Statistics 511 Additional Materials

Sampling Distributions and Central Limit Theorem In previous topics we have discussed taking a single observation from a distribution. More accurately, we looked at the probability of a single variable taking a specific value or range of values. That is, we wanted to know what is the chance that a single variable will be more than 7 or less than 190 or equal to 4. In this section we will move toward combining data (i.e., using more than one data value). As we move toward inference, decision making from data, we will be combining information. However, we must first understand the variable that summarizes collections of data. If we take lots of samples of size one, the distribution of those samples will look like the original distribution. But suppose we take a sample of 10 observations and average them. If we repeated this process, what would the distribution of these averages look like? That is the question we hope to answer in this section. Though what we will do in this topic may seem somewhat convoluted, it will help us to understand the behavior of a sample relative to a population. Here we (will pretend to) know the population (or what is looks like) and we want to know how samples behave. Knowing how samples behave will help us later when we only have a sample and we want to draw conclusions or make inferences about the population. Generating Random Numbers We begin this topic with a section about generating random numbers. To understand the process of obtaining a random sample and the variability that is associated with it, we must first know how to take a random sample. One of the hardest ideas in Statistics is the idea of randomness and what is truly random. What we often think of intuitively as random does not meet the criterion that science has set for randomness. Hence the need for generating random numbers. There are several ways to generate random numbers. The simple role of a die is one; however, this only gives a set of integers from one to six. If we want to get a set of values that is larger we need to consider another way. One of the most commonly used other ways is the use of a Random Number Table. Unfortunately, our textbook does NOT include a table of Random Numbers. In order to use such a table, we need to know the size of the population that we wish to draw from. If we want to select one observation from a population of size 58, we ll use a slightly different method than if we want to select one observation from a population of size 9. The outline is this. The number of individuals in the population determines the labels given to each unit. The number of digits in each label is set by the following table. Number of individuals Digit in the labels 2 to 10 1 11 to 100 2 101 to 1000 3 1001 to 10000 4 Page 1 of 8

We select a sample of size 5 from a population consisting of 26 individuals. If I have 26 units, the labels would be as follows. Unit Label Unit Label Unit 1 01 Unit 14 14 Unit 2 02 Unit 15 15 Unit 3 03 Unit 16 16 Unit 4 04 Unit 17 17 Unit 5 05 Unit 18 18 Unit 6 06 Unit 19 19 Unit 7 07 Unit 20 20 Unit 8 08 Unit 21 21 Unit 9 09 Unit 22 22 Unit 10 10 Unit 23 23 Unit 11 11 Unit 24 24 Unit 12 12 Unit 25 25 Unit 13 13 Unit 26 26 To take numbers from the random number table, begin in a random location in the table. (One way to start is to close your eyes and place a finger on the table.) Then begin going left to right across that row taking the number of digits in each label. For example suppose that we started at the beginning of a row that looks like 92 41 24 08 42 64 96 82 07 01 40 00 95 09 30 23 40 08 19 78 Using the example above, we have two-digit labels so we ll take two digits at a time. The first number is 92 which we don t use, since not of our units was assigned that label. Likewise with 41. The next number 24 was assigned a label so it get s used. 08 also will be used. 42(skip), 64(skip), 96(skip), 82(skip), 07 we will use. 01 we will use. 40(skip), 00(skip), 95(skip), 09 we will use. We ll stop there for now. So our sample would contain the following: Unit 01, Unit 07, Unit 08, Unit 09 and Unit 24. If we had reached the end of the row and did not have the five units we needed, we would continue beginning with the next row. One question that may arise is what would have happened if a unit was chosen twice. Looking at row 6 above, if we had continued we d have found 08 showing up twice. There are two possible resolutions. The first, sampling without replacement, is that we would have skipped 08 the second time it arose. As the name implies without replacement means that once a unit is selected it is no longer eligible to be selected again. Sampling with replacement means that we could use a label as many times at it arises in the table. For the most part we will be interested in sampling with replacement, since this is simpler to handle (from a theoretical aspect). Page 2 of 8

Another sample using the same units and labels as above. We ll take a sample of size 5 as we did before. We start with: Row a 19 49 99 18 26 11 63 74 29 96 14 57 76 72 92 86 28 Row b 39 14 12 52 96 24 33 70 06 19 we will use, 49 (skip), 99(skip) 18(use), 26 (use), 11(use), 63(skip), 74(skip), 29(skip), 96(skip), 14(use). This time our sample consists of units: 11, 14, 18, 19, 26. This is a different sample than the first sample. Suppose the population was students who attend a Stat 215 lab. If we are interested in heights or number of siblings for each student, then we would get a (potentially) different set of 5 values each time we construct a sample. Steps for using a random number table: 1. Label each unit in the population with a label of equal length 2. Pick a starting point in the table. 3. Read off labels from the table until you get the number of units that you need. TIP: It is important to differentiate between the unit that we have selected and the value of a particular variable that that unit represents. Unit 26 won t have 26 siblings. The unit is just a label so that we can identify the unit when we take our sample. Random Sampling The key to the idea of random sampling is variability. Each time we sample we, most likely, get a different subset of the population. Consider the following population of frogs. Each frog is labeled and a weight measurement is given. Weight is given in grams Unit Label Weight Unit Label Weight Unit Label Weight Frog 1 01 163 Frog 17 17 183 Frog 33 33 154 Frog 2 02 142 Frog 18 18 147 Frog 34 34 147 Frog 3 03 183 Frog 19 19 148 Frog 35 35 157 Frog 4 04 129 Frog 20 20 138 Frog 36 36 158 Frog 5 05 134 Frog 21 21 136 Frog 37 37 139 Frog 6 06 138 Frog 22 22 168 Frog 38 38 195 Frog 7 07 190 Frog 23 23 149 Frog 39 39 184 Frog 8 08 130 Frog 24 24 158 Frog 40 40 175 Frog 9 09 122 Frog 25 25 124 Frog 41 41 183 Frog 10 10 175 Frog 26 26 175 Frog 42 42 148 Frog 11 11 158 Frog 27 27 184 Frog 43 43 149 Frog 12 12 200 Frog 28 28 159 Frog 44 44 139 Frog 13 13 140 Frog 29 29 193 Frog 45 45 130 Frog 14 14 145 Frog 30 30 131 Frog 46 46 129 Frog 15 15 143 Frog 31 31 195 Frog 47 47 131 Page 3 of 8

Frog 16 16 156 Frog 32 32 196 Frog 48 48 182 Let X be the RV representing the weight of a frog in grams. Sample 1: Suppose that we took a sample of size 10 using sampling without replacement and got Frogs 2, 3, 14, 25, 27, 29, 30, 34, 40, 48. The weights for those frogs are then x: 142, 183, 145, 124, 184, 193, 131, 147, 175, 182. Sample 2: Suppose that we took another sample of size 10 using sampling without replacement and got Frogs: 5, 9, 12, 14, 16, 18, 21, 24, 29, 44. The weights for those frogs are then x: 134, 122, 200, 145, 156, 147, 136, 158, 193, 139. Sample 3: Suppose that we took another sample of size 10 using sampling without replacement and got Frogs: 4, 5, 13, 23, 24, 27, 33, 40, 43, 46. The weights for those frogs are then x: 129, 134, 140, 149, 158, 184, 154, 175, 149, 129. Consider the following summaries for the three samples: Sample sample mean, sample standard deviation, s x. 1 160.6 25.29 2 153.0 25.28 3 150.1 18.56 For all the frogs the population mean is µ x = 156.92 and the population standard deviation is σ x = 22.73. There are two important things we can notice from this. First each sample is different. We don t get the same values for the sample statistics each time we take a sample. Second, the sample statistics do not have the same values as the population parameters. Focusing on the mean, the sample 1 mean is larger than the population mean, while the sample means for samples 2 and 3 are lower than the population mean. Similarly samples 1 and 2 have standard deviations, s x, that are larger than the population standard deviation, σ x, and sample 3 has a standard deviation that is less than the population standard deviation. This is a result of the variability that is present in taking a sample. We usually get different values for the sample mean and sample standard deviation each time we take a sample. Additionally, these values usually differ from the population mean and the population standard deviation. Sampling Distributions Definition: A sampling distribution is the collection of all possible values for a sample statistic, e.g. the values of from all possible samples. Using the previous examples with frogs, we could list all possible samples of size 10 and calculate the means for those samples. So we would have sample one mean, sample two Page 4 of 8

mean,, sample 100 mean, etc. through all possible samples of size 10. Then the sampling distribution of the sample mean would all of those possible values. That distribution of sample means would have parameters that we would use to describe it. We could talk about the mean of that sampling distribution and the standard deviation of that sampling distribution. It is certainly awkward to talk about the mean of the sampling distribution of the sample mean, but we will. This is simply the average of all possible values of sample means from a certain population. Consider the following population of five units with values y: 3, 4, 5, 8, 10. Sampling without replacement, we the following possible samples. 3, 4, 5 3, 4, 8 3, 4, 10 3, 5, 8 3, 5, 10 3, 8, 10 4, 5, 8 4, 5, 10 4, 8, 10 5, 8, 10 (You might recall that the number of ways we can get 3 objects from a set of 5 objects is =10.) We can find the mean of each sample and thus get the sampling distribution of the mean. Sample Mean 3, 4, 5 4.00 3, 4, 8 5.00 3, 4, 10 5.67 3, 5, 8 5.33 3, 5, 10 6.00 3, 8, 10 7.00 4, 5, 8 5.67 4, 5, 10 6.33 4, 8, 10 7.33 5, 8, 10 7.67 The sampling distribution for the sample mean (with samples of size 3) is simply the second column of the above table. The mean of this distribution is 6.00 and the standard deviation is 1.12. These samples were done without replacement. In addition to the summary that we created above, we could create a histogram of the means. It might not be helpful in this case since we only have 10 means. But if the list of all possible sample Page 5 of 8

means was long, then it might be useful to display the sampling distribution of the sample means via a histogram. Note: If our population consists of values for a continuous random variable, then the number of possible samples is infinitely large. Central Limit Theorem In the previous section we saw an example of a sampling distribution. In this section we will explore three mathematical results that will be important for the rest of the course. These results give mathematical descriptions for sampling distributions of the mean for any data set. Theorem 1: If X is a RV with mean µ x and standard deviation σ x, then is a RV with mean = µ x and, where n is the number of observations per sample. This means that regardless of the distribution from which the sample was selected, the average,, is a RV with the same mean as the population from which our sample was obtained. The second part says that the standard deviation of the distribution of sample means is the original standard deviation the population from which our sample was obtained divided by the square root of the number of observations. This implies that the distribution of has a smaller standard deviation than the distribution of X. What is happening is that by averaging, there is less variability in the resulting distribution. The larger values are counterbalanced by smaller values. Note also that if n=1, that is each sample is of size one than X has the same distribution as X. Finally it is worth reiterating that Theorem 1 says nothing about the shape of the distribution of either X or. Let X be a RV with mean 18 and standard deviation 3. If I take samples of size 10, then the sampling distribution of will have mean 18 and standard deviation 3/ 10 = 0.94868 Theorem 2: If X is a Normal RV with mean µ x and standard deviation σ x, then is a Normal RV with mean = µ x and, where n is the number of observations per sample. This result is similar to Theorem 1, except that it says that by averaging observations from a Normal distribution, the averages will always possess a Normal distribution. Page 6 of 8

Let Y be a Normal RV from a population with mean 240 and standard deviation 12. If I take a sample of 18 observations from the distribution of Y and average them, what is the probability that the sample mean will be more than 245. Note that is a Normal RV with mean 240 and standard deviation 12/ 18 = 2.828427. 245 240 So P( >245) = P(Z> ) = P(Z> ) =P(Z>1.77) = = 0.0384 2.828427 Theorem 3: Central Limit Theorem If X is a RV with mean µ x and standard deviation σ x, and n is large (n 30), then have a Normal distribution with mean = µ x and, where n is the number of observations per sample. What this result says is that regardless of the distribution we start with (it might be Poisson, it might be Continuous Uniform, etc.), the sampling distribution of the sample mean is a Normal distribution if we obtain at least 30 observations. Theorem 3 takes Theorem 1 a step farther, but requires that n 30. If n is large, then we can use the Normal distribution to calculate probabilities associated with X is a RV from a population with mean 40 and standard deviation 5. If we are going to take 50 observations from this distribution, what is the chance that the sample mean of those 50 observations will be more than 42. Since n 30, we can use the Central Limit Theorem. Thus even though we know nothing about the distribution of X, we know that the distribution of will be a Normal distribution with mean 40 and standard deviation 5/ 50. With this knowledge we can, use the tools of the Normal distribution to calculate the probability we need. P( >42) = P(Z> ) = P(Z > ) = P(Z>2.83) = 0.0023. will The amount of time it takes a shipment of coal to cross the Bitterroot Mountains from Hamilton, MT to Salmon, ID has a mean of 5.4 hours and a standard deviation of 0.6 hours. What is the probability that the average time of the next 36 shipments will be longer than 5.46 hours? Let Y be the time it takes for a shipment of coal to cross the Bitterroot Mountains. Y has mean 5.4 and standard deviation 0.6. We want to know P( >5.46). Since the number of shipments, 36, that we are averaging is more than 30, we can use the Central Limit Page 7 of 8

Theorem to say that the distribution of will be a Normal distribution with mean 5.4 and standard deviation 0.6/ 36 = 0.1. So P( >5.46) = P(Z> ) = P(Z> ) = P(Z>0.60) = 0.2743. DARF is a psychometric measurement designed to ascertain the amount of stress that an individual is able to handle. The average score on the DARF is 100 with a standard deviation of 25. The basketball coach at Mount Carmel High School in Illinois wants to know the chance that the average score of their 18 players is less than 90. Find that probability. Let F be a score on the DARF. F has a mean of 100 and a standard deviation of 25. We need to find the distribution of. Unfortunately since n = 18, we cannot use the Central Limit Theorem to say that the distribution is Normal. We can say from Theorem 1 that the mean of the distribution of possible values of will be 100 and the standard deviation would be 25/ 18. However, we cannot say anything about the shape of the distribution. Therefore we cannot calculate this probability. Some final comments about this topic. First, this topic is really a prelude to the next several topics. We need to understand the variability of the sample average when we know the population (or distribution) it came from. The reason for this is that as we move to the next chapter, we will no longer know the mean and standard deviation of the population and we will have to use our knowledge of the behavior of the sample mean to help make inferences about the population mean. Notes In the section on normal distributions we used the z-score formula z = x µ x σ x to answer probability questions involving a single random variable X, e.g., P(X > a) or P(X b) or P(a X b). However if we ask a probability question about the sample mean X of several observations we use a different z-score formula. We use the z-score formula z = x µ x = x µ σ x σ n to answer probability questions involving the sample mean X of several observations, e.g., P(X > a) or P(X b) or P(a X b). Page 8 of 8