Chapter 18. Sampling Distribution Models /51

Chapter 18 Sampling Distribution Models 1 /51

Homework p432 2, 4, 6, 8, 10, 16, 17, 20, 30, 36, 41 2 /51

3 /51

Objective Students calculate values of central 4 /51

The Central Limit Theorem for Sample Proportions In the real world it is very common to sample the population to estimate a proportion of the population demonstrating some characteristic. Rather than showing real repeated samples, imagine what would happen if we were to actually draw many samples. We know 9/10 dentists recommend Crest. Were only 10 dentists surveyed? Trump s approval rating is currently 38%. Did they ask every person in the U.S? How do they know how many people watched the Super Bowl? (Estimated 114.4 million) 5 /51

The Central Limit Theorem for Sample Proportions If we only take one sample, what makes anyone think the sample is a good representation of the population. Can we trust the results actually reflect the proportion in the population? Suppose we could take many samples of the same size (n). What would we expect to see if we looked at a distribution of the sample proportions for these samples. In other words, we have a distribution of data values that are proportions taken from our many samples. We could then create a histogram of the proportions from many samples. Do you think this histogram might look like the histogram of all possible sample proportions from samples of size n? 6 /51

The Central Limit Theorem for Sample Proportions What would the histogram of all the sample proportions look like? That histogram of all possible sample proportions is called the sampling distribution of the sample proportions. What is the expected value of the proportion? What would this value be for our sampling distribution of sample proportions? One very nice attribute of sampling proportions is the fact that the standard deviation of the distribution is an artifact of the proportion. 7 /51

Modeling the Distribution of Sample Proportions We would expect the histogram of the sample proportions to be centered at the actual true proportion, p, in the population. As far as the shape of the histogram goes, we could simulate a bunch of random samples that we did not actually draw. You could do this with a random table or random number generator by assigning values appropriate to the proportion. It is a really fortunate turn that the histogram turns out to be unimodal, symmetric, and centered at p. More specifically, it is extremely fortuitous that a Normal model is juuuuust right for the histogram of sample proportions. 8 /51

Modeling the Distribution of Sample Proportions Modeling how sample proportions (and soon sample means) vary from sample to sample is the foundation of the methodologies we use in parametric statistics. A sampling distribution model for how a sample proportion varies from sample to sample provides the ability to quantify variation and determine the probability we would observe a sample proportion within a particular interval. To define a Normal model, we need the mean and standard deviation. For the distribution of sample proportions µ, the mean of the Normal Distribution, is equal to p, the true proportion in the population. 9 /51

Modeling the Distribution of Sample Proportions When working with proportions, knowing the mean (proportion) automatically gives us the standard deviation as well. In our case the standard deviation we will use is SD (X ) = pq n So, the distribution of the sample proportions is modeled with a probability model that is: N p, pq n 10/51

Modeling the Distribution of Sample Proportions This is the graph of the distribution of sample proportions N p, pq n 3 pq n 2 pq n 1 pq n p 1 pq n 2 pq n 3 pq n 11/51

The Central Limit Theorem for Sample Proportions Because we have a Normal model, we know that 95% of Normally distributed values are within two standard deviations of the mean. So we should not be surprised if 95% of various sampling polls gave results that were near the mean but varied above and below that by no more than two standard deviations. This is what we mean by sampling error. Sampling error is not really an error (mistake), it is just variability we would expect expect to see from one sample to another. A better phrase would be sampling variability. The Normal model gets better as a model for the distribution of sample proportions as the sample size gets bigger. 12/51

Assumptions and Conditions Like most models, the distribution of sample proportions is useful (valid) only when specific assumptions are true. There are two assumptions in the case of the model for the distribution of sample proportions: 1. Independence Assumption: The sampled values must be independent of each other. 2. Sample Size Assumption: The sample size, n, must be large enough. 13/51

Assumptions and Conditions We check that the assumptions are reasonable by checking conditions that provide information about the assumptions. The corresponding conditions to check before using the Normal to model the distribution of sample proportions are: Randomization Condition (independence) 10% Condition (independence) Success/Failure Condition (adequate sample size) 14/51

Assumptions and Conditions 1. Randomization Condition: The sample should be a simple random sample of the population (SRS). 2. 10% Condition: the sample size, n, must be no larger than 10% of the population (N 10n). 3. Success/Failure Condition: The sample size has to be big enough so that both np (number of successes) and nq (number of failures) are at least 10 (np 10, nq 10). 15/51

A Sampling Distribution Model for a Proportion A proportion is now more than just a computation from a set of data. A proportion is now one example of a random variable quantity (statistic) that has a probability distribution. This distribution is called the sampling distribution model for proportions. We make decisions based on what we know about sampling distribution models, but we never actually create them. It would be kind of tough to do, as the model requires an infinite number of data values. We rarely take repeated samples from the same population and make a histogram. We can simulate them, but we know what the model would look like, so we never need to actually create it. 16/51

Sampling Distribution Let us be clear about what we are now discussing. We have a new distribution of values (statistics) taken from samples. This new distribution (model) is to be distinguished from a distribution of individual observations taken from a population. This new distribution currently consists of sample proportions denoted: p^ p-hat p 17/51

A Sampling Distribution Model for a Proportion Provided that the sampled values are independent and the sample size is sufficiently large, the sampling distribution of model with: P! is modeled by a Normal Mean: µ P! = P Standard deviation: SD P! ( ) = pq n N p, pq n 18/51

Example Suppose we find data that tells us 80% of cars on the interstate exceed the speed limit. What proportion of the next 50 cars that speed by will be exceeding the speed limit. The problem (parameter of interest): We want to find the distribution of the proportion of the next 50 cars that may be speeding on the highway. We are told 80% of all cars on the highway are speeding. Randomize Condition: The 50 cars can be considered a random sample and thus, representative sample of all cars on the highway. 10% Condition: 50 is certainly less than 10% of all cars on the highway. Success/Failure Condition: np = 50(0.80) = 40, nq = 50(0.20) = 10, both 10. 19/51

Example Suppose we find data that tells us 80% of cars on the interstate exceed the speed limit. What proportion of the next 50 cars that speed by will be exceeding the speed limit. The sampling distribution model for the proportion of speeders in 50 cars: Mean: µ P! =.8 Standard deviation: SD P! ( ) = pq n =.8 i.2 50.057 N p, pq n The model for P! is N(.8,.057) 20/51

Example The model for P! is N(.8,.057) Proportion of speeders in 50 cars Accordingly, we expect 68% of the samples of 50 cars to have proportions of speeders between 0.743 and 0.857 95% of the samples to have 99.7% 95% 68% 0.629 0.686 0.743 0.80 0.857 0.914 0.971 proportions between 0.686 and 0.914 99.7% of the samples to have proportions between 0.629 and 0.971. 21/51

What About Quantitative Data? Proportions summarize categorical variables. It is binomial data, you do exhibit the characteristic or you do not exhibit the characteristic. The Normal sampling distribution model is extremely useful in evaluating sample proportion data. But can we do something similar with quantitative data (i.e. means)? Yeperdoo. Even better, not only can we use all of the same concepts, but almost the same model. 22/51

Simulating the Sampling Distribution of a Mean Like other statistics computed from a random sample, a sample mean also has a sampling distribution. A simulation will give us a sense as to what the sampling distribution of the sample mean might look like Let s start with a simulation of many tosses of a die. Using a die or your calculator roll a die 20 times and record the results on a dot plot on the board. 23/51

Simulating the Sampling Distribution of a Mean A histogram of the results might look like this: 150 freq 112.5 75 How did we do? 37.5 0 1 2 3 4 5 6 24/51

Simulation the Sampling Distribution of the Mean Now we will take samples of size 5 rolls. Roll the die 5 times and calculate the mean of the 5 rolls. Repeat 10 times, record on the board. Complete questions 1-3 of the worksheet. 25/51

Means Averaging More Dice Throws Looking at the mean of two dice after a simulation of 10,000 tosses: The mean of three dice after a simulation of 10,000 tosses looks like: 26/51

Means Averaging Even More Dice Throws The mean of 5 dice after a simulation of 10,000 tosses looks like: The mean of 20 dice after 10,000 tosses looks like: What does this tell us? 27/51

Means What the Simulations Show As the sample size (number of dice) gets larger, each sample mean is more likely to be closer to the population mean. So, we see the shape continuing to compress around 3.5 And, you hopefully noticed that the sampling distribution of a mean becomes Normal with all possible samples of size n. 28/51

The Central Limit Theorem The sampling distribution of any mean becomes more nearly Normal as the sample size grows. It must be true that the observations are independent and random. This is true, no matter the shape of the population distribution as long as the sample is large enough. More on this soon. This Fundamental Theorem of Statistics is called the Central Limit Theorem. 29/51

The Central Limit Theorem The Central Limit Theorem is not intuitive and a tad bit weird: Not only does the histogram of the sample means get closer and closer to the Normal model as the sample size grows, but this is true regardless of the shape of the population distribution. The Central Limit Theorem works better (and faster) the more unimodal and symmetric the population. Additionally, the Central Limit Theorem works better for larger samples. To summarize, the more unimodal and symmetric the population, the more unimodal and symmetric the sampling distribution. The larger the sample sizes, the more unimodal and symmetric the sampling distribution. 30/51

The Central Limit Theorem The mean of a random sample is a random variable whose sampling distribution (the distribution of means) can be approximated by a Normal model. The larger the sample, the better the approximation will be. It is also intuitive that the standard deviation of the distribution of sample means will be smaller than the population standard deviation. A collection of means should be close to the population mean, and thus closer to each other. SD(X ) = σ n So, the distribution of the sample means (sampling distribution) is a probability model: N µ, σ n 31/51

Sampling Distribution We have introduced a new distribution that is to be differentiated from the distribution of population values. The new distribution, the sampling distribution (distribution of a sample statistic), is just that; a distribution of statistics (means). The new distribution, has a mean equal to the population parameter (μ) and a standard deviation (now called standard error) that is smaller than that of the original population distribution. 32/51

Assumptions and Conditions The CLT requires essentially the same assumptions we saw for modeling proportions: Random Assumption: The samples must be randomly selected. Independence Assumption: The individual values must be independent. Sample Size Assumption: The sample size must be sufficiently large. 33/51

Assumptions and Conditions Once again we cannot check these directly, but we can be reasonably confident about whether the Independence Assumption is appropriate. We can check some related conditions to verify the assumption: Randomization Condition: The data values must be sampled randomly. 10% Condition: When the sample is drawn without replacement, the sample size, n, should be no more than 10% of the population. Large Enough Sample Condition: The CLT doesn t tell us how large a sample we need. For now, you need to think about your sample size in the context of what you know about the population. 34/51

But Which Normal? The Central Limit Theorem ensures that the sampling distribution of a statistic (mean) is approximately Normal. But which Normal model? We have learned that for proportions, the sampling distribution is centered at the population proportion (p). For means, it s centered at the population mean (μ). All that remains is to define the standard deviations. 35/51

But Which Normal The Normal model for the sampling distribution of the proportion has a standard deviation equal to SD(P! ) = pq n = pq n The Normal model for the sampling distribution of the means of samples size n has a standard deviation equal to SD(X ) = σ n where σ is the population standard deviation. 36/51

About Variation The standard deviation of the sampling distribution decreases only with the square root of the sample size. Thus, the variability decreases as the sample size increases but only by a factor of a square root. To decrease standard deviation significantly we have to increase sample size a whole bunch. But we cannot forget that we are limited in sample size to 10% of the population. 37/51

Example SAT math scores have mean 541 and standard deviation 103. What about the mean of random samples of 20 students? (Note that the small sample is okay because we believe a Normal model applies to the population.) We are interested in the distribution of possible means from samples of SAT scores from 20 students. SAT scores have a mean of 541 and a standard deviation of 103, and since the SAT is standardized, it s reasonable to assume that the model for all SAT scores is Normal. What we are NOT discussing is a distribution of individual SAT scores. 38/51

Example SAT scores have mean 541 and standard deviation 103. What about the mean of random samples of 20 students? Random Sampling Condition: The 20 students were sampled randomly. Independence Assumption: It is reasonable to believe that the SAT scores of the 20 randomly sampled students will be mutually independent, as long as the students were not all from the same school. 10% Condition: 20 students represent less than 10% of all students. Under these conditions, the sampling distribution of y has a Normal model. Large enough: 20 students provide a small sample, but we believe the distribution of individual SAT scores is unimodal and symmetric so the sampling distribution of y has a Normal model, 39/51

Example Individual SAT scores have mean 541 and standard deviation 103. What about the mean of random samples of 20 students? SD X ( ) = σ n = 103 20 23.0315 N(541, 23.0315) 40/51

Example Mean SAT scores of 20 students SAT scores have mean 541 and standard deviation 103. N(541, 23.0315) 99.7% 95% According to the Normal model, we can expect 68% 472 495 518 541 564 587 610 68% of samples of 20 to have mean SAT scores between 517.97 and 564.03 95% of samples to have means between 494.94 and 587.06 99.7% of samples to have means between 471.91 and 610.09. Remember, we are not comparing the SAT scores of individual students but, rather, comparing SAT scores of samples of 20 students. 41/51

Example SAT scores have mean 541 and standard deviation 103. Mean SAT scores of 20 students 1.67% What is the probability that a sample of 20 students would have an aggregate mean SAT score of 590 or more? 472 495 518 541 564 587 610 N(541, 23.0315) We meet the conditions for calculating the probability (listed previously). P(x 590) = normcdf(590, 10^99, 541, 23.0315) =.0167 P(z 2.2361) = normcdf(2.1275, 10^99, 0, 1) =.0167 Z = 590 541 23.0315 = 2.1275 The probability of a random sample of 20 students having a mean SAT score of 590 or better is approximately 1.67% 42/51

Individual SAT scores have mean 541 and standard deviation 103. What is the probability that an individual will have a score of 590 or better? Mean SAT scores 99.7% 95% 68% 232 335 438 541 644 747 850 P(x 590) = normcdf(590, 10^99, 541, 103) =.3171 P(z.4757) = normcdf(.4757, 10^99, 0, 1) =.3171 Z = 590 541 103 =.4757 The probability of an individual student having a mean 31.71% SAT score of 550 or better is approximately 31.7% Do you see the difference? 232 335 438 541 644 747 850 43/51

Another Example Speeds of cars on a highway have mean 52 mph and standard deviation of 6 mph, and are likely to be skewed to the right (a few very fast drivers). Describe what we might see in random samples of 50 cars. We are investigating the distribution of sample mean speeds from many samples of 50 cars on the highway. 44/51

Example 2 Random Sampling Condition: The 50 speeds were sampled randomly. Independence Assumption: It is reasonable to think that the speeds of the 50 randomly sampled cars will be mutually independent. 10% Condition: 50 cars represent less than 10% of all cars on the highway. Large enough: 50 cars is a large enough sample to ensure the sampling distribution is unimodal and symmetric even though the original distribution is skewed. 45/51

Example 2 The distribution of our sample (and of the population of cars being driven) is almost certainly positively skewed, but the Central Limit Theorem takes care of that because the sample size of 50 cars, is sufficiently large and overcomes the skewness in the population. Under these conditions, the sampling distribution of y has a Normal model, with mean 52 mph and standard deviation SD X ( ) = σ n = 6 0.8485 50 N(52, 0.8485) 46/51

Example 2 Mean speed of 50 cars N(52, 0.8485) 99.7% 95% According to the Normal model, we can expect 68% 49.45 50.30 51.15 52 52.85 53.7 54.55 68% of samples of 50 cars to have mean speeds between 51.2 and 52.9 mph 95% of samples to have means between 50.3 and 53.7 mph 99.7% of samples to have means between 49.5 and 54.6 mph. 47/51

Example What is the probability the next 50 cars will have an average speed of 50 mph or less? Mean speed of 50 cars N(52, 0.8485) n 0.92% 49.45 50.30 51.15 52 52.85 53.7 54.55 The conditions for a normal model N(52, 0.8485) have been met. P(x 50) = normalcdf(-10^99, 50, 52, 0.8485) =.0092 P(z -2.3571) = normalcdf(-10^99, -2.3571, 0, 1) =.0092 Z = 50 52.8485 = 2.3571 The probability of a random sample of 50 cars having a mean speed of 50 or lower is approximately 0.92% 48/51

Complete Response Once again, to ensure you completely answer a question, the following elements must be included in your response. n 1. Identify the parameter of study. n 2. You must determine that the sample was obtained randomly. n 3. Determine that the measures are independent of each other. n 4. Ensure the sample is not too large, n < 10% of population. n 5. Determine that the sample is sufficiently large such that the central limit theorem ensures the sampling distribution is sufficiently unimodal and symmetric, np 10 nq 10 for proportion, n > 30 for means. n 6. Calculate the SE showing the appropriate values. SD X n 7. A complete probability statement P(who) = how = what. ( ) = σ n SD p! ( ) = pq n n 8. If a normal model is involved; define the model, calculate any z scores (even if you do not use them), and an appropriately labeled curve. n 9. A concluding sentence in context. 49/51

The Real World and the Model World Be careful! Now we have two distributions to deal with. The first is the real world distribution of a sample from individual data, which we might display with a histogram. The second is the theoretical sampling distribution of the statistic, which we model with a Normal model based on the Central Limit Theorem. Don t confuse the two! 50/51

The Process Going Into the Sampling Distribution Model 51/51