Sampling Distribution Models. Central Limit Theorem

Sampling Distribution Models Central Limit Theorem

Thought Questions 1. 40% of large population disagree with new law. In parts a and b, think about role of sample size. a. If randomly sample 10 people, will exactly four (40%) disagree with law? Surprised if only two in sample disagreed? How about if none disagreed? b. If randomly sample 1000 people, will exactly 400 (40%) disagree with law? Surprised if only 200 in sample disagreed? How about if none disagreed?

The Diversity of Samples from the Same Population Working Backward from Samples to Populations Start with question about population. Collect a sample from the population, measure variable. Answer question of interest for sample. With statistics, determine how close such an answer, based on a sample, would tend to be from the actual answer for the population. Understanding Dissimilarity among Samples We need to understand what kind of differences we should expect to see in various samples from the same population

What to Expect of Sample Proportions A slice of the population 40% of population carry a certain gene Do Not Carry Gene =, Do Carry Gene = X

What to Expect of Sample Proportions Possible Samples Sample 1: Proportion with gene = 12/25 = 0.48 = 48% Sample 2: Proportion with gene = 9/25 = 0.36 = 36% Sample 3: Proportion with gene = 10/25 = 0.40 = 40% Sample 4: Proportion with gene = 7/25 = 0.28 = 28%

The Central Limit Theorem for Sample Proportions Rather than showing real repeated samples, imagine what would happen if we were to actually draw many samples. Now imagine what would happen if we looked at the sample proportions for these samples. The histogram we d get if we could see all the proportions from all possible samples is called the sampling distribution of the proportions. What would the histogram of all the sample proportions look like?

The Central Limit Theorem for Sample Proportions We would expect the histogram of the sample proportions to center at the true proportion, p, in the population. As far as the shape of the histogram goes, we can simulate a bunch of random samples that we didn t really draw. It turns out that the histogram is unimodal, symmetric, and centered at p. More specifically, it s an amazing and fortunate fact that a Normal model is just the right one for the histogram of sample proportions.

Imagine, Imagine, Imagine - Predicting Election Results Imagine a research organization (R.E.) wants to know if Candidate A would win the election if held next week. They completed a well-conducted poll of 100 randomly selected voters. Imagine several other research organizations (499 of them) also completed a well-conducted poll of 100 randomly selected voters to try and answer the same question. They all conduct the polls on the same day. Imagine that the answer they are looking for is 0.53, that is Candidate A would get 53% of the vote if the election were held on the day of the polls.

Predicting Election Results R.E. Results 1 52/100 =.52 2 49/100 =.49 3 62/100 =.62 4 45/100 =.45 5 59/100 =.59...and so on

Relative frequency 0.35 Results from 500 Polls of 100 voters 0.3 0.25 0.2 0.15 0.1 0.05 0 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Proportion in sample that would vote for Candidate A

Relative frequency 0.4 Results from 500 Polls of 400 voters 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0.35 0.4 0.45 0.5 0.55 0.6 0.65 Proportion in sample that would vote for Candidate A

Relative frequency Relative frequency 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Results from 500 Polls of 100 voters 0 0.2 0.3 0.4 0.5 0.6 0.7 Proportion in sample that would vote for Candidate A 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Results from 500 Polls of 400 voters 0 0.35 0.4 0.45 0.5 0.55 0.6 0.65 Proportion in sample that would vote for Candidate A

Relative frequency 0.3 Results from 500 Polls of 1000 voters 0.25 0.2 0.15 0.1 0.05 0 0.45 0.47 0.49 0.51 0.53 0.55 0.57 Proportion in sample that would vote for Candidate A

Relative frequency Relative frequency 0.4 Results from 500 Polls of 400 voters 0.3 Results from 500 Polls of 1000 voters 0.35 0.25 0.3 0.25 0.2 0.15 0.1 0.05 0 0.35 0.4 0.45 0.5 0.55 0.6 0.65 Proportion in sample that would vote for Candidate A 0.2 0.15 0.1 0.05 0 0.45 0.5 0.55 Proportion in sample that would vote for Candidate A

The Central Limit Theorem for Sample Proportions Modeling how sample proportions vary from sample to sample is one of the most powerful ideas we ll see in this course. A sampling distribution model for how a sample proportion varies from sample to sample allows us to quantify that variation and how likely it is that we d observe a sample proportion in any particular interval. To use a Normal model, we need to specify its mean and standard deviation. We ll put µ, the mean of the Normal, at p.

The Central Limit Theorem for Sample Proportions When working with proportions, knowing the mean automatically gives us the standard deviation as well the standard deviation we will use is pq n So, the distribution of the sample proportions is modeled with a probability model that is N p, pq n

Sampling Distribution Model for Proportions Mean and Standard Deviation Let Y - Binom(n,p) where n is the number of trials and p is the probability of success.

The Central Limit Theorem for Sample Proportions A picture of what we just discussed is as follows:

The Central Limit Theorem for Sample Proportions Because we have a Normal model, for example, we know that 95% of Normally distributed values are within two standard deviations of the mean. So we should not be surprised if 95% of various polls gave results that were near the mean but varied above and below that by no more than two standard deviations. This is what we call sampling variability. The sample proportions varies from sample to sample. All possible sample proportions arrange themselves neatly under a normal curve.

What to Expect of Sample Proportions If numerous samples of the same size are taken from a population are taken, the frequency curve made from proportions from various samples will be approximately bell-shaped. In other words, the sampling distribution of possible sample proportions is Normal and centered at the true population proportion, with standard deviation (true proportion)(1 true proportion) sample size

What to Expect of Sample Proportions In reality, we don t know the true proportion The standard error(se) for the sampling distribution of possible sample proportions is (sample proportion)(1 sample proportion) sample size

Assumptions and Conditions 1. Randomization Condition: The sample should be a random sample of the population. 2. 10% Condition: the sample size, n, must be no larger than 10% of the population. 3. Success/Failure Condition:The sample size has to be big enough so that both np (number of successes) and nq (number of failures) are at least 10.

A Sampling Distribution Model for a Proportion A proportion is no longer just a computation from a set of data. It is now a random variable quantity that has a probability distribution. This distribution is called the sampling distribution model for proportions. Even though we depend on sampling distribution models, we never actually get to see them. We never actually take repeated samples from the same population and make a histogram. We only imagine or simulate them.

A Sampling Distribution Model for a Proportion Still, sampling distribution models are important because they act as a bridge from the real world of data to the imaginary world of the statistic and enable us to say something about the population when all we have is data from the real world.

The Sampling Distribution Model for a Proportion Provided that the sampled values are independent and the sample size is large enough, the sampling distribution of ˆp is modeled by a Normal model with Mean: Expected Value( ˆp )=p Standard deviation: SD( ˆp) pq n

What to Expect of Sample Proportions Example : Suppose 40% of all voters in U.S. favor candidate X. Pollsters take a sample of 2400 people. What sample proportion would be expected to favor candidate X? The sample proportion could be anything from a bell-shaped curve with mean 0.40 and standard deviation: (0.40)(1 0.40) = 0.01 2400 For our sample of 2400 people: 68% of sample proportions will be between 39% and 41% 95% of sample proportions will be between 38% and 42% 99.7% of sample proportions will be between 37% and 43%

Example: Do Americans Really Vote When They Say They Do? Reported in Time magazine (Nov 28, 1994): Telephone poll of 800 adults (2 days after election) 56% reported they had voted. Committee for Study of American Electorate stated only 39% of American adults had voted. Could it be the results of poll simply reflected a sample that, by chance, voted with greater frequency than general population?

Example: Do Americans Really Vote When They Say They Do? Suppose only 39% of American adults voted. We can expect sample proportions to be represented by a bell-shaped curve with mean 0.39 and standard deviation: (0.39)(1 0.39) = 0.017 or 1.7% 800 68% of sample proportions will be between 37.3% and 40.7% 95% of sample proportions will be between 35.6% and 42.4% 99.7% of sample proportions will be between 33.9% and 44.1%

Question

Thought Questions 2. Mean weight of all women at large university is 135 pounds with a standard deviation of 10 pounds. a. Recalling Empirical Rule for bell-shaped curves, in what range would you expect 95% of women s weights to fall? b. If randomly sampled 10 women at university, how close do you think their average weight would be to 135 pounds? c. If sampled 1000 women, would you expect average weight to be closer to 135 pounds than for the sample of only 10 women?

What to Expect of Sample Means Example Want to estimate average weight loss for all who attend national weight-loss clinic for 10 weeks. Unknown to us, population mean weight loss is 8 pounds and standard deviation is 5 pounds. If weight losses are approximately bell-shaped, 95% of individual weight losses will fall between 2 (a gain of 2 pounds) and 18 pounds lost. Possible Samples (random samples of 25 people from this population) Sample 1: 1,1,2,3,4,4,4,5,6,7,7,7,8,8,9,9,11,11,13,13,14,14,15,16,16 Sample 2: 2, 2,0,0,3,4,4,4,5,5,6,6,8,8,9,9,9,9,9,10,11,12,13,13,16 Sample 3: 4, 4,2,3,4,5,7,8,8,9,9,9,9,9,10,10,11,11,11,12,12,13,14,16,18 Sample 4: 3, 3, 2,0,1,2,2,4,4,5,7,7,9,9,10,10,10,11,11,12,12,14,14,14,19

What to Expect of Sample Means Results: Sample 1: Mean = 8.32 pounds Sample 2: Mean = 6.76 pounds Sample 3: Mean = 8.48 pounds Sample 4: Mean = 7.16 pounds Each sample gave a different sample mean, but close to 8.

What to Expect of Sample Means Say the true population mean height is 68 inches and the population standard deviation is 3 inches

What to Expect of Sample Means Variation in sample means Say, the actual mean height of a population is 68 inches. First sample mean is 67.5 inches Second sample mean is 66 inches Third sample mean is 69 inches Sample means are (hopefully) close to the population mean But they are not identical Why? Due to sample variability Each sample is only a subset of the population

What to Expect of Sample Means : Population of measurements is bell-shaped, and a random sample of any size is measured.

What to Expect of Sample Means: Population of measurements of interest is not bell-shaped, but a large random sample is measured. Sample of size 30 is considered large, but if there are extreme outliers, better to have a larger sample. Population Mean is 21.76

What to Expect of Sample Means: Population of measurements of interest is not bell-shaped, but a large random sample is measured.

What to Expect of Sample Means If numerous samples or repetitions of the same size are taken, the frequency curve of means from various samples will be approximately bell-shaped. The mean for the sampling distribution of the sample mean is equal to the true population mean The standard deviation(sd) for the sampling distribution of the possible sample means is : population standard deviation sample size

What to Expect of Sample Means In reality, we won t know the population standard deviation The standard error(se) for the sampling distribution of the possible sample means is sample standard deviation sample size

The Central Limit Theorem: The Fundamental Theorem of Statistics The sampling distribution of any mean becomes closer to near Normal as the sample size grows. We don t even care about the shape of the population distribution as long as the sample size is large enough! The Fundamental Theorem of Statistics is called the Central Limit Theorem (CLT).

The Central Limit Theorem: The Fundamental Theorem of Statistics The CLT is surprising: Not only does the histogram of the sample means get closer and closer to the Normal distribution as the sample size grows, but this is true regardless of the shape of the population distribution. The CLT works better (and faster) the closer the population distribution is to a Normal itself. It also works better for larger samples.

The Central Limit Theorem: The Fundamental Theorem of Statistics Slide 18-48

The Fundamental Theorem of Statistics The Central Limit Theorem (CLT) The mean of a random sample is a random variable whose sampling distribution can be approximated by a Normal model. The larger the sample, the better the approximation will be.

Assumptions and Conditions The CLT requires essentially the same assumptions we saw for modeling proportions: Independence Assumption: The sampled values must be independent of each other. Sample Size Assumption: The sample size must be sufficiently large.

Assumptions and Conditions (cont.) We can t check these directly, but we can think about whether the Independence Assumption is plausible. We can also check some related conditions: Randomization Condition: The data values must be sampled randomly. 10% Condition: When the sample is drawn without replacement, the sample size, n, should be no more than 10% of the population. Large Enough Sample Condition: The CLT doesn t tell us how large a sample we need. For now, you need to think about your sample size in the context of what you know about the population.

Weight-Loss Example Weight-loss example, population mean and standard deviation were 8 pounds and 5 pounds, respectively, and we were taking random samples of size 25. Potential sample means represented by a bell-shaped curve with mean of 8 pounds and standard deviation: 5 = 1 pound 25 For our samples of 25 people: 68% of sample means will be between 7 and 9 pounds 95% of sample means will be between 6 and 10 pounds 99.7% of sample means will be between 5 and 11 pounds

Weight-Loss Example Increasing the Size of the Sample suppose a sample of 100 people instead of 25 was taken. Potential sample means still represented by a bell-shaped curve with mean of 8 pounds but standard deviation: 5 = 0.5 pounds 100 For our sample of 100 people: 68% of sample means will be between 7.5 and 8.5 pounds 95% of sample means will be between 7 and 9 pounds 99.7% of sample means will be between 6.5 and 9.5 pounds

Question Suppose that test scores on a particular exam have a mean of 77 and standard deviation of 5, and that they have a bellshaped curve. Suppose you randomly select a 1000 samples of size 100 from this population and calculate the sample mean test scores. Between what two values(sample means) would you expect 95% of these sample mean test scores to fall? A. 72 AND 82 B. 67 AND 87 C. 76 AND 78

Question

But Which Normal? The CLT says that the sampling distribution of any mean or proportion is approximately Normal. But which Normal model? For proportions, the sampling distribution is centered at the population proportion. For means, it s centered at the population mean. But what about the standard deviations?

But Which Normal? The Normal model for the sampling distribution of the mean has a standard deviation equal to SDy n where σ is the population standard deviation.

But Which Normal? The Normal model for the sampling distribution of the proportion has a standard deviation equal to SD ˆp pq n pq n

About Variation The standard deviation of the sampling distribution declines only with the square root of the sample size (the denominator contains the square root of n). Therefore, the variability decreases as the sample size increases. While we d always like a larger sample, the square root limits how much we can make a sample tell about the population. (This is an example of the Law of Diminishing Returns.)

The Real World and the Model World Be careful! Now we have two distributions to deal with. The first is the real world distribution of the sample, which we might display with a histogram. The second is the math world sampling distribution of the statistic, which we model with a Normal model based on the Central Limit Theorem. Don t confuse the two!

Sampling Distribution Models There are two basic truths about sampling distributions: 1. Sampling distributions arise because samples vary. Each random sample will have different cases and, so, a different value of the statistic. 2. Although we can always simulate a sampling distribution, the Central Limit Theorem saves us the trouble for means and proportions.

What Can Go Wrong? Don t confuse the sampling distribution with the distribution of the sample. When you take a sample, you look at the distribution of the values, usually with a histogram, and you may calculate summary statistics. The sampling distribution is an imaginary collection of the values that a statistic might have taken for all random samples the one you got and the ones you didn t get. Watch out for small samples from skewed populations. The more skewed the distribution, the larger the sample size we need for the CLT to work.

Summary Sample proportions and means will vary from sample to sample that s sampling error (sampling variability). Sampling variability may be unavoidable, but it is also predictable! In statistics, the concept of population parameter is theoretical Only God? knows the Truth? We try our best to find out what it is.

A Scientific Look at the Dangers of High Heels, NY Times, Jan., 2012 Not long ago, Neil J. Cronin, a postdoctoral researcher, and two of his colleagues at the Musculoskeletal Research Program at Griffith University in Queensland, Australia, were having coffee on the university s campus when they noticed a young woman tottering past in high heels. She looked quite uncomfortable and unstable, Dr. Cronin says.

A Scientific Look at the Dangers of High Heels, NY Times, Jan., 2012 Some observers, particularly women, might have winced in sympathy or, alternatively, wondered where she d bought stilettos. But the three researchers, men who study the biomechanics of walking, were struck instead by the scientific implications of her passage. We began to consider what might be happening at the muscle and tendon level in women who wear heels, Dr. Cronin says.

Study: Long-term use of high heeled shoes alters the neuromechanics of human walking