Carolyn Anderson & YoungShil Paek (Slide contributors: Shuai Wang, Yi Zheng, Michael Culbertson, & Haiyan Li)

Carolyn Anderson & YoungShil Paek (Slide contributors: Shuai Wang, Yi Zheng, Michael Culbertson, & Haiyan Li) Department of Educational Psychology University of Illinois at Urbana-Champaign 1

Inferential methods are the main focus of the rest of the course. Understanding the concept of sampling distribution is crucial to understanding statistical inferences. 2

Key Points 1. Statistic vs. Parameter 2. Population Distribution, Data Distribution, and Sampling Distributions 3. Mean and Standard Deviation of the Sampling Distribution of a Proportion 4. Inference with Sampling Distribution of a Proportion 4

Example: Predicting California Election Results Using Exit Polls Using exit polls, polling organizations predict winners after learning how a small number of people voted, often only a few thousand out of possibly millions of voters. The total number of voters was over nine million, and the poll sampled a small portion of them. How do we know if the sample proportion from the California exit poll is a good estimate, falling close to the population proportion? This section introduces a type of probability distribution called the Sampling Distribution that helps us determine how close to the population parameter a sample statistic is likely to fall. 5

Example: Predicting California Election Results Using Exit Polls In California in November 2010, the gubernatorial race pitted the Republican candidate Meg Whitman against the Democratic candidate, Jerry Brown. After sampling 3889 randomly selected voters, 53.1% said they voted for Brown, 42.4% for Whitman. At the time of the exit poll, the percentage of the entire voting population (nearly 9.5 million people) that voted for Brown was unknown. 6

Example: Predicting Election Results Using Exit Polls How close can we expect a sample percentage to be to the population percentage? How does the sample size influence our analysis? The sampling distribution helps us determine how close to the population parameter a sample statistic is likely to fall. 7

Recall: Statistic and Parameter A statistic is a numerical summary of sample data such as a sample proportion or sample mean A parameter is a numerical summary of a population such as a population proportion or population mean. In practice, we seldom know the values of parameters. Parameters are estimated using sample data. We use statistics to estimate parameters. 8

Population Distribution Population distribution: the probability distribution of the random variable of interest in the whole population. Example: Let X = vote outcome, with x = 1 for Jerry Brown and x = 0 for all other responses. The possible values of the random variable X (0 and 1) and how often these values occurred in the whole population (0.462 and 0.538) give the population distribution. 9

Data Distribution Data distribution: probability distribution of the random variable of interest in one sample that we obtain from the population. Example: The possible values of the random variable X (0 and 1) and how often these values occurred (0.469 and 0.531) give the data distribution for this one sample. With random sampling, the larger the sample size n, the more closely the data distribution resembles the population distribution 10

Example: Predicting Election Results Using Exit Polls Figure 7.1 The population (9.5 million voters) and data (n=3889) distributions of candidate preference (0 = Not Brown, 1= Brown). 11

Sampling Distribution Sampling distribution: the probability distribution of a sample statistic. With random sampling, the sampling distribution provides probabilities for all the possible values the statistic can take. Example: the sampling distribution of a sample proportion the sampling distribution of a sample mean 12

Sampling Distribution A sampling distribution is different from population distribution and data distribution. Rather than giving probabilities for an observation for an individual subject (as in a population or data distribution), it gives probabilities for the value of a statistic for a sample of subjects. Sampling distributions describe the variability of the sample statistic (e.g., sample mean, sample proportion) that occurs from sample to sample. The sampling distribution provides the key for telling us how close a sample statistic falls to the corresponding unknown parameter. 13

True or False: For one population distribution there is only one data distribution. a) True b) False 14

Mean and SD of the Sampling Distribution of a Proportion For a random sample of size n from a population with proportion p of outcomes in a particular category, the sampling distribution of the proportion of the sample in that category has Mean = p Standard deviation = p(1-p) n 15

The Standard Error To distinguish the standard deviation of a sampling distribution from the standard deviation of an ordinary probability distribution, we refer to it as a standard error. The standard error of a sample statistic (e.g., sample mean, sample proportion) is the standard deviation of the sampling distribution of the sample statistic 16

Example: 2010 California Election Revisited Election results showed that 53.8% of the population of all voters voted for Brown. What was the mean and standard deviation of the sampling distribution of the sample proportion who voted for him? Given that the exit poll had 3889 people (n =3889) and 53.8% supported Brown (p =.538), Mean = p =.538 S.E. = p*(1- p) n =.538*(1-.538) 3889 =.008 17

Suppose that 40% of men over the age of 30 suffer from lower back pain. For a random sample of 50 men over the age of 30, find the mean and the standard error of the sampling distribution of the sample proportion of men over the age of 30 that suffer from lower back pain. a) Mean = 0.40 Standard Error = 0.0693 b) Mean= 20 Standard Error = 3.464 c) Mean = 0.40 Standard Error = 3.464 d) Mean = 20 Standard Error = 0.0693 e) Cannot be determined 18

Example: 2010 California Election Revisited Mean 3*S.E.=.514 Mean=.538 Mean+3*S.E.=.562 20

Example: 2010 California Election Revisited Q1: Given the sampling distribution of the sample proportion who voted for Brown, what are the values of the sample proportion we would expect to observe from random sampling (data distribution)? Answer: given p=.538, it is likely that the sample proportion from a random sample taken from this population will fall within 3 S.E. from the mean, which is between.514 and.562. 21

Example: 2010 California Election Revisited Q2: Based on the results of the exit poll, would you have been willing to predict Brown as the winner on election night while the votes were still being counted? 22

Example: 2010 California Election Revisited Think it through: Our inference on the plausible population proportion will help us predict the election result. When the votes are still being counted, we do not know the actual population proportion (p). Our best estimate of the population proportion is the sample proportion (p-hat) from the exit poll. We could estimate the standard error of a sample proportion by substituting p-hat for p ˆp =.531 S.E. - hat = ˆp*(1- ˆp) n =.531*(1-.531) 3889 =.008 23

Example: 2010 California Election Revisited Think it through: With the estimated mean and standard error of the sample proportion, we can find a range of plausible values for the actual population proportion as.531±3*.008 =[.507,.557] We observe that all the plausible values estimated for the population proportion of voters who will vote for Brown are above the value of 0.50 and give Brown a majority over any other candidate. Therefore, we would be willing to predict Brown as the winner. 24

Key Points Revisited 1. Statistic vs. Parameter 2. Population Distribution, Data Distribution, and Sampling Distributions 3. Mean and Standard Deviation of the Sampling Distribution of a Proportion 4. Inference with Sampling Distribution of a Proportion 26

Key Points 1. The Sampling Distribution of the Sample Mean 2. Effect of n on the Standard Error 3. Central Limit Theorem (CLT) 4. Calculating Probabilities of Sample Means 5. Binomial Distribution is a Sampling Distribution 27

The Sampling Distribution of the Sample Mean The sample mean, x, is a random variable. The sample mean varies from sample to sample. By contrast, the population mean, µ, is a single fixed number. 28

The Mean and Standard Deviation of the Sampling Distribution of the Sample Mean For a random sample of size n from a population having mean µ and standard deviation σ, the sampling distribution of the sample mean has: its center described by the mean µ (the same as the mean of the population). and the spread described by the standard error, which equals the population standard deviation divided by the square root of the sample size: S.E. x =s n 29

Example 1: Pizza Sales Daily sales at a pizza restaurant vary from day to day. The daily sales figures fluctuate around a mean µ = $900 with a standard deviation σ = $300. What are the center and spread of the sampling distribution of the average daily sales in a week? m = $900 S.E. = 300 7 = $113 30

The Sampling Distribution of the Sample Mean When the Population Distribution is Normally Distributed For a random sample of size n from a normally distributed population having mean µ and standard deviation σ, the sampling distribution of the sample mean: is also normally distributed with its center described by the mean µ (the same as the mean of the population). and the spread described by the standard error, which equals the population standard deviation divided by the square root of the sample size: S.E. x =s n 31

The Sampling Distribution of the Sample Mean When the Population Distribution is NOT Normally Distributed For a random sample of size n from a NOT normally distributed population having mean µ and standard deviation σ, the sampling distribution of the sample mean: approaches an approximately normal distribution as the sample size increases has its center described by the mean µ (the same as the mean of the population). and the spread described by the standard error, which equals the population standard deviation divided by the square root of the sample size: S.E. x =s n 32

Central Limit Theorem (CLT) CLT: for a random sample of size n from a population having mean µ and standard deviation σ, the sampling distribution of the sample mean: Approaches an approximately normal distribution as the sample size increases has its center described by the mean µ (the same as the mean of the population). and the spread described by the standard error, which equals the population standard deviation divided by the square root of the sample size: S.E. x =s This result applies no matter what the shape of the probability distribution from which the samples are taken. n 33

CLT: How Large a Sample? The sampling distribution of the sample mean takes more of a bell shape as the random sample size n increases. The more skewed the population distribution, the larger n must be before the shape of the sampling distribution is close to normal. In practice, the sampling distribution is usually close to normal when the sample size n is at least about 30. If the population distribution is approximately normal, then the sampling distribution is approximately normal for all sample sizes. 34

CLT: Impact of increasing n 35

CLT Helps Us Make Inferences For large n, the sampling distribution is approximately normal even if the population distribution is not. This enables us to make inferences about population means regardless of the shape of the population distribution. 36

Effect of n on the Standard Error Knowing how to find a standard error gives us a mechanism for understanding how much variability to expect in sample statistics just by chance. s The standard error of the sample mean = n As the sample size n increases, the denominator increases, so the standard error decreases. With larger samples, the sample mean is more likely to fall closer to the population mean. 37

CLT: Impact of increasing n 38

Calculating Probabilities of Sample Means The distribution of weights of milk bottles is normally distributed with a mean of 1.1 lbs and a standard deviation (σ)=0.20 lbs. What is the probability that the mean of a random sample of 5 bottles will be greater than 0.99 lbs? Calculate the mean and standard error for the sampling distribution of a random sample of 5 milk bottles By the CLT, x is approximately normal with mean=1.1 and standard error = = 0.0894 æ P(X >.99) = PçZ > è (.99-1.1).0895 0.2 5 ö = P(z > -1.23) =.89 ø 39

Binomial Distribution is a Sampling Distribution In binomial distribution, p, the probability of success in one trial, can also be regarded as the population proportion of success. The binomial distribution is the probability distribution of the number of successes in n independent trials, which can be regarded as the sampling distribution for the sample proportion of successes multiplied by n when the sample size is n. 40

Binomial Distribution is a Sampling Distribution For a random sample of size n from a population with proportion p of success, the sampling distribution of the proportion of the sample has Mean = p Standard error = n Now, if multiply them by n mean and sd for binomial distribution. The binomial distribution of the number of successes in n independent trials with probability of success p in each trial has: Mean = np Standard deviation = p(1- p) np(1- p) 41

Approximating the Binomial Distribution with the Normal Distribution The binomial distribution can be well approximated by the normal distribution when the expected number of successes, np, and the expected number of failures, n(1-p) are both at least 15. This is an application of CLT. 42

2000 Presidential Election The 2000 US presidential election came down to votes in Florida. The official results from the Florida Department of State, Division of Elections for the two top candidates on Sunday November 28, 2000 George W. Bush 2,912,790 Al Gore 2,912,253 Total 5,825,043 Bush only had a 537 vote lead. Distribution of proportion for Bush is approximate normal. 43

Example continued Proportion for Bush p = 2912790 5825043 =.5000046094 se = p(1 p)/n =.00207166 If the election was a tie, z =.5000046094.5.00207166 =.0222 p z >.0222 =.98, which equals probability of making a mistake if the election was a tie. (Bush would have had to win by 6,217 votes for a decisive victory) 44

Key Points Revisited 1. The Sampling Distribution of the Sample Mean 2. Effect of n on the Standard Error 3. Central Limit Theorem (CLT) 4. Calculating Probabilities of Sample Means 5. Binomial Distribution is a Sampling Distribution 45