What we will do today s David Meredith Department of Mathematics San Francisco State University October 22, 2009 s 1 2 s 3 What is a? Decision support Political decisions s s Goal of statistics: optimize decisions with partial information Example: political party wants to appeal to voters Cannot ask every voter what they think So ask a sample Statistics tells us how closely sample might equal population
Medical innovations Using sample statistics to estimate population parameters s s Example: medical researchers think they have effective drug Conduct careful test with randomization of patients, placebos, etc. Statistics helps plan test, decide number of subjects, etc. Statistics tells us if test provides good evidence of drug s effectiveness Population parameter: percentage of all voters who favor a single-payer health-care system. Sample statistic: percentage of sample of voters who favor a single-payer health-care system. The sample percentage is a point estimate for the population percentage Any sample statistic is a point estimate for the corresponding population parameter Example point estimates The big question s If we measure heights, the mean height of a sample is a point estimate for the mean height of the population The standard deviation of the sample heights is a point estimate for the standard deviation of the heights in the population The proportion of people favoring single-payer in a sample of voters is a point estimate for the proportion of people favoring single-payer in the population s The important questions is: how accurate is a point estimate? The rest of this course will study the question: how accurate? The slope of the regression line through a sample of heights and weights is a point estimate for the slope of the regression line through the heights and weights of the entire population
Three answers for quantitative variable Three answers for categorical variable s s Question: how effective is this medicine? Population: sick people; variable: blood pressure, present or absent (categorical) 1 The medicine lowers blood pressure by about 20 points on average (point estimate) 2 I m highly confident that the medicine lowers blood pressure by between 15 and 25 points on average (confidence interval) 3 I m almost certain the medicine lowers blood pressure (hypothesis test) Question: how effective is this medicine? Population: sick people; variable: disease, present or absent (categorical) 1 The medicine cures about 80% of cases (point estimate) 2 I m highly confident that the medicine cures between 70% and 90% of cases (confidence interval) 3 I m almost certain the medicine cures some people (hypothesis test) Confidence intervals Confidence intervals s Suppose you want to estimate some parameter for a population, like height, percentage type A blood, percentage of Republicans, average lifespan, etc. A sample will give you a point estimate p for the parameter. We do not expect p to be exactly correct. A confidence interval is an interval (a, b) that you are somewhat confident contains the true population parameter. We are somewhat confident that a < π < b where π is the true parameter s Your point estimate p is right in the middle of (a, b). p = a + b 2 a = p e b = p + e You need to learn how to compute the margin of error e for different levels of confidence.
Levels of confidence The key idea s What does it mean to be % confident, 95% confident You are % confident if you think there is a % chance that you are correct (not good). You are 95% confident if you think there is a 95% chance that you are correct (pretty high). In real life you are seldom certain Statistics is a rigorous way of dealing with uncertainty s To calculate a 95% confidence interval for a sample with mean m, standard deviation s and size n: Pretend your sample represents the population perfectly Let x be the sampling distribution for your variable and sample size Sample size is n. Mean of x is m. Standard deviation of x is s n. Find the central range that contains 95% of all possible samples Solve P( x < a) = 0.025 and P( x < b) = 0.975. Then P(a < x < b) = P( x < b) P( x < a) = 0.975 0.025 = 0.95 (a, b) is your 95% confidence interval Example 1 Example 2 s Suppose we wanted a 95% confidence interval for male heights, and we took a random sample of men. The sample average height was 69.2" with a sample standard deviation of 3.1". Related problem: population average is 69.2" and population standard deviation is 3.1". Let x be sampling distribution for heights with sample size. Mean of x is 69.2, standard deviation of x is 3.1. We will find a and b such that P(H < a) = 0.025 and P(H < b) = 0.975. Then P(a < H < b) = P(H < b) P(H < a) = 0.975 0.025 = 0.95 a = qnorm(0.025, 69.2, 3.1/sqrt()) = 68.34 b = qnorm(0.975, 69.2, 3.1/sqrt()) = 70.06) 95% confidence interval is (68.34, 70.06). s Suppose we wanted a 90% confidence interval for male heights, and we took a random sample of men. The sample average height was 69.2" with a sample standard deviation of 3.1". Related problem: population average is 69.2" and population standard deviation is 3.1". Let x be sampling distribution for heights with sample size. Mean of x is 69.2, standard deviation of x is 3.1. We will find a and b such that P(H < a) = 0.05 and P(H < b) = 0.95. Then P(a < H < b) = P(H < b) P(H < a) = 0.95 0.05 = 0.90 a = qnorm(c(0.05,0.95), 69.2, 3.1/sqrt()) = 68.47, 69.92 95% confidence interval is (68.47, 69.92).
s Compare confidence intervals Two confidence intervals for heights; 95% confidence interval: (68.34, 70.06) 90% confidence interval: (68.47, 69.92) Sample average 69.2" is right in the middle of both. 95% margin of error: e = 69.2 68.34 = 70.06 69.2 = 0.86 s Convenient approximation for 95% CI If your sample has mean m and standard deviation s, an approximation frequently used for the 95% confidence interval is (m 2 s n, m + 2 s n ) 90% margin of error: e = 69.2 68.47 = 69.92 69.2 = 0.73 95% interval is wider, because we are more confident of a less precise statement. We are less confident of a more precise statement. In ( previous example, that would ) be 69.2 2 3.1, 69.2 + 2 3.1 = (68.32, 70.08) Actual answer was (68.34, 70.06). Sample sizes Sample sizes s Picking sample size is an important question for researchers at the beginning of a project Too small a sample, and your research might not be significant. Too big a sample, and your research might be too expensive. s Suppose you wanted to estimate a quantitative variable like men s heights with a margin of error of e = 0.1 with 95% confidence. How big a sample do you need? Let n be the sample size and s standard deviation of the sample we will measure. We have to guess or estimate s to find n. We guess s = 3. Sometimes researchers do a small study to estimate s with a small (cheap) sample. Let solve P(Z < z) = 0.975 z = qnorm(0.975,0,1) = 1.96 ( sz ) 2 Then n = = 3457.31 e Minimal sample size is 3458.
Sample sizes, the sequel The key idea s Suppose you wanted to estimate a quantitative variable like men s heights with a margin of error of e = 0.2 with 90% confidence. How big a sample do you need? Let n be the sample size and s standard deviation of the sample we will measure. Assume s = 3. Let solve P(Z < z) = 0.95 z = qnorm(0.95,0,1) = 1.64 ( sz ) 2 Then n = = 608.75 e Minimal sample size is 609. s To calculate a 95% confidence interval for a sample with proportion p and size n: Pretend your sample represents the population perfectly Let x be the sampling distribution for your variable and sample size Sample size is n. Proportion of x is p. p(1 p) Standard deviation of x is. n Find the central range that contains 95% of all possible samples Solve P( x < a) = 0.025 and P( x < b) = 0.975. T (a, b) is your 95% confidence interval s Example 1 Suppose we wanted a 95% confidence interval for voters preferences, and we took a random sample of voters. 61% wanted congress to pass a health plan (CBS radio news, Sunday, October 18, 2009). Related problem: population proportion is 0.61. Let ˆp be sampling distribution for voter preferences with sample size. Mean of ˆp is 0.61, standard deviation 0.61 0.39 of ˆp is = 0.069. We will find a and b such that P(ˆp < a) = 0.025 and P(ˆp < b) = 0.975. Then P(a < ˆp < b) = P(ˆp < b) P(ˆp < a) = 0.975 0.025 = 0.95 a = qnorm(0.025, 0.61, sqrt(.61*.39/)) = 0.47 b = qnorm(0.975, 0.61, sqrt(.61*.39/)) = 0.75) 95% confidence interval is (0.47, 0.75). s Example 2 Suppose we wanted a 90% confidence interval for voters preferences, and we took a random sample of voters. 61% wanted congress to pass a health plan. Related problem: population proportion is 0.61. Let ˆp be sampling distribution for voter preferences with sample size. Mean of ˆp is 0.61, standard deviation 0.61 0.39 of ˆp is = 0.069. We will find a and b such that P(ˆp < a) = 0.05 and P(ˆp < b) = 0.95. Then P(a < ˆp < b) = P(ˆp < b) P(ˆp < a) = 0.95 0.05 = 0.90 a = qnorm(0.05, 0.61, sqrt(.61*.39/)) = 0. b = qnorm(0.95, 0.61, sqrt(.61*.39/)) = 0.72) 95% confidence interval is (0., 0.72).
s Compare confidence intervals Two confidence intervals for heights; 95% confidence interval: (0.47, 0.75) 90% confidence interval: (0., 0.72) Sample proportion 0.61 is right in the middle of both. 95% margin of error: e = 0.61 0.47 = 0.75 0.61 = 0.14 s Convenient approximation for 95% CI If your sample has proportion p an approximation frequently used for the 95% confidence interval is ( ) p(1 p p(1 p p 2, p + 2 n n 90% margin of error: e = 0.61 0. = 0.72 0.61 = 0.11 95% interval is wider, because we are more confident of a less precise statement. We are less confident of a more precise statement. In ( previous example, that would be 0.61 0.39 0.61 2, 0.61 + 2 (0.47, 0.75) Actual answer was the same. 0.61 0.39 ) = Sample sizes Sample sizes s Suppose you wanted to estimate a categorical variable like voter preferences with a margin of error of e = 0.03 with 95% confidence. How big a sample do you need? Let n be the sample size Let solve P(Z < z) = 0.975 z = qnorm(0.975,0,1) = 1.96 ( z ) 2 Then n = = 1067.11 2e Minimal sample size is 1068. Statisticians often replace 1.96 with 2 and calculate the sample size for a 95% CI with margin of error e to be n = 1 e 2. s Suppose you wanted to estimate a categorical variable like voter preferences with a margin of error of e = 0.03 with 90% confidence. How big a sample do you need? Let n be the sample size Let solve P(Z < z) = 0.95 z = qnorm(0.95,0,1) = 1.64 ( z ) 2 Then n = = 747.11 2e Minimal sample size is 748. Margin of error 3% requires sample size 1 0.03 2 = 1111.11.
Why lecture better than textbook s Text often assumes you know standard deviation of population, but only mean of sample. Lecture used both mean and standard deviation derived from sample. More realistic.