Central Limit Theorem Confidence Intervals Worked example #6 July 24, 2017
10 8 Raw scores 6 4 Mean=71.4% 2 0 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90+ Scaling is to add 3.6% to bring mean to 75% Scaled scores 14 12 10 8 6 4 2 0 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90+ Mean = 75%
1369 1944 4327 5774 4561 4161 4654 4555 5888 2392 7640 8719 3863 1107 9828 3516 9171 957 4147 5134 5892 7931 2408 6911 97 6708 0 7947 9820 3896 6035 2205 3524 5969 9863 9495 A B C D F 0.4 0.5 0.6 0.7 0.8 0.9 1 Mean=71.4% (add 3.6%) Scaled Mean = 75%
Let's return to the Normal distribution f(x)
The area under the curve between two x values represent how much of the data is in that range. A B Area under curve between A and B Integral impossible, but numerical methods were used to create tables of areas under standard normal from A to B.
Values in the table are a values. Z a is the value Z such that there is a area to the left. Z 1-a is the value Z such that there is a area to the right. Z 1-a/2 is the value Z such that there is a/2 area to the right. Ex: Z 1-0.05/2 = Z 0.975 = 1.96 Beware how different texts and tables use a, 1-a, a/2 and 1-a/2
Why do we care about areas under the Normal distribution? (1) Many populations exhibit a normal distribution Calculating areas allows us to predict how much of the population data is in a region if we have a sample mean and variance. - Lab #8
Recall: we transformed any normal distribution into the standard with 2 steps. (1) subtract mean (2) divide by standard deviation Consider individual data points Z-score of x = Z = (x-m)/s The Z-score allows comparison or standardization of data sets with different means and standard deviations.
Example, IQ: How to compare IQ test results from 2008 with 1996? Get test scores Compute Z-scores Compute an IQ score: IQ = 100 + 15Z Mean IQ = 100, Standard deviation = 15 ~ 66% population 85-115 ~ 13% population 115-130 ~ 2.5% population >130
Why do we care about areas under the Normal distribution? (2) Central Limit Theorem For large sample size the distribution of sample means will be normal, no matter what the actual population distribution is. - Lab #8 Distribution of data Distribution of means If we study means we can use normal distribution
Central limit theorem For large sample size the distribution of sample means will be normal, no matter what the actual population distribution is. Sample means ~ N(m, s 2 /n) Notice: s of the sample means is s/n 1/2 Note: the variance of the distribution of sample means depends on the size of the sample. Note: The ranges around the mean are confidence intervals, we are confident the true population mean lies within a region around our sample mean. Bigger region means more confidence but less utility
Central limit theorem For large sample size the distribution of sample means will be normal, no matter what the actual population distribution is. Distribution of the sample means Pop. mean
Central limit theorem The variance of the distribution of sample means depends on the size of the sample and population variance. which means that Standard error of the sample mean Distribution of samples with many values Distribution of samples with few values means
Central limit theorem The range around the sample mean is a confidence interval. We are x% confident the true population mean lies within a region around our sample mean that includes x% of the distribution. Distribution of sample means Distribution of sample means Very confident pop. mean is somewhere in this range Less confident pop. mean is somewhere in this range Tradeoff between certainty and utility Standard deviation: measures spread of sample or population data values Standard error: measures spread of potential of the population mean
Central limit theorem We are x% confident the true population mean lies within a region around our sample mean that includes x% of the distribution. Distribution is a normal distribution and we know how to calculate what % of area lies within a range defined by Z scores. Using the Normal distribution allows calculation of confidence intervals and quantitative statements about population mean from sample data. Since we usually don't know "s", we have to estimate it from "s". This means we can't just use Z scores we have to use t distribution (which includes uncertainty in our estimation of s via s). If we knew s, we could use this Z distribution Since s unknown, we have to use this t distribution
Central limit theorem We typically don't know m or s (if we did we wouldn't need to get a confidence interval because we would know m) so we use X to estimate m and s to estimate s. Sample means ~ N(X, s 2 /n) (if s is known, m unknown) Region Degree of confidence that pop. mean is in this region x ± Z 1-0.32/2 (s 2 /n) 1/2 68% x ± Z 1-0.05/2 (s 2 /n) 1/2 95% x ± Z 1-0.01/2 (s 2 /n) 1/2 99% Z 1-0.32/2 =1 Z 1-0.05/2 =1.96 Z 1-0.01/2 =2.575 Note: We use values from a Z distribution since s is known.
Central limit theorem We typically don't know m or s (if we did we wouldn't need to get a confidence interval because we would know m) so we use X to estimate m and s to estimate s. Sample means ~ t(x, s 2 /n) (if s is unknown, m unknown) Region Degree of confidence that pop. mean is in this region x ± t 1-0.32/2,df (s 2 /n) 1/2 68% x ± t 1-0.05/2,df (s 2 /n) 1/2 95% x ± t 1-0.01/2,df (s 2 /n) 1/2 99% t 1-0.32/2,df =varies t 1-0.05/2,df =varies t 1-0.01/2,df =varies Note: We have to use values from a t distribution since s is unknown. The "t distribution" is a bit wider than the Z to include our uncertainty in estimating s with s
Central limit theorem m known s known No sampling needed m unknown s known use Z distribution m unknown s unknown use t distribution Note: - As the sample size increases the t distribution becomes the Z distribution. Some naughty people forget the t distribution even exists... Caution: - Don't lose sight of the goal, to describe a region within which we are confident the population mean lies.
Central limit theorem The t distribution includes the uncertainty in estimate of s and is wider. Using the "t distribution" also requires us to specify the degrees of freedom (df) of the data, in this case df=n-1 t a,df you may also see t a/2,df & t 1-a/2,df in texts and tables For our t table the a refers to the area to the right. (our Z table shows area to the left) Tables often require interpolation if sample size not listed As df increases t distribution becomes the Z distribution
Central limit theorem Example: (1) s known = 10 Use Z distribution sample size = 20 df=na -1.96SE +1.96SE 95% confident pop. mean is somewhere in this range 95% CI, confidence interval (2) s unknown Use t distribution s = 10 sample size = 20 df=20-1=19-2.093se +2.093SE 95% CI (3) s unknown Use t distribution s = 10 sample size = 101 df=101-1=100-1.984se +1.984SE 95% CI
HANDOUT #6 Consider a medication that may increase the clotting time of patients taking it. We need reliable data on the usual clotting time (CT) of individuals not taking the medication in order to determine whether the medication is effective. We are unable to regularly measure the CT of every individual in the population of people not taking the medication so we will estimate the population parameter, mean clotting time (CT), with a sample from the general population. For the first set of questions we have access to detailed physiological data from the NIH and we know that the population standard deviation of CT values in humans is 3.2 seconds Sample data: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. What is the region in which there is a 95% chance that the true population mean CT lies? What is the region in which there is a 99% chance that the true population mean CT lies? If we have a larger sample: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. What is the region in which there is a 95% chance that the true population mean CT lies?
HANDOUT #6 note: we are interested in a/2 area to the right, but our Z table shows 1-a/2 area to the left. Consider a medication that may increase the clotting time of patients taking it. We need reliable data on the usual clotting time (CT) of individuals not taking the medication in order to determine whether the medication is effective. We are unable to regularly measure the CT of every individual in the population of people not taking the medication so we will estimate the population parameter, mean clotting time (CT), with a sample from the general population. For the first set of questions we have access to detailed physiological data from the NIH and we know that the population standard deviation of CT values in humans is 3.2 seconds Sample data: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. What is the region in which there is a 95% chance that the true population mean CT lies? Z 1-a/2 = Z 1-0.05/2 = Z 0.975 = 1.96 from our Z table What is the region in which there is a 99% chance that the true population mean CT lies? If we have a larger sample: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. What is the region in which there is a 95% chance that the true population mean CT lies? sample mean=20, pop s = 3.2 X ± 1.96(s/n 1/2 ) 20 ± 1.96(3.2/12 1/2 ) 20 ± 1.81, {18.19,21.81} Z 1-a/2 = Z 1-0.01/2 = Z 0.995 = 2.575 from our Z table X ± 2.575(s/n 1/2 ) 20 ± 2.575(3.2/12 1/2 ) 20 ± 2.38, {17.62,22.38} Z 1-a/2 = Z 1-0.05/2 = Z 0.975 = 1.96 from our Z table X ± 1.96(s/n 1/2 ) 20 ± 1.96(3.2/36 1/2 ) 20 ± 1.05, {18.95,21.05} note: with a larger sample size the confidence interval for the same degree of confidence is narrower because the SE is smaller.
HANDOUT #6 For the second set of questions we consider the more realistic situation in which we don't know the population standard deviation and have to estimate it from the sample data Sample data: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. What is the region in which there is a 95% chance that the true population mean CT lies? What is the region in which there is a 99% chance that the true population mean CT lies? If we have a larger sample: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. What is the region in which there is a 95% chance that the true population mean CT lies?
HANDOUT #6 note: instead of our table showing 1-a to the left like the Z table does, the t table shows values for a area to the right. For the second set of questions we consider the more realistic situation in which we don't know the population standard deviation and have to estimate it from the sample data Sample data: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. sample mean=20, sample s = 3.133 What is the region in which there is a 95% chance that the true population mean CT lies? t a/2,df = t 0.05/2,11 = t 0.025,11 = 2.201 from our t table X ± 2.201(s/n 1/2 ) 20 ± 2.201(3.133/12 1/2 ) 20 ± 1.99, {18.01,21.99} What is the region in which there is a 99% chance that the true population mean CT lies? t a/2 = t 0.01/2,11 = t 0.005,11 = 3.106 from our t table X ± 3.106(s/n 1/2 ) 20 ± 3.106(3.133/12 1/2 ) 20 ± 2.81, {17.19,22.81} If we have a larger sample: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. What is the region in which there is a 95% chance that the true population mean CT lies? note: our t table does not have an entry for df=35 so we round down and use the df=30 value from our table. t a/2,df = t 0.05/2,35 = t 0.025,30 = 2.042 from our t table X ± 2.042(s/n 1/2 ) 20 ± 2.042(3.04/36 1/2 ) 20 ± 1.04, {18.96,21.04} note: the SD is also smaller for the larger sample because of the relatively larger denominator; as n-1 increases it approaches n.