Central Limit Theorem Confidence Intervals Worked example #6. July 24, 2017

Size: px

Start display at page:

Download "Central Limit Theorem Confidence Intervals Worked example #6. July 24, 2017"

Shavonne Little
6 years ago
Views:

1 Central Limit Theorem Confidence Intervals Worked example #6 July 24, 2017

2 10 8 Raw scores 6 4 Mean=71.4% Scaling is to add 3.6% to bring mean to 75% Scaled scores Mean = 75%

3 A B C D F Mean=71.4% (add 3.6%) Scaled Mean = 75%

4 Let's return to the Normal distribution f(x)

5 The area under the curve between two x values represent how much of the data is in that range. A B Area under curve between A and B Integral impossible, but numerical methods were used to create tables of areas under standard normal from A to B.

Z 1-a/2 is the value Z such that there is a/2 area

6 Values in the table are a values. Z a is the value Z such that there is a area to the left. Z 1-a is the value Z such that there is a area to the right. Z 1-a/2 is the value Z such that there is a/2 area to the right. Ex: Z /2 = Z = 1.96 Beware how different texts and tables use a, 1-a, a/2 and 1-a/2

7 Why do we care about areas under the Normal distribution? (1) Many populations exhibit a normal distribution Calculating areas allows us to predict how much of the population data is in a region if we have a sample mean and variance. - Lab #8

8 Recall: we transformed any normal distribution into the standard with 2 steps. (1) subtract mean (2) divide by standard deviation Consider individual data points Z-score of x = Z = (x-m)/s The Z-score allows comparison or standardization of data sets with different means and standard deviations.

9 Example, IQ: How to compare IQ test results from 2008 with 1996? Get test scores Compute Z-scores Compute an IQ score: IQ = Z Mean IQ = 100, Standard deviation = 15 ~ 66% population ~ 13% population ~ 2.5% population >130

10 Why do we care about areas under the Normal distribution? (2) Central Limit Theorem For large sample size the distribution of sample means will be normal, no matter what the actual population distribution is. - Lab #8 Distribution of data Distribution of means If we study means we can use normal distribution

11 Central limit theorem For large sample size the distribution of sample means will be normal, no matter what the actual population distribution is. Sample means ~ N(m, s 2 /n) Notice: s of the sample means is s/n 1/2 Note: the variance of the distribution of sample means depends on the size of the sample. Note: The ranges around the mean are confidence intervals, we are confident the true population mean lies within a region around our sample mean. Bigger region means more confidence but less utility

12 Central limit theorem For large sample size the distribution of sample means will be normal, no matter what the actual population distribution is. Distribution of the sample means Pop. mean

13 Central limit theorem The variance of the distribution of sample means depends on the size of the sample and population variance. which means that Standard error of the sample mean Distribution of samples with many values Distribution of samples with few values means

14 Central limit theorem The range around the sample mean is a confidence interval. We are x% confident the true population mean lies within a region around our sample mean that includes x% of the distribution. Distribution of sample means Distribution of sample means Very confident pop. mean is somewhere in this range Less confident pop. mean is somewhere in this range Tradeoff between certainty and utility Standard deviation: measures spread of sample or population data values Standard error: measures spread of potential of the population mean

15 Central limit theorem We are x% confident the true population mean lies within a region around our sample mean that includes x% of the distribution. Distribution is a normal distribution and we know how to calculate what % of area lies within a range defined by Z scores. Using the Normal distribution allows calculation of confidence intervals and quantitative statements about population mean from sample data. Since we usually don't know "s", we have to estimate it from "s". This means we can't just use Z scores we have to use t distribution (which includes uncertainty in our estimation of s via s). If we knew s, we could use this Z distribution Since s unknown, we have to use this t distribution

16 Central limit theorem We typically don't know m or s (if we did we wouldn't need to get a confidence interval because we would know m) so we use X to estimate m and s to estimate s. Sample means ~ N(X, s 2 /n) (if s is known, m unknown) Region Degree of confidence that pop. mean is in this region x ± Z /2 (s 2 /n) 1/2 68% x ± Z /2 (s 2 /n) 1/2 95% x ± Z /2 (s 2 /n) 1/2 99% Z /2 =1 Z /2 =1.96 Z /2 =2.575 Note: We use values from a Z distribution since s is known.

17 Central limit theorem We typically don't know m or s (if we did we wouldn't need to get a confidence interval because we would know m) so we use X to estimate m and s to estimate s. Sample means ~ t(x, s 2 /n) (if s is unknown, m unknown) Region Degree of confidence that pop. mean is in this region x ± t /2,df (s 2 /n) 1/2 68% x ± t /2,df (s 2 /n) 1/2 95% x ± t /2,df (s 2 /n) 1/2 99% t /2,df =varies t /2,df =varies t /2,df =varies Note: We have to use values from a t distribution since s is unknown. The "t distribution" is a bit wider than the Z to include our uncertainty in estimating s with s

18 Central limit theorem m known s known No sampling needed m unknown s known use Z distribution m unknown s unknown use t distribution Note: - As the sample size increases the t distribution becomes the Z distribution. Some naughty people forget the t distribution even exists... Caution: - Don't lose sight of the goal, to describe a region within which we are confident the population mean lies.

19 Central limit theorem The t distribution includes the uncertainty in estimate of s and is wider. Using the "t distribution" also requires us to specify the degrees of freedom (df) of the data, in this case df=n-1 t a,df you may also see t a/2,df & t 1-a/2,df in texts and tables For our t table the a refers to the area to the right. (our Z table shows area to the left) Tables often require interpolation if sample size not listed As df increases t distribution becomes the Z distribution

20 Central limit theorem Example: (1) s known = 10 Use Z distribution sample size = 20 df=na -1.96SE +1.96SE 95% confident pop. mean is somewhere in this range 95% CI, confidence interval (2) s unknown Use t distribution s = 10 sample size = 20 df=20-1= se SE 95% CI (3) s unknown Use t distribution s = 10 sample size = 101 df=101-1= se SE 95% CI

21 HANDOUT #6 Consider a medication that may increase the clotting time of patients taking it. We need reliable data on the usual clotting time (CT) of individuals not taking the medication in order to determine whether the medication is effective. We are unable to regularly measure the CT of every individual in the population of people not taking the medication so we will estimate the population parameter, mean clotting time (CT), with a sample from the general population. For the first set of questions we have access to detailed physiological data from the NIH and we know that the population standard deviation of CT values in humans is 3.2 seconds Sample data: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. What is the region in which there is a 95% chance that the true population mean CT lies? What is the region in which there is a 99% chance that the true population mean CT lies? If we have a larger sample: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. What is the region in which there is a 95% chance that the true population mean CT lies?

22 HANDOUT #6 note: we are interested in a/2 area to the right, but our Z table shows 1-a/2 area to the left. Consider a medication that may increase the clotting time of patients taking it. We need reliable data on the usual clotting time (CT) of individuals not taking the medication in order to determine whether the medication is effective. We are unable to regularly measure the CT of every individual in the population of people not taking the medication so we will estimate the population parameter, mean clotting time (CT), with a sample from the general population. For the first set of questions we have access to detailed physiological data from the NIH and we know that the population standard deviation of CT values in humans is 3.2 seconds Sample data: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. What is the region in which there is a 95% chance that the true population mean CT lies? Z 1-a/2 = Z /2 = Z = 1.96 from our Z table What is the region in which there is a 99% chance that the true population mean CT lies? If we have a larger sample: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. What is the region in which there is a 95% chance that the true population mean CT lies? sample mean=20, pop s = 3.2 X ± 1.96(s/n 1/2 ) 20 ± 1.96(3.2/12 1/2 ) 20 ± 1.81, {18.19,21.81} Z 1-a/2 = Z /2 = Z = from our Z table X ± 2.575(s/n 1/2 ) 20 ± 2.575(3.2/12 1/2 ) 20 ± 2.38, {17.62,22.38} Z 1-a/2 = Z /2 = Z = 1.96 from our Z table X ± 1.96(s/n 1/2 ) 20 ± 1.96(3.2/36 1/2 ) 20 ± 1.05, {18.95,21.05} note: with a larger sample size the confidence interval for the same degree of confidence is narrower because the SE is smaller.

23 HANDOUT #6 For the second set of questions we consider the more realistic situation in which we don't know the population standard deviation and have to estimate it from the sample data Sample data: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. What is the region in which there is a 95% chance that the true population mean CT lies? What is the region in which there is a 99% chance that the true population mean CT lies? If we have a larger sample: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. What is the region in which there is a 95% chance that the true population mean CT lies?

24 HANDOUT #6 note: instead of our table showing 1-a to the left like the Z table does, the t table shows values for a area to the right. For the second set of questions we consider the more realistic situation in which we don't know the population standard deviation and have to estimate it from the sample data Sample data: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. sample mean=20, sample s = What is the region in which there is a 95% chance that the true population mean CT lies? t a/2,df = t 0.05/2,11 = t 0.025,11 = from our t table X ± 2.201(s/n 1/2 ) 20 ± 2.201(3.133/12 1/2 ) 20 ± 1.99, {18.01,21.99} What is the region in which there is a 99% chance that the true population mean CT lies? t a/2 = t 0.01/2,11 = t 0.005,11 = from our t table X ± 3.106(s/n 1/2 ) 20 ± 3.106(3.133/12 1/2 ) 20 ± 2.81, {17.19,22.81} If we have a larger sample: 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19, 18, 20, 22, 23, 26, 17, 14, 22, 18, 21, 20, 19. What is the region in which there is a 95% chance that the true population mean CT lies? note: our t table does not have an entry for df=35 so we round down and use the df=30 value from our table. t a/2,df = t 0.05/2,35 = t 0.025,30 = from our t table X ± 2.042(s/n 1/2 ) 20 ± 2.042(3.04/36 1/2 ) 20 ± 1.04, {18.96,21.04} note: the SD is also smaller for the larger sample because of the relatively larger denominator; as n-1 increases it approaches n.

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b). Confidence Intervals 1) What are confidence intervals? Simply, an interval for which we have a certain confidence. For example, we are 90% certain that an interval contains the true value of something