Chapter 6 Part 4. Confidence Intervals

Size: px

Start display at page:

Download "Chapter 6 Part 4. Confidence Intervals"

Posy Matthews
5 years ago
Views:

1 Chapter 6 Part 4 Confidence Intervals October 1, 008

2 Goal: To clearly understand the link between probability distributions and confidence intervals. Skills: Be able to calculate (1 - α)% confidence interval for a sample mean both for the case that the population variance is known and the case that it is not known. Be able to accurately interpret a confidence interval. Contents: Central Limit Theorem Page Confidence interval using the normal distribution Page 5 Formula Page 8 What impacts the length of a CI Page 16 Stata commands: invnormal

3 Usually we study samples from a population rather than the population itself because it is not possible to get our hands on the whole population (e.g. it is too big, the process is too costly, frequently some of the members of the population we are interested in haven t even been born yet). We have agreed that when possible, we should select a random sample. We also know that when we select a random sample of size n for a study, it is just one of many possible samples of size that could have been selected from the population. n Suppose we want to know the average fasting triglycerides of the entire population of the U.S. that is 55 years old or older (55+). Some of the reasons why we ll have to select a sample would be: 1) usually the whole population is simply not available (e.g. the ALLHAT investigators were hoping that the results of their study would apply not only to those who were 55+ at the time of entry into the study but also to those who will later become 55+) and ) even in cases where the population is available (an unusual case) the cost and time involved to study a whole population tends to be prohibitive. So we ve decided to select a random sample from the population and use the mean of the fasting triglycerides of that sample to estimate the mean of the entire population. What we learned earlier when studying the sampling distribution of means is the following: Let be a random variable representing the distribution of the fasting triglycerides in the population of people aged 55+. Let the fasting triglycerides and represent the population mean for the population variance for the fasting triglycerides. Then we usually denote the random variable representing the sampling n distribution of means of samples of size by, the mean of the population of means by and the variance of the population of means by. If the size, n =, of the sample is large enough, we have 1) (Fact 1 from before) The mean of the original distribution is equal to the mean of the sampling distribution. Page -1-

4 ) = n where is called the standard error of the mean (SEM) - Fact from before. n Note that refers to variation related to a single sample and = n, the SEM, refers to variation among samples. 3) We also noticed that the larger the sample size n got, the more the distribution of those sample means looked like a normal distribution. The Central Limit Theorem states the following: Given the notation we have used above for the original population of fasting triglycerides and the notation for the sampling distribution of means, for large, n = is approximately normally distributed with mean = (Fact 1: ) and variance = (Fact : = ) regardless of the distribution of. n n If is distributed normally, then is also distributed normally (as opposed to approximately normally). Now our problem is, how do we know if the sample mean is a good estimate of the population mean. Let us say that the graph below is the distribution of the means for fasting triglycerides (AFTRIG) of all samples of size n from the U.S. population of those aged 55+. Looking at the histogram of the sampling distribution below we would probably be willing to say that the means represented by the bar on the far right end (the bar with square dots) of the distribution are not good estimates for the mean of the distribution of Page --

the original AFTRIG values because they are probably not what we would be willing to call close to the mean of the distribution of sample means (i.e. ).

5 the original AFTRIG values because they are probably not what we would be willing to call close to the mean of the distribution of sample means (i.e. ). But what about the means represented by the striped bar in the graph below. This is where our problems begin. We are clearly going to need some sort of measure of how certain we are that the mean of our sample is a reasonable estimate of the population mean. This is where confidence intervals come in. Confidence intervals are going to be defined such that given a 95% confidence interval, we will be 95% confident that (and hence ) lies within our interval. So in obtaining a 95% confidence interval for, we will have also obtained an interval for the original population mean. Just as we have only one sample and one sample mean, we will have only one confidence interval based on that sample and its mean. If, however, we had all possible samples, we could get a confidence interval for the mean of each sample. Then the interpretation of the 95% confidence interval is that we are confident that 95% of these intervals contain the original population mean ( ). Page -3-

6 Looking at the graph below of the confidence intervals, we notice that 3 of the intervals (the dashed ones) do not contain the population mean. The very top confidence interval does not contain the mean because confidence intervals will be defined as open intervals (i.e. intervals that do not contain their endpoints). The other two dashed confidence intervals don t even come particularly close to the mean. 95% CI s for the sample means assuming we know = Each interval is centered about a sample mean. Each interval is the same length because is known. The intervals are all of the same length because (as we will show) the length of each interval depends on the sample size n (remember all samples from the sampling distribution have the same size) and on the size of when is known. We ll show later that when is not known, we can calculate the confidence interval using the sample estimate of, namely s. In this case the lengths of the samples will vary as s varies from sample to sample. There are actually 3 kinds of intervals that we can use: prediction, confidence and tolerance intervals. We won t do much with prediction and tolerance intervals until we get to regression, but I will describe all three kinds of intervals here. Page -4-

7 This example is taken from Forthofer and Lee s (007) book Biostatistics. Dairies add vitamin D to milk for the purpose of fortification. The recommended amount of vitamin D to be added to a quart of milk is 400 IUs (10 g). If a dairy adds too much vitamin D, perhaps over 5000 IUs, the amount of vitamin D could be toxic. A prediction interval focuses on a single observation of the variable - for example, the amount of vitamin D in the next bottle of milk. A confidence interval focuses on a population parameter - for example, the mean or median of vitamin D in a population of bottles of milk. Thus, the prediction interval is of more interest to the consumer of the next bottle of milk, whereas the confidence interval is of more interest to the dairy. A tolerance interval provides limits such that there is a high level of confidence that a large portion of the values of the variable will fall within them. For example, besides being interested in the mean, the dairy owner or regulatory agency also wants to be confident that for a large portion of the bottles the vitamin D contents are within a specified tolerance of the value of 400 IUs. So back to confidence intervals. The picture of the confidence intervals above is a nice graphic, but how do we actually calculate the confidence interval for our sample mean? Confidence Intervals Below we give the confidence interval for the random variable conditions that the random variable is normally distributed under the has an unknown mean and has a known variance It is not usually the case that we know confidence interval first. but we present this simplest version of the Page -5-

8 So let be the random variable associated with the sampling distribution of samples of size n drawn from the distribution with random variable. N(, ) The Central Limit Theorem says: for n large enough of the distribution of ) where = and. [ = approximately.] = n (regardless Density for N =, = n 95%.5%.5% Note that the areas and standard deviations in the graph above were derived under the assumption that is close enough to being normally distributed not make any difference. How did I decide that the area under the normal density associated with x-axis and between 196. and , above the is 95% of the total Page -6-

9 196. area under the curve. Well is 1.96 standard deviations ( ) [ ] N(, ) below the mean ( ) of the normal distribution and is 1.96 standard deviations above the mean. We learned earlier that from 1.96 standard deviations below the mean to 1.96 standard deviations above the mean cuts off 95% of the area under the curve for any normal distribution (i.e. this is part of what we learned when we showed that any normal distribution could be mapped into the standard normal distribution ). [ Z ~ N( 01, )] So for n large enough we have Equation 1 ( ) Pr 196. < < = 095. [Aside: Notice above that I have used < rather than because although it doesn t make any difference which you use in terms of the probability of a continuous distribution, confidence intervals are always written as open intervals.] But according to the Central Limit Theorem = and = n So Equation 1 becomes Pr 196. < < = 095. n n Equation Page -7-

10 But we want in the middle and on the ends, so we subtract across all parts of the inequality in Equation and get Pr 196. < < 196. = 095. n n Equation 3 Now subtract across all parts of the inequality in Equation 3 and get Pr 196. < < = 095. n n Equation 4 Now multiply by -1 across all parts of the inequality in Equation 4 (note this reverses the inequalities) Pr > > 196. = 095. n n Equation 5 Now just put the smaller endpoint of equation 5 on the left and the larger on the right. Pr 196. < < = 095. n n Equation 6 Below we switch from probability to confidence because is a random variable for which probability is appropriate but is the mean of a particular sample. Once we use the sample mean, the population mean probability is no longer appropriate. x x either is or is not in the interval and Page -8-

11 So our 95% confidence interval is x n x 196., n On the N(0,1) curve the area to the right of 1.96 is 0.05 or.5%. Or the area to the left of 1.96 is or z = z α = 005. This means we could denote 1.96 as. Or if we let, so that α / = z 1 α of the value of., then more generally we have. This pattern will work regardless α ( / ) Well what do we do about -1.96? We ll use. ( α / ) z 1 Therefore, the general form of the (1 - α )% confidence interval is x z n x + z, 1 ( α/ ) 1 ( α/ ) n Usually we don t have to work so hard to distinguish between and and their means and variances. This is because the random variable is not usually part of the conversation. We have only used it to derive the formula for the confidence interval. This means we can just say that the distribution for the random variable has mean and standard deviation. So the commonly used form of the (1 - )% α x z n x + z, 1 ( α/ ) 1 ( α/ ) n confidence interval is Page -9-

12 In the above formula x is the mean of a single sample and is not a random variable. α α = α = 090. α / = 005. The confidence for the interval above is 1 -. So if, then and we would have a 90% confidence interval. So equal to 0.05 is cut off each end of the distribution.. Therefore, an area The length of the confidence interval is 1 α z ( / ) n As we select different samples of size n, we get different values for. So the location of the confidence interval changes. However, the length of the confidence interval remains the same (this is because is known) and the samples are all of size n. x Find the 95% confidence interval for the baseline heart rate in beats/min for the Propranolol treatment group (Cardiology Problem 6.81 on page of Rosner), also see original description of the problem in Cardiovascular Disease on page 157). Let us suppose that the standard deviation of the baseline heart rate for Propranolol is known and is equal to 17 beats/minute. The Stata data set for this problem is nifed.dta. Page -10-

13 . des Contains data from C:\Stata\StataData\Myfiles\BiostatFall003\Data\nifed.dta obs: 34 vars: 10 Oct 00 0:53 size: 1,496 (99.9% of memory free) storage display value variable name type format label variable label id float %1.0g trtgrp float %11.0g trt Treatment Group heartlv0 float %1.0g Baseline Heart Rate beats/min heartlv1 float %1.0g Level 1 Heart Rate beats/min heartlv float %1.0g Level Heart Rate beats/min heartlv3 float %1.0g Level 3 Heart Rate beats/min syslv0 float %1.0g Baseline Systolic Blood Pressure mmhg syslv1 float %1.0g Level 1 Systolic Blood Pressure mmhg syslv float %1.0g Level Systolic Blood Pressure mmhg syslv3 float %1.0g Level 3 Systolic Blood Pressure mmhg tab trtgrp Treatment Group Freq. Percent Cum nifedipine propranolol Total label list trt: 0 nifedipine 1 propranolol Since we have not used this data set before, I have run codebook for treatment group and for baseline heart rate so we can see what we have. Page -11-

14 . codebook trtgrp Treatment Group type: numeric (float) label: trt range: [0,1] units: 1 unique values: missing.: 0/34 tabulation: Freq. Numeric Label 18 0 nifedipine 16 1 propranolol heartlv0 Baseline Heart Rate beats/min type: numeric (float) range: [51,116] units: 1 unique values: 1 missing.: 0/34 mean: std. dev: percentiles: 10% 5% 50% 75% 90% The baseline heart rate in beats/minute is denoted heartlv0 and trtgrp = 1 is the propranolol treatment group.. sum(heartlv0) if trtgrp == 1 Variable Obs Mean Std. Dev. Min Max heartlv So x = and = 17 (i.e. we don t use s = because is known). Since we are assuming n is large enough to assume normality, the 95% confidence Page -1-

15 interval is , = (68.48, 85.14) We are confident that 95% of all such confidence intervals cover, the mean of the population (i.e. all people treated with Propranolol) baseline heart rate. That is what we mean when we say we are 95% confident that lies between and When assuming normality our equation for the confidence interval implies that the confidence interval is centered about the sample mean. So when you are carefully double-checking your work, you ll want to make sure that the confidence interval you have gotten actually contains the sample mean. What impacts the length of the confidence interval? Remember that the length of the confidence interval is z 1 ( α / ) n 1) Sample size n As n increases, the length of the confidence interval decreases. So there is an inverse relationship between the sample size n and the length of the confidence interval. Note that shorter confidence intervals are better. x and y are inversely related if one increases as the other decreases. So there is an inverse relationship between the size of n and the length of the confidence interval. ) The standard deviation or variance. Page -13-

16 As the standard deviation or variance increases, the length of the confidence interval increases. So there is a direct relationship between the size of and the length of the confidence interval. x and y are directly related if they both increase or they both decrease. 3) The α -level. α As increases (meaning the confidence decreases), the length of the confidence interval decreases. So there is an inverse relationship between the size α and the length of the confidence interval. Let us use the function invnormal(p) = z where p is the probability or area and z is the cutoff. We can write the equation as invnormal(1 - ( α /)) = z. Suppose that α = 0.05 (i.e. we are talking about a 95% confidence interval). This means that an area of 0.05 will be cut off on each end of the normal distribution. So we have 1 - ( α /) = = di invnormal(1-(0.05/)) or. di invnormal(0.975) α z = z = 196. So for = 0.05, 1 ( α / ) Page -14-

17 If α = 0.10, then 1 - ( α /) = = di invnormal(1 - (0.10/)) or. di invnormal(0.95) So z1 ( α / ) = z095. = 164. So α 1 = 0.05 produces a z value of 1.96 and α = 0.10 produces a z value of 1.64 So the larger of the two α s produces the smaller z value and hence the shorter confidence interval. If α = 0.05, then we have a 95% [i.e. (1 - α )%] confidence interval. If α = 0.10, then we have a 90% confidence interval. So less confidence and shorter confidence intervals go together. Page -15-

Chapter 6. The Standard Deviation as a Ruler and the Normal Model 1 /67

Chapter 6. The Standard Deviation as a Ruler and the Normal Model 1 /67 Chapter 6 The Standard Deviation as a Ruler and the Normal Model 1 /67 Homework Read Chpt 6 Complete Reading Notes Do P129 1, 3, 5, 7, 15, 17, 23, 27, 29, 31, 37, 39, 43 2 /67 Objective Students calculate