Introduction to Statistical Data Analysis Lecture 5: Confidence Intervals

Size: px

Start display at page:

Download "Introduction to Statistical Data Analysis Lecture 5: Confidence Intervals"

Henry Newman
5 years ago
Views:

1 Introduction to Statistical Data Analysis Lecture 5: Confidence Intervals James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 25

2 Introduction Now that we have learned about sampling and sampling distributions, we are ready to learn how to use inferential statistics to make conclusions about populations based on information obtained from samples. A key component of inferential statistics is to quantify the uncertainty that is inherent in using only a sample. An example of this is polling; a statement of a poll result is accompanied by an indication of the sampling error. James V. Lambers Statistical Data Analysis 2 / 25

3 Suppose that we wish to know the population mean, but only have a sample mean. We can construct a confidence interval that is centered at the sample mean and can provide an indication of the population mean. James V. Lambers Statistical Data Analysis 3 / 25

4 We first consider the case where the sample size n is sufficiently large, meaning that n 30. If this is the case, then, by the Central Limit Theorem, the sample means are approximately normally distributed, even if the population is not. James V. Lambers Statistical Data Analysis 4 / 25

5 Estimators The sample mean is an example of what is called a point estimate, which is a single value that describes population. Point estimators are easy to compute, but impossible to validate. To gauge the validity of a sample mean, we will rely on an interval estimate, which is a range of values that describes the population. The particular interval estimate we will use is called a confidence interval. James V. Lambers Statistical Data Analysis 5 / 25

6 Confidence Levels The first step in constructing a confidence interval is choosing a confidence level, which is the probability that the interval estimate will include the population parameter (in this case, the population mean). For example, for a 90% confidence interval, the confidence level is 0.9. Subtracting this value from 1 yields the significance level α; that is, for a 90% confidence interval, the significance level is 0.1. James V. Lambers Statistical Data Analysis 6 / 25

7 Constructing a Confidence Interval When the population standard deviation σ is known, the confidence interval is determined as follows: 1. Compute the standard error of the mean, σ x = σ/ n. 2. Find the z-value z α/2 such that for the random variable Z with standard normal distribution N (0, 1), P(Z z α/2 ) = 1 α/2, where α is the level of significance and 1 α is the corresponding confidence level. The value of z α/2 can be found by looking up the probability 1 α/2 in a normal distribution table, or by using the R function qnorm with argument 1 α/2. 3. Compute the margin of error E = z α/2 σ x. 4. Then, the confidence interval is [ x E, x + E]. James V. Lambers Statistical Data Analysis 7 / 25

8 Meaning of z α/2 James V. Lambers Statistical Data Analysis 8 / 25

9 Example Suppose that a signal with value µ is received with a value that is normally distributed around µ with variance 4. To reduce error, the signal is transmitted 10 times. If the values received are 8.5, 9.5, 9.5, 7.5, 9, 8.5, 10.5, 11, 11 and 7.5, then what is a 95% confidence interval for µ? James V. Lambers Statistical Data Analysis 9 / 25

10 Example, cont d 1. First, we compute the sample mean, x = Then, we compute the standard error of the mean, σ x = σ/ n = 2/ 10 = Using α = 0.05, we obtain z α/2 = The margin of error is then E = z α/2 σ x = (1.96)(0.6325) = Finally, the confidence interval is [ x E, x + E] = [ , ] = [8.01, 10.49]. James V. Lambers Statistical Data Analysis 10 / 25

11 Interpreting Confidence Intervals Once the confidence interval is obtained, it is essential to interpret it correctly. Given a 90% confidence interval, it is not true that the population mean has a 90% probability of falling within the interval. Instead, what we know is that there is a 90% probability that any given confidence interval from a random sample will contain the population mean. Note that all confidence intervals for a given confidence level and sample size have the same width E, but the center is the sample mean, which can vary. James V. Lambers Statistical Data Analysis 11 / 25

12 Changing the Confidence Level The significance level α represents the probability of erroneously concluding that the population mean is outside the confidence interval, when in fact it lies within the interval. As the confidence level 1 α increases, the significance level α decreases (since these two quantities must sum to one), which causes the z-score z α/2 to increase, and therefore the interval widens. As a result, the chance of erroneously concluding that the population mean is outside the confidence interval decreases. James V. Lambers Statistical Data Analysis 12 / 25

13 Changing the Sample Size As the sample size n increases, the standard error of the mean decreases. It follows that the margin of error decreases, and therefore the confidence interval shrinks. This makes sense because with a larger sample size, the sample mean should more accurately approximate the population mean. In fact, this is confirmed by the Law of Large Numbers, which states that as n, the sample mean x converges to the population mean µ. James V. Lambers Statistical Data Analysis 13 / 25

14 Choosing the Sample Size for the Mean Given a desired margin of error E, one can solve for the sample size n that would produce this value of E for the width of the interval. Rearranging the formulas presented earlier for the construction of the confidence interval, we obtain ( ) 2 σ n = = σ x ( σzα/2 We can see from this formula that as the margin of error E decreases, the sample size n must increase. E ) 2. James V. Lambers Statistical Data Analysis 14 / 25

15 When σ is Unknown If the population standard deviation σ is unknown, a confidence interval can be obtained by substituting the sample standard deviation s. That is, the standard error of the mean is taken to be ˆσ x = s/ n. James V. Lambers Statistical Data Analysis 15 / 25

16 When the sample size n is considered small (that is, n < 30), we can no longer rely on the Central Limit Theorem to conclude that the sampling distribution of the mean is normal. We must instead assume that the population itself is normal. James V. Lambers Statistical Data Analysis 16 / 25

17 When σ is Known When the population standard deviation σ is known, then we can proceed in the same way as for large samples. James V. Lambers Statistical Data Analysis 17 / 25

18 When σ is Unknown When σ is unknown, we can substitute s for σ as is done for large samples, but to determine the margin of error E, instead of using the z-value z α/2 from the normal distribution, we use the Student s t-distribution. This distribution, like the normal distribution, is bell-shaped and symmetric around the mean, and the area under the probability density curve is 1, but the shape of this curve depends on the degrees of freedom, which is n 1. This is because there are n observations in the sample, but one degree of freedom is removed due to the mean. The Student s t-distribution curve is flatter than the normal distribution curve, but it converges to a normal distribution as n increases. James V. Lambers Statistical Data Analysis 18 / 25

19 Using the Student s t-distribution In this scenario, the confidence interval is given by [ x tα/2,n 1ˆσ x, x + t α/2,n 1ˆσ x ], ˆσ x = s n. The value of t α/2,n 1 can be obtained by looking up the probability 1 α/2 in a Student s t-distribution table, or using the R function qt with arguments 1 α/2 and n 1. James V. Lambers Statistical Data Analysis 19 / 25

20 Example We revisit our previous example with a signal transmitted 10 times, except that now, the variance σ 2 is unknown. Recall that the received values are 8.5, 9.5, 9.5, 7.5, 9, 8.5, 10.5, 11, 11 and 7.5. Therefore, for the standard error, we use the sample standard deviation s = , which yields the standard error ˆσ x = s = = n 10 James V. Lambers Statistical Data Analysis 20 / 25

21 Example, cont d The number of degrees of freedom is n 1 = 10 1 = 9, and therefore we have t α/2,n 1 = t 0.05/2,9 = and margin of error E = t α/2,n 1ˆσ x = (0.4099) = We conclude that the 95% confidence interval is [ x E, x + E] = [ , ] = [8.3226, ]. Note that this interval is somewhat smaller than the one constructed in the previous example, due mainly to the smaller sample standard deviation. James V. Lambers Statistical Data Analysis 21 / 25

22 Calculating Confidence Intervals Determining the Sample Size We will now learn how to construct confidence intervals for proportions, using the standard error of the proportion introduced earlier. This construction is based on the approximation of the binomial distribution by a normal distribution, for sample size n sufficiently large so that np 5 and n(1 p) 5, where p is the population proportion. If an estimate of p is unknown, then we use the sample proportion p s in place of p. James V. Lambers Statistical Data Analysis 22 / 25

23 Calculating Confidence Intervals Determining the Sample Size Calculating Confidence Intervals We first compute the standard error of the proportion, except that we use the sample proportion p s instead of the population proportion p: ps (1 p s ) ˆσ p =. n Then, as was done for confidence intervals for the mean, we compute z α/2, where P(Z z α/2 ) = 1 α/2, using either table lookup or the R function qnorm. Finally, we obtain the confidence interval [ ps z α/2ˆσ p, p s + z α/2ˆσ p ]. As before, the margin of error E is given by z α/2ˆσ p, which is half the width of the interval. James V. Lambers Statistical Data Analysis 23 / 25

24 Calculating Confidence Intervals Determining the Sample Size Example Suppose that 600 voters are polled, and 52% of them indicate that they approve of the president s job performance. Using a 95% confidence interval, what is the margin of error? The standard error of the proportion, with p s = 0.52 and n = 600, is (0.52)(0.48) ˆσ p = = We also have z α/2 = z 0.05/2 = Therefore, the margin of error is z α/2ˆσ p = 1.96(0.0204) = That is, the poll has a margin of error of 4%. James V. Lambers Statistical Data Analysis 24 / 25

25 Calculating Confidence Intervals Determining the Sample Size Determining the Sample Size As with confidence intervals for the mean, we can determine the sample size n so that a confidence level for the proportion has a given margin of error. Solving the equation for the margin of error for n yields n = p(1 p) σ 2 p = p(1 p)z2 α/2 E 2. Without an estimate of p, we can use p = 0.5 to maximize n, since that choice of p maximizes the quantity p(1 p). James V. Lambers Statistical Data Analysis 25 / 25

Introduction to Statistical Data Analysis Lecture 4: Sampling

Introduction to Statistical Data Analysis Lecture 4: Sampling James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 30 Introduction