Chapter 8 Point Estimation and Confidence Interval 8.1 Point estimator The purpose of point estimation is to use a function of the sample data to estimate the unknown parameter. Definition 8.1 A parameter is a constant that describes the population. A statistic is a random variable that can be computed from the sample data without making use of any unknown parameters. The statistic ˆΘ used to estimate the unknown parameter is called a point estimator of θ. A point estimate is the value of ˆΘ calculated from the observed sample values. Sample mean (mean of the sample): X = 1 n n i=1 Sample variance (variance of the sample): S 2 = 1 n 1 X i n (X i X) 2 Sample standard deviation (standard deviation of the sample): S = 1 n (X i n 1 X) 2 Recall A statistic ˆΘ is said to be unbiased if E( ˆΘ) = θ. X is an unbiased estimator of µ, and S 2 is an unbiased estimator of σ 2. Warning: Unbiased estimator is not unique. Definition 8.2 If we consider all possible unbiased estimators of θ, the one with the smallest variance is called the most efficient estimator of θ. i=1 i=1 8-1
8.2 Interval estimation We have proved that the sample mean is an unbiased estimator of the population mean. Suppose a sample of size n is taken from a Poisson distribution and it is found that x = 3.8. Then, 3.8 is a point estimate of λ. But, what exactly does this tell us about the true value of λ? Can we fell reasonably certain, for example, that λ lies somewhere close to x say, in the interval from 3.7 to 3.9. Or, on other hand, is X so variable that there is a good chance that X λ is fairly large? To address this uncertainty we turn from point estimation to a technique known as interval estimation. Interval estimation is exactly what the name implies. We want to find two statistics, ˆΘ 1 and ˆΘ 2, that can be used to generate an interval of real numbers that we hope contains the true value of the parameter θ being estimated. Definition 8.3 A 100(1 α)% confidence interval for a parameter θ is an interval of the form [ˆθ 1, ˆθ2 ], in which ˆΘ 1 and ˆΘ 2 are statistics such that P ( ˆΘ 1 θ ˆΘ 2 ) = 1 α 8.3 Confidence interval for µ when σ is known Theorem 1 If x is the value of the sample mean of a random sample of size n from a normal population with the known variance σ 2, then a 100(1 α)% confidence interval for µ is z α/2 σ n is called the margin of error. x ± z α/2 σ n = ( x z α/2 σ n, x + z α/2 σ n ) Example 8.1 What is the average price of statistics books? Below is a random sample of prices: 40 53 39 37 22 35 66 80 95 35 What is a 95% confidence interval for µ?, assuming σ = 20 x = 50.2, s = 23.13 50.2 ± 1.96 20 10 = 50.2 ± 12.4 = (37.8, 62.6) 8-2
The interval (37.8, 62.6) is called a 95% confidence interval for µ. We are 95% confident that the unknown µ lies between $37.8, and $62.6 We got this interval by a method that gives correct results 95% of the time. Caution: There are only two possibilities: (37.8, 62.6) contains the true µ. This random sample is one of the few samples for which x is not within $12.4 of the true µ. Only 5% of all samples give such inaccurate results. Example 8.1 You have measured the systolic blood pressure of a random sample of 25 students in UST. A 95% confidence interval for the mean systolic blood pressure for all students in UST is (122, 138). Which of the following statements gives a valid interpretation of this interval? (a) 95% of the sample of students have a systolic blood pressure betwee22 and 138. (b) 95% of the population of students have a systolic blood pressure betwee22 and 138. (c) The probability that the population mean blood pressure is betwee22 and 138 is 0.95. (d) If the procedure were repeated many times, 95% of the resulting confidence intervals would contain the population mean systolic blood pressure. (e) If the procedure were repeated many times, 95% of the sample means would be between 122 and 138. Example 8.2 The Bureau of Labor Statistics (BLS) conducts surveys each month to collect information on the labor market. According to one recent survey, the average hourly earnings of workers employed in the manufacturing industries edged up 1 cent in September to $12.86. The survey is based on a random sample of 390,000 workers. Suppose that the standard deviation of hourly earnings is $1.875. Find 95% and 99% confidence intervals for the mean hourly earnings of all works employed in the manufacturing industries in September 1998. What are the margins of error for 95% and 99% confidence? 8-3
n = 390000, x = 12.86, σ = 1.875 m =.0059 m =.0077 x ± z.025 σ n = 12.86 ± 1.96 1.875 390000 = 12.86 ± 0.0059 = (12.8541, 12.8659) x ± z 0.005 σ n = 12.86 ± 2.576 1.875 390000 = 12.86 ± 0.0077 = (12.8523, 12.8677) Example 8.3 Refer to Example 8.2 Suppose that the sample size is 1000. Find 95% and 99% confidence intervals for the mean hourly earnings in September 1998. What are the margins of error for 95% and 99% confidence? n = 1000, x = 12.86, σ = 1.875 x ± z 0.025 σ n = 12.86 ± 1.96 1.875 1000 = 12.86 ±.116 = (12.744, 12.976) x ± z.005 σ n = 12.86 ± 2.576 1.875 1000 = 12.86 ±.153 = (12.707, 13.013) 8-4
8.4 Confidence interval for µ when σ is unknown Recall that for a random sample of size n from N(µ, σ 2 ) Z = X µ σ/ n N(0, 1) What if σ is unknown? Student-t Statistic: where S is the sample standard deviation. T = X µ S/ n, Theorem 2 (Student-t Distributions) For a random sample of size n from N(µ, σ 2 ), T = X µ S/ n t(n 1), where t(n 1) is the t-distribution with n-1 degrees of freedom. To understand the t-distribution, we first introduce the chi-square distribution. Definition 8.4 X is said to have the chi-square distribution with v degrees of freedom, denoted by X χ 2 (v), if its pdf is given by f(x) = Definition 8.5 If Y and Z are independent, { 1 2 v/2 Γ(v/2) xv/2 1 e x/2 for x > 0 0 elsewhere Z N(0, 1), Y χ 2 (v) Then T = Z Y/v has pdf f(t) = Γ((v + 1)/2) πvγ(v/2) (1 + t2 v ) (v+1)/2, < t < and it is called the t distribution with v degrees of freedom. Let X 1, X 2..., X n be a random sample of size n from N(µ, σ 2 ), and let X and S 2 be the mean and variance of the random sample. Then n i=1 ( Xi ) µ 2 σ χ 2 (n); (n 1)S2 σ 2 has a chi-square distribution with n 1 degrees of freedom; 8-5
X and S 2 are independent n( X µ)/s has a t-distribution with n 1 degrees of freedom. Properties of t distributions: Symmetric about 0 Bell-shaped similar to N(0,1) curve, but have heavier tails As k increases, the t(k) distribution approaches N(0,1) distribution The t confidence interval If x and s are the values of the mean and the standard deviation of a random sample of size n from a normal population, then a 100(1 α)% confidence interval for µ is s x ± t α/2,n 1 n or where margin of error = t α/2,n 1 s n x ± margin of error, Example 8.4 Refer to Table A.4. What t critical value would you use for a C.I. for µ? (a) A 95% C.I. based on n = 10 (b) A 99% C.I. based on n = 20 Example 8.5 Great discoveries in science tend to be made by persons who are quite young. Listed below are 12 major scientific breakthroughs from the middle of the sixteenth century to the early part of the twentieth century. 8-6
Discovery Discoverer Date Age Earth goes around sun Copernicus 1543 40 Telscope, basic laws of astronomy Galileo 1600 34 Principles of motion, graviation, Newto665 23 calculus Nature of electricity Frankli746 40 Burning is uniting with oxygen Lavoisier 1774 31 Earth evolved by gradual Lyell 1830 33 processes Evidence for natural selection Darwi858 49 controlling evolution Field equations for light Maxwell 1864 33 Radioactivity Curie 1896 33 Quantum theory Planck 1901 43 Special theory of relativity, E = mc 2 Einstei905 26 Mathematical foundations Kolmogrov 1933 30 for probability theory Let µ denote the true average age at which great scientific discoveries are made. Construct a 95% confidence interval for µ. n = 12, x = 34.58, s = 7.3045 A 95% C.I. is 34.58 ± 2.201 7.3045 12 = 34.67 ± 4.7311 = (29.9389, 39.2211) Example 8.6 Large superstores use scanners to calculate a customer s bill. Scanners should be as accurate as possible. A state agency regularly monitors stores by randomly selecting a large number of items and comparing the shelf price with the checkout scanner price. Are the overcharges balanced by the undercharges, or is the mean overcharge of all different items in the store positive? During one check by the agency, 16 items were found to be incorrectly scanned. The amounts of overcharges were 2.00 -.99 1.00 -.50.40 -.60.20.30.50 3.00-1.20 1.00.50.30 -.70.40 (a) Find a 95% C.I. for the mean overcharge. n = 16, x =.351, s = 1.083 A 95% C.I. is.351 ± 2.131 1.083 16 = (.226,.928) 8-7
8.5 Point and interval estimation for a population proportion In many problems we must estimate proportions, probabilities, percentages or rate, such as Example: (a) What is the current unemployment rate in Hong Kong? (b) What is President Bush s approval rating? (c) What proportion of students in Math 144 will receive Grade A? (d) Point estimation of proportions Let p be a population proportion of successes (or the probability of success), and let X be the number of successes in a random sample of size n. Define the sample proportion ˆp by ˆp = X n Then p(1 p) E(ˆp) = p, Var(ˆp) = n Therefore, the sample proportion is an unbiased estimator of p. Interval estimation of p Recall that X np N(0, 1) as n np(1 p) Hence and ˆp p p(1 p) n ˆp p ˆp(1 ˆp) n N(0, 1) N(0, 1) Confidence interval for p: An approximate 100(1 α)% confidence interval for p is ˆp ± z α/2 ˆp(1 ˆp) n or ˆp ± margin of error, ˆp(1 ˆp) where margin of error= z α/2. n 8-8
Example 8.7 Do you approve or disapprove of the way Bill Clinton is handling his job as president? A Gallup poll conducted May 7-9, 1999 found that 60% of 1,015 adults interviewed approved Clinton s job performance. The Gallup poll claims that For results based on the total sample of adults nationwide, one can say with 95% confidence that the margin of sampling error is no greater than +/ 3 percentage points.. Explain. ˆp =.60, n = 1015 95% C.I. for p is.60 ± 1.96 (.60.4/1015).5 =.60 ± 1.96.015377 =.60 ±.03 = (.57,.63) Example 8.8 In a random sample of n = 500 families owning television sets in the city of Hamilton, Canada, it is found that x = 340 subscribed to HBO. Find a 95% confidence interval for the actual proportion of families in this city who subscribe to HBO. ˆp = 340/500 = 0.68, 95% CI is 0.68 ± 1.96 0.68(0.32)/500 = 0.68 ± 0.04 = [0.64, 0.72] Choosing the sample size Recall the margin of error is given by m = z α/2 ˆp(1 ˆp) n Sample size for desired margin of error: The 100(1 α)% confidence interval for p will have a margin of error approximately equal to a specified value m when the sample size is ( zα/2 ) 2 n = p (1 p ) m where p is a guessed value for the sample proportion. Conservative sample size: ( zα/2 ) 2 n = (1/4). m The margin of error will be less than or equal to m. 8-9
Example 8.9 Find the sample size needed if the margin of error of the 95% confidence interval is (a) m = 1% (b) m = 2% (c) m = 3% (d) m = 3%, p =.3 (a) n =.25(1.96/.01) 2 = 9604 (b) n =.25(1.96/.02) 2 = 2401 (c) n =.25(1.96/.03) 2 = 1067.1 (d) n =.3.7(1.96/.03) 2 = 896.4 8.6 Estimation of differences between means Variances known Let X 1 and X 2 are the sample means of independent random samples of size and from normal populations with means µ 1 and µ 2 and with the known variances σ 2 1 and σ 2 2. Then Z = ( X 1 X 2 ) (µ 1 µ 2 ) N(0, 1) σ1 2 + σ2 2 Confidence interval for µ 1 µ 2 : If x 1 and x 2 are the values of the sample means of independent random samples of size and from normal populations with means µ 1 and µ 2 and with the known variances σ1 2 and σ2, 2 then a (1 α)100% confidence interval for µ 1 µ 2 is σ1 2 x 1 x 2 ± z α/2 + σ2 2 The margin of error is = z α/2 σ 2 1 + σ2 2 8-10
Example 8.10 A survey of credit card holders revealed that Americans carried an average credit card balance of $3900 i995 and $3300 i994 (U.S. News & World Report, January 1, 1996). Suppose that these averages are based on random samples of 400 credit card holders i995 and 450 credit card holders i994 and that the population standard deviations of the balances were $880 i995 and $810 i994. Construct a 95% confidence interval for the difference between the mean credit card balances for all credit card holders i995 and 1994. For 1995: = 400, x 1 = $3900, σ 1 = $880 For 1994: = 450, x 2 = $3300, σ 2 = $810 σ x1 x 2 = σ 2 1 + σ2 2 = 880 2 400 + 8102 450 = 58.258 A 95% confidence interval for µ 1 µ 2 is ( x 1 x 2 ) ± z 0.025 σ x1 x 2 = (3900 3300) ± 1.96(58.258) = 600 ± 114.19 = (485.81, 714.19) Variance unknown Case I: σ 2 1 = σ 2 2 = σ 2 Pooled estimator of σ 2 : S 2 p is an unbiased estimator of σ 2 ( + 2)S 2 p σ 2 χ 2 ( + 2) X 1 X 2 and S 2 p are independent Therefore S 2 p = ( 1)S 2 1 + ( 1)S 2 2 + 2 T = ( X 1 X 2 ) (µ 1 µ 2 ) t( + 2) 1 S p + 1 Confidence interval for µ 1 µ 2 when σ 1 = σ 2 unknown: If x 1 and x 2 are the values of the sample means of independent random samples of size and from normal populations with means µ 1 and µ 2 and with the unknown common variance σ 2 1 = σ 2 2 = σ 2, then a (1 α)100% confidence interval for µ 1 µ 2 is x 1 x 2 ± t α/2,n1 + 2s p 1 + 1 8-11
Example 8.11 A company claims that its medicine, Brand A, provides faster relief from pain than another company s medicine, brand B. A researcher tested both brands of medicine on two groups of randomly selected patients. The results of test are given in the following table. The mean and standard deviation of relief times are in minutes. Brand Sample size Mean S.d. A 25 44 13 B 22 49 11 Construct a 95% C.I. for the difference between the mean relief times for the two brands of medicine Assume that σ 1 = σ 2. s 2 p = 24(13)2 +21(11) 2 = 146.6, s 45 p = 12.1078 A 95% c.i. for µ 1 µ 2 is 44 49 ± 2.0141(12.1078) 1/25 + 1/22 = 5 ± 7.06 = ( 12.13, 2.13) Example 8.12 An instructor in Math 144 believes that students will not do well if they skip too many classes. To test the claim, the instructor divides students into high-attendance group and lower-attendance group. Group Sample size Mean S.d. 1 69 84.5 12 2 51 77 14 Find a 95% confidence interval for µ 1 µ 2. s 2 p = 166.034, s p = 12.8854. A 95% confidence interval is 84.5 77 ± 1.98(12.8854) 1/69 + 1/51 = 7.5 ± 4.7 = (2.82, 12.21) Case II. σ 2 1 σ 2 2 For this case, we have T = ( X 1 X 2 ) (µ 1 µ 2 ) t v, S1 2 + S2 2 8-12
where v = ( s2 1 + s2 2 ) 2 1 1 ( s2 1 ) 2 + 1 1 ( s2 2 ) 2 Thus, a (1 α)100% confidence interval is approximately s 2 1 x 1 x 2 ± t α/2,v + s2 2 Example 8.13 Does increasing the amount of calcium in our diet reduce blood pressure? Examination of a large sample of people revealed a relationship between calcium intake and blood pressure, but such observational studies do not establish causation. A randomized comparative experiment gave one group of 10 men a calcium supplement for 12 weeks.the control group of 11 men received a placebo. Below are the data for seated systolic blood pressure. Calcium group Placebo Group Begin End Decrease Begin End Decrease 107 100 7 123 124-1 110 114-4 109 97 12 123 105 18 112 113-1 129 112 17 102 105-3 112 115-5 98 95 3 111 116-5 114 119-5 107 106 1 112 114-2 112 102 10 110 121-11 136 125 11 117 118-1 102 125 11 130 133-3 119 114 5 Find a 95% confidence interval for µ 1 µ 2. = 10, x 1 = 6.1, s 1 = 8.81, = 11, x 2 =.64, s 2 = 5.87 s 2 p = 9(8.74)2 +10(5.87) 2 = 54.3188, s 19 p = 7.37. A 95% confidence interval for µ 1 µ 2 is 1 5 +.64 ± 2.093(7.37) 10 + 1 = 5.64 ± 6.74 = ( 1.1, 12.4) 11 Use twos ample 95 c1 c2, 95% C.I. for µ 1 µ 2 is ( 0.30, 13.77), df = 15 8-13
Matched pairs intervals Consider the problem of comparing two means for samples that are not independent. This situation arises quite naturally when observations occur in pairs. Given paired data: (X 1, Y 1 ), (X 2, Y 2 ),, (X n, Y n ) Let d i = X i Y i. Then a 100(1 α)% confidence interval for µ X µ Y is d ±t α/2,n 1 s d / n, where d and s d are the sample mean and the sample standard deviation of {d i }. Example 8.14 It is claimed that a new diet will reduce a person s weight by 4.5 kg on the average in a period of 4 weeks. The weights of 9 women who followed this diet were recorded before and after a 4-week period: Weight before Weight after Difference 58.5 60.0-1.5 60.3 54.9 5.4 61.7 58.1 3.6 69.0 62.1 6.9 64.0 58.5 5.5 62.6 59.9 2.7 56.7 54.4 2.3 70.4 68.5 1.9 73.2 70.5 2.7 Find a 90% C.I. for the mean weight loss in a 4-week period. d = 3.278, s d = 2.475 A 90% C.I. is 3.278 ± 1.86(2.475/3) = 3.278 ± 1.535 = (1.743, 4.813) 8.7 Estimation of differences between proportions Suppose we have two independent samples. 8-14
Population proportion Sample size Sample proportion 1 θ 1 ˆΘ1 2 θ 2 ˆΘ2 The sampling distribution of ˆΘ 1 ˆΘ 2 : and By the central limit theorem, where E( ˆΘ 1 ˆΘ 2 ) = θ 1 θ 2, Var( ˆΘ 1 ˆΘ 2 ) = θ 1(1 θ 1 ) + θ 2(1 θ 2 ). ˆΘ 1 ˆΘ 2 (θ 1 θ 2 ) S.E. = S.E. N(0, 1) ˆΘ 1 (1 ˆΘ 1 ) + ˆΘ 2 (1 ˆΘ 2 ) Confidence interval for p 1 p 2 : If X 1 is a binomial random variable with B(, θ 1 ), X 2 is a binomial random variable with B(, θ 2 ), X 1 and X 2 are independent, then an approximate 100(1 α)% confidence interval for θ 1 θ 2 is (ˆθ 1 ˆθ 2 ) ± z α/2 s.e. where ˆθ 1 = x 1 and ˆθ 2 = x 2 s.e. = ˆθ 1 (1 ˆθ 1 ) + ˆθ 2 (1 ˆθ 2 ), Example 8.15 A study is made to determine if a cold climate results in more students being absent from school during a semester than for a warmer climate. Two groups of students are selected at random, one group from Vermont and the other group from Georgia. Of the 300 students from Vermont, 64 were absent at least 1 day during the semester, and of the 400 students from Georgia, 51 were absent 1 or more days. Find a 90% confidence interval for the difference between the fractions of students who are absent in the two states. = 300, ˆθ 1 = 64/300 =.2133 = 400, ˆθ 2 = 51/400 =.1275 SE =.2133.7867/300 +.1275.8725/400 =.02894 8-15
An approximate 90% CI for θ 1 θ 2 is.2133.1275 ± 1.645.02894 =.0858 ±.0476 = (.0382,.1334) Example 8.16 A clinical trial is conducted to determine if a certain type of inoculation has an effect on the incidence of a certain disease. A sample of 1000 rats was kept in a controlled environment for a period of 1 year and 500 of the rats were given the inoculation. Of the group not given the drug, there were 120 incidences of the disease, while 98 of the inoculated group contracted it. If we call p 1 the probability of incidence of the disease in uninoculated rats and p 2 the probability of incidence after receiving the drug, compute a 90% confidence interval for p 1 p 2. 0.0011 < p 1 p 2 < 0.0869 8-16