Interval estimation. October 3, Basic ideas CLT and CI CI for a population mean CI for a population proportion CI for a Normal mean

Interval estimation October 3, 2018 STAT 151 Class 7 Slide 1

Pandemic data Treatment outcome, X, from n = 100 patients in a pandemic: 1 = recovered and 0 = not recovered 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 0 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 1 A probability model for treatment outcome: Outcome Probability 1 (recovers) p 0 (does not recover) 1 p Maximum likelihood estimate of p is ˆp = 60 a point estimate 100 STAT 151 Class 7 Slide 2

Interval estimation MLE ˆp comes from a sample of n = 100 patients Recall any estimate incurs sampling error a defined by point estimate unknown = ˆp p To account for uncertainty due to sampling error, we estimate where a is called margin of error Interval width can be adjusted: p lies within (ˆp a, ˆp + a), a > 0, large a (wider interval) small a (shorter interval) } { more certain p (ˆp ± a) less certain p (ˆp ± a) a Sampling error arises because we use a sample (only a part of the population) to infer about the entire population STAT 151 Class 7 Slide 3

Confidence interval (CI) Our new type of estimate is called a confidence interval estimate a There are two basic components in a confidence interval estimate: (a) Level of confidence a measure of our level of belief (b) Margin of error a measure of the precision of our estimate For the pandemic example, we wish to say something like: we are 95% confident the population proportion p is between 0.6 ± a In that case (a) Level of confidence = 95% (b) Margin of error = a How do we determine a? a Sometimes simply called an interval estimate STAT 151 Class 7 Slide 4

Sampling distribution Some facts: Estimates a from different random samples form a sampling distribution around the unknown p, see gray x-marks in figure ˆp = 0.6, green x-mark from the observed data behaves like the gray x-marks Anything in gray is not observed ˆp = 0.6 green x-mark p =? ˆp from observed sample Estimates from different samples p a Assuming an unbiased or consistent estimator so there is no systematic over- or under-estimation STAT 151 Class 7 Slide 5

Central Limit Theorem (CLT) Let ˆθ be the sample estimate of a population characteristic θ. If ˆθ is obtained using a well behaved estimator and given a sufficiently large sample (of n independent randomly drawn observations), then the sampling distribution of ˆθ is approximately normal with mean θ and variance var(ˆθ) ˆp from observed sample Estimates from different samples p Using the empirical rules: we can be 95% certain that is no more than p ± 2 a var( ) = p ± 2SE(ˆp) a 2 is an approximation; a more exact value is 1.96 STAT 151 Class 7 Slide 6

Large sample 95% confidence interval for p We can be 95% certain that is no more than p ± 2 a SE(ˆp) translates in notations as p 1.96SE(ˆp) < ˆp < p + 1.96SE(ˆp) p 1.96SE(ˆp) p ˆp < ˆp p ˆp < p + 1.96SE(ˆp) p ˆp 1.96SE(ˆp) ˆp < p < 1.96SE(ˆp) ˆp ˆp 1.96SE(ˆp) < p < ˆp + 1.96SE(ˆp) We are 95% certain that p is within ˆp ± 1.96SE(ˆp) The level of confidence is 95% and the margin of error is 1.96SE(ˆp) a The more exact value of 1.96 is used here STAT 151 Class 7 Slide 7

Large sample 95% confidence interval for any quantity Given a sufficiently large random sample (of n independent observations) from a population, let ˆθ be the sample estimate of a population characteristic θ. If ˆθ is obtained from a well behaved estimator, then an approximate 95% confidence interval for θ is given by ˆθ ± 1.96SE(ˆθ) Why 95%? Using the empirical rules: we can be 90% certain that is no more than p ± 1.64SE(ˆp) p ˆp ± 1.64SE(ˆp) is a 90% confidence interval We can form many confidence levels from the same set of data Every study should have one conclusion. A meaningful interval should have: (a) a high level of confidence (b) a width that is not too wide Due to (a) and (b), we often use a 95% confidence interval STAT 151 Class 7 Slide 8

Interpretation of a confidence level A confidence interval (CI) is a method for finding a plausible range for p. Each time a CI is calculated using a random sample, we obtain a different interval. For example, a 95 % CI has the following property: If the method is used repeatedly, then 95% of the intervals will actually include p. However, each time a 95% CI is calculated, the chance that p is included in that particular interval is NOT 95% it is either { 0% (p not inside CI, wrong estimate!) 100% (p inside CI, correct estimate!). Therefore, our confidence in our interval is based on the fact that it may be one of the 95 (out of 100) that actually includes the unknown. STAT 151 Class 7 Slide 9

Large sample 95% confidence interval for a population mean If X is a point estimate µ, an approximate 95% confidence interval is: ˆµ ± 1.96SE(ˆµ) X ± 1.96SE( X ) The interval estimate can be completed by working out SE( X ) between samples {}}{ ( ) X1 +... + X n var( X ) = var n = 1 n 2 var(x 1 +... + X n ) = 1 n 2 [var(x 1) +... + var(x n )] }{{} X 1,...,X n are independent = 1 n 2 n var(x ) }{{} var(x 1)=...=var(X n) var(x ) var(x ) = }{{ n } depends on var(x ) and n { (1) SD(X ), how different are the values of X in the population SE( X ) = var( X ) depends on (2) n, the sample size STAT 151 Class 7 Slide 10

Woman s wage data example Suppose we wish to estimate µ = mean hours of work for all working white married women in the US in 1975-1976 Available data: n (X X s = i X ) 2 i=1 n 1 n 1303 776.2744 428 Since SD(X ) is unknown, an approximate 95% confidence interval for µ is Using the data gives X ± 1.96 s n 1303 ± 1.96 776.2744 428 1303 ± 74 = (1229, 1377) In the approximate 95% confidence interval, 1229 and 1377 hours are, respectively, the lower and upper confidence limits; the margin of error is 74. STAT 151 Class 7 Slide 11

What is a proportion? Pandemic example (2) p is the proportion in the population of N patients who would recover { 1 recovers X = 0 not recover Suppose the value of X in the population are X 1 = 1 (recovers), X 2 = 0 (not recover), X 3 = 0,...,X N = 1, which is a collection of 1 s and 0 s p = #1 s N = 1 + 0 + 0 +... + 1 N = X 1 + X 2 + X 3 +... + X N N = µ Hence a proportion is a special case of µ with only 1 s and 0 s STAT 151 Class 7 Slide 12

Sampling to estimate a proportion Pandemic example (3) We take a sample X 1,..., X n and estimate p µ using X ˆp = X 1 +... + X n n X 1,..., X n are: { 1 with probability p 0 with probability 1 p An approximate 95% confidence for p as a special case of µ is ˆp ± 1.96SD(ˆp) X ± 1.96SE( X ) X ± 1.96 SD(X ) n var(x ) = E(X 2 ) E(X ) 2 = (1) 2 p + (0) 2 (1 p) = p p 2 = p(1 p) p 2 {}}{ µ 2 Hence, an approximate 95% confidence interval for p is: p(1 p) ˆp ± 1.96 n STAT 151 Class 7 Slide 13

Pandemic example (4) Available data: ˆp = X ˆp(1 ˆp s = n n 60/100=0.6 0.6(1 0.6)/100 100 Since p(1 p) is unknown, an approximate 95% confidence interval for p is ˆp(1 ˆp) ˆp ± 1.96 n which, using the data, gives 0.6(0.4) 0.6 ± 1.96 0.6 ± 0.096 = (0.504, 0.696). 100 STAT 151 Class 7 Slide 14

A normal population mean Interval estimates that rely on the CLT require large sample size n No general expression for small n, except when estimating a normal population mean µ and SD(X ) is known, when the following is still valid X ± 1.96SE( X ) = X ± 1.96 SD(X ) n SD(X ) is often unknown and replaced by s to give: where t 1.96 X ± t s n t stretches the interval to compensate for the extra uncertainty in a poor estimate of SD(X ) by s when n is small The amount of stretching depends on n, a small n requires more stretching STAT 151 Class 7 Slide 15

Example Suppose we wish to estimate average household expenditure in a population, with available data n (X X s = i X ) 2 i=1 n 1 n 1924.9 223.1021 10 Since n is small, we assume household expenditure follows a normal distribution To find an approximate 95% confidence interval, we need to use a t value that depends on the degree of freedom (df ), defined as df = n 1. df = n 1 6 7 8 9 10 20 120 >120 value 2.447 2.365 2.306 2.262 2.228 2.086 1.98 1.96 In this example, n = 10, which gives df = 10 1 = 9; so we choose the value 2.262 in the table to replace 1.96, giving 1924.9 ± 2.262 223.1021 10 1924.9 ± 159.5 = (1765.4, 2084.4) STAT 151 Class 7 Slide 16