Business Statistics. Lecture 5: Confidence Intervals

Business Statistics Lecture 5: Confidence Intervals

Goals for this Lecture Confidence intervals The t distribution 2

Welcome to Interval Estimation! Moments Mean 815.0340 Std Dev 0.8923 Std Error Mean 0.0892 Upper 95% Mean 815.2111 Lower 95% Mean 814.8569 N 100.0000 Sum Weights 100.0000 Sample mean Sample SD LAST CLASS THIS CLASS Sample size 3

Hmmm... In the motor shaft case, we built a model using X and s We treated them like population parameters What if they were way off? How would we know? E.g., What if X 820 for the 100 shafts and we decided the process was not capable How sure are we that would give the same result? 4

Inference: Making Educated Guesses We want to use a sample to make guesses about a larger population Samples are variable (we d get different values if we took a different sample), so our guesses are uncertain We want to guess in such a way that: There is a chance we guessed right We know what that chance is 5

General Strategy for Guessing Pick a statistic similar to the parameter you want to guess Figure out what the sampling distribution of your statistic looks like Use the sampling distribution to assess the quality of your guess We ll start by guessing averages, because they re easy 6

Assumption Data are collected as a simple random sample: Every unit in population equally likely to be chosen Don t just look at the 800 most highly paid CEO s Choosing one unit does not change the relative chances for another unit to be chosen Other sampling schemes require different techniques 7

Last Class: Distribution of Averages 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Individual Mean of 5 Histogram of all possible means of five future shafts. SE= s/sqrt(5) 0.2 0.1 0.0 812 813 814 815 816 817 818 ShaftDiam Histogram of all future shafts. SD=s 8

Guessing, the Population Mean Best guess for is If you always guess never be right! for, you will You can guess with a confidence interval and be right some of the time Narrow intervals: higher chance of being wrong Wide intervals: less useful X X 9

Main Idea 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Because of the CLT, we know that X is within 2 SE s of 95% of the time Alternatively, is within 2 SE s of 95% of the time Individual Mean of 5 Unobserved pop mean Sample mean X (Unobserved) dist. of sample mean 95% confidence interval for pop mean 0.2 0.1 0.0 812 813 814 815 816 817 818 ShaftDiam (Unobserved) dist. 10 of population

How to Guess Choose a probability of being wrong: a Find z on the normal table so that a/2 of the probability is above z and a/2 is less than -z Example: if a = 5% then z = 1.96 s X and n Calculate Your interval is X z s n 11

Example: Shaft Process Sample mean: x 815.03 Sample SD: s 0.8923 Number of shafts: n 100 SE: s s / n 0.8923/ 100 = 0.08923 X 95% confidence interval for : x 1.96 s / n With numbers: 815.03 1.96(0.08923) = [814.859, 815.209] 12

JMP: Shaft Diameters Mean Std Dev Std Error Mean Upper 95% Mean Lower 95% Mean N Sum Weights 815.0340 0.8923 0.0892 815.2111 814.8569 100.0000 100.0000 Best Guess at Best Guess at s 95% Confidence Interval for There s a 95% chance that this interval contains There is a 5% chance it does not 13

So, What is a Confidence Interval? It s an interval around our sample statistic that shows how variable the sample statistic is Narrow: real (population) value unlikely to be far away Wide: little information about the population value Two CIs for our example: Confidence interval #1 Confidence interval #2 814 815 816 Observed Sample Mean 14

Again, What is a Confidence Interval? A confidence interval is a random interval Random because it is a function of a random variable ( X ) Confidence level is the long-run percentage of intervals that will cover the population parameter It is not the probability that the interval contains the true parameter! 15

A Simulation intervals not including population mean: 2 100 90 80 70 60 50 40 30 20 10 1 5 10 15 20 25 30 35 40 45 50 sample 95% Confidence Intervals for mean = 50, sd = 10, n = 5 16

Another Simulation intervals not including population mean: 10 80 70 60 50 40 30 20 1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 sample 95% Confidence Intervals for mean = 50, sd = 10, n = 595 17

Deriving a Confidence Interval (1) Let X 1, X 2,, X n be a random sample from a normal population with unknown mean and known standard deviation s Create a CI for based on the sampling 2 distribution of the mean: X ~ N, s / n To start, we know that (via standardizing): X s / n ~ N(0,1) 18

Deriving a Confidence Interval (2) Now for Z ~ N(0,1) we know Pr( 1.96 Z 1.96) 0.95 That is, there is a 95% probability that the random variable Z lies in this fixed interval Thus X - Pr -1.96 1.96 0.95 s / n And, after some algebra s s Pr X 1.96 X 1.96 0.95 n n Now we say we are 95% confident that this random interval covers the fixed (unknown) 19

Deriving a Confidence Interval (3) So, If X 1 = x 1, X 2 = x 2,, X n = x n are observed values of a random sample 2 from a N,s : x 1.96 s n is a 95% confidence interval for We can be 95% confident that the interval covers the population mean Interpretation: In the long run, 19 times out of 20 the interval will cover the true mean and 1 time out of 20 it will not 20

But Something s Fishy... Why make all the fuss about the mean being random if we treat the SD as known? Since s is a population quantity, we have to estimate it with s Should that make the intervals wider or narrower? When s unknown (almost always), use t distribution rather than the normal 21

The t Distribution 0.40 normal 0.30 T3 T10 T100 0.20 0.10 0.00-4 -3-2 -1 0 1 2 3 4 Z= number of SE s from the mean 22

Degrees of Freedom (df) The more degrees of freedom we have, the better we can estimate s The better we estimate s, the closer we are to s being known Thus, the more df we have, the closer t values are to z values Calculating degrees of freedom: Each observation adds one degree of freedom One degree of freedom is used up when we calculate X There are n-1 degrees of freedom left 23

How to (Really) Guess Choose a probability of being wrong: a Calculate X and s/ n For DF=1-n, find t from Table A3-5 (page 496) for p=1-a Then, the confidence interval is X t s n 24

Table A3-5 Example For a=0.05 and df=100, we have t=1.984 Notation t df, a / 2 t t n 1, a / 2 100,0.025 1.984 25

Shaft Diameters Example Redux Mean Std Dev Std Error Mean Upper 95% Mean Lower 95% Mean N Sum Weights 815.0340 0.8923 0.0892 815.2111 814.8569 100.0000 100.0000 Best Guess at Best Guess at s 95% Confidence Interval for x 1.984 s / n 815.034 1.984 0.8923/ 100 [814.857,815.211] 26

How Confidence Intervals Behave Width of CI s: w 2t n 1, a / 2 s Margin of error: E tn 1, a / 2 n s n Bigger SD Bigger SE wider intervals Bigger sample size Smaller SE narrower intervals Smaller t values narrower intervals Higher confidence Bigger t values wider intervals 27

t vs. z Use t when you don t know s The t distribution assumes the data are normally distributed Options if data are not normally distributed: Transform the data (logarithms) If transformations don t work and sample size is big ( > 30) ignore the problem If transformations don t work and sample size is small, read the book about nonparametric tests 28

Example (CompPur.jmp) Manufacturer of consumer electronics: How many households will purchase a computer in the next year? Use survey to collect responses from 100 households To justify sales projections, management needs the proportion to be at least 25% Should management revise sales projections? 29

Example, continued Survey results: Frequencies Level Count No 86 Yes 14 Total 100 2 Levels Prob 0.86000 0.14000 1.00000 Another sample would likely have given a different result What we want to know is, based on this result, where could the true proportion lie? 30

Example, continued When data are 0 s and 1 s, they are REALLY not normal 95% CI for true proportion.0.1.2.3.4.5.6.7.8.9 1.0 1.1 Mean Std Dev Std Error Mean Upper 95% Mean Lower 95% Mean N Sum Weights 0.1400 0.3487 0.0349 0.2092 0.0708 100.0000 100.0000 Rule of thumb: at least 30 observations, 5 successes, and 5 failures lets CLT kick in No difference between means and proportions! 31

Other Confidence Intervals There are lots of other confidence intervals we ve concentrated on CIs for the mean See your textbook for CI for the variance (i.e., s 2 ) CI for the difference of two means Not enough time to learn about these Just skim those sections in the book And know that CIs exist for other parameters 32

What We Have Learned So Far Descriptive Statistics Probability And, Or, Not Normal distribution Central limit theorem Computing SE( X ) from SD(X) Inference Confidence intervals for population means and proportions 33