Chapter 23. Inference About Means

Size: px

Start display at page:

Download "Chapter 23. Inference About Means"

Deirdre Carmel Gibson
5 years ago
Views:

1 Chapter 23 Inference About Means 1 /57

2 Homework p554 2, 4, 9, 10, 13, 15, 17, 33, 34 2 /57

3 Objective Students test null and alternate hypotheses about a population mean. 3 /57

4 Here We Go Again Now that we are familiar with creating confidence intervals and testing hypotheses about proportions, we should find doing the same for means to be very easy. It is still true that confidence intervals and hypothesis testing are based on the sampling distribution model. The only real change is that the Central Limit Theorem tells us that the sampling distribution model for means is Normal with mean µ and standard deviation SD(y ) = σ n 4 /57

5 All We Need is All we need to proceed is a random sample of quantitative data. Oh yeah, and the true population standard deviation, σ. Uh Oh, that s a problem Proportions have a link between the proportion value and the standard deviation of the sample proportion. This is not the case with means knowing the sample mean tells us nothing about SD(y ) 5 /57

6 What Shall We Do? We will use what we do have available: we will estimate the population parameter σ with the sample statistic s. Our resulting standard error is SE(y ) = s n We now have additional uncertainty, or variation, in our standard error from s, the sample standard deviation which most likely σ. We need to allow for the extra variation so that it does not mess up the margin of error and P-value, especially for a small sample. The additional uncertainty alters the shape of the sampling distribution. The distribution is still unimodal and symmetric, but is no longer Normal. At least, not the Normal with which we are familiar. So, what is the sampling model? 6 /57

7 Student s t Coming to the rescue was William S. Gosset, an employee of the Guinness Brewery in Dublin, Ireland, who worked out what the change to the sampling model was and how we might deal with the added variability. The sampling model that Gosset created he called Student s t. His employer would not allow employees to publish data from the brewery, so Gosset published his finding under the pseudonym Student. The Student s t-models are a whole family of related distributions that depend on a parameter known as degrees of freedom. We denote degrees of freedom as df (or ν (nu)), and the statistic as t df. When appropriate, the t statistic replaces the z statistic in calculations. 7 /57

When the conditions are met, the standardized sample mean follows

8 Confidence Interval, Means A sampling distribution model for means t = y µ ( ) SE Y Look familiar? When the conditions are met, the standardized sample mean follows a Student s t-model with n 1 degrees of freedom. We estimate the standard error with SE(y ) = s n 8 /57

9 Confidence Interval, Means One-sample t procedures are to be used only when the population is sufficiently normal. It must be reasonable to assume that the population (sampling distribution) is sufficiently unimodal and symmetric in order to justify the use of t. The t statistic is strongly influenced by outliers. As a result, it is imperative that you check the data. If the population is not unimodal and symmetric, there are outliers and the sample size is small, the results will not be reliable. We have to find other statistical methods. The simplest way to verify the population is sufficiently normal is to trust your methodolgy, assume the sample is representative of the population and verify that the sample is sufficiently unimodal and symmetric by making a histogram of the sample. 9 /57

10 When to use t procedures I, and your calculator, have a simple rule for using t over z. If you do not know the population σ, use the t statistic. Howsomever, keep in mind that it is not always appropriate to use the t model. To ensure the sampling distribution warrants using the normal model we must verify the sample sizes are appropriate. If the sample size is less than 15, only use t procedures if the data are really close to Normal. It must be a reasonable assumption that the population distribution is normal. If the sample is between 15 and 40, use t if the data is unimodal and reasonably symmetric, verified by a histogram of the sample. If the sample size is at least 40, you may use t procedures, even if the original data is skewed. 10/57

11 Confidence Interval, Means When Gosset corrected the model for the extra uncertainty, the margin of error got bigger. Confidence intervals will be slightly wider and P-values a little larger for the t models than they were with the Normal model. Using the t-model, compensates for the extra variability introduced by estimating the population parameters with sample statistics. 11 /57

The confidence interval is ( ) y ±t * SE X n 1 s y ±t * n 1 n where the standard error of the mean is SE (y ) =

12 Confidence Interval, Means One-sample t-interval for the mean When the conditions are met, we can find the confidence interval for the population mean, µ exactly like we did for the proportions. The confidence interval is ( ) y ±t * SE X n 1 s y ±t * n 1 n where the standard error of the mean is SE (y ) = s n The critical value t*n-1 depends on the particular confidence level, C, that you specify and on the number of degrees of freedom, n 1. 12/57

13 Confidence Interval, Means Student s t-models are unimodal and symmetric, just like the Normal. But t-models with only a few degrees of freedom are lower, and have much fatter tails than the Normal. Like someone sat on the top of the Z curve and squashed it down. (Making the margin of error bigger.) The difference between the t-curve and the Z curve is the standard deviation. Both distributions have a mean of 0. The Z distribution has a variance of 1. ν The t distribution has a variance approximated by. ν 2 13/57

14 The t curve As the degrees of freedom increase, the t-curves look more and more like the Normal curve. Normal Distribu.on t- Distribu.on As the degrees of freedom increase, the t distribution gets closer to the normal distribution, since s gets closer to σ. In fact, the t-model with infinite degrees of freedom is exactly Normal. 14/57

15 Assumptions and Conditions Normal Population Assumption: We can never know if the population is truly Normal, but we can make a reasonable assumption about some populations. Nearly Normal Condition: The data come from a distribution that is unimodal and symmetric. Check by making a histogram or Normal probability plot (NPP) of the sample data. If the sample is representative, and the sample is sufficiently unimodal and symmetric, we conclude the population and sampling distribution are sufficiently unimodal and symmetric. 15/57

16 Assumptions and Conditions These should come as no surprise to you... Independence Assumption: Independence Condition. The data values should be independent. Randomization Condition: The data arise from a random sample or randomized experiment. 10% Condition: When a sample is drawn without replacement, the sample should be no more than 10% of the population. As there is no proportion, there is no np or nq. Sufficient sample size requires considering the sample sizes 15, and /57

17 Remember If we cannot reasonably assume the distribution of the population data is Normally distributed then we must check further. Nearly Normal Condition: The smaller the sample size (n < 15 or so), the more closely the population data should follow a Normal model. For moderate sample sizes (n between 15 and 40 or so), the t works well as long as the data are unimodal and reasonably symmetric. Check the sample distribution with a histogram. For larger sample sizes (n > 40), the t -statistic is acceptable to use unless the data are extremely skewed or has significant outliers. 17/57

18 Finding t-values By Hand The Student s t-model is different for to 22 each value of degrees of freedom. Because of this, Statistics books usually have one table of t-model critical values for selected confidence levels. Your book has a table in Appendix G, pa-104 Of course, we will use the calculator. 18/57

19 t - test Test Statistic = Observed Value (mean) - Expected Value sample standard deviation t = Sample mean - Population mean adjusted standard deviation t = X - µ s n, d.f. = n 1 19/57

20 t - distribution The t distribution has the following properties: The mean of the distribution is equal to 0. ν The variance is equal to, where df (v) is the degrees of freedom ν 2 and ν > 2. The variance is always greater than 1, although it is close to 1 when there are many degrees of freedom. With infinite degrees of freedom, the t distribution is the same as the standard normal (z) distribution. 20/57

21 Confidence Interval Calculating the confidence interval is exactly the same as with the z statistic, simply replace the z with t. X ± t α 2 s Or X ± t n 1 s n n 21/57

22 Example Find a 90% confidence interval for the population mean if a sample of 20 has mean of 1462 with a standard deviation 42. This is calculator practice, so we will forgo the assumptions and conditions, plus these numbers have no context. To find t* or t19 InvT(.95, 19) = s 90% t 19 n = Thus our interval would be 1462 ± 16.2 or (1445.8, ). to 18 22/57

23 TI-84 With data, enter data into a list. STAT TESTS 8:TInterval Inpt: Data Stats List: Freq: 1 C-Level: Calculate With statistics. STAT TESTS 8:TInterval Inpt: Data Stats x: sx:.4 n: C-Level: Calculate 23/57

24 TI-84 To find the critical value of t* Dist 2nd vars 4:invT( 1 - α, df) or (1 - α/2, df) To find the area to the below (left of) of a t-value 2nd Dist vars 6:tcdf( -10^99, t, df) To find the area to the between t-values Dist 2nd vars 6:tcdf( lowert, uppert, df) To find the area to the above (right of) a t-value 2nd Dist vars 6:tcdf( t, -10^99,df) 24/57

25 Computer Computer output for 1-sample t-test and t interval. two tailed test Mean and Standard 95% Confidence Test P-Value Deviation of the sample Interval Statistic SE of the sampling distribution mean/stdev Run the test and CI on your calculator. 25/57

26 Cautions About Interpreting Confidence Intervals Remember the appropriate interpretation of your confidence interval. What NOT to say: 90% of all the subject scored between 7 and 9 on the depression test. The confidence interval is about the population mean NOT the individual data values. We are 90% confident that a randomly selected subject will have a depression score between 7 and 9. Again, the confidence interval is about the mean not the individual values. 26/57

27 Cautions About Interpreting Confidence Intervals What NOT to say: The mean depression score is 8 90% of the time. The true population mean does not vary, the confidence interval and sample mean would be different from a different sample. 90% of all samples will have mean score between 7 and 9. The interval we calculate does not represent any other interval. It is only one of many possible intervals, and no more likely to be valid than any other interval. The probability of a mean score between 7 and 9 is 90%. The interval we calculate is not special. It is only one of many possible intervals, and no more likely to be valid than some other interval. 27/57

28 Cautions About Interpreting Confidence Intervals What you DO say: Conclusion Based on data from a sample of size, I am C% confident that the true value of the population mean is between and. C% Confidence With repeated sampling, C% of intervals that could be found from samples size n would contain the true value. 28/57

29 Make a Picture x 3 Draw a picher, draw a picher, draw a picher. Now you have another graph to draw. In addition to the t curve, you must draw a histogram of your sample data to demonstrate that the sample, and thus the sampling distribution is sufficiently unimodal and symmetric with no problematic outliers. You may also want to make a Normal probability plot to ensure that it is reasonably straight, indicating a symmetric distribution. 29/57

30 t - test Test Statistic = Observed Value (mean) - Expected Value sample standard error t = Sample mean - Population mean adjusted standard deviation t = X - µ, d.f. = n 1 s n 30/57

31 A Test for the Mean One-sample t-test for the mean The conditions for the one-sample t-test for the mean are the same as for the one-sample t-interval. We test the hypothesis H 0 : µ = µ 0 using the statistic: t = X - µ, d.f. = n 1 s n When the conditions are met and the null hypothesis is true, this statistic follows a Student s t model with n 1 df. We use that model to obtain a P- value. 31/57

32 t - test for a mean Do not forget to write a complete response P Define the parameter. (In this case the population mean.) H Formulate all hypotheses: H0, Ha. A Check the assumptions and conditions. N Determine the test statistic that will assess the evidence against the null hypothesis. (Now the t - test.) T Find critical values of the test statistic based on α. O Obtain the p-value for the statistic M Decide to Reject or Fail to reject H0. S Tell someone about your results in context. 32/57

33 Example A consumer, Skeptical Starks tested 18 bottles of sodypop and found a sample mean of 15.8 ounces with a standard deviation of 0.4 ounces. Skeptical is convinced the company is deliberately shorting the drinks that are labeled 16 ounces. Is Skeptical right? 18 bottles, x = s = 0.4 We do not know the standard deviation of the population of sodypop bottles so we will use the t-test for a mean volume of sodypop. The null hypothesis is that the mean volume for the sodypop is not less than 16 oz. The alternate hypothesis is the mean volume is less than 16 oz. H0: µ = 16 and Ha: µ < 16 33/57

34 Independence Assumption: Example 18 bottles, x = s = 0.4 Independence Assumption. We are not certain that the consumer did not get all of the bottles from one distributor so independence is questionable, but let us pretend. Randomization Condition: We can be relatively certain that the bottles were randomly chosen based on marketing protocols. 10% Condition: 18 bottles is certainly not more than 10% of the population of sodypop bottles. Normal Population Assumption: Nearly Normal Condition: It is a relatively safe bet that the amount of sodypop in 16 ounce bottles is normally distributed. To be confident we would do a bar chart of the volumes in our sample bottles. 34/57

35 TI bottles, x = s = 0.4 We have met the conditions for a t-test. Use the calculator to run the t test. STAT TESTS 2:T-Test Inpt: Data Stats µ0: 16 x: 15.8 Sx:.4 n: 18 p1: µ0 <µ0 >µ0 Calculate µ <16 t= p= x: 15.8 Sx:.4 n: 18 Note all the useful information the calculator so kindly provides for you. Treat it nicely by keeping it fully charged. 35/57

36 Example 18 bottles, x = s = 0.4 We are not given the level of significance so we default to.05. A one tailed test means all the critical area is below the mean. The critical t-statistic (t*, or tα/2) corresponding to α =.05, df = 17 2nd Dist vars 4:invT(.05, 17) = % 5% -3σ -2σ -1σ 0 1σ 2σ 3σ The critical value of the t-statistic corresponding to α =.05 and d.f. = 17 is /57

37 Example 18 bottles, x = s = 0.4 The critical value of the t-statistic corresponding to α =.05 and d.f. = 17 is t = X - µ s, d.f. = n % n % σ -2σ -1σ 0 1σ 2σ 3σ = = p(t< ) = p(x < 15.8) = ttest(stats, 16, 15.8,.4, <µ0) = /57

38 Example < and p-value =.017 <.05 so our t statistic falls within the rejection area. The probability of getting a value of 15.8 oz, if the null (16.0 oz) is true, is less than 5%, so our decision is to reject the null hypothesis. Since we reject H0, it is plausible to conclude the bottling company is deliberately short-changing the consumer. Skeptical Starks has a reason to by skeptical. 38/57

39 Intervals and Tests Confidence intervals and hypothesis tests are built from the same concepts. In point of fact, they two perspectives of the same question. The confidence interval provides the range of values that will cause us to fail to reject the null hypothesis. Values outside the confidence interval will lead us to reject the null hypothesis. 39/57

40 Intervals and Tests To sum up, a confidence interval with confidence C contains all of the plausible null hypothesis values that would fail to be rejected by a two-tail hypothesis text at alpha level 1 C. So a 95% confidence interval matches an α 0.05 level two-sided test. (.025 on each side of the interval) Confidence intervals are naturally two-sided, so they match exactly with twosided hypothesis tests. When the hypothesis is one sided, the corresponding alpha level is (1 C)/2. 40/57

41 Determining Sample Size To find the sample size needed for a particular confidence level with a particular margin of error (ME), solve this equation for n: ME = t * n 1 s n and CI = X ± t * n 1 s n The problem with using the equation above is that we don t know most of the values. Howsomever do not fret, we can overcome this: We can use s from a small pilot study to establish a sample size for a more comprehensive study. OR We can use a two step process by using z* in place of the necessary t value. Calculate a sample size then use that sample size as our degrees of freedom to find t* 41/57

42 Determining Sample Size Two steps to finding sample size for a t-test. We initially estimate an n by using z ME = z * s n Now dial it in even closer by using that first n to determine d.f. for t* The sample size found by using z* can be used to find t*. Calculate again, this time with the new n and the t* found in the table. s ME = t * n 1 n' 42/57

43 Suppose an initial convenience sample has a mean of 135 lbs and a standard deviation of 32 lbs. How large a sample would you need to estimate the population mean with 95% confidence and a margin of error of 5 pounds? First, an estimate with z. ME = z * s 5 = n n = i 32 2 n = n 5 Now we use t ME = t * n 1 s 5 = n n = i 32 2 n = n' 5 We would need a sample of 790 observations. 43/57

44 Sample Size Sample size calculations are never exact and rarely understood. The margin of error you find after collecting the data won t match exactly the one you used to find n. The sample size formula depends on quantities you do not have until you collect the data from a sample you do not have, but using it is an important first step. Before you collect data, it is probably a good idea to know that the sample size is large enough to give you a good chance of being able to use the statistical techniques you hope to use to answer the question. 44/57

45 Degrees of Freedom If only we knew the true population mean, µ, we would find the sample standard deviation as s = n i =1 (x i µ) 2 n But, we use x instead of µ, and that introduces a source of variability. n (x x ) 2 (x µ) 2 i i When we use rather than to calculate s, our standard i =1 n i =1 deviation estimate turns out to be too small. We compensate for the smaller sum by dividing by n 1 which is the degrees of freedom. We use this simply because it gives us a better estimate of the population standard deviation. 45/57

46 Standard Deviation That is why, when we do not know the mean of the population, s = n i =1 (x i x ) 2 n 1 That is the difference in what your calculator reports for the standard deviation between Sx and σx when you ask for the statistics of a list. 46/57

47 Caution Don t confuse proportions and means. Examine the Normal condition closely, if we ain t normal we are done: Beware of multimodality. The Nearly Normal Condition clearly fails if a histogram of the sample has two or more modes. Beware of skewed data. If the data are very skewed, try re-expressing the variable. If you have outliers, run the test both with and without the outliers, reporting the results for both. 47/57

48 Calculator Practice Let us say that in previous years the average temperature for this time of year is 67 F. Students are complaining that this year it is much warmer. To find out the students record the temps at noon for a random two week period. For this example we will assume σ = Test the student s conjecture at a significance level of.05 This is for the purpose of becoming familiar with the calculator. The data is most appropriate for a t test, but we will run both z and t. 48/57

49 Practice We know the population mean (67 ) but we do not know the population standard deviation so we should be running just a t-test. For educational purposes we will test using both a z-test and a t-test. We are measuring the mean temperature of this time of year. The hypotheses are: H0: µ 67 and H1: µ > 67 49/57

50 Assumptions Independence Assumption: Independence Assumption. Our temperature measurements are independent. Randomization Condition: The days were chosen at random. 10% Condition: 14 days is certainly not more than 10% of the days we could measure. Normal Population Assumption: Nearly Normal Condition: Fourteen days is a small sample, but temperatures for this time of year are distributed normally enough around the mean temperature. 50/57

51 Test Statistic Since this is for learning the calculator, we will be finding both z and t. Enter the following data into L1: To clear an earlier list: - stat - 4: 2nd L1 - Enter If we wish to find the critical z-values: 2nd Distr 3:InvNorm(.975) - Enter /57

52 z-test To find the z stat: Stat (Tests) - 1:Z-test - Enter Inpt: Data Stats µ0: 67 σ: 5 List: (2nd) L1 p <.05 Z-Test µ 67 z = p = Freq: 1 Reject H0 x = Sx = This is Ha µ: µ 0 < µ 0 > µ 0 0 Calculate n = 14 52/57

53 If we wish to find the critical t-value: 2nd Distr 3:InvT(.975) - df:13 - Enter /57

54 t-test To find the t stat: Stat (Tests) - 2:T-test - Enter T-Test Inpt: Data Stats µ0: 67 List: (2nd) L1 Freq: 1 p <.05 Reject H0 µ > 67 t = p = x = This is H1 µ: µ 0 < µ 0 > µ 0 Sx = Calculate n = 14 54/57

55 When You Have the Statistics If we know the sample statistics we can test the hypotheses using the sample mean and standard deviation. We will compare a sample mean to a known population mean using the hypotheses: H0: µ = 135 and H1: µ 135 Suppose the population has a standard deviation of 32 lbs and our sample had a standard deviation of 31.5 lbs. Would a sample of 38 items weighing in at 138 lbs be enough to reject H0? 55/57

56 z-test We do not have any data to enter, instead we have statistics. To find the z stat: Stat (Tests) - 1:Z-test - Enter Inpt: Data Stats µ0: 135 σ: 32 x: 138 n: 38 p >.05 Fail to reject H0 µ 135 z = p = x = 138 n = 38 This is H1 µ: µ 0 < µ 0 > µ 0 Calculate 56/57

57 t-test To find the t stat: Stat (Tests) - 2:Z-test - Enter To find Critical Value of test statistic 2nd Distr 3:InvT(.975) - df: 37 - Enter Inpt: Data Stats µ0: 135 µ 135 x: 138 t = sx: 31.5 p >.05 p = n: 38 Fail to x = 138 This is H1 µ: µ 0 < µ 0 > µ 0 reject H0 n = 38 Calculate 57/57

Chapter 23. Inferences About Means. Monday, May 6, 13. Copyright 2009 Pearson Education, Inc.

Chapter 23. Inferences About Means. Monday, May 6, 13. Copyright 2009 Pearson Education, Inc. Chapter 23 Inferences About Means Sampling Distributions of Means Now that we know how to create confidence intervals and test hypotheses about proportions, we do the same for means. Just as we did before,