7.2 Inference for comparing means of two populations where the samples are independent

Size: px

Start display at page:

Download "7.2 Inference for comparing means of two populations where the samples are independent"

Jade Jenkins
6 years ago
Views:

1 Objectives 7.2 Inference for comaring means of two oulations where the samles are indeendent Two-samle t significance test (we give three examles) Two-samle t confidence interval htt://onlinestatbook.com/2/tests_of_means/difference_means.ht ml

2 Toics: Indeendent two samle t-test Be able to construct the aroriate hyothesis for comaring two oulations based on what researchers want to rove. When given a data sets and the story behind it, be able to identify when to use an indeendent two samle t-test. Understand the statcrunch outut for an indeendent two samle t- test and confidence interval. Be able to construct confidence intervals and hyothesis tests based on a art of the outut. Understand the standard error for the indeendent two samles t-test and confidence interval. Understand what combination of the samles sizes yields the smallest standard error and why. Be able to check the validity (accuracy) of the -values and confidence intervals.

3 Standard errors We have learnt that standard errors are crucial in constructing both confidence intervals and also statistical testing. Do not get mixed u between standard error of an estimator and standard deviation of the samle. The amount of variation in the samle (the average (suared) distance between each estimate and the oulation mean) is measured by its standard error, which is s.e. = s n = (amount of variation in the samle) (suare root of the samle size) You can imagine that the unknown oulation mean should be in some roximity to the known samle mean. The roximity is measured by the standard error. The samle mean tends to get closer in roximity/recision to the oulation mean as you increase the samle size. As we continue with the course, the standard errors will become more comlex, but the underlying ideas are the same.

4 Comarisons everywhere! You will see comarisons being made all over the lace. Just look at some of the roducts you have at home: Dentex floss sticks are clinically roven to remove more laue than regular floss. What does this mean how on earth do they rove this? This is an examle of where they rove the results statistically. It is done via clinical trials, by collecting data: Aim: to see if it is ossible to rove that on average the amount of laue removed using floss sticks is more than the average amount of laue removed using regular floss. They state their hyothesis as H 0 : μ FP -μ F 0 against H A : μ FP μ F >0. where μ FP = mean floss removed using Floss stick and μ F = mean laue removed using regular loss.

5 Designing the floss study There are two ways the data could have been collected: Either one simle random samle of individuals is taken. For each individual (on searate far aart days) is asked to use a floss stick and regular floss and the amount of laue removed for each treatment is measured. This is an examle of a matched air study where the same individual is used in both treatments. In this case a matched aired-test is done which we covered in the revious lectures. The advantage of this design is that it avoids confounding because the same individual is used for both exeriments. The disadvantage is that it takes time and effort because we need to do it over several days. Alternatively a simle random samle is taken and randomly slit into two grous. Some are asked to use floss and others are asked to use floss sticks. The individuals in both grous are comletely indeendent of each other and there isn t any matching. The advantage it that it is uick to do this exeriment. Disadvantage: larger standard errors.

6 Indeendent samles inference The urose of most studies is to comare the effects of different treatments or conditions. Using matching to design an exeriment is very useful way to make comarisons between oulations since it tends to reduce confounding factors (such as the ability of a erson to floss). If we have reason to believe that there is matching between subjects, then we should use a matched aired t-test. However, in many situations it is imossible to have any matching between the samles. If we want to see whether a drug works, we need to comare a SRS (simle random samle) of atients treated with the drug with a SRS of atients gives the lacebo.

7 Examle 1: Floss A simle random grou of individuals are chosen and the amount of laue removed using both regular floss and floss icks is measured. Secifically, 50 individuals were given floss and another 50 were given floss sticks. The average amount of laue removed in the grous using floss sticks is 3.16mg whereas the average amount of laue removed in the grous using regular floss is 2.99mg.

8 Examle 1: Floss The data is on the left. Note that each row does not corresond to the same individual. These are two different individuals with no airing/matching. They are two indeendent samles. It is hard to understand these numbers, so the data is summarized in the table below.

9 Does the difference of 3.16 vs 2.99 automatically rove that floss sticks are better? No it doesn t. The roducts makers cannot use this as roof. They need to show that given the data it is imlausible that over the entire oulation there is no difference between the floss and floss sticks (they need to show that the null is not lausible). Technically, this means calculating the chance of observing a difference of 3.16mg 2.99mg by fluke. This is the -value. If the -value is large (over a re-determined significance level, say 5%), then we cannot reject the null. This means that there isn t any evidence in the collected data that that floss icks are better than regular floww. If the -value is small (below a re-determined significance level), then the data collected suggests that floss stick function better than floss. Fortunately, comuter software does the calculation. However, the usual assumtions of normality of the samle means still aly. So these still have to be checked:

10 The indeendent samle t-test in Statcrunch Go to Stat -> T Stats -> Two Samle-> With Data We test the hyothesis H 0 : μ FP -μ F 0 against H A : μ FP μ F >0.

11 Anaylsis of data The green lines in the dot lot corresond to the samle means calculated from the data. We test the hyothesis H 0 : μ FP -μ F 0 against H A : μ FP μ F >0. We see that the samle mean for FlossSticks is greater than the samle mean of regular floss. But just looking at two data sets it see there is a large overla in numbers. This makes it visually hard to discriminate between the two data sets and determine the -value. However, the -value can be calculated and it is 3.15%

12 The -value of 3.15% (which is less than 5%) tell us that there is some evidence in the claim that floss sticks remove more laue than floss. To find out on average how much more laue floss sticks remove comared with regular floss we calculate the 95% confidence interval. This can be done using the outut and the t-calculator in Statcrunch gives us the critical value to use in construction of the 95% confidence interval for the mean difference. [ , ] =[ , 0.329]

13 Thus with 95% confidence the mean difference difference lies somewhere between and Note that zero lies in this interval, this is because we can reject the null using a one-sided test, but not a 2 sided one. Alternatively, use Statcrunch to give the confidence interval: Observe that these numbers are identical to the calculation we made.

14 Since Dentex found a statistically significant difference and they can make the claim that Floss sticks remove more laue than Floss. Checking the normality assumtion of the samle means: Samle size is relatively large and the distribution of the two samle means (look at the green lots) look uite normal. Therefore we can say the 3.15% -value is close to the truth -value based on normality of the samle means.

15 Summary In the situation discussed above the samles are comletely indeendent of each other there isn t any matching. In this situation we need to use an indeendent t-test. In general, subjects are often observed searately under the different conditions, resulting in samles that are indeendent. That is, the subjects of each samle are obtained and observed searately from, and without any regard to, the subjects of the other samles. As in the matched airs design, subjects should be samled randomly from the oulation of interest. By the end of the class you should be able to identify which test to aly give the situation. You should look to see if there is any matching in the data, if there is matching never do an indeendent samle t-test (this will give the wrong standard errors and can lead to unreliable results). If the samles aear to comletely indeendent of each other use an indeendent samle t-test.

16 Examle 2: Heights Consider the following roblem that we already know the answer to: In general, do males students tend to be taller than female students? In terms of a hyothesis test we to see if there is evidence to suort: H 0 : μ M - μ F 0 against H A : μ M - μ F > 0. A matched design is ossible, by random samling male and female student siblings. Such data may be hard to come by. In addition, we exclude the sub-oulation of eole with same-sex or no siblings. Instead, a random samle of students was drawn and an indeendent samle t-test is done. Statcrunch instructions: Stat -> T-stat > Two Samle -> With data. Then lace the relevant columns in each box and uncheck the box that says ooled variance. You have the otion of doing a test (one or two sided) or constructing a confidence interval.

In this samle there were 27 males and 37 females, there is clearly no matching. The difference in samle means is 0.45.

17 In this samle there were 27 males and 37 females, there is clearly no matching. The difference in samle means is Visually, there seems to be a large difference between the data sets. Meaning that it is unlikely they share the same mean (small - value).

18 We see that the -value is less 0.01% (we do the test at the 5% level) which means there is strong evidence to suggest that males are on average taller than females. t-value = =7.27 We can use the same outut to construct a 99% confidence interval for the mean difference. The only difference is that the degrees of freedom is unusual it is 48.29%. However, we do exactly the same as before, we either look-u tables (rovided by me in the exam aer) or use software such as Statcrunch [0.466 ± ] = [0.29, 0.64] With 99% confidence we believe the mean difference in height lies between 0.29 to 0.64 feet.

Examle 3: Diets We want to know whether there is any difference between two different diets. 20 randomly samled eole are randomly laced into two grous of 10.

20 Examle 3: Diets We want to know whether there is any difference between two different diets. 20 randomly samled eole are randomly laced into two grous of 10. The first grou goes on Diet I and the second grou on Diet 2. The weight loss for each grou (after dieting for one month) is given below We need to use an indeendent two samle t-rocedure (no matching between individuals). As we have no reason to believe one diet is better than another, our hyothesis of interest is: H 0 : μ 1 - μ 2 = 0 against H A : μ 1 - μ 2 0 The samle means are different, but there is a large overla in the dots.

21 The 95% confidence interval is [-2.23,0.598]. This tells us with 95% confidence the mean difference between the diets is somewhere in this interval. As this contains the mean difference of 0, we cannot reject the null (for the two sided test). The -value is greater for H 0 : μ 1 - μ 2 = 0 against H A : μ 1 - μ 2 0 is greater than 5% for the two sided test. To calculate the recise -value we use the t-transform t-value = = 1.22 Using Statcrunch we see that the smallest area is the area to the LEFT of -1.22, this is 12%. Thus the -value for the two-sided test is 24%. From the data there is no evidence to suggest there is any difference between the means of the diets.

The other grou was given a calcium low diet and iron recorded.

22 Examle 4: Does calcium interact with iron absortion? It is believed that too much calcium in a diet can reduce the absortion of iron. To test this, 20 randomly samled eole were ut into two grous of 10. One grou was given a calcium high diet and the iron absortion recorded. The other grou was given a calcium low diet and iron recorded. The differences from their revious level is given below (this is why you see some negative numbers). The data and summary statistics is given below: We observe that for this grou there those in a calcium low grou absorb more iron, is this statistically significant?

23 The hyothesis of interest is H 0 : μ CH - μ CL 0 against H A : μ CH - μ CL < 0. The hyothesis given in the outut above is oosite of what we want to test. However, from the outut we immediately see that the -value for H 0 : μ CH - μ CL 0 against H A : μ CH - μ CL < 0 is the area to the LEFT of which is = 0.26%. As this -value is less than 5% there is evidence to reject the null and conclude that high calcium decreases iron absortion (comared with low calcium). The 95% confidence interval for the mean difference is [ ± ]

Examle 5: Calf treatments Comaring the weights of calves and different treatments Treatment A vs B Is there is evidence in the data to suggest there is a difference between treatments A and B.

24 Examle 5: Calf treatments Comaring the weights of calves and different treatments Treatment A vs B Is there is evidence in the data to suggest there is a difference between treatments A and B. This means we are testing H 0: μ A μ B =0 against H A: μ A μ B 0. Eyeballing the numbers, we see that the is not much difference in the samle means and It is very hard to discriminate between the two data sets. This corresonds to a large -value.

25 Examle 5: Calf treatments The -value for H 0: μ A μ B =0 against H A: μ A μ B 0 is 93%. This tells us that obtaining a differences seen in the two grous when there is no difference in the treatments (in terms of weight) is highly likely. Thus there is no evidence to reject the null Note: To analyze the calf data in Statcrunch you need to slit each grou into their own columns. To do this go to Data -> Arrange -> Slit -> Select Column data you want to analyze (for examle Wt 8) and Select the grou you want (for examle TRT)

26 Treatment A vsd From the summary statistics, the difference between treatment A and B aears uite large (7.7), can this difference be exlained by random chance? We test the hyothesis H 0: μ A μ D =0 against H A: μ A μ D 0. There is a 7.7 oint difference in the treatments but a large overla in the data sets (both have a large standard deviation).

27 The mean difference may be -7.7 but the -value is 34%, this tells us there is over a 1/3 chance of observing a difference of 7.7 in the samle means when there is in fact no difference in the treatments. This is uite large over the 5% significant level, so there is no evidence to reject the null We now construct a 95% confidence interval. To do this we use statcrunch to find the critical value of a t-distribution with df The 95% confidence interval for the difference in mean weights for the treatments in [ , ] = [-24,9.2]. This is an interval where we believe the mean difference should lie and exlains why we were not able to reject the null, desite 7.7 being subjectively large. The difference this interval is wide is that the standard error is large, due to small samle size and large standard deviation of calf weights.

28 The standard error, what is that? We illustrate the idea with the female and male height examle For every samle the difference in samle means X M XF will vary. If the samle size is large enough X M XF will have a normal distribution (thanks to the central limit theorem). The normal distribution will be centered about the true mean μ M - μ F (oulation male mean minus oulation female mean) and but it will have a comlicated standard error: r 2M F 34 Where σ M = standard deviation of heights and σ F = standard deviation of female heights.

29 Therefore, just like in the one-samle case, in order to do the test we simly take the z-transform under the null that the mean male and female height is the same (μ M - μ F = 0). z = M 27 + F 2 34 At this oint we encounter a roblem. We do not know the oulation standard deviations σ M and σ F. But we see from the summary statistics that we do have estimates for them. Thus we can relace the true oulation standard deviations by its estimates. And obtain the transformation: t =

30 The distribution of this ratio? Having exchanged the unknown true standard deviations with their estimators (calculated from the data) it seems reasonable to suose that extra variability has been added to this ratio and we need to correct for it by changing from a normal distribution to another distribution. Previously for the one samle case, the new distribution which took into account of this variability was the t- distribution. In the two samle case, the ratio t = X M XF s 2 M 27 + s2 F 34 This ratio has aroximately a t-distribution with a very strange number of degrees of freedom. 2 This is why using software is imortant, you don t want to calculate this stuff!! 2 s 1 + s 2 $ 2 # & " n df= 1 n 2 % 2 1! 2 s 1 $ # & + 1! # n 1 1 " % n 2 1 " n 1 2 s 2 n 2 $ & % 2

We are testing H 0 : μ M - μ F = 0 against H A : μ M - μ F > 0 and have the t-transform 5.91 5.46 0.27 2 27 + 0.212 34 =7.27 Which we know has 48.045 degree of freedom.

31 We are testing H 0 : μ M - μ F = 0 against H A : μ M - μ F > 0 and have the t-transform =7.27 Which we know has degree of freedom. Now going to Statcrunch -> Stat -> Calculators -> T we get The area to the right of 7.11 for a t-distribution with degrees of freedom is tiny. So at both the 5% and 1% significance level we would reject the null. This means there is lenty of evidence to reject the null and conclude the mean height of males is greater than females. Remember If the samle sizes are both over 15, and the data not too skewed, using the t-distribution reasonable.

32 Summary of Analysis: Significant effect Remember: Significance means the evidence of the data is sufficient to reject the null hyothesis (at our stated level α). Only data, and the statistics we calculate from the data, can be statistically significant. We can say that the samle means are significantly different or that the observed effect is significant. But the conclusion about the oulation means is simly they are different. The observed effect of 0.46 between male and female height is significant so we conclude that the true effect μ M-- -μ F is greater than zero. Having made this conclusion, or even if we have not, we can always estimate the difference using the confidence interval [0.33,0.58].

33 Standard errors In the one-samle case the standard error is s(standard deviation of oulation) n(samle size) = r s 2 n In the indeendent two-samle case the standard error is s s 2 1 (variance of oulation one) n(samle size) + s2 2 (variance of oulation two) m(samle size) These two different standard errors are for different situations but the ideas are the same. Remember, that a smaller standard error leads to more reliable estimators. Therefore if we are designing the exeriment to decrease the samle size we observe that: For the one-samle case, we can decrease the standard error by increasing the samle size (it is usually imossible to decrease the standard deviation) For the two-samle case, we can decrease the standard error by increasing the size of both samles (again it is usually imossible to decrease the standard deviation of the oulations).

34 Choosing the samle size We now consider how to distribute the samle sizes in the case that the standard deviations for both samles are about the same. In this case the standard error is: r r s 2 n + s2 1 m = s n + 1 m Remember the standard deviation is fixed we cannot change this value. Suose that we only have enough funds to include 200 subjects in our exeriment, how to distribute them amongst the two grous: It makes no sense to have on subject in grou 1 and 199 in grou 2. For examle, if we are comaring male and female heights, this would be using one male height to estimate the mean height of males and 199 females heights to estimate the mean height of females. Clearly this is wrong, and we r can understand why from the standard which is 1 s =1.002s On the other hand if we distributed them evenly, 100 and 100, the standard error is a lot smaller r 1 s =0.141s

35 Which tye of test? One samle, aired samles or two indeendent samles? Comaring vitamin content of bread immediately after baking vs. 3 days later (the same loaves are used on day one and 3 days later). Comaring vitamin content of bread immediately after baking vs. 3 days later (tests made on indeendent loaves). Average fuel efficiency for 2005 vehicles is 21 miles er gallon. Is average fuel efficiency higher in the new generation green vehicles? Is blood ressure altered by use of an oral contracetive? Comaring a grou of women not using an oral contracetive with a grou taking it. Review insurance records for dollar amount aid after fire damage in houses euied with a fire extinguisher vs. houses without one. Was there a difference in the average dollar amount aid?

36 Cautions about the two samle t-test or interval Using the correct standard error and degrees of freedom is critical. As in the one samle t-test, the method assumes simle random samles. Likewise, it also assumes the oulations have normal distributions. Skewness and outliers can make the methods inaccurate (that is, having confidence/significance level other that what they are suosed to have). The larger the samle sizes, the less this is a roblem. It also is less of a roblem if the oulations have similar skewness and the two samles are close to the same size. Significant effect merely means we have sufficient evidence to say the two true means are different. It does not exlain why they are different or how meaningful/imortant the difference is. A confidence interval is needed to determine how big the effect is.

37 Summary: Distribution of two samle means In order to do statistical inference, we must know a few things about the samling distribution of our statistic. The samling distribution of has standard deviation (Mathematically, the variance of the difference is the sum of s n + s n the variances of the two samle means.) This is estimated by the standard error 1 2 If the samle sizes are both over 15, and the data not too skewed, using the t-distribution reasonable. Then the two-samle t statistic is x - x t = ( x1-x2) -( µ 1-µ 2). 2 2 s1 s2 + n n s2 s n SE = + n. This statistic has an aroximate t-distribution on which we will base our inferences. But the degrees of freedom is comlicated

38 Two-samle t confidence interval Recall that we have two indeendent samles and we use the difference between the samle averages ( ) to estimate (μ 1 μ 2 ) s This estimate has standard error 1 s2 SE = +. n1 n2 The margin of error for a confidence interval of μ 1 μ 2 is We find t* is found using the comuter. The confidence interval is then comuted as 2 2 * s1 s2 * m= t + = t SE n n ( x - x ) ± m x - x The interretation of confidence is the same as before: it is the roortion of ossible samles for which the method leads to a true statement about the arameters.

39 Two-samle t significance test The null hyothesis is that both oulation means μ 1 and μ 2 are eual, thus their difference is eual to zero. H 0 : μ 1 = μ 2 Û H 0 : μ 1 μ 2 = 0. Either a one-sided or a two-sided alternative hyothesis can be tested. Using the value (μ 1 μ 2 ) = 0 given in H 0, the test statistic becomes t = ( x1-x2) s2 s n + n 1 2 To find the P-value, we look u the aroriate robability of the t-distribution using the df given by Statcrunch or me.

40 Statistics in the media Look at this article and the data they describe: htt:// What is the data that Dr. Carrasco has? If we did a indeendent samle t-test to see whether those with Alzeheimer s had more fungal cells than those who did not Alzheimer s what would be the -value (give a rough estimate)?

41 Accomanying roblems associated with this Chater Quiz 14 Homework 7 (Questions 5,6 and 7)

The one-sample t test for a population mean

The one-sample t test for a population mean Objectives Constructing and assessing hyotheses The t-statistic and the P-value Statistical significance The one-samle t test for a oulation mean One-sided versus two-sided tests Further reading: OS3,