Two-sample inference: Continuous Data

Size: px

Start display at page:

Download "Two-sample inference: Continuous Data"

Arabella Alexander
5 years ago
Views:

1 Two-sample inference: Continuous Data November 5

2 Diarrhea Diarrhea is a major health problem for babies, especially in underdeveloped countries Diarrhea leads to dehydration, which results in millions of deaths each year worldwide Bismuth salicylate (Pepto Bismol) reduces diarrhea in adults Researchers in Peru conducted a double-blind randomized controlled trial, published in The New England Journal of Medicine, to determine whether it would do so in infants suffering from diarrhea as well

3 Peruvian diarrhea study In their study, all infants received the standard therapy for diarrhea: oral rehydration In addition to the rehydration, 85 babies received bismuth salicylate, while 84 babies received a placebo The total stool volumes for all infants over the course of their illness was measured To adjust for body size, the researchers divided by body weight to obtain their outcome of interest: stool output per kilogram of body weight

4 Results The investigators found that in the placebo group, the total stool output was 260 ± 254 ml/kg (mean ± SD), while in the treatment group, stool output was 182 ± 197 ml/kg From these numbers, does it seem likely that stool output is normally distributed? Absolutely not; if stool output followed a normal distribution, these number suggest that about 16% of infants had negative stool output

5 Looking at the data Indeed, the distribution does not look normal at all: Stool output (ml/kg) Control Treatment Should we worry about this? We don t really have to; n = 84 or 85 is big enough such that the mean should be normally distributed even if the data itself is not

6 Separate confidence intervals One way of analyzing this data is to calculate separate confidence intervals for each group: Stool output (ml/kg) Placebo Treatment

7 The difference between two means As we said last week, however, this is not the most powerful way to analyze this data if we are interested in the difference between treatment and placebo Denoting the two groups with the subscripts 1 and 2, we will attempt to test the hypothesis that µ 1 = µ 2 by looking at the random variable x 1 x 2 and determining whether it is significantly different from 0 Today s outline will be similar to last week: we will go over an exact, albeit computer/labor-intensive, approach (the permutation test ); then talk about an approximate approach that is much easier to do by hand (the two-sample t-test )

8 Viewing our study as balls in an urn The same concept that we encountered in the two-sample Fisher s exact test can be used for continuous data also If pepto bismol had no effect on diarrhea in children, then it wouldn t matter which group the child was assigned to Under this null hypothesis, then, it would be like writing down each child s stool output/kg on a ball, putting all the balls into an urn, then randomly picking out 85 balls and calling them the control group How likely is it that this difference would be as large or larger than 78, the actual difference observed?

9 Results of the experiment Frequency Difference

10 Results of the experiment (cont d) In the experiment, I obtained a random difference in means larger than the observed one 283 times Thus, my experimental p-value is 283/10000=.0283 Conclusion, based on statistically significant evidence: pepto bismol causes a reduction in the symptoms of diarrhea in children

11 This approach to carrying out a hypothesis test is called a permutation test The different orders that a sequence of numbers can be arranged in are called permutations; a permutation test is essentially calculating the percent of random permutations under the null hypothesis that produce a result as extreme or more extreme than the observed value Unlike Fisher s exact test, exact solutions are not readily available (this problem is much harder than Fisher s exact test) Unless the number of observations is small enough that we can count the permutations by hand, we need a computer to perform a permutation test

12 The standard error of a difference between two means Student s test Confidence intervals Getting an approximate answer based on the normal distribution As you might guess from the look of the histogram, a much easier way to obtain an answer is to use the normal distribution as an approximation Letting d = x 1 x 2 represent the difference between the two means, our test statistic will be d d 0 SE d = d SE d But what s the standard error of the difference between two means, SE d?

13 The standard error of a difference between two means Student s test Confidence intervals The standard error of the difference between two means Suppose we have x 1 with standard error SE 1 and x 2 with standard error SE 2 (and that x 1 and x 2 are independent) Then the standard error of x 1 x 2 is SE d = SE1 2 + SE2 2 Note the connections with both the root-mean-square idea and the square root law from earlier in the course

14 The split The standard error of a difference between two means Student s test Confidence intervals This equation would be perfect if we knew SE 1 and SE 2 But of course we don t There are two ways of settling this question, and they have led to two different forms of the two-sample t-test

15 Approach #1: Student s t-test The standard error of a difference between two means Student s test Confidence intervals The first approach was invented by W.S. Gosset (Student) His approach was to assume that the standard deviations of the two groups were the same If you do this, then you only have two sources of variability to worry about: the variability of x 1 x 2 and the variability in your estimate of the common standard deviation

16 Approach #2: Welch s t-test The standard error of a difference between two means Student s test Confidence intervals The second approach was invented by B.L. Welch He generalized the two-sample t-test to situations in which the standard deviations were different between the two groups If you don t make Student s assumption, then you have three sources of variability to worry about: the variability of x 1 x 2, the variability of SD 1, and the variability of SD 2

17 Student s test vs. Welch s test The standard error of a difference between two means Student s test Confidence intervals We ll talk more about the difference between the two tests at the end of class For now, it will suffice to say that Student s test is more powerful when the sample sizes are small, but that it can yield poor results if the true standard deviations are quite different When the sample sizes are reasonably large and the standard deviations are reasonably close, the two tests essentially give the same answer Student s test is much easier to do by hand, however

18 The pooled standard deviation The standard error of a difference between two means Student s test Confidence intervals In order to estimate a standard error, we will need an estimate of the common standard deviation This is obtained by pooling the deviations To calculate a standard deviation, we took the root-mean-square of the deviations (only with n 1 in the denominator) To calculate a pooled standard deviation, we take the root-mean-square of all the deviations from both groups (only with n 1 + n 2 2 in the denominator) Essentially, the pooled standard deviation is an average of the standard deviations in the two groups, weighted by the number of observations in each group

19 The standard error of a difference between two means Student s test Confidence intervals The pooled standard deviation and the standard error Letting SD p denote our pooled standard deviation, our estimated standard error is SDp 2 SE d = + SD2 p n 1 n 2 = SD p 1 n n 2 This equation is similar to our earlier one, only now the amount by which the SE is reduced in comparison to the SD depends on the sample size in each group

20 Sample size and standard error The standard error of a difference between two means Student s test Confidence intervals So, let s say that we have 50 subjects in one group and 10 subjects in the other group, and we have enough money to enroll 20 more people in the study To reduce the SE as much as possible, should we assign them to the group that already has 50, or the group that only has 10? Let s check: = = 0.23

21 The advantages of balanced sample sizes The standard error of a difference between two means Student s test Confidence intervals This example illustrates an important general point: the greatest improvement in accuracy/reduction in standard error comes when the sample sizes of the two groups are balanced Occasionally, it is much easier (or cheaper) to obtain (or assign) subjects in one group than in the other In these cases, one often sees unbalanced sample sizes However, even in these cases, it is rare to see a ratio that exceeds 3:1 The reason is that the study runs into diminishing returns no matter how much you reduce the standard error of x 1, the standard error of x 2 will still be there

22 The standard error of a difference between two means Student s test Confidence intervals The degrees of freedom of the two-sample t-test We have said that if we make the assumption of equal standard deviations, then we only need to worry about the variability of x 1 x 2 and the variability in your estimate of the common standard deviation How variable is our estimate of the common standard deviation? Well, it s now based on n 1 + n 2 2 degrees of freedom (we lose one degree of freedom for each mean that we calculate) Thus, to perform inference, we will look up results on Student s curve with n 1 + n 2 2 degrees of freedom

23 Student s t-test: procedure The standard error of a difference between two means Student s test Confidence intervals The procedure of Student s two-sample t-test should look quite familiar: #1 Estimate the standard error: SE d = SD p 1 n n 2 #2 Calculate the test statistic t = d d 0 SE d #3 Calculate the area under the Student s curve with n 1 + n 2 2 degrees of freedom curve outside ±t

24 Student s t-test: example The standard error of a difference between two means Student s test Confidence intervals For the diarrhea study, the pooled standard deviation is 227 ml/kg: 1 #1 Estimate the standard error: SE d = = 34.9 #2 Calculate the test statistic: t = = 2.23 #3 For Student s curve with = 167 degrees of freedom, only 2.7% of the area lies outside 2.23 Thus, there is only a 2.7% probability that the difference between the sample means would have been this far apart if bismuth salicylate did nothing to reduce diarrhea

25 The standard error of a difference between two means Student s test Confidence intervals Student s t-test, Welch s t-test, and the permutation test Note that Student s t-test agrees quite well with our permutation test from earlier, and with Welch s test Permutation: p =.028 Student s: p =.027 Welch s: p =.026 This is usually the case when the sample sizes are reasonably large: the approximations work well, agreeing both with each other and with exact approaches

26 Confidence intervals: Procedure The standard error of a difference between two means Student s test Confidence intervals The p-value indicates that we should rule out 0 as the likely effect size for bismuth salicylate But we should always be interested in confidence intervals to assess the clinical significance of our findings The procedure for calculating confidence intervals is straightforward: 1 #1 Estimate the standard error: SE d = SD p n n 2 #2 Determine the values that contain the middle x% of the Student s curve with n 1 + n 2 2 degrees of freedom; denote these values ±t x%,n1+n 2 2 #3 Calculate the confidence interval: ( d t x%,n1+n 2 2SE d, d + t x%,n1+n 2 2SE d )

27 Confidence intervals: Example The standard error of a difference between two means Student s test Confidence intervals For the diarrhea study: 1 #1 The standard error is SE d = = 34.9, and the difference between the two means was = 78 #2 The values ±1.97 contain the middle 95% of the Student s curve with = 167 degrees of freedom #3 Thus, the 95% confidence interval is: ( (34.9), (34.9)) = (9, 147) So, although we can rule out no effect, it s possible that bismuth salicylate only has a slight effect on diarrhea (9 is 3% of 260), but it s also possible that it has a major effect (147 is 57% of 260)

28 vs. t-tests Student s test vs. Welch s test Should I use a permutation test or a t-test? Approaches which do not depend on assuming that the sampling distribution follows a certain shape are called nonparametric approaches The permutation test can be a very valuable alternative to the t-test, making many fewer assumptions The trade-off, however, is that it is less powerful

29 Power in small sample sizes vs. t-tests Student s test vs. Welch s test For example, consider the following made-up data: the response in one group is 1,2,3, while the response in the other group is 101,102,103 The t-test has no difficulty rejecting the null hypothesis: p = However, the permutation test only comes up with a p-value of 0.1 (Don t read too much into this, however: the difference in power is far less dramatic when the sample size is larger)

30 vs. t-tests Student s test vs. Welch s test However, this is something of a catch-22: small samples are precisely the situations in which t-tests are least reliable! This dilemma illustrates an important general trend in statistics: the more data you have, the fewer assumptions you have to make Because of this, it is difficult to say anything too conclusively when the sample size is very small

31 vs. t-tests Student s test vs. Welch s test Should I use Student s test or Welch s test? As we saw earlier, Student s test and Welch s test were basically the same when the sample sizes were reasonably large In the diarrhea example, even though the standard deviations differed by about 30%, this had no almost no effect on Student s assumption of equal standard deviations However, if the sample sizes are small, this is not the case What does small mean? Let s run some simulations and see what happens

32 vs. t-tests Student s test vs. Welch s test Power vs. sample size: population SDs equal Power Student Welch Sample size per group

33 vs. t-tests Student s test vs. Welch s test Type I error rate vs. sample size: ratio of SDs = 4 Type I error rate Student Welch Sample size per group

34 vs. t-tests Student s test vs. Welch s test Type I error rate vs. ratio of standard deviations 5 subjects in each group Type I error rate Student Welch Ratio of standard deviations

35 vs. t-tests Student s test vs. Welch s test Type I error rate and unequal sample sizes For 20 total subjects and a ratio of standard deviations equal to 4 Type I error rate Student Welch Number in group 1

36 Conclusions vs. t-tests Student s test vs. Welch s test If n 1 and n 2 are the same and both are above 10, the two t-tests are usually very similar If n 1 and n 2 are below 5, Student s t-test is a bit more powerful than Welch s (about 10-40% more) If n 1 and n 2 are different, and the standard deviations in the two groups are different, and you re planning on using Student s t-test, watch out!

Two-sample inference: Continuous data

Two-sample inference: Continuous data Patrick Breheny November 11 Patrick Breheny STA 580: Biostatistics I 1/32 Introduction Our next two lectures will deal with two-sample inference for continuous data