Data Analysis and Statistical Methods Statistics 65 http://www.stat.tamu.edu/~suhasini/teaching.html Suhasini Subba Rao Comparing populations Suppose I want to compare the heights of males and females at A&M. I can consider all boys at Texas A&M as one population and all girls at Texas A&M as another population. Question : Is the mean girl height less than that of the boy height? Question 2: What is the difference in the mean girl and mean boy height. How much larger are boys than girls. As I do not have data from the entire student population, I can use the data from a class. Suggestion: Compare the sample mean of the male heights with the sample mean of female heights. Female Heights: 5.33 5.33 5.7 5.75 5.42 5.42 5.50 5.50 5.58 5.33 5.50 5.67 5.42 5.25 6.7 5.42 5.33 5.7 5.42 5.42 5.42 5.42 5.42 5.83 5.33 5.67 5.33 5.66 5.25 5.75 5.57 5.35 5.42 5.08 5.75 5.33 5.08 Male Heights: 5.75 5.92 6.7 6.08 5.58 5.92 6.00 5.75 5.92 5.75 5.75 5.83 6.58 6.00 5.75 6.42 6.50 6.7 6.00 5.67 5.58 5.83 5.58 5.58 6.08 5.67 6.00 Let X be the height of a randomly selected female and Y the height of randomly selected male. There are n = 37 girls and m = 27 boys in the samples. The sample mean for girls is X = 37 sample mean for boys is Ȳ = 27 27 i= Y i = 5.92. Let µ X be the female population mean height and µ Y population mean height. 37 i= X i = 5.45 and be the male We are interesting in the quantity µ X µ Y. It will tell us how much larger, how small or whether the male and female heights are equal. Of course, we do not know that difference µ X µ Y, and need to infer something about µ X µ Y from the samples. Intuitively it is obvious that to see whether µ X and µ Y are equal, we need to compare the sample averages X and Ȳ and look at their difference X Ȳ. What can the differences in the sample means say about the differences in the true means that is µ X µ Y (population mean of females - population mean of males)? 2 3
We would expect that population mean of females is less than population mean of males, in other words population mean of females - population mean of males to be less than zero. Hence we would be interested in testing H 0 : µ X µ Y 0 against H A : µ X µ Y < 0. We also want know the magnitude of the difference, this mean constructing a CI for µ X µ Y. Clearly if X Ȳ > 0 we would be unable to reject the null (why??? - remember X Ȳ has to pointing in the same direction as the alternative). But if X Ȳ < 0, then we can use a statistical test. The question is how to make the comparison, what is the distribution of X Ȳ, we look at this now. Aims: Comparing male and female heights To build a confidence interval for the mean difference µ X µ Y (this will tell us where the mean difference lies an is very informative). To test the hypothesis that H 0 : µ X µ Y 0 (mean female height and male height are the same or mean female height is greater than mean male height) against the alternative H A : µ X µ Y < 0. We can also test whether H 0 : µ X µ Y 0.3 against H A : µ X µ Y < 0.3. This is essentially testing whether boys tend on average to be more than 0.3 feet taller than girls. This situation can also arise. 4 5 We will consider both constructing CIs for the difference between the sample means and also hypothesis testing. We do how to do this by both hand and reading the output in JMP. It is important to understand both. Below we will consider assumptions that are required to make the test and also the details. The details may appear to be overwhelming, but do not be detered by them. In order to do any test, ie. H 0 : µ X µ Y 0 against H A : µ X µ Y < 0 or H 0 : µ X µ Y 0.3 against H A : µ X µ Y < 0.3 or to construct CI for µ X µ Y we need three magical ingredients: The difference of the sample averages: X Ȳ. The standard error of X Ȳ (this will turn out to be The sample sizes m and n are relatively large. σ 2 n + σ2 m )). 6 7
Formal: comparing populations We have two samples from two different populations. That is X,..., X n is a size n sample (eg. heights of females in the 65 class) from population (eg. heights of all females) and Y,..., Y m is a size m sample (eg. the heights of males in the 65 class) from population 2 (eg. heights of all males). The mean of population, is µ X (eg. mean height of a female) and the mean of population 2 is µ Y (eg. mean height of a male). Given the samples we want to make inference about the difference µ X µ Y. Is one housing material better than another? In the above examples what are the different populations and samples? All these questions are important and can lead to quite important decisions, therefore it is important that we do a careful analysis. To construct CIs and do a hypothesis we do an independent 2 sample t-test. To do this test we have to ensure the data satisfies the assumptions below. It is clear this is an important question. Other examples include: Does a new therapy work better than old therapy? Is there a difference in the performance of one school over an other? On average does eating healthy food mean you live longer? On average if one studies more do they get better grades? 8 9 Assumptions and how to check them We have two samples from two different populations X,..., X n and Y,..., Y m (sample size n and m respectively). Both samples are independent of each other and independent within the sample. For example the values X,..., X n should have no influence on Y,..., Y m and X should not have any influence on X 2,..., X n. Can you think of examples when this may not be true? It is likely for observations taken over time, those taken around the time will be close. Checking for independence can be difficult, though there are methods available. In practice this may not be true, but it does not have to be strictly the same so long as they have similar sample sizes (see Ott and Longnecker, page 275). Make a boxplot of both samples and check if the variation is the same. We can also do a test to see if the variances from the two populations are the same (we do this later). If n and m are small the observations X,..., X n and Y,..., Y m should be close to normal. If n and m are large this normality of the observations does not matter (this is the same as in the one-sample tests). When n and m are small make a QQ-plot. There may be good reasons why the original data is normal. The variance of both populations need to be about the same. That is var(x i ) = σ 2 X and var(y i) = σ 2 Y, and we must have σ2 X = σ2 Y. 0
Compare means of populations We do not have the populations available, only the samples, and we have to base our conclusions on the samples. To make inference about the population mean we should look at the difference between the sample means: X Ȳ. We are interested in constructing a confidence interval for the difference in the population means. The CI will tell us how much larger one mean is than another or if they could be similar in value. Testing the hypothesis H 0 : µ x µ y = 0 against H A : µ x µ y 0 (or the one-sided versions of this: H 0 : µ x µ y 0 against H A : µ x µ y < 0 or H 0 : µ x µ y 0 against H A : µ x µ y > 0). But to do any of the above we require more than just X Ȳ. Remember both X and Ȳ are sample means hence are random variables, their distribution is centered about the true means µ X and µ Y. Therefore X Ȳ is a random variable too, and their distribution is centered about µ X µ Y. Now to do anything we require the distribution of X Ȳ, and its standard error - this explains how much spread or error there is in X Ȳ. We formalise this below. Don t panic we will go through some examples later on. 2 3 Distribution of the difference of the sample means X Ȳ If X i and Y i are have the same variance σ 2 and X and Ȳ are close to normal, then the difference of the averages has the following distribution: ( ( X Ȳ N µ X µ Y, σ 2 n + )). m Note that σ ( 2 n + ) ) m = (σ 2 n + σ2 m. Important points: The distribution is centered about µ X µ Y, hence I am likely to draw close to µ X µ Y. How close depends on the standard error which is σ ( 2 n + m). The larger the sample sizes n and m are the smaller the standard error (just like in the one sample case, where we deal with just one sample mean X, which has standard error σ2 n ). Therefore we can make a Z-transform of the difference X Ȳ ( X Ȳ ) (µ x µ y ) N(0,). σ n + m Of course in practice σ 2 will not be known and has to be estimated from the data. 4 5
The distribution cont. When the variance is unknown and we use the sample pooled variance s 2, then The distribution of the standardised transform using the sample variance is : ( X Ȳ ) (µ x µ y ) t(m + n 2). s n + m It has a t-distribution with (n + m 2) degrees of freedoms (that is a t-distribution, where the number of degrees of freedom is the sum of the two sample sizes minus two). Don t panic! If the samples from both populations are greater than 30. Then everything is wonderful and all we require is X Ȳ, which we can get from the data, s (the sample standard deviation of the populations), which is always given to you and the sample sizes n and m. With these ingredients you can contruct CI and do tests. s If you are really lucky rather evaluate 2 n + s2 m yourself, if you are given output it will already be there in the JMP output! See how it is all put together intwo sample independent t-test JMP.pdf. 6 7 Confidence intervals for the differences the mean The 99% CI in the case that n = 2 and m = 3 is At the 00( α)% level this gives the confidence interval for the difference in mean to be ( X Ȳ ) t α/2(n + m 2)s n + m,( X Ȳ ) + t α/2(n + m 2)s n + m. ( X Ȳ ) t 0.005(50)s 2 + 3,( X Ȳ ) + t 0.005(50)s 2 + 3. You will need to look up t 0.025 (50) and t 0.005 (50) in the t-tables. Examples: The 95% CI in the case that n = 2 and m = 3 is ( X Ȳ ) t 0.025(50)s 2 + 3,( X Ȳ ) + t 0.025(50)s 2 + 3. 8 9
Choosing the sample sizes Notice that the length of the interval is small when Suppose n + m = 00, then If n = 50 and m = 50, If n = 99 and m =, 50 = 0.2 99 + =.0. 50 + n + m is small. Hypothesis testing Testing H 0 : µ x µ y = 0 against H A : µ x µ y 0. What we need to do. Calculate the Z-statistic under the null: ( X Ȳ ) 0 s n + m We see that the variance will be small when n and m are close. Remember a smaller variance = a better estimator. Therefore having similar sample sizes (for a given total sample) is a good thing! We can always access the quality of the average difference X by looking at its variance: σ 2( n + m). As always the smaller ( n + m) the better. and look this number up in the t-tables with (n + m 2) degrees of freedom. If (n + m 2) is large use the normal tables instead. This will give you the p-value. If p-value is small (say less than 5%), then we reject the null in favour of the alternative. 20 2 Example: Heights of students We know that there are n = 37 girls and m = 27 boys. The sample mean for girls is X = 37 37 i= X i = 5.45 and sample mean for boys is Ȳ = 27 27 i= Y i = 5.92. Build a 95% confidence interval for µ x µ y. Make a hypothesis test that the mean male and female height are the same against the alternative that mean male height is greater than mean female height (α = 0.05). The sample variance for girls is s 2 x = 37 s 2 y = 27 27 i= (Y i Ȳ )2 = 0.0758. 37 i= (X i X) 2 = 0.0484 and The two populations are all male and female heights in A&M. Suppose the population mean female height is µ x and the population mean male height is µ y. Object: 22 23
Checking the assumptions for the height data to do a independent sample t-test Male and Female Boxplot Unless many of the students in the 65 class were related it is reasonable to assume that they are independent. The sample standard deviations are s x = 0.22 and s y = 0.275, which are close. 5.5 6.0 6.5 Below we make boxplots and QQplots. 2 The standard deviations s x = 0.22 and s y = 0.275 are quite smilar and this is confirmed by the boxplots. The spread of the interquatile ranges in the two plots look similar. 24 25 Sample Quantiles 5.2 5.4 5.6 5.8 6.0 6.2 Male and Female QQ-plots Normal Q Q Plot 2 0 2 Theoretical Quantiles They data looks close to normal (in a handwavey sense). The sample size of 27 and 37 are quite large so I think we can stick to the normal assumption. There does seem to be one huge outlier for the female plot and a few male outliers, which we need to keep in mind. We now do the test by hand, but compare it with the JMP output in two sample independent t-test JMP.pdf. Normal Q Q Plot Sample Quantiles 5.6 5.8 6.0 6.2 6.4 6.6 The ingredients we need are: The sample variance of the population is 2 0 2 Theoretical Quantiles Female heights is the top plot, male heights the lower plot. s 2 = (37 ) 0.0484 + (27 ) 0.0758 37 + 27 2 Don t worry how this was obtained. =.74 +.97 62 = 0.06. 26 27
t α/2 (n + m 2) = t 0.025 (62). The t-distribution with 62 degrees of freedom is not in the tables. Either use t 0.025 (60), but since 62 is quite large you can also use the normal approximation: z 0.025 =.96. X Ȳ = 5.45 5.92 = 0.47. The 95% CI is The confidence interval for the heights 0.47.96 0.06 ( 27 + ) (, 0.47 +.96 0.06 37 27 + 37 [ 0.47.96 0.062, 0.47 +.96 0.062] = [ 0.59, 0.34]. Zero is not contained in the above. So it seems like Texas A&M boys tend to be taller than Texas A&M girls. With 95% confidence the difference in mean heights seems to lie in the interval [ 0.59, 0.34]. 28 29 Hypothesis test for the heights This is closely related to what we did above. We want to test H 0 : µ x µ y 0 against H A : µ x µ y < 0. Note that whether the test is a left hand test or a right hand test, depends on you choose to order of µ x and µ y, either µ x µ y or µ y µ x. This becomes even more important when you do the test in JMP. JMP automatically selects whether it is considering the difference µ x µ y or µ y µ x, and this depends on how you code the levels (for example for the male/female data it depends on how you code the male and female categories). But from the output you should see which way it takes the difference. In Means for Oneway Anova, you will see JMP gives the sample mean for each level (you should know what the levels correspond to), for example, in the height example 0 is male and is female, the mean for level 0 is 5.9 and the mean for level is 5.45. In the t-test, it will give the Difference, for example the difference for the height exampe is -0.466, hence you can see that JMP is evaluating level - level 0, ie it formulates the test as µ x µ y, hence you should state your hypothesis in terms of the difference µ x µ y. JMP also gives you a clue, just below t-test it states -0, which means that it formulates the test as level - level 0. We assume for now the null and construct the test statistic. Under the null we have ( X Ȳ ) s 37 + 27 t(37 + 27 2). 30 3
We do the calculation: 0.47 0.06 ( 37 + 27 ) = 8.3. Example: Diets Two diets are being compare for effectiveness. 0 volunteers went on diet and 0 different volunteers went on Diet 2. After one month their weight loss (in kilos) was recorded. The data is given below. We don t have the t(62) in the tables. So we approximate with a normal distribution. Suppose Z N(0, ), then P(Z 8.3) 0. So the p-value is really small. Diet I 2.9 2.7 3.9 2.7 2. 2.6 2.2 4.2 5.0 0.7 Diet II 3.5 2.5 3.8 8. 3.6 2.5 5.0 2.9 2.3 3 Let µ I be the mean weight loss of diet I and µ II be the mean weight loss of diet II. Test the hypothesis that the diets are different. So pretty much for all values of α we reject the null in favour of the alternative. Texas A&M boys tend to be taller than Texas A&M girls. 32 33 Aside: Estimating the variance σ 2 : The pooled sample variance This is the formula for estimating the sample variance σ 2 : Evaluate the sample variance s 2 x = n n i= (X i X) 2. Evaluate the sample variance s 2 y = n n i= (Y i Ȳ )2. Evaluate pooled sample variance: s 2 = (n )s2 x + (m )s 2 y. n + m 2 You do not have to know this, you just need to know that JMP will estimate the variance of the population variance σ 2 using the sample variance above. 34