Multiple Sample Numerical Data Analysis of Variance, Kruskal-Wallis test, Friedman test University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 32 Multiple unpaired numerical samples Suppose we have multiple unpaired samples: The ith observation from group j is denoted Y ij. group 1 group 2 group J Y 1,1 Y 1,2 Y 1,J Y 2,1 Y 2,2 Y 2,J... Y n1,1 Y n2,2 Y nj,j There are n j observations in group j, so that N = n 1 + +n J is the total sample size. An experiment where the number of observations per group is the same is said to be have a balanced design. Summary statistics: summarize each sample independently of the others. Graphics: side-by-side boxplots. Example: Smoking and Heart Rate We consider the smokers data (Larsen & Marx, case study 12.2.1). non light moderate heavy 69 55 66 91 52 60 81 72 71 78 70 81 58 58 77 67 59 62 57 95 65 66 79 84 Non-smokers, light smokers, moderate smokers and heavy smokers (six in each group) undertook sustained physical exercise. Their heart rates were measured after resting for three minutes. (The observations are not paired.) Main question: Is smoking associated with a decrease in fitness level? 2 / 32 3 / 32
Testing for Equality of Means Let µ j be the population mean for group j. Consider testing H 0 : µ 1 = = µ J versus H 1 : not all µ j s are equal 4 / 32 One-Way Analysis of Variance (ANOVA) Let Y.j and S 2 j denote the jth sample mean and variance: Y.j = 1 n j n j Y ij, S 2 j = 1 n j 1 n j (Y ij Y.j ) 2 Let Y.. denote the (all-groups-combined) grand mean: Y.. = 1 N n j Y ij = n j N Y.j 5 / 32 One-Way Analysis of Variance (ANOVA) Compute the between groups sum of squares (aka treatment sum of squares) SS T = n j (Y.j Y.. ) 2 Compute the within groups sum of squares (aka error sum of squares) SSE = n j (Y ij Y.j ) 2 = (n j 1)Sj 2 The F-test rejects for large values of F = SS T/(J 1) SSE/(N J) Theory. If the observations are iid normal, then F has the so-called F-distribution with J 1 and N J degrees of freedom. 6 / 32
One-way ANOVA table Traditionally, the computations above are gathered in an ANOVA table: Df SS MS = SS/Df F-ratio p-value Treatment J 1 SS T SS T /J F = MS T /MS E P(F J 1,N J >F) Residuals N J SSE SSE/(N J) Sometimes, an ANOVA table also includes the total sum of squares: SS Tot = n j (Y ij Y.. ) 2 Note that SS Tot = SS T +SSE. 7 / 32 One-Way ANOVA model assumptions The p-value relies on the assumption that, at least approximately, and that these are independent. This is often expressed as where ε ij iid N(0,σ 2 ) are the measurement errors. µ = 1 N J n jµ j is the grand mean. Y ij N(µ j,σ 2 ) Y ij = µ+α j +ε ij α j = µ j µ is the jth treatment effect. Note that J n jα j = 0. Testing µ 1 = = µ J is equivalent to testing α 1 = = α J = 0. 8 / 32
Checking Assumptions We need to check that the ε ij s satisfy the following properties: 1. They have the same variance (homoscedasticity). 2. They are approximately normally distributed (important if the sample sizes are small). 3. They are independent of each other. The errors are not available to us. As a substitute we use the residuals: e ij = Y ij Y.j 1. We use a residual plot to check for homoscedasticity. 2. We use a Q-Q plot of the residuals to check for (approximate) normality. 3. Independence is usually assessed based on how the data was collected. While a the assumption of normality is cured in the large-sample limit (because of the CLT), the assumption of homoscedasticity is not. However there is a variant (the analog of the Welch s t-test) that does not require homoscedasticity. Tukey s Honest Significant Differences Although the F-test is significant, in the boxplots it appears that there is not much difference between heart rates for non smokers and light smokers We might want to know which groups have different population means. Tukey s honest significant differences provides a method to perform all (or some selected tests) with an overall level of 5%. The level is exact under the same assumptions as in ANOVA. In our example, only the differences between heavy and light (or non) smokers are statistically significant. This may seem paradoxical, as we accept µ light = µ moderate, and µ moderate = µ heavy 9 / 32 while rejecting µ light = µ heavy 10 / 32
Calibration by bootstrap The one-way ANOVA F-test and Tukey s HSD both assume (approximate) normality and equal variances. (Although the Welch ANOVA F-test does not assume equal variances.) However, just as in the two-sample situation, a bootstrap calibration is possible. For the F-test, it works as follows: 1. Start by centering each group by removing its sample mean. This is to place ourselves under the null where the means are all equal. 2. Let B be a large integer. For b = 1,...B do the following: (a) For each (centered) group j = 1,...,J, sample n j observations with replacement. (b) Compute the corresponding F-ratio F b. 3. The p-value is #{b : F b F obs }+1 B +1 (roughly) the fraction of bootstrapped statistics (F b : b = 1,...,B) that are at least as large as the observed statistic F obs. 11 / 32 Permutation test Just as in the two-sample situation, a calibration by permutation is appropriate if we are interested in comparing the group distributions (goodness-of-fit testing) instead of merely comparing the means. This leads to testing H 0 : all J samples have the same distribution, meaning, (Y ij : i = 1,...,n j ;j = 1,...,J) are iid versus H 1 : this is not the case Let Z 1,...,Z n be the concatenated sample (Y ij : i = 1,...,n j ;j = 1,...,J). Under the null, the Z l s are exchangeable because the Y ij s are iid. 12 / 32 A permutation test based on the treatment sum of squares works as follows: 1. For each permutation π of {1,...,N}, let Yij π = Z π(i) for i = n j 1 +1,...,n j (where n 0 = 0) and compute SS π T = n j (Y π.j Y..) 2 2. The (exact) p-value is the fraction of SS T π that are greater or equal to the observed SS T obs. 13 / 32
Even for small n, computing the exact p-value as done above may not be feasible. Indeed, assuming all the Y ij s are distinct, there are (n 1,...,n J )! = (n 1 + +n J )! n 1! n J! non-equivalent permutations. This number is very large even for small sample sizes. When this is the case, we estimate the p-value by Monte Carlo sampling. We sample B permutations (B is a large integer) uniformly at random. Kruskal-Wallis Test 14 / 32 The Kruskal-Wallis test is a rank-based method, and as such, is a form of permutation test. The null hypothesis is the same as with any permutation test, that the J samples come from the same distribution (i.e., same population). Let R ij denote the overall rank of Y ij and define the rank-sum for group j The test rejects for large values of D = 12 N(N +1) n j R.j = R ij R 2.j n j 3(N +1) If the design is balanced, meaning n 1 = = n J, this is equivalent to rejecting for large values of R 2.j 15 / 32
To compare it with one-way ANOVA, note that D = (N 1) ( R.j n j R.. n j N n j ( R ij R.. N ) 2 ) 2 So the total sum of squares is used in the denominator as opposed to the error sum of squares as in regular one-way ANOVA. As with any rank-based method, the KW test is distribution-free (important for tabulation in the pre-personal computer age) and invariant with respect to strictly increasing transformations for the data. Theory. Under the null, D has asymptotically (all the sample sizes tend to infinity in proportion to each other) a chi-squared distribution with J 1 degrees of freedom. Repeated measures: multiple paired numerical samples Suppose now that we have multiple paired samples. This is often referred to as repeated measures, and presented as follows: treatment 1 treatment 2 treatment J subject 1 Y 1,1 Y 1,2 Y 1,J subject 2 Y 2,1 Y 2,2 Y 2,J... subject I Y I,1 Y I,2 Y I,J 16 / 32 There are I subjects (or experimental units) in the study and each subject receives all J treatments. This often occurs over a period of time. This typically leads to a two-way analysis of variance, where the subjects are the blocks. (This is covered later in the slides.) 17 / 32
Friedman test The Friedman test is a rank-based procedure for repeated measures designs, to test H 0 : for all i = 1,...,I, (Y i,1,...,y i,j ) are exchangeable with the alternative H 1 being the negation of H 0. Let R ij be the rank of Y i,j among (Y i,1,...,y i,j ). The treatment responses are compared within each subject. Define the sum of the ranks for treatment j: R.j = I R ij The test rejects for large values of G = 12 IJ(J +1) R 2.j 3I(J +1) Theory. Under the null, G has asymptotically (as I ) the chi-squared distribution with J 1 degrees of freedom. Two-way designs Suppose we have a two-way design with two factors: Treatment: I levels, indexed by i. Blocking: J blocks, indexed by j. These define I J groups, also called cells, indexed by pairs (i,j) : i = 1,...,I;j = 1,...,J. Let Y i,j,k denote the kth observation in cell (i,j). For simplicity, we assume the design is balanced, with the same number n of observations per group. The observations in each cell, (Y i,j,k : k = 1,...,n), are assumed iid. Summary statistics: for each group independently. Graphics: side-by-side boxplots and interactions plot. 18 / 32 19 / 32
Pictorially, the data looks as follows: it is divided into groups indexed by the treatment and the blocking variables. Treatment 1 Treatment 2 Treatment k Block 1 Y 1,1,1,...,Y 1,1,n Y 2,1,1,...,Y 2,1,n Y k,1,1,...,y k,1,n Block 2 Y 1,2,1,...,Y 1,2,n Y 2,2,1,...,Y 2,2,n Y k,2,1,...,y k,2,n.... Block b Y 1,b,1,...,Y 1,b,n Y 2,b,1,...,Y 2,b,n Y k,b,1,...,y k,b,n 20 / 32 Example: orange juice vs ascorbic acid in tooth growth We consider the ToothGrowth data, already loaded in R. The response is the length of odontoblasts (continuous). Supplement type: VC = ascorbic acid or OJ = orange juice (categorical). Dose is either 0.5, 1 or 2 mg (discrete). Questions: 1. Does the delivery method have any effect on tooth growth when controlling for dosage? 2. Is there any interaction between delivery method and dosage in their effect on tooth growth? 21 / 32 Population means Let µ ij denote the population mean of group (i,j). Define µ.. = 1 IJ I µ ij, µ i. = 1 J µ ij, µ.j = 1 I I µ ij, We have the decomposition: µ ij = µ+α i +β j +γ ij Grand mean: µ (same as µ.. ). Treatment effects: α i = µ i. µ Blocking effects: β j = µ.j µ Interactions terms: γ ij = µ ij µ i. µ.j +µ 22 / 32
Note that: I α i = 0, β j = 0, I γ ij = 0, j, γ ij = 0, i 23 / 32 Hypotheses about the means Testing for a treatment effect H T : α i = µ i. µ.. = 0, i This is often the main testing problem considered. Testing for a blocking effect H B : β j = µ.j µ.. = 0, j This is often less important since the blocking variable is usually known to affect the response variable. Testing for interactions H TB : γ ij = µ ij µ i. µ.j +µ.. = 0, i,j 24 / 32 Sample means Define Y ij. = 1 n n k=1 Y ijk, Y i.. = 1 nj n k=1 Y ijk, Y.j. = 1 ni I n k=1 Y ijk Y... = 1 nji I n k=1 Y ijk Y ij. is the method of moments estimate (MME) for µ ij Y i.. is the MME estimate for µ i. Y.j. is the MME estimate for µ.j Y... is the MME estimate for µ.. 25 / 32
Sums of squares The treatment sum of squares The blocks sum of squares The interactions sum of squares SS T = nj SS B = ni I (Y i.. Y... ) 2 (Y.j. Y... ) 2 I SS TB = n (Y ij. Y.j. Y i.. +Y... ) 2 26 / 32 Sums of squares The error sum of squares SSE = I n (Y ijk Y ij. ) 2 k=1 Assume equal variance Var(Y ijk ) = σ 2, i,j,k We have: SS df E(SS/df) SS T I 1 σ 2 + nj I 1 i α2 i SS B J 1 σ 2 + ni J 1 j β2 j SS TB (I 1)(J 1) σ 2 + SSE (n 1)JI σ 2 n (I 1)(J 1) ij γ2 ij 27 / 32
F-tests For H T, reject for large For H B, reject for large For H TB, reject for large F T = SS T/(I 1) SSE/((n 1)JI) F B = SS B/(J 1) SSE/((n 1)JI) F TB = SS TB/(I 1)(J 1) SSE/((n 1)JI) 28 / 32 Theory. Assume the Y ijk s are independent normal with equal variance. Under H T, F T has the F-distribution with I 1 and (n 1)JI degrees of freedom. Under H B, F B has the F-distribution with J 1 and (n 1)JI degrees of freedom. Under H TB, F TB has the F-distribution with (I 1)(J 1) and (n 1)JI degrees of freedom. 29 / 32 Two-way ANOVA table Traditionally, the computations above are gathered in an ANOVA table: Df SS MS = SS/Df F-ratio p-value Treatment I 1 SS T SS T I 1 Blocking J 1 SS B SS B J 1 SS TB Interactions (I 1)(J 1) SS TB (I 1)(J 1) SSE Residuals (n 1)JI SSE (n 1)JI Sometimes, an ANOVA table also includes the total sum of squares: F T = MS T MS E P ( F dft,df E >F T ) F B = MS B MS E P ( F dfb,df E >F B ) F TB = MS TB MS E P ( F dftb,df E >F TB ) SS Tot = I n (Y ijk Y... ) 2 k=1 Note that SS Tot = SS T +SS B +SS TB +SSE NOTE: sometimes the table is different and the tests for treatment and blocking are performed as if it were a one-way ANOVA. 30 / 32
Two-Way ANOVA model assumptions and diagnostics The p-value relies on the assumption that Y ijk iid N(µ ij,σ 2 ). This is equivalent to where the ε ijk s are the measurement errors. Y ijk = µ+α i +β j +γ ij +ε ijk Checking the assumptions on the ε ijs s is done as before based the residuals: e ijk = Y ijk Y ij. If we are fitting a model without interactions, i.e. Y ijk = µ+α i +β j +ε ijk then we need to check that the model is appropriate. This can be done via an interactions plot, or a residual plot. Permutation test A calibration by permutation is also possible for the null where the groups within each block come from the same population. This allows for the blocking variable to have an effect on the response, which it typically does. But it does not allow for interactions. Therefore, consider testing H 0 : the observations within each block are iid meaning, for each j = 1,...,J, (Y ijk : i = 1,...,I;k = 1,...,n) are iid 31 / 32 versus H 1 : this is not the case Choose a test statistic, for example, consider the treatment sum of squares. Permute the observations within each block, independently of the other blocks. This is done many times (say, B = 2,000 times) and the statistic is computed every time. The p-value is computed as usual. 32 / 32