Kruskal-Wallis and Friedman type tests for. nested effects in hierarchical designs 1

Similar documents
CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Nonparametric Location Tests: k-sample

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Non-parametric (Distribution-free) approaches p188 CN

Rank-Based Methods. Lukas Meier

Contents. Acknowledgments. xix

Agonistic Display in Betta splendens: Data Analysis I. Betta splendens Research: Parametric or Non-parametric Data?

Unit 14: Nonparametric Statistical Methods

Introduction to Nonparametric Statistics

Lecture 7: Hypothesis Testing and ANOVA

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Non-parametric Tests

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

Non-parametric tests, part A:

PSY 307 Statistics for the Behavioral Sciences. Chapter 20 Tests for Ranked Data, Choosing Statistical Tests

Data Analysis: Agonistic Display in Betta splendens I. Betta splendens Research: Parametric or Non-parametric Data?

ST4241 Design and Analysis of Clinical Trials Lecture 9: N. Lecture 9: Non-parametric procedures for CRBD

Chapter 15: Nonparametric Statistics Section 15.1: An Overview of Nonparametric Statistics

STAT 135 Lab 9 Multiple Testing, One-Way ANOVA and Kruskal-Wallis

SEVERAL μs AND MEDIANS: MORE ISSUES. Business Statistics

What is a Hypothesis?

MATH Notebook 3 Spring 2018

Inferences About the Difference Between Two Means

Analysis of variance (ANOVA) Comparing the means of more than two groups

October 1, Keywords: Conditional Testing Procedures, Non-normal Data, Nonparametric Statistics, Simulation study

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

ST4241 Design and Analysis of Clinical Trials Lecture 7: N. Lecture 7: Non-parametric tests for PDG data

Biostatistics 270 Kruskal-Wallis Test 1. Kruskal-Wallis Test

Sleep data, two drugs Ch13.xls

3. Nonparametric methods

Module 9: Nonparametric Statistics Statistics (OA3102)

Non-Parametric Statistics: When Normal Isn t Good Enough"

Nonparametric Statistics

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

Introduction to Statistical Analysis

I i=1 1 I(J 1) j=1 (Y ij Ȳi ) 2. j=1 (Y j Ȳ )2 ] = 2n( is the two-sample t-test statistic.

4/6/16. Non-parametric Test. Overview. Stephen Opiyo. Distinguish Parametric and Nonparametric Test Procedures

Do not copy, post, or distribute. Independent-Samples t Test and Mann- C h a p t e r 13

Introduction to Statistical Inference Lecture 10: ANOVA, Kruskal-Wallis Test

Statistical Inference Theory Lesson 46 Non-parametric Statistics

r(equivalent): A Simple Effect Size Indicator

Analysis of 2x2 Cross-Over Designs using T-Tests

Analysis of Variance

Comparison of Two Samples

Glossary for the Triola Statistics Series

Nonparametric statistic methods. Waraphon Phimpraphai DVM, PhD Department of Veterinary Public Health

Hypothesis Testing. Hypothesis: conjecture, proposition or statement based on published literature, data, or a theory that may or may not be true

Statistics for EES Factorial analysis of variance

Orthogonal, Planned and Unplanned Comparisons

Data analysis and Geostatistics - lecture VII

HYPOTHESIS TESTING: THE CHI-SQUARE STATISTIC

Lecture Slides. Elementary Statistics. by Mario F. Triola. and the Triola Statistics Series

Statistics for Managers Using Microsoft Excel Chapter 10 ANOVA and Other C-Sample Tests With Numerical Data

Lecture Slides. Section 13-1 Overview. Elementary Statistics Tenth Edition. Chapter 13 Nonparametric Statistics. by Mario F.

Basic Business Statistics, 10/e

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

Nonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Physics 509: Non-Parametric Statistics and Correlation Testing

Do not copy, post, or distribute

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

NON-PARAMETRIC METHODS IN ANALYSIS OF EXPERIMENTAL DATA

Distribution-Free Procedures (Devore Chapter Fifteen)

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

STAT Section 5.8: Block Designs

EXAM # 2. Total 100. Please show all work! Problem Points Grade. STAT 301, Spring 2013 Name

Applied Statistics Preliminary Examination Theory of Linear Models August 2017

Resampling Methods. Lukas Meier

Multiple comparisons - subsequent inferences for two-way ANOVA

INTRODUCTION TO ANALYSIS OF VARIANCE

Non-Parametric Two-Sample Analysis: The Mann-Whitney U Test

Introduction to Statistical Data Analysis III

Diagnostics and Remedial Measures

STATISTICS 4, S4 (4769) A2

Non-parametric Hypothesis Testing

Single Sample Means. SOCY601 Alan Neustadtl

NON-PARAMETRIC STATISTICS * (

ON SMALL SAMPLE PROPERTIES OF PERMUTATION TESTS: INDEPENDENCE BETWEEN TWO SAMPLES

Contents Kruskal-Wallis Test Friedman s Two-way Analysis of Variance by Ranks... 47

Week 14 Comparing k(> 2) Populations

One-way ANOVA Model Assumptions

Statistics: revision

Empirical Power of Four Statistical Tests in One Way Layout

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

Extending the Robust Means Modeling Framework. Alyssa Counsell, Phil Chalmers, Matt Sigal, Rob Cribbie

Purposes of Data Analysis. Variables and Samples. Parameters and Statistics. Part 1: Probability Distributions

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

Foundations of Probability and Statistics

Math 494: Mathematical Statistics

This is particularly true if you see long tails in your data. What are you testing? That the two distributions are the same!

Chap The McGraw-Hill Companies, Inc. All rights reserved.

Data are sometimes not compatible with the assumptions of parametric statistical tests (i.e. t-test, regression, ANOVA)

Non-parametric confidence intervals for shift effects based on paired ranks

PROBLEM TWO (ALKALOID CONCENTRATIONS IN TEA) 1. Statistical Design

Transition Passage to Descriptive Statistics 28

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

Transcription:

Kruskal-Wallis and Friedman type tests for nested effects in hierarchical designs 1 Assaf P. Oron and Peter D. Hoff Department of Statistics, University of Washington, Seattle assaf@u.washington.edu, hoff@stat.washington.edu Working Paper no. 68 Center for Statistics and the Social Sciences University of Washington December 21, 2006 1 This work was performed under a seed grant from the University of Washington s Center for Statistics in the Social Sciences (CSSS). The authors thank A. Roth for providing the Roth et al. (1991) dataset, and S. Sirakaya for useful discussion.

Abstract In hierarchical-design experiments researchers need to test for nested effects, but sometimes parametric assumptions are violated. We show that the Kruskal-Wallis (K-W) rank test can be extended to test for such effects. Let B = b 1... b h be the tested effect, and A = a 1... a g be the effect immediately above it in the hierarchy. If one calculates separate K-W statistics for the effect B at each level of A, then under H 0 for B, the sum of all these statistics is asymptotically χ 2 h g. This test is demonstrated using a wellknown experimental economics dataset. We also show that the Friedman test can be extended in a similar fashion for designs that are both nested and crossed. In general, the preferable inference approach is using the label permutation distribution rather than the asymptotic distribution. Codes for carrying out such tests have been implemented and are available.

1 Introduction The motivating application for this problem was encountered during reanalysis of a highly influential four-country economic experiment (Roth et al., 1991). A major question of interest in this experiment was whether cultural differences affect bargaining behavior, as judged from offers in a two-player game called ultimatum game. Three sessions were held in each of four sites located in four different countries. In each session there were 10 rounds. In each round, buyers were paired with sellers for a single game, in a round-robin pairing. Since the data are skewed and heteroscedastic, the authors performed rank tests and concluded that there were significant country differences in offer amounts. Upon inspecting the data we noticed that a potential nested session effect was overlooked. But even had such an effect been originally identified, the problem is that economists and other researchers do not have at their disposal an off-the-shelf, easy to use rank test to check whether it s significant. Rank tests are readily available for one-way designs (Kruskal-Wallis or Wilcoxon-Mann-Whitney) and for complete balanced two-way designs (Friedman). The Friedman test can be relatively easily extended to incomplete and/or unbalanced designs; see e.g. Mack and Skillings (1980), though this option appears to be much less known, except among researchers specializing in nonparametric statistics. For more complex designs or research questions, there are generic frameworks suggesting overall formulae for tests (Akritas 1

et al., 1997; Gao and Alvo, 2005). These provide templates from which an individual test can be derived. Our question of interest may be covered by these frameworks, but the resulting test would be neither as simple nor as accessible to applied researchers as the one described below. 2 Nested Kruskal-Wallis Test 2.1 Asymptotic Theory For this article s purpose, let a hierarchical experiment design be any design where there is at least one effect (hereafter, the nested effect), whose levels are each observed within exactly one level of the effect above it in the hierarchy (the nesting effect). The design may be purely hierarchical, i.e., that all effects are either nested or nesting, or it may be combined with other types. For our discussion, Let the nested effect be B = b 1... b h, and one nesting effect A = a 1... a g - so that h levels of B are nested within g levels of A (trivially, h > g). Please note, that the effect can be a controlled treatment, or an uncontrollable factor, e.g. a country as in the motivating example. It is customary and more convenient to use two indices to denote the nested levels, e.g B = b i(j), i(j) = 1... h j, j = 1... g. Obviously g j=1 h j = h. A parametric ANOVA null hypothesis for B would be µ i(j) = µ j i, j. The analogous Kruskal-Wallis (K-W) null hypothesis is: F i(j) = F j i, j, where F i(j) is the distribution of outcomes for level b i(j). The alternative is that 2

l, i, j, such that F i(j) > F l(j) with the greater than sign denoting stochastically greater, i.e., all quantiles of F i(j) are greater than the corresponding quantiles of F l(j). Consider a single level of A, a j with h j levels of B, each with n i(j) observations, and h j i(j)=1 n i(j) = n j. Then, under the K-W null hypothesis, the ranks of all n j observations are i.i.d discrete uniform. Therefore for each level of B, ni(j) ( r i(j) n ) ( j + 1 n i(j) N 0, (n ) j + 1)(n j 1), (1) 2 12 where r i(j) is the sample mean rank of observations for level b i(j). Therefore, 12n i(j) (n j + 1)n j ( r i(j) n ) 2 j + 1 n i(j) χ 2 2 1. (2) Summing (2) over all levels nested in a j, we obtain the ordinary K-W statistic H (Kruskal and Wallis, 1952). Since the ranks are constrained by a constant sum, H is asymptotically χ 2 h j 1. Now under the K-W null for B as defined above, if we take another level of A, a k, and in the absence of any effect crossing the hierarchical structure, the ranks within a k would be independent of the ranks within a j. Therefore, the analogous K-W statistics would be independent across different levels of A. Therefore H (nest), the sum of all K-W statistics from different levels of A would have (under the K-W null) 3

the following asymptotic distribution: H (nest) g H j = j=1 h g j j=1 i=1 12n i(j) (n j + 1)n j ( r i(j) n ) 2 j + 1 n j χ 2 2 h g. (3) The process for performing a nested K-W test is schematically illustrated in fig. 1. 2.2 Permutation K-W Tests While establishing asymptotic properties is intrinsically important and useful, often the K-W test is called for in a limited sample-size scenario. Moreover, in many datasets ties are common. There exists a tie correction to the K-W statistic, but it, too, is based upon asymptotic properties. Fortunately, the tool of label permutation tests provides an adequate solution. We take the standard permutation test approach, focusing on a single level of A, a j : under the K-W null for B, all ranks are uniformly distributed and therefore all possible assignment of ranks to h j subgroups with subgroup sizes {n i(j) }, is equally likely. This is of course true for all levels of A. Thus, randomly generating M distinct sets of groups and subgroup labels (keeping the same subgroup sizes as the true ones), calculating H m (nest), m = 1... M { } and ordering them into provides a representative sample from the H (nest) (m) true sampling distribution of H (nest). Setting a significance level α, we would reject the K-W null hypothesis for B nested in A, if H (nest) > H (nest) (M(1 α)). 4

Figure 1: Schematic illustration of the nested K-W test. Some individual data points are labeled; this requires a minimum of 3 indices. The K-W statistic H is calculated separately for each level of the nesting effect A, comparing points grouped by the nested effect B. The overall test is carried out on the sum of all such statistics. A label permutation test for B would involve randomly re-assigning points to the groups within each level of A - and calculating H (nest) for each such assignment. In the data example, effect A is country, effect B is session, and data points for the test are player summaries. 5

2.3 Data Example As stated in the introduction, Roth et al. (1991) s ultimatum game experiment was designed, among other things, to test for country differences in bargaining behavior. It took place in 1989-1990, in four cities around the world: Jerusalem (Israel and Palestinian West Bank), Tokyo (Japan), Pittsburgh (U.S.A.) and Ljubljana (formerly Yugoslavia, nowadays Slovenia). Three sessions were run in each city. In each session there were 10 buyers and 10 sellers, matched with each other for a round-robin game of 10 rounds. 1 In each round and for each buyer-seller encounter, buyer suggested a division of a fixed sum of real money with the seller. If seller accepted the offer, the division was carried out. If seller refused, neither player received any money. All interactions were indirect and anonymous. The authors recognized that in general there is a dependence within any given buyer s offers. Therefore, they used only a single value from each buyer for inference - the round 10 offer. They also assumed that these offers are independent of session assignment, yielding a test sample size of 116. Descriptive statistics of this sample seem to suggest significant country differences (fig. 2). The authors inference method was pairwise Mann-Whitney tests, which yielded 3 highly significant and 2 significant pairwise differences at the 0.05 level. 2 But was the independence assumption correct? 1 Due to no shows, three sessions ended up having fewer players; however, the number of rounds remained 10 in all sessions. 2 There was a fundamental error in the execution of pairwise tests: confidence levels 6

Roth et el. 1991: Round 10 Offers by Country Percent of Pie Offered 10 20 30 40 50 Israel Japan USA Yugo Country Figure 2: Roth et al. (1991) buyer offers during the last round of play (round 10), grouped by site (country). 7

Revisiting the experimental design and play evolution, it seems that buyers gradually adapt to seller responses and vice versa. 3 Since all buyers of a given session face the same pool of sellers, and since these pools differ for different sessions, it makes sense to assume some intra-session dependence. Figure 3, splitting the 116 round-10 offers by session, does show marked session-specific differences at all sites. We performed a permutation nested K-W test for the session effect on round 10 data. H (nest) = 22.4 on 8 d.f., yielding a permutation p-value of 1.3 10 3 (permutation sample size: 50, 000). Hence, one can safely conclude that the within-session independence assumption was wrong. In the absence of any explicit model accounting for the intra-session dependence, the effective sample size for testing the country effect is the number of sessions: only 12. Subsequent testing for country differences using various session summaries (including round 6 10 summaries instead of only round 10), all yield p-values greater than 0.1. 4 Pairwise tests are impossible for this sample size, and the only two-way contrast indicating a marginally significant effect is Jerusalem vs. all other three sites. We conclude that given the significant intra-session dependence in player behavior, a larger number of sessions needs to be designed into this type of experiment in order to draw meaningful conclusions about the main effect. were not adjusted for pairwise comparisons. Following a Bonferroni correction, 3 tests (all those involving Jerusalem) are still deemed significant, while two more comparisons involving Tokyo fail to reject the null at the 0.1 level. 3 The authors also reached this conclusion, though it wasn t factored into their inference procedure. 4 Analysis using intra-session dependence models was also attempted; the resulting inference was not substantially different from the one reported here. 8

Roth et el. 1991: Round 10 Offers by Country and Session Percent of Pie Offered 10 20 30 40 50 1.Israel 2.Israel 3.Israel 1.Japan 2.Japan 3.Japan 1.USA 2.USA 3.USA 1.Yugo 2.Yugo 3.Yugo Session.Country Figure 3: Roth et al. (1991) buyer offers during the last round of play (round 10), grouped by session and site (country). 9

3 Nested Friedman Test The same logic leading to the nested K-W test can produce a nested Friedman test. Here there would be at least one blocking effect C, crossing the A, B hierarchical design. Even under the null hypothesis for B, ranks within each level of A may now depend upon C. The natural solution is to perform a separate ranking for each level of C. Within a single level of A and given a balanced complete design, this is a simple Friedman test (Friedman, 1937). 5 Using the approach of Section 2.1, since the process adjusts for C via the separate rankings, the Friedman statistics Q j obtained for each level of A are mutually independent. Therefore, under the null hypothesis for B the sum of separate Friedman statistics from all levels of A would have asymptotic properties analogous to (3) (cf. fig. 4): Q (nest) g j=1 Q j = v χ 2 h g. (4) If the design is incomplete or unbalanced, weighted Friedman-type statistics as in Mack and Skillings (1980) can be used instead. Of course, a permutation test implementation is highly recommended in any case. In the data example above, a nested Friedman-type test can be performed for the buyer effect nested within sessions. This yields a test statistic of 562 on 104 degrees of freedom (both asymptotic and permutation p-values are 5 Note that testing for C can still be achieved using a simple Friedman test, in spite of the hierarchy. One needs to perform a separate ranking for each level of the nested effect B. 10

practically zero), compared with 511 with the same d.f., using an analogous nested K-W test, i.e., one that ignores the crossed round factor. For this particular application, both tests yield a strongly significant effect, as was deduced by the researchers without any formal testing. However, there may be similar applications in other fields where the nested Friedman test may prove useful. 4 Conclusions The two nested rank tests described here are easily implemented and accessible to any researcher. Additionally, they possess clear asymptotic properties and are directly related to well-known standard tests. Therefore, this should present a significant enhancement of the readily-available nonparametric toolset for the analysis of experiments with nontrivial designs. We recommend using these tests via a permutation application. R language codes (R Development Core Team, 2005), including permutation codes for the original non-nested K-W and Friedman tests and a generic Friedmantype formulation that allows for incomplete and unbalanced designs, are available from the authors. 11

Figure 4: Schematic illustration of the nested Friedman test. Nested and nesting levels are as in fig. 1, but now they are crossed with effect C (having, say, v levels) in a complete design. If one wishes to test for C, then a single Friedman Q statistic would suffice, which is equivalent to ranking observations within each column, then summing ranks by row. However, in order to test for B nested in A and assuming C can be significant, one needs to calculate a separate Q statistic for each level of A and test their sum. This is equivalent to ranking observations within each row (marked with horizontal rectangles), then summing ranks by column - but doing so separately for each array of points. Like in the K-W example, inference is obtained by summing all separate Q statistics. A label permutation test for B would involve randomly scrambling points in each row, within each level of A - and calculating Q (nest) for each such assignment. In the data example, effect A is session, effect B is buyer, and effect C is round. Individual points are single offers. 12

References Akritas, M., Arnold, S., Brunner, E., 1997. Nonparametric hypotheses and rank statistics for unbalanced factorial designs. J. Amer. Statist. Assoc. 92, 258 265. Friedman, M., 1937. The use of ranks to avoid the assumption of normality in the analysis of variance. J. Amer. Statist. Assoc. 32, 675 701. Gao, X., Alvo, M., 2005. A unified nonparametric approach for unbalanced factorial designs. J. Amer. Statist. Assoc. 100, 926 941. Kruskal, W. H., Wallis, W. A., 1952. Use of ranks in one-criterion variance analysis. J. Amer. Statist. Assoc. 47, 583 621. Mack, G., Skillings, J., 1980. A friedman-type rank test for main effect in a two-factor ANOVA. J. Amer. Statist. Assoc. 75, 947 951. R Development Core Team, 2005. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. URL http://www.r-project.org Roth, A., Prasnikar, V., Okuno-Fujiwara, M., Zamir, S., 1991. Bargaining and market behavior in Jerusalem, Ljubljana, Pittsburgh and Tokyo: an experimental study. Amer.Econ.Rev. 81, 1068 1095. 13