October 1, Keywords: Conditional Testing Procedures, Non-normal Data, Nonparametric Statistics, Simulation study

Similar documents
arxiv: v1 [stat.me] 20 Feb 2018

AN IMPROVEMENT TO THE ALIGNED RANK STATISTIC

arxiv: v2 [stat.me] 23 Jun 2016

Empirical Power of Four Statistical Tests in One Way Layout

Kruskal-Wallis and Friedman type tests for. nested effects in hierarchical designs 1

Analysis of variance, multivariate (MANOVA)

STAT 730 Chapter 5: Hypothesis Testing

INFLUENCE OF USING ALTERNATIVE MEANS ON TYPE-I ERROR RATE IN THE COMPARISON OF INDEPENDENT GROUPS ABSTRACT

Comparison of nonparametric analysis of variance methods a Monte Carlo study Part A: Between subjects designs - A Vote for van der Waerden

Experimental Design and Data Analysis for Biologists

Non-parametric tests, part A:

Generalized Linear Models (GLZ)

Simultaneous Confidence Intervals and Multiple Contrast Tests

Practical Solutions to Behrens-Fisher Problem: Bootstrapping, Permutation, Dudewicz-Ahmed Method

Testing Statistical Hypotheses

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Tutorial 4: Power and Sample Size for the Two-sample t-test with Unequal Variances

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Application of Parametric Homogeneity of Variances Tests under Violation of Classical Assumption

UNIVERSITY OF TORONTO Faculty of Arts and Science

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

ST4241 Design and Analysis of Clinical Trials Lecture 4: 2 2 factorial experiments, a special cases of parallel groups study

MATH Notebook 3 Spring 2018

An Approximate Test for Homogeneity of Correlated Correlation Coefficients

A randomization-based perspective of analysis of variance: a test statistic robust to treatment effect heterogeneity

THE 'IMPROVED' BROWN AND FORSYTHE TEST FOR MEAN EQUALITY: SOME THINGS CAN'T BE FIXED

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook

On Selecting Tests for Equality of Two Normal Mean Vectors

Extending the Robust Means Modeling Framework. Alyssa Counsell, Phil Chalmers, Matt Sigal, Rob Cribbie

Chapter 7, continued: MANOVA

A Monte Carlo Simulation of the Robust Rank- Order Test Under Various Population Symmetry Conditions

Chapter 1 Statistical Inference

Special Topics. Handout #4. Diagnostics. Residual Analysis. Nonlinearity

Logistic Regression: Regression with a Binary Dependent Variable

Testing Restrictions and Comparing Models

TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST

A Signed-Rank Test Based on the Score Function

Statistical tests for a sequence of random numbers by using the distribution function of random distance in three dimensions

Testing Statistical Hypotheses

SOME ASPECTS OF MULTIVARIATE BEHRENS-FISHER PROBLEM

One-way ANOVA. Experimental Design. One-way ANOVA

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

arxiv: v1 [stat.me] 22 Mar 2010

An Overview of the Performance of Four Alternatives to Hotelling's T Square

Tutorial 3: Power and Sample Size for the Two-sample t-test with Equal Variances. Acknowledgements:

Psychology 282 Lecture #4 Outline Inferences in SLR

Tutorial 5: Power and Sample Size for One-way Analysis of Variance (ANOVA) with Equal Variances Across Groups. Acknowledgements:

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

arxiv: v1 [math.st] 11 Jun 2018

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Nonparametric Test on Process Capability

Open Problems in Mixed Models

A nonparametric test for path dependence in discrete panel data

COMPLETELY RANDOMIZED DESIGNS (CRD) For now, t unstructured treatments (e.g. no factorial structure)

HYPOTHESIS TESTING. Hypothesis Testing

A Test of Homogeneity Against Umbrella Scale Alternative Based on Gini s Mean Difference

Lecture 5: Hypothesis tests for more than one sample

Question. Hypothesis testing. Example. Answer: hypothesis. Test: true or not? Question. Average is not the mean! μ average. Random deviation or not?

TESTING FOR HOMOGENEITY IN COMBINING OF TWO-ARMED TRIALS WITH NORMALLY DISTRIBUTED RESPONSES

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Lec 3: Model Adequacy Checking

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

4/6/16. Non-parametric Test. Overview. Stephen Opiyo. Distinguish Parametric and Nonparametric Test Procedures

This paper has been submitted for consideration for publication in Biometrics

Statistical Hypothesis Testing

ANOVA approach. Investigates interaction terms. Disadvantages: Requires careful sampling design with replication

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

HANDBOOK OF APPLICABLE MATHEMATICS

Chap The McGraw-Hill Companies, Inc. All rights reserved.

Non-parametric methods

The Aligned Rank Transform and discrete Variables - a Warning

Stat 5101 Lecture Notes

3. (a) (8 points) There is more than one way to correctly express the null hypothesis in matrix form. One way to state the null hypothesis is

Lec 1: An Introduction to ANOVA

arxiv: v1 [stat.me] 2 Mar 2015

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

Nominal Data. Parametric Statistics. Nonparametric Statistics. Parametric vs Nonparametric Tests. Greg C Elvers

13: Additional ANOVA Topics. Post hoc Comparisons

We know from STAT.1030 that the relevant test statistic for equality of proportions is:

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

One-way ANOVA Model Assumptions

Holzmann, Min, Czado: Validating linear restrictions in linear regression models with general error structure

Chapter 5: HYPOTHESIS TESTING

Multivariate Time Series: Part 4

Exam details. Final Review Session. Things to Review

Inferential Statistics

Models, Testing, and Correction of Heteroskedasticity. James L. Powell Department of Economics University of California, Berkeley

Sleep data, two drugs Ch13.xls

Performing Two-Way Analysis of Variance Under Variance Heterogeneity

ON SMALL SAMPLE PROPERTIES OF PERMUTATION TESTS: INDEPENDENCE BETWEEN TWO SAMPLES

STAT 525 Fall Final exam. Tuesday December 14, 2010

Marginal Screening and Post-Selection Inference

Comparison of two samples

Chapter 12. Analysis of variance

Math 152. Rumbos Fall Solutions to Exam #2

Master s Written Examination

Sample Size and Power Considerations for Longitudinal Studies

Hypothesis Testing for Var-Cov Components

MSc in Statistics and Operations Research

A Box-Type Approximation for General Two-Sample Repeated Measures - Technical Report -

Transcription:

A comparison of efficient permutation tests for unbalanced ANOVA in two by two designs and their behavior under heteroscedasticity arxiv:1309.7781v1 [stat.me] 30 Sep 2013 Sonja Hahn Department of Psychology, University of Jena and Frank Konietschke Department of Medical Statistics, University Medical Center Göttingen and Luigi Salmaso Department of Management and Engineering, University of Padova October 1, 2013 Abstract We compare different permutation tests and some parametric counterparts that are applicable to unbalanced designs in two by two designs. First the different approaches are shortly summarized. Then we investigate the behavior of the tests in a simulation study. A special focus is on the behavior of the tests under heteroscedastic variances. Keywords: Conditional Testing Procedures, Non-normal Data, Nonparametric Statistics, Simulation study 1

1 Introduction In many biological, medical and social trials, data are collected in terms of a two by two design, e.g. when male and female patients are randomized to two different treatment groups (placebo and active treatment). The data is often analyzed by assuming linear treatment effects and ANOVA procedures. These approaches rely on rather strict model assumptions like normally distributed error terms and variance homogeneity. However, these model assumptions can rarely be justified. In particular, heteroscedastic variances occur frequently in a variety of disciplines, e.g. in genetic data. It is well known that the classical ANOVA F -test tends to result in liberal or conservative decisions, depending on the underlying distribution, the amount of variance heterogeneity, and unbalance. Thus, asymptotic (or approximate) procedures, which allow the data to be heteroscedastic, are a robust alternative to the classical ANOVA F -test. An asymptotic testing procedure is the -type statistic (see e.g., Brunner et al., 1997; Pauly et al., submitted), which is based on the asymptotic distribution of an appropriate quadratic form. It is even valid without the assumptions of normality and variance homogeneity. However, very large sample sizes are necessary to achieve accurate test results (see, e.g., Brunner et al. (1997) and references therein). As an approximate solution, Brunner et al. (1997) propose the so-called ANOVAtype statistic (ATS), which is based on an Box-type approximation approach. The ATS, however, is an approximate test and its asymptotically exactness is unknown (see, e.g., Pauly et al. (submitted)). On the other hand, permutation approaches are known to be very robust under non-normality. In particular, under certain model assumptions, permutation tests are exact level α tests. Usual permutation tests assume that the data is exchangeable, which particularly implies homogeneous variances. Recently, Pauly et al. (2013) propose asymptotic permutation tests, which are asymptotically exact even under non-normality and possibly heteroscedastic variances. Various permutational approaches for factorial designs have been developed within the last 2

years, but a comparison of the different permutational approaches for unbalanced factorial designs with variance heterogeneity remains. The aim of the present paper is to investigate different parametric and permutation tests for factorial linear models. For simplicity, we focus on two by two designs within this paper. The paper is organized as follows: After some notational issues we summarize different existing approaches that were developed for unbalanced ANOVA designs. Afterwards we investigate the behavior of these procedures in a simulation study. Here we focus on small sample sizes, heterogeneity of variances, and different error term distributions. Finally, we discuss the results of the simulation study and add further considerations about the procedures. 1.1 Notation and Hypotheses We consider the two way factorial crossed design X ijk = µ + α i + β j + (αβ) ij + ɛ ijk, i = 1, 2; j = 1, 2; k = 1,..., n ij, (1) where α i denotes the effect of level i from factor A, β j denotes the effect of level j from factor B and (αβ) ij denotes the (ij)th interaction effect from A B. Here, ɛ ijk denotes the error term with E(ɛ ijk ) = 0 and V ar(ɛ ijk ) = σij 2 > 0. Under the assumption of equal variances, we simply write V ar(ɛ ijk ) = σ 2. It is our purpose to test the null hypotheses H (A) 0 : α 1 = α 2 H (B) 0 : β 1 = β 2 (2) H (A B) 0 : (αβ) 11 =... = (αβ) 22 3

For simplicity, let µ ij = µ + α i + β j + (αβ) ij, then, the hypotheses defined above can be equivalently written as H (A) 0 : C A µ = 0 H (B) 0 : C B µ = 0 H (A B) 0 : C A B µ = 0, where C L, L {A, B, A B}, denote suitable contrast matrices and µ = (µ 11,..., µ 22 ). To test the null hypotheses formulated in (2), various asymptotic and approximate test procedures have been proposed. We will explain the current state of the art in the subsequent sections. 1.2 -Type Statistic (WTS) Let X = (X 11,..., X 22 ) denote the vector of sample means X ij = 1 n ij nij k=1 X ijk, and let ŜN = diag( σ 11, 2..., σ 22) 2 denote the 4 4 diagonal matrix of sample variances σ ij 2 = nij k=1 (X ijk X ij ) 2. Under the null hypothesis H 0 : (L) : C L µ = 0, the -type 1 n ij 1 statistic W N (L) = NX C L(C L S N C L) + C L X χ 2 rank(c L ) (3) has, asymptotically, as N, a χ 2 rank(c L ) distribution. The rate of convergence, however, is rather slow, particularly for larger numbers of factor levels and smaller sample sizes. For small and medium sample sizes, the WTS tends to result in rather liberal results (see Brunner et al., 1997; Pauly et al., submitted, for some simulation results). However, the -type statistic is asymptotically exact even under non-normality. 4

1.3 ANOVA-Type Statistic (ATS) In order to overcome the strong liberality of the -type statistic in (3) with small sample sizes, Brunner et al. (1997) propose the so-called ANOVA-type statistic (ATS) F N (L) = N X T L X trace(t L S N ), (4) where T L = C L (C LC L )+ C L. The null distribution of F N (L) is approximated by a F - distribution with f 1 = [trace(t LS N )] 2 trace[(t L S N ) 2 ] and f 2 = [trace(t LS N )] 2, (5) trace(d 2 T L S 2 NΛ) where D T is the diagonal matrix of the diagonal elements of T L and Λ = diag{(n ij 1) 1 } i,j=1,2. The ATS relies on the assumption of normally distributed error terms (Pauly et al., submitted). Especially for skewed error terms the procedure tends to be very conservative (Pauly et al., submitted; Vallejo et al., 2010). When the sample sizes are extremely small (n ij 5), it tends to result in conservative decisions (Richter and Payton, 2003). We note that in two by two designs, the -type statistics W N (L) in (3) and the ANOVA-type statistic F N (L) are identical. Furthermore, the ATS is even asymptotically an approximate test and its asymptotical exactness is unknown. 1.4 -Type Permutation Test (WTPS) Recently, Pauly et al. (submitted) proposed an asymptotic permutation based -test, which is even asymptotically exact when the data is not exchangeable. In particular, it is asymptotically valid under variance heterogeneity. This procedure denotes an generalization of two-sample studentized permutation tests for the Behrens-Fisher problem (Janssen, 1997; Janssen and Pauls, 2003; Konietschke and Pauly, 2012, 2013). The procedure is based 5

on (randomly) permuting the data X = (X111,..., X22n 22 ) within the whole data set. Let X = (X 11,..., X 22 ) denote the vector of permuted means X ij = n 1 nij ij k=1 X ijk, and let Ŝ N = diag( σ2 11,..., σ 22) 2 denote the 4 4 diagonal matrix of permuted sample variances σ ij 2 = 1 nij n ij 1 k=1 (X ijk X ij ) 2. Further let W N(L) = N(X ) C L(C L S NC L) + C L X (6) denote the permuted -type statistics W N (L). Pauly et al. (submitted) show that, given the data X, the distribution of WN (L) is, asymptotically, the χ2 rank(c L ) distribution. The p-value is derived as the proportion of test statistics of the permuted data sets that are equal or more extreme than the test statistic of the original data set. If data is exchangeable, this -type permutation tests guarantees an exact level α test. Otherwise, this procedure is asymptotically exact due to the multivariate studentization. Simulation results showed that this tests adheres better to the nominal α-level than its unconditional counterpart for small and medium sample sizes (see Pauly et al., submitted, and the supplementary materials therein). Furthermore, the -type permutation tests achieves a higher power than the ATS in general. We note that the WTPS is not restricted to two by two designs. The procedure is applicable in higher-way layouts and even in nested and hierarchical designs. 1.5 Synchronized Permutation Tests (CSP and ) Synchronized permutation tests were designed to test the different hypotheses in (2) of a factorial separately (e.g., testing a main effect when there is an interaction effect). There are two important differences to the WTPS approach: (1) Data is not permuted within the whole data set, but there exists a special synchronized permutation mechanism. (2) The test statistic is not studentized. These procedure assumes that the error terms are exchangeable. 6

Basso et al. (2009), Pesarin and Salmaso (2010) and Salmaso (2003) propose synchronized permutation tests for balanced factorial designs. Synchronization means that data is permuted within blocks built by one of the factors. In addition, the number of exchanged observations in each of these blocks is equal for a single permutation. For example when testing for the main effect A or the interaction effect, the observations can be permuted within the blocks built by the levels of factor B. Different variants of synchronized permutations have been developed (see Corain and Salmaso, 2007, for details): Constrained Synchronized Permutations (CSP) Here only observations on the same position within each subsample are permuted. When applied to real data set it is strongly recommended to pre-randomize the observations in the data set to eliminate possible systematic order effects. Unconstrained Synchronized Permutations () Here also observations on different position can be permuted. In this case it has to be ensured that the test statistic follows a uniform distribution. The test statistics for the main effect A and the interaction effect are T A = (T 11 + T 12 T 21 T 22 ) 2, and T A B = (T 11 T 12 T 21 + T 22 ) 2 with T ij = k X ijk. Due to the synchronization and the test statistic, the effects not of interest are eliminated (e.g, when testing for main effect A, main effect B and the interaction effect are eliminated, see Basso et al., 2009, for more background information).when testing for main effect B, the data has to be permuted within blocks built by the levels of A and the test statistics 7

have to be adapted. For certain unbalanced factorial designs this method can be extended (Hahn and Salmaso, submitted). In the case of CSP this leads to the situation that some observations will never be exchanged. In the case of the maximum number of exchanged observations equals the minimum subsample size. A test statistic that finally eliminates the effects of interest is only available in special cases (Hahn and Salmaso, submitted). For example, when n 11 = n 12 and n 21 = n 22, possible test statistics are: T A = (n 21 T 11 + n 22 T 12 n 11 T 21 n 12 T 22 ) 2, T A B = (n 21 T 11 n 22 T 12 n 11 T 21 + n 12 T 22 ) 2. For both, the balanced and the unbalanced case, the p-value is again calculated as the proportion of test statistics of permuted data sets greater or equal than the test statistic of the original data set. These procedures showed a good adherence to the nominal α-level as well as power in simulation studies (Basso et al., 2009; Hahn and Salmaso, submitted). However, this procedure is limited in various ways: It is restricted to very specific cases of unbalanced designs due to assumptions on equal sized subsamples. Extension to more complex factorial designs seems quite difficult (see e.g., Basso et al. (2009) for balanced cases with more levels). It assumes exchangeability. This might not be given in cases with heteroscedastic error variances. 8

As the behavior of this procedure under variance heterogeneity has not been investigated yet, we included it in the following simulation study. 1.6 Summary We outlined various procedures that aim to compensate shortcomings of classical ANOVA. Some procedures are only valid under normality and possibly heteroscedastic variances (ATS). CSP and are valid under non normally distributed error terms and homoscedastic variances. Both the WTS and WTPS are asymptotically valid even under non-normality and heteroscedasticity, respectively. Most of these procedures are intended to be used for small samples (ATS, WTPS, CSP, and ), only the WTS requires a sufficiently large sample size. In the following simulation we vary additionally the aspect of balanced vs. unbalanced designs, as heteroscedasticity is especially problematic in the latter one. 2 Simulation study 2.1 General aspects The present simulation study investigates the behavior of the procedures described above (see Section 1) for balanced vs. unbalanced designs and homo- vs. heteroscedastic variances. A major assessment criterion for the accuracy of the procedures is their behavior when increasing sample sizes are combined with increasing variances (positive pairing) or with decreasing variances (negative pairing). We investigate data sets that did not contain any effect, and data sets that contained an effect. In the first case we were interested if the procedures keep the nominal level; in the second case additionally the power behavior was investigated. Similar to the notation 9

data setting n 11 (SD) n 12 (SD) n 21 (SD) n 22 (SD) 1 balanced and homoscedastic 5 (1.0) 5 (1.0) 5 (1.0) 5 (1.0) 2 differing sample sizes 5 (1.0) 7 (1.0) 10 (1.0) 15 (1.0) 3 differing variances 5 (1.0) 5 (1.3) 5 (1.5) 5 (2.0) 4 positive pairings 5 (1.0) 7 (1.3) 10 (1.5) 15 (2.0) 5 negative pairings 5 (2.0) 7 (1.5) 10 (1.3) 15 (1.0) Table 1: Different subsample sizes and variances considered in the simulation study. Besides the data settings in the table, bigger samples were achieved by adding 5, 10, 20, or 25 observations to each subsample. introduced above we used the following approach for data simulation: X ijk = µ + α i + β j + (αβ) ij + ɛ ijk. (7) Specifications for the different data settings can be found below. Throughout all studies we focused at main effect A and the interaction effect. All simulations were conducted using the freely available software R (www.r-project.org), version 2.15.2 R Core Team (2013). The numbers of simulation and permutation runs were n sim = 5000 and n perm = 5000, respectively. All simulations were conducted at 5% level of significance. 2.2 Data Sets Containing no Effect 2.2.1 Description Table 1 outlines the combinations of balanced vs. unbalanced designs and homo- vs. heteroscedastic variances. Larger sample sizes were obtained by adding a constant number to each of the sample sizes. Those numbers were 5, 10, 20, and 25. There was no effect in the data (i.e. for Equation 7 µ = α i = β j = 0). For the error terms, different symmetric and skewed distributions were used: Symmetrical distributions: normal, Laplace, logistic, and a mixed distribution, where each factor level combination has a different symmetric distribution (normal, 10

Laplace, logistic and uniform). Skewed distributions: log-normal,χ 2 3, χ 2 10, and a mixed distribution, where each factor level combination has a different skewed distribution (exponential, log-normal, χ 2 3, χ 2 10). To generate variance heterogeneity, random variables were first generated from the distributions mentioned above and standardized to achieve an expected value of 0 and a standard deviation of 1. These values were further multiplied by the standard deviations given in Table 1 to achieve different degrees of variance heteroscedasticity. 2.2.2 Results Figure 1 shows the behavior of the different procedures in the case of symmetric and homoscedastic error terms. Most procedures keep close to the nominal α-level of.05 that is indicated by the red thin line. WTS tends to be quite liberal, while ATS tends to be slightly conservative for small sample sizes. Figure 2 shows the behavior for skewed but still homoscedastic error term distributions. The picture is very similar to the previous one, but the conservative behavior of the ATS procedure is more pronounced. Figure 3 shows the behavior in the symmetric and heteroscedastic case. For Setting 3 with equal sample sizes there is not much difference in comparison to the previous cases. In Setting 4, the positive pairings, WTPS and ATS show a good adherence to the level and a slightly conservative behavior in the case of the Laplace-distribution. WTS tends to over-reject the null in small sample size settings. Both the CSP and tests tend to result in conservative decisions. This is more pronounced for small sample sizes and for the -procedure. In Setting 5, that indicates negative pairings, all procedures unless ATS tend to result in a liberal behavior especially for small sample sizes. has the strongest tendency with Type-I-error rates up to.08. 11

Laplace Logistic Normal Mixed ; Symmetric Distributions Homoscedastic Variances 0 0 Setting = 1 Setting = 2 0 0 Laplace Logistic Normal Mixed ; Symmetric Distributions Homoscedastic Variances 0 0 Setting = 1 Setting = 2 _Perm ATS CSP _Perm ATS CSP WTS WTPS ATS CSP 0 0 Figure 1: Results for the different procedures testing main effect A (left hand side) or the interaction effect (right hand side) for symmetric distributions and homoscedastic variances. Chi(10) Chi(3) LogNor Mixed 0 0 ; Skew Distributions Homoscedastic Variances Setting = 1 Setting = 2 0 0 Chi(10) Chi(3) LogNor Mixed 0 0 ;Skew Distributions Homoscedastic Variances Setting = 1 Setting = 2 _Perm ATS CSP _Perm ATS CSP WTS WTPS ATS CSP 0 0 Figure 2: Results for the different procedures testing main effect A (left hand side) or the interaction effect (right hand side) for skewed distributions and homoscedastic variances. 12

Figure 4 shows the behavior in the skewed and heteroscedastic case. In general, the same conclusions can be drawn. For the log-normal distribution there is a general tendency to get a more liberal decision than in the other cases. This means that in Setting 4 with positive pairings the procedures keep the level almost well, but in the other cases the Type-I-error rate is up to.10. 2.3 Data Sets Containing an Effect 2.3.1 Description Table 2 shows the different combinations of subsample sizes and standard deviations for data sets that contained an effect. Two aspects were considered: The power behavior as well as the level of the procedures when testing an inactive effect. To ensure a valid comparison of the power behavior, the sample sizes and variance heterogeneity was chosen less extreme than in the previous simulation study (see Section 2.2). Again different error term distributions were used: normal and Laplace distribution as symmetric distributions, and log-normal distribution and exponential distribution as skewed distributions. Different error term variances were obtained in the same manner as described above in Section 2.2 using the standard deviations from Table 2. Additionally, there were active effects as described in Table 3 in the data with µ = 0 and δ {0, 0.2,..., 1}. The tested data setting n 11 (SD) n 12 (SD) n 21 (SD) n 22 (SD) 1 balanced and homoscedastic 10 (1) 10 (1) 10 (1) 10 (1) 2 differing sample sizes 9 (1) 9 (1) 15 (1) 15 (1) 3 differing variances 10 (1) 10 (1) 10 ( 4 2) 10 ( 4 2) 4 positive pairings 9 (1) 9 (1) 15 ( 4 2) 15 ( 4 2) 5 negative pairings 9 ( 4 2) 9 ( 4 2) 15 (1) 15 (1) Table 2: Different subsample sizes and standard deviations considered in the simulation study containing effects. 13

Laplace Logistic Normal Mixed ; Symmetric Distributions Heteroscedastic Variances 0 0 Setting = 3 Setting = 4 Setting = 5 0 0 Laplace Logistic Normal Mixed ; Symmetric Distributions Heteroscedastic Variances 0 0 Setting = 3 Setting = 4 Setting = 5 _Perm ATS CSP _Perm ATS CSP WTS WTPS ATS CSP 0 0 Figure 3: Results for the different procedures testing main effect A (left hand side) or the interaction effect (right hand side) for symmetric distributions and heteroscedastic variances. Chi(10) Chi(3) LogNor Mixed 0 0 :Skew Distributions Heteroscedastic Variances Setting = 3 Setting = 4 Setting = 5 0 0 Chi(10) Chi(3) LogNor Mixed 0 0 ; Skew Distributions Heteroscedastic Variances Setting = 3 Setting = 4 Setting = 5 _Perm ATS CSP _Perm ATS CSP WTS WTPS ATS CSP 0 0 Figure 4: Results for the different procedures testing main effect A (left hand side) or the interaction effect (right hand side) for skewed distributions and heteroscedastic variances. 14

Condition α 1 α 2 β 1 β 2 αβ 11 αβ 12 αβ 21 αβ 22 active effects 1 +δ δ 0 0 0 0 0 0 main effect A 2 0 0 +δ δ 0 0 0 0 main effect B 3 + δ δ 0 0 + δ δ δ + δ main effect A and interaction 2 2 2 2 2 2 effect Table 3: Different effect in simulated data sets with µ = 0 and δ {0, 0.2,..., 1} in the simulation study containing effects. effects were again main effect A and the interaction effect. In some cases where only main effect B was active the aim was to test if the procedures kept the level in these cases. 2.3.2 Results Figure 5 5 show the behavior of the different procedures for data sets containing an effect. The procedures show a very similar power behavior. 3 Conclusion As the simulation study showed, the different procedures may be useful depending on the data setting and further aspects. The ATS procedure was the only one that never exceeded the nominal level. On the other hand it may show a conservative behavior, but in the simulations containing effects this was only slightly observable. Similar to the results of previous simulation studies, the conservative behavior was higher for skewed distributions, especially with homoscedastic error term variances. An advantage of this procedure is that it can be adapted for very different designs and hypotheses (see Vallejo et al., 2010, for more background information). The WTS procedure showed in almost every data setting a liberal behavior for small samples. It should be only applied when sample sizes are large. The WTPS procedure overcomes this problem. In all considered simulation settings this procedure controls the type-i error rate quite accurately. In case of positive or negative pairings, this permutation test shows better results than its competitors. Both the WTS 15

Setting = 1; H_0(B) Not True _Perm ATS CSP Setting = 1; Not True _Perm ATS CSP WTS WTPS ATS CSP Setting = 1; All Not True _Perm ATS CSP Figure 5: Results for data sets containing effects (equal subsample sizes and homoscedastic variances). Setting = 2; H_0(B) Not True _Perm ATS CSP Setting = 2; Not True _Perm ATS CSP WTS WTPS ATS CSP Setting = 2; All Not True _Perm ATS CSP Figure 6: Results for data sets containing effects (equal subsample sizes and heteroscedastic variances). 16

Setting = 3; H_0(B) Not True _Perm ATS CSP Setting = 3; Not True _Perm ATS CSP WTS WTPS ATS CSP Setting = 3; All Not True _Perm ATS CSP Figure 7: Results for data sets containing effects (unequal subsample sizes and homoscedastic variances). Setting = 4; H_0(B) Not True _Perm ATS CSP Setting = 4; Not True _Perm ATS CSP WTS WTPS ATS CSP Setting = 4; All Not True _Perm ATS CSP Figure 8: Results for data sets containing effects (positive pairings). Setting = 5; H_0(B) Not True Setting = 5; Not True Setting = 5; All Not True _Perm ATS CSP _Perm ATS CSP _Perm ATS CSP WTS WTPS ATS CSP Figure 9: Results for data sets containing effects (negative pairings). 17

and WTPS can be adapted to higher-way layouts and hierarchical designs. The CSP and the procedures work well for all cases with equal subsample sizes or homogeneous variances. This implies cases where exchangeability of the observations might not be given due to different error term distributions (mixed distributions) or heterogeneous variances. In case of positive and negative pairings, the behavior is similar to parametric ANOVA with a conservative behavior for positive pairings and a liberal behavior for negative pairings. This is more pronounced for the -procedure. The power behavior of both procedures was very comparable to the other procedures. CSP showed in some cases a slightly lower power than the other procedures. The CSP and the procedures are restricted to certain hypothesis due to their construction: They assumed that all cells should get the same weight in the analysis. This corresponds to Type III sums of squares (Searle, 1987). Extension of these procedures to other kind of hypotheses, unbalancedness, or more complex designs might be challenging. There are various aspects at which this line of research might be continued: The simulation study could be extended by looking at other data settings (e.g., null pairings) or by including further procedures (e.g., Kherad-Pajouh and Renaud, 2010). The CSP and the algorithm are still not widely applicable. So more research on these procedures or possibly combination of these procedures with other approaches could enlarge their scope of application. We restricted our analysis to two by two ANOVA models. Permutation tests allowing for covariates in higher way layouts will be part of future research. References Basso, D., Pesarin, F., Salmaso, L., and Solari, A. (2009), Permutation Tests for Stochastic Ordering and ANOVA, New York: Springer. 18

Brunner, E., Dette, H., and Munk, A. (1997), Box-Type Approximations in Nonparametric Factorial Designs, Journal of the American Statistical Association, 92, pp. 1494 1502. Corain, L. and Salmaso, L. (2007), A critical review and a comparative study on conditional permutation tests for two-way ANOVA, Communications in Statistics: Simulation and Computation, 36, 791 805. Hahn, S. and Salmaso, L. (submitted), A Comparison of Different Permutation Approaches to Testing Effects in Unbalanced Two-Level ANOVA Designs,. Janssen, A. (1997), Studentized permutation tests for non-i.i.d. hypotheses and the generalized Behrens-Fisher problem, Statistics and Probability Letters, 36, 9 21. Janssen, A. and Pauls, T. (2003), How do bootstrap and permutation tests work? The Annals of Statistis, 31, 768806. Kherad-Pajouh, S. and Renaud, D. (2010), An exact permutation method for testing any effect in balanced and unbalanced fixed effect ANOVA, Computational Statistics and Data Analysis, 54, 1881 1893. Konietschke, F. and Pauly, M. (2012), A studentized permutation test for the nonparametric Behrens-Fisher problem in paired data, The Electronic Journal of Statistics, 6, 13581372. (2013), Bootstrapping and Permuting paired t-test type statistics, Statistics and Computing, 36. Pauly, M., Brunner, E., and Konietschke, F. (submitted), Asymptotic Permutation Tests in General Factorial Designs,. Pesarin, F. and Salmaso, L. (2010), Permutation Tests for Complex Data: Theory, Application and Software, New York: Wiley & Sons. 19

R Core Team (2013), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. Richter, S. J. and Payton, M. E. (2003), Performing Two Way Analysis of Variance Under Variance Heterogeneity, Journal of Modern Applied Statistical Methods, 2 (1), 152 160. Salmaso, L. (2003), Synchronized Permutation Tests in 2 k Factorial Designs, Communication in Statistics, 32, 1419 1437. Searle, S. R. (1987), Linear Models for Unbalanced Data, Wiley series in probability and mathematical statistics, New York: John Wiley & Sons, Ltd. Vallejo, G., Fernndez, M. P., and Livacic-Rojas, P. E. (2010), Analysis of unbalanced factorial designs with heteroscedastic data, Journal of Statistical Computation and Simulation, 80, 75 88. 20