Anova with binary variables - Alternatives for a dangerous F-test

Size: px

Start display at page:

Download "Anova with binary variables - Alternatives for a dangerous F-test"

Valerie Barrett
5 years ago
Views:

2018) Haiko Lüpsen Regionales Rechenzentrum

1 Anova with binary variables - Alternatives for a dangerous F-test Version 1.0 ( ) Haiko Lüpsen Regionales Rechenzentrum (RRZK) Kontakt: Luepsen@Uni-Koeln.de Universität zu Köln

2 Introduction 1 Abstract ANOVA with binary variables - Alternatives for a dangerous F-test Several methods are compared with the parametric F-test to perform an ANOVA with a binary dependent variable in 2-way layouts. Equal and unequal cell counts as well as several different effect models are taken into account. Special attention has been paid on heterogeneous conditions which are caused by nonnull effects through the relation of the binomial probability and its variance. For between subject designs Puri & Sen s L statistic, Brunner & Munzel s ATS, the χ 2 -test of loglinear models, the logistic and the probit regresssion in conjunction with the Wald test, the LR test and a compromise test as ANOVA-like tests are considered. The L statistic is recommended because the F-test cannot keep the type I error under control if there are nonnull effects in unbalanced designs. For mixed designs the Huynh-Feldt adjustment, Hotelling Lawley s multivariate test, Puri & Sen s L statistic, Brunner & Munzel s ATS and Koch s ANOVA are considered. In addition several GLMM and GEE models are included. The ATS turned out to be the only method which can handle all, especially heterogeneous conditions and large designs, where the F-test fails. The GLMM and GEE methods were not able to compete with the tests listed above, from which the GLMM together with the type II Wald test proved to be the best one. Keywords: ANOVA, binary, dichotomous, Puri & Sen, ATS, GLMM, GEE, logistic, probit, regression. 1. Introduction The analysis of variance (ANOVA) is one of the most important and frequently used methods of applied statistics, mainly for the analysis of designs with only grouping factors (between subject designs) and of designs with grouping and repeated measurements factors (mixed or split-plot designs). There is the parametric version and there are nonparametric methods as well. The first one has assumptions, of course. These are essentially normality of the residuals, homogeneity of the variances and in the case of repeated measurements additionally sphericity and homogeneity of the covariance matrices over the groups. But what to do if the dependent variable y is dichotomous, e.g. with values yes or no, or 1 and 0? Due to the familiarity and simplicity of the ANOVA methodology, one could trust in the robustness of the parametric tests. A test is called robust when its significance level (Type I error probability) and power (one minus Type-II probability) are insensitive to departures from the assumptions on which it is derives. (See Ito, 1980). And one of the first who investigated the applicability of the parametric F-test to a dichotomous criterion was Lunney (1970). His simulations showed that for 1-, 2- and 3-factorial designs the type I error rate is controlled as long as n 20 for 0.2 p 0.8 and n 40 for other values of p (p being the percentage of one of the outcomes y). And the power is satisfying as long as n is not too small. This seems reasonable as on one side the homogeneity of the variances is the most essential assumption - noting that the variance p(1-p) depends on the mean p - and on the other side for 0.25 p 0.75 variances of a binomial distributed outcome can be regarded as equal. Unfortunately Lunney s study has a fundamental restriction: he examined only equal sample sizes and type I error rates if there are no other effects, therefore neglecting the cases of unequal variances. Beside that only between subject designs had been studied. D Agostino (1971) wrote a detailed critic on Lunney s paper.

3 Introduction 2 Nevertheless it remains one of the most important works on this subject. Only decades later Jaeger (2008) expressed concern on the use of the parametric ANOVA for the analysis of a binary outcome, but mainly on a variation of this procedure: Cochran (1940) proposed first computing the success proportion and applying an arcsine-square-root transformation before performing the ANOVA. But which are the alternatives? First, one of the nonparametric ANOVA methods could be applied. See Luepsen (2017) for an overview. A simple rank transform (RT) or an inverse normal transformation (INT) of y would make no difference compared to the parametric F-test because these transform the two values of y just into two other different values. The aligned rank transform (ART) is not reasonable for dichotomous variables as Luepsen (2016) pointed out. But the Puri & Sen-method (e.g. Puri & Sen, 1985), often referred as L statistic, is one alternative. It can be viewed as a generalization both of the Kruskal-Wallis H test and the Friedman ANOVA. Here also applying the INT-transformation on the Puri & Sen-method, resulting in the van der Waerden-test, makes no difference compared to Puri & Sen s test. Brunner & Munzel (2002) developed two tests to compare samples by means of comparing the relative effects (for details see next chapter): the ATS (ANO- VA-type statistic) and the WTS (Wald-type statistic). Furthermore there are the logistic regression and the probit regression (see e.g. Agresti, 2002, and Haberman, 1978), ideal models for an ANOVA with a dichotomous dependent variable. One disadvantage of these methods is the large n required. Recommendations say e.g. 10 per parameter and a minimum of 100 (see. e.g. Peng et al., 2002). Finally there is the χ 2 test. But this one also needs a large sample size to prevent that too many of the expected cell frequencies will be 1.0 or less. Additionally it should be remarked that for the analysis of factorial designs loglinear models must be used instead which require more effort. Therefore Hsu & Feldt (1969), who investigated the effect of scale limitations on the type I error rate, do not recommend the χ 2 -test for the analysis of ANOVA designs. At this point it should be remarked that the χ 2 - and the F-test are algebraically similar and under the null hypothesis asymptotically equivalent as D Agostino (1972) has shown. So far only methods suitable for between subjects designs have been considered. Of course the parametric F-test, but also the Puri & Sen-method and the ATS may be applied to repeated measures designs. At this point Cochran s Q-test should be mentioned. But this one is identical to the Friedman test in the case of a binary dependent variable (see e.g. Mandeville, 1972). And on the other side the Puri & Sen-test is a generalization of the Friedman test. So Cochran s test is included in Puri & Sen s L-statistic. Reflecting that unequal binomial probabilities p i result in unequal variances, methods should be considered that do not assume sphericity, e.g. the Huynh-Feldt adjustment for the parametric ANOVA and the multivariate statistic by Hotelling- Lawley. Mandeville (1972) tried in his study the multivariate statistic by Hotelling-Lawley and Myers et al. (1982) applied an adjustment of the degrees of freedom to the F- and Q-test to compensate for the case of nonsphericity. Another solution to this problem is supplied by G. Koch. On one side he proposed several nonparametric ANOVA procedures (Koch, 1969). On the other side he developed a general framework for the analysis of experiments with repeated measurements of categorical data (Koch et al., 1977), also known as marginal probability model, which is based on the weighted least squares methodology by Grizzle, Starmer & Koch (GSK-method). Guthrie (1981) specialized this method for the case of binary variables. Both, of course, include the application on dichotomous variables. Unfortunately both methods require large sample sizes: n i 25 is recommended (Koch et al., 1977). And, the tests of main and interaction effects are not ortho-

4 Introduction 3 gonal. By the way, Kenward & Jones (1992) compared these together with a couple of other approaches to the analysis of categorical data in repeated measures designs. Other comparisons come from Omar et al. (1999), Allan (1997) as well as Neuhaus (1992) who use sample datasets without giving general recommendations. Concerning the logistic regression there exist also corresponding methods for dependent samples (see e.g. Hosmer & Lemeshow, 2013): GEE (Generalized Estimating Equations), established by Liang & Zeger (1986), and GLMM (Generalized LinearMixed Models, sometimes also called MLM, multi level models) by Harville (1977). Both are extensions of the generalized linear models GLM allowing correlated responses. Pendergast et al. (1996) give an overview of extensions of the logistic regression to correlated samples. GEE can be considered as an advancement of the above mentioned marginal probability model and is based on LS estimation. It has appealing and robust properties in parameter estimation: Unlike the repeated measures ANOVA, GEE does not require the outcome variable to have a particular distribution. GEE allows specification of the correlation structure, among others the compound symmetry (CS) and the autoregressive (AR(1)) correlation structure. And the GEE approach produces virtually unbiased estimates even if the correlation structure is misspecified (see Emrich & Piedmonte, 1992 and Pan & Connett, 2002). On the other side, as the method is based on large sample asymptotic theory, it is not surprising that the tests of the parameters are generally liberal for small samples n<50 (see e.g. Qu et al., 1994 and Stiger et al., 1998). That induced a number of statisticians to think about an amelioration for small samples. Some of their work is summarized by Wang et al. (2016) and Fan & Zhang (2014). GLMM is also based on large sample asymptotic theory which causes the same deficiencies in the case of small samples (see e.g. Li & Redden, 2015). In contrary to GEE, GLMM uses ML estimation methods which lead to a number of different solutions and programs. One advantage of this approach is the flexibilty in handling missing data though such datasets are not considered here. While both methods primarily estimate and test the regression coefficients of the model, an ANOVA-like test is compulsory for each of the effects, e.g. the Wald-test. Finally two major disadvantages of these methods should be mentioned: First, due to different estimating methods there are a number of programs for GEE and GLMM which yield unfortunately different results, and secondly, due to the complex computations, the failure to converge in fitting GEE and GLMM is a common occurrence (Fox & Weisberg, 2011). There are several studies comparing the different methods. First the case of between subject designs. Malhotra (1983) considered regression methods and discriminant analysis which is also used for the prediction of a binary outcome, but not suitable for the analysis of ANOVA designs. He sees the logistic approach superior to the probit regression. Powers et al. (1978) come to the same result. Malhotra reported in his publication also quite a number of comparative studies and gave the results in a clearly arranged table. Gessner et al. (1988) studied the same methods but under special situations, covariance heterogeneity, multicollinearity and nonnormal joint distribution of predictors. But they give no recommendation. Whitmore & Schumacker (1999) compared the parametric ANOVA with the logistic regression in the context of differential item functioning (DIF). They favored the logistic approach. Hsu & Feldt (1969) compared in their above mentioned study the ANOVA F-test with the χ 2 -test. They state: If a sample size of 50 or more per treatment is used data from a two-point scale such that.25 < p <.75 may be validly analyzed via analysis of variance technique, and even for smaller sample sizes the error rates exceed the nominal level of 5 percent by 10% at most. Especially for very small sample sizes or when the design involves more than one factor, the F-test has an advantage over the χ 2 -test of independence. A summary of these results can be found at Glass et al. (1972). For the case of within subjects designs the studies restrict to the comparison of Cochran s Q-

5 Introduction 4 test with the F-test. First of all two assumptions for the Q-test: The population covariances for all pairs of conditions are the same (Cochran, 1950) and the variance may not be zero, i.e. all values are equal. Mandeville (1972) compared the two tests for.1 < p <.9 and showed that the F-Test has generally the larger power and for the number of treatments k > 5 the lower type I error rate. Further he demonstrated that its error rate offends the nominal level of 5 percent by a maximum of 20% as long as the total n is larger than 60. Unfortunately both tests offend the nominal 1 percent level. Additionally he included the multivariate statistic by Hotelling-Lawley. He observed that for 3 treatments the procedure behaves fairly well, but that for 6 or more treatments in some instances the errors are very large. Myers et al. (1982) compared the two tests for the case of heterogeneous covariance matrices and considered additional adjustments of the degrees of freedom for these tests to compensate the degree of heterogeneity. They revealed an improvement by their adjustments in these cases. Additionally they cited several other studies on these two tests, the results of which seemed to be less useful because of too small sample sizes. Finally Seeger & Gabrielsson (1968) proved by means of a simulation study that the F-test as a rule gives at least as good results as the Q-test with regard to the type I error. Meanwhile a large number of studies are concerned with GEE and GLMM, but only very few compare these methods with the parametric F-test. Stiger et al. (1998) analyzed ordinal data in a 2*4 split-plot design with small samples sizes (20, 40, 60, 80) and examined the performance with respect to error rates and power of ANOVA (with and without the Huynh-Feldt-adjustment), Manova, GEE and a weighted least squared method by Koch. Though a 4-point scale had been used the results may be adapted to binary data. The ANOVA with adjustment as well as the Manova performed well for all sample sizes, while the unadjusted ANOVA behaved sometimes slightly liberal if sphericity was not given. In contrast the GEE exceeded the type I error rates generally for n i <60. Concerning the power, ANOVA was overall superior while Manova had the lowest rates. Ma et al. (2012) also compared GEE with ANOVA in their study, considering continuous and binary variables, and, surprisingly, found that GEE keeps the type I error rate even for small n i and has the largest power. Mancl & DeRouen (2001) summarized a number of studies examining the behaviour of GEE in small samples and concluded that for n i <50 the type I error rate is generally much too high. McNeish & Stapleton (2016) compared, among other methods, GEE and GLMM for very small sample sizes ( 4 n i 14 ), but unfortunately for a continuous outcome. They found that uncorrected GEE is a poor choice, which also applies to most of the small sample corrections except the version by Morel et al. (2003). This one could control the type I error rate at the cost of a diminished power while GLMM provided satisfying results. Just to mention their large list of studies analyzing these two methods. McNeish & Harring (2017) confirmed these results for binary variables. Fan & Zhang (2014) dealt with another problem associated with GEE and GLMM: a suitable ANOVA-type test for small samples. They suggested a test based on the work of Akritas et al. (1997) and showed in their study of GEE for repeated measures models with 5 n i 20 that this one could control the error rates in most situations while the Wald-test produced rates up to 80% for the trial effects. Concerning the previously cited corrections for GEE there are several studies beside those mentioned above, e.g. by Fan et al. (2013) or by Lu et al. (2007) which lead to different results, partly because of the different effect tests used. Littell (2002) compared the ANOVA with maximum likelihood and generalized least squares estimation, mainly from a theoretical point of view, and favored the modern techniques because of their flexibility matching the model. Finally a couple of warnings in this context: When applied to modeling binary responses, different software packages and even different procedures within a package may give quite different results (Zhang et al., 2011). This kind of convergence problem is a common occurrence in

6 Methods to be compared 5 mixed-effects modeling (Fox & Weisberg, 2015) who also report that SAS (Proc NLMIXED) and R (lme4 and glmmml) yield different results for the same datasets though they all use the same integral approximation approach. By the way, sincere convergence problems are reported in quite a number of publications, e.g. by Beckman & Stroup (2003), who also tell: The SASavailable GLMM algorithms considered in this paper performed poorly with fewer than 20 subjects per treatment.... This raises significant questions about the viability of studies with few subjects and binary data. So this study tries on one side to fill the gaps from the Lunney study (unequal sample sizes, impact of other effects and split plot designs) and to compare the results with those produced with the other methods mentioned above. However those methods are focused which are easily available in the statistical packages and give results for the global hypotheses. But on the other side the number of designs, ANOVA arrangements and different models had to be limited. Whereas Lunney (1970) studied 4 different levels for the factors, 5 two-way and 3 three-way designs as well as nested designs, and e.g. Mandeville (1972) compared 4 fixed and several nonpatterned correlation matrices, here only a few ANOVA settings could be considered while additionally the impact of effects of other factors is studied. So the influence of a number of parameters is checked here: equal/unequal cell counts, small/large number of cells, effect models, success rate p, sample size n, and additionally for within subject designs the pattern of the correlations between the repeated measurements. But not in detail. Therefore further investigations may be helpful. 2. Methods to be compared It follows a brief description of the methods compared in this paper. More information, especially how to use them in R or SPSS can be found in Luepsen (2015). The parametric F-test In the case of a between subject design the 2-factorial ANOVA model for a dependent variable y with N observations shall be denoted by y ijk = α i + β j + αβ ij + e ijk with fixed effects α i (factor A, i=1,..,i), β j (factor B, j=1,..,j), αβ ij (interaction AB), error e ijk (k=1,..,n ij ), cell counts n ij and N = n ij. The parameters α i, β j and αβ ij with the restrictions α = 0 i, β j = 0, αβ ij= 0 can be estimated by means of a linear model y = X p + e using the least squares method, where y are the values of the dependent variable, p is the vector of the parameters, X a suitable design matrix and e the random variable of the errors. If the contrasts for the tests of the hypotheses H A (α i =0), H B (β i =0) and H AB (αβ ij =0) are orthogonal the resulting sum of squares SS A, SS B, SS AB of the parameters are also orthogonal and commonly called type III SSq. They are tested by means of the F-distribution. In case of equal sample sizes the sum of squares as well as the mean squares can be easily computed as N SS A --- I ( N = y i y )2 SS B --- J ( N = y j y )2 SS AB = ---- IJ ( y ij y i y j + y)2 MS A = SS A ( I 1) MS B = SS B ( J 1) MS AB = SS AB (( I 1) ( J 1) ) MS error = ( y ijk y ij ) 2 ( N IJ) and the F-ratios as F A = MS A MS error F B = MS B MS error F AB = MS AB MS error

7 Methods to be compared 6 where y i, y j are the level means of factor A and B, y ij are the cell means and y is the grand mean (see e.g. Winer, 1991). In the case of a mixed design with one grouping factor A and one repeated measures factor B, often called trial factor, the 2-factorial ANOVA model for a dependent variable y with N = n i observations shall be denoted by y ijk = α i + β j + αβ ij + τ ik + βτ ijk + e ijk with α i, β j, e ijk as above, n i subjects per group and a subject specific variation τ ik (k=1,..,n i ). The sums of squares and mean squares of the effects are the same as above, if N is substituted by NJ, due to the different definition of N, whereas those for the error terms are different: MS between = J ( y i k y i ) 2 MS within = ( y ijk y i k y ij + y i ) 2 i k and the F-ratios as F A = MS A MS between F B = MS B MS within F AB = MS AB MS within To make up for heterogeneous variances, i.e. here unequal p i, on factor B, an appropriate adjustment of the degrees of freedom for the F-test is applied. Here the Huynh-Feldt adjustment is chosen (see e.g. Winer et al., 1991). Puri & Sen tests (L statistic) k j These are generalizations of the well known Kruskal-Wallis H test (for independent samples) and the Friedman test (for dependent samples) by Puri & Sen (1985), often referred as L statistic. A good introduction offer Thomas et al. (1999). The idea dates back to the 60s, when Bennett (1968), Scheirer, Ray & Hare (1976) and later Shirley (1981) generalized the H test for multifactorial designs. It is well-known that the Kruskal-Wallis H test as well as the Friedman test can be performed by a suitable ranking of y (see e.g. Winer,1991), conducting a parametric ANOVA and finally computing χ 2 -ratios using the sum of squares. In fact the same applies to the generalized tests.the χ 2 -ratios are computed in the case of only grouping factors as χ 2 SS effect effect = MS total and in the case of a mixed design for the tests of A, B and AB as 2 SS A 2 SS χ A = B 2 SS χ B = AB χ AB = MS between MS within Here SS A, SS B, SS AB are the sum of squares as outlined before, but computed for R(y), the ranks of y, where midranks are used in case of tied values. MS between and MS within are the mean squares defined previously, and MS total the variance of R(y). The degrees of freedom are those of the numerator of the corresponding F-test. The major disadvantage of this method is the lack of power for any effect in the case of other nonnull effects in the model. The reason: In the standard ANOVA the denominator of the F- values is the residual mean square which is reduced by the effects of other factors in the model. In contrast the mean squares in the denominator of the χ 2 -tests of Puri & Sen s L statistic increase with effects of the other factors, thus making the ratio of the considered effect and therefore also the χ 2 -ratio smaller. A good review of articles concerning this test can be found in the study by Toothaker & De Newman (1994). i MS within

8 Methods to be compared 7 Brunner, Munzel and Puri (ATS) Based on the relative effect (see Brunner & Munzel, 2002) the authors developed two tests to compare samples by means of comparing the relative effects: the approximately F-distributed ATS (ANOVA-type statistic) and the asymptotically χ 2 -distributed WTS (Wald type statistic). The ATS has preferable attributes e.g. more power (see Brunner & Munzel (2002). The relative effect of a random variable X 1 to a second one X 2 is defined as p + = PX ( 1 X 2 ), i.e. the probability that X 1 has smaller values than X 2. As the definition of relative effects is based only on an ordinal scale of y this method is suitable also for variables of ordinal or dichotomous scale. For between subject designs detailed descriptions can be found in Brunner & Munzel (2002, chapter 3) as well as in Akritas, Arnold and Brunner (1997). These tests have been extended to reapeated measures designs by Brunner et al. (1999). Noguchi et al. (2012) described the procedures which involve a lot of matrix algebra, and developed the corresponding R package nparld. Logistic Regression and Probit Regression Instead of building a model for y these methods model the probability of y=1: logistic regression probit regression Here x i (i=1,..,p) are predictors which correspond in an ANOVA environment to design variables, β i are the regression parameters, and Φ the normal distribution. The computational procedures are described e.g. in Agresti (2002) and not repeated here. An ANOVA-type test, into which all β i belonging to the same effect are summarized, is available by means of the Wald test (see below) or LR (likelihood ratio) test. Usually the latter is preferred (see e.g. Agresti, 2002, and Fox, 1997), especially for smaller samples as analyzed in this study. But a warning for R users: the standard function for an ANOVA-type test, anova, returns sequential (type I) tests, running from 1st factor to the interaction, and the result is therefore dependent from the order of the factors for unbalanced designs. Normally the function Anova (from package car) should be preferred which yields type II F-tests. χ 2 -test and loglinear model For between subject designs Pearson s χ 2 -test is performed. The test of the main effects is received from the classical test of independence for two variables. And for the test of the interaction of factors A and B a loglinear model including all 2-way interactions is fitted which yields the desired result. For details see e.g. Agresti (2002). Hotelling-Lawley s multivariate test Py ( = 1) = exp( β i x i ) ( 1 + exp( β i x i )) Py ( = 1) = Φ( exp( β i x i )) This test is often used for the analysis of repeated measures designs because it does not require a compound symmetry of the variance-covariance matrix of y. Instead a multivariate normal distribution is demanded. Therefore this test does not seem appropriate for the analysis of dichotomous dependent variables. Nevertheless various authors tried it with differing success (see e.g. Mandeville, 1972, and Stiger, 1998). First the differences of two consecutive variables are computed d 1ik = y i2k - y i1k,, d 2ik = y i3k - y i2k,,.. (for i=1,..,i and k=1,..,n i ). Then d 1, d 2,.. are checked for 0 by means of Hotelling Lawley s test (see e.g. Winer et a.., 1991).

9 Methods to be compared 8 Koch s ANOVA Koch (1969 and 1970) proposed a couple of nonparametric procedures for split-plot designs based on a multivariate version of the Kruskal-Wallis test and a nonparametric analogue of the one-way Manova based on the trace. There are several variants for the cases with and without compound symmetry, as well as with and without independence of the factors A and B. The version used here assumes an interaction, but no compound symmetry. A detailed description can be found in Koch (1969) and shall not be repeated here. GEE (Generalized Estimating Equations) The GEE method (Liang & Zeger, 1986) can be considered as an extension of the logistic regression to designs with repeated measurements. The specification of the model requires the type of correlation matrix of y. Possible correlation structures are among others the compound symmetry (CS), also often named exchangeable, and the autoregressive (AR(1)), though most authors remark that an incorrect specification has scarcely an impact on the results (see e.g.fan et al., 2013). A short sketch of the model: Let p jk = exp( β i x ijk ) ( 1 + exp( β i x ijk )) (k=1,..,n, j=1,..,j and i=1,...,p) or equivalently p P jk ln = 1 p jk β i x ijk i be the marginal expectation of the binary y: y jk = p jk + e jk with E(e jk ) = 0 and Var(e jk ) = V k, and with P predictors and corresponding regression parameters β i (k=1,...,p). Here, x ijk is the design matrix of subject k, and V k = A k R k ( α)a k with a correlation matrix R k (α) for y k, which can be parametrized by a vector α, and A k = diag(p k1 (1-p k1 ), p k2 (1-p k2 ),...). Then the GEE estimates of β k are the solution of D k V ( k y k p ) p k k = 0 where D k = β k and y k = (y k1, y k2,...), p k = (p k1, p k2,...), β = (β 1, β 2,...) (see e.g. Emrich & Piedmonte, 1992). A detailed description of the model and the estimation process can be found in McNeish, D. & Stapleton, L.M. (2016). Also to mention Ziegler et al. (1998) who summarize a number of variants and different estimation methods for GEE. For the tests of the ANOVA effects the variance-covariance matrix of the estimation β k is essential. For this purpose normally the sandwich estimator by Liang & Zeger (1986) is used. But this has the disadvantage that for smaller N<50 the estimates are too small which leads to liberal tests of the β k. A number of authors proposed bias-corrected sandwich estimators, among others Fay & Graubard (2001), Kauermann & Carroll (2001), Mancl & DeRouen (2001) and Morel et al. (2003). A comparison of them can be found in Fan et al. (2013). But as Fan & Zhang (2014) had shown, these are also unsatisfying. Other solutions came from Pan & Wall (2001), Gosho et al. (2014) and Wang & Long (2011) who assumed additionally that all subjects have a common correlation structure. Since their estimate cov(y k ) is obtained by pooling observations across different subjects, whereas (y k -μ k )(y k -μ k ), the estimate by Liang & Zeger and the first mentioned authors, is only based on one observation, it is expected they are more efficient. Pan (2001) showed that in the extreme case when all subjects have different correlation structures the estimates by Pan and Liang & Zeger are the same. In a recent publication Wang et al. (2016) compared all of the currently available methods.

10 Methods to be compared 9 GLMM (Generalized LinearMixed Models) Also the GLMM method can be considered as an extension of the logistic regression to designs with repeated measurements. The model looks rather similar to the GEE model: for the binary y: y jk = p jk + e jk with E(e jk )=0 and Var(e jk )= V k. But here, in addition to the fixed effects β i (i=1,...,p) with design matrix x ijk, there are also random effects γ ik (i=1,...,q) for subject k with a design matrix z ijk, e.g. for modelling subject and repeated measures effects, and to reflect the correlation among the observations of the same subject, often called cluster in this context. Thus, a correlation structure, as for GEE, has not to be stated here. γ ik are multivariate normal distributed with E(γ ik )=0 und independent from the e jk which do not depend on the repeated measurements. Details, especially concerning the ML estimation, can be found at Tuerlinckx et al. (2006) and Song, X.Y. & Lee, S.Y. (2006). As mentioned above there are several estimation methods leading not necessarily to the same results, e.g. Penalized quasilikelihood, Laplace approximation, Gauss-Hermite quadrature or Markov chain Monte Carlo. Similar to GEE, here also the method is based on large sample asymptotic theory with the consequence that for small N<50 the tests for β k and γ ik are sometimes liberal. Li & Redden (2015) discuss a number of solutions for this problem which lies in the estimation of the denominator degrees of freedom (ddf) for the F-test into which the Wald test is transformed. The most popular solution is probably the rather complicated one by Kenward & Roger. The simpliest one uses ddf=n-rank(c) where C is the contrast matrix. Additional ANOVA-like tests are mentioned below. Wald tests p jk ln p jk = P i β i x ijk + Q i γ ik z ijk The primary results from an analysis using logistic regression, probit regression, the GEE or GLMM method are the estimates of the model parameters β i together with their standard errors and a significance test of β i =0 for each i, normally by means of the Wald test. But in this context an ANOVA-like test is desired into which all β i belonging to the same effect are summarized. On one side there is Wald s χ 2 -test in the variant for several parameters (see e.g. Carr & Chi, 1992 and Pan & Wall, 2001): ( Cβˆ )'( CV β C' ) 1 ( Cβˆ ) which is approximately χ 2 -distributed with rank(c) degrees of freedom, and where βˆ are the estimates of β, V β is the variance-covariance-matrix of β, and C a contrast matrix, and in its simpler form (see e.g. Kenward & Jones, 1992) β' ˆ 1 Vβ βˆ where βˆ and V β are restricted to those i belonging to the effect of interest. Fan & Zhang (2014) found that the above test is too liberal for small sample sizes and proposed a different one, based on the work of Akritas et al. (1997) and Brunner et al. (1997): Q = βˆ'c( C'C) 1 C'βˆ The expression (c 1 /c 2 )Q is approximately χ 2 -distributed with f degrees of freedom where c 1 c 2 = = tr( TV β ) tr( TV β TV β )

11 The Study 10 T = C( C'C) 1 C' The tests above cannot be simply calculated from the usual table of coefficients βˆ which include normally either a normal z- or a χ 2 -value. Therefore finally it should be remarked that, assuming the βˆ as independent, i.e. their covariances being zero, an ANOVA-like test can be computed by summing up either the χ 2 - or the z 2 -values which has the same degrees as the numerator from the corresponding F-test. While the Wald test above is equivalent to a Type III test, Fox & Weisberg (2011, chapter 4.4.4) favored a Type II Wald test which is offered in the function Anova of the R package car. It is based on the likelihood ratio method, using analysis of deviance tests. This one conforms to the principle of marginality and is most powerful in the case of no interaction. Using it the main effects may be overestimated, in contrary to the interaction effects. For the application of GEE estimation Pan (2001) derived additionally the exact degrees of freedom for the above described Wald test for those cases where a robust estimator of the covariance matrix V β, mentioned in a previous section, is used: the sum of eigenvalues of V P VLZ, 1 where V P and V LZ are the covariance estimates by Pan and Liang & Zeger respectively. As this test requires an invertible covariance matrix which is often not guaranteed for smaller n i it is not recommendable in this context. 3. The Study 2 f = c 1 c2 This is a pure Monte Carlo study, which means a couple of designs and theoretical distributions had been chosen from which a large number of samples had been drawn by means of a random number generator. These samples had been analyzed using the various ANOVA methods. Concerning the number of different situations, for example, distributions, equal/unequal cell counts, effect sizes, relations of means, variances, and cell counts, one had to restrict to a minimum, as the number of resulting combinations produce an unmanageable amount of information. Therefore, not all influencing factors could be varied. The main focus had been laid upon the control of the type I error rates for α = 0.05 and α = 0.01 as well as upon a comparison of the power for the various methods in the following situations: There are two major designs: a between subject and a mixed (split-plot) design. For both the following subdesigns are analysed: a 2*4 design ( small design ) with equal cell counts (balanced) and one with unequal cell counts and a ratio max(n ij )/min(n ij ) of 3 (unbalanced), and a 4*5 design ( large design ) with equal cell counts (balanced) and one with unequal cell counts and a ratio max(n ij )/min(n ij ) of 4. Mainly the aspect will be put on differences between the small balanced and the large unbalanced designs. So, if results show differences between them in a special situation, this may have two reasons. In such cases additionally the results for the corresponding small unbalanced and the large balanced designs are consulted to get things straight concerning the difference. The binomial probabilities p i had been set to 0.5, 0.8 and 0.9, which is equivalent to 0.5, 0.2 and 0.1, as for 0.25 p 0.75 the variances of a binomial distributed outcome can be regarded as equal. For the split-plot design the following correlation structures had been chosen which are assumed equal for all groups, while heterogeneous correlation matrices over the groups of factor A were not considered:

12 The Study 11 exchangeable (compound symmetry) with r=0.3, a value that seems realistic and had often been chosen (see e.g. Emrich & Piedmonte, 1992), and descending correlations r=(0.7, 0.5, 0.4, 0.2) which is similar to the AR(1) structure, denoted as ar1. In the case of between subject designs the error rates had been checked for the following effect models: main effects and interaction effect for the case of no effects (null model, equal probabilities p i ) and for the case of one significant main effect A(0.6), main effect B for the case of a significant interaction AB(0.6) and for the case of significant effects A(0.6) and AB(0.6), interaction effect for the case of both significant main effects A(0.4) and B(0.4). In the case of mixed designs the error rates had been checked for the following effect models: main effects and interaction effect for the case of no effects (null model, equal probabilities p i ), for the case of one significant main effect A(0.6) or B(0.4), and for the case of a significant interaction AB(0.4). interaction effect for the case of both significant main effects A(0.3) and B(0.3). where e.g. A(d) denotes an effect of size d for factor A, which means p i -s*d/2 is compared with p i +s*d/2 (s being the standard deviation), or computationally p i ( d 2) p i ( 1 p i ) with p i + ( d 2) p i ( 1 p i ). In 2*4 designs the first group will be compared with last group, in 4*5 designs the first two groups with the last two groups. An analog procedure is chosen for B(d) and the interaction effect. For unbalanced designs the interaction effects ab ij had to be adjusted in order to avoid impacts on the main effects. Unfortunately first simulations revealed that difficulties arise when data have to be generated in mixed design models with p i =0.9 and effects had to be added. In order not to let the shifted parameter come too close to 0 or 1 respectively, this parameter had generally to be reduced to p i =0.88. The problems intensified in the case of an unequal correlation structure with descending correlations where effect sizes had to be reduced from 0.6 to 0.4 (for factor B) and 0.4 to 0.36 (for the interaction). Another problem to be investigated is heterogeneity in unbalanced designs, because it is wellknown that at least the parametric F-test tends to be conservative if cells with larger n i have also larger variances (positive pairing) and that it reacts liberal if cells with larger n i have the smaller variances (negative pairing), as Feir & Toothaker (1974) and Luepsen (2017), to name a few, reported. Here unequal variances coincide with unequal means, i.e. in between subject designs as well as in mixed designs a factor A with nonnull effects will produce heterogeneity which may in unbalanced designs lead to biased type I error rates. Being aware that for p i >0.5 the variances of y become smaller with increasing p i, it is to be expected that the F-test reacts liberal if levels of A with larger n i have also larger p i and that it reacts conservative if levels with larger n i have the smaller p i. Of course, the same behavior will apply to the case p i <0.5. Therefore the effects of factor A will be analyzed for both situations. Though for both designs, between subjects and mixed, the effect size will be the same (d=0.6), the ratio max(n i )/min(n i ) will be only 1.3 for the between subjects, but 4 for the mixed design. As in the case of p i =0.5 with an effect d for factor A the resulting p i -s*d/2 and p i +s*d/2 will be equal and therefore produce equal variances, p i =0.6 is chosen instead when situations of heterogeneity are analyzed.

13 Selected Methods and Preliminary Results 12 The type I error rates at 5% and 1% and the power were computed for n i =5,10,15,..,50 as percentages of rejected null hypotheses. The number of repetitions were 2000 for both, the between subject and the split-plot designs. Because of the enormous computational effort for the GEE and the GLMM method, the number of repetitions for these had been limited to The relatively small number of samples is not unusual (see e.g. McNeish, 2017 and Guerin & Stroup, 2000). Due to the convergence problems, mentioned in the previous chapter, which occurred mainly with GLMM in smaller samples n i 15, the actual number of repetitions was reduced sometimes by about 2 percent. But the situation became much worse with GEE which produced unmanageable covariance matrices for smaller samples n i 10. The failure rates were sometimes 90 percent and more. In those cases the repetitions had to be increased to 5000 in order to receive at least 200 valid results, or the sample size of 5 had been dropped from the study, especially for unbalanced designs. For between subject designs the variables had been generated from a uniform distribution which had been dichotomized according to the desired p i. For the repeated measurements rmvbin (from R package bindata) created correlated multivariate binary random variables by thresholding a normal distribution, given probabilities p i and a given correlation matrix. 4. Selected Methods and Preliminary Results For between subject designs these are the parametric F-test, Puri & Sen s L-statistic, Brunner & Munzel s ATS, the logistic and the probit regression, both with the Wald and the LR test, as well as loglinear models (denoted as chi square). As first results showed that, especially for small samples, the LR test behaves rather liberal while the Wald test acts extremely conservative, a compromise test for the logistic and the probit regression is proposed, composed by the χ 2 -values of the two tests with the same degrees of freedom, denoted by WLR: 2 χ WLR 2 χ Wald 2 χ LR = ( + ) 2 Especially for between subject designs there were considerations to include exact tests for the comparison of two groups (see e.g. Storer & Kim, 1990). These could be either corrected for the number of comparisons or accumulated, e.g. by Fisher s combined probability test, to get a test for the global hypothesis. But first, some preliminary tests revealed these solutions lead to a very poor power, and secondly, it is very difficult to deduce an corresponding test for the interaction. Concerning split-plot designs the choice is also the parametric F-test, with and without the Huynh-Feldt correction, Puri & Sen s L-statistic, Brunner & Munzels ATS, together with Hotelling-Lawley s multivariate ANOVA test, Koch s rank ANOVA, later often denoted as standard methods, as well as GEE and GLMM, special cases of GLM models. For the selection of a suitable function to apply these two methods in R, preliminary studies had been necessary. For the GLMM analysis three R functions had been tested: glmer (R package lm4) which is based on restricted maximum likelihood estimation (REML) using a bounds constrained quasi- Newton method (nlminb, by means of R function optimx from package optimx), glmmpql (R package MASS) which uses Penalized quasilikelihood estimation and glmmml (R package glmmml) which applies adaptive Gauss-Hermite quadrature. (A practical comparison give Bolker et al., 2009, while Zhang et al., 2011, matched the methods using data.) The three methods had been compared for the above mentioned balanced and unbalanced designs with n i =5,...,50, given compound symmetry and p i =0.8, for the following models: null model, A signficant, B signficant and AB significant. To compare the results the classical Wald test, the robust test by Fan & Zhang (2014) and the Type II Wald test (except glmmml)had been used

14 Selected Methods and Preliminary Results 13 which were described in the previous chapter. Additionally the F-transformed Wald test and the sum of χ 2 -values were computed (see appendix B6). The only satisfying procedure was glmer, which had the error rates completely under control. But, unexpectedly, it is the Type II Wald test which manages also the case of a signficant interaction, though this one is not recommended if there is an interaction. In contrast, the other ANOVA-type tests explode with increasing sample sizes. glmmml generally offends the acceptable limits for the type I error rate with values around 10 for sample sizes n i 10. And glmmpql shows in nearly all situations error rates beyond 10. For the choice of the GEE procedure the focus had been laid upon the different estimation methods for the covariance matrix of the parameter estimates β ˆ i. As a basis for the estimation of the parameters themselves the function geeglm (R package geepack) had been used which is based on the estimation method by Prentice & Zhao (1991). All the 9 methods described by Wang et al. (2016) had been compared for the above mentioned situations using the functions from the R package geesmv. For the estimation according to Liang & Zeger together with the sandwich estimators for the covariance matrix of the β ˆ i two different functions had been considered: the above mentioned geeglm and the corresponding function GEE.var.lz in package geesmv. The results in general were rather disappointing (see also appendix B6), particularly for the case of an interaction effect, when none of the methods was able to keep the type I error rate for the test of any main effect in an acceptable range. In fact, for all methods and tests the error rates rose up to over 90 percent for n i =50 (see 6.10 and 6.11 in B6). A few words to the two functions used for the Liang & Zeger method. Generally geeglm behaved much more liberal than GEE.var.lz. E.g. for the test of the interaction in the null model where the first one showed exceeding error rates for nearly all sample sizes, especially for unbalanced designs, and even when the conservative ANOVA test by Fan & Zhang is used (see B6.3). Similar results were achieved for the test of the interaction when there are significant main effects. One reason might be the following fact: GEE.var.lz is just a transcription of the formula into R. For this study the function had been modified so that any sample is dropped if computational problems like a not invertible matrix arise. By the way, the same had been applied to the other functions of package geesmv. In contrary geeglm tries to handle problematic situations itself. Among others this difference leads to quite different accepted cases from the sample. But as cited in the first chapter, the same method may lead to different results when different functions or programs are used. Some remarks to the other covariance estimation methods. As to be expected: the ANOVA-like tests by Fan & Zhang (2014) show generally much smaller error rates than the Wald test, but with the disadvantage of an also much smaller power. McKinnon shows in many situations a strange behavior: e.g. higher error rates for the Fan & Zhang test than for the Wald test and decreasing power rates for increasing sample sizes (see tables 6.4 and 6.8 in B6). For the methods by Kauermann & Carroll, Mancl & DeRouen and Fay & Grauband there are huge differences between the error rates by the Wald test and the one by Fan & Zhang in a number of situations. In general the solutions from Pan & Wall (2001), Gosho et al. (2014) and Wang & Long (2011) which obtain their estimates by pooling observations across different subjects, as well as the method by Morel et al. (2003), have the most benevolent behavior. For the first three methods additionally the previously mentioned ANOVA-type test by Pan (2001) had been computed which is able to control the type I error rate in a same way as the one by Fan & Zhang (2014), but shows on the other side clearly better power rates: e.g. 35 (factor A for unequal n i ), 51 (factor B for equal n i ) and 37 (factor B for unequal n i ) in contrary to 14, 26 and 22 in the same situations for the test by Fan & Zhang (see tables 6.4 and 6.8 in B6). But, unfortunately, even these

15 Results 14 procedures cannot keep the type I error rate of the main effects under control for the case of a significant interaction. In an additional preliminary study the error rates and power for the previously mentioned four methods, together with the one by Liang & Zeger, have been studied for all situations quoted in the previous chapter, where the ANOVA-type tests by Wald, by Fan & Zhang and, if possible, by Pan were applied. For the case of an underlying ar1-correlation structure of the data two GEE models have been analyzed: one assuming the correct ar1-structure and one assuming an exchangeable (compound symmetry) correlation structure. The results from it are to be found in appendixes B7 and B8. If there is no interaction effect the procedure of Morel keeps the type I error always under control, also the one by Gosho et al., except in a few situations for n i =5. For the approach by Wang & Long it is the test by Pan which leads sometimes to exceeding error rates for small n i. The one by Pan & Wall is even more severe affected, though for both the ANOVA-like test by Fan & Zhang keeps the error under control while the Wald test leads often to liberal results, e.g. for the tests of factor B with an ar1-structure (see and in B7) or if an ar1-structure is handled as an exchangeable structure (see e.g , or in B7). Finally, for the approach by Liang & Zeger it is again the test by Fan & Zhang which produces satisfying results whereas the Wald test leads generally to error rates of 10 and above. As had to be expected: for all methods using the ANOVA-like test by Fan & Zhang the power is the worst, from which the one by Morel has the lowest. Reasonable results are achieved with the procedures of Wang & Long and Gosho et al., if the Wald test is used, which both are reliable concerning the type I error rate. Regarding the case of an interaction effect, the type I error rates rise up to 50 percent and beyond (for n i =50) for all methods and nearly all situations (see 7.3 and 7.6 un B7). There are only differences in the size of the intervals for which the distinct methods keep the error rate in the acceptable range. If the focus is directed to those techniques favoured above, one has to accept that only for n i 20 admissible error rates are produced, while, of course, the conservative methods, like the one by Morel et al., have generally a wider range n i 40, especially when the test by Fan & Zhang is applied. Overall it can be concluded that the procedures by Gosho et al. and by Wang & Long in conjunction with the ANOVA tests by Wald or Pan are the most recommendable from all GEE models with a slight advantage for the first one, while the Pan & Wall procedure is a bit more liberal in some situations. Finally, a couple of observations to mention: first, the error and power rates reduce with rising parameter p i, and second, if the underlying correlation structure is assumed as an ar1 model, the Liang & Zeger method as well as the application of the ANOVA test by Pan lead to rather conservative results (see e.g. the tables in B9, where the results for the different correlation structures are summarized). As a result from these experiences, for the comparison with the other ANOVA methods the estimation approach by Gosho et al. (2014) is selected together with the ANOVA-type tests by Wald, by Fan & Zhang and by Pan. As these are not generally available the estimation results by Liang & Zeger as received from GEE.var.lz together the ANOVA-type test by Fan & Zhang are also included for comparison purposes. One final remark: As cited in the first chapter, the same method may lead to different results when different functions or programs are used. 5. Results Tables and Graphical Illustrations The following remarks represent only a small extract from the numerous tables and graphics produced in this study and will concentrate on essential and perhaps unexpected results. All tables and corresponding graphical illustrations are available online (see address below). These

Comparison of nonparametric analysis of variance methods a Monte Carlo study Part A: Between subjects designs - A Vote for van der Waerden

Comparison of nonparametric analysis of variance methods a Monte Carlo study Part A: Between subjects designs - A Vote for van der Waerden Version 5 completely revised and extended (13.7.2017) Haiko Lüpsen