A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies

Size: px

Start display at page:

Download "A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies"

Richard Hardy
5 years ago
Views:

1 Anne-Laure Boulesteix, Robert Hable, Sabine Lauer, Manuel Eugster A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies Technical Report Number 136, 2013 Department of Statistics University of Munich

2 A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies Anne-Laure Boulesteix 1, Robert Hable 2,3, Sabine Lauer 1, Manuel J. A. Eugster 2,4 1 Department of Medical Informatics, Biometry and Epidemiology, University of Munich (LMU), Marchioninistr. 15, D Munich, Germany. boulesteix@ibe.med.uni-muenchen.de 2 Department of Statistics, University of Munich (LMU), Munich, Germany 3 Chair of Stochastics, Department of Mathematics, University of Bayreuth, Bayreuth, Germany 4 Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland Anne-Laure Boulesteix is a statistician and biometrician, Department of Medical Informatics, Biometry and Epidemiology, University of Munich (LMU), Munich, Germany. Robert Hable is a mathematician and statistician, Department of Mathematics, University of Bayreuth, Bayreuth, Germany. Sabine Lauer is a statistician and sociologist, Department of Medical Informatics, Biometry and Epidemiology, University of Munich (LMU), Munich, Germany. Manuel Eugster is a computer scientist and statistician, Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland. 1

3 ABSTRACT In computational sciences, including computational statistics, machine learning, and bioinformatics, most abstracts of articles presenting new supervised learning methods end with a sentence like our method performed better than existing methods on real data sets, e.g. in terms of error rate. However, these claims are often not based on proper statistical tests and, if such tests are performed (as usual in the machine learning literature), the tested hypothesis is not clearly defined and poor attention is devoted to the type I and type II error. In the present paper we aim to fill this gap by providing a proper statistical framework for hypothesis tests comparing the performance of supervised learning methods based on several real data sets with unknown underlying distribution. After giving a statistical interpretation of ad-hoc tests commonly performed by machine learning scientists, we devote special attention to power issues and suggest a simple method to determine the number of data sets to be included in a comparison study to reach an adequate power. These methods are illustrated through three comparison studies from the literature and an exemplary benchmarking study using gene expression microarray data. All our results can be reproduced using R-codes and data sets available from the companion website de/organisation/mitarbeiter/020_professuren/boulesteix/compstud2013. KEYWORDS Benchmarking, supervised learning, comparison, testing, performance 2

4 1. INTRODUCTION Almost all machine learning or computational statistics articles on supervised learning methods include a more or less extensive comparison study based on real data sets of moderate size assessing the respective performance (typically in terms of prediction error) of a few algorithms, either new or already described in previous literature. These comparisons are often termed as benchmark experiments (e.g. Hothorn et al., 2005) in the statistical and machine learning communities. However, the statistical foundations of these comparisons are usually given surprisingly poor attention in concrete comparison studies, although they are addressed in a large body of methodological literature. For simplicity we will from now on talk about the comparison of two supervised classification methods, but all the ideas discussed here are also relevant to the comparison of more than two methods or to other supervised learning problems. These two methods are used to derive a classification rule from an available training sample. In a seminal paper Dietterich (1998) proposes a general taxonomy of the problems related to performance evaluation and comparison in supervised learning. The comparison of two methods for a particular distribution in the absence of large sample is termed Question 8 while Questions 1 to 7 consider error estimation (as opposed to comparison of methods), situations with large samples, and/or the assessment of specific classification rules (as opposed to the problem considered here of the assessment of the methods used to derive classification rules). The paper considers a few simple testing procedures based on resampling-based estimates of the prediction error and compares them empirically via simulations in terms of type I error and power. The conclusion is that none of these procedures is completely satisfactory. The essential problem is that they do not use adequate estimates of the unconditional variance of the error estimates. Estimating the unconditional variance based on the empirical variance over resampling iterations implies a violation of independence assumptions and thus a reduction of the effective degrees of freedom 3

5 (Bouckaert, 2003). Nadeau and Bengio (2003) provide an overview of estimators of the unconditional variance of resampling-based error estimates used in the machine learning community and suggest two additional estimators that can be applied to error estimators obtained through repeated splitting into training and test data. They present benchmark experiments as a natural application of their new estimators and stress that naive estimators of the variance (e.g. the empirical variance of the estimated error over resampling iterations) often lead to false positives in the sense that researchers see a difference in performance between the considered methods although there is no such difference. Hothorn et al. (2005) give a statistical interpretation of these issues and recommend to estimate the variance of resampling-based error estimates through bootstrapping. Dietterich (1998) also mentions a further question termed as Question 9 in his taxonomy and referring to several domains, where the term domain here denotes a data set with its own underlying distribution: Question 9: Given two learning algorithms A and B and data sets from several domains, which algorithm will produce more accurate classifiers when trained on examples from new domains? This is perhaps the most fundamental and difficult question in machine learning. Indeed, almost all comparison studies based on real data consider not one but several data sets, thus implicitly involving several different underlying distributions. While it is usual to perform hypothesis tests to compare the performance of different methods in the context of benchmarkring studies (Demšar, 2006), the literature on the statistical interpretation of such tests based on multiple data sets is surprisingly sparse. One of the few articles on this topic suggests to use a mixed model approach with the data sets as subjects with random intercept and the methods as fixed effects (Eugster et al., 2012). 4

6 In this paper, our aim is to give a statistical formulation of tests performed in the context of comparison studies and to correspondingly interpret results of published comparison studies, with focus on classification based on high-dimensional gene expression data as an highlighting illustration. The paper is structured as follows. In Section 2 we outline the epistemological background of comparisons of prediction errors from a statistical perspective contrasting with the machine learning perspective taken by most papers handling this topic. Section 3 is especially devoted to tests in the context of comparison studies based on several real data sets and gives an original formulation of these tests in a strict statistical framework. In particular, Section 3 suggests a power calculation approach in this context. Section 4 presents an application of these methods to three comparison studies of classification methods for microarray data from the literature, while Section 5 describes an exemplary benchmark study based on 50 data sets. The appendix contains some technicalities which are needed for the mathematically rigorous formulation of the testing problem given in Section EPISTEMOLOGICAL BACKGROUND Settings From a statistical point of view, binary supervised classification can be described in the following way. On the one hand, we have a response variable taking values in Y = {0, 1}. On the other hand, we have predictors taking values in X R p that will be used for constructing a classification rule. Predictors X and response Y follow an unknown joint distribution on X Y denoted by P. The observed i.i.d. sample of size n is denoted by s 0 = {(x 1, y 1 )...(x n, y n )}. The classification task consists in building a decision function ˆf that maps elements of the predictor space X into the 5

7 response space Y: ˆf s 0 : X Y, x ˆf s 0 (x), where the superscript s 0 indicates that the decision function is built using the sample s 0. For simplicity we assume that the classification method is deterministic, i.e. that ˆf s 0 is uniquely defined given the sample s 0. There are many possible methods to fit a function ˆf s 0 based on a sample s 0, which we denote as M 1,..., M K. The decision function obtained by fitting method M k to the sample s 0 is denoted as ˆf s 0 M k. The true error of this classification rule ˆf s 0 M k can be written as ε( ˆf ( )] s 0 M k, P ) = E P [L ˆf s 0 M k (X), Y (1) where E P stands for the mean over the joint distribution P of X and Y, and L(.,.) is an adequate loss function, e.g. the indicator loss yielding the classification error rate considered in this paper. The true error ε( ˆf s 0 M k, P ) of method M k constructed using sample s 0 is commonly referred to as conditional error since it corresponds to the decision function constructed on the specific sample s 0. The notation ε( ˆf s 0 M k, P ) stresses that this error depends on the distribution P as well as on the method M k and sample s 0 used to fit the classification rule. In this perspective, the error ε( ˆf S M k, P ) (corresponding to Eq. (1) with s 0 replaced by S) should be seen as a random variable, where S stands for a random i.i.d. sample that follows the distribution P n. The mean of this random variable over P n is commonly referred to as the unconditional true error rate of method M k. In this paper it is denoted as ε(n, M k, P ) = E P n[ε( ˆf S M k, P )]. It depends only on the method M k, on the size n of the sample and on the joint 6

8 distribution P of X and Y. Note that the joint distribution P is involved twice in this formula. Method 2 is better than method 1: what does this mean? Now suppose that we are interested in the relative performance of two methods M 1 and M 2. What does it mean when we ask whether method M 2 is better than method M 1 or vice-versa? An applicant (for instance a biologist) who collected a specific data set s 0 in his lab is primarily interested in whether the classification rule ˆf s 0 M 2 fitted with method M 2 has a smaller error on future independent data than the classification rule ˆf s 0 M 1 fitted with method M 1 or vice-versa, i.e. whether ε( ˆf s 0 M 2, P ) < ε( ˆf s 0 M 1, P ). We define the null- and alternative hypotheses of the applicant correspondingly as H (cond) 0 : ε( ˆf s 0 M 2, P ) ε( ˆf s 0 M 1, P ) 0 vs. H (cond) 1 : ε( ˆf s 0 M 2, P ) ε( ˆf s 0 M 1, P ) < 0. These hypotheses can be seen as conditional (hence the exponent (cond) ) in the sense that they are conditional on a fixed sample s 0. In contrast, statisticians or machine learners doing methodological research are not primarily interested in the performance of the classification rule fitted on a specific sample s 0 but rather on the mean performance over different samples. In mathematical words, they are interested in the comparison of the unconditional errors ε(n, M 1, P ) and ε(n, M 2, P ), yielding the corresponding hypotheses: H (uncond) 0 : ε(n, M 2, P ) ε(n, M 1, P ) 0 vs. H (uncond) 1 : ε(n, M 2, P ) ε(n, M 1, P ) < 0, with the exponent (uncond) standing for unconditional. When methodological researchers write that method M 2 is better than the standard method M 1, they 7

9 implicitly mean that the unconditional error is smaller for M 2 than for M 1 and that H (uncond) 0 can be rejected. In practical data analysis, when the distribution P is unknown, it is not easy to test H (uncond) 0. A natural estimator of the difference ε(n, M 2, P ) ε(n, M 1, P ) is the difference between resampling-based error estimates obtained with the two considered methods, e.g. cross-validation error estimates. The problem is that the true unconditional variance of this difference under H (uncond) 0 is unknown and difficult to estimate. Many naive or more complex estimates of this variance have been considered in the literature (Dietterich, 1998; Nadeau and Bengio, 2003; Hanczar and Dougherty, 2010), but they rely on uncertain assumptions that are most often not met in practical cases. The essential problem is that they are all based on the available sample s 0, while their target is actually the variance over different samples drawn from P n. Hence, they are all conditional in some way. To date, there exists no widely accepted test for testing H (uncond) 0 based on a real data set with unknown underlying distribution. Epistemological background: theory and simulations For a given distribution P and a specific n, however, it is easy to empirically approximate the unconditional error ε(n, M 1, P ) of a given classification method M 1 via simulations. A straightforward procedure is as follows: 1. Randomly draw a huge number n test of independent realizations of P, yielding a so-called test sample s (T ) = {(x (T ) 1, y (T ) 1 ),..., (x (T ) n test, y (T ) n test )}. For instance, n test = may be appropriate. 2. For b = 1,..., B, with B large (typically B 1000): 2.a. Draw n i.i.d. realizations from the distribution P, yielding the training sample s (b) = ((x (b) 1, y (b) 1 ),..., (x (b) n, y (b) n )). 2.b. Fit method M 1 to s (b), yielding the classification rule ˆf s(b) M 1. 8

10 2.c. Estimate the true error of ˆf s(b) M 1 based on the test sample drawn in step 1 as the proportion of misclassified realizations in the test sample s(b) ˆε( ˆf M 1, P ) = 1 n test L( n test i=1 ˆf s(b) M 1 (x (T ) i ), y (T ) i ). 3. Estimate the true unconditional error ε(n, M 1, P ) as ˆε(n, M 1, P ) = 1 B B ˆε( b=1 ˆf s(b) M 1, P ). Note that this approach implies two embedded approximation procedures: approximation of the true error of ˆf s(b) M 1 using a huge test sample s (T ), and approximation of the unconditional error over P n by averaging over a large number B of random training samples s (b). Note that in specific cases, the unconditional error might even be derived analytically. No matter whether one derives this error analytically or via simulations, two parameters have to be chosen: the sample size n and the joint distribution P. Except for trivial examples (e.g. an algorithm that randomly generates a rule f without looking at the data), it is not to be expected that the new method M 2 performs better than the standard method M 1 for all sample sizes and all imaginable joint distributions the so-called no free lunch -theorem (Wolpert, 2001). Epistemological background: real data study The essential limitation of simulations and analytical results is that the chosen distribution P often does not reflect the complexity of real life distributions. Therefore, performance estimation on real data is usually considered as extremely important. In essence, the goal of such studies is to evaluate the considered classification methods M 1 and M 2 on real life distributions P, i.e. to compare ε(n, M k, P ) for k = 1, 2. There are however two important problems related to real data studies. 9

11 The first problem ( variability of error estimation ) is related to the fact that for a specific data set the underlying distribution P is essentially unknown in practice. It is thus impossible to derive the prediction error analytically. If a huge sample is given, we can obtain a good approximation by drawing numerous non-overlapping samples of the considered size n out of the huge sample, and estimating the error on the rest of the sample. However, in most practical situations no huge samples are given and resampling methods are then used to address this estimation issue. Such methods, however, are known to poorly estimate the unconditional error because they suffer a very high variance and/or a high bias resulting in a large mean squared error (Braga-Neto and Dougherty, 2005; Zollanvari et al., 2009; Dougherty et al., 2011; Zollanvari et al., 2011; Dalton and Dougherty, 2012a,b). The second problem ( variability across data sets ) is that each real life data set follows its own distribution. In the example of microarray gene expression data, the leukemia data set of size n = 38 by Golub et al. (1999) follows a certain distribution P 1 while the breast cancer data set of size n = 76 by Van t Veer et al. (2002) follows another distribution P 2. Even if we had a good estimate of the unconditional error ε(n = 38, M 1, P 1 ) for method M 1, it would tell us nothing about ε(n = 38, M 1, P 2 ) and ε(n = 76, M 1, P 2 ), except if we assume that P 2 is somehow similar to P 1, an assumption that may make sense in some exceptional cases, but not in general. When researchers perform a real data study based on several data sets, they implicitly aim to capture the variability across data sets, i.e. across distributions. A problem related to the variability across data sets is the definition of the aimed area of application. It can be defined in a very general way without restrictions, or the authors may choose to focus on a very particular area of application with restrictions regarding the structure of the data and/or the substantive context of the data sets. No matter how the area of application is defined and how example data sets are selected, the two sources of variability (variability of error estimation and variability across data sets) imply a high variance. The variance of error estimation 10

12 can be addressed by using an estimation method known to have smaller variance: for instance, we know that repeated subsampling with a training/test splitting ratio of, say, 2:1 has smaller (unconditional) variance than leave-one-out cross-validation (LOOCV). However, the variance remains high even for the less variable methods. As to the variability across data sets, it can only be addressed by increasing the number of candidate data sets. Therefore after motivating the issues with an illustrative exemplary scenario we formally address the problem of testing the difference between the unconditional errors of two methods M 1 and M 2 in a real data study with several data sets and suggest a statistical framework, including power considerations. Examples Suppose that authors want to compare two simple statistical procedures for supervised classification based on high-dimensional microarray data to be used after a variable selection step: linear discriminant analysis (method A, considered as the reference), and diagonal linear discriminant analysis (method B, considered as a new method suggested by the authors). Method B is expected to work well if the covariance matrix is indeed diagonal (which depends on P ). Method A may have problems if the sample size n is not large compared to the number of predictors (which also depends on P via the dimension of X), because the estimated covariance matrix is then ill-conditioned and it inversion is problematic. But it may work well if the true model implied by the distribution P involves correlations between the variables and if n is large enough. Quite generally, for learning algorithms based on the estimation of parameters of a stochastic model, the closer the assumed model to the true distribution P, the better the prediction accuracy in infinite sample settings. Things are more complicated for learning algorithms that are not based on an underlying stochastic model and in finite sample settings, but we can again say that the error essentially depends on n 11

13 and P. When evaluating methods based on simulated data, the choice of the distribution P and the sample size n is thus crucial. A particular distribution and a particular sample size n can be realistic in a considered area of application, but not realistic for another one. In practice, it is recommended that researchers consider distributions and sample sizes that are typical for the area of application they are aiming at. For example, a sample size of n = 100 and a distribution P involving p = multivariate Gaussian covariates with block diagonal structure, of which p = 20 are related to the response class Y through a logistic model may more or less reflect the reality of microarray gene-expression data, but not of large epidemiological studies involving only a handful of covariates, or of imaging data with complex spatial structure. In practice, methodological researchers often develop methods addressing a particular type of distribution P (sometimes combined with a particular sample size range). They correspondingly design a simulation study based on such a distribution and/or sample size. For instance, the authors comparing methods A and B would probably consider simulation settings with diagonal covariance structure. While it is obviously more than recommended to evaluate the new classification method in the data setting it was designed for, it is also recommended to i) clearly state that the superiority of the new method is specific to the investigated distributions (here: diagonal covariance matrix), ii) investigate other distributions as well to examine the robustness of the method to changes in the distribution (e.g. distributions with block-diagonal covariance matrix). In real life, the data do not stem from simple joint distributions like those commonly used in simulations. That is why real life comparison studies are considered as very important in computational science. If we refer again to the example of methods A and B, the success of classification method B in the real data study with microarray data then depends on three essential factors: i) whether the considered 12

14 data sets have a diagonal covariance structure, ii) how the new method performs if the structure is indeed diagonal, iii) whether the method is robust against deviations from the diagonal covariance structure. Obviously, method B will comparatively perform better in the real data study if condition i) holds. And it will have more impact in the future if diagonal covariance structures often occur not only in the selected data sets but in the whole considered area of application. More generally, we can say that the data sets selected for the real data study should reflect the aimed area of application. For example, if one consciously selects microarray gene expression data with diagonal covariance structure, the real data analysis section should not be written as if the area of application were microarray gene expression data in general. Conversely, one could say that data sets should be selected randomly within the stated field of application. Most importantly, data sets should not be selected a posteriori based on the obtained results, because this procedure generates a substantial bias (Yousefi et al., 2010). 3. TESTING FRAMEWORK FOR REAL DATA STUDIES Settings Webb (2000) states that it is debatable whether error rates in different domains [where the term domain here refers to the underlying joint distribution] are commensurable, and hence whether averaging error rates across domains is very meaningful. Nonetheless, a low average error rate is indicative of a tendency toward low error rates for individual domains. In this section, we try to address this issue in terms of statistical testing. More precisely, we consider the problem of testing the error difference between two methods M 1 and M 2 based on several real-life data sets. The first task we have to address is thus to properly define the testing problem at hand. Let us consider a given area of application. We independently and randomly 13

15 draw J data sets belonging to this area. It is questionable whether the selection of data sets would really be random over the area in practice. For example, researchers may preferably look for data sets in their favorite database. This database is not necessarily representative for the whole area, for instance because it includes, say, many small data sets (the distribution of n is then not the same as in the whole area) or more data sets for a particular disease (possibly implying specific forms for P ). For simplicity, however, we assume in this paper that the data sets are randomly drawn from the considered area. These data sets are denoted as D 1,..., D J. We intentionally do not use the notation S from the previous section to stress that the situation is now different: the data sets D 1,..., D J are not drawn from the same distribution (as was the case for s (1),..., s (b) in the previous section). Each data set D j is as a realization of P n j j, where n j is its size and P j is the distribution of the underlying population. As we assume that data sets are randomly drawn from the considered area, the distribution P j is the outcome of a random variable Φ j : Ω V where V is the set of all possible distributions (in the area of application). Furthermore, the size n j of the data set D j is the outcome of a random variable N j : Ω N. The random variables (Φ 1, N 1 ),..., (Φ J, N J ) are i.i.d. but, of course, we only observe N j = n j and cannot observe Φ j = P j for every j {1,..., J}. Test hypothesis When randomly drawing a data set D, one thus implicitly randomly draws simultaneously a distribution P and n realizations of this distribution. Both imply a certain variability. Importantly, P can now be seen as the outcome of a random variable Φ and n as the outcome of a random variable N. Note that Φ and N are not necessarily independent. The test hypotheses of interest that are implicitly considered when comparing the performance of methods based on different real data sets can 14

16 be stated as H (real) 0 : E(ε(N, M 2, Φ)) E(ε(N, M 1, Φ)) 0 vs. H (real) 1 : E(ε(N, M 2, Φ)) E(ε(N, M 1, Φ)) < 0, where E denotes the expectation over the random variables Φ and N here. Accordingly, ε(n, M k, P ) is now the outcome of the random variable ε(n, M k, Φ), k {1, 2}. In comparison studies based on real data, researchers typically estimate the error for each data set using a resampling procedure such as, e.g., repeated splitting into training and test set. Let e(n, M k, D) denote the error of method M k estimated for data set D with the chosen resampling procedure. The estimated error e(n, M k, D) can be seen as an estimator of the unknown parameter ε(n, M k, P ). In resampling procedures, however, the training data set used at each resampling iteration is smaller than n, hence leading to an error e(n, M k, D) larger than ε(n, M k, P ) on average. Nevertheless, if we assume that this bias is equal for both considered methods M 1 and M 2, i.e. that E P n(e(n, M 1, D)) ε(n, M 1, P ) = E P n(e(n, M 2, D)) ε(n, M 2, P ), we have ε(n, M 2, P ) ε(n, M 1, P ) = E P n(e(n, M 2, D) e(n, M 1, D)) (2) where E P n denotes the expectation over the data set D drawn i.i.d. from P. The data set D can be seen as the outcome of a random variable D of which the conditional distribution given Φ = P and N = n is equal to P n. Then, using the relation E ( E(e(N, M 2, D) e(n, M 1, D) Φ, N) ) = E(e(N, M 2, D) e(n, M 1, D)), (3) 15

17 we can formulate the null-hypothesis H (real) 0 as H (real) 0 : E(e(N, M 2, D) E(e(N, M 1, D) 0 vs. H (real) 1 : E(e(N, M 2, D) E(e(N, M 1, D) < 0. This formulation is advantageous because we have access to independent and identically distributed realizations e(n j, M 2, D j ) e(n j, M 1, D j ) (with j = 1,..., J) of e(n, M 2, D) e(n, M 1, D), thus yielding estimates of its mean and its variance. Note that the exact measure theoretic formulation of (3) needs some more care because: (i) Φ is a random variable which takes its values in a set V of probability measures P so that we need a suitable σ-algebra on V, (ii) the size n = N of the data set D = D is random so that a naive formalization would yield that D (X Y) N where the dimension N is random, and (iii) the existence of a suitable random variable D (of which the conditional distribution given Φ = P and N = n is equal to P n ) is not obvious. The mathematically correct derivation of the above testing problem which takes care of these technicalities is deferred to the appendix. Let e(n j, D j ) = e(n j, M 2, D j ) e(n j, M 1, D j ) (with j = 1,..., J) be independent and identically distributed realizations of e(n, M 2, D) e(n, M 1, D) and define e = 1 J J j=1 e(n j, D j ). Under normality assumption or for very large J it is now possible to perform a paired sample t-test to test H (real) 0. The test statistic T = 1 J 1 J 1 e J j=1 ( e(n j, D j ) e) 2 follows a Student distribution with J 1 degrees of freedom under H (real) 0. This type of test is performed in many machine learning studies, where it is usual to apply new and existing methods to a large number of data sets (Demšar, 2006), for instance from databases especially designed for this purpose. Non-parametric tests such as the Wilcoxon signed-rank test are also commonly applied in this context (Demšar, 2006) including adjustment procedures for multiple comparisons or the use of global 16

18 test statistics for the comparison of more than two methods (Garcia and Herrera, 2008; Garcia et al., 2010). When authors state in their abstracts that the new method performs better than existing methods on real data sets, they implicitly say that H (real) 0 can be rejected, although most of them do not perform the test and give almost no attention to the theoretical background of these tests. Decomposition of the variance The variance term Var( e(n, D)) consists of two parts: firstly, the variance of e(n, D) conditional on the distribution Φ = P and the sample size N = n; secondly, the variance of ε(n, M 2, Φ) ε(n, M 1, Φ): Var( e(n, D)) = Var ( ε(n, M 2, Φ) ε(n, M 1, Φ) ) + E ( Var( e(n, D) Φ, N) (4) ). Essentially, this follows from the law of total variance, Equation (2), and the fact that the conditional distribution of D given Φ = P and N = n is equal to P n ; the exact derivation needs some more care again and is deferred to the appendix. The literature on error estimation and comparison usually focuses on the second part of the variance in (4): several estimators have been proposed (Nadeau and Bengio, 2003). The first part of the variance is considered as disturbing factor by many authors. Data sets that behave differently from the other are generally considered as cumbersome outliers and sometimes even excluded from the comparison study, possibly leading to a substantial bias (Yousefi et al., 2010). This part of the variance is also tightly related to the optimistic bias commonly observed in studies assessing new methods in comparison studies (Jelizarow et al., 2010). Researchers tend to overfit their new method to specific example data sets while developing them. The variance across data sets being high, this new method that has been optimized to these particular data sets is likely to perform much worse on other data sets. 17

19 Power considerations Considering the one-sided one-sample t-test outlined above, the number J of data sets necessary to detect a given effect size σ at a certain power 1 β can be derived from the formula J [t 1 α,j 1 + t 1 β,j 1 ] 2 ( σ )2, where t α,df denotes the α-quantile of the Student distribution with df degrees of freedom, α denotes the type I error (typically α = 0.05), β denotes the type II error, denotes the difference that we want to be able to detect, and σ denotes the standard deviation (Bock, 1998). Conversely, for a given J, a given and a given σ, one can also compute the power 1 β of the test as ( ) 1 β = Φ tdf=j 1 J 2 σ t 2 1 α,df=j 1, where Φ tdf=j 1 denotes the cumulative distribution function of the Student distribution with J 1 degrees of freedom. It is common practice to use such formulas to derive the adequate sample size in various types of experiments including, e.g. animal trials or clinical trials. Such a statistical planning, however, is never considered when performing benchmark experiments, even if the researchers want to eventually perform statistical tests. In this paper, we suggest to also give attention to such power considerations when planning or performing a comparison study involving several data sets. These issues are illustrated in the next section through an application to comparison studies of microarray-based classification methods. 4. ILLUSTRATION: POWER OF PUBLISHED STUDIES In this section we examine comparison studies from the literature on microarraybased supervised classification with respect to the issue discussed above. In par- 18

20 ticular, we compute the empirical variance of the difference between methods and estimate the power of the comparisons. Considered studies We selected comparison studies (i) whose aim was not to establish the superiority of a new method, hence warranting a certain level of neutrality (Boulesteix et al., 2008; Boulesteix and Eugster, 2012), (ii) focusing on diagnosis or prognosis based on high-dimensional gene expression microarray data, (iii) representing the estimated error rates in form of a table (thus providing access to the exact figures) rather than in form of graphics, (iv) examining at least five data sets N 5 and at least five classification methods. Three of the considered comparison studies fulfilled these four criteria: the study by Lee et al. (2005) comparing six methods on six cancer sdata sets, the study by Lai et al. (2006) comparing 13 methods on seven cancer data sets, and the study by Statnikov et al. (2005) comparing eight methods on 11 cancer data sets. Standard deviation We first derived the standard deviation of each pairwise difference between methods for each study. There were 6 5/2 = 15 pairwise differences for the Lee study, 13 12/2 = 78 pairwise differences for the Lai study, and 8 7/2 = 28 pairwise differences for the Statnikov study. The boxplots representing the standard deviations of these pairwise differences are displayed in Figure 1. It can be seen from these boxplots that the range of the standard deviations is surprisingly the same for the three studies, except for a few more extreme values for the Statnikov study. Furthermore, the standard deviations are very different within the studies: some pairs of methods have a very low standard deviation (indicating that the data sets are similar with respect to the difference in performance), while other show large standard deviations. 19

21 standard deviation of e(n j,d j ) Lee Lai Statnikov Figure 1: Estimated standard deviation of e(n j, D j ) in the three studies by Lee et al. (2005), Lai et al. (2006) and Statnikov et al. (2005). Power considerations Assuming values of the standard deviation σ in the range of those observed in the three investigated studies, Figure 2 (left panel) represents the number J of data sets required to detect a difference of error rate of with power 80%. The right panel of Figure 2 displays the reached power against the number of data sets J for = 0.05 and different values of the standard deviation σ in the range of those observed in the three investigated studies. This figure suggests that most published comparison studies (either neutral comparison studies or comparison studies included in an original article) are substantially underpowered. In this perspective, we propose and recommend that researchers conducting comparison studies explicitly address such power issues for the design and/or interpretation of their benchmark experiments. 5. AN EXEMPLARY BENCHMARK STUDY In order to explicitly illustrate these problems in real benchmark experiments we perform an exemplary benchmark study based on J = 50 microarray data sets with binary response as prepared by de Souza et al. 20

22 J σ=0.03 σ=0.05 σ=0.075 σ=0.1 Power σ=0.03 σ=0.05 σ=0.075 σ= J Figure 2: Left: Number J of data sets requested to detect for different values of σ (black: σ = 0.03, red: σ = 0.05, green: σ = 0.075, blue: σ = 0.1) and power=80% and Right: Power to detect a difference of = 0.05 for different values of σ (black: σ = 0.03, red: σ = 0.05, green: σ = 0.075, blue: σ = 0.1). (2010). These data sets and corresponding R Code are available from our companion website mitarbeiter/020_professuren/boulesteix/compstud2013. For each data set, repeated subsampling with 4/5 of the observations in the training sets and 300 resampling iterations is used for error estimation. The considered classification methods are: i) diagonal linear discriminant analysis (DLDA) without variable selection (DLDA-all), ii) DLDA with 500 selected variables (DLDA-500), iii) DLDA with 20 selected variables (DLDA-20), and iv) DLDA with 10 selected variables (DLDA-10). Variable selection is performed by selecting the variables yielding the smallest p-values when testing the equality of the means in the two groups Y = 0/1 with a classical t-test. All analyses are performed using the Bioconductor package CMA (Slawski et al., 2008). Figure 3 displays the boxplots of the estimated errors over the 50 data sets using the four considered classification methods (top) and the boxplots of the six pairwise differences (bottom) also showing their respective standard deviation. It can be clearly seen from this figure that the standard deviations of the differences vary a lot depending on the considered pair of methods and that their range is the similar 21

23 Error rate DLDA all Error rate difference DLDA 500 DLDA 20 DLDA all 500 all 20 all Figure 3: Top: Boxplots of the estimated errors over the 50 data sets using the four considered classification methods. Bottom: Boxplots of the six pairwise differences also showing their respective standard deviation. The gray lines connect points corresponding to the same data set. to the values observed in the comparison studies from the literature discussed in the previous section. The results of the power considerations presented in Figure 2 are thus also relevant to our exemplary benchmark study. Matched-pair t-test Wilcoxon test Comparison Difference t p-value W p-value DLDA-all vs. DLDA e e-05 DLDA-all vs. DLDA DLDA-all vs. DLDA DLDA-500 vs. DLDA DLDA-10 vs. DLDA DLDA-10 vs. DLDA Table 1: Results of the one-sided matched-pair t-test and one-sided Wilcoxonranked-sum test for all six pairwise comparisons of differences in the error rates over the 50 data sets when testing hypothesis H (real) 0 vs. H (real) 1. For each comparison, the first method plays the role of M 1 and the second method plays the role of M 2. The results of the one-sided matched-pair t-test and one-sided Wilcoxon-signed- 22

24 rank-test for the six considered pairs of methods are displayed in Table 1. Four of the six pairwise differences between error rates are statistically significant when tested at an α-level of 0.05 using the parametric matched-pair t-test and the nonparametric Wilcoxon-signed-rank-test. The largest difference of = can be found between DLDA-all and DLDA-20, where the standard deviation is as large as 0.1. The results from Table outline the importance of the standard deviation of the difference: the comparisons DLDA-all vs. DLDA-500 and DLDA-all vs. DLDA-10 yield almost equal means differences (0.038 and 0.036, respectively) but the first comparison leads to a much lower p-value due to the smaller standard deviation of 0.06 (see Figure 3). Similarly, the mean difference for the comparison DLDA-10 vs. DLDA-20 is moderate (0.009) but yields small p-values due to the very small standard deviation of On the whole, these results again stress the importance of a sufficient number of data sets in comparison studies, especially in the context of high-dimensional data investigated in our example. Differences between methods are often moderate and their standard deviations may be comparatively large, thus making comparison studies based on the usual number of, say, 5 to 10 data sets substantially underpowered. 6. CONCLUDING REMARKS In this paper we proposed a statistical formulation and interpretation of hypothesis tests performed in the context of comparison studies comparing the performance of supervised classification methods based on several data sets with unknown underlying distribution. Although we focused on classification problems, the developed ideas may be easily extended to other problems through the use of a different loss function. At the light of this framework, we examined published comparison studies as- 23

25 sessing classification methods for high-dimensional microarray data. Considering the large variance of the difference in performance across the data sets, we found that a very large number of data sets would be necessary to reach an adequate power, i.e. to have a large probability to detect relevant differences in error rates as statistically significant in the testing framework. We conclude that most published comparison studies (either neutral or part of an original article) are substantially underpowered. As an outlook, we point to the parallels that can be made between comparison studies in the context of supervised learning and experiments from application fields of statistics (e.g. biomedicine). Roughly speaking, a comparison study comparing supervised learning methods shows some similarities with a clinical trial. For example, it might be helpful to first perform a pilot study with a limited number of data sets to provide a first raw estimate of the variance in order to evaluate the necessary number of data sets. Subgroup analyses may also make sense, e.g. focusing on data sets of particular sizes, with particular numbers of predictors, etc. The same rules should be observed as when performing subgroup analyses in biomedical sciences: relevant subgroups should be defined prior to the analysis and all the results should be reported in order to avoid fishing for significance. Alternatively, it may make sense to model e as a dependent variable with data set characteristics as independent variables (for instance size, number of predictors, ratio between size and number of predictors, signal strength, balance between the class, etc). This would be an alternative to the previously suggested Bradley-Terry model approach (Eugster et al., 2010). Going one step further, meta-analyses of comparison studies would also be conceivable. To conclude, we believe that statisticians working on methodological research projects such as the development of new supervised learning methods (including ourselves) should probably take more care of rules that they themselves (rightly) impose on their statistical consulting clients: take care of sample size and power issues, pay attention to the underlying hypothesis when performing a test, and not 24

26 over-interpret results that are not statistically significant. References Billingsley, P., Probability and measure, 2nd Edition. John Wiley & Sons Inc., New York. Bock, J., Bestimmung des stichprobenumfangs. München/Wien: Oldenbourg. Bouckaert, R., Choosing between two learning algorithms based on calibrated tests. In: MACHINE LEARNING- INTERNATIONAL WORKSHOP THEN CONFERENCE-. Vol. 20. p. 51. Boulesteix, A., Eugster, M., A plea for neutral comparison studies in computational sciences. Technical Report: Boulesteix, A., Strobl, C., Augustin, T., Daumer, M., Evaluating microarray-based classifiers: an overview. Cancer Informatics 6, Braga-Neto, U., Dougherty, E., Exact performance of error estimators for discrete classifiers. Pattern Recognition 38, Dalton, L., Dougherty, E., 2012a. Exact sample conditioned mse performance of the bayesian mmse estimator for classification error part I: Representation. IEEE Transactions on Signal Processing 60, Dalton, L., Dougherty, E., 2012b. Exact sample conditioned mse performance of the bayesian mmse estimator for classification error part ii: Consistency and performance analysis. IEEE Transactions on Signal Processing 60, de Souza, B., de Carvalho, A., Soares, C., A comprehensive comparison of ml algorithms for gene expression data classification. In: The 2010 International Joint Conference on Neural Networks (IJCNN). IEEE, pp Demšar, J., Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7, Dietterich, T., Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation 10, Dougherty, E., Zollanvari, A., Braga-Neto, U., The illusion of distribution-free small-sample classification in genomics. Current Genomics 12, Eugster, M., Hothorn, T., Leisch, F., Domain-based benchmark experiments: exploratory and inferential analysis. Austrian Journal of Statistics 41, Eugster, M., Leisch, F., Strobl, C., (Psycho-)analysis of benchmark experiments. Technical Report 78, Department of Statistics, LMU. Garcia, S., Fernandez, A., Luengo, J., Herrera, F., Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences 180 (10),

27 Garcia, S., Herrera, F., An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. Journal of Machine Learning Research 9, Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, Hanczar, B., Dougherty, E., On the comparison of classifiers for microarray data. Current Bioinformatics 5, Hoeffding, W., Wolfowitz, J., Distinguishability of sets of distributions. (The case of independent and identically distributed chance variables). Annals of Mathematical Statistics 29, Hothorn, T., Leisch, F., Zeileis, A., Hornik, K., The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics 14, Jelizarow, M., Guillemot, V., Tenenhaus, A., Strimmer, K., Boulesteix, A., Over-optimism in bioinformatics: an illustration. Bioinformatics 26, Lai, C., Reinders, M., Van t Veer, L., Wessels, L., A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets. BMC Bioinformatics 7, 235. Lee, J., Lee, J., Park, M., Song, S., An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis 48, Nadeau, C., Bengio, Y., Inference for the generalization error. Machine Learning 52, Slawski, M., Daumer, M., Boulesteix, A., CMA a comprehensive bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9, 439. Statnikov, A., Aliferis, C., Tsamardinos, I., Hardin, D., Levy, S., A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21, Van t Veer, L., Dai, H., Van De Vijver, M., He, Y., Hart, A., Mao, M., Peterse, H., Van Der Kooy, K., Marton, M., Witteveen, A., et al., Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, Webb, G., Multiboosting: A technique for combining boosting and wagging. Machine Learning 40, Wolpert, D., The supervised learning no-free-lunch theorems. In: Proceedings of the 6th Online World Conference on Soft Computing in Industrial Applications. pp Yousefi, M., Hua, J., Sima, C., Dougherty, E., Reporting bias when using real data sets to analyze classification performance. Bioinformatics 26, Zollanvari, A., Braga-Neto, U., Dougherty, E., On the sampling distribution of resubstitution and leave-one-out error estimators for linear classifiers. Pattern Recognition 42, Zollanvari, A., Braga-Neto, U., Dougherty, E., Analytic study of performance of error estimators for linear discriminant analysis. IEEE Transactions on Signal Processing 59,

28 Appendix The appendix contains some technicalities which are needed for a mathematically rigorous formulation of the testing problem in Section 3. In particular, it is shown that there are suitable random variables N and D such that e(n, M 2, D) e(n, M 1, D) can be seen as the observed realization of the random variable e(n, M 2, D) e(n, M 1, D) where also the distribution P (which generates the data set D) is randomly chosen. This is crucial in order to apply the central limit theorem and, otherwise, using the t-rest to test H (real) 0 in Section 3 would not be justified. Preliminaries Let ( (X Y), B ) be the countably-infinite-product space of the measurable space (X Y, B). Let V be the set of all possible distributions (in the area of application) endowed with the Borel-σ-algebra B V with respect to the total variation norm. We arbitrarily fix any x 0 X and y 0 Y and define the function γ : (X Y) N (X Y), (D, n) γ(d, n) =: D (n) by D (n) = ( (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ), (x 0, y 0 ), (x 0, y 0 ),... ) (X Y) for D = ( (x 1, y 1 ), (x 2, y 2 ),... ) (X Y) and n N. Note that γ is measurable with respect to B 2 N and B. Similarly, define P (n) = P n δ (x 0,y 0 ) on ( (X Y), B ) for every P V. 27

Statistical aspects of prediction models with high-dimensional data

Statistical aspects of prediction models with high-dimensional data Anne Laure Boulesteix Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie February 15th, 2017 Typeset by