A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies

Size: px
Start display at page:

Download "A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies"

Transcription

1 Anne-Laure Boulesteix, Robert Hable, Sabine Lauer, Manuel Eugster A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies Technical Report Number 136, 2013 Department of Statistics University of Munich

2 A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies Anne-Laure Boulesteix 1, Robert Hable 2,3, Sabine Lauer 1, Manuel J. A. Eugster 2,4 1 Department of Medical Informatics, Biometry and Epidemiology, University of Munich (LMU), Marchioninistr. 15, D Munich, Germany. boulesteix@ibe.med.uni-muenchen.de 2 Department of Statistics, University of Munich (LMU), Munich, Germany 3 Chair of Stochastics, Department of Mathematics, University of Bayreuth, Bayreuth, Germany 4 Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland Anne-Laure Boulesteix is a statistician and biometrician, Department of Medical Informatics, Biometry and Epidemiology, University of Munich (LMU), Munich, Germany. Robert Hable is a mathematician and statistician, Department of Mathematics, University of Bayreuth, Bayreuth, Germany. Sabine Lauer is a statistician and sociologist, Department of Medical Informatics, Biometry and Epidemiology, University of Munich (LMU), Munich, Germany. Manuel Eugster is a computer scientist and statistician, Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland. 1

3 ABSTRACT In computational sciences, including computational statistics, machine learning, and bioinformatics, most abstracts of articles presenting new supervised learning methods end with a sentence like our method performed better than existing methods on real data sets, e.g. in terms of error rate. However, these claims are often not based on proper statistical tests and, if such tests are performed (as usual in the machine learning literature), the tested hypothesis is not clearly defined and poor attention is devoted to the type I and type II error. In the present paper we aim to fill this gap by providing a proper statistical framework for hypothesis tests comparing the performance of supervised learning methods based on several real data sets with unknown underlying distribution. After giving a statistical interpretation of ad-hoc tests commonly performed by machine learning scientists, we devote special attention to power issues and suggest a simple method to determine the number of data sets to be included in a comparison study to reach an adequate power. These methods are illustrated through three comparison studies from the literature and an exemplary benchmarking study using gene expression microarray data. All our results can be reproduced using R-codes and data sets available from the companion website de/organisation/mitarbeiter/020_professuren/boulesteix/compstud2013. KEYWORDS Benchmarking, supervised learning, comparison, testing, performance 2

4 1. INTRODUCTION Almost all machine learning or computational statistics articles on supervised learning methods include a more or less extensive comparison study based on real data sets of moderate size assessing the respective performance (typically in terms of prediction error) of a few algorithms, either new or already described in previous literature. These comparisons are often termed as benchmark experiments (e.g. Hothorn et al., 2005) in the statistical and machine learning communities. However, the statistical foundations of these comparisons are usually given surprisingly poor attention in concrete comparison studies, although they are addressed in a large body of methodological literature. For simplicity we will from now on talk about the comparison of two supervised classification methods, but all the ideas discussed here are also relevant to the comparison of more than two methods or to other supervised learning problems. These two methods are used to derive a classification rule from an available training sample. In a seminal paper Dietterich (1998) proposes a general taxonomy of the problems related to performance evaluation and comparison in supervised learning. The comparison of two methods for a particular distribution in the absence of large sample is termed Question 8 while Questions 1 to 7 consider error estimation (as opposed to comparison of methods), situations with large samples, and/or the assessment of specific classification rules (as opposed to the problem considered here of the assessment of the methods used to derive classification rules). The paper considers a few simple testing procedures based on resampling-based estimates of the prediction error and compares them empirically via simulations in terms of type I error and power. The conclusion is that none of these procedures is completely satisfactory. The essential problem is that they do not use adequate estimates of the unconditional variance of the error estimates. Estimating the unconditional variance based on the empirical variance over resampling iterations implies a violation of independence assumptions and thus a reduction of the effective degrees of freedom 3

5 (Bouckaert, 2003). Nadeau and Bengio (2003) provide an overview of estimators of the unconditional variance of resampling-based error estimates used in the machine learning community and suggest two additional estimators that can be applied to error estimators obtained through repeated splitting into training and test data. They present benchmark experiments as a natural application of their new estimators and stress that naive estimators of the variance (e.g. the empirical variance of the estimated error over resampling iterations) often lead to false positives in the sense that researchers see a difference in performance between the considered methods although there is no such difference. Hothorn et al. (2005) give a statistical interpretation of these issues and recommend to estimate the variance of resampling-based error estimates through bootstrapping. Dietterich (1998) also mentions a further question termed as Question 9 in his taxonomy and referring to several domains, where the term domain here denotes a data set with its own underlying distribution: Question 9: Given two learning algorithms A and B and data sets from several domains, which algorithm will produce more accurate classifiers when trained on examples from new domains? This is perhaps the most fundamental and difficult question in machine learning. Indeed, almost all comparison studies based on real data consider not one but several data sets, thus implicitly involving several different underlying distributions. While it is usual to perform hypothesis tests to compare the performance of different methods in the context of benchmarkring studies (Demšar, 2006), the literature on the statistical interpretation of such tests based on multiple data sets is surprisingly sparse. One of the few articles on this topic suggests to use a mixed model approach with the data sets as subjects with random intercept and the methods as fixed effects (Eugster et al., 2012). 4

6 In this paper, our aim is to give a statistical formulation of tests performed in the context of comparison studies and to correspondingly interpret results of published comparison studies, with focus on classification based on high-dimensional gene expression data as an highlighting illustration. The paper is structured as follows. In Section 2 we outline the epistemological background of comparisons of prediction errors from a statistical perspective contrasting with the machine learning perspective taken by most papers handling this topic. Section 3 is especially devoted to tests in the context of comparison studies based on several real data sets and gives an original formulation of these tests in a strict statistical framework. In particular, Section 3 suggests a power calculation approach in this context. Section 4 presents an application of these methods to three comparison studies of classification methods for microarray data from the literature, while Section 5 describes an exemplary benchmark study based on 50 data sets. The appendix contains some technicalities which are needed for the mathematically rigorous formulation of the testing problem given in Section EPISTEMOLOGICAL BACKGROUND Settings From a statistical point of view, binary supervised classification can be described in the following way. On the one hand, we have a response variable taking values in Y = {0, 1}. On the other hand, we have predictors taking values in X R p that will be used for constructing a classification rule. Predictors X and response Y follow an unknown joint distribution on X Y denoted by P. The observed i.i.d. sample of size n is denoted by s 0 = {(x 1, y 1 )...(x n, y n )}. The classification task consists in building a decision function ˆf that maps elements of the predictor space X into the 5

7 response space Y: ˆf s 0 : X Y, x ˆf s 0 (x), where the superscript s 0 indicates that the decision function is built using the sample s 0. For simplicity we assume that the classification method is deterministic, i.e. that ˆf s 0 is uniquely defined given the sample s 0. There are many possible methods to fit a function ˆf s 0 based on a sample s 0, which we denote as M 1,..., M K. The decision function obtained by fitting method M k to the sample s 0 is denoted as ˆf s 0 M k. The true error of this classification rule ˆf s 0 M k can be written as ε( ˆf ( )] s 0 M k, P ) = E P [L ˆf s 0 M k (X), Y (1) where E P stands for the mean over the joint distribution P of X and Y, and L(.,.) is an adequate loss function, e.g. the indicator loss yielding the classification error rate considered in this paper. The true error ε( ˆf s 0 M k, P ) of method M k constructed using sample s 0 is commonly referred to as conditional error since it corresponds to the decision function constructed on the specific sample s 0. The notation ε( ˆf s 0 M k, P ) stresses that this error depends on the distribution P as well as on the method M k and sample s 0 used to fit the classification rule. In this perspective, the error ε( ˆf S M k, P ) (corresponding to Eq. (1) with s 0 replaced by S) should be seen as a random variable, where S stands for a random i.i.d. sample that follows the distribution P n. The mean of this random variable over P n is commonly referred to as the unconditional true error rate of method M k. In this paper it is denoted as ε(n, M k, P ) = E P n[ε( ˆf S M k, P )]. It depends only on the method M k, on the size n of the sample and on the joint 6

8 distribution P of X and Y. Note that the joint distribution P is involved twice in this formula. Method 2 is better than method 1: what does this mean? Now suppose that we are interested in the relative performance of two methods M 1 and M 2. What does it mean when we ask whether method M 2 is better than method M 1 or vice-versa? An applicant (for instance a biologist) who collected a specific data set s 0 in his lab is primarily interested in whether the classification rule ˆf s 0 M 2 fitted with method M 2 has a smaller error on future independent data than the classification rule ˆf s 0 M 1 fitted with method M 1 or vice-versa, i.e. whether ε( ˆf s 0 M 2, P ) < ε( ˆf s 0 M 1, P ). We define the null- and alternative hypotheses of the applicant correspondingly as H (cond) 0 : ε( ˆf s 0 M 2, P ) ε( ˆf s 0 M 1, P ) 0 vs. H (cond) 1 : ε( ˆf s 0 M 2, P ) ε( ˆf s 0 M 1, P ) < 0. These hypotheses can be seen as conditional (hence the exponent (cond) ) in the sense that they are conditional on a fixed sample s 0. In contrast, statisticians or machine learners doing methodological research are not primarily interested in the performance of the classification rule fitted on a specific sample s 0 but rather on the mean performance over different samples. In mathematical words, they are interested in the comparison of the unconditional errors ε(n, M 1, P ) and ε(n, M 2, P ), yielding the corresponding hypotheses: H (uncond) 0 : ε(n, M 2, P ) ε(n, M 1, P ) 0 vs. H (uncond) 1 : ε(n, M 2, P ) ε(n, M 1, P ) < 0, with the exponent (uncond) standing for unconditional. When methodological researchers write that method M 2 is better than the standard method M 1, they 7

9 implicitly mean that the unconditional error is smaller for M 2 than for M 1 and that H (uncond) 0 can be rejected. In practical data analysis, when the distribution P is unknown, it is not easy to test H (uncond) 0. A natural estimator of the difference ε(n, M 2, P ) ε(n, M 1, P ) is the difference between resampling-based error estimates obtained with the two considered methods, e.g. cross-validation error estimates. The problem is that the true unconditional variance of this difference under H (uncond) 0 is unknown and difficult to estimate. Many naive or more complex estimates of this variance have been considered in the literature (Dietterich, 1998; Nadeau and Bengio, 2003; Hanczar and Dougherty, 2010), but they rely on uncertain assumptions that are most often not met in practical cases. The essential problem is that they are all based on the available sample s 0, while their target is actually the variance over different samples drawn from P n. Hence, they are all conditional in some way. To date, there exists no widely accepted test for testing H (uncond) 0 based on a real data set with unknown underlying distribution. Epistemological background: theory and simulations For a given distribution P and a specific n, however, it is easy to empirically approximate the unconditional error ε(n, M 1, P ) of a given classification method M 1 via simulations. A straightforward procedure is as follows: 1. Randomly draw a huge number n test of independent realizations of P, yielding a so-called test sample s (T ) = {(x (T ) 1, y (T ) 1 ),..., (x (T ) n test, y (T ) n test )}. For instance, n test = may be appropriate. 2. For b = 1,..., B, with B large (typically B 1000): 2.a. Draw n i.i.d. realizations from the distribution P, yielding the training sample s (b) = ((x (b) 1, y (b) 1 ),..., (x (b) n, y (b) n )). 2.b. Fit method M 1 to s (b), yielding the classification rule ˆf s(b) M 1. 8

10 2.c. Estimate the true error of ˆf s(b) M 1 based on the test sample drawn in step 1 as the proportion of misclassified realizations in the test sample s(b) ˆε( ˆf M 1, P ) = 1 n test L( n test i=1 ˆf s(b) M 1 (x (T ) i ), y (T ) i ). 3. Estimate the true unconditional error ε(n, M 1, P ) as ˆε(n, M 1, P ) = 1 B B ˆε( b=1 ˆf s(b) M 1, P ). Note that this approach implies two embedded approximation procedures: approximation of the true error of ˆf s(b) M 1 using a huge test sample s (T ), and approximation of the unconditional error over P n by averaging over a large number B of random training samples s (b). Note that in specific cases, the unconditional error might even be derived analytically. No matter whether one derives this error analytically or via simulations, two parameters have to be chosen: the sample size n and the joint distribution P. Except for trivial examples (e.g. an algorithm that randomly generates a rule f without looking at the data), it is not to be expected that the new method M 2 performs better than the standard method M 1 for all sample sizes and all imaginable joint distributions the so-called no free lunch -theorem (Wolpert, 2001). Epistemological background: real data study The essential limitation of simulations and analytical results is that the chosen distribution P often does not reflect the complexity of real life distributions. Therefore, performance estimation on real data is usually considered as extremely important. In essence, the goal of such studies is to evaluate the considered classification methods M 1 and M 2 on real life distributions P, i.e. to compare ε(n, M k, P ) for k = 1, 2. There are however two important problems related to real data studies. 9

11 The first problem ( variability of error estimation ) is related to the fact that for a specific data set the underlying distribution P is essentially unknown in practice. It is thus impossible to derive the prediction error analytically. If a huge sample is given, we can obtain a good approximation by drawing numerous non-overlapping samples of the considered size n out of the huge sample, and estimating the error on the rest of the sample. However, in most practical situations no huge samples are given and resampling methods are then used to address this estimation issue. Such methods, however, are known to poorly estimate the unconditional error because they suffer a very high variance and/or a high bias resulting in a large mean squared error (Braga-Neto and Dougherty, 2005; Zollanvari et al., 2009; Dougherty et al., 2011; Zollanvari et al., 2011; Dalton and Dougherty, 2012a,b). The second problem ( variability across data sets ) is that each real life data set follows its own distribution. In the example of microarray gene expression data, the leukemia data set of size n = 38 by Golub et al. (1999) follows a certain distribution P 1 while the breast cancer data set of size n = 76 by Van t Veer et al. (2002) follows another distribution P 2. Even if we had a good estimate of the unconditional error ε(n = 38, M 1, P 1 ) for method M 1, it would tell us nothing about ε(n = 38, M 1, P 2 ) and ε(n = 76, M 1, P 2 ), except if we assume that P 2 is somehow similar to P 1, an assumption that may make sense in some exceptional cases, but not in general. When researchers perform a real data study based on several data sets, they implicitly aim to capture the variability across data sets, i.e. across distributions. A problem related to the variability across data sets is the definition of the aimed area of application. It can be defined in a very general way without restrictions, or the authors may choose to focus on a very particular area of application with restrictions regarding the structure of the data and/or the substantive context of the data sets. No matter how the area of application is defined and how example data sets are selected, the two sources of variability (variability of error estimation and variability across data sets) imply a high variance. The variance of error estimation 10

12 can be addressed by using an estimation method known to have smaller variance: for instance, we know that repeated subsampling with a training/test splitting ratio of, say, 2:1 has smaller (unconditional) variance than leave-one-out cross-validation (LOOCV). However, the variance remains high even for the less variable methods. As to the variability across data sets, it can only be addressed by increasing the number of candidate data sets. Therefore after motivating the issues with an illustrative exemplary scenario we formally address the problem of testing the difference between the unconditional errors of two methods M 1 and M 2 in a real data study with several data sets and suggest a statistical framework, including power considerations. Examples Suppose that authors want to compare two simple statistical procedures for supervised classification based on high-dimensional microarray data to be used after a variable selection step: linear discriminant analysis (method A, considered as the reference), and diagonal linear discriminant analysis (method B, considered as a new method suggested by the authors). Method B is expected to work well if the covariance matrix is indeed diagonal (which depends on P ). Method A may have problems if the sample size n is not large compared to the number of predictors (which also depends on P via the dimension of X), because the estimated covariance matrix is then ill-conditioned and it inversion is problematic. But it may work well if the true model implied by the distribution P involves correlations between the variables and if n is large enough. Quite generally, for learning algorithms based on the estimation of parameters of a stochastic model, the closer the assumed model to the true distribution P, the better the prediction accuracy in infinite sample settings. Things are more complicated for learning algorithms that are not based on an underlying stochastic model and in finite sample settings, but we can again say that the error essentially depends on n 11

13 and P. When evaluating methods based on simulated data, the choice of the distribution P and the sample size n is thus crucial. A particular distribution and a particular sample size n can be realistic in a considered area of application, but not realistic for another one. In practice, it is recommended that researchers consider distributions and sample sizes that are typical for the area of application they are aiming at. For example, a sample size of n = 100 and a distribution P involving p = multivariate Gaussian covariates with block diagonal structure, of which p = 20 are related to the response class Y through a logistic model may more or less reflect the reality of microarray gene-expression data, but not of large epidemiological studies involving only a handful of covariates, or of imaging data with complex spatial structure. In practice, methodological researchers often develop methods addressing a particular type of distribution P (sometimes combined with a particular sample size range). They correspondingly design a simulation study based on such a distribution and/or sample size. For instance, the authors comparing methods A and B would probably consider simulation settings with diagonal covariance structure. While it is obviously more than recommended to evaluate the new classification method in the data setting it was designed for, it is also recommended to i) clearly state that the superiority of the new method is specific to the investigated distributions (here: diagonal covariance matrix), ii) investigate other distributions as well to examine the robustness of the method to changes in the distribution (e.g. distributions with block-diagonal covariance matrix). In real life, the data do not stem from simple joint distributions like those commonly used in simulations. That is why real life comparison studies are considered as very important in computational science. If we refer again to the example of methods A and B, the success of classification method B in the real data study with microarray data then depends on three essential factors: i) whether the considered 12

14 data sets have a diagonal covariance structure, ii) how the new method performs if the structure is indeed diagonal, iii) whether the method is robust against deviations from the diagonal covariance structure. Obviously, method B will comparatively perform better in the real data study if condition i) holds. And it will have more impact in the future if diagonal covariance structures often occur not only in the selected data sets but in the whole considered area of application. More generally, we can say that the data sets selected for the real data study should reflect the aimed area of application. For example, if one consciously selects microarray gene expression data with diagonal covariance structure, the real data analysis section should not be written as if the area of application were microarray gene expression data in general. Conversely, one could say that data sets should be selected randomly within the stated field of application. Most importantly, data sets should not be selected a posteriori based on the obtained results, because this procedure generates a substantial bias (Yousefi et al., 2010). 3. TESTING FRAMEWORK FOR REAL DATA STUDIES Settings Webb (2000) states that it is debatable whether error rates in different domains [where the term domain here refers to the underlying joint distribution] are commensurable, and hence whether averaging error rates across domains is very meaningful. Nonetheless, a low average error rate is indicative of a tendency toward low error rates for individual domains. In this section, we try to address this issue in terms of statistical testing. More precisely, we consider the problem of testing the error difference between two methods M 1 and M 2 based on several real-life data sets. The first task we have to address is thus to properly define the testing problem at hand. Let us consider a given area of application. We independently and randomly 13

15 draw J data sets belonging to this area. It is questionable whether the selection of data sets would really be random over the area in practice. For example, researchers may preferably look for data sets in their favorite database. This database is not necessarily representative for the whole area, for instance because it includes, say, many small data sets (the distribution of n is then not the same as in the whole area) or more data sets for a particular disease (possibly implying specific forms for P ). For simplicity, however, we assume in this paper that the data sets are randomly drawn from the considered area. These data sets are denoted as D 1,..., D J. We intentionally do not use the notation S from the previous section to stress that the situation is now different: the data sets D 1,..., D J are not drawn from the same distribution (as was the case for s (1),..., s (b) in the previous section). Each data set D j is as a realization of P n j j, where n j is its size and P j is the distribution of the underlying population. As we assume that data sets are randomly drawn from the considered area, the distribution P j is the outcome of a random variable Φ j : Ω V where V is the set of all possible distributions (in the area of application). Furthermore, the size n j of the data set D j is the outcome of a random variable N j : Ω N. The random variables (Φ 1, N 1 ),..., (Φ J, N J ) are i.i.d. but, of course, we only observe N j = n j and cannot observe Φ j = P j for every j {1,..., J}. Test hypothesis When randomly drawing a data set D, one thus implicitly randomly draws simultaneously a distribution P and n realizations of this distribution. Both imply a certain variability. Importantly, P can now be seen as the outcome of a random variable Φ and n as the outcome of a random variable N. Note that Φ and N are not necessarily independent. The test hypotheses of interest that are implicitly considered when comparing the performance of methods based on different real data sets can 14

16 be stated as H (real) 0 : E(ε(N, M 2, Φ)) E(ε(N, M 1, Φ)) 0 vs. H (real) 1 : E(ε(N, M 2, Φ)) E(ε(N, M 1, Φ)) < 0, where E denotes the expectation over the random variables Φ and N here. Accordingly, ε(n, M k, P ) is now the outcome of the random variable ε(n, M k, Φ), k {1, 2}. In comparison studies based on real data, researchers typically estimate the error for each data set using a resampling procedure such as, e.g., repeated splitting into training and test set. Let e(n, M k, D) denote the error of method M k estimated for data set D with the chosen resampling procedure. The estimated error e(n, M k, D) can be seen as an estimator of the unknown parameter ε(n, M k, P ). In resampling procedures, however, the training data set used at each resampling iteration is smaller than n, hence leading to an error e(n, M k, D) larger than ε(n, M k, P ) on average. Nevertheless, if we assume that this bias is equal for both considered methods M 1 and M 2, i.e. that E P n(e(n, M 1, D)) ε(n, M 1, P ) = E P n(e(n, M 2, D)) ε(n, M 2, P ), we have ε(n, M 2, P ) ε(n, M 1, P ) = E P n(e(n, M 2, D) e(n, M 1, D)) (2) where E P n denotes the expectation over the data set D drawn i.i.d. from P. The data set D can be seen as the outcome of a random variable D of which the conditional distribution given Φ = P and N = n is equal to P n. Then, using the relation E ( E(e(N, M 2, D) e(n, M 1, D) Φ, N) ) = E(e(N, M 2, D) e(n, M 1, D)), (3) 15

17 we can formulate the null-hypothesis H (real) 0 as H (real) 0 : E(e(N, M 2, D) E(e(N, M 1, D) 0 vs. H (real) 1 : E(e(N, M 2, D) E(e(N, M 1, D) < 0. This formulation is advantageous because we have access to independent and identically distributed realizations e(n j, M 2, D j ) e(n j, M 1, D j ) (with j = 1,..., J) of e(n, M 2, D) e(n, M 1, D), thus yielding estimates of its mean and its variance. Note that the exact measure theoretic formulation of (3) needs some more care because: (i) Φ is a random variable which takes its values in a set V of probability measures P so that we need a suitable σ-algebra on V, (ii) the size n = N of the data set D = D is random so that a naive formalization would yield that D (X Y) N where the dimension N is random, and (iii) the existence of a suitable random variable D (of which the conditional distribution given Φ = P and N = n is equal to P n ) is not obvious. The mathematically correct derivation of the above testing problem which takes care of these technicalities is deferred to the appendix. Let e(n j, D j ) = e(n j, M 2, D j ) e(n j, M 1, D j ) (with j = 1,..., J) be independent and identically distributed realizations of e(n, M 2, D) e(n, M 1, D) and define e = 1 J J j=1 e(n j, D j ). Under normality assumption or for very large J it is now possible to perform a paired sample t-test to test H (real) 0. The test statistic T = 1 J 1 J 1 e J j=1 ( e(n j, D j ) e) 2 follows a Student distribution with J 1 degrees of freedom under H (real) 0. This type of test is performed in many machine learning studies, where it is usual to apply new and existing methods to a large number of data sets (Demšar, 2006), for instance from databases especially designed for this purpose. Non-parametric tests such as the Wilcoxon signed-rank test are also commonly applied in this context (Demšar, 2006) including adjustment procedures for multiple comparisons or the use of global 16

18 test statistics for the comparison of more than two methods (Garcia and Herrera, 2008; Garcia et al., 2010). When authors state in their abstracts that the new method performs better than existing methods on real data sets, they implicitly say that H (real) 0 can be rejected, although most of them do not perform the test and give almost no attention to the theoretical background of these tests. Decomposition of the variance The variance term Var( e(n, D)) consists of two parts: firstly, the variance of e(n, D) conditional on the distribution Φ = P and the sample size N = n; secondly, the variance of ε(n, M 2, Φ) ε(n, M 1, Φ): Var( e(n, D)) = Var ( ε(n, M 2, Φ) ε(n, M 1, Φ) ) + E ( Var( e(n, D) Φ, N) (4) ). Essentially, this follows from the law of total variance, Equation (2), and the fact that the conditional distribution of D given Φ = P and N = n is equal to P n ; the exact derivation needs some more care again and is deferred to the appendix. The literature on error estimation and comparison usually focuses on the second part of the variance in (4): several estimators have been proposed (Nadeau and Bengio, 2003). The first part of the variance is considered as disturbing factor by many authors. Data sets that behave differently from the other are generally considered as cumbersome outliers and sometimes even excluded from the comparison study, possibly leading to a substantial bias (Yousefi et al., 2010). This part of the variance is also tightly related to the optimistic bias commonly observed in studies assessing new methods in comparison studies (Jelizarow et al., 2010). Researchers tend to overfit their new method to specific example data sets while developing them. The variance across data sets being high, this new method that has been optimized to these particular data sets is likely to perform much worse on other data sets. 17

19 Power considerations Considering the one-sided one-sample t-test outlined above, the number J of data sets necessary to detect a given effect size σ at a certain power 1 β can be derived from the formula J [t 1 α,j 1 + t 1 β,j 1 ] 2 ( σ )2, where t α,df denotes the α-quantile of the Student distribution with df degrees of freedom, α denotes the type I error (typically α = 0.05), β denotes the type II error, denotes the difference that we want to be able to detect, and σ denotes the standard deviation (Bock, 1998). Conversely, for a given J, a given and a given σ, one can also compute the power 1 β of the test as ( ) 1 β = Φ tdf=j 1 J 2 σ t 2 1 α,df=j 1, where Φ tdf=j 1 denotes the cumulative distribution function of the Student distribution with J 1 degrees of freedom. It is common practice to use such formulas to derive the adequate sample size in various types of experiments including, e.g. animal trials or clinical trials. Such a statistical planning, however, is never considered when performing benchmark experiments, even if the researchers want to eventually perform statistical tests. In this paper, we suggest to also give attention to such power considerations when planning or performing a comparison study involving several data sets. These issues are illustrated in the next section through an application to comparison studies of microarray-based classification methods. 4. ILLUSTRATION: POWER OF PUBLISHED STUDIES In this section we examine comparison studies from the literature on microarraybased supervised classification with respect to the issue discussed above. In par- 18

20 ticular, we compute the empirical variance of the difference between methods and estimate the power of the comparisons. Considered studies We selected comparison studies (i) whose aim was not to establish the superiority of a new method, hence warranting a certain level of neutrality (Boulesteix et al., 2008; Boulesteix and Eugster, 2012), (ii) focusing on diagnosis or prognosis based on high-dimensional gene expression microarray data, (iii) representing the estimated error rates in form of a table (thus providing access to the exact figures) rather than in form of graphics, (iv) examining at least five data sets N 5 and at least five classification methods. Three of the considered comparison studies fulfilled these four criteria: the study by Lee et al. (2005) comparing six methods on six cancer sdata sets, the study by Lai et al. (2006) comparing 13 methods on seven cancer data sets, and the study by Statnikov et al. (2005) comparing eight methods on 11 cancer data sets. Standard deviation We first derived the standard deviation of each pairwise difference between methods for each study. There were 6 5/2 = 15 pairwise differences for the Lee study, 13 12/2 = 78 pairwise differences for the Lai study, and 8 7/2 = 28 pairwise differences for the Statnikov study. The boxplots representing the standard deviations of these pairwise differences are displayed in Figure 1. It can be seen from these boxplots that the range of the standard deviations is surprisingly the same for the three studies, except for a few more extreme values for the Statnikov study. Furthermore, the standard deviations are very different within the studies: some pairs of methods have a very low standard deviation (indicating that the data sets are similar with respect to the difference in performance), while other show large standard deviations. 19

21 standard deviation of e(n j,d j ) Lee Lai Statnikov Figure 1: Estimated standard deviation of e(n j, D j ) in the three studies by Lee et al. (2005), Lai et al. (2006) and Statnikov et al. (2005). Power considerations Assuming values of the standard deviation σ in the range of those observed in the three investigated studies, Figure 2 (left panel) represents the number J of data sets required to detect a difference of error rate of with power 80%. The right panel of Figure 2 displays the reached power against the number of data sets J for = 0.05 and different values of the standard deviation σ in the range of those observed in the three investigated studies. This figure suggests that most published comparison studies (either neutral comparison studies or comparison studies included in an original article) are substantially underpowered. In this perspective, we propose and recommend that researchers conducting comparison studies explicitly address such power issues for the design and/or interpretation of their benchmark experiments. 5. AN EXEMPLARY BENCHMARK STUDY In order to explicitly illustrate these problems in real benchmark experiments we perform an exemplary benchmark study based on J = 50 microarray data sets with binary response as prepared by de Souza et al. 20

22 J σ=0.03 σ=0.05 σ=0.075 σ=0.1 Power σ=0.03 σ=0.05 σ=0.075 σ= J Figure 2: Left: Number J of data sets requested to detect for different values of σ (black: σ = 0.03, red: σ = 0.05, green: σ = 0.075, blue: σ = 0.1) and power=80% and Right: Power to detect a difference of = 0.05 for different values of σ (black: σ = 0.03, red: σ = 0.05, green: σ = 0.075, blue: σ = 0.1). (2010). These data sets and corresponding R Code are available from our companion website mitarbeiter/020_professuren/boulesteix/compstud2013. For each data set, repeated subsampling with 4/5 of the observations in the training sets and 300 resampling iterations is used for error estimation. The considered classification methods are: i) diagonal linear discriminant analysis (DLDA) without variable selection (DLDA-all), ii) DLDA with 500 selected variables (DLDA-500), iii) DLDA with 20 selected variables (DLDA-20), and iv) DLDA with 10 selected variables (DLDA-10). Variable selection is performed by selecting the variables yielding the smallest p-values when testing the equality of the means in the two groups Y = 0/1 with a classical t-test. All analyses are performed using the Bioconductor package CMA (Slawski et al., 2008). Figure 3 displays the boxplots of the estimated errors over the 50 data sets using the four considered classification methods (top) and the boxplots of the six pairwise differences (bottom) also showing their respective standard deviation. It can be clearly seen from this figure that the standard deviations of the differences vary a lot depending on the considered pair of methods and that their range is the similar 21

23 Error rate DLDA all Error rate difference DLDA 500 DLDA 20 DLDA all 500 all 20 all Figure 3: Top: Boxplots of the estimated errors over the 50 data sets using the four considered classification methods. Bottom: Boxplots of the six pairwise differences also showing their respective standard deviation. The gray lines connect points corresponding to the same data set. to the values observed in the comparison studies from the literature discussed in the previous section. The results of the power considerations presented in Figure 2 are thus also relevant to our exemplary benchmark study. Matched-pair t-test Wilcoxon test Comparison Difference t p-value W p-value DLDA-all vs. DLDA e e-05 DLDA-all vs. DLDA DLDA-all vs. DLDA DLDA-500 vs. DLDA DLDA-10 vs. DLDA DLDA-10 vs. DLDA Table 1: Results of the one-sided matched-pair t-test and one-sided Wilcoxonranked-sum test for all six pairwise comparisons of differences in the error rates over the 50 data sets when testing hypothesis H (real) 0 vs. H (real) 1. For each comparison, the first method plays the role of M 1 and the second method plays the role of M 2. The results of the one-sided matched-pair t-test and one-sided Wilcoxon-signed- 22

24 rank-test for the six considered pairs of methods are displayed in Table 1. Four of the six pairwise differences between error rates are statistically significant when tested at an α-level of 0.05 using the parametric matched-pair t-test and the nonparametric Wilcoxon-signed-rank-test. The largest difference of = can be found between DLDA-all and DLDA-20, where the standard deviation is as large as 0.1. The results from Table outline the importance of the standard deviation of the difference: the comparisons DLDA-all vs. DLDA-500 and DLDA-all vs. DLDA-10 yield almost equal means differences (0.038 and 0.036, respectively) but the first comparison leads to a much lower p-value due to the smaller standard deviation of 0.06 (see Figure 3). Similarly, the mean difference for the comparison DLDA-10 vs. DLDA-20 is moderate (0.009) but yields small p-values due to the very small standard deviation of On the whole, these results again stress the importance of a sufficient number of data sets in comparison studies, especially in the context of high-dimensional data investigated in our example. Differences between methods are often moderate and their standard deviations may be comparatively large, thus making comparison studies based on the usual number of, say, 5 to 10 data sets substantially underpowered. 6. CONCLUDING REMARKS In this paper we proposed a statistical formulation and interpretation of hypothesis tests performed in the context of comparison studies comparing the performance of supervised classification methods based on several data sets with unknown underlying distribution. Although we focused on classification problems, the developed ideas may be easily extended to other problems through the use of a different loss function. At the light of this framework, we examined published comparison studies as- 23

25 sessing classification methods for high-dimensional microarray data. Considering the large variance of the difference in performance across the data sets, we found that a very large number of data sets would be necessary to reach an adequate power, i.e. to have a large probability to detect relevant differences in error rates as statistically significant in the testing framework. We conclude that most published comparison studies (either neutral or part of an original article) are substantially underpowered. As an outlook, we point to the parallels that can be made between comparison studies in the context of supervised learning and experiments from application fields of statistics (e.g. biomedicine). Roughly speaking, a comparison study comparing supervised learning methods shows some similarities with a clinical trial. For example, it might be helpful to first perform a pilot study with a limited number of data sets to provide a first raw estimate of the variance in order to evaluate the necessary number of data sets. Subgroup analyses may also make sense, e.g. focusing on data sets of particular sizes, with particular numbers of predictors, etc. The same rules should be observed as when performing subgroup analyses in biomedical sciences: relevant subgroups should be defined prior to the analysis and all the results should be reported in order to avoid fishing for significance. Alternatively, it may make sense to model e as a dependent variable with data set characteristics as independent variables (for instance size, number of predictors, ratio between size and number of predictors, signal strength, balance between the class, etc). This would be an alternative to the previously suggested Bradley-Terry model approach (Eugster et al., 2010). Going one step further, meta-analyses of comparison studies would also be conceivable. To conclude, we believe that statisticians working on methodological research projects such as the development of new supervised learning methods (including ourselves) should probably take more care of rules that they themselves (rightly) impose on their statistical consulting clients: take care of sample size and power issues, pay attention to the underlying hypothesis when performing a test, and not 24

26 over-interpret results that are not statistically significant. References Billingsley, P., Probability and measure, 2nd Edition. John Wiley & Sons Inc., New York. Bock, J., Bestimmung des stichprobenumfangs. München/Wien: Oldenbourg. Bouckaert, R., Choosing between two learning algorithms based on calibrated tests. In: MACHINE LEARNING- INTERNATIONAL WORKSHOP THEN CONFERENCE-. Vol. 20. p. 51. Boulesteix, A., Eugster, M., A plea for neutral comparison studies in computational sciences. Technical Report: Boulesteix, A., Strobl, C., Augustin, T., Daumer, M., Evaluating microarray-based classifiers: an overview. Cancer Informatics 6, Braga-Neto, U., Dougherty, E., Exact performance of error estimators for discrete classifiers. Pattern Recognition 38, Dalton, L., Dougherty, E., 2012a. Exact sample conditioned mse performance of the bayesian mmse estimator for classification error part I: Representation. IEEE Transactions on Signal Processing 60, Dalton, L., Dougherty, E., 2012b. Exact sample conditioned mse performance of the bayesian mmse estimator for classification error part ii: Consistency and performance analysis. IEEE Transactions on Signal Processing 60, de Souza, B., de Carvalho, A., Soares, C., A comprehensive comparison of ml algorithms for gene expression data classification. In: The 2010 International Joint Conference on Neural Networks (IJCNN). IEEE, pp Demšar, J., Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7, Dietterich, T., Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation 10, Dougherty, E., Zollanvari, A., Braga-Neto, U., The illusion of distribution-free small-sample classification in genomics. Current Genomics 12, Eugster, M., Hothorn, T., Leisch, F., Domain-based benchmark experiments: exploratory and inferential analysis. Austrian Journal of Statistics 41, Eugster, M., Leisch, F., Strobl, C., (Psycho-)analysis of benchmark experiments. Technical Report 78, Department of Statistics, LMU. Garcia, S., Fernandez, A., Luengo, J., Herrera, F., Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences 180 (10),

27 Garcia, S., Herrera, F., An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. Journal of Machine Learning Research 9, Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, Hanczar, B., Dougherty, E., On the comparison of classifiers for microarray data. Current Bioinformatics 5, Hoeffding, W., Wolfowitz, J., Distinguishability of sets of distributions. (The case of independent and identically distributed chance variables). Annals of Mathematical Statistics 29, Hothorn, T., Leisch, F., Zeileis, A., Hornik, K., The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics 14, Jelizarow, M., Guillemot, V., Tenenhaus, A., Strimmer, K., Boulesteix, A., Over-optimism in bioinformatics: an illustration. Bioinformatics 26, Lai, C., Reinders, M., Van t Veer, L., Wessels, L., A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets. BMC Bioinformatics 7, 235. Lee, J., Lee, J., Park, M., Song, S., An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis 48, Nadeau, C., Bengio, Y., Inference for the generalization error. Machine Learning 52, Slawski, M., Daumer, M., Boulesteix, A., CMA a comprehensive bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9, 439. Statnikov, A., Aliferis, C., Tsamardinos, I., Hardin, D., Levy, S., A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21, Van t Veer, L., Dai, H., Van De Vijver, M., He, Y., Hart, A., Mao, M., Peterse, H., Van Der Kooy, K., Marton, M., Witteveen, A., et al., Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, Webb, G., Multiboosting: A technique for combining boosting and wagging. Machine Learning 40, Wolpert, D., The supervised learning no-free-lunch theorems. In: Proceedings of the 6th Online World Conference on Soft Computing in Industrial Applications. pp Yousefi, M., Hua, J., Sima, C., Dougherty, E., Reporting bias when using real data sets to analyze classification performance. Bioinformatics 26, Zollanvari, A., Braga-Neto, U., Dougherty, E., On the sampling distribution of resubstitution and leave-one-out error estimators for linear classifiers. Pattern Recognition 42, Zollanvari, A., Braga-Neto, U., Dougherty, E., Analytic study of performance of error estimators for linear discriminant analysis. IEEE Transactions on Signal Processing 59,

28 Appendix The appendix contains some technicalities which are needed for a mathematically rigorous formulation of the testing problem in Section 3. In particular, it is shown that there are suitable random variables N and D such that e(n, M 2, D) e(n, M 1, D) can be seen as the observed realization of the random variable e(n, M 2, D) e(n, M 1, D) where also the distribution P (which generates the data set D) is randomly chosen. This is crucial in order to apply the central limit theorem and, otherwise, using the t-rest to test H (real) 0 in Section 3 would not be justified. Preliminaries Let ( (X Y), B ) be the countably-infinite-product space of the measurable space (X Y, B). Let V be the set of all possible distributions (in the area of application) endowed with the Borel-σ-algebra B V with respect to the total variation norm. We arbitrarily fix any x 0 X and y 0 Y and define the function γ : (X Y) N (X Y), (D, n) γ(d, n) =: D (n) by D (n) = ( (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ), (x 0, y 0 ), (x 0, y 0 ),... ) (X Y) for D = ( (x 1, y 1 ), (x 2, y 2 ),... ) (X Y) and n N. Note that γ is measurable with respect to B 2 N and B. Similarly, define P (n) = P n δ (x 0,y 0 ) on ( (X Y), B ) for every P V. 27

Statistical aspects of prediction models with high-dimensional data

Statistical aspects of prediction models with high-dimensional data Statistical aspects of prediction models with high-dimensional data Anne Laure Boulesteix Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie February 15th, 2017 Typeset by

More information

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation IIM Joint work with Christoph Bernau, Caroline Truntzer, Thomas Stadler and

More information

The computationally optimal test set size in simulation studies on supervised learning

The computationally optimal test set size in simulation studies on supervised learning Mathias Fuchs, Xiaoyu Jiang, Anne-Laure Boulesteix The computationally optimal test set size in simulation studies on supervised learning Technical Report Number 189, 2016 Department of Statistics University

More information

Correction for Tuning Bias in Resampling Based Error Rate Estimation

Correction for Tuning Bias in Resampling Based Error Rate Estimation Correction for Tuning Bias in Resampling Based Error Rate Estimation Christoph Bernau & Anne-Laure Boulesteix Institute of Medical Informatics, Biometry and Epidemiology (IBE), Ludwig-Maximilians University,

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST

EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST TIAN ZHENG, SHAW-HWA LO DEPARTMENT OF STATISTICS, COLUMBIA UNIVERSITY Abstract. In

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Look before you leap: Some insights into learner evaluation with cross-validation

Look before you leap: Some insights into learner evaluation with cross-validation Look before you leap: Some insights into learner evaluation with cross-validation Gitte Vanwinckelen and Hendrik Blockeel Department of Computer Science, KU Leuven, Belgium, {gitte.vanwinckelen,hendrik.blockeel}@cs.kuleuven.be

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

The Lady Tasting Tea. How to deal with multiple testing. Need to explore many models. More Predictive Modeling

The Lady Tasting Tea. How to deal with multiple testing. Need to explore many models. More Predictive Modeling The Lady Tasting Tea More Predictive Modeling R. A. Fisher & the Lady B. Muriel Bristol claimed she prefers tea added to milk rather than milk added to tea Fisher was skeptical that she could distinguish

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Random Forests for Ordinal Response Data: Prediction and Variable Selection

Random Forests for Ordinal Response Data: Prediction and Variable Selection Silke Janitza, Gerhard Tutz, Anne-Laure Boulesteix Random Forests for Ordinal Response Data: Prediction and Variable Selection Technical Report Number 174, 2014 Department of Statistics University of Munich

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Steven Bergner, Chris Demwell Lecture notes for Cmpt 882 Machine Learning February 19, 2004 Abstract In these notes, a

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

ABC random forest for parameter estimation. Jean-Michel Marin

ABC random forest for parameter estimation. Jean-Michel Marin ABC random forest for parameter estimation Jean-Michel Marin Université de Montpellier Institut Montpelliérain Alexander Grothendieck (IMAG) Institut de Biologie Computationnelle (IBC) Labex Numev! joint

More information

CS 543 Page 1 John E. Boon, Jr.

CS 543 Page 1 John E. Boon, Jr. CS 543 Machine Learning Spring 2010 Lecture 05 Evaluating Hypotheses I. Overview A. Given observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Berkman Sahiner, a) Heang-Ping Chan, Nicholas Petrick, Robert F. Wagner, b) and Lubomir Hadjiiski

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Mixtures of Gaussians with Sparse Structure

Mixtures of Gaussians with Sparse Structure Mixtures of Gaussians with Sparse Structure Costas Boulis 1 Abstract When fitting a mixture of Gaussians to training data there are usually two choices for the type of Gaussians used. Either diagonal or

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

Sequential/Adaptive Benchmarking

Sequential/Adaptive Benchmarking Sequential/Adaptive Benchmarking Manuel J. A. Eugster Institut für Statistik Ludwig-Maximiliams-Universität München Validation in Statistics and Machine Learning, 2010 1 / 22 Benchmark experiments Data

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

More information

Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Why uncertainty? Why should data mining care about uncertainty? We

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

A Simple Algorithm for Learning Stable Machines

A Simple Algorithm for Learning Stable Machines A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Structural Uncertainty in Health Economic Decision Models

Structural Uncertainty in Health Economic Decision Models Structural Uncertainty in Health Economic Decision Models Mark Strong 1, Hazel Pilgrim 1, Jeremy Oakley 2, Jim Chilcott 1 December 2009 1. School of Health and Related Research, University of Sheffield,

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

The Design and Analysis of Benchmark Experiments Part II: Analysis

The Design and Analysis of Benchmark Experiments Part II: Analysis The Design and Analysis of Benchmark Experiments Part II: Analysis Torsten Hothorn Achim Zeileis Friedrich Leisch Kurt Hornik Friedrich Alexander Universität Erlangen Nürnberg http://www.imbe.med.uni-erlangen.de/~hothorn/

More information

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lectures on Simple Linear Regression Stat 431, Summer 2012 Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population

More information

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Applied Mathematical Sciences, Vol. 4, 2010, no. 62, 3083-3093 Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Julia Bondarenko Helmut-Schmidt University Hamburg University

More information

A STUDY OF PRE-VALIDATION

A STUDY OF PRE-VALIDATION A STUDY OF PRE-VALIDATION Holger Höfling Robert Tibshirani July 3, 2007 Abstract Pre-validation is a useful technique for the analysis of microarray and other high dimensional data. It allows one to derive

More information

Algorithms for Classification: The Basic Methods

Algorithms for Classification: The Basic Methods Algorithms for Classification: The Basic Methods Outline Simplicity first: 1R Naïve Bayes 2 Classification Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.

More information

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1 CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1 IEEE Expert, October 1996 CptS 570 - Machine Learning 2 Given sample S from all possible examples D Learner

More information

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance Dhaka Univ. J. Sci. 61(1): 81-85, 2013 (January) An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance A. H. Sajib, A. Z. M. Shafiullah 1 and A. H. Sumon Department of Statistics,

More information

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier

More information

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study TECHNICAL REPORT # 59 MAY 2013 Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study Sergey Tarima, Peng He, Tao Wang, Aniko Szabo Division of Biostatistics,

More information

DIFFERENT APPROACHES TO STATISTICAL INFERENCE: HYPOTHESIS TESTING VERSUS BAYESIAN ANALYSIS

DIFFERENT APPROACHES TO STATISTICAL INFERENCE: HYPOTHESIS TESTING VERSUS BAYESIAN ANALYSIS DIFFERENT APPROACHES TO STATISTICAL INFERENCE: HYPOTHESIS TESTING VERSUS BAYESIAN ANALYSIS THUY ANH NGO 1. Introduction Statistics are easily come across in our daily life. Statements such as the average

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Data Warehousing & Data Mining

Data Warehousing & Data Mining 13. Meta-Algorithms for Classification Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13.

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Smart Home Health Analytics Information Systems University of Maryland Baltimore County Smart Home Health Analytics Information Systems University of Maryland Baltimore County 1 IEEE Expert, October 1996 2 Given sample S from all possible examples D Learner L learns hypothesis h based on

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

How Much Evidence Should One Collect?

How Much Evidence Should One Collect? How Much Evidence Should One Collect? Remco Heesen October 10, 2013 Abstract This paper focuses on the question how much evidence one should collect before deciding on the truth-value of a proposition.

More information

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, 87-100 Toruń, Poland

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Post-Selection Inference

Post-Selection Inference Classical Inference start end start Post-Selection Inference selected end model data inference data selection model data inference Post-Selection Inference Todd Kuffner Washington University in St. Louis

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

Generalization Error on Pruning Decision Trees

Generalization Error on Pruning Decision Trees Generalization Error on Pruning Decision Trees Ryan R. Rosario Computer Science 269 Fall 2010 A decision tree is a predictive model that can be used for either classification or regression [3]. Decision

More information

Lecture 9: Bayesian Learning

Lecture 9: Bayesian Learning Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes Optimal

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

NEAREST NEIGHBOR CLASSIFICATION WITH IMPROVED WEIGHTED DISSIMILARITY MEASURE

NEAREST NEIGHBOR CLASSIFICATION WITH IMPROVED WEIGHTED DISSIMILARITY MEASURE THE PUBLISHING HOUSE PROCEEDINGS OF THE ROMANIAN ACADEMY, Series A, OF THE ROMANIAN ACADEMY Volume 0, Number /009, pp. 000 000 NEAREST NEIGHBOR CLASSIFICATION WITH IMPROVED WEIGHTED DISSIMILARITY MEASURE

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Reading for Lecture 6 Release v10

Reading for Lecture 6 Release v10 Reading for Lecture 6 Release v10 Christopher Lee October 11, 2011 Contents 1 The Basics ii 1.1 What is a Hypothesis Test?........................................ ii Example..................................................

More information

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes Week 10 Based in part on slides from textbook, slides of Susan Holmes Part I Linear regression & December 5, 2012 1 / 1 2 / 1 We ve talked mostly about classification, where the outcome categorical. If

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

Algorithm Independent Topics Lecture 6

Algorithm Independent Topics Lecture 6 Algorithm Independent Topics Lecture 6 Jason Corso SUNY at Buffalo Feb. 23 2009 J. Corso (SUNY at Buffalo) Algorithm Independent Topics Lecture 6 Feb. 23 2009 1 / 45 Introduction Now that we ve built an

More information

Research Article Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation

Research Article Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 7, Article ID, pages doi:./7/ Research Article Quantification of the Impact of Feature Selection on the Variance

More information

SC705: Advanced Statistics Instructor: Natasha Sarkisian Class notes: Introduction to Structural Equation Modeling (SEM)

SC705: Advanced Statistics Instructor: Natasha Sarkisian Class notes: Introduction to Structural Equation Modeling (SEM) SC705: Advanced Statistics Instructor: Natasha Sarkisian Class notes: Introduction to Structural Equation Modeling (SEM) SEM is a family of statistical techniques which builds upon multiple regression,

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Diagnostics. Gad Kimmel

Diagnostics. Gad Kimmel Diagnostics Gad Kimmel Outline Introduction. Bootstrap method. Cross validation. ROC plot. Introduction Motivation Estimating properties of an estimator. Given data samples say the average. x 1, x 2,...,

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information

Short Note: Naive Bayes Classifiers and Permanence of Ratios

Short Note: Naive Bayes Classifiers and Permanence of Ratios Short Note: Naive Bayes Classifiers and Permanence of Ratios Julián M. Ortiz (jmo1@ualberta.ca) Department of Civil & Environmental Engineering University of Alberta Abstract The assumption of permanence

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

Regression tree methods for subgroup identification I

Regression tree methods for subgroup identification I Regression tree methods for subgroup identification I Xu He Academy of Mathematics and Systems Science, Chinese Academy of Sciences March 25, 2014 Xu He (AMSS, CAS) March 25, 2014 1 / 34 Outline The problem

More information

A Note on Bayesian Inference After Multiple Imputation

A Note on Bayesian Inference After Multiple Imputation A Note on Bayesian Inference After Multiple Imputation Xiang Zhou and Jerome P. Reiter Abstract This article is aimed at practitioners who plan to use Bayesian inference on multiplyimputed datasets in

More information