The Design and Analysis of Benchmark Experiments Part II: Analysis

Similar documents
Sequential/Adaptive Benchmarking

The Design and Analysis of Benchmark Experiments

Classification of Longitudinal Data Using Tree-Based Ensemble Methods

Unbiased Recursive Partitioning I: A Non-parametric Conditional Inference Framework

A comparison study of the nonparametric tests based on the empirical distributions

Resampling and the Bootstrap

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Step-down FDR Procedures for Large Numbers of Hypotheses

Discrete Multivariate Statistics

epub WU Institutional Repository

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

A Bias Correction for the Minimum Error Rate in Cross-validation

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

Binary choice 3.3 Maximum likelihood estimation

Resampling and the Bootstrap

Testing Statistical Hypotheses

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Lehrstuhl für Statistik und Ökonometrie. Diskussionspapier 87 / Some critical remarks on Zhang s gamma test for independence

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

Building a Prognostic Biomarker

An Introduction to Multivariate Statistical Analysis

Generalized Maximally Selected Statistics

Classification. Chapter Introduction. 6.2 The Bayes classifier

Introduction to Statistical Analysis

Bootstrapping, Randomization, 2B-PLS

Transition Passage to Descriptive Statistics 28

Testing Statistical Hypotheses

STAT Section 2.1: Basic Inference. Basic Definitions

11 Survival Analysis and Empirical Likelihood

Maximally selected chi-square statistics for ordinal variables

Experimental Design and Data Analysis for Biologists

Course in Data Science

NONPARAMETRICS. Statistical Methods Based on Ranks E. L. LEHMANN HOLDEN-DAY, INC. McGRAW-HILL INTERNATIONAL BOOK COMPANY

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Performance Evaluation and Comparison

Regression tree methods for subgroup identification I

Wolfgang Karl Härdle Leopold Simar. Applied Multivariate. Statistical Analysis. Fourth Edition. ö Springer

Statistical Comparisons of Classifiers over Multiple Data Sets. Peiqian Li

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Random Forests for Ordinal Response Data: Prediction and Variable Selection

Advanced Statistical Methods: Beyond Linear Regression

Question. Hypothesis testing. Example. Answer: hypothesis. Test: true or not? Question. Average is not the mean! μ average. Random deviation or not?

Introduction to Statistical Inference Lecture 10: ANOVA, Kruskal-Wallis Test

Lecture 30. DATA 8 Summer Regression Inference

A Lego System for Conditional Inference

Comparison of Independent Samples of High-Dimensional Data by Pairwise Distance Measures

UNIVERSITÄT POTSDAM Institut für Mathematik

Comparison of Two Population Means

Supplementary material for Intervention in prediction measure: a new approach to assessing variable importance for random forests

Introduction to Nonparametric Statistics

Summary of Chapters 7-9

NAG Library Chapter Introduction. G08 Nonparametric Statistics

AFT Models and Empirical Likelihood

Subject CS1 Actuarial Statistics 1 Core Principles

Boulesteix: Maximally selected chi-square statistics and binary splits of nominal variables

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

PhD course: Statistical evaluation of diagnostic and predictive models

Multivariate Lineare Modelle

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Classification using stochastic ensembles

Monitoring Structural Change in Dynamic Econometric Models

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Quantile Regression for Residual Life and Empirical Likelihood

Censoring Unbiased Regression Trees and Ensembles

Genomics, Transcriptomics and Proteomics in Clinical Research. Statistical Learning for Analyzing Functional Genomic Data. Explanation vs.

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning 2017

Data Mining. CS57300 Purdue University. March 22, 2018

Tests of independence for censored bivariate failure time data

The Chi-Square Distributions

Minimum distance tests and estimates based on ranks

Master s Written Examination - Solution

ECE521 week 3: 23/26 January 2017

Tests for Population Proportion(s)

Group Sequential Tests for Delayed Responses. Christopher Jennison. Lisa Hampson. Workshop on Special Topics on Sequential Methodology

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Implementing a Class of Permutation Tests: The coin Package

Predicting the Probability of Correct Classification

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Multivariable Fractional Polynomials

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Formulas and Tables. for Elementary Statistics, Tenth Edition, by Mario F. Triola Copyright 2006 Pearson Education, Inc. ˆp E p ˆp E Proportion

Targeted Maximum Likelihood Estimation in Safety Analysis

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

A review of some semiparametric regression models with application to scoring

A nonparametric two-sample wald test of equality of variances

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Simulation-based robust IV inference for lifetime data

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

Purposes of Data Analysis. Variables and Samples. Parameters and Statistics. Part 1: Probability Distributions

Small n, σ known or unknown, underlying nongaussian

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Nowcasting GDP directional change with an application to French business survey data

Bayesian Econometrics

Part III. Hypothesis Testing. III.1. Log-rank Test for Right-censored Failure Time Data

Transcription:

The Design and Analysis of Benchmark Experiments Part II: Analysis Torsten Hothorn Achim Zeileis Friedrich Leisch Kurt Hornik Friedrich Alexander Universität Erlangen Nürnberg http://www.imbe.med.uni-erlangen.de/~hothorn/

Benchmark Experiments A comparison of algorithms with respect to certain performance measures is of special interest in the following problems select the best out of a set of candidates, identify groups of algorithms with the same performance, test whether any useful structure is inherent in the data or demonstrate equivalence of two algorithms.

Illustrating Example Stabilization of a Linear Discriminant Analysis (LDA) by using lowdimensional Principal Component (PC-q) scores (Läuter, 1992; Läuter et al., 1998; Kropf, 2000) for Glaucoma diagnosis (Hothorn et al., 2003; Mardin et al., 2003). Laser-scanning images from 98 patients and 98 controls (n = 196), p = 62 numeric input variables. Data generating process: The empirical distribution function Ẑn. Performance measure: Out-of-bootstrap misclassification error.

Experiment Question: Does the performance distribution ˆP LDA (Ẑn) of a LDA using the original p input variables differ from the performance distribution ˆP slda (Ẑn) of a stabilized LDA? Experiment: Draw B samples L b from the data generating process Ẑn and compute ˆp LDA,b and ˆp slda,b, the misclassification errors evaluated on the out-of-bootstrap observations.

LDA 0.05 0.15 0.25 0.35 0.05 0.15 0.25 0.35 stabilized LDA

Inference H 0 : ˆP LDA (Ẑn) = ˆP slda (Ẑn) Problem: We do not know anything about the performances, except that parametric assumptions are surely not appropriate. Solution: Dispose the performance distributions by conditioning on all permutations of the labels for each bootstrap sample.

Inference T = B b=1 ˆp LDA,b ˆp slda,b = B( p LDA, p slda, ) The conditional distribution of the test statistic T under the conditions described by H 0 can be used to construct a permutation test. In our case, the P -value based on the asymptotic conditional distribution is p < 0.001 and therefore H 0 can be rejected.

A Regression Example Exactly the same methodology can be applied to regression problems with univariate numeric responses. Example: Can additional randomness via Random Forests improve Bagging for the Boston Housing data? House prices for n = 506 houses near Boston, p = 13 input variables. Data generating process: The empirical distribution function Ẑn. Performance measure: Out-of-bootstrap mean squared error.

Boxplot Density 0.15 0.1 Bagging 0.05 0 0.15 0.1 Random Forests 0.05 0 5 10 15 20 25 30 Performance 5 10 15 20 25 30

Inference The null-hypothesis of equal performance distributions can be rejected (P -value < 0.001). The estimated difference of the mean square error of Bagging compared to Random Forests is 0.969 with confidence limits (0.633, 1.305).

Comparison of Multiple Algorithms When multiple algorithms are under test, we are interested in both a global test and a multiple test procedure showing where the differences, if any, come from. Example: Breast Cancer data with tumor classification from n = 699 observations with p = 9 inputs. Comparison of slda, Support Vector Machine, Random Forests and Bundling (Hothorn and Lausen, 2003). Data generating process: The empirical distribution function Ẑn. Performance measure: Out-of-bootstrap misclassification error.

Boxplot Density slda SVM Random Forests Bundling 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 Performance 50 40 30 20 10 0 50 40 30 20 10 0 50 40 30 20 10 0 50 40 30 20 10 0

Inference Again, the global hypothesis H 0 : ˆP 1 (Ẑn) =... = ˆP K (Ẑn) can be rejected (P -value < 0.001). Problem: Which differences cause the rejection of H 0? Solution: One can avoid complicated closed testing procedures by computing confidence intervals after mapping the B-block design into a K-sample problem via alignment (Hájek et al., 1999).

Alignment When we look at the performance measure of algorithm k in the bth sample drawn from the data generating process, we might want to write p kb = µ + β b + γ k + ε kb where µ corresponds to the performance of the Bayes-rule, β b is the error induced by the b sample and γ k is the error of the kth algorithm, the quantity we are primarily interested in, ε indicates an error term.

Alignment (cont d) The aligned performance measures p kb cover the difference of the performance of the kth algorithm from the average performance of all K algorithms: p kb = p kb p b = (γ k + ε kb ) 1 K K (γ k + ε kb ) k=1 For classification problems, p k 1 b p k 2 b is the difference of the misclassification error.

Alignment (cont d) The aligned random variables are not independent but exchangeable for each of the b samples and are independent between samples. Therefore, (asymptotic) permutation test procedures can be used to assess the deviations from the global null-hypothesis. For example, asymptotic simultaneous confidence intervals for Tukeycontrasts can be used for an all-pair comparison of the K algorithms under test.

Asymptotic Tukey Confidence Sets RF vs. Bund ( ) SVM vs. Bund ( ) slda vs. Bund ( ) SVM vs. RF ( ) slda vs. RF ( ) slda vs. SVM ( ) 0.005 0.000 0.005 0.010 95 % two sided confidence intervals

Classical Tests? We advocate usage of permutation tests, but what about more classical tests? Consider a paired comparison of slda vs. SVM for the Breast Cancer data: Permutation test: T = 1.488, p = 0.776 t test: t = 0.284, p = 0.777 Wilcoxon signed rank test: W = 18216, p < 0.001

Rank Tests: A Warning Tests like the Wilcoxon signed rank test are constructed for the nullhypothesis the difference of the performance measures is symmetrically distributed around zero. For non-symmetric distributions this leads to a complete desaster. Look at n = 500 realizations of a skewed random variable X d 2d with expectation zero and unit variance with X χ 2 d.

Size 0.0 0.2 0.4 0.6 0.8 1.0 Wilcoxon signed rank test permutation test t test nominal size 0 50 100 150 200 Degrees of freedom

Lifetime Analysis Problems Appropriate performance measures for censored responses are by no means obvious and still a matter of debate (Henderson, 1995; Graf et al., 1999; Molinaro et al., 2004). We use the Brier score for censored data suggested by Graf et al. (1999). Example: Predictive performance of the Kaplan-Meier estimator, a single survival tree and Bagging of survival trees (Hothorn et al., 2004) measured for n = 686 women enrolled in the German Breast Cancer Study (Group 2).

Kaplan-Meier vs. Single Tree Brier Score KM 0.17 0.18 0.19 0.20 0.21 0.22 0.17 0.18 0.19 0.20 0.21 0.22 Brier Score Tree Brier Score Tree KM 0.02 0.01 0.00 0.01 0.02 0.03 0.17 0.18 0.19 0.20 0.21 0.22 Brier Score (Tree + KM)/2

Kaplan-Meier vs. Bagging Brier Score KM 0.15 0.16 0.17 0.18 0.19 0.20 0.15 0.16 0.17 0.18 0.19 0.20 Brier Score Bagging Brier Score Bagging KM 0.03 0.02 0.01 0.00 0.01 0.15 0.16 0.17 0.18 0.19 0.20 Brier Score (Bagging + KM)/2

Asymptotic Dunnett Confidence Sets Bagging vs. KaplanMeier ( ) Tree vs. KaplanMeier ( ) 0.025 0.015 0.005 0.005 95 % two sided confidence intervals

Interpretation Predictions derived from the estimated Kaplan-Meier curve don t take any information covered by the input variables into account. A test for the hypothesis there is no (detectable) relationship between the input variables and the response can therefore be performed by comparing the performance of the simple Kaplan-Meier curve with the performance of the best tools available for predicting survival times.

Conclusion When comparing the performance of K algorithms it is appropriate to treat the B samples from the data generating process as blocks. Standard statistical test procedures can be used to compare arbitrary performance measures for multiple algorithms. Some classical parametric and non-parametric procedures are only suboptimal, we advocate procedures based on the conditional distribution of test statistics for inference.

References Graf, E., Schmoor, C., Sauerbrei, W., and Schumacher, M. (1999), Assessment and comparison of prognostic classification schemes for survival data, Statistics in Medicine, 18, 2529 2545. Hájek, J., Šidák, Z., and Sen, P. K. (1999), Theory of Rank Tests, London: Academic Press, 2nd edition. Henderson, R. (1995), Problems and prediction in survival-data analysis, Statistics in Medicine, 14, 161 184. Hothorn, T. and Lausen, B. (2003), Bundling classifiers by bagging trees, Preprint, Friedrich-Alexander-University Erlangen-Nuremberg, URL http://www.mathpreprints.com/. Hothorn, T., Lausen, B., Benner, A., and Radespiel-Tröger, M. (2004), Bagging survival trees, Statistics in Medicine, 23, 77 91. Hothorn, T., Pal, I., Gefeller, O., Lausen, B., Michelson, G., and Paulus, D. (2003), Automated classification of optic nerve head topography images for glaucoma screening, in Studies in Classification, Data Analysis, and Knowledge Organization: Exploratory Data Analysis in Empirical Research, eds. M. Schwaiger and O. Opitz, Heidelberg: Springer, pp. 346 356. Kropf, S. (2000), Hochdimensionale multivariate Verfahren in der medizinischen Statistik, Aachen: Shaker Verlag. Läuter, J. (1992), Stabile multivariate Verfahren: Diskriminanzanalyse - Regressionsanalyse - Faktoranalyse, Berlin: Akademie Verlag. Läuter, J., Glimm, E., and Kropf, S. (1998), Multivariate tests based on left-spherically distributed linear scores, The Annals of Statistics, 26, 1972 1988, correction: 1999, Vol. 27, p. 1441. Mardin, C. Y., Hothorn, T., Peters, A., Jünemann, A. G., Nguyen, N. X., and Lausen, B. (2003), New glaucoma classification method based on standard HRT parameters by bagging classification trees, Journal of Glaucoma, 12, 340 346. Molinaro, A. M., Dudoit, S., and van der Laan, M. J. (2004), Tree-based multivariate regression and density estimation with right-censored data, Journal of Multivariate Analysis, 90, 154 177.