Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation

Similar documents
A Bias Correction for the Minimum Error Rate in Cross-validation

Statistical aspects of prediction models with high-dimensional data

Building a Prognostic Biomarker

Correction for Tuning Bias in Resampling Based Error Rate Estimation

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

High-Dimensional Statistical Learning: Introduction

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis

Stephen Scott.

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES

Advanced Statistical Methods: Beyond Linear Regression

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Regularized Discriminant Analysis and Its Application in Microarrays

Classification using stochastic ensembles

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Random projection ensemble classification

Model Assessment and Selection: Exercises

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Lecture 6: Methods for high-dimensional problems

application in microarrays

STK Statistical Learning: Advanced Regression and Classification

Machine Learning. Lecture 9: Learning Theory. Feng Li.

CMSC858P Supervised Learning Methods

Statistical Learning with the Lasso, spring The Lasso

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

Microarray Data Analysis: Discovery

Regularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution

Course in Data Science

Nearest Neighbors Methods for Support Vector Machines

A Simple Algorithm for Learning Stable Machines

INCORPORATING BIOLOGICAL KNOWLEDGE OF GENES INTO MICROARRAY DATA ANALYSIS

Holdout and Cross-Validation Methods Overfitting Avoidance

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection

Single gene analysis of differential expression. Giorgio Valentini

Research Article Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Evaluation. Andrea Passerini Machine Learning. Evaluation

Gaussian Models

Is cross-validation valid for small-sample microarray classification?

SPARSE QUADRATIC DISCRIMINANT ANALYSIS FOR HIGH DIMENSIONAL DATA

Evaluation requires to define performance measures to be optimized

Roman Hornung. Ordinal Forests. Technical Report Number 212, 2017 Department of Statistics University of Munich.

Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

ECE521 week 3: 23/26 January 2017

Discriminant analysis and supervised classification

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Extensions to LDA and multinomial regression

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies

Proteomics and Variable Selection

Classification of Longitudinal Data Using Tree-Based Ensemble Methods

Classification in microarray experiments

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Regularized Discriminant Analysis and Its Application in Microarray

Random Forests for Ordinal Response Data: Prediction and Variable Selection

CS 543 Page 1 John E. Boon, Jr.

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Univariate shrinkage in the Cox model for high dimensional data

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

Midterm exam CS 189/289, Fall 2015

Marginal Screening and Post-Selection Inference

A STUDY OF PRE-VALIDATION

Machine Learning Linear Classification. Prof. Matteo Matteucci

Modern Information Retrieval

ESS2222. Lecture 4 Linear model

Universal Learning Technology: Support Vector Machines

Applied Multivariate and Longitudinal Data Analysis

The prediction of house price

VBM683 Machine Learning

Introduction to Data Science

Machine Learning Practice Page 2 of 2 10/28/13

CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data

ECE 5424: Introduction to Machine Learning

Manual for a computer class in ML

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Biochip informatics-(i)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

Introduction to Machine Learning

A class of generalized ridge estimator for high-dimensional linear regression

The computationally optimal test set size in simulation studies on supervised learning

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Lecture 4 Discriminant Analysis, k-nearest Neighbors

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

SF2930 Regression Analysis

1 Machine Learning Concepts (16 points)

Machine Learning (CS 567) Lecture 5

Model Averaging With Holdout Estimation of the Posterior Distribution

Modeling Tissue Heterogeneity of Test Samples to Improve Class Prediction

GS Analysis of Microarray Data

A Magiv CV Theory for Large-Margin Classifiers

Introduction to Supervised Learning. Performance Evaluation

Sample Size Estimation for Studies of High-Dimensional Data

Transcription:

cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation IIM Joint work with Christoph Bernau, Caroline Truntzer, Thomas Stadler and Anne-Laure Boulesteix LMU Munich Department of Medical Informatics, Biometry and Epidemiology July, 7th, 2014 Conclusion 1/35

IIM 1 2 3 4 IIM 5 6 7 8 Conclusion Conclusion 2/35

IIM Modern technologies, most prominently microarrays, enable the measurement of the expression of thousands or many thousands of genes for each unit of investigation. Agglomerating such measurements for units (patients, tissues,...) which are affected by a disease of interest and for unaffected controls enables the building of prediction for the purpose of predicting the status of new units. Due to limited sample sizes and information contained in gene expression such prediction make errors. Our general interest lies in the estimation of the expected error frequency. Conclusion 3/35

rule in general IIM Sample: S = {(X 1, Y 1 ),..., (X n, Y n )} P n X i : (LONG) vector of (raw!) gene expressions ( covariates ) Y i : status of patient ( class ) (i = 1,..., n) rule fitted on training data S: ĝ S : X Y = {1, 2} Predicted status of new patient with gene expression vector x X : ĝ S (x) = ŷ Conclusion 4/35

Exemplary analysis for obtaining a prediction rule IIM Data material: 100 DNA microarrays of breast tissues, 50 affected with early stage breast cancer and 50 unaffected Analysis: 1 Normalization using the RMA method. 47,000 different expression variables 2 t-test based selection of the 500 most informative variables 3 Cross-validation based selection of the optimal cost parameter for the Support Vector Machine (SVM) classification method 4 Fitting the SVM classification method using the optimized cost parameter Three preliminary steps before fitting the actual classification method - these steps are part of prediction rule ĝ S ( ). Conclusion 5/35

(Expected) misclassification probability IIM Misclassification probability of ĝ S ( ): ε[ĝ S ] := P (X,Y ) P [ĝ S (X ) Y ] Relevant to the medical doctor Expected misclassification probability when considering samples of size n (following distribution P n ): ε(n) := E S P n[ ε[ĝ S ] ] Relevant to the statistical methodologist Conclusion 6/35

Motivation for cross-validation IIM Taking the misclassification rate of ĝ S ( ) on S as an estimator for ε[ĝ S ] would result in an unrealistically small error estimate, because ĝ S ( ) is overly adapted to S ( resubstitution bias ). Building a prediction rule on only a part Strain S of the data and estimating ε[ĝ S train ] on the rest S test = S/Strain often very inefficient because of limited sample size Motivation for cross-validation, which is an estimator for ε(n train ) := E S train P n train [ ε[ĝ S train ] ] with n train < n. Conclusion 7/35

Cross-validation () procedure General Idea: Perform more than one splitting of the whole data set S into two parts Strain and Stest Procedure: 1 Split the data set S into K (approximately) equally sized folds S 1,..., S K 2 For k = 1,..., K: Use the units in S/S k for constructing the prediction rule and the units in S k as test data. 3 Average the misclassification rates out of the K splittings in (2). IIM Formula of the estimator e full,k (S): 1 K K 1 #Sk k=1 j {i : (X i,y i ) Sk} Conclusion 8/35 In practice: repeat and take the average I (ĝ S\S k (X j) Y j ),

Properties of cross-validation () IIM (Formally) not an estimator for the misclassification probability of a specific prediction rule (i.e. ε[ĝ S ]), but train one of the expected misclassification probability of samples of size n train,k := #{S/Sk} (k {1,..., K}) (i.e. ε(n train,k )) For n train,k approaching n (big K) its expectancy gets increasingly similar to ε[ĝ S ] - interesting for the medical doctor, but unfavourable for the methodologist High variance (around ε(n train,k )) Conclusion 9/35

Incomplete cross-validation IIM Incomplete : One or more preliminary steps performed before Part of the prediction rule already constructed on the whole data set Violation of the training and test set principle of Can lead to severe underestimation of the expected misclassification probability Over-optimistic conclusions possible Incomplete known to be severely downwardly biased for the case of variable selection Conclusion 10/35

Incomplete cross-validation IIM Issue previously unexamined for other preliminary steps in the literature (to our knowledge) Preliminary steps almost always conducted before Examples: normalization of gene expression data, imputation of missing values, variable filtering by variance, dichotomization of continuous variables, data-driven determination of powers of fractional polynomials, sophisticated preprocessing steps for imaging data,... No certainty on the extent of downward bias through with respect to such steps Conclusion 11/35

Incomplete cross-validation IIM Formula of the estimator e incompl,k (S): 1 K K 1 #Sk k=1 j {i : (X i,y i ) Sk} I (ĝ S S\Sk (X j) Y j ), where ĝ S S\Sk ( ) denotes the prediction rule obtained when specific steps in its construction on S \ Sk are performed on the whole sample S. Conclusion 12/35

Incomplete cross-validation IIM e incompl,k (S) is a negatively biased estimator for ε(n train,k ), while e full,k (S) is unbiased for ε(n train,k ). e incompl,k (S) is unbiased as an estimator of the following term: ε incompl (n train,k ; n) := { E S P n P[ĝ S Strain,K (X n train,k +1) Y ntrain,k +1] with Strain,K := {(X 1, Y 1 ),..., (X n train,k, Y ntrain,k )} S and (X n train,k +1, Y ntrain,k +1) S playing the role of an arbitrary test set observation. }, Conclusion 13/35

cross-validation IIM = Full Perform all steps for obtaining the prediction rule within. e full,k (S) Incomplete In general likely: e incompl,k (S) < ε(n train,k ). Perform one or more steps before on the whole data set. formally wrong e incompl,k (S) Full can be computationally intensive and to integrate certain steps into are often not implemented. BUT: The extent to which e incompl,k (S) underestimates ε(n train,k ) can be marginal in some cases. Full can be avoided in these cases. Conclusion 14/35

A new measure of the impact of ness (IIM) IIM Quantitative measure for the degree of bias induced by with respect to specific steps. Main purpose: Spot cases, where full can be avoided generally and cases where is especially dangerous. Straightforward, but naive measure would be: ε(n train,k ) ε incompl (n train,k ; n) Smaller differences can easily also be due to a smaller ε(n train,k ). Preference for the ratio of the errors Conclusion 15/35

A new measure of the impact of ness (IIM) IIM Our new measure IIM (standing for Cross-Validation Incompleteness Impact Measure ) is defined as IIM P,n,K = 1 ε incompl(n train,k ; n) ε(n train,k ) 0 otherwise. Conclusion 16/35 if ε incompl (n train,k ; n) < ε(n train,k ) and ε(n train,k ) > 0 [0, 1]. Larger values of IIM P,n,K are associated with a stronger underestimation of ε(n train,k ). Interpretation: Relative reduction of mean estimated error when performing

A new measure of the impact of ness (IIM) IIM Estimator of IIM P,n,K : Replace ε(n train,k ) and ε incompl (n train,k ; n) by their unbiased estimators e full,k (S) and e incompl,k (S); denoted as IIM S,n,K Rules of thumb: IIM S,n,K [0, 0.02] no bias, ]0.02, 0.1] weak bias, ]0.1, 0.2] medium bias, ]0.2, 0.4] strong bias, ]0.4, 1] very strong bias. IIM P,n,K dependent on data distribution P Calculate IIM S,n,K for several data sets and average the values to draw conclusions. Conclusion 17/35

for preliminary steps IIM In general: for the purpose of prediction each step performed for obtaining a prediction rule has to be done on new units as well Naive approach: 1) Pool training data with new units, 2) re-perform all preliminary steps, 3) fit the classification method anew on the training data a) Impossible for steps the fitting of which requires the target variable; b) prediction rule is commonly kept fixed All steps have to be integrated into the constructed prediction rule ĝ S ( ). Conclusion 18/35

for preliminary steps IIM : New units made subject to exactly the same procedure as those in the training data, but new units not involved in the adaption of the procedure to the training data. Example - procedure for variable selection: The same variables are chosen for new units than for those in the training data, but only the training data is used to determine, which variables are used. Conclusion 19/35

IIM Various real-life data sets, mostly gene-expression data Investigated preliminary steps: 1) variable selection, 2) variable filtering by variance, 3) choice of tuning parameters for various classification methods, 4) imputation using a variant of k-nearest-neighbors, 5) normalization with the RMA method In each case s repeated 300 times and the averaged. Splitting ratios between the sizes of the training and test sets: 2:1 (3-fold ), 4:1 (5-fold ) and 9:1 (10-fold ) Conclusion 20/35

Overview of used data sets IIM Name number of number of % diseased type of disease samples variables variables ProstatecTranscr 102 12,625 51% transcriptomic prostate cancer HeadNeckcTranscr 50 22,011 50% transcriptomic head and neck squamous LungcTranscr 100 22,277 49% transcriptomic lung Adenocarcinoma SLETranscr 36 47,231 56% transcriptomic systemic lupus erythematosus GenitInfCoww0 51 21 71% various genital infection in cows GenitInfCoww1 51 24 71% various genital infection in cows GenitInfCoww2 51 27 71% various genital infection in cows GenitInfCoww3 51 26 71% various genital infection in cows GenitInfCoww4 51 27 71% various genital infection in cows ProstatecMethyl 70 222 41% methylation prostate cancer ColoncTranscr 47 22,283 53% transcriptomic colon cancer WilmsTumorTranscr 100 22,283 42% transcriptomic Wilms tumor Conclusion 21/35

Investigated steps: variable selection with addon IIM Procedure: 1 For each variable: two-sample t-tests with groups according to the target variable 2 Select the p sel variables with the smallest p-values out of the t-tests Considered values for number of selected variables p sel : 5, 10, 20 and half of the total number p of variables Classification methods: Diagonal Linear Discriminant Analysis (DLDA) for p sel = p/2, otherwise Linear Discriminant Analysis (LDA) procedure: Use only the variables, which were selected on the training data Conclusion 22/35

Investigated steps: variable filtering by variance with addon IIM Procedure: Calculate the empirical variance of every variable and select the p/2 variables with the largest variances. Classification method: DLDA procedure: As with the t-test-based variable selection use only the variables selected on the training data. Conclusion 23/35

Investigated steps: optimization of tuning parameters with addon IIM Optimization of tuning parameters on a grid for seven different classification methods: 1 number of iterations m stop in componentwise boosting with logistic loss function 2 number of neighbors in the k-nearest-neighbors algorithm 3 shrinkage intensity in Lasso 4 shrinkage intensity for the class centroids in Nearest Shrunken Centroids 5 number of components in Linear Discriminant Analysis on Partial Least Squares components 6 number mtry of variables randomly sampled as candidates at each split in Random Forests 7 cost parameter in Support Vector Machines with linear kernel Conclusion 24/35

Investigated steps: optimization of tuning parameters with addon IIM Procedure: For each candidate value of the tuning parameter on a respective grid, 3-fold (i.e. K=3) of the classifier is performed using this value of the tuning parameter. The value yielding the smallest 3-fold error is selected. procedure: The tuning parameter value chosen on the training data is used. Conclusion 25/35

Investigated steps: Imputation of missing values with addon IIM Procedure: k-nearest-neighbors imputation with standardization during the imputation; Tuning of k using 3-fold in an analogous way as described before Classification method: Nearest Shrunken Centroids for the high-dimensional data set (of those considered for this step) and Random Forests for the other data sets procedure: For the standardization use means and standard deviations estimated from the training data. Search k nearest neigbours only on the training data. Conclusion 26/35

Investigated steps: Robust Multi-array Average (RMA) with addon IIM Purpose: remove technical artefacts in raw microarray data and summarize the multiple measurements done for each variable value Three steps: 1) Background correction, 2) Quantile normalization, 3) Summarization Classification method: Nearest Shrunken Centroids procedure: Steps 1) and 3) performed array by array. Only 2) requires an addon strategy. Use quantiles from training data to perform quantile normalization for new samples (Kostka and Spang, 2008). Conclusion 27/35

Results: Variable selection and variable filtering by variance psel = 5 psel = 10 psel = 20 IIMS,n,K 0.0 0.2 0.4 0.6 0.8 1.0 21 1 1 1 4 4 4 1 21 31 1 1 31 21 31 IIMS,n,K 0.0 0.2 0.4 0.6 0.8 1.0 41 41 41 21 21 21 31 1 31 31 1 1 IIMS,n,K 0.0 0.2 0.4 0.6 0.8 1.0 41 41 41 21 21 1 1 21 31 31 1 31 IIM IIMS,n,K 0.0 0.2 0.4 0.6 0.8 1.0 K = 3 K = 5 K = 10 psel = p/2 41 41 41 31 31 31 21 21 21 1 1 1 IIMS,n,K 0.0 0.2 0.4 0.6 0.8 1.0 K = 3 K = 5 K = 10 psel = p/2 filtering by variance 1 21 31 41 1 21 31 41 1 21 31 41 K = 3 K = 5 K = 10 1 2 3 4 ProstatecTranscr HeadNeckcTranscr LungcTranscr SLETranscr K = 3 K = 5 K = 10 K = 3 K = 5 K = 10 Conclusion 28/35

Results: Optimization of tuning parameters IIMS,n,K 0.2 0.0 0.2 Boosting 1 1 21 21 21 41 41 14 1 1 3 31 31 IIMS,n,K 0.2 0.0 0.2 knn 41 21 21 3 3 1 41 41 1 1 1 1 21 3 1 IIMS,n,K 0.2 0.0 0.2 Lasso 21 21 21 3 1 1 1 14 1 31 41 31 41 IIMS,n,K 0.2 0.0 0.2 NSC 1 41 41 41 1 31 1 31 31 1 1 21 2 2 IIM IIMS,n,K 0.2 0.0 0.2 K = 3 K = 5 K = 10 21 PLS LDA 21 21 1 1 31 31 31 1 41 41 41 K = 3 K = 5 K = 10 IIMS,n,K 0.2 0.0 0.2 K = 3 K = 5 K = 10 K = 3 K = 5 K = 10 Random Forest SVM IIMS,n,K 41 21 41 41 3 21 21 21 31 31 31 1 1 1 1 1 1 31 31 2 4 4 4 1 1 1 1 1 1 2 K = 3 K = 5 K = 10 K = 3 K = 5 K = 10 0.2 0.0 0.2 K = 3 K = 5 K = 10 1 2 3 4 ProstatecTranscr HeadNeckcTranscr LungcTranscr SLETranscr Conclusion 29/35

Results: Imputation of missing values and RMA normalization Imputation Normalization Imputation IIM IIMS,n,K 0.15 0.05 0.00 0.05 0.10 1 1 1 21 41 21 21 31 31 41 51 51 61 61 31 41 51 61 1 21 1 21 1 21 IIMS,n,K 0.15 0.05 0.00 0.05 0.10 K = 3 K = 5 K = 10 K = 3 K = 5 K = 10 1 2 3 4 5 6 1 2 GenitInfCoww0 GenitInfCoww1 GenitInfCoww2 GenitInfCoww3 GenitInfCoww4 ProstatecMethyl Normalization ColoncTranscr WilmsTumorTranscr Conclusion 30/35

IIM Considered data preparation step: variable selection Simulated data: n {50, 100}, 2000 correlated normally distributed predictors, two different signal strengthes Main : 1 Relatively high variance of IIM S,n,K - lower for smaller IIM P,n,K -values 2 Negligible bias with respect to the true measure values 3 Choice of K = 3 might be preferable over larger values - smaller variance and better assessment of variance achievable Conclusion 31/35

Conclusion Data preparation steps very often done before violation of the separation of training and test data. Over-optimistic conclusions possible IIM Impact very different for different steps - in our analyses greater for steps taking the target variable into account - variable selection and tuning - but not necessarily the case IIM to assess this impact Constantly arising new types of molecular data will require specialized data preparation steps, for which the impact of ness will have to be assessed. Conclusion 32/35

. Thank you for your attention! IIM Conclusion 33/35

References IIM Ambroise, C. and McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Nat. Acad. Sci. 99, 6562 6566. Kostka, D. and Spang, R. (2008). Microarray based diagnosis profits from better documentation of gene expression signatures. PLoS Computational Biology 4, e22. Simon, R., Radmacher, M. D., Dobbin, K., and McShane, L. M. (2003). Pitfalls in the use of dna microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute 95, 14 18. Conclusion 34/35

References IIM Varma, S. and Simon, R. (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 7, 91. Technical Report available online:, Christoph Bernau, Caroline Truntzer, Thomas Stadler, and Anne-Laure Boulesteix (2014). cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation. Department of Statistics, LMU, Technical Report 159. Conclusion 35/35