Data splitting. INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+TITLE:

Size: px

Start display at page:

Download "Data splitting. INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+TITLE:"

Jade Rodgers
5 years ago
Views:

1 #+TITLE: Data splitting INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+AUTHOR: Thomas Alexander Gerds #+INSTITUTE: Department of Biostatistics, University of Copenhagen #+DATE: 30 May :--- datasplitting-presentation.tex Top 1/

2 Outline * Introduction * Parameters of interest * Estimation methods * Simulation results * Guidelines & pitfalls * Summary --:--- datasplitting-presentation.tex 4% 2/

3 This talk is not about - how to deal with censored data - how to deal with competing risks - the choice of the prediction metric - the choice of the modelling algorithm This talk is about methods for internal validation of risk prediction models based on repeated data splitting. --:--- datasplitting-presentation.tex 6% 3/41 [Introduction]

4 The estimation problem: a mission impossible A statistical risk prediction model which only works in its own training data is practically useless. Aim: to estimate how well the model generalizes to new data, i.e. how it will perform in *yet unseen patients* Dilemma: there are no new data! --:--- datasplitting-presentation.tex 8% 4/41 [Introduction]

5 For the purpose of illustration The Copenhagen stroke study: Ten years of follow-up of 518 patients after stroke until death. Cox regression: Factor unit Hazard.ratio CI.95 P.value Age per year 1.06 [1.04;1.07] < Sex female male 1.47 [1.18;1.83] Hypertension no yes 1.24 [1.00;1.53] Stroke history no yes 1.27 [0.98;1.63] Other disease no yes 1.13 [0.87;1.47] 0.35 Alcohol no yes 0.91 [0.72;1.16] Diabetes no yes 1.38 [1.04;1.81] Smoking no yes 1.30 [1.04;1.62] Stroke score scale 0.98 [0.97;0.98] < Cholesterol mg/dl 1.00 [0.92;1.08] In what follows: Simulate data that are "alike" the real data based on Weibull regression models --:--- datasplitting-presentation.tex 10% 5/41 [Introduction]

6 Notation (part I) Predictors X R p age, sex, diabetes, stroke score,... Outcome at time t Y(t) {0,1} 0 = survived, 1 = dead Risk prediction model: (X, t ) [0, 1] ˆR n (t X ) P (Y (t ) = 1 X ) Performance metric 1 E[{Y (t ) ˆR n (t X )} 2 ] --:--- datasplitting-presentation.tex 12% 6/41 [Introduction]

7 Fundamental idea Data splitting is very intuitive: we hide one part of the data, learn on the rest, and then check our knowledge on what was hidden. There is a hidden parameter here: hide and how much we show. how much we --:--- datasplitting-presentation.tex 14% 7/41 [Introduction]

8 Sketch: 3-fold CV --:--- datasplitting-presentation.tex 16% 8/41 [Introduction]

9 Effect of learning sample size on predictions Copenhagen stroke study (n=518, p=11, n.event=404) Cox regression model (B=100 repetitions) 100 % 75 % 50 % Factor age sex male female male hypten no no no prevstroke no no no othdisease no no no alcohol no no no diabetes yes no no smoke yes no yes atrialfib no no no strokescore cholest time status % 0 % 5 year survival probability Patient nr. 17 Patient nr. 18 Patient nr. 19 loo Percentage of hidden data --:--- datasplitting-presentation.tex 18% 9/41 [Introduction]

10 Effect of test set size on model performance Performance of 5 year prediction 100 % 75 % Cox regression (Copenhagen stroke study) 50 % Coin flipper 25 % Size of simulated validation sample --:--- datasplitting-presentation.tex 20% 10/41 [Introduction]

11 Effect of split-ratio No matter how often we split the data: - The smaller the size of the learning samples, the higher the variability of the risk predictions - The smaller the size of the validation samples, the higher the variability of the estimate of model performance --:--- datasplitting-presentation.tex 22% 11/41 [Introduction]

12 Effect of split-ratio No matter how often we split the data: - The smaller the size of the learning samples, the higher the variability of the risk predictions - The smaller the size of the validation samples, the higher the variability of the estimate of model performance *Dilemma* *Tradeoff* *#%&Grr!!?! --:--- datasplitting-presentation.tex 24% 11/41 [Introduction]

13 Effect of split-ratio No matter how often we split the data: - The smaller the size of the learning samples, the higher the variability of the risk predictions - The smaller the size of the validation samples, the higher the variability of the estimate of model performance *Dilemma* *Tradeoff* *#%&Grr!!?! Wait a minute: What do we want to estimate? --:--- datasplitting-presentation.tex 26% 11/41 [Introduction]

14 Dietterich (1998) A frequently-applied strategy is to convert Question 2 into Question 6 --:--- datasplitting-presentation.tex 28% 12/41 [Parameters of interest] --

15 Notation (part II) The one and only: D n = {(Y 1 (t ), X 1 ),..., (Y n (t ), X n )} [data-set] The available risk prediction model: ˆR n = R(D n ) [trained in this data set] The average model at n : r n = E Dn R (D n ) [trained in a data set of size n ] Note: The function R is the model selection algorithm --:--- datasplitting-presentation.tex 30% 13/41 [Parameters of interest] --

16 Definition: a) Conditional performance at D n : [ { } ] 2 = E Y,X Y (t ) ˆR n (t X ) D n, b) Expected performance at sample size n : [ { } ]) 2 = E Dn (E Y,X Y (t ) ˆR n (t X ) D n. --:--- datasplitting-presentation.tex 32% 14/41 [Parameters of interest] --

17 Definition: a) Conditional performance at D n : [ { } ] 2 = E Y,X Y (t ) ˆR n (t X ) D n, b) Expected performance at sample size n : [ { } ]) 2 = E Dn (E Y,X Y (t ) ˆR n (t X ) D n. Efron & Tibshirani (1997): Note, however, that although the conditional error rate is often what we would like to obtain, none of the methods correlates very well with it on a sample by sample basis. --:--- datasplitting-presentation.tex 34% 14/41 [Parameters of interest] --

18 Decomposition of the expected prediction performance [ { } ]) 2 E Dn (E Y,X Y (t ) ˆR n (t X ) Dn [ = E X,Y {Y (t ) r n (t X )} 2] } {{ } Model accuracy } ] 2 +E Dn [E X {ˆR n (t X ) r n (t X ) } {{ } Model uncertainty } + E X,Y E Dn {ˆR n (t X ) r n (t X ) }{{} =0 Note: The model accuracy is the conditional error of the average model r n = E Dn R (D n ) at size n --:--- datasplitting-presentation.tex 36% 15/41 [Parameters of interest] --

19 Learning curve True performance Marty McFly Data generating model Useless model Average: across learning sets Conditional: single learning set Overfitting model Learning sample size --:--- datasplitting-presentation.tex 38% 16/41 [Parameters of interest] --

20 Overview - Cross-validation + Leave-one-out (LOOCV) + k-fold (repeated B times) + leave-k out (repeated random sub-sampling) - Bootstrap + Optimism corrected bootstrap [Efron -> Harrel] + Boostrap cross-validation + leave-one-out bootstrap bootsrap [Efron & Tibshirani 1997] + adjusted bootstrap [Jiang & Simon 2007] Note: All estimates are practically subject to Monte-Carlo variation (except perhaps LOOCV). --:--- datasplitting-presentation.tex 40% 17/41 [Estimation methods]

21 Leave-one-out cross-validation Denote ˆR n i = R (D n \ {(X i, Y i )}) for the model trained in the data without subject i : LOOCV = 1 {Y i (t ) ˆR n i (t X i )} 2. n i D n --:--- datasplitting-presentation.tex 42% 18/41 [Estimation methods]

22 Leave-one-out cross-validation Denote ˆR n i = R (D n \ {(X i, Y i )}) for the model trained in the data without subject i : LOOCV = 1 {Y i (t ) ˆR n i (t X i )} 2. n i D n Advantages: - We expect R (D n )(t x ) = ˆR n (t x ) ˆR i n (t x ). - Result does not depend on the random seed Disadvantage: - The whole model strategy has to be applied n times. - Bias? Variance? For which parameter? --:--- datasplitting-presentation.tex 44% 18/41 [Estimation methods]

23 K-fold cross-validation Split the data into K disjoint subsets D 1 n,..., D K n of approximately equal size and denote ˆR k n = R (D n \ D k n ) CV(K ) = 1 n K {Y i (t ) ˆR n k (t X i )} 2. k =1 i Dn k --:--- datasplitting-presentation.tex 46% 19/41 [Estimation methods]

24 K-fold cross-validation Split the data into K disjoint subsets D 1 n,..., D K n of approximately equal size and denote ˆR k n = R (D n \ D k n ) CV(K ) = 1 n K {Y i (t ) ˆR n k (t X i )} 2. k =1 i Dn k Advantages: - CV(10) is a frequently used procedure - Useful if model selection is time consuming - Has (asymptotic) oracle properties!? Disadvantage: - Usually high Monte-Carlo variation - Negative bias for performance at n --:--- datasplitting-presentation.tex 48% 19/41 [Estimation methods]

25 Bootstrap-cross-validation 1. Draw B bootstrap training data sets Db train from D n either with replacement of size m = n or without replacement of size m < n. 2. Fit the model in each training set: ˆR b train = R (Db train ) 3. Use the left-out data to compute the performance, then average. BootCV = 1 B B 1 n b =1 b i / D train b { } 2 Y i (t ) ˆR b train (t X i ) --:--- datasplitting-presentation.tex 50% 20/41 [Estimation methods]

26 Leave-one-out bootstrap 1. Draw B bootstrap training data sets Db train from D n either with replacement of size m = n or without replacement of size m < n. 2. Fit the model in each training set: ˆR b train = R (Db train ) 3. Use the left-out data to compute the performance, then average. LOOBOOT = 1 n n 1 K i =1 i b :i / D train b { } 2 Y i (t ) ˆR b train (t X i ) --:--- datasplitting-presentation.tex 52% 21/41 [Estimation methods]

27 Leave-one-out bootstrap Advantages - includes more model variability - is less variable than LOOCV Disadvantages - assesses the expected performance rather than the conditional performance - underestimates the performance at n - depends on the random seed unless B is large Notes - If B is small then LOOBOOT is preferable to BootCV - The bias depends on the slope of the learning curve --:--- datasplitting-presentation.tex 54% 22/41 [Estimation methods]

28 Decomposition of the leave-one-out bootstrap LOOBOOT = 1 n 1 { } 2 Y i (t ) rk train n K i (t X i ) i =1 i b :i / D b }{{} Here Estimated model accuracy + 1 n 1 } 2 {ˆR b train (X i ) rk train n K i (t X i ) i =1 i b :i / D b }{{} Estimated model uncertainty r train K i (t X i ) = 1 K i b :i / D b ˆR train b (t X i ). is the average prediction at X i of the bootstrap training models which did not include i. --:--- datasplitting-presentation.tex 56% 23/41 [Estimation methods]

29 Apparent performance The apparent performance (aka re-substitution performance) is obtained by validating the model in its own training data. App = 1 n i D n { } 2 Y i (t ) ˆR n (t X i ) Disadvantage: performance Overestimates the prediction --:--- datasplitting-presentation.tex 58% 24/41 [Estimation methods]

30 Optimism corrected bootstrap Two problems: 1. The Bootstrap cross-validation estimate underestimates the expected performance 2. the apparent performance overestimates the expected performance. A compromise: BootCV + ω(apperr BootCV) Remaining question: How to choose ω? --:--- datasplitting-presentation.tex 60% 25/41 [Estimation methods]

31 The.632+ bootstrap estimate Efron & Tibshirani: Define App BootCV ˆω =.632/ App NoInf }{{} Relative overfit Boot = BootCV + ˆω 632 +(App BootCV) The no-information performance assesses the overfitting by permutation NoInf = n j =1 i =1 n {Y i (t ) ˆR n (t X j )} 2 n 2 --:--- datasplitting-presentation.tex 62% 26/41 [Estimation methods]

32 Design For different sample sizes we simulate data that are "alike" the Copenhagen stroke study data based on parametric models for survival and censoring 1. In each sample we fit a Cox model after automated backward elimination. 2. We generate a huge independent test set ( records) to compute the conditional performance. 3. In each sample we compute LOOCV, App and using 1000 bootstrap samples also BootCv and the.632+ Steps 1-3 are repeated 360 times. --:--- datasplitting-presentation.tex 64% 27/41 [Simulation results]

33 Cost study based simulation results 65 % Learning curve and variation of conditional performance 50 % 45 % Learning sample size --:--- datasplitting-presentation.tex 66% 28/41 [Simulation results]

34 Apparent performance 65 % 50 % 45 % Learning sample size --:--- datasplitting-presentation.tex 68% 29/41 [Simulation results]

35 LOOCV versus Bootstrap % LOOCV boot % 45 % Learning sample size --:--- datasplitting-presentation.tex 70% 30/41 [Simulation results]

36 BootCv: portions Hiding (subsampling) different 65 % Subsample size for learning 50% 36.8% 20% 10% 50 % 45 % Learning sample size --:--- datasplitting-presentation.tex 72% 31/41 [Simulation results]

37 Next Guidelines & pitfalls: - comparison of risk prediction models does not always work - the superlearner - practical hints --:--- datasplitting-presentation.tex 74% 32/41 [Guidelines & pitfalls] ---

38 Comparison of risk prediction models We want to assess if ˆR n (2) has significantly better prediction performance than ˆR n (1). Define paired residual-differences: i (t ) = {Y i (t ) ˆR (1) n (t X i )} 2 {Y i (t ) ˆR (2) n (t X i )} 2 --:--- datasplitting-presentation.tex 76% 33/41 [Guidelines & pitfalls] ---

39 Comparison of risk prediction models We want to assess if ˆR n (2) has significantly better prediction performance than ˆR n (1). Define paired residual-differences: i (t ) = {Y i (t ) ˆR (1) n (t X i )} 2 {Y i (t ) ˆR (2) n (t X i )} 2 van de Wiel (2009) proposed a statistical test of H 0 : F (δ) + F ( δ) = 1, for all δ where i F for fixed training and test set. --:--- datasplitting-presentation.tex 78% 33/41 [Guidelines & pitfalls] ---

40 Comparison of risk prediction models We want to assess if ˆR n (2) has significantly better prediction performance than ˆR n (1). Define paired residual-differences: i (t ) = {Y i (t ) ˆR (1) n (t X i )} 2 {Y i (t ) ˆR (2) n (t X i )} 2 van de Wiel (2009) proposed a statistical test of H 0 : F (δ) + F ( δ) = 1, for all δ where i F for fixed training and test set. 1. use a paired test in each split, and report the median p-value 2. requires equal size of the validation sets! 3. There are some unsolved issues in right censored data. --:--- datasplitting-presentation.tex 80% 33/41 [Guidelines & pitfalls] ---

41 The.632+ bootstrap does not seem to work for random forest Copenhagen stroke study COX RF Estimation method BOOTCV 0.20 Prediction error Time --:--- datasplitting-presentation.tex 82% 34/41 [Guidelines & pitfalls] ---

42 The.632+ bootstrap does not seem to work for random forest Copenhagen stroke study COX RF Estimation method BOOTCV APP 0.20 Prediction error Time --:--- datasplitting-presentation.tex 84% 35/41 [Guidelines & pitfalls] ---

43 The.632+ bootstrap likes random forest Copenhagen stroke study COX RF Estimation method BOOTCV APP Prediction error Time --:--- datasplitting-presentation.tex 86% 36/41 [Guidelines & pitfalls] ---

44 The SuperLearner has oracle properties (van der Laan et al. 2007) Validation? Interpretation? Risk of an endless loop! --:--- datasplitting-presentation.tex 88% 37/41 [Guidelines & pitfalls] ---

45 Finally: some practical hints - Do many splits (or repeat cross-validation several times) to avoid Monte-Carlo variation - Repeat all model specification steps in each split - Use the same splits to compare modelling strategies - Prefer LOOBOOT over BOOTCV when model fitting is slow - Do not sample with replacement + if the learning algorithm uses cross-validation to select a hyper parameter + in high dimensions (see Binder & Schumacher) - UseR! my packages: + pec (continuous, survival, competing risks) + ModelGood (binary) --:--- datasplitting-presentation.tex 90% 38/41 [Guidelines & pitfalls] ---

46 Citations RJ Hyndman (blog): Every statistician knows that the model fit statistics are not a good guide to how well a model will predict. --:--- datasplitting-presentation.tex 92% 39/41 [Summary]

47 Citations RJ Hyndman (blog): Every statistician knows that the model fit statistics are not a good guide to how well a model will predict. J Shao (1993): Using the LOOCV method can be compared to using a telescope to see some objects (i.e. different models) 10,000 meters away, whereas using the BOOTCV method is more like using the same telescope to see the same objects only 100 meter away. --:--- datasplitting-presentation.tex 94% 39/41 [Summary]

48 Citations RJ Hyndman (blog): Every statistician knows that the model fit statistics are not a good guide to how well a model will predict. J Shao (1993): Using the LOOCV method can be compared to using a telescope to see some objects (i.e. different models) 10,000 meters away, whereas using the BOOTCV method is more like using the same telescope to see the same objects only 100 meter away. Efron & Tibshirani (1997): The same set of bootstrap replications that gives a point estimate of prediction error can also be used to assess the variability of that estimate. --:--- datasplitting-presentation.tex 96% 39/41 [Summary]

49 Last slide --:--- datasplitting-presentation.tex 98% 40/41 [Summary]

50 References 1. Shao (1993), Linear model selection by cross-validation, Journal of the American Statistical Association, 88, Dietterich (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7), Efron & Tibshirani (1997). Improvement on cross-validation: The.632+ bootstrap method. Journal of the American Statistical Association, 92: van der Laan, Polley & Hubbard (2008) Super Learner, Statistical Applications of Genetics and Molecular Biology, 6, article Binder & Schumacher (2008). Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples. Statistical Applications in Genetics and Molecular Biology; 7: Article van de Wiel, Berkhof & van Wieringen (2009). Testing the prediction error difference between 2 predictors. Biostatistics, 10: Mogensen, Ishwaran & Gerds (2012). Evaluating random forests for survival analysis using prediction error curves. Journal of Statistical Software, 50(11): :--- datasplitting-presentation.tex Bot 41/41 [Summary]

Bootstrap, Jackknife and other resampling methods

Bootstrap, Jackknife and other resampling methods Part VI: Cross-validation Rozenn Dahyot Room 128, Department of Statistics Trinity College Dublin, Ireland dahyot@mee.tcd.ie 2005 R. Dahyot (TCD) 453 Modern