Cross-Validation with Confidence

Size: px

Start display at page:

Download "Cross-Validation with Confidence"

Louise Brown
5 years ago
Views:

1 Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017

2 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

3 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

4 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

5 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

6 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

7 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

8 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval CVC

9 Outline Background: cross-validation, overfitting, and uncertainty of model selection Cross-validation with confidence A hypothesis testing framework p-value calculation Validity of the confidence set Model selection consistency for (low dim.) sparse linear models Examples

10 A regression setting Data: D = {(X i,y i ) : 1 i n}, i.i.d from joint distribution P on R p R 1 Y = f (X) + ε, with E(ε X) = 0 Loss function: l(, ) : R 2 R Goal: find ˆf f so that Q(ˆf ) E [ l(ˆf (X),Y) ˆf ] is small.

11 A regression setting Data: D = {(X i,y i ) : 1 i n}, i.i.d from joint distribution P on R p R 1 Y = f (X) + ε, with E(ε X) = 0 Loss function: l(, ) : R 2 R Goal: find ˆf f so that Q(ˆf ) E [ l(ˆf (X),Y) ˆf ] is small. The framework can be extended to unsupervised learning problems.

12 Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m

13 Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m 1. such that it equals the true model

14 Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m 1. such that it equals the true model 2. such that it minimizes Q(ˆf ) over all m M with high probability.

15 Cross-validation Sample split: Let I tr and I te be a partition of {1,...,n}. Fitting: ˆf m = ˆf (D tr,m), where D tr = {(X i,y i ) : i I tr }. Validation: ˆQ(ˆf m ) = n 1 te i Ite l(ˆf m (X i ),Y i ). CV model selection: ˆm cv = argmin m M ˆQ(ˆf m ). V-fold cross-validation: 1. For V 2, split the data into V folds. 2. Rotate over each fold as I tr to obtain ˆQ (v) (ˆf (v) m ) 3. ˆm = argminv 1 V v=1 ˆQ (v) (ˆf (v) m ) 4. Popular choices of V: 10 and V = n: leave-one-out cross-validation

16 Why can cross-validation be successful? To find the best model 1. The fitting procedure ˆf (D,m) needs to be stable, so that the best model (almost) always gives the best fit. 2. Conditional inference: Given (ˆf m : m M ), cross-validation approximately minimizes Q(ˆf m ) over all m.

17 Why can cross-validation be successful? To find the best model 1. The fitting procedure ˆf (D,m) needs to be stable, so that the best model (almost) always gives the best fit. 2. Conditional inference: Given (ˆf m : m M ), cross-validation approximately minimizes Q(ˆf m ) over all m. The value of cross-validation is in conditional inference.

18 A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1.

19 A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1. (Shao 93, Zhang 93, Yang 07) ˆm cv is inconsistent unless n tr = o(n).

20 A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1. (Shao 93, Zhang 93, Yang 07) ˆm cv is inconsistent unless n tr = o(n). V-fold does not help!

21 A closer look at the example Two potential sources of mistake: ˆf (D tr,m) is not stable, or CV does not work as expected.

22 A closer look at the example Two potential sources of mistake: ˆf (D tr,m) is not stable, or CV does not work as expected. Cross-validation makes a mistake if ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε 2 tr 2 ε tr ε te < 0.

23 A closer look at the example Two potential sources of mistake: ˆf (D tr,m) is not stable, or CV does not work as expected. Cross-validation makes a mistake if ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε 2 tr 2 ε tr ε te < 0. But Q(ˆf 2 ) Q(ˆf 1 ) = ε 2 tr > 0. So the best model indeed gives the best fit. The problem is in CV!

24 A closer look at the example Two potential sources of mistake: ˆf (D tr,m) is not stable, or CV does not work as expected. Cross-validation makes a mistake if ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε 2 tr 2 ε tr ε te < 0. But Q(ˆf 2 ) Q(ˆf 1 ) = ε 2 tr > 0. So the best model indeed gives the best fit. The problem is in CV! Here ε 2 tr is the signal, and 2 ε tr ε te is the noise.

25 A closer look at the example Two potential sources of mistake: ˆf (D tr,m) is not stable, or CV does not work as expected. Cross-validation makes a mistake if ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε 2 tr 2 ε tr ε te < 0. But Q(ˆf 2 ) Q(ˆf 1 ) = ε 2 tr > 0. So the best model indeed gives the best fit. The problem is in CV! Here ε 2 tr is the signal, and 2 ε tr ε te is the noise. Observation Cross-validation makes a mistake when it fails to take into account the uncertainty in the testing sample.

26 Cross-validation with confidence (CVC) We want to avoid making such a mistake as in the simple example. We want to use conventional split ratios, with V-fold implementation.

27 A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )?

28 A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )? A standard solution uses hypothesis testing H 0 : Q(ˆf 1 ) Q(ˆf 2 ) conditioning on ˆf 1, ˆf 2.

29 A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )? A standard solution uses hypothesis testing H 0 : Q(ˆf 1 ) Q(ˆf 2 ) conditioning on ˆf 1, ˆf 2. Can do this using a paired sample t-test, say with type I error level α.

30 CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1?

31 CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1? No. Because if we consider H 0 : Q(ˆf 2 ) Q(ˆf 1 ). H 0 will not be rejected either (probability of rejecting H 0 is bounded away from 0.) Most likely, we do not reject H 0 or H 0.

32 CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1? No. Because if we consider H 0 : Q(ˆf 2 ) Q(ˆf 1 ). H 0 will not be rejected either (probability of rejecting H 0 is bounded away from 0.) Most likely, we do not reject H 0 or H 0. We accept both fitted models ˆf 1 and ˆf 2, as they are very similar and the difference cannot be noticed from the data.

33 Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions.

34 Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions. Our approach: one step, with provable coverage and power under mild assumptions in high dimensions.

35 Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions. Our approach: one step, with provable coverage and power under mild assumptions in high dimensions. Key technique: high-dimensional Gaussian comparison of sample means (Chernozhukov et al).

36 CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ].

37 CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ]. For each m, test hypothesis (conditioning on ˆf 1,..., ˆf M ) Let ˆp m be a valid p-value. H 0,m : min j m Q(ˆf j ) Q(ˆf m ).

38 CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ]. For each m, test hypothesis (conditioning on ˆf 1,..., ˆf M ) Let ˆp m be a valid p-value. H 0,m : min j m Q(ˆf j ) Q(ˆf m ). A cvc = {m : ˆp m > α} is our confidence set for the best fitted model: P(m A cvc ) 1 α, where m = argmin m Q(ˆf m ).

39 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m.

40 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges

41 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large.

42 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large. 2. Potentially high correlation between ξ m,j and ξ m,j.

43 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large. 2. Potentially high correlation between ξ m,j and ξ m,j. 3. Vastly different scaling: Var(ξ m,j ) can be O(1) or O(n 1 ).

44 Calculating ˆp m H 0,m : E(ξ m,j ) 0, j m. Let ˆµ m,j and ˆσ m,j be the sample mean and standard deviation of (ξ (i) m,j : i I te). Naturally, one would reject H 0,m for large values of max j m ˆµ m,j ˆσ m,j. Approximate the null distribution using high dimensional Gaussian comparison.

45 Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j 3. ˆp m = B 1 B 1(Tb > T m ). b=1 m,j ˆµ m,j ζ i

46 Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j m,j ˆµ m,j 3. ˆp m = B 1 B b=1 1(T b > T m ). - The studentization takes care of the scaling difference. ζ i

47 Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j m,j ˆµ m,j 3. ˆp m = B 1 B 1(Tb > T m ). b=1 - The studentization takes care of the scaling difference. - The bootstrap Gaussian comparison takes care of the dimensionality and correlation. ζ i

48 Properties of CVC A cvc = {m : ˆp m > α}. Let ˆm cv = argmin m ˆQ(ˆf m ). By construction T ˆmcv 0. Proposition If α < 0.5, then P( ˆm cv A cvc ) 1 as B. [ 1 Proof: nte (i) ξ i I te m,j ˆµ m,j ˆσ m,j ζ i ] j m is a zero-mean Gaussian random vector. So the upper α quantile of its maximum must be positive. Can view ˆm cv as the center of the confidence set.

49 Coverage of A cvc Recall ξ m,j = l(ˆf m (X),Y) l(ˆf j (X),Y), with independent (X,Y). Let µ m,j = E [ ξ m,j ˆf m, ˆf m ], σ 2 m,j = Var [ ξ m,j ˆf m, ˆf m ]. Theorem Assume that (ξ m,j µ m,j )/(A n σ m,j ) has sub-exponential tail for all m j and some A n 1 such that for some c > 0 A 6 n log 7 (M n) = O(n 1 c ). ( ) ( ) µm,j 1. If max j m σ = o 1 m,j + nlog(m n), then P(m A cvc ) 1 α + o(1). 2. If max j m ( µm,j σ m,j )+ CA n and α n 1, then P(m A cvc ) = o(1). log(m n) n for some constant C,

50 Coverage of A cvc 1. If m is the best fitted model which minimizes Q(ˆf m ) over all {ˆf m : m M }, then µ m,j/σ m,j 0 for all j. Thus P(m A cvc ) 1 α + o(1). 2. If µ m,j = 0 for all j, then P(m A cvc ) = 1 α + o(1). 3. Part 2 of the theorem ensures that bad models are excluded with high probability.

51 Proof of coverage Let Z(Σ) = maxn(0,σ), and z(1 α,σ) its 1 α quantile. Let ˆΓ and Γ be sample and population correlation matrices of (ξ (i) m,j ) i I te,j m. When B, [ P(ˆp m α) = P max j Tools (2, 3 are due to Chernozhukov et al.) ] ˆµ m,j nte z(1 α, ˆΓ) ˆσ m,j 1. Concentration: ˆµ n m,j te ˆσ m,j ˆµ n m,j µ m,j te σ m,j + o(1/ logm) ˆµ 2. Gaussian comparison: max m,j µ m,j d d j nte σ m,j Z(Γ) Z( ˆΓ) 3. Anti-concentration: Z( ˆΓ) and Z(Γ) have densities logm

52 Split data into V folds. V-fold CVC

53 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i.

54 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v.

55 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n.

56 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n. Calculate T m and Tb correspondingly using the n (M 1) cross-validated error difference matrix (ξ (i) m,j ) 1 i n,j m.

57 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n. Calculate T m and Tb correspondingly using the n (M 1) cross-validated error difference matrix (ξ (i) m,j ) 1 i n,j m. Rigorous justification is hard due to dependence between folds. But empirically much better.

58 Example: the diabetes data (Efron et al 04) n = 442, with 10 covariates: age, sex, bmi, blood pressure, etc. Response is diabetes progression after one year. Including all quadratic terms, p = fold CVC with α = 0.05, using Lasso with 50 values of λ. test error log(λ) Triangle: models in A cvc, solid triangle: ˆm cv.

59 Simulations: coverage of A cvc Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple), or β = (1,1,1,0.7,0.5,0.3,0,...,0) T (mixed). 5-fold CVC with α = 0.05 using Lasso with 50 values of λ setting of (Σ,β) coverage A cvc cv is opt. identity, simple.92 (.03) 5.1 (.19).27 (.04) identity, mixed.95 (.02) 5.1 (.18).37 (.05) correlated, simple.96 (.02) 7.5 (.18).18 (.04) correlated, mixed.93 (.03) 7.4 (.23).19 (.04)

60 Simulations: coverage of A cvc Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple), or β = (1,1,1,0.7,0.5,0.3,0,...,0) T (mixed). 5-fold CVC with α = 0.05 using forward stepwise setting of (Σ,β) coverage A cvc cv is opt. identity, simple 1 (0) 3.7 (.29).87 (.03) identity, mixed.95 (.02) 5.2 (.33).58 (.05) correlated, simple.97 (.02) 4.1 (.31).80 (.04) correlated, mixed.93 (.03) 6.3 (.36).44 (.05)

61 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models.

62 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable.

63 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable. 2. A cvc can be used to answers questions like Is fitting procedure A better than procedure B?

64 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable. 2. A cvc can be used to answers questions like Is fitting procedure A better than procedure B? 3. We can also simply choose the most parsimonious model in A cvc.

65 The most parsimonious model in A cvc Now consider the linear regression problem: Y = X T β + ε. Let J m be the subset of variables selected using model m ˆm cvc.min = arg min m A cvc J m. ˆm cvc.min is the simplest model that gives a similar predictive risk as ˆm cv.

66 A classical setting Y = X T β + ε, X R p, Var(X) = Σ has full rank. ε has mean zero and variance σ 2 <. Assume that (p,σ,σ 2 ) are fixed and n. M contains the true model m, and at least one overfitting model. n tr /n te 1. Using squared loss, the true model and all overfitting models give n-consistent estimates. Early results (Shao 93, Zhang 93, Yang 07) show that P( ˆm cv m ) is bounded away from 0.

67 Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n

68 Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n Sub-Gaussianity of X and ε implies that (Y X T β) 2 is sub-exponential.

69 Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n Sub-Gaussianity of X and ε implies that (Y X T β) 2 is sub-exponential. Can allow p to grow slowly as n using union bound.

70 Example in low-dim. variable selection Synthetic data with p = 5, n = 40, as in [Shao 93]. Y = X T β + ε, β = (2,9,0,4,8) T, ε N(0,1). Generated additional rows for n = 60, 80, 100, 120, 140, 160. Candidates: (1,4,5),(1,2,4,5),(1,3,4,5),(1,2,3,4,5) Repeated 1000 times, using OLS with 5-fold CVC. rate of correct selection cvc.min cv n

71 Simulations: variable selection with ˆm cvc.min Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple) 5-fold CVC with α = 0.05 using forward stepwise setting of (Σ,β) oracle ˆm cvc.min ˆm cv identity, simple correlated, simple Proportion of correct model selection over 100 independent data sets. Oracle method: the number of steps that gives smallest prediction risk.

72 Simulations: variable selection with ˆm cvc.min Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple) 5-fold CVC with α = 0.05 using Lasso + Least Square setting of (Σ,β) oracle ˆm cvc.min ˆm cv identity, simple correlated, simple Proportion of correct model selection over 100 independent data sets. Oracle method: the λ value that gives smallest prediction risk.

73 The diabetes data revisited Split n = 442 into 300 (estimation) and 142 (risk approximation). 5-fold CVC applied on the 300 sample points, with a final re-fit. The final estimate is evaluated using the 142 hold-out sample. Repeat 100 times, using Lasso with 50 values of λ. test error model size cv cvc.min cv cvc.min

74 Summary CVC: confidence sets for model selection ˆm cvc.min has similar risk as ˆm cv using a simpler model. Extensions Validity of CVC in high dimensions and nonparametric settings. Unsupervised problems 1. Clustering 2. Matrix decomposition (PCA, SVD, etc) 3. Network models Other sample-splitting based inference methods.

75 Thanks! Questions? Paper: Cross-Validation with Confidence, arxiv.org/ Slides:

Cross-Validation with Confidence

Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University WHOA-PSI Workshop, St Louis, 2017 Quotes from Day 1 and Day 2 Good model or pure model? Occam s razor We really