Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University WHOA-PSI Workshop, St Louis, 2017
Quotes from Day 1 and Day 2 Good model or pure model? Occam s razor We really want a λ that looks reasonable
Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval
Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval
Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval
Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval
Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval
Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval
Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval CVC
Outline Background: cross-validation, overfitting, and uncertainty of model selection Cross-validation with confidence A hypothesis testing framework p-value calculation Validity of the confidence set Examples
A regression setting Data: D = {(X i,y i ) : 1 i n}, i.i.d from joint distribution P on R p R 1 Y = f (X) + ε, with E(ε X) = 0 Loss function: l(, ) : R 2 R Goal: find ˆf f so that Q(ˆf ) E [ l(ˆf (X),Y) ˆf ] is small.
A regression setting Data: D = {(X i,y i ) : 1 i n}, i.i.d from joint distribution P on R p R 1 Y = f (X) + ε, with E(ε X) = 0 Loss function: l(, ) : R 2 R Goal: find ˆf f so that Q(ˆf ) E [ l(ˆf (X),Y) ˆf ] is small. The framework can be extended to unsupervised learning problems, including network data.
Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m
Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m 1. such that it equals the true model
Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m 1. such that it equals the true model 2. such that it minimizes Q(ˆf ) over all m M with high probability.
Cross-validation Sample split: Let I tr and I te be a partition of {1,...,n}. Fitting: ˆf m = ˆf (D tr,m), where D tr = {(X i,y i ) : i I tr }. Validation: ˆQ(ˆf m ) = n 1 te i Ite l(ˆf m (X i ),Y i ). CV model selection: ˆm cv = argmin m M ˆQ(ˆf m ). V-fold cross-validation: 1. For V 2, split the data into V folds. 2. Rotate over each fold as I tr to obtain ˆQ (v) (ˆf (v) m ) 3. ˆm = argminv 1 V v=1 ˆQ (v) (ˆf (v) m ) 4. Popular choices of V: 10 and 5. 5. V = n: leave-one-out cross-validation
A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1.
A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1. (Shao 93, Zhang 93, Yang 07) ˆm cv is inconsistent unless n tr = o(n).
A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1. (Shao 93, Zhang 93, Yang 07) ˆm cv is inconsistent unless n tr = o(n). V-fold does not help!
A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )?
A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )? A standard solution uses hypothesis testing H 0 : Q(ˆf 1 ) Q(ˆf 2 ) conditioning on ˆf 1, ˆf 2.
A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )? A standard solution uses hypothesis testing H 0 : Q(ˆf 1 ) Q(ˆf 2 ) conditioning on ˆf 1, ˆf 2. Can do this using a paired sample t-test, say with type I error level α.
CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1?
CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1? No. Because if we consider H 0 : Q(ˆf 2 ) Q(ˆf 1 ). H 0 will not be rejected either (probability of rejecting H 0 is bounded away from 0.) Most likely, we do not reject H 0 or H 0.
CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1? No. Because if we consider H 0 : Q(ˆf 2 ) Q(ˆf 1 ). H 0 will not be rejected either (probability of rejecting H 0 is bounded away from 0.) Most likely, we do not reject H 0 or H 0. We accept both fitted models ˆf 1 and ˆf 2, as they are very similar and the difference cannot be noticed from the data.
Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions.
Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions. Our approach: one step, with provable coverage and power under mild assumptions in high dimensions.
Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions. Our approach: one step, with provable coverage and power under mild assumptions in high dimensions. Key technique: high-dimensional Gaussian comparison of sample means (Chernozhukov et al).
CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ].
CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ]. For each m, test hypothesis (conditioning on ˆf 1,..., ˆf M ) Let ˆp m be a valid p-value. H 0,m : min j m Q(ˆf j ) Q(ˆf m ).
CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ]. For each m, test hypothesis (conditioning on ˆf 1,..., ˆf M ) Let ˆp m be a valid p-value. H 0,m : min j m Q(ˆf j ) Q(ˆf m ). A cvc = {m : ˆp m > α} is our confidence set for the best fitted model: P(m A cvc ) 1 α, where m = argmin m Q(ˆf m ).
Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m.
Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges
Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large.
Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large. 2. Potentially high correlation between ξ m,j and ξ m,j.
Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large. 2. Potentially high correlation between ξ m,j and ξ m,j. 3. Vastly different scaling: Var(ξ m,j ) can be O(1) or O(n 1 ).
Calculating ˆp m H 0,m : E(ξ m,j ) 0, j m. Let ˆµ m,j and ˆσ m,j be the sample mean and standard deviation of (ξ (i) m,j : i I te). Naturally, one would reject H 0,m for large values of max j m ˆµ m,j ˆσ m,j. Approximate the null distribution using high dimensional Gaussian comparison.
Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j 3. ˆp m = B 1 B 1(Tb > T m ). b=1 m,j ˆµ m,j ζ i
Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j m,j ˆµ m,j 3. ˆp m = B 1 B b=1 1(T b > T m ). - The studentization takes care of the scaling difference. ζ i
Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j m,j ˆµ m,j 3. ˆp m = B 1 B 1(Tb > T m ). b=1 - The studentization takes care of the scaling difference. - The bootstrap Gaussian comparison takes care of the dimensionality and correlation. ζ i
Properties of CVC A cvc = {m : ˆp m > α}. Let ˆm cv = argmin m ˆQ(ˆf m ). By construction T ˆmcv 0. Proposition If α < 0.5, then P( ˆm cv A cvc ) 1 as B. [ 1 Proof: nte (i) ξ i I te m,j ˆµ m,j ˆσ m,j ζ i ] j m is a zero-mean Gaussian random vector. So the upper α quantile of its maximum must be positive. Can view ˆm cv as the center of the confidence set.
Coverage of A cvc Recall ξ m,j = l(ˆf m (X),Y) l(ˆf j (X),Y), with independent (X,Y). Let µ m,j = E [ ξ m,j ˆf m, ˆf m ], σ 2 m,j = Var [ ξ m,j ˆf m, ˆf m ]. Theorem Assume that (ξ m,j µ m,j )/(A n σ m,j ) has sub-exponential tail for all m j and some A n 1 such that for some c > 0 A 6 n log 7 (M n) = O(n 1 c ). ( ) ( ) µm,j 1. If max j m σ = o 1 m,j + nlog(m n), then P(m A cvc ) 1 α + o(1). 2. If max j m ( µm,j σ m,j )+ CA n and α n 1, then P(m A cvc ) = o(1). log(m n) n for some constant C,
Proof of coverage Let Z(Σ) = maxn(0,σ), and z(1 α,σ) its 1 α quantile. Let ˆΓ and Γ be sample and population correlation matrices of (ξ (i) m,j ) i I te,j m. When B, [ P(ˆp m α) = P max j Tools (2, 3 are due to Chernozhukov et al.) ] ˆµ m,j nte z(1 α, ˆΓ) ˆσ m,j 1. Concentration: ˆµ n m,j te ˆσ m,j ˆµ n m,j µ m,j te σ m,j + o(1/ logm) ˆµ 2. Gaussian comparison: max m,j µ m,j d d j nte σ m,j Z(Γ) Z( ˆΓ) 3. Anti-concentration: Z( ˆΓ) and Z(Γ) have densities logm
Split data into V folds. V-fold CVC
V-fold CVC Split data into V folds. Let v i be the fold that contains data point i.
V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v.
V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n.
V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n. Calculate T m and Tb correspondingly using the n (M 1) cross-validated error difference matrix (ξ (i) m,j ) 1 i n,j m.
V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n. Calculate T m and Tb correspondingly using the n (M 1) cross-validated error difference matrix (ξ (i) m,j ) 1 i n,j m. Rigorous justification is hard due to dependence between folds. But empirically much better.
Example: the diabetes data (Efron et al 04) n = 442, with 10 covariates: age, sex, bmi, blood pressure, etc. Response is diabetes progression after one year. Including all quadratic terms, p = 64. 5-fold CVC with α = 0.05, using Lasso with 50 values of λ. test error 3000 4000 5000-8 -6-4 -2 0 log(λ) Triangle: models in A cvc, solid triangle: ˆm cv.
Simulations: coverage of A cvc Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = 0.5 + 0.5δ jk (correlated). β = (1,1,1,0,...,0) T (simple), or β = (1,1,1,0.7,0.5,0.3,0,...,0) T (mixed). 5-fold CVC with α = 0.05 using Lasso with 50 values of λ setting of (Σ,β) coverage A cvc cv is opt. identity, simple.92 (.03) 5.1 (.19).27 (.04) identity, mixed.95 (.02) 5.1 (.18).37 (.05) correlated, simple.96 (.02) 7.5 (.18).18 (.04) correlated, mixed.93 (.03) 7.4 (.23).19 (.04)
Simulations: coverage of A cvc Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = 0.5 + 0.5δ jk (correlated). β = (1,1,1,0,...,0) T (simple), or β = (1,1,1,0.7,0.5,0.3,0,...,0) T (mixed). 5-fold CVC with α = 0.05 using forward stepwise setting of (Σ,β) coverage A cvc cv is opt. identity, simple 1 (0) 3.7 (.29).87 (.03) identity, mixed.95 (.02) 5.2 (.33).58 (.05) correlated, simple.97 (.02) 4.1 (.31).80 (.04) correlated, mixed.93 (.03) 6.3 (.36).44 (.05)
How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models.
How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable.
How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable. 2. A cvc can be used to answers questions like Is fitting procedure A better than procedure B?
How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable. 2. A cvc can be used to answers questions like Is fitting procedure A better than procedure B? 3. We can also simply choose the most parsimonious model in A cvc.
The most parsimonious model in A cvc Now consider the linear regression problem: Y = X T β + ε. Let J m be the subset of variables selected using model m ˆm cvc.min = arg min m A cvc J m. ˆm cvc.min is the simplest model that gives a similar predictive risk as ˆm cv.
A classical setting Y = X T β + ε, X R p, Var(X) = Σ has full rank. ε has mean zero and variance σ 2 <. Assume that (p,σ,σ 2 ) are fixed and n. M contains the true model m, and at least one overfitting model. n tr /n te 1. Using squared loss, the true model and all overfitting models give n-consistent estimates. Early results (Shao 93, Zhang 93, Yang 07) show that P( ˆm cv m ) is bounded away from 0.
Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n
Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n Sub-Gaussianity of X and ε implies that (Y X T β) 2 is sub-exponential.
Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n Sub-Gaussianity of X and ε implies that (Y X T β) 2 is sub-exponential. Can allow p to grow slowly as n using union bound.
Example in low-dim. variable selection Synthetic data with p = 5, n = 40, as in [Shao 93]. Y = X T β + ε, β = (2,9,0,4,8) T, ε N(0,1). Generated additional rows for n = 60, 80, 100, 120, 140, 160. Candidates: (1,4,5),(1,2,4,5),(1,3,4,5),(1,2,3,4,5) Repeated 1000 times, using OLS with 5-fold CVC. rate of correct selection 0.0 0.2 0.4 0.6 0.8 1.0 cvc.min cv 40 60 80 100 120 140 160 n
Simulations: variable selection with ˆm cvc.min Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = 0.5 + 0.5δ jk (correlated). β = (1,1,1,0,...,0) T (simple) 5-fold CVC with α = 0.05 using forward stepwise setting of (Σ,β) oracle ˆm cvc.min ˆm cv identity, simple 1 1.87 correlated, simple 1.97.80 Proportion of correct model selection over 100 independent data sets. Oracle method: the number of steps that gives smallest prediction risk.
Simulations: variable selection with ˆm cvc.min Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = 0.5 + 0.5δ jk (correlated). β = (1,1,1,0,...,0) T (simple) 5-fold CVC with α = 0.05 using Lasso + Least Square setting of (Σ,β) oracle ˆm cvc.min ˆm cv identity, simple 1 1.88 correlated, simple.87.85.71 Proportion of correct model selection over 100 independent data sets. Oracle method: the λ value that gives smallest prediction risk.
The diabetes data revisited Split n = 442 into 300 (estimation) and 142 (risk approximation). 5-fold CVC applied on the 300 sample points, with a final re-fit. The final estimate is evaluated using the 142 hold-out sample. Repeat 100 times, using Lasso with 50 values of λ. test error 0.40 0.45 0.50 0.55 0.60 model size 5 10 15 20 25 30 35 40 cv cvc.min cv cvc.min
Summary CVC: confidence sets for model selection ˆm cvc.min has similar risk as ˆm cv using a simpler model. Extensions Validity of CVC in high dimensions and nonparametric settings. Unsupervised problems 1. Clustering 2. Matrix decomposition (PCA, SVD, etc) 3. Network models Other sample-splitting based inference methods.
Thanks! Questions? Paper: Cross-Validation with Confidence, arxiv.org/1703.07904 Slides: http://www.stat.cmu.edu/~jinglei/talk.shtml