The lasso, persistence, and cross-validation

Size: px

Start display at page:

Download "The lasso, persistence, and cross-validation"

Pearl Gordon
5 years ago
Views:

1 The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University danielmc Joint work with: Darren Homrighausen Colorado State University

2 The question: All the results about lasso are for oracle tuning parameter. What happens if you choose it using the data? The answer: YES! 1

3 The question: All the results about lasso are for oracle tuning parameter. What happens if you choose it using the data? The answer: YES! 1

4 The motivation: 2

5 The motivation: 2

6 The Setup Suppose we have data D n = {(Y 1, X1 ),..., (Y n, Xn )} where X i = (X i1,..., X ip ) R p are the features Y i R are the responses We use D n to find a function f that can predict Y from X. The regression function is the best possible function m(x) = E[Y X] = argmin E [ (Y f(x)) 2] f 3

7 Parameterizing this Relationship A good start is to find the best linear approximation of m(x). A linear predictor specifies a β R p and forms f(x) = X 1 β X p β p = X β Important: This does not assume that m is linear in X! We need to find a good estimator of β. 4

8 The lasso

9 l 1 -regularized regression Of course, for large p, small n, we need to regularize Known as lasso basis pursuit The estimator satisfies β t = argmin Y Xβ 2 2 subject to β 1 t β Alternatively: β λ = argmin Y Xβ λ β 1 β 5

10 Some properties Suppose m(x) IS linear, p is small 1... If λ = o(n), then β λ a.s. β If λ n a (0, ), then β λ β in general If λ n, then β λ a.s. 0 If λ = o( n), then n β λ β d A, A is a random variable. 1 Knight and Fu (2000), Chatterjee and Lahiri (2011) 6

11 Different properties What if m(x) is not linear, p n...? Define Z = (Y, X ) to be a new observation [same distribution]. We define the (predictive) risk to be [ ( ) ] 2 R(β) = E Z Y X β. Define the oracle estimator β t = argmin R(β) {β: β 1 t} The excess risk is E( β t, β t ) = R( β t ) R(β t ) 7

12 Different properties (cont.) A procedure is persistent (relative to the oracle) if E( β t, β t ) P 0 Then 2 ( ) If t 4 = o n log n, β t is persistent relative to βt ( ) β t is not necessarily persistent if t 4 / o n log n 2 Greenshtein and Ritov (2004) 8

13 You ve got data... What t to use?

14 Methods for choosing t The tuning parameter can be selected by unbiased risk estimation using degrees of freedom using an adapted Bayesian information criterion However... Many papers recommend cross-validation [3, 4, 7, 8, 9, 10, 11] [It is also the default method in the R package glmnet. See Zou, Hastie, and Tibshirani (2010)] 9

15 Cross-validation Define V n = {v 1,..., v Kn } to be a set of validation sets β (v) t lasso estimator computed on observations not in v {1,..., n} The cross-validation estimator of the risk is R Vn (t) = R ( β(v 1 ) Vn t,..., β (v Kn ) t := 1 K n v V n 1 v r v ) ( Y r X r ) (v) 2 β t Define t := argmin RVn (t) t T n 10

16 Cross-validation Define V n = {v 1,..., v Kn } to be a set of validation sets β (v) t lasso estimator computed on observations not in v {1,..., n} The cross-validation estimator of the risk is R Vn (t) = R ( β(v 1 ) Vn t,..., β (v Kn ) t := 1 K n v V n 1 v r v ) ( Y r X r ) (v) 2 β t Define t := argmin RVn (t) t T n 10

17 Choosing T n In practice, the optimization set T n = [0, t max ] needs to be specified However, if t max is too small, good solutions might be excluded What is too small? 11

18 Choosing T n By definition, β t {β : β 1 t} This constraint is only binding if t < min η K β 0 + η 1 =: t 0, where β 0 := (X X) X Y is a least squares solution K := {a : Xa = 0} is the null space of X If t t 0, then β t is equal to β 0 We define t max := β

19 Does cross-validation work? Prevailing heuristic: Regarding the choice of the regularization parameter, we typically use [the tuning parameter chosen by] cross-validation. Luckily, empirical and some theoretical indications support [good performance]... Peter Bühlmann s comments to Tibshirani (2011). What does theory have to say? 13

20 Stability (or lack thereoff) Sparsity inducing algorithms, such as lasso, are not (uniformly) algorithmically stable Algorithmic stability is sufficient, but not necessary, for persistence 4 4 Xu and Mannor (2008) and Bousquet and Elisseeff (2002) 14

21 Model selection and cross-validation There is a close connection between lasso and model selection [e.g. the LARS algorithm] For model selection 5... Leave-one-out cross-validation is inconsistent If c n /n 1 and n c n, then cross-validation is consistent [[c n is the size of the smallest held-out set]] Very restrictive: asymptotically, all the data is used for validation 5 Shao (1993) 15

22 Results: Cross-validation does work

23 Conditions C1. E [ t max 4 ] [ ] 1 = E β = o(t 4 n) C2. Held-out sets contain at least c n observations, don t overlap. C3. Let Z = (Y, X ) F n. Then, (F n ) n 1 is such that C < for all n where E Fn max (Z jz k E Fn Z j Z k ) 2 C 0 j,k p 16

24 Results (cont.) Theorem Assume C1 C3 and that p n = n α for some α > 0. Then, for any δ > 0, ( ) log n P (E( β t, β t n ) > δ) = o. t 2 n c n Some remarks c n n for K-fold cross-validation leave-one-out cross-validation has c n = 1 E( β t, β t n ) CAN be negative (don t care) 17

25 Properties of t n The faster t n... the less restrictive condition C1 becomes R n (β tn ) shrinks faster But, if t n grows as fast or faster than necessarily persistent [ ] ( ( ) ) 1/4 Can E β = o(t 4 n) if t n = o n log n? ( n log n) 1/4, then βtn is not Yes... 18

26 When it works... Suppose Y = m(x) + ɛ, m(x) bounded, E[ɛ 4 ] < Example 1: X i R p i.i.d sub-gaussian with independent components Example 2: Fixed design e i = i/n X ij = h 1 φ( e j e i /h) φ satisfies h 1 φ(1/h) 0 as h Example 3: Orthogonal basis regression 19

27 Future work Show similar results for lasso-type estimators, such as group lasso G is a partition of {1,..., p} G u := {β : g G g βg 2 u} Theorem Suppose 1 E [ ( g G β 0 g 2 ) 4 ] = o(u 4 n) 2 p n = n α for some α > 0 3 max g G g = a n 4 conditions C2 and C3 Then, for any δ > 0, ) P Fn (E ( β û, β un ( ) > δ = o a n u 2 n ) log n. c n 20

28 References I [1] K. Knight and W. Fu. Asymptotics for lasso-type estimators. Annals of Statistics, 28(5): , [2] A. Chatterjee and SN Lahiri. Strong consistency of lasso estimators. Sankhya A-Mathematical Statistics and Probability, 73(1):55 78, [3] E. Greenshtein and Y.A. Ritov. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10(6): , [4] H. Zou, T. Hastie, and R. Tibshirani. On the degrees of freedom of the lasso. The Annals of Statistics, 35(5): , [5] R.J. Tibshirani and J. Taylor. Degrees of freedom in lasso problems. Annals of Statistics, 40: , [6] H. Wang and C. Leng. Unified lasso estimation by least squares approximation. Journal of the American Statistical Association, 102(479): ,

29 References II [7] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1): , [8] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Verlag, [9] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32(2): , [10] R. Tibshirani. Regression shrinkage and selection via the lasso: A retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3): , [11] S. van de Geer and J. Lederer. The lasso, correlated design, and improved oracle inequalities, [12] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1 22,

30 References III [13] H. Xu, S. Mannor, and C. Caramanis. Sparse algorithms are not stable: A no-free-lunch theorem. In 46th Annual Allerton Conference on Communication, Control, and Computing, pages IEEE, [14] O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2: , [15] J. Shao. Linear model selection by cross-validation. Journal of the American Statistical Association, 88: , [16] M. Rudelson and R. Vershynin. Smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics, 62(12): ,

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los