Adaptive Lasso for correlated predictors

Size: px

Start display at page:

Download "Adaptive Lasso for correlated predictors"

Felix Paul
5 years ago
Views:

1 Adaptive Lasso for correlated predictors Keith Knight Department of Statistics University of Toronto This research was supported by NSERC of Canada.

2 OUTLINE 1. Introduction 2. The Lasso under collinearity 3. Projection pursuit with the Lasso 4. Example: Diabetes data

3 1. INTRODUCTION Assume a linear model for {(xi.yi) : i = 1., n}: Yi = β0 + β1x1i + + βpxpi + εi = x T i β + εi (i = 1,, n) Assume that the predictors are centred and scaled to have mean 0 and variance 1. We can estimate β0 by Ȳ least squares estimator. Thus we can assume that {Yi} are centred to have mean 0. In many applications, p can be much greater than n. In this talk, we will assume implicitly that p < n.

4 Shrinkage estimation Bridge regression: Minimize (Yi x T i β) 2 + λ p βj γ j=1 for some γ > 0. Includes the Lasso (Tibshirani, 1996) and ridge regression as special cases with γ = 1 and 2 respectively. For γ 1, it s possible to obtain exact 0 parameter estimates. Many other variations of the Lasso: elastic nets (Zou & Hastie, 2005), fused lasso (Tibshirani et al., 2006) among others. The Dantzig selector of Candès & Tao (2007) is similar in spirit to the Lasso.

5 Stagewise fitting: Given β (k), minimize (Yi x T β (k) i x T i φ) 2 over φ with all but 1 (or a small number) of its elements equal to 0. Then define β(k+1) = β (k) + ɛ φ (0 < ɛ 1) and repeat until convergence. This is a special case of boosting (Shapire, 1990). Also related to LARS (Efron et al., 2004), which in turn is related to the Lasso.

6 2. THE LASSO UNDER COLLINEARITY For given λ, the Lasso estimator β(λ) can be defined in a number of equivalent ways: 1. β(λ) minimizes (Yi x T i β) 2 subject to p j=1 β j t(λ); 2. β(λ) minimizes (x T i β) 2 subject to (Yi x T i β)xij λ for j = 1,, p.

7 The advantage of the Lasso is that it produces exact 0 estimates while β(λ) is a smooth function of λ. This is very useful when p n to produce sparse models. However, when the predictors {xi} are highly correlated then β(λ) may contain too many zeroes. This is not necessarily undesirable but some important effects may be missed as a result. How does one interpret a sparse model under high collinearity?

8 Question: Why does this happen? Answer: Redundancy in the constraints (Yi x T i β)xij λ for j = 1,, p due to collinearity; that is, we don t have p independent constraints. The Dantzig selector minimizes j β j subject to similar constraints on the correlations, and thus will tend to behave similarly.

9 For LS estimation (λ = 0), we have (Yi x T β)x i T i a = 0 for any a. Similarly, we could try to consider estimates β such that (Yi x T i β)x T i al λ for some set of vectors (projections) {al : l L}. If the set L is finite, we can incorporate predictors {a T l x} into the Lasso.

10 Example: Principal components regression ( L = p) where a1,, ap are the eigenvectors of C = xix T i. However... Projections obtain via PC are based solely on information in the design. Moreover, they need not be particular easy to interpret. More generally, there s no problem in taking L p.

11 3. PROJECTION PURSUIT WITH THE LASSO For collinear predictors, it s often desirable to consider projections of the original predictors. Given predictors x1,, xp and projections {al : l L}, we want to identify interesting (data-driven) projections al1,, alp and define new predictors a T x,, l1 at x. lp We can take L to be very large but the projections we consider should be easily interpretable. Coordinate projections (i.e. original predictors). Sums and differences of two or more predictors.

12 Question: How do we do this? Answer: Two possibilities: Use the Lasso on the projections. But we need to worry about the choice of λ. The active projections will depend on λ. Look at the Lasso solution as λ 0. This identifies a set of p projections. These projections can be used in the Lasso.

13 Question: What happens to the Lasso solution as λ 0? Suppose that β(λ) minimizes p (Yi x T i β) 2 + λ βj j=1 and that C = xix T i is singular. Define D = { φ : (Yi x T i φ) 2 = min β (Yi x T i β) 2 }.

14 Proposition: For the Lasso estimate β(λ), we have p lim β(λ) = argmin φj : φ D λ 0. j=1 Proof. Assume (for simplicity) that the minimum RSS is 0. Then β(λ) minimizes Zλ(β) = 1 λ (Yi x T i β) 2 + p βj. j=1 As λ 0, the first term of Zλ blows up for β D and is exactly 0 for β D. The conclusion follows using convexity of Zλ. Corollary: The Dantzig selector estimator has the same limit as λ 0.

15 In our problem, define til to be a scaled version of a T l x i. The model now becomes Yi = l L φltil + εi = t T i φ + εi (i = 1,, n) We estimate φ by minimizing φl subject to l L (Yi t T i φ)ti = 0. This can be solved using linear programming methods. Software for the Lasso tends to be unstable as λ 0.

16 Asymptotics: Assume p < r = L are fixed and n. Define matrices 1 C = lim n n 1 D = lim n n xix T i tit T i where C is non-singular and D singular with rank p. Then φ n p some φ 0. We also have d n( φ n φ 0 ) V where the distribution of V is concentrated on the orthogonal complement of the null space of D.

17 4. EXAMPLE Diabetes data (Efron et al., 2004) Response: measure of disease progression. Predictors: age, sex, BMI, blood pressure, and 6 blood serum measurements (TC, LDL, HDL, TCH, LTG, GLU). Some predictors are quite highly correlated. Analysis indicates that the most important variables are LTG, BMI, BP, TC, and sex. Look at coordinate-wise projections as well as pairwise sums and differences (100 projections in total).

18 coefficients proportional bound Lasso plot for original predictors. LTG BMI LDL BP TCH HDL GLU AGE SEX TC

19 Results: Estimated projections Projections Estimates BMI + LTG LTG TC LDL TC BP SEX 9.61 BMI + BP 6.64 BMI + GLU 5.36 BP + LTG 5.33 TCH SEX 4.18 HDL + TCH 3.48 BP AGE 0.55

20 coefficients proportional bound Lasso plot for the 10 identified projections. BMI+LTG LTG TC LDL TC BP SEX BMI+BP BMI+GLU BP+LTG TCH SEX HDL+TCH BP AGE

21 coefficients LTG BMI LDL BP TCH HDL GLU AGE SEX TC proportional bound Lasso trajectories for original predictors using the projections.

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression