The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Size: px

Start display at page:

Download "The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA"

Ashley Dean
6 years ago
Views:

1 The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010

2 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM

3 Setting y i = x i β + ε i, where ε 1,, ε n are i.i.d. mean 0 and variance { σ 2. } A = j : βj 0 and A = p 0 < p. 1 n X T X C, where C is a positive definite matrix. [ ] C11 C C = 12, where C C 21 C 11 is a p 0 p 0 matrix. 22

4 Definition of Oracle Procedures We call δ an oracle procedure if ˆβ(δ) (asymptotically) has the following oracle properties: { 1. Identifies the right subset model, j : ˆβ } j 0 = A. 2. ( ) n ˆβ(δ)A βa d N (0, Σ ), where Σ is the covariance matrix knowing the true subset model.

5 Definition of LASSO (Tibshirani, 1996) ˆβ (n) y p = arg min x 2 jβ j + λn β j=1 p β j. j=1 λ n varies with n. A n = { j : } (n) ˆβ j 0. LASSO variable selection is consistent iff lim n P (A n = A) = 1.

6 Proposition 1: Proposition 1 If λ n / n λ 0 0, then lim sup n P (A n = A) c < 1, where c is a constant depending on the true model.

7 Theorem 1: Necessary Condition for Consistency of LASSO Theorem 1 Suppose that lim n P (A n = A) = 1. Then there exists some sign vector s = (s 1,, s p0 ) T, s j = 1 or 1, such that C21 C11 1 s 1. (1)

8 Corollary 1: Interesting Case of Corollary 1 Suppose that p 0 = 2m and p = p 0 + 1, so there is one irrelevant predictor. Let C 11 = (1 ρ 1 ) I + ρ 1 J 1, where J 1 is the matrix of 1 s and C 12 = ρ 2 1 and C 22 = 1. If 1 p 0 1 < ρ 1 < 1 p 0 and 1 + (p 0 1) ρ 1 < ρ 2 < (1 + (p 0 1) ρ 1 ) /p 0, then condition (1) cannot be satisfied. Thus LASSO variable selection is inconsistent.

9 Corollary 1: Interesting Case of m=1, p0=3 rho rho1

10 Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Definition of ˆβ (n) y p = arg min x 2 jβ j + λn β j=1 p ŵ j β j. j=1 weight vector ŵ = 1/ ˆβ γ (data-dependent) and γ > 0. ˆβ is a root-n-consistent estimator to β, e.g. ˆβ = ˆβ(ols). { } A (n) n = j : ˆβ j 0.

11 e Adaptive Lasso Introduction Definition 1421 (a) (c) Oracle Properties (b) Computations Relationship: Nonnegative Garrote Extensions: GLM Penalty Function of LASSO, SCAD and (d) (c) (e) (d) (f) (e) Figure 1. Plot of Thresholding Functions With (f) λ = 2 for (a) the Hard; (b) Bridge L.5; (c) the Lasso; (d) the SCAD; (e) the Ada and (f) the Adaptive Lasso, γ = Oracle Inequality and Near-Minimax Optimality For the minimax arguments, we consider ple estimation problem discussed by Donoho Presented As shownby bydongjun Donoho and Chung Johnstone (1994), The Adaptive the l 1 shrinkage Lasso and (1994). Its Oracle Suppose Properties that we are Hui given Zou n independ (2006),

12 Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Remarks: The data-dependent ŵ is the key for its oracle properties. As n grows, the weights for zero-coefficient predictors get inflated, while the weights for nonzero-coefficient predictors converge to a finite constant. In the view of Fan and Li, 2001 (presented by Yang Zhao), adaptive lasso satisfies three properties of good penalty function: unbiasedness, sparsity, and continuity.

13 Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Theorem 2: Oracle Properties of Theorem 2 Suppose that λ n / n 0 and λ n n (γ 1)/2. Then the adaptive LASSO must satisfy the following: 1. Consistency in variable selection: lim n P (A n = A) = Asymptotic normality: ( ) n ˆβ (n) A βa d N ( 0, σ 2 C11 1 ).

14 Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Computations of estimates can be solved by the LARS algorithm (Efron et al., 2004). The entire solution path can be computed at the same order of computation of a single OLS fit. Tuning: If we use ˆβ(ols), then use 2-dimensional CV to find an optimal pair of (γ, λ n ). Or use 3-dimensional CV to find an optimal triple ( ˆβ, γ, λ). ˆβ(ridge) may be used from the best ridge regression fit when collinearity is a concern.

15 Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Definition of Nonnegative Garrote (Breiman, 1995) ˆβ j (garrote) = c j ˆβj (ols), where a set of nonnegative scaling factor {c j } is to minimize subject to c j 0, j. y p j=1 x j ˆβ j (ols) c j 2 + λn p j=1 c j, A sufficiently large λ n shrinks some c j to exact 0, i.e. ˆβ j (garrote) = 0. Yuan and Lin (2007) also studied the consistency of the nonnegative garrote.

16 Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Garrote: Formulation and Consistency Formulation y p ˆβ (garrote) = arg min x 2 p jβ j + λn β j=1 j=1 ŵj β j subject to β j ˆβj (ols) 0, j, where γ = 1, ŵ = 1/ ˆβ (ols). Corollary 2: Consistency of Nonnegative Garrote If we choose a λ n such that λ n / n 0 and λ n, then nonnegative garrote is consistent for variable selection.

17 Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM for GLM ( ) ( ˆβ (n) (glm) = arg min y i (xi T β + φ xi T β weight vector ŵ = 1/ ˆβ(mle) γ for some γ > 0. )) p β + λ n j=1 ŵj β j. f (y x, θ) = h (y) exp (yθ φ (θ)), where θ = x T β. [ ] The Fisher information matrix I (β I11 I ) = 12, where I 21 I 22 I 11 is a p 0 p 0 matrix. Then I 11 is the Fisher information matrix with the true submodel known.

18 Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Theorem 4: Oracle Properties of for GLM Theorem { 4 } Let A (n) n = j : ˆβ j (glm) 0. Suppose that λ n / n 0 and λ n n (γ 1)/2. Then, under some mild regularity conditions, the adaptive LASSO estimate ˆβ (n) (glm) must satisfy the following: 1. Consistency in variable selection: lim n P (A n = A) = Asymptotic normality: ( ) n ˆβ (n) A (glm) β A d N ( 0, I11 1 ).

19 Experiments for Setting We let y = x T β + N(0, σ 2 ), where the true regression coefficients are β = (5.6, 5.6, 5.6, 0). The predictors x i (i = 1,, n) are i.i.d. N(0, C), where C is the C matrix in Corollary 1 with ρ 1 =.39 and ρ 2 =.23 (red point).

20 ˆβ (n) )) 1. A n Experiments for s estimates e estimated 001). and n = 80. Introduction Numerical In bothexperiments models we and Discussion simulated 100 training datasets for each combination of (n,σ 2 ). All of the training and tuning were done on the training set. We also collected independent test datasets of 10,000 observations to compute the RPE. To estimate the standard error of the RPE, we generated a bootstrapped sample from the 100 RPEs, then calculated the bootstrapped Table 1. Simulation Model 0: The Probability of Containing the True Model in the Solution Path e to comd the nonrious linear mputed the the LARS so. We imto compute n = 60, σ = 9 n= 120, σ = 5 n= 300, σ = 3 lasso adalasso(γ =.5) adalasso(γ = 1) adalasso(γ = 2) adalasso(γ by cv) NOTE: In this table adalasso is the adaptive lasso, and γ by cv means that γ was selected by five-fold cross-validation from three choices: γ =.5, γ = 1, and γ = 2.

21 General Observations Comparison: LASSO,, SCAD, and nonnegative garrote. p = 8 and p 0 = 3. Consider a few large effects (n = 20, 60) and many small effects (n = 40, 80). LASSO performs best when the SNR is low., SCAD, and and nonnegative garrote outperforms LASSO with a medium or low level of SNR. tends to be more stable than SCAD. LASSO tends to select noise variables more often than other methods.

22 Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM Theorem 2: Oracle Properties of Theorem 2 Suppose that λ n / n 0 and λ n n (γ 1)/2. Then the adaptive LASSO must satisfy the following: 1. Consistency in variable selection: lim n P (A n = A) = Asymptotic normality: ( ) n ˆβ (n) A βa d N ( 0, σ 2 C11 1 ).

23 Proof of Theorem 2: Asymptotic Normality Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM Let β = β + u/ n and Ψ n (u) = y p j=1 x j ( βj + u ) n 2 p + λ n n j=1 ŵj β j + u n. n Let û (n) = arg min Ψ n (u); then û (n) = ( ) n ˆβ (n) β. Ψ n (u) Ψ n (0) = V (n) 4 (u), where V (n) 4 (u) = u T ( 1 n X T X ) u 2 εt X n u + λ ( n p n j=1 ŵj n βj + u n ) β n j

24 Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM Proof of Theorem 2: Asymptotic Normality (conti.) Then, V (n) 4 (u) d V 4 (u) for every u, where { u T V 4 (u) = A C 11 u A 2uA T W A if u j = 0, j / A otherwise and W A = N ( 0, σ 2 ) (n) C 11. V 4 is convex, and the unique minimum of V 4 is ( C11 1 W A, 0 ) T. Following the epi-convergence results of Geyer (1994), we have û (n) A d C11 1 W A and û (n) A C d 0. Hence, we prove the asymptotic normality part.

25 Proof of Theorem 2: Consistency Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM The asymptotic normality result indicates that j A, ˆβ (n) j p βj ; thus P (j A n) 1. Then it suffices to show that j / A, P (j A n) 0. Consider ( the event j A n. By the KKT optimality conditions, 2xj T y X ˆβ ) (n) = λ n ŵ j. λ n ŵ j / n = λ n n (γ 1)/2 / n ˆβ j γ p and 2 xt j (y X ˆβ (n) ) n = 2 xt j X n(β ˆβ (n) ) n + 2 xt j ε n and each of these two terms converges to some normal distribution. Thus P ( j A ) ( n P (2xj T y X ˆβ (n)) ) = λ n ŵ j 0.

26 Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM Corollary 2: Consistency of Nonnegative Garrote Formulation y p ˆβ (garrote) = arg min x 2 p jβ j + λn β j=1 j=1 ŵj β j subject to β j ˆβj (ols) 0, j, where γ = 1, ŵ = 1/ ˆβ (ols). Corollary 2: Consistency of Nonnegative Garrote If we choose a λ n such that λ n / n 0 and λ n, then nonnegative garrote is consistent for variable selection.

27 Proof of Corollary 2 Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM ˆβ (n) Let ˆβ (n) be the adaptive LASSO estimates. By Theorem 2, is an oracle estimator if λ n / n 0 and λ n. To show the consistency, it suffices to show that ˆβ (n) satisfies the sign constraint with probability tending to 1. Pick any j. If j A, then ( ) 2 ˆβ (n) (γ = 1) j ˆβ (ols)j p βj > 0. If j / A, then ( ) ( ) P ˆβ (n) (γ = 1) j ˆβ (ols)j 0 P ˆβ (n) (γ = 1) j = 0 1. In ( ) either case, P ˆβ (n) (γ = 1) j ˆβ (ols)j 0 1 for any j = 1, 2,, p.

28 Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM Theorem 4: Oracle Properties of for GLM Theorem { 4 } Let A (n) n = j : ˆβ j (glm) 0. Suppose that λ n / n 0 and λ n n (γ 1)/2. Then, under some mild regularity conditions, the adaptive LASSO estimate ˆβ (n) (glm) must satisfy the following: 1. Consistency in variable selection: lim n P (A n = A) = Asymptotic normality: ( ) n ˆβ (n) A (glm) β A d N ( 0, σ 2 I11 1 ). f (y x, θ) = h (y) exp (yθ φ (θ)), where θ = x T β.

29 Theorem 4: Regularity Conditions Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM 1. The Fisher information matrix is finite and positive definite, [ ( I (β ) = E φ x T β ) ] xx T. 2. There is a sufficiently large enough open set O that contains β such that β O, ( φ x β) T M (x) < and for all 1 j, k, l p. E [M (x) x j x k x l ] <

30 Proof of Theorem 4: Asymptotic Normality Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM Let β = β + u/ n. Define Γ n (u) = n { ( ( i=1 yi x T i β + u/ n )) + φ ( xi T p +λ n β j=1 j + u j / n ( β + u/ n ))} Let û (n) = arg min u Γ n (u); then û (n) = n ( β (n) (glm) β ). Using the Taylor expansion, we have Γ n (u) Γ n (0) = H (n) (u), where H (n) (u) = A (n) 1 + A (n) 2 + A (n) 3 + A (n) 4, with A (n) 1 = n [ i=1 yi φ ( xi T β )] xi T u n, A (n) 2 = n i=1 1 ( 2 φ xi T β ) u T x i xi T n u, A (n) 3 = ŵj ( λn p β n j=1 n j + β un n j ),

31 Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM Proof of Theorem 4: Asymptotic Normality (conti.) and A (n) 4 = n 3/2 ) n (x i=1 1 6 (x φ i T β T i u ) 3, where β is between β and β + u/ n. Then, by the regularity condition 1 and 2, H (n) (u) d H(u) for every u, where { u T H (u) = A I 11 u A 2uA T W A if u j = 0, j / A otherwise and W A = N (0, I 11 ). H (n) is convex, and the unique minimum of H is ( I11 1 W A, 0 ) T. Following the epi-convergence results of Geyer (1994), we have û (n) A d I11 1 W A and û (n) A C d 0, and the asymptotic normality part is proven.

32 Proof of Theorem 4: Consistency Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM The asymptotic normality result indicates that j A, P (j A n) 1. Then it suffices to show that j / A, P (j A n) 0. Consider the event j A n. By the KKT optimality conditions, n i=1 x ij (y i φ ( )) ˆβ (n) (glm) xi T = λ n ŵ j. n x ij (y ( )) i=1 i φ xi T ˆβ (n) (glm) / n = B (n) 1 + B (n) 2 + B (n) 3 with B (n) 1 = n i=1 x ( ij yi φ ( xi T β )) / n, B (n) 2 = ( 1 n n i=1 x ( ij φ xi T β ) xi T ) ( n β ˆβ ) (n) (glm), ( B (n) 3 = 1 ( )) ( n n i=1 x ij φ xi T β xi T ( n β ˆβ 2 (glm))) (n) / n, where β is between ˆβ (n) (glm) and β.

33 Proof of Theorem 4: Consistency (conti.) Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM B (n) 1 and B (n) 2 converge to some normal distributions and B (n) 3 = O p (1/ n). λ n ŵ j / n = λ n n (γ 1)/2 / n ˆβ j (glm) γ p. Thus P ( j A ) n ( ( n P( x ij i=1 y i φ xi T and this completes the proof. )) ˆβ (n) (glm) = λ n ŵ j ) 0.

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)