The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Similar documents
Lecture 14: Variable Selection - Beyond LASSO

Comparisons of penalized least squares. methods by simulations

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Regularization and Variable Selection via the Elastic Net

Lecture 5: Soft-Thresholding and Lasso

On Model Selection Consistency of Lasso

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Iterative Selection Using Orthogonal Regression Techniques

Relaxed Lasso. Nicolai Meinshausen December 14, 2006

arxiv: v1 [stat.me] 30 Dec 2017

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Lecture 14: Shrinkage

Inference for High Dimensional Robust Regression

The Iterated Lasso for High-Dimensional Logistic Regression

Robust Variable Selection Through MAVE

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

High-dimensional Ordinary Least-squares Projection for Screening Variables

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

Consistent Group Identification and Variable Selection in Regression with Correlated Predictors

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

Logistic Regression with the Nonnegative Garrote

Bolasso: Model Consistent Lasso Estimation through the Bootstrap

Semi-Penalized Inference with Direct FDR Control

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Spatial Lasso with Applications to GIS Model Selection

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Generalized Elastic Net Regression

Bi-level feature selection with applications to genetic association

ESL Chap3. Some extensions of lasso

Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays

Variable Selection for Highly Correlated Predictors

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Regression Shrinkage and Selection via the Lasso

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

A Confidence Region Approach to Tuning for Variable Selection

Chapter 3. Linear Models for Regression

Permutation-invariant regularization of large covariance matrices. Liza Levina

CONTRIBUTIONS TO PENALIZED ESTIMATION. Sunyoung Shin

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

Stability and the elastic net

VARIABLE SELECTION IN ROBUST LINEAR MODELS

MSA220/MVE440 Statistical Learning for Big Data

Forward Regression for Ultra-High Dimensional Variable Screening

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Learning with Sparsity Constraints

Variable Selection for Highly Correlated Predictors

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Linear Model Selection and Regularization

Linear regression methods

Model Selection and Estimation in Regression with Grouped Variables 1

Least Absolute Gradient Selector: variable selection via Pseudo-Hard Thresholding

Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space

CONSISTENT BI-LEVEL VARIABLE SELECTION VIA COMPOSITE GROUP BRIDGE PENALIZED REGRESSION INDU SEETHARAMAN

Least squares under convex constraint

Bolasso: Model Consistent Lasso Estimation through the Bootstrap

The deterministic Lasso

Robust variable selection through MAVE

Variable selection in high-dimensional regression problems

High-dimensional covariance estimation based on Gaussian graphical models

arxiv: v2 [math.st] 12 Feb 2008

A Significance Test for the Lasso

On the Oracle Property of the Adaptive LASSO in Stationary and Nonstationary Autoregressions. Anders Bredahl Kock. CREATES Research Paper

Adaptive Lasso for correlated predictors

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

Single Index Quantile Regression for Heteroscedastic Data

On High-Dimensional Cross-Validation

Bayesian linear regression

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Genomics, Transcriptomics and Proteomics in Clinical Research. Statistical Learning for Analyzing Functional Genomic Data. Explanation vs.

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Cross-Validation with Confidence

PENALIZED PRINCIPAL COMPONENT REGRESSION. Ayanna Byrd. (Under the direction of Cheolwoo Park) Abstract

Nonnegative Garrote Component Selection in Functional ANOVA Models

Sparse survival regression

Spatial Lasso with Application to GIS Model Selection. F. Jay Breidt Colorado State University

LASSO-type penalties for covariate selection and forecasting in time series

Saharon Rosset 1 and Ji Zhu 2

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building

Variable Selection and Parameter Estimation Using a Continuous and Differentiable Approximation to the L0 Penalty Function

arxiv: v2 [math.st] 9 Feb 2017

arxiv: v2 [stat.ml] 22 Feb 2008

Sparsity and the Lasso

ISyE 691 Data mining and analytics

Least Absolute Gradient Selector: Statistical Regression via Pseudo-Hard Thresholding

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

Analysis Methods for Supersaturated Design: Some Comparisons

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices

LUO, BIN, M.A. Robust High-dimensional Data Analysis Using A Weight Shrinkage Rule. (2016) Directed by Dr. Xiaoli Gao. 73 pp.

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

Consistent Selection of Tuning Parameters via Variable Selection Stability

Large Sample Theory For OLS Variable Selection Estimators

Transcription:

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010

Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM

Setting y i = x i β + ε i, where ε 1,, ε n are i.i.d. mean 0 and variance { σ 2. } A = j : βj 0 and A = p 0 < p. 1 n X T X C, where C is a positive definite matrix. [ ] C11 C C = 12, where C C 21 C 11 is a p 0 p 0 matrix. 22

Definition of Oracle Procedures We call δ an oracle procedure if ˆβ(δ) (asymptotically) has the following oracle properties: { 1. Identifies the right subset model, j : ˆβ } j 0 = A. 2. ( ) n ˆβ(δ)A βa d N (0, Σ ), where Σ is the covariance matrix knowing the true subset model.

Definition of LASSO (Tibshirani, 1996) ˆβ (n) y p = arg min x 2 jβ j + λn β j=1 p β j. j=1 λ n varies with n. A n = { j : } (n) ˆβ j 0. LASSO variable selection is consistent iff lim n P (A n = A) = 1.

Proposition 1: Proposition 1 If λ n / n λ 0 0, then lim sup n P (A n = A) c < 1, where c is a constant depending on the true model.

Theorem 1: Necessary Condition for Consistency of LASSO Theorem 1 Suppose that lim n P (A n = A) = 1. Then there exists some sign vector s = (s 1,, s p0 ) T, s j = 1 or 1, such that C21 C11 1 s 1. (1)

Corollary 1: Interesting Case of Corollary 1 Suppose that p 0 = 2m + 1 3 and p = p 0 + 1, so there is one irrelevant predictor. Let C 11 = (1 ρ 1 ) I + ρ 1 J 1, where J 1 is the matrix of 1 s and C 12 = ρ 2 1 and C 22 = 1. If 1 p 0 1 < ρ 1 < 1 p 0 and 1 + (p 0 1) ρ 1 < ρ 2 < (1 + (p 0 1) ρ 1 ) /p 0, then condition (1) cannot be satisfied. Thus LASSO variable selection is inconsistent.

Corollary 1: Interesting Case of m=1, p0=3 rho2 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 rho1

Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Definition of ˆβ (n) y p = arg min x 2 jβ j + λn β j=1 p ŵ j β j. j=1 weight vector ŵ = 1/ ˆβ γ (data-dependent) and γ > 0. ˆβ is a root-n-consistent estimator to β, e.g. ˆβ = ˆβ(ols). { } A (n) n = j : ˆβ j 0.

e Adaptive Lasso Introduction Definition 1421 (a) (c) Oracle Properties (b) Computations Relationship: Nonnegative Garrote Extensions: GLM Penalty Function of LASSO, SCAD and (d) (c) (e) (d) (f) (e) Figure 1. Plot of Thresholding Functions With (f) λ = 2 for (a) the Hard; (b) Bridge L.5; (c) the Lasso; (d) the SCAD; (e) the Ada and (f) the Adaptive Lasso, γ = 2. 3.3 Oracle Inequality and Near-Minimax Optimality For the minimax arguments, we consider ple estimation problem discussed by Donoho Presented As shownby bydongjun Donoho and Chung Johnstone (1994), The Adaptive the l 1 shrinkage Lasso and (1994). Its Oracle Suppose Properties that we are Hui given Zou n independ (2006),

Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Remarks: The data-dependent ŵ is the key for its oracle properties. As n grows, the weights for zero-coefficient predictors get inflated, while the weights for nonzero-coefficient predictors converge to a finite constant. In the view of Fan and Li, 2001 (presented by Yang Zhao), adaptive lasso satisfies three properties of good penalty function: unbiasedness, sparsity, and continuity.

Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Theorem 2: Oracle Properties of Theorem 2 Suppose that λ n / n 0 and λ n n (γ 1)/2. Then the adaptive LASSO must satisfy the following: 1. Consistency in variable selection: lim n P (A n = A) = 1. 2. Asymptotic normality: ( ) n ˆβ (n) A βa d N ( 0, σ 2 C11 1 ).

Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Computations of estimates can be solved by the LARS algorithm (Efron et al., 2004). The entire solution path can be computed at the same order of computation of a single OLS fit. Tuning: If we use ˆβ(ols), then use 2-dimensional CV to find an optimal pair of (γ, λ n ). Or use 3-dimensional CV to find an optimal triple ( ˆβ, γ, λ). ˆβ(ridge) may be used from the best ridge regression fit when collinearity is a concern.

Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Definition of Nonnegative Garrote (Breiman, 1995) ˆβ j (garrote) = c j ˆβj (ols), where a set of nonnegative scaling factor {c j } is to minimize subject to c j 0, j. y p j=1 x j ˆβ j (ols) c j 2 + λn p j=1 c j, A sufficiently large λ n shrinks some c j to exact 0, i.e. ˆβ j (garrote) = 0. Yuan and Lin (2007) also studied the consistency of the nonnegative garrote.

Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Garrote: Formulation and Consistency Formulation y p ˆβ (garrote) = arg min x 2 p jβ j + λn β j=1 j=1 ŵj β j subject to β j ˆβj (ols) 0, j, where γ = 1, ŵ = 1/ ˆβ (ols). Corollary 2: Consistency of Nonnegative Garrote If we choose a λ n such that λ n / n 0 and λ n, then nonnegative garrote is consistent for variable selection.

Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM for GLM ( ) ( ˆβ (n) (glm) = arg min y i (xi T β + φ xi T β weight vector ŵ = 1/ ˆβ(mle) γ for some γ > 0. )) p β + λ n j=1 ŵj β j. f (y x, θ) = h (y) exp (yθ φ (θ)), where θ = x T β. [ ] The Fisher information matrix I (β I11 I ) = 12, where I 21 I 22 I 11 is a p 0 p 0 matrix. Then I 11 is the Fisher information matrix with the true submodel known.

Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: GLM Theorem 4: Oracle Properties of for GLM Theorem { 4 } Let A (n) n = j : ˆβ j (glm) 0. Suppose that λ n / n 0 and λ n n (γ 1)/2. Then, under some mild regularity conditions, the adaptive LASSO estimate ˆβ (n) (glm) must satisfy the following: 1. Consistency in variable selection: lim n P (A n = A) = 1. 2. Asymptotic normality: ( ) n ˆβ (n) A (glm) β A d N ( 0, I11 1 ).

Experiments for Setting We let y = x T β + N(0, σ 2 ), where the true regression coefficients are β = (5.6, 5.6, 5.6, 0). The predictors x i (i = 1,, n) are i.i.d. N(0, C), where C is the C matrix in Corollary 1 with ρ 1 =.39 and ρ 2 =.23 (red point).

ˆβ (n) )) 1. A n Experiments for s estimates e estimated 001). and n = 80. Introduction Numerical In bothexperiments models we and Discussion simulated 100 training datasets for each combination of (n,σ 2 ). All of the training and tuning were done on the training set. We also collected independent test datasets of 10,000 observations to compute the RPE. To estimate the standard error of the RPE, we generated a bootstrapped sample from the 100 RPEs, then calculated the bootstrapped Table 1. Simulation Model 0: The Probability of Containing the True Model in the Solution Path e to comd the nonrious linear mputed the the LARS so. We imto compute n = 60, σ = 9 n= 120, σ = 5 n= 300, σ = 3 lasso.55.51.53 adalasso(γ =.5).59.68.93 adalasso(γ = 1).67.89 1 adalasso(γ = 2).73.97 1 adalasso(γ by cv).67.91 1 NOTE: In this table adalasso is the adaptive lasso, and γ by cv means that γ was selected by five-fold cross-validation from three choices: γ =.5, γ = 1, and γ = 2.

General Observations Comparison: LASSO,, SCAD, and nonnegative garrote. p = 8 and p 0 = 3. Consider a few large effects (n = 20, 60) and many small effects (n = 40, 80). LASSO performs best when the SNR is low., SCAD, and and nonnegative garrote outperforms LASSO with a medium or low level of SNR. tends to be more stable than SCAD. LASSO tends to select noise variables more often than other methods.

Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM Theorem 2: Oracle Properties of Theorem 2 Suppose that λ n / n 0 and λ n n (γ 1)/2. Then the adaptive LASSO must satisfy the following: 1. Consistency in variable selection: lim n P (A n = A) = 1. 2. Asymptotic normality: ( ) n ˆβ (n) A βa d N ( 0, σ 2 C11 1 ).

Proof of Theorem 2: Asymptotic Normality Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM Let β = β + u/ n and Ψ n (u) = y p j=1 x j ( βj + u ) n 2 p + λ n n j=1 ŵj β j + u n. n Let û (n) = arg min Ψ n (u); then û (n) = ( ) n ˆβ (n) β. Ψ n (u) Ψ n (0) = V (n) 4 (u), where V (n) 4 (u) = u T ( 1 n X T X ) u 2 εt X n u + λ ( n p n j=1 ŵj n βj + u n ) β n j

Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM Proof of Theorem 2: Asymptotic Normality (conti.) Then, V (n) 4 (u) d V 4 (u) for every u, where { u T V 4 (u) = A C 11 u A 2uA T W A if u j = 0, j / A otherwise and W A = N ( 0, σ 2 ) (n) C 11. V 4 is convex, and the unique minimum of V 4 is ( C11 1 W A, 0 ) T. Following the epi-convergence results of Geyer (1994), we have û (n) A d C11 1 W A and û (n) A C d 0. Hence, we prove the asymptotic normality part.

Proof of Theorem 2: Consistency Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM The asymptotic normality result indicates that j A, ˆβ (n) j p βj ; thus P (j A n) 1. Then it suffices to show that j / A, P (j A n) 0. Consider ( the event j A n. By the KKT optimality conditions, 2xj T y X ˆβ ) (n) = λ n ŵ j. λ n ŵ j / n = λ n n (γ 1)/2 / n ˆβ j γ p and 2 xt j (y X ˆβ (n) ) n = 2 xt j X n(β ˆβ (n) ) n + 2 xt j ε n and each of these two terms converges to some normal distribution. Thus P ( j A ) ( n P (2xj T y X ˆβ (n)) ) = λ n ŵ j 0.

Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM Corollary 2: Consistency of Nonnegative Garrote Formulation y p ˆβ (garrote) = arg min x 2 p jβ j + λn β j=1 j=1 ŵj β j subject to β j ˆβj (ols) 0, j, where γ = 1, ŵ = 1/ ˆβ (ols). Corollary 2: Consistency of Nonnegative Garrote If we choose a λ n such that λ n / n 0 and λ n, then nonnegative garrote is consistent for variable selection.

Proof of Corollary 2 Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM ˆβ (n) Let ˆβ (n) be the adaptive LASSO estimates. By Theorem 2, is an oracle estimator if λ n / n 0 and λ n. To show the consistency, it suffices to show that ˆβ (n) satisfies the sign constraint with probability tending to 1. Pick any j. If j A, then ( ) 2 ˆβ (n) (γ = 1) j ˆβ (ols)j p βj > 0. If j / A, then ( ) ( ) P ˆβ (n) (γ = 1) j ˆβ (ols)j 0 P ˆβ (n) (γ = 1) j = 0 1. In ( ) either case, P ˆβ (n) (γ = 1) j ˆβ (ols)j 0 1 for any j = 1, 2,, p.

Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM Theorem 4: Oracle Properties of for GLM Theorem { 4 } Let A (n) n = j : ˆβ j (glm) 0. Suppose that λ n / n 0 and λ n n (γ 1)/2. Then, under some mild regularity conditions, the adaptive LASSO estimate ˆβ (n) (glm) must satisfy the following: 1. Consistency in variable selection: lim n P (A n = A) = 1. 2. Asymptotic normality: ( ) n ˆβ (n) A (glm) β A d N ( 0, σ 2 I11 1 ). f (y x, θ) = h (y) exp (yθ φ (θ)), where θ = x T β.

Theorem 4: Regularity Conditions Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM 1. The Fisher information matrix is finite and positive definite, [ ( I (β ) = E φ x T β ) ] xx T. 2. There is a sufficiently large enough open set O that contains β such that β O, ( φ x β) T M (x) < and for all 1 j, k, l p. E [M (x) x j x k x l ] <

Proof of Theorem 4: Asymptotic Normality Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM Let β = β + u/ n. Define Γ n (u) = n { ( ( i=1 yi x T i β + u/ n )) + φ ( xi T p +λ n β j=1 j + u j / n ( β + u/ n ))} Let û (n) = arg min u Γ n (u); then û (n) = n ( β (n) (glm) β ). Using the Taylor expansion, we have Γ n (u) Γ n (0) = H (n) (u), where H (n) (u) = A (n) 1 + A (n) 2 + A (n) 3 + A (n) 4, with A (n) 1 = n [ i=1 yi φ ( xi T β )] xi T u n, A (n) 2 = n i=1 1 ( 2 φ xi T β ) u T x i xi T n u, A (n) 3 = ŵj ( λn p β n j=1 n j + β un n j ),

Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM Proof of Theorem 4: Asymptotic Normality (conti.) and A (n) 4 = n 3/2 ) n (x i=1 1 6 (x φ i T β T i u ) 3, where β is between β and β + u/ n. Then, by the regularity condition 1 and 2, H (n) (u) d H(u) for every u, where { u T H (u) = A I 11 u A 2uA T W A if u j = 0, j / A otherwise and W A = N (0, I 11 ). H (n) is convex, and the unique minimum of H is ( I11 1 W A, 0 ) T. Following the epi-convergence results of Geyer (1994), we have û (n) A d I11 1 W A and û (n) A C d 0, and the asymptotic normality part is proven.

Proof of Theorem 4: Consistency Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM The asymptotic normality result indicates that j A, P (j A n) 1. Then it suffices to show that j / A, P (j A n) 0. Consider the event j A n. By the KKT optimality conditions, n i=1 x ij (y i φ ( )) ˆβ (n) (glm) xi T = λ n ŵ j. n x ij (y ( )) i=1 i φ xi T ˆβ (n) (glm) / n = B (n) 1 + B (n) 2 + B (n) 3 with B (n) 1 = n i=1 x ( ij yi φ ( xi T β )) / n, B (n) 2 = ( 1 n n i=1 x ( ij φ xi T β ) xi T ) ( n β ˆβ ) (n) (glm), ( B (n) 3 = 1 ( )) ( n n i=1 x ij φ xi T β xi T ( n β ˆβ 2 (glm))) (n) / n, where β is between ˆβ (n) (glm) and β.

Proof of Theorem 4: Consistency (conti.) Theorem 2: Oracle Properties of Corollary 2: Consistency of Nonnegative Garrote Theorem 4: Oracle Properties of for GLM B (n) 1 and B (n) 2 converge to some normal distributions and B (n) 3 = O p (1/ n). λ n ŵ j / n = λ n n (γ 1)/2 / n ˆβ j (glm) γ p. Thus P ( j A ) n ( ( n P( x ij i=1 y i φ xi T and this completes the proof. )) ˆβ (n) (glm) = λ n ŵ j ) 0.