Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Similar documents
Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Analysis Methods for Supersaturated Design: Some Comparisons

Full terms and conditions of use:

VARIABLE SELECTION IN ROBUST LINEAR MODELS

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Theoretical results for lasso, MCP, and SCAD

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Genomics, Transcriptomics and Proteomics in Clinical Research. Statistical Learning for Analyzing Functional Genomic Data. Explanation vs.

MSA220/MVE440 Statistical Learning for Big Data

Lecture 14: Variable Selection - Beyond LASSO

A UNIFIED APPROACH TO MODEL SELECTION AND SPARS. REGULARIZED LEAST SQUARES by Jinchi Lv and Yingying Fan The annals of Statistics (2009)

Variable Selection in Measurement Error Models 1

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Lecture 5: Soft-Thresholding and Lasso

Regression, Ridge Regression, Lasso

Generalized Elastic Net Regression

Tuning parameter selectors for the smoothly clipped absolute deviation method

ESL Chap3. Some extensions of lasso

Bi-level feature selection with applications to genetic association

The Pennsylvania State University The Graduate School Eberly College of Science NEW PROCEDURES FOR COX S MODEL WITH HIGH DIMENSIONAL PREDICTORS

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Shrinkage Tuning Parameter Selection in Precision Matrices Estimation

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Stability and the elastic net

Biostatistics Advanced Methods in Biostatistics IV

Regression Shrinkage and Selection via the Lasso

Variable Selection in Multivariate Multiple Regression

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan

A Modern Look at Classical Multivariate Techniques

Covariance function estimation in Gaussian process regression

Ultra High Dimensional Variable Selection with Endogenous Variables

Likelihood-Based Methods

Machine Learning for OR & FE

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Model comparison and selection

Logistic Regression with the Nonnegative Garrote

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Relaxed Lasso. Nicolai Meinshausen December 14, 2006

The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression

High dimensional thresholded regression and shrinkage effect

Variable Selection and Parameter Estimation Using a Continuous and Differentiable Approximation to the L0 Penalty Function

ESTIMATION AND VARIABLE SELECTION FOR GENERALIZED ADDITIVE PARTIAL LINEAR MODELS

Model Selection and Geometry

Permutation-invariant regularization of large covariance matrices. Liza Levina

Lecture 6: Methods for high-dimensional problems

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Adaptive Lasso for Cox s Proportional Hazards Model

High-dimensional regression

Bayesian Grouped Horseshoe Regression with Application to Additive Models

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

Regularization in Cox Frailty Models

Single Index Quantile Regression for Heteroscedastic Data

Comments on \Wavelets in Statistics: A Review" by. A. Antoniadis. Jianqing Fan. University of North Carolina, Chapel Hill

Linear regression methods

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

STAT 462-Computational Data Analysis

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Gaussian Graphical Models and Graphical Lasso

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Data Mining Stat 588

Statistical challenges with high dimensionality: feature selection in knowledge discovery

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

VARIABLE SELECTION IN QUANTILE REGRESSION

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Comparisons of penalized least squares. methods by simulations

Iterative Selection Using Orthogonal Regression Techniques

The deterministic Lasso

Feature selection with high-dimensional data: criteria and Proc. Procedures

Lecture 2 Part 1 Optimization

MS-C1620 Statistical inference

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

High-dimensional regression with unknown variance

Statistical Data Mining and Machine Learning Hilary Term 2016

Linear Methods for Regression. Lijun Zhang

Nonparametric Regression. Badr Missaoui

Graduate Econometrics I: Unbiased Estimation

The annals of Statistics (2006)

Dimension Reduction Methods

Consistent Group Identification and Variable Selection in Regression with Correlated Predictors

High-dimensional covariance estimation based on Gaussian graphical models

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Sample Size Requirement For Some Low-Dimensional Estimation Problems

Long-Run Covariability

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

MSA220/MVE440 Statistical Learning for Big Data

On Mixture Regression Shrinkage and Selection via the MR-LASSO

Parameter Norm Penalties. Sargur N. Srihari

MODEL SELECTION FOR CORRELATED DATA WITH DIVERGING NUMBER OF PARAMETERS

Robust Variable Selection Through MAVE

Econometrics I, Estimation

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Chapter 3. Linear Models for Regression

Transcription:

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36

Outlines 2 / 36

Motivation Variable selection is fundamental to high-dimensional statistical modeling Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process The theoretical properties for stepwise deletion and subset selection are somewhat hard to understand The most severe drawback of the best subset variable selection is its lack of stability as analyzed by Breiman (1996) 3 / 36

Overview of Proposed Method Penalized likelihood approaches are proposed, which simultaneously select variables and estimate coefficients Penalty functions are symmetric, nonconcave on (0, ), and have singularities at the origin to produce sparse solutions Furthermore the penalty functions are bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions A new algorithm is proposed for optimizing penalized likelihood functions, which is widely applicable Rates of convergence of the proposed penalized likelihood estimators are established With proper choice of regularization parameters, the proposed estimators perform as well as the oracle procedure (as if the correct submodel were known) 4 / 36

Penalized Least Squares and Subset Selection Linear regression model: y = X β + ɛ, where y is n 1 and X is n d. Assume for now that the columns of X are orthonormal. Denote z = X T y. A form of the penalized least squares is 1 y X 2 β 2 + λ d j=1 p j( β j ) = 1 y XX T y 2 + 1 d 2 2 j=1 (z j β j ) 2 + λ d j=1 p j( β j ). p j (.) are not necessarily the same for all j; but we will assume they are the same for simplicity. From now on, denote λp(. ) by p λ (. ). The minimization problem is equivalent to minimize componentwise, which leads us to consider the penalized least squares problem 1 2 (z θ)2 + p λ ( θ ). (2.3) 5 / 36

Penalized Least Squares and Subset Selection Cont d By taking the hard thresholding penalty function p λ ( θ ) = λ 2 ( θ λ) 2 I ( θ < λ), the hard thresholding rule is obtained as ˆθ = zi ( z > λ). This coincides with the best subset selection and stepwise deletion and addition for orthonormal designs! 6 / 36

What Makes a Good Penalty Function? Unbiasedness: The resulting estimator is nearly unbiased when the true unknown parameter is large to avoid unnecessary modeling bias Sparsity: The resulting estimator is a thresholding rule, which automatically sets small estimated coefficients to zero to reduce model complexity Continuity: The resulting estimator is continuous in data z to avoid instability in model prediction 7 / 36

Sufficient Conditions for Good Penalty Functions p λ( θ ) = 0 for large θ Unbiasedness Intuition: The first order derivative of (2.3) w.r.t. θ is sgn(θ)( θ + p λ( θ )) z. When p λ( θ ) = 0 for large θ, the resulting estimator is z when z is sufficiently large, which is approximately unbiased. The minimum of the function θ + p λ( θ ) is positive Sparsity The minimum of the function θ + p λ( θ ) is attained at 0 Continuity Intuition: See nextpage 8 / 36

Sufficient Conditions for Good Penalty Functions Cont d when z < min θ 0 { θ + p λ( θ )}, the derivative of (2.3) is positive for all positive θ and negative for all negative θ ˆθ = 0; when z > min θ 0 { θ + p λ( θ )}, two crossings may exist as shown, the larger one is a penalized least squares estimator. further, this implies that a sufficient and necessary condition for continuity is to require the minimum to be attained at 0 9 / 36

Penalty Function of Interest hard thresholding penalty: p λ ( θ ) = λ 2 ( θ λ) 2 I ( θ < λ). soft thresholding penalty (L 1 penalty, LASSO): p λ ( θ ) = λ θ. L 2 penalty (ridge regression): p λ ( θ ) = λ θ 2. SCAD penalty (Smoothly Clipped Absolute Deviation penalty): p λ(θ) = λ{i (θ λ) + (aλ θ)+ I (θ > λ)} for some a > 2 and θ > 0. (a 1)λ Note: As for the three properties that make a penalty function good, only SCAD possesses all (and therefore is advocated by the authors); whereas all the other penalty functions are unable to satisfy three sufficient conditions simultaneously. 10 / 36

Penalty Function of Interest Cont d 11 / 36

Penalty Function of Interest Cont d When the design matrix X is orthonormal, closed form solutions are obtained as follows. hard thresholding rule: ˆθ = zi ( z > λ) LASSO (L 1 penalty): ˆθ = sgn(z)( z λ) + SCAD: sgn(z)( z λ) + ˆθ = {(a 1)z sgn(z)aλ}/(a 2) z when z 2λ when 2λ < z aλ when z > aλ 12 / 36

Penalty Function of Interest Cont d 13 / 36

More on SCAD SCAD penalty has two unknown parameters λ and a. They could be chosen by searching grids using cross-validation (CV) or generalized cross-validation (GCV), but that is computation intensive. The authors applied Bayesian risk analysis to help choose a, in that Bayesian risks seem not very sensitive to the values of a. It is found that a = 3.7 works similarly to that chosen by GCV. 14 / 36

Performance Comparison Refer to the last graph on previous slide. SCAD performs favorably compared with the other two thresholding rules. It is actually expected to perform the best, given that it retains all the good mathematical properties of the other two penalty functions. 15 / 36

Penalized Likelihood Formulation From now on, assume that X is standardized so that each column has mean 0 and variance 1, and is no longer orthonormal. A form of penalized least squares is 1 (y X 2 β)t (y X β) + n d j=1 p λ( β j ). (3.1) A form of penalized robust least squares is n i=1 ψ( y i x i β ) + n d j=1 p λ( β j ). (3.2) A form of penalized likelihood for generalized linear models is n i=1 l i(g(xi T β), y i ) + n d j=1 p λ( β j ). (3.3) 16 / 36

Sampling Properties and Oracle Properties Let β 0 = (β 10,..., β d0 ) T = (β T 10, β T 20) T. Assume that β 20 = 0. Let I (β 0 ) be the Fisher information matrix and I 1(β 10, 0) be the Fisher information knowing β 20 = 0. General setting: Let V i = (X i, Y i ), i = 1,..., n. Let L(β) be the log-likelihood function of observations V 1,..., V n and let Q(β) be the penalized likelihood function L(β) n d j=1 p λ( β j ). 17 / 36

Sampling Properties and Oracle Properties Cont d Regularity Conditions (A) The observations V i are iid with density f (V, β). f (V, β) has a common support and the model is identifiable. Furthermore, the following holds: E β ( logf (V,β) β j ) = 0 for j = 1,..., d and I jk (β) = E β ( β j logf (V, β) β k logf (V, β)) = E β ( (B) The Fisher information matrix I (β) = E{( β 2 β j β k logf (V, β)). logf (V, β))( β logf (V, β))t } is finite and positive definite at β = β 0. (C) There exists an open subset ω of Ω that contains the true parameter point β 0 such that for almost all V the density f (V, β) admits all third derivatives for all β ω. Furthermore, there exists functions M jkl such that 3 β i β k β l logf (V, β) M jkl (V ) for all β ω, where m j kl = E β (M jkl (V )) < for j, k, l. 18 / 36

Sampling Properties and Oracle Properties Cont d Theorem 1. Let V 1,..., V n be iid, each with a density f (V, β) that satisfies conditions (A)-(C). If max{ p λ n ( β j0 ) : β j0 0} 0, then there exists a local maximizer ˆβ of Q(β) such that ˆβ β 0 = O p(n 1/2 + a n), where a n = max{ p λ n ( β j0 ) : β j0 0}. Note: For hard thresholding and SCAD penalty functions, λ n 0 implies a n = 0, therefore the penalized likelihood estimator is root-n consistent. 19 / 36

Sampling Properties and Oracle Properties Cont d Lemma 1. Let V 1,..., V n be iid, each with a density f (V, β) that satisfies conditions (A)-(C). Assume that lim inf n lim inf θ 0 +p λ n (θ)/λ n > 0 (3.5). If λ n 0 and nλ n ad n, then with probability tending to 1, for any given β 1 satisfying β 1 β 10 = O p(n 1/2 ) and any constant C, Q{(β1 T, 0) T } = max β2 Cn 1/2Q{(βT 1, β2 T ) T } Aside: Denote Σ = diag{p λ n (β 10),..., p λ n (β s0)} and b = (p λ n (β 10)sgn(β 10),..., p λ n (β s0)sgn(β s0)) T, where s is the number of components of β 10. Theorem 2 (Oracle Property). Let V 1,..., V n be iid, each with a density f (V, β) that satisfies conditions (A)-(C). Assume that the penalty function p λn ( θ ) satisfies condition (3.5). If λ n 0 and nλ n, then with probability tending to 1, the root-n consistent local maximizers T T ˆβ = ( ˆβ 1, ˆβ 2 ) T in Theorem 1 must satisfy: (a) Sparsity: ˆβ 2 = 0 (b) Asymptotic normality: n(i1(β 10) + Σ){ ˆβ 1 β 10 + (I 1(β 10) + Σ) 1 b} N{0, I 1(β 10)} 20 / 36

Sampling Properties and Oracle Properties Cont d Lemma 1. Let V 1,..., V n be iid, each with a density f (V, β) that satisfies conditions (A)-(C). Assume that lim inf n lim inf θ 0 +p λ n (θ)/λ n > 0 (3.5). If λ n 0 and nλ n ad n, then with probability tending to 1, for any given β 1 satisfying β 1 β 10 = O p(n 1/2 ) and any constant C, Q{(β1 T, 0) T } = max β2 Cn 1/2Q{(βT 1, β2 T ) T } Aside: Denote Σ = diag{p λ n (β 10),..., p λ n (β s0)} and b = (p λ n (β 10)sgn(β 10),..., p λ n (β s0)sgn(β s0)) T, where s is the number of components of β 10. Theorem 2 (Oracle Property). Let V 1,..., V n be iid, each with a density f (V, β) that satisfies conditions (A)-(C). Assume that the penalty function p λn ( θ ) satisfies condition (3.5). If λ n 0 and nλ n, then with probability tending to 1, the root-n consistent local maximizers T T ˆβ = ( ˆβ 1, ˆβ 2 ) T in Theorem 1 must satisfy: (a) Sparsity: ˆβ 2 = 0 (b) Asymptotic normality: n(i1(β 10) + Σ){ ˆβ 1 β 10 + (I 1(β 10) + Σ) 1 b} N{0, I 1(β 10)} 21 / 36

Sampling Properties and Oracle Properties Cont d Lemma 1. Let V 1,..., V n be iid, each with a density f (V, β) that satisfies conditions (A)-(C). Assume that lim inf n lim inf θ 0 +p λ n (θ)/λ n > 0 (3.5). If λ n 0 and nλ n ad n, then with probability tending to 1, for any given β 1 satisfying β 1 β 10 = O p(n 1/2 ) and any constant C, Q{(β1 T, 0) T } = max β2 Cn 1/2Q{(βT 1, β2 T ) T } Aside: Denote Σ = diag{p λ n (β 10),..., p λ n (β s0)} and b = (p λ n (β 10)sgn(β 10),..., p λ n (β s0)sgn(β s0)) T, where s is the number of components of β 10. Theorem 2 (Oracle Property). Let V 1,..., V n be iid, each with a density f (V, β) that satisfies conditions (A)-(C). Assume that the penalty function p λn ( θ ) satisfies condition (3.5). If λ n 0 and nλ n, then with probability tending to 1, the root-n consistent local maximizers T T ˆβ = ( ˆβ 1, ˆβ 2 ) T in Theorem 1 must satisfy: (a) Sparsity: ˆβ 2 = 0 (b) Asymptotic normality: n(i1(β 10) + Σ){ ˆβ 1 β 10 + (I 1(β 10) + Σ) 1 b} N{0, I 1(β 10)} 22 / 36

Sampling Properties and Oracle Properties Cont d Remarks: (1)For the hard and SCAD thresholding penalty functions, if λ n 0, a n = 0. Hence, by Thm 2, when nλ n, their corresponding penalized likelihood estimators possess the oracle property and perform as well as the maximum likelihood estimates for estimating β 1 knowing β 2 = 0. (2)However, for LASSO (L 1) penalty, a n = λ n. Hence, the root-n consistency requires λ n = O p(n 1/2 ). On the other hand, the oracle property requires nλn. These two conditions cannot be satisfied simultaneously. The authors conjecture that the oracle property does not hold for LASSO. (3)For L q penalty with q < 1, the oracle property continues to hold with suitable choice of λ n. 23 / 36

Tibshirani (1996) proposed an algorithm for solving constrained least squares problem of LASSO Fu (1998) provided a shooting algorithm for LASSO LASSO2 submitted by Berwin Turlach at Statlib Here the authors proposed a unified algorithm, which optimizes problems (3.1) (3.2) (3.3) via local quadratic approximations. 24 / 36

The Algorithm Re-written (3.1) (3.2) (3.3) as l(β) + n d j=1 p λ( β j ) (3.6), where l(β) is a general loss function. Given an initial value β 0 that is close to the minimizer of (3.6). If β j0 is very close to 0, then set ˆβ j = 0; otherwise they can be locally approximated by a quadratic function as [p λ ( β j )] = p λ( β j )sgn(β j ) {p λ( β j0 )/ β j0 }β j. In other words, p λ ( β j ) p λ (β j0 ) + 1 2 {p λ( β j0 )/ β j0 }(β 2 j β 2 j0), for β j β j0. ψ( y x T β ) in robust penalized least squares case can be approximated by {ψ( y x T β 0 )/(y x T β 0) 2 }(y x T β) 2. assume the log-likelihood function is smooth, so can be locally approximated by a quadratic function. Newton-Raphson algorithm applies. The updating equation is β 1 = β 0 ( 2 l(β 0) + nσ λ (β 0)) 1 ( l(β 0) + nu λ (β 0)), where Σ λ (β 0) = diag{p λ( β 10 )/ β 10,..., p λ( β d0 )/ β d0 }, U λ (β 0) = Σ λ (β 0). 25 / 36

Standard Error Formula Obtained directly from the algorithm. Sandwich formula: cov( ˆ ˆβ 1) = ( 2 l( ˆβ 1) + nσ λ ( ˆβ 1)) 1 cov{ l( ˆ ˆβ 1)}( 2 l( ˆβ 1) + nσ λ ( ˆβ 1)) 1 for nonvanishing component of β. 26 / 36

Prediction and Model Error Two regression situations: X random and X controlled For ease of presentation, consider only X-random case PE(ˆµ) = E(Y ˆµx) 2 = E(Y E(Y x)) 2 + E(Y x ˆµ(x)) 2 The second component is the model error 27 / 36

Simulation Results 28 / 36

Simulation Results Cont d Note: SD = median absolute deviation of ˆβ 1/0.6745 SD m=median of ˆσ( ˆβ 1) SD mad =median absolute deviation of ˆσ( ˆβ 1) 29 / 36

Simulation Results Cont d 30 / 36

Real Data Analysis 31 / 36

A family of penalty functions was introduced, of which SCAD performs favorably Rates of convergence of the proposed penalized likelihood estimators were established With proper choice of regularization parameters, the estimators perform as well as the oracle procedure The unified algorithm was demonstrated effective and standard errors were estimated with good accuracy The proposed approach can be applied to various statistical contexts 32 / 36

Proofs 33 / 36

Proofs 34 / 36

Proofs 35 / 36

The End! 36 / 36