Stability and the elastic net

Similar documents
Nonconvex penalties: Signal-to-noise ratio and algorithms

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

High-dimensional regression

Theoretical results for lasso, MCP, and SCAD

ESL Chap3. Some extensions of lasso

Regularization and Variable Selection via the Elastic Net

Machine Learning for OR & FE

Extensions to LDA and multinomial regression

Linear Methods for Regression. Lijun Zhang

Linear Model Selection and Regularization

Group exponential penalties for bi-level variable selection

MSA220/MVE440 Statistical Learning for Big Data

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Bi-level feature selection with applications to genetic association

Lecture 14: Variable Selection - Beyond LASSO

Other likelihoods. Patrick Breheny. April 25. Multinomial regression Robust regression Cox regression

Lecture 14: Shrinkage

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Lecture 5: A step back

Linear regression methods

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Consistent high-dimensional Bayesian variable selection via penalized credible regions

DATA MINING AND MACHINE LEARNING

Prediction & Feature Selection in GLM

arxiv: v3 [stat.me] 8 Jun 2018

THE Mnet METHOD FOR VARIABLE SELECTION

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

A Short Introduction to the Lasso Methodology

Supplement to Bayesian inference for high-dimensional linear regression under the mnet priors

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Sparsity Regularization

MSA220/MVE440 Statistical Learning for Big Data

Statistics for high-dimensional data: Group Lasso and additive models

Lecture 2 Part 1 Optimization

Theorems. Least squares regression

Sparse Linear Models (10/7/13)

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Regularized methods for high-dimensional and bilevel variable selection

Variable Selection for Highly Correlated Predictors

Opening Theme: Flexibility vs. Stability

Generalized Elastic Net Regression

Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

High-dimensional Ordinary Least-squares Projection for Screening Variables

DEGREES OF FREEDOM AND MODEL SEARCH

Statistical Data Mining and Machine Learning Hilary Term 2016

Semiparametric Regression

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Lecture 6: Methods for high-dimensional problems

25 : Graphical induced structured input/output models

Comparisons of penalized least squares. methods by simulations

Applied Machine Learning Annalisa Marsico

Iterative Selection Using Orthogonal Regression Techniques

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Fused elastic net EPoC

STAT 462-Computational Data Analysis

arxiv: v1 [math.st] 9 Feb 2014

Estimating subgroup specific treatment effects via concave fusion

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Sparse Gaussian conditional random fields

25 : Graphical induced structured input/output models

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Linear Models for Regression CS534

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Covariance test Selective inference. Selective inference. Patrick Breheny. April 18. Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/20

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Variable Selection for Highly Correlated Predictors

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

6. Regularized linear regression

STA414/2104 Statistical Methods for Machine Learning II

ECE521 lecture 4: 19 January Optimization, MLE, regularization

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

Linear Models for Regression CS534

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

MS-C1620 Statistical inference

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Regularized Discriminant Analysis and Its Application in Microarray

Eigenvalues and diagonalization

MS&E 226: Small Data

One-sample categorical data: approximate inference

application in microarrays

PENALIZED PRINCIPAL COMPONENT REGRESSION. Ayanna Byrd. (Under the direction of Cheolwoo Park) Abstract

A simulation study of model fitting to high dimensional data using penalized logistic regression

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

The deterministic Lasso

Linear Models for Regression CS534

COMS 4771 Regression. Nakul Verma

Transcription:

Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32

Introduction Elastic Net Our last several lectures have concentrated on methods for reducing the bias of lasso estimates This week, we will discuss methods for doing the opposite: introducing ridge penalties in order to reduce the variance of lasso estimates at the cost of further increasing their bias As we saw when discussing the ridge penalty itself, there is typically some degree of shrinkage we can introduce for which the gains of variance reduction outweigh the cost of increased bias to produce more accurate estimates Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 2/32

Elastic net penalty Recall that the lasso solutions are not always unique, but that solutions to ridge regression are This is somewhat unsatisfactory for the lasso, as we could re-order the covariates and end up with different estimates Consider, then, the following penalty, known as the elastic net penalty: P λ (β) = λ 1 β 1 + λ 2 2 β 2 2, which consists of two terms, a lasso term plus a ridge term Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 3/32

Example (revisited) Because the ridge penalty is strictly convex, the elastic net solution β is unique provided that λ 2 > 0 To see how this works, let us revisit the example from February 15, which consisted of two observations: (y 1, x 11, x 12 ) = (1, 1, 1) and (y 2, x 21, x 22 ) = ( 1, 1, 1) For λ < 1, the lasso admits infinitely many solutions along the line β 1 + β 2 = 1 λ in the β 1 > 0, β 2 > 0 quadrant Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 4/32

Example (revisited, continued) In contrast, the elastic net penalty always yields a unique solution: { β1 = β 2 = 0 if λ 1 1, β 1 = β 2 = 1 λ 1 2+λ 2 if λ 1 < 1. Note that regardless of λ 1 and λ 2, β 1 is always equal to β 2 ; this is reasonable, given that x 1 = x 2 Indeed, this is a general property of the elastic net: whenever x j = x k, β j = β k Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 5/32

Remarks Elastic Net The example also illustrates that the elastic net retains properties of both the lasso and ridge regression methods From the lasso, it inherits sparsity in particular, β = 0 if λ 1 > 1 From ridge regression, the elastic net inherits the ability to always produce a unique solution as well as ridge regression s property of proportional shrinkage: β 1 = β 2 = (1 λ 1 )/(2 + λ 2 ) for elastic net, compared to β 1 = β 2 = (1 λ 1 )/2 for (one possible solution of) the lasso Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 6/32

Reparameterization A common reparameterization of the elastic net is to express the regularization parameters in terms of λ, which controls the overall degree of regularization, and α, which controls the balance between the lasso and ridge penalties: λ 1 = αλ λ 2 = (1 α)λ This reparameterization is useful in practice, as it allows one to fix α and then select a single tuning parameter λ, which is more straightforward than attempting to select λ 1 and λ 2 separately Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 7/32

: Introduction As with several other penalties we have considered, the elastic net has a closed form solution in the orthonormal case Considering this special case lends considerable insight into the nature of the and in addition, proves useful for optimization via the coordinate descent algorithm Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 8/32

KKT conditions Elastic Net The KKT conditions, or penalized likelihood equations, are given by: 1 n xt j (y X β) λ 2 βj = λ 1 sign( β j ) βj 0 1 n xt j (y X β) λ 1 βj = 0 Simplifying these conditions for the orthonormal case yields z j β j λ 2 βj = λ 1 sign( β j ) βj 0 where z j = x T j y/n z j λ 1 βj = 0, Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 9/32

These equations can be further simplified by writing them in terms of the soft-thresholding operator: β j = S(z j λ 1 ) 1 + λ 2 In the orthonormal case, then, the elastic net solutions are simply the lasso solutions divided by 1 + λ 2 In other words, the additional ridge penalty has the same effect on the lasso as the ridge penalty itself has on ordinary least squares regression: it provides shrinkage Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 10/32

Remarks Elastic Net As with ridge regression itself, shrinking the coefficients towards zero increases bias, but reduces variance Since this involves drawbacks as well as advantages, adding a ridge penalty is not always universally beneficial, as the bias can dominate the variance Still, as with ridge regression itself, it is typically the case that a profitable compromise can be reached by incorporating some (possibly small) ridge term into the penalty Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 11/32

The grouping effect: Introduction Our earlier example is an extreme example of a property possessed by the elastic net known as the grouping effect The property states that highly correlated features will have similar estimated coefficients, which seems intuitively reasonable Even if a data set does not contain identical variables as in the toy example, many data sets particularly high dimensional ones contain highly correlated predictors The shrinkage and grouping effects produced by the elastic net are an effective way of dealing with these correlated predictors Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 12/32

Elastic Net The property can be described formally in terms of an upper bound on the difference between two coefficients as it relates to the correlation between the predictors: β j β k y 2(1 ρ jk ) λ 2 n where ρ jk is the sample correlation between x j and x k Note, in particular, that as ρ jk 1, the difference between β j and β k goes to zero Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 13/32

Simulation example: Setup To see the effect of the grouping property, we will carry out a simulation study with n = 50 and p = 100 All features x j will follow standard Gaussian distributions in the marginal sense, but we introduce correlation between the features in one of two ways: Compound symmetric: All features have the same pairwise correlation ρ Block diagonal: The 100 features are partitioned into blocks of 5 features each, with a pairwise correlation of ρ between features in a block, but features from separate blocks are independent In the generating model, we will set β 1 = β 2 = = β 5 = 0.5 and β 6 = β 7 = = β 100 = 0 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 14/32

Simulation example: Setup (cont d) Note that in the block diagonal case, this introduces a grouping property: correlated features have identical coefficients In the compound symmetric case, on the other hand, correlation between features does not tell us anything about their corresponding coefficients For the elastic net penalty, for the sake of simplicity we set λ 1 = λ 2 and select λ by independent validation Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 15/32

Simulation example: Results Lasso Elastic Net 1.8 Compound symmetric 0.7 Block diagonal 1.6 1.4 0.6 MSE 1.2 1.0 MSE 0.5 0.4 0.8 0.6 0.3 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 ρ ρ Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 16/32

Remarks Elastic Net This simulation demonstrates that when the correlation between features is not large, there is often little difference between the lasso and elastic net estimators in terms of their estimation accuracy; indeed, when correlation is near zero, the lasso is often more accurate When the correlation between features is large, however, the elastic net has an advantage over the lasso This advantage is much more pronounced in the block diagonal case, where the coefficients have a grouping property Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 17/32

Commentary on the grouping property In practice, the grouping effect is often one of the strongest motivations for applying an elastic net penalty For example, in gene expression studies, genes that have similar functions, or that work together in a pathway to accomplish a certain function, are often correlated It is often reasonable to assume, then, that if the function is relevant to the response we are analyzing, the coefficients will be similar across the correlated group Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 18/32

Commentary on the grouping property (cont d) It is worth pointing out, however, that grouping does not always hold For example, in a genetic association study, it is certainly quite possible for two nearby variants to be highly correlated in their inheritance patterns, but for one variant to be harmless and the other to be highly deleterious Nevertheless, in such a case, it is often quite difficult to determine which of two highly correlated features is the causative feature, and the elastic net, which splits the estimated signal between the correlated features, offers a reasonable compromise Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 19/32

Ridge + nonconvex penalty Method and derivation Simulation studies The motivation for adding a ridge penalty to the lasso penalty also applies to nonconvex penalties such as MCP and SCAD In fact, the motivation is perhaps even stronger in this case As we saw last week, the objective functions for MCP and SCAD may fail to be convex and present multiple local minima, which leads to difficulty in optimization and decreased numerical stability Adding a strictly convex ridge penalty can often substantially stabilize the problem Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 20/32

for MCP Method and derivation Simulation studies The addition of a ridge penalty has a similar shrinkage effect on MCP and SCAD as it does on lasso-penalized models In particular, for MCP in the orthonormal case, z j z j > γλ 1 (1 + λ 2 ) 1 + λ 2 β j = S(z j λ 1 ) 1 1 γ + λ z j γλ 1 (1 + λ 2 ). 2 Similar, if somewhat more complicated, results are available for SCAD (equation 4.7 in the book) Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 21/32

Remarks Elastic Net Method and derivation Simulation studies From this solution, we can see that the shrinkage role played by λ 2 is, in a sense, the opposite of the bias reduction role played by γ While dividing by 1 γ 1 inflates the value of S(z j λ 1 ), dividing by 1 + λ 2 shrinks it When both are present in the model, the orthonormal solution is the soft-thresholding solution divided by 1 γ 1 + λ 2, which could either shrink or inflate S(z j λ 1 ) depending on the balance between γ and λ 2 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 22/32

Remarks Elastic Net Method and derivation Simulation studies It should be noted, however, that the terms are not entirely redundant; while they cancel each other out in the denominator of the orthonormal solution, they do not cancel out elsewhere In particular, they can have rather different effects in the presence of correlation among the features Finally, as in the elastic net, the regularization parameters for the ridge-stabilized versions of MCP and SCAD are often expressed in terms of λ and α Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 23/32

Simulation 1: Setup Method and derivation Simulation studies We close this lecture with two simulation studies comparing the estimation accuracy of lasso, MCP, the elastic net, and what we will abbreviate MNet, the MCP version of the elastic net (i.e., a penalty that consists of MCP + Ridge) First, suppose all covariates {x j } follow independent standard Gaussian distributions, and that the outcome y equals Xβ plus errors drawn from the standard Gaussian distribution For each independently generated set of data set, let n = 100 and p = 500, with 12 nonzero coefficients equal to s and the remaining 488 coefficients equal to zero; we will consider varying the signal strength s between 0.1 and 1.1 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 24/32

Simulation 1: Setup (cont d) Method and derivation Simulation studies For all methods, tuning parameters are selected on the basis of mean-squared prediction error on an independent validation data set also of size n = 100 For lasso and MCP, only one tuning parameter (λ) was selected (for MCP, γ = 3 was fixed) For the Enet and Mnet estimators, we consider both fixed-α estimators and estimators in which both λ and α were selected by external validation (i.e., prediction error was calculated over a two-dimensional grid and the best combination of λ and α was chosen) Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 25/32

Simulation 1: Results Method and derivation Simulation studies Lasso MCP α = 0.9 α = 0.7 α = 0.5 α = 0.3 α = 0.1 Lasso MCP Enet Mnet 3.0 1.5 2.5 Relative MSE 2.0 1.5 1.0 Relative MSE 1.0 0.5 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Signal (s) 0.2 0.4 0.6 0.8 1.0 Signal (s) Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 26/32

Fixed-α remarks Method and derivation Simulation studies All methods behave rather similarly when s is small, as all models end up with estimates of β 0 in these settings As one might expect, a modest ridge penalty is beneficial in the medium-signal settings, with α = 0.5 achieving the highest accuracy when s = 0.5 As signal increases, however, the downward bias of ridge and lasso play a larger role, and MCP becomes the most accurate estimator along with the α = 0.9 Mnet estimator, which is similar to MCP Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 27/32

Variable-α remarks Method and derivation Simulation studies In the variable-α case, there is little difference between the lasso and elastic net estimators In particular, when s is large the two are virtually identical due, in part, to the fact that α is typically selected to be 1 for Enet when s is large MCP and Mnet are similar to lasso and Enet when s is small, but substantially outperform the lasso and elastic net when the signal is increased One can improve estimation accuracy by adding a ridge penalty, although the gains are not particularly dramatic when the features are independent Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 28/32

Simulation 2: Setup Method and derivation Simulation studies Let us modify the previous simulation to examine how correlation among the features affects the results In particular, all covariates {x j } follow a standard Gaussian distribution marginally, but are now correlated with a common (compound symmetric) correlation ρ = 0.7 between any two covariates This is a rather extreme amount of correlation, but helps to clearly illustrate the effect of correlation on the relative performance of the methods Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 29/32

Simulation 2: Results Method and derivation Simulation studies Lasso MCP α = 0.9 α = 0.7 α = 0.5 α = 0.3 α = 0.1 Lasso MCP Enet Mnet 2.0 1.5 Relative MSE 1.5 1.0 0.5 Relative MSE 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Signal (s) 0.2 0.4 0.6 0.8 1.0 Signal (s) Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 30/32

Remarks Elastic Net Method and derivation Simulation studies The primary point illustrated by the figure is that the benefits of shrinkage are much more pronounced in the presence of correlation For example, while MCP was never far from the best estimator in the uncorrelated case, it is one of the worst methods for most signal strengths in the correlated case Meanwhile, although Mnet and MCP were generally similar in Simulation #1, here Mnet outperforms MCP rather dramatically Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 31/32

Conclusions Elastic Net Method and derivation Simulation studies The addition of a ridge penalty typically yields little improvement in the absence of correlation Benefits are more substantial in the presence of correlation, and very substantial in grouped scenarios Adding ridge penalties offers much more potential advantage for nonconvex penalties; indeed, adjusting α to stabilize MCP in this way is often a more fruitful approach than adjusting γ It is difficult to rely on any particular value of α; in practice, it is advisable to try out several values of α and use cross-validation to guide its selection Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 32/32