The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Similar documents
Stability and the elastic net

Theoretical results for lasso, MCP, and SCAD

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

ESL Chap3. Some extensions of lasso

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Nonconvex penalties: Signal-to-noise ratio and algorithms

A Short Introduction to the Lasso Methodology

High-dimensional regression

Linear Model Selection and Regularization

Sparse Linear Models (10/7/13)

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Convex functions, subdifferentials, and the L.a.s.s.o.

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Regression.

Lecture 5: Soft-Thresholding and Lasso

Sampling distribution of GLM regression coefficients

Optimization methods

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Machine Learning for OR & FE

Lecture 5: A step back

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Optimization methods

A Modern Look at Classical Multivariate Techniques

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

STANDARDIZATION AND THE GROUP LASSO PENALTY

Bayesian linear regression

Regression, Ridge Regression, Lasso

Extensions to LDA and multinomial regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Covariance test Selective inference. Selective inference. Patrick Breheny. April 18. Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/20

ISyE 691 Data mining and analytics

Lecture 2 Part 1 Optimization

Bi-level feature selection with applications to genetic association

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Data Mining Stat 588

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Sparsity and the Lasso

IEOR165 Discussion Week 5

Sparsity Regularization

MATH 680 Fall November 27, Homework 3

Lecture 2 Machine Learning Review

UNIVERSITETET I OSLO

Semiparametric Regression

Lecture 14: Variable Selection - Beyond LASSO

Regression Shrinkage and Selection via the Lasso

Least Angle Regression, Forward Stagewise and the Lasso

Ratemaking application of Bayesian LASSO with conjugate hyperprior

Biostatistics Advanced Methods in Biostatistics IV

Standardization and the Group Lasso Penalty

Shrinkage Methods: Ridge and Lasso

Pathwise coordinate optimization

A robust hybrid of lasso and ridge regression

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Generalized Elastic Net Regression

Gaussian Graphical Models and Graphical Lasso

Linear Methods for Regression. Lijun Zhang

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

COMS 4771 Regression. Nakul Verma

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Regularization and Variable Selection via the Elastic Net

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Outlier detection and variable selection via difference based regression model and penalized regression

DA Freedman Notes on the MLE Fall 2003

Regularized methods for high-dimensional and bilevel variable selection

The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

10-725/36-725: Convex Optimization Prerequisite Topics

A significance test for the lasso

Mixed models in R using the lme4 package Part 4: Theory of linear mixed models

Lasso: Algorithms and Extensions

arxiv: v3 [stat.me] 8 Jun 2018

On Model Selection Consistency of Lasso

Math 328 Course Notes

Data analysis strategies for high dimensional social science data M3 Conference May 2013

Bayesian Grouped Horseshoe Regression with Application to Additive Models

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

A note on the group lasso and a sparse group lasso

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Lecture 14: Shrinkage

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Saharon Rosset 1 and Ji Zhu 2

Group exponential penalties for bi-level variable selection

Parameter Norm Penalties. Sargur N. Srihari

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Eigenvalues and diagonalization

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

Transcription:

Patrick Breheny February 15 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/24

Introduction Last week, we introduced penalized regression and discussed ridge regression, in which the penalty took the form of a sum of squares of the regression coefficients In this topic, we will instead penalize the absolute values of the regression coefficients, a seemingly simple change with widespread consequences Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 2/24

Specifically, consider the objective function Q(β X, y) = 1 2n y Xβ 2 2 + λ β 1, where β 1 = j β j denotes the l 1 norm of the regression coefficients As before, estimates of β are obtained by minimizing the above function for a given value of λ, yielding β(λ) This approach was originally proposed in the regression context by Robert Tibshirani in 1996, who called it the least absolute shrinkage and selection operator, or lasso Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 3/24

Shrinkage, selection, and sparsity Its name captures the essence of what the lasso penalty accomplishes Shrinkage: Like ridge regression, the lasso penalizes large regression coefficients and shrinks estimates towards zero Selection: Unlike ridge regression, the lasso produces sparse solutions: some coefficient estimates are exactly zero, effectively removing those predictors from the model Sparsity has two very attractive properties Speed: Algorithms which take advantage of sparsity can scale up very efficiently, offering considerable computational advantages Interpretability: In models with hundreds or thousands of predictors, sparsity offers a helpful simplification of the model by allowing us to focus only on the predictors with nonzero coefficient estimates Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 4/24

Ridge and lasso penalties Lasso Ridge P(β) P'(β) λ 0 β 0 β Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 5/24

Semi-differentiable functions One obvious challenge that comes with the lasso is that, by introducing absolute values, we are no longer dealing with differentiable functions For this reason, we re going to take a moment and extend some basic calculus results to the case of non-differentiable (more specifically, semi-differentiable) functions A function f : R R is said to be semi-differentiable at a point x if both d f(x) and d + f(x) exist as real numbers, where d f(x) and d + f(x) are the left- and right-derivatives of f at x Note that f is semi-differentiable implies that f is continuous Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 6/24

Subderivatives and subdifferentials Given a semi-differentiable function f : R R, we say that d is a subderivative of f at x if d [d f(x), d + f(x)]; the set [d f(x), d + f(x)] is called the subdifferential of f at x, and is denoted f(x) Note that the subdifferential is a set-valued function Recall that a function is differentiable at x if d f(x) = d + f(x); i.e., if the subdifferential consists of a single point Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 7/24

Example: x For example, consider the function f(x) = x The subdifferential is 1 if x < 0 f(x) = [ 1, 1] if x = 0 1 if x > 0 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 8/24

Optimization The essential results of optimization can be extended to semi-differentiable functions Theorem: If f is a semi-differentiable function and x 0 is a local minimum or maximum of f, then 0 f(x 0 ) As with regular calculus, the converse is not true in general Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 9/24

Computation rules As with regular differentiation, the following basic rules apply Theorem: Let f be semi-differentiable, a, b be constants, and g be differentiable. Then {af(x) + b} = a f(x) {f(x) + g(x)} = f(x) + g (x) The notions extend to higher-order derivatives as well; a function f : R R is said to be second-order semi-differentiable at a point x if both d 2 f(x) and d 2 +f(x) exist as real numbers The second-order subdifferential is denoted 2 f(x) = [d 2 f(x), d 2 +f(x)] Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 10/24

Convexity As in the differentiable case, a convex function can be characterized in terms of its subdifferential Theorem: Suppose f is semi-differentiable on (a, b). Then f is convex on (a, b) if and only if f is increasing on (a, b). Theorem: Suppose f is second-order semi-differentiable on (a, b). Then f is convex on (a, b) if and only if 2 f(x) 0 x (a, b). Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 11/24

Multidimensional results The previous results can be extended (although we ll gloss over the details) to multidimensional functions by replacing left- and right-derivatives with directional derivatives A function f : R n R is said to be semi-differentiable if the directional derivative d u f(x) exists in all directions u Theorem: If f is a semi-differentiable function and x 0 is a local minimum of f, then d u f(x 0 ) 0 u Theorem: Suppose f is a semi-differentiable function. Then f is convex over a set S if and only if d 2 uf(x) 0 for all x S and in all directions u Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 12/24

Score functions and penalized score functions In classical statistical theory, the derivative of the log-likelihood function is called the score function, and maximum likelihood estimators are found by setting this derivative equal to zero, thus yielding the likelihood equations (or score equations): 0 = θ L(θ), where L denotes the log-likelihood. Extending this idea to penalized likelihoods involves taking the derivatives of objective functions of the form Q(θ) = L(θ) + P (θ), yielding the penalized score function Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 13/24

Penalized likelihood equations For ridge regression, the penalized likelihood is everywhere differentiable, and the extension to penalized score equations is straightforward For the lasso, and for the other penalties we will consider in this class, the penalized likelihood is not differentiable specifically, not differentiable at zero and subdifferentials are needed to characterize them Letting Q(θ) denote the subdifferential of Q, the penalized likelihood equations (or penalized score equations) are: 0 Q(θ). Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 14/24

In the optimization literature, the resulting equations are known as the Karush-Kuhn-Tucker (KKT) conditions For convex optimization problems such as the lasso, the KKT conditions are both necessary and sufficient to characterize the solution A rigorous proof of this claim in multiple dimensions would involve some of the details we glossed over, but the idea is fairly straightforward: to solve for β, we simply replace the derivative with the subderivative and the likelihood with the penalized likelihood Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 15/24

for the lasso Result: β minimizes the lasso objective function if and only if it satisfies the 1 n xt j (y X β) = λsign( β j ) βj 0 1 n xt j (y X β) λ βj = 0 In other words, the correlation between a predictor and the residuals, x T j (y X β)/n, must exceed a certain minimum threshold λ before it is included in the model When this correlation is below λ, β j = 0 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 16/24

Remarks If we set λ = λ max max 1 j p xt j y /n, then β = 0 satisfies the That is, for any λ λ max, we have β(λ) = 0 On the other hand, if we set λ = 0, the are simply the normal equations for OLS, X T (y X β) = 0 Thus, the coefficient path for the lasso starts at λ max and may continue until λ = 0 if X is full rank; otherwise it will terminate at some λ min > 0 when the model becomes saturated Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 17/24

Lasso and uniqueness criterion is convex, but not strictly convex if X T X is not full rank; thus the lasso solution may not be unique For example, suppose n = 2 and p = 2, with (y 1, x 11, x 12 ) = (1, 1, 1) and and (y 2, x 21, x 22 ) = ( 1, 1, 1) Then the solutions are ( β 1, β 2 ) =(0, 0) if λ 1, ( β 1, β 2 ) {(β 1, β 2 ) : β 1 + β 2 = 1 λ, β 1 0, β 2 0} if 0 λ < 1 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 18/24

Special case: Orthonormal design As with ridge regression, it is instructive to consider the special case where the design matrix X is orthonormal: n 1 X T X = I Result: In the orthonormal case, the lasso estimate is z j λ, if z j > λ, β j (λ) = 0, if z j λ,, z j + λ, if z j < λ where z j = x T j y/n is the OLS solution Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 19/24

The result on the previous slide can be written more compactly as β j (λ) = S(z j λ), where the function S( λ) is known as the soft thresholding operator This was originally proposed by Donoho and Johnstone in 1994 for soft thresholding of wavelets coefficients in the context of nonparametric regression By comparison, the hard thresholding operator is H(z, λ) = zi{ z > λ}, where I(S) is the indicator function for set S Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 20/24

Soft and hard thresholding operators Hard Soft 2 1 β^(z) 0 1 2 2 1 0 1 2 z Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 21/24

Probability that β j = 0 With soft thresholding, it is clear that the lasso has a positive probability of yielding an estimate of exactly 0 in other words, of producing a sparse solution Specifically, the probability of dropping x j from the model is P( z j λ) Under the assumption that ɛ i iid N(0, σ 2 ), we have z j N(β, σ 2 /n) and ( λ β ) P( β j (λ) = 0) = Φ σ/ Φ n where Φ is the Gaussian CDF ( λ β σ/ n ), Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 22/24

Sampling distribution For σ = 1, n = 10, and λ = 1/2: 1.5 β 0 = 0 β 0 = 1 1 Density 1.0 0.5 0.5 Probability 0.0 0 2 1 0 1 2 β^ 2 1 0 1 2 β^ Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 23/24

Remarks This sampling distribution is very different from that of a classical MLE: The distribution is mixed: a portion is continuously distributed, but there is also a point mass at zero The continuous portion is not normally distributed The distribution is asymmetric (unless β = 0) The distribution is not centered at the true value of β These facts create a number of challenges for carrying out inference using the lasso; we will be putting this issue aside for now, but will return to it later in the course Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 24/24