Least Angle Regression, Forward Stagewise and the Lasso

Similar documents
Regularization Paths

Regularization Paths. Theme

A significance test for the lasso

Chapter 3. Linear Models for Regression

Lecture 5: Soft-Thresholding and Lasso

Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays

The lasso: some novel algorithms and applications

The lasso: some novel algorithms and applications

A significance test for the lasso

Data Mining Stat 588

A Significance Test for the Lasso

Biostatistics Advanced Methods in Biostatistics IV

Post-selection inference with an application to internal inference

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

STAT 462-Computational Data Analysis

Fast Regularization Paths via Coordinate Descent

Regression Shrinkage and Selection via the Lasso

Regularization Path Algorithms for Detecting Gene Interactions

Post-selection Inference for Forward Stepwise and Least Angle Regression

Pathwise coordinate optimization

Some new ideas for post selection inference and model assessment

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Regularization and Variable Selection via the Elastic Net

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Chapter 6. Ensemble Methods

L 1 Regularization Path Algorithm for Generalized Linear Models

S-PLUS and R package for Least Angle Regression

Boosting as a Regularized Path to a Maximum Margin Classifier

Tutz, Binder: Boosting Ridge Regression

arxiv: v1 [math.st] 2 May 2007

Learning with Sparsity Constraints

Fast Regularization Paths via Coordinate Descent

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

LEAST ANGLE REGRESSION 469

MATH 829: Introduction to Data Mining and Analysis Least angle regression

Fast Regularization Paths via Coordinate Descent

The lasso, persistence, and cross-validation

Discussion of Least Angle Regression

COMP 551 Applied Machine Learning Lecture 2: Linear regression

Sampling Distributions

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

Sparsity in Underdetermined Systems

Boosted Lasso and Reverse Boosting

Machine Learning for OR & FE

ESL Chap3. Some extensions of lasso

The lars Package. R topics documented: May 17, Version Date Title Least Angle Regression, Lasso and Forward Stagewise

Regularization: Ridge Regression and the LASSO

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Conjugate direction boosting

MS-C1620 Statistical inference

LASSO Review, Fused LASSO, Parallel LASSO Solvers

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Descent methods. min x. f(x)

Generalized Elastic Net Regression

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Linear Methods for Regression. Lijun Zhang

Variable Selection in Data Mining Project

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Sampling Distributions

A note on the group lasso and a sparse group lasso

Statistical Inference

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

ISyE 691 Data mining and analytics

Introduction to Statistics and R

A Bias Correction for the Minimum Error Rate in Cross-validation

Boosting. March 30, 2009

Lecture 4: Newton s method and gradient descent

L 1 -regularization path algorithm for generalized linear models

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Introduction to the genlasso package

Variable Selection for High-Dimensional Data with Spatial-Temporal Effects and Extensions to Multitask Regression and Multicategory Classification

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Sparse inverse covariance estimation with the lasso

Lingmin Zeng a & Jun Xie b a Department of Biostatistics, MedImmune, One MedImmune Way,

Recent Advances in Post-Selection Statistical Inference

High-dimensional regression

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint

Least Absolute Gradient Selector: Statistical Regression via Pseudo-Hard Thresholding

The Entire Regularization Path for the Support Vector Machine

A Significance Test for the Lasso

Recent Advances in Post-Selection Statistical Inference

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Mallows Cp for Out-of-sample Prediction

Regression III: Computing a Good Estimator with Regularization

MSA220/MVE440 Statistical Learning for Big Data

Linear regression methods

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Adaptive Lasso for correlated predictors

Lass0: sparse non-convex regression by local search

Standardization and the Group Lasso Penalty

Ridge and Lasso Regression

Margin Maximizing Loss Functions

Grouped and Hierarchical Model Selection through Composite Absolute Penalties

Linear Models in Machine Learning

Hierarchical Penalization

Transcription:

January 2005 Rob Tibshirani, Stanford 1 Least Angle Regression, Forward Stagewise and the Lasso Brad Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani Stanford University Annals of Statistics, 2004 (with discussion) http://www-stat.stanford.edu/ tibs

January 2005 Rob Tibshirani, Stanford 2 Background Today s talk is about linear regression But the motivation comes from the area of flexible function fitting: Boosting Freund & Schapire (1995)

January 2005 Rob Tibshirani, Stanford 3 Least Squares Boosting Friedman, Hastie & Tibshirani see Elements of Statistical Learning (chapter 10) Supervised learning: Response y, predictors x = (x 1, x 2... x p ). 1. Start with function F (x) = 0 and residual r = y 2. Fit a CART regression tree to r giving f(x) 3. Set F (x) F (x) + ɛf(x), r r ɛf(x) and repeat step 2 many times

January 2005 Rob Tibshirani, Stanford 4 Prediction Error Least Squares Boosting Single tree ag replacements ɛ = 1 ɛ =.01 Number of steps

January 2005 Rob Tibshirani, Stanford 5 Linear Regression Here is a version of least squares boosting for multiple linear regression: (assume predictors are standardized) (Incremental) Forward Stagewise 1. Start with r = y, β 1, β 2,... β p = 0. 2. Find the predictor x j most correlated with r 3. Update β j β j + δ j, where δ j = ɛ sign r, x j 4. Set r r δ j x j and repeat steps 2 and 3 many times δ j = r, x j gives usual forward stagewise; different from forward stepwise Analogous to least squares boosting, with trees=predictors

January 2005 Rob Tibshirani, Stanford 6 Prostate Cancer Data Lasso Forward Stagewise lcavol lcavol lacements Coefficients -0.2 0.0 0.2 0.4 0.6 svi lweight pgg45 lbph gleason age Coefficients -0.2 0.0 0.2 0.4 0.6 svi lweight pgg45 lbph gleason age lcp lcp 0.0 0.5 1.0 1.5 2.0 2.5 t = P j β j 0 50 100 150 200 250 Iteration

January 2005 Rob Tibshirani, Stanford 7 Linear regression via the Lasso (Tibshirani, 1995) Assume ȳ = 0, x j = 0, Var(x j ) = 1 for all j. Minimize i (y i j x ijβ j ) 2 subject to j β j s With orthogonal predictors, solutions are soft thresholded version of least squares coefficients: (γ is a function of s) sign( ˆβ j )( ˆβ j γ) + For small values of the bound s, Lasso does variable selection. See pictures

January 2005 Rob Tibshirani, Stanford 8 Lasso and Ridge regression β 2. ^ β. β 2 ^ β β 1 β 1

January 2005 Rob Tibshirani, Stanford 9 More on Lasso Current implementations use quadratic programming to compute solutions Can be applied when p > n. In that case, number of non-zero coefficients is at most n 1 (by convex duality) interesting consequences for applications, eg microarray data

January 2005 Rob Tibshirani, Stanford 10 Diabetes Data Lasso Stagewise 9 9 lacements ˆβj -500 0 500 39 4 7 2 105 8 61 3 6 4 8 7 10 1 2-500 0 500 39 4 7 2 105 8 16 3 6 4 8 7 10 1 2 P ˆβj 0 1000 2000 3000 t = P ˆβ j 5 0 1000 2000 3000 t = P ˆβ j 5

January 2005 Rob Tibshirani, Stanford 11 Why are Forward Stagewise and Lasso so similar? Are they identical? In orthogonal predictor case: yes In hard to verify case of monotone coefficient paths: yes In general, almost! Least angle regression (LAR) provides answers to these questions, and an efficient way to compute the complete Lasso sequence of solutions.

January 2005 Rob Tibshirani, Stanford 12 Least Angle Regression LAR Like a more democratic version of forward stepwise regression. 1. Start with r = y, ˆβ 1, ˆβ 2,... ˆβ p = 0. Assume x j standardized. 2. Find predictor x j most correlated with r. 3. Increase β j in the direction of sign(corr(r, x j )) until some other competitor x k has as much correlation with current residual as does x j. 4. Move ( ˆβ j, ˆβ k ) in the joint least squares direction for (x j, x k ) until some other competitor x l has as much correlation with the current residual 5. Continue in this way until all predictors have been entered. Stop when corr(r, x j ) = 0 j, i.e. OLS solution.

January 2005 Rob Tibshirani, Stanford 13 ag replacements x 2 x 2 ȳ 2 u 2 ˆµ 0 ˆµ 1 x 1 ȳ1 The LAR direction u 2 at step 2 makes an equal angle with x 1 and x 2.

January 2005 Rob Tibshirani, Stanford 14 LARS lacements ˆβj -500 0 500 39 4 7 2 105 8 61 9 3 6 4 8 7 10 1 2 5 ĉkj 0 5000 10000 15000 20000 3 9 4 8 7 10 5 1 6 2 9 4 Ĉ k 7 2 10 5 8 6 1 P ˆβj 0 1000 2000 3000 2 4 6 8 10 Step k

January 2005 Rob Tibshirani, Stanford 15 Relationship between the 3 algorithms Lasso and forward stagewise can be thought of as restricted versions of LAR For Lasso: Start with LAR. If a coefficient crosses zero, stop. Drop that predictor, recompute the best direction and continue. This gives the Lasso path Proof (lengthy): use Karush-Kuhn-Tucker theory of convex optimization. Informally: β j { y Xβ 2 + λ j β j } = 0 x j, r = λ 2 sign( ˆβ j ) if ˆβ j 0 (active)

January 2005 Rob Tibshirani, Stanford 16 For forward stagewise: Start with LAR. Compute best (equal angular) direction at each stage. If direction for any predictor j doesn t agree in sign with corr(r, x j ), project direction into the positive cone and use the projected direction instead. in other words, forward stagewise always moves each predictor in the direction of corr(r, x j ). The incremental forward stagewise procedure approximates these steps, one predictor at a time. As step size ɛ 0, can show that it coincides with this modified version of LAR

January 2005 Rob Tibshirani, Stanford 17 More on forward stagewise Let A be the active set of predictors at some stage. Suppose the procedure takes M = M j steps of size ɛ in these predictors, M j in predictor j (M j = 0 for j / A). Then ˆβ is changed to ˆβ + (s 1 M 1 /M, s 2 M 2 /M, s p M p /M) where s j = sign(corr(r, x j )) θ = (M 1 /M, M p /M) satisfies θ = argmin i (r i j A x ij s j θ j ) 2, subject to θ j 0 for all j. This is a non-negative least squares estimate.

January 2005 Rob Tibshirani, Stanford 18 S A ag replacements C A x 3 S + A x 2 x 1 v ˆB S + B v A The forward stagewise direction lies in the positive cone spanned by the (signed) predictors with equal correlation with the current residual.

January 2005 Rob Tibshirani, Stanford 19 Summary LARS uses least squares directions in the active set of variables. Lasso uses least square directions; if a variable crosses zero, it is removed from the active set. Forward stagewise uses non-negative least squares directions in the active set.

January 2005 Rob Tibshirani, Stanford 20 Benefits Possible explanation of the benefit of slow learning in boosting: it is approximately fitting via an L 1 (lasso) penalty new algorithm computes entire Lasso path in same order of computation as one full least squares fit. Splus/R Software on Hastie s website: www-stat.stanford.edu/ hastie/papers#lars Degrees of freedom formula for LAR: After k steps, degrees of freedom of fit = k (with some regularity conditions) For Lasso, the procedure often takes > p steps, since predictors can drop out. Corresponding formula (conjecture): Degrees of freedom for last model in sequence with k predictors is equal to k.

January 2005 Rob Tibshirani, Stanford 21 Degrees of freedom df estimate 10 8 6 4 2 0 df estimate 60 50 40 30 20 10 0 0 2 4 6 8 10 step k--> 0 10 20 30 40 50 60 step k-->

January 2005 Rob Tibshirani, Stanford 22 Degree of Freedom result df(ˆµ) n cov(ˆµ i, y i )/σ 2 = k i=1 Proof is based on is an application of Stein s unbiased risk estimate (SURE). Suppose that g : R n R n is almost differentiable and set g = n i=1 g i/ x i. If y N n (µ, σ 2 I), then Stein s formula states that n cov(g i, y i )/σ 2 = E[ g(y)]. i=1 LHS is degrees of freedom. Set g( ) equal to the LAR estimate. In orthogonal case, g i / x i is 1 if predictor is in model, 0 otherwise. Hence RHS equals number of predictors in model (= k). Non-orthogonal case is much harder.

January 2005 Rob Tibshirani, Stanford 23 Software for R and Splus lars() function fits all three models: lasso, lar or forward.stagewise. Methods for prediction, plotting, and cross-validation. Detailed documentation provided. Visit www-stat.stanford.edu/ hastie/papers/#lars Main computations involve least squares fitting using the active set of variables. Computations managed by updating the Choleski R matrix (and frequent downdating for lasso and forward stagewise).

January 2005 Rob Tibshirani, Stanford 24 MicroArray Example Expression data for 38 Leukemia patients ( Golub data). X matrix with 38 samples and 7129 variables (genes) Response Y is dichotomous ALL (27) vs AML (11) LARS (lasso) took 4 seconds in R version 1.7 on a 1.8Ghz Dell workstation running Linux. In 70 steps, 52 variables ever non zero, at most 37 at a time.

January 2005 Rob Tibshirani, Stanford 25 LASSO Standardized Coefficients 0.2 0.0 0.2 0.4 0.6 0.8 2968 6801 2945 461 2267 0.0 0.2 0.4 0.6 0.8 1.0 beta /max beta

January 2005 Rob Tibshirani, Stanford 26 LASSO 0 1 3 4 5 8 12 15 20 26 31 38 50 65 Standardized Coefficients 0.2 0.0 0.2 0.4 0.6 0.8 2968 6801 2945 461 2267 0.0 0.2 0.4 0.6 0.8 1.0 beta /max beta

January 2005 Rob Tibshirani, Stanford 27 10 fold cross validation for Leukemia Expression Data (Lasso) cv 0.05 0.10 0.15 0.20 0.25 0.0 0.2 0.4 0.6 0.8 1.0 fraction

January 2005 Rob Tibshirani, Stanford 28 LAR Standardized Coefficients 0.5 0.0 0.5 1.0 6895 1817 4328 1241 2534 5039 2267 1882 0.0 0.2 0.4 0.6 0.8 1.0 beta /max beta

January 2005 Rob Tibshirani, Stanford 29 LAR 0 2 3 4 8 13 20 25 29 32 35 36 37 Standardized Coefficients 0.5 0.0 0.5 1.0 6895 1817 4328 1241 2534 5039 2267 1882 0.0 0.2 0.4 0.6 0.8 1.0 beta /max beta

January 2005 Rob Tibshirani, Stanford 30 10 fold cross validation for Leukemia Expression Data (LAR) cv 0.05 0.10 0.15 0.20 0.25 0.0 0.2 0.4 0.6 0.8 1.0 fraction

January 2005 Rob Tibshirani, Stanford 31 Forward Stagewise Standardized Coefficients 0.2 0.0 0.2 0.4 0.6 4535 4399 87 2774 461 1834 2267 1882 0.0 0.2 0.4 0.6 0.8 1.0 beta /max beta

January 2005 Rob Tibshirani, Stanford 32 Forward Stagewise 0 1 3 4 5 8 12 15 19 29 43 54 82 121 Standardized Coefficients 0.2 0.0 0.2 0.4 0.6 4535 4399 87 2774 461 1834 2267 1882 0.0 0.2 0.4 0.6 0.8 1.0 beta /max beta

January 2005 Rob Tibshirani, Stanford 33 10 fold cross validation for Leukemia Expression Data (Stagewise) cv 0.05 0.10 0.15 0.20 0.25 0.0 0.2 0.4 0.6 0.8 1.0 fraction

January 2005 Rob Tibshirani, Stanford 34 A new result (Hastie, Taylor, Tibshirani, Walther) Criterion for forward stagewise Minimize ( i y i ) 2 j x ijβ j (t) Subject to t j 0 β j (s) ds t (Bounded L 1 arc-length)

January 2005 Rob Tibshirani, Stanford 35 Future directions use ideas to make better versions of boosting (Friedman and Popescu) Application to support vector machines (Zhu, Rosset, Hastie, Tibshirani) fused lasso : ˆβ = argmin i (y i j x ij β j ) 2 subject to p β j s 1 j=1 p and β j β j 1 s 2 j=2 Has applications to microarrays and protein mass spec