Regularization Paths. Theme

Similar documents
Regularization Paths

Least Angle Regression, Forward Stagewise and the Lasso

Boosting as a Regularized Path to a Maximum Margin Classifier

LEAST ANGLE REGRESSION 469

Chapter 3. Linear Models for Regression

Regularization Path Algorithms for Detecting Gene Interactions

A Study of Relative Efficiency and Robustness of Classification Methods

Data Mining Stat 588

Chapter 6. Ensemble Methods

A significance test for the lasso

Does Modeling Lead to More Accurate Classification?

Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays

A significance test for the lasso

The lasso: some novel algorithms and applications

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

arxiv: v1 [math.st] 2 May 2007

Statistical Methods for Data Mining

Support Vector Machines for Classification: A Statistical Portrait

STAT 462-Computational Data Analysis

Lecture 5: Soft-Thresholding and Lasso

Margin Maximizing Loss Functions

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Learning with Sparsity Constraints

Boosted Lasso and Reverse Boosting

Kernel Logistic Regression and the Import Vector Machine

Fast Regularization Paths via Coordinate Descent

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Regularization and Variable Selection via the Elastic Net

L 1 Regularization Path Algorithm for Generalized Linear Models

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Tutz, Binder: Boosting Ridge Regression

Why does boosting work from a statistical view

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Machine Learning for OR & FE

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

A Significance Test for the Lasso

A Magiv CV Theory for Large-Margin Classifiers

The lasso: some novel algorithms and applications

Variable Selection in Data Mining Project

S-PLUS and R package for Least Angle Regression

The Entire Regularization Path for the Support Vector Machine

Regression Shrinkage and Selection via the Lasso

Linear Model Selection and Regularization

6.036 midterm review. Wednesday, March 18, 15

Sparse Penalized Forward Selection for Support Vector Classification

Prediction & Feature Selection in GLM

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Gradient Boosting (Continued)

ESL Chap3. Some extensions of lasso

Introduction to Machine Learning Midterm, Tues April 8

MS-C1620 Statistical inference

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Boosting. March 30, 2009

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

ECE 5424: Introduction to Machine Learning

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Fast Regularization Paths via Coordinate Descent

Saharon Rosset 1 and Ji Zhu 2

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Boosted Lasso. Abstract

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

ISyE 691 Data mining and analytics

L 1 -regularization path algorithm for generalized linear models

Linear regression methods

MATH 829: Introduction to Data Mining and Analysis Least angle regression

Lecture 14: Variable Selection - Beyond LASSO

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Post-selection inference with an application to internal inference

Biostatistics Advanced Methods in Biostatistics IV

Linear Methods for Regression. Lijun Zhang

Variable Selection for High-Dimensional Data with Spatial-Temporal Effects and Extensions to Multitask Regression and Multicategory Classification

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Pathwise coordinate optimization

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

COMS 4771 Regression. Nakul Verma

High-dimensional regression modeling

Some new ideas for post selection inference and model assessment

Regularization: Ridge Regression and the LASSO

Proteomics and Variable Selection

6. Regularized linear regression

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Ordinal Classification with Decision Rules

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Generalized Boosted Models: A guide to the gbm package

Package covtest. R topics documented:

Boosted Lasso. Peng Zhao. Bin Yu. 1. Introduction

A Modern Look at Classical Multivariate Techniques

Pattern Recognition 2018 Support Vector Machines

Announcements - Homework

Transcription:

June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park, Saharon Rosset, Rob Tibshirani, Hui Zou and Ji Zhu. Boosting fits a regularization path toward a max-margin classifier. Svmpath does as well. In neither case is this endpoint always of interest somewhere along the path is often better. Having efficient algorithms for computing entire paths facilitates this selection. A mini industry has emerged for generating regularization paths covering a broad spectrum of statistical problems. June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Adaboost Stumps for Classification Boosting Stumps for Regression Test Misclassification Error 0. 0. 0.0 0. 0. 0. Adaboost Stump Adaboost Stump shrink 0. MSE (Squared Error Loss)..0..0..0 GBM Stump GMB Stump shrink 0. 0 00 00 00 800 000 0 0 00 00 000 Iterations Number of Trees

June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Linear Regression Least Squares Boosting Friedman, Hastie & Tibshirani see Elements of Statistical Learning (chapter 0) Supervised learning: Response y, predictors x =(x,x...x p ).. Start with function F (x) = 0 and residual r = y. Fit a CART regression tree to r giving f(x). Set F (x) F (x)+ɛf(x), r r ɛf(x) and repeat steps and many times Here is a version of least squares boosting for multiple linear regression: (assume predictors are standardized) (Incremental) Forward Stagewise. Start with r = y, β,β,...β p =0.. Find the predictor x j most correlated with r. Update β j β j + δ j, where δ j = ɛ sign r, x j. Set r r δ j x j and repeat steps and many times δ j = r, x j gives usual forward stagewise; different from forward stepwise Analogous to least squares boosting, with trees=predictors June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics 8 Example: Prostate Cancer Data Lasso Forward Stagewise Linear regression via the Lasso (Tibshirani, 99) lcavol lcavol Assume ȳ =0, x j =0,Var(x j ) = for all j. Coefficients -0. 0. 0. 0. svi lweight pgg lbph gleason age Coefficients -0. 0. 0. 0. svi lweight pgg lbph gleason age Minimize i (y i j x ijβ j ) subject to β t Similar to ridge regression, which has constraint β t Lasso does variable selection and shrinkage, while ridge only shrinks. β. β. ^ β ^ β lcp 0..0..0. t = P j β j lcp 0 0 00 0 00 0 Iteration β β

June 00 Trevor Hastie, Stanford Statistics 9 June 00 Trevor Hastie, Stanford Statistics 0 Diabetes Data Lasso Stagewise Why are Forward Stagewise and Lasso so similar? 9 9 ˆβj -00 0 00 9 0 8 0 000 000 000 t = P ˆβ j 8 0-00 0 00 9 0 8 0 000 000 000 t = P ˆβ j 8 0 Are they identical? In orthogonal predictor case: yes In hard to verify case of monotone coefficient paths: yes In general, almost! Least angle regression (LAR) provides answers to these questions, and an efficient way to compute the complete Lasso sequence of solutions. June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics 8 Least Angle Regression LAR LAR Like a more democratic version of forward stepwise regression.. Start with r = y, ˆβ, ˆβ,... ˆβ p = 0. Assume x j standardized.. Find predictor x j most correlated with r.. Increase β j in the direction of sign(corr(r, x j )) until some other competitor x k has as much correlation with current residual as does x j.. Move ( ˆβ j, ˆβ k ) in the joint least squares direction for (x j,x k ) until some other competitor x l has as much correlation with the current residual. Continue in this way until all predictors have been entered. Stop when corr(r, x j )=0 j, i.e. OLS solution. Standardized Coefficients 00 0 00 0 8 0 0. 0. 0..0 beta /max beta 0 8 9 df for LAR df are labeled at the top of the figure At the point a competitor enters the active set, the df are incremented by. Not true, for example, for stepwise regression.

March 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics LARS x x ˆβj -00 0 00 9 0 8 ˆβj 0 000 000 000 9 8 0 ĉkj 0 000 0000 000 0000 9 9 8 0 Ĉ k 0 8 8 0 Step k ˆµ 0 ˆµ The LAR direction u at step makes an equal angle with x and x. u x ȳ ȳ June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Relationship between the algorithms Lasso and forward stagewise can be thought of as restricted versions of LAR Lasso: Start with LAR. If a coefficient crosses zero, stop. Drop that predictor, recompute the best direction and continue. This gives the Lasso path Proof: use KKT conditions for appropriate Lagrangian. Informally: [ β j y Xβ + λ ] β j =0 j x j, r = λ sign( ˆβ j ) if ˆβ j 0 (active) Forward Stagewise: Compute the LAR direction, but constrain the sign of the coefficients to match the correlations corr(r, x j ). The incremental forward stagewise procedure approximates these steps, one predictor at a time. As step size ɛ 0, can show that it coincides with this modified version of LAR

June 00 Trevor Hastie, Stanford Statistics Data Mining Trevor Hastie, Stanford University Cross-Validation Error Curve lars package The LARS algorithm computes the entire Lasso/FS/LAR path in same order of computation as one full least squares fit. When p N, the solution has at most N non-zero coefficients. Works efficiently for micro-array data (p in thousands). Cross-validation is quick and easy. CV Error 000 00 000 00 000 00 000 0. 0. 0..0 Tuning Parameter s 0-fold CV error curve using lasso on some diabetes data ( inputs, samples). Thick curve is CV error curve Shaded region indicates standard error of CV estimate. Curve shows effect of overfitting errors start to increase above s =0.. This shows a trade-off between bias and variance. June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Coefficients (Negative) Coefficients (Positive) Coefficients (Negative) Coefficients (Positive) 0 0 0 0 Forward Stagewise and the Monotone Lasso Lasso Forward Stagewise 0 0 0 0 80 L Norm (Standardized) Expand the variable set to include their negative versions x j. Original lasso corresponds to a positive lasso in this enlarged space. Forward stagewise corresponds to a monotone lasso. The L norm β in this enlarged space is arc-length. Forward stagewise produces the maximum decrease in loss per unit arc-length in coefficients. Degrees of Freedom of Lasso The df or effective number of parameters give us an indication of how much fitting we have done. Stein s Lemma: If y i are i.i.d. N(µ i,σ ), [ n n ] df (ˆµ) def = cov(ˆµ i,y i )/σ ˆµ i = E y i i= Degrees of freedom formula for LAR: After k steps, df (ˆµ k )=k exactly (amazing! with some regularity conditions) Degrees of freedom formula for lasso: Let ˆdf (ˆµ λ )bethe number of non-zero elements in ˆβ λ.thene ˆdf (ˆµ λ )=df (ˆµ λ ). i=

June 00 Trevor Hastie, Stanford Statistics 8 June 00 Trevor Hastie, Stanford Statistics 9 Back to Boosting LAR Standardized Coefficients 00 0 00 0 8 0 0. 0. 0..0 0 8 9 df for LAR df are labeled at the top of the figure At the point a competitor enters the active set, the df are incremented by. Not true, for example, for stepwise regression. Work with Rosset and Zhu (JMLR 00) extends the connections between Forward Stagewise and L penalized fitting to other loss functions. In particular the Exponential loss of Adaboost, and the Binomial loss of Logitboost. In the separable case, L regularized fitting with these losses converges to a L maximizing margin (defined by β ), as the penalty disappears. i.e. if then β(t) = arg min L(y, f) lim t β(t) β(t) β s.t. β t, beta /max beta Then min i y i F (x i )=min i y i x T i β,thel margin, is maximized. June 00 Trevor Hastie, Stanford Statistics 0 June 00 Trevor Hastie, Stanford Statistics When the monotone lasso is used in the expanded feature space, the connection with boosting (with shrinkage) is more precise. This ties in very nicely with the L margin explanation of boosting (Schapire, Freund, Bartlett and Lee, 998). makes connections between SVMs and Boosting, and makes explicit the margin maximizing properties of boosting. experience from statistics suggests that some β(t) along the path might perform better a.k.a stopping early. Zhao and Yu (00) incorporate backward corrections with forward stagewise, and produce a boosting algorithm that mimics lasso. Maximum Margin and Overfitting Mixture data from ESL. Boosting with -node trees, gbm package in R, shrinkage =, Adaboost loss. Margin 0. 0. 0. 0. 0 K K K 8K 0K Number of Trees Test Error 0. 0. 0. 0 K K K 8K 0K Number of Trees

June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Lasso or Forward Stagewise? Other Path Algorithms Micro-array example (Golub Data). N = 8, p = 9, response binary ALL vs AML Elasticnet: (Zou and Hastie, 00). Compromise between lasso and ridge: minimize i (yi j xij βj ) subject to α β + ( α) β t. Useful for situations where variables operate in correlated groups (genes in pathways). Lasso behaves chaotically near the end of the path, while Forward Stagewise is smooth and stable. Forward Stagewise Glmpath: (Park and Hastie, 00). Approximates the L regularization path for generalized linear models. e.g. logistic regression, Poisson regression. 8 0. 0. 0. 0. 0. 0..0 0. beta /max beta June 00 Friedman and Popescu (00) created Pathseeker. It uses an efficient incremental forward-stagewise algorithm with a variety of loss functions. A generalization adjusts the leading k coefficients at each step; k = corresponds to forward stagewise, k = p to gradient descent. 9 0. 98 80 9 Standardized Coefficients 0. 0. 0. 0. 0..0 beta /max beta Trevor Hastie, Stanford Statistics Bach and Jordan (00) have path algorithms for Kernel estimation, and for efficient ROC curve estimation. The latter is a useful generalization of the Svmpath algorithm discussed later. June 00 Trevor Hastie, Stanford Statistics elasticnet package (Hui Zou) Min i (yi j xij βj ) s.t. α β + ( α) β t Mixed penalty selects correlated sets of variables in groups. For fixed α, LARS algorithm, along with a standard ridge regression trick, lets us compute the entire regularization path. Lasso Elastic Net lambda = 0. 0 0 0 0 Standardized Coefficients k= 0 λk βk where is the L norm (not squared), and βk represents a subset of the coefficients. 0 0 β 0 min (β) + K Standardized Coefficients Mee-Young Park is finishing a Cosso path algorithm. Cosso (Lin and Zhang, 00) fits models of the form 0 0 Rosset and Zhu (00) discuss conditions needed to obtain piecewise-linear paths. A combination of piecewise quadratic/linear loss function, and an L penalty, is sufficient. 0 Standardized Coefficients 0. LASSO 0. 0. 0. s = beta /max beta.0 0. 0. 0. s = beta /max beta.0

June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics glmpath package Path algorithms for the SVM Standardized coefficients..0 0. 0..0. Coefficient path 0 0 lambda x x x x x Standardized coefficients..0 0. 0..0. Coefficient path 0 0 lambda x x x x x max l(β) s.t. β t Predictor-corrector methods in convex optimization used. Computes exact path at a sequence of index points t. Can approximate the junctions (in t) where the active set changes. coxpath included in package. The two-class SVM classifier f(x) =α 0 + N i= α ik(x, x i )y i can be seen to have a quadratic penalty and piecewise-linear loss. As the cost parameter C is varied, the Lagrange multipliers α i change piecewise-linearly. This allows the entire regularization path to be traced exactly. The active set is determined by the points exactly on the margin. 0 points, per class, Separated 9 Step: Error: 0 Elbow Size: Loss: 0 8 Mixture Data Radial Kernel Gamma=.0 Step: Error: Elbow Size: Loss: 0. Mixture Data Radial Kernel Gamma= Step: 8 Error: Elbow Size: 90 Loss:.0 June 00 Trevor Hastie, Stanford Statistics 8 June 00 Trevor Hastie, Stanford Statistics 9 Loss 0..0..0..0 SVM as a regularization method Binomial Log-likelihood Support Vector - - - 0 yf(x) (margin) With f(x) =x T β + β 0 and y i {, }, consider min β 0,β N [ y i f(x i )] + + λ β i= This hinge loss criterion is equivalent to the SVM, with λ monotone in B. Compare with min β 0,β N i= [ ] log +e y if(x i ) + λ β This is binomial deviance loss, and the solution is ridged linear logistic regression. Test Error 0. 0.0 0. e 0 e+0 e+0 The Need for Regularization Test Error Curves SVM with Radial Kernel γ = γ = γ =0. γ =0. e 0 e+0 e+0 e 0 e+0 e+0 e 0 e+0 e+0 C =/λ γ is a kernel parameter: K(x, z) =exp( γ x z ). λ (or C) are regularization parameters, which have to be determined using some means like cross-validation.

June 00 Trevor Hastie, Stanford Statistics 0 June 00 Trevor Hastie, Stanford Statistics Concluding Comments Using logistic regression + binomial loss or Adaboost exponential loss, and same quadratic penalty as SVM, we get the same limiting margin as SVM (Rosset, Zhu and Hastie, JMLR 00) Alternatively, using the Hinge loss of SVMs and an L penalty (rather than quadratic), we get a Lasso version of SVMs (with at most N variables in the solution for any value of the penalty. Boosting fits a monotone L regularization path toward a maximum-margin classifier Many modern function estimation techniques create a path of solutions via regularization. In many cases these paths can be computed efficiently and entirely. This facilitates the important step of model selection selecting a desirable position along the path using a test sample or by CV.