Fast Regularization Paths via Coordinate Descent

Similar documents
Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent

Learning with Sparsity Constraints

The lasso: some novel algorithms and applications

Pathwise coordinate optimization

Graphical Model Selection

The lasso: some novel algorithms and applications

Lecture 25: November 27

Sparse inverse covariance estimation with the lasso

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Regularization Path Algorithms for Detecting Gene Interactions

Lecture 6: Methods for high-dimensional problems

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Shrinkage Methods: Ridge and Lasso

Chapter 3. Linear Models for Regression

Variable Selection for High-Dimensional Data with Spatial-Temporal Effects and Extensions to Multitask Regression and Multicategory Classification

Journal of Statistical Software

Least Angle Regression, Forward Stagewise and the Lasso

A note on the group lasso and a sparse group lasso

Machine Learning for OR & FE

ESL Chap3. Some extensions of lasso

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Permutation-invariant regularization of large covariance matrices. Liza Levina

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Regularization Paths. Theme

Regularization Paths

An Introduction to Graphical Lasso

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Proteomics and Variable Selection

Linear Methods for Regression. Lijun Zhang

Gaussian Graphical Models and Graphical Lasso

Lecture 9: September 28

MS-C1620 Statistical inference

application in microarrays

Regularized Discriminant Analysis and Its Application in Microarrays

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Prediction & Feature Selection in GLM

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Undirected Graphical Models

Regularization: Ridge Regression and the LASSO

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

MSA220/MVE440 Statistical Learning for Big Data

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Chapter 17: Undirected Graphical Models

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

A significance test for the lasso

Bi-level feature selection with applications to genetic association

MSA220/MVE440 Statistical Learning for Big Data

Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Regularized Discriminant Analysis and Its Application in Microarray

Regularization and Variable Selection via the Elastic Net

Lecture 1: September 25

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

The Double Dantzig. Some key words: Dantzig Selector; Double Dantzig; Generalized Linear Models; Lasso; Variable Selection.

Discussion of Least Angle Regression

Regularized Linear Models in Stacked Generalization

Linear Model Selection and Regularization

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Lecture 14: Shrinkage

Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results

Generalized Elastic Net Regression

Homework 4. Convex Optimization /36-725

LEAST ANGLE REGRESSION 469

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Lecture 2 Part 1 Optimization

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Statistical Methods for Data Mining

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

The lasso, persistence, and cross-validation

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Sparse Ridge Fusion For Linear Regression

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

A Magiv CV Theory for Large-Margin Classifiers

Statistical Data Mining and Machine Learning Hilary Term 2016

Multivariate Normal Models

Compressed Sensing and Neural Networks

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

CMSC858P Supervised Learning Methods

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Sparse Penalized Forward Selection for Support Vector Classification

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Multivariate Normal Models

Statistical aspects of prediction models with high-dimensional data

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Compressed Sensing in Cancer Biology? (A Work in Progress)

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Univariate shrinkage in the Cox model for high dimensional data

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Strong Rules for Discarding Predictors in Lasso-type Problems

Lecture 14: Variable Selection - Beyond LASSO

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 23: November 21

High-dimensional Ordinary Least-squares Projection for Screening Variables

Transcription:

user! 2009 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerome Friedman and Rob Tibshirani.

user! 2009 Trevor Hastie, Stanford Statistics 2 Linear Models in Data Mining As datasets grow wide i.e. many more features than samples the linear model has regained favor in the dataminers toolbox. Document classification: bag-of-words can leads to p =20K features and N =5K document samples. Image deblurring, classification: p =65K pixels are features, N = 100 samples. Genomics, microarray studies: p =40K genes are measured for each of N = 100 subjects. Genome-wide association studies: p = 500K SNPs measured for N = 2000 case-control subjects. In all of these we use linear models e.g. linear regression, logistic regression. Since p N, we have to regularize.

user! 2009 Trevor Hastie, Stanford Statistics 3 February 2009. Additional chapters on wide data, random forests, graphical models and ensemble methods + new material on path algorithms, kernel methods and more.

user! 2009 Trevor Hastie, Stanford Statistics 4 Linear regression via the Lasso (Tibshirani, 1995) Given observations {y i,x i1,...,x ip } N i=1 N p min (y i β 0 x ij β j ) 2 subject to β i=1 j=1 p β j t j=1 Similar to ridge regression, which has constraint j β2 j t Lasso does variable selection and shrinkage, while ridge only shrinks. β 2. ^ β. β 2 ^ β β 1 β 1

user! 2009 Trevor Hastie, Stanford Statistics 5 Brief History of l 1 Regularization Wavelet Soft Thresholding (Donoho and Johnstone 1994) in orthonormal setting. Tibshirani introduces Lasso for regression in 1995. Same idea used in Basis Pursuit (Chen, Donoho and Saunders 1996). Extended to many linear-model settings e.g. Survival models (Tibshirani, 1997), logistic regression, and so on. Gives rise to a new field Compressed Sensing (Donoho 2004, Candes and Tao 2005) near exact recovery of sparse signals in very high dimensions. In many cases l 1 a good surrogate for l 0.

user! 2009 Trevor Hastie, Stanford Statistics 6 Lasso Coefficient Path 0 2 3 4 5 7 8 10 Standardized Coefficients 500 0 500 5 2 1 10 8 4 6 9 0.0 0.2 0.4 0.6 0.8 1.0 ˆβ(λ) 1 / ˆβ(0) 1 Lasso: ˆβ(λ) = argminβ N i=1 (y i β 0 x T i β)2 + λ β 1

user! 2009 Trevor Hastie, Stanford Statistics 7 History of Path Algorithms Efficient path algorithms for ˆβ(λ) allow for easy and exact cross-validation and model selection. In 2001 the LARS algorithm (Efron et al) provides a way to compute the entire lasso coefficient path efficiently at the cost of a full least-squares fit. 2001 present: path algorithms pop up for a wide variety of related problems: Grouped lasso (Yuan & Lin 2006), support-vector machine (Hastie, Rosset, Tibshirani & Zhu 2004), elastic net (Zou & Hastie 2004), quantile regression (Li & Zhu, 2007), logistic regression and glms (Park & Hastie, 2007), Dantzig selector (James & Radchenko 2008),... Many of these do not enjoy the piecewise-linearity of LARS, and seize up on very large problems.

user! 2009 Trevor Hastie, Stanford Statistics 8 Coordinate Descent Solve the lasso problem by coordinate descent: optimize each parameter separately, holding all the others fixed. Updates are trivial. Cycle around till coefficients stabilize. Do this on a grid of λ values, from λ max down to λ min (uniform on log scale), using warms starts. Can do this with a variety of loss functions and additive penalties. Coordinate descent achieves dramatic speedups over all competitors, by factors of 10, 100 and more.

user! 2009 Trevor Hastie, Stanford Statistics 9 LARS and GLMNET Coefficients 40 20 0 20 0 50 100 150 L1 Norm

user! 2009 Trevor Hastie, Stanford Statistics 10 Speed Trials Competitors: lars As implemented in R package, for squared-error loss. glmnet Fortran based R package using coordinate descent topic of this talk. Does squared error and logistic (2- and K-class). l1logreg Lasso-logistic regression package by Koh, Kim and Boyd, using state-of-art interior point methods for convex optimization. BBR/BMR Bayesian binomial/multinomial regression package by Genkin, Lewis and Madigan. Also uses coordinate descent to compute posterior mode with Laplace prior the lasso fit. Based on simulations (next 3 slides) and real data (4th slide).

user! 2009 Trevor Hastie, Stanford Statistics 11 Linear Regression Dense Features Average Correlation between Features 0 0.1 0.2 0.5 0.9 0.95 N = 5000, p= 100 glmnet 0.05 0.05 0.05 0.05 0.05 0.05 lars 0.29 0.29 0.29 0.30 0.29 0.29 N = 100, p= 50000 glmnet 2.66 2.46 2.84 3.53 3.39 2.43 lars 58.68 64.00 64.79 58.20 66.39 79.79 Timings (secs) for glmnet and lars algorithms for linear regression with lasso penalty. Total time for 100 λ values, averaged over 3 runs.

user! 2009 Trevor Hastie, Stanford Statistics 12 Logistic Regression Dense Features Average Correlation between Features 0 0.1 0.2 0.5 0.9 0.95 N = 5000, p= 100 glmnet 7.89 8.48 9.01 13.39 26.68 26.36 l1lognet 239.88 232.00 229.62 229.49 223.19 223.09 N = 100, p= 5000 glmnet 5.24 4.43 5.12 7.05 7.87 6.05 l1lognet 165.02 161.90 163.25 166.50 151.91 135.28 Timings (seconds) for logistic models with lasso penalty. Total time for tenfold cross-validation over a grid of 100 λ values.

user! 2009 Trevor Hastie, Stanford Statistics 13 Logistic Regression Sparse Features 0 0.1 0.2 0.5 0.9 0.95 N =10, 000, p= 100 glmnet 3.21 3.02 2.95 3.25 4.58 5.08 BBR 11.80 11.64 11.58 13.30 12.46 11.83 l1lognet 45.87 46.63 44.33 43.99 45.60 43.16 N = 100, p=10, 000 glmnet 10.18 10.35 9.93 10.04 9.02 8.91 BBR 45.72 47.50 47.46 48.49 56.29 60.21 l1lognet 130.27 124.88 124.18 129.84 137.21 159.54 Timings (seconds) for logistic model with lasso penalty and sparse features (95% zeros in X). Total time for ten-fold cross-validation over a grid of 100 λ values.

user! 2009 Trevor Hastie, Stanford Statistics 14 Logistic Regression Real Datasets Name Type N p glmnet l1logreg BBR BMR Dense Cancer 14 class 144 16,063 2.5 mins NA 2.1 hrs Leukemia 2 class 72 3571 2.50 55.0 450 Sparse Internet ad 2 class 2359 1430 5.0 20.9 34.7 Newsgroup 2 class 11,314 777,811 2 mins 3.5 hrs Timings in seconds (unless stated otherwise). For Cancer, Leukemia and Internet-Ad, times are for ten-fold cross-validation over 100 λ values; for Newsgroup we performed a single run with 100 values of λ, with λ min =0.05λ max.

user! 2009 Trevor Hastie, Stanford Statistics 15 A brief history of coordinate descent for the lasso 1997 Tibshirani s student Wenjiang Fu at U. Toronto develops the shooting algorithm for the lasso. Tibshirani doesn t fully appreciate it..

user! 2009 Trevor Hastie, Stanford Statistics 16 A brief history of coordinate descent for the lasso 1997 Tibshirani s student Wenjiang Fu at U. Toronto develops the shooting algorithm for the lasso. Tibshirani doesn t fully appreciate it. 2002 Ingrid Daubechies gives a talk at Stanford, describes a one-at-a-time algorithm for the lasso. Hastie implements it, makes an error, and Hastie +Tibshirani conclude that the method doesn t work..

user! 2009 Trevor Hastie, Stanford Statistics 17 A brief history of coordinate descent for the lasso 1997 Tibshirani s student Wenjiang Fu at U. Toronto develops the shooting algorithm for the lasso. Tibshirani doesn t fully appreciate it. 2002 Ingrid Daubechies gives a talk at Stanford, describes a one-at-a-time algorithm for the lasso. Hastie implements it, makes an error, and Hastie +Tibshirani conclude that the method doesn t work. 2006 Friedman is external examiner at PhD oral of Anita van der Kooij (Leiden) who uses coordinate descent for elastic net. Friedman, Hastie + Tibshirani revisit this problem. Others have too Shevade and Keerthi (2003), Krishnapuram and Hartemink (2005), Genkin, Lewis and Madigan (2007), Wu and Lange (2008), Meier, van de Geer and Buehlmann (2008).

user! 2009 Trevor Hastie, Stanford Statistics 18 Coordinate descent for the lasso min β 1 2N N i=1 (y i p j=1 x ijβ j ) 2 + λ p j=1 β j Suppose the p predictors and response are standardized to have mean zero and variance 1. Initialize all the β j =0. Cycle over j =1, 2,...,p,1, 2,... till convergence: Compute the partial residuals r ij = y i k j x ikβ k. Compute the simple least squares coefficient of these residuals on jth predictor: β j = 1 N N i=1 x ijr ij Update β j by soft-thresholding: λ β j S(β j,λ) (0,0) = sign(β j )( β j λ) +

user! 2009 Trevor Hastie, Stanford Statistics 19 Why is coordinate descent so fast? There are a number of tricks and strategies that we use exploit the structure of the problem. Naive Updates: β j = 1 N N i=1 x ijr ij = 1 N N i=1 x ijr i + β j,where r i is current model residual; O(N). Many coefficients are zero, and stay zero. If a coefficient changes, residuals are updated in O(N) computations. Covariance Updates: N i=1 x ijr i = x j,y k: β k >0 x j,x k β k Cross-covariance terms are computed once for active variables andstored(helpsalotwhenn p). Sparse Updates: If data is sparse (many zeros), inner products can be computed efficiently.

user! 2009 Trevor Hastie, Stanford Statistics 20 Active Set Convergence: After a cycle through p variables, we can restrict further iterations to the active set till convergence + one more cycle through p to check if active set has changed. Helps when p N. Warm Starts: We fits a sequence of models from λ max down to λ min = ɛλ max (on log scale). λ max is smallest value of λ for which all coefficients are zero. Solutions don t change much from one λ to the next. Convergence is often faster for entire sequence than for single solution at small value of λ. FFT:.

user! 2009 Trevor Hastie, Stanford Statistics 21 Active Set Convergence: After a cycle through p variables, we can restrict further iterations to the active set till convergence + one more cycle through p to check if active set has changed. Helps when p N. Warm Starts: We fits a sequence of models from λ max down to λ min = ɛλ max (on log scale). λ max is smallest value of λ for which all coefficients are zero. Solutions don t change much from one λ to the next. Convergence is often faster for entire sequence than for single solution at small value of λ. FFT: Friedman + Fortran + Tricks no sloppy flops!.

user! 2009 Trevor Hastie, Stanford Statistics 22 Binary Logistic Models Newton Updates: For binary logistic regression we have an outer Newton loop at each λ. This amounts to fitting a lasso with weighted squared error-loss. Uses weighted soft thresholding. Multinomial: We use a symmetric formulation for multi- class logistic: Pr(G = l x) = e β 0l+x T β l K k=1 eβ 0k+x T β k. This creates an additional loop, as we cycle through classes, and compute the quadratic approximation to the multinomial log-likelihood, holding all but one class s parameters fixed. Details Many important but tedious details with logistic models. e.g. if p N, cannot let λ run down to zero.

user! 2009 Trevor Hastie, Stanford Statistics 23 Elastic-net Penalty Proposed in Zou and Hastie (2005) for p N situations, where predictors are correlated in groups. P α (β) = p [ 1 2 (1 α)β2 j + α β j ]. j=1 α creates a compromise between the lasso and ridge. Coordinate update is now β j S(β j,λα) 1+λ(1 α) where β j = 1 N N i=1 x ijr ij as before.

user! 2009 Trevor Hastie, Stanford Statistics 24 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Step Step Step Coefficients 0.1 0.0 0.1 0.2 0.3 Coefficients 0.1 0.0 0.1 0.2 0.3 Coefficients 0.1 0.0 0.1 0.2 0.3 Lasso Elastic Net (0.4) Ridge Leukemia Data, Logistic, N=72, p=3571, first 10 steps shown

user! 2009 Trevor Hastie, Stanford Statistics 25 Multiclass classification Microarray classification: 16,063 genes, 144 training samples 54 test samples, 14 cancer classes. Multinomial regression model. Methods CV errors Test errors # of out of 144 out of 54 genes used 1. Nearest shrunken centroids 35 (5) 17 6520 2. L 2 -penalized discriminant analysis 25 (4.1) 12 16063 3. Support vector classifier 26 (4.2) 14 16063 4. Lasso regression (one vs all) 30.7 (1.8) 12.5 1429 5. K-nearest neighbors 41 (4.6) 26 16063 6. L 2 -penalized multinomial 26 (4.2) 15 16063 7. Lasso-penalized multinomial 17 (2.8) 13 269 8. Elastic-net penalized multinomial 22 (3.7) 11.8 384 6 8 fit using glmnet

user! 2009 Trevor Hastie, Stanford Statistics 26 Summary Many problems have the form min {β j } p 1 R(y, β)+λ p P j (β j ). j=1 If R and P j are convex, and R is differentiable, then coordinate descent converges to the solution (Tseng, 1988). Often each coordinate step is trivial. E.g. for lasso, it amounts to soft-thresholding, with many steps leaving ˆβ j =0. Decreasing λ slowly means not much cycling is needed. Coordinate moves can exploit sparsity.

user! 2009 Trevor Hastie, Stanford Statistics 27 Other Applications Undirected Graphical Models learning dependence structure via the lasso. Model the inverse covariance Θ in the Gaussian family with L 1 penalties applied to elements. max Θ log det Θ Tr(SΘ) λ Θ 1 Modified block-wise lasso algorithm, which we solve by coordinate descent (FHT 2007). Algorithm is very fast, and solve moderately sparse graphs with 1000 nodes in under a minute. Example: flow cytometry - p =11proteins measured in N = 7466 cells (Sachs et al 2003) (next page)

user! 2009 Trevor Hastie, Stanford Statistics 28 λ =36 λ =27 Mek Raf Jnk Mek Raf Jnk Plcg P38 Plcg P38 PIP2 PKC PIP2 PKC PIP3 PKA PIP3 PKA Erk Akt Erk Akt λ =7 λ =0 Mek Raf Jnk Mek Raf Jnk Plcg P38 Plcg P38 PIP2 PKC PIP2 PKC PIP3 PKA PIP3 PKA Erk Akt Erk Akt

user! 2009 Trevor Hastie, Stanford Statistics 29 Grouped lasso (Yuan and Lin, 2007, Meier, Van de Geer, Buehlmann, 2008) each term P j (β j ) applies to sets of parameters: J β j 2. j=1 Example: each block represents the levels for a categorical predictor. Leads to a block-updating form of coordinate descent.

user! 2009 Trevor Hastie, Stanford Statistics 30 CGH modeling and the fused lasso. Here the penalty has the form p p 1 β j + α β j+1 β j. j=1 j=1 This is not additive, so a modified coordinate descent algorithm is required (FHT + Hoeffling 2007). log2 ratio 2 0 2 4 0 200 400 600 800 1000 Genome order

user! 2009 Trevor Hastie, Stanford Statistics 31 Matrix Completion Movies 0 10 20 30 40 50 60 70 80 90 Complete 100 0 10 20 30 40 50 60 70 Raters Example: Netflix problem. We partially observe a matrix of movie ratings (rows) by a number of raters (columns). The goal is to predict the future ratings of these same individuals for movies they have not yet rated (or seen). Movies 0 10 20 30 40 50 60 70 80 90 Observed 100 0 10 20 30 40 50 60 70 Raters We solve this problem by fitting an l 1 regularized SVD path to the observed data matrix (Mazumder, Hastie and Tibshirani, 2009).

user! 2009 Trevor Hastie, Stanford Statistics 32 min ˆX l 1 regularized SVD P Ω (X) P Ω ( ˆX) 2 F + λ ˆX P Ω is projection onto observed values (sets unobserved to 0). ˆX is nuclear norm sum of singular values. This is a convex optimization problem (Candes 2008), with solution given by a soft thresholded SVD singular values are shrunk toward zero, many set to zero. Our algorithm iteratively soft-thresholds the SVD of P Ω (X)+PΩ ( ˆX) { = P Ω (X) P Ω ( ˆX) } + ˆX = Sparse + Low-Rank Using Lanczos techniques and warm starts, we can efficiently compute solution paths for very large matrices (50K 50K)

user! 2009 Trevor Hastie, Stanford Statistics 33 Summary l 1 regularization (and variants) has become a powerful tool with the advent of wide data. In compressed sensing, often exact sparse signal recovery is possible with l 1 methods. Coordinate descent is fastest known algorithm for solving these problems along a path of values for the tuning parameter.