Pathwise coordinate optimization

Similar documents
The lasso: some novel algorithms and applications

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent

The lasso: some novel algorithms and applications

Learning with Sparsity Constraints

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

LASSO Review, Fused LASSO, Parallel LASSO Solvers

ESL Chap3. Some extensions of lasso

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint

A note on the group lasso and a sparse group lasso

Sparse inverse covariance estimation with the lasso

Graphical Model Selection

Least Angle Regression, Forward Stagewise and the Lasso

Regression Shrinkage and Selection via the Lasso

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

An Homotopy Algorithm for the Lasso with Online Observations

Lasso: Algorithms and Extensions

Lecture 23: November 21

Regularization Path Algorithms for Detecting Gene Interactions

A significance test for the lasso

Lecture 5: Soft-Thresholding and Lasso

Journal of Statistical Software

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Introduction to the genlasso package

Lecture 25: November 27

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

LEAST ANGLE REGRESSION 469

1. f(β) 0 (that is, β is a feasible point for the constraints)

Basis Pursuit Denoising and the Dantzig Selector

A significance test for the lasso

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Lecture 9: September 28

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Variable Selection for High-Dimensional Data with Spatial-Temporal Effects and Extensions to Multitask Regression and Multicategory Classification

Regularization Paths. Theme

Comparisons of penalized least squares. methods by simulations

MSA220/MVE440 Statistical Learning for Big Data

The lasso, persistence, and cross-validation

Adaptive Lasso for correlated predictors

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Variable Selection in Data Mining Project

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Regularization and Variable Selection via the Elastic Net

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Lecture 14: Variable Selection - Beyond LASSO

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Saharon Rosset 1 and Ji Zhu 2

arxiv: v1 [math.st] 2 May 2007

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Regularization Paths

Machine Learning for OR & FE

Classification Logistic Regression

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

Discussion of Least Angle Regression

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Gaussian Graphical Models and Graphical Lasso

Sparsity in Underdetermined Systems

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Biostatistics Advanced Methods in Biostatistics IV

Permutation-invariant regularization of large covariance matrices. Liza Levina

Linear Model Selection and Regularization

Linear Methods for Regression. Lijun Zhang

Sparse representation classification and positive L1 minimization

Bi-level feature selection with applications to genetic association

Sparse Ridge Fusion For Linear Regression

MS-C1620 Statistical inference

Chapter 3. Linear Models for Regression

Regression.

Lasso Regression: Regularization for feature selection

Lecture 6: Methods for high-dimensional problems

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

MSA220/MVE440 Statistical Learning for Big Data

Statistical Methods for Data Mining

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Robust Variable Selection Through MAVE

Sparse Penalized Forward Selection for Support Vector Classification

Lasso Regression: Regularization for feature selection

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Sparsity Regularization

Forward Selection and Estimation in High Dimensional Single Index Models

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Fused elastic net EPoC

Different types of regression: Linear, Lasso, Ridge, Elastic net, Ro

An Introduction to Graphical Lasso

Automatic Response Category Combination. in Multinomial Logistic Regression

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Hierarchical Penalization

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Joint ICTP-TWAS School on Coherent State Transforms, Time- Frequency and Time-Scale Analysis, Applications.

Homework 4. Convex Optimization /36-725

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

Transcription:

Stanford University 1 Pathwise coordinate optimization Jerome Friedman, Trevor Hastie, Holger Hoefling, Robert Tibshirani Stanford University Acknowledgements: Thanks to Stephen Boyd, Michael Saunders, Guenther Walther. Email:tibs@stat.stanford.edu http://www-stat.stanford.edu/~tibs

Stanford University 2 Jerome Friedman Trevor Hastie

Stanford University 3 From MyHeritage.Com

Stanford University 4 Jerome Friedman Trevor Hastie

Stanford University 5 Today s topic Convex optimization for two problems: Lasso 1 min β 2 nx (y i i=1 px x ij β j ) 2 + λ j=1 px β j. j=1 (assume features are standardized and y i has mean zero; Tibshirani, 1996; also Basis Pursuit - Chen, Donoho and Saunders 1997) Fused lasso 1 min β 2 nx (y i i=1 px x ij β j ) 2 + λ 1 j=1 px px β j + λ 2 β j β j 1. j=1 j=2 (Tibshirani, Saunders, Rosset, Zhu and Knight, 2004) and other related problems

Stanford University 6 Approaches Many good methods available, including LARS (homotopy algorithm), and sophisticated convex optimization procedures Today s topic: an extremely simple-minded approach. Minimize over one parameter at a time, keeping all others fixed. Coordinate descent. works for lasso, and many other related methods (elastic net, grouped lasso) coordinate-wise descent seems to have been virtually ignored by Statistics and CS researchers doesn t work for fused lasso, but a modified version does work.

Stanford University 7 A brief history of coordinate descent for the lasso 1997: Tibshirani s student Wenjiang Fu at University of Toronto develops the shooting algorithm for the lasso. Tibshirani doesn t fully appreciate it 2002 Ingrid Daubechies gives a talk at Stanford, describes a one-at-a-time algorithm for the lasso. Hastie implements it, makes an error, and Hastie +Tibshirani conclude that the method doesn t work 2006: Friedman is the external examiner at the PhD oral of Anita van der Kooij (Leiden) who uses the coordinate descent idea for the Elastic net. Friedman wonders whether it works for the lasso. Friedman, Hastie + Tibshirani start working on this problem

Stanford University 8 Outline pathwise coordinate descent for the lasso application to other separable models one-dimensional fused lasso for signals two-dimensional fused lasso for images

Stanford University 9 Pathwise coordinate descent for the lasso For a single predictor, solution is given by soft-thresholding of the least squares estimate: sign( ˆβ)( ˆβ γ) +. Same thing for multiple orthogonal predictors. For general multiple regression, coordinate descent algorithm is soft-thresholding of a partial residual: ( β n j (λ) S i=1 where S(t, λ) = sign(t)( t λ) +, ỹ (j) i ) x ij (y i ỹ (j) i ), λ = k j x ik β k (λ). start with a large value of λ. Run procedure until convergence. Then decrease λ using previous solution as a warm start. Key point: quantities in above equation can be quickly updated a j = 1, 2,... p, 1, 2,... (1)

Stanford University 10 Basic coordinate descent code for Lasso is just 73 lines of Fortran! We also iterate on the active set of predictors to speed things up. glmnet package is the R implementation

Stanford University 11 Movie of Pathwise Coordinate algorithm [show movie]

Stanford University 12 Comparison to LARS algorithm LARS (Efron, Hastie, Johnstone, Tibshirani 2002) gives exact solution path. Faster than general purpose algorithms for this problem. Closely related to the homotopy procedure (Osborne, Presnell, Turlach 2000) Pathwise coordinate optimization gives solution on a grid of λ values Speed comparisons: pathwise coordinate optimization (Fortran + R calling function), LARS-R (mostly R, with matrix stuff in Fortran), LARS-Fort (Fortran with R calling function ) lasso2 (C + R calling function). Exponentially decreasing coefficients

Stanford University 13 Method Population correlation between features n = 100, p = 1000 0 0.1 0.2 0.5 0.9 0.95 coord-fort 0.31 0.33 0.40 0.57 1.20 1.45 lars-r 2.18 2.46 2.14 2.45 2.37 2.10 lars-fort 2.01 2.09 2.12 1.95 2.50 2.22 lasso2-c 2.42 2.16 2.39 2.18 2.01 2.71 n = 100, p = 20, 000 coord-fort 7.03 9.34 8.83 10.62 27.46 40.37 lars-r 116.26 122.39 121.48 104.17 100.30 107.29 lars-fort would not run lasso2-c would not run

Stanford University 14 Method Population correlation between features n = 1000, p = 100 0 0.1 0.2 0.5 0.9 0.95 coord-fort 0.03 0.04 0.04 0.04 0.06 0.08 lars-r 0.42 0.41 0.40 0.40 0.40 0.40 lars-fort 0.30 0.24 0.22 0.23 0.23 0.28 lasso2-c 0.73 0.66 0.69 0.68 0.69 0.70 n = 5000, p = 100 0 0.1 0.2 0.5 0.9 0.95 coord-fort 0.16 0.15 0.14 0.16 0.15 0.16 lars-r 1.02 1.03 1.02 1.04 1.02 1.03 lars-fort 1.07 1.09 1.10 1.10 1.10 1.08 lasso2-c 2.91 2.90 3.00 2.95 2.95 2.92

Stanford University 15 Extension to related models Coordinate descent works whenever penalty is separable- a sum of functions of each parameter. See next slide. Examples: elastic net (Zhu and Hastie), garotte (Breiman), grouped lasso (Yuan), Berhu (Owen), least absolute deviation regression (LAD), LAD-lasso, Also - logistic regression with L1 penalty- use same approach as in lasso, with iteratively reweighted least squares. See next slide.

Stanford University 16 Logistic regression Outcome Y = 0 or 1; Logistic regression model log( P r(y = 1) 1 P r(y = 1) ) = β 0 + β 1 X 1 + β 2 X 2... Criterion is binomial log-likelihood +absolute value penalty Example: sparse data. N = 50, 000, p = 700, 000. State-of-the-art interior point algorithm (Stephen Boyd, Stanford), exploiting sparsity of features : 3.5 hours for 100 values along path Pathwise coordinate descent (glmnet): 1 minute 75 vs 7500 lines of code

Stanford University 17 Recent generalizations of lasso The Elastic net (Zou, Hastie 2005) 1 min β 2 nx (y i i=1 nx x ij β j ) 2 + λ 1 j=1 px px β j + λ 2 βj 2 /2 (2) j=1 j=1 Grouped Lasso (Yuan and Lin 2005). Here the variables occur in groups (such as dummy variables for multi-level factors). Suppose X j is an N p j orthonormal matrix that represents the jth group of p j variables, j = 1,..., m, and β j the corresponding coefficient vector. The grouped lasso solves where λ j = λ p j. min β y mx X j β j 2 2 + j=1 mx λ j β j 2, (3) j=1 Can apply coordinate descent in both models!

Stanford University 18 Fused lasso general form 1 min β 2 n (y i i=1 p x ij β j ) 2 + λ 1 j=1 p p β j + λ 2 β j β j 1. j=1 j=2 Will focus on diagonal fused lasso 1 min β 2 n n n (y i β i ) 2 + λ 1 β i + λ 2 β i β i 1. i=1 i=1 useful for approximating signals and images i=2

Stanford University 19 Comparative Genomic Hybridization (CGH) data

Stanford University 20 log2 ratio 2 0 2 4 log2 ratio 2 0 2 4 0 200 400 600 800 1000 genome order 0 200 400 600 800 1000 genome order

Stanford University 21 Window size=5 Window size=10 TPR 0.0 0.2 0.4 0.6 0.8 1.0 Fused.Lasso CGHseg CLAC CBS TPR 0.0 0.2 0.4 0.6 0.8 1.0 Fused.Lasso CGHseg CLAC CBS 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FPR FPR Window size=20 Window size=40 TPR 0.0 0.2 0.4 0.6 0.8 1.0 Fused.Lasso CGHseg CLAC CBS TPR 0.0 0.2 0.4 0.6 0.8 1.0 Fused.Lasso CGHseg CLAC CBS 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FPR FPR

Stanford University 22 Surprise! loss function is strictly convex, but coordinate descent doesn t work! loss function is not differentiable > algorithm can get stuck

Stanford University 23 When does coordinate descent work? Paul Tseng (1988), (2001) If f(β 1... β p ) = g(β 1... β p ) + h j (β j ) where g( ) is convex and differentiable, and h j ( ) is convex, then coordinate descent converges to a minimizer of f.

Stanford University 24 What goes wrong can have a situation where no move of an individual parameter helps, but setting two (or more) adjacent parameters equal, and moving their common value, does help

Stanford University 25 The problem No improvement No improvement Improvement

f(β) β63 β64 β63 1.0 0.5 0.0 1.0 0.5 0.0 1.0 0.5 0.0 lacements f(β) 102.0 102.5 103.0 103.5 104.0 f(β) 102.0 102.5 103.0 103.5 104.0 β 64 1.0 0.5 0.0 Stanford University 26

Stanford University 27 Our solution try moving not only individual parameters, but also fusing together pairs and varying their common value What if no fusion of a pair helps, but moving three or more together helps? Solution- don t let this happen fix λ 1, start with λ 2 = 0 and solve the problem for incremental values of λ 2.

Stanford University 28 Strategy for fused lasso λ 2 PSfrag replacements λ 1

Stanford University 29 Theorem any two non-zero adjacent parameters that are fused for value λ 2, are also fused for λ 2 > λ 2. seems obvious, but surprisingly difficult to prove thanks to Stephen Boyd for showing us the subgradient representation.

Stanford University 30 Sub-gradient representations Necessary and sufficient conditions for solution: see Bertsekas (1999) Lasso X T (y Xβ) + λ X s j = 0 where s j sign(β j ), that is s j = sign(β j ) if β j 0, and s j [ 1, 1] otherwise. Fused lasso (y 1 β 1 ) + λ 1 s 1 λ 2 t 2 = 0 (y j β j ) + λ 1 s j + λ 2 (t j t j+1 ) = 0, j = 2,... n (4) with s j = sign(β j ) if β j 0 and s j [ 1, 1] if β j = 0. Similarly, t j = sign(β j β j 1 ) if β j β j 1 and t j [ 1, 1] if β j = β j 1.

Stanford University 31 Algorithm for fused lasso Fix λ 1 and start λ 2 at zero for each parameter, try coordinate descent. (This is fastfunction is piecewise linear). If coordinate descent doesn t help, try fusing it with the parameter to its left and moving their common value when algorithm has converged for the current value of λ 2, fuse together equal adjacent non-zero values forever. Collapse data and form a weighted problem. increment λ 2 and repeat

Stanford University 32 Movie of fused lasso algorithm [Show movie of 1D fused lasso]

Stanford University 33 Speed comparisons generalized pathwise coordinate optimization vs two-phase active set algorithm sqopt of Gill, Murray, and Saunders

Stanford University 34 p = 1000 λ 1 λ 2 # Active Path Coord Standard 0.01 0.01 450 0.039 1.933 0.01 2.00 948 0.017 0.989 1.00 0.01 824 0.022 1.519 1.00 2.00 981 0.023 1.404 2.00 0.01 861 0.023 1.499 2.00 2.00 991 0.018 1.407 p = 5000 λ 1 λ 2 # Active Path Coord Standard 0.002 0.002 4044 0.219 19.157 0.002 0.400 3855 0.133 27.030 0.200 0.002 4305 0.150 41.105 0.200 0.400 4701 0.129 45.136 0.400 0.002 4301 0.108 41.062 0.400 0.400 4722 0.119 38.896

Stanford University 35 Two-dimensional fused lasso problem has the form 1 px px min (y ii β ii ) 2 β 2 i=1 i =1 subject to px px β ii s 1, i=1 px i=1 px i =1 i =2 i=2 i =1 px β i,i β i,i 1 s 2, px β i,i β i 1,i s 3. we consider fusions of pixels one unit away in horizontal or vertical directions. after a number of fusions, fused regions can become very irregular in shape. Required some impressive programming by Friedman.

Stanford University 36 algorithm is not exact- fused groups can break up as λ 2 increases

Stanford University 37 Examples Two-fold validation (even and odd pixels) was used to estimate λ 1, λ 2.

Stanford University 38 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Data Lasso 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fused Lasso

Stanford University 39

Stanford University 40 Original image Fused lasso reconstruction Noisy image Fused lasso reconstruction

Stanford University 41 original image noisy image lasso denoising fusion denoising

Stanford University 42 movie [Show movie of 2D fused lasso]

Stanford University 43 Discussion Pathwise coordinate optimization is simple but extremely useful We are thinking about/ working on further applications of pathwise coordinate optimization: generalizations of the fused lasso to higher (> 2) dimensions applications to big, sparse problems Hoefling has derived an exact path (LARS-like) algorithm for the fused lasso, using maximal flow ideas