The lasso: some novel algorithms and applications

Similar documents
Pathwise coordinate optimization

The lasso: some novel algorithms and applications

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent

Learning with Sparsity Constraints

Sparse inverse covariance estimation with the lasso

Graphical Model Selection

Least Angle Regression, Forward Stagewise and the Lasso

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

A significance test for the lasso

Regularization Paths

Regularization Paths. Theme

ESL Chap3. Some extensions of lasso

A significance test for the lasso

Post-selection inference with an application to internal inference

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Gaussian Graphical Models and Graphical Lasso

A note on the group lasso and a sparse group lasso

Chapter 3. Linear Models for Regression

An Introduction to Graphical Lasso

Chapter 17: Undirected Graphical Models

Covariance-regularized regression and classification for high-dimensional problems

Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Variable Selection for High-Dimensional Data with Spatial-Temporal Effects and Extensions to Multitask Regression and Multicategory Classification

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Lecture 5: Soft-Thresholding and Lasso

MS-C1620 Statistical inference

Lecture 25: November 27

Multivariate Normal Models

Automatic Response Category Combination. in Multinomial Logistic Regression

Univariate shrinkage in the Cox model for high dimensional data

Bi-level feature selection with applications to genetic association

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Multivariate Normal Models

Lecture 6: Methods for high-dimensional problems

Regularization Path Algorithms for Detecting Gene Interactions

Journal of Statistical Software

Permutation-invariant regularization of large covariance matrices. Liza Levina

Introduction to the genlasso package

Regularization and Variable Selection via the Elastic Net

LASSO Review, Fused LASSO, Parallel LASSO Solvers

STAT 462-Computational Data Analysis

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results

MSA220/MVE440 Statistical Learning for Big Data

application in microarrays

Prediction & Feature Selection in GLM

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Lecture 2 Part 1 Optimization

Regularized Discriminant Analysis and Its Application in Microarrays

MSA220/MVE440 Statistical Learning for Big Data

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Correlate. A method for the integrative analysis of two genomic data sets

Regularized Discriminant Analysis and Its Application in Microarray

10708 Graphical Models: Homework 2

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Statistical Data Mining and Machine Learning Hilary Term 2016

Linear Model Selection and Regularization

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Undirected Graphical Models

Sparse Graph Learning via Markov Random Fields

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Sparse Ridge Fusion For Linear Regression

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Lecture 14: Variable Selection - Beyond LASSO

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Tuning Parameter Selection in L1 Regularized Logistic Regression

Generalized Elastic Net Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Proteomics and Variable Selection

Learning Gaussian Graphical Models with Unknown Group Sparsity

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

High-dimensional covariance estimation based on Gaussian graphical models

A Robust Approach to Regularized Discriminant Analysis

Shrinkage Methods: Ridge and Lasso

Regression Shrinkage and Selection via the Lasso

Post-selection Inference for Forward Stepwise and Least Angle Regression

Biostatistics Advanced Methods in Biostatistics IV

Sparse statistical modelling

LEAST ANGLE REGRESSION 469

Data Mining Stat 588

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Different types of regression: Linear, Lasso, Ridge, Elastic net, Ro

Statistics for high-dimensional data: Group Lasso and additive models

Linear Methods for Regression. Lijun Zhang

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Machine Learning for OR & FE

Some new ideas for post selection inference and model assessment

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Boosting as a Regularized Path to a Maximum Margin Classifier

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Comparisons of penalized least squares. methods by simulations

A Significance Test for the Lasso

Transcription:

1 The lasso: some novel algorithms and applications Newton Institute, June 25, 2008 Robert Tibshirani Stanford University Collaborations with Trevor Hastie, Jerome Friedman, Holger Hoefling, Gen Nowak, Pei Wang, Daniela Witten Email:tibs@stat.stanford.edu http://www-stat.stanford.edu/~tibs

2 Tibs Lab Genevera Allen Holger Hoefling Gen Nowak Daniella Witten

3 Jerome Friedman Trevor Hastie

From MyHeritage.Com 4

5 Jerome Friedman Trevor Hastie

6 Outline New fast algorithm for lasso- Pathwise coordinate descent Three examples of generalizations of the lasso: 1. Logistic/multinomial for classification. Example later of cancer classification from microarray data 2. Indirected graphical models- graphical lasso 3. Estimation of signals along chromosomes- Comparative Genomic Hybridization - Fused lasso

7 Linear regression via the Lasso (Tibshirani, 1995) Outcome variable y i, for cases i = 1, 2,... n, features x ij, j = 1, 2,... p Minimize n i=1 (y i j x ijβ j ) 2 + λ β j Equivalent to minimizing sum of squares with constraint βj s. Similar to ridge regression, which has constraint j β2 j t Lasso does variable selection and shrinkage, while ridge only shrinks. See also Basis Pursuit (Chen, Donoho and Saunders, 1998).

8 Picture of Lasso and Ridge regression β 2. ^ β. β 2 ^ β β 1 β 1

9 Example: Prostate Cancer Data y i = log (PSA), x ij measurements on a man and his prostate lcavol Coefficients 0.2 0.0 0.2 0.4 0.6 svi lweight pgg45 lbph gleason age lcp 0.0 0.2 0.4 0.6 0.8 1.0 Shrinkage Factor s

Example: Graphical lasso for cell signalling proteins 10

11 L1 norm= 2.27182 L1 norm= 0.08915 L1 norm= 0.04251 L1 norm= 0.02171 L1 norm= 0.01611 L1 norm= 0.01224 L1 norm= 0.00926 L1 norm= 0.00687 L1 norm= 0.00496 L1 norm= 0.00372 L1 norm= 0.00297 L1 norm= 7e 05

12 Fused lasso for CGH data log2 ratio 2 0 2 4 log2 ratio 2 0 2 4 0 200 400 600 800 1000 genome order 0 200 400 600 800 1000 genome order

13 Algorithms for the lasso Standard convex optimizer Least angle regression (LAR) - Efron et al 2004- computes entire path of solutions. State-of-the-Art until 2008 Pathwise coordinate descent- new

14 Pathwise coordinate descent for the lasso Coordinate descent: optimize one parameter (coordinate) at a time. How? suppose we had only one predictor. Problem is to minimize (y i x i β) 2 + λ β Solution is the soft-thresholded estimate i sign( ˆβ)( β λ) + where ˆβ is usual least squares estimate. Idea: with multiple predictors, cycle through each predictor in turn. We compute residuals r i = y i j k x ij ˆβ k and applying univariate soft-thresholding, pretending that our data is (x ij, r i ).

15 Turns out that this is coordinate descent for the lasso criterion (y i x ij β j ) 2 + λ β j i j like skiing to the bottom of a hill, going north-south, east-west, north-south, etc. [Show movie] Too simple?!

16 A brief history of coordinate descent for the lasso 1997: Tibshirani s student Wenjiang Fu at University of Toronto develops the shooting algorithm for the lasso. Tibshirani doesn t fully appreciate it 2002 Ingrid Daubechies gives a talk at Stanford, describes a one-at-a-time algorithm for the lasso. Hastie implements it, makes an error, and Hastie +Tibshirani conclude that the method doesn t work 2006: Friedman is the external examiner at the PhD oral of Anita van der Kooij (Leiden) who uses the coordinate descent idea for the Elastic net. Friedman wonders whether it works for the lasso. Friedman, Hastie + Tibshirani start working on this problem. See also Wu and Lange (2008)!

17 Pathwise coordinate descent for the lasso Start with large value for λ (very sparse model) and slowly decrease it most coordinates that are zero never become non-zero coordinate descent code for Lasso is just 73 lines of Fortran!

18 Extensions Pathwise coordinate descent can be generalized to many other models: logistic/multinomial for classification, graphical lasso for undirected graphs, fused lasso for signals. Its speed and simplicity are quite remarkable. glmnet R package now available on CRAN

19 Logistic regression Outcome Y = 0 or 1; Logistic regression model log( P r(y = 1) 1 P r(y = 1) ) = β 0 + β 1 X 1 + β 2 X 2... Criterion is binomial log-likelihood +absolute value penalty Example: sparse data. N = 50, 000, p = 700, 000. State-of-the-art interior point algorithm (Stephen Boyd, Stanford), exploiting sparsity of features : 3.5 hours for 100 values along path Pathwise coordinate descent: 1 minute

20 Multiclass classification Microarray classification: 16,000 genes, 144 training samples 54 test samples, 14 cancer classes. Multinomial regression model. Methods CV errors Test errors # of out of 144 out of 54 genes used 1. Nearest shrunken centroids 35 (5) 17 6520 2. L 2 -penalized discriminant analysis 25 (4.1) 12 16063 3. Support vector classifier 26 (4.2) 14 16063 4. Lasso regression (one vs all) 30.7 (1.8) 12.5 1429 5. K-nearest neighbors 41 (4.6) 26 16063 6. L 2 -penalized multinomial 26 (4.2) 15 16063 7. Lasso-penalized multinomial 17 (2.8) 13 269 8. Elastic-net penalized multinomial 22 (3.7) 11.8 384

21 Graphical lasso for undirected graphs p variables measured on N observations- eg p proteins measured in N cells goal is to estimate the best undirected graph on the variables: a missing link means that the variables are conditionally independent

22 X2 X1 X3 X5 X4

23 Naive procedure for graphs Meinhausen and Buhlmann (2006) The coefficient of the jth predictor in a regression measures the partial correlation between the response variable and the jth predictor Idea: apply lasso to graph problem by treating each node in turn as the response variable include an edge i j in the graph if either coefficient of jth predictor in regression for x i is non-zero, or ith predictor in regression for x j is non-zero.

24 More exact formulation Assume x N(0, Σ) and let l(σ; X) be the log-likelihood. X j, X k are conditionally independent if (Σ 1 ) jk = 0 let Θ = Σ 1, and let S be the empirical covariance matrix, the problem is to maximize the penalized log-likelihood log det Θ tr(sθ) subject to Θ 1 t, (1) Proposed in Yuan and Lin (2007) Convex problem! How to maximize? lasso(x [, j], X j, t) Naive lasso((x T X) [ j, j], X T [ j, j] X j, t) lasso(ˆσ [ j, j], X T [ j, j] X j, t) Just a modified lasso Exact! Naive using inner products

25 Why does this work? Considered one row and column at a time, the log-likelihood and the lasso problem have the same subgradient Banerjee et al (2007) found this connection, but didn t fully exploit it Resulting algorithm is a blockwise coordinate descent procedure

26 Speed comparisons Timings for p = 400 variables Method CPU time State-of-the-art convex optimizer Graphical lasso 27 min 6.2 sec All of this can be generalized to binary Markov networks; but computation of the partition function remains the major hurdle

Cell signalling proteins 27

28 L1 norm= 2.27182 L1 norm= 0.08915 L1 norm= 0.04251 L1 norm= 0.02171 L1 norm= 0.01611 L1 norm= 0.01224 L1 norm= 0.00926 L1 norm= 0.00687 L1 norm= 0.00496 L1 norm= 0.00372 L1 norm= 0.00297 L1 norm= 7e 05

29 Fitting graphs with missing edges A new solution to an old problem X2 X1 X3 X5 X4 Do a modified linear regression of each node on the other nodes, leaving out variables that represent missing edges; cycle through the nodes until convergence this is a simple, apparently new algorithm for this problem [according to Whittaker, Lauritzen]

30 Another application of the graphical lasso Covariance-regularized regression and classification for high-dimensional problems- the Scout procedure. Daniela Witten - ongoing Stanford Ph.D. thesis Idea is to penalize the joint distribution of the features and response, to obtain a better estimate of the regression parameters.

31 Fused lasso 1 min β 2 n p p (y j β j ) 2 + λ 1 β j + λ 2 β j β j 1. j=1 j=1 j=2 (Tibshirani, Saunders, Rosset, Zhu and Knight, 2004) Coordinate descent: although criterion is convex, it is not differentiable, and coordinate descent can get stuck in the cusps modification is needed- have to move pairs of parameters together

32 Comparative Genomic Hybridization (CGH) data log2 ratio 2 0 2 4 log2 ratio 2 0 2 4 0 200 400 600 800 1000 genome order 0 200 400 600 800 1000 genome order

33 lambda1= 0.01, lambda2= 0 lambda1= 0.01, lambda2= 0.21 lambda1= 0.01, lambda2= 1 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 lambda1= 1.56, lambda2= 0 lambda1= 1.56, lambda2= 0.21 lambda1= 1.56, lambda2= 1 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 lambda1= 3.11, lambda2= 0 lambda1= 3.11, lambda2= 0.21 lambda1= 3.11, lambda2= 1 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

34 Window size=5 Window size=10 TPR 0.0 0.2 0.4 0.6 0.8 1.0 Fused.Lasso CGHseg CLAC CBS TPR 0.0 0.2 0.4 0.6 0.8 1.0 Fused.Lasso CGHseg CLAC CBS 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FPR FPR Window size=20 Window size=40 TPR 0.0 0.2 0.4 0.6 0.8 1.0 Fused.Lasso CGHseg CLAC CBS TPR 0.0 0.2 0.4 0.6 0.8 1.0 Fused.Lasso CGHseg CLAC CBS 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FPR FPR

CGH- multisample fused lasso 35

36 Marginal projections have limitations...

37 Our idea Patient profiles (cakes) Underlying patterns (raw ingredients) weights(recipes)

Common patterns that were discovered 38

39 Two-dimensional fused lasso problem has the form 1 px px min (y ii β ii ) 2 β 2 i=1 i =1 subject to px px β ii s 1, i=1 px i=1 px i =1 i =2 i=2 i =1 px β i,i β i,i 1 s 2, px β i,i β i 1,i s 3. we consider fusions of pixels one unit away in horizontal or vertical directions. after a number of fusions, fused regions can become very irregular in shape. Required some impressive programming by Friedman.

40 Examples Two-fold validation (even and odd pixels) was used to estimate λ 1, λ 2.

41 Discussion lasso penalties are useful for fitting a wide variety of models to large datasets Pathwise coordinate descent enables to fit these models to large datasets for the first time Software in R language: LARS, cghflasso, glasso, glm.net

42 Finally Lasso can be used to prove Efron s last theorem. [Wild Applause]

1 Original image Fused lasso reconstruction Noisy image Fused lasso reconstruction