1 The lasso: some novel algorithms and applications Newton Institute, June 25, 2008 Robert Tibshirani Stanford University Collaborations with Trevor Hastie, Jerome Friedman, Holger Hoefling, Gen Nowak, Pei Wang, Daniela Witten Email:tibs@stat.stanford.edu http://www-stat.stanford.edu/~tibs
2 Tibs Lab Genevera Allen Holger Hoefling Gen Nowak Daniella Witten
3 Jerome Friedman Trevor Hastie
From MyHeritage.Com 4
5 Jerome Friedman Trevor Hastie
6 Outline New fast algorithm for lasso- Pathwise coordinate descent Three examples of generalizations of the lasso: 1. Logistic/multinomial for classification. Example later of cancer classification from microarray data 2. Indirected graphical models- graphical lasso 3. Estimation of signals along chromosomes- Comparative Genomic Hybridization - Fused lasso
7 Linear regression via the Lasso (Tibshirani, 1995) Outcome variable y i, for cases i = 1, 2,... n, features x ij, j = 1, 2,... p Minimize n i=1 (y i j x ijβ j ) 2 + λ β j Equivalent to minimizing sum of squares with constraint βj s. Similar to ridge regression, which has constraint j β2 j t Lasso does variable selection and shrinkage, while ridge only shrinks. See also Basis Pursuit (Chen, Donoho and Saunders, 1998).
8 Picture of Lasso and Ridge regression β 2. ^ β. β 2 ^ β β 1 β 1
9 Example: Prostate Cancer Data y i = log (PSA), x ij measurements on a man and his prostate lcavol Coefficients 0.2 0.0 0.2 0.4 0.6 svi lweight pgg45 lbph gleason age lcp 0.0 0.2 0.4 0.6 0.8 1.0 Shrinkage Factor s
Example: Graphical lasso for cell signalling proteins 10
11 L1 norm= 2.27182 L1 norm= 0.08915 L1 norm= 0.04251 L1 norm= 0.02171 L1 norm= 0.01611 L1 norm= 0.01224 L1 norm= 0.00926 L1 norm= 0.00687 L1 norm= 0.00496 L1 norm= 0.00372 L1 norm= 0.00297 L1 norm= 7e 05
12 Fused lasso for CGH data log2 ratio 2 0 2 4 log2 ratio 2 0 2 4 0 200 400 600 800 1000 genome order 0 200 400 600 800 1000 genome order
13 Algorithms for the lasso Standard convex optimizer Least angle regression (LAR) - Efron et al 2004- computes entire path of solutions. State-of-the-Art until 2008 Pathwise coordinate descent- new
14 Pathwise coordinate descent for the lasso Coordinate descent: optimize one parameter (coordinate) at a time. How? suppose we had only one predictor. Problem is to minimize (y i x i β) 2 + λ β Solution is the soft-thresholded estimate i sign( ˆβ)( β λ) + where ˆβ is usual least squares estimate. Idea: with multiple predictors, cycle through each predictor in turn. We compute residuals r i = y i j k x ij ˆβ k and applying univariate soft-thresholding, pretending that our data is (x ij, r i ).
15 Turns out that this is coordinate descent for the lasso criterion (y i x ij β j ) 2 + λ β j i j like skiing to the bottom of a hill, going north-south, east-west, north-south, etc. [Show movie] Too simple?!
16 A brief history of coordinate descent for the lasso 1997: Tibshirani s student Wenjiang Fu at University of Toronto develops the shooting algorithm for the lasso. Tibshirani doesn t fully appreciate it 2002 Ingrid Daubechies gives a talk at Stanford, describes a one-at-a-time algorithm for the lasso. Hastie implements it, makes an error, and Hastie +Tibshirani conclude that the method doesn t work 2006: Friedman is the external examiner at the PhD oral of Anita van der Kooij (Leiden) who uses the coordinate descent idea for the Elastic net. Friedman wonders whether it works for the lasso. Friedman, Hastie + Tibshirani start working on this problem. See also Wu and Lange (2008)!
17 Pathwise coordinate descent for the lasso Start with large value for λ (very sparse model) and slowly decrease it most coordinates that are zero never become non-zero coordinate descent code for Lasso is just 73 lines of Fortran!
18 Extensions Pathwise coordinate descent can be generalized to many other models: logistic/multinomial for classification, graphical lasso for undirected graphs, fused lasso for signals. Its speed and simplicity are quite remarkable. glmnet R package now available on CRAN
19 Logistic regression Outcome Y = 0 or 1; Logistic regression model log( P r(y = 1) 1 P r(y = 1) ) = β 0 + β 1 X 1 + β 2 X 2... Criterion is binomial log-likelihood +absolute value penalty Example: sparse data. N = 50, 000, p = 700, 000. State-of-the-art interior point algorithm (Stephen Boyd, Stanford), exploiting sparsity of features : 3.5 hours for 100 values along path Pathwise coordinate descent: 1 minute
20 Multiclass classification Microarray classification: 16,000 genes, 144 training samples 54 test samples, 14 cancer classes. Multinomial regression model. Methods CV errors Test errors # of out of 144 out of 54 genes used 1. Nearest shrunken centroids 35 (5) 17 6520 2. L 2 -penalized discriminant analysis 25 (4.1) 12 16063 3. Support vector classifier 26 (4.2) 14 16063 4. Lasso regression (one vs all) 30.7 (1.8) 12.5 1429 5. K-nearest neighbors 41 (4.6) 26 16063 6. L 2 -penalized multinomial 26 (4.2) 15 16063 7. Lasso-penalized multinomial 17 (2.8) 13 269 8. Elastic-net penalized multinomial 22 (3.7) 11.8 384
21 Graphical lasso for undirected graphs p variables measured on N observations- eg p proteins measured in N cells goal is to estimate the best undirected graph on the variables: a missing link means that the variables are conditionally independent
22 X2 X1 X3 X5 X4
23 Naive procedure for graphs Meinhausen and Buhlmann (2006) The coefficient of the jth predictor in a regression measures the partial correlation between the response variable and the jth predictor Idea: apply lasso to graph problem by treating each node in turn as the response variable include an edge i j in the graph if either coefficient of jth predictor in regression for x i is non-zero, or ith predictor in regression for x j is non-zero.
24 More exact formulation Assume x N(0, Σ) and let l(σ; X) be the log-likelihood. X j, X k are conditionally independent if (Σ 1 ) jk = 0 let Θ = Σ 1, and let S be the empirical covariance matrix, the problem is to maximize the penalized log-likelihood log det Θ tr(sθ) subject to Θ 1 t, (1) Proposed in Yuan and Lin (2007) Convex problem! How to maximize? lasso(x [, j], X j, t) Naive lasso((x T X) [ j, j], X T [ j, j] X j, t) lasso(ˆσ [ j, j], X T [ j, j] X j, t) Just a modified lasso Exact! Naive using inner products
25 Why does this work? Considered one row and column at a time, the log-likelihood and the lasso problem have the same subgradient Banerjee et al (2007) found this connection, but didn t fully exploit it Resulting algorithm is a blockwise coordinate descent procedure
26 Speed comparisons Timings for p = 400 variables Method CPU time State-of-the-art convex optimizer Graphical lasso 27 min 6.2 sec All of this can be generalized to binary Markov networks; but computation of the partition function remains the major hurdle
Cell signalling proteins 27
28 L1 norm= 2.27182 L1 norm= 0.08915 L1 norm= 0.04251 L1 norm= 0.02171 L1 norm= 0.01611 L1 norm= 0.01224 L1 norm= 0.00926 L1 norm= 0.00687 L1 norm= 0.00496 L1 norm= 0.00372 L1 norm= 0.00297 L1 norm= 7e 05
29 Fitting graphs with missing edges A new solution to an old problem X2 X1 X3 X5 X4 Do a modified linear regression of each node on the other nodes, leaving out variables that represent missing edges; cycle through the nodes until convergence this is a simple, apparently new algorithm for this problem [according to Whittaker, Lauritzen]
30 Another application of the graphical lasso Covariance-regularized regression and classification for high-dimensional problems- the Scout procedure. Daniela Witten - ongoing Stanford Ph.D. thesis Idea is to penalize the joint distribution of the features and response, to obtain a better estimate of the regression parameters.
31 Fused lasso 1 min β 2 n p p (y j β j ) 2 + λ 1 β j + λ 2 β j β j 1. j=1 j=1 j=2 (Tibshirani, Saunders, Rosset, Zhu and Knight, 2004) Coordinate descent: although criterion is convex, it is not differentiable, and coordinate descent can get stuck in the cusps modification is needed- have to move pairs of parameters together
32 Comparative Genomic Hybridization (CGH) data log2 ratio 2 0 2 4 log2 ratio 2 0 2 4 0 200 400 600 800 1000 genome order 0 200 400 600 800 1000 genome order
33 lambda1= 0.01, lambda2= 0 lambda1= 0.01, lambda2= 0.21 lambda1= 0.01, lambda2= 1 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 lambda1= 1.56, lambda2= 0 lambda1= 1.56, lambda2= 0.21 lambda1= 1.56, lambda2= 1 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 lambda1= 3.11, lambda2= 0 lambda1= 3.11, lambda2= 0.21 lambda1= 3.11, lambda2= 1 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 2 0 2 o o oo o o o ooo o o oo o o ooo o o oo o oo o oo o o oo oo o oo oo o oo o o oo oo ooo o oo o oo o oo o o o o o o o ooo o o o oo o o oo o o o oo o oo o o ooo o oo o o o o o oo oo o oo o ooo ooo o o oo oo o o oo oo o 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
34 Window size=5 Window size=10 TPR 0.0 0.2 0.4 0.6 0.8 1.0 Fused.Lasso CGHseg CLAC CBS TPR 0.0 0.2 0.4 0.6 0.8 1.0 Fused.Lasso CGHseg CLAC CBS 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FPR FPR Window size=20 Window size=40 TPR 0.0 0.2 0.4 0.6 0.8 1.0 Fused.Lasso CGHseg CLAC CBS TPR 0.0 0.2 0.4 0.6 0.8 1.0 Fused.Lasso CGHseg CLAC CBS 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FPR FPR
CGH- multisample fused lasso 35
36 Marginal projections have limitations...
37 Our idea Patient profiles (cakes) Underlying patterns (raw ingredients) weights(recipes)
Common patterns that were discovered 38
39 Two-dimensional fused lasso problem has the form 1 px px min (y ii β ii ) 2 β 2 i=1 i =1 subject to px px β ii s 1, i=1 px i=1 px i =1 i =2 i=2 i =1 px β i,i β i,i 1 s 2, px β i,i β i 1,i s 3. we consider fusions of pixels one unit away in horizontal or vertical directions. after a number of fusions, fused regions can become very irregular in shape. Required some impressive programming by Friedman.
40 Examples Two-fold validation (even and odd pixels) was used to estimate λ 1, λ 2.
41 Discussion lasso penalties are useful for fitting a wide variety of models to large datasets Pathwise coordinate descent enables to fit these models to large datasets for the first time Software in R language: LARS, cghflasso, glasso, glm.net
42 Finally Lasso can be used to prove Efron s last theorem. [Wild Applause]
1 Original image Fused lasso reconstruction Noisy image Fused lasso reconstruction