Group lasso for genomic data Jean-Philippe Vert Mines ParisTech / Curie Institute / Inserm Machine learning: Theory and Computation workshop, IMA, Minneapolis, March 26-3, 22 J.P Vert (ParisTech) Group lasso in genomics IMA / 5
Outline Motivations 2 Finding multiple change-points in a single profile 3 Finding multiple change-points shared by many signals 4 Learning molecular classifiers with network information 5 Conclusion J.P Vert (ParisTech) Group lasso in genomics IMA 2 / 5
Outline Motivations 2 Finding multiple change-points in a single profile 3 Finding multiple change-points shared by many signals 4 Learning molecular classifiers with network information 5 Conclusion J.P Vert (ParisTech) Group lasso in genomics IMA 3 / 5
Chromosomic aberrations in cancer J.P Vert (ParisTech) Group lasso in genomics IMA 4 / 5
Comparative Genomic Hybridization (CGH) Motivation Comparative genomic hybridization (CGH) data measure the DNA copy number along the genome Very useful, in particular in cancer research to observe systematically variants in DNA content Log-ratio.5 -.5 - J.P Vert (ParisTech) 2 3 4 5 6 Chromosome 7 8 9 Group lasso in genomics 2 3 4 5 6 7 8 9 2222 23 X IMA 5 / 5
Can we identify breakpoints and "smooth" each profile?.4.2.8.6.4.2.2 2 3 4 5 6 7 8 9 J.P Vert (ParisTech) Group lasso in genomics IMA 6 / 5
Can we detect frequent breakpoints?.5.5 2 4 6 8 2 4 6 8 2.5.5 2 4 6 8 2 4 6 8 2.5.5 2 4 6 8 2 4 6 8 2.5.5 2 4 6 8 2 4 6 8 2 A collection of bladder tumour copy number profiles. J.P Vert (ParisTech) Group lasso in genomics IMA 7 / 5
DNA RNA protein CGH shows the (static) DNA Cancer cells have also abnormal (dynamic) gene expression (= transcription) J.P Vert (ParisTech) Group lasso in genomics IMA 8 / 5
Can we identify the cancer subtype? (diagnosis) J.P Vert (ParisTech) Group lasso in genomics IMA 9 / 5
Can we predict the future evolution? (prognosis) J.P Vert (ParisTech) Group lasso in genomics IMA / 5
Outline Motivations 2 Finding multiple change-points in a single profile 3 Finding multiple change-points shared by many signals 4 Learning molecular classifiers with network information 5 Conclusion J.P Vert (ParisTech) Group lasso in genomics IMA / 5
The problem.4.2.8.6.4.2.2 2 3 4 5 6 7 8 9 Let Y R p the signal We want to find a piecewise constant approximation Û Rp with at most k change-points. J.P Vert (ParisTech) Group lasso in genomics IMA 2 / 5
The problem.4.2.8.6.4.2.2 2 3 4 5 6 7 8 9 Let Y R p the signal We want to find a piecewise constant approximation Û Rp with at most k change-points. J.P Vert (ParisTech) Group lasso in genomics IMA 2 / 5
An optimal solution?.4.2.8.6.4.2.2 2 3 4 5 6 7 8 9 We can define an "optimal" piecewise constant approximation Û R p as the solution of min U R p Y U 2 such that p (U i+ U i ) k This is an optimization problem over the ( p k) partitions... Dynamic programming finds the solution in O(p 2 k) in time and O(p 2 ) in memory But: does not scale to p = 6 9... J.P Vert (ParisTech) Group lasso in genomics IMA 3 / 5 i=
An optimal solution?.4.2.8.6.4.2.2 2 3 4 5 6 7 8 9 We can define an "optimal" piecewise constant approximation Û R p as the solution of min U R p Y U 2 such that p (U i+ U i ) k This is an optimization problem over the ( p k) partitions... Dynamic programming finds the solution in O(p 2 k) in time and O(p 2 ) in memory But: does not scale to p = 6 9... J.P Vert (ParisTech) Group lasso in genomics IMA 3 / 5 i=
An optimal solution?.4.2.8.6.4.2.2 2 3 4 5 6 7 8 9 We can define an "optimal" piecewise constant approximation Û R p as the solution of min U R p Y U 2 such that p (U i+ U i ) k This is an optimization problem over the ( p k) partitions... Dynamic programming finds the solution in O(p 2 k) in time and O(p 2 ) in memory But: does not scale to p = 6 9... J.P Vert (ParisTech) Group lasso in genomics IMA 3 / 5 i=
An optimal solution?.4.2.8.6.4.2.2 2 3 4 5 6 7 8 9 We can define an "optimal" piecewise constant approximation Û R p as the solution of min U R p Y U 2 such that p (U i+ U i ) k This is an optimization problem over the ( p k) partitions... Dynamic programming finds the solution in O(p 2 k) in time and O(p 2 ) in memory But: does not scale to p = 6 9... J.P Vert (ParisTech) Group lasso in genomics IMA 3 / 5 i=
Promoting sparsity with the l penalty The l penalty (Tibshirani, 996; Chen et al., 998) If R(β) is convex and "smooth", the solution of p min R(β) + λ β i β R p is usually sparse. i= J.P Vert (ParisTech) Group lasso in genomics IMA 4 / 5
Promoting piecewise constant profiles penalty The total variation / variable fusion penalty If R(β) is convex and "smooth", the solution of min R(β) + λ p β i+ β i β R p is usually piecewise constant (Rudin et al., 992; Land and Friedman, 996). Proof: i= Change of variable u i = β i+ β i, u = β We obtain a Lasso problem in u R p u sparse means β piecewise constant J.P Vert (ParisTech) Group lasso in genomics IMA 5 / 5
TV signal approximator min β R p Y β 2 such that p β i+ β i µ Adding additional constraints does not change the change-points: p i= β i ν (Tibshirani et al., 25; Tibshirani and Wang, 28) p i= β2 i ν (Mairal et al. 2) i= J.P Vert (ParisTech) Group lasso in genomics IMA 6 / 5
Solving TV signal approximator min β R p Y β 2 such that p β i+ β i µ QP with sparse linear constraints in O(p 2 ) -> 35 min for p = 5 (Tibshirani and Wang, 28) Coordinate descent-like method O(p)? -> 3s s for p = 5 (Friedman et al., 27) For all µ with the LARS in O(pK ) (Harchaoui and Levy-Leduc, 28) For all µ in O(p ln p) (Hoefling, 29) For the first K change-points in O(p ln K ) (Bleakley and V., 2) i= J.P Vert (ParisTech) Group lasso in genomics IMA 7 / 5
We will investigate different function I L (I), I R (I) and (I). The dichotomic segmentation method, presented in Algorithm, then proceeds as follows: starting from the full interval [i, n] as a trivial partition of [,n] into intervals, and then iteratively refine any partition P of [,n] into p intervals P = {I,...,I p } by splitting the interval I 2 P with maximal (I ) into the two intervals I L (I ) and I R (I ). TV signal approximator as dichotomic segmentation Algorithm Greedy dichotomic segmentation Require: k number of intervals, (I) gain function to split an interval I into I L (I),I R (I) : I represents the interval [,n] 2: P = {I } 3: for i =to k do 4: I arg max (I ) I2P 5: P P\{I } 6: P P [ {I L (I ),I R (I )} 7: end for 8: return P Theorem TV signal approximator performs "greedy" dichotomic segmentation (V. and Bleakley, 2; see also Hoefling, 29) 2 J.P Vert (ParisTech) Group lasso in genomics IMA 8 / 5
Speed trial : 2 s. for K =, p = 7 Speed for K=,, e2, e3, e4, e5.9.8.7.6 seconds.5.4.3.2. 2 3 4 5 signal length 6 7 8 9 x 5 J.P Vert (ParisTech) Group lasso in genomics IMA 9 / 5
Outline Motivations 2 Finding multiple change-points in a single profile 3 Finding multiple change-points shared by many signals 4 Learning molecular classifiers with network information 5 Conclusion J.P Vert (ParisTech) Group lasso in genomics IMA 2 / 5
The problem.5.5.5 2 3 4 5 6 7 8 9.5.5.5 2 3 4 5 6 7 8 9.5.5.5 2 3 4 5 6 7 8 9 Let Y R p n the n signals of length p We want to find a piecewise constant approximation Û Rp n with at most k change-points. J.P Vert (ParisTech) Group lasso in genomics IMA 2 / 5
The problem.5.5.5 2 3 4 5 6 7 8 9.5.5.5 2 3 4 5 6 7 8 9.5.5.5 2 3 4 5 6 7 8 9 Let Y R p n the n signals of length p We want to find a piecewise constant approximation Û Rp n with at most k change-points. J.P Vert (ParisTech) Group lasso in genomics IMA 2 / 5
"Optimal" segmentation by dynamic programming.5.5.5 2 3 4 5 6 7 8 9.5.5.5 2 3 4 5 6 7 8 9.5.5.5 2 3 4 5 6 7 8 9 Define the "optimal" piecewise constant approximation Û Rp n of Y as the solution of min Y U 2 such U Rp n that p ( ) U i+, U i, k i= DP finds the solution in O(p 2 kn) in time and O(p 2 ) in memory But: does not scale to p = 6 9... J.P Vert (ParisTech) Group lasso in genomics IMA 22 / 5
Selecting pre-defined groups of variables Group lasso (Yuan & Lin, 26) If groups of covariates are likely to be selected together, the l /l 2 -norm induces sparse solutions at the group level: Ω group (w) = g w g 2 Ω(w, w 2, w 3 ) = (w, w 2 ) 2 + w 3 2 = w 2 + w 22 + w3 2 J.P Vert (ParisTech) Group lasso in genomics IMA 23 / 5
TV approximator for many signals Replace min Y U 2 such U Rp n that p ( ) U i+, U i, k i= by min Y U 2 such U Rp n that p w i U i+, U i, µ i= Questions Practice: can we solve it efficiently? Theory: does it benefit from increasing p (for n fixed)? J.P Vert (ParisTech) Group lasso in genomics IMA 24 / 5
TV approximator as a group Lasso problem Make the change of variables: γ = U,, β i, = w i ( Ui+, U i, ) for i =,..., p. TV approximator is then equivalent to the following group Lasso problem (Yuan and Lin, 26): min Ȳ Xβ p 2 + λ β i,, β R (p ) n where Ȳ is the centered signal matrix and X is a particular (p ) (p ) design matrix. i= J.P Vert (ParisTech) Group lasso in genomics IMA 25 / 5
TV approximator implementation Theorem min Ȳ Xβ p 2 + λ β i,, β R (p ) n i= The TV approximator can be solved efficiently: approximately with the group LARS in O(npk) in time and O(np) in memory exactly with a block coordinate descent + active set method in O(np) in memory J.P Vert (ParisTech) Group lasso in genomics IMA 26 / 5
Proof: computational tricks... Although X is (p ) (p ): For any R R p n, we can compute C = X R in O(np) operations and memory For any two subset of indices A = ( a,..., a A ) and B = ( b,..., b B ) in [, p ], we can compute X,A X,B in O ( A B ) in time and memory For any A = ( a,..., a A ), set of distinct indices with a <... < a A p, and for any A n matrix R, we can compute C = ( X,A X,A ) R in O( A n) in time and memory J.P Vert (ParisTech) Group lasso in genomics IMA 27 / 5
Next, we tested empirically the accuracy the group fused Lasso for detecting a single change-point. We first generated J.P Vert (ParisTech) multidimensional profiles ofgroup dimension lasso inp, genomics withasinglejumpofheight at a position IMA u, 28 for / 5 respect to the LARS version. We suggest that this may be due to the Lasso optimization problem becoming relatively easier to solve when n or p increases, as we observed that the Lasso algorithm converged quickly to its final set of change-points. The main difference between thelassoandlarsperformanceappears when the number of change-points increases: with respective empiricalcomplexitiescubicandlinearink, as predicted by the theoretical analysis, Lasso is already, times slower than LARS when we seek change-points. Speed trial 3 2 3 2 2 time (s) time (s) time (s) 2 GFLars GFLasso 3 2 4 6 8 n 2 GFLars GFLasso 3 5 p 2 GFLars GFLasso 3 2 3 k Figure 2: Speed trials for group fused LARS (top row) and Lasso (bottom row). Left column: varying n,with fixedp =and k =; center column: varying p,with fixedn =and k =; right column: varying k, with fixed n =and p =.Figure axes are log-log. Results are averaged over trials. 6.2 Accuracy for detection of a single change-point
Consistency for a single change-point Suppose a single change-point: at position u = αp with increments (β i ) i=,...,n s.t. β 2 = lim k n n i= β2 i corrupted by i.i.d. Gaussian noise of variance σ 2 4 2 2 2 3 4 5 6 7 8 9 4 2 2 2 3 4 5 6 7 8 9 4 2 2 2 3 4 5 6 7 8 9 Does the TV approximator correctly estimate the first change-point as p increases? J.P Vert (ParisTech) Group lasso in genomics IMA 29 / 5
Consistency of the unweighted TV approximator Theorem min Y U 2 such U Rp n that p U i+, U i, µ The unweighted TV approximator finds the correct change-point with probability tending to (resp. ) as n + if σ 2 < σ α 2 (resp. σ 2 > σ α), 2 where σ α 2 = p β ( 2 α)2 (α 2p ) α 2. 2p i= correct estimation on [pɛ, p( ɛ)] with ɛ = wrong estimation near the boundaries σ 2 2p β 2 + o(p /2 ). J.P Vert (ParisTech) Group lasso in genomics IMA 3 / 5
Consistency of the weighted TV approximator Theorem min Y U 2 such U Rp n that p w i U i+, U i, µ i= The weighted TV approximator with weights i(p i) i [, p ], w i = p correctly finds the first change-point with probability tending to as n +. we see the benefit of increasing n we see the benefit of adding weights to the TV penalty J.P Vert (ParisTech) Group lasso in genomics IMA 3 / 5
Proof sketch The first change-point î found by TV approximator maximizes F i = ĉ i, 2, where ĉ = X Ȳ = X Xβ + X W. ĉ is Gaussian, and F i is follows a non-central χ 2 distribution with { G i = EF i i(p i) β = p pwi 2 σ 2 2 + wi 2 wu 2 p i 2 (p u) 2 if i u, 2 u 2 (p i) 2 otherwise. We then just check when G u = max i G i J.P Vert (ParisTech) Group lasso in genomics IMA 32 / 5
Consistency for a single change-point remains J.Probust Vert (ParisTech) against fluctuations in the exact Groupchange-point lasso genomics location. IMA 33 / 5.9.9.9.8.8.8 Accuracy: unweighted.7.6.5.4.3 u=5 u=6 u=7 u=8 u=9 Accuracy: weighted.7.6.5.4.3 u=5 u=6 u=7 u=8 u=9 Accuracy: weighted+vary.7.6.5.4.3 u=5±2 u=6±2 u=7±2 u=8±2 u=9±2.2.2.2... 2 4 p 2 4 p 2 4 p Figure 3: Single change-point accuracy for the group fused Lasso. Accuracy as a function of the number of profiles p when the change-point is placed in a variety of positions u =5to u =9(left and centre plots, resp. unweighted and weighted group fused Lasso), or: u = 5±2 to u = 9±2 (right plot, weighted with varying change-point location), for a signal of length.
Estimation of more change-points? Accuracy.9.8.7.6.5.4.3.2 U LARS W LARS. U Lasso W Lasso 2 4 p Accuracy.9.8.7.6.5.4.3.2. 2 4 p Accuracy.9.8.7.6.5.4.3.2. 2 4 p Figure 4: Multiple change-point accuracy. Accuracy as a function of the number of profiles p when change-points are placed at the nine positions {, 2,...,9} and the variance σ 2 of the centered Gaussian noise is either.5 (left),.2 (center) and (right). The profile length is. loss on each interval of the segmentation by summarizing each profilebyitsaveragevalueontheinterval. Note that we do not assume that all profiles share exactly the same change-points, but merely see the joint segmentation J.P Vert (ParisTech) as an adaptive way to reducegroup the dimension lasso in genomics and remove noise from data. IMA 34 / 5
Outline Motivations 2 Finding multiple change-points in a single profile 3 Finding multiple change-points shared by many signals 4 Learning molecular classifiers with network information 5 Conclusion J.P Vert (ParisTech) Group lasso in genomics IMA 35 / 5
Molecular diagnosis / prognosis / theragnosis J.P Vert (ParisTech) Group lasso in genomics IMA 36 / 5
Gene networks, gene groups N Glycan biosynthesis Glycolysis / Gluconeogenesis Porphyrin and chlorophyll metabolism Riboflavin metabolism Sulfur metabolism - Protein kinases Nitrogen, asparagine metabolism Folate biosynthesis Biosynthesis of steroids, ergosterol metabolism DNA and RNA polymerase subunits Lysine biosynthesis Phenylalanine, tyrosine and tryptophan biosynthesis Oxidative phosphorylation, TCA cycle Purine metabolism J.P Vert (ParisTech) Group lasso in genomics IMA 37 / 5
Structured feature selection Basic biological functions usually involve the coordinated action of several proteins: Formation of protein complexes Activation of metabolic, signalling or regulatory pathways How to perform structured feature selection, such that selected genes belong to only a few groups? form a small number of connected components on the graph? N Glycan biosynthesis Glycolysis / Gluconeogenesis Porphyrin and chlorophyll metabolism Protein kinases Sulfur metabolism Nitrogen, asparagine metabolism - Riboflavin metabolism Folate biosynthesis DNA and RNA polymerase subunits Biosynthesis of steroids, ergosterol metabolism Lysine biosynthesis Phenylalanine, tyrosine and tryptophan biosynthesis J.P Vert (ParisTech) Oxidative phosphorylation, TCA cycle Purine metabolism Group lasso in genomics IMA 38 / 5
Selecting pre-defined groups of variables Group lasso (Yuan & Lin, 26) If groups of covariates are likely to be selected together, the l /l 2 -norm induces sparse solutions at the group level: Ω group (w) = g w g 2 Ω(w, w 2, w 3 ) = (w, w 2 ) 2 + w 3 2 J.P Vert (ParisTech) Group lasso in genomics IMA 39 / 5
Group lasso with overlapping groups Idea : shrink groups to zero (Jenatton et al., 29) Ω group (w) = g w g 2 sets groups to. One variable is selected all the groups to which it belongs are selected. w g 2 = w g3 2 = IGF selection selection of unwanted groups Removal of any group containing a gene the weight of the gene is. J.P Vert (ParisTech) Group lasso in genomics IMA 4 / 5
Group lasso with overlapping groups Idea 2: latent group Lasso (Jacob et al., 29) min v g 2 v Ω G latent (w) = g G w = g G v g supp ( ) v g g. Properties Resulting support is a union of groups in G. Possible to select one variable without selecting all the groups containing it. Equivalent to group lasso when there is no overlap J.P Vert (ParisTech) Group lasso in genomics IMA 4 / 5
Overlap and group unity balls Balls for Ω G group ( ) (middle) and Ω G latent ( ) (right) for the groups G = {{, 2}, {2, 3}} where w 2 is represented as the vertical coordinate. Left: group-lasso (G = {{, 2}, {3}}), for comparison. J.P Vert (ParisTech) Group lasso in genomics IMA 42 / 5
Theoretical results Consistency in group support (Jacob et al., 29) Then Let w be the true parameter vector. Assume that there exists a unique decomposition v g such that w = g v g and Ω G latent ( w) = v g 2. Consider the regularized empirical risk minimization problem L(w) + λω G latent (w). under appropriate mutual incoherence conditions on X, as n, with very high probability, the optimal solution ŵ admits a unique decomposition (ˆv g ) g G such that { g G ˆvg } = { g G v g }. J.P Vert (ParisTech) Group lasso in genomics IMA 43 / 5
Theoretical results Consistency in group support (Jacob et al., 29) Then Let w be the true parameter vector. Assume that there exists a unique decomposition v g such that w = g v g and Ω G latent ( w) = v g 2. Consider the regularized empirical risk minimization problem L(w) + λω G latent (w). under appropriate mutual incoherence conditions on X, as n, with very high probability, the optimal solution ŵ admits a unique decomposition (ˆv g ) g G such that { g G ˆvg } = { g G v g }. J.P Vert (ParisTech) Group lasso in genomics IMA 43 / 5
Experiments Synthetic data: overlapping groups groups of variables with 2 variables of overlap between two successive groups :{,..., }, {9,..., 8},..., {73,..., 82}. Support: union of 4th and 5th groups. Learn from training points. Frequency of selection of each variable with the lasso (left) and Ω G latent (.) (middle), comparison of the RMSE of both methods (right). J.P Vert (ParisTech) Group lasso in genomics IMA 44 / 5
Graph lasso Two solutions Ω G group (β) = i j β 2 i + β 2 j, Ω G latent (β) = sup α β. α R p : i j, α 2 i +α2 j J.P Vert (ParisTech) Group lasso in genomics IMA 45 / 5
Preliminary results Breast cancer data Gene expression data for 8, 4 genes in 295 breast cancer tumors. Canonical pathways from MSigDB containing 639 groups of genes, 637 of which involve genes from our study. Graph on the genes. METHOD l Ω G LATENT (.) ERROR.38 ±.4.36 ±.3 MEAN PATH. 3 3 METHOD l Ω graph (.) ERROR.39 ±.4.36 ±. AV. SIZE C.C..3.3 J.P Vert (ParisTech) Group lasso in genomics IMA 46 / 5
Lasso signature J.P Vert (ParisTech) Group lasso in genomics IMA 47 / 5
Graph Lasso signature J.P Vert (ParisTech) Group lasso in genomics IMA 48 / 5
Outline Motivations 2 Finding multiple change-points in a single profile 3 Finding multiple change-points shared by many signals 4 Learning molecular classifiers with network information 5 Conclusion J.P Vert (ParisTech) Group lasso in genomics IMA 49 / 5
Conclusions Feature / pattern selection in high dimension is central for many applications Convex sparsity-inducing penalties are useful; efficient implementations + consistency results Kevin Bleakley (INRIA), Laurent Jacob (UC Berkeley) Guillaume Obozinski (INRIA) Post-docs available in Paris! J.P Vert (ParisTech) Group lasso in genomics IMA 5 / 5