Statistics for high-dimensional data: Group Lasso and additive models

Size: px

Start display at page:

Download "Statistics for high-dimensional data: Group Lasso and additive models"

Rosamond Johns
5 years ago
Views:

1 Statistics for high-dimensional data: Group Lasso and additive models Peter Bühlmann and Sara van de Geer Seminar für Statistik, ETH Zürich May 2012

2 The Group Lasso (Yuan & Lin, 2006) high-dimensional parameter vector is structured into q groups or partitions (known a-priori): G 1,..., G q {1,..., p}, disjoint and g G g = {1,..., p} corresponding coefficients: β G = {β j ; j G}

3 Example: categorical covariates X (1),..., X (p) are factors (categorical variables) each with 4 levels (e.g. letters from DNA) for encoding a main effect: 3 parameters for encoding a first-order interaction: 9 parameters and so on... parameterization (e.g. sum contrasts) is structured as follows: intercept: no penalty main effect of X (1) : group G 1 with df = 3 main effect of X (2) : group G 2 with df = 3... first-order interaction of X (1) and X (2) : G p+1 with df = 9... often, we want sparsity on the group-level either all parameters of an effect are zero or not

4 often, we want sparsity on the group-level either all parameters of an effect are zero or not this can be achieved with the Group-Lasso penalty λ q g=1 m g β Gg 2 }{{} 2 2 typically m g = G g

5 properties of Group-Lasso penalty for group-sizes G g 1 standard Lasso-penalty convex penalty convex optimization for standard likelihoods (exponential family models) either ( ˆβ G (λ)) j = 0 or 0 for all j G penalty is invariant under orthonormal transformation e.g. invariant when requiring orthonormal parameterization for factors

6 DNA splice site detection: (mainly) prediction problem DNA sequence... ACGGC... E E E }{{} GC I I I I... AAC... } potential donor site {{ } 3 positions exon GC 4 positions intron response Y {0, 1}: splice or non-splice site predictor variables: 7 factors each having 4 levels (full dimension: 4 7 = ) data: training: test data: true splice sites non-splice sites plus an unbalanced validation set true splice sites non-splice sites

7 logistic regression: ( ) p(x) log = β 0 + main effects + first order interactions p(x) up to second oreder interactions: 1156 parameters use the Group-Lasso which selects whole terms

8 l 2 norm GL GL/R GL/MLE :3 1:5 1:7 2:4 2:6 3:4 3:6 4:5 4:7 5: :2 1:4 1:6 2:3 2:5 2:7 3:5 3:7 4:6 5:6 6:7 Term l 2 norm :2:3 1:2:5 1:2:7 1:3:5 1:3:7 1:4:6 1:5:6 1:6:7 2:3:5 2:3:7 2:4:6 2:5:6 2:6:7 3:4:6 3:5:6 3:6:7 4:5:7 5:6:7 1:2:4 1:2:6 1:3:4 1:3:6 1:4:5 1:4:7 1:5:7 2:3:4 2:3:6 2:4:5 2:4:7 2:5:7 3:4:5 3:4:7 3:5:7 4:5:6 4:6:7 Term mainly neighboring DNA positions show interactions (has been known and debated ) no interaction among exons and introns (with Group Lasso method) no second-order interactions (with Group Lasso method)

9 predictive power: competitive with state to the art maximum entropy modeling from Yeo and Burge (2004) correlation between true and predicted class Logistic Group Lasso max. entropy (Yeo and Burge) our model (not necessarily the method/algorithm) is simple and has clear interpretation

10 Generalized group Lasso penalty q λ m j βg T j A j β Gj, where A j are T j T j positive definite matrices generalized group Lasso: q ˆβ = argmin β ( Y Xβ 2 2 /n + λ m j βg T j A j β Gj )

11 reparameterize β Gj = A 1/2 j β Gj, X Gj = X Gj A 1/2 j q ˆβ = argmin β ( Y Xβ 2 2 /n + λ m j βg T j A j β Gj ) can be derived from ˆβ Gj = A 1/2 j ˆ β Gj, q ˆ β = argmin β ( Y X β 2 2 /n + λ m j β Gj 2 )

12 Groupwise prediction penalty and parameterization invariance q q λ m j X Gj β Gj 2 = λ m j βg T j X T G j X Gj β Gj is a generalized group Lasso penalty if X T G j X Gj definite (i.e. necessarily G j n) are positive this penalty is invariant under any (invertible) transformations within groups i.e. can use G j = B j β Gj where B j is any T j T j invertible matrix X Gj ˆβ Gj = X Gj ˆ β Gj, {j; ˆβ Gj 0} = {j; ˆ β Gj 0}

13 Some aspects from theory again : optimal prediction and estimation (oracle inequality) group screening: Ŝ S 0 }{{} with high prob. set of active groups but listen to Sara in a few minutes interesting case: G j s are large β Gj s are smooth

14 example: high-dimensional additive model Y = p f j (X (j) ) + ɛ and expand f j (x (j) ) = n k=1 β(j) (j) k Bk }{{}}{{} (β Gj ) k f j ( ) smooth smoothness of β Gj basis fct.s (x (j) )

15 Computation and KKT criterion function Q λ (β) = n 1 n ρ β (x i, Y i ) +λ }{{} loss fct. i=1 loss function ρ β (.,.) convex in β KKT conditions: G m g β g 2, g=1 ˆβGg ρ( ˆβ) Gg + λm g = 0 if ˆβ Gg 0 (not the 0-vector), ˆβ Gg 2 ρ( ˆβ) Gg 2 λm g if ˆβ Gg 0.

16 Block coordinate descent algorithm generic description for both, Lasso or Group-Lasso problems: cycle through all coordinates j = 1,..., p, 1, 2,... or j = 1,..., q, 1, 2,... optimize the penalized log-likelihood w.r.t. β j (or β Gj ) keeping all other coefficients β k, k j (or k G j ) fixed Lasso: (β 1, β 2 = β (0) 2,..., β j = β (0) j,..., β p = β (0) p )

17 Block coordinate descent algorithm generic description for both, Lasso or Group-Lasso problems: cycle through all coordinates j = 1,..., p, 1, 2,... or j = 1,..., q, 1, 2,... optimize the penalized log-likelihood w.r.t. β j (or β Gj ) keeping all other coefficients β k, k j (or k G j ) fixed Lasso: (β 1 = β (1) 1, β 2,..., β j = β (0) j,..., β p = β (0) p )

18 Block coordinate descent algorithm generic description for both, Lasso or Group-Lasso problems: cycle through all coordinates j = 1,..., p, 1, 2,... or j = 1,..., q, 1, 2,... optimize the penalized log-likelihood w.r.t. β j (or β Gj ) keeping all other coefficients β k, k j (or k G j ) fixed Lasso: (β 1 = β (1) 1, β 2 = β (1) 2,..., β j,..., β p = β (0) p )

19 Block coordinate descent algorithm generic description for both, Lasso or Group-Lasso problems: cycle through all coordinates j = 1,..., p, 1, 2,... or j = 1,..., q, 1, 2,... optimize the penalized log-likelihood w.r.t. β j (or β Gj ) keeping all other coefficients β k, k j (or k G j ) fixed Lasso: (β 1 = β (1) 1, β 2 = β (1) 2,..., β j = β (1) j,..., β p )

20 Block coordinate descent algorithm generic description for both, Lasso or Group-Lasso problems: cycle through all coordinates j = 1,..., p, 1, 2,... or j = 1,..., q, 1, 2,... optimize the penalized log-likelihood w.r.t. β j (or β Gj ) keeping all other coefficients β k, k j (or k G j ) fixed Lasso: (β 1, β 2 = β (1) 2,..., β j = β (1) j,..., β p = β (1) p )

21 Block coordinate descent algorithm generic description for both, Lasso or Group-Lasso problems: cycle through all coordinates j = 1,..., p, 1, 2,... or j = 1,..., q, 1, 2,... optimize the penalized log-likelihood w.r.t. β j (or β Gj ) keeping all other coefficients β k, k j (or k G j ) fixed Group Lasso: (β G1, β G2 = β (0) G 2,..., β Gj = β (0) G j,..., β Gq = β (0) G q )

22 Block coordinate descent algorithm generic description for both, Lasso or Group-Lasso problems: cycle through all coordinates j = 1,..., p, 1, 2,... or j = 1,..., q, 1, 2,... optimize the penalized log-likelihood w.r.t. β j (or β Gj ) keeping all other coefficients β k, k j (or k G j ) fixed Group Lasso: (β G1 = β (1) G1, β G 2,..., β Gj = β (0) G j,..., β Gq = β (0) G q )

23 Block coordinate descent algorithm generic description for both, Lasso or Group-Lasso problems: cycle through all coordinates j = 1,..., p, 1, 2,... or j = 1,..., q, 1, 2,... optimize the penalized log-likelihood w.r.t. β j (or β Gj ) keeping all other coefficients β k, k j (or k G j ) fixed Group Lasso: (β G1 = β (1) G1, β G 2 = β (1) G 2,..., β Gj,..., β Gq = β (0) G q )

24 Block coordinate descent algorithm generic description for both, Lasso or Group-Lasso problems: cycle through all coordinates j = 1,..., p, 1, 2,... or j = 1,..., q, 1, 2,... optimize the penalized log-likelihood w.r.t. β j (or β Gj ) keeping all other coefficients β k, k j (or k G j ) fixed Group Lasso: (β G1 = β (1) G1, β G 2 = β (1) G 2,..., β Gj = β (1) G j,..., β Gq )

25 Block coordinate descent algorithm generic description for both, Lasso or Group-Lasso problems: cycle through all coordinates j = 1,..., p, 1, 2,... or j = 1,..., q, 1, 2,... optimize the penalized log-likelihood w.r.t. β j (or β Gj ) keeping all other coefficients β k, k j (or k G j ) fixed Group Lasso: (β G1, β G2 = β (1) G 2,..., β Gj = β (1) G j,..., β Gq = β (1) G q )

26 for Gaussian log-likelihood (squared error loss): blockwise up-dates are easy and closed-form solutions exist (use KKT) for other loss functions (e.g. logistic loss): blockwise up-dates: no closed-form solution strategy which is fast: improve every coordinate/group numerically, but not until numerical convergence (by using quadratic approximation of log-likelihood function for improving/optimization of a single block) and further tricks... (still allowing provable numerical convergence)

27 How fast? logistic case: p = 10 6, n = 100 group-size = 20, sparsity: 2 active groups = 40 parameters for 10 different λ-values CPU using grplasso: seconds 3.5 minutes (dual core processor with 2.6 GHz and 32 GB RAM) we can easily deal today with predictors in the Mega s i.e. p

28 How fast? logistic case: p = 10 6, n = 100 group-size = 20, sparsity: 2 active groups = 40 parameters for 10 different λ-values CPU using grplasso: seconds 3.5 minutes (dual core processor with 2.6 GHz and 32 GB RAM) we can easily deal today with predictors in the Mega s i.e. p

29 The sparsity-smoothness penalty (SSP) (whose corresponding optimization becomes again a Group-Lasso problem...) for additive modeling in high dimensions Y i = p f j (x (j) i ) + ε i (i = 1,..., n) f j : R R smooth univariate functions p n

30 in principle: basis expansion for every f j ( ) with basis functions h j,1,..., h j,k where K = O(n) (or e.g. K = O(n 1/2 )) j = 1,..., p represent p f j (x (j) ) = p K β j,k h j,k (x (j) ) k=1 high-dimensional parametric problem and use the Group-Lasso penalty to ensure sparsity of whole functions λ p β Gj 2 }{{} β j :=(β j,1,...,β j,k ) T

31 drawback: does not exploit smoothness (except when choosing appropriate K which is bad if different f j s have different complexity) when using a large number of basis functions (large K ) for achieving a high degree of flexibility need additional control for smoothness

32 Sparsity-Smoothness Penalty (SSP) λ 1 p f j n }{{} H j β j 2 / n I 2 (f j ) = p +λ 2 I(f j ) (f j (x)) 2 dx = β T j W j β j where f j = (f j (X (j) 1 ),..., f j(x (j) n )) T, and W j = h j,k (x)h j,l (x)dx SSP-penalty does variable selection (ˆf j 0 for some j)

33 Orthogonal basis and diagonal smoothing matrices n 1 H T j H j = I and W j diag(d 2 1,..., d 2 K ) := D2, d k = k m (m > 1/2) then, the penalty becomes λ 1 p β j 2 + λ 2 p Dβ j 2 p p p ˆβ(λ 1, λ 2 ) = argmin β Y H j β j 2 2 /n + λ 1 β j 2 + λ 2 Dβ j 2

34 the difficulty is the computation, although still a convex optimization problem see Section in the book

35 A modified SSP-penalty λ 1 p f j λ 2I 2 (f j ) for additive modeling: ˆf1,...,ˆf p = argmin f1,...,f p Y p f j λ 1 p f j λ 2I 2 (f j ) assuming f j is twice differentiable solution is a natural cubic spline with knots at X (j) i finite-dimensional parameterization with e.g. B-splines: f = p f j, f j = H j β j

36 penalty becomes: p f j λ 2I 2 (f j ) λ 1 p = λ 1 βt j p = λ 1 βt j (Σ j + λ 2 W j ) β }{{} A j =A j (λ 2 ) Bj T B j }{{} β j + λ 2 βt j W j }{{} Σ j integ. 2nd derivatives re-parameterize β j = β j (λ 2 ) = R j β j, Rj T R j = A j = A j (λ 2 ) (Choleski) penalty becomes p λ 1 β j 2 }{{} depending on λ 2 i.e., a Group-Lasso penalty β j

37 HIF1α motif additive regression for finding HIF1α transcription factor binding sites on DNA sequences n = 287, p = 196 additive model with SSP has 20% better prediction performance than linear model with Lasso bootstrap stability analysis: select the variables (functions) which have occurred at least in 50% among all bootstrap runs only 2 stable variables /candidate motifs remain Partial Effect Partial Effect Motif.P Motif.P1.6.26

7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 Partial Effect 0.6

38 Partial Effect Partial Effect Motif.P Motif.P right panel: variable corresponds to a true, known motif variable/motif corresponding to left panel: good additional support for relevance (nearness to transcriptional start-site of important genes,...)

Computationally Tractable Methods for High-Dimensional Data

Computationally Tractable Methods for High-Dimensional Data Peter Bühlmann Seminar für Statistik, ETH Zürich August 2008 Riboflavin production in Bacillus Subtilis in collaboration with DSM (former Roche