A note on the group lasso and a sparse group lasso

Similar documents
ESL Chap3. Some extensions of lasso

Sparse inverse covariance estimation with the lasso

Pathwise coordinate optimization

Regularization Path Algorithms for Detecting Gene Interactions

The lasso: some novel algorithms and applications

Fast Regularization Paths via Coordinate Descent

Gaussian Graphical Models and Graphical Lasso

Graphical Model Selection

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent

Learning with Sparsity Constraints

Standardization and the Group Lasso Penalty

An algorithm for the multivariate group lasso with covariance estimation

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Linear Model Selection and Regularization

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

STANDARDIZATION AND THE GROUP LASSO PENALTY

Lasso: Algorithms and Extensions

MS-C1620 Statistical inference

Generalized Elastic Net Regression

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Bi-level feature selection with applications to genetic association

The Entire Regularization Path for the Support Vector Machine

Regression Shrinkage and Selection via the Lasso

Lecture 6: Methods for high-dimensional problems

Strong Rules for Discarding Predictors in Lasso-type Problems

Sparse Graph Learning via Markov Random Fields

A Bias Correction for the Minimum Error Rate in Cross-validation

Least Angle Regression, Forward Stagewise and the Lasso

Saharon Rosset 1 and Ji Zhu 2

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Statistics for high-dimensional data: Group Lasso and additive models

Lass0: sparse non-convex regression by local search

Linear Methods for Regression. Lijun Zhang

Regularization Paths. Theme

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

Penalized versus constrained generalized eigenvalue problems

Some Sparse Methods for High Dimensional Data. Gilbert Saporta CEDRIC, Conservatoire National des Arts et Métiers, Paris

An Introduction to Graphical Lasso

The lasso: some novel algorithms and applications

Chapter 3. Linear Models for Regression

Lecture 5: Soft-Thresholding and Lasso

Lecture 14: Variable Selection - Beyond LASSO

Regularization Paths

A significance test for the lasso

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Regularization: Ridge Regression and the LASSO

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data

LASSO-Type Penalization in the Framework of Generalized Additive Models for Location, Scale and Shape

LEAST ANGLE REGRESSION 469

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Regularization and Variable Selection via the Elastic Net

ORIE 4741: Learning with Big Messy Data. Regularization

Machine Learning for OR & FE

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Comparisons of penalized least squares. methods by simulations

Lecture 9: September 28

The lasso, persistence, and cross-validation

Lecture 5 Multivariate Linear Regression

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

LASSO Review, Fused LASSO, Parallel LASSO Solvers

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

Optimization methods

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

Lecture 2 Part 1 Optimization

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

Biostatistics Advanced Methods in Biostatistics IV

A SURVEY OF SOME SPARSE METHODS FOR HIGH DIMENSIONAL DATA

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

arxiv: v2 [math.st] 12 Feb 2008

Linear Algebra (Review) Volker Tresp 2017

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Sparse statistical modelling

Sparse Ridge Fusion For Linear Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Sparse Principal Component Analysis for multiblocks data and its extension to Sparse Multiple Correspondence Analysis

Linear Algebra (Review) Volker Tresp 2018

CONSISTENT BI-LEVEL VARIABLE SELECTION VIA COMPOSITE GROUP BRIDGE PENALIZED REGRESSION INDU SEETHARAMAN

arxiv: v1 [math.st] 2 May 2007

Support Vector Machines for Classification: A Statistical Portrait

Analysis Methods for Supersaturated Design: Some Comparisons

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Indirect multivariate response linear regression

Lasso Screening Rules via Dual Polytope Projection

OPTIMISATION CHALLENGES IN MODERN STATISTICS. Co-authors: Y. Chen, M. Cule, R. Gramacy, M. Yuan

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Robust Inverse Covariance Estimation under Noisy Measurements

Forward Selection and Estimation in High Dimensional Single Index Models

Penalized Methods for Bi-level Variable Selection

A Short Introduction to the Lasso Methodology

Variable Selection for High-Dimensional Data with Spatial-Temporal Effects and Extensions to Multitask Regression and Multicategory Classification

Day 4: Shrinkage Estimators

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

The graphical lasso: New insights and alternatives

Transcription:

A note on the group lasso and a sparse group lasso arxiv:1001.0736v1 [math.st] 5 Jan 2010 Jerome Friedman Trevor Hastie and Robert Tibshirani January 5, 2010 Abstract We consider the group lasso penalty for the linear model. We note that the standard algorithm for solving the problem assumes that the model matrices in each group are orthonormal. Here we consider a more general penalty that blends the lasso (L 1 ) with the group lasso ( two-norm ). This penalty yields solutions that are sparse at both the group and individual feature levels. We derive an efficient algorithm for the resulting convex problem based on coordinate descent. This algorithm can also be used to solve the general form of the group lasso, with non-orthonormal model matrices. 1 Introduction In this note, we consider the problem of prediction using a linear model. Our data consist of y, a vector of N observations, and X, a N p matrix of features. Suppose that the p predictors are divided into L groups, with p l the number in group l. For ease of notation, we use a matrix X l to represent the predictors corresponding to the lth group, with corresponding coefficient vector β l. Assume that y and X has been centered, that is, all variables have mean zero. Dept. of Statistics, Stanford Univ., CA 94305, jhf@stanford.edu Depts. of Statistics, and Health, Research & Policy, Stanford Univ., CA 94305, hastie@stanford.edu Depts. of Health, Research & Policy, and Statistics, Stanford Univ, tibs@stanford.edu 1

In an elegant paper, Yuan & Lin (2007) proposed the group lasso which solves the convex optimization problem ( ) L L y 1 X l β l 2 2 +λ pl β l 2, (1) min β R p l=1 where the p l terms accounts for the varying group sizes, and 2 is the Euclidean norm (not squared). This procedure acts like the lasso at the group level: depending on λ, an entire group of predictors may drop out of the model. In fact if the group sizes are all one, it reduces to the lasso. Meier et al. (2008) extend the group lasso to logistic regression. The group lasso does not, however, yield sparsity within a group. That is, if a group of parameters is non-zero, they will all be non-zero. In this note we propose a more general penalty that yields sparsity at both the group and individual feature levels, in order to select groups and predictors within a group. We also point out that the algorithm proposed by Yuan & Lin (2007) for fitting the group lasso assumes that the model matrices in each group are orthonormal. The algorithm that we provide for our more general criterion also works for the standard group lasso with non-orthonormal model matrices. We consider the sparse group lasso criterion ( min β R p y L X l β l 2 2 +λ 1 l=1 l=1 ) L β l 2 +λ 2 β 1. (2) where β = (β 1,β 2,...β l ) is the entire parameter vector. For notational simplicity we omit the weights p l. Expression (2) is the sum of convex functions and is therefore convex. Figure 1 shows the constraint region for the group lasso, lasso and sparse group lasso. A similar penalty involving both group lasso and lasso terms is discussed in Peng et al. (2009). When λ 2 = 0, criterion (2) reduces to the group lasso, whose computation we discuss next. l=1 2 Computation for the group lasso Here we briefly review the computation for the group lasso of Yuan & Lin (2007). In the process we clarify a confusing issue regarding orthonormality of predictors within a group. The subgradient equations (see e.g. Bertsekas (1999)) for the group lasso are X T l (y l X l β l )+λ s l = 0; l = 1,2,...L, (3) 2

β2 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 β 1 Figure 1: Contour lines for the penalty for the group lasso (dotted), lasso (dashed) and sparse group lasso penalty (solid), for a single group with two predictors. where s l = β l / β l if β l 0 and s l is a vector with s l 2 < 1 otherwise. Let the solutions be ˆβ 1, ˆβ 2... ˆβ l. If X T l (y k l X kˆβk ) < λ (4) then ˆβ l is zero; otherwise it satisfies ˆβ l = (X T l X l +λ/ ˆβ l ) 1 X T l r l (5) where r l = y k l X kˆβk Now if we assume that X T l X l = I, and let s l = X T l r l, then (5) simplifies to ˆβ l = (1 λ/ s l )s l. This leads to an algorithm that cycles through the groups k, and is a blockwise coordinate descent procedure. It is given in Yuan & Lin (2007). If however the predictors are not orthonormal, one approach is to orthonormalize them before applying the group lasso. However this will not generally provide a solution to the original problem. In detail, if X l = UDV T, then the columns of U = X l VD 1 are orthonormal. Then X l β l = UVD 1 β l = U[VD 1 β l ] = Uβ l. 3

But β l = β l only if D = I. This will not be true in general, e.g. if X is a set of dummy varables for a factor, this is true only if the number of observations in each category is equal. Hence an alternative approach is needed. In the non-orthonormal case, we can think of equation (5) as a ridge regression, with the ridge parameter depending on ˆβ l. A complicated scalar equation can be derived for ˆβ l from (5); then substituting into the right-hand side of (5) gives the solution. However this is not a good approach numerically, as it can involve dividing by the norm of a vector that is very close to zero. It is also not guaranteed to converge. In the next section we provide a better solution to this problem, and to the sparse group lasso. 3 Computation for the sparse group lasso The criterion (1) is separable so that block coordinate descent can be used for its optimization. Therefore we focus on just one group l, and denote the predictors by X l = Z = (Z 1,Z 2,...Z k ), the coefficients by β l = θ = (θ 1,θ 2,...θ k ) and the residual by r = y k l X kβ k. The subgradient equations are Z T j (r j Z j θ j )+λ 1 s j +λ 2 t j = 0 (6) for j = 1,2,...k where s j = θ j / θ if θ l 0 and s is a vector satisfying s 2 1 otherwise, and t j sign(θ j ), that is t j = sign(θ j ) if θ j 0 and t j [ 1,1] if θ j = 0. Letting a = X l r, then a necessary and sufficient condition for θ to be zero is that the system of equations a j = λ 1 s j +λ 2 t j (7) have a solution with s 2 1 and t j [ 1,1]. We can determine this by minimizing k J(t) = (1/λ 2 1) (a j λ 2 t j ) 2 = k s 2 j (8) with respect to the t j [ 1,1] and then checking if J(ˆt) 1. The minimizer is easily seen to be { aj λ ˆt j = 2 if a j λ 2 1, sign( a j λ 2 ) if a j λ 2 > 1. Now if J(ˆt) > 1, then we must minimize the criterion 1 2 N ( r i i=1 k ) 2 k Z ij θ j +λ1 θ 2 +λ 2 θ j (9) 4

Thisisthesumofaconvexdifferentiablefunction(firsttwoterms)andaseparable penalty, and hence we can use coordinate descent to obtain the global minimum. Here are the details of the coordinate descent procedure. For each j let r j = r k j Z kˆθ k. Then ˆθ j = 0 if Z T j r j < λ 2. This follows easily by examining the subgradient equation corresponding to (9). Otherwise if Z T j r j λ 2 we minimize (9) by a one-dimensional search over θ j. We use the optimize function in the R package, which is a combination of golden section search and successive parabolic interpolation. This leads to the following algorithm: Algorithm for the sparse group lasso 1. Start with ˆβ = β 0 2. Ingroupldefiner l = y k l X kβ k,x l = (Z 1,Z 2,...Z k ),β l = (θ 1,θ 2,...θ k ) and r j = y k j Z kθ k. Check if J(ˆt) 1 according to (8) and if so set ˆβ l = 0. Otherwise for j = 1,2,...k, if Z T j r j < λ 2 then ˆθ j = 0; if instead Z T j r j λ 2 then minimize 1 2 N k k (y i Z ij θ j ) 2 +λ 1 θ 2 +λ 2 θ j (10) i=1 over θ j by a one-dimensional optimization. 3. Iterate step (2) over groups l = 1,2,...L until convergence. Ifλ 2 iszero, weinsteadusecondition(4)forthegroup-level test andwedon tneed to check the condition Z T j r j < λ 2. With these modifications, this algorithm also gives a effective method for solving the group lasso with non-orthogonal model matrices. Note that in the special case where X T l X l = I, with X l = (Z 1,Z 2,...Z k ) then its is easy to show that ) ˆθ j = ( S(Z Tj y,λ 2) 2 λ 1 and this reduces to the algorithm of Yuan & Lin (2007). 4 An example + S(Zj Ty,λ 2) S(Zj Ty,λ (11) 2) 2 We generated n = 200 observations with p = 100 predictors, in ten blocks of ten. The second fifty predictors iall have coefficients of zero. The number of non-zero 5

coefficients in the first five blocks of 10 are (10, 8, 6, 4, 2, 1) respectively, with coefficients equal to ±1, the sign chosen at random. The predictors are standard Gaussian with correlation 0.2 within a group and zero otherwise. Finally, Gaussian noise with standard deviation 4.0 was added to each observation. Figure 2 shows the signs of the estimated coefficients from the lasso, group lasso and sparse group lasso, using a well chosen tuning parameter for each method (we set λ 1 = λ 2 for the sparse group lasso). The corresponding misclassification rates for the groups and individual features are shown in Figure 3. We see that the sparse group lasso strikes an effective compromise between the lasso and group lasso, yielding sparseness at the group and individual predictor levels. References Bertsekas, D. (1999), Nonlinear programming, Athena Scientific. Meier, L., van de Geer, S. & Bühlmann, P. (2008), The group lasso for logistic regression, Journal of the Royal Statistical Society B 70, 53 71. Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D.-Y., Pollack, J. R. & Wang, P. (2009), Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer, Annals of Applied Statistics (to appear). Yuan, M. & Lin, Y. (2007), Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society, Series B 68(1), 49 67. 6

Group lasso Coefficient 3 2 1 0 1 2 3 0 20 40 60 80 100 Predictor Sparse Group lasso Coefficient 3 2 1 0 1 2 3 0 20 40 60 80 100 Predictor Lasso Coefficient 3 2 1 0 1 2 3 0 20 40 60 80 100 Predictor Figure 2: Results for the simulated example. True coefficients are indicated by the open triangles while the filled green circles indicate the sign of the estimated coefficients from each method. 7

Number of groups misclassified 0 1 2 3 4 5 Lasso Group lasso Sparse Group lasso 2 4 6 8 10 12 Regularization parameter Number of features misclassified 10 20 30 40 50 60 70 2 4 6 8 10 12 Regularization parameter Figure 3: Results for the simulated example. The top panel shows the number of groups that are misclassified as the regularization parameter is varied. A misclassified group is one with at least one nonzero coefficient whose estimated coefficients are all set to zero, or vice versa. The bottom panel shows the number of individual coefficients that are misclassified, that is, estimated to be zero when the true coefficient is nonzero or vice-versa. 8