Proximal methods for the latent group lasso penalty

Size: px

Start display at page:

Download "Proximal methods for the latent group lasso penalty"

Dominick Brown
5 years ago
Views:

1 Proximal methods for the latent grou lasso enalty The MIT Faculty has made this article oenly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Villa, Silvia, Lorenzo Rosasco, Sofia Mosci, and Alessandro Verri. Proximal Methods for the Latent Grou Lasso Penalty. Comut Otim Al 58, no. 2 (December 21, 2013): htt://dx.doi.org/ /s Sringer US Version Author's final manuscrit Accessed Mon Jan 07 01:50:09 EST 2019 Citable Link Terms of Use Detailed Terms htt://hdl.handle.net/1721.1/ Article is made available in accordance with the ublisher's olicy and may be subject to US coyright law. Please refer to the ublisher's site for terms of use.

2 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 1/28 Noname manuscrit No. (will be inserted by the editor) Proximal methods for the latent grou lasso enalty Silvia Villa Lorenzo Rosasco Sofia Mosci Alessandro Verri Received: date / Acceted: date Abstract We consider a regularized least squares roblem, with regularization by structured sarsity-inducing norms, which extend the usual l 1 and the grou lasso enalty, by allowing the subsets to overla. Such regularizations lead to nonsmooth roblems that are difficult to otimize, and we roose in this aer a suitable version of an accelerated roximal method to solve them. We rove convergence of a nested rocedure, obtained comosing an accelerated roximal method with an inner algorithm for comuting the roximity oerator. By exloiting the geometrical roerties of the enalty, we devise a new active set strategy, thanks to which the inner iteration is relatively fast, thus guaranteeing good comutational erformances of the overall algorithm. Our aroach allows to deal with high dimensional roblems without re-rocessing for dimensionality reduction, leading to better comutational and rediction erformances with resect to the state-of-the art methods, as shown emirically both on toy and real data. Keywords First keyword Structured sarsity Proximal methods More Regularization S. Villa Istituto Italiano di Tecnologia, Italy Silvia.Villa@iit.it Lorenzo Rosasco CBCL, McGovern Institute, Massachussets Institute of Technology, USA & Istituto Italiano di Tecnologia, Italy lrosasco@mit.edu Sofia Mosci DIBRIS, Università di Genova, Italy Sofia.Mosci@unige.it Alessandro Verri DIBRIS, Università di Genova, Italy Alessandro.Verri@unige.it

3 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 2/28 2 Silvia Villa et al. 1 Introduction Sarsity has become a oular way to deal with a number of roblems arising in signal and image rocessing, statistics and machine learning [19]. In a broad sense, it refers to the ossibility of writing the solution in terms of a few building blocks. Often sarsity based methods are the key towards finding interretable models in real-world roblems. For examle, sarse regularization based with l 1 -tye enalties is a owerful aroach to find sarse solutions by minimizing a convex functional [48,12,18]. The success of l 1 regularization motivated exloring different kinds of sarsity roerties for regularized otimization roblems, exloiting available a riori information, which restricts the admissible sarsity atterns of the solution. An examle of a sarsity attern is when the variables are artitioned into grous (known a riori), and the goal is to estimate a sarse model where variables belonging to the same grou are either jointly selected or discarded. This roblem can be solved by regularizing with the grou l 1 enalty, also known as grou lasso enalty [52]. The latter is the sum, over the grous, of the euclidean norms of the coefficients restricted to each grou. Note that, for any > 1, the same grouwise selection can be achieved by regularizing with the l 1 /l norm, i.e. the sum over the grous of the l norm of the coefficients restricted to each grou. A ossible generalization of the grou lasso enalty is obtained considering grous of variables which can be otentially overlaing [53, 24], and the goal is to estimate a model which suort is the union of grous. For examle, this is a common situation in bioinformatics (esecially in the context of highthroughut data such as gene exression and mass sectrometry data), where roblems are characterized by a very low number of samles with several thousands of variables. In fact, when the number of samles is not sufficient to guarantee accurate model estimation, a ossible solution is to take advantage of the huge amount of rior knowledge encoded in online databases such as the Gene Ontology [15]. Largely motivated by alications in bioinformatics, the latent grou lasso with overla enalty is roosed in [22] and further studied in [36,2] and in [38] in the image rocessing context, which generalizes the l 1 /l 2 enalty to overlaing grous, thus satisfying the assumtion that the admissible sarsity atterns must be unions of a subset of the grous. All the methods roosed in the literature solve the minimization roblem arising in [22] by alying state-of-the-art techniques for grou lasso in an exanded sace, called sace of latent variables, built by dulicating variables that belong to more than one grou. The most oular otimization strategies that have been roosed are interior-oints methods [3, 37], block coordinate descent [28], roximal methods [43, 31, 38, 26, 13] and the related alternating direction method [16]. Very recently, the aer [40] roosed an accelerated alternating direction method and [41] studied a block coordinate descent, along with a roximal method with variable ste-sizes. As already noted in [22], though very natural, every imlementation develoed in the latent variables does not scale to large datasets: when the grous have significant overla, a more scalable algorithm with no data dulication

4 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 3/28 Proximal methods for the latent grou lasso enalty 3 is needed. For this reason we roose an alternative otimization aroach to solve the grou lasso roblem with overla, and extend it to the entire family of grou lasso with overla enalties, that generalize the l 1 /l enalties to overlaing grous for > 1. Our method is a two-loos iterative scheme based on roximal methods (see for examle [33,7,6]), and more recisely on the accelerated version named FISTA [6]. It does not require exlicit relication of the variables and is thus more aroriate to deal with high dimensional roblems with large grou overla. In fact, the roximity oerator can be efficiently comuted by exloiting the geometrical roerties of the enalty. We show that such an oerator can be written as the identity minus the rojection onto a suitable convex set, which is the intersection of as many convex sets as the number of active grous, that is grous corresonding to active constraints, which can be easily found. Indeed, the identification of the active grous is a key ste, since it allows comuting the rojection in a reduced sace. For general, the rojection can be solved via the Cyclic Projections algorithm [4]. Furthermore, for the case = 2, we resent an accelerated scheme, where the reduced rojection is comuted by solving a corresonding dual roblem via the rojected Newton method [8], thus working in a much lower dimensional sace. The resent aer comletes and extends the reliminary results resented in the short conference version [32]. In articular, it contains a general mathematical resentation and all the roofs, which were omitted in [32]. We next describe how the rest of the aer is organized, and then highlight the main novelties with resect to the short version. In Section 2, we cast the roblem of Grou-wise Selection with Overla (GSO) as a regularization roblem based on a modified l 1 /l -tye enalty and comare it with other structured sarsity enalties. We extend the aroach in [32] for = 2 to general > 1. In Section 3, we describe the derivation of the roosed otimization scheme, and rove its convergence. Precisely, we first recall roximal methods in Subsection 3.1, then in Subsection 3.2 we describe the technical results that ease the comutation of the roximity oerator as a simlified rojection, and resent different rojection algorithms deending on. With resect to [32], we show that our active set strategy can be rofitably used in this generalized framework in combination with any algorithm chosen to comute the inner rojection. Furthermore, to solve the rojection for a general (1, + ], we discuss the use of a cyclic rojections algorithm, whose convergence in norm is guaranteed and results in a rate of convergence for the roosed roximal method, roved in Subsection 3.3. Section 4 is a substantial extension of the exeriments erformed in [32]. We emirically analyze the comutational erformance of our otimization rocedure. We first study the erformance of the different variations of the roosed otimization scheme. Then we resent a set of numerical exeriments comaring running time of our algorithm with state-of-the-art techniques. We conclude with a real data exeriment where we show that the imroved comutational erformance allows dealing with large data sets without rerocessing thus imroving also the rediction and selection erformance. Finally, in Aendix B we review the rojected Newton

5 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 4/28 4 Silvia Villa et al. method [8]. Notation. Given a vector x R d, we denote with the l -norm of x, defined as x = ( d j=1 x j )1/ and x = max j {1,...d} x j. We will also use the notation x G, = ( j G x j )1/ for 1, and x G, = max j G x j to denote the l -norm of the comonents of x in G {1,..., d}. When the subscrit is omitted, the l 2 norm is used, = 2. The conjugate exonent of is denoted by q; we recall that q is such that 1/+1/q = 1. In the following, X will denote R d and Y a bounded interval in R. 2 Grou-wise selection with Overla (GSO) This aer rooses an otimization algorithm for a regularized least-squares roblem of the tye min E x R d τ (x), Eτ (x) = 1 n Ψx y 2 + 2τΩ G (x), (GSO-) where Ψ : R d R n is a linear oerator, y R n, and Ω G : R d [0, + ) is a convex and lower semicontinuous enalty, deending on a arameter (1, + ], and on an a riori given grou structure, G = {G r } B, with G r {1,..., d} and B G r = {1,..., d}. Note that other data fit terms could be used, different from the quadratic one, as long as they are convex and continuously differentiable with Lischitz continuous gradient. We will focus on least squares to simlify the exosition. Most grou sarsity enalties can be built starting from the family of canonical linear rojections on the subsace identified by the indices belonging to G r, i.e. P r : R d R Gr. The definition of the enalties we consider is based on the adjoint of the linear oerator P : R d that is the oerator B R Gr, P x = (P 1 x,..., P B x), P : B R Gr R d, P (v 1,..., v B ) = B Pr v r, where Pr : R Gr R d is the canonical injection. For x R d we set Ω G (x) = min v R Gr P v=x B v r. (1) For = 2, the functional Ω2 G was introduced in [22] (see also [36,2]). The distinctive feature of the family of enalties Ω G, is that they have the roerty of inducing grou-wise selection, that is they lead to solutions with suort

6 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 5/28 Proximal methods for the latent grou lasso enalty 5 (i.e. set of non zero entries) which is the union of a subsets of the grous defined a riori. In fact, Ω G can be seen as a generalization of the mixed l 1 /l norms, originally introduced for disjoint grous: R G (x) = B x Gr,, 1. For = 2, R G is the grou lasso enalty, and it is well-known [52] that such enalties lead to solutions whose suort is the union of a small number of grous. The enalty R G can be written also if the grous overla, and more generally the comosite absolute enalties (CAP) J G γ,(x) = B ( x Gr,) γ, first introduced in [53] and coinciding with R G for γ = 1, have been intensively studied. The J γ, enalties allow to deal with comlex grous structures involving hierarchies or grahs and it is roved in [24] that the CAP enalties constraint the suort to be the comlement of a union of grous. Ω G and R G are thus somehow comlementary and have different domain of alications [24, 27, 25]. While many algorithms have been roosed to solve the otimization roblem corresonding to R G, the one corresonding Ω G is much less studied. This is due on the one hand to the fact that the enalty is more comlex, and on the other hand to the widesread use of the relication strategy. The latter is based on the observation that, using the definition of Ω G, and the surjectivity of P, the (GSO-) minimization roblem can be written as 1 min v B n ΨP v y 2 + 2τ RGr B v r, (2) which is a grou lasso roblem without overla for the linear oerator ΨP in the so called latent variables (v r ) B, obtained by relicating variables belonging to more than one grou. The last rewriting allows to aly every algorithm develoed for the standard grou-lasso to the overlaing case, but this strategy is not feasible for high dimensional roblems with large grou overlas, as otentially many artificial dimensions are created. The main goal of this aer is to roose and study an otimization algorithm which does not require the relication of variables belonging to more than one grou. The choice > 1 has both technical and ractical motivations. On the one hand, it guarantees convexity of the enalty which can be shown to be a norm (see Lemma 1 in [22] for = 2), and, as a consequence, of the (GSO-) regularization roblem (note that this is valid for = 1 too). On the other hand, it enforces democracy among the elements that belong to the same grou, in the sense that no intragrou sarsity is enforced, thus inducing

7 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 6/28 6 Silvia Villa et al. grou-wise selection. The case = 1 is trivial, since the enalty Ω1 G with the l 1 norm, or lasso enalty [48]: coincides Ω G 1 (x) = inf (v r) B R Gr P v=x B and is thus indeendent of G. j G r (v r ) j = inf d (v r) B R Gr P j=1 v=x B r:j G r (v r ) j = d x j, Examle 1 A articular instance of the above roblem occurs in statistical learning theory. Given a set X, a set Y R, and j = 1,..., d, let ψ j : X R be a function (the collection {ψ j j = 1,..., d} is called a dictionary). The family of functions defined as { d } x j ψ j x R d j=1 is called a generalized linear model. If the estimator and the regression function belong to a generalized linear model, then given a training set {(t i, y i ) n i=1 } (X Y ) n, the regularized emirical risk takes the form 1 n Ψx y 2 + 2τΩ G (x), with Ψ : R d R n, (Ψx) i = d j=1 x jψ j (t i ) and y = (y 1,..., y n ). Examle 2 Most results obtained in the aer hold in an infinite dimensional setting. In articular, our aroach can be naturally extended to the multile kernel learning(mkl) roblem [29]. For this roblem, given reroducing kernel Hilbert saces H 1,..., H m of functions g : X R, defining H = m H r, the resulting otimization roblem takes the form (see [29]) min g r Hr Ψ( r g r ) y 2 + m g r Hr, for a suitable Ψ : H R n, y R n. As can be readily seen, the multile kernel learning roblem has the same structure of the (GSO-) roblem described above. j=1 3 An efficient roximal algorithm Due to non-smoothness of the enalty term, solving the (GSO-) minimization roblem is not trivial. Moreover, if one needs to solve the (GSO-) roblem for high dimensional data, the use of standard second-order methods such as interior-oint methods is recluded (see for instance [7]), since they need to solve large systems of linear equations to comute the Newton stes. On the other hand, first order methods insired to Nesterov s seminal aer [34] (see also [33]) and based on the roximal ma are accurate, and robust, in the

8 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 7/28 Proximal methods for the latent grou lasso enalty 7 sense that their erformance does not deend on the fine tuning of various controlling arameters. Furthermore, these methods were already roved to be a comutationally efficient alternative for solving many regularized inverse roblems in image rocessing [11], comressed sensing [7] and machine learning alications [2, 17, 31]. 3.1 Proximal methods The (GSO-) regularized convex functional is the sum of a convex smooth term, F (x) = 1 n Ψx y 2, with Lischitz continuous gradient, and a nondifferentiable enalty τω G ( ). A minimizing sequence can be comuted with a roximal gradient algorithm [49] (a.k.a. forward-backward slitting method [14], and Iterative Shrinkage Thresholding Algorithm (ISTA) [6]) x m = rox τ σ ΩG ( x m 1 1 2σ F (xm 1 ) ) (ISTA) for a suitable choice of σ, and any initialization x 0. Recently, several accelerations of ISTA have been roosed [35,49,6]. With resect to ISTA, they only require the additional comutation of a linear combination of two consecutive iterates. Among them, FISTA (Fast Iterative Shrinkage Thresholding Algorithm) [6] is given by the following udating rule for m 1 x m = rox τ σ ΩG ( s m+1 = 1 2 h m+1 = ( h m s 2 m ) 2σ F (hm ) ) ( 1 + s ) m 1 x m + 1 s m x m 1 s m+1 s m+1 (FISTA) for a suitable choice of σ > 0, s 1 = 1, and any initialization h 1 = x 0. Both schemes are based on the comutation of the roximity oerator [30], which is defined as rox λω G (z) = argmin Φ λ (x), with Φ λ (x) = 1 x R 2λ x z 2 + Ω G (x), λ > 0. d (3) The convergence rate of Eτ (x m ) min Eτ, for ISTA and FISTA, is O(1/m) and O(1/m 2 ), resectively, when the roximity oerator is comuted exactly. However, in general, the exact exression is not available. Recently, it has been shown that, also in the resence of errors, the accelerated version maintains advantages with resect to the basic one. In fact, the rate O(1/m 2 ) for FISTA in the resence of comutational errors was recently roved in [46, 51] for various error criteria. Convergence of ISTA with errors was already known, and first roved in [42,14]. Since the roximity oerator of the enalty Ω G is not admissible in closed form, the (GSO-) minimization roblem can thus be solved via an inexact

9 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 8/28 8 Silvia Villa et al. Algorithm 1 FISTA for GSO- Given: G, (1, + ], τ > 0, ɛ 0 > 0, α > 0, x 0 = h 0 R d, s 0 = 1, Let: σ = Ψ T Ψ /n, m = 0 and q such that q = 1. while convergence not reached do ĥ m = h m 1 nσ Ψ T (Ψh m y) Find Ĝm = {G G, ĥm G τ σ } Aroximately comute the rojection of ĥm onto τ σ KĜm := { h R d : h G Ĝm G,q τ } σ with tolerance ɛ0 m α x m = ĥm π (ĥm τσ ) KĜm ( s m+1 = ) 1 + 4s 2 2 m ( ) h m+1 = 1 + sm 1 x s m + 1 sm x m+1 s m 1 m+1 end while return x m version of the iterative schemes ISTA or FISTA, where F (h m ) is simly 2Ψ T (Ψh m y)/n. Note that, in the secial case of not overlaing grous, the roximity oerator can be exlicitly evaluated grou-wise, and reduces to a grou-wise soft-thresholding oerator. In the general case, as exlained in Subsection 3.2, the roximity oerator can be written in terms of a rojection, and we will rovide an algorithm to aroximately comute it. Note also that we will show that at each ste the rojection involves only a subset of the initial grous, the active grous, thus significantly increasing the comutational erformance of the overall algorithm. 3.2 Comuting the roximity oerator of Ω G In this subsection we state the lemmas that allow us to efficiently comute the roximity oerator of Ω G and to formulate the inexact version of FISTA reorted in Algorithm 1. As a direct consequence of standard results of convex analysis, Lemma 1 shows that the comutation of the roximity oerator amounts to the comutation of a rojection oerator onto the intersection of convex sets, each of them corresonding to a grou. In Lemma 2, we theoretically justifies an active set strategy, by showing that when rojecting a vector onto this intersection, it is ossible to discard the constraints which are already satisfied. In the following, given a convex and closed subset A (in R d for some d) we denote by π A the associate rojection: x R d, π A (x) = argmin y x. y A

10 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 9/28 Proximal methods for the latent grou lasso enalty 9 Lemma 1 For any λ > 0 and 1, the roximity oerator of λω G, where Ω G is defined in (1), is given by rox λω G = I π λk G. where K G is given by K G = {x R d, x Gr,q 1, for r = 1,..., B}. (4) The roof exloits the articular definition of the enalty and relies on the Moreau decomosition ( x rox λω (x) = x λrox Ω. (5) λ λ) Formula (4) allows to comute the roximity oerator of Ω starting from the roximity oerator of the Fenchel conjugate. In our case, being Ω G one homogeneous, we obtain the identity minus the rojection onto a closed and convex set. The articular geometry of K G, which is the intersection of B convex generalized cylinders centered on a coordinate subsace, derives from definition of Ω G and the exlicit comutation of its Fenchel conjugate. Observe that by definition Ω G is the infimal convolution of B functions, and recisely the B norms on R Gr comosed with the rojections. By standard roerties of the Fenchel conjugate, it follows that (Ω G ) = ι q, where ι q is the dual function of, i.e. the indicator function of the l q unitary ball in R Gr, defined as { 0 if v q 1 ι q (v) = + otherwise. We give here a self-contained roof which does not use the notion of infimal convolution. A different roof for the case = 2 is given in [36]. Proof We start by comuting exlicitly the Fenchel conjugate of Ω G. By definition, (Ω G ) (u) = su x, u min B B v r = su su x, u v r x R d x R d [ v R Gr P v=x v R Gr P v=x ] B B B = su v Pr v r, u v r = su [ Pr v r, u v r ] R Gr v r R Gr B su = [ v r, P r u v r ] = B ι q (P r u), v r R Gr where ι q is the Fenchel conjugate of, i.e. the indicator function of the l q unitary ball in R Gr. We can rewrite the sum of indicator functions as B ι q(p r u) = ι K G (u). It is well-known that rox λιk G (x) = π K G (x).

11 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 10/28 10 Silvia Villa et al. Using the Moreau decomosition (5) and basic roerties of the rojection we obtain rox λω (x) = x λπ K G (x/λ) = x π λk G (x). (6) The following lemma shows that, when evaluating the rojection π K G (x), we can restrict ourselves to a subset of active grous, denoted by Ĝ = G(ˆx) and defined in Lemma 2. This equivalence is crucial to seed u Algorithm 1, in fact the number of active grous at iteration m will converge to the number of selected grous, which is tyically small if one is interested in sarse solutions. Lemma 2 Given x R d, it holds where Ĝ := {G G, x G,q > λ}. π λk G (x) = π λk Ĝ (x), (7) Proof Given a grou of indices G and a number > 1, we denote by C G, the convex set C G, = {x R d : x G,q 1}. To rove the result we first show that for any subset S G the rojection onto the intersection λk S = G S λc G, is non-exansive coordinate-wise with resect to zero. More recisely, for all x R d, it holds that π λk S (x) i x i for all i = 1,..., d and for all λ > 0. By contradiction, assume that there exists an index ĵ such that π λk S (x)ĵ > xĵ. Consider the vector x defined by setting { πλk S x j = (x) j if j ĵ xĵ otherwise. First note that x λk S, since x G,q π λk S (x) G,q λ for all G S. On the other hand x x 2 = d (x j x j ) 2 < x π λk S (x) 2, j=1 j ĵ which is a contradiction. To conclude, suose that x λk S, with S G. If we rove that π λk G (x) = π G\S λk (x), we are done. For the sake of brevity denote v = π G\S λk (x). Thanks to the non-exansive roerty it follows v j x j for all j = 1,..., d and therefore v λk S. Since v λk G\S by hyotheses, we get that v λk G. Furthermore by definition of rojection v x w x, for every w λk G\S and a fortiori v x w x for every w λk G.

12 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 11/28 Proximal methods for the latent grou lasso enalty The rojection on K G for general The convex set K G is an intersection of convex sets, recisely K G = G G C G, where C G, = {v R d, v G,q 1}. For general a ossible minimization scheme for comuting the rojection in (7) can be obtained by alying the Cyclic Projections algorithm [9] or one of its modified versions (see [4] and references therein). In the articular case of = 2, we describe the Lagrangian dual roblem corresonding to the rojection onto K2 G, and we roose an alternative otimization scheme, the rojected Newton method [8], which better exloits the geometry of the set K2 G, and in ractice roves to be faster than the Cyclic Projections algorithm. Note that, in order to satisfy the hyothesis of Theorem 2, the tolerance for stoing the iteration must decrease with the outer iteration m. A simle way to comute the rojection onto the intersection of convex sets is given by the Cyclic Projections algorithm [9], which amounts to cyclically rojecting onto each set. Here we recall in Algorithm 2 a modification of the Cyclic Projections algorithm roosed by [4], for which strong convergence is guaranteed (see Theorem 4.1 in [4]). Algorithm 2 Cyclic Projections Given x R d, {C G1,,..., C GB,} Let l = 0, w 0 = x and find CĜ1,,..., C Ĝ ˆB, while convergence not reached do l = l + 1 Let π l the rojection onto τcĝl mod ˆB, w l = 1 l + 1 x + end while l l + 1 π l(w l 1 ) In the following we describe how to comute each rojection π CG, secific values of and an arbitrary grou G. for = 2. In this case q = 2, and the rojection is trivial [π τcg,2 (w)] j = { τ w j w G,2 w j if j G and w G,2 > τ otherwise =. In this case q = 1, and C G, is an l 1 ball when restricting to the coordinates in G. From Lemma 4.2 in [20], we have that if w 1 > τ, then

13 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 12/28 12 Silvia Villa et al. the rojection of w onto the l 1 ball of radius τ, τb 1, is given by the softthresholding oeration [π τb1 (w)] j = ( w j µ) + sign(w j ) where µ (deending on w and τ) is chosen such that j ( w j µ) + = τ. We recall a simle rocedure rovided in [20] for determining µ. In a first ste, sort the absolute values of the comonents of w, resulting in the rearranged sequence, wj w j+1 0 for all j. Next, erform a search to find k such that k 1 k (wj wk) τ (wj wk+1). j=1 j=1 Then set µ = w k + k 1 ( k 1 j=1 (w j w k ) τ ) 2, +. In these cases no known closed form for the rojection on the set C G, exist, but it can be efficiently comuted using Newton s method, as done in [23] The rojection on K G for = 2 When = 2, the rojection onto K2 G minimization roblem amounts to solving the constrained Minimize v x 2 subject to v R d, v G,2 τ, for G Ĝ, (8) which Lagrangian dual roblem can be written in a closed form. Working on the dual is advantageous, since the number of grous is tyically much smaller than d, and furthermore Lemma 2 guarantees that one can restrict to the subset of grous Ĝ := {G G : x G,2 > τ} =: {Ĝ1,..., Ĝ ˆB} (9) which in general is a roer subset of G. In the following theorem we show how to comute the solution to roblem (8), by solving the associated dual roblem. Theorem 1 Given x R d, G = {G r } B with G r {1,..., d}, Ĝ as in (9) and τ > 0, the rojection of x onto the convex set τk2 G with KG 2 = {v Rd : v Gr,2 τ for r = 1,..., B} is given by [ ] π τk G (x) = x j 2 j 1 + ˆB for j = 1,..., d (10) λ r1 r,j where λ is the solution of argmax f(λ), with f(λ) = λ R ˆB + d j=1 x 2 j 1 + ˆB 1 r,jλ r and 1 r,j equal to 1 if j belongs to grou Ĝr and 0 otherwise. ˆB λ r τ 2, (11)

14 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 13/28 Proximal methods for the latent grou lasso enalty 13 Proof The Lagrangian function for the minimization roblem (8) is defined as L(v, λ) = v x 2 + ˆB d = (v j x j ) 2 + j=1 d = (1+ ˆB j=1 ˆB 1 r,j λ r ) λ r τ 2 + x 2 λ r ( v 2 G r τ 2 ) ˆB ( v j λ r 1 r,j vj 2 x j ˆB 1+ ˆB 1 r,jλ r λ r τ 2 ) 2 d j=1 x 2 j 1+ ˆB 1 r,jλ r + (12) where λ R ˆB. The dual function is then f(λ) = inf v R d L(v, λ) = L = d j=1 x 2 j ( 1 + ˆB 1 r,jλ r x j 1 + ˆB, λ 1 r,jλ r ˆB ) λ r τ 2 + x 2. Since strong duality holds, the minimum of (4) is equal to maximum of the dual roblem which is therefore Maximize f(λ) subject to λ r 0 for r = 1,..., ˆB. (13) Once the solution λ to the dual roblem (13) is obtained, the solution to the rimal roblem (8), v, is given by v j = x j 1 + ˆB λ r1 r,j for j = 1,..., d. The dual roblem can be efficiently solved, for instance, via Bertsekas rojected Newton method described in [8], and here reorted as Algorithm 5 in the Aendix, where the first and second artial derivatives of f(λ) are given by r f(λ) = d x 2 j 1 r,j (1 + ˆB τ 2, s=1 1 s,jλ s ) 2 j=1

15 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 14/28 14 Silvia Villa et al. and d 2x 2 j r s f(λ) = 1 r,j1 s,j j=1 (1 + ˆB s=1 1 s,jλ s ) 3 { 0 if Ĝ r Ĝs = = 2 x2 ˆB j Ĝr Ĝs j (1 + s=1 1 s,jλ s ) 3 otherwise. Bertsekas iterative scheme combines the basic simlicity of the steeest descent iteration [44] with the quadratic convergence of the rojected Newton s method [10]. It does not involve the solution of a quadratic rogram thereby avoiding the associated comutational overhead. Its convergence roerties have been studied in [8] and are briefly mentioned in next section. 3.3 Convergence analysis of GSO- Algorithm In this subsection we clarify the accuracy in the comutation of the rojection which is required to rove convergence of the Algorithm 1. As mentioned above, we rely on recent theorems roviding a convergence rate for roximal gradient methods with aroximations. Definition 1 We say that w is an aroximation of π τ/σk G (x) with tolerance ɛ if w π τ/σk G (x) ɛ. Theorem 2 Given x 0 R d, and σ = Ψ T Ψ /n. Assume that π τ/σk G (x m ) in Algorithm 1 is aroximately comuted at ste m with tolerance ɛ m = ɛ 0 /m α. If α > 2, there exists a constant C I := C I (, G, x 0, σ, τ, α) such that the iterative udate (ISTA) satisfies ( ) Eτ 1 m x i Eτ (x ) C I m m. (14) i=1 If α > 4, there exists a constant C F := C F (, G, x 0, σ, τ, α) such that the iterative udate (FISTA) satisfies E τ (x m ) E τ (x ) C F m 2. (15) Proof It is enough to show that there exists a constant C > 0 (indeendent of w l and x m ) such that where Φ τ σ w l π τk G (x m ) ɛ m C = Φ τ σ (wl ) min Φ τ σ + ɛ m (16) is defined as in (3). Then the statement directly follows from Proosition 1 and Proosition 2 in [46]. In order to rove equation (16) first note that thanks to the assumtion B G r = {1,..., d} made at the beginning, it

16 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 15/28 Proximal methods for the latent grou lasso enalty 15 easily follows from the definition that Ω G is a norm on R d, and therefore it is equivalent to the euclidean one. Thus, there exists a constant A (deending only on and G) such that Ω G (x) Ω G (x ) A x x, x, x R d. Next, let w and x be such that w π τ/σk G (x) γ, (17) for some γ > 0 (and suose w.l.o.g. that γ < 1). By Lemma 1 and by definition of rox τ (x) and Φ σ ΩG τ (see equation (3)) we have σ Φ τ σ (x π τ/σk G (x)) = min Φ τ σ. Thus, by equation (17), and using the fact that Ω G is a norm Φ τ σ (x w) = σ 2τ w 2 + Ω G (x w) σ 2τ w π τ/σk G (x) 2 + σ 2τ π τ/σk G (x) 2 + σ τ w π τ/σk G (x), π τ/σk G (x) + Ω G (x π τ/σk G (x)) + Ω G (π τ/σk G (x) w) min Φ τ + σ σ 2τ γ2 + σ γã + Aγ τ = min Φ τ σ + ( σ 2τ γ + σ τ Ã + A ) γ min Φ τ + Cγ σ where Ã is such that su v K v Ã and C = C(, G, σ, τ). Therefore, G equation (16) holds with C as defined above. By Theorem 3.1 in [4], Algorithm 2 is strongly convergent, and therefore, given arbitrary ɛ > 0 and x R d, there exists an index l m := l m (ɛ) such that w lm roduced through Algorithm 2 enjoys the roerty w l π τk G (x m ) ɛ, for every l l m. Algorithm 1 combined with Algorithm 2 thus converges to the minimum of (GSO-) roblem with rate 1/m 2, if the rojection is aroximately comuted with tolerance ɛ 0 /m α with α > 4. Similarly, one can use ISTA instead of FISTA as udating rule in Algorithm 1, obtaining the convergence rate 1/m, and setting α > 2. It is clear that the choice of α defines the stoing rule for the internal algorithm (see Subsection 4.1). As it haens for the exact accelerations of the basic forward-backward slitting algorithm such as [34,7,6], convergence of the sequence x m is no longer guaranteed unless strong convexity is assumed.

17 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 16/28 16 Silvia Villa et al. Every other algorithm roducing admissible aroximations can be used in lace of Algorithm 2 in the comutation of the rojection. In the case = 2, we tested Bertsekas rojected Newton method, reorted in the Aendix as Algorithm 5. Its convergence is not always guaranteed, since there are articular choices of x and G for which the artial Hessian of the dual function is not strictly ositive defined, as would be required to ensure strong convergence (see Proosition 3 and Proosition 4 in [8]). However, ideas which are useful for circumventing the same roblem for unconstrained Newton s method, such as reconditioning, could be easily adated to this case, and convergence has always been observed in our exeriments (for more details see the discussion in [8] and also the comments at the end of the next subsection). 3.4 Comuting the regularization ath In Algorithm 3 we reort the comlete scheme for comuting the regularization ath for the Grou-wise Selection with Overla roblem (GSO-), i.e. the set of solutions corresonding to different values of the regularization arameter τ 1 >...>τ T. Note that we emloy the continuation strategy roosed in [21]. When comuting the roximity oerator with Bertsekas rojected Newton Algorithm 3 Regularization ath for GSO- Given: τ 1 > τ 2 > > τ T, G, ɛ 0 > 0, ν > 0 Let: σ = Ψ T Ψ /n, x (τ 0) = 0 for t = 1,..., T do Initialize: x = x (τ t 1) while convergence not reached do udate x according to Algorithm 1, with the rojection comuted via Cyclic Projections or by solving the dual roblem end while x (τ t) = x end for return x (τ 1),..., x (τ T ) method, a similar warm starting is alied to the inner iteration, since the m-th rojection is initialized with the solution of the (m 1)-th rojection. Desite the local nature of Bertsekas scheme, such an initialization emirically roved to guarantee convergence. 3.5 The relicates formulation As discussed in Section 2, the most common method to solve (GSO-) roblem is to minimize the standard grou l 1 /l regularization (without overla) in the exanded sace of latent variables in (2) built by relicating variables belonging to more than one grou, thus working in a d-dimensional sace

18 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 17/28 Proximal methods for the latent grou lasso enalty 17 with d = B G r. Setting Ψ = ΨP and R G (v) = B v r, roblem (2) can be written as 1 min v B n Ψv y 2 + 2τR G (v). RGr The main advantage of such a formulation relies on the ossibility of using any state-of-the-art otimization rocedure for l 1 /l regularization without overla. In terms of roximal methods, a ossible solution is given by Algorithm 3, where the roximity oerator can be now comuted grou-wise as ((rox λr G (v)) j ) j G r = ( I π λsgr, ) ((vj ) j Gr ) for all r = 1,..., B, where S Gr, now denotes the l q unitary ball in R Gr. Furthermore for = 2 and = + each rojection can be comuted exactly as described in Subsection 3.2.1, and the roximity oerator of R G is thus exact. The otimization algorithm for solving (GSO-) via FISTA in the relicated sace is reorted in Algorithm 4. Algorithm 4 FISTA for Grou-wise Selection without overla Given: v 0 B RGr, τ > 0, σ = Ψ T Ψ /n Initialize: m = 0, w 1 = v 0, t 1 = 1 while convergence not reached do for r = 1,..., B do ) v r = (I ( ( π S τσ Gr, w m 1 ) ) Ψ T ( Ψw m y) nσ j G r end for end while return v m s m+1 = 1 ( ) s 2 m 2 ( w m+1 = 1 + sm 1 ) v m + s m+1 s m+1 ( 1 sm ) v m 1 The relicate formulation involves a much simler roximity oerator, but each iteration has higher comutational cost, since now deends on d rather than on d, and thus increases with the amount of overla among variables subsets (see Section 4 for numerical comarisons between the rojection and relication aroaches). 4 Numerical exeriments In this section we resent numerical exeriments aimed at studying the comutational erformance of the roosed family of otimization algorithms, and at comaring them with the state-of-the-art algorithms alied to the relicate formulation.

19 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 18/28 18 Silvia Villa et al. Fig. 1 Comuting time (in seconds) necessary for evaluating the rox vs number of variables (d), for different values of the overla degree α and the tolerance, for fixed grou size b = Cyclic Projections vs dual formulation We build B grous,{g r } B, of size b, with G r {1,..., d}, by randomly drawing sets of b indexes from {1,..., d}, and consider the cases b = 10, and b = 100. We vary the number of grous B, so that the dimension of the exanded sace is α times the inut dimension, d = αd, with α = 1.2, 2 and 5. Clearly this amounts to taking B = α d/b. We then generate a vector x R d by randomly drawing each of its entry from N (0, 1). We then ick a value of τ such that, when comuting rox τω G (x), all grous are active. Precisely we take τ =.8 min,...,b x Gr,2. We first comute the exact solution x = rox Ω G (x) 1. Then we comute the aroximated solutions with 2 the Cyclic Projections Algorithm 2 and by solving the dual via the rojected Newton method. We will refer to the former as CP2 and to the latter as dual. We sto the iteration when the distance from the exact solution is less than ɛ the norm of x. We consider different values for the tolerance ɛ, recisely we take ɛ = 10 2, 10 3, Mean and standard deviation of the comuting time over 20 reetitions are lotted in Figure 1 and 2 for each value of α and ɛ. The dual formulation is faster than the Cyclic Projections algorithm in most situations. It is convenient to use Cyclic Projections when the number of active grous is high and the required tolerance very low. When comuting the rojection for Algorithm 1, it is thus reasonable to use Cyclic Projections in the very first outer iterations, 1 it is the solution comuted via the rojected Newton method for the dual roblem with very tight tolerance

20 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 19/28 Proximal methods for the latent grou lasso enalty 19 Fig. 2 Comuting time (in seconds) necessary for evaluating the rox vs number of variables (d), for different values of the overla degree α and the tolerance, for fixed grou size b = 100 when the tolerance which deends on the outer iterations is low, and the solution could be not sarse, because still far from convergence. After few iterations, it is more convenient to resort to the dual formulation. Even though, not otimal, in the following exeriments, when denoting GSO-2 via rojection we will consider always the rojection comuted with the dual formulation. 4.2 Projection vs relication In this Subsection we comare the running time erformance of the roosed set of algorithms where the roximity oerator is comuted aroximately, to state-of-the-art algorithms used to solve the equivalent formulation in the relicated sace. For such a comarison we restrict to = 2, since many benchmark algorithms are available in the case of grous that do not overla. In order to ensure a fair comarison, we first run some reliminary exeriments to identify the fastest codes for grou l 1 regularization with no overla Comarison without overla Recently there has been a very active research on this toic, see e.g. [40,41,13]. For the comarison, we considered three algorithms which are reresentative of the otimization techniques used to solve grou lasso: interior-oint methods, (grou) coordinate descent and its variations, and roximal methods. As an instance of the first set of techniques we emloyed the ublicly available

21 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 20/28 20 Silvia Villa et al. Matlab code at htt:// described in [1]. For coordinate descent methods, we emloyed the R-ackage grllasso, which imlements block coordinate gradient descent minimization for a set of ossible loss functions. In the following we will refer to these two algorithms as IP and BCGD. Finally, as an instance of roximal methods, we use our Matlab imlementation of FISTA for Grou-wise Selection, namely Algorithm 4 with FISTA instead of ISTA as udating rule. We will refer to it as PROX. In our exeriments, we sto PROX algorithm when the relative recision between two iterates is below a given threshold, i.e. when x m x m 1 ν x m 1. Though the theoretical results only guarantee convergence of the objective value, we observe in ractice that the algorithm with this stoing criterion always sto on our roblems. We first observe that on randomly generated toy examles for which the solution is easily comutable, the solutions of the three algorithms coincide u to an error which deends on each algorithm tolerance. We thus need to tune the each tolerance in order to guarantee that all iterative algorithms are stoed when the aroximation of the solution of the same level is obtained, namely when x m x ot is of the same order for all the three algorithms. Toward this end, we run Algorithm PROX with machine recision, ν = 10 16, in order to have a good aroximation of the solution x ot. We observe that for many values of n and d, and over a large range of values of τ, the aroximation of PROX when ν =10 6 is of the same order of the aroximation of IP with otaram.tol = 10 9, and of BCGD with tol = Note also that with these tolerances the three solutions coincide also in terms of selection, i.e. their suorts are identical for each value of τ. Therefore the following results corresond to otaram.tol = 10 9 for IP, tol = for BCGD, and ν = 10 6 for PROX. For the other arameters of IP we used the values used in the demos sulied with the code. Concerning the data generation rotocol, the inut variables t = (t 1,..., t n ) are uniformly drawn from [ 1, 1] d. The labels y are comuted using a noisecorruted linear regression function, i.e. y j = x, t j +w for all j {1,..., n}, where x deends on the first 30 variables, x j = c if j = 1,..., 30, and 0 otherwise, w is an additive noise, w N(0, 1), and c is a rescaling factor that sets the signal to noise ratio to 5:1. We consider the model described in Examle 1. In this case the dictionary is Ψ j (t) = t j for j = 1,..., d, so that the linear oerator Ψ can be reresented by the n d matrix with entries Ψ ij = (t i ) j. We then evaluate the entire regularization ath for the three algorithms with B sequential grous of 10 variables, (G 1 =[1,..., 10], G 2 = [11,..., 20], and so on), for different values of n and B. In order to make sure that we are working on the correct range of values for the arameter τ, we first evaluate the set of solutions of PROX corresonding to a large range of 500 values for τ, with ν = We then determine the smallest value of τ which corresonds to selecting less than n variables, τ min, and the smallest one returning the null solution, τ max. Finally we build the geometric series of 50 values between τ min and τ max, and use it to evaluate the regularization ath on the three algorithms. In order to obtain robust estimates of the running

22 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 21/28 Proximal methods for the latent grou lasso enalty 21 Table 1 Running time (mean and standard deviation) in seconds for comuting the entire regularization ath of IP, BCGD, and PROX for different values of B, and n. n = 100 B = 10 B = 100 IP 5.6 ± ± 90 BCGD 2.1 ± ± 0.6 PROX 0.21 ± ± 0.4 n = 500 n = 1000 B = 10 B = 100 IP 2.30 ± ± 30 BCGD 2.15 ± ± 0.5 PROX ± ± 0.16 B = 10 B = 100 IP 1.92 ± ± 22 BCGD 2.06 ± ± 3 PROX ± ± 0.5 times, we reeat 20 times for each air n, B. In Table 1 we reort the comutational times required to evaluate the entire regularization ath for the three algorithms. Algorithms BCGD and PROX are always faster than IP which, due to memory reasons, cannot be alied to roblems where the number of variables are more than 5000, since it requires to store the d d matrix Ψ T Ψ. It must be said that the code for IP was made available mainly in order to allow reroducibility of the results resented in [1], and is not otimized in terms of time and memory occuation. However it is well known that standard second-order methods are tyically recluded on large data sets, since they need to solve large systems of linear equations to comute the Newton stes. PROX is the fastest for B = 10, 100 and has a similar behavior to BCGD. The candidates as benchmark algorithms for comarison with FISTA via rojection are therefore BCGD and PROX. Since we are more familiar with the PROX algorithm, we therefore comare FISTA via rojection with the PROX algorithm, i.e. FISTA via relication only Comarison with overla Here we comare two different imlementations of the GSO-2 solution: FISTA via aroximated rojection comuted by solving the dual roblem with rojected Newton method, and FISTA via relication. We will refer to the former as FISTA-roj, and to the latter as FISTA-rel. The data generation rotocol is equal to the one described in the revious exeriments, but x deends on the first 12/5b variables (which corresond to the first three grous) x = ( c,..., c, 0, 0,..., 0). }{{}}{{} b 12/5 times d b 12/5 times

23 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 22/28 22 Silvia Villa et al. We then define B grous of size b, so that d = B b > d. The first three grous corresond to the subset of relevant variables, and are defined as G 1 = [1,..., b], G 2 = [4/5b + 1,..., 9/5b], and G 3 = [1,..., b/5, 8/5b + 1,..., 12/5b], so that they have a 20% air-wise overla. The remaining B 3 grous are built by randomly drawing sets of b indexes from {1, d}. In the following we will let n = 10 G 1 G 2 G 3, i.e. n is ten times the number of relevant variables, and vary d, b. We also vary the number of grous B, so that the dimension of the sace of latent variables is α times the inut dimension, d = αd, with α = 1.2, 2, 5. Clearly this amounts to taking B = α d/b. The arameter α can be thought of as the average number of grous a single variable belongs to. We identify the correct range of values for τ as in the revious exeriments, using FISTA-roj with loose tolerance, and then evaluate the running time and the number of iterations necessary to comute the entire regularization ath for FISTA-rel on the exanded sace and FISTA-roj, both with ν = Finally we reeat 20 times for each combination of the three arameters d, b, and α. Table 2 Running time (mean ± standard deviation) in seconds for b=10 (to), and b=100 (below). For each d and α, the left and right side corresond to FISTA-roj, and FISTA-rel, resectively. α = 1.2 α = 2 α = 5 d= ± ± ± ± ± ± 8 d= ± ± ± ± ± ± 57 d= ± ± ± ± ± ± 400 α = 1.2 α = 2 α = 5 d= ± ± ± ± ± ± 13 d= ± ± ± ± ± ± 80 d= ± ± 3 90 ± ± ± 16 Table 3 Number of iterations (mean ± standard deviation) for b = 10 (to) and b = 100 (below). For each d and α, the left and right side corresond to FISTA-roj, and FISTA-rel, resectively. α = 1.2 α = 2 α = 5 d= ± ± ± ± ± ± 1300 d= ± ± ± ± ± ± 2000 d= ± ± ± ± ± ± 6000 α = 1.2 α = 2 α = 5 d= ± ± ± ± ± ± 400 d= ± ± ± ± ± ± 500 d= ± ± ± ± ± 60

24 COAP9628_source [12/09 08:25] SmallExtended, MathPhysSci, Numbered, rh:otion 23/28 Proximal methods for the latent grou lasso enalty 23 Fig. 3 Number of iteration necessary for evaluating the rox vs number of variables (d), for different values of the overla degree α, and the tolerance. Running times and number of iterations are reorted in Table 2 and 3, resectively. When the overla, that is α, is low the comutational times of FISTA-rel and FISTA-roj are comarable. As α increases, there is a clear advantage in using FISTA-roj instead of FISTA-rel. The same behavior occurs for the number of iterations. 4.3 = 2 vs = We generate the grous and the coefficient vector as in Subsection 4.1, with b = 10. Differently from the Subsection 4.1, here we comare the comutational erformance of the same algorithm alied to two different roblems: Cyclic Projections for = 2 and Cyclic Projections for =, that yield different solutions, since rox τω G (x) rox 2 τω G (x). In order to guarantee a fair comarison we consider two different values of τ, τ 2 and τ, such that, when comuting rox τ2ω G (x) and rox 2 τ Ω (x), all grous are active. Precisely we G take τ 2 =.8 min,...,b x Gr,2. and τ =.8 min,...,b x Gr,. We comute the aroximated solutions with the Cyclic Projections Algorithm 2 for = 2 and =. We will refer to the former as CP2 and to the latter as CPinf. We sto the iteration when the relative decrease of the aroximated solution is below ɛ. We consider different values for the tolerance ɛ, recisely we take ɛ = 10 2, 10 3, For each value of α and ɛ we estimate the number of iterations, and the comuting time for the two algorithms, and average over 20 reetitions. Mean and standard deviation of number of iterations and the comuting time are lotted in Figure 3 and 4. In all conditions CP2 is much faster than CPinf.

arxiv: v1 [math.oc] 3 Sep 2012

arxiv: v1 [math.oc] 3 Sep 2012 Proximal methods for the latent group lasso penalty arxiv:1209.0368v1 [math.oc] 3 Sep 2012 Silvia Villa, Lorenzo Rosasco, Sofia Mosci, Alessandro Verri Istituto Italiano di Tecnologia, Genova, ITALY CBCL,