Solving Constrained Lasso and Elastic Net Using

Solving Constrained Lasso and Elasti Net Using ν s Carlos M. Alaı z, Alberto Torres and Jose R. Dorronsoro Universidad Auto noma de Madrid - Departamento de Ingenierı a Informa tia Toma s y Valiente 11, 8049 Madrid - Spain Abstrat. Many important linear sparse models have at its ore the Lasso problem, for whih the algorithm is often onsidered as the urrent state of the art. Reently M. Jaggi has observed that Constrained Lasso (CL) an be redued to a -like problem, whih opens the way to use effiient algorithms to solve CL. We will refine Jaggi s arguments to redue CL as well as onstrained Elasti Net to a Nearest Point Problem and show experimentally that the well known LIB library results in a faster onvergene than for small problems and also, if properly adapted, for larger ones. 1 Introdution Big Data problems are putting a strong emphasis in using simple linear sparse models to handle large size and dimensional samples. This has made Lasso [1] and Elasti Net [] often the methods of hoie in large sale Mahine Learning. Both are usually stated in their unonstrained version of solving min Uλ,µ (β) = β 1 µ kxβ yk + kβk + λkβk1, (1) where for an N -size sample S = {(x1, y 1 ),..., (xn, y N )}, X is an N d data matrix ontaining as its rows the transposes of the d-dimensional features xn, y = (y 1,..., y N )t is the N 1 target vetor, β denotes the Lasso (when µ = 0) or Elasti Net model oeffiients, and the subsripts 1, denote the `1 and ` norms respetively. We will refer to problem (1) as λ-unonstrained Lasso (or λ-ul), writing then simply Uλ, or as (µ, λ)-unonstrained Elasti Net(or (µ, λ)-uen). It has an equivalent onstrained formulation min Cρ,µ (β) = β 1 µ kxβ yk + kβk s.t. kβk1 ρ. () Again, we will refer to problem () as ρ-cl, and write Cρ, or as (µ, ρ)-cen. Two algorithmi approahes stand out when solving (1), the algorithm [3] that uses yli oordinate desent and the FISTA algorithm [4], based on proximal optimization. Reently M. Jaggi [5] has shown the equivalene between With partial support from Spain s grants TIN013-4351-P and S013/ICE-845 CASICAM-CM and also of the Ca tedra UAM ADIC in Data Siene and Mahine Learning. The authors also gratefully aknowledge the use of the failities of Centro de Computaio n Cientı fia (CCC) at UAM. The seond author is also supported by the FPU MEC grant AP-01-5163. 67

onstrained Lasso and some variants. This is relatively simple when going from Lasso to (the diretion we are interested in) but more involved the other way around. In [6] Jaggi s approah is refined to obtain a one way redution from onstrained Elasti Net to a squared hinge loss problem. As mentioned in [5], this opens the way for advanes in one problem to be translated in advanes into another and, while the appliation of solvers to Lasso is not addressed in [5], prior work in parallelizing s is leveraged in [6] to obtain a highly optimized and parallel solver for the Elasti Net and Lasso. In this paper we retake M. Jaggi s approah, first to simplify and refine the redutions in [5] and [6] and, then, to explore the effiieny of the solver LIB, when applied to Lasso. More preisely, our ontributions are A simple redution in Set. of Elasti Net to Lasso and a proof of the equivalene between λ-ul and ρ-cl. This is a known result but proofs are hard to find, so our simple argument may have a value in itself. A refinement in Set. 3 on M. Jaggi s arguments to redue ρ-cl to Nearest Point and ν-svc problems solvable using LIB. An experimental omparison in Set. 4 of NPP-LIB with and FISTA, showing NPP-LIB to be ompetitive with. The paper will lose with a brief disussion and pointers to further work. Unonstrained and Constrained Lasso and Elasti Net We show that λ-cl and ρ-ul and also (µ, λ)-uen and (µ, ρ)-cen have the same β solution for appropriate hoies of λ and ρ. We will first redue our disussion to the Lasso ase desribing how Elasti Net (EN) an be reast as a pure Lasso problem over an enlarged data matrix and target vetor. For this let s onsider in either problem (1) or () the (N + d) d matrix X and (N + d) 1 vetor Y defined as X t = (X t, µid ), Y t = (y t, 0td ) with Id the d d identity matrix and 0d the d dimensional 0 vetor. It is now easy to hek that kxβ yk + µkβk = kx β Yk and, thus, we an rewrite (1) or () as Lasso problems over an extended sample S = {(x1, y 1 ),..., (xn, y N ), ( µe1, 0),..., ( µed, 0)} of size N + d, where ei denotes the anonial vetors for Rd, for whih X and Y are the new data matrix and target vetor. Beause of this, in what follows we limit our disussion to UL and CL, pointing where needed the hanges to be made for UEN and CEN, and reverting to the (X, y) notations instead of (X, Y). Assuming λ fixed it is obvious that a minimizer βλ of λ-ul is also a minimizer for ρλ -CL with ρλ = kβλ k1. Next, if βlr is the solution of Linear Regression and ρlr = kβlr k1, the solution of ρ-cl is learly βlr for ρ ρlr, so we assume ρ < ρlr. Let s denote by βρ a minimum of the onvex problem ρ-cl; we shall 68

prove next that βρ solves λρ -UL with λρ = βρt X t (Xβρ y) /ρ. To do so, let e (β) = 1 kxβ yk and g(β) = kβk1 ρ, and define f (β) = max{e (β) eρ, g(β)}, where eρ = e (βρ ) is the optimal square error in ρ-cl. Then, sine f is onvex, the subgradient f (β 0 ) at any β 0 is f (β 0 ) = {γ e (β 0 ) + (1 γ)h : 0 γ 1, h g(β 0 ) = k k1 (βρ )}. Thus, as βρ minimizes f (any β 0 with kβ 0 k1 < ρ will have and error greater than eρ ), there is an hρ k k1 (βρ ) and γ, 0 γ 1, suh that 0 = γ e (βρ ) + (1 γ)hρ. But γ > 0, otherwise we would have hρ = 0, i.e., βρ = 0, ontraditing that kβρ k1 = ρ. Therefore, taking λρ = (1 γ)/γ, we have 0 = e (βρ ) + λρ hρ [e ( ) + λρ k k1 ] (βρ ) = Uλρ (βρ ), i.e., βρ minimizes λρ -UL. We finally derive a value for λρ by observing that βρt hρ = kβρ k1 = ρ and 0 = e (βρ ) + λρ hρ = X t rβρ + hρ λρ, where rβρ is just the residual (Xβρ y). As a result, λρ kβρ k1 = βρt hλρ = βρt X t rβρ λρ = βρt X t rβρ βρt X t rβρ =. kβρ k1 ρ For (µ, ρ)-cen, the orresponding λρ would be λρ = 3 (Y X βρ ) X βρ (y Xβρ ) Xβρ µkβρ k =. ρ ρ From Constrained Lasso to NPP The basi idea in [5] is to re-sale β = β/ρ and y = y/ρ to normalize ρ-cl into 1-CL and then to replae the `1 unit ball B1 with the simplex d = {α = (α1,..., αd ) Rd : 0 αi 1, d X αi = 1}, i=1 b with olumns by enlarging the initial data matrix X to an N d dimensional X d+ b b X = X and X = X, = 1,..., d. As a onsequene, the square error in b y k. Sine Xα b now lies in the the re-saled Lasso kx β y k beomes kxα b onvex hull C spanned by {X, 1 d}, finding an optimal αo is equivalent to finding the point in C losest to y, i.e., solving the nearest point problem (NPP) between the N -dimensional onvex hull C and the singleton {y }. In turn, NPP an be saled [7] into an equivalent linear ν-svc problem [8] with ν = /(d+1), over the sample S = {S+, S }, where S+ = {X 1,..., X d, X 1,..., X d } and S = {y }. This problem an be solved using the LIB library [9]. The optimal ρ-cl solution βρ is reovered as follows: first we obtain the optimal NPP oeffiients αo by saling the ν- solution γ o as αo = (d + 1)γ o, then o we ompute (β ρ )i = αio αi+d and finally we re-sale again βρ = ρβ ρ. Finally, hanges for (µ, ρ)-cen are straightforward, sine we only have to add the d extra dimensions µe and µe to the olumn vetors X. 69

4 Numerial Experiments In this setion we will ompare the performane of the LIB approah for solving ρ-cl with two well known, state-of-the-art algorithms for λ-ul, FISTA and. We first disuss their expeted omplexity. FISTA (Fast ISTA; [4]) ombines the basi iterations in Iterative Shrinkage Thresholding Algorithm (ISTA) with a Nesterov aeleration step. Assuming that the ovariane X t X is preomputed at a fixed initial ost O(N d ), the ost per iteration of FISTA is O(d ), i.e., that of omputing X t Xβ. If only m omponents of β are non zero, the iteration ost is O(dm). On the other hand, performs yli oordinate subgradient desent on the λ-ul ost funtion. arefully manages the ovariane omputations X j X k ensuring that there is a ost O(N d) only the first time they are performed at a given oordinate k and that afterwards their ost is just O(d). Note that an iteration in just hanges one oordinate while in FISTA it hanges d omponents. Finally, the ν-svc solver in LIB performs SMO-like iterations that hange two oordinates, the pair that most violates the ν-svc KKT onditions. The ost per iteration is also O(d) plus the time needed to ompute the required dot produts, i.e., linear kernel operations. LIB builds the kernel matrix making no assumptions on the partiular struture of the data. This may lead to eventually ompute 4d dot produts (without onsidering the last olumn) even though only d are atually needed, sine the d d dimensional linear kernel matrix X X t is made of four d d bloks with the ovariane matrix C = XX t in the two diagonal bloks and C in the other two. This an be ertainly ontrolled but in our experiments we have diretly used LIB, without any attempt to optimize running times exploiting the struture of the kernel matrix. We ompare next the number of iterations and running times that FISTA, and LIB need to solve equivalent λ-ul and ρ-cl problems. We will do so over four datasets from the UCI olletion, namely the relatively small prostate and housing, and the larger year and tsan. As it is well known, training omplexity greatly depends on λ. We will onsider three possible λ values for UL: an optimal λ obtained aording to the regularization path proedure of [3], a smaller λ / value (that should result in longer training) and a striter penalty λ value. The optimal λ values for the problems onsidered are.04 10 3, 8.19 10 3, 6.87 10 3 and 3.75 10 3 respetively. The orresponding ρ parameters are omputed as ρ = kβλ k1, with βλ the optimal solution for λ-ul. To make a balaned omparison, for eah λ and dataset we first make a long run of eah method M so that it onverges to a βm that we take as the k optimum. We then ompare for eah M the evolution of fm (βm ) fm (βm ), k with βm the oeffiients at the k-th iteration of method M, until it is smaller than a threshold that we take as 10 1 for prostate and housing and 10 6 for year and tsan. Table 1 shows in olumns to 4 the number of iterations that eah method requires to arrive at the threshold. As it an be seen, LIB is the fastest on this aount, with in seond plae and FISTA a more distant third (for a proper omparison a FISTA iterate is made to orrespond to 70

Iterations Dataset Time (ms) LIB FISTA LIB (λ ) prostate prostate ( λ ) prostate (λ /) 86 75 5 181 174 173 1,3 1,336 904 0.051 0.049 0.036 0.053 0.057 0.075 housing (λ ) housing ( λ ) housing (λ /) 143 11 88 1,010 933 666 5,538 5,538 3,913 0.66 0.05 0.190 0.83 0.749 0.50 year (λ ) year ( λ ) year (λ /) 760 576 39 5,584 6,13 6,393 8,550 8,460 8,640 1,58.9 1,371.8 1,140.6 558. 590.7 66.0 tsan (λ ) tsan ( λ ) tsan (λ /) 97 670 418 9,390 78,456 53,630 190,080 159,744 13,480 3,935.4 15,868. 14,999.5 8,83.9 6,935.8 4,173.7 Table 1: Iterations and running times. d iterates of LIB and ). Columns 5 and 6 give the times required by LIB and ; we omit FISTA s times as they are not ompetitive (at least under the implementation we used). We use the LIB and implementations in the Sikit Python library, whih both have a ompiled C ore, so we may expet time omparisons to be broadly homogeneous. Even without onsidering LIB s overhead when omputing a d d kernel matrix, it is learly faster on prostate and housing but not so on the other two datasets. However, it turns out that LIB indeed omputes about 4 times more dot produts than needed, whih suggests that the running time of a LIB version properly adapted for ρ-cl should have running times about a fourth of those reported here, outperforming then. This behavior is further illustrated in Fig. 1 that depits, for housing and tsan with λ, the number of iterations and running times required to reah the threshold. 5 Disussion and Conlusions an be onsidered the urrent state of the art to solve the Lasso problem. M. Jaggi s reent observation that ρ-cl an be redued to ν-svc opens the way to the appliation of methods. In this work we have shown using four examples how the ν-svc option of LIB is faster than for small problems and pointed out how adapted versions ould also beat it on larger problems. Devising suh an adaptation is thus a first further line of work. Moreover, Lasso is at the ore of many other problems in onvex regularization, suh as Fused Lasso, wavelet smoothing or trend filtering, urrently solved using speialized algorithms. A ν-svc approah ould provide for them the same faster onvergene that we have illustrated here for standard Lasso. Furthermore, the ν-svc training ould be speed up, as suggested in [5], using the sreening rules available for Lasso [10] in order to remove olumns of the data matrix. We are working on these and other related questions. 71

Time - housing Iterations - housing Objetive 100 FISTA 10 6 10 1 0,000 6,000 0 4,000 Iterations - tsan 0.4 0.6 0.8 Time - tsan FISTA 100 Objetive 0. Time (ms) Iteration 10 3 10 6 0 50,000 1 105 Iteration 1.5 105 0 5,000 10,000 15,000 Time (ms) Fig. 1: Results for housing and tsan. Referenes [1] R. Tibshirani. Regression Shrinkage and Seletion Via the Lasso. Journal of the Royal Statistial Soiety and Series B, 58:67 88, 1994. [] H. Zou and T. Hastie. Regularization and variable seletion via the elasti net. Journal of the Royal Statistial Soiety: Series B (Statistial Methodology), 67():301 30, 005. [3] J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Desent. Journal of Statistial Software, 33(1):1, 010. [4] A. Bek and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Si., (1):183 0, 009. [5] M. Jaggi. An Equivalene between the Lasso and Support Vetor Mahines. In Regularization, Optimization, Kernels, and Support Vetor Mahines. CRC Press, 014. [6] Q. Zhou, W. Chen, S. Song, J. R. Gardner, K. Q. Weinberger, and Y. Chen. A Redution of the Elasti Net to Support Vetor Mahines with an Appliation to GPU Computing. Tehnial Report arxiv:1409.1976 [stat.ml], 014. [7] J. Lo pez, A. Barbero, and J.R. Dorronsoro. Clipping Algorithms for Solving the Nearest Point Problem over Redued Convex Hulls. Pattern Reognition, 44(3):607 614, 011. [8] C.-C. Chang and C.-J. Lin. Training ν-support Vetor Classifiers: Theory and Algorithms. Neural Computation, 13(9):119 147, 001. [9] C.-C. Chang and C.-J. Lin. LIB: a Library for Support Vetor Mahines. http: //www.sie.ntu.edu.tw/~jlin/libsvm. [10] J. Wang, J. Zhou, P. Wonka, and J. Ye. Lasso sreening rules via dual polytope projetion. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advanes in Neural Information Proessing Systems 6, pages 1070 1078. Curran Assoiates, In., 013. 7