Solving Constrained Lasso and Elastic Net Using

Similar documents
Complexity of Regularization RBF Networks

Model-based mixture discriminant analysis an experimental study

A NETWORK SIMPLEX ALGORITHM FOR THE MINIMUM COST-BENEFIT NETWORK FLOW PROBLEM

Bilinear Formulated Multiple Kernel Learning for Multi-class Classification Problem

max min z i i=1 x j k s.t. j=1 x j j:i T j

Taste for variety and optimum product diversity in an open economy

The Effectiveness of the Linear Hull Effect

Maximum Entropy and Exponential Families

The Laws of Acceleration

A new initial search direction for nonlinear conjugate gradient method

A Unified View on Multi-class Support Vector Classification Supplement

Supplementary Materials

Millennium Relativity Acceleration Composition. The Relativistic Relationship between Acceleration and Uniform Motion

Counting Idempotent Relations

Advanced Computational Fluid Dynamics AA215A Lecture 4

Control Theory association of mathematics and engineering

An Adaptive Optimization Approach to Active Cancellation of Repeated Transient Vibration Disturbances

Hankel Optimal Model Order Reduction 1

10.5 Unsupervised Bayesian Learning

A model for measurement of the states in a coupled-dot qubit

The Hanging Chain. John McCuan. January 19, 2006

SURFACE WAVES OF NON-RAYLEIGH TYPE

Estimating the probability law of the codelength as a function of the approximation error in image compression

INTERNATIONAL JOURNAL OF CIVIL AND STRUCTURAL ENGINEERING Volume 2, No 4, 2012

Lecture 7: Sampling/Projections for Least-squares Approximation, Cont. 7 Sampling/Projections for Least-squares Approximation, Cont.

A Queueing Model for Call Blending in Call Centers

Coding for Random Projections and Approximate Near Neighbor Search

A variant of Coppersmith s Algorithm with Improved Complexity and Efficient Exhaustive Search

Danielle Maddix AA238 Final Project December 9, 2016

Modeling of Threading Dislocation Density Reduction in Heteroepitaxial Layers

Robust Flight Control Design for a Turn Coordination System with Parameter Uncertainties

REFINED UPPER BOUNDS FOR THE LINEAR DIOPHANTINE PROBLEM OF FROBENIUS. 1. Introduction

An I-Vector Backend for Speaker Verification

Optimization of Statistical Decisions for Age Replacement Problems via a New Pivotal Quantity Averaging Approach

Scalable Positivity Preserving Model Reduction Using Linear Energy Functions

A Spatiotemporal Approach to Passive Sound Source Localization

Modeling of discrete/continuous optimization problems: characterization and formulation of disjunctions and their relaxations

Normative and descriptive approaches to multiattribute decision making

Likelihood-confidence intervals for quantiles in Extreme Value Distributions

What s New in ChemSep TM 6.8

Stabilization of the Precision Positioning Stage Working in the Vacuum Environment by Using the Disturbance Observer

A Characterization of Wavelet Convergence in Sobolev Spaces

Convergence of reinforcement learning with general function approximators

On the Complexity of the Weighted Fused Lasso

EXACT TRAVELLING WAVE SOLUTIONS FOR THE GENERALIZED KURAMOTO-SIVASHINSKY EQUATION

MATHEMATICAL AND NUMERICAL BASIS OF BINARY ALLOY SOLIDIFICATION MODELS WITH SUBSTITUTE THERMAL CAPACITY. PART II

Word of Mass: The Relationship between Mass Media and Word-of-Mouth

CSC2515 Winter 2015 Introduc3on to Machine Learning. Lecture 5: Clustering, mixture models, and EM

Perturbation Analyses for the Cholesky Factorization with Backward Rounding Errors

Comparison of Alternative Equivalent Circuits of Induction Motor with Real Machine Data

The Second Postulate of Euclid and the Hyperbolic Geometry

LECTURE NOTES FOR , FALL 2004

Lecture 3 - Lorentz Transformations

Average Rate Speed Scaling

An iterative least-square method suitable for solving large sparse matrices

Dynamic Screening: Accelerating First-Order Algorithms for the Lasso and Group-Lasso

Wave Propagation through Random Media

Probabilistic Graphical Models

7 Max-Flow Problems. Business Computing and Operations Research 608

Nonreversibility of Multiple Unicast Networks

Bäcklund Transformations: Some Old and New Perspectives

Measuring & Inducing Neural Activity Using Extracellular Fields I: Inverse systems approach

Weighted K-Nearest Neighbor Revisited

A NONLILEAR CONTROLLER FOR SHIP AUTOPILOTS

Moments and Wavelets in Signal Estimation

Assessing the Performance of a BCI: A Task-Oriented Approach

On the Licensing of Innovations under Strategic Delegation

On the application of the spectral projected gradient method in image segmentation

The gravitational phenomena without the curved spacetime

Adaptive neuro-fuzzy inference system-based controllers for smart material actuator modelling

MOLECULAR ORBITAL THEORY- PART I

Sufficient Conditions for a Flexible Manufacturing System to be Deadlocked

23.1 Tuning controllers, in the large view Quoting from Section 16.7:

The Influences of Smooth Approximation Functions for SPTSVM

Variation Based Online Travel Time Prediction Using Clustered Neural Networks

IDENTIFICATION AND CONTROL OF ACOUSTIC RADIATION MODES

Discrete Bessel functions and partial difference equations

Robust Recovery of Signals From a Structured Union of Subspaces

Ordered fields and the ultrafilter theorem

Maximum Likelihood Multipath Estimation in Comparison with Conventional Delay Lock Loops

A simple expression for radial distribution functions of pure fluids and mixtures

Parallel disrete-event simulation is an attempt to speed-up the simulation proess through the use of multiple proessors. In some sense parallel disret

arxiv: v2 [math.pr] 9 Dec 2016

The Electromagnetic Radiation and Gravity

A new method of measuring similarity between two neutrosophic soft sets and its application in pattern recognition problems

Effects of Vane Sweep on Fan-Wake/Outlet-Guide-Vane Interaction Broadband Noise

Sensitivity Analysis in Markov Networks

Lightpath routing for maximum reliability in optical mesh networks

Exploring the feasibility of on-site earthquake early warning using close-in records of the 2007 Noto Hanto earthquake

JAST 2015 M.U.C. Women s College, Burdwan ISSN a peer reviewed multidisciplinary research journal Vol.-01, Issue- 01

LOGISTIC REGRESSION IN DEPRESSION CLASSIFICATION

ONLINE APPENDICES for Cost-Effective Quality Assurance in Crowd Labeling

V. Interacting Particles

EE 321 Project Spring 2018

THE METHOD OF SECTIONING WITH APPLICATION TO SIMULATION, by Danie 1 Brent ~~uffman'i

Relativity fundamentals explained well (I hope) Walter F. Smith, Haverford College

On the Quantum Theory of Radiation.

SYNTHETIC APERTURE IMAGING OF DIRECTION AND FREQUENCY DEPENDENT REFLECTIVITIES

MODELING MATTER AT NANOSCALES. 4. Introduction to quantum treatments Eigenvectors and eigenvalues of a matrix

Error Bounds for Context Reduction and Feature Omission

Transcription:

Solving Constrained Lasso and Elasti Net Using ν s Carlos M. Alaı z, Alberto Torres and Jose R. Dorronsoro Universidad Auto noma de Madrid - Departamento de Ingenierı a Informa tia Toma s y Valiente 11, 8049 Madrid - Spain Abstrat. Many important linear sparse models have at its ore the Lasso problem, for whih the algorithm is often onsidered as the urrent state of the art. Reently M. Jaggi has observed that Constrained Lasso (CL) an be redued to a -like problem, whih opens the way to use effiient algorithms to solve CL. We will refine Jaggi s arguments to redue CL as well as onstrained Elasti Net to a Nearest Point Problem and show experimentally that the well known LIB library results in a faster onvergene than for small problems and also, if properly adapted, for larger ones. 1 Introdution Big Data problems are putting a strong emphasis in using simple linear sparse models to handle large size and dimensional samples. This has made Lasso [1] and Elasti Net [] often the methods of hoie in large sale Mahine Learning. Both are usually stated in their unonstrained version of solving min Uλ,µ (β) = β 1 µ kxβ yk + kβk + λkβk1, (1) where for an N -size sample S = {(x1, y 1 ),..., (xn, y N )}, X is an N d data matrix ontaining as its rows the transposes of the d-dimensional features xn, y = (y 1,..., y N )t is the N 1 target vetor, β denotes the Lasso (when µ = 0) or Elasti Net model oeffiients, and the subsripts 1, denote the `1 and ` norms respetively. We will refer to problem (1) as λ-unonstrained Lasso (or λ-ul), writing then simply Uλ, or as (µ, λ)-unonstrained Elasti Net(or (µ, λ)-uen). It has an equivalent onstrained formulation min Cρ,µ (β) = β 1 µ kxβ yk + kβk s.t. kβk1 ρ. () Again, we will refer to problem () as ρ-cl, and write Cρ, or as (µ, ρ)-cen. Two algorithmi approahes stand out when solving (1), the algorithm [3] that uses yli oordinate desent and the FISTA algorithm [4], based on proximal optimization. Reently M. Jaggi [5] has shown the equivalene between With partial support from Spain s grants TIN013-4351-P and S013/ICE-845 CASICAM-CM and also of the Ca tedra UAM ADIC in Data Siene and Mahine Learning. The authors also gratefully aknowledge the use of the failities of Centro de Computaio n Cientı fia (CCC) at UAM. The seond author is also supported by the FPU MEC grant AP-01-5163. 67

onstrained Lasso and some variants. This is relatively simple when going from Lasso to (the diretion we are interested in) but more involved the other way around. In [6] Jaggi s approah is refined to obtain a one way redution from onstrained Elasti Net to a squared hinge loss problem. As mentioned in [5], this opens the way for advanes in one problem to be translated in advanes into another and, while the appliation of solvers to Lasso is not addressed in [5], prior work in parallelizing s is leveraged in [6] to obtain a highly optimized and parallel solver for the Elasti Net and Lasso. In this paper we retake M. Jaggi s approah, first to simplify and refine the redutions in [5] and [6] and, then, to explore the effiieny of the solver LIB, when applied to Lasso. More preisely, our ontributions are A simple redution in Set. of Elasti Net to Lasso and a proof of the equivalene between λ-ul and ρ-cl. This is a known result but proofs are hard to find, so our simple argument may have a value in itself. A refinement in Set. 3 on M. Jaggi s arguments to redue ρ-cl to Nearest Point and ν-svc problems solvable using LIB. An experimental omparison in Set. 4 of NPP-LIB with and FISTA, showing NPP-LIB to be ompetitive with. The paper will lose with a brief disussion and pointers to further work. Unonstrained and Constrained Lasso and Elasti Net We show that λ-cl and ρ-ul and also (µ, λ)-uen and (µ, ρ)-cen have the same β solution for appropriate hoies of λ and ρ. We will first redue our disussion to the Lasso ase desribing how Elasti Net (EN) an be reast as a pure Lasso problem over an enlarged data matrix and target vetor. For this let s onsider in either problem (1) or () the (N + d) d matrix X and (N + d) 1 vetor Y defined as X t = (X t, µid ), Y t = (y t, 0td ) with Id the d d identity matrix and 0d the d dimensional 0 vetor. It is now easy to hek that kxβ yk + µkβk = kx β Yk and, thus, we an rewrite (1) or () as Lasso problems over an extended sample S = {(x1, y 1 ),..., (xn, y N ), ( µe1, 0),..., ( µed, 0)} of size N + d, where ei denotes the anonial vetors for Rd, for whih X and Y are the new data matrix and target vetor. Beause of this, in what follows we limit our disussion to UL and CL, pointing where needed the hanges to be made for UEN and CEN, and reverting to the (X, y) notations instead of (X, Y). Assuming λ fixed it is obvious that a minimizer βλ of λ-ul is also a minimizer for ρλ -CL with ρλ = kβλ k1. Next, if βlr is the solution of Linear Regression and ρlr = kβlr k1, the solution of ρ-cl is learly βlr for ρ ρlr, so we assume ρ < ρlr. Let s denote by βρ a minimum of the onvex problem ρ-cl; we shall 68

prove next that βρ solves λρ -UL with λρ = βρt X t (Xβρ y) /ρ. To do so, let e (β) = 1 kxβ yk and g(β) = kβk1 ρ, and define f (β) = max{e (β) eρ, g(β)}, where eρ = e (βρ ) is the optimal square error in ρ-cl. Then, sine f is onvex, the subgradient f (β 0 ) at any β 0 is f (β 0 ) = {γ e (β 0 ) + (1 γ)h : 0 γ 1, h g(β 0 ) = k k1 (βρ )}. Thus, as βρ minimizes f (any β 0 with kβ 0 k1 < ρ will have and error greater than eρ ), there is an hρ k k1 (βρ ) and γ, 0 γ 1, suh that 0 = γ e (βρ ) + (1 γ)hρ. But γ > 0, otherwise we would have hρ = 0, i.e., βρ = 0, ontraditing that kβρ k1 = ρ. Therefore, taking λρ = (1 γ)/γ, we have 0 = e (βρ ) + λρ hρ [e ( ) + λρ k k1 ] (βρ ) = Uλρ (βρ ), i.e., βρ minimizes λρ -UL. We finally derive a value for λρ by observing that βρt hρ = kβρ k1 = ρ and 0 = e (βρ ) + λρ hρ = X t rβρ + hρ λρ, where rβρ is just the residual (Xβρ y). As a result, λρ kβρ k1 = βρt hλρ = βρt X t rβρ λρ = βρt X t rβρ βρt X t rβρ =. kβρ k1 ρ For (µ, ρ)-cen, the orresponding λρ would be λρ = 3 (Y X βρ ) X βρ (y Xβρ ) Xβρ µkβρ k =. ρ ρ From Constrained Lasso to NPP The basi idea in [5] is to re-sale β = β/ρ and y = y/ρ to normalize ρ-cl into 1-CL and then to replae the `1 unit ball B1 with the simplex d = {α = (α1,..., αd ) Rd : 0 αi 1, d X αi = 1}, i=1 b with olumns by enlarging the initial data matrix X to an N d dimensional X d+ b b X = X and X = X, = 1,..., d. As a onsequene, the square error in b y k. Sine Xα b now lies in the the re-saled Lasso kx β y k beomes kxα b onvex hull C spanned by {X, 1 d}, finding an optimal αo is equivalent to finding the point in C losest to y, i.e., solving the nearest point problem (NPP) between the N -dimensional onvex hull C and the singleton {y }. In turn, NPP an be saled [7] into an equivalent linear ν-svc problem [8] with ν = /(d+1), over the sample S = {S+, S }, where S+ = {X 1,..., X d, X 1,..., X d } and S = {y }. This problem an be solved using the LIB library [9]. The optimal ρ-cl solution βρ is reovered as follows: first we obtain the optimal NPP oeffiients αo by saling the ν- solution γ o as αo = (d + 1)γ o, then o we ompute (β ρ )i = αio αi+d and finally we re-sale again βρ = ρβ ρ. Finally, hanges for (µ, ρ)-cen are straightforward, sine we only have to add the d extra dimensions µe and µe to the olumn vetors X. 69

4 Numerial Experiments In this setion we will ompare the performane of the LIB approah for solving ρ-cl with two well known, state-of-the-art algorithms for λ-ul, FISTA and. We first disuss their expeted omplexity. FISTA (Fast ISTA; [4]) ombines the basi iterations in Iterative Shrinkage Thresholding Algorithm (ISTA) with a Nesterov aeleration step. Assuming that the ovariane X t X is preomputed at a fixed initial ost O(N d ), the ost per iteration of FISTA is O(d ), i.e., that of omputing X t Xβ. If only m omponents of β are non zero, the iteration ost is O(dm). On the other hand, performs yli oordinate subgradient desent on the λ-ul ost funtion. arefully manages the ovariane omputations X j X k ensuring that there is a ost O(N d) only the first time they are performed at a given oordinate k and that afterwards their ost is just O(d). Note that an iteration in just hanges one oordinate while in FISTA it hanges d omponents. Finally, the ν-svc solver in LIB performs SMO-like iterations that hange two oordinates, the pair that most violates the ν-svc KKT onditions. The ost per iteration is also O(d) plus the time needed to ompute the required dot produts, i.e., linear kernel operations. LIB builds the kernel matrix making no assumptions on the partiular struture of the data. This may lead to eventually ompute 4d dot produts (without onsidering the last olumn) even though only d are atually needed, sine the d d dimensional linear kernel matrix X X t is made of four d d bloks with the ovariane matrix C = XX t in the two diagonal bloks and C in the other two. This an be ertainly ontrolled but in our experiments we have diretly used LIB, without any attempt to optimize running times exploiting the struture of the kernel matrix. We ompare next the number of iterations and running times that FISTA, and LIB need to solve equivalent λ-ul and ρ-cl problems. We will do so over four datasets from the UCI olletion, namely the relatively small prostate and housing, and the larger year and tsan. As it is well known, training omplexity greatly depends on λ. We will onsider three possible λ values for UL: an optimal λ obtained aording to the regularization path proedure of [3], a smaller λ / value (that should result in longer training) and a striter penalty λ value. The optimal λ values for the problems onsidered are.04 10 3, 8.19 10 3, 6.87 10 3 and 3.75 10 3 respetively. The orresponding ρ parameters are omputed as ρ = kβλ k1, with βλ the optimal solution for λ-ul. To make a balaned omparison, for eah λ and dataset we first make a long run of eah method M so that it onverges to a βm that we take as the k optimum. We then ompare for eah M the evolution of fm (βm ) fm (βm ), k with βm the oeffiients at the k-th iteration of method M, until it is smaller than a threshold that we take as 10 1 for prostate and housing and 10 6 for year and tsan. Table 1 shows in olumns to 4 the number of iterations that eah method requires to arrive at the threshold. As it an be seen, LIB is the fastest on this aount, with in seond plae and FISTA a more distant third (for a proper omparison a FISTA iterate is made to orrespond to 70

Iterations Dataset Time (ms) LIB FISTA LIB (λ ) prostate prostate ( λ ) prostate (λ /) 86 75 5 181 174 173 1,3 1,336 904 0.051 0.049 0.036 0.053 0.057 0.075 housing (λ ) housing ( λ ) housing (λ /) 143 11 88 1,010 933 666 5,538 5,538 3,913 0.66 0.05 0.190 0.83 0.749 0.50 year (λ ) year ( λ ) year (λ /) 760 576 39 5,584 6,13 6,393 8,550 8,460 8,640 1,58.9 1,371.8 1,140.6 558. 590.7 66.0 tsan (λ ) tsan ( λ ) tsan (λ /) 97 670 418 9,390 78,456 53,630 190,080 159,744 13,480 3,935.4 15,868. 14,999.5 8,83.9 6,935.8 4,173.7 Table 1: Iterations and running times. d iterates of LIB and ). Columns 5 and 6 give the times required by LIB and ; we omit FISTA s times as they are not ompetitive (at least under the implementation we used). We use the LIB and implementations in the Sikit Python library, whih both have a ompiled C ore, so we may expet time omparisons to be broadly homogeneous. Even without onsidering LIB s overhead when omputing a d d kernel matrix, it is learly faster on prostate and housing but not so on the other two datasets. However, it turns out that LIB indeed omputes about 4 times more dot produts than needed, whih suggests that the running time of a LIB version properly adapted for ρ-cl should have running times about a fourth of those reported here, outperforming then. This behavior is further illustrated in Fig. 1 that depits, for housing and tsan with λ, the number of iterations and running times required to reah the threshold. 5 Disussion and Conlusions an be onsidered the urrent state of the art to solve the Lasso problem. M. Jaggi s reent observation that ρ-cl an be redued to ν-svc opens the way to the appliation of methods. In this work we have shown using four examples how the ν-svc option of LIB is faster than for small problems and pointed out how adapted versions ould also beat it on larger problems. Devising suh an adaptation is thus a first further line of work. Moreover, Lasso is at the ore of many other problems in onvex regularization, suh as Fused Lasso, wavelet smoothing or trend filtering, urrently solved using speialized algorithms. A ν-svc approah ould provide for them the same faster onvergene that we have illustrated here for standard Lasso. Furthermore, the ν-svc training ould be speed up, as suggested in [5], using the sreening rules available for Lasso [10] in order to remove olumns of the data matrix. We are working on these and other related questions. 71

Time - housing Iterations - housing Objetive 100 FISTA 10 6 10 1 0,000 6,000 0 4,000 Iterations - tsan 0.4 0.6 0.8 Time - tsan FISTA 100 Objetive 0. Time (ms) Iteration 10 3 10 6 0 50,000 1 105 Iteration 1.5 105 0 5,000 10,000 15,000 Time (ms) Fig. 1: Results for housing and tsan. Referenes [1] R. Tibshirani. Regression Shrinkage and Seletion Via the Lasso. Journal of the Royal Statistial Soiety and Series B, 58:67 88, 1994. [] H. Zou and T. Hastie. Regularization and variable seletion via the elasti net. Journal of the Royal Statistial Soiety: Series B (Statistial Methodology), 67():301 30, 005. [3] J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Desent. Journal of Statistial Software, 33(1):1, 010. [4] A. Bek and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Si., (1):183 0, 009. [5] M. Jaggi. An Equivalene between the Lasso and Support Vetor Mahines. In Regularization, Optimization, Kernels, and Support Vetor Mahines. CRC Press, 014. [6] Q. Zhou, W. Chen, S. Song, J. R. Gardner, K. Q. Weinberger, and Y. Chen. A Redution of the Elasti Net to Support Vetor Mahines with an Appliation to GPU Computing. Tehnial Report arxiv:1409.1976 [stat.ml], 014. [7] J. Lo pez, A. Barbero, and J.R. Dorronsoro. Clipping Algorithms for Solving the Nearest Point Problem over Redued Convex Hulls. Pattern Reognition, 44(3):607 614, 011. [8] C.-C. Chang and C.-J. Lin. Training ν-support Vetor Classifiers: Theory and Algorithms. Neural Computation, 13(9):119 147, 001. [9] C.-C. Chang and C.-J. Lin. LIB: a Library for Support Vetor Mahines. http: //www.sie.ntu.edu.tw/~jlin/libsvm. [10] J. Wang, J. Zhou, P. Wonka, and J. Ye. Lasso sreening rules via dual polytope projetion. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advanes in Neural Information Proessing Systems 6, pages 1070 1078. Curran Assoiates, In., 013. 7