Fast Algorithms for LAD Lasso Problems

Similar documents
Fast Fourier Optimization and Related Sparsifications

Optimization for Compressed Sensing

Sparsity Matters. Robert J. Vanderbei September 20. IDA: Center for Communications Research Princeton NJ.

ORF 307: Lecture 2. Linear Programming: Chapter 2 Simplex Methods

A Parametric Simplex Approach to Statistical Learning Problems

A Parametric Simplex Algorithm for Linear Vector Optimization Problems

Linear Programming: Chapter 5 Duality

LOWER BOUNDS FOR THE MAXIMUM NUMBER OF SOLUTIONS GENERATED BY THE SIMPLEX METHOD

arxiv: v1 [cs.lg] 4 Apr 2017

Properties of a Simple Variant of Klee-Minty s LP and Their Proof

Standard Form An LP is in standard form when: All variables are non-negativenegative All constraints are equalities Putting an LP formulation into sta

Lecture 11: Post-Optimal Analysis. September 23, 2009

Lesson 27 Linear Programming; The Simplex Method

(includes both Phases I & II)

The augmented form of this LP is the following linear system of equations:

(includes both Phases I & II)

Example. 1 Rows 1,..., m of the simplex tableau remain lexicographically positive

CO350 Linear Programming Chapter 8: Degeneracy and Finite Termination

MAT016: Optimization

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Solving Linear and Integer Programs

CHAPTER 2. The Simplex Method

Lecture 5 Simplex Method. September 2, 2009

Slack Variable. Max Z= 3x 1 + 4x 2 + 5X 3. Subject to: X 1 + X 2 + X x 1 + 4x 2 + X X 1 + X 2 + 4X 3 10 X 1 0, X 2 0, X 3 0

THE STEPS OF THE SIMPLEX ALGORITHM

56:270 Final Exam - May

On the Number of Solutions Generated by the Simplex Method for LP

An upper bound for the number of different solutions generated by the primal simplex method with any selection rule of entering variables

CSC Design and Analysis of Algorithms. LP Shader Electronics Example

Decomposition methods in optimization

Supplementary lecture notes on linear programming. We will present an algorithm to solve linear programs of the form. maximize.

Simplex Method for LP (II)

Prelude to the Simplex Algorithm. The Algebraic Approach The search for extreme point solutions.

0.0.1 Section 1.2: Row Reduction and Echelon Forms Echelon form (or row echelon form): 1. All nonzero rows are above any rows of all zeros.

CPS 616 ITERATIVE IMPROVEMENTS 10-1

Introduction to Mathematical Programming IE406. Lecture 13. Dr. Ted Ralphs

Math Models of OR: Some Definitions

The Dual Simplex Algorithm

21. Solve the LP given in Exercise 19 using the big-m method discussed in Exercise 20.

Algebraic Simplex Active Learning Module 4

9.1 Linear Programs in canonical form

Math Models of OR: Handling Upper Bounds in Simplex

Introduction to optimization

Integer programming (part 2) Lecturer: Javier Peña Convex Optimization /36-725

Yinyu Ye, MS&E, Stanford MS&E310 Lecture Note #06. The Simplex Method

Linear Programming: Simplex

Relation of Pure Minimum Cost Flow Model to Linear Programming

The simplex algorithm

Week_4: simplex method II

The dual simplex method with bounds

Abstract This paper is about the efficient solution of large-scale compressed sensing problems.

1 Overview. 2 Extreme Points. AM 221: Advanced Optimization Spring 2016

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Linear Programming: Simplex Method CHAPTER The Simplex Tableau; Pivoting

Local Warming. Robert J. Vanderbei March 17. GERAD, Montréal Canada. rvdb

LP. Lecture 3. Chapter 3: degeneracy. degeneracy example cycling the lexicographic method other pivot rules the fundamental theorem of LP

Novel update techniques for the revised simplex method (and their application)

A Review of Linear Programming

Linear programs Optimization Geoff Gordon Ryan Tibshirani

Lecture 9 Tuesday, 4/20/10. Linear Programming

LINEAR PROGRAMMING 2. In many business and policy making situations the following type of problem is encountered:

MATH 445/545 Homework 2: Due March 3rd, 2016

TIM 206 Lecture 3: The Simplex Method

DATA MINING AND MACHINE LEARNING

Linear Programming The Simplex Algorithm: Part II Chapter 5

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

On the number of distinct solutions generated by the simplex method for LP

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

SOME HISTORY OF STOCHASTIC PROGRAMMING

Lecture 11 Linear programming : The Revised Simplex Method

The Simplex Method. Lecture 5 Standard and Canonical Forms and Setting up the Tableau. Lecture 5 Slide 1. FOMGT 353 Introduction to Management Science

LINEAR PROGRAMMING I. a refreshing example standard form fundamental questions geometry linear algebra simplex algorithm

Dr. Maddah ENMG 500 Engineering Management I 10/21/07

Lecture slides by Kevin Wayne

1 Review Session. 1.1 Lecture 2

Optimization (168) Lecture 7-8-9

ORF 522. Linear Programming and Convex Analysis

Row Reduction and Echelon Forms

Simplex Method: More Details

OPERATIONS RESEARCH. Linear Programming Problem

GETTING STARTED INITIALIZATION

December 2014 MATH 340 Name Page 2 of 10 pages

Simplex method(s) for solving LPs in standard form

Part 1. The Review of Linear Programming

Part 1. The Review of Linear Programming

Review for Chapter 1. Selected Topics

Sensitivity Analysis

6.2: The Simplex Method: Maximization (with problem constraints of the form )

Dual Basic Solutions. Observation 5.7. Consider LP in standard form with A 2 R m n,rank(a) =m, and dual LP:

The use of shadow price is an example of sensitivity analysis. Duality theory can be applied to do other kind of sensitivity analysis:

Linear Programming. Linear Programming I. Lecture 1. Linear Programming. Linear Programming

CO350 Linear Programming Chapter 6: The Simplex Method

Foundations of Operations Research

Exploiting hyper-sparsity when computing preconditioners for conjugate gradients in interior point methods

SELECT TWO PROBLEMS (OF A POSSIBLE FOUR) FROM PART ONE, AND FOUR PROBLEMS (OF A POSSIBLE FIVE) FROM PART TWO. PART ONE: TOTAL GRAND

Today s class. Constrained optimization Linear programming. Prof. Jinbo Bi CSE, UConn. Numerical Methods, Fall 2011 Lecture 12

Math Models of OR: Sensitivity Analysis

Math 5593 Linear Programming Problem Set 4

Contents. 4.5 The(Primal)SimplexMethod NumericalExamplesoftheSimplexMethod

Introduction to integer programming II

Transcription:

Fast Algorithms for LAD Lasso Problems Robert J. Vanderbei 2015 Nov 3 http://www.princeton.edu/ rvdb INFORMS Philadelphia

Lasso Regression The problem is to solve a sparsity-encouraging regularized regression problem: My gut reaction: minimize Ax b 2 2 + λ x 1 Replace least squares (LS) with least absolute deviations (LAD). LAD is to LS as median is to mean. Median is a more robust statistic. The LAD version can be recast as a linear programming (LP) problem. If the solution is expected to be sparse, then the simplex method can be expected to solve the problem very quickly. No one knows the correct value of the parameter λ. The parametric simplex method can solve the problem for all values of λ from λ = to a small value of λ in the same (fast) time it takes the standard simplex method to solve the problem for one choice of λ. The parametric simplex method can be stopped when the desired sparsity level is attained. No need to guess/search-for the correct value of λ. 1

Linear Programming Formulation The least absolute deviations variant is given by minimize Ax b 1 + λ x 1 It is easy to reformulate this problem as a linear programming problem: min x subject to 1 T ε + λ 1 T δ ε Ax b ε δ x δ Note: There exists a critical value λ c such that the optimal solution for all λ λ c is trivial: x = 0 δ = 0 ε i = b i, for all i. 2

The Simplex Method We start by introducing slack variables: minimize 1 T ε + λ 1 T δ subject to Ax ε + s = b Ax ε + t = b x δ + u = 0 x δ + v = 0 s, t, u, v 0 We must identify a partition of the set of all variables into basic and nonbasic variables. If A is an m n matrix, then there are 2m + 2n equality constraints in 3m + 4n variables. The variables, x, ε, and δ can be regarded as free variables. In the simplex method, free variables are always basic variables. Let P denote the set of indices i for which b i 0 and let N denote those for which b i < 0. 3

Dictionary Form Writing the equations with the nonbasic variables defining the basic ones, we get: minimize 1 T b P 1 T b N + 1 T s P + 1 T t N 1(p n)u + 1 (p n)v 2 2 + λ 1u + λ 1v 2 2 1 subject to ε P = b P + s P P u + 1 P v 2 2 s N = 2b N + t N + Nu Nv t P = 2b P + s P P u + P v 1 ε N = b N + t N + Nu 1 Nv 2 2 x = 0 + 1 2 u 1 δ = 0 + 1 2 u + 1 v 2 v 2 where A = [ P N, ε = [ εp, s = ε N [ sp, t = s N [ tp tn, p = 1 T P, and n = 1 T N. 4

Dictionary Form Writing the equations with the nonbasic variables defining the basic ones, we get: minimize 1 T b P 1 T b N + 1 T s P + 1 T t N 1(p n)u + 1 (p n)v 2 2 + λ 1u + λ 1v 2 2 subject to s N = 2b N + t N + Nu Nv t P = 2b P + s P P u + P v where A = [ P N, ε = [ εp ε N, s = [ sp s N, t = [ tp tn, p = 1 T P, and n = 1 T N. 5

Parametric Simplex Method A dictionary solution is obtained by setting the nonbasic variables to zero and reading off the values of the basic variables from the dictionary. The dictionary solution on the previous slide is feasible. It is optimal provided the coefficients of the variables in the objective function are nonnegative: 1 0, (p j n j )/2 + λ/2 0, (p j n j )/2 + λ/2 0. This simplifies to λ max(p j n j ) j The specific index defining this max sets the lower bound on λ and identifies the entering variable for the parametric simplex method. The leaving variable is determined in the usual way. A simplex pivot is performed. A new range of λ values is determined (the lower bound from before becomes the upper bound). The process is repeated. 6

Random Example A matrix A with m = 3000 observations and n = 200 was generated randomly. The linear programming problem has 6400 constraints, 9800 variables, and 1213200 nonzeros in the constraint matrix. A simple C implementation of the algorithm finds a solution with 6 nonzero regression coefficients in just 44 simplex pivots (7.8 seconds). This compares favorably with the 5321 pivots (168.5 seconds) to solve the problem all the way to λ = 0. A speedup by a factor of 21.6. 7

Real-World Example A regression model from my Local Warming studies with m = 2899 observations and n = 106 regression coefficients of which only 4 are deemed relevant (linear trend plus seasonal changes). The linear programming problem has 6010 constraints, 9121 variables, and 626820 nonzeros in the constraint matrix. My implementation finds a solution with 4 nonzero regression coefficients in 1871 iterations (37.0 seconds). This compares favorably with the 4774 iterations (108.2 seconds) to solve the problem to λ = 0. But, why so many iterations just to get just a few nonzeros? 8

A Simple Median Example To answer the question, consider the simplest example where n = 1 and A = e. In this case, the problem is simply a lasso variant of a median computation... min x b i + λ x. x i Clearly, if λ is a positive integer, then this Lasso problem is equivalent to computing the median of the original numbers, b 1, b 2,..., b m, augmented with λ zeros, 0, 0, 0,..., 0. Best case scenario: the median of the b i s is close to zero. For example, suppose the median is positive but the next smaller b i is negative. Then, just need to add one or two zeros to make the median exactly zero. And, after one simplex iteration, we will arrive at the true median. Worst case scenario: all of the b i s have the same sign. Suppose they are all positive. Then we have to add m+1 zeros to make the lasso ed median zero. The first iteration will produce a single nonzero: the minimum of the b i s. That s a terrible estimate of the median! Each subsequent simplex pivot produces the next larger b i until, after m/2 pivots, the true median is found. 9

Traditional (Least Squares) Lasso Consider the least squares variant of the previous example: 1 min x 2 (x b i ) 2 + λ x. i Set the derivative of the objective function to zero: f (x) = mx i b i + λ sgn(x) = 0. Without loss of generality, suppose that i b i > 0. We see that the Lasso solution is zero for λ i b i. df/dx 80 60 40 20 0 20 Data λ=10 λ=0 For 0 λ i b i, the solution is 60 80 i b i λ x = m. 1 0 1 2 3 4 5 6 7 8 9 10 x So, as with LAD-Lasso, depending on the choice of λ we can get anything between 0 and the sample mean x = i b i /m. 40 10

(Unexpected) Conclusion The parametric simplex method for the LAD-Lasso problem can generate sparse solutions in a very small number of pivots. With the parametric simplex method there is no need to guess/search for the value of λ that gives a desired regressant sparsity. When the number of pivots is small, one can expect the result to be good. But, LAD-Lasso regression can also sometimes produce terrible results. The same is true for traditional Lasso regression. 11

Lasso References Regression Shrinkage and Selection via the LASSO, Tsibshirani, J. Royal Statistical Soc., Ser. B, 58, 267 288, 1996 Robust Regression Shrinkage and Consistent Variable Selection Through the LAD-Lasso, Wang, Li, and Jiang, Journal of Business and Economic Statistics, 25(3), 347 355, 2007. 12

Parametric Simplex Method References Linear Programming and Extensions, Dantzig Princeton University Press, Chapter 11, 1963 Linear Programming: Foundations and Extensions, Vanderbei, any edition Springer, Chapter 7, 1997 Frontiers of Stochastically Nondominated Portfolios, Ruszczyński and Vanderbei, Econometrica, 71(4), 1289 1297, 2003. The Fastclime Package for Linear Programming and Large-Scale Precision Matrix Estimation in R, Pang, Liu, and Vanderbei, Journal of Machine Learning Research, 2013. Optimization for Compressed Sensing: the Simplex Method and Kronecker Sparsification, Vanderbei, Liu, Wang, and Lin, Mathematical Programming Computation, to appear (hopefully), 2016. A Parametric Simplex Approach to Statistical Learning Problems, Pang, Zhao, Vanderbei, Liu, in preparation, 2015. 13

Thank You! 14