Lasso: Algorithms and Extensions

Similar documents
ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Optimization methods

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Optimization methods

6. Proximal gradient method

6. Proximal gradient method

Lecture 8: February 9

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Lecture 9: September 28

Stochastic Optimization: First order method

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Proximal gradient methods

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

An Efficient Proximal Gradient Method for General Structured Sparse Learning

Accelerated gradient methods

Lecture 23: November 21

Subgradient Method. Ryan Tibshirani Convex Optimization

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Dual and primal-dual methods

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Fast proximal gradient methods

Accelerated Proximal Gradient Methods for Convex Optimization

Selected Topics in Optimization. Some slides borrowed from

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

A Unified Approach to Proximal Algorithms using Bregman Distance

Gaussian Graphical Models and Graphical Lasso

MATH 680 Fall November 27, Homework 3

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Learning with stochastic proximal gradient

Lecture 1: September 25

Conditional Gradient (Frank-Wolfe) Method

Proximal methods. S. Villa. October 7, 2014

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Pathwise coordinate optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Oslo Class 6 Sparsity based regularization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

A Tutorial on Primal-Dual Algorithm

Sparsity Regularization

Efficient Methods for Overlapping Group Lasso

1 Sparsity and l 1 relaxation

Linear Methods for Regression. Lijun Zhang

ESL Chap3. Some extensions of lasso

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

SMOOTHING PROXIMAL GRADIENT METHOD FOR GENERAL STRUCTURED SPARSE REGRESSION. Carnegie Mellon University

Dual Proximal Gradient Method

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Convex Optimization Algorithms for Machine Learning in 10 Slides

Regression Shrinkage and Selection via the Lasso

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Within Group Variable Selection through the Exclusive Lasso

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

CPSC 540: Machine Learning

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

SMOOTHING PROXIMAL GRADIENT METHOD FOR GENERAL STRUCTURED SPARSE REGRESSION

Lecture 2 February 25th

arxiv: v2 [stat.co] 29 May 2018

A direct formulation for sparse PCA using semidefinite programming

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Homework 2. Convex Optimization /36-725

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Large-scale Stochastic Optimization

DATA MINING AND MACHINE LEARNING

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data

Optimization for Learning and Big Data

5. Subgradient method

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Recent Advances in Structured Sparse Models

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Sparse Gaussian conditional random fields

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Regression.

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Descent methods. min x. f(x)

Coordinate Descent and Ascent Methods

arxiv: v1 [cs.lg] 17 Dec 2015

OWL to the rescue of LASSO

Big Data Analytics: Optimization and Randomization

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

Adaptive Restarting for First Order Optimization Methods

Lecture 7: September 17

Lecture 25: November 27

Coordinate descent methods

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

The Alternating Direction Method of Multipliers

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Accelerated primal-dual methods for linearly constrained convex problems

Transcription:

ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017

Outline Proximal operators Proximal gradient methods for lasso and its extensions Nesterov s accelerated algorithm Lasso: algorithms and extensions 4-2

Proximal operators Lasso: algorithms and extensions 4-3

Gradient descent minimize β R p where f(β) is convex and differentiable f(β) Algorithm 4.1 Gradient descent for t = 0, 1, : β t+1 = β t µ t f(β t ) where µ t : step size / learning rate Lasso: algorithms and extensions 4-4

A proximal point of view of GD β t+1 = arg min β { f(β t ) + f(β t ), β β t }{{} linear approximation When µ t is small, β t+1 tends to stay close to β t + 1 } β β t 2 2µ } t {{} proximal term Lasso: algorithms and extensions 4-5

Proximal operator If we define the proximal operator prox h (b) := arg min β { 1 2 β b 2 + h(β)} for any convex function h, then one can write β t+1 = prox µtf t ( β t) where f t (β) := f(β t ) + f(β t ), β β t Lasso: algorithms and extensions 4-6

Why consider proximal operators? { 1 prox h (b) := arg min β 2 β b 2 + h(β)} It is well-defined under very general conditions (including nonsmooth convex functions) The operator can be evaluated efficiently for many widely used functions (in particular, regularizers) This abstraction is conceptually and mathematically simple, and covers many well-known optimization algorithms Lasso: algorithms and extensions 4-7

Example: characteristic functions If h is characteristic function { 0, if β C h(β) =, else then prox h (b) = arg min β C β b 2 (Euclidean projection) Lasso: algorithms and extensions 4-8

Example: l 1 norm If h(β) = β 1, then prox λh (b) = ψ st (b; λ) where soft-thresholding ψ st ( ) is applied in an entry-wise manner. Lasso: algorithms and extensions 4-9

Example: l 2 norm { 1 prox h (b) := arg min β 2 β b 2 + h(β)} If h(β) = β, then prox λh (b) = ( 1 λ ) b b + where a + := max{a, 0}. This is called block soft thresholding. Lasso: algorithms and extensions 4-10

Example: log barrier { 1 prox h (b) := arg min β 2 β b 2 + h(β)} If h(β) = p i=1 log β i, then (prox λh (b)) i = b i + b 2 i + 4λ 2 Lasso: algorithms and extensions 4-11

Nonexpansiveness of proximal operators { 0, if β C Recall that when h(β) =, prox h (β) is Euclidean else projection P C onto C, which is nonexpansive: P C (β 1 ) P C (β 2 ) β 1 β 2 Lasso: algorithms and extensions 4-12

Nonexpansiveness of proximal operators Nonexpansiveness is a property for general prox h ( ) Fact 4.1 (Nonexpansiveness) prox h (β 1 ) prox h (β 2 ) β 1 β 2 In some sense, proximal operator behaves like projection Lasso: algorithms and extensions 4-13

Proof of nonexpansiveness Let z 1 = prox h (β 1 ) and z 2 = prox h (β 2 ). Subgradient characterizations of z 1 and z 2 read The claim would follow if β 1 z 1 h(z 1 ) and β 2 z 2 h(z 2 ) (β 1 β 2 ) (z 1 z 2 ) z 1 z 2 2 (together with Cauchy-Schwarz) = (β 1 z 1 β 2 + z 2 ) (z 1 z 2 ) 0 h(z 2 ) h(z 1 ) + β 1 z 1, z 2 z 1 }{{} h(z = 1 ) h(z 1 ) h(z 2 ) + β 2 z 2, z 1 z 2 }{{} h(z 2 ) Lasso: algorithms and extensions 4-14

Proximal gradient methods Lasso: algorithms and extensions 4-15

Optimizing composite functions (Lasso) minimize β R p 1 Xβ y 2 + λ β 1 = f(β) + g(β) } 2 {{}}{{} :=g(β) :=f(β) where f(β) is differentiable, and g(β) is non-smooth Since g(β) is non-differentiable, we cannot run vanilla gradient descent Lasso: algorithms and extensions 4-16

Proximal gradient methods One strategy: replace f(β) with linear approximation, and compute the proximal solution { β t+1 = arg min f(β t ) + f(β t ), β β t + g(β) + 1 } β β t 2 β 2µ t The optimality condition reads 0 f(β t ) + g(β t+1 ) + 1 µ t ( β t+1 β t) which is equivalent to optimality condition of { β t+1 = arg min g(β) + 1 β ( β t µ t f(β t ) ) } 2 β 2µ t ( ) = prox µtg β t µ t f(β t ) Lasso: algorithms and extensions 4-17

Proximal gradient methods Alternate between gradient updates on f and proximal minimization on g Algorithm 4.2 Proximal gradient methods for t = 0, 1, : β t+1 = prox µtg where µ t : step size / learning rate ( ) β t µ t f(β t ) Lasso: algorithms and extensions 4-18

Projected gradient methods 0, if β C When g(β) =, else }{{} convex is characteristic function: ) β t+1 = P C (β t µ t f(β t ) := arg min β (β t µ t f(β t )) β C This is a first-order method to solve the constrained optimization minimize β s.t. f(β) β C Lasso: algorithms and extensions 4-19

Proximal gradient methods for lasso For lasso: f(β) = 1 2 Xβ y 2 and g(β) = λ β 1, prox g (β) { } 1 = arg min b 2 β b 2 + λ b 1 = ψ st (β; λ) ) = β t+1 = ψ st (β t µ t X (Xβ t y); µ t λ (iterative soft thresholding) Lasso: algorithms and extensions 4-20

Proximal gradient methods for group lasso Sometimes variables have a natural group structure, and it is desirable to set all variables within a group to be zero (or nonzero) simultaneously (group lasso) where β j R p/k and β = 1 Xβ y 2 + λ k } 2 β j j=1 {{}}{{} :=f(β) :=g(β) β 1. β k. prox g (β) = ψ bst (β; λ) := [ ( 1 λ ) ] β j β j + 1 j k = β t+1 = ψ bst ( β t µ t X (Xβ t y); µ t λ ) Lasso: algorithms and extensions 4-21

Proximal gradient methods for elastic net Lasso does not handle highly correlated variables well: if there is a group of highly correlated variables, lasso often picks one from the group and ignore the rest. Sometimes we make a compromise between lasso and l 2 penalties 1 { } (elastic net) = β t+1 = Xβ y 2 + λ } 2 {{} :=f(β) prox λg (β) = β 1 + (γ/2) β 2 2 }{{} :=g(β) 1 1 + λγ ψ st (β; λ) 1 ( ) 1 + µ t λγ ψ st β t µ t X (Xβ t y); µ t λ soft thresholding followed by multiplicative shrinkage Lasso: algorithms and extensions 4-22

Interpretation: majorization-minimization f µt (β, β t ) := f(β t ) + f(β t ), β β t + 1 β β t 2 }{{} 2µ } t {{} linearization trust region penalty majorizes f(β) if 0 < µ t < 1 L, where L is Lipschitz constant1 of f( ) Proximal gradient descent is a majorization-minimization algorithm { } β t+1 = arg min f µt (β, β t ) + g(β) β }{{}}{{} majorization minimization 1 This means f(β) f(b) L β b for all β and b Lasso: algorithms and extensions 4-23

Convergence rate of proximal gradient methods Theorem 4.2 (fixed step size; Nesterov 07) Suppose g is convex, and f is differentiable and convex whose gradient has Lipschitz constant L. If µ t µ (0, 1/L), then ( ) f(β t ) + g(β t { } 1 ) min f(β) + g(β) O β t Step size requires an upper bound on L May prefer backtracking line search to fixed step size Question: can we further improve the convergence rate? Lasso: algorithms and extensions 4-24

Nesterov s accelerated gradient methods Lasso: algorithms and extensions 4-25

Nesterov s accelerated method Problem of gradient descent: zigzagging Nesterov s idea: include a momentum term to avoid overshooting Lasso: algorithms and extensions 4-26

Nesterov s accelerated method Nesterov s idea: include a momentum term to avoid overshooting β t = ( prox µtg b t 1 µ t f (b t 1)) b t = ( β t + α t β t β t 1) (extrapolation) } {{ } momentum term A simple (but mysterious) choice of extrapolation parameter α t = t 1 t + 2 Fixed size µ t µ (0, 1/L) or backtracking line search Same computational cost per iteration as proximal gradient Lasso: algorithms and extensions 4-27

Convergence rate of Nesterov s accelerated method Theorem 4.3 (Nesterov 83, Nesterov 07) Suppose f is differentiable and convex and g is convex. If one takes α t = t 1 t+2 and a fixed step size µ t µ (0, 1/L), then ( ) f(β t ) + g(β t { } 1 ) min f(β) + g(β) O β t 2 In general, this rate cannot be improved if one only uses gradient information! Lasso: algorithms and extensions 4-28

Numerical experiments (for lasso) Figure credit: Hastie, Tibshirani, & Wainwright 15 Lasso: algorithms and extensions 4-29

Reference [1] Proximal algorithms, Neal Parikh and S. Boyd, Foundations and Trends in Optimization, 2013. [2] Convex optimization algorithms, D. Bertsekas, Athena Scientific, 2015. [3] Convex optimization: algorithms and complexity, S. Bubeck, Foundations and Trends in Machine Learning, 2015. [4] Statistical learning with sparsity: the Lasso and generalizations, T. Hastie, R. Tibshirani, and M. Wainwright, 2015. [5] Model selection and estimation in regression with grouped variables, M. Yuan and Y. Lin, Journal of the royal statistical society, 2006. [6] A method of solving a convex programming problem with convergence rate O(1/k 2 ), Y. Nestrove, Soviet Mathematics Doklady, 1983. Lasso: algorithms and extensions 4-30

Reference [7] Gradient methods for minimizing composite functions,, Y. Nesterov, Technical Report, 2007. [8] A fast iterative shrinkage-thresholding algorithm for linear inverse problems, A. Beck and M. Teboulle, SIAM journal on imaging sciences, 2009. Lasso: algorithms and extensions 4-31