Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Similar documents
Lecture 8: February 9

Lecture 9: September 28

Lecture 1: September 25

Subgradient Method. Ryan Tibshirani Convex Optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

6. Proximal gradient method

Lasso: Algorithms and Extensions

Fast proximal gradient methods

6. Proximal gradient method

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Conditional Gradient (Frank-Wolfe) Method

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Optimization methods

Optimization methods

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Lecture 5: September 15

Selected Topics in Optimization. Some slides borrowed from

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Newton s Method. Javier Peña Convex Optimization /36-725

Lecture 5: September 12

Dual Proximal Gradient Method

CPSC 540: Machine Learning

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Lecture 17: October 27

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Adaptive Restarting for First Order Optimization Methods

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Descent methods. min x. f(x)

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Lecture 7: September 17

1 Sparsity and l 1 relaxation

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Lecture 23: November 21

5. Subgradient method

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Lecture 6: September 12

Optimization for Learning and Big Data

Accelerated Proximal Gradient Methods for Convex Optimization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Convex Optimization Lecture 16

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Stochastic Optimization: First order method

Regression Shrinkage and Selection via the Lasso

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Lecture 23: November 19

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data

Proximal methods. S. Villa. October 7, 2014

Homework 2. Convex Optimization /36-725

9. Dual decomposition and dual algorithms

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Unconstrained minimization of smooth functions

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Sparsity Regularization

MATH 680 Fall November 27, Homework 3

Dual and primal-dual methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Lecture 14: October 17

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Generalized Conditional Gradient and Its Applications

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

A Unified Approach to Proximal Algorithms using Bregman Distance

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Math 273a: Optimization Subgradient Methods

CPSC 540: Machine Learning

Optimization for Machine Learning

Convexity II: Optimization Basics

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Homework 4. Convex Optimization /36-725

Regression.

Lecture 25: November 27

A Tutorial on Primal-Dual Algorithm

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

ECS289: Scalable Machine Learning

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

A direct formulation for sparse PCA using semidefinite programming

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Minimizing the Difference of L 1 and L 2 Norms with Applications

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Coordinate Descent and Ascent Methods

Accelerated gradient methods

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Transcription:

Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725

Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method: choose an initial x (0) R n, and repeat: x (k) = x (k 1) t k g (k 1), k = 1, 2, 3,... where g (k 1) f(x (k 1) ). We use pre-set rules for the step sizes (e.g., diminshing step sizes rule) If f is Lipschitz, then subgradient method has a convergence rate O(1/ɛ 2 ) Upside: very generic. Downside: can be very slow! 2

Outline Today: Proximal gradient descent Convergence analysis ISTA, matrix completion Special cases Acceleration 3

Decomposable functions Suppose f(x) = g(x) + h(x) g is convex, differentiable, dom(g) = R n h is convex, not necessarily differentiable If f were differentiable, then gradient descent update would be: x + = x t f(x) Recall motivation: minimize quadratic approximation to f around x, replace 2 f(x) by 1 t I, x + = argmin z f(x) + f(x) T (z x) + 1 2t z x 2 2 }{{} f t (z) 4

In our case f is not differentiable, but f = g + h, g differentiable. Why don t we make quadratic approximation to g, leave h alone? I.e., update x + = argmin z = argmin z = argmin z g t (z) + h(z) g(x) + g(x) T (z x) + 1 2t z x 2 2 + h(z) 1 ( ) z x t g(x) 2 2t 2 + h(z) 1 2t ( ) z x t g(x) 2 stay close to gradient update for g 2 h(z) also make h small 5

Define proximal mapping: Proximal gradient descent prox t (x) = argmin z 1 2t x z 2 2 + h(z) Proximal gradient descent: choose initialize x (0), repeat x (k) ( = prox tk x (k 1) t k g(x (k 1) ) ), k = 1, 2, 3,... To make this update step look familiar, can rewrite it as x (k) = x (k 1) t k G tk (x (k 1) ) where G t is the generalized gradient of f, G t (x) = x prox ( ) t x t g(x) t 6

What good did this do? You have a right to be suspicious... may look like we just swapped one minimization problem for another Key point is that prox t ( ) is can be computed analytically for a lot of important functions h. Note: mapping prox t ( ) doesn t depend on g at all, only on h smooth part g can be complicated, we only need to compute its gradients Convergence analysis: will be in terms of number of iterations of the algorithm. Keep in mind that each iteration evaluates prox t ( ) once, and this can be cheap or expensive, depending on h 7

Example: ISTA Given y R n, X R n p, recall lasso criterion: Prox mapping is now f(β) = 1 2 y Xβ 2 2 +. }{{}. λ β 1 }{{} g(β) h(β) 1 prox t (β) = argmin z R n 2t β z 2 2 + λ z 1 = S λt (β) where S λ (β) is the soft-thresholding operator, β i λ if β i > λ [S λ (β)] i = 0 if λ β i λ, i = 1,... n β i + λ if β i < λ 8

Recall g(β) = X T (y Xβ), hence proximal gradient update is: β + = S λt ( β + tx T (y Xβ) ) Often called the iterative soft-thresholding algorithm (ISTA). 1 Very simple algorithm to compute a lasso solution Example of proximal gradient (ISTA) vs. subgradient method convergence rates f fstar 0.02 0.05 0.10 0.20 0.50 Subgradient method Proximal gradient 0 200 400 600 800 1000 1 Beck and Teboulle (2008), A fast iterative shrinkage-thresholding algorithm for linear inverse problems k 9

10 Convergence analysis With criterion f(x) = g(x) + h(x), we assume: g is convex, differentiable, dom(g) = R n, and g is Lipschitz continuous with constant L > 0 h is convex, prox t (x) = argmin z { x z 2 2 /(2t) + h(z)} can be evaluated Theorem: Proximal gradient descent with fixed step size t 1/L satisfies f(x (k) ) f x(0) x 2 2 2tk Proximal gradient descent has convergence rate O(1/k), or O(1/ɛ) Same as gradient descent! But remember, this counts the number of iterations, not operations

11 Backtracking line search Similar to gradient descent, but operates on g and not f. We fix a parameter 0 < β < 1. At each iteration, start with t = 1, and while g ( x tg t (x) ) > g(x) t g(x) T G t (x) + t 2 G t(x) 2 2 shrink t = βt. Else perform prox gradient update Under same assumptions, we get the same rate Theorem: Proximal gradient descent with backtracking line search satisfies where t min = min{1, β/l} f(x (k) ) f x(0) x 2 2 2t min k

12 Example: matrix completion Given a matrix Y R m n, and only observe entries Y ij, (i, j) Ω. Suppose we want to fill in missing entries (e.g., for a recommender system), so we solve a matrix completion problem: 1 min B R m n 2 (i,j) Ω (Y ij B ij ) 2 + λ B tr Here B tr is the trace (or nuclear) norm of B, B tr = r σ i (B) i=1 where r = rank(b) and σ 1 (X)... σ r (X) 0 are the singular values

13 Define P Ω, projection operator onto observed set: { B ij (i, j) Ω [P Ω (B)] ij = 0 (i, j) / Ω Then the criterion is f(b) = 1 2 P Ω(Y ) P Ω (B) 2 F +. }{{}. λ B tr }{{} g(b) h(b) Two ingredients needed for proximal gradient descent: Gradient calculation: g(b) = (P Ω (Y ) P Ω (B)) Prox function: 1 prox t (B) = argmin Z R m n 2t B Z 2 F + λ Z tr

14 Claim: prox t (B) = S λt (B), matrix soft-thresholding at the level λ. Here S λ (B) is defined by S λ (B) = UΣ λ V T where B = UΣV T is an SVD, and Σ λ is diagonal with (Σ λ ) ii = max{σ ii λ, 0} Why? Note that prox t (B) = Z, where Z satisfies 0 Z B + λt Z tr Fact: if Z = UΣV T, then Z tr = {UV T + W : W op 1, U T W = 0, W V = 0} Now plug in Z = S λt (B) and check that we can get 0

Hence proximal gradient update step is: B + = S λt ( B + t ( P Ω (Y ) P Ω (B) )) Note that g(b) is Lipschitz continuous with L = 1, so we can choose fixed step size t = 1. Update step is now: B + = S λ ( PΩ (Y ) + P Ω (B) ) where P Ω projects onto unobserved set, P Ω(B) + P Ω (B) = B This is the soft-impute algorithm 2, simple and effective method for matrix completion 2 Mazumder et al. (2011), Spectral regularization algorithms for learning large incomplete matrices 15

16 Special cases Proximal gradient descent also called composite gradient descent, or generalized gradient descent Why generalized? This refers to the several special cases, when minimizing f = g + h: h = 0 gradient descent h = I C projected gradient descent g = 0 proximal minimization algorithm Therefore these algorithms all have O(1/ɛ) convergence rate

17 Projected gradient descent Given closed, convex set C R n, where I C (x) = min x C g(x) min x R n { 0 x C x / C g(x) + I C(x) is the indicator function of C Hence 1 prox t (x) = argmin z R n 2t x z 2 2 + I C (z) = argmin z C x z 2 2 I.e., prox t (x) = P C (x), projection operator onto C

18 Therefore proximal gradient update step is: x + = P C ( x t g(x) ) i.e., perform usual gradient update and then project back onto C. Called projected gradient descent c() 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 (Note: projected subgradient method works too)

19 Proximal minimization algorithm Consider for h convex (not necessarily differentiable), min h(x) Proximal gradient update step is just: x + = argmin z 1 2t x z 2 2 + h(z) Called proximal minimization algorithm. Faster than subgradient method, but not implementable unless we know prox in closed form

What happens if we can t evaluate prox? Theory for proximal gradient, with f = g + h, assumes that prox function can be evaluated, i.e., assumes the minimization 1 prox t (x) = argmin z R n 2t x z 2 2 + h(z) can be done exactly. In general, all bets are off if we just minimize this approximately But, if you can precisely control the errors in approximating the prox operator, then you can recover the original convergence rates 3 In practice, if prox evaluation is done approximately, then it should be done to fairly high precision 3 Schmidt et al. (2011), Convergence rates of inexact proximal-gradient methods for convex optimization 20

Acceleration Turns out we can accelerate proximal gradient descent in order to achieve the optimal O(1/ ɛ) convergence rate. Four ideas (three acceleration methods) by Nesterov: 1983: original acceleration idea for smooth functions 1988: another acceleration idea for smooth functions 2005: smoothing techniques for nonsmooth functions, coupled with original acceleration idea 2007: acceleration idea for composite functions 4 We will follow Beck and Teboulle (2008), extension of Nesterov (1983) to composite functions 5 4 Each step uses entire history of previous steps and makes two prox calls 5 Each step uses information from two last steps and makes one prox call 21

22 Accelerated proximal gradient method Our problem, as before: min g(x) + h(x) where g convex, differentiable, and h convex. Accelerated proximal gradient method: choose an initial point x (0) = x ( 1) R n, repeat for k = 1, 2, 3,... v = x (k 1) + k 2 k + 1 (x(k 1) x (k 2) ) x (k) = prox tk ( v tk g(v) ) First step k = 1 is just usual proximal gradient update After that, v = x (k 1) + k 2 k+1 (x(k 1) x (k 2) ) carries some momentum from previous iterations h = 0 gives accelerated gradient method

23 Momentum weights: (k 2)/(k + 1) 0.5 0.0 0.5 1.0 0 20 40 60 80 100 k

24 Recall lasso example: acceleration can really help! f fstar 0.002 0.005 0.020 0.050 0.200 0.500 Subgradient method Proximal gradient Nesterov acceleration 0 200 400 600 800 1000 k Note: accelerated proximal gradient is not a descent method ( Nesterov ripples )

25 Convergence analysis As usual, we are minimizing f(x) = g(x) + h(x), assuming: g is convex, differentiable, dom(f) = R n, and g is Lipschitz continuous with constant L > 0 h is convex, prox function can be evaluated Theorem: Accelerated proximal gradient method with fixed step size t 1/L satisfies f(x (k) ) f 2 x(0) x 2 2 t(k + 1) 2 Achieves the optimal rate O(1/k 2 ) for first-order methods! I.e., a rate of O(1/ ɛ)

26 Backtracking line search A few ways to do this with acceleration... here s a simple method (more complicated strategies exist): fix β < 1, t 0 = 1. At iteration k, start with t = t k 1, and while g(x + ) > g(v) + g(v) T (x + v) + 1 2t x+ v 2 2 shrink t = βt, and let x + = prox t (v t g(v)). Else keep x + Under same assumptions, we get the same rate Theorem: Accelerated proximal gradient method with backtracking line search satisfies f(x (k) ) f 2 x(0) x 2 2 t min (k + 1) 2 where t min = min{1, β/l}

27 Is acceleration always useful? Acceleration can be a very effective speedup tool... but should it always be used? In practice the speedup of using acceleration is diminished in the presence of warm starts. I.e., suppose want to solve lasso problem for tuning parameters values λ 1 > λ 2 >... > λ r When solving for λ 1, initialize x (0) = 0, record solution ˆx(λ 1 ) When solving for λ j, initialize x (0) = ˆx(λ j 1 ), the recorded solution for λ j 1 Over a fine enough grid of λ values, proximal gradient descent can often perform just as well without acceleration

28 Sometimes backtracking and acceleration can be disadvantageous! Recall matrix completion problem: the proximal gradient update is ( B + = S λ B + t ( P Ω (Y ) P (B) )) where S λ is the matrix soft-thresholding operator... requires SVD One backtracking loop evaluates generalized gradient G t (x), i.e., evaluates prox t (x), across various values of t. For matrix completion, this means multiple SVDs... Acceleration changes argument we pass to prox: v t g(v) instead of x t g(x). For matrix completion (and t = 1), B g(b) = P Ω (Y ) + PΩ (B) }{{}}{{} sparse low rank a V g(v ) = P Ω (Y ) + PΩ (V ) }{{}}{{} sparse not necessarily low rank fast SVD slow SVD

29 References and further reading Nesterov s four ideas (three acceleration methods): Y. Nesterov (1983), A method for solving a convex programming problem with convergence rate O(1/k 2 ) Y. Nesterov (1988), On an approach to the construction of optimal methods of minimization of smooth convex functions Y. Nesterov (2005), Smooth minimization of non-smooth functions Y. Nesterov (2007), Gradient methods for minimizing composite objective function

Extensions and/or analyses: A. Beck and M. Teboulle (2008), A fast iterative shrinkage-thresholding algorithm for linear inverse problems S. Becker and J. Bobin and E. Candes (2009), NESTA: a fast and accurate first-order method for sparse recovery P. Tseng (2008), On accelerated proximal gradient methods for convex-concave optimization and there are many more... Helpful lecture notes/books: E. Candes, Lecture notes for Math 301, Stanford University, Winter 2010-2011 Y. Nesterov (1998), Introductory lectures on convex optimization: a basic course, Chapter 2 L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012 30