Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Similar documents
Fast proximal gradient methods

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

6. Proximal gradient method

Lecture 1: September 25

Optimization methods

6. Proximal gradient method

Lecture 8: February 9

Accelerated Proximal Gradient Methods for Convex Optimization

Lecture 9: September 28

Optimization methods

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lasso: Algorithms and Extensions

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Accelerated gradient methods

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Dual and primal-dual methods

Dual Proximal Gradient Method

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Subgradient Method. Ryan Tibshirani Convex Optimization

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Adaptive Restarting for First Order Optimization Methods

Stochastic Optimization: First order method

9. Dual decomposition and dual algorithms

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Optimization for Learning and Big Data

5. Subgradient method

Conditional Gradient (Frank-Wolfe) Method

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Lecture: Smoothing.

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Big Data Analytics: Optimization and Randomization

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

A Unified Approach to Proximal Algorithms using Bregman Distance

Lecture 23: November 21

Nonlinear Optimization for Optimal Control

Lecture 5: September 15

A Tutorial on Primal-Dual Algorithm

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Descent methods. min x. f(x)

Math 273a: Optimization Subgradient Methods

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Agenda. Interior Point Methods. 1 Barrier functions. 2 Analytic center. 3 Central path. 4 Barrier method. 5 Primal-dual path following algorithms

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Convex Optimization Lecture 16

An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems

Accelerated primal-dual methods for linearly constrained convex problems

Stochastic and online algorithms

A direct formulation for sparse PCA using semidefinite programming

Lecture 17: October 27

An adaptive accelerated first-order method for convex optimization

Proximal Minimization by Incremental Surrogate Optimization (MISO)

ACCELERATED LINEARIZED BREGMAN METHOD. June 21, Introduction. In this paper, we are interested in the following optimization problem.

Math 273a: Optimization Convex Conjugacy

Nesterov s Acceleration

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

1 Sparsity and l 1 relaxation

Lecture: Algorithms for Compressed Sensing

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Selected Topics in Optimization. Some slides borrowed from

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE

The Proximal Gradient Method

Newton s Method. Javier Peña Convex Optimization /36-725

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

Mathematics of Data: From Theory to Computation

On the acceleration of the double smoothing technique for unconstrained convex optimization problems

Sequential convex programming,: value function and convergence

10. Unconstrained minimization

Expanding the reach of optimal methods

Block Coordinate Descent for Regularized Multi-convex Optimization

Proximal gradient methods

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Proximal methods. S. Villa. October 7, 2014

Optimization for Machine Learning

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

Math 273a: Optimization Subgradients of convex functions

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Composite nonlinear models at scale

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Stochastic Semi-Proximal Mirror-Prox

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ)

A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems

A Conservation Law Method in Optimization

GRADIENT = STEEPEST DESCENT

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

Transcription:

Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme

Last time Proximal gradient method convergence rate 1 k Subgradient methods convergence rate 1 k Can we do better for non-smooth problems min f(x) = g(x) + h(x) with the same computational effort as proximal gradient method but with faster convergence? Answer: Yes we can - with equally simple scheme x k+1 = arg min Q 1/t (x, y k ) Note that we use y k instead of x k where new point is cleverly chosen Original idea: Nesterov 1983 for minimization of smooth objective Here: nonsmooth problem

Accelerated first-order methods Choose x 0 and set y 0 = x 0. Repeat for k = 1, 2,... { xk = prox tk h(y k 1 t k g(y k 1 )) y k = x k + k 1 k+2 (x k x k 1 ) same computational complexity as proximal gradient with h = 0, this is the accelerated gradient descent of Nesterov ( 83) can be used with various stepsize rules fixed BLS... interpretation x k + k 1 k+2 (x k x k 1 ) momentum term/prevents zigzagging

Other formulations: Beck and Teboulle 2009 Fix step size t = 1 L(g) Choose x 0, set y 0 = x 0, θ 0 = 1 Loop: for k = 1, 2,... (a) x k = prox tk h(y k 1 t k g(y k 1 )) (b) 1 θ k = 1+ 1+4/θ 2 k 1 2 (c) y k = x k + θ k [ 1 θ k 1 1](x k x k 1 )

With BLS (knowledge of Lipschitz constant not necessary) Choose x 0, set y 0 = x 0, θ 0 = 1 Loop: for k = 1, 2,..., backtrack until (this gives t k ) f(y k 1 t k G tk (y k 1 )) Q 1/tk ((y k 1 t k G tk (y k 1 ), y k 1 ) Then prox tk h(y k 1 t k g(y k 1 )) y k 1 t k G tk (y k 1 ) (a) (b) (c)

Convergence analysis Theorem f(x k ) f 2 x 0 x 2 (k + 1) 2 t t = 1/L for fixed step size t = β/l for BLS Other 1/k 2 first-order methods Nesterov 2007 Two auxiliary sequences {y k }, {z k } Two prox operations at each iteration convergence analysis Lu, Lan and Monteiro Tseng Auslander and Teboulle Unified analysis framework: Tseng (2008)

Proof (Beck and Teboulle s version) and (i) v k+1 = v k + 1 θ k [x k+1 y k ] = 1 θ 2 k 1 (ii) 1 θ k θk 2 Proof of (ii) (u = 4/θk 1 2 + 1) { vk 1 1 θ k 1 x k [ θ k 1 1]x k 1 y k = θ k v k + (1 θ k )x k 1 θ k θ 2 k = [1 + u] 2 4 u 1 u + 1 = u 1 4 = 4/θ k 1 2 + 1 1 4 = 1 θ 2 k 1

Increment in one iteration: Beck and Teboulle, Vandenberghe x = x i 1 x + = x i y = y i 1 v = v i 1 v + = v i θ = θ i 1 Pillars of analysis: (1) f(x + ) f(x) + G t (y) T (y x) t 2 G t(y) 2 (2) f(x + ) f + G t (y) T (y x ) t 2 G t(y) 2 Take cvx combination f(x + ) (1 θ)f(x) + θf + G t (y), y (1 θ)x θx t 2 G t(y) 2 = (1 θ)f(x) + θf + θ G t (y), v x t 2 G t(y) 2

Because y = θv + (1 θ)x f(x + ) f (1 θ) [f(x) f ] + [ v θ2 x 2 v x tθ ] 2t G t(y) 2 Therefore Conclusion 1 θ 2 i 1 v t θ G t(y) = v + 1 θ [y G t(y) y] = v + f(x + ) f (1 θ)[f(x) f ] + θ2 [ v x 2 v + x 2] 2t [f(x i ) f ] + 1 2t v i x 2 1 θ i 1 θ 2 i 1 [f(x i 1 ) f ] + 1 2t v i 1 x 2

We have 1 θi 1 = 1 θi 1 2 θi 1 2 1 θ 2 k 1 and [f(x k ) f ] + 1 2t v k x 2 1 θ 0 θ0 2 [f(x 0 ) f ] + 1 2t v 0 x 2 Since θ 0 = 1 and v 0 = x 0 1 θk 1 2 (f(x k ) f ) 1 2t x 0 x 2 1 Since 1 θk 1 2 4 (k + 1)2, f(x k ) f 2 (k + 1) 2 t x 0 x 2 Similar with BLS, see Beck and Teboulle (2009)

Case study: LASSO min f(x) = 1 2 Ax b 2 2 + λ x 1 Chose x 0, set y 0 = x 0 and θ 0 = 1 and repeat x k = S tk λ(y k 1 t k A (Ay k 1 b)) [ ] 1 θ k = 2 1 + 1 + 4/θk 1 2 y k = x k + θ k (θ 1 k 1 1)(x k x k 1 ) until convergence (S t is soft-thresholding at level t) Dominant computational cost per iteration one application of A one application of A

[1] A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimization: Analysis, Algorithms, and Example from Beck and Teboulle (FISTA) AFASTITERATIVESHRINKAGE-THRESHOLDINGALGORITHM 201 10 2 ISTA MTWIST FISTA 10 0 10 2 10 4 10 6 10 8 0 2000 4000 6000 8000 10000 Figure 5. Comparison of function value errors F (xk) F (x ) of ISTA, MTWIST, and FISTA. REFERENCES

Example from Vandenberghe (EE 236C, UCLA) 1-norm regularized least-squares minimize 1 2 Ax b 2 2 + x 1 (f(x (k) ) f )/f randomly generated A R 2000 1000 ; step t k =1/L with L = λ max (A T A) k Gradient methods for nonsmooth problems 4 18

Nuclear norm regularization General gradient update min g(x) + λ X { 1 } X = arg min 2t X (X 0 t g(x 0 ) 2 F + λ X = S tλ (X 0 t g(x 0 )) S λ is the singular value soft-thresholding operator X = r σ j u j vj S t (X) := j=1 r max(σ j t, 0) u j vj j=1

Example min 1 2 A(X) b 2 + λ X Choose X 0, set Y 0 = X 0, θ 0 = 1 and repeat X k = S tk λ[y k 1 t k A (A(Y k 1 ) b)] θ k =... Y k =... Important remark: only need to compute the (top) part of the SVD of with singular values exceeding t k λ Y k 1 t k A (A(Y k 1 ) b)

Example from Vandenberghe (EE 236C, UCLA) minimize (i,j) obs. X is convergence 500 500 (fixed step size t =1/L) 5,000 observed entries Fix step size t = 1/L (X ij M ij ) 2 + λ X (f(x (k) ) f )/f k

Optimality of Nesterov s Method min f(x) f convex f Lipschitz No method which updates x k in span {x 0, f(x 0 ),..., f(x k 1 )} can converge faster than 1/k 2 1/k 2 is the optimal rate of first-order method

Why? 2 1 0...... 0 1 f(x) = 1 1 2 1 0... 0 0 2 x Ax e 1x, A =...................................., e 1 =...... 0... 0 1 2 1... 0... 0 0 1 2 0 A 0, A 4 and solution obeys Ax = e 1 x = 1 i n + 1 f n = 2(n + 1) Note that since 0 k n x 2 = 1 (n + 1) 2 (n2 +... + 1) n + 1 3 (k + 1) 3 k 3 0 k n 3k 2 n k=1 k 2 n + 1 3

Start first-order algorithm at x 0 = 0 span ( f(x 1 )) span(e 1, e 2 ) span ( f(x 2 )) span(e 1, e 2, e 3 )... For k n/2 or n = 2k + 1 span( f(x 0 )) = e 1 = x 1 span(e 1 ) f(x k ) f inf f(x) = k x k+1 =...=x n=0 2(k + 1) f(x k ) f n 2(n + 1) k 2(k + 1) = 1 4(k + 1) So f(x k ) f 1 x 2 4(k + 1) x 2 3 x 2 4(k + 1)(n + 1) 3 x 2 8(k + 1) 2

References 1 Y. Nesterov. Gradient methods for minimizing composite objective function Technical Report CORE Université Catholique de Louvain, (2007) 2 A. Beck and M. Teboulle. Fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sciences, (2008) 3 M. Teboulle, First Order Algorithms for Convex Minimization, Optimization Tutorials (2010), IPAM, UCLA 4 L. Vandenberghe, EE236C (Spring 2011), UCLA