Accelerating Nesterov s Method for Strongly Convex Functions

Similar documents
NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Optimization methods

Convergence Analysis of Deterministic. and Stochastic Methods for Convex. Optimization

Lecture 3: Huge-scale optimization problems

Subgradient methods for huge-scale optimization problems

On Nesterov s Random Coordinate Descent Algorithms - Continued

Fast proximal gradient methods

Gradient methods for minimizing composite functions

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

10. Unconstrained minimization

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Gradient methods for minimizing composite functions

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Stochastic Optimization Algorithms Beyond SG

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

CPSC 540: Machine Learning

Coordinate Descent and Ascent Methods

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

4 Stability analysis of finite-difference methods for ODEs

the method of steepest descent

SGD and Randomized projection algorithms for overdetermined linear systems

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

More First-Order Optimization Algorithms

The Conjugate Gradient Method

Convex Optimization and l 1 -minimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Lecture # 20 The Preconditioned Conjugate Gradient Method

Conditional Gradient (Frank-Wolfe) Method

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Optimized first-order minimization methods

26. Filtering. ECE 830, Spring 2014

FINE TUNING NESTEROV S STEEPEST DESCENT ALGORITHM FOR DIFFERENTIABLE CONVEX PROGRAMMING. 1. Introduction. We study the nonlinear programming problem

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Accelerate Subgradient Methods

Newton s Method. Javier Peña Convex Optimization /36-725

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Math 273a: Optimization Convex Conjugacy

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13

Algorithms for Nonsmooth Optimization

Math 273a: Optimization Subgradients of convex functions

Sparsity Regularization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

On Nesterov s Random Coordinate Descent Algorithms

Primal-dual IPM with Asymmetric Barrier

Composite nonlinear models at scale

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING

Empirical Risk Minimization

8 Numerical methods for unconstrained problems

Line Search Methods for Unconstrained Optimisation

Analysis of Greedy Algorithms

Convex Optimization Lecture 16

CPSC 540: Machine Learning

How hard is this function to optimize?

minimize x subject to (x 2)(x 4) u,

Estimators based on non-convex programs: Statistical and computational guarantees

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Optimization methods

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

An Alternative Three-Term Conjugate Gradient Algorithm for Systems of Nonlinear Equations

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué

Numerical Methods - Numerical Linear Algebra

Adaptive Restarting for First Order Optimization Methods

CS711008Z Algorithm Design and Analysis

The Frank-Wolfe Algorithm:

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Unconstrained minimization of smooth functions

Stochastic and online algorithms

12. Interior-point methods

Higher-Order Methods

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Coordinate Descent Methods on Huge-Scale Optimization Problems

Contraction Methods for Convex Optimization and monotone variational inequalities No.12

ALGORITHMS FOR MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS AND APPLICATIONS

Descent methods. min x. f(x)

CPSC 540: Machine Learning

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Accelerated gradient methods

Lecture: Smoothing.

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Iterative regularization of nonlinear ill-posed problems in Banach space

arxiv: v1 [math.oc] 1 Jul 2016

Big Data Analytics: Optimization and Randomization

Subgradient Method. Ryan Tibshirani Convex Optimization

Gradient Sliding for Composite Optimization

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

6. Proximal gradient method

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Gradient Methods Using Momentum and Memory

LIMITED MEMORY BUNDLE METHOD FOR LARGE BOUND CONSTRAINED NONSMOOTH OPTIMIZATION: CONVERGENCE ANALYSIS

Transcription:

Accelerating Nesterov s Method for Strongly Convex Functions Hao Chen Xiangrui Meng MATH301, 2011

Outline The Gap 1 The Gap 2 3

Outline The Gap 1 The Gap 2 3

Our talk begins with a tiny gap For any x 0 R and any constant µ > 0, L > µ there exists a function f S,1 µ,l such that for any first-order method, we have f (x k ) f µ 2 ( κ 1 κ + 1 ) 2k x 0 x 2, κ = L µ. Nesterov s method generates a sequence {x k } k=0 such that ( ) k κ 1 f (x k ) f L x 0 x 2, κ = L κ µ.

At a closer look, the gap is not tiny Assume that κ is large. Given a small tolerance ɛ > 0, to make f (x k ) f < ɛ, the ideal first-order method needs K = log ɛ log µ 2 2 log κ 1 log 1 κ ɛ 4 κ+1 number of iterations. Nesterov s method needs log ɛ log L K = log 1 log ɛ κ κ 1 κ number of iterations, which is 4 times as large as the ideal number.

Can we reduce the gap? Can we reduce the gap for quadratic functions? minimize f (x) = 1 2 x T Ax b T x, µi n A LI n. In this case, we do have an ideal method, the conjugate gradient method, having the optimal convergence rate. for general strongly convex functions? minimize f (x), f (x) S µ,l.

Outline The Gap 1 The Gap 2 3

Nesterov s constant step scheme, III 0. Choose y 0 = x 0 R n. 1. k-th iteration (k 0). x k+1 = y k hf (y k ), y k+1 = x k+1 + β(x k+1 x k ), where h = 1 L and β = 1 µh 1+ µh. Q: Is Nesterov s choice of h and β optimal?

On quadratic functions When minimizing a quadratic function f (x) = 1 2 x T Ax b T x, Nesterov s updates become 0. Choose y 0 = x 0 = 0. 1. k-th iteration (k 0). x k+1 = y k h(ay k b), y k+1 = x k+1 + β(x k+1 x k ).

Eigendecomposition Let A = V ΛV T be A s eigendecomposition. Define x k = V T x k, ȳ k = V T y k for all k, and b = V T b. Then Nesterov s updates can be written as 0. Choose ȳ 0 = x 0 = 0. 1. k-th iteration (k 0). x k+1 = ȳ k h(λȳ k b), ȳ k+1 = x k+1 + β( x k+1 x k ). Λ is diagonal, hence the updates are actually element-wise: x k+1,i = ȳ k,i h(λ i ȳ k,i b i ), i = 1,..., n, ȳ k+1,i = x k+1,i β(λ i ȳ k,i b i ), i = 1,..., n.

Recurrence relation We can eliminate the sequence {ȳ k } from the update scheme. x k+1,i = ȳ k,i h(λ i ȳ k,i b i ) = ( x k,i + β( x k,i x k 1,i ) h(λ i ( x k,i + β( x k,i x k 1,i )) b i ) = (1 + β)(1 λ i h) x k,i β(1 λ i h) x k 1,i + h b i. Let ē k = V T (x k x ) = V T (x k V Λ 1 V T b) = x k Λ 1 b for all k. We have the following recurrence relation on the error: ē k+1,i = (1 + β)(1 λ i h)ē k,i β(1 λ i h)ē k 1,i.

Characteristic equation The characteristic equation for the recurrence relation is given by ξ 2 i = (1 + β)(1 λ i h)ξ i β(1 λ i h). Denote the two roots by ξ i,1 and ξ i,2, and assume they are distinct for simplicity. The general solution is given by ē k,i = C i,1 ξ k i,1 + C i,2 ξ k i,2. Let C i = C i,1 + C i,2 and θ i = max{ ξ i,1, ξ i,2 }. We have Hence, ē k,i C i θ k i. x k x 2 = x k x 2 = i ē k,i 2 i C 2 i θ 2k i Cθ 2k, where C = i C 2 i and θ = max i θ i.

Finding the optimal convergence rate Our problem becomes minimize θ subject to θ ξ 1 (λ), ξ 2 (λ), λ [µ, L], where ξ 1 (λ) and ξ 2 (λ) are the roots of ξ 2 = (1 + β)(1 λh)ξ β(1 λh), where h, β and θ are variables.

Special cases If β = 0, we are doing gradient descent. The optimal rate is given by θ = L µ L+µ, attained at h = 2 L+µ. If h = 1 L, the optimal rate is given by θ = 1 µh = 1 µ L, attained at β = 1 µh 1+ µh = L µ L+ µ, which confirms Nesterov s choice. Q: Why do we choose h = 1 L? It guarantees the most decrease in function value of a function with Lipschitz constant L.

The optimal convergence rate By considering all the combinations of h and β, we reach the following optimal solution: ( h 4 = the harmonic mean of 1 ) 3L + µ L and 2 L + µ β = 1 µh 1 + µh, θ = 1 µh = 1 2 3κ + 1.

Comparing the convergence rates Nesterov s method (h = 1 L ): ( x k x C 1 1 ) k x 0 x. κ Note that this is better than the convergence rate we have on general strongly convex functions. Nesterov s method (h = 4 3L+µ ): x k x C Conjugate gradient: x k x A 2 ( ) k 2 1 x 0 x. 3κ + 1 ( ) k 2 1 x 0 x A. κ + 1

What s happening on the eigenspace Figure: Error along eigendirections ( ē k,i )

The model problem minimize f (x) = 1 2 x T Ax b T x, where 2 1. A = 1........... 1 + δi n R n n, b = randn(n, 1) R n. 1 2 We chose n = 10 6 and δ = 0.05.

Figure: x k x

Figure: f (x k ) f

Outline The Gap 1 The Gap 2 3

Back to Nesterov s proof A pair of sequence {φ k (x)} and {λ k }, λ k 0 is called an estimate sequence of function f (x) if λ k 0 and for any x R n and all k 0 we have φ k (x) (1 λ k )f (x) + λ k φ 0 (x). If for a sequence {x k } we have f (x k ) φ k min x R n φ k(x) then f (x k ) f λ k [φ 0 (x ) f ] 0

A useful estimate sequence provided by Nesterov λ k+1 = (1 α k )λ k φ k+1 (x) = (1 α k )φ k (x) + α k [f (y k ) + f (y k ), x y k + µ 2 x y k 2 ] where {y k } is an arbitrary sequence in R n. α k (0, 1), k=0 α k =. λ 0 = 1. φ 0 is an arbitrary function on R n.

A specific choice of φ 0 (x) φ 0 (x) φ 0 + γ 0 2 x v 0 2 and set x 0 = v 0, φ 0 = f (x 0) The previous estimate sequence becomes with γ k+1 =(1 α k )γ k + α k µ φ k (x) φ k + γ k 2 x v k 2 v k+1 =[(1 α k )γ k v k + α k µy k α k f (y k )]/γ k+1 φ k+1 =(1 α k)φ k + α kf (y k ) α2 k 2γ k+1 f (y k ) 2 + α k(1 α k )γ k γ k+1 ( µ 2 y k v k 2 + f (y k ), v k y k )

Let the update be x k+1 = y k h k f (y k ) and use the inequalities φ k f (x k) f (y k ) + f (y k ), x k y k + µ 2 x k y k 2 f (x k+1 ) f (y k ) h k(2 Lh k ) 2 f (y k ) 2 We have ( φ k+1 f (x k+1) α2 k 2 2γ k+1 + h k(2 Lh k ) ) f (y k ) 2 + (1 α) f (y k ), α kγ k (v k y k ) + (x k y k ) γ k+1 + µ(1 α ( ) k) αk γ k v k y k 2 + x k y k 2 2 γ k+1

( φ k+1 f (x k+1) Nesterov choice: y k = α kγ k v k +γ k+1 x k γ k +α k µ h k = 1 L. α2 k + h k(2 Lh k ) 2γ k+1 2 ) f (y k ) 2 +(1 α k ) f (y k ), α kγ k (v k y k ) + (x k y k ) γ k+1 + µ(1 α ( ) k) αk γ k v k y k 2 + x k y k 2 2 γ k+1 γ 0 µ. Since γ k+1 = (1 α k )γ k + α k µ, we have γ k µ α k can be as large as µ L convergence rate 1 µ L = 1 1 κ. at each step, which leads to the

A simplified version γ k µ, h k 1 L y k = α kv k +x k α+1 v k y k = v k x k α+1 x k y k = α(x k v k ) α+1 φ k+1 f (x k+1) ( α2 k 2µ + 1 ) f (y k ) 2 2L + µα k(1 α k ) 2(1 + α k ) x k v k 2

x k v k 2 / f (y k ) 2 Figure: f (x) = 1 2 Ax b 2 + λ smooth( x 1, τ) + 1 2 µ x 2

µα k (1 α k ) 2(1 + α k ) x k v k 2 α2 k ) 2 2µ 1 2L f ( α kv k +x k α k +1 Since the decay rate is k (1 α k), we want to find a large α k such that the inequality holds. Evaluating f ( α kv k +x k α+1 ) is time consuming, so we hope our first guess of α k is good. Note that f (y k ) has a trend of decreasing, so our procedure is to find an α k µ L such that µα k(1 α k ) x k v k 2 2(1+α k ) α2 f (y k 1 ) 2 k 2µ is large, then such α k usually makes the inequality holds.

f (y k ) The Gap Figure: f (x) = 1 2 Ax b 2 + λ smooth( x 1, τ) + 1 2 µ x 2

Test 1: smooth-bpdn The first test is a smooth version of Basis Pursuit De-Noising: minimize f (x) = 1 2 Ax b 2 + λ smooth( x 1, τ) + µ 2 x 2, where we set A = 1 n randn(m, n), m = 1000, n = 3000, λ = 0.2, τ = 0.001, and µ = 0.01. x is a random sparse vector with 125 non-zeros and b = Ax + ε. We use the following estimate for L: ˆL = ( 1 + ) 2 m + λ n τ + µ 202.50.

Figure: x k x

Figure: f (x k ) f

Test 2: anisotropic bowl The second test is minimize f (x) = subject to x τ. n i xi 4 + 1 2 x 2, We choose n = 500 and τ = 4. x 0 is randomly chosen from the boundary {x x = τ}. For this problem, we have i=1 L = 12nτ 2 + 1 = 96001 and µ = 1.

Figure: x k x

Figure: f (x k ) f

Test 3: back to quadratic functions Let s check the performance of the adaptive algorithm on quadratic functions. minimize f (x) = 1 2 x T Ax b T x. We choose A 1 m W n(i n, m), where n = 4500 and m = 5000. We use the following estimate for L and µ: ˆL = ( ) 2 ( ) 2 n n 1 +, ˆµ = 1. m m

Figure: x k x

Figure: f (x k ) f

Comparing with TFOCS(AT) Figure: x k x

Figure: f (x k ) f

Final thoughts The convergence rate of Nesterov s method depends on problem types. For quadratic problems, the speed is doubled. There is space to improve Nesterov s optimal gradient method on strongly convex functions. Whether we can improve Nesterov s method universally (with theoretical proof) is still a question.