On Nesterov s Random Coordinate Descent Algorithms - Continued

Similar documents
On Nesterov s Random Coordinate Descent Algorithms

Coordinate Descent Methods on Huge-Scale Optimization Problems

Conditional Gradient (Frank-Wolfe) Method

CSC 576: Gradient Descent Algorithms

Convex Optimization Lecture 16

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Coordinate Descent and Ascent Methods

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Lecture 5: September 12

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

arxiv: v1 [math.oc] 1 Jul 2016

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Numerical Methods. V. Leclère May 15, x R n

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Sub-Sampled Newton Methods

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Selected Topics in Optimization. Some slides borrowed from

Lecture 14: October 17

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Higher-Order Methods

CPSC 540: Machine Learning

Non-convex optimization. Issam Laradji

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Newton s Method. Javier Peña Convex Optimization /36-725

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Constrained Optimization

Optimization Tutorial 1. Basic Gradient Descent

A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization

Stochastic Optimization Algorithms Beyond SG

Convergence Analysis of Deterministic. and Stochastic Methods for Convex. Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Lecture 3: Huge-scale optimization problems

Cubic regularization of Newton s method for convex problems with constraints

8 Numerical methods for unconstrained problems

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Unconstrained Optimization

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Global convergence of the Heavy-ball method for convex optimization

Unconstrained minimization of smooth functions

Comparison of Modern Stochastic Optimization Algorithms

Adaptive Restarting for First Order Optimization Methods

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Lecture 14: Newton s Method

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Numerical optimization

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Lecture 5: September 15

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

Sparse Optimization Lecture: Dual Methods, Part I

Accelerating Nesterov s Method for Strongly Convex Functions

An introduction to complexity analysis for nonconvex optimization

Introduction to Optimization

Scientific Computing: Optimization

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

Lecture 23: Conditional Gradient Method

Gradient methods for minimizing composite functions

Linear Convergence under the Polyak-Łojasiewicz Inequality

Convex Optimization CMU-10725

Fast proximal gradient methods

Linear Convergence under the Polyak-Łojasiewicz Inequality

Accelerated primal-dual methods for linearly constrained convex problems

1 Maximizing a Submodular Function

Lecture 16: FTRL and Online Mirror Descent

Descent methods. min x. f(x)

Accelerated Block-Coordinate Relaxation for Regularized Optimization

ECS171: Machine Learning

Introduction. A Modified Steepest Descent Method Based on BFGS Method for Locally Lipschitz Functions. R. Yousefpour 1

Unconstrained optimization

Majorization Minimization - the Technique of Surrogate

Stochastic Quasi-Newton Methods

Lecture 19: Follow The Regulerized Leader

Gradient methods for minimizing composite functions

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Optimized first-order minimization methods

Optimization II: Unconstrained Multivariable

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent

Stochastic and online algorithms

GRADIENT = STEEPEST DESCENT

A Unified Approach to Proximal Algorithms using Bregman Distance

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

6. Proximal gradient method

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Line Search Methods for Unconstrained Optimisation

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Linearly-Convergent Stochastic-Gradient Methods

Transcription:

On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015

1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower Bound of Bregman Distance 2 Variants of Random Coordinate Descent Minimizing Strongly Convex Functions When α 0 When α = 0 Constrained Minimization 3 Summary

Revisit Random Coordinate Descent The Random Coordinate Descent Revisit Random Coordinate Descent When we assume the partial gradient of f (x) is Lipschitz continuous, i.e. i f (x + x) i f (x) (i) L i x (1) We can then introduce the random coordinate descent algorithm (hereafter RCDM), x k+1 i k i k = R α (2) = x k i k + α k ( ik f (x k )) (3) where R α is a random number generator that is subject to some distribution with parameter α.

Revisit Random Coordinate Descent The Random Coordinate Descent Revisit the R α R α N is a special random counter, generating integer number with probability p (i) α = L α i [ n j=1 L α j ] 1 (4) where L i is the Lipschitz constant of the partial gradient i f (x). To give an example, when α = 0, R 0 is a uniform random number generator.

Revisit Random Coordinate Descent Upper and Lower Bound of Bregman Distance Bregman Distance of a Function Definition The Bregman Distance associated with a convex function f at the point y is D y f where p f (v). (x, y) = f (x) f (y) p, x y (5)

Strong Convexity Definition A differentiable function f is called strongly convex with convexity parameter µ > 0 if the following inequality holds for all points x, y in its domain: f (x) f (y) + f (y), x y + µ 2 x y 2 (6) Remark The definition of strong convexity yields a lower bound of Bregman distance of f D y f (x, y) = f (x) f (y) f (y), x y µ 2 x y 2 (7)

Strong Smoothness Definition A function f is 1/L-Lipschitz continuous if for all x, y domf f (x) f (y) L x y (8) In [NN04], a well-known inequality is introduced for a function with 1/L-Lipschitz continuous gradient. f (x) f (y) f (y), x y L 2 x y 2 (9) which yields another bound from the upper side of the Bregman distance D y f (x, y) = f (x) f (y) f (y), x y L 2 x y 2 (10)

Upper and Lower Bound of Bregman Distance From definitions, if f (x) is both µ-strongly convex and L-strongly smooth, we can bound the Bregman distance or Hessian from both upper and lower side, i.e. µ 2 x y 2 f (x) f (y) f (y), x y L x y 2 2 (11) Letting in the Taylor expansion, we now have the bounds for Hessian matrix of twice differentiable f (x) µ 2 f (x) L (12)

Variants of Random Coordinate Descent 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower Bound of Bregman Distance 2 Variants of Random Coordinate Descent Minimizing Strongly Convex Functions When α 0 When α = 0 Constrained Minimization 3 Summary

Variants of Random Coordinate Descent Minimizing Strongly Convex Functions Case when f (x) is strongly convex In [NN04], it is proven that, for deterministic numerical optimization, when objection function f (x) is both µ strongly convex and L strongly smooth, the optimal convergence rate for minimizing f (x) is f (x k ) f µ 2 ( κ 1 κ + 1 ) 2k x 0 x (13) where κ = L is the condition number of f. This type of µ convergence rate is also known as linear rate convergence.

Linear rate convergence of RCDM when f is strongly convex Theorem Let function f (x) be strongly convex with convexity parameter µ > 0. Then, for the sequence {x k } generated by RCDM(α, x 0 ) (α 0), we have where S = n Ef (x k ) f L α j j=1 ( 1 µ S) k (f (x0 ) f ) (14) From this theorem, we can see the Random Coordinate Descent Method can reach the first-order optimal rate when f (x) is both strongly convex and strongly smooth.

Case when α = 0 Here, when α = 0, the R 0 is thus a uniform random number generator. The algorithm is thus 1 Define v 0 = x 0, a 0 = 1 n, b 0 = 2 2 For k > 0 iterate: 1 Solve γ k from γk 2 γ k n = (1 γ kσ n ) a2 k bk 2 Set α k = n γ kσ γ k (n 2 σ), β k = 1 1 n γ kσ 2 Select y k = α k v k + (1 α k )x k 3 Choose i k = R 0 and update x k+1 = T ik (y k ), v k+1 = β k v k + (1 β k )y k γ k L ik f i (y k) 4 Set b k+1 = b k βk, and a k+1 = γ k b k+1

Variants of Random Coordinate Descent Minimizing Strongly Convex Functions Convergence Analysis of Accelerated Random Coordinate Descent Theorem For any k 0 we have [ Ef (x k ) f σ 2 x 0 x 2 1 + 1 ] n (f (x 0) f ) (15) 2 [ σ (1 + 2n )k+1 (1 n ( k + 1 )2 ] 2 σ 2n )k+1 [ 2 x 0 x 2 1 + 1 n 2 (f (x 0) f ) ]

Variants of Random Coordinate Descent Constrained Minimization The constrained problem Consider now the constrained minimization problem min f (x) (16) x Q Here, Q R N is a convex and closed set which constructs a constrain in this problem.

Conditional Gradient Optimization First, we need to fix the random number generator parameter α = 0 and thus R α is a uniform random number generator. The conditional coordinate descent steps are u (i) (x) = arg min u (i) Q i V i (x) = x + (u (i) (x) x (i) ), Under this denotation, the algorithm is [ f i (x), u (i) x (i) + L i 2 u(i) x (i) 2 (i) ], i k = R 0 x k+1 = V ik (x k ) Note this type of update is quite like the Frank-Wolfe ( conditional gradient) method.

Variants of Random Coordinate Descent Constrained Minimization Convergence Analysis for Constrained Random Coordinate Descent Theorem For any k > 0 we have Ef (x k ) f n n + k (1 2 R 2 1 (x 0 ) + f (x 0 ) f ) (17) if f is strongly convex in 1 with constant µ, then Ef (x k ) f ( ) k 2σ 1 ( 1 n(1 + σ) 2 R 1 2 (x 0 ) + f (x 0 ) f ) (18)

Summary Summary 1 We ve revisited the Basic Random Coordinate Descent and the basic definition of strongly convex and strongly smooth. 2 When unconstrained objective function f is both strongly convex and strongly smooth, 1 if α 0, the convergence rate could be accelerated to linear rate i.e. O(θ k ) θ (0, 1) 2 if α = 0, the convergence rate could be accelerated to sublinear rate O( 1 k 2 ) where k is the iteration counter 3 Discussion of Frank-Wolfe-type constrained minimization of random coordinate descent method, could only work with α = 0.

Summary Thanks!

Summary Yurii Nesterov and Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer, 2004.