On Nesterov s Random Coordinate Descent Algorithms

Similar documents
On Nesterov s Random Coordinate Descent Algorithms - Continued

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Coordinate Descent Methods on Huge-Scale Optimization Problems

Lecture 3: Huge-scale optimization problems

An introduction to complexity analysis for nonconvex optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Parallel Coordinate Optimization

Optimization II: Unconstrained Multivariable

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

arxiv: v1 [math.oc] 1 Jul 2016

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

Optimization II: Unconstrained Multivariable

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Random coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara

Coordinate gradient descent methods. Ion Necoara

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Lasso: Algorithms and Extensions

CPSC 540: Machine Learning

Worst Case Complexity of Direct Search

Selected Topics in Optimization. Some slides borrowed from

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Worst Case Complexity of Direct Search

Coordinate Descent and Ascent Methods

Lecture 23: November 21

Math 164: Optimization Barzilai-Borwein Method

Worst case complexity of direct search

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Importance Sampling for Minibatches

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Lecture 1: Basic Concepts

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Gradient methods for minimizing composite functions

Conditional Gradient (Frank-Wolfe) Method

Comparison of Modern Stochastic Optimization Algorithms

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Convex Optimization Lecture 16

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Unified Approach to Proximal Algorithms using Bregman Distance

arxiv: v1 [math.oc] 22 May 2018

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Stochastic Gradient Descent with Variance Reduction

Convex Optimization CMU-10725

Complexity of gradient descent for multiobjective optimization

Subgradient Method. Ryan Tibshirani Convex Optimization

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Complexity and Simplicity of Optimization Problems

More First-Order Optimization Algorithms

Cubic regularization of Newton s method for convex problems with constraints

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Accelerating Nesterov s Method for Strongly Convex Functions

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Adaptive Restarting for First Order Optimization Methods

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

An Inexact Newton Method for Optimization

Accelerated primal-dual methods for linearly constrained convex problems

AN EIGENVALUE STUDY ON THE SUFFICIENT DESCENT PROPERTY OF A MODIFIED POLAK-RIBIÈRE-POLYAK CONJUGATE GRADIENT METHOD S.

An example of slow convergence for Newton s method on a function with globally Lipschitz continuous Hessian

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Gradient methods for minimizing composite functions

Unconstrained minimization of smooth functions

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

ECS289: Scalable Machine Learning

Lecture 14: October 17

STA141C: Big Data & High Performance Statistical Computing

Optimized first-order minimization methods

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

GRADIENT = STEEPEST DESCENT

Introduction to unconstrained optimization - direct search methods

Fast Stochastic Optimization Algorithms for ML

Nonlinear equations and optimization

Nonlinear Optimization Methods for Machine Learning

Trade-Offs in Distributed Learning and Optimization

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Higher-Order Methods

Optimization Methods for Machine Learning

Solving LP and MIP Models with Piecewise Linear Objective Functions

Journal of Convex Analysis Vol. 14, No. 2, March 2007 AN EXPLICIT DESCENT METHOD FOR BILEVEL CONVEX OPTIMIZATION. Mikhail Solodov. September 12, 2005

Extended Monotropic Programming and Duality 1

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

The Relation Between Pseudonormality and Quasiregularity in Constrained Optimization 1

Optimization methods

Conjugate Gradient Tutorial

Research Article Strong Convergence of a Projected Gradient Method

Newton s Method. Javier Peña Convex Optimization /36-725

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Transcription:

On Nesterov s Random Coordinate Descent Algorithms Zheng Xu University of Texas At Arlington February 19, 2015

1 Introduction Full-Gradient Descent Coordinate Descent 2 Random Coordinate Descent Algorithm Improved Assumption The basic idea of Random Coordinate Descent Convergence Analysis 3 Summary

Introduction Unconstratined smooth minimization problem Considering the following mathematical optimization problem min f (x) (1) x R N where f (x) is a convex and differentiable function on R N.

Introduction Full-Gradient Descent Full-Gradient Descent When f (x) is differentiable, it is favorable to apply first-order method like gradient descent method. The basic step of the traditional full-gradient descent is x k+1 = x k + α k ( f (x k )) (2) {x k } is the optimization sequence generated by gradient descent algorithm α k > 0 is the step size. Usually given by some line search techniques. f (x k ) is the gradient direction of f (x) at x k. Note the gradient direction at x k is the steepest ascent direction at x k. So f (x k ) is the steepest descent direction at x k.

Introduction Full-Gradient Descent Figure: A basic example of R 2

Introduction Coordinate Descent Huge-scale problem Full-gradient descent method works! But what if the N is incredibly large? e.g. N = 1, 000, 000, 000, 000, 000 Compute f (x k ) will suffer from unfeasibly computational complexity. It is even impossible to store that large-scale data in a single computer! For problem of this type, we adopt the term huge-scale problem.

Introduction Coordinate Descent Coordinate Descent Recently, in [Ber99][Nes12][RT14], the old-age coordinate descent method is revisited to solve the huge-scale problems. The basic step of coordinate descent method is x k+1 i k = x k i k + α k ( ik f (x k )) (3) where i k is the component index of x to be updated in k-th step. The main idea of coordinate is updating only very few components of x in each iteration instead of full of x. The benefit is that the iteration cost of coordinate descent is only O(1) or O(m) for m N, while the full-gradient method usually requires at least O(N).

Introduction Coordinate Descent Figure: Example of Cyclic Coordinate Descent in R 2

Introduction Coordinate Descent Does coordinate descent really work? Vague convergence theory: Theoretically, it s hard to prove the convergence of conventional cyclic coordinate search. It s even harder to estimate its convergence rate. Computational expensive line search: For coordinate descent methods, the line-search strategies based on the function values are too expensive. To survive, in [Nes12], Yu. Nesterov propose a random coordinate descent with clear convergence theory and lower cost iteration.

Random Coordinate Descent Algorithm 1 Introduction Full-Gradient Descent Coordinate Descent 2 Random Coordinate Descent Algorithm Improved Assumption The basic idea of Random Coordinate Descent Convergence Analysis 3 Summary

Random Coordinate Descent Algorithm Improved Assumption Improved Assumption In [Nes12], it is further assumed the objective function f (x) is Lipschitz continuous for every partial gradient. i.e. i f (x + x) i f (x) (i) L i x (4) Roughly speaking, it is assumed f (x) is strongly smooth on every component of x R N

Random Coordinate Descent Algorithm The basic idea of Random Coordinate Descent Basic Idea of Random Coordinate Descent The basic idea of random coordinate descent is adopting the random index selection instead of cyclic index selection, which is x k+1 i k i k = R α (5) = x k i k + α k ( ik f (x k )) (6) where R α is a random number generator that is subject to some distribution with parameter α.

Random Coordinate Descent Algorithm The basic idea of Random Coordinate Descent The R α R α N is a special random counter, generating integer number with probability p (i) α = L α i [ n j=1 L α j ] 1 (7) where L i is the Lipschitz constant of the partial gradient i f (x). As a special case, when α = 0, R 0 is a uniform random number generator.

Random Coordinate Descent Algorithm Convergence Analysis Does it converge? Denote random variable ξ k = i 0, i 1,..., i k, we have the following convergence theorem of random coordinate descent method with improved assumption. Theorem For any k 0 we have E ξk f (x k ) f 2 k + 4 [ n j=1 where R β (x 0 ) = max x {max x X x x β } L α j ] R 2 1 α(x 0 ) (8)

Random Coordinate Descent Algorithm Convergence Analysis How fast does it converge? Let first discuss when α = 0, we now have E ξk f (x k ) f 2n k + 4 R 2 1 (x 0 ) (9) And the worst-case convergence rate of full-gradient descent method is f (x k ) f n k R 2 1 (10) So the convergence rate of random coordinate descent is proportional to that one of full-gradient method.

Random Coordinate Descent Algorithm Convergence Analysis How fast does it converge? - Continued Let first discuss when α = 1 2, denote D (x 0 ) = max x { max y X max 1 i N x (i) y (i) : f (x) < f (x 0 ) } (11) Under this terminology system, we have [ n E ξk f (x k ) f 2 k + 4 j=1 L 1/2 j ] D 2 (x 0 ) (12) However, there exists some cases that the random coordinate descent method can work while the full-gradient method n cannot in the sense that can be bounded when L = max{l i } cannot. j=1 L 1/2 j

Random Coordinate Descent Algorithm Convergence Analysis How fast does it converge? - Continued Let first discuss when α = 1, we now have [ n ] E ξk f (x k ) f 2 L j R0 2 (x 0 ) (13) k + 4 While the worst-case convergence rate of full-gradient descent method is j=1 f (x k ) f max i{l i } R1 2 (14) k Still, the convergence rate of random coordinate descent is proportional to that one of full-gradient method.

Random Coordinate Descent Algorithm Convergence Analysis Summary of Random Coordinate Descent Method To sum up, the random coordinate descent method It has very cheap iteration cost. It adopts the random coordinate selection to ensure convergence Surprisingly enough, its convergence rate is almost same as the worst-case convergence rate of full-gradient descent method.

Summary Summary Motivation Full-gradient method is not favorable to solve huge-scale problem. Conventional coordinate descent method lacks theoretical justification Random Coordinate Descent Random Coordinate Descent adopts random coordinate index selection Its convergence rate is almost same as the worst-case convergence rate of full-gradient method. The computational cost of each iteration of random coordinate descent method is much cheaper!

Summary Future topic In the future, we will discuss The convergence property when f is strongly convex Minimizing f (x) with constrain Accelerating random coordinate descent method to O( 1 k 2 )

Summary Thanks!

Summary Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999. Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, 2012. Peter Richtárik and Martin Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1 38, 2014.