Nesterov s Optimal Gradient Methods

Similar documents
Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Support Vector Machines and Kernel Methods

Accelerated Training of Max-Margin Markov Networks with Kernels

Convex Optimization Lecture 16

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Conditional Gradient (Frank-Wolfe) Method

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

GRADIENT = STEEPEST DESCENT

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Basis Expansion and Nonlinear SVM. Kai Yu

Coordinate Descent and Ascent Methods

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Newton s Method. Javier Peña Convex Optimization /36-725

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

RegML 2018 Class 2 Tikhonov regularization and kernels

Convex Optimization Algorithms for Machine Learning in 10 Slides

Oslo Class 2 Tikhonov regularization and kernels

Accelerating Stochastic Optimization

Stochastic and online algorithms

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

The Frank-Wolfe Algorithm:

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

(Kernels +) Support Vector Machines

Accelerated Proximal Gradient Methods for Convex Optimization

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

6. Proximal gradient method

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

You should be able to...

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Stochastic Subgradient Method

Constrained Optimization and Lagrangian Duality

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Optimization methods

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Advanced Machine Learning

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

10-725/36-725: Convex Optimization Prerequisite Topics

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Neural Networks and Deep Learning

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Big Data Analytics: Optimization and Randomization

1 Sparsity and l 1 relaxation

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Optimization and Optimal Control in Banach Spaces

Lecture 1: Background on Convex Analysis

CS60021: Scalable Data Mining. Large Scale Machine Learning

Nesterov s Acceleration

Optimization for Machine Learning

CIS 520: Machine Learning Oct 09, Kernel Methods

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

An Optimal Affine Invariant Smooth Minimization Algorithm.

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Exponentiated Gradient Descent

ICS-E4030 Kernel Methods in Machine Learning

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

CSC 411 Lecture 17: Support Vector Machine

Math 273a: Optimization Convex Conjugacy

Unconstrained minimization of smooth functions

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Computational and Statistical Learning Theory

Lasso: Algorithms and Extensions

Stochastic Optimization Algorithms Beyond SG

STA141C: Big Data & High Performance Statistical Computing

Machine Learning for NLP

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Lecture 23: November 19

For any y 2C, any sub-gradient, v, of h at prox(x), i.e., v and by optimality of prox(x) in (3), we have

Primal/Dual Decomposition Methods

ECS289: Scalable Machine Learning

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Solutions Chapter 5. The problem of finding the minimum distance from the origin to a line is written as. min 1 2 kxk2. subject to Ax = b.

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Lecture 5: September 12

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Statistical Machine Learning from Data

Support Vector Machines for Classification: A Statistical Portrait

6. Proximal gradient method

Support Vector Machines, Kernel SVM

Machine Learning Lecture 5

Discriminative Models

Math 273a: Optimization Subgradients of convex functions

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Support Vector Machines for Classification and Regression

Max Margin-Classifier

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Transcription:

Yurii Nesterov http://www.core.ucl.ac.be/~nesterov Nesterov s Optimal Gradient Methods Xinhua Zhang Australian National University NICTA 1

Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 2

Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 3

The problem Many machine learning problems have the form where min w w: weight vector fx i ;y i g n i=1 : training data J(w) := (w)+r emp(w) R emp (w) := 1 n nx l(x i ; y i ; w) i=1 l(x;y;w) : convex and non-negative loss function Can be non-smooth, possibly non-convex. (w) : convex and non-negative regularizer 4

min w s.t. The problem: Examples 2 kwk2 + 1 n nx i=1» i» i 1 y i hw; x i i 8 1 i n» i 0 8 1 i n» i =max f0; 1 y i hw; x i ig 2 kwk2 + 1 n nx max f0; 1 y i hw; x i ig i=1 max f0; 1 xg hinge loss x Model (obj) (w) + R emp (w) linear SVMs 2 kwk2 2 + 1 P n n i=1 max f0; 1 y i hw; x i ig `1 logistic regression kwk 1 + 1 P n n i=1 log(1+exp( y ihw;x i i)) ²-insensitive classify 2 kwk2 2 + 1 P n n i=1 max f0; jy i hw; x i ij ²g kw 1 k 1 = P i jw ij 5

The problem: More examples Lasso Multi-task learning argmin w argmin w argmin w kwk 1 + kaw bk 2 2 TX kw k tr + kx t w t b t k 2 2 kw k 1;1 + t=1 TX kx t w t b t k 2 2 t=1 Matrix game Entropy regularized LPBoost argmin w2 d hc; wi +max u2 n fhaw; ui + hb; uig argmin w2 d (w; w 0 ) + max u2 n haw; ui 6

The problem: Lagrange dual 7 Binary SVM Entropy regularized LPBoost min 1 Q X i s.t. i 2 [0;n 1 ] P y i i =0 ln X wd 0 exp d i s.t. i 2 [0; 1] P i =1 i à i where à n!! X 1 A i;d i i=1 Q ij = y i y j hx i ; x j i

The problem 8 Summary where min J(w) w2q J is convex, but might be non-smooth Q is a (simple) convex set J might have composite form Solver: iterative method Want ² k := J(w k ) J(w ) to decrease to 0 quickly where w :=argmin w2q J(w). w 0 ; w 1 ; w 2 ;::: We only discuss optimization in this session, no generalization bound.

The problem: What makes a good optimizer? Find an ² -approximate solution Desirable: J(w k ) min w k as small as possible (take as few steps as possible) ² k J(w)+² Error decays by 1=k 2, 1=k, or e k. Each iteration costs reasonable amount of work Depends on n, and other condition parameters leniently General purpose, parallelizable (low sequential processing) Quit when done (measurable convergence criteria) w k 9

The problem: Rate of convergence Convergence rate: lim k!1 ² k+1 ² k = 8 < : 0 superlinear rate ² k = e ek 2 (0; 1) linear rate ² k = e k 1 sublinear rate ² k = 1 k Use interchangeably: Fix step index k, upper bound min 1 t k ² t Fix precision, how many steps needed for E.g. ² min ² t < ² 1 t k 1 1 ² ; 2 ² ; p² 1 ; log 1 ² ; loglog 1 ² 10

The problem: Collection of results Convergence rate: Objective function Gradient descent Nesterov Lower bound Composite non-smooth Smooth µ 1 O ² ³q 1 O ² ³q O 1 ² Smooth and very convex µ O log 1 µ ² O log 1 µ ² O log 1 ² 11 Smooth + (dual of smooth) O µ 1 ² (very convex) + (dual of smooth) ³q O 1 ²

Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 12

Preliminaries: convex analysis Convex functions A function f is convex iff 8 x; y; 2 (0; 1) f( x +(1 )y) f(x)+(1 )f(y) 13

Preliminaries: convex analysis Convex functions A function f is convex iff 8 x; y; 2 (0; 1) f( x +(1 )y) f(x)+(1 )f(y) 14

Preliminaries: convex analysis Strong convexity ¾ A function f is called -strongly convex wrt a norm iff k k f(x) 1 2 ¾ kxk2 is convex 8 x; y; 2 (0; 1) f( x +(1 )y) f(x)+(1 )f(y) ¾ (1 ) 2 kx yk 2 15

Preliminaries: convex analysis Strong convexity First order equivalent condition f(y) f(x)+hrf(x); y xi + ¾ 2 kx yk2 8 x; y 16

Preliminaries: convex analysis Strong convexity First order equivalent condition f(y) f(x)+hrf(x); y xi + ¾ 2 kx yk2 8 x; y 17

Preliminaries: convex analysis Strong convexity Second order r 2 f(x)y; y ¾ kyk 2 8 x; y If k k Euclidean norm, then r 2 f(x) º ¾I Lower bounds rate of change of gradient 18

Preliminaries: convex analysis Lipschitz continuous gradient Lipschitzcontinuity Stronger than continuity, weaker than differentiability Upper bounds rate of change 9L >0 jf(x) f(y)j L kx yk 8 x; y 19

Preliminaries: convex analysis Lipschitz continuous gradient Gradient is Lipschitzcontinuous (must be differentiable) krf(x) rf(y)k L kx yk f(y) f(x)+hrf(x); y xi + L 2 kx yk2 8 x; y 8 x; y L L-l.c.g ¾ 20

Preliminaries: convex analysis Lipschitz continuous gradient Gradient is Lipschitzcontinuous (must be differentiable) krf(x) rf(y)k L kx yk 8 x; y f(y) f(x)+hrf(x); y xi + L 2 kx yk2 8 x; y r 2 f(x)y; y L kyk 2 8 x; y r 2 f(x) ¹ LI if L 2 norm 21

Preliminaries: convex analysis Fenchel Dual Fencheldual of a function f f? (s) =sup x hs; xi f(x) Properties f ¾ strongly convex f?? = f if f is convex and closed f? 1 ¾ -l.c.g on Rd L-l.c.g on R d 1 L strongly convex 22

Preliminaries: convex analysis Fenchel Dual Fencheldual of a function f f? (s) =sup x hs; xi f(x) s = rf(x) s 2 @f(x) slope = s f? (s) f(x) 23

Preliminaries: convex analysis: Subgradient Generalize gradient to non-differentiable functions Idea: tangent plane lying below the graph of f 24

Preliminaries: convex analysis: Subgradient Generalize gradient to non-differentiable functions ¹ is called a subgradientof f at x if f(x 0 ) f(x)+hx 0 x;¹i 8x 0 All such ¹ comprise the subdifferential of f at x: @f(x) 25

Preliminaries: convex analysis: Subgradient Generalize gradient to non-differentiable functions is called a subgradientof f at x if ¹ f(x 0 ) f(x)+hx 0 x;¹i 8x 0 All such ¹ comprise the subdifferential of f at x: Unique if f is differentiable at x @f(x) 26

Preliminaries: optimization: Gradient descent Gradient descent x k+1 = x k s k rf(x k ) s k 0 Suppose f is both ¾-strongly convex and L-l.c.g. ² k := f(x k ) f(w ) ² k ³ 1 ¾ L k ²0 Key idea Norm of gradient upper bounds how far away from optimal Lower bounds how much progress one can make 27

Preliminaries: optimization: Gradient descent Upper bound distance from optimal x k+1 = x k 1 L rf(x k) rf shaded area triangle area slope = ¾ x x k rf(x k ) So f(x k ) f(x ) 1 2¾ krf(x k)k 2 f(x k ) f(x ) 1 2¾ krf(x k)k 2 28

Preliminaries: optimization: Gradient descent Lower bound progress at each step x k+1 = x k 1 L rf(x k) rf shaded area triangle area slope = L rf(x k ) So f(x k ) f(x k+1 ) 1 2L krf(x k)k 2 rf(x k ) x k+1 f(x L k ) f(x k+1 ) 1 2L krf(x k)k 2 x k 29

Preliminaries: optimization: Gradient descent Putting things together distance to optimal progress 2¾(f(x k ) f(x )) krf(x k )k 2 2L(f(x k ) f(x k+1 )) f(x k+1 ) f(x ) 1 ¾ L (f(xk ) f(x )) 30

Preliminaries: optimization: Gradient descent Putting things together distance to optimal progress 2¾(f(x k ) f(x )) krf(x k )k 2 2L(f(x k ) f(x k+1 )) f(x k+1 ) f(x ) {z } ² k+1 1 ¾ L (f(xk ) f(x )) {z } ² k 31

Preliminaries: optimization: Gradient descent Putting things together distance to optimal progress 2¾(f(x k ) f(x )) krf(x k )k 2 2L(f(x k ) f(x k+1 )) f(x k+1 ) f(x ) {z } ² k+1 1 ¾ L (f(xk ) f(x )) {z } ² k What if ¾ =0? What if there is constraint? 32

Preliminaries: optimization: Projected Gradient descent If objective function is L-l.c.g., but not strongly convex Constrained to convex set Projected gradient descent x k+1 = Q µx k 1 L rf(x k) =argmin x2q Rate of convergence: µ L O ² ³q L Compare with Newton O, interior point Q =argmin ^x2q ^x (x k 1 L rf(x k)) f(x k )+hrf(x k ); x x k i + L 2 kx x kk 2 ² O log 1 ² 33

Preliminaries: optimization: Projected Gradient descent Projected gradient descent x k+1 = Q µx k 1 L rf(x k) =argmin x2q =argmin ^x2q f(x k )+hrf(x k ); x x k i + L 2 kx x kk 2 Property 1: monotonic decreasing ^x (x k 1 L rf(x k)) f(x k+1 ) f(x k )+hrf(x k ); x k+1 x k i + L 2 kx k+1 x k k 2 L-l.c.g. f(x k )+hrf(x k ); x k x k i + L 2 kx k x k k 2 = f(x k ) Def x k+1 projection 34

Preliminaries: optimization: Projected Gradient descent Property 2: x x k+1 ; (x k 1L À 8 x 2 Q rf(x k)) x k+1 0 L-l.c.g. f(x k+1 ) f(x k )+hrf(x k ); x k+1 x k i + L 2 kx k+1 x k k 2 x k 1 L rf(x k) x k+1 Q x Property 2 f(x k )+hrf(x k ); x x k i + L 2 kx x kk 2 L 2 kx x k+1k 2 8 x 2 Q Convexity of f f(x)+ L 2 kx x kk 2 L 2 kx x k+1k 2 8 x 2 Q 35

Preliminaries: optimization: Projected Gradient descent Put together f(x k+1 ) f(x)+ L 2 kx x kk 2 L 2 kx x k+1k 2 8 x 2 Q Let x = x : 0 L 2 kx x k+1 k 2 ² k+1 + L 2 kx x k k 2 ::: k+1 X ² i + L 2 kx x 0 k 2 i=1 (k+1)² k+1 + L 2 kx x 0 k 2 ( monotonic decreasing) ² k 36 ² k+1 L 2(k +1) kx x 0 k 2

Preliminaries: optimization: Subgradient method Objective is continuous but not differentiable Subgradientmethod for min f(x) x2q x k+1 = Q (x k s k rf(x k )) 37 where Rate of convergence Summary O µ 1 ² 2 rf(x k ) 2 @f(x k ) µ 1 O ² 2 O µ L ² (arbitrary subgradient) ln 1 ² ln(1 ¾ L ) non-smooth L-l.c.g. L-l.c.g. & ¾-strongly convex

Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 38

Optimal gradient method Lower bound Consider the set of L-l.c.g. functions For any ² > 0, there exists an L-l.c.g. function f, such that any first-order method takes at least ³q k = O steps to ensure ² k < ². First-order method means L ² x k 2 x 0 +span frf(x 0 );:::;rf(x k 1 )g Not saying: there exists an L-l.c.g. function f, such that for all any first-order method takes at least k =O( p ² > 0 L=²) steps to ensure ² k < ². Gap: recall the upper bound O L of GD, two possibilities. ² 39

Optimal gradient method: Primitive Nesterov Problem under consideration where f is L-l.c.g., Big results is convex He proposed an algorithm attaining Not for free: require an oracle to project a point onto in sense L 2 min w f(w) w 2 Q Q p L=" Q 40

Primitive Nesterov Construct quadratic functions 1 Á k (x) and k > 0 Á k (x)=á k + k 2 kx v k k 2 f(x k ) 2 3 4 9x k ;s:t: f(x k ) Á k Á k (x) (1 k)f(x)+ ká 0 (x) k! 0 v k x k 2 1 f(x k ) Á k Á k (x ) 3 (1 k)f(x )+ ká 0 (x ) 41 f(x k ) f(x ) k(á 0 (x ) f(x ))! 0

Primitive Nesterov Construct quadratic functions 1 Á k (x) and k > 0 Á k (x) =Á k + k 2 kx v k k 2 f(x k ) 2 3 4 9x k ;s:t: f(x k ) Á k Á k (x) (1 k)f(x)+ ká 0 (x) k! 0 v k x k 2 1 f(x k ) Á k Á k (x ) 3 (1 k)f(x )+ ká 0 (x ) 42 f(x k ) f(x ) k(á 0 (x ) f(x ))! 0

Primitive Nesterov: Rate of convergence 1 2 3 4 Á k (x) =Á k + k 2 kx v k k 2 9x k ;s:t: f(x k ) Á k Á k (x) (1 k)f(x)+ ká 0 (x) k! 0 Nesterov constructed, in a highly non-trivial way, the Á k (x) and k, s.t. has closed form (grad desc) x k k 4L (2 p L+k p 0 ) 2 f(x k ) f(x ) k(á 0 (x ) f(x )) Furthermore, if f is ¾-strongly convex, then Rate of convergence sheerly depends on k k 1 ¾ L k 43

Primitive Nesterov: Dealing with constraints x k has closed form by gradient descent When constrained to set Q, modify by New gradient: x k+1 = x k rf(x k ) x Q k+1 = Q (x k rf(x k ))=argmin x2q kx (x k rf(x k ))k g Q k := 1 ³x k x Q k+1 gradient mapping 44 This new gradient keeps all important properties of gradient, also keeping the rate of convergence

Primitive Nesterov: Gradient mapping x k has closed form by gradient descent When constrained to set Q, modify by New gradient: x k+1 = x k rf(x k ) x Q k+1 = Q (x k rf(x k ))=argmin x2q kx (x k rf(x k ))k Expensive? g Q k := 1 ³x k x Q k+1 gradient mapping 45 This new gradient keeps all important properties of gradient, also keeping the rate of convergence

Primitive Nesterov Summary where f is L-l.c.g., min w f(w) w 2 Q Q Rate of convergence is convex. r L ² if no strong convexity ln 1 ² ln(1 ¾ L ) if ¾-strongly convexity 46

Primitive Nesterov: Example 1 min x;y 2 x2 +2y 2 ¹=1; L=4 min x 0;y 0 1 (x + y)2 2 ¹ =0; L=2 Nesterov Grad desc Nesterov Grad desc 47

Extension: Non-Euclidean norm Remember strong convexity and l.c.g. are wrt some norm We have implicitly used Euclidean norm (L 2 norm) Some functions are strongly convex wrt other norms P i x i ln x i Not l.c.g. wrt L 2 norm Negative entropy is l.c.g. wrt L 1 norm kxk 1 = P i x i strongly convex wrt L 1 norm. Can Nesterov s approach be extended to non-euclidean norm? 48

Extension: Non-Euclidean norm Remember strong convexity and l.c.g. are wrt some norm We have implicitly used Euclidean norm (L 2 norm) Some functions are l.c.g. wrt other norms P i x i ln x i Not l.c.g. wrt L 2 norm Negative entropy is l.c.g. wrt L 1 norm kxk 1 = P i x i strongly convex wrt L 1 norm. Can Nesterov s approach be extended to non-euclidean norm? 49

Extension: Non-Euclidean norm Suppose the objective function f is l.c.g. wrt k k. Use a prox-function d on Q which is ¾ -strongly convex wrt k k, and min d(x) =0 D:= max d(x) x2q x2q Algorithm 1:Nesterovs algorithm for non-euclidean norm 1 2 3 4 5 6 Output: Asequence y kª converging to the optimal at O(1=k 2 )rate. Initialize: Set x 0 toarandom value in Q. for k =0;1;2;::: do Query the gradient of f at point x k : rf(x k ). Find y k à argmin x2q rf(x k );x x k + 1 2 L x x k 2. Find z k L à argmin x2q ¾ d(x)+p k i=0 i+1 2 rf(x i ); x x i. Update x k+1 à 2 k+3 zk + k+1 k+3 yk. I won t mention details 50

Extension: Non-Euclidean norm Suppose the objective function f is l.c.g. wrt k k. Use a prox-function d on Q which is ¾ -strongly convex wrt k k, and min d(x) =0 D:= max d(x) x2q x2q Algorithm 1:Nesterovs algorithm for non-euclidean norm 1 2 3 4 5 6 Output: Asequence y kª converging to the optimal at O(1=k 2 )rate. Initialize: Set x 0 toarandom value in Q. for k =0;1;2;::: do Query the gradient of f at point x k : rf(x k ). Find y k à argmin x2q rf(x k );x x k + 1 2 L x x k 2. Find z k L à argmin x2q ¾ d(x)+p k i=0 i+1 2 rf(x i ); x x i. Update x k+1 à 2 k+3 zk + k+1 k+3 yk. I won t mention details 51

Extension: Non-Euclidean norm Rate of convergence f(y k ) f(x ) 4Ld(x ) ¾(k +1)(k +2) Applications will be given later. 52

Immediate application: Non-smooth functions Objective function not differentiable Suppose it is the Fencheldual of some function f min x f? (x) where f is de ned on Q Idea: smooth the non-smooth function. Add a small ¾-strongly convex function d to f f + d is ¾-strongly convex (f + d)? is 1 ¾ -l.c.g 53

Immediate application: Non-smooth functions (f + ²d)? (x) approximates f? (x) If 0 d(u) D for u 2 Q then f? (x) ²D (f + ²d)? (x) f? (x) max u Proof hu; xi f(u) ²D max hu; xi f(u) ²d(u) max u u hu; xi f(u) 0 = f? (x) ²D = = (f + ²d)? (x) f? (x) 54

Immediate application: Non-smooth functions (f + ²d)? (x) approximates f? (x) well If d(u)2[0;d]on Q, then Algorithm (given precision ) Fix Optimize (l.c.g. function) to precision Rate of convergence ² ^² = ² 2D (f +^²d)? (x) ²=2 r 1 ² L = r 1 ² 1 ^²¾ = (f + ²d)? (x) f? (x) 2 [ ²D; 0] r 2D ¾² 2 = 1 r 2D ² ¾ 55

Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 56

Composite optimization Many applications have objectives in the form of where J(w) =f(w)+g? (Aw) f is convex on the region E 1 with norm k k 1 g is convex on the region E 2 with norm k k 2 Very useful in machine learning Aw corresponds to linear model 57

Composite optimization Example: binary SVM 1 nx J(w)= 2 kwk2 +min [1 y i (hx i ; wi + b)] b2r {z } n + i=1 {z } f(w) g? (Aw) A= (y 1 x 1 ;:::;y n x n ) > g? is the dual of g( ) = P i i over Q 2 = 2[0;n 1 ] n : P i y i i =0 ª 58

Composite optimization 1: Smooth minimization J(w) =f(w)+g? (Aw) Let us only assume that f is M-l.c.g wrt k k 1 Smooth into g? (g + ¹d 2 )? (d 2 is ¾ 2 -strongly convex wrt k k 2 ) then is J ¹ (w) =f(w)+(g+¹d 2 )? (Aw) ³ M + 1 ¹¾ 2 kak 2 1;2 -l.c.g Apply Nesterov on J ¹ (w) 59

Composite optimization 1: Smooth minimization Rate of convergence to find an ² accurate solution, it costs steps. 4 kak 1;2 r D1 D 2 ¾ 1 ¾ 2 1 ² + r MD1 ¾ 1 ² d 1 is ¾ 1 -strongly convex wrt k k 1 d 2 is ¾ 2 -strongly convex wrt k k 2 D 1 :=max w2e 1 d 1 (w) D 2 :=max 2E 2 d 2 ( ) 60

Composite optimization 1: Smooth minimization Example: matrix game argmin w2 n hc; wi {z } f(w) Use Euclidean distance +max 2 m fhaw; i + hb; ig {z } g? (Aw) E 1 = n kwk 1 = Pi w2 i 1=2 d 1 (w) = 1 2 Pi (w i n 1 ) 2 ¾ 1 =¾ 2 =1 E 2 = m k k 2 = Pi 2 i 1=2 d 2 ( ) = 1 2 Pi ( i m 1 ) 2 D 1 <1;D 2 <1 kak 2 1;2 = 1=2 max(a > A) f(w k ) f(w ) 4 1=2 max(a > A) k +1 May scale with O(nm) 61

Composite optimization 1: Smooth minimization Example: matrix game E 1 = n E 2 = m argmin w2 n hc; wi {z } f(w) Use Entropy distance kwk 1 = P i jw ij k k 2 = P i j ij +max 2 m fhaw; i + hb; ig {z } g? (Aw) d 1 (w)=ln n + P i w i ln w i d 2 ( ) =ln m + P i i ln i ¾ 1 = ¾ 2 =1 D 1 =ln n D2 =ln m kak 1;2 =max i;j ja i;j j f(w k ) f(w ) 4(ln n ln m) 1 2 k +1 max i;j ka i;j k 62

Composite optimization 1: Smooth minimization Disadvantages: Fix the smoothing beforehand using prescribed accuracy ² No convergence criteria because real min is unknown. 63

Composite optimization 2: Excessive gap minimization Primal-dual Easily upper bounds the duality gap Idea Assume objective function takes the form J(w) =f(w)+g? (Aw) Utilizes the adjoint form Relations: D( )= g( ) f? ( A > ) J(w) D( ) 8w; and inf w2e 1 J(w) = sup 2E 2 D( ) 64

Composite optimization 2: Excessive gap minimization Sketch of idea L f L g Assume f is -l.c.g. and g is -l.c.g. Smooth both f? and g? by prox-functions d 1 ;d 2 J ¹2 (w) =f(w)+(g+¹ 2 d 2 )? (Aw) D ¹1 ( )= g( ) (f+¹ 1 d 1 )? ( A > ) 65

Composite optimization 2: Excessive gap minimization Sketch of idea Maintain two point sequences fw k g and and two regularization sequences s.t. f k g and f¹ 1 (k)g f¹ 2 (k)g ¹ 1 (k)! 0 J ¹2 (k)(w k ) D ¹1 (k)( k ) ¹ 2 (k)! 0 66

Composite optimization 2: Excessive gap minimization J ¹2 (k)(w k ) D ¹1 (k)( k ) 67 Challenge: How to efficiently find the initial point,,, that satisfy excessive gap condition. Given,,,, with new and how to efficiently find w k+1 and k+1. How to anneal ¹ 1 (k) and ¹ 2 (k) (otherwise one step done). Solution Gradient mapping Bregman projection (very cool) w 1 1 ¹ 1 (1) ¹ 2 (1) w k k ¹ 1 (k) ¹ 2 (k) ¹ 1 (k +1) ¹ 2 (k +1)

Composite optimization 2: Excessive gap minimization Rate of convergence: J(w k ) D( k ) 4 kak 1;2 k +1 r D1 D 2 ¾ 1 ¾ 2 f is ¾-strongly convex No need to add prox-function to f, ¹ 1 (k) 0 J(w k ) D( k ) 4D 2 ¾ 2 k(k+1) Ã kak 2 1;2 ¾ + L g! 68

Composite optimization 2: Excessive gap minimization Example: binary SVM 1 nx J(w)= 2 kwk2 +min [1 y i (hx i ; wi + b)] b2r {z } n + i=1 {z } f(w) g? (Aw) 69 A= (y 1 x 1 ;:::;y n x n ) > g? is the dual of g( ) = P i i over Adjoint form E 2 = 2 [0;n 1 ] n : P i y i i =0 ª D( ) = X i 1 2 > AA > i

Composite optimization 2: Convergence rate for SVM Theorem: running on SVM for k iterations J(w k ) D( k ) 2L (k +1)(k +2)n L = 1 kak 2 = 1 k(y 1 x 1 ;:::;y n x n )k 2 nr2 Final conclusion µ R J(w k ) D( k ) " as long as k>o p 11111 1=" (kx i k R) 70

Composite optimization 2: Projection for SVM Efficient O(n) time projection onto E 2 = Projection leads to a singly linear constrained QP min ( nx ( i m i ) 2 i=1 s:t: l i i u i 8i 2 [n]; nx ¾ i i = z: i=1 2 [0;n 1 ] n : X i y i i =0 ) Key tool: Median nding takes O(n) time 71

Automatic estimation of Lipschitz constant Automatic estimation of Lipschitzconstant Geometric scaling Does not affect the rate of convergence L 72

Conclusion Nesterov smethod attains the lower bound O L for L-l.c.g. objectives ² Linear rate for l.c.g. and strongly convex objectives Composite optimization Attains the rate of the nice part of the function Handling constraints Gradient mapping and Bregmanprojection Essentially does not change the convergence rate Expecting wide applications in machine learning Note: not in terms of generalization performance 73

74