Nesterov s Optimal Gradient Methods

Yurii Nesterov http://www.core.ucl.ac.be/~nesterov Nesterov s Optimal Gradient Methods Xinhua Zhang Australian National University NICTA 1

Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 2

The problem Many machine learning problems have the form where min w w: weight vector fx i ;y i g n i=1 : training data J(w) := (w)+r emp(w) R emp (w) := 1 n nx l(x i ; y i ; w) i=1 l(x;y;w) : convex and non-negative loss function Can be non-smooth, possibly non-convex. (w) : convex and non-negative regularizer 4

min w s.t. The problem: Examples 2 kwk2 + 1 n nx i=1» i» i 1 y i hw; x i i 8 1 i n» i 0 8 1 i n» i =max f0; 1 y i hw; x i ig 2 kwk2 + 1 n nx max f0; 1 y i hw; x i ig i=1 max f0; 1 xg hinge loss x Model (obj) (w) + R emp (w) linear SVMs 2 kwk2 2 + 1 P n n i=1 max f0; 1 y i hw; x i ig `1 logistic regression kwk 1 + 1 P n n i=1 log(1+exp( y ihw;x i i)) ²-insensitive classify 2 kwk2 2 + 1 P n n i=1 max f0; jy i hw; x i ij ²g kw 1 k 1 = P i jw ij 5

The problem: More examples Lasso Multi-task learning argmin w argmin w argmin w kwk 1 + kaw bk 2 2 TX kw k tr + kx t w t b t k 2 2 kw k 1;1 + t=1 TX kx t w t b t k 2 2 t=1 Matrix game Entropy regularized LPBoost argmin w2 d hc; wi +max u2 n fhaw; ui + hb; uig argmin w2 d (w; w 0 ) + max u2 n haw; ui 6

The problem: Lagrange dual 7 Binary SVM Entropy regularized LPBoost min 1 Q X i s.t. i 2 [0;n 1 ] P y i i =0 ln X wd 0 exp d i s.t. i 2 [0; 1] P i =1 i Ã i where Ã n!! X 1 A i;d i i=1 Q ij = y i y j hx i ; x j i

The problem 8 Summary where min J(w) w2q J is convex, but might be non-smooth Q is a (simple) convex set J might have composite form Solver: iterative method Want ² k := J(w k ) J(w ) to decrease to 0 quickly where w :=argmin w2q J(w). w 0 ; w 1 ; w 2 ;::: We only discuss optimization in this session, no generalization bound.

The problem: What makes a good optimizer? Find an ² -approximate solution Desirable: J(w k ) min w k as small as possible (take as few steps as possible) ² k J(w)+² Error decays by 1=k 2, 1=k, or e k. Each iteration costs reasonable amount of work Depends on n, and other condition parameters leniently General purpose, parallelizable (low sequential processing) Quit when done (measurable convergence criteria) w k 9

The problem: Rate of convergence Convergence rate: lim k!1 ² k+1 ² k = 8 < : 0 superlinear rate ² k = e ek 2 (0; 1) linear rate ² k = e k 1 sublinear rate ² k = 1 k Use interchangeably: Fix step index k, upper bound min 1 t k ² t Fix precision, how many steps needed for E.g. ² min ² t < ² 1 t k 1 1 ² ; 2 ² ; p² 1 ; log 1 ² ; loglog 1 ² 10

The problem: Collection of results Convergence rate: Objective function Gradient descent Nesterov Lower bound Composite non-smooth Smooth µ 1 O ² ³q 1 O ² ³q O 1 ² Smooth and very convex µ O log 1 µ ² O log 1 µ ² O log 1 ² 11 Smooth + (dual of smooth) O µ 1 ² (very convex) + (dual of smooth) ³q O 1 ²

Preliminaries: convex analysis Convex functions A function f is convex iff 8 x; y; 2 (0; 1) f( x +(1 )y) f(x)+(1 )f(y) 13

Preliminaries: convex analysis Convex functions A function f is convex iff 8 x; y; 2 (0; 1) f( x +(1 )y) f(x)+(1 )f(y) 14

Preliminaries: convex analysis Strong convexity ¾ A function f is called -strongly convex wrt a norm iff k k f(x) 1 2 ¾ kxk2 is convex 8 x; y; 2 (0; 1) f( x +(1 )y) f(x)+(1 )f(y) ¾ (1 ) 2 kx yk 2 15

Preliminaries: convex analysis Strong convexity First order equivalent condition f(y) f(x)+hrf(x); y xi + ¾ 2 kx yk2 8 x; y 16

Preliminaries: convex analysis Strong convexity First order equivalent condition f(y) f(x)+hrf(x); y xi + ¾ 2 kx yk2 8 x; y 17

Preliminaries: convex analysis Strong convexity Second order r 2 f(x)y; y ¾ kyk 2 8 x; y If k k Euclidean norm, then r 2 f(x) º ¾I Lower bounds rate of change of gradient 18

Preliminaries: convex analysis Lipschitz continuous gradient Lipschitzcontinuity Stronger than continuity, weaker than differentiability Upper bounds rate of change 9L >0 jf(x) f(y)j L kx yk 8 x; y 19

Preliminaries: convex analysis Lipschitz continuous gradient Gradient is Lipschitzcontinuous (must be differentiable) krf(x) rf(y)k L kx yk f(y) f(x)+hrf(x); y xi + L 2 kx yk2 8 x; y 8 x; y L L-l.c.g ¾ 20

Preliminaries: convex analysis Lipschitz continuous gradient Gradient is Lipschitzcontinuous (must be differentiable) krf(x) rf(y)k L kx yk 8 x; y f(y) f(x)+hrf(x); y xi + L 2 kx yk2 8 x; y r 2 f(x)y; y L kyk 2 8 x; y r 2 f(x) ¹ LI if L 2 norm 21

Preliminaries: convex analysis Fenchel Dual Fencheldual of a function f f? (s) =sup x hs; xi f(x) Properties f ¾ strongly convex f?? = f if f is convex and closed f? 1 ¾ -l.c.g on Rd L-l.c.g on R d 1 L strongly convex 22

Preliminaries: convex analysis Fenchel Dual Fencheldual of a function f f? (s) =sup x hs; xi f(x) s = rf(x) s 2 @f(x) slope = s f? (s) f(x) 23

Preliminaries: convex analysis: Subgradient Generalize gradient to non-differentiable functions Idea: tangent plane lying below the graph of f 24

Preliminaries: convex analysis: Subgradient Generalize gradient to non-differentiable functions ¹ is called a subgradientof f at x if f(x 0 ) f(x)+hx 0 x;¹i 8x 0 All such ¹ comprise the subdifferential of f at x: @f(x) 25

Preliminaries: convex analysis: Subgradient Generalize gradient to non-differentiable functions is called a subgradientof f at x if ¹ f(x 0 ) f(x)+hx 0 x;¹i 8x 0 All such ¹ comprise the subdifferential of f at x: Unique if f is differentiable at x @f(x) 26

Preliminaries: optimization: Gradient descent Gradient descent x k+1 = x k s k rf(x k ) s k 0 Suppose f is both ¾-strongly convex and L-l.c.g. ² k := f(x k ) f(w ) ² k ³ 1 ¾ L k ²0 Key idea Norm of gradient upper bounds how far away from optimal Lower bounds how much progress one can make 27

Preliminaries: optimization: Gradient descent Upper bound distance from optimal x k+1 = x k 1 L rf(x k) rf shaded area triangle area slope = ¾ x x k rf(x k ) So f(x k ) f(x ) 1 2¾ krf(x k)k 2 f(x k ) f(x ) 1 2¾ krf(x k)k 2 28

Preliminaries: optimization: Gradient descent Lower bound progress at each step x k+1 = x k 1 L rf(x k) rf shaded area triangle area slope = L rf(x k ) So f(x k ) f(x k+1 ) 1 2L krf(x k)k 2 rf(x k ) x k+1 f(x L k ) f(x k+1 ) 1 2L krf(x k)k 2 x k 29

Preliminaries: optimization: Gradient descent Putting things together distance to optimal progress 2¾(f(x k ) f(x )) krf(x k )k 2 2L(f(x k ) f(x k+1 )) f(x k+1 ) f(x ) 1 ¾ L (f(xk ) f(x )) 30

Preliminaries: optimization: Gradient descent Putting things together distance to optimal progress 2¾(f(x k ) f(x )) krf(x k )k 2 2L(f(x k ) f(x k+1 )) f(x k+1 ) f(x ) {z } ² k+1 1 ¾ L (f(xk ) f(x )) {z } ² k 31

Preliminaries: optimization: Projected Gradient descent If objective function is L-l.c.g., but not strongly convex Constrained to convex set Projected gradient descent x k+1 = Q µx k 1 L rf(x k) =argmin x2q Rate of convergence: µ L O ² ³q L Compare with Newton O, interior point Q =argmin ^x2q ^x (x k 1 L rf(x k)) f(x k )+hrf(x k ); x x k i + L 2 kx x kk 2 ² O log 1 ² 33

Preliminaries: optimization: Projected Gradient descent Projected gradient descent x k+1 = Q µx k 1 L rf(x k) =argmin x2q =argmin ^x2q f(x k )+hrf(x k ); x x k i + L 2 kx x kk 2 Property 1: monotonic decreasing ^x (x k 1 L rf(x k)) f(x k+1 ) f(x k )+hrf(x k ); x k+1 x k i + L 2 kx k+1 x k k 2 L-l.c.g. f(x k )+hrf(x k ); x k x k i + L 2 kx k x k k 2 = f(x k ) Def x k+1 projection 34

Preliminaries: optimization: Projected Gradient descent Property 2: x x k+1 ; (x k 1L À 8 x 2 Q rf(x k)) x k+1 0 L-l.c.g. f(x k+1 ) f(x k )+hrf(x k ); x k+1 x k i + L 2 kx k+1 x k k 2 x k 1 L rf(x k) x k+1 Q x Property 2 f(x k )+hrf(x k ); x x k i + L 2 kx x kk 2 L 2 kx x k+1k 2 8 x 2 Q Convexity of f f(x)+ L 2 kx x kk 2 L 2 kx x k+1k 2 8 x 2 Q 35

Preliminaries: optimization: Projected Gradient descent Put together f(x k+1 ) f(x)+ L 2 kx x kk 2 L 2 kx x k+1k 2 8 x 2 Q Let x = x : 0 L 2 kx x k+1 k 2 ² k+1 + L 2 kx x k k 2 ::: k+1 X ² i + L 2 kx x 0 k 2 i=1 (k+1)² k+1 + L 2 kx x 0 k 2 ( monotonic decreasing) ² k 36 ² k+1 L 2(k +1) kx x 0 k 2

Preliminaries: optimization: Subgradient method Objective is continuous but not differentiable Subgradientmethod for min f(x) x2q x k+1 = Q (x k s k rf(x k )) 37 where Rate of convergence Summary O µ 1 ² 2 rf(x k ) 2 @f(x k ) µ 1 O ² 2 O µ L ² (arbitrary subgradient) ln 1 ² ln(1 ¾ L ) non-smooth L-l.c.g. L-l.c.g. & ¾-strongly convex

Optimal gradient method Lower bound Consider the set of L-l.c.g. functions For any ² > 0, there exists an L-l.c.g. function f, such that any first-order method takes at least ³q k = O steps to ensure ² k < ². First-order method means L ² x k 2 x 0 +span frf(x 0 );:::;rf(x k 1 )g Not saying: there exists an L-l.c.g. function f, such that for all any first-order method takes at least k =O( p ² > 0 L=²) steps to ensure ² k < ². Gap: recall the upper bound O L of GD, two possibilities. ² 39

Optimal gradient method: Primitive Nesterov Problem under consideration where f is L-l.c.g., Big results is convex He proposed an algorithm attaining Not for free: require an oracle to project a point onto in sense L 2 min w f(w) w 2 Q Q p L=" Q 40

Primitive Nesterov Construct quadratic functions 1 Á k (x) and k > 0 Á k (x)=á k + k 2 kx v k k 2 f(x k ) 2 3 4 9x k ;s:t: f(x k ) Á k Á k (x) (1 k)f(x)+ ká 0 (x) k! 0 v k x k 2 1 f(x k ) Á k Á k (x ) 3 (1 k)f(x )+ ká 0 (x ) 41 f(x k ) f(x ) k(á 0 (x ) f(x ))! 0

Primitive Nesterov Construct quadratic functions 1 Á k (x) and k > 0 Á k (x) =Á k + k 2 kx v k k 2 f(x k ) 2 3 4 9x k ;s:t: f(x k ) Á k Á k (x) (1 k)f(x)+ ká 0 (x) k! 0 v k x k 2 1 f(x k ) Á k Á k (x ) 3 (1 k)f(x )+ ká 0 (x ) 42 f(x k ) f(x ) k(á 0 (x ) f(x ))! 0

Primitive Nesterov: Rate of convergence 1 2 3 4 Á k (x) =Á k + k 2 kx v k k 2 9x k ;s:t: f(x k ) Á k Á k (x) (1 k)f(x)+ ká 0 (x) k! 0 Nesterov constructed, in a highly non-trivial way, the Á k (x) and k, s.t. has closed form (grad desc) x k k 4L (2 p L+k p 0 ) 2 f(x k ) f(x ) k(á 0 (x ) f(x )) Furthermore, if f is ¾-strongly convex, then Rate of convergence sheerly depends on k k 1 ¾ L k 43

Primitive Nesterov: Dealing with constraints x k has closed form by gradient descent When constrained to set Q, modify by New gradient: x k+1 = x k rf(x k ) x Q k+1 = Q (x k rf(x k ))=argmin x2q kx (x k rf(x k ))k g Q k := 1 ³x k x Q k+1 gradient mapping 44 This new gradient keeps all important properties of gradient, also keeping the rate of convergence

Primitive Nesterov: Gradient mapping x k has closed form by gradient descent When constrained to set Q, modify by New gradient: x k+1 = x k rf(x k ) x Q k+1 = Q (x k rf(x k ))=argmin x2q kx (x k rf(x k ))k Expensive? g Q k := 1 ³x k x Q k+1 gradient mapping 45 This new gradient keeps all important properties of gradient, also keeping the rate of convergence

Primitive Nesterov Summary where f is L-l.c.g., min w f(w) w 2 Q Q Rate of convergence is convex. r L ² if no strong convexity ln 1 ² ln(1 ¾ L ) if ¾-strongly convexity 46

Primitive Nesterov: Example 1 min x;y 2 x2 +2y 2 ¹=1; L=4 min x 0;y 0 1 (x + y)2 2 ¹ =0; L=2 Nesterov Grad desc Nesterov Grad desc 47

Extension: Non-Euclidean norm Remember strong convexity and l.c.g. are wrt some norm We have implicitly used Euclidean norm (L 2 norm) Some functions are strongly convex wrt other norms P i x i ln x i Not l.c.g. wrt L 2 norm Negative entropy is l.c.g. wrt L 1 norm kxk 1 = P i x i strongly convex wrt L 1 norm. Can Nesterov s approach be extended to non-euclidean norm? 48

Extension: Non-Euclidean norm Remember strong convexity and l.c.g. are wrt some norm We have implicitly used Euclidean norm (L 2 norm) Some functions are l.c.g. wrt other norms P i x i ln x i Not l.c.g. wrt L 2 norm Negative entropy is l.c.g. wrt L 1 norm kxk 1 = P i x i strongly convex wrt L 1 norm. Can Nesterov s approach be extended to non-euclidean norm? 49

Extension: Non-Euclidean norm Suppose the objective function f is l.c.g. wrt k k. Use a prox-function d on Q which is ¾ -strongly convex wrt k k, and min d(x) =0 D:= max d(x) x2q x2q Algorithm 1:Nesterovs algorithm for non-euclidean norm 1 2 3 4 5 6 Output: Asequence y kª converging to the optimal at O(1=k 2 )rate. Initialize: Set x 0 toarandom value in Q. for k =0;1;2;::: do Query the gradient of f at point x k : rf(x k ). Find y k Ã argmin x2q rf(x k );x x k + 1 2 L x x k 2. Find z k L Ã argmin x2q ¾ d(x)+p k i=0 i+1 2 rf(x i ); x x i. Update x k+1 Ã 2 k+3 zk + k+1 k+3 yk. I won t mention details 50

Extension: Non-Euclidean norm Rate of convergence f(y k ) f(x ) 4Ld(x ) ¾(k +1)(k +2) Applications will be given later. 52

Immediate application: Non-smooth functions Objective function not differentiable Suppose it is the Fencheldual of some function f min x f? (x) where f is de ned on Q Idea: smooth the non-smooth function. Add a small ¾-strongly convex function d to f f + d is ¾-strongly convex (f + d)? is 1 ¾ -l.c.g 53

Immediate application: Non-smooth functions (f + ²d)? (x) approximates f? (x) If 0 d(u) D for u 2 Q then f? (x) ²D (f + ²d)? (x) f? (x) max u Proof hu; xi f(u) ²D max hu; xi f(u) ²d(u) max u u hu; xi f(u) 0 = f? (x) ²D = = (f + ²d)? (x) f? (x) 54

Immediate application: Non-smooth functions (f + ²d)? (x) approximates f? (x) well If d(u)2[0;d]on Q, then Algorithm (given precision ) Fix Optimize (l.c.g. function) to precision Rate of convergence ² ^² = ² 2D (f +^²d)? (x) ²=2 r 1 ² L = r 1 ² 1 ^²¾ = (f + ²d)? (x) f? (x) 2 [ ²D; 0] r 2D ¾² 2 = 1 r 2D ² ¾ 55

Composite optimization Many applications have objectives in the form of where J(w) =f(w)+g? (Aw) f is convex on the region E 1 with norm k k 1 g is convex on the region E 2 with norm k k 2 Very useful in machine learning Aw corresponds to linear model 57

Composite optimization Example: binary SVM 1 nx J(w)= 2 kwk2 +min [1 y i (hx i ; wi + b)] b2r {z } n + i=1 {z } f(w) g? (Aw) A= (y 1 x 1 ;:::;y n x n ) > g? is the dual of g( ) = P i i over Q 2 = 2[0;n 1 ] n : P i y i i =0 ª 58

Composite optimization 1: Smooth minimization J(w) =f(w)+g? (Aw) Let us only assume that f is M-l.c.g wrt k k 1 Smooth into g? (g + ¹d 2 )? (d 2 is ¾ 2 -strongly convex wrt k k 2 ) then is J ¹ (w) =f(w)+(g+¹d 2 )? (Aw) ³ M + 1 ¹¾ 2 kak 2 1;2 -l.c.g Apply Nesterov on J ¹ (w) 59

Composite optimization 1: Smooth minimization Rate of convergence to find an ² accurate solution, it costs steps. 4 kak 1;2 r D1 D 2 ¾ 1 ¾ 2 1 ² + r MD1 ¾ 1 ² d 1 is ¾ 1 -strongly convex wrt k k 1 d 2 is ¾ 2 -strongly convex wrt k k 2 D 1 :=max w2e 1 d 1 (w) D 2 :=max 2E 2 d 2 ( ) 60

Composite optimization 1: Smooth minimization Example: matrix game argmin w2 n hc; wi {z } f(w) Use Euclidean distance +max 2 m fhaw; i + hb; ig {z } g? (Aw) E 1 = n kwk 1 = Pi w2 i 1=2 d 1 (w) = 1 2 Pi (w i n 1 ) 2 ¾ 1 =¾ 2 =1 E 2 = m k k 2 = Pi 2 i 1=2 d 2 ( ) = 1 2 Pi ( i m 1 ) 2 D 1 <1;D 2 <1 kak 2 1;2 = 1=2 max(a > A) f(w k ) f(w ) 4 1=2 max(a > A) k +1 May scale with O(nm) 61

Composite optimization 1: Smooth minimization Example: matrix game E 1 = n E 2 = m argmin w2 n hc; wi {z } f(w) Use Entropy distance kwk 1 = P i jw ij k k 2 = P i j ij +max 2 m fhaw; i + hb; ig {z } g? (Aw) d 1 (w)=ln n + P i w i ln w i d 2 ( ) =ln m + P i i ln i ¾ 1 = ¾ 2 =1 D 1 =ln n D2 =ln m kak 1;2 =max i;j ja i;j j f(w k ) f(w ) 4(ln n ln m) 1 2 k +1 max i;j ka i;j k 62

Composite optimization 1: Smooth minimization Disadvantages: Fix the smoothing beforehand using prescribed accuracy ² No convergence criteria because real min is unknown. 63

Composite optimization 2: Excessive gap minimization Primal-dual Easily upper bounds the duality gap Idea Assume objective function takes the form J(w) =f(w)+g? (Aw) Utilizes the adjoint form Relations: D( )= g( ) f? ( A > ) J(w) D( ) 8w; and inf w2e 1 J(w) = sup 2E 2 D( ) 64

Composite optimization 2: Excessive gap minimization Sketch of idea L f L g Assume f is -l.c.g. and g is -l.c.g. Smooth both f? and g? by prox-functions d 1 ;d 2 J ¹2 (w) =f(w)+(g+¹ 2 d 2 )? (Aw) D ¹1 ( )= g( ) (f+¹ 1 d 1 )? ( A > ) 65

Composite optimization 2: Excessive gap minimization Sketch of idea Maintain two point sequences fw k g and and two regularization sequences s.t. f k g and f¹ 1 (k)g f¹ 2 (k)g ¹ 1 (k)! 0 J ¹2 (k)(w k ) D ¹1 (k)( k ) ¹ 2 (k)! 0 66

Composite optimization 2: Excessive gap minimization J ¹2 (k)(w k ) D ¹1 (k)( k ) 67 Challenge: How to efficiently find the initial point,,, that satisfy excessive gap condition. Given,,,, with new and how to efficiently find w k+1 and k+1. How to anneal ¹ 1 (k) and ¹ 2 (k) (otherwise one step done). Solution Gradient mapping Bregman projection (very cool) w 1 1 ¹ 1 (1) ¹ 2 (1) w k k ¹ 1 (k) ¹ 2 (k) ¹ 1 (k +1) ¹ 2 (k +1)

Composite optimization 2: Excessive gap minimization Rate of convergence: J(w k ) D( k ) 4 kak 1;2 k +1 r D1 D 2 ¾ 1 ¾ 2 f is ¾-strongly convex No need to add prox-function to f, ¹ 1 (k) 0 J(w k ) D( k ) 4D 2 ¾ 2 k(k+1) Ã kak 2 1;2 ¾ + L g! 68

Composite optimization 2: Excessive gap minimization Example: binary SVM 1 nx J(w)= 2 kwk2 +min [1 y i (hx i ; wi + b)] b2r {z } n + i=1 {z } f(w) g? (Aw) 69 A= (y 1 x 1 ;:::;y n x n ) > g? is the dual of g( ) = P i i over Adjoint form E 2 = 2 [0;n 1 ] n : P i y i i =0 ª D( ) = X i 1 2 > AA > i

Composite optimization 2: Convergence rate for SVM Theorem: running on SVM for k iterations J(w k ) D( k ) 2L (k +1)(k +2)n L = 1 kak 2 = 1 k(y 1 x 1 ;:::;y n x n )k 2 nr2 Final conclusion µ R J(w k ) D( k ) " as long as k>o p 11111 1=" (kx i k R) 70

Composite optimization 2: Projection for SVM Efficient O(n) time projection onto E 2 = Projection leads to a singly linear constrained QP min ( nx ( i m i ) 2 i=1 s:t: l i i u i 8i 2 [n]; nx ¾ i i = z: i=1 2 [0;n 1 ] n : X i y i i =0 ) Key tool: Median nding takes O(n) time 71

Automatic estimation of Lipschitz constant Automatic estimation of Lipschitzconstant Geometric scaling Does not affect the rate of convergence L 72

Conclusion Nesterov smethod attains the lower bound O L for L-l.c.g. objectives ² Linear rate for l.c.g. and strongly convex objectives Composite optimization Attains the rate of the nice part of the function Handling constraints Gradient mapping and Bregmanprojection Essentially does not change the convergence rate Expecting wide applications in machine learning Note: not in terms of generalization performance 73