Nesterov s Optimal Gradient Methods

Size: px
Start display at page:

Download "Nesterov s Optimal Gradient Methods"

Transcription

1 Yurii Nesterov Nesterov s Optimal Gradient Methods Xinhua Zhang Australian National University NICTA 1

2 Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 2

3 Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 3

4 The problem Many machine learning problems have the form where min w w: weight vector fx i ;y i g n i=1 : training data J(w) := (w)+r emp(w) R emp (w) := 1 n nx l(x i ; y i ; w) i=1 l(x;y;w) : convex and non-negative loss function Can be non-smooth, possibly non-convex. (w) : convex and non-negative regularizer 4

5 min w s.t. The problem: Examples 2 kwk2 + 1 n nx i=1» i» i 1 y i hw; x i i 8 1 i n» i i n» i =max f0; 1 y i hw; x i ig 2 kwk2 + 1 n nx max f0; 1 y i hw; x i ig i=1 max f0; 1 xg hinge loss x Model (obj) (w) + R emp (w) linear SVMs 2 kwk P n n i=1 max f0; 1 y i hw; x i ig `1 logistic regression kwk P n n i=1 log(1+exp( y ihw;x i i)) ²-insensitive classify 2 kwk P n n i=1 max f0; jy i hw; x i ij ²g kw 1 k 1 = P i jw ij 5

6 The problem: More examples Lasso Multi-task learning argmin w argmin w argmin w kwk 1 + kaw bk 2 2 TX kw k tr + kx t w t b t k 2 2 kw k 1;1 + t=1 TX kx t w t b t k 2 2 t=1 Matrix game Entropy regularized LPBoost argmin w2 d hc; wi +max u2 n fhaw; ui + hb; uig argmin w2 d (w; w 0 ) + max u2 n haw; ui 6

7 The problem: Lagrange dual 7 Binary SVM Entropy regularized LPBoost min 1 Q X i s.t. i 2 [0;n 1 ] P y i i =0 ln X wd 0 exp d i s.t. i 2 [0; 1] P i =1 i à i where à n!! X 1 A i;d i i=1 Q ij = y i y j hx i ; x j i

8 The problem 8 Summary where min J(w) w2q J is convex, but might be non-smooth Q is a (simple) convex set J might have composite form Solver: iterative method Want ² k := J(w k ) J(w ) to decrease to 0 quickly where w :=argmin w2q J(w). w 0 ; w 1 ; w 2 ;::: We only discuss optimization in this session, no generalization bound.

9 The problem: What makes a good optimizer? Find an ² -approximate solution Desirable: J(w k ) min w k as small as possible (take as few steps as possible) ² k J(w)+² Error decays by 1=k 2, 1=k, or e k. Each iteration costs reasonable amount of work Depends on n, and other condition parameters leniently General purpose, parallelizable (low sequential processing) Quit when done (measurable convergence criteria) w k 9

10 The problem: Rate of convergence Convergence rate: lim k!1 ² k+1 ² k = 8 < : 0 superlinear rate ² k = e ek 2 (0; 1) linear rate ² k = e k 1 sublinear rate ² k = 1 k Use interchangeably: Fix step index k, upper bound min 1 t k ² t Fix precision, how many steps needed for E.g. ² min ² t < ² 1 t k 1 1 ² ; 2 ² ; p² 1 ; log 1 ² ; loglog 1 ² 10

11 The problem: Collection of results Convergence rate: Objective function Gradient descent Nesterov Lower bound Composite non-smooth Smooth µ 1 O ² ³q 1 O ² ³q O 1 ² Smooth and very convex µ O log 1 µ ² O log 1 µ ² O log 1 ² 11 Smooth + (dual of smooth) O µ 1 ² (very convex) + (dual of smooth) ³q O 1 ²

12 Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 12

13 Preliminaries: convex analysis Convex functions A function f is convex iff 8 x; y; 2 (0; 1) f( x +(1 )y) f(x)+(1 )f(y) 13

14 Preliminaries: convex analysis Convex functions A function f is convex iff 8 x; y; 2 (0; 1) f( x +(1 )y) f(x)+(1 )f(y) 14

15 Preliminaries: convex analysis Strong convexity ¾ A function f is called -strongly convex wrt a norm iff k k f(x) 1 2 ¾ kxk2 is convex 8 x; y; 2 (0; 1) f( x +(1 )y) f(x)+(1 )f(y) ¾ (1 ) 2 kx yk 2 15

16 Preliminaries: convex analysis Strong convexity First order equivalent condition f(y) f(x)+hrf(x); y xi + ¾ 2 kx yk2 8 x; y 16

17 Preliminaries: convex analysis Strong convexity First order equivalent condition f(y) f(x)+hrf(x); y xi + ¾ 2 kx yk2 8 x; y 17

18 Preliminaries: convex analysis Strong convexity Second order r 2 f(x)y; y ¾ kyk 2 8 x; y If k k Euclidean norm, then r 2 f(x) º ¾I Lower bounds rate of change of gradient 18

19 Preliminaries: convex analysis Lipschitz continuous gradient Lipschitzcontinuity Stronger than continuity, weaker than differentiability Upper bounds rate of change 9L >0 jf(x) f(y)j L kx yk 8 x; y 19

20 Preliminaries: convex analysis Lipschitz continuous gradient Gradient is Lipschitzcontinuous (must be differentiable) krf(x) rf(y)k L kx yk f(y) f(x)+hrf(x); y xi + L 2 kx yk2 8 x; y 8 x; y L L-l.c.g ¾ 20

21 Preliminaries: convex analysis Lipschitz continuous gradient Gradient is Lipschitzcontinuous (must be differentiable) krf(x) rf(y)k L kx yk 8 x; y f(y) f(x)+hrf(x); y xi + L 2 kx yk2 8 x; y r 2 f(x)y; y L kyk 2 8 x; y r 2 f(x) ¹ LI if L 2 norm 21

22 Preliminaries: convex analysis Fenchel Dual Fencheldual of a function f f? (s) =sup x hs; xi f(x) Properties f ¾ strongly convex f?? = f if f is convex and closed f? 1 ¾ -l.c.g on Rd L-l.c.g on R d 1 L strongly convex 22

23 Preliminaries: convex analysis Fenchel Dual Fencheldual of a function f f? (s) =sup x hs; xi f(x) s = rf(x) s slope = s f? (s) f(x) 23

24 Preliminaries: convex analysis: Subgradient Generalize gradient to non-differentiable functions Idea: tangent plane lying below the graph of f 24

25 Preliminaries: convex analysis: Subgradient Generalize gradient to non-differentiable functions ¹ is called a subgradientof f at x if f(x 0 ) f(x)+hx 0 x;¹i 8x 0 All such ¹ comprise the subdifferential of f at 25

26 Preliminaries: convex analysis: Subgradient Generalize gradient to non-differentiable functions is called a subgradientof f at x if ¹ f(x 0 ) f(x)+hx 0 x;¹i 8x 0 All such ¹ comprise the subdifferential of f at x: Unique if f is differentiable at 26

27 Preliminaries: optimization: Gradient descent Gradient descent x k+1 = x k s k rf(x k ) s k 0 Suppose f is both ¾-strongly convex and L-l.c.g. ² k := f(x k ) f(w ) ² k ³ 1 ¾ L k ²0 Key idea Norm of gradient upper bounds how far away from optimal Lower bounds how much progress one can make 27

28 Preliminaries: optimization: Gradient descent Upper bound distance from optimal x k+1 = x k 1 L rf(x k) rf shaded area triangle area slope = ¾ x x k rf(x k ) So f(x k ) f(x ) 1 2¾ krf(x k)k 2 f(x k ) f(x ) 1 2¾ krf(x k)k 2 28

29 Preliminaries: optimization: Gradient descent Lower bound progress at each step x k+1 = x k 1 L rf(x k) rf shaded area triangle area slope = L rf(x k ) So f(x k ) f(x k+1 ) 1 2L krf(x k)k 2 rf(x k ) x k+1 f(x L k ) f(x k+1 ) 1 2L krf(x k)k 2 x k 29

30 Preliminaries: optimization: Gradient descent Putting things together distance to optimal progress 2¾(f(x k ) f(x )) krf(x k )k 2 2L(f(x k ) f(x k+1 )) f(x k+1 ) f(x ) 1 ¾ L (f(xk ) f(x )) 30

31 Preliminaries: optimization: Gradient descent Putting things together distance to optimal progress 2¾(f(x k ) f(x )) krf(x k )k 2 2L(f(x k ) f(x k+1 )) f(x k+1 ) f(x ) {z } ² k+1 1 ¾ L (f(xk ) f(x )) {z } ² k 31

32 Preliminaries: optimization: Gradient descent Putting things together distance to optimal progress 2¾(f(x k ) f(x )) krf(x k )k 2 2L(f(x k ) f(x k+1 )) f(x k+1 ) f(x ) {z } ² k+1 1 ¾ L (f(xk ) f(x )) {z } ² k What if ¾ =0? What if there is constraint? 32

33 Preliminaries: optimization: Projected Gradient descent If objective function is L-l.c.g., but not strongly convex Constrained to convex set Projected gradient descent x k+1 = Q µx k 1 L rf(x k) =argmin x2q Rate of convergence: µ L O ² ³q L Compare with Newton O, interior point Q =argmin ^x2q ^x (x k 1 L rf(x k)) f(x k )+hrf(x k ); x x k i + L 2 kx x kk 2 ² O log 1 ² 33

34 Preliminaries: optimization: Projected Gradient descent Projected gradient descent x k+1 = Q µx k 1 L rf(x k) =argmin x2q =argmin ^x2q f(x k )+hrf(x k ); x x k i + L 2 kx x kk 2 Property 1: monotonic decreasing ^x (x k 1 L rf(x k)) f(x k+1 ) f(x k )+hrf(x k ); x k+1 x k i + L 2 kx k+1 x k k 2 L-l.c.g. f(x k )+hrf(x k ); x k x k i + L 2 kx k x k k 2 = f(x k ) Def x k+1 projection 34

35 Preliminaries: optimization: Projected Gradient descent Property 2: x x k+1 ; (x k 1L À 8 x 2 Q rf(x k)) x k+1 0 L-l.c.g. f(x k+1 ) f(x k )+hrf(x k ); x k+1 x k i + L 2 kx k+1 x k k 2 x k 1 L rf(x k) x k+1 Q x Property 2 f(x k )+hrf(x k ); x x k i + L 2 kx x kk 2 L 2 kx x k+1k 2 8 x 2 Q Convexity of f f(x)+ L 2 kx x kk 2 L 2 kx x k+1k 2 8 x 2 Q 35

36 Preliminaries: optimization: Projected Gradient descent Put together f(x k+1 ) f(x)+ L 2 kx x kk 2 L 2 kx x k+1k 2 8 x 2 Q Let x = x : 0 L 2 kx x k+1 k 2 ² k+1 + L 2 kx x k k 2 ::: k+1 X ² i + L 2 kx x 0 k 2 i=1 (k+1)² k+1 + L 2 kx x 0 k 2 ( monotonic decreasing) ² k 36 ² k+1 L 2(k +1) kx x 0 k 2

37 Preliminaries: optimization: Subgradient method Objective is continuous but not differentiable Subgradientmethod for min f(x) x2q x k+1 = Q (x k s k rf(x k )) 37 where Rate of convergence Summary O µ 1 ² 2 rf(x k ) k ) µ 1 O ² 2 O µ L ² (arbitrary subgradient) ln 1 ² ln(1 ¾ L ) non-smooth L-l.c.g. L-l.c.g. & ¾-strongly convex

38 Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 38

39 Optimal gradient method Lower bound Consider the set of L-l.c.g. functions For any ² > 0, there exists an L-l.c.g. function f, such that any first-order method takes at least ³q k = O steps to ensure ² k < ². First-order method means L ² x k 2 x 0 +span frf(x 0 );:::;rf(x k 1 )g Not saying: there exists an L-l.c.g. function f, such that for all any first-order method takes at least k =O( p ² > 0 L=²) steps to ensure ² k < ². Gap: recall the upper bound O L of GD, two possibilities. ² 39

40 Optimal gradient method: Primitive Nesterov Problem under consideration where f is L-l.c.g., Big results is convex He proposed an algorithm attaining Not for free: require an oracle to project a point onto in sense L 2 min w f(w) w 2 Q Q p L=" Q 40

41 Primitive Nesterov Construct quadratic functions 1 Á k (x) and k > 0 Á k (x)=á k + k 2 kx v k k 2 f(x k ) x k ;s:t: f(x k ) Á k Á k (x) (1 k)f(x)+ ká 0 (x) k! 0 v k x k 2 1 f(x k ) Á k Á k (x ) 3 (1 k)f(x )+ ká 0 (x ) 41 f(x k ) f(x ) k(á 0 (x ) f(x ))! 0

42 Primitive Nesterov Construct quadratic functions 1 Á k (x) and k > 0 Á k (x) =Á k + k 2 kx v k k 2 f(x k ) x k ;s:t: f(x k ) Á k Á k (x) (1 k)f(x)+ ká 0 (x) k! 0 v k x k 2 1 f(x k ) Á k Á k (x ) 3 (1 k)f(x )+ ká 0 (x ) 42 f(x k ) f(x ) k(á 0 (x ) f(x ))! 0

43 Primitive Nesterov: Rate of convergence Á k (x) =Á k + k 2 kx v k k 2 9x k ;s:t: f(x k ) Á k Á k (x) (1 k)f(x)+ ká 0 (x) k! 0 Nesterov constructed, in a highly non-trivial way, the Á k (x) and k, s.t. has closed form (grad desc) x k k 4L (2 p L+k p 0 ) 2 f(x k ) f(x ) k(á 0 (x ) f(x )) Furthermore, if f is ¾-strongly convex, then Rate of convergence sheerly depends on k k 1 ¾ L k 43

44 Primitive Nesterov: Dealing with constraints x k has closed form by gradient descent When constrained to set Q, modify by New gradient: x k+1 = x k rf(x k ) x Q k+1 = Q (x k rf(x k ))=argmin x2q kx (x k rf(x k ))k g Q k := 1 ³x k x Q k+1 gradient mapping 44 This new gradient keeps all important properties of gradient, also keeping the rate of convergence

45 Primitive Nesterov: Gradient mapping x k has closed form by gradient descent When constrained to set Q, modify by New gradient: x k+1 = x k rf(x k ) x Q k+1 = Q (x k rf(x k ))=argmin x2q kx (x k rf(x k ))k Expensive? g Q k := 1 ³x k x Q k+1 gradient mapping 45 This new gradient keeps all important properties of gradient, also keeping the rate of convergence

46 Primitive Nesterov Summary where f is L-l.c.g., min w f(w) w 2 Q Q Rate of convergence is convex. r L ² if no strong convexity ln 1 ² ln(1 ¾ L ) if ¾-strongly convexity 46

47 Primitive Nesterov: Example 1 min x;y 2 x2 +2y 2 ¹=1; L=4 min x 0;y 0 1 (x + y)2 2 ¹ =0; L=2 Nesterov Grad desc Nesterov Grad desc 47

48 Extension: Non-Euclidean norm Remember strong convexity and l.c.g. are wrt some norm We have implicitly used Euclidean norm (L 2 norm) Some functions are strongly convex wrt other norms P i x i ln x i Not l.c.g. wrt L 2 norm Negative entropy is l.c.g. wrt L 1 norm kxk 1 = P i x i strongly convex wrt L 1 norm. Can Nesterov s approach be extended to non-euclidean norm? 48

49 Extension: Non-Euclidean norm Remember strong convexity and l.c.g. are wrt some norm We have implicitly used Euclidean norm (L 2 norm) Some functions are l.c.g. wrt other norms P i x i ln x i Not l.c.g. wrt L 2 norm Negative entropy is l.c.g. wrt L 1 norm kxk 1 = P i x i strongly convex wrt L 1 norm. Can Nesterov s approach be extended to non-euclidean norm? 49

50 Extension: Non-Euclidean norm Suppose the objective function f is l.c.g. wrt k k. Use a prox-function d on Q which is ¾ -strongly convex wrt k k, and min d(x) =0 D:= max d(x) x2q x2q Algorithm 1:Nesterovs algorithm for non-euclidean norm Output: Asequence y kª converging to the optimal at O(1=k 2 )rate. Initialize: Set x 0 toarandom value in Q. for k =0;1;2;::: do Query the gradient of f at point x k : rf(x k ). Find y k à argmin x2q rf(x k );x x k L x x k 2. Find z k L à argmin x2q ¾ d(x)+p k i=0 i+1 2 rf(x i ); x x i. Update x k+1 à 2 k+3 zk + k+1 k+3 yk. I won t mention details 50

51 Extension: Non-Euclidean norm Suppose the objective function f is l.c.g. wrt k k. Use a prox-function d on Q which is ¾ -strongly convex wrt k k, and min d(x) =0 D:= max d(x) x2q x2q Algorithm 1:Nesterovs algorithm for non-euclidean norm Output: Asequence y kª converging to the optimal at O(1=k 2 )rate. Initialize: Set x 0 toarandom value in Q. for k =0;1;2;::: do Query the gradient of f at point x k : rf(x k ). Find y k à argmin x2q rf(x k );x x k L x x k 2. Find z k L à argmin x2q ¾ d(x)+p k i=0 i+1 2 rf(x i ); x x i. Update x k+1 à 2 k+3 zk + k+1 k+3 yk. I won t mention details 51

52 Extension: Non-Euclidean norm Rate of convergence f(y k ) f(x ) 4Ld(x ) ¾(k +1)(k +2) Applications will be given later. 52

53 Immediate application: Non-smooth functions Objective function not differentiable Suppose it is the Fencheldual of some function f min x f? (x) where f is de ned on Q Idea: smooth the non-smooth function. Add a small ¾-strongly convex function d to f f + d is ¾-strongly convex (f + d)? is 1 ¾ -l.c.g 53

54 Immediate application: Non-smooth functions (f + ²d)? (x) approximates f? (x) If 0 d(u) D for u 2 Q then f? (x) ²D (f + ²d)? (x) f? (x) max u Proof hu; xi f(u) ²D max hu; xi f(u) ²d(u) max u u hu; xi f(u) 0 = f? (x) ²D = = (f + ²d)? (x) f? (x) 54

55 Immediate application: Non-smooth functions (f + ²d)? (x) approximates f? (x) well If d(u)2[0;d]on Q, then Algorithm (given precision ) Fix Optimize (l.c.g. function) to precision Rate of convergence ² ^² = ² 2D (f +^²d)? (x) ²=2 r 1 ² L = r 1 ² 1 ^²¾ = (f + ²d)? (x) f? (x) 2 [ ²D; 0] r 2D ¾² 2 = 1 r 2D ² ¾ 55

56 Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 56

57 Composite optimization Many applications have objectives in the form of where J(w) =f(w)+g? (Aw) f is convex on the region E 1 with norm k k 1 g is convex on the region E 2 with norm k k 2 Very useful in machine learning Aw corresponds to linear model 57

58 Composite optimization Example: binary SVM 1 nx J(w)= 2 kwk2 +min [1 y i (hx i ; wi + b)] b2r {z } n + i=1 {z } f(w) g? (Aw) A= (y 1 x 1 ;:::;y n x n ) > g? is the dual of g( ) = P i i over Q 2 = 2[0;n 1 ] n : P i y i i =0 ª 58

59 Composite optimization 1: Smooth minimization J(w) =f(w)+g? (Aw) Let us only assume that f is M-l.c.g wrt k k 1 Smooth into g? (g + ¹d 2 )? (d 2 is ¾ 2 -strongly convex wrt k k 2 ) then is J ¹ (w) =f(w)+(g+¹d 2 )? (Aw) ³ M + 1 ¹¾ 2 kak 2 1;2 -l.c.g Apply Nesterov on J ¹ (w) 59

60 Composite optimization 1: Smooth minimization Rate of convergence to find an ² accurate solution, it costs steps. 4 kak 1;2 r D1 D 2 ¾ 1 ¾ 2 1 ² + r MD1 ¾ 1 ² d 1 is ¾ 1 -strongly convex wrt k k 1 d 2 is ¾ 2 -strongly convex wrt k k 2 D 1 :=max w2e 1 d 1 (w) D 2 :=max 2E 2 d 2 ( ) 60

61 Composite optimization 1: Smooth minimization Example: matrix game argmin w2 n hc; wi {z } f(w) Use Euclidean distance +max 2 m fhaw; i + hb; ig {z } g? (Aw) E 1 = n kwk 1 = Pi w2 i 1=2 d 1 (w) = 1 2 Pi (w i n 1 ) 2 ¾ 1 =¾ 2 =1 E 2 = m k k 2 = Pi 2 i 1=2 d 2 ( ) = 1 2 Pi ( i m 1 ) 2 D 1 <1;D 2 <1 kak 2 1;2 = 1=2 max(a > A) f(w k ) f(w ) 4 1=2 max(a > A) k +1 May scale with O(nm) 61

62 Composite optimization 1: Smooth minimization Example: matrix game E 1 = n E 2 = m argmin w2 n hc; wi {z } f(w) Use Entropy distance kwk 1 = P i jw ij k k 2 = P i j ij +max 2 m fhaw; i + hb; ig {z } g? (Aw) d 1 (w)=ln n + P i w i ln w i d 2 ( ) =ln m + P i i ln i ¾ 1 = ¾ 2 =1 D 1 =ln n D2 =ln m kak 1;2 =max i;j ja i;j j f(w k ) f(w ) 4(ln n ln m) 1 2 k +1 max i;j ka i;j k 62

63 Composite optimization 1: Smooth minimization Disadvantages: Fix the smoothing beforehand using prescribed accuracy ² No convergence criteria because real min is unknown. 63

64 Composite optimization 2: Excessive gap minimization Primal-dual Easily upper bounds the duality gap Idea Assume objective function takes the form J(w) =f(w)+g? (Aw) Utilizes the adjoint form Relations: D( )= g( ) f? ( A > ) J(w) D( ) 8w; and inf w2e 1 J(w) = sup 2E 2 D( ) 64

65 Composite optimization 2: Excessive gap minimization Sketch of idea L f L g Assume f is -l.c.g. and g is -l.c.g. Smooth both f? and g? by prox-functions d 1 ;d 2 J ¹2 (w) =f(w)+(g+¹ 2 d 2 )? (Aw) D ¹1 ( )= g( ) (f+¹ 1 d 1 )? ( A > ) 65

66 Composite optimization 2: Excessive gap minimization Sketch of idea Maintain two point sequences fw k g and and two regularization sequences s.t. f k g and f¹ 1 (k)g f¹ 2 (k)g ¹ 1 (k)! 0 J ¹2 (k)(w k ) D ¹1 (k)( k ) ¹ 2 (k)! 0 66

67 Composite optimization 2: Excessive gap minimization J ¹2 (k)(w k ) D ¹1 (k)( k ) 67 Challenge: How to efficiently find the initial point,,, that satisfy excessive gap condition. Given,,,, with new and how to efficiently find w k+1 and k+1. How to anneal ¹ 1 (k) and ¹ 2 (k) (otherwise one step done). Solution Gradient mapping Bregman projection (very cool) w 1 1 ¹ 1 (1) ¹ 2 (1) w k k ¹ 1 (k) ¹ 2 (k) ¹ 1 (k +1) ¹ 2 (k +1)

68 Composite optimization 2: Excessive gap minimization Rate of convergence: J(w k ) D( k ) 4 kak 1;2 k +1 r D1 D 2 ¾ 1 ¾ 2 f is ¾-strongly convex No need to add prox-function to f, ¹ 1 (k) 0 J(w k ) D( k ) 4D 2 ¾ 2 k(k+1) Ã kak 2 1;2 ¾ + L g! 68

69 Composite optimization 2: Excessive gap minimization Example: binary SVM 1 nx J(w)= 2 kwk2 +min [1 y i (hx i ; wi + b)] b2r {z } n + i=1 {z } f(w) g? (Aw) 69 A= (y 1 x 1 ;:::;y n x n ) > g? is the dual of g( ) = P i i over Adjoint form E 2 = 2 [0;n 1 ] n : P i y i i =0 ª D( ) = X i 1 2 > AA > i

70 Composite optimization 2: Convergence rate for SVM Theorem: running on SVM for k iterations J(w k ) D( k ) 2L (k +1)(k +2)n L = 1 kak 2 = 1 k(y 1 x 1 ;:::;y n x n )k 2 nr2 Final conclusion µ R J(w k ) D( k ) " as long as k>o p =" (kx i k R) 70

71 Composite optimization 2: Projection for SVM Efficient O(n) time projection onto E 2 = Projection leads to a singly linear constrained QP min ( nx ( i m i ) 2 i=1 s:t: l i i u i 8i 2 [n]; nx ¾ i i = z: i=1 2 [0;n 1 ] n : X i y i i =0 ) Key tool: Median nding takes O(n) time 71

72 Automatic estimation of Lipschitz constant Automatic estimation of Lipschitzconstant Geometric scaling Does not affect the rate of convergence L 72

73 Conclusion Nesterov smethod attains the lower bound O L for L-l.c.g. objectives ² Linear rate for l.c.g. and strongly convex objectives Composite optimization Attains the rate of the nice part of the function Handling constraints Gradient mapping and Bregmanprojection Essentially does not change the convergence rate Expecting wide applications in machine learning Note: not in terms of generalization performance 73

74 74

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Accelerated Training of Max-Margin Markov Networks with Kernels

Accelerated Training of Max-Margin Markov Networks with Kernels Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725 Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

GRADIENT = STEEPEST DESCENT

GRADIENT = STEEPEST DESCENT GRADIENT METHODS GRADIENT = STEEPEST DESCENT Convex Function Iso-contours gradient 0.5 0.4 4 2 0 8 0.3 0.2 0. 0 0. negative gradient 6 0.2 4 0.3 2.5 0.5 0 0.5 0.5 0 0.5 0.4 0.5.5 0.5 0 0.5 GRADIENT DESCENT

More information

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ???? DUALITY WHY DUALITY? No constraints f(x) Non-differentiable f(x) Gradient descent Newton s method Quasi-newton Conjugate gradients etc???? Constrained problems? f(x) subject to g(x) apple 0???? h(x) =0

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Michael Patriksson 0-0 The Relaxation Theorem 1 Problem: find f := infimum f(x), x subject to x S, (1a) (1b) where f : R n R

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

The Frank-Wolfe Algorithm:

The Frank-Wolfe Algorithm: The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology

More information

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学 Linearized Alternating Direction Method: Two Blocks and Multiple Blocks Zhouchen Lin 林宙辰北京大学 Dec. 3, 014 Outline Alternating Direction Method (ADM) Linearized Alternating Direction Method (LADM) Two Blocks

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

You should be able to...

You should be able to... Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set

More information

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014 Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017 Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!

More information

10-725/36-725: Convex Optimization Prerequisite Topics

10-725/36-725: Convex Optimization Prerequisite Topics 10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Lecture 1: Background on Convex Analysis

Lecture 1: Background on Convex Analysis Lecture 1: Background on Convex Analysis John Duchi PCMI 2016 Outline I Convex sets 1.1 Definitions and examples 2.2 Basic properties 3.3 Projections onto convex sets 4.4 Separating and supporting hyperplanes

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

An Optimal Affine Invariant Smooth Minimization Algorithm.

An Optimal Affine Invariant Smooth Minimization Algorithm. An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre d Aspremont, CNRS & École Polytechnique. Joint work with Martin Jaggi. Support from ERC SIPA. A. d Aspremont IWSL, Moscow, June 2013,

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex

More information

Exponentiated Gradient Descent

Exponentiated Gradient Descent CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,

More information

ICS-E4030 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Math 273a: Optimization Convex Conjugacy

Math 273a: Optimization Convex Conjugacy Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37 COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation

More information

Lecture 23: November 19

Lecture 23: November 19 10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo

More information

For any y 2C, any sub-gradient, v, of h at prox(x), i.e., v and by optimality of prox(x) in (3), we have

For any y 2C, any sub-gradient, v, of h at prox(x), i.e., v and by optimality of prox(x) in (3), we have A Proofs We now give the details for the proof of our main results, i.e., heorems and 2. Below, we outline the steps for the proof of FAG s heorem. he proof of heorem 2 for FARE follows the same line of

More information

Primal/Dual Decomposition Methods

Primal/Dual Decomposition Methods Primal/Dual Decomposition Methods Daniel P. Palomar Hong Kong University of Science and Technology (HKUST) ELEC5470 - Convex Optimization Fall 2018-19, HKUST, Hong Kong Outline of Lecture Subgradients

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Solutions Chapter 5. The problem of finding the minimum distance from the origin to a line is written as. min 1 2 kxk2. subject to Ax = b.

Solutions Chapter 5. The problem of finding the minimum distance from the origin to a line is written as. min 1 2 kxk2. subject to Ax = b. Solutions Chapter 5 SECTION 5.1 5.1.4 www Throughout this exercise we will use the fact that strong duality holds for convex quadratic problems with linear constraints (cf. Section 3.4). The problem of

More information

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

More information

Lecture 5: September 12

Lecture 5: September 12 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Smoothing Proximal Gradient Method. General Structured Sparse Regression for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:

More information

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information