Nesterov s Optimal Gradient Methods
|
|
- Matthew Jones
- 6 years ago
- Views:
Transcription
1 Yurii Nesterov Nesterov s Optimal Gradient Methods Xinhua Zhang Australian National University NICTA 1
2 Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 2
3 Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 3
4 The problem Many machine learning problems have the form where min w w: weight vector fx i ;y i g n i=1 : training data J(w) := (w)+r emp(w) R emp (w) := 1 n nx l(x i ; y i ; w) i=1 l(x;y;w) : convex and non-negative loss function Can be non-smooth, possibly non-convex. (w) : convex and non-negative regularizer 4
5 min w s.t. The problem: Examples 2 kwk2 + 1 n nx i=1» i» i 1 y i hw; x i i 8 1 i n» i i n» i =max f0; 1 y i hw; x i ig 2 kwk2 + 1 n nx max f0; 1 y i hw; x i ig i=1 max f0; 1 xg hinge loss x Model (obj) (w) + R emp (w) linear SVMs 2 kwk P n n i=1 max f0; 1 y i hw; x i ig `1 logistic regression kwk P n n i=1 log(1+exp( y ihw;x i i)) ²-insensitive classify 2 kwk P n n i=1 max f0; jy i hw; x i ij ²g kw 1 k 1 = P i jw ij 5
6 The problem: More examples Lasso Multi-task learning argmin w argmin w argmin w kwk 1 + kaw bk 2 2 TX kw k tr + kx t w t b t k 2 2 kw k 1;1 + t=1 TX kx t w t b t k 2 2 t=1 Matrix game Entropy regularized LPBoost argmin w2 d hc; wi +max u2 n fhaw; ui + hb; uig argmin w2 d (w; w 0 ) + max u2 n haw; ui 6
7 The problem: Lagrange dual 7 Binary SVM Entropy regularized LPBoost min 1 Q X i s.t. i 2 [0;n 1 ] P y i i =0 ln X wd 0 exp d i s.t. i 2 [0; 1] P i =1 i à i where à n!! X 1 A i;d i i=1 Q ij = y i y j hx i ; x j i
8 The problem 8 Summary where min J(w) w2q J is convex, but might be non-smooth Q is a (simple) convex set J might have composite form Solver: iterative method Want ² k := J(w k ) J(w ) to decrease to 0 quickly where w :=argmin w2q J(w). w 0 ; w 1 ; w 2 ;::: We only discuss optimization in this session, no generalization bound.
9 The problem: What makes a good optimizer? Find an ² -approximate solution Desirable: J(w k ) min w k as small as possible (take as few steps as possible) ² k J(w)+² Error decays by 1=k 2, 1=k, or e k. Each iteration costs reasonable amount of work Depends on n, and other condition parameters leniently General purpose, parallelizable (low sequential processing) Quit when done (measurable convergence criteria) w k 9
10 The problem: Rate of convergence Convergence rate: lim k!1 ² k+1 ² k = 8 < : 0 superlinear rate ² k = e ek 2 (0; 1) linear rate ² k = e k 1 sublinear rate ² k = 1 k Use interchangeably: Fix step index k, upper bound min 1 t k ² t Fix precision, how many steps needed for E.g. ² min ² t < ² 1 t k 1 1 ² ; 2 ² ; p² 1 ; log 1 ² ; loglog 1 ² 10
11 The problem: Collection of results Convergence rate: Objective function Gradient descent Nesterov Lower bound Composite non-smooth Smooth µ 1 O ² ³q 1 O ² ³q O 1 ² Smooth and very convex µ O log 1 µ ² O log 1 µ ² O log 1 ² 11 Smooth + (dual of smooth) O µ 1 ² (very convex) + (dual of smooth) ³q O 1 ²
12 Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 12
13 Preliminaries: convex analysis Convex functions A function f is convex iff 8 x; y; 2 (0; 1) f( x +(1 )y) f(x)+(1 )f(y) 13
14 Preliminaries: convex analysis Convex functions A function f is convex iff 8 x; y; 2 (0; 1) f( x +(1 )y) f(x)+(1 )f(y) 14
15 Preliminaries: convex analysis Strong convexity ¾ A function f is called -strongly convex wrt a norm iff k k f(x) 1 2 ¾ kxk2 is convex 8 x; y; 2 (0; 1) f( x +(1 )y) f(x)+(1 )f(y) ¾ (1 ) 2 kx yk 2 15
16 Preliminaries: convex analysis Strong convexity First order equivalent condition f(y) f(x)+hrf(x); y xi + ¾ 2 kx yk2 8 x; y 16
17 Preliminaries: convex analysis Strong convexity First order equivalent condition f(y) f(x)+hrf(x); y xi + ¾ 2 kx yk2 8 x; y 17
18 Preliminaries: convex analysis Strong convexity Second order r 2 f(x)y; y ¾ kyk 2 8 x; y If k k Euclidean norm, then r 2 f(x) º ¾I Lower bounds rate of change of gradient 18
19 Preliminaries: convex analysis Lipschitz continuous gradient Lipschitzcontinuity Stronger than continuity, weaker than differentiability Upper bounds rate of change 9L >0 jf(x) f(y)j L kx yk 8 x; y 19
20 Preliminaries: convex analysis Lipschitz continuous gradient Gradient is Lipschitzcontinuous (must be differentiable) krf(x) rf(y)k L kx yk f(y) f(x)+hrf(x); y xi + L 2 kx yk2 8 x; y 8 x; y L L-l.c.g ¾ 20
21 Preliminaries: convex analysis Lipschitz continuous gradient Gradient is Lipschitzcontinuous (must be differentiable) krf(x) rf(y)k L kx yk 8 x; y f(y) f(x)+hrf(x); y xi + L 2 kx yk2 8 x; y r 2 f(x)y; y L kyk 2 8 x; y r 2 f(x) ¹ LI if L 2 norm 21
22 Preliminaries: convex analysis Fenchel Dual Fencheldual of a function f f? (s) =sup x hs; xi f(x) Properties f ¾ strongly convex f?? = f if f is convex and closed f? 1 ¾ -l.c.g on Rd L-l.c.g on R d 1 L strongly convex 22
23 Preliminaries: convex analysis Fenchel Dual Fencheldual of a function f f? (s) =sup x hs; xi f(x) s = rf(x) s slope = s f? (s) f(x) 23
24 Preliminaries: convex analysis: Subgradient Generalize gradient to non-differentiable functions Idea: tangent plane lying below the graph of f 24
25 Preliminaries: convex analysis: Subgradient Generalize gradient to non-differentiable functions ¹ is called a subgradientof f at x if f(x 0 ) f(x)+hx 0 x;¹i 8x 0 All such ¹ comprise the subdifferential of f at 25
26 Preliminaries: convex analysis: Subgradient Generalize gradient to non-differentiable functions is called a subgradientof f at x if ¹ f(x 0 ) f(x)+hx 0 x;¹i 8x 0 All such ¹ comprise the subdifferential of f at x: Unique if f is differentiable at 26
27 Preliminaries: optimization: Gradient descent Gradient descent x k+1 = x k s k rf(x k ) s k 0 Suppose f is both ¾-strongly convex and L-l.c.g. ² k := f(x k ) f(w ) ² k ³ 1 ¾ L k ²0 Key idea Norm of gradient upper bounds how far away from optimal Lower bounds how much progress one can make 27
28 Preliminaries: optimization: Gradient descent Upper bound distance from optimal x k+1 = x k 1 L rf(x k) rf shaded area triangle area slope = ¾ x x k rf(x k ) So f(x k ) f(x ) 1 2¾ krf(x k)k 2 f(x k ) f(x ) 1 2¾ krf(x k)k 2 28
29 Preliminaries: optimization: Gradient descent Lower bound progress at each step x k+1 = x k 1 L rf(x k) rf shaded area triangle area slope = L rf(x k ) So f(x k ) f(x k+1 ) 1 2L krf(x k)k 2 rf(x k ) x k+1 f(x L k ) f(x k+1 ) 1 2L krf(x k)k 2 x k 29
30 Preliminaries: optimization: Gradient descent Putting things together distance to optimal progress 2¾(f(x k ) f(x )) krf(x k )k 2 2L(f(x k ) f(x k+1 )) f(x k+1 ) f(x ) 1 ¾ L (f(xk ) f(x )) 30
31 Preliminaries: optimization: Gradient descent Putting things together distance to optimal progress 2¾(f(x k ) f(x )) krf(x k )k 2 2L(f(x k ) f(x k+1 )) f(x k+1 ) f(x ) {z } ² k+1 1 ¾ L (f(xk ) f(x )) {z } ² k 31
32 Preliminaries: optimization: Gradient descent Putting things together distance to optimal progress 2¾(f(x k ) f(x )) krf(x k )k 2 2L(f(x k ) f(x k+1 )) f(x k+1 ) f(x ) {z } ² k+1 1 ¾ L (f(xk ) f(x )) {z } ² k What if ¾ =0? What if there is constraint? 32
33 Preliminaries: optimization: Projected Gradient descent If objective function is L-l.c.g., but not strongly convex Constrained to convex set Projected gradient descent x k+1 = Q µx k 1 L rf(x k) =argmin x2q Rate of convergence: µ L O ² ³q L Compare with Newton O, interior point Q =argmin ^x2q ^x (x k 1 L rf(x k)) f(x k )+hrf(x k ); x x k i + L 2 kx x kk 2 ² O log 1 ² 33
34 Preliminaries: optimization: Projected Gradient descent Projected gradient descent x k+1 = Q µx k 1 L rf(x k) =argmin x2q =argmin ^x2q f(x k )+hrf(x k ); x x k i + L 2 kx x kk 2 Property 1: monotonic decreasing ^x (x k 1 L rf(x k)) f(x k+1 ) f(x k )+hrf(x k ); x k+1 x k i + L 2 kx k+1 x k k 2 L-l.c.g. f(x k )+hrf(x k ); x k x k i + L 2 kx k x k k 2 = f(x k ) Def x k+1 projection 34
35 Preliminaries: optimization: Projected Gradient descent Property 2: x x k+1 ; (x k 1L À 8 x 2 Q rf(x k)) x k+1 0 L-l.c.g. f(x k+1 ) f(x k )+hrf(x k ); x k+1 x k i + L 2 kx k+1 x k k 2 x k 1 L rf(x k) x k+1 Q x Property 2 f(x k )+hrf(x k ); x x k i + L 2 kx x kk 2 L 2 kx x k+1k 2 8 x 2 Q Convexity of f f(x)+ L 2 kx x kk 2 L 2 kx x k+1k 2 8 x 2 Q 35
36 Preliminaries: optimization: Projected Gradient descent Put together f(x k+1 ) f(x)+ L 2 kx x kk 2 L 2 kx x k+1k 2 8 x 2 Q Let x = x : 0 L 2 kx x k+1 k 2 ² k+1 + L 2 kx x k k 2 ::: k+1 X ² i + L 2 kx x 0 k 2 i=1 (k+1)² k+1 + L 2 kx x 0 k 2 ( monotonic decreasing) ² k 36 ² k+1 L 2(k +1) kx x 0 k 2
37 Preliminaries: optimization: Subgradient method Objective is continuous but not differentiable Subgradientmethod for min f(x) x2q x k+1 = Q (x k s k rf(x k )) 37 where Rate of convergence Summary O µ 1 ² 2 rf(x k ) k ) µ 1 O ² 2 O µ L ² (arbitrary subgradient) ln 1 ² ln(1 ¾ L ) non-smooth L-l.c.g. L-l.c.g. & ¾-strongly convex
38 Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 38
39 Optimal gradient method Lower bound Consider the set of L-l.c.g. functions For any ² > 0, there exists an L-l.c.g. function f, such that any first-order method takes at least ³q k = O steps to ensure ² k < ². First-order method means L ² x k 2 x 0 +span frf(x 0 );:::;rf(x k 1 )g Not saying: there exists an L-l.c.g. function f, such that for all any first-order method takes at least k =O( p ² > 0 L=²) steps to ensure ² k < ². Gap: recall the upper bound O L of GD, two possibilities. ² 39
40 Optimal gradient method: Primitive Nesterov Problem under consideration where f is L-l.c.g., Big results is convex He proposed an algorithm attaining Not for free: require an oracle to project a point onto in sense L 2 min w f(w) w 2 Q Q p L=" Q 40
41 Primitive Nesterov Construct quadratic functions 1 Á k (x) and k > 0 Á k (x)=á k + k 2 kx v k k 2 f(x k ) x k ;s:t: f(x k ) Á k Á k (x) (1 k)f(x)+ ká 0 (x) k! 0 v k x k 2 1 f(x k ) Á k Á k (x ) 3 (1 k)f(x )+ ká 0 (x ) 41 f(x k ) f(x ) k(á 0 (x ) f(x ))! 0
42 Primitive Nesterov Construct quadratic functions 1 Á k (x) and k > 0 Á k (x) =Á k + k 2 kx v k k 2 f(x k ) x k ;s:t: f(x k ) Á k Á k (x) (1 k)f(x)+ ká 0 (x) k! 0 v k x k 2 1 f(x k ) Á k Á k (x ) 3 (1 k)f(x )+ ká 0 (x ) 42 f(x k ) f(x ) k(á 0 (x ) f(x ))! 0
43 Primitive Nesterov: Rate of convergence Á k (x) =Á k + k 2 kx v k k 2 9x k ;s:t: f(x k ) Á k Á k (x) (1 k)f(x)+ ká 0 (x) k! 0 Nesterov constructed, in a highly non-trivial way, the Á k (x) and k, s.t. has closed form (grad desc) x k k 4L (2 p L+k p 0 ) 2 f(x k ) f(x ) k(á 0 (x ) f(x )) Furthermore, if f is ¾-strongly convex, then Rate of convergence sheerly depends on k k 1 ¾ L k 43
44 Primitive Nesterov: Dealing with constraints x k has closed form by gradient descent When constrained to set Q, modify by New gradient: x k+1 = x k rf(x k ) x Q k+1 = Q (x k rf(x k ))=argmin x2q kx (x k rf(x k ))k g Q k := 1 ³x k x Q k+1 gradient mapping 44 This new gradient keeps all important properties of gradient, also keeping the rate of convergence
45 Primitive Nesterov: Gradient mapping x k has closed form by gradient descent When constrained to set Q, modify by New gradient: x k+1 = x k rf(x k ) x Q k+1 = Q (x k rf(x k ))=argmin x2q kx (x k rf(x k ))k Expensive? g Q k := 1 ³x k x Q k+1 gradient mapping 45 This new gradient keeps all important properties of gradient, also keeping the rate of convergence
46 Primitive Nesterov Summary where f is L-l.c.g., min w f(w) w 2 Q Q Rate of convergence is convex. r L ² if no strong convexity ln 1 ² ln(1 ¾ L ) if ¾-strongly convexity 46
47 Primitive Nesterov: Example 1 min x;y 2 x2 +2y 2 ¹=1; L=4 min x 0;y 0 1 (x + y)2 2 ¹ =0; L=2 Nesterov Grad desc Nesterov Grad desc 47
48 Extension: Non-Euclidean norm Remember strong convexity and l.c.g. are wrt some norm We have implicitly used Euclidean norm (L 2 norm) Some functions are strongly convex wrt other norms P i x i ln x i Not l.c.g. wrt L 2 norm Negative entropy is l.c.g. wrt L 1 norm kxk 1 = P i x i strongly convex wrt L 1 norm. Can Nesterov s approach be extended to non-euclidean norm? 48
49 Extension: Non-Euclidean norm Remember strong convexity and l.c.g. are wrt some norm We have implicitly used Euclidean norm (L 2 norm) Some functions are l.c.g. wrt other norms P i x i ln x i Not l.c.g. wrt L 2 norm Negative entropy is l.c.g. wrt L 1 norm kxk 1 = P i x i strongly convex wrt L 1 norm. Can Nesterov s approach be extended to non-euclidean norm? 49
50 Extension: Non-Euclidean norm Suppose the objective function f is l.c.g. wrt k k. Use a prox-function d on Q which is ¾ -strongly convex wrt k k, and min d(x) =0 D:= max d(x) x2q x2q Algorithm 1:Nesterovs algorithm for non-euclidean norm Output: Asequence y kª converging to the optimal at O(1=k 2 )rate. Initialize: Set x 0 toarandom value in Q. for k =0;1;2;::: do Query the gradient of f at point x k : rf(x k ). Find y k à argmin x2q rf(x k );x x k L x x k 2. Find z k L à argmin x2q ¾ d(x)+p k i=0 i+1 2 rf(x i ); x x i. Update x k+1 à 2 k+3 zk + k+1 k+3 yk. I won t mention details 50
51 Extension: Non-Euclidean norm Suppose the objective function f is l.c.g. wrt k k. Use a prox-function d on Q which is ¾ -strongly convex wrt k k, and min d(x) =0 D:= max d(x) x2q x2q Algorithm 1:Nesterovs algorithm for non-euclidean norm Output: Asequence y kª converging to the optimal at O(1=k 2 )rate. Initialize: Set x 0 toarandom value in Q. for k =0;1;2;::: do Query the gradient of f at point x k : rf(x k ). Find y k à argmin x2q rf(x k );x x k L x x k 2. Find z k L à argmin x2q ¾ d(x)+p k i=0 i+1 2 rf(x i ); x x i. Update x k+1 à 2 k+3 zk + k+1 k+3 yk. I won t mention details 51
52 Extension: Non-Euclidean norm Rate of convergence f(y k ) f(x ) 4Ld(x ) ¾(k +1)(k +2) Applications will be given later. 52
53 Immediate application: Non-smooth functions Objective function not differentiable Suppose it is the Fencheldual of some function f min x f? (x) where f is de ned on Q Idea: smooth the non-smooth function. Add a small ¾-strongly convex function d to f f + d is ¾-strongly convex (f + d)? is 1 ¾ -l.c.g 53
54 Immediate application: Non-smooth functions (f + ²d)? (x) approximates f? (x) If 0 d(u) D for u 2 Q then f? (x) ²D (f + ²d)? (x) f? (x) max u Proof hu; xi f(u) ²D max hu; xi f(u) ²d(u) max u u hu; xi f(u) 0 = f? (x) ²D = = (f + ²d)? (x) f? (x) 54
55 Immediate application: Non-smooth functions (f + ²d)? (x) approximates f? (x) well If d(u)2[0;d]on Q, then Algorithm (given precision ) Fix Optimize (l.c.g. function) to precision Rate of convergence ² ^² = ² 2D (f +^²d)? (x) ²=2 r 1 ² L = r 1 ² 1 ^²¾ = (f + ²d)? (x) f? (x) 2 [ ²D; 0] r 2D ¾² 2 = 1 r 2D ² ¾ 55
56 Outline The problem from machine learning perspective Preliminaries Convex analysis and gradient descent Nesterov soptimal gradient method Lower bound of optimization Optimal gradient method Utilizing structure: composite optimization Smooth minimization Excessive gap minimization Conclusion 56
57 Composite optimization Many applications have objectives in the form of where J(w) =f(w)+g? (Aw) f is convex on the region E 1 with norm k k 1 g is convex on the region E 2 with norm k k 2 Very useful in machine learning Aw corresponds to linear model 57
58 Composite optimization Example: binary SVM 1 nx J(w)= 2 kwk2 +min [1 y i (hx i ; wi + b)] b2r {z } n + i=1 {z } f(w) g? (Aw) A= (y 1 x 1 ;:::;y n x n ) > g? is the dual of g( ) = P i i over Q 2 = 2[0;n 1 ] n : P i y i i =0 ª 58
59 Composite optimization 1: Smooth minimization J(w) =f(w)+g? (Aw) Let us only assume that f is M-l.c.g wrt k k 1 Smooth into g? (g + ¹d 2 )? (d 2 is ¾ 2 -strongly convex wrt k k 2 ) then is J ¹ (w) =f(w)+(g+¹d 2 )? (Aw) ³ M + 1 ¹¾ 2 kak 2 1;2 -l.c.g Apply Nesterov on J ¹ (w) 59
60 Composite optimization 1: Smooth minimization Rate of convergence to find an ² accurate solution, it costs steps. 4 kak 1;2 r D1 D 2 ¾ 1 ¾ 2 1 ² + r MD1 ¾ 1 ² d 1 is ¾ 1 -strongly convex wrt k k 1 d 2 is ¾ 2 -strongly convex wrt k k 2 D 1 :=max w2e 1 d 1 (w) D 2 :=max 2E 2 d 2 ( ) 60
61 Composite optimization 1: Smooth minimization Example: matrix game argmin w2 n hc; wi {z } f(w) Use Euclidean distance +max 2 m fhaw; i + hb; ig {z } g? (Aw) E 1 = n kwk 1 = Pi w2 i 1=2 d 1 (w) = 1 2 Pi (w i n 1 ) 2 ¾ 1 =¾ 2 =1 E 2 = m k k 2 = Pi 2 i 1=2 d 2 ( ) = 1 2 Pi ( i m 1 ) 2 D 1 <1;D 2 <1 kak 2 1;2 = 1=2 max(a > A) f(w k ) f(w ) 4 1=2 max(a > A) k +1 May scale with O(nm) 61
62 Composite optimization 1: Smooth minimization Example: matrix game E 1 = n E 2 = m argmin w2 n hc; wi {z } f(w) Use Entropy distance kwk 1 = P i jw ij k k 2 = P i j ij +max 2 m fhaw; i + hb; ig {z } g? (Aw) d 1 (w)=ln n + P i w i ln w i d 2 ( ) =ln m + P i i ln i ¾ 1 = ¾ 2 =1 D 1 =ln n D2 =ln m kak 1;2 =max i;j ja i;j j f(w k ) f(w ) 4(ln n ln m) 1 2 k +1 max i;j ka i;j k 62
63 Composite optimization 1: Smooth minimization Disadvantages: Fix the smoothing beforehand using prescribed accuracy ² No convergence criteria because real min is unknown. 63
64 Composite optimization 2: Excessive gap minimization Primal-dual Easily upper bounds the duality gap Idea Assume objective function takes the form J(w) =f(w)+g? (Aw) Utilizes the adjoint form Relations: D( )= g( ) f? ( A > ) J(w) D( ) 8w; and inf w2e 1 J(w) = sup 2E 2 D( ) 64
65 Composite optimization 2: Excessive gap minimization Sketch of idea L f L g Assume f is -l.c.g. and g is -l.c.g. Smooth both f? and g? by prox-functions d 1 ;d 2 J ¹2 (w) =f(w)+(g+¹ 2 d 2 )? (Aw) D ¹1 ( )= g( ) (f+¹ 1 d 1 )? ( A > ) 65
66 Composite optimization 2: Excessive gap minimization Sketch of idea Maintain two point sequences fw k g and and two regularization sequences s.t. f k g and f¹ 1 (k)g f¹ 2 (k)g ¹ 1 (k)! 0 J ¹2 (k)(w k ) D ¹1 (k)( k ) ¹ 2 (k)! 0 66
67 Composite optimization 2: Excessive gap minimization J ¹2 (k)(w k ) D ¹1 (k)( k ) 67 Challenge: How to efficiently find the initial point,,, that satisfy excessive gap condition. Given,,,, with new and how to efficiently find w k+1 and k+1. How to anneal ¹ 1 (k) and ¹ 2 (k) (otherwise one step done). Solution Gradient mapping Bregman projection (very cool) w 1 1 ¹ 1 (1) ¹ 2 (1) w k k ¹ 1 (k) ¹ 2 (k) ¹ 1 (k +1) ¹ 2 (k +1)
68 Composite optimization 2: Excessive gap minimization Rate of convergence: J(w k ) D( k ) 4 kak 1;2 k +1 r D1 D 2 ¾ 1 ¾ 2 f is ¾-strongly convex No need to add prox-function to f, ¹ 1 (k) 0 J(w k ) D( k ) 4D 2 ¾ 2 k(k+1) Ã kak 2 1;2 ¾ + L g! 68
69 Composite optimization 2: Excessive gap minimization Example: binary SVM 1 nx J(w)= 2 kwk2 +min [1 y i (hx i ; wi + b)] b2r {z } n + i=1 {z } f(w) g? (Aw) 69 A= (y 1 x 1 ;:::;y n x n ) > g? is the dual of g( ) = P i i over Adjoint form E 2 = 2 [0;n 1 ] n : P i y i i =0 ª D( ) = X i 1 2 > AA > i
70 Composite optimization 2: Convergence rate for SVM Theorem: running on SVM for k iterations J(w k ) D( k ) 2L (k +1)(k +2)n L = 1 kak 2 = 1 k(y 1 x 1 ;:::;y n x n )k 2 nr2 Final conclusion µ R J(w k ) D( k ) " as long as k>o p =" (kx i k R) 70
71 Composite optimization 2: Projection for SVM Efficient O(n) time projection onto E 2 = Projection leads to a singly linear constrained QP min ( nx ( i m i ) 2 i=1 s:t: l i i u i 8i 2 [n]; nx ¾ i i = z: i=1 2 [0;n 1 ] n : X i y i i =0 ) Key tool: Median nding takes O(n) time 71
72 Automatic estimation of Lipschitz constant Automatic estimation of Lipschitzconstant Geometric scaling Does not affect the rate of convergence L 72
73 Conclusion Nesterov smethod attains the lower bound O L for L-l.c.g. objectives ² Linear rate for l.c.g. and strongly convex objectives Composite optimization Attains the rate of the nice part of the function Handling constraints Gradient mapping and Bregmanprojection Essentially does not change the convergence rate Expecting wide applications in machine learning Note: not in terms of generalization performance 73
74 74
Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationPrimal-dual Subgradient Method for Convex Problems with Functional Constraints
Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationAccelerated Training of Max-Margin Markov Networks with Kernels
Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationGradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725
Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationConvex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization
Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f
More informationGRADIENT = STEEPEST DESCENT
GRADIENT METHODS GRADIENT = STEEPEST DESCENT Convex Function Iso-contours gradient 0.5 0.4 4 2 0 8 0.3 0.2 0. 0 0. negative gradient 6 0.2 4 0.3 2.5 0.5 0 0.5 0.5 0 0.5 0.4 0.5.5 0.5 0 0.5 GRADIENT DESCENT
More informationWHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????
DUALITY WHY DUALITY? No constraints f(x) Non-differentiable f(x) Gradient descent Newton s method Quasi-newton Conjugate gradients etc???? Constrained problems? f(x) subject to g(x) apple 0???? h(x) =0
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationKernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning
Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:
More informationLecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University
Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations
More informationLecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem
Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Michael Patriksson 0-0 The Relaxation Theorem 1 Problem: find f := infimum f(x), x subject to x S, (1a) (1b) where f : R n R
More informationEE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015
EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,
More informationLogarithmic Regret Algorithms for Strongly Convex Repeated Games
Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600
More informationThe Frank-Wolfe Algorithm:
The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology
More informationLinearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学
Linearized Alternating Direction Method: Two Blocks and Multiple Blocks Zhouchen Lin 林宙辰北京大学 Dec. 3, 014 Outline Alternating Direction Method (ADM) Linearized Alternating Direction Method (LADM) Two Blocks
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More information(Kernels +) Support Vector Machines
(Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationLecture 9: Large Margin Classifiers. Linear Support Vector Machines
Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationYou should be able to...
Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set
More informationThis can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization
This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More informationConvex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014
Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationLecture 15 Newton Method and Self-Concordance. October 23, 2008
Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications
More informationAdvanced Machine Learning
Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationStochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017
Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!
More information10-725/36-725: Convex Optimization Prerequisite Topics
10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationNewton s Method. Ryan Tibshirani Convex Optimization /36-725
Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationOptimization and Optimal Control in Banach Spaces
Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,
More informationLecture 1: Background on Convex Analysis
Lecture 1: Background on Convex Analysis John Duchi PCMI 2016 Outline I Convex sets 1.1 Definitions and examples 2.2 Basic properties 3.3 Projections onto convex sets 4.4 Separating and supporting hyperplanes
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationMachine Learning. Lecture 6: Support Vector Machine. Feng Li.
Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)
More informationAn Optimal Affine Invariant Smooth Minimization Algorithm.
An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre d Aspremont, CNRS & École Polytechnique. Joint work with Martin Jaggi. Support from ERC SIPA. A. d Aspremont IWSL, Moscow, June 2013,
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex
More informationExponentiated Gradient Descent
CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,
More informationICS-E4030 Kernel Methods in Machine Learning
ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationMath 273a: Optimization Convex Conjugacy
Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper
More informationUnconstrained minimization of smooth functions
Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and
More informationOn the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,
Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationCOMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37
COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation
More informationLecture 23: November 19
10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo
More informationFor any y 2C, any sub-gradient, v, of h at prox(x), i.e., v and by optimality of prox(x) in (3), we have
A Proofs We now give the details for the proof of our main results, i.e., heorems and 2. Below, we outline the steps for the proof of FAG s heorem. he proof of heorem 2 for FARE follows the same line of
More informationPrimal/Dual Decomposition Methods
Primal/Dual Decomposition Methods Daniel P. Palomar Hong Kong University of Science and Technology (HKUST) ELEC5470 - Convex Optimization Fall 2018-19, HKUST, Hong Kong Outline of Lecture Subgradients
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationSolutions Chapter 5. The problem of finding the minimum distance from the origin to a line is written as. min 1 2 kxk2. subject to Ax = b.
Solutions Chapter 5 SECTION 5.1 5.1.4 www Throughout this exercise we will use the fact that strong duality holds for convex quadratic problems with linear constraints (cf. Section 3.4). The problem of
More informationSupport Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs
E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in
More informationLecture 5: September 12
10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationSupport Vector Machines, Kernel SVM
Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationMath 273a: Optimization Subgradients of convex functions
Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More informationSmoothing Proximal Gradient Method. General Structured Sparse Regression
for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:
More informationSupport Vector Machines for Classification and Regression
CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More information