Randomized Smoothing Techniques in Optimization

Size: px
Start display at page:

Download "Randomized Smoothing Techniques in Optimization"

Transcription

1 Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory Seminar October 2014

2 Problem Statement Goal: solve the following problem minimize f(x) subject to x X

3 Problem Statement Goal: solve the following problem minimize f(x) subject to x X Often, we will assume f(x) := 1 n n F(x;ξ i ) or f(x) := E[F(x;ξ)] i=1

4 Gradient Descent Goal: solve f(x) minimize f(x) Technique: go down the slope, x t+1 = x t α f(x t ) (y,f(y)) f(y)+ f(y),x y

5 When is optimization easy? Easy problem: function is convex, nice and smooth

6 When is optimization easy? Easy problem: function is convex, nice and smooth Not so easy problem: function is non-smooth

7 When is optimization easy? Easy problem: function is convex, nice and smooth Not so easy problem: function is non-smooth Even harder problems: We cannot compute gradients f(x) Function f is non-convex and non-smooth

8 Example 1: Robust regression Data in pairs ξ i = (a i,b i ) R d R Want to estimate b i a i x

9 Example 1: Robust regression Data in pairs ξ i = (a i,b i ) R d R Want to estimate b i a i x To avoid outliers, minimize f(x) = 1 n a i x b i = 1 n n Ax b 1 i=1 a i x b i

10 Example 2: Protein Structure Prediction RSCCPCYWGGCPWGQNCYPEGCSGPKV (A) 45 N (Grace et al., PNAS 2004) C 5 6

11 Protein Structure Prediction Featurize edges e in graph: vector ξ e. Labels y are matching in a graph, set V is all matchings. x T ξ 1 3 x T ξ x T ξ 4 6

12 Protein Structure Prediction Featurize edges e in graph: vector ξ e. Labels y are matching in a graph, set V is all matchings. x T ξ x T ξ x T ξ 4 6 Goal: Learn weights x so that { } argmax ξe x = ν ν V e ν i.e. learn x so maximum matching in graph with edge weights x ξ e is correct

13 Protein Structure Prediction x T ξ x T ξ x T ξ 4 6 Goal: Learn weights x so that { } argmax ξe x = ν ν V e ν i.e. learn x so maximum matching in graph with edge weights x ξ e is correct

14 Protein Structure Prediction x T ξ x T ξ x T ξ 4 6 Goal: Learn weights x so that { } argmax ξe x = ν ν V e ν i.e. learn x so maximum matching in graph with edge weights x ξ e is correct Loss function: L(ν, ν) is number of disagreements in matchings ( F(x;{ξ,ν}) := max L(ν, ν)+x ξ e x ξ e ). ν V e ν e ν

15 When is optimization easy? Easy problem: function is convex, nice and smooth Not so easy problem: function is non-smooth Even harder problems: We cannot compute gradients f(x) Function f is non-convex and non-smooth

16 One technique to address many of these Instead of only using f(x) and f(x) to solve get more global information minimize f(x),

17 One technique to address many of these Instead of only using f(x) and f(x) to solve get more global information minimize f(x), Let Z be a random variable, and for small u, look at f near points f(x+uz), where u is small

18 One technique to address many of these Instead of only using f(x) and f(x) to solve get more global information minimize f(x), Let Z be a random variable, and for small u, look at f near points u f(x+uz), where u is small

19 One technique to address many of these Instead of only using f(x) and f(x) to solve get more global information minimize f(x), Let Z be a random variable, and for small u, look at f near points u f(x+uz), where u is small

20 Three instances I Solving previously unsolvable problems [Burke, Lewis, Overton 2005] Non-smooth, non-convex problems

21 Three instances I Solving previously unsolvable problems [Burke, Lewis, Overton 2005] Non-smooth, non-convex problems II Optimal convergence guarantees for problems with existing algorithms [D., Jordan, Wainwright, Wibisono 2014] Smooth and non-smooth zero order stochastic and non-stochastic optimization problems

22 Three instances I Solving previously unsolvable problems [Burke, Lewis, Overton 2005] Non-smooth, non-convex problems II Optimal convergence guarantees for problems with existing algorithms [D., Jordan, Wainwright, Wibisono 2014] Smooth and non-smooth zero order stochastic and non-stochastic optimization problems III Parallelism: really fast solutions for large scale problems [D., Bartlett, Wainwright 2013] Smooth and non-smooth stochastic optimization problems

23 Instance I: Gradient Sampling Algorithm Problem: Solve minimize f(x) x X where f is potentially non-smooth and non-convex (but assume it is continuous and a.e. differentiable) [Burke, Lewis, Overton 2005]

24 Instance I: Gradient Sampling Algorithm Problem: Solve minimize f(x) x X where f is potentially non-smooth and non-convex (but assume it is continuous and a.e. differentiable) [Burke, Lewis, Overton 2005] At each iteration t, Draw Z 1,...,Z m i.i.d. Z i 1 Set g i t = f(x t +uz i ) Set gradient g t as g t = argmin g { g 2 2 : g = i λ igt i λ 0, i λ i = 1 } Update x t+1 = x t αg t, where α > 0 chosen by line search u

25 Instance I: Gradient Sampling Algorithm Problem: Solve minimize f(x) x X where f is potentially non-smooth and non-convex (but assume it is continuous and a.e. differentiable) [Burke, Lewis, Overton 2005] Define the set G u (x) := Conv { f(x ) : x x 2 u, f(x ) exists } Proposition (Burke, Lewis, Overton): There exist cluster points x of the sequence x t, and for any such cluster point, 0 G u ( x)

26 Instance II: Zero Order Optimization Problem: We want to solve minimize x X f(x) = E[F(x;ξ)] but we are only allowed to observe function values f(x) (or F(x;ξ))

27 Instance II: Zero Order Optimization Problem: We want to solve minimize x X f(x) = E[F(x;ξ)] but we are only allowed to observe function values f(x) (or F(x;ξ)) Idea: Approximate gradient by function differences f (y) g u := f(y +u) f(y) u u 1 u 0 f (x) f (y) f (y) + g u0, x y f (y) + g u1, x y

28 Instance II: Zero Order Optimization Problem: We want to solve minimize x X f(x) = E[F(x;ξ)] but we are only allowed to observe function values f(x) (or F(x;ξ)) Idea: Approximate gradient by function differences f (y) g u := f(y +u) f(y) u Long history in optimization: Kiefer-Wolfowitz, Spall, Robbins-Monroe f (y) + g u0, x y u 1 u 0 f (y) f (x) f (y) + g u1, x y

29 Instance II: Zero Order Optimization Problem: We want to solve minimize x X f(x) = E[F(x;ξ)] but we are only allowed to observe function values f(x) (or F(x;ξ)) Idea: Approximate gradient by function differences f (y) g u := f(y +u) f(y) u Long history in optimization: Kiefer-Wolfowitz, Spall, Robbins-Monroe Can randomized perturbations give insights? f (x) u 1 u 0 f (y) f (y) + g u0, x y f (y) + g u1, x y

30 Stochastic Gradient Descent Algorithm: At iteration t Choose random ξ, set g t = F(x t ;ξ i ) Update x t+1 = x t αg t

31 Stochastic Gradient Descent Algorithm: At iteration t Choose random ξ, set g t = F(x t ;ξ i ) Update x t+1 = x t αg t

32 Stochastic Gradient Descent Algorithm: At iteration t Choose random ξ, set g t = F(x t ;ξ i ) Update x t+1 = x t αg t Theorem (Russians): Let x T = 1 T T t=1 x t and assume R x x 1 2, G 2 E[ g t 2 2 ]. Then E[f( x T ) f(x )] RG 1 T

33 Stochastic Gradient Descent Algorithm: At iteration t Choose random ξ, set g t = F(x t ;ξ i ) Update x t+1 = x t αg t Theorem (Russians): Let x T = 1 T T t=1 x t and assume R x x 1 2, G 2 E[ g t 2 2 ]. Then E[f( x T ) f(x )] RG 1 T Note: Dependence on G important

34 Derivative-free gradient descent E[f( x T ) f(x )] RG 1 T Question: How well can we estimate gradient f using only function differences? And how small is the norm of this estimate?

35 Derivative-free gradient descent E[f( x T ) f(x )] RG 1 T Question: How well can we estimate gradient f using only function differences? And how small is the norm of this estimate? First idea gradient estimator: Sample Z µ satisfying E µ [ZZ ] = I d d Gradient estimator at x: Perform gradient descent using these g g = f(x+uz) f(x) Z u

36 Two-point gradient estimates At any point x and any direction z, for small u > 0 f(x+uz) f(x) u If f(x) exists, f (x,z) = f(x),z f (x,z) := lim h 0 f(x+hz) f(x) h If E[ZZ ] = I, then E[f (x,z)z] = E[ZZ f(x)] = f(x)

37 Two-point gradient estimates At any point x and any direction z, for small u > 0 f(x+uz) f(x) u If f(x) exists, f (x,z) = f(x),z f (x,z) := lim h 0 f(x+hz) f(x) h If E[ZZ ] = I, then E[f (x,z)z] = E[ZZ f(x)] = f(x) Random estimates Average f

38 Two-point stochastic gradient: differentiable functions To solve d-dimensional problem minimize x X R d f(x) := E[F(x;ξ)] Algorithm: Iterate Draw ξ according to distribution, draw Z µ with Cov(Z) = I Set u t = u/t and Update x t+1 = x t αg t g t = F(x t +u t Z;ξ) F(x t ;ξ) Z u t

39 Two-point stochastic gradient: differentiable functions To solve d-dimensional problem minimize x X R d f(x) := E[F(x;ξ)] Algorithm: Iterate Draw ξ according to distribution, draw Z µ with Cov(Z) = I Set u t = u/t and Update x t+1 = x t αg t g t = F(x t +u t Z;ξ) F(x t ;ξ) Z u t Theorem (D., Jordan, Wainwright, Wibisono): With appropriate α, if R x x 1 2 and E[ F(x;ξ) 2 2 ] G2 for all x, then E[f( x T ) f(x )] RG ( d +O u 2logT ). T T

40 Comparisons to knowing gradient Convergence rate scaling RG 1 d versus RG T T

41 Comparisons to knowing gradient Convergence rate scaling RG 1 d versus RG T T To achieve ǫ-accuracy, 1/ǫ 2 versus d/ǫ 2

42 Comparisons to knowing gradient Convergence rate scaling RG 1 d versus RG T T To achieve ǫ-accuracy, 1/ǫ 2 versus d/ǫ 2 Optimality gap SGD Zero Order Iterations

43 Comparisons to knowing gradient Convergence rate scaling RG 1 d versus RG T T To achieve ǫ-accuracy, 1/ǫ 2 versus d/ǫ Optimality gap SGD Zero Order Optimality gap SGD Zero Order Iterations Rescaled Iterations

44 Two-point stochastic gradient: non-differentiable functions Problem: If f is non-differentiable, kinks make estimates too large

45 Two-point stochastic gradient: non-differentiable functions Problem: If f is non-differentiable, kinks make estimates too large

46 Two-point stochastic gradient: non-differentiable functions Problem: If f is non-differentiable, kinks make estimates too large Example: Let f(x) = x 2. Then if E[ZZ ] = I d d, at x = 0 E[ (f(uz) 0)Z/u 2 2 ] = E[ Z 2 2 Z 2 2 ] d2

47 Two-point stochastic gradient: non-differentiable functions Problem: If f is non-differentiable, kinks make estimates too large Example: Let f(x) = x 2. Then if E[ZZ ] = I d d, at x = 0 E[ (f(uz) 0)Z/u 2 2 ] = E[ Z 2 2 Z 2 2 ] d2 Idea: More randomization! u 2 u 1

48 Two-point stochastic gradient: non-differentiable functions Problem: If f is non-differentiable, kinks make estimates too large Proposition (D., Jordan, Wainwright, Wibisono): If Z 1,Z 2 are N(0,I d d ) or uniform on z 2 d, then g := f(x+u 1Z 1 +u 2 Z 2 ) f(x+u 1 Z 1 ) u 2 Z 2 satisfies ( ) E[g] = f u1 (x)+o(u 2 /u 1 ) and E[ g 2 2 ] d u2 d+log(2d). u 1

49 Two-point stochastic gradient: non-differentiable functions Problem: If f is non-differentiable, kinks make estimates too large Proposition (D., Jordan, Wainwright, Wibisono): If Z 1,Z 2 are N(0,I d d ) or uniform on z 2 d, then g := f(x+u 1Z 1 +u 2 Z 2 ) f(x+u 1 Z 1 ) u 2 Z 2 satisfies ( ) E[g] = f u1 (x)+o(u 2 /u 1 ) and E[ g 2 2 ] d u2 d+log(2d). u 1 Note: If u 2 /u 1 0, scaling linear in d

50 Two-point sub-gradient: non-differentiable functions To solve d-dimensional problem Algorithm: Iterate minimize f(x) subject to x X R d Draw Z 1 µ and Z 2 µ with Cov(Z) = I Set u t,1 = u/t, u t,2 = u/t 2, and g t = f(x t +u t,1 Z 1 +u t,2 Z 2 ) F(x t +u t,1 Z 1 ) u t Z 2 Update x t+1 = x t αg t

51 Two-point sub-gradient: non-differentiable functions To solve d-dimensional problem Algorithm: Iterate minimize f(x) subject to x X R d Draw Z 1 µ and Z 2 µ with Cov(Z) = I Set u t,1 = u/t, u t,2 = u/t 2, and g t = f(x t +u t,1 Z 1 +u t,2 Z 2 ) F(x t +u t,1 Z 1 ) u t Z 2 Update x t+1 = x t αg t Theorem (D., Jordan, Wainwright, Wibisono): With appropriate α, if R x x 1 2 and f(x) 2 2 G2 for all x, then E[f( x T ) f(x )] RG ( dlogd +O u logt ). T T

52 Two-point sub-gradient: non-differentiable functions To solve d-dimensional problem minimize x X R d f(x) := E[F(x;ξ)] Algorithm: Iterate Draw ξ according to distribution, draw Z 1 µ and Z 2 µ with Cov(Z) = I Set u t,1 = u/t, u t,2 = u/t 2, and g t = F(x t +u t,1 Z 1 +u t,2 Z 2 ;ξ) F(x t +u t,1 ;ξ) u t Z 2 Update x t+1 = x t αg t

53 Two-point sub-gradient: non-differentiable functions To solve d-dimensional problem minimize x X R d f(x) := E[F(x;ξ)] Algorithm: Iterate Draw ξ according to distribution, draw Z 1 µ and Z 2 µ with Cov(Z) = I Set u t,1 = u/t, u t,2 = u/t 2, and g t = F(x t +u t,1 Z 1 +u t,2 Z 2 ;ξ) F(x t +u t,1 ;ξ) u t Z 2 Update x t+1 = x t αg t Corollary (D., Jordan, Wainwright, Wibisono): With appropriate α, if R x x 1 2 and E[ F(x;ξ) 2 2 ] G2 for all x, then E[f( x T ) f(x )] RG ( dlogd +O u logt ). T T

54 Wrapping up zero-order gradient methods If gradients available, convergence rates of 1/T If only zero order information available, in smooth and non-smooth case, convergence rates of d/t Time to ǫ-accuracy: 1/ǫ 2 d/ǫ 2

55 Wrapping up zero-order gradient methods If gradients available, convergence rates of 1/T If only zero order information available, in smooth and non-smooth case, convergence rates of d/t Time to ǫ-accuracy: 1/ǫ 2 d/ǫ 2 Sharpness: In stochastic case, no algorithms exist that can do better than those we have provided. That is, lower bound for all zero-order algorithms of d RG. T

56 Wrapping up zero-order gradient methods If gradients available, convergence rates of 1/T If only zero order information available, in smooth and non-smooth case, convergence rates of d/t Time to ǫ-accuracy: 1/ǫ 2 d/ǫ 2 Sharpness: In stochastic case, no algorithms exist that can do better than those we have provided. That is, lower bound for all zero-order algorithms of d RG. T Open question: Non-stochastic lower bounds? (Sebastian Pokutta, next week.)

57 Instance III: Parallelization and fast algorithms Goal: solve the following problem minimize f(x) subject to x X where f(x) := 1 n n F(x;ξ i ) or f(x) := E[F(x;ξ)] i=1

58 Stochastic Gradient Descent Problem: Tough to compute f(x) = 1 n n F(x;ξ i ). i=1 Instead: At iteration t Choose random ξ i, set g t = F(x t ;ξ i ) Update x t+1 = x t αg t

59 Stochastic Gradient Descent Problem: Tough to compute f(x) = 1 n n F(x;ξ i ). i=1 Instead: At iteration t Choose random ξ i, set g t = F(x t ;ξ i ) Update x t+1 = x t αg t

60 What everyone knows we should do Obviously: get a lower-variance estimate of the gradient. Sample g j,t with E[g j,t ] = f(x t ) and use g t = 1 m m j=1 g j,t

61 What everyone knows we should do Obviously: get a lower-variance estimate of the gradient. Sample g j,t with E[g j,t ] = f(x t ) and use g t = 1 m m j=1 g j,t

62 What everyone knows we should do Obviously: get a lower-variance estimate of the gradient. Sample g j,t with E[g j,t ] = f(x t ) and use g t = 1 m m j=1 g j,t

63 What everyone knows we should do Obviously: get a lower-variance estimate of the gradient. Sample g j,t with E[g j,t ] = f(x t ) and use g t = 1 m m j=1 g j,t

64 Problem: only works for smooth functions.

65 Non-smooth problems we care about: Classification F(x;ξ) = F(x;(a,b)) = [ ] 1 bx a + Robust regression F(x;(a,b)) = b x a

66 Non-smooth problems we care about: Classification F(x;ξ) = F(x;(a,b)) = [ ] 1 bx a + Robust regression F(x;(a,b)) = b x a Structured prediction (ranking, parsing, learning matchings) [ ] F(x;{ξ,ν}) = max L(ν, ν)+x Φ(ξ, ν) x Φ(ξ,ν) ν V

67 Difficulties of non-smooth Intuition: Gradient is poor indicator of global structure

68 Better global estimators Idea: Ask for subgradients from multiple points

69 Better global estimators Idea: Ask for subgradients from multiple points

70 The algorithm Normal approach: sample ξ at random, g j,t F(x t ;ξ). x t Our approach: add noise to x g j,t F(x t +u t Z j ;ξ) Decrease magnitude u t over time

71 The algorithm Normal approach: sample ξ at random, g j,t F(x t ;ξ). Our approach: add noise to x g j,t F(x t +u t Z j ;ξ) Decrease magnitude u t over time x t u t

72 The algorithm Normal approach: sample ξ at random, g j,t F(x t ;ξ). x t u t Our approach: add noise to x g j,t F(x t +u t Z j ;ξ) Decrease magnitude u t over time

73 Algorithm Generalization of accelerated gradient methods (Nesterov 1983, Tseng 2008, Lan 2010). Have query point and exploration point

74 Algorithm Generalization of accelerated gradient methods (Nesterov 1983, Tseng 2008, Lan 2010). Have query point and exploration point I. Get query point and gradients: y t = (1 θ t )x t +θ t z t Sample ξ j,t and Z j,t, compute gradient approximation g t = 1 m g j,t, g j,t F(y t +u t Z j,t ;ξ j,t ) m j=1

75 Algorithm Generalization of accelerated gradient methods (Nesterov 1983, Tseng 2008, Lan 2010). Have query point and exploration point I. Get query point and gradients: y t = (1 θ t )x t +θ t z t Sample ξ j,t and Z j,t, compute gradient approximation g t = 1 m g j,t, g j,t F(y t +u t Z j,t ;ξ j,t ) m j=1 II. Solve for exploration point { t z t+1 = argmin x X 1 [ gτ,x ] θ τ=0 τ }{{} Approximate f + 1 } x 2 2 2α } t {{} Regularize

76 Algorithm Generalization of accelerated gradient methods (Nesterov 1983, Tseng 2008, Lan 2010). Have query point and exploration point I. Get query point and gradients: y t = (1 θ t )x t +θ t z t Sample ξ j,t and Z j,t, compute gradient approximation g t = 1 m g j,t, g j,t F(y t +u t Z j,t ;ξ j,t ) m j=1 II. Solve for exploration point { t z t+1 = argmin x X III. Interpolate 1 [ gτ,x ] θ τ=0 τ }{{} Approximate f x t+1 = (1 θ t )x t +θ t z t } x 2 2 2α } t {{} Regularize

77 Theoretical Results Objective: minimize x X f(x) where f(x) = E[F(x;ξ)] using m gradient samples for stochastic gradients.

78 Theoretical Results Objective: minimize x X f(x) where f(x) = E[F(x;ξ)] using m gradient samples for stochastic gradients. Non-strongly convex objectives: f(x T ) f(x ) = O ( C T + 1 ) Tm

79 Theoretical Results Objective: minimize x X f(x) where f(x) = E[F(x;ξ)] using m gradient samples for stochastic gradients. Non-strongly convex objectives: λ-strongly convex objectives: f(x T ) f(x ) = O f(x T ) f(x ) = O ( C T + 1 ) Tm ( C T ) λtm

80 A few remarks on distributing Convergence rate: f(x T ) f(x ) = O ( 1 T + 1 ) Tm If communication is expensive, use larger batch sizes m: (a) Communication cost is c (b) n computers with batch size m (c) S total update steps }{{} c }{{} n }{{} m

81 A few remarks on distributing Convergence rate: f(x T ) f(x ) = O ( 1 T + 1 ) Tm If communication is expensive, use larger batch sizes m: (a) Communication cost is c (b) n computers with batch size m (c) S total update steps Backsolve: after T = S(m+c) units of time, error is ( m+c O + 1 ) m+c T Tn m }{{} c }{{} m }{{} n

82 Experimental results

83 Iteration complexity simulations Define T(ǫ,m) = min{t N f(x t ) f(x ) ǫ}, solve robust regression problem f(x) = 1 n a 1 i x b i = n n Ax b 1 i=1 Iterations to ǫ-optimality 10 3 Actual T (ǫ, m) Predicted T (ǫ, m) Iterations to ǫ-optimality 10 3 Actual T (ǫ, m) Predicted T (ǫ, m) Number m of gradient samples Number m of gradient samples

84 Robustness to stepsize and smoothing Two parameters: smoothing parameter u, stepsize η f ( x) + ϕ( x) f (x ) ϕ(x ) /u η Plot: optimality gap after 2000 iterations on synthetic SVM problem f(x)+ϕ(x) := 1 n n i=1 [ 1 ξ i x ] + + λ 2 x 2 2

85 Text Classification Reuter s RCV1 dataset, time to ǫ-optimal solution for 1 n n i=1 [ ] 1 ξi x + λ + 2 x 2 2 Mean time (seconds) Time to optimality gap of Batch size 10 Batch size 20 Mean time (seconds) Time to optimality gap of Batch size 10 Batch size Number of worker threads Number of worker threads

86 Text Classification Reuter s RCV1 dataset, optimization speed for minimizing 1 n n i=1 [ ] 1 ξi x + λ + 2 x 2 2 Optimality gap Acc (1) Acc (2) Acc (3) Acc (4) Acc (6) Acc (8) Pegasos Time (s)

87 Parsing Penn Treebank dataset, learning weights for a hypergraph parser (here x is a sentence, y V is a parse tree) 1 n n i=1 [ ] max L(ν i, ν)+x (Φ(ξ i, ν) Φ(ξ i,ν i )) + λ ν V 2 x 2 2. f (x) + ϕ(x) f (x ) ϕ(x ) f (x) + ϕ(x) f (x ) ϕ(x ) RDA Iterations Time (seconds)

88 Is smoothing necessary? Solve multiple-median problem f(x) = 1 n n x ξ i 1, i=1 ξ i { 1,1} d. Compare standard stochastic gradient: Iterations to ǫ-optimality Smoothed Unsmoothed Number m of gradient samples

89 Discussion Randomized smoothing allows Stationary points of non-convex non-smooth problems Optimal solutions in zero-order problems (including non-smooth) Parallelization for non-smooth problems

90 Discussion Randomized smoothing allows Stationary points of non-convex non-smooth problems Optimal solutions in zero-order problems (including non-smooth) Parallelization for non-smooth problems Current experiments in consultation with Google for large-scale parsing/translation tasks Open questions: non-stochastic optimality guarantees? True zero-order optimization?

91 Discussion Randomized smoothing allows Stationary points of non-convex non-smooth problems Optimal solutions in zero-order problems (including non-smooth) Parallelization for non-smooth problems Current experiments in consultation with Google for large-scale parsing/translation tasks Open questions: non-stochastic optimality guarantees? True zero-order optimization? Thanks!

92 Discussion Randomized smoothing allows Stationary points of non-convex non-smooth problems Optimal solutions in zero-order problems (including non-smooth) Parallelization for non-smooth problems Current experiments in consultation with Google for large-scale parsing/translation tasks Open questions: non-stochastic optimality guarantees? True zero-order optimization? References: Randomized Smoothing for Stochastic Optimization (D., Bartlett, Wainwright). SIAM Journal on Optimization, 22(2), pages Optimal rates for zero-order convex optimization: the power of two function evaluations (D., Jordan, Wainwright, Wibisono). arxiv: [math.oc]. Available on my webpage (

93

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

How hard is this function to optimize?

How hard is this function to optimize? How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016 Problem minimize

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Ergodic Subgradient Descent

Ergodic Subgradient Descent Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September

More information

Lecture 2: Subgradient Methods

Lecture 2: Subgradient Methods Lecture 2: Subgradient Methods John C. Duchi Stanford University Park City Mathematics Institute 206 Abstract In this lecture, we discuss first order methods for the minimization of convex functions. We

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Lecture 6 : Projected Gradient Descent

Lecture 6 : Projected Gradient Descent Lecture 6 : Projected Gradient Descent EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Consider the following update. x l+1 = Π C (x l α f(x l )) Theorem Say f : R d R is (m, M)-strongly

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Estimators based on non-convex programs: Statistical and computational guarantees

Estimators based on non-convex programs: Statistical and computational guarantees Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information

Second-Order Methods for Stochastic Optimization

Second-Order Methods for Stochastic Optimization Second-Order Methods for Stochastic Optimization Frank E. Curtis, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Introduction. A Modified Steepest Descent Method Based on BFGS Method for Locally Lipschitz Functions. R. Yousefpour 1

Introduction. A Modified Steepest Descent Method Based on BFGS Method for Locally Lipschitz Functions. R. Yousefpour 1 A Modified Steepest Descent Method Based on BFGS Method for Locally Lipschitz Functions R. Yousefpour 1 1 Department Mathematical Sciences, University of Mazandaran, Babolsar, Iran; yousefpour@umz.ac.ir

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725 Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Distirbutional robustness, regularizing variance, and adversaries

Distirbutional robustness, regularizing variance, and adversaries Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

Taylor-like models in nonsmooth optimization

Taylor-like models in nonsmooth optimization Taylor-like models in nonsmooth optimization Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with Ioffe (Technion), Lewis (Cornell), and Paquette (UW) SIAM Optimization 2017 AFOSR,

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Composite Objective Mirror Descent

Composite Objective Mirror Descent Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems Z. Akbari 1, R. Yousefpour 2, M. R. Peyghami 3 1 Department of Mathematics, K.N. Toosi University of Technology,

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information