Mini-Course 1: SGD Escapes Saddle Points

Size: px
Start display at page:

Download "Mini-Course 1: SGD Escapes Saddle Points"

Transcription

1 Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University

2 Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t )

3 Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t )

4 Gradient Descent (GD) has at least two problems

5 Gradient Descent (GD) has at least two problems Computing the full gradient is slow for big data.

6 Gradient Descent (GD) has at least two problems Computing the full gradient is slow for big data. Stuck at stationary points.

7 Stochastic Gradient Descent (SGD) Very similar to GD, gradient now has some randomness: x t+1 = x t η t g t, where E[g t ] = f (x t ).

8 Stochastic Gradient Descent (SGD) Very similar to GD, gradient now has some randomness: x t+1 = x t η t g t, where E[g t ] = f (x t ).

9 Why do we use SGD? Initially because:

10 Why do we use SGD? Initially because: Much cheaper to compute using mini-batch

11 Why do we use SGD? Initially because: Much cheaper to compute using mini-batch Can still converge to global minimum in convex case

12 Why do we use SGD? Initially because: Much cheaper to compute using mini-batch Can still converge to global minimum in convex case But now people realize:

13 Why do we use SGD? Initially because: Much cheaper to compute using mini-batch Can still converge to global minimum in convex case But now people realize: Can escape saddle points! (Today s topic)

14 Why do we use SGD? Initially because: Much cheaper to compute using mini-batch Can still converge to global minimum in convex case But now people realize: Can escape saddle points! (Today s topic) Can escape shallow local minima (Next time s topic, some progress.)

15 Why do we use SGD? Initially because: Much cheaper to compute using mini-batch Can still converge to global minimum in convex case But now people realize: Can escape saddle points! (Today s topic) Can escape shallow local minima (Next time s topic, some progress.) Can find local minima that generalize well (Not well understood)

16 Why do we use SGD? Initially because: Much cheaper to compute using mini-batch Can still converge to global minimum in convex case But now people realize: Can escape saddle points! (Today s topic) Can escape shallow local minima (Next time s topic, some progress.) Can find local minima that generalize well (Not well understood) Therefore, it s not only faster, but also works better!

17 About g t that we use x t+1 = x t η t g t, where E[g t ] = f (x t ).

18 About g t that we use x t+1 = x t η t g t, where E[g t ] = f (x t ). In practice, g t is obtained by sampling a minibatch of size 128 or 256 from the dataset

19 About g t that we use x t+1 = x t η t g t, where E[g t ] = f (x t ). In practice, g t is obtained by sampling a minibatch of size 128 or 256 from the dataset To simplify the analysis, we assume where ξ t N(0, I) or B 0 (r) g t = f (x t ) + ξ t

20 About g t that we use x t+1 = x t η t g t, where E[g t ] = f (x t ). In practice, g t is obtained by sampling a minibatch of size 128 or 256 from the dataset To simplify the analysis, we assume where ξ t N(0, I) or B 0 (r) g t = f (x t ) + ξ t In general, if ξ t has non-negligible components on every direction, the analysis works.

21 Preliminaries L-Lipschitz, i.e., f (w 1 ) f (w 2 ) L w 1 w 2 2

22 Preliminaries L-Lipschitz, i.e., f (w 1 ) f (w 2 ) L w 1 w 2 2 l-smoothness: The gradient is l-lipschitz, i.e. f (w 1 ) f (w 2 ) 2 l w 1 w 2 2

23 Preliminaries L-Lipschitz, i.e., f (w 1 ) f (w 2 ) L w 1 w 2 2 l-smoothness: The gradient is l-lipschitz, i.e. f (w 1 ) f (w 2 ) 2 l w 1 w 2 2 ρ-hessian smoothness: The hessian matrix is ρ-lipschitz, i.e., 2 f (w 1 ) 2 f (w 2 ) sp ρ w 1 w 2 2

24 Preliminaries L-Lipschitz, i.e., f (w 1 ) f (w 2 ) L w 1 w 2 2 l-smoothness: The gradient is l-lipschitz, i.e. f (w 1 ) f (w 2 ) 2 l w 1 w 2 2 ρ-hessian smoothness: The hessian matrix is ρ-lipschitz, i.e., 2 f (w 1 ) 2 f (w 2 ) sp ρ w 1 w 2 2 We need this because we will use the Hessian at the current spot to approximate the neighborhood

25 Preliminaries L-Lipschitz, i.e., f (w 1 ) f (w 2 ) L w 1 w 2 2 l-smoothness: The gradient is l-lipschitz, i.e. f (w 1 ) f (w 2 ) 2 l w 1 w 2 2 ρ-hessian smoothness: The hessian matrix is ρ-lipschitz, i.e., 2 f (w 1 ) 2 f (w 2 ) sp ρ w 1 w 2 2 We need this because we will use the Hessian at the current spot to approximate the neighborhood Then bound the approximation.

26 Saddle points, and negative eigenvalue

27 Stationary points: saddle points, local minima, local maxima For stationary points f (w) = 0,

28 Stationary points: saddle points, local minima, local maxima For stationary points f (w) = 0, If 2 f (w) 0, it s a local minimum.

29 Stationary points: saddle points, local minima, local maxima For stationary points f (w) = 0, If 2 f (w) 0, it s a local minimum. If 2 f (w) 0, it s a local maximum.

30 Stationary points: saddle points, local minima, local maxima For stationary points f (w) = 0, If 2 f (w) 0, it s a local minimum. If 2 f (w) 0, it s a local maximum. If 2 f (w) has both +/ eigenvalues, it s a saddle point.

31 Stationary points: saddle points, local minima, local maxima For stationary points f (w) = 0, If 2 f (w) 0, it s a local minimum. If 2 f (w) 0, it s a local maximum. If 2 f (w) has both +/ eigenvalues, it s a saddle point. Degenerate case: 2 f (w) has eigenvalues equal to 0. It could be either local minimum(maximum)/saddle point.

32 Stationary points: saddle points, local minima, local maxima For stationary points f (w) = 0, If 2 f (w) 0, it s a local minimum. If 2 f (w) 0, it s a local maximum. If 2 f (w) has both +/ eigenvalues, it s a saddle point. Degenerate case: 2 f (w) has eigenvalues equal to 0. It could be either local minimum(maximum)/saddle point. f is flat on some directions

33 Stationary points: saddle points, local minima, local maxima For stationary points f (w) = 0, If 2 f (w) 0, it s a local minimum. If 2 f (w) 0, it s a local maximum. If 2 f (w) has both +/ eigenvalues, it s a saddle point. Degenerate case: 2 f (w) has eigenvalues equal to 0. It could be either local minimum(maximum)/saddle point. f is flat on some directions SGD is like random walk

34 Stationary points: saddle points, local minima, local maxima For stationary points f (w) = 0, If 2 f (w) 0, it s a local minimum. If 2 f (w) 0, it s a local maximum. If 2 f (w) has both +/ eigenvalues, it s a saddle point. Degenerate case: 2 f (w) has eigenvalues equal to 0. It could be either local minimum(maximum)/saddle point. f is flat on some directions SGD is like random walk We only consider non-degenerate case!

35 Strict saddle property f (w) is (α, γ, ɛ, ζ)-strict saddle, if for any w,

36 Strict saddle property f (w) is (α, γ, ɛ, ζ)-strict saddle, if for any w, f (w) 2 ɛ Which means: Gradient is large

37 Strict saddle property f (w) is (α, γ, ɛ, ζ)-strict saddle, if for any w, f (w) 2 ɛ or, λ min 2 f (w) γ < 0 Which means: Gradient is large or (stationary point), we have a negative eigenvalue direction to escape

38 Strict saddle property f (w) is (α, γ, ɛ, ζ)-strict saddle, if for any w, f (w) 2 ɛ or, λ min 2 f (w) γ < 0 or, there exists w such that w w 2 ζ, and the region centered w with radius 2ζ is α-strongly convex. Which means: Gradient is large or (stationary point), we have a negative eigenvalue direction to escape or (stationary point, no negative eigenvalues), we are pretty close to a local minimum.

39 Strict saddle functions are everywhere Orthogonal tensor decomposition [Ge et al 2015] Deep linear (residual) networks [Kawaguchi 2016], [Hardt and Ma 2016] Matrix completion [Ge et al 2016] Generalized phase retrieval problem [Sun et al 2016] Low rank matrix recovery [Bhojanapalli et al 2016]

40 Strict saddle functions are everywhere Orthogonal tensor decomposition [Ge et al 2015] Deep linear (residual) networks [Kawaguchi 2016], [Hardt and Ma 2016] Matrix completion [Ge et al 2016] Generalized phase retrieval problem [Sun et al 2016] Low rank matrix recovery [Bhojanapalli et al 2016] Moreover, in these problems, all local minima are equally good!

41 Strict saddle functions are everywhere Orthogonal tensor decomposition [Ge et al 2015] Deep linear (residual) networks [Kawaguchi 2016], [Hardt and Ma 2016] Matrix completion [Ge et al 2016] Generalized phase retrieval problem [Sun et al 2016] Low rank matrix recovery [Bhojanapalli et al 2016] Moreover, in these problems, all local minima are equally good! That means,

42 Strict saddle functions are everywhere Orthogonal tensor decomposition [Ge et al 2015] Deep linear (residual) networks [Kawaguchi 2016], [Hardt and Ma 2016] Matrix completion [Ge et al 2016] Generalized phase retrieval problem [Sun et al 2016] Low rank matrix recovery [Bhojanapalli et al 2016] Moreover, in these problems, all local minima are equally good! That means, SGD escapes all saddle points

43 Strict saddle functions are everywhere Orthogonal tensor decomposition [Ge et al 2015] Deep linear (residual) networks [Kawaguchi 2016], [Hardt and Ma 2016] Matrix completion [Ge et al 2016] Generalized phase retrieval problem [Sun et al 2016] Low rank matrix recovery [Bhojanapalli et al 2016] Moreover, in these problems, all local minima are equally good! That means, SGD escapes all saddle points So, SGD arrives one local minimum global minimum!

44 Strict saddle functions are everywhere Orthogonal tensor decomposition [Ge et al 2015] Deep linear (residual) networks [Kawaguchi 2016], [Hardt and Ma 2016] Matrix completion [Ge et al 2016] Generalized phase retrieval problem [Sun et al 2016] Low rank matrix recovery [Bhojanapalli et al 2016] Moreover, in these problems, all local minima are equally good! That means, SGD escapes all saddle points So, SGD arrives one local minimum global minimum! One popular way to prove SGD solves the problem.

45 Main Results [Ge et al 2015] says, whp, SGD will escape all saddle points, and converge to a local minimum. The convergence time has polynomial dependency in dimension d.

46 Main Results [Ge et al 2015] says, whp, SGD will escape all saddle points, and converge to a local minimum. The convergence time has polynomial dependency in dimension d. [Jin et al 2017] says, whp, PGD (a variant of SGD) will escape all saddle points, and converge to a local minimum much faster. The dependence in d is logarithmic.

47 Main Results [Ge et al 2015] says, whp, SGD will escape all saddle points, and converge to a local minimum. The convergence time has polynomial dependency in dimension d. [Jin et al 2017] says, whp, PGD (a variant of SGD) will escape all saddle points, and converge to a local minimum much faster. The dependence in d is logarithmic. Same proof framework. We ll mainly look at the new result.

48 Description of PGD Do the following iteratively:

49 Description of PGD Do the following iteratively: Do a gradient descent step: x t+1 = x t η f (x t )

50 Description of PGD Do the following iteratively: If f (x t ) g thres, and last perturbed time is > t thres steps before, do random perturbation (ball) Do a gradient descent step: x t+1 = x t η f (x t )

51 Description of PGD Do the following iteratively: If f (x t ) g thres, and last perturbed time is > t thres steps before, do random perturbation (ball) If perturbation happened t thres steps ago, but f is decreased for less than f thres, return the value before last perturbation Do a gradient descent step: x t+1 = x t η f (x t )

52 Description of PGD Do the following iteratively: If f (x t ) g thres, and last perturbed time is > t thres steps before, do random perturbation (ball) If perturbation happened t thres steps ago, but f is decreased for less than f thres, return the value before last perturbation Do a gradient descent step: x t+1 = x t η f (x t ) A few Remarks:

53 Description of PGD Do the following iteratively: If f (x t ) g thres, and last perturbed time is > t thres steps before, do random perturbation (ball) If perturbation happened t thres steps ago, but f is decreased for less than f thres, return the value before last perturbation Do a gradient descent step: x t+1 = x t η f (x t ) A few Remarks: Unfortunately.. Not a fast algorithm because of GD!

54 Description of PGD Do the following iteratively: If f (x t ) g thres, and last perturbed time is > t thres steps before, do random perturbation (ball) If perturbation happened t thres steps ago, but f is decreased for less than f thres, return the value before last perturbation Do a gradient descent step: x t+1 = x t η f (x t ) A few Remarks: Unfortunately.. Not a fast algorithm because of GD! η = c l. g thres, t thres, f thres depends on a constant c, as well as other parameters.

55 Main theorem in [Jin et al 2017] Theorem (Main Theorem) Assume function f is l-smooth and ρ-hessian Lipschitz, (α, γ, ɛ, ζ)-strict saddle. There exists an absolute constant c max such that, for any δ > 0, f f (x 0 ) f, and constant c c max, ɛ = min{ɛ, γ2 ρ }, PGD(c) will output a point ζ-close to a local minimum, with probability 1 δ, and terminate in the following number of iterations: O ( l(f (x0 ) f ) ɛ 2 log 4 ( dl f ɛ 2 δ ))

56 Main theorem in [Jin et al 2017] Theorem (Main Theorem) Assume function f is l-smooth and ρ-hessian Lipschitz, (α, γ, ɛ, ζ)-strict saddle. There exists an absolute constant c max such that, for any δ > 0, f f (x 0 ) f, and constant c c max, ɛ = min{ɛ, γ2 ρ }, PGD(c) will output a point ζ-close to a local minimum, with probability 1 δ, and terminate in the following number of iterations: O ( l(f (x0 ) f ) ɛ 2 log 4 ( dl f ɛ 2 δ )) If could show SGD has similar property, would be great!

57 Main theorem in [Jin et al 2017] Theorem (Main Theorem) Assume function f is l-smooth and ρ-hessian Lipschitz, (α, γ, ɛ, ζ)-strict saddle. There exists an absolute constant c max such that, for any δ > 0, f f (x 0 ) f, and constant c c max, ɛ = min{ɛ, γ2 ρ }, PGD(c) will output a point ζ-close to a local minimum, with probability 1 δ, and terminate in the following number of iterations: O ( l(f (x0 ) f ) ɛ 2 log 4 ( dl f ɛ 2 δ )) If could show SGD has similar property, would be great! The convergence rate is almost optimal.

58 More general version: why it s fast Theorem (A more general version) Assume function f is l-smooth and ρ-hessian Lipschitz. There exists an absolute constant c max such that, for any δ > 0, f f (x 0 ) f, and constant c c max, ɛ l2 ρ, PGD(c) will output a point ζ-close to an ɛ-second-order stationary point, with probability 1 δ, and terminate in the following number of iterations: ( l(f (x0 ) f ( )) ) O ɛ 2 log 4 dl f ɛ 2 δ Essentially saying the same thing. If f is not strict saddle, only ɛ-second-order stationary point (instead of local minimum) is guaranteed.

59 ɛ-stationary points ɛ-first-order stationary point: f (x) ɛ

60 ɛ-stationary points ɛ-first-order stationary point: f (x) ɛ ɛ-second-order stationary point: f (x) ɛ, λ min ( 2 f (x)) ρɛ

61 ɛ-stationary points ɛ-first-order stationary point: f (x) ɛ ɛ-second-order stationary point: f (x) ɛ, λ min ( 2 f (x)) ρɛ If l-smooth, λmin ( 2 f (x)) l.

62 ɛ-stationary points ɛ-first-order stationary point: f (x) ɛ ɛ-second-order stationary point: f (x) ɛ, λ min ( 2 f (x)) ρɛ If l-smooth, λmin ( 2 f (x)) l. For any ɛ > l 2 ρ, an ɛ-first-order stationary point in a l-smooth function is a l2 ρ -second-order stationary point

63 ɛ-stationary points ɛ-first-order stationary point: f (x) ɛ ɛ-second-order stationary point: f (x) ɛ, λ min ( 2 f (x)) ρɛ If l-smooth, λmin ( 2 f (x)) l. For any ɛ > l 2 ρ, an ɛ-first-order stationary point in a l-smooth function is a l2 ρ -second-order stationary point If (α, γ, ɛ, ζ)-strict saddle, and ɛ < γ 2 ρ, then any ɛ-second-order stationary point is a local minimum.

64 [Nesterov, 1998] Theorem Assume that f is l-smooth. Then for any ɛ > 0, if we run GD with step size η = 1 l and termination condition f (x) ɛ, the output will be ɛ-first-order stationary point, and the algorithm terminates in the following number of iterations: l(f (x 0 ) f ) ɛ 2 [Jin et ( al 2017]: PGD ( converges )) to ɛ-second-order stationary point in O l(f (x0 ) f ) log 4 dl f steps. ɛ 2 ɛ 2 δ Matched up to log factors!

65 Why ρɛ? If we use third order approximation for x [Nesterov and Polyak, 2006] min { f (x), y x f (x)(y x), y x + ρ6 } y x 2 y denote the answer as T x.

66 Why ρɛ? If we use third order approximation for x [Nesterov and Polyak, 2006] min { f (x), y x f (x)(y x), y x + ρ6 } y x 2 y denote the answer as T x. Denote distance r = x T x

67 Why ρɛ? If we use third order approximation for x [Nesterov and Polyak, 2006] min { f (x), y x f (x)(y x), y x + ρ6 } y x 2 y denote the answer as T x. Denote distance r = x T x f (T x ) ρr 2, 2 f (T x ) 3 2 ρri

68 Why ρɛ? If we use third order approximation for x [Nesterov and Polyak, 2006] min { f (x), y x f (x)(y x), y x + ρ6 } y x 2 y denote the answer as T x. Denote distance r = x T x f (T x ) ρr 2, 2 f (T x ) 3 2 ρri To get a lower bound for r: { } f (T x ) max, 2 ρ 3ρ λ min 2 f (T x )

69 Why ρɛ? If we use third order approximation for x [Nesterov and Polyak, 2006] min { f (x), y x f (x)(y x), y x + ρ6 } y x 2 y denote the answer as T x. Denote distance r = x T x f (T x ) ρr 2, 2 f (T x ) 3 2 ρri To get a lower bound for r: { } f (T x ) max, 2 ρ 3ρ λ min 2 f (T x ) When are they equal ρɛ

70 Related results 1. Gradient Descent Converges to Minimizers By Lee, Simchowitz, Jordan and Recht. 15

71 Related results 1. Gradient Descent Converges to Minimizers By Lee, Simchowitz, Jordan and Recht. 15 with random initialization, GD almost surely never touches any saddle points, and always converges to local minima.

72 Related results 1. Gradient Descent Converges to Minimizers By Lee, Simchowitz, Jordan and Recht. 15 with random initialization, GD almost surely never touches any saddle points, and always converges to local minima. 2. The power of normalization: faster evasion of saddle points, Kfir Levy. 16

73 Related results 1. Gradient Descent Converges to Minimizers By Lee, Simchowitz, Jordan and Recht. 15 with random initialization, GD almost surely never touches any saddle points, and always converges to local minima. 2. The power of normalization: faster evasion of saddle points, Kfir Levy. 16 Normalized gradient can escape saddle points in O(d 3 poly(1/ɛ)), slower than [Jin et al 2017], faster than [Ge et al 2015], but still polynomial in d.

74 Main theorem in [Jin et al 2017] Theorem (Main Theorem) Assume function f is l-smooth and ρ-hessian Lipschitz, (α, γ, ɛ, ζ)-strict saddle. There exists an absolute constant c max such that, for any δ > 0, f f (x 0 ) f, and constant c c max, ɛ = min{ɛ, γ2 ρ }, PGD(c) will output a point ζ-close to a local minimum, with probability 1 δ, and terminate in the following number of iterations: O ( l(f (x0 ) f ) ɛ 2 log 4 ( dl f ɛ 2 δ ))

75 Proof framework: Progress, Escape and Trap Progress: when f (x) > g thres, f (x) is decreased by at least f thres /t thres.

76 Proof framework: Progress, Escape and Trap Progress: when f (x) > g thres, f (x) is decreased by at least f thres /t thres. Escape: when f (x) g thres, and λ min 2 f (x) γ, whp function value is decreased by f thres after perturbation+t thres steps.

77 Proof framework: Progress, Escape and Trap Progress: when f (x) > g thres, f (x) is decreased by at least f thres /t thres. Escape: when f (x) g thres, and λ min 2 f (x) γ, whp function value is decreased by f thres after perturbation+t thres steps. f thres /t thres on average each step.

78 Proof framework: Progress, Escape and Trap Progress: when f (x) > g thres, f (x) is decreased by at least f thres /t thres. Escape: when f (x) g thres, and λ min 2 f (x) γ, whp function value is decreased by f thres after perturbation+t thres steps. f thres /t thres on average each step. Trap:

79 Proof framework: Progress, Escape and Trap Progress: when f (x) > g thres, f (x) is decreased by at least f thres /t thres. Escape: when f (x) g thres, and λ min 2 f (x) γ, whp function value is decreased by f thres after perturbation+t thres steps. f thres /t thres on average each step. Trap: The algorithm can t do progress and escape forever, because it s bounded!

80 Proof framework: Progress, Escape and Trap Progress: when f (x) > g thres, f (x) is decreased by at least f thres /t thres. Escape: when f (x) g thres, and λ min 2 f (x) γ, whp function value is decreased by f thres after perturbation+t thres steps. f thres /t thres on average each step. Trap: The algorithm can t do progress and escape forever, because it s bounded! When it stops: perturbation happened t thres steps ago, but f is decreased for less than f thres

81 Proof framework: Progress, Escape and Trap Progress: when f (x) > g thres, f (x) is decreased by at least f thres /t thres. Escape: when f (x) g thres, and λ min 2 f (x) γ, whp function value is decreased by f thres after perturbation+t thres steps. f thres /t thres on average each step. Trap: The algorithm can t do progress and escape forever, because it s bounded! When it stops: perturbation happened t thres steps ago, but f is decreased for less than f thres That means, f (x) gthres before perturbation, and whp there is no eigenvalue γ.

82 Proof framework: Progress, Escape and Trap Progress: when f (x) > g thres, f (x) is decreased by at least f thres /t thres. Escape: when f (x) g thres, and λ min 2 f (x) γ, whp function value is decreased by f thres after perturbation+t thres steps. f thres /t thres on average each step. Trap: The algorithm can t do progress and escape forever, because it s bounded! When it stops: perturbation happened t thres steps ago, but f is decreased for less than f thres That means, f (x) gthres before perturbation, and whp there is no eigenvalue γ. So it s a local minimum!

83 Progress Lemma If f is l-smooth, then for GD with step size η < 1 l, we have: f (x t+1 ) f (x t ) η 2 f (x t) 2 Proof. f (x t+1 ) f (x t ) + f (x t ) (x t+1 x t ) + l 2 x t+1 x t 2 = f (x t ) η f (x t ) 2 + η2 l 2 f (x t) 2 f (x t ) η 2 f (x t) 2

84 Escape: main idea

85 Escape: main idea

86 Escape: thin pancake

87 Main Lemma: measure the width Lemma Suppose we start with point x satisfying following conditions: f ( x) g thres, λ min ( 2 f ( x)) γ Let e 1 the minimum eigenvector. Consider two gradient descent sequences {u t }, {w t }, with initial points u 0, w 0 satisfying : u 0 x r, w 0 = u 0 + µre 1, µ [δ/(2 d), 1] Then, for any stepsize η c max /l, and any T t thres, we have min{f (u T ) f (u 0 ), f (w T ) f (w 0 )} 2.5f thres As long as u 0 w 0 are on e 1, and u 0 w 0 one of them will escape! δr 2, at least d

88 Main Lemma: measure the width

89 Escape Case Lemma (Escape case) Suppose we start with point x satisfying following conditions: f ( x) g thres, λ min( 2 f ( x)) γ Let x 0 = x + ξ, where ξ come from the uniform distribution over ball with radius r, and let x t be the iterates of GD from x 0. Then when η < cmax l, with at least probability 1 δ, for any T t thres : f (x T ) f ( x) f thres

90 Proof of the escape lemma

91 Proof of the escape lemma By smoothness, the perturbation step does not increase f much: f (x 0 ) f ( x) f ( x) ξ + l 2 ξ 2 1.5f thres

92 Proof of the escape lemma By smoothness, the perturbation step does not increase f much: f (x 0 ) f ( x) f ( x) ξ + l 2 ξ 2 1.5f thres By the main lemma, for any x 0 X stuck, we know (x 0 ± µre 1 ) X stuck, where µ [δ/(2 d), 1]. Vol(X stuck ) = Vol(B (d 1) x (r)) δr 2 d 2

93 Proof of the escape lemma By smoothness, the perturbation step does not increase f much: f (x 0 ) f ( x) f ( x) ξ + l 2 ξ 2 1.5f thres By the main lemma, for any x 0 X stuck, we know (x 0 ± µre 1 ) X stuck, where µ [δ/(2 d), 1]. Vol(X stuck ) = Vol(B (d 1) x (r)) δr 2 d 2 Therefore, the probability that we picked a point in X stuck is bounded by Vol(X stuck ) Vol(B (d) x (r))) δ

94 Proof of the escape lemma Thus, with probability at least 1 δ, x 0 X stuck, and in this case, by the main lemma. f (x T ) f ( x) 2.5f thres + 1.5f thres = f thres

95 How to prove the main Lemma?

96 How to prove the main Lemma? If u T does not decrease function value, then {u 0,, u T } are close to x.

97 How to prove the main Lemma? If u T does not decrease function value, then {u 0,, u T } are close to x. If {u 0,, u T } are close to x, GD on w 0 will decrease the function value.

98 How to prove the main Lemma? If u T does not decrease function value, then {u 0,, u T } are close to x. If {u 0,, u T } are close to x, GD on w 0 will decrease the function value. We will need the following approximation: f y (x) = f (y) + f (y) (x y) (x y) H(x y) where H = 2 f ( x).

99 Two lemmas (simplified) Lemma (u T -stuck) There exists absolute constant c max s.t., for any initial point u 0 with u 0 x r, defined { } } T = min inf {t f u0 (u t ) f (u 0 ) 3f thres, t thres t Then, for any η cmax l, we have for all t < T, u t x Φ.

100 Two lemmas (simplified) Lemma (u T -stuck) There exists absolute constant c max s.t., for any initial point u 0 with u 0 x r, defined { } } T = min inf {t f u0 (u t ) f (u 0 ) 3f thres, t thres t Then, for any η cmax l, we have for all t < T, u t x Φ. Lemma (w T -escape) There exists absolute constant c max s.t., define { } } T = min inf {t f w0 (w t ) f (w 0 ) 3f thres, t thres t then, for any η cmax l, if u t x Φ for t < T, we have T < t thres.

101 Prove the main lemma

102 Prove the main lemma Assume x is the origin. Define } T = inf {t f u0 (u t ) f (u 0 ) 3f thres t

103 Prove the main lemma Assume x is the origin. Define } T = inf {t f u0 (u t ) f (u 0 ) 3f thres t Case T t thres : We know u T 1 Φ by u T -stuck-lemma. By simple calculation, we can show that u T = O(Φ) as well.

104 Prove the main lemma Assume x is the origin. Define } T = inf {t f u0 (u t ) f (u 0 ) 3f thres t Case T t thres : We know u T 1 Φ by u T -stuck-lemma. By simple calculation, we can show that u T = O(Φ) as well. f (u T ) f (u 0 ) f (u 0 ) (u T u 0 ) (u T u 0) 2 f (u 0 )(u T u 0 ) + ρ 6 u T u 0 3 f u0 (u t ) f (u 0 ) + ρ 2 u 0 x u T u ρ 6 u T u f thres

105 Prove the main lemma Case T > t thres : u t Φ. By u T -stuck-lemma, we know for all t t thres

106 Prove the main lemma Case T > t thres : By u T -stuck-lemma, we know for all t t thres u t Φ. Using the w T -escape-lemma, we know } T = inf {t f w0 (w t ) f (w 0 ) 3f thres t thres t

107 Prove the main lemma Case T > t thres : By u T -stuck-lemma, we know for all t t thres u t Φ. Using the w T -escape-lemma, we know } T = inf {t f w0 (w t ) f (w 0 ) 3f thres t thres t Then we may reduce this to the case that T t thres because w, u are interchangeable.

108 Prove the u T -stuck-lemma

109 Prove the u T -stuck-lemma Lemma (u T -stuck) There exists absolute constant c max s.t., for any initial point u 0 with u 0 x r, defined { } } T = min inf {t f u0 (u t ) f (u 0 ) 3f thres, t thres t Then, for any η cmax l, we have for all t < T, u t x Φ.

110 Prove the u T -stuck-lemma Lemma (u T -stuck) There exists absolute constant c max s.t., for any initial point u 0 with u 0 x r, defined { } } T = min inf {t f u0 (u t ) f (u 0 ) 3f thres, t thres t Then, for any η cmax l, we have for all t < T, u t x Φ. We won t move much in large negative eigenvector directions, otherwise it s a lot of progress!

111 Prove the u T -stuck-lemma Lemma (u T -stuck) There exists absolute constant c max s.t., for any initial point u 0 with u 0 x r, defined { } } T = min inf {t f u0 (u t ) f (u 0 ) 3f thres, t thres t Then, for any η cmax l, we have for all t < T, u t x Φ. We won t move much in large negative eigenvector directions, otherwise it s a lot of progress! Consider B t as u t in the remaining space where eigenvalue γ 100, B t+1 (1 + ηγ 100 ) B t + 2ηg thres

112 Prove the u T -stuck-lemma Lemma (u T -stuck) There exists absolute constant c max s.t., for any initial point u 0 with u 0 x r, defined { } } T = min inf {t f u0 (u t ) f (u 0 ) 3f thres, t thres t Then, for any η cmax l, we have for all t < T, u t x Φ. We won t move much in large negative eigenvector directions, otherwise it s a lot of progress! Consider B t as u t in the remaining space where eigenvalue γ 100, B t+1 (1 + ηγ 100 ) B t + 2ηg thres If T t thres, we will have (1 + ηγ 100 )T 3, so B T is bounded.

113 Prove the w T -escape-lemma Lemma (w T -escape) There exists absolute constant c max s.t., define { } } T = min inf {t f w0 (w t ) f (w 0 ) 3f thres, t thres t then, for any η cmax l, if u t x Φ for t < T, we have T < t thres.

114 Prove the w T -escape-lemma let v t = w t u t

115 Prove the w T -escape-lemma let v t = w t u t We want to say for T < t thres, w T made progress.

116 Prove the w T -escape-lemma let v t = w t u t We want to say for T < t thres, w T made progress. If w t makes no progress, by u T -stuck-lemma, it s still near x.

117 Prove the w T -escape-lemma let v t = w t u t We want to say for T < t thres, w T made progress. If w t makes no progress, by u T -stuck-lemma, it s still near x. Therefore, we always have v t u t + w t 2Φ.

118 Prove the w T -escape-lemma let v t = w t u t We want to say for T < t thres, w T made progress. If w t makes no progress, by u T -stuck-lemma, it s still near x. Therefore, we always have v t u t + w t 2Φ. However, v t is increasing very rapidly. It can t be always small!

119 Prove the w T -escape-lemma let v t = w t u t We want to say for T < t thres, w T made progress. If w t makes no progress, by u T -stuck-lemma, it s still near x. Therefore, we always have v t u t + w t 2Φ. However, v t is increasing very rapidly. It can t be always small! At e 1 direction v 0 has at least δr 2 d

120 Prove the w T -escape-lemma let v t = w t u t We want to say for T < t thres, w T made progress. If w t makes no progress, by u T -stuck-lemma, it s still near x. Therefore, we always have v t u t + w t 2Φ. However, v t is increasing very rapidly. It can t be always small! At e 1 direction v 0 has at least δr 2 d Every time it multiplies by at least 1 + ηγ.

121 Prove the w T -escape-lemma let v t = w t u t We want to say for T < t thres, w T made progress. If w t makes no progress, by u T -stuck-lemma, it s still near x. Therefore, we always have v t u t + w t 2Φ. However, v t is increasing very rapidly. It can t be always small! At e 1 direction v 0 has at least δr 2 d Every time it multiplies by at least 1 + ηγ. In T < t thres, we get v T > 2Φ, so w T made progress!

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu

More information

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington Nonconvex optimization

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC Non-Convex Optimization in Machine Learning Jan Mrkos AIC The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff? Neural net universal approximation Theorem (1989):

More information

SVRG Escapes Saddle Points

SVRG Escapes Saddle Points DUKE UNIVERSITY SVRG Escapes Saddle Points by Weiyao Wang A thesis submitted to in partial fulfillment of the requirements for graduating with distinction in the Department of Computer Science degree of

More information

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

On the fast convergence of random perturbations of the gradient flow.

On the fast convergence of random perturbations of the gradient flow. On the fast convergence of random perturbations of the gradient flow. Wenqing Hu. 1 (Joint work with Chris Junchi Li 2.) 1. Department of Mathematics and Statistics, Missouri S&T. 2. Department of Operations

More information

arxiv: v1 [cs.lg] 2 Mar 2017

arxiv: v1 [cs.lg] 2 Mar 2017 How to Escape Saddle Points Efficiently Chi Jin Rong Ge Praneeth Netrapalli Sham M. Kakade Michael I. Jordan arxiv:1703.00887v1 [cs.lg] 2 Mar 2017 March 3, 2017 Abstract This paper shows that a perturbed

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

A Conservation Law Method in Optimization

A Conservation Law Method in Optimization A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu

More information

CSC321 Lecture 7: Optimization

CSC321 Lecture 7: Optimization CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

arxiv: v2 [math.oc] 5 Nov 2017

arxiv: v2 [math.oc] 5 Nov 2017 Gradient Descent Can Take Exponential Time to Escape Saddle Points arxiv:175.1412v2 [math.oc] 5 Nov 217 Simon S. Du Carnegie Mellon University ssdu@cs.cmu.edu Jason D. Lee University of Southern California

More information

A random perturbation approach to some stochastic approximation algorithms in optimization.

A random perturbation approach to some stochastic approximation algorithms in optimization. A random perturbation approach to some stochastic approximation algorithms in optimization. Wenqing Hu. 1 (Presentation based on joint works with Chris Junchi Li 2, Weijie Su 3, Haoyi Xiong 4.) 1. Department

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Optimization for Training I. First-Order Methods Training algorithm

Optimization for Training I. First-Order Methods Training algorithm Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Gradient Descent Can Take Exponential Time to Escape Saddle Points

Gradient Descent Can Take Exponential Time to Escape Saddle Points Gradient Descent Can Take Exponential Time to Escape Saddle Points Simon S. Du Carnegie Mellon University ssdu@cs.cmu.edu Jason D. Lee University of Southern California jasonlee@marshall.usc.edu Barnabás

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Introduction to gradient descent

Introduction to gradient descent 6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

Math (P)refresher Lecture 8: Unconstrained Optimization

Math (P)refresher Lecture 8: Unconstrained Optimization Math (P)refresher Lecture 8: Unconstrained Optimization September 2006 Today s Topics : Quadratic Forms Definiteness of Quadratic Forms Maxima and Minima in R n First Order Conditions Second Order Conditions

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

COR-OPT Seminar Reading List Sp 18

COR-OPT Seminar Reading List Sp 18 COR-OPT Seminar Reading List Sp 18 Damek Davis January 28, 2018 References [1] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht. Low-rank Solutions of Linear Matrix Equations via Procrustes

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

arxiv: v1 [math.oc] 9 Oct 2018

arxiv: v1 [math.oc] 9 Oct 2018 Cubic Regularization with Momentum for Nonconvex Optimization Zhe Wang Yi Zhou Yingbin Liang Guanghui Lan Ohio State University Ohio State University zhou.117@osu.edu liang.889@osu.edu Ohio State University

More information

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E. Curtis Department of Industrial and Systems Engineering, Lehigh University Daniel P. Robinson Department

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

arxiv: v1 [cs.lg] 17 Nov 2017

arxiv: v1 [cs.lg] 17 Nov 2017 Neon: Finding Local Minima via First-Order Oracles (version ) Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research Yuanzhi Li yuanzhil@cs.princeton.edu Princeton University arxiv:7.06673v [cs.lg] 7

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

Gradient Descent. Dr. Xiaowei Huang

Gradient Descent. Dr. Xiaowei Huang Gradient Descent Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Three machine learning algorithms: decision tree learning k-nn linear regression only optimization objectives are discussed,

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

A Quick Tour of Linear Algebra and Optimization for Machine Learning

A Quick Tour of Linear Algebra and Optimization for Machine Learning A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28 Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Clément Royer - University of Wisconsin-Madison Joint work with Stephen J. Wright MOPTA, Bethlehem,

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

arxiv: v2 [math.oc] 1 Nov 2017

arxiv: v2 [math.oc] 1 Nov 2017 Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence arxiv:1710.09447v [math.oc] 1 Nov 017 Mingrui Liu, Tianbao Yang Department of Computer Science The University of

More information

OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods

OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods Department of Statistical Sciences and Operations Research Virginia Commonwealth University Sept 25, 2013 (Lecture 9) Nonlinear Optimization

More information

Gradient Descent. Sargur Srihari

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

More information

Introduction to Optimization

Introduction to Optimization Introduction to Optimization Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning So far Machine learning is important and interesting The general concept: Fitting models to data So far Machine

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

arxiv: v1 [math.oc] 12 Oct 2018

arxiv: v1 [math.oc] 12 Oct 2018 MULTIPLICATIVE WEIGHTS UPDATES AS A DISTRIBUTED CONSTRAINED OPTIMIZATION ALGORITHM: CONVERGENCE TO SECOND-ORDER STATIONARY POINTS ALMOST ALWAYS IOANNIS PANAGEAS, GEORGIOS PILIOURAS, AND XIAO WANG SINGAPORE

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

A picture of the energy landscape of! deep neural networks

A picture of the energy landscape of! deep neural networks A picture of the energy landscape of! deep neural networks Pratik Chaudhari December 15, 2017 UCLA VISION LAB 1 Dy (x; w) = (w p (w p 1 (... (w 1 x))...)) w = argmin w (x,y ) 2 D kx 1 {y =i } log Dy i

More information

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x) 0 x x x CSE 559A: Computer Vision For Binary Classification: [0, ] f (x; ) = σ( x) = exp( x) + exp( x) Output is interpreted as probability Pr(y = ) x are the log-odds. Fall 207: -R: :30-pm @ Lopata 0

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué Clément Royer (Université du Wisconsin-Madison, États-Unis) Toulouse, 8 janvier 2019 Nonconvex optimization via Newton-CG

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

arxiv: v4 [math.oc] 24 Apr 2017

arxiv: v4 [math.oc] 24 Apr 2017 Finding Approximate ocal Minima Faster than Gradient Descent arxiv:6.046v4 [math.oc] 4 Apr 07 Naman Agarwal namana@cs.princeton.edu Princeton University Zeyuan Allen-Zhu zeyuan@csail.mit.edu Institute

More information

Lecture 5: September 12

Lecture 5: September 12 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

More information

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares Robert Bridson October 29, 2008 1 Hessian Problems in Newton Last time we fixed one of plain Newton s problems by introducing line search

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Yi Zhou Department of ECE The Ohio State University zhou.1172@osu.edu Zhe Wang Department of ECE The Ohio State University

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

arxiv: v4 [math.oc] 11 Jun 2018

arxiv: v4 [math.oc] 11 Jun 2018 Natasha : Faster Non-Convex Optimization han SGD How to Swing By Saddle Points (version 4) arxiv:708.08694v4 [math.oc] Jun 08 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research, Redmond August 8,

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Yi Zhou Ohio State University Yingbin Liang Ohio State University Abstract zhou.1172@osu.edu liang.889@osu.edu The past

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method

Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method Huishuai Zhang Department of EECS Syracuse University Syracuse, NY 3244 hzhan23@syr.edu Yingbin Liang Department of EECS Syracuse

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information