Large-scale Machine Learning and Optimization

Size: px
Start display at page:

Download "Large-scale Machine Learning and Optimization"

Transcription

1 1/84 Large-scale Machine Learning and Optimization Zaiwen Wen Acknowledgement: this slides is based on Dimitris Papailiopoulos and Shiqian Ma lecture notes and the book "understanding machine learning theory and algorithms" of Shai Shalev-Shwartz and Shai Ben-David. Thanks Yongfeng Li for preparing part of this slides

2 Outline 2/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning

3 Machine Learning Model 3/84 Machine learning model: (x, y) P, P is a underlying distribution. Given a dataset D = {(x 1, y 1 ), (x 2, y 2 ),, (x n, y n )}. (x i, y i ) P i.i.d. Our goal is to find a hypothesis h s with the smallest expected risk, i.e., min h s H R[h s] := E[f (h s (x), y)] (1) where H is hypothesis class.

4 Machine Learning Model 4/84 In practice, we don not know the exact form of the underlying distribution P. Empirical Risk Minimization(ERM) min h s H ˆR[h s ] := 1 n n f (h s (x i ), y i ) (2) We care about two questions on ERM: When does the ERM concentrate around the true risk? i=1 How does the hypothesis class affect the ERM?

5 Machine Learning Model 5/84 Empirical risk minimizer ĥ n argminˆr n [h] h H Expected risk minimizor h argminr[h] h H The concentration means that for any ɛ > 0, 0 < δ < 1, if n is larger enough, we have P( R[ĥ n] R[h ] ɛ) > 1 δ (3) It just means that R[ĥ n] convergences to R[h ] in probability. The concentration will fail in some cases

6 Hoeffding Inequality Let X 1, X 2, be a sequence of i.i.d. random variables and assume that for all i, E(X i ) = µ and P(a X i b) = 1. Then for any ɛ > 0 P( 1 n n X i µ ɛ) 2 exp( 2nɛ2 (b a) 2 ) (4) i=1 The Hoeffding Inequality describes the asymptotic property that sampling mean convergences to expectation. Azuma-Hoeffding inequality is a martingle version. Let X 1, X 2, be a martingale difference sequence with X i B for all i = 1, 2,... Then P( P( n i=1 n i=1 X i t) exp( 2t2 nb 2 ) (5) X i t) exp( 2t2 nb 2 ) (6) 6/84

7 7/84 To make the exposition simpler, we assume that our loss function, 0 f (a, b) 1, a, b. By Hoeffding Inequality, fixed h P( ˆR n [h] R[h] ɛ) 2e 2nɛ2 (7) Union Bound P( h H { ˆR n [h] R[h] ɛ}) 2 H e 2nɛ2 (8) If we want to bound P( h H { ˆR n [h] R[h] ɛ}) δ, we need the size of sample n 1 2ɛ 2 log(2 H δ ) = O(log H + log(δ 1 ) ɛ 2 ) (9) What if H =? This bound doesn t work

8 8/84 If n is large enough, with a probability 1 δ, we have R[ĥ n] R[h ] = (R[ĥ n] ˆR n [ĥ n]) + (ˆR n [ĥ n] ˆR n [h ]) +(ˆR n [h ] R[h ]). ɛ ɛ. For a two label classification problem, with a probability 1 δ, we have n VC[H] log( VC[H] sup ˆR n [h] R[h] O( ) + log( 1 δ )) ) (10) h H n where VC[H] is a VC dimension of H. Finite VC dimension is sufficient and necessary condition of empirical risk concentration for two label classification.

9 9/84 VC dimension VC dimension of a set-family: Let H be a set family (a set of sets) and C a set. Their intersection is defined as the following set-family: H C := {h C h H} We say that a set C is shattered by H if H C = 2 C. The VC dimension of H is the largest integer D such that there exists a set C with cardinality D that is shattered by H. A classification model f with some parameter vector θ is said to shatter a set of data points (x 1, x 2,..., x n ) if, for all assignments of labels to those points, there exists a θ such that the model f makes no errors when evaluating that set of data points. The VC dimension of a model f is the maximum number of points that can be arranged so that f shatters them. More formally, it is the maximum cardinal D such that some data point set of cardinality D can be shattered by f.

10 VC dimension example: If n and {(x 1, y 1 ),, (x n, y n )}, there exits h H s.t. h(x i ) = y i, then VC[H] = For a neural network whose activation functions are all sign functions, then VC[H] O(w log(w)), where w is the number of parameters. We must use prior knowledge and choose a proper hypothesis class. Suppose a is the true model R[ĥ n] R[a] = (R[ĥ n] R[h ]) + (R[h ] R[a]) }{{}}{{} A B If the hypothesis class is too large, B will be small but A will be large. (overfitting) If the hypothesis class is too small, A will be small but B will be large. (underfitting) 10/84

11 Outline 11/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning

12 The gradient and subgradient methods 12/84 Consider the problem gradient methods min f (x) (11) x Rn subgradient methods the update is equal to x k+1 = x k α k f (x k ) (12) x k+1 = x k α k g k, g k f (x k ) (13) x k+1 = arg min x f (x k ) + g k, x x k + 1 2α k x x k 2 2 (14)

13 Convergence guarantees 13/84 Assumption There is at least one minimizing point x arg minf (x) with f (x ) > x The subgradients are bounded: g 2 M for all x and all g f (x). Theorem 1: Convergence of subgradient Let α k 0 be any non-negative sequence of stepsizes and the preceding assumptions hold. Let x k be generated by the subgradient iteration. Then for all K 1, K α k [f (x k ) f (x )] 1 2 x 1 x k=1 K αkm 2 2. (15) k=1

14 Proof of Theorem 1 14/84 By convexity g k, x x k f (x ) f (x k ) 1 2 x k+1 x 2 2 = 1 2 x k α k g k x 2 2 = 1 2 x k x α k g k, x x k + α2 k 2 g x k x 2 2 α k (f (x k ) f (x )) + α2 k 2 M2 α k (f (x k ) f (x )) 1 2 x k x x k+1 x α2 k 2 M2

15 Convergence guarantees 15/84 Corollary Let A k = K i=1 α i and define x K = 1 A K K k=1 α kx k f ( x k ) f (x ) x 1 x K 2 K k=1 α k k=1 α2 k M2 Convergence: k=1 α K k=1 k = and α2 k K 0 k=1 α k Let x 1 x R. For a fixed stepsize α k = α: For a given K, take α = f ( x K ) f (x ) R M K : R2 2Kα + αm2 2 f ( x K ) f (x ) RM K. (16)

16 Projected subgradient methods 16/84 Consider the problem subgradient methods min x C f (x) (17) x k+1 = π C (x k α k g k ), g k f (x k ) (18) projection: π C (x) = arg min x y 2 2 y C the update is equal to x k+1 = arg min x C f (x k ) + g k, x x k + 1 2α k x x k 2 2 (19)

17 Convergence guarantees 17/84 Assumption The set C R n is compact and convex, and x x 2 R < for all x C. The subgradients are bounded: g 2 M for all x and all g f (x). Theorem 2: Convergence of projected subgradient method Let α k 0 be any non-negative sequence of stepsizes and the preceding assumptions hold. Let x k be generated by the projected subgradient iteration. Then for all K 1, K α k [f (x k ) f (x )] 1 2 x 1 x k=1 K αkm 2 2. (20) k=1

18 Proof of Theorem 2 18/84 By non-expansiveness of π C (x) x k+1 x 2 2 = π C (x k αg k ) x x k αg k x 1 2 x k+1 x x k α k g k x 2 2 = 1 2 x k x α k g k, x x k + α2 k 2 g x k x 2 2 α k (f (x k ) f (x )) + α2 k 2 M2 α k (f (x k ) f (x )) 1 2 x k x x k+1 x α2 k 2 M2

19 Convergence guarantees 19/84 Corollary Let A k = k i=1 α i and define x K = 1 A K K k=1 α kx k f ( x K ) f (x ) x 1 x K 2 K k=1 α k k=1 α2 k M2 Convergence: k=1 α K k=1 k = and α2 k K 0 k=1 α k a fixed stepsize α k = α: f ( x K ) f (x ) R2 2Kα + αm2 2. (21) Take α = R M K : f ( x K ) f (x ) RM K

20 20/84 Stochastic subgradient methods the stochastic optimization problem min f (x) := E P[F(x; S)] (22) x C S is a random space is a random variable on the space S with distribution P. for each s, x F(x; s) is convex. The subgradient E P [g(x; S)] f (x), where g(x; s) F(x; s). f (y) = E P [F(y; S)] E P [F(x; S) + g(x, S), y x ] = f (x) + E P [g(x; S)], y x

21 Stochastic subgradient methods 21/84 the deterministic optimization problem min x C f (x) := 1 m m F(x; s i ) (23) i=1 Why Stochastic? E P [F(x; S)] is generally intracktable to compute Small complexity: only one subgradient g(x; s) F(x; s) needs to be computed in one iteration. More possible to get global solution for non-convex case. stochasitic subgradient method x k+1 = π C (x k α k g k ), E[g k x k ] f (x k )

22 Convergence guarantees Assumption The set C R n is compact and convex, and x x 2 R < for all x C. The variance are bounded: E g(x, S) 2 2 M2 for all x. Theorem 3: Convergence of stochastic subgradient method Let α k 0 be any non-negative sequence of stepsizes and the preceding assumptions hold. Let x k be generated by the stochastic subgradient iteration. Then for all K 1, K α k E(f (x k ) f (x )) 1 2 E x 1 x k+1 K k=1 α2 k M2 2 (24) 22/84

23 Proof of Theorem 2 Let f (x k ) = E[g k x k ] and ξ k = g k f (x k ), 1 2 x k+1 x x k α k g k x 2 2 = 1 2 x k x α k g k, x x k + α2 k 2 g k 2 2 = 1 2 x k x α k f (x k ), x α 2 x k + k 2 g k α k ξ k, x x k 1 2 x k x 2 2 α k (f (x k ) f (x )) + α2 k 2 g k α k ξ k, x x k E[ ξ k, x x k ] = E[E[ ξ k, x x k x k ]] = 0. α k E(f (x k ) f (x )) 1 2 E x k x E x k+1 x α2 k 2 M2 23/84

24 Convergence guarantees 24/84 Corollary Let A k = k i=1 α i and define x K = 1 A K K k=1 α kx k E[f ( x K ) f (x )] R2 + K k=1 α2 k M2 2 K k=1 α k Convergence: k=1 α K k=1 k = and α2 k K 0 k=1 α k a fixed stepsize α k = α: E(f ( x K ) f (x )) R2 2Kα + αm2 2. (25) Take α = R M K : E(f ( x K ) f (x )) RM K

25 25/84 Theorem 5: Convergence of stochastic subgradient method Let α k > 0 be non-increasing sequence of stepsizes and the preceding assumptions hold. Let x = 1 K K k=1 x k. Then, E[f ( x K ) f (x )] R Kα K 2K K α k M 2. (26) k=1 α k E(f (x k ) f (x )) 1 2 E x k x E x k+1 x α2 k 2 M2 E(f (x k ) f (x )) 1 2α k E x k x α k E x k+1 x α k 2 M2

26 26/84 Corollary Let the conditions of Theorem 5 hold, and let α k = Then, R M for each k. k E[f ( x K ) f (x )] 3RM 2 K. (27) proof K k=1 1 K 1 dt = 2 K. k 0 t Corollary Let α k be chosen such that E[f ( x K ) f (x )] 0. Then f ( x K ) f (x ) P 0 as K, that is, for all ɛ > 0 we have lim sup k P(f ( x K ) f (x ) ɛ) = 0. (28)

27 By markov inequality: P(X α) EX α if X 0 and α > 0 P(f ( x K ) f (x ) ɛ) 1 ɛ E[f ( x K) f (x )] 0 Theorem 6: Convergence of stochastic subgradient method In addition to the conditions of Theorem 5, assume that g 2 M for all stochastic subgradients g, Then for anything ɛ > 0, f ( x K ) f (x ) with probability at least 1 e 1 2 ɛ2 R Kα K 2K K k=1 α k M 2 + RM K ɛ. (29) Taking α k = R km and setting δ = e 1 2 ɛ2 with probability at least 1 δ f ( x K ) f (x ) 3RM 2 K + RM 2 log 1 δ. K 27/84

28 Azuma-Hoeffding Inequality 28/84 martingle: A sequence X 1, X 2, of random vectors is a martingale if there is a sequence of random vectors Z 1, Z 2, such that for each n, X n is a function of Z n, Z n 1 is a function of Z n, we have the conditional expectation condition E[X n Z n 1 ] = X n 1. martingale difference sequence X 1, X 2, is a martingale difference sequence if S n = n i=1 X i is a martingle or, equivalently E[X n Z n 1 ] = 0. example X 1, X 2, independent and E(X i ) = 0, Z i = (X 1,, X i ).

29 Azuma-Hoeffding Inequality Let X 1, X 2, be a martingale difference sequence with X i B for all i = 1, 2,... Then P( P( n i=1 n i=1 X i t) exp( 2t2 nb 2 ) (30) X i t) exp( 2t2 nb 2 ) (31) Let δ = t n P( 1 n X 1, X 2, i.i.d, EX i = µ m i=1 X i δ) exp( 2nδ2 B 2 ) P( 1 n m i=1 X i µ δ) 2 exp( 2nδ2 B 2 ) 29/84

30 Azuma-Hoeffding Inequality Theorem 5: Convergence of stochastic subgradient method Let α k > 0 be non-increasing sequence of stepsizes and the preceding assumptions hold. Let x = 1 K K k=1 x k. Then, E[f ( x K ) f (x )] R Kα K 2K K α k M 2. (32) Theorem 6: Convergence of stochastic subgradient method In addition to the conditions of Theorem 5, assume that g 2 M for all stochastic subgradients g, Then for anything ɛ > 0, f ( x K ) f (x ) R Kα K 2K K k=1 k=1 α k M 2 + RM K ɛ. (33) with probability at least 1 e 1 2 ɛ2 30/84

31 Proof of Theorem 6 31/84 Let f (x k ) = E[g k x k ] and ξ k = g k f (x k ), 1 2 x k+1 x 2 2 = 1 2 x k α k g k x 2 2 = 1 2 x k x α k g k, x x k + α2 k 2 g k 2 2 = 1 2 x k x α k f (x k ), x α 2 x k + k 2 g k α k ξ k, x x k 1 2 x k x 2 2 α k (f (x k ) f (x )) + α2 k 2 g k α k ξ k, x x k f (x k ) f (x ) 1 2α k x k x α k x k+1 x 2 2+ α k 2 g k 2 2+ ξ k, x x k

32 Proof of Theorem 6 32/84 f ( x K ) f (x) 1 K K f (x k ) f (x ) k=1 1 2Kα K x 1 x K 1 x 1 x Kα K 2K K α k g k K k=1 K α k M K k=1 K ξ k, x x k i=1 K ξ k, x x k i=1 Let ω = 1 2Kα K x 1 x K 2K k=1 α km 2 P(f ( x K ) f (x) ω t) P( 1 K K ξ k, x x k t), i=1

33 Proof of Theorem 6 ξ k, x x k is a bounded difference martingale sequence Z k = (x 1,, x k+1 ) Since E[ξ k Z k 1 ] = 0 and E[x k Z k 1 ] = x k. E ξ k, x x k = 0. Since ξ k 2 = g k f (x k ) 2M By Azuma-Hoeffding Inequality, P( K i=1 Substituting t = MR Kɛ ξ k, x x k ξ k 2 x x k 2 2MR ξ k, x x k t) exp( t2 2KM 2 R 2 ). P( 1 K K i=1 ξ k, x x k MRɛ K ) exp( ɛ2 2 ). 33/84

34 Adaptive stepsizes 34/84 choose an appropriate metric and associated distance-generating function h. it may be advantageous to adapt the metric being used, or at least the stepsizes, to achieve faster convergence guarantees. a simple scheme h(x) = 1 2 xt Ax where A may change depending on information observed during solution of the problem.

35 35/84 Adaptive stepsizes Recall the bounds E[f ( x K ) f (x )] E[ R2 Kα k + 1 2K k Taking α k = R/ i=1 g 2, K α k g k 2 ]. (34) k=1 E[f ( x k ) f (x )] 2 R K K E[( g k 2 ) 1 2 ]. (35) if E[ g k 2 ] M 2 for all k, then E[( k=1 K K g k 2 ) 1 2 ] [(E g k 2 )] 1 2 M 2 K = M K (36) k=1 k=1

36 Variable metric methods 36/84 Variable metric methods x k+1 = arg min{ g k, x + 1 x C 2 x x k, H k (x x k ) } Projected subgradient method: H k = α k I, Newton method: H k = 2 f (x k ), AdaGrad: H k = 1 α diag( k i=1 g i. g i ) 1 2

37 37/84 Variable metric methods Theorem 9: Convergence of Variable metric methods Let H k > 0 be a sequence of positive define matrices, where H k is a function of g 1,, g k. Let g k be stochastic subgradient with E[g k x k ] f (x k ). Then E[ K (f (x k ) f (x ))] 1 K 2 E[ ( x k x 2 H k x k x 2 H k 1 )] k=1 k= E[ x 1 x 2 H 1 + K k=1 g k 2 ]. H 1 k E[f (x k ) f (x )] 1 2 E[ x k x 2 H k x k+1 x 2 H k + g k 2 ] H 1 k

38 Proof of Theorem 9 By non-expansiveness of π C (x) under x 2 H k = x, H k x Define ξ k = g k f (x k ) x k+1 x 2 H k x k H 1 k g k x 2 H k 1 2 x k+1 x 2 H k 1 2 x k x 2 H k + g k, x x k g k 2 H 1 k = 1 2 x k x 2 H k + f (x k ), x 1 x k + 2 g k 2 + ξ H 1 k, x x k k 1 2 x k x 2 H k (f (x k ) f (x )) g k 2 + ξ H 1 k, x x k k E[f (x k ) f (x )] 1 2 E[ x k x 2 H k x k+1 x 2 H k + g k 2 ] H 1 k 38/84

39 Variable metric methods Corollary: Convergence of AdaGrad Let R = sup x C x x and let the conditions of Theorem 9 hold. Then we have E[ K k=1 (f (x k ) f (x ))] 1 2α R2 E[tr(M K )] + αe[tr(m K )] where M k = diag( k i=1 g i. g i ) 1 2 and H k = 1 α M k Let α = R, Then E[f ( x K ) f (x )] 3 2K R E[tr(M K )] = 3 2K R n K E[( g 2 k,j) 1 2 ] If C = {x : x 1}, the bound is lower than the one of adaptive stepsize. j=1 k=1 (37) 39/84

40 Proof of Corolory 40/84 Aim: x k x 2 H k x k x 2 H k 1 x k x 2 tr(h k H k 1 ) Let z = x x z 2 H k z 2 H k 1 n n = H k,j z 2 j H k 1,j z 2 j = j=1 j=1 n (H k,j H k 1,j )z 2 j j=1 z 2 n (H k,j H k 1,j ) j=1 = z 2 tr(h k H k 1 )

41 Proof of Corolory Assume a = (a 1, a 2,..., a T ), a simple inequality( prove by induction), T t=1 a 2 t a a2 t Aim: K k=1 g k H 1 2αtr(M K ). k K g k 2 H 1 k = α k=1 2α K n k=1 j=1 g 2 k,j M k,j = α 2 a a2 T n K j=1 k=1 g 2 k,j k i=1 g2 i,j n K g 2 i,j = 2αtr(M K) (38) j=1 i=1 41/84

42 Optimality Guarantees Abstract Description: A collection F of convex functions f : R n R A closed convex set C R n over which we optimize A collection G of stochastic gradient oracle, which consists of a sample space S, a gradient mapping g : R n S F R n and (implicitly) a probability distributions P on S. The stochastic gradient oracle may be queried at a point x, and when queried, draws a s P has the property that minimax risk E[g(x, s, f )] f (x). M(C, G, F) := inf sup supe[f (ˆx K (g 1,, g K )) inf f ˆx K C g G f F x C (x )] 42/84

43 Optimality Guarantees 43/84

44 Optimality Guarantees Theorem 10: Optimality Guarantees Let C R n be a convex set containing an l 2 ball of radius R, and let G denote the collection of distributions generating stochastic subgradients with g 2 M with probability 1. Then M(C, G, F) RM 96K Theorem 11: Optimality Guarantees Let G and the stochastic gradient oracle be as above, and assume C [ R, R] n. Then M(C, G, F) RM min{ 1 n 5, } 96K 44/84

45 Summary expectation E[f ( x K ) f (x))] 3RM 2 K convergence in probability f ( x K ) f (x ) 3RM 2 K + RM 2 log 1 δ. K with probability at least 1 δ The optimal order M(C, G, F) RM 96K Using proper metric and adapted strategy can improve the convergence: Mirror Descent method and Adagrad. 45/84

46 Outline 46/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning

47 Gradient methods 47/84 Rewrite the ERM problem gradient methods min x R n f (x) := 1 n n f i (x) (39) i=1 x k+1 = x k α k f (x k ) (40) the update is equal to x k+1 = arg min x f (x k ) + f (x k ), x x k + 1 2α k x x k 2 2 (41)

48 48/84 Basic Properties We only consider the convex defferentiable functions. convex functions: f (λx + (1 λ)y) λf (x) + (1 λ)f (y), λ [0, 1], x, y M-Lipschitz functions: L-smooth functions: f (x) f (y) M x y 2 µ-strongly convex functions: f (x) f (y) L x y 2 f (λx+(1 λ)y) λf (x)+(1 λ)f (y) µ 2 λ(1 λ) x y 2 2, λ [0, 1], x, y

49 Some useful results convex functions: f (y) f (x) + f (x), y x M-Lipschitz functions: f (x) 2 M L-smooth functions: L 2 xt x f (x) is convex f (y) f (x) + f (x), y x + L 2 x y L f (x) 2 2 f (x) f (x ) L 2 x x 2 2 µ-strongly convex functions: f (x) µ 2 xt x is convex f (y) f (x) + f (x), y x + µ 2 x y /84

50 Co-coercivity of gradient if f is convex with dom f = R n and (L/2)x x f (x) is convex then ( f (x) f (y)) (x y) 1 L f (x) f (y) 2 2 x, y proof : define convex functions f x, f y with domain R n : f x (z) = f (z) f (x) z, f y (z) = f (z) f (y) z the functions (L/2)z z f x (z) and (L/2)z z f y (z) are convex z = x minimizes f x (z); from the left-hand inequality, f (y) f (x) f (x) (y x) = f x (y) f x (x) 1 2L f x(y) 2 2 = 1 2L f (y) f (x) 2 2 similarly, z = y minimizes f y (z); therefore f (x) f (y) f (y) (x y) 1 2L f (y) f (x) 2 2 combining the two inequalities shows co-coercivity 50/84

51 Extension of co-coercivity if f is strongly convex and f is Lipschitz continuous, then g(x) = f (x) µ 2 x 2 2 is convex and g is Lipschitz continuous with parameter L µ. co-coercivity of g gives for all x, y dom f ( f (x) f (y)) (x y) µl µ + L x y µ + L f (x) f (y) /84

52 52/84 Convergence guarantees Assumption f is L-smooth and µ-strongly convex. lemma: Coercivity of gradients f (x) f (y), x y Lµ L + µ x y L + µ f (x) f (y) 2 (42) Theorem: Convergence rates of GD Let α k = 2 L+µ and let κ = L µ. Define k = x k x. Then we get, f (x T+1 ) f (x ) L T exp( ). (43) κ + 1

53 Proof of Theorem 53/84 2 k+1 = x k+1 x 2 2 = x k α k f (x k ) x 2 2 By the lemma = x k x 2 2 2α k f (x k ), x k x + α 2 k f (x k ) 2 2 = 2 k 2α k f (x k ), x k x + α 2 k f (x k ) k+1 2 k 2α k ( Lµ L + µ 2 k + 1 L + µ f (x) 2 ) + α 2 k f (x k ) 2 2 = (1 2α k Lµ L + µ ) 2 k + ( 2α k L + µ + α2 k) f (x k ) 2 2 (1 2α k Lµ L + µ ) 2 k + ( 2α k L + µ + α2 k)l 2 2 k (44)

54 Proof of Theorem α k = 2 L+µ 2 k+1 (1 4Lµ (L + µ) 2 ) 2 k = ( L µ L + µ )2 2 k = ( κ 1 κ + 1 )2 2 k 2 T+1 ( κ 1 κ + 1 )2T 2 1 = 2 1 exp(2t log(1 2 κ + 1 )) 2 1 exp( 4T κ + 1 ) f (x T+1 ) f (x ) L 2 2 T+1 L exp( 4T κ + 1 ) 54/84

55 Stochastic Gradient methods 55/84 ERM problem gradient descent min x R n f (x) := 1 n n f i (x) (45) i=1 x k+1 = x k α k f (x k ) (46) stochastic gradient descent x k+1 = x k α k f sk (x k ), (47) where s k is uniformly sampled from {1,, n}

56 56/84 Convergence guarantees Assumption f (x) is L-smooth: f (x) f (y) 2 2 L x y 2 2 f (x) is µ-strongly convex: f (x) f (y), x y µ x y 2 2 E s [ f s (x)] = f (x) E s f s (x) 2 M 2 Theorem: Convergence rates of SGD Define k = x k x. For a fixed Stepsize α k = α, 0 < α < 1 2µ we have, E[f (x T+1 ) f (x )] L 2 E[ 2 T+1] L 2 [(1 2αµ)T αm2 ]. (48) 2µ

57 Proof of Theorem 57/84 2 k+1 = x k+1 x 2 2 = x k α k f sk (x k ) x 2 2 = x k x 2 2 2α k f sk (x k ), x k x + αk f 2 sk (x k ) 2 2 = 2 k 2α k f sk (x k ), x k x + αk f 2 sk (x k ) 2 2 Using E[X] = E[E[X Y]]: E s1,...,s k [ f sk (x k ), x k x ] = E s1,...,s k 1 [E sk [ f sk (x k ), x k x ]] = E s1,...,s k 1 [ E sk [ f sk (x k )], x k x ] = E s1,...,s k 1 [ f (x k ), x k x ] = E s1,...,s k [ f (x k ), x k x ] By the strongly convexity E s1,...,s k ( 2 k+1) (1 2αµ)E s1,...,s k ( 2 k) + α 2 M 2 (49)

58 58/84 Proof of Theorem Taking induction from k = 1 to k = T, we have T 1 E s1,...,s T ( 2 T+1) (1 2αµ) T (1 2αµ) i α 2 M 2 (50) i=0 under the assumption that 0 2αµ 1, we have Then i=0 (1 2αµ) i = 1 2αµ E s1,...,s T ( 2 T+1) (1 2αµ) T αm2 2µ (51)

59 59/84 Convergence guarantees For fixed stepsize, we don t have the convergence For diminishing stepsize, the order of convergence is O( 1 T ) Theorem: Convergence rates of SGD Define k = x k x. For a diminishing stepsize α k = β k + γ Then we have, for any T 1 for some β > 1 2µ and γ > 0 such that α 1 1 2µ. E[f (x T ) f (x )] L 2 E[ 2 T] L 2 where v = max( β2 M 2 2βµ 1, (γ + 1) 2 1 ) v γ + T, (52)

60 Proof of Theorem 60/84 Recall the bounds E s1,...,s k ( 2 k+1) (1 2αµ)E s1,...,s k ( 2 k) + α 2 M 2 (53) We prove it by induction. Firstly, the definition of v ensures that it holds for k = 1. Assume the conclusion holds for some k, it follows that E( 2 k+1) (1 2βµ ˆk )vˆk + β2 M 2 ˆk 2 ( with ˆk := γ + k) 2βµ = (ˆk )v + β2 M 2 ˆk 2 ˆk 2 = ˆk 1 v 2βµ 1 v + β2 M 2 ˆk 2 ˆk 2 ˆk 2 v ˆk + 1

61 61/84 stochastic optimization stochastic subgradient descent: O(1/ɛ 2 ) stochastic gradient descent with strong convexity O(1/ɛ) stochastic gradient descent with strong convexity and smoothness O(1/ɛ) deterministic optimization subgradient descent: O(n/ɛ 2 ) gradient descent with strong convexity O(n/ɛ) gradient descent with strong convexity and smoothness O(n log(1/ɛ)) The complexity refers to the times of computation of component (sub)gradients. We need to compute n gradients in every iterations of GD and one gradient in SGD.

62 Outline 62/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning

63 63/84 Variance Reduction Assumption f (x) is L-smooth: f (x) f (y) 2 L x y 2 f (x) is µ-strongly convex: f (x) f (y), x y µ x y 2 2 E s [ f s (x)] = f (x) E s f s (x) 2 M 2 GD: linear convergence O(n log(1/ɛ)) SGD: sublinear convergence O(1/ɛ) What is the essential difference between SGD and GD?

64 Variance Reduction 64/84 GD 2 k+1 = x k+1 x 2 2 = x k α f (x k ) x 2 2 SGD = 2 k 2α f (x k ), x k x + α 2 f (x k ) 2 2 (1 2αµ) 2 k + α 2 f (x k ) 2 2 (µ strongly convex) (1 2αµ + α 2 L 2 ) 2 k (L smooth) E 2 k+1 = E x k+1 x 2 2 = E x k α f sk (x k ) x 2 2 = E 2 k 2αE f sk (x k ), x k x + α 2 E f sk (x k ) 2 2 = E 2 k 2αE f (x k ), x k x + α 2 E f sk (x k ) 2 2 (1 2αµ)E 2 k + α 2 E f sk (x k ) 2 2 (µ strongly convex) (1 2αµ)E 2 k + α 2 E f sk (x k ) f (x k ) + f (x k ) 2 2 (1 2αµ + 2α 2 L 2 )E 2 k + 2α 2 E f sk (x k ) f (x k ) 2 2

65 Variance Reduction 65/84 E 2 k+1 (1 2αµ + 2α 2 L 2 )E 2 k + 2α 2 E f }{{} sk (x k ) f (x k ) 2 2 }{{} A B (54) a worst case convergence rate of 1/T for SGD In practice, the actual convergence rate may be somewhat better than this bound. Initially, B << A and we observe the linear rate regime, once B > A we observe 1/T rate. How to reduce variance term B to speed up SGD? SAG ( Stochastic average gradient) SAGA SVRG (Stochastic variance reduced gradient)

66 SAG method SAG method (Le Roux, Schmidt, Bach 2012) x k+1 = x k α k n where n i=1 g i k = x k α k ( 1 n ( f s k (x k ) g s k k 1 ) + 1 n { g i k = f i (x k ) if i = s k, g i k 1 o.w., n i=1 g i k 1 and s k is uniformly sampled from {1,, n} complexity(# component gradient evaluations): O(max{n, L µ } log(1/ɛ)) need to store most recent gradient of each component. SAGA(Defazio, Bach,Julien 2014) is unbaised revision of SAG x k+1 = x k α k ( f ik (x k ) g i k k n ) (55) (56) n g i k 1 ) (57) i=1 66/84

67 67/84 SVRG method SVRG method (Johnson and Zhang 2013) v k = f sk (x k ) f sk (y) + f (y) x k+1 = x k α k v k where and s k is uniformly sampled from {1,, n} v k is unbiased estimation of gradient f (x k ) Ev k = f (x k ) + f (y) f (y) = f (x k ). (58) Recall the bound E 2 k+1 (1 2αµ)E 2 k + α 2 E v k 2 2 (59)

68 SVRG method 68/84 Additional assumption: L smoothness for component functions f i (x) f i (y) 2 L x y 2 (60) Let s analyze the "variance" E v k 2 2 = E f sk (x k ) f sk (y) + f (y) 2 2 = E f sk (x k ) f sk (y) + f (y) + f sk (x ) f sk (x ) 2 2 2E f sk (x k ) f sk (x ) E f sk (y) f (y) f sk (x ) 2 2 2L 2 E 2 k + 2E f sk (y) f sk (x ) 2 2 2L 2 E 2 k + 2L 2 E y x 2 if x k and y is close to x, the variance is small.

69 SVRG method We only need to choose a current point as y. picking a fresh y more often should decrease the variance, however doing this too often involves computing too many full gradients Let s set y = x 1, Unrolling this: E 2 k+1 (1 2αµ + 2α 2 L 2 )E 2 k + 2α 2 L 2 E 2 1 (61) E 2 k+1 k 1 (1 2αµ + 2α 2 L 2 )E (1 2αµ + 2α 2 L 2 ) i 2α 2 L 2 E 2 1 i=0 (1 2αµ + 2α 2 L 2 ) k E kα 2 L 2 E 2 1 (62) 69/84

70 SVRG method 70/84 Unrolling this: E 2 k+1 (1 2αµ + 2α 2 L 2 ) k E kα 2 L 2 E 2 1 (63) Suppose we would like this to be 0.5E 1 after T iterations. We pick α = O(1) µ L 2, then it turns out that we can set T = O(1) L2 µ 2. In fact, we can improve it to T = O(1) L µ. condition number κ = L µ

71 SVRG method Algorithm 2: SVRG method Input: x 0, α, m for e = 1 : E do y x e 1, x 1 x e 1. g f (y) (full gradient) for k = 1 : m do pick s k {1,, n} uniformly at random. end for v k = f sk (x k ) f sk (y) + f (y) x k+1 = x k αv k x e 1 m end for m k=1 x k 71/84

72 72/84 SVRG method Convergence of SVRG method Suppose 0 < α 1 2L and m sufficiently large such that ρ = 1 µα(1 2Lα)m + 2Lα 1 2Lα < 1 (64) then we have linear convergence in expectation Ef ( x s ) f (x ) ρ s [f ( x 0 ) f (x )] (65) if α = θ L, then ρ = L/µ θ(1 2θ)m + 2θ 1 2θ choosing θ = 0.1 and m = 50(L/µ) results in ρ = 0.5 overall complexity: O(( L µ + n) log(1/ɛ)) (66)

73 proof of theorem 73/84 E 2 k+1 = E x k+1 x 2 2 = E x k αv k x 2 2 By smoothness of f i (x) = E 2 k 2αE v k, x k x + α 2 E v k 2 2 = E 2 k 2αE f (x k ), x k x + α 2 E v k 2 2 E 2 k 2αE(f (x k 1 ) f (x )) + α 2 E v k 2 2 f i (x) f i (x ) 2 2L[f i (x) f i (x ) f i (x ) T (x x )] (67) summing above inequalities over 1, 2,, n and using f (x ) = 0 1 n n f i (x) f i (x ) 2 2L[f (x) f (x )] (68) i=1

74 proof of theorem 74/84 Using E[(X E[X]) 2 ] E[X 2 ], we obtain the bound E s v k 2 2 = E f sk (x k ) f sk (y) + f (y) + f sk (x ) f sk (x ) 2 2 2E f sk (x k ) f sk (x ) E f sk (y) f (y) f sk (x ) 2 2 = 2E f sk (x k ) f sk (x ) E f sk (y) f sk (x ) E[ f sk (y) f sk (x )] 2 2 2E f sk (x k ) f sk (x ) E f s (y) f s (x ) 2 4L[f (x k ) f (x ) + f (y) f (x )] now continue the derivation E 2 k+1 E 2 k 2αE(f (x k ) f (x )) + α 2 E v k 2 2 E 2 k 2α(1 2αL)E(f (x k ) f (x )) + 4Lα 2 [f (y) f (x )]

75 proof of theorem summing over k = 1,, m (note that y = x e 1 and x e = 1 m m k=1 x k) E 2 k+1 + 2α(1 2αL) m E(f (x k ) f (x )) k=1 E x e 1 x 2 + 4Lα 2 me[f ( x e 1 ) f (x )] 2 µ E[f ( x e 1) f (x )] + 4Lα 2 me[f ( x e 1 ) f (x )] therefore, for each stage s E[f ( x e ) f (x )] 1 m E(f (x k ) f (x )) m k=1 1 2α(1 2αL)m ( 2 µ + 4mLα2 )E[f ( x e 1 ) f (x )] (69) 75/84

76 Summary 76/84 condition number: κ = L µ SVRG: E log( 1 ɛ ) so the complexity is O((n + κ) log( 1 ɛ )) GD: T κ log( 1 ɛ ) so the complexity is O(nκ log( 1 ɛ )) SGD: T κ ɛ so the complexity is O( κ ɛ ) even though we are allowing ourselves a few gradient computations here, we don t really pay too much in terms of complexity.

77 Outline 77/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning

78 78/84 Stochastic Algorithms in Deeplearning Consider problem min x R d 1 n n i=1 f i(x) References: chapter 8 in Gradient descent x t+1 = x t αt n Stochastic gradient descent SGD with momentum n f i (x t ) i=1 x k+1 = x t α t f i (x t ) v t+1 = µ t v t α t f i (x t ) x t+1 = x t + v t+1

79 79/84 Stochastic Algorithms in Deeplearning Nesterov accelerated gradient (original version) v t+1 = (1 + µ t )x t µ t x t 1 x t+1 = v t+1 α t f i (v t+1 ) here µ t = t+2 t+5 and αt fixed or determined by line search (inverse of Lipschitz constant). Nesterov accelerated gradient (momentum version) v t+1 = µ t v t α t f i (x t + µ t v t ) x t+1 = x t + v t+1 here µ t = t+2 t+5 and αt fixed or determined by line search.

80 Stochastic Algorithms in Deeplearning 80/84 Adaptive Subgradient Methods(Adagrad): let g t = f i (x t ), g 2 t = diag[g t g T t ] R d, and initial G 1 = g 2 1. At step t x t+1 = x t G t+1 = G t + g 2 t+1 α t G t + ɛ1 d f i (x t ) in the upper and the following iterations we use element-wise vector-vector multiplication.

81 Stochastic Algorithms in Deeplearning 81/84 Adadelta: denote exponentially decaying average E[g 2 ] t = ρe[g 2 ] t 1 + (1 ρ)g 2 t, RMS[g] t = E[g 2 ] t + ɛ1 d R d,and E[ x 2 ] t = ρe[ x 2 ] t 1 + (1 ρ) xt 2, RMS[ x] t = E[ x 2 ] t + ɛ1 d R d. Initial parameter: x 1, E[g 2 ] 0 = 0, E[ x 2 ] 0 = 0, decay rate ρ, constant ɛ. At step t, compute g t = f i (x t ) and E[g 2 ] t = ρe[g 2 ] t 1 + (1 ρ)g 2 t x t = RMS[ x] t 1 g t RMS[g] t E[ x 2 ] t = ρe[ x 2 ] t 1 + (1 ρ) x 2 t x t+1 = x t + x t

82 Stochastic Algorithms in Deeplearning 82/84 RMSprop: initial E[g 2 ] 0 = 0. At step t, E[g 2 ] t = ρe[g 2 ] t 1 + (1 ρ)g 2 t x t+1 = x t α t f i (x t ) E[g 2 ] t + ɛ1 d here ρ 0.9, α t is a good choice.

83 Stochastic Algorithms in Deeplearning 83/84 Adam: initial E[g 2 ] 0 = 0, E[g] 0 = 0. At step t, E[g] t = µe[g] t 1 + (1 µ)g t E[g 2 ] t = ρe[g 2 ] t 1 + (1 ρ)g 2 t Ê[g] t = E[g] t 1 µ t Ê[g 2 ] t = E[g2 ] t 1 ρ t x t+1 = x t α Ê[g] t Ê[g 2 ] t + ɛ1 d here ρ, µ are decay rates, α is learning rate.

84 84/84 Optimization algorithms in Deep learning 随机梯度类算法 caffe 里实现的算法有 adadelta, adagrad, adam, nesterov, rmsprop caffe/solvers Lasagne 里有 :sgd, momentum, nesterov_momentum, adagrad, rmsprop, adadelta, adam, adamax lasagne/updates.py tensorflow 实现的算法有 :Adadelta, AdagradDA, Adagrad, ProximalAdagrad, Ftrl, Momentum, adam, Momentum, CenteredRMSProp 具体实现 : master/tensorflow/core/kernels/training_ops.cc

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Lecture 2: Subgradient Methods

Lecture 2: Subgradient Methods Lecture 2: Subgradient Methods John C. Duchi Stanford University Park City Mathematics Institute 206 Abstract In this lecture, we discuss first order methods for the minimization of convex functions. We

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Lecture 3: Huge-scale optimization problems

Lecture 3: Huge-scale optimization problems Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32March 9, 2012 1

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Noname manuscript No. (will be inserted by the editor) Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Saeed Ghadimi Guanghui Lan Hongchao Zhang the date of

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0 Numerical Analysis 1 1. Nonlinear Equations This lecture note excerpted parts from Michael Heath and Max Gunzburger. Given function f, we seek value x for which where f : D R n R n is nonlinear. f(x) =

More information

Lecture 25: Subgradient Method and Bundle Methods April 24

Lecture 25: Subgradient Method and Bundle Methods April 24 IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

Second order machine learning

Second order machine learning Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 88 Outline Machine Learning s Inverse Problem

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley ( For more info,

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target

More information

You should be able to...

You should be able to... Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Second order machine learning

Second order machine learning Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 96 Outline Machine Learning s Inverse Problem

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information