Large-scale Machine Learning and Optimization
|
|
- Janis Hubbard
- 6 years ago
- Views:
Transcription
1 1/84 Large-scale Machine Learning and Optimization Zaiwen Wen Acknowledgement: this slides is based on Dimitris Papailiopoulos and Shiqian Ma lecture notes and the book "understanding machine learning theory and algorithms" of Shai Shalev-Shwartz and Shai Ben-David. Thanks Yongfeng Li for preparing part of this slides
2 Outline 2/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning
3 Machine Learning Model 3/84 Machine learning model: (x, y) P, P is a underlying distribution. Given a dataset D = {(x 1, y 1 ), (x 2, y 2 ),, (x n, y n )}. (x i, y i ) P i.i.d. Our goal is to find a hypothesis h s with the smallest expected risk, i.e., min h s H R[h s] := E[f (h s (x), y)] (1) where H is hypothesis class.
4 Machine Learning Model 4/84 In practice, we don not know the exact form of the underlying distribution P. Empirical Risk Minimization(ERM) min h s H ˆR[h s ] := 1 n n f (h s (x i ), y i ) (2) We care about two questions on ERM: When does the ERM concentrate around the true risk? i=1 How does the hypothesis class affect the ERM?
5 Machine Learning Model 5/84 Empirical risk minimizer ĥ n argminˆr n [h] h H Expected risk minimizor h argminr[h] h H The concentration means that for any ɛ > 0, 0 < δ < 1, if n is larger enough, we have P( R[ĥ n] R[h ] ɛ) > 1 δ (3) It just means that R[ĥ n] convergences to R[h ] in probability. The concentration will fail in some cases
6 Hoeffding Inequality Let X 1, X 2, be a sequence of i.i.d. random variables and assume that for all i, E(X i ) = µ and P(a X i b) = 1. Then for any ɛ > 0 P( 1 n n X i µ ɛ) 2 exp( 2nɛ2 (b a) 2 ) (4) i=1 The Hoeffding Inequality describes the asymptotic property that sampling mean convergences to expectation. Azuma-Hoeffding inequality is a martingle version. Let X 1, X 2, be a martingale difference sequence with X i B for all i = 1, 2,... Then P( P( n i=1 n i=1 X i t) exp( 2t2 nb 2 ) (5) X i t) exp( 2t2 nb 2 ) (6) 6/84
7 7/84 To make the exposition simpler, we assume that our loss function, 0 f (a, b) 1, a, b. By Hoeffding Inequality, fixed h P( ˆR n [h] R[h] ɛ) 2e 2nɛ2 (7) Union Bound P( h H { ˆR n [h] R[h] ɛ}) 2 H e 2nɛ2 (8) If we want to bound P( h H { ˆR n [h] R[h] ɛ}) δ, we need the size of sample n 1 2ɛ 2 log(2 H δ ) = O(log H + log(δ 1 ) ɛ 2 ) (9) What if H =? This bound doesn t work
8 8/84 If n is large enough, with a probability 1 δ, we have R[ĥ n] R[h ] = (R[ĥ n] ˆR n [ĥ n]) + (ˆR n [ĥ n] ˆR n [h ]) +(ˆR n [h ] R[h ]). ɛ ɛ. For a two label classification problem, with a probability 1 δ, we have n VC[H] log( VC[H] sup ˆR n [h] R[h] O( ) + log( 1 δ )) ) (10) h H n where VC[H] is a VC dimension of H. Finite VC dimension is sufficient and necessary condition of empirical risk concentration for two label classification.
9 9/84 VC dimension VC dimension of a set-family: Let H be a set family (a set of sets) and C a set. Their intersection is defined as the following set-family: H C := {h C h H} We say that a set C is shattered by H if H C = 2 C. The VC dimension of H is the largest integer D such that there exists a set C with cardinality D that is shattered by H. A classification model f with some parameter vector θ is said to shatter a set of data points (x 1, x 2,..., x n ) if, for all assignments of labels to those points, there exists a θ such that the model f makes no errors when evaluating that set of data points. The VC dimension of a model f is the maximum number of points that can be arranged so that f shatters them. More formally, it is the maximum cardinal D such that some data point set of cardinality D can be shattered by f.
10 VC dimension example: If n and {(x 1, y 1 ),, (x n, y n )}, there exits h H s.t. h(x i ) = y i, then VC[H] = For a neural network whose activation functions are all sign functions, then VC[H] O(w log(w)), where w is the number of parameters. We must use prior knowledge and choose a proper hypothesis class. Suppose a is the true model R[ĥ n] R[a] = (R[ĥ n] R[h ]) + (R[h ] R[a]) }{{}}{{} A B If the hypothesis class is too large, B will be small but A will be large. (overfitting) If the hypothesis class is too small, A will be small but B will be large. (underfitting) 10/84
11 Outline 11/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning
12 The gradient and subgradient methods 12/84 Consider the problem gradient methods min f (x) (11) x Rn subgradient methods the update is equal to x k+1 = x k α k f (x k ) (12) x k+1 = x k α k g k, g k f (x k ) (13) x k+1 = arg min x f (x k ) + g k, x x k + 1 2α k x x k 2 2 (14)
13 Convergence guarantees 13/84 Assumption There is at least one minimizing point x arg minf (x) with f (x ) > x The subgradients are bounded: g 2 M for all x and all g f (x). Theorem 1: Convergence of subgradient Let α k 0 be any non-negative sequence of stepsizes and the preceding assumptions hold. Let x k be generated by the subgradient iteration. Then for all K 1, K α k [f (x k ) f (x )] 1 2 x 1 x k=1 K αkm 2 2. (15) k=1
14 Proof of Theorem 1 14/84 By convexity g k, x x k f (x ) f (x k ) 1 2 x k+1 x 2 2 = 1 2 x k α k g k x 2 2 = 1 2 x k x α k g k, x x k + α2 k 2 g x k x 2 2 α k (f (x k ) f (x )) + α2 k 2 M2 α k (f (x k ) f (x )) 1 2 x k x x k+1 x α2 k 2 M2
15 Convergence guarantees 15/84 Corollary Let A k = K i=1 α i and define x K = 1 A K K k=1 α kx k f ( x k ) f (x ) x 1 x K 2 K k=1 α k k=1 α2 k M2 Convergence: k=1 α K k=1 k = and α2 k K 0 k=1 α k Let x 1 x R. For a fixed stepsize α k = α: For a given K, take α = f ( x K ) f (x ) R M K : R2 2Kα + αm2 2 f ( x K ) f (x ) RM K. (16)
16 Projected subgradient methods 16/84 Consider the problem subgradient methods min x C f (x) (17) x k+1 = π C (x k α k g k ), g k f (x k ) (18) projection: π C (x) = arg min x y 2 2 y C the update is equal to x k+1 = arg min x C f (x k ) + g k, x x k + 1 2α k x x k 2 2 (19)
17 Convergence guarantees 17/84 Assumption The set C R n is compact and convex, and x x 2 R < for all x C. The subgradients are bounded: g 2 M for all x and all g f (x). Theorem 2: Convergence of projected subgradient method Let α k 0 be any non-negative sequence of stepsizes and the preceding assumptions hold. Let x k be generated by the projected subgradient iteration. Then for all K 1, K α k [f (x k ) f (x )] 1 2 x 1 x k=1 K αkm 2 2. (20) k=1
18 Proof of Theorem 2 18/84 By non-expansiveness of π C (x) x k+1 x 2 2 = π C (x k αg k ) x x k αg k x 1 2 x k+1 x x k α k g k x 2 2 = 1 2 x k x α k g k, x x k + α2 k 2 g x k x 2 2 α k (f (x k ) f (x )) + α2 k 2 M2 α k (f (x k ) f (x )) 1 2 x k x x k+1 x α2 k 2 M2
19 Convergence guarantees 19/84 Corollary Let A k = k i=1 α i and define x K = 1 A K K k=1 α kx k f ( x K ) f (x ) x 1 x K 2 K k=1 α k k=1 α2 k M2 Convergence: k=1 α K k=1 k = and α2 k K 0 k=1 α k a fixed stepsize α k = α: f ( x K ) f (x ) R2 2Kα + αm2 2. (21) Take α = R M K : f ( x K ) f (x ) RM K
20 20/84 Stochastic subgradient methods the stochastic optimization problem min f (x) := E P[F(x; S)] (22) x C S is a random space is a random variable on the space S with distribution P. for each s, x F(x; s) is convex. The subgradient E P [g(x; S)] f (x), where g(x; s) F(x; s). f (y) = E P [F(y; S)] E P [F(x; S) + g(x, S), y x ] = f (x) + E P [g(x; S)], y x
21 Stochastic subgradient methods 21/84 the deterministic optimization problem min x C f (x) := 1 m m F(x; s i ) (23) i=1 Why Stochastic? E P [F(x; S)] is generally intracktable to compute Small complexity: only one subgradient g(x; s) F(x; s) needs to be computed in one iteration. More possible to get global solution for non-convex case. stochasitic subgradient method x k+1 = π C (x k α k g k ), E[g k x k ] f (x k )
22 Convergence guarantees Assumption The set C R n is compact and convex, and x x 2 R < for all x C. The variance are bounded: E g(x, S) 2 2 M2 for all x. Theorem 3: Convergence of stochastic subgradient method Let α k 0 be any non-negative sequence of stepsizes and the preceding assumptions hold. Let x k be generated by the stochastic subgradient iteration. Then for all K 1, K α k E(f (x k ) f (x )) 1 2 E x 1 x k+1 K k=1 α2 k M2 2 (24) 22/84
23 Proof of Theorem 2 Let f (x k ) = E[g k x k ] and ξ k = g k f (x k ), 1 2 x k+1 x x k α k g k x 2 2 = 1 2 x k x α k g k, x x k + α2 k 2 g k 2 2 = 1 2 x k x α k f (x k ), x α 2 x k + k 2 g k α k ξ k, x x k 1 2 x k x 2 2 α k (f (x k ) f (x )) + α2 k 2 g k α k ξ k, x x k E[ ξ k, x x k ] = E[E[ ξ k, x x k x k ]] = 0. α k E(f (x k ) f (x )) 1 2 E x k x E x k+1 x α2 k 2 M2 23/84
24 Convergence guarantees 24/84 Corollary Let A k = k i=1 α i and define x K = 1 A K K k=1 α kx k E[f ( x K ) f (x )] R2 + K k=1 α2 k M2 2 K k=1 α k Convergence: k=1 α K k=1 k = and α2 k K 0 k=1 α k a fixed stepsize α k = α: E(f ( x K ) f (x )) R2 2Kα + αm2 2. (25) Take α = R M K : E(f ( x K ) f (x )) RM K
25 25/84 Theorem 5: Convergence of stochastic subgradient method Let α k > 0 be non-increasing sequence of stepsizes and the preceding assumptions hold. Let x = 1 K K k=1 x k. Then, E[f ( x K ) f (x )] R Kα K 2K K α k M 2. (26) k=1 α k E(f (x k ) f (x )) 1 2 E x k x E x k+1 x α2 k 2 M2 E(f (x k ) f (x )) 1 2α k E x k x α k E x k+1 x α k 2 M2
26 26/84 Corollary Let the conditions of Theorem 5 hold, and let α k = Then, R M for each k. k E[f ( x K ) f (x )] 3RM 2 K. (27) proof K k=1 1 K 1 dt = 2 K. k 0 t Corollary Let α k be chosen such that E[f ( x K ) f (x )] 0. Then f ( x K ) f (x ) P 0 as K, that is, for all ɛ > 0 we have lim sup k P(f ( x K ) f (x ) ɛ) = 0. (28)
27 By markov inequality: P(X α) EX α if X 0 and α > 0 P(f ( x K ) f (x ) ɛ) 1 ɛ E[f ( x K) f (x )] 0 Theorem 6: Convergence of stochastic subgradient method In addition to the conditions of Theorem 5, assume that g 2 M for all stochastic subgradients g, Then for anything ɛ > 0, f ( x K ) f (x ) with probability at least 1 e 1 2 ɛ2 R Kα K 2K K k=1 α k M 2 + RM K ɛ. (29) Taking α k = R km and setting δ = e 1 2 ɛ2 with probability at least 1 δ f ( x K ) f (x ) 3RM 2 K + RM 2 log 1 δ. K 27/84
28 Azuma-Hoeffding Inequality 28/84 martingle: A sequence X 1, X 2, of random vectors is a martingale if there is a sequence of random vectors Z 1, Z 2, such that for each n, X n is a function of Z n, Z n 1 is a function of Z n, we have the conditional expectation condition E[X n Z n 1 ] = X n 1. martingale difference sequence X 1, X 2, is a martingale difference sequence if S n = n i=1 X i is a martingle or, equivalently E[X n Z n 1 ] = 0. example X 1, X 2, independent and E(X i ) = 0, Z i = (X 1,, X i ).
29 Azuma-Hoeffding Inequality Let X 1, X 2, be a martingale difference sequence with X i B for all i = 1, 2,... Then P( P( n i=1 n i=1 X i t) exp( 2t2 nb 2 ) (30) X i t) exp( 2t2 nb 2 ) (31) Let δ = t n P( 1 n X 1, X 2, i.i.d, EX i = µ m i=1 X i δ) exp( 2nδ2 B 2 ) P( 1 n m i=1 X i µ δ) 2 exp( 2nδ2 B 2 ) 29/84
30 Azuma-Hoeffding Inequality Theorem 5: Convergence of stochastic subgradient method Let α k > 0 be non-increasing sequence of stepsizes and the preceding assumptions hold. Let x = 1 K K k=1 x k. Then, E[f ( x K ) f (x )] R Kα K 2K K α k M 2. (32) Theorem 6: Convergence of stochastic subgradient method In addition to the conditions of Theorem 5, assume that g 2 M for all stochastic subgradients g, Then for anything ɛ > 0, f ( x K ) f (x ) R Kα K 2K K k=1 k=1 α k M 2 + RM K ɛ. (33) with probability at least 1 e 1 2 ɛ2 30/84
31 Proof of Theorem 6 31/84 Let f (x k ) = E[g k x k ] and ξ k = g k f (x k ), 1 2 x k+1 x 2 2 = 1 2 x k α k g k x 2 2 = 1 2 x k x α k g k, x x k + α2 k 2 g k 2 2 = 1 2 x k x α k f (x k ), x α 2 x k + k 2 g k α k ξ k, x x k 1 2 x k x 2 2 α k (f (x k ) f (x )) + α2 k 2 g k α k ξ k, x x k f (x k ) f (x ) 1 2α k x k x α k x k+1 x 2 2+ α k 2 g k 2 2+ ξ k, x x k
32 Proof of Theorem 6 32/84 f ( x K ) f (x) 1 K K f (x k ) f (x ) k=1 1 2Kα K x 1 x K 1 x 1 x Kα K 2K K α k g k K k=1 K α k M K k=1 K ξ k, x x k i=1 K ξ k, x x k i=1 Let ω = 1 2Kα K x 1 x K 2K k=1 α km 2 P(f ( x K ) f (x) ω t) P( 1 K K ξ k, x x k t), i=1
33 Proof of Theorem 6 ξ k, x x k is a bounded difference martingale sequence Z k = (x 1,, x k+1 ) Since E[ξ k Z k 1 ] = 0 and E[x k Z k 1 ] = x k. E ξ k, x x k = 0. Since ξ k 2 = g k f (x k ) 2M By Azuma-Hoeffding Inequality, P( K i=1 Substituting t = MR Kɛ ξ k, x x k ξ k 2 x x k 2 2MR ξ k, x x k t) exp( t2 2KM 2 R 2 ). P( 1 K K i=1 ξ k, x x k MRɛ K ) exp( ɛ2 2 ). 33/84
34 Adaptive stepsizes 34/84 choose an appropriate metric and associated distance-generating function h. it may be advantageous to adapt the metric being used, or at least the stepsizes, to achieve faster convergence guarantees. a simple scheme h(x) = 1 2 xt Ax where A may change depending on information observed during solution of the problem.
35 35/84 Adaptive stepsizes Recall the bounds E[f ( x K ) f (x )] E[ R2 Kα k + 1 2K k Taking α k = R/ i=1 g 2, K α k g k 2 ]. (34) k=1 E[f ( x k ) f (x )] 2 R K K E[( g k 2 ) 1 2 ]. (35) if E[ g k 2 ] M 2 for all k, then E[( k=1 K K g k 2 ) 1 2 ] [(E g k 2 )] 1 2 M 2 K = M K (36) k=1 k=1
36 Variable metric methods 36/84 Variable metric methods x k+1 = arg min{ g k, x + 1 x C 2 x x k, H k (x x k ) } Projected subgradient method: H k = α k I, Newton method: H k = 2 f (x k ), AdaGrad: H k = 1 α diag( k i=1 g i. g i ) 1 2
37 37/84 Variable metric methods Theorem 9: Convergence of Variable metric methods Let H k > 0 be a sequence of positive define matrices, where H k is a function of g 1,, g k. Let g k be stochastic subgradient with E[g k x k ] f (x k ). Then E[ K (f (x k ) f (x ))] 1 K 2 E[ ( x k x 2 H k x k x 2 H k 1 )] k=1 k= E[ x 1 x 2 H 1 + K k=1 g k 2 ]. H 1 k E[f (x k ) f (x )] 1 2 E[ x k x 2 H k x k+1 x 2 H k + g k 2 ] H 1 k
38 Proof of Theorem 9 By non-expansiveness of π C (x) under x 2 H k = x, H k x Define ξ k = g k f (x k ) x k+1 x 2 H k x k H 1 k g k x 2 H k 1 2 x k+1 x 2 H k 1 2 x k x 2 H k + g k, x x k g k 2 H 1 k = 1 2 x k x 2 H k + f (x k ), x 1 x k + 2 g k 2 + ξ H 1 k, x x k k 1 2 x k x 2 H k (f (x k ) f (x )) g k 2 + ξ H 1 k, x x k k E[f (x k ) f (x )] 1 2 E[ x k x 2 H k x k+1 x 2 H k + g k 2 ] H 1 k 38/84
39 Variable metric methods Corollary: Convergence of AdaGrad Let R = sup x C x x and let the conditions of Theorem 9 hold. Then we have E[ K k=1 (f (x k ) f (x ))] 1 2α R2 E[tr(M K )] + αe[tr(m K )] where M k = diag( k i=1 g i. g i ) 1 2 and H k = 1 α M k Let α = R, Then E[f ( x K ) f (x )] 3 2K R E[tr(M K )] = 3 2K R n K E[( g 2 k,j) 1 2 ] If C = {x : x 1}, the bound is lower than the one of adaptive stepsize. j=1 k=1 (37) 39/84
40 Proof of Corolory 40/84 Aim: x k x 2 H k x k x 2 H k 1 x k x 2 tr(h k H k 1 ) Let z = x x z 2 H k z 2 H k 1 n n = H k,j z 2 j H k 1,j z 2 j = j=1 j=1 n (H k,j H k 1,j )z 2 j j=1 z 2 n (H k,j H k 1,j ) j=1 = z 2 tr(h k H k 1 )
41 Proof of Corolory Assume a = (a 1, a 2,..., a T ), a simple inequality( prove by induction), T t=1 a 2 t a a2 t Aim: K k=1 g k H 1 2αtr(M K ). k K g k 2 H 1 k = α k=1 2α K n k=1 j=1 g 2 k,j M k,j = α 2 a a2 T n K j=1 k=1 g 2 k,j k i=1 g2 i,j n K g 2 i,j = 2αtr(M K) (38) j=1 i=1 41/84
42 Optimality Guarantees Abstract Description: A collection F of convex functions f : R n R A closed convex set C R n over which we optimize A collection G of stochastic gradient oracle, which consists of a sample space S, a gradient mapping g : R n S F R n and (implicitly) a probability distributions P on S. The stochastic gradient oracle may be queried at a point x, and when queried, draws a s P has the property that minimax risk E[g(x, s, f )] f (x). M(C, G, F) := inf sup supe[f (ˆx K (g 1,, g K )) inf f ˆx K C g G f F x C (x )] 42/84
43 Optimality Guarantees 43/84
44 Optimality Guarantees Theorem 10: Optimality Guarantees Let C R n be a convex set containing an l 2 ball of radius R, and let G denote the collection of distributions generating stochastic subgradients with g 2 M with probability 1. Then M(C, G, F) RM 96K Theorem 11: Optimality Guarantees Let G and the stochastic gradient oracle be as above, and assume C [ R, R] n. Then M(C, G, F) RM min{ 1 n 5, } 96K 44/84
45 Summary expectation E[f ( x K ) f (x))] 3RM 2 K convergence in probability f ( x K ) f (x ) 3RM 2 K + RM 2 log 1 δ. K with probability at least 1 δ The optimal order M(C, G, F) RM 96K Using proper metric and adapted strategy can improve the convergence: Mirror Descent method and Adagrad. 45/84
46 Outline 46/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning
47 Gradient methods 47/84 Rewrite the ERM problem gradient methods min x R n f (x) := 1 n n f i (x) (39) i=1 x k+1 = x k α k f (x k ) (40) the update is equal to x k+1 = arg min x f (x k ) + f (x k ), x x k + 1 2α k x x k 2 2 (41)
48 48/84 Basic Properties We only consider the convex defferentiable functions. convex functions: f (λx + (1 λ)y) λf (x) + (1 λ)f (y), λ [0, 1], x, y M-Lipschitz functions: L-smooth functions: f (x) f (y) M x y 2 µ-strongly convex functions: f (x) f (y) L x y 2 f (λx+(1 λ)y) λf (x)+(1 λ)f (y) µ 2 λ(1 λ) x y 2 2, λ [0, 1], x, y
49 Some useful results convex functions: f (y) f (x) + f (x), y x M-Lipschitz functions: f (x) 2 M L-smooth functions: L 2 xt x f (x) is convex f (y) f (x) + f (x), y x + L 2 x y L f (x) 2 2 f (x) f (x ) L 2 x x 2 2 µ-strongly convex functions: f (x) µ 2 xt x is convex f (y) f (x) + f (x), y x + µ 2 x y /84
50 Co-coercivity of gradient if f is convex with dom f = R n and (L/2)x x f (x) is convex then ( f (x) f (y)) (x y) 1 L f (x) f (y) 2 2 x, y proof : define convex functions f x, f y with domain R n : f x (z) = f (z) f (x) z, f y (z) = f (z) f (y) z the functions (L/2)z z f x (z) and (L/2)z z f y (z) are convex z = x minimizes f x (z); from the left-hand inequality, f (y) f (x) f (x) (y x) = f x (y) f x (x) 1 2L f x(y) 2 2 = 1 2L f (y) f (x) 2 2 similarly, z = y minimizes f y (z); therefore f (x) f (y) f (y) (x y) 1 2L f (y) f (x) 2 2 combining the two inequalities shows co-coercivity 50/84
51 Extension of co-coercivity if f is strongly convex and f is Lipschitz continuous, then g(x) = f (x) µ 2 x 2 2 is convex and g is Lipschitz continuous with parameter L µ. co-coercivity of g gives for all x, y dom f ( f (x) f (y)) (x y) µl µ + L x y µ + L f (x) f (y) /84
52 52/84 Convergence guarantees Assumption f is L-smooth and µ-strongly convex. lemma: Coercivity of gradients f (x) f (y), x y Lµ L + µ x y L + µ f (x) f (y) 2 (42) Theorem: Convergence rates of GD Let α k = 2 L+µ and let κ = L µ. Define k = x k x. Then we get, f (x T+1 ) f (x ) L T exp( ). (43) κ + 1
53 Proof of Theorem 53/84 2 k+1 = x k+1 x 2 2 = x k α k f (x k ) x 2 2 By the lemma = x k x 2 2 2α k f (x k ), x k x + α 2 k f (x k ) 2 2 = 2 k 2α k f (x k ), x k x + α 2 k f (x k ) k+1 2 k 2α k ( Lµ L + µ 2 k + 1 L + µ f (x) 2 ) + α 2 k f (x k ) 2 2 = (1 2α k Lµ L + µ ) 2 k + ( 2α k L + µ + α2 k) f (x k ) 2 2 (1 2α k Lµ L + µ ) 2 k + ( 2α k L + µ + α2 k)l 2 2 k (44)
54 Proof of Theorem α k = 2 L+µ 2 k+1 (1 4Lµ (L + µ) 2 ) 2 k = ( L µ L + µ )2 2 k = ( κ 1 κ + 1 )2 2 k 2 T+1 ( κ 1 κ + 1 )2T 2 1 = 2 1 exp(2t log(1 2 κ + 1 )) 2 1 exp( 4T κ + 1 ) f (x T+1 ) f (x ) L 2 2 T+1 L exp( 4T κ + 1 ) 54/84
55 Stochastic Gradient methods 55/84 ERM problem gradient descent min x R n f (x) := 1 n n f i (x) (45) i=1 x k+1 = x k α k f (x k ) (46) stochastic gradient descent x k+1 = x k α k f sk (x k ), (47) where s k is uniformly sampled from {1,, n}
56 56/84 Convergence guarantees Assumption f (x) is L-smooth: f (x) f (y) 2 2 L x y 2 2 f (x) is µ-strongly convex: f (x) f (y), x y µ x y 2 2 E s [ f s (x)] = f (x) E s f s (x) 2 M 2 Theorem: Convergence rates of SGD Define k = x k x. For a fixed Stepsize α k = α, 0 < α < 1 2µ we have, E[f (x T+1 ) f (x )] L 2 E[ 2 T+1] L 2 [(1 2αµ)T αm2 ]. (48) 2µ
57 Proof of Theorem 57/84 2 k+1 = x k+1 x 2 2 = x k α k f sk (x k ) x 2 2 = x k x 2 2 2α k f sk (x k ), x k x + αk f 2 sk (x k ) 2 2 = 2 k 2α k f sk (x k ), x k x + αk f 2 sk (x k ) 2 2 Using E[X] = E[E[X Y]]: E s1,...,s k [ f sk (x k ), x k x ] = E s1,...,s k 1 [E sk [ f sk (x k ), x k x ]] = E s1,...,s k 1 [ E sk [ f sk (x k )], x k x ] = E s1,...,s k 1 [ f (x k ), x k x ] = E s1,...,s k [ f (x k ), x k x ] By the strongly convexity E s1,...,s k ( 2 k+1) (1 2αµ)E s1,...,s k ( 2 k) + α 2 M 2 (49)
58 58/84 Proof of Theorem Taking induction from k = 1 to k = T, we have T 1 E s1,...,s T ( 2 T+1) (1 2αµ) T (1 2αµ) i α 2 M 2 (50) i=0 under the assumption that 0 2αµ 1, we have Then i=0 (1 2αµ) i = 1 2αµ E s1,...,s T ( 2 T+1) (1 2αµ) T αm2 2µ (51)
59 59/84 Convergence guarantees For fixed stepsize, we don t have the convergence For diminishing stepsize, the order of convergence is O( 1 T ) Theorem: Convergence rates of SGD Define k = x k x. For a diminishing stepsize α k = β k + γ Then we have, for any T 1 for some β > 1 2µ and γ > 0 such that α 1 1 2µ. E[f (x T ) f (x )] L 2 E[ 2 T] L 2 where v = max( β2 M 2 2βµ 1, (γ + 1) 2 1 ) v γ + T, (52)
60 Proof of Theorem 60/84 Recall the bounds E s1,...,s k ( 2 k+1) (1 2αµ)E s1,...,s k ( 2 k) + α 2 M 2 (53) We prove it by induction. Firstly, the definition of v ensures that it holds for k = 1. Assume the conclusion holds for some k, it follows that E( 2 k+1) (1 2βµ ˆk )vˆk + β2 M 2 ˆk 2 ( with ˆk := γ + k) 2βµ = (ˆk )v + β2 M 2 ˆk 2 ˆk 2 = ˆk 1 v 2βµ 1 v + β2 M 2 ˆk 2 ˆk 2 ˆk 2 v ˆk + 1
61 61/84 stochastic optimization stochastic subgradient descent: O(1/ɛ 2 ) stochastic gradient descent with strong convexity O(1/ɛ) stochastic gradient descent with strong convexity and smoothness O(1/ɛ) deterministic optimization subgradient descent: O(n/ɛ 2 ) gradient descent with strong convexity O(n/ɛ) gradient descent with strong convexity and smoothness O(n log(1/ɛ)) The complexity refers to the times of computation of component (sub)gradients. We need to compute n gradients in every iterations of GD and one gradient in SGD.
62 Outline 62/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning
63 63/84 Variance Reduction Assumption f (x) is L-smooth: f (x) f (y) 2 L x y 2 f (x) is µ-strongly convex: f (x) f (y), x y µ x y 2 2 E s [ f s (x)] = f (x) E s f s (x) 2 M 2 GD: linear convergence O(n log(1/ɛ)) SGD: sublinear convergence O(1/ɛ) What is the essential difference between SGD and GD?
64 Variance Reduction 64/84 GD 2 k+1 = x k+1 x 2 2 = x k α f (x k ) x 2 2 SGD = 2 k 2α f (x k ), x k x + α 2 f (x k ) 2 2 (1 2αµ) 2 k + α 2 f (x k ) 2 2 (µ strongly convex) (1 2αµ + α 2 L 2 ) 2 k (L smooth) E 2 k+1 = E x k+1 x 2 2 = E x k α f sk (x k ) x 2 2 = E 2 k 2αE f sk (x k ), x k x + α 2 E f sk (x k ) 2 2 = E 2 k 2αE f (x k ), x k x + α 2 E f sk (x k ) 2 2 (1 2αµ)E 2 k + α 2 E f sk (x k ) 2 2 (µ strongly convex) (1 2αµ)E 2 k + α 2 E f sk (x k ) f (x k ) + f (x k ) 2 2 (1 2αµ + 2α 2 L 2 )E 2 k + 2α 2 E f sk (x k ) f (x k ) 2 2
65 Variance Reduction 65/84 E 2 k+1 (1 2αµ + 2α 2 L 2 )E 2 k + 2α 2 E f }{{} sk (x k ) f (x k ) 2 2 }{{} A B (54) a worst case convergence rate of 1/T for SGD In practice, the actual convergence rate may be somewhat better than this bound. Initially, B << A and we observe the linear rate regime, once B > A we observe 1/T rate. How to reduce variance term B to speed up SGD? SAG ( Stochastic average gradient) SAGA SVRG (Stochastic variance reduced gradient)
66 SAG method SAG method (Le Roux, Schmidt, Bach 2012) x k+1 = x k α k n where n i=1 g i k = x k α k ( 1 n ( f s k (x k ) g s k k 1 ) + 1 n { g i k = f i (x k ) if i = s k, g i k 1 o.w., n i=1 g i k 1 and s k is uniformly sampled from {1,, n} complexity(# component gradient evaluations): O(max{n, L µ } log(1/ɛ)) need to store most recent gradient of each component. SAGA(Defazio, Bach,Julien 2014) is unbaised revision of SAG x k+1 = x k α k ( f ik (x k ) g i k k n ) (55) (56) n g i k 1 ) (57) i=1 66/84
67 67/84 SVRG method SVRG method (Johnson and Zhang 2013) v k = f sk (x k ) f sk (y) + f (y) x k+1 = x k α k v k where and s k is uniformly sampled from {1,, n} v k is unbiased estimation of gradient f (x k ) Ev k = f (x k ) + f (y) f (y) = f (x k ). (58) Recall the bound E 2 k+1 (1 2αµ)E 2 k + α 2 E v k 2 2 (59)
68 SVRG method 68/84 Additional assumption: L smoothness for component functions f i (x) f i (y) 2 L x y 2 (60) Let s analyze the "variance" E v k 2 2 = E f sk (x k ) f sk (y) + f (y) 2 2 = E f sk (x k ) f sk (y) + f (y) + f sk (x ) f sk (x ) 2 2 2E f sk (x k ) f sk (x ) E f sk (y) f (y) f sk (x ) 2 2 2L 2 E 2 k + 2E f sk (y) f sk (x ) 2 2 2L 2 E 2 k + 2L 2 E y x 2 if x k and y is close to x, the variance is small.
69 SVRG method We only need to choose a current point as y. picking a fresh y more often should decrease the variance, however doing this too often involves computing too many full gradients Let s set y = x 1, Unrolling this: E 2 k+1 (1 2αµ + 2α 2 L 2 )E 2 k + 2α 2 L 2 E 2 1 (61) E 2 k+1 k 1 (1 2αµ + 2α 2 L 2 )E (1 2αµ + 2α 2 L 2 ) i 2α 2 L 2 E 2 1 i=0 (1 2αµ + 2α 2 L 2 ) k E kα 2 L 2 E 2 1 (62) 69/84
70 SVRG method 70/84 Unrolling this: E 2 k+1 (1 2αµ + 2α 2 L 2 ) k E kα 2 L 2 E 2 1 (63) Suppose we would like this to be 0.5E 1 after T iterations. We pick α = O(1) µ L 2, then it turns out that we can set T = O(1) L2 µ 2. In fact, we can improve it to T = O(1) L µ. condition number κ = L µ
71 SVRG method Algorithm 2: SVRG method Input: x 0, α, m for e = 1 : E do y x e 1, x 1 x e 1. g f (y) (full gradient) for k = 1 : m do pick s k {1,, n} uniformly at random. end for v k = f sk (x k ) f sk (y) + f (y) x k+1 = x k αv k x e 1 m end for m k=1 x k 71/84
72 72/84 SVRG method Convergence of SVRG method Suppose 0 < α 1 2L and m sufficiently large such that ρ = 1 µα(1 2Lα)m + 2Lα 1 2Lα < 1 (64) then we have linear convergence in expectation Ef ( x s ) f (x ) ρ s [f ( x 0 ) f (x )] (65) if α = θ L, then ρ = L/µ θ(1 2θ)m + 2θ 1 2θ choosing θ = 0.1 and m = 50(L/µ) results in ρ = 0.5 overall complexity: O(( L µ + n) log(1/ɛ)) (66)
73 proof of theorem 73/84 E 2 k+1 = E x k+1 x 2 2 = E x k αv k x 2 2 By smoothness of f i (x) = E 2 k 2αE v k, x k x + α 2 E v k 2 2 = E 2 k 2αE f (x k ), x k x + α 2 E v k 2 2 E 2 k 2αE(f (x k 1 ) f (x )) + α 2 E v k 2 2 f i (x) f i (x ) 2 2L[f i (x) f i (x ) f i (x ) T (x x )] (67) summing above inequalities over 1, 2,, n and using f (x ) = 0 1 n n f i (x) f i (x ) 2 2L[f (x) f (x )] (68) i=1
74 proof of theorem 74/84 Using E[(X E[X]) 2 ] E[X 2 ], we obtain the bound E s v k 2 2 = E f sk (x k ) f sk (y) + f (y) + f sk (x ) f sk (x ) 2 2 2E f sk (x k ) f sk (x ) E f sk (y) f (y) f sk (x ) 2 2 = 2E f sk (x k ) f sk (x ) E f sk (y) f sk (x ) E[ f sk (y) f sk (x )] 2 2 2E f sk (x k ) f sk (x ) E f s (y) f s (x ) 2 4L[f (x k ) f (x ) + f (y) f (x )] now continue the derivation E 2 k+1 E 2 k 2αE(f (x k ) f (x )) + α 2 E v k 2 2 E 2 k 2α(1 2αL)E(f (x k ) f (x )) + 4Lα 2 [f (y) f (x )]
75 proof of theorem summing over k = 1,, m (note that y = x e 1 and x e = 1 m m k=1 x k) E 2 k+1 + 2α(1 2αL) m E(f (x k ) f (x )) k=1 E x e 1 x 2 + 4Lα 2 me[f ( x e 1 ) f (x )] 2 µ E[f ( x e 1) f (x )] + 4Lα 2 me[f ( x e 1 ) f (x )] therefore, for each stage s E[f ( x e ) f (x )] 1 m E(f (x k ) f (x )) m k=1 1 2α(1 2αL)m ( 2 µ + 4mLα2 )E[f ( x e 1 ) f (x )] (69) 75/84
76 Summary 76/84 condition number: κ = L µ SVRG: E log( 1 ɛ ) so the complexity is O((n + κ) log( 1 ɛ )) GD: T κ log( 1 ɛ ) so the complexity is O(nκ log( 1 ɛ )) SGD: T κ ɛ so the complexity is O( κ ɛ ) even though we are allowing ourselves a few gradient computations here, we don t really pay too much in terms of complexity.
77 Outline 77/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning
78 78/84 Stochastic Algorithms in Deeplearning Consider problem min x R d 1 n n i=1 f i(x) References: chapter 8 in Gradient descent x t+1 = x t αt n Stochastic gradient descent SGD with momentum n f i (x t ) i=1 x k+1 = x t α t f i (x t ) v t+1 = µ t v t α t f i (x t ) x t+1 = x t + v t+1
79 79/84 Stochastic Algorithms in Deeplearning Nesterov accelerated gradient (original version) v t+1 = (1 + µ t )x t µ t x t 1 x t+1 = v t+1 α t f i (v t+1 ) here µ t = t+2 t+5 and αt fixed or determined by line search (inverse of Lipschitz constant). Nesterov accelerated gradient (momentum version) v t+1 = µ t v t α t f i (x t + µ t v t ) x t+1 = x t + v t+1 here µ t = t+2 t+5 and αt fixed or determined by line search.
80 Stochastic Algorithms in Deeplearning 80/84 Adaptive Subgradient Methods(Adagrad): let g t = f i (x t ), g 2 t = diag[g t g T t ] R d, and initial G 1 = g 2 1. At step t x t+1 = x t G t+1 = G t + g 2 t+1 α t G t + ɛ1 d f i (x t ) in the upper and the following iterations we use element-wise vector-vector multiplication.
81 Stochastic Algorithms in Deeplearning 81/84 Adadelta: denote exponentially decaying average E[g 2 ] t = ρe[g 2 ] t 1 + (1 ρ)g 2 t, RMS[g] t = E[g 2 ] t + ɛ1 d R d,and E[ x 2 ] t = ρe[ x 2 ] t 1 + (1 ρ) xt 2, RMS[ x] t = E[ x 2 ] t + ɛ1 d R d. Initial parameter: x 1, E[g 2 ] 0 = 0, E[ x 2 ] 0 = 0, decay rate ρ, constant ɛ. At step t, compute g t = f i (x t ) and E[g 2 ] t = ρe[g 2 ] t 1 + (1 ρ)g 2 t x t = RMS[ x] t 1 g t RMS[g] t E[ x 2 ] t = ρe[ x 2 ] t 1 + (1 ρ) x 2 t x t+1 = x t + x t
82 Stochastic Algorithms in Deeplearning 82/84 RMSprop: initial E[g 2 ] 0 = 0. At step t, E[g 2 ] t = ρe[g 2 ] t 1 + (1 ρ)g 2 t x t+1 = x t α t f i (x t ) E[g 2 ] t + ɛ1 d here ρ 0.9, α t is a good choice.
83 Stochastic Algorithms in Deeplearning 83/84 Adam: initial E[g 2 ] 0 = 0, E[g] 0 = 0. At step t, E[g] t = µe[g] t 1 + (1 µ)g t E[g 2 ] t = ρe[g 2 ] t 1 + (1 ρ)g 2 t Ê[g] t = E[g] t 1 µ t Ê[g 2 ] t = E[g2 ] t 1 ρ t x t+1 = x t α Ê[g] t Ê[g 2 ] t + ɛ1 d here ρ, µ are decay rates, α is learning rate.
84 84/84 Optimization algorithms in Deep learning 随机梯度类算法 caffe 里实现的算法有 adadelta, adagrad, adam, nesterov, rmsprop caffe/solvers Lasagne 里有 :sgd, momentum, nesterov_momentum, adagrad, rmsprop, adadelta, adam, adamax lasagne/updates.py tensorflow 实现的算法有 :Adadelta, AdagradDA, Adagrad, ProximalAdagrad, Ftrl, Momentum, adam, Momentum, CenteredRMSProp 具体实现 : master/tensorflow/core/kernels/training_ops.cc
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationLecture 2: Subgradient Methods
Lecture 2: Subgradient Methods John C. Duchi Stanford University Park City Mathematics Institute 206 Abstract In this lecture, we discuss first order methods for the minimization of convex functions. We
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationAccelerate Subgradient Methods
Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More informationJournal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19
Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how
More information5. Subgradient method
L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize
More informationIntroduction to Machine Learning (67577) Lecture 7
Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationLecture 3: Huge-scale optimization problems
Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32March 9, 2012 1
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationFinite-sum Composition Optimization via Variance Reduced Gradient Descent
Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationConvergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity
Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to
More informationThe PAC Learning Framework -II
The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationMini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization
Noname manuscript No. (will be inserted by the editor) Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Saeed Ghadimi Guanghui Lan Hongchao Zhang the date of
More informationMath 273a: Optimization Subgradients of convex functions
Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More information1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0
Numerical Analysis 1 1. Nonlinear Equations This lecture note excerpted parts from Michael Heath and Max Gunzburger. Given function f, we seek value x for which where f : D R n R n is nonlinear. f(x) =
More informationLecture 25: Subgradient Method and Bundle Methods April 24
IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationSecond order machine learning
Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 88 Outline Machine Learning s Inverse Problem
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationLinear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra
Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley ( For more info,
More informationMath 273a: Optimization Subgradient Methods
Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationA Sparsity Preserving Stochastic Gradient Method for Composite Optimization
A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning
More informationActive Learning: Disagreement Coefficient
Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which
More informationA Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir
A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence
More informationCS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims
CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target
More informationYou should be able to...
Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationSecond order machine learning
Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 96 Outline Machine Learning s Inverse Problem
More informationarxiv: v3 [math.oc] 8 Jan 2019
Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex
More informationStochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values
Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited
More informationModern Optimization Techniques
Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationConvex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization
Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear
More informationAlgorithmic Stability and Generalization Christoph Lampert
Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts
More informationComputational Learning Theory - Hilary Term : Learning Real-valued Functions
Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More information