Large-scale Machine Learning and Optimization

Size: px

Start display at page:

Download "Large-scale Machine Learning and Optimization"

Janis Hubbard
6 years ago
Views:

1 1/84 Large-scale Machine Learning and Optimization Zaiwen Wen Acknowledgement: this slides is based on Dimitris Papailiopoulos and Shiqian Ma lecture notes and the book "understanding machine learning theory and algorithms" of Shai Shalev-Shwartz and Shai Ben-David. Thanks Yongfeng Li for preparing part of this slides

2 Outline 2/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning

3 Machine Learning Model 3/84 Machine learning model: (x, y) P, P is a underlying distribution. Given a dataset D = {(x 1, y 1 ), (x 2, y 2 ),, (x n, y n )}. (x i, y i ) P i.i.d. Our goal is to find a hypothesis h s with the smallest expected risk, i.e., min h s H R[h s] := E[f (h s (x), y)] (1) where H is hypothesis class.

4 Machine Learning Model 4/84 In practice, we don not know the exact form of the underlying distribution P. Empirical Risk Minimization(ERM) min h s H ˆR[h s ] := 1 n n f (h s (x i ), y i ) (2) We care about two questions on ERM: When does the ERM concentrate around the true risk? i=1 How does the hypothesis class affect the ERM?

5 Machine Learning Model 5/84 Empirical risk minimizer ĥ n argminˆr n [h] h H Expected risk minimizor h argminr[h] h H The concentration means that for any ɛ > 0, 0 < δ < 1, if n is larger enough, we have P( R[ĥ n] R[h ] ɛ) > 1 δ (3) It just means that R[ĥ n] convergences to R[h ] in probability. The concentration will fail in some cases

6 Hoeffding Inequality Let X 1, X 2, be a sequence of i.i.d. random variables and assume that for all i, E(X i ) = µ and P(a X i b) = 1. Then for any ɛ > 0 P( 1 n n X i µ ɛ) 2 exp( 2nɛ2 (b a) 2 ) (4) i=1 The Hoeffding Inequality describes the asymptotic property that sampling mean convergences to expectation. Azuma-Hoeffding inequality is a martingle version. Let X 1, X 2, be a martingale difference sequence with X i B for all i = 1, 2,... Then P( P( n i=1 n i=1 X i t) exp( 2t2 nb 2 ) (5) X i t) exp( 2t2 nb 2 ) (6) 6/84

7 7/84 To make the exposition simpler, we assume that our loss function, 0 f (a, b) 1, a, b. By Hoeffding Inequality, fixed h P( ˆR n [h] R[h] ɛ) 2e 2nɛ2 (7) Union Bound P( h H { ˆR n [h] R[h] ɛ}) 2 H e 2nɛ2 (8) If we want to bound P( h H { ˆR n [h] R[h] ɛ}) δ, we need the size of sample n 1 2ɛ 2 log(2 H δ ) = O(log H + log(δ 1 ) ɛ 2 ) (9) What if H =? This bound doesn t work

8 8/84 If n is large enough, with a probability 1 δ, we have R[ĥ n] R[h ] = (R[ĥ n] ˆR n [ĥ n]) + (ˆR n [ĥ n] ˆR n [h ]) +(ˆR n [h ] R[h ]). ɛ ɛ. For a two label classification problem, with a probability 1 δ, we have n VC[H] log( VC[H] sup ˆR n [h] R[h] O( ) + log( 1 δ )) ) (10) h H n where VC[H] is a VC dimension of H. Finite VC dimension is sufficient and necessary condition of empirical risk concentration for two label classification.

9 9/84 VC dimension VC dimension of a set-family: Let H be a set family (a set of sets) and C a set. Their intersection is defined as the following set-family: H C := {h C h H} We say that a set C is shattered by H if H C = 2 C. The VC dimension of H is the largest integer D such that there exists a set C with cardinality D that is shattered by H. A classification model f with some parameter vector θ is said to shatter a set of data points (x 1, x 2,..., x n ) if, for all assignments of labels to those points, there exists a θ such that the model f makes no errors when evaluating that set of data points. The VC dimension of a model f is the maximum number of points that can be arranged so that f shatters them. More formally, it is the maximum cardinal D such that some data point set of cardinality D can be shattered by f.

10 VC dimension example: If n and {(x 1, y 1 ),, (x n, y n )}, there exits h H s.t. h(x i ) = y i, then VC[H] = For a neural network whose activation functions are all sign functions, then VC[H] O(w log(w)), where w is the number of parameters. We must use prior knowledge and choose a proper hypothesis class. Suppose a is the true model R[ĥ n] R[a] = (R[ĥ n] R[h ]) + (R[h ] R[a]) }{{}}{{} A B If the hypothesis class is too large, B will be small but A will be large. (overfitting) If the hypothesis class is too small, A will be small but B will be large. (underfitting) 10/84

11 Outline 11/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning

12 The gradient and subgradient methods 12/84 Consider the problem gradient methods min f (x) (11) x Rn subgradient methods the update is equal to x k+1 = x k α k f (x k ) (12) x k+1 = x k α k g k, g k f (x k ) (13) x k+1 = arg min x f (x k ) + g k, x x k + 1 2α k x x k 2 2 (14)

13 Convergence guarantees 13/84 Assumption There is at least one minimizing point x arg minf (x) with f (x ) > x The subgradients are bounded: g 2 M for all x and all g f (x). Theorem 1: Convergence of subgradient Let α k 0 be any non-negative sequence of stepsizes and the preceding assumptions hold. Let x k be generated by the subgradient iteration. Then for all K 1, K α k [f (x k ) f (x )] 1 2 x 1 x k=1 K αkm 2 2. (15) k=1

14 Proof of Theorem 1 14/84 By convexity g k, x x k f (x ) f (x k ) 1 2 x k+1 x 2 2 = 1 2 x k α k g k x 2 2 = 1 2 x k x α k g k, x x k + α2 k 2 g x k x 2 2 α k (f (x k ) f (x )) + α2 k 2 M2 α k (f (x k ) f (x )) 1 2 x k x x k+1 x α2 k 2 M2

15 Convergence guarantees 15/84 Corollary Let A k = K i=1 α i and define x K = 1 A K K k=1 α kx k f ( x k ) f (x ) x 1 x K 2 K k=1 α k k=1 α2 k M2 Convergence: k=1 α K k=1 k = and α2 k K 0 k=1 α k Let x 1 x R. For a fixed stepsize α k = α: For a given K, take α = f ( x K ) f (x ) R M K : R2 2Kα + αm2 2 f ( x K ) f (x ) RM K. (16)

16 Projected subgradient methods 16/84 Consider the problem subgradient methods min x C f (x) (17) x k+1 = π C (x k α k g k ), g k f (x k ) (18) projection: π C (x) = arg min x y 2 2 y C the update is equal to x k+1 = arg min x C f (x k ) + g k, x x k + 1 2α k x x k 2 2 (19)

17 Convergence guarantees 17/84 Assumption The set C R n is compact and convex, and x x 2 R < for all x C. The subgradients are bounded: g 2 M for all x and all g f (x). Theorem 2: Convergence of projected subgradient method Let α k 0 be any non-negative sequence of stepsizes and the preceding assumptions hold. Let x k be generated by the projected subgradient iteration. Then for all K 1, K α k [f (x k ) f (x )] 1 2 x 1 x k=1 K αkm 2 2. (20) k=1

18 Proof of Theorem 2 18/84 By non-expansiveness of π C (x) x k+1 x 2 2 = π C (x k αg k ) x x k αg k x 1 2 x k+1 x x k α k g k x 2 2 = 1 2 x k x α k g k, x x k + α2 k 2 g x k x 2 2 α k (f (x k ) f (x )) + α2 k 2 M2 α k (f (x k ) f (x )) 1 2 x k x x k+1 x α2 k 2 M2

19 Convergence guarantees 19/84 Corollary Let A k = k i=1 α i and define x K = 1 A K K k=1 α kx k f ( x K ) f (x ) x 1 x K 2 K k=1 α k k=1 α2 k M2 Convergence: k=1 α K k=1 k = and α2 k K 0 k=1 α k a fixed stepsize α k = α: f ( x K ) f (x ) R2 2Kα + αm2 2. (21) Take α = R M K : f ( x K ) f (x ) RM K

20 20/84 Stochastic subgradient methods the stochastic optimization problem min f (x) := E P[F(x; S)] (22) x C S is a random space is a random variable on the space S with distribution P. for each s, x F(x; s) is convex. The subgradient E P [g(x; S)] f (x), where g(x; s) F(x; s). f (y) = E P [F(y; S)] E P [F(x; S) + g(x, S), y x ] = f (x) + E P [g(x; S)], y x

21 Stochastic subgradient methods 21/84 the deterministic optimization problem min x C f (x) := 1 m m F(x; s i ) (23) i=1 Why Stochastic? E P [F(x; S)] is generally intracktable to compute Small complexity: only one subgradient g(x; s) F(x; s) needs to be computed in one iteration. More possible to get global solution for non-convex case. stochasitic subgradient method x k+1 = π C (x k α k g k ), E[g k x k ] f (x k )

22 Convergence guarantees Assumption The set C R n is compact and convex, and x x 2 R < for all x C. The variance are bounded: E g(x, S) 2 2 M2 for all x. Theorem 3: Convergence of stochastic subgradient method Let α k 0 be any non-negative sequence of stepsizes and the preceding assumptions hold. Let x k be generated by the stochastic subgradient iteration. Then for all K 1, K α k E(f (x k ) f (x )) 1 2 E x 1 x k+1 K k=1 α2 k M2 2 (24) 22/84

23 Proof of Theorem 2 Let f (x k ) = E[g k x k ] and ξ k = g k f (x k ), 1 2 x k+1 x x k α k g k x 2 2 = 1 2 x k x α k g k, x x k + α2 k 2 g k 2 2 = 1 2 x k x α k f (x k ), x α 2 x k + k 2 g k α k ξ k, x x k 1 2 x k x 2 2 α k (f (x k ) f (x )) + α2 k 2 g k α k ξ k, x x k E[ ξ k, x x k ] = E[E[ ξ k, x x k x k ]] = 0. α k E(f (x k ) f (x )) 1 2 E x k x E x k+1 x α2 k 2 M2 23/84

24 Convergence guarantees 24/84 Corollary Let A k = k i=1 α i and define x K = 1 A K K k=1 α kx k E[f ( x K ) f (x )] R2 + K k=1 α2 k M2 2 K k=1 α k Convergence: k=1 α K k=1 k = and α2 k K 0 k=1 α k a fixed stepsize α k = α: E(f ( x K ) f (x )) R2 2Kα + αm2 2. (25) Take α = R M K : E(f ( x K ) f (x )) RM K

25 25/84 Theorem 5: Convergence of stochastic subgradient method Let α k > 0 be non-increasing sequence of stepsizes and the preceding assumptions hold. Let x = 1 K K k=1 x k. Then, E[f ( x K ) f (x )] R Kα K 2K K α k M 2. (26) k=1 α k E(f (x k ) f (x )) 1 2 E x k x E x k+1 x α2 k 2 M2 E(f (x k ) f (x )) 1 2α k E x k x α k E x k+1 x α k 2 M2

26 26/84 Corollary Let the conditions of Theorem 5 hold, and let α k = Then, R M for each k. k E[f ( x K ) f (x )] 3RM 2 K. (27) proof K k=1 1 K 1 dt = 2 K. k 0 t Corollary Let α k be chosen such that E[f ( x K ) f (x )] 0. Then f ( x K ) f (x ) P 0 as K, that is, for all ɛ > 0 we have lim sup k P(f ( x K ) f (x ) ɛ) = 0. (28)

27 By markov inequality: P(X α) EX α if X 0 and α > 0 P(f ( x K ) f (x ) ɛ) 1 ɛ E[f ( x K) f (x )] 0 Theorem 6: Convergence of stochastic subgradient method In addition to the conditions of Theorem 5, assume that g 2 M for all stochastic subgradients g, Then for anything ɛ > 0, f ( x K ) f (x ) with probability at least 1 e 1 2 ɛ2 R Kα K 2K K k=1 α k M 2 + RM K ɛ. (29) Taking α k = R km and setting δ = e 1 2 ɛ2 with probability at least 1 δ f ( x K ) f (x ) 3RM 2 K + RM 2 log 1 δ. K 27/84

28 Azuma-Hoeffding Inequality 28/84 martingle: A sequence X 1, X 2, of random vectors is a martingale if there is a sequence of random vectors Z 1, Z 2, such that for each n, X n is a function of Z n, Z n 1 is a function of Z n, we have the conditional expectation condition E[X n Z n 1 ] = X n 1. martingale difference sequence X 1, X 2, is a martingale difference sequence if S n = n i=1 X i is a martingle or, equivalently E[X n Z n 1 ] = 0. example X 1, X 2, independent and E(X i ) = 0, Z i = (X 1,, X i ).

29 Azuma-Hoeffding Inequality Let X 1, X 2, be a martingale difference sequence with X i B for all i = 1, 2,... Then P( P( n i=1 n i=1 X i t) exp( 2t2 nb 2 ) (30) X i t) exp( 2t2 nb 2 ) (31) Let δ = t n P( 1 n X 1, X 2, i.i.d, EX i = µ m i=1 X i δ) exp( 2nδ2 B 2 ) P( 1 n m i=1 X i µ δ) 2 exp( 2nδ2 B 2 ) 29/84

30 Azuma-Hoeffding Inequality Theorem 5: Convergence of stochastic subgradient method Let α k > 0 be non-increasing sequence of stepsizes and the preceding assumptions hold. Let x = 1 K K k=1 x k. Then, E[f ( x K ) f (x )] R Kα K 2K K α k M 2. (32) Theorem 6: Convergence of stochastic subgradient method In addition to the conditions of Theorem 5, assume that g 2 M for all stochastic subgradients g, Then for anything ɛ > 0, f ( x K ) f (x ) R Kα K 2K K k=1 k=1 α k M 2 + RM K ɛ. (33) with probability at least 1 e 1 2 ɛ2 30/84

31 Proof of Theorem 6 31/84 Let f (x k ) = E[g k x k ] and ξ k = g k f (x k ), 1 2 x k+1 x 2 2 = 1 2 x k α k g k x 2 2 = 1 2 x k x α k g k, x x k + α2 k 2 g k 2 2 = 1 2 x k x α k f (x k ), x α 2 x k + k 2 g k α k ξ k, x x k 1 2 x k x 2 2 α k (f (x k ) f (x )) + α2 k 2 g k α k ξ k, x x k f (x k ) f (x ) 1 2α k x k x α k x k+1 x 2 2+ α k 2 g k 2 2+ ξ k, x x k

32 Proof of Theorem 6 32/84 f ( x K ) f (x) 1 K K f (x k ) f (x ) k=1 1 2Kα K x 1 x K 1 x 1 x Kα K 2K K α k g k K k=1 K α k M K k=1 K ξ k, x x k i=1 K ξ k, x x k i=1 Let ω = 1 2Kα K x 1 x K 2K k=1 α km 2 P(f ( x K ) f (x) ω t) P( 1 K K ξ k, x x k t), i=1

33 Proof of Theorem 6 ξ k, x x k is a bounded difference martingale sequence Z k = (x 1,, x k+1 ) Since E[ξ k Z k 1 ] = 0 and E[x k Z k 1 ] = x k. E ξ k, x x k = 0. Since ξ k 2 = g k f (x k ) 2M By Azuma-Hoeffding Inequality, P( K i=1 Substituting t = MR Kɛ ξ k, x x k ξ k 2 x x k 2 2MR ξ k, x x k t) exp( t2 2KM 2 R 2 ). P( 1 K K i=1 ξ k, x x k MRɛ K ) exp( ɛ2 2 ). 33/84

34 Adaptive stepsizes 34/84 choose an appropriate metric and associated distance-generating function h. it may be advantageous to adapt the metric being used, or at least the stepsizes, to achieve faster convergence guarantees. a simple scheme h(x) = 1 2 xt Ax where A may change depending on information observed during solution of the problem.

35 35/84 Adaptive stepsizes Recall the bounds E[f ( x K ) f (x )] E[ R2 Kα k + 1 2K k Taking α k = R/ i=1 g 2, K α k g k 2 ]. (34) k=1 E[f ( x k ) f (x )] 2 R K K E[( g k 2 ) 1 2 ]. (35) if E[ g k 2 ] M 2 for all k, then E[( k=1 K K g k 2 ) 1 2 ] [(E g k 2 )] 1 2 M 2 K = M K (36) k=1 k=1

36 Variable metric methods 36/84 Variable metric methods x k+1 = arg min{ g k, x + 1 x C 2 x x k, H k (x x k ) } Projected subgradient method: H k = α k I, Newton method: H k = 2 f (x k ), AdaGrad: H k = 1 α diag( k i=1 g i. g i ) 1 2

37 37/84 Variable metric methods Theorem 9: Convergence of Variable metric methods Let H k > 0 be a sequence of positive define matrices, where H k is a function of g 1,, g k. Let g k be stochastic subgradient with E[g k x k ] f (x k ). Then E[ K (f (x k ) f (x ))] 1 K 2 E[ ( x k x 2 H k x k x 2 H k 1 )] k=1 k= E[ x 1 x 2 H 1 + K k=1 g k 2 ]. H 1 k E[f (x k ) f (x )] 1 2 E[ x k x 2 H k x k+1 x 2 H k + g k 2 ] H 1 k

38 Proof of Theorem 9 By non-expansiveness of π C (x) under x 2 H k = x, H k x Define ξ k = g k f (x k ) x k+1 x 2 H k x k H 1 k g k x 2 H k 1 2 x k+1 x 2 H k 1 2 x k x 2 H k + g k, x x k g k 2 H 1 k = 1 2 x k x 2 H k + f (x k ), x 1 x k + 2 g k 2 + ξ H 1 k, x x k k 1 2 x k x 2 H k (f (x k ) f (x )) g k 2 + ξ H 1 k, x x k k E[f (x k ) f (x )] 1 2 E[ x k x 2 H k x k+1 x 2 H k + g k 2 ] H 1 k 38/84

39 Variable metric methods Corollary: Convergence of AdaGrad Let R = sup x C x x and let the conditions of Theorem 9 hold. Then we have E[ K k=1 (f (x k ) f (x ))] 1 2α R2 E[tr(M K )] + αe[tr(m K )] where M k = diag( k i=1 g i. g i ) 1 2 and H k = 1 α M k Let α = R, Then E[f ( x K ) f (x )] 3 2K R E[tr(M K )] = 3 2K R n K E[( g 2 k,j) 1 2 ] If C = {x : x 1}, the bound is lower than the one of adaptive stepsize. j=1 k=1 (37) 39/84

40 Proof of Corolory 40/84 Aim: x k x 2 H k x k x 2 H k 1 x k x 2 tr(h k H k 1 ) Let z = x x z 2 H k z 2 H k 1 n n = H k,j z 2 j H k 1,j z 2 j = j=1 j=1 n (H k,j H k 1,j )z 2 j j=1 z 2 n (H k,j H k 1,j ) j=1 = z 2 tr(h k H k 1 )

41 Proof of Corolory Assume a = (a 1, a 2,..., a T ), a simple inequality( prove by induction), T t=1 a 2 t a a2 t Aim: K k=1 g k H 1 2αtr(M K ). k K g k 2 H 1 k = α k=1 2α K n k=1 j=1 g 2 k,j M k,j = α 2 a a2 T n K j=1 k=1 g 2 k,j k i=1 g2 i,j n K g 2 i,j = 2αtr(M K) (38) j=1 i=1 41/84

42 Optimality Guarantees Abstract Description: A collection F of convex functions f : R n R A closed convex set C R n over which we optimize A collection G of stochastic gradient oracle, which consists of a sample space S, a gradient mapping g : R n S F R n and (implicitly) a probability distributions P on S. The stochastic gradient oracle may be queried at a point x, and when queried, draws a s P has the property that minimax risk E[g(x, s, f )] f (x). M(C, G, F) := inf sup supe[f (ˆx K (g 1,, g K )) inf f ˆx K C g G f F x C (x )] 42/84

43 Optimality Guarantees 43/84

44 Optimality Guarantees Theorem 10: Optimality Guarantees Let C R n be a convex set containing an l 2 ball of radius R, and let G denote the collection of distributions generating stochastic subgradients with g 2 M with probability 1. Then M(C, G, F) RM 96K Theorem 11: Optimality Guarantees Let G and the stochastic gradient oracle be as above, and assume C [ R, R] n. Then M(C, G, F) RM min{ 1 n 5, } 96K 44/84

45 Summary expectation E[f ( x K ) f (x))] 3RM 2 K convergence in probability f ( x K ) f (x ) 3RM 2 K + RM 2 log 1 δ. K with probability at least 1 δ The optimal order M(C, G, F) RM 96K Using proper metric and adapted strategy can improve the convergence: Mirror Descent method and Adagrad. 45/84

46 Outline 46/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning

47 Gradient methods 47/84 Rewrite the ERM problem gradient methods min x R n f (x) := 1 n n f i (x) (39) i=1 x k+1 = x k α k f (x k ) (40) the update is equal to x k+1 = arg min x f (x k ) + f (x k ), x x k + 1 2α k x x k 2 2 (41)

48 48/84 Basic Properties We only consider the convex defferentiable functions. convex functions: f (λx + (1 λ)y) λf (x) + (1 λ)f (y), λ [0, 1], x, y M-Lipschitz functions: L-smooth functions: f (x) f (y) M x y 2 µ-strongly convex functions: f (x) f (y) L x y 2 f (λx+(1 λ)y) λf (x)+(1 λ)f (y) µ 2 λ(1 λ) x y 2 2, λ [0, 1], x, y

49 Some useful results convex functions: f (y) f (x) + f (x), y x M-Lipschitz functions: f (x) 2 M L-smooth functions: L 2 xt x f (x) is convex f (y) f (x) + f (x), y x + L 2 x y L f (x) 2 2 f (x) f (x ) L 2 x x 2 2 µ-strongly convex functions: f (x) µ 2 xt x is convex f (y) f (x) + f (x), y x + µ 2 x y /84

50 Co-coercivity of gradient if f is convex with dom f = R n and (L/2)x x f (x) is convex then ( f (x) f (y)) (x y) 1 L f (x) f (y) 2 2 x, y proof : define convex functions f x, f y with domain R n : f x (z) = f (z) f (x) z, f y (z) = f (z) f (y) z the functions (L/2)z z f x (z) and (L/2)z z f y (z) are convex z = x minimizes f x (z); from the left-hand inequality, f (y) f (x) f (x) (y x) = f x (y) f x (x) 1 2L f x(y) 2 2 = 1 2L f (y) f (x) 2 2 similarly, z = y minimizes f y (z); therefore f (x) f (y) f (y) (x y) 1 2L f (y) f (x) 2 2 combining the two inequalities shows co-coercivity 50/84

51 Extension of co-coercivity if f is strongly convex and f is Lipschitz continuous, then g(x) = f (x) µ 2 x 2 2 is convex and g is Lipschitz continuous with parameter L µ. co-coercivity of g gives for all x, y dom f ( f (x) f (y)) (x y) µl µ + L x y µ + L f (x) f (y) /84

52 52/84 Convergence guarantees Assumption f is L-smooth and µ-strongly convex. lemma: Coercivity of gradients f (x) f (y), x y Lµ L + µ x y L + µ f (x) f (y) 2 (42) Theorem: Convergence rates of GD Let α k = 2 L+µ and let κ = L µ. Define k = x k x. Then we get, f (x T+1 ) f (x ) L T exp( ). (43) κ + 1

53 Proof of Theorem 53/84 2 k+1 = x k+1 x 2 2 = x k α k f (x k ) x 2 2 By the lemma = x k x 2 2 2α k f (x k ), x k x + α 2 k f (x k ) 2 2 = 2 k 2α k f (x k ), x k x + α 2 k f (x k ) k+1 2 k 2α k ( Lµ L + µ 2 k + 1 L + µ f (x) 2 ) + α 2 k f (x k ) 2 2 = (1 2α k Lµ L + µ ) 2 k + ( 2α k L + µ + α2 k) f (x k ) 2 2 (1 2α k Lµ L + µ ) 2 k + ( 2α k L + µ + α2 k)l 2 2 k (44)

54 Proof of Theorem α k = 2 L+µ 2 k+1 (1 4Lµ (L + µ) 2 ) 2 k = ( L µ L + µ )2 2 k = ( κ 1 κ + 1 )2 2 k 2 T+1 ( κ 1 κ + 1 )2T 2 1 = 2 1 exp(2t log(1 2 κ + 1 )) 2 1 exp( 4T κ + 1 ) f (x T+1 ) f (x ) L 2 2 T+1 L exp( 4T κ + 1 ) 54/84

55 Stochastic Gradient methods 55/84 ERM problem gradient descent min x R n f (x) := 1 n n f i (x) (45) i=1 x k+1 = x k α k f (x k ) (46) stochastic gradient descent x k+1 = x k α k f sk (x k ), (47) where s k is uniformly sampled from {1,, n}

56 56/84 Convergence guarantees Assumption f (x) is L-smooth: f (x) f (y) 2 2 L x y 2 2 f (x) is µ-strongly convex: f (x) f (y), x y µ x y 2 2 E s [ f s (x)] = f (x) E s f s (x) 2 M 2 Theorem: Convergence rates of SGD Define k = x k x. For a fixed Stepsize α k = α, 0 < α < 1 2µ we have, E[f (x T+1 ) f (x )] L 2 E[ 2 T+1] L 2 [(1 2αµ)T αm2 ]. (48) 2µ

57 Proof of Theorem 57/84 2 k+1 = x k+1 x 2 2 = x k α k f sk (x k ) x 2 2 = x k x 2 2 2α k f sk (x k ), x k x + αk f 2 sk (x k ) 2 2 = 2 k 2α k f sk (x k ), x k x + αk f 2 sk (x k ) 2 2 Using E[X] = E[E[X Y]]: E s1,...,s k [ f sk (x k ), x k x ] = E s1,...,s k 1 [E sk [ f sk (x k ), x k x ]] = E s1,...,s k 1 [ E sk [ f sk (x k )], x k x ] = E s1,...,s k 1 [ f (x k ), x k x ] = E s1,...,s k [ f (x k ), x k x ] By the strongly convexity E s1,...,s k ( 2 k+1) (1 2αµ)E s1,...,s k ( 2 k) + α 2 M 2 (49)

58 58/84 Proof of Theorem Taking induction from k = 1 to k = T, we have T 1 E s1,...,s T ( 2 T+1) (1 2αµ) T (1 2αµ) i α 2 M 2 (50) i=0 under the assumption that 0 2αµ 1, we have Then i=0 (1 2αµ) i = 1 2αµ E s1,...,s T ( 2 T+1) (1 2αµ) T αm2 2µ (51)

59 59/84 Convergence guarantees For fixed stepsize, we don t have the convergence For diminishing stepsize, the order of convergence is O( 1 T ) Theorem: Convergence rates of SGD Define k = x k x. For a diminishing stepsize α k = β k + γ Then we have, for any T 1 for some β > 1 2µ and γ > 0 such that α 1 1 2µ. E[f (x T ) f (x )] L 2 E[ 2 T] L 2 where v = max( β2 M 2 2βµ 1, (γ + 1) 2 1 ) v γ + T, (52)

60 Proof of Theorem 60/84 Recall the bounds E s1,...,s k ( 2 k+1) (1 2αµ)E s1,...,s k ( 2 k) + α 2 M 2 (53) We prove it by induction. Firstly, the definition of v ensures that it holds for k = 1. Assume the conclusion holds for some k, it follows that E( 2 k+1) (1 2βµ ˆk )vˆk + β2 M 2 ˆk 2 ( with ˆk := γ + k) 2βµ = (ˆk )v + β2 M 2 ˆk 2 ˆk 2 = ˆk 1 v 2βµ 1 v + β2 M 2 ˆk 2 ˆk 2 ˆk 2 v ˆk + 1

61 61/84 stochastic optimization stochastic subgradient descent: O(1/ɛ 2 ) stochastic gradient descent with strong convexity O(1/ɛ) stochastic gradient descent with strong convexity and smoothness O(1/ɛ) deterministic optimization subgradient descent: O(n/ɛ 2 ) gradient descent with strong convexity O(n/ɛ) gradient descent with strong convexity and smoothness O(n log(1/ɛ)) The complexity refers to the times of computation of component (sub)gradients. We need to compute n gradients in every iterations of GD and one gradient in SGD.

62 Outline 62/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning

63 63/84 Variance Reduction Assumption f (x) is L-smooth: f (x) f (y) 2 L x y 2 f (x) is µ-strongly convex: f (x) f (y), x y µ x y 2 2 E s [ f s (x)] = f (x) E s f s (x) 2 M 2 GD: linear convergence O(n log(1/ɛ)) SGD: sublinear convergence O(1/ɛ) What is the essential difference between SGD and GD?

64 Variance Reduction 64/84 GD 2 k+1 = x k+1 x 2 2 = x k α f (x k ) x 2 2 SGD = 2 k 2α f (x k ), x k x + α 2 f (x k ) 2 2 (1 2αµ) 2 k + α 2 f (x k ) 2 2 (µ strongly convex) (1 2αµ + α 2 L 2 ) 2 k (L smooth) E 2 k+1 = E x k+1 x 2 2 = E x k α f sk (x k ) x 2 2 = E 2 k 2αE f sk (x k ), x k x + α 2 E f sk (x k ) 2 2 = E 2 k 2αE f (x k ), x k x + α 2 E f sk (x k ) 2 2 (1 2αµ)E 2 k + α 2 E f sk (x k ) 2 2 (µ strongly convex) (1 2αµ)E 2 k + α 2 E f sk (x k ) f (x k ) + f (x k ) 2 2 (1 2αµ + 2α 2 L 2 )E 2 k + 2α 2 E f sk (x k ) f (x k ) 2 2

65 Variance Reduction 65/84 E 2 k+1 (1 2αµ + 2α 2 L 2 )E 2 k + 2α 2 E f }{{} sk (x k ) f (x k ) 2 2 }{{} A B (54) a worst case convergence rate of 1/T for SGD In practice, the actual convergence rate may be somewhat better than this bound. Initially, B << A and we observe the linear rate regime, once B > A we observe 1/T rate. How to reduce variance term B to speed up SGD? SAG ( Stochastic average gradient) SAGA SVRG (Stochastic variance reduced gradient)

66 SAG method SAG method (Le Roux, Schmidt, Bach 2012) x k+1 = x k α k n where n i=1 g i k = x k α k ( 1 n ( f s k (x k ) g s k k 1 ) + 1 n { g i k = f i (x k ) if i = s k, g i k 1 o.w., n i=1 g i k 1 and s k is uniformly sampled from {1,, n} complexity(# component gradient evaluations): O(max{n, L µ } log(1/ɛ)) need to store most recent gradient of each component. SAGA(Defazio, Bach,Julien 2014) is unbaised revision of SAG x k+1 = x k α k ( f ik (x k ) g i k k n ) (55) (56) n g i k 1 ) (57) i=1 66/84

67 67/84 SVRG method SVRG method (Johnson and Zhang 2013) v k = f sk (x k ) f sk (y) + f (y) x k+1 = x k α k v k where and s k is uniformly sampled from {1,, n} v k is unbiased estimation of gradient f (x k ) Ev k = f (x k ) + f (y) f (y) = f (x k ). (58) Recall the bound E 2 k+1 (1 2αµ)E 2 k + α 2 E v k 2 2 (59)

68 SVRG method 68/84 Additional assumption: L smoothness for component functions f i (x) f i (y) 2 L x y 2 (60) Let s analyze the "variance" E v k 2 2 = E f sk (x k ) f sk (y) + f (y) 2 2 = E f sk (x k ) f sk (y) + f (y) + f sk (x ) f sk (x ) 2 2 2E f sk (x k ) f sk (x ) E f sk (y) f (y) f sk (x ) 2 2 2L 2 E 2 k + 2E f sk (y) f sk (x ) 2 2 2L 2 E 2 k + 2L 2 E y x 2 if x k and y is close to x, the variance is small.

69 SVRG method We only need to choose a current point as y. picking a fresh y more often should decrease the variance, however doing this too often involves computing too many full gradients Let s set y = x 1, Unrolling this: E 2 k+1 (1 2αµ + 2α 2 L 2 )E 2 k + 2α 2 L 2 E 2 1 (61) E 2 k+1 k 1 (1 2αµ + 2α 2 L 2 )E (1 2αµ + 2α 2 L 2 ) i 2α 2 L 2 E 2 1 i=0 (1 2αµ + 2α 2 L 2 ) k E kα 2 L 2 E 2 1 (62) 69/84

70 SVRG method 70/84 Unrolling this: E 2 k+1 (1 2αµ + 2α 2 L 2 ) k E kα 2 L 2 E 2 1 (63) Suppose we would like this to be 0.5E 1 after T iterations. We pick α = O(1) µ L 2, then it turns out that we can set T = O(1) L2 µ 2. In fact, we can improve it to T = O(1) L µ. condition number κ = L µ

71 SVRG method Algorithm 2: SVRG method Input: x 0, α, m for e = 1 : E do y x e 1, x 1 x e 1. g f (y) (full gradient) for k = 1 : m do pick s k {1,, n} uniformly at random. end for v k = f sk (x k ) f sk (y) + f (y) x k+1 = x k αv k x e 1 m end for m k=1 x k 71/84

72 72/84 SVRG method Convergence of SVRG method Suppose 0 < α 1 2L and m sufficiently large such that ρ = 1 µα(1 2Lα)m + 2Lα 1 2Lα < 1 (64) then we have linear convergence in expectation Ef ( x s ) f (x ) ρ s [f ( x 0 ) f (x )] (65) if α = θ L, then ρ = L/µ θ(1 2θ)m + 2θ 1 2θ choosing θ = 0.1 and m = 50(L/µ) results in ρ = 0.5 overall complexity: O(( L µ + n) log(1/ɛ)) (66)

73 proof of theorem 73/84 E 2 k+1 = E x k+1 x 2 2 = E x k αv k x 2 2 By smoothness of f i (x) = E 2 k 2αE v k, x k x + α 2 E v k 2 2 = E 2 k 2αE f (x k ), x k x + α 2 E v k 2 2 E 2 k 2αE(f (x k 1 ) f (x )) + α 2 E v k 2 2 f i (x) f i (x ) 2 2L[f i (x) f i (x ) f i (x ) T (x x )] (67) summing above inequalities over 1, 2,, n and using f (x ) = 0 1 n n f i (x) f i (x ) 2 2L[f (x) f (x )] (68) i=1

74 proof of theorem 74/84 Using E[(X E[X]) 2 ] E[X 2 ], we obtain the bound E s v k 2 2 = E f sk (x k ) f sk (y) + f (y) + f sk (x ) f sk (x ) 2 2 2E f sk (x k ) f sk (x ) E f sk (y) f (y) f sk (x ) 2 2 = 2E f sk (x k ) f sk (x ) E f sk (y) f sk (x ) E[ f sk (y) f sk (x )] 2 2 2E f sk (x k ) f sk (x ) E f s (y) f s (x ) 2 4L[f (x k ) f (x ) + f (y) f (x )] now continue the derivation E 2 k+1 E 2 k 2αE(f (x k ) f (x )) + α 2 E v k 2 2 E 2 k 2α(1 2αL)E(f (x k ) f (x )) + 4Lα 2 [f (y) f (x )]

75 proof of theorem summing over k = 1,, m (note that y = x e 1 and x e = 1 m m k=1 x k) E 2 k+1 + 2α(1 2αL) m E(f (x k ) f (x )) k=1 E x e 1 x 2 + 4Lα 2 me[f ( x e 1 ) f (x )] 2 µ E[f ( x e 1) f (x )] + 4Lα 2 me[f ( x e 1 ) f (x )] therefore, for each stage s E[f ( x e ) f (x )] 1 m E(f (x k ) f (x )) m k=1 1 2α(1 2αL)m ( 2 µ + 4mLα2 )E[f ( x e 1 ) f (x )] (69) 75/84

76 Summary 76/84 condition number: κ = L µ SVRG: E log( 1 ɛ ) so the complexity is O((n + κ) log( 1 ɛ )) GD: T κ log( 1 ɛ ) so the complexity is O(nκ log( 1 ɛ )) SGD: T κ ɛ so the complexity is O( κ ɛ ) even though we are allowing ourselves a few gradient computations here, we don t really pay too much in terms of complexity.

77 Outline 77/84 1 Problem Description 2 Subgradient Methods The gradient and subgradient methods Stochastic subgradient methods 3 Stochastic Gradient Descent Gradient Descent Stochastic Gradient methods 4 Variance Reduction SAG method and SAGA method SVRG method 5 Stochastic Algorithms in Deeplearning

78 78/84 Stochastic Algorithms in Deeplearning Consider problem min x R d 1 n n i=1 f i(x) References: chapter 8 in Gradient descent x t+1 = x t αt n Stochastic gradient descent SGD with momentum n f i (x t ) i=1 x k+1 = x t α t f i (x t ) v t+1 = µ t v t α t f i (x t ) x t+1 = x t + v t+1

79 79/84 Stochastic Algorithms in Deeplearning Nesterov accelerated gradient (original version) v t+1 = (1 + µ t )x t µ t x t 1 x t+1 = v t+1 α t f i (v t+1 ) here µ t = t+2 t+5 and αt fixed or determined by line search (inverse of Lipschitz constant). Nesterov accelerated gradient (momentum version) v t+1 = µ t v t α t f i (x t + µ t v t ) x t+1 = x t + v t+1 here µ t = t+2 t+5 and αt fixed or determined by line search.

80 Stochastic Algorithms in Deeplearning 80/84 Adaptive Subgradient Methods(Adagrad): let g t = f i (x t ), g 2 t = diag[g t g T t ] R d, and initial G 1 = g 2 1. At step t x t+1 = x t G t+1 = G t + g 2 t+1 α t G t + ɛ1 d f i (x t ) in the upper and the following iterations we use element-wise vector-vector multiplication.

81 Stochastic Algorithms in Deeplearning 81/84 Adadelta: denote exponentially decaying average E[g 2 ] t = ρe[g 2 ] t 1 + (1 ρ)g 2 t, RMS[g] t = E[g 2 ] t + ɛ1 d R d,and E[ x 2 ] t = ρe[ x 2 ] t 1 + (1 ρ) xt 2, RMS[ x] t = E[ x 2 ] t + ɛ1 d R d. Initial parameter: x 1, E[g 2 ] 0 = 0, E[ x 2 ] 0 = 0, decay rate ρ, constant ɛ. At step t, compute g t = f i (x t ) and E[g 2 ] t = ρe[g 2 ] t 1 + (1 ρ)g 2 t x t = RMS[ x] t 1 g t RMS[g] t E[ x 2 ] t = ρe[ x 2 ] t 1 + (1 ρ) x 2 t x t+1 = x t + x t

82 Stochastic Algorithms in Deeplearning 82/84 RMSprop: initial E[g 2 ] 0 = 0. At step t, E[g 2 ] t = ρe[g 2 ] t 1 + (1 ρ)g 2 t x t+1 = x t α t f i (x t ) E[g 2 ] t + ɛ1 d here ρ 0.9, α t is a good choice.

83 Stochastic Algorithms in Deeplearning 83/84 Adam: initial E[g 2 ] 0 = 0, E[g] 0 = 0. At step t, E[g] t = µe[g] t 1 + (1 µ)g t E[g 2 ] t = ρe[g 2 ] t 1 + (1 ρ)g 2 t Ê[g] t = E[g] t 1 µ t Ê[g 2 ] t = E[g2 ] t 1 ρ t x t+1 = x t α Ê[g] t Ê[g 2 ] t + ɛ1 d here ρ, µ are decay rates, α is learning rate.

84 84/84 Optimization algorithms in Deep learning 随机梯度类算法 caffe 里实现的算法有 adadelta, adagrad, adam, nesterov, rmsprop caffe/solvers Lasagne 里有 :sgd, momentum, nesterov_momentum, adagrad, rmsprop, adadelta, adam, adamax lasagne/updates.py tensorflow 实现的算法有 :Adadelta, AdagradDA, Adagrad, ProximalAdagrad, Ftrl, Momentum, adam, Momentum, CenteredRMSProp 具体实现 : master/tensorflow/core/kernels/training_ops.cc

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic