Large-scale Stochastic Optimization

Size: px

Start display at page:

Download "Large-scale Stochastic Optimization"

Peter O’Brien’
5 years ago
Views:

1 Large-scale Stochastic Optimization /641/441 (Spring 2016) Hanxiao Liu March 24, / 22

2 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation Comparisons with GD 3. Useful large-scale SGD solvers Support Vector Machines Matrix Factorization 4. Random topics Variance reduction Implementation trick Other variants 2 / 22

3 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation Comparisons with GD 3. Useful large-scale SGD solvers Support Vector Machines Matrix Factorization 4. Random topics Variance reduction Implementation trick Other variants 2 / 22

4 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation Comparisons with GD 3. Useful large-scale SGD solvers Support Vector Machines Matrix Factorization 4. Random topics Variance reduction Implementation trick Other variants 2 / 22

5 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation Comparisons with GD 3. Useful large-scale SGD solvers Support Vector Machines Matrix Factorization 4. Random topics Variance reduction Implementation trick Other variants 2 / 22

6 Risk Minimization {(x i, y i )} n i.i.d. i=1 : training data D. min f E l (f (x), y) (x,y) D }{{} True risk 1 n = min l (f (x i ), y i ) f n i=1 }{{} Empirical risk 1 = min w n (1) n l (f w (x i ), y i ) (2) i=1 Algorithm l (f w (x i ), y i ) Logistic Regression ) ln (1 + e y iw x i + λ 2 w 2 SVMs max ( 0, 1 y i w x i ) + λ 2 w 2 3 / 22

7 Risk Minimization {(x i, y i )} n i.i.d. i=1 : training data D. min f E l (f (x), y) (x,y) D }{{} True risk 1 n = min l (f (x i ), y i ) f n i=1 }{{} Empirical risk 1 = min w n (1) n l (f w (x i ), y i ) (2) i=1 Algorithm l (f w (x i ), y i ) Logistic Regression ) ln (1 + e y iw x i + λ 2 w 2 SVMs max ( 0, 1 y i w x i ) + λ 2 w 2 3 / 22

8 Gradient Descent (GD) l (f w (x i ), y i ) def = l i (w), l (w) def = 1 n n i=1 l i (w) Training objective: min w l(w) (3) Gradient update: w (k) = w (k 1) η k l ( w (k 1)) η k : pre-specified or determined via backtracking Can be easily generalized to handle nonsmooth loss 1. Gradient Subgradient 2. Proximal gradient method (for structured l (w)) Question of interest: How fast does GD converge? 4 / 22

9 Gradient Descent (GD) l (f w (x i ), y i ) def = l i (w), l (w) def = 1 n n i=1 l i (w) Training objective: min w l(w) (3) Gradient update: w (k) = w (k 1) η k l ( w (k 1)) η k : pre-specified or determined via backtracking Can be easily generalized to handle nonsmooth loss 1. Gradient Subgradient 2. Proximal gradient method (for structured l (w)) Question of interest: How fast does GD converge? 4 / 22

10 Gradient Descent (GD) l (f w (x i ), y i ) def = l i (w), l (w) def = 1 n n i=1 l i (w) Training objective: min w l(w) (3) Gradient update: w (k) = w (k 1) η k l ( w (k 1)) η k : pre-specified or determined via backtracking Can be easily generalized to handle nonsmooth loss 1. Gradient Subgradient 2. Proximal gradient method (for structured l (w)) Question of interest: How fast does GD converge? 4 / 22

11 Convergence Theorem (GD convergence) If l is both convex and differentiable 1 l ( w (k)) l (w ) { w (0) w 2 2 = O ( ) 1 in general 2ηk k c k L w (0) w 2 2 = O ( c k) l is strongly convex 2 (4) To achieve l ( x (k)) l (x ) ρ, GD needs O ( 1 ρ) iterations ( in general, and O log ( 1 ρ) ) iterations with strong convexity. 1 the step size η must be no larger than 1 L, where L is the Lipschitz constant satisfying l (a) l (b) 2 L a b 2 a, b 5 / 22

12 Convergence Theorem (GD convergence) If l is both convex and differentiable 1 l ( w (k)) l (w ) { w (0) w 2 2 = O ( ) 1 in general 2ηk k c k L w (0) w 2 2 = O ( c k) l is strongly convex 2 (4) To achieve l ( x (k)) l (x ) ρ, GD needs O ( 1 ρ) iterations ( in general, and O log ( 1 ρ) ) iterations with strong convexity. 1 the step size η must be no larger than 1 L, where L is the Lipschitz constant satisfying l (a) l (b) 2 L a b 2 a, b 5 / 22

13 GD Efficiency Why not happy with GD? Fast convergence high efficiency. w (k) = w (k 1) η k l ( w (k 1)) (5) [ = w (k 1) 1 n ( η k l )] i w (k 1) (6) n i=1 Per-iteration complexity = O(n) (extremely large) A single cycle of all the data may take forever. Cheaper GD? - SGD 6 / 22

14 GD Efficiency Why not happy with GD? Fast convergence high efficiency. w (k) = w (k 1) η k l ( w (k 1)) (5) [ = w (k 1) 1 n ( η k l )] i w (k 1) (6) n i=1 Per-iteration complexity = O(n) (extremely large) A single cycle of all the data may take forever. Cheaper GD? - SGD 6 / 22

15 Stochastic Gradient Descent Approximate the full gradient via an unbiased estimator ( 1 w (k) = w (k 1) η k n ( 1 w (k 1) η k n ( l ) ) i w (k 1) (7) i=1 ( l ) ) i w (k 1) B i B }{{} mini-batch SGD 2 w (k 1) η k l i ( w (k 1) ) }{{} pure SGD B unif {1, 2,... n} (8) i unif {1, 2,... n} (9) Trade-off: lower computation cost v.s. larger variance 2 When using GPU, B usually depends on the memory budget. 7 / 22

16 GD v.s. SGD For strongly convex l (w), according to [Bottou, 2012] Optimizer GD SGD Winner Time per-iteration O (n) ( O (1) SGD Iterations to accuracy ρ O log ( 1 ρ) ) ( ) 1 Õ ρ GD ( ) ( ) Time to accuracy ρ O n log 1 1 ρ Õ ρ Depends Time to generalization error ɛ O ( ) 1 log 1 ɛ 1/α ɛ Õ ( ) 1 ɛ SGD where 1 2 α 1 8 / 22

17 SVMs Solver: Pegasos [Shalev-Shwartz et al., 2011] Recall l i (w) = max ( ) 0, 1 y i w λ x i + 2 w 2 (10) { λ 2 = w 2 y i w x i 1 (11) 1 y i w x i + λ 2 w 2 y i w x i < 1 Therefore l i (w) = { λw y i w x i 1 λw y i x i y i w x i < 1 (12) 9 / 22

18 SVMs Solver: Pegasos [Shalev-Shwartz et al., 2011] Recall l i (w) = max ( ) 0, 1 y i w λ x i + 2 w 2 (10) { λ 2 = w 2 y i w x i 1 (11) 1 y i w x i + λ 2 w 2 y i w x i < 1 Therefore l i (w) = { λw y i w x i 1 λw y i x i y i w x i < 1 (12) 9 / 22

19 SVMs in 10 Lines Algorithm 1: Pegasos: SGD solver for SVMs Input: n, λ, T ; Initialization: w 0; for k = 1, 2,..., T do i uni {1, 2,... n}; η k 1 λk ; if y i w (k) x i < 1 then w (k+1) w (k) ( η k λw (k) ) y i x i else w (k+1) w (k) η k λw (k) end end Output: w (T +1) 10 / 22

20 Empirical Comparisons SGD v.s. batch solvers 3 on RCV1 #Features #Training examples 47, , 265 Algorithm Time (secs) Primal Obj Test Error SMO (SVM light ) 16, % Cutting Plane (SVM perf ) % SGD < % Where is the magic? / 22

21 The Magic SGD takes a long time to accurately solve the problem. There s no need to solve the problem super accurately in order to get good generalization ability / 22

22 SGD for Matrix Factorization The idea of SGD can be trivially extended to MF 4 l (U, V ) = 1 O (a,b) O l a,b (u a, v b ) (13) }{{} e.g. (r ab u a v b) 2 SGD updating rule: for each user-item pair (a, b) O ( ) u (k) a = u (k 1) a η k l a,b u (k 1) a ( ) v (k) b = v (k 1) b η k l a,b v (k 1) b (14) (15) Buildingblock for distributed SGD for MF 4 We omit the regularization term for simplicity 13 / 22

23 SGD for Matrix Factorization The idea of SGD can be trivially extended to MF 4 l (U, V ) = 1 O (a,b) O l a,b (u a, v b ) (13) }{{} e.g. (r ab u a v b) 2 SGD updating rule: for each user-item pair (a, b) O ( ) u (k) a = u (k 1) a η k l a,b u (k 1) a ( ) v (k) b = v (k 1) b η k l a,b v (k 1) b (14) (15) Buildingblock for distributed SGD for MF 4 We omit the regularization term for simplicity 13 / 22

24 Empirical Comparisons On Netflix [Gemulla et al., 2011] DSGD Distributed SGD ALS Alternating least squares one of the state-of-the-art batch solvers DGD Distributed GD 14 / 22

25 SGD Revisited Can we even do better? Bottleneck of SGD: high variance in l i (w) Less effective gradient steps The existence of variance = lim k η k = 0 for convergence = slower progress Variance reduction SVRG [Johnson and Zhang, 2013], SAG, SDCA / 22

26 Stochastic Variance Reduced Gradient w - a snapshot of w (to be updated every few cycles) µ - 1 n n i=1 l i ( w) Key idea - use l i ( w) to cancel the volatility in l i (w) 1 n n l i (w) = 1 n i=1 = 1 n n l i (w) µ + µ (16) i=1 n l i (w) 1 n i=1 n l i ( w) + µ (17) i=1 l i (w) l i ( w) + µ i {1, 2,..., n} (18) A desirable property: l i ( w (k) ) l i ( w) + µ 0 16 / 22

27 Stochastic Variance Reduced Gradient w - a snapshot of w (to be updated every few cycles) µ - 1 n n i=1 l i ( w) Key idea - use l i ( w) to cancel the volatility in l i (w) 1 n n l i (w) = 1 n i=1 = 1 n n l i (w) µ + µ (16) i=1 n l i (w) 1 n i=1 n l i ( w) + µ (17) i=1 l i (w) l i ( w) + µ i {1, 2,..., n} (18) A desirable property: l i ( w (k) ) l i ( w) + µ 0 16 / 22

28 Results Image classification using neural networks [Johnson and Zhang, 2013] 17 / 22

29 Implementation trick For l i (w) = ψ i (w) + λ 2 w 2 ( w (k+1) (1 ηλ) w (k) η ψ ) }{{} i w (k) }{{} shrink w highly sparse (19) The shrinking operations takes O(p) not happy Trick 5 : recast w as w = c w ( ηψ i c (k) w (k) c (k+1) w (k+1) (1 ηλ) c (k) w (k) }{{} (1 ηλ) c (k) scalar update } {{ } sparse update (20) More SGD tricks can be found at [Bottou, 2012] 5 fast-quadratic-regularization-for-online-learning 18 / 22

30 Implementation trick For l i (w) = ψ i (w) + λ 2 w 2 ( w (k+1) (1 ηλ) w (k) η ψ ) }{{} i w (k) }{{} shrink w highly sparse (19) The shrinking operations takes O(p) not happy Trick 5 : recast w as w = c w ( ηψ i c (k) w (k) c (k+1) w (k+1) (1 ηλ) c (k) w (k) }{{} (1 ηλ) c (k) scalar update } {{ } sparse update (20) More SGD tricks can be found at [Bottou, 2012] 5 fast-quadratic-regularization-for-online-learning 18 / 22

31 Popular SGD Variants A non-exhaustive list 1. AdaGrad [Duchi et al., 2011] 2. Momentum [Rumelhart et al., 1988] 3. Nesterov s method [Nesterov et al., 1994] 4. AdaDelta: AdaGrad refined [Zeiler, 2012] 5. Rprop & Rmsprop [Tieleman and Hinton, 2012]: Ignoring the magnitude of gradient All are empirically found effective in solving nonconvex problems (e.g., deep neural nets). Demos 6 : Animation 0, 1, 2, visualizing_gradient_optimization_techniques/cklhott 19 / 22

32 Popular SGD Variants A non-exhaustive list 1. AdaGrad [Duchi et al., 2011] 2. Momentum [Rumelhart et al., 1988] 3. Nesterov s method [Nesterov et al., 1994] 4. AdaDelta: AdaGrad refined [Zeiler, 2012] 5. Rprop & Rmsprop [Tieleman and Hinton, 2012]: Ignoring the magnitude of gradient All are empirically found effective in solving nonconvex problems (e.g., deep neural nets). Demos 6 : Animation 0, 1, 2, visualizing_gradient_optimization_techniques/cklhott 19 / 22

33 Summary Today s talk 1. GD - expensive, accurate gradient evaluation 2. SGD - cheap, noisy gradient evaluation 3. SGD-based solvers (SVMs, MF) 4. Variance reduction techniques Remarks about SGD extremely handy for large problems only one of many handy tools alternatives: quasi-newton (BFGS), Coordinate descent, ADMM, CG, etc. depending on the problem structure 20 / 22

34 Summary Today s talk 1. GD - expensive, accurate gradient evaluation 2. SGD - cheap, noisy gradient evaluation 3. SGD-based solvers (SVMs, MF) 4. Variance reduction techniques Remarks about SGD extremely handy for large problems only one of many handy tools alternatives: quasi-newton (BFGS), Coordinate descent, ADMM, CG, etc. depending on the problem structure 20 / 22

35 Reference I Bordes, A., Bottou, L., and Gallinari, P. (2009). Sgd-qn: Careful quasi-newton stochastic gradient descent. The Journal of Machine Learning Research, 10: Bottou, L. (2012). Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages Springer. Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12: Gemulla, R., Nijkamp, E., Haas, P. J., and Sismanis, Y. (2011). Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM. Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages / 22

36 Reference II Nesterov, Y., Nemirovskii, A., and Ye, Y. (1994). Interior-point polynomial algorithms in convex programming, volume 13. SIAM. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988). Learning representations by back-propagating errors. Cognitive modeling, 5:3. Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. (2011). Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3 30. Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4. Zeiler, M. D. (2012). Adadelta: An adaptive learning rate method. arxiv preprint arxiv: Zhang, T. Modern optimization techniques for big data machine learning. 22 / 22

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {