Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Size: px

Start display at page:

Download "Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade"

Conrad Anderson
5 years ago
Views:

1 Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1

2 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random features) Today: Review: tradeoffs in large scale learning Today: adaptive gradient methods Kakade

3 Review Tradeoffs in Large Scale Learning. S. M. Kakade (UW) Optimization for Big data 9 / 23 3

4 Tradeoffs in Large Scale Learning. Many issues sources of error approximation error: our choice of a hypothesis class estimation error: we only have n samples optimization error: computing exact (or near-exact) minimizers can be costly. How do we think about these issues? 4 S. M. Kakade (UW) Optimization for Big data 10 / 23

5 The true objective hypothesis map x 2 X to y 2 Y. have n training examples (x 1, y 1 ),...(x n, y n ) sampled i.i.d. from D. Training objective: have a set of parametric predictors {h(x, w) :w 2 W}, min w2w ˆL n (w) where ˆL n (w) = 1 n nx loss(h(x i, w), y i ) i=1 True objective: to generalize to D, min L(w) where L(w) =E (X,Y ) Dloss(h(X, w), Y ) w2w Optimization: Can we obtain linear time algorithms to find an -accurate solution? i.e. find ĥ so that L(ŵ) min w2w L(w) apple 5 S. M. Kakade (UW) Optimization for Big data 11 / 23

6 Definitions Let h is the Bayes optimal hypothesis, over all functions from X! Y. h 2 argmin h L(h) Let w is the best in class hypothesis w 2 argmin w2w L(w) Let w n be the empirical risk minimizer: w n 2 argmin w2w ˆLn (w) Let w n be what our algorithm returns. 6 S. M. Kakade (UW) Optimization for Big data 12 / 23

7 Loss decomposition Observe: L( w n ) L(h )= L(w ) L(h ) Approximation error + L(w n ) L(w ) Estimation error + L( w n ) L(w n ) Optimization error Three parts which determine our performance. Optimization algorithms with best accuracy dependencies on ˆL n may not be best. Forcing one error to decrease much faster may be wasteful. 7 S. M. Kakade (UW) Optimization for Big data 13 / 23

8 Best in class error Fix a class W. What is the best estimator of w for this model? For a wide class of models (linear regression, logistic regression, etc), the ERM, w n, is (in the limit) the best estimator: w n 2 argmin w2w ˆLn (w) 1 What is the generalization error of best estimator w n? 2 How well can we do? Note: L( w n ) L(w )= +L(w n ) L(w ) Estimation error + L( w n ) L(w n ) Optimization error Can we generalize as well as the sample minimizer, w n? (without computing it exactly) 8 S. M. Kakade (UW) Optimization for Big data 20 / 23

9 Statistical Optimality Can generalize as well as the sample minimizer, w n? (without computing it exactly) For a wide class of models (linear regression, logistic regression, etc), we have that the estimation error is: E[L(w n )] L(w ) n!1 = 2 opt n where 2 opt is an (optimal) problem dependent constant. This is the best possible statistical rate. (Can quantify the non-asymptotic burn-in ). What is the computational cost of achieving exactly this rate? say for large n? 9 S. M. Kakade (UW) Optimization for Big data 21 / 23

10 Averaged SGD SGD: w t+1 w t t rloss(h(x, w t ), y) An (asymptotically) optimal algo: Have t go to 0 (sufficiently slowly) (iterate averaging) Maintain the a running average: w n = 1 n (Polyak & Juditsky, 1992) for large enough n and with one pass of SGD over the dataset: X tapplen w t E[L(w n )] L(w ) n!1 = 2 opt n 10 S. M. Kakade (UW) Optimization for Big data 22 / 23

11 Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 11

12 The Problem with GD (and SGD) 12

13 Adaptive Gradient Methods: Convex Case What we want? Newton s method: Why is this a good idea? Guarantees? Stepsize? Related ideas: Conjugate Gradient/Acceleration: L- BFGS Quasi- Newton methods w w [r 2 L(w)] 1 rl(w) 13

14 Adaptive Gradient Methods: Non- Cvx Case What do we want? Hessian may not be PSD, so is Newton s method a descent method? Other ideas: Natural Gradient methods: Curvature adaptive: Adagrad, AdaDelta, RMS prop, ADAM, l- BFGS, heavy ball gradient, momentum Noise injection: Simulated annealing, dropout, Langevin methods Caveats: Batch methods may be poor: On Large- Batch Training for Deep Learning: Generalization Gap and Sharp Minima 14

15 Natural Gradient Idea Probabilistic models and maximum likelihood estimation: L" w = log Pr data w True likelihood function: L w = E 0~2 log Pr z w where z is sampled form the underling data distribution D. Suppose the model is correct, i.e. z~ Pr z w for some w Let s look at the Hessian at w r 2 L(w ) = E z Pr(z w )[ r 2 log Pr(z w )] How do we approximate the Hessian at w? = E z Pr(z w )[r log Pr(z w )(r log Pr(z w )) > ] 15

16 16

17 Fisher Information Matrix Define the Fisher matrix: F (w) := E z Pr(z w) [r log Pr(z w)(r log Pr(z w)) > ] If the model is correct and if w > w, then F(w) F(w ) Natural Gradient: Use the update rule: X Empirically, use L^(w) and ˆF (w) := 1 t X w w [F (w)] 1 rl(w) X g t (w)g t (w) > where g_t(w) is the gradient of the log likelihood of the t- th data point t 17

18 Curvature approximation: One idea: r 2 ˆL(w)? 1 t X g t (w)g t (w) > t where g_t(w) is the gradient of the t- th data point Many ideas try to use this approximation Quasi- Newton methods, Gauss newton methods Ellipsoid method (sort of) 18

19 Motivating AdaGrad (Duchi, Hazan, Singer 2011) w 2 R d Assuming, standard stochastic (sub)gradient descent updates are of the form: w (t+1) i w (t) i t g t,i Should all features share the same learning rate? Motivating AdaGrad (Duchi, Hazan, Singer 2011): Often have high- dimensional feature spaces Many features are irrelevant Rare features are often very informative Adagrad provides a feature- specific adaptive learning rate by incorporating knowledge of the geometry of past observations 19

20 Why Adapt to Geometry? hy adapt to geometry? Hard Hard x x x y t y t t,1 t,1 t,2 t,2 t,3 t, Examples from Duchi et al. ISMP 2012 slides Nice Nice 1 Frequent, 1 Frequent, irrelevant irrelevant 2 Infrequent, 2 Infrequent, predictive predictive 3 Infrequent, 3 Infrequent, predictive predictive 20

Not All Features are Created Equal Examples: Text data: The most unsung birthday in American business and technological history this year may be the 50th

21 Not All Features are Created Equal Examples: Text data: The most unsung birthday in American business and technological history this year may be the 50th anniversary of the Xerox 914 photocopier. a High-dimensional image features a The Atlantic, July/August Images from Duchi et al. ISMP 2012 slides 21

22 Visualizing Effect Credit: 22

23 Regret Minimization How do we assess the performance of an online algorithm? Algorithm iteratively predicts Incur loss Regret: What is the total incurred loss of algorithm relative to the best choice of that could have been made retrospectively w `t(w (t) ) R(T )= TX t=1 w (t) `t(w (t) ) inf w2w TX t=1 `t(w) 23

24 Regret Bounds for Standard SGD Standard projected gradient stochastic updates: w (t+1) = arg min w2w w (w(t) g t ) 2 2 Standard regret bound: TX t=1 `t(w (t) ) `t(w ) apple 1 2 w(1) w TX t=1 g t

25 Projected Gradient using Mahalanobis Standard projected gradient stochastic updates: w (t+1) = arg min w2w w (w(t) g t ) 2 2 What if instead of an L 2 metric for projection, we considered the Mahalanobis norm w (t+1) = arg min w2w w (w(t) A 1 g t ) 2 A 25

26 Mahalanobis Regret Bounds What A to choose? Regret bound now: TX t=1 w (t+1) = arg min w2w w (w(t) A 1 g t ) 2 A `t(w (t) ) `t(w ) apple 1 2 w(1) w TX t=1 g t 2 A 1 What if we minimize upper bound on regret w.r.t. A in hindsight? min A TX t=1 g T t A 1 g t 26

27 Mahalanobis Regret Minimization Objective: min A Solution: TX t=1 g T t A 1 g t subject to A 0, tr(a) apple C A = c TX g t g T t! 1 2 t=1 For proof, see Appendix E, Lemma 15 of Duchi et al Uses trace trick and Lagrangian. A defines the norm of the metric space we should be operating in 27

28 AdaGrad Algorithm At time t, estimate optimal (sub)gradient modification A by A t = tx g g T! 1 2 =1 For d large, A t is computationally intensive to compute. Instead, Then, algorithm is a simple modification of normal updates: w (t+1) = arg min w2w w (w(t) diag(a t ) 1 g t ) 2 diag(a t ) 28

29 AdaGrad in Euclidean Space For W = R d, For each feature dimension, where That is, Each feature dimension has it s own learning rate! Adapts with t w (t+1) i w (t) i t,i = w (t+1) i w (t) i t,i g t,i q Pt =1 g2,i g t,i Takes geometry of the past observations into account Primary role of η is determining rate the first time a feature is encountered 29

30 AdaGrad Theoretical Guarantees AdaGrad regret bound: TX dx `t(w (t) ) `t(w ) apple 2R 1 g 1:T,i 2 t=1 In stochastic setting: E " ` 1 T!# TX w (t) t=1 R 1 := max t i=1 `(w ) apple 2R 1 T w (t) w 1 dx E[ g 1:T,j 2 ] i=1 This is used in practice. Many cool examples. Let s just examine one 30

31 AdaGrad Theoretical Example Expect to out- perform when gradient vectors are sparse SVM hinge loss example: `t(w) =[1 y t x t, w ] + x t 2 { 1, 0, 1} d / j, > 1 If x jt 0 with probability E " ` 1 T!# TX w (t) t=1 `(w )=O w p 1 max{log d, d 1 /2 } T (sort of) previously bound: E " ` 1 T!# TX w (t) t=1 `(w )=O w 1 p T pd 31

32 Neural Network Neural Network LearningLearning Very non- convex problem, but use SGD methods anyway Wildly non-convex problem: `(w, x) = log(1 + exp(h[p(hw1, x1 i) p(hwk, xk i)], x0 i)) Neural Network Learning f (x; ) = log (1 + exp (h[p(hx1, 1 i) p(hxk, k i)], 0 i)) 1 p( ) = 1 + exp( ) Accuracy on Test Set Average Frame Accuracy (%) where p( ) = 1 + exp( ) SGD GPU Downpour SGD Downpour SGD w/adagrad Sandblaster L BFGS i) p(hwp(hx 1, 1x, 11i) 120 Time (hours) w x1 x w2 x3 w x4 w w x5 x11 x22 x33 x54 x45 (Dean et al. 2012) Distributed, d = parameters. SGD and AdaGrad use 80 machines (1000 cores), L-BFGS uses 800 (10000 cores) Duchi et al. (UC Berkeley) Duchi etadaptive al. (UC Berkeley) AdaptiveISMP Subgradient Sham Kakade 2017 Subgradient Methods Methods / 32 Images from Duchi et al. ISMP 2012 slides ISMP

33 ADAM Like AdaGrad but with forgetting The algo has component- wise updates

34 Comparisons: MNIST, Sigmoid 100 layer

35 Comparisons: MNIST, Tanh 100 layer

36 Comparisons: Sigmoid, ReLu, Sigmoid

37 Acknolwedgments Some figs taken from: of- optimization- techniques- stochastic- gradient- descent- momentum- adagrad- and- adadelta/

Adaptive Gradient Methods AdaGrad / Adam

Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)