Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Size: px

Start display at page:

Download "Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade"

Sherman Bradley
5 years ago
Views:

1 Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 25

2 Announcements... Work on your project milestones read/related work summary some empirical work Today: Review: optimization of finite sums, (dual) coordinate ascent New: SVRG (for sums of loss functions); Tradeoffs in large scale learning How do we optimize in the big data regime? S. M. Kakade (UW) Optimization for Big data 2 / 25

3 Machine Learning and the Big Data Regime... goal: find a d-dim parameter vector which minimizes the loss on n training examples. have n training examples (x 1, y 1 ),... (x n, y n ) have parametric a classifier h(x, w), where w is a d dimensional vector. min L(w) where L(w) = loss(h(x i, w), y i ) w i Big Data Regime : How do you optimize this when n and d are large? memory? parallelization? Can we obtain linear time algorithms to find an ɛ-accurate solution? i.e. find ŵ so that L(ŵ) min L(w) ɛ w S. M. Kakade (UW) Optimization for Big data 3 / 25

4 Review: Stochastic Gradient Descent (SGD) SGD update rule: at each time t, sample a point (x i, y i ) w w η(w x i y i )x i S. M. Kakade (UW) Optimization for Big data 4 / 25

5 Review: Stochastic Gradient Descent (SGD) SGD update rule: at each time t, sample a point (x i, y i ) w w η(w x i y i )x i S. M. Kakade (UW) Optimization for Big data 4 / 25

6 Review: Stochastic Gradient Descent (SGD) SGD update rule: at each time t, sample a point (x i, y i ) w w η(w x i y i )x i Problem: even if w = w, the update changes w. Rate: convergence rate is O(1/ɛ), with decaying η simple algorithm, light on memory, but poor convergence rate S. M. Kakade (UW) Optimization for Big data 4 / 25

7 SDCA advantages/disadvantages What about more general convex problems? e.g. min L(w) where L(w) = w i loss(h(x i, w), y i ) the basic idea (formalized with duality) is pretty general for convex loss( ). works very well in practice. memory: SDCA needs O(n + d) memory, while SGD is only O(d). What about an algorithm for non-convex problems? SDCA seems heavily tied to the convex case. Is there an algo that is highly accurate in the convex case and sensible in the non-convex case? S. M. Kakade (UW) Optimization for Big data 5 / 25

8 L smooth and µ-strongly convex case S. M. Kakade (UW) Optimization for Big data 6 / 25

9 Review: Stochastic Gradient Descent Suppose L(w) is µ strongly convex. Suppose each loss loss( ) is L-smooth To get ɛ accuracy: # iterations to get ɛ-accuracy: L µɛ (see related work for precise problem dependent parameters) Computation time to get ɛ-accuracy: L µɛ d (assuming O(d) cost pre gradient evaluation.) S. M. Kakade (UW) Optimization for Big data 7 / 25

10 (another idea) Stochastic Variance Reduced Gradient (SVRG) 1 exact gradient computation: at stage s, using w s, compute: L( w s ) = 1 n n loss(h(x i, w s ), y i ) i=1 2 variance reduction + SGD: initialize w w s. for m steps, sample a point (x, y) w w η ( loss(h(x, w), y) loss(h(x, w s ), y) + L( w s ) ) 3 update and repeat: w s+1 w. S. M. Kakade (UW) Optimization for Big data 8 / 25

11 Properties of SVRG unbiased updates: What is the mean of the blue term? E[ loss(h(x, w s ), y) L( w s )] =? where the expectation is for a random sample (x, y). If w = w, then no update. Memory is O(d). No dual variables. Applicable to non-convex optimization. S. M. Kakade (UW) Optimization for Big data 9 / 25

12 Guarantees of SVRG set m = L/µ. # of gradient computations to get ɛ accuracy: ( n + L ) log 1/ɛ µ S. M. Kakade (UW) Optimization for Big data 10 / 25

13 Comparisons a gradient evaluation is at point (x, y). SVRG: # of gradient computations to get ɛ accuracy: ( n + L ) log 1/ɛ µ # of gradient evaluations for batch gradient descent: where L is the smoothness of L(w). # of gradient computations for SGD: n L log 1/ɛ µ L µɛ S. M. Kakade (UW) Optimization for Big data 11 / 25

14 Non-convex comparisons How many gradient evaluations does it take to find w so that: L(w) 2 ɛ 2 (i.e. close to a stationary point) Rates: the number of gradient evaluations, at a point (x, y), is: GD: O(n/ɛ) SGD: O(1/ɛ 2 ) SVRG: O(n + n 2/3 /ɛ) Does SVRG work well in practice? S. M. Kakade (UW) Optimization for Big data 12 / 25

15 Tradeoffs in Large Scale Learning. Many issues sources of error approximation error: our choice of a hypothesis class estimation error: we only have n samples optimization error: computing exact (or near-exact) minimizers can be costly. How do we think about these issues? S. M. Kakade (UW) Optimization for Big data 13 / 25

16 The true objective hypothesis map x X to y Y. have n training examples (x 1, y 1 ),... (x n, y n ) sampled i.i.d. from D. Training objective: have a set of parametric predictors {h(x, w) : w W}, min ˆL n (w) where ˆL n (w) = 1 w W n True objective: to generalize to D, n loss(h(x i, w), y i ) i=1 min L(w) where L(w) = E (X,Y ) Dloss(h(X, w), Y ) w W Optimization: Can we obtain linear time algorithms to find an ɛ-accurate solution? i.e. find ĥ so that L(ŵ) min w W L(w) ɛ S. M. Kakade (UW) Optimization for Big data 14 / 25

17 Definitions Let h is the Bayes optimal hypothesis, over all functions from X Y. h argmin h L(h) Let w is the best in class hypothesis w argmin w W L(w) Let w n be the empirical risk minimizer: w n argmin w W ˆLn (w) Let w n be what our algorithm returns. S. M. Kakade (UW) Optimization for Big data 15 / 25

18 Loss decomposition Observe: L( w n ) L(h ) = L(w ) L(h ) Approximation error + L(w n ) L(w ) Estimation error + L( w n ) L(w n ) Optimization error Three parts which determine our performance. Optimization algorithms with best accuracy dependencies on ˆL n may not be best. Forcing one error to decrease much faster may be wasteful. S. M. Kakade (UW) Optimization for Big data 16 / 25

19 Time to a fixed accuracy test error versus training time S. M. Kakade (UW) Optimization for Big data 17 / 25

20 Comparing sample sizes test error versus training time Vary the number of examples S. M. Kakade (UW) Optimization for Big data 18 / 25

21 Comparing sample sizes and models test error versus training time Vary the number of examples S. M. Kakade (UW) Optimization for Big data 19 / 25

22 Optimal choices test error versus training time Good combinations Optimal combination depends on training time budget. S. M. Kakade (UW) Optimization for Big data 20 / 25

23 Estimation error: simplest case Measuring a mean: The minima is at µ = E[y]. L(µ) = E(µ y) 2 With n samples, the Bayes optimal estimator is the sample mean: ˆµ n = 1 n i y i. The error is: E[L(ˆµ n )] L(E[y]) = σ2 n σ 2 is the variance and the expectation is with respect to the n samples. How many samples do we need for ɛ error? S. M. Kakade (UW) Optimization for Big data 21 / 25

24 Let s compare: SGD: Is O(1/ɛ) reasonable? GD: Is log 1/eps needed? SDCA/SVRG: These are also log 1/eps but much faster. S. M. Kakade (UW) Optimization for Big data 22 / 25

25 Statistical Optimality Can generalize as well as the sample minimizer, w n? (without computing it exactly) For a wide class of models (linear regression, logistic regression, etc), we have that the estimation error is: E[L(w n )] L(w ) = σ2 opt n where σ 2 opt is a problem dependent constant. What is the computational cost of achieving exactly this rate? say for large n? S. M. Kakade (UW) Optimization for Big data 23 / 25

26 Averaged SGD SGD: w t+1 w t η t loss(h(x, w t ), y) An (asymptotically) optimal algo: Have η t go to 0 (sufficiently slowly) (iterate averaging) Maintain the a running average: w n = 1 n (Polyak & Juditsky, 1992) for large enough n and with one pass of SGD over the dataset: t n w t E[L(w n )] L(w ) n = σ2 opt n S. M. Kakade (UW) Optimization for Big data 24 / 25

27 Acknowledgements Some slides from Large-scale machine learning revisited, Leon Bottou S. M. Kakade (UW) Optimization for Big data 25 / 25

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random