Stochastic Optimization Methods

Size: px

Start display at page:

Download "Stochastic Optimization Methods"

August Dalton
5 years ago
Views:

1 Stochastc Optmzaton Methods Lecturer: Pradeep Ravkumar Co-nstructor: Aart Sngh Convex Optmzaton / Adapted from sldes from Ryan Tbshran

2 Stochastc gradent descent Consder sum of functons 1 mn x n n f (x) Gradent descent appled to ths problem would repeat x (k) = x (k 1) t k 1 n f (x (k 1) ), k = 1, 2, 3,... n =1 =1 In comparson, stochastc gradent descent (or ncremental gradent descent) repeats x (k) = x (k 1) t k f k (x (k 1) ), k = 1, 2, 3,... where k {1,... n} s some chosen ndex at teraton k 2

3 Notes: Typcally we make a (unform) random choce k {1,... n} Also common: mn-batch stochastc gradent descent, where we choose a random subset I k {1,... n}, of sze b n, and update accordng to x (k) = x (k 1) t k 1 f (x (k 1) ), k = 1, 2, 3,... b I k In both cases, we are approxmatng the full gradent by a nosy estmate, and our nosy estmate s unbased E[ f k (x)] = f(x) [ 1 ] E f (x) = f(x) b I k The mn-batch reduces the varance by a factor 1/b, but s also b tmes more expensve! 3

4 Example: regularzed logstc regresson Gven labels y {0, 1}, features x R p, = 1,... n. Consder logstc regresson wth rdge regularzaton: 1 mn β R p n n =1 ( ) y x T β + log(1 + e xt β ) + λ 2 β 2 2 Wrte the crteron as f(β) = 1 n f (β), f (β) = y x T β + log(1 + e xt β ) + λ n 2 β 2 2 =1 The gradent computaton f(β) = n =1 ( y p (β) ) x + λβ s doable when n s moderate, but not when n s huge. Note that: One batch update costs O(np) One stochastc update costs O(p) One mn-batch update costs O(bp) 4

5 Example wth n = 10, 000, p = 20, all methods employ fxed step szes (dmnshng step szes gve roughly smlar results): Crteron fk Full Stochastc Mn batch, b=10 Mn batch, b= Iteraton number k 5

6 What s happenng? Iteratons make better progress as mn-batch sze b gets bgger. But now let s parametrze by flops: Crteron fk Full Stochastc Mn batch, b=10 Mn batch, b=100 1e+02 1e+04 1e+06 Flop count 6

7 Convergence rates Recall that, under sutable step szes, when f s convex and has a Lpschtz gradent, full gradent (FG) descent satsfes f(x (k) ) f = O(1/k) What about stochastc gradent (SG) descent? Under dmnshng step szes, when f s convex (plus other condtons) E[f(x (k) )] f = O(1/ k) Fnally, what about mn-batch stochastc gradent? Agan, under dmnshng step szes, for f convex (plus other condtons) E[f(x (k) )] f = O(1/ bk + 1/k) But each teraton here b tmes more expensve... and (for small b), n terms of flops, ths s the same rate 7

8 Back to our rdge logstc regresson example, we gan mportant nsght by lookng at suboptmalty gap (on log scale): Crteron gap fk fstar 1e 12 1e 09 1e 06 1e 03 Full Stochastc Mn batch, b=10 Mn batch, b= Iteraton number k 8

9 Recall that, under sutable step szes, when f s strongly convex wth a Lpschtz gradent, gradent descent satsfes f(x (k) ) f = O(ρ k ) where ρ < 1. But, under dmnshng step szes, when f s strongly convex (plus other condtons), stochastc gradent descent gves E[f(x (k) )] f = O(1/k) So stochastc methods do not enjoy the lnear convergence rate of gradent descent under strong convexty For a whle, ths was beleved to be nevtable, as Nemrovsk and others had establshed matchng lower bounds... but these appled to stochastc mnmzaton of crterons, f(x) = F (x, ξ) dξ. Can we do better for fnte sums? 9

10 10 Stochastc Gradent: Bas For solvng mn x f(x), stochastc gradent s actually a class of algorthms that use the terates: x (k) = x (k 1) η k g(x (k 1) ; ξ k ), where g(x (k 1), ξ k ) s a stochastc gradent of the objectve f(x) evaluated at x (k 1). Bas: The bas of the stochastc gradent s defned as: bas(g(x (k 1) ; ξ k )) := E ξk (g(x (k 1) ; ξ k )) f(x (k 1) ). Unbased: When E ξk (g(x (k 1) ; ξ k )) = f(x (k 1) ), the stochastc gradent s sad to be unbased. (e.g. the stochastc gradent scheme dscussed so far) Based: We mght also be nterested n based estmators, but where the bas s small, so that E ξk (g(x (k 1) ; ξ k )) f(x (k 1) ).

11 11 Stochastc Gradent: Varance Varance. In addton to small (or zero) bas, we also want the varance of the estmator to be small: varance(g(x (k 1), ξ k )) := E ξk (g(x (k 1), ξ k ) E ξk (g(x (k 1), ξ k ))) 2 E ξk (g(x (k 1), ξ k )) 2. The caveat wth the stochastc gradent scheme we have seen so far s that ts varance s large, and n partcular doesn t decay to zero wth the teraton ndex. Loosely: because of above, we have to decay the step sze η k to zero, whch n turn means we can t take large steps, and hence the convergence rate s slow. Can we get the varance to be small, and decay to zero wth teraton ndex?

12 12 Varance Reducton Consder an estmator X for a parameter θ. Note that for an unbased estmator, E(X) = θ. Now consder the followng modfed estmator: Z := X Y, such that E(Y ) 0. Then the bas of Z s also close to zero, snce E(Z) = E(X) E(Y ) θ. If E(Y ) = 0, then Z s unbased ff X s unbased. What about the varance of estmator X? Var(X Y ) = Var(X) + Var(Y ) 2 Cov(X, Y ). Ths can be seen to much less than Var(X) f Y s hghly correlated wth X. Thus, gven any estmator X, we can reduce ts varance, f we can construct a Y that (a) has expectaton (close to) zero, and (b) s hghly correlated wth X. Ths s the abstract template followed by SAG, SAGA, SVRG, SDCA,...

13 13 Outlne Rest of today: Stochastc average gradent (SAG) SAGA (does ths stand for somethng?) Stochastc Varance Reduced Gradent (SVRG) Many, many others

14 14 Stochastc average gradent Stochastc average gradent or SAG (Schmdt, Le Roux, Bach 2013) s a breakthrough method n stochastc optmzaton. Idea s farly smple: Mantan table, contanng gradent g of f, = 1,... n Intalze x (0), and g (0) = x (0), = 1,... n At steps k = 1, 2, 3,..., pck a random k {1,... n} and then let Set all other g (k) Update g (k) k = f (x (k 1) ) (most recent gradent of f ) = g (k 1), k,.e., these stay the same x (k) = x (k 1) t k 1 n n =1 g (k)

15 15 Notes: Key of SAG s to allow each f, = 1,... n to communcate a part of the gradent estmate at each step Ths basc dea can be traced back to ncremental aggregated gradent (Blatt, Hero, Gauchman, 2006) SAG gradent estmates are no longer unbased, but they have greatly reduced varance Isn t t expensve to average all these gradents? (Especally f n s huge?) Ths s bascally just as effcent as stochastc gradent descent, as long we re clever: ( (k) g x (k) = x (k 1) t k k n g(k 1) k + 1 n n n =1 g (k 1) }{{} old table average ) } {{ } new table average

16 16 Stochastc gradent n SAG: SAG Varance Reducton g (k) k }{{} X (g (k 1) k n g (k 1) ) =1 } {{ } Y It can be seen that E(X) = f(x (k) ). But that E(Y ) 0, so that we have a based estmator. But we do have that Y seems correlated wth X (n lne wth varance reducton template). In partcular, we have that X Y 0, as k, snce x (k 1) and x (k) converge to x, the dfference between frst two terms converges to zero, and the last term converges to gradent at optmum,.e. also to zero. Thus, the overall estmator l 2 norm (and accordngly ts varance) decays to zero..

17 17 SAG convergence analyss Assume that f(x) = 1 n n =1 f (x), where each f s dfferentable, and f s Lpschtz wth constant L Denote x (k) = 1 k k 1 l=0 x(l), the average terate after k 1 steps Theorem (Schmdt, Le Roux, Bach): SAG, wth a fxed step sze t = 1/(16L), and the ntalzaton satsfes g (0) = f (x (0) ) f(x (0) ), = 1,... n E[f( x (k) )] f 48n ( f(x (0) ) f ) + 128L k k x(0) x 2 2 where the expectaton s taken over the random choce of ndex at each teraton

18 18 Notes: Result stated n terms of the average terate x (k), but also can be shown to hold for best terate x (k) best seen so far Ths s O(1/k) convergence rate for SAG. Compare to O(1/k) rate for FG, and O(1/ k) rate for SG But, the constants are dfferent! Bounds after k steps: SAG : FG : SG : 48n( f(x (0) ) f ) + 128L k k x(0) x 2 2 L 2k x(0) x 2 2 L 5 x (0) x 2 (*not a real bound, loose translaton) 2k So frst term n SAG bound suffers from factor of n; authors suggest smarter ntalzaton to make f(x (0) ) f small (e.g., they suggest usng result of n SG steps)

19 19 Convergence analyss under strong convexty Assume further that each f s strongly convex wth parameter m Theorem (Schmdt, Le Roux, Bach): SAG, wth a step sze t = 1/(16L) and the same ntalzaton as before, satsfes ( { m E[f(x (k) )] f 1 mn 16L, 1 } ) k 8n ( 3 ( f(x (0) ) f ) + 4L ) 2 n x(0) x 2 2 More notes: Ths s lnear convergence rate O(ρ k ) for SAG. Compare ths to O(ρ k ) for FG, and only O(1/k) for SG Lke FG, we say SAG s adaptve to strong convexty (acheves better rate wth same settngs) Proofs of these results not easy: 15 pages, computed-aded!

20 Back to our rdge logstc regresson example, SG versus SAG, over 30 reruns of these randomzed algorthms: Crteron gap fk fstar SG SAG Iteraton number k 20

21 21 SAG does well, but dd not work out of the box; requred a specfc setup Took one full cycle of SG (one pass over the data) to get β (0), and then started SG and SAG both from β (0). Ths warm start helped a lot SAG ntalzed at g (0) = f (β (0) ), = 1,... n, computed durng ntal SG cycle. Centerng these gradents was much worse (and so was ntalzng them at 0) Tunng the fxed step szes for SAG was very fncky; here now hand-tuned to be about as large as possble before t dverges

22 Experments from Schmdt, Le Roux, Bach (each plot s a dfferent problem settng): Objectve mnus Optmum ASG AFG IAG SAG LS SG L BFGS Objectve mnus Optmum L BFGS AFG SG SAG LS IAG ASG Objectve mnus Optmum ASG AFG SAG LS IAG L BFGS SG Objectve mnus Optmum Effectve Passes Effectve Passes Effectve Passes SAG LS L BFGS SG ASG IAG AFG Objectve mnus Optmum SAG LS SG L BFGS ASG AFG IAG Objectve mnus Optmum L BFGS AFG IAG SAG LS SG ASG Effectve Passes 10 0 L BFGS IAG AFG Effectve Passes Effectve Passes 10 0 IAG AFG Objectve mnus Optmum SAG LS SG ASG Objectve mnus Optmum IAG AFG L BFGS SG SAG LS ASG Objectve mnus Optmum L BFGS SG SAG LS ASG Effectve Passes Effectve Passes Effectve Passes Fg. 1 Comparson of d erent FG and SG optmzaton strateges. The top row gves 22

23 23 SAGA SAGA (Defazo, Bach, Lacoste-Julen, 2014) s another recent stochastc method, smlar n sprt to SAG. Idea s agan smple: Mantan table, contanng gradent g of f, = 1,... n Intalze x (0), and g (0) = x (0), = 1,... n At steps k = 1, 2, 3,..., pck a random k {1,... n} and then let Set all other g (k) Update g (k) k = f (x (k 1) ) (most recent gradent of f ) = g (k 1), k,.e., these stay the same ( x (k) = x (k 1) t k g (k) k g (k 1) k + 1 n n =1 ) g (k 1)

24 24 Notes: SAGA gradent estmate g (k) k SAG gradent estmate 1 n g(k) k g (k 1) k 1 n g(k 1) k + 1 n n =1 g(k 1) + 1 n n =1 g(k 1), versus Recall, SAG estmate s based; remarkably, SAGA estmate s unbased!

25 25 Stochastc gradent n SAGA: SAGA Varance Reducton g (k) k }{{} X n (g (k 1) k 1 n } {{ } Y g (k 1) ) =1 It can be seen that E(X) = f(x (k) ). And that E(Y ) 0, so that we have an unbased estmator. Moreover, we have that Y seems correlated wth X (n lne wth varance reducton template). In partcular, we have that X Y 0, as k, snce x (k 1) and x (k) converge to x, the dfference between frst two terms converges to zero, and the last term converges to gradent at optmum,.e. also to zero. Thus, the overall estmator l 2 norm (and accordngly ts varance) decays to zero..

26 26 SAGA bascally matches strong convergence rates of SAG (for both Lpschtz gradents, and strongly convex cases), but the proofs here much smpler Another strength of SAGA s that t can extend to composte problems of the form 1 mn x n m f (x) + h(x) =1 where each f s smooth and convex, and h s convex and nonsmooth but has a known prox. The updates are now ( x (k) = prox h,tk (x (k 1) t k g (k) g (k 1) + 1 n n =1 ) ) g (k 1) It s not known whether SAG s generally convergent under such a scheme

27 27 Back to our rdge logstc regresson example, now addng SAGA to the mx: Crteron gap fk fstar SG SAG SAGA Iteraton number k

28 28 SAGA does well, but agan t requred somewhat specfc setup As before, took one full cycle of SG (one pass over the data) to get β (0), and then started SG, SAG, SAGA all from β (0). Ths warm start helped a lot SAGA ntalzed at g (0) = f (β (0) ), = 1,... n, computed durng ntal SG cycle. Centerng these gradents was much worse (and so was ntalzng them at 0) Tunng the fxed step szes for SAGA was fne; seemngly on par wth tunng for SG, and more robust than tunng for SAG Interestngly, the SAGA crteron curves look lke SG curves (realzatons beng jagged and hghly varable); SAG looks very dfferent, and ths really emphaszes the fact that ts updates have much lower varance

29 29 Stochastc Varance Reduced Gradent (SVRG) The Stochastc Varance Reduced Gradent (SVRG) algorthm (Johnson, Zhang, 2013) runs n epochs: Intalze x (0). For k = 1,...: Set x = x (k 1). Compute µ := f( x). Set x (0) = x. For l = 1,..., m: Pck coordnate l at random from {1,..., n}. Set: Set x (k) = x (m). x (l) = x (l 1) η ( f l (x (l 1) ) f l ( x) + µ).

30 30 Stochastc Varance Reduced Gradent (SVRG) Just lke SAG/SAGA, but does not store a full table of gradents, just an average, and updates ths occasonally. SAGA: f (k) k f (k 1) k + 1 n (k 1) n =1 f, SVRG: f l (x (l 1) ) f l ( x) + 1 n n =1 f ( x). Can be shown to acheve varance reducton smlar to SAGA. Convergence rates smlar to SAGA, but formal analyss much smpler.

31 31 Many, many others A lot of recent work revstng stochastc optmzaton: SDCA (Shalev-Schwartz, Zhang, 2013): apples randomzed coordnate ascent to the dual of rdge regularzed problems. Effectve prmal updates are smlar to SAG/SAGA. There s also S2GD (Konecny, Rchtark, 2014), MISO (Maral, 2013), Fnto (Defazo, Caetano, Domke, 2014), etc.

32 32 SAGA SAG SDCA SVRG FINITO Strongly Convex (SC) Convex, Non-SC* 3 3 7?? Prox Reg. 3? 3[6] 3 7 Non-smooth Low Storage Cost Smple(-sh) Proof Adaptve to SC 3 3 7?? Fgure 1: Basc summary of method propertes. Queston marks denote unproven, but not expermentally (From Defazo, Bach, Lacoste-Julen, 2014) Are we approachng optmalty wth these methods? Agarwal and Bottou (2014) recently proved nonmatchng lower bounds for mnmzng fnte sums Leaves three possbltes: () algorthms we currently have are not optmal; () lower bounds can be tghtened; or () upper bounds can be tghtened Very actve area of research, ths wll lkely be sorted out soon

33 33 References and further readng D. Bertsekas (2010), Incremental gradent, subgradent, and proxmal methods for convex optmzaton: a survey A. Defaso and F. Bach and S. Lacoste-Julen (2014), SAGA: A fast ncremental gradent method wth support for non-strongly convex composte objectves R. Johnson and T. Zhang (2013), Acceleratng stochastc gradent descent usng predctve varance reducton A. Nemrosvk and A. Judtsky and G. Lan and A. Shapro (2009), Robust stochastc optmzaton approach to stochastc programmng M. Schmdt and N. Le Roux and F. Bach (2013), Mnmzng fnte sums wth the stochastc average gradent S. Shalev-Shwartz and T. Zhang (2013), Stochastc dual coordnate ascent methods for regularzed loss mnmzaton

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where