Lecture 3: Minimizing Large Sums. Peter Richtárik

Size: px

Start display at page:

Download "Lecture 3: Minimizing Large Sums. Peter Richtárik"

Edith Sherman
5 years ago
Views:

1 Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Control and Networks Belgium 2015

2 Machine Learning & Empirical Risk

3 Training Linear Predictors

4 Nature of Data Label (A i,y i ) Distribution Data (e.g., image, text, measurements, ) A i 2 R d m y i 2 R m

5 of Labels from Data Find w 2 R d Linear predictor Such that when (data, label) pair is drawn from the distribu@on Then (A i,y i ) Distribution A > i w y i Predicted label True label

6 Measure of Success loss(a, b) data label We want the expected loss (=risk) to be small: E loss(a > i w, y i ) (A i,y i ) Distribution

7 Finding a Linear Predictor via Empirical Risk Minimiza@on (ERM) Draw i.i.d. data (samples) from the distribu@on (A 1,y 1 ), (A 2,y 2 ),...,(A n,y n ) Distribution Output predictor which minimizes the empirical risk: 1 nx min loss(a > w2r d i w, y i ) n i=1

8 Primal and Dual Problems

9 Primal Problem: ERM i : R m 7! R 1/ regulariza@on parameter 1 -smooth and convex min w2r d " P (w) 1 n nx i=1 i(a > i w)+ g(w) # d = # features (parameters) n = # samples A i 2 R d m 1 - strongly convex func@on (regularizer)

10 Is the difficulty in n or d? Big n Work in the primal Process one loss func@on (= one example) at Type of methods: stochas@c gradient descent (modern variants: SAG, SVRG, S2GD, ms2gd, SAGA, S2CD, MISO, FINITO, ) Big d Work in the primal Process one primal variable at Type of methods: randomized coordinate descent (e.g., Hydra, Hydra2) Big n Work in the dual Process one dual variable (=one example) at Type of methods: randomized coordinate descent (modern variants: RCDM, PCDM, Shotgun, SDCA, APPROX, Quartz, ALPHA, SDNA, SPDC, ASDCA, ) E.g. SDCA = run coordinate descent on the dual problem

11 Dual Problem D( ) g 1 n nx i=1 A i i! 1 n nx i=1 2 R m i ( i ) 1 smooth & convex g (w 0 ) = max (w 0 ) > w g(w) w2r d 2 R d - strongly convex i (a 0 ) = max a2r m (a0 ) > a i(a) max D( ) =( 1,..., n )2R N =R nm 2 R m 2 R m

12 An Efficient Dual Method Zheng Qu, P.R. and Tong Zhang Randomized dual coordinate ascent with arbitrary sampling In NIPS 2015 (arxiv: )

13 Empirical Risk

14 Primal Problem: ERM 1/ - smooth regulariza@on & convex parameter min w2r d " P (w) 1 n nx i=1 i(a > i w)+ g(w) # d = # features (parameters) n = # samples 1 - strongly convex func@on (regularizer)

15 1 The loss functions i : R m 7! R are 1 -smooth: kr i (a) r i (a 0 )k apple 1 ka a 0 k, a,a 0 2 R m Lipschitz constant of the gradient of the func@on

16 2 Regularizer is 1- strongly convex g(w) g(w 0 )+hrg(w 0 ),w w 0 i kw w0 k 2, w,w 0 2 R d subgradient

17 Dual Problem D( ) g 1 n nx i=1 A i i! 1 n nx i=1 2 R m i ( i ) 1 smooth & convex g (w 0 ) = max (w 0 ) > w g(w) w2r d 2 R d - strongly convex i (a 0 ) = max a2r m (a0 ) > a i(a) max D( ) =( 1,..., n )2R N =R nm 2 R m 2 R m

18 The Algorithm

19 Quartz

20 Fenchel Duality = 1 n nx A i i i=1 P (w) D( ) = (g(w)+g ( )) + 1 n nx i=1 i(a > i w)+ i ( i )= (g(w)+g ( ) hw, i)+ 1 n nx i=1 i(a > i w)+ i ( i )+ A > i w, i 0 Weak duality 0 Op;mality condi;ons w = rg ( ) i = r i (A > i w)

21 The Algorithm ( t,w t ) ) ( t+1,w t+1 )

22 Quartz: Bird s Eye View STEP 1: PRIMAL UPDATE w t+1 (1 )w t + rg ( t ) STEP 2: DUAL UPDATE Choose a random set S t of dual variables For i 2 S t do 1 t+1 i p i p i = P(i 2 S t ) t i + p i r i (A > i w t+1 )

23 The Algorithm STEP 1 Convex combina@on constant STEP 2 Just maintaining

24 Other Dual Methods for ERM

25 Randomized Dual Coordinate Ascent Methods for ERM Algorithm 1-nice 1-optimal -nice arbitrary additional speedup direct p-d analysis acceleration SDCA msdca ASDCA AccProx-SDCA DisDCA Iprox-SDCA APCG SPDC Quartz SDCA: SS Shwartz & T Zhang, 09/2012 msdca M Takac, A Bijral, P R & N Srebro, 03/2013 ASDCA: SS Shwartz & T Zhang, 05/2013 AccProx-SDCA: SS Shwartz & T Zhang, 10/2013 DisDCA: T Yang, 2013 Iprox-SDCA: P Zhao & T Zhang, 01/2014 APCG: Q Lin, Z Lu & L Xiao, 07/2014 SPDC: Y Zhang & L Xiao, 09/2014 Quartz: Z Qu, P R & T Zhang, 11/2014

26 Complexity

27 3 (Expected Separable Overapproxima@on) Parameters v 1,...,v n satisfy: X E A i i apple i2s t 2 nx p i v i k i k 2 i=1 inequality must hold for all 1,..., n 2 R m p i = P(i 2 S t )

0 ) E P (w t ) D( t ) apple Example Data: n =7 10 5 = 1 4 v i max (A > i A

28 Complexity Theorem [Qu, R & Zhang 14] p i n =min i v i + n E[P (w t ) D( t )] apple (1 ) t (P (w 0 ) D( 0 )) t max i 1 p i + p i v i n log P (w 0 ) D( 0 ) E P (w t ) D( t ) apple Example Data: n = = 1 4 v i max (A > i A i) apple 1 Method: S t 1 p i = 1 n = 1 n (1 ) n = (1 ) 12n = < 1 10

29 One Dual Variable at a Time

30 Complexity of Quartz specialized to serial sampling Optimal sampling n + 1 n P n i=1 L i Uniform sampling n + max i L i L i max A > i A i

31 Experiment: Quartz vs SDCA, uniform vs sampling 10 0 Prox-SDCA Quartz-U (10θ) Iprox-SDCA Quartz-IP (10θ) Primal dual gap nb of epochs Data = cov1, n = 522, 911, = 10 6

32 An Efficient Primal Method S. Shalev- Shwartz SDCA without Duality, NIPS 2015 (arxiv: ) Dominik Csiba and P.R. Primal method for ERM with flexible mini- batching schemes and non- convex losses, arxiv: , 2015

33 Empirical Risk

34 Primal Problem: ERM i : R m 7! R 1 -smooth and convex regulariza@on parameter min w2r d " P (w) 1 n nx i=1 i(a > i w)+ 2 kwk 2 2 # d = # features (parameters) n = # samples A i 2 R d m We had a general 1- strongly convex func@on g here before

35 The loss functions i : R m 7! R are 1 -smooth: kr i (a) r i (a 0 )k apple 1 ka a 0 k, a,a 0 2 R m Lipschitz constant of the gradient of the func@on

36 D( ) g 1 n 1 smooth & convex g (w 0 ) = max (w 0 ) > w g(w) w2r d Dual Problem! nx A i i 1 n nx i=1 Goal: An efficient i=1 algorithm which 2 R d max D( ) =( 1,..., n )2R N =R nm 2 R m 2 R m 2 R m i ( i ) naturally operates in the primal space (i.e., on the primal problem) only - strongly convex The method will have the same theore@cal guarantee as Quartz i (a 0 ) = max a2r (a0 ) > a i(a) m The computer lab will be based on this

37 6.2 The Algorithm

38 I w is optimal 0=rP (w )= 1 n nx A i r i (A > i w )! + w i=1 nx w = 1 n A i i i=1 i := r i (A > i w )

Mo@va@on II Algorithmic Ideas: 1 Simultaneously search for both w and 1,.

39 II Algorithmic Ideas: 1 Simultaneously search for both w and 1,..., n 2 Try to do something like 3 t+1 i r i (A > i w t ) Maintain the relationship nx Does not quite work: too greedy w t = 1 n i=1 A i t i

40 The Algorithm: dfsdca STEP 0: INITIALIZE Choose 0 1,..., 0 n 2 R m STEP 1: DUAL UPDATE For i 2 S t do STEP 2: PRIMAL UPDATE w 0 = 1 n nx A i i 0 i=1 Choose a random set S t of dual variables t+1 i This is just maintaining the rela@onship 1 w t+1 w t X Controlling greed by taking a convex combina@on i2s t p i i t + r i (A > i w t ) p i n p i A i Ini@alize the rela@onship r i (A > i w t ) + t i p i n =min i v i + n p i = P(i 2 S t )

41 Complexity

42 ESO (same as before!) Parameters v 1,...,v n satisfy: X E A i i apple i2s t 2 nx p i v i k i k 2 i=1 inequality must hold for all 1,..., n 2 R m p i = P(i 2 S t )

43 Complexity Theorem [Csiba & R 15] A constant depending on P, w 0, i 0,w, i t max i 1 p i + p i v i n log C p i = P(i 2 S t ) E P (w t ) P (w ) apple

44 Experiments

45 Some More Efficient Primal Methods for ERM: SAG, SVRG and S2GD SAG: Average Gradient N. Le Roux, M. Schmidt, and F. Bach. A stochas;c gradient method with an exponen;al convergence rate for finite training sets. NIPS, 2012 SVRG: Stochas@c Variance Reduced Gradient Rie Johnson and Tong Zhang. Accelera;ng stochas;c gradient descent using predic;ve variance reduc;on. NIPS, S2GD: Semi- Stochas@c Gradient Descent J. Konečný and P. R. Semi- stochas;c gradient descent methods. arxiv: , 2013

46 Modern Methods for ERM vs SGD Dataset: rcv1 (n = 20,241 ; d = 47,232) dfsdca

47 Behavior of dfsdca for various Dataset: rcv1 (n = 20,241 ; d = 47,232)

48 (Minibatching)

49 NAIVE APPROACH

50 Failure of naive 1b 0 1a

51 Failure of naive 1b 1 0 1a

52 Failure of naive 2b 1 2a

53 Failure of naive 2b 1 2 2a

54 Failure of naive 2

55 Idea: averaging updates may help 1b 1 0 1a

56 0 1a Averaging can be too 1b 2b 1 2 2a

57 Averaging can be too BAD!!! WANT

58 Experiment with a 1 billion- by- 2 billion LASSO problem P.R. and Mar@n Takáč Parallel coordinate descent methods for big data op;miza;on MathemaAcal Programming, 2015 (arxiv: )

59 Coordinate Updates The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Itera@ons The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again.

60 The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

61 Wall Time The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

62 Minibatching & Quartz [Qu, R & Zhang 14]

63 Data Sparsity 1 apple! apple n Fully sparse data A normalized measure of average sparsity of the data Fully dense data

64 Complexity of Quartz Quartz [Qu, R & Zhang 14] Fully sparse data (! =1) n + max i L i Fully dense data (! = n) n + max i L i Any data (1 apple! apple n) n + 1+ (! 1)( 1) n 1 max i L i T ( )

65 Speedup Assume the data is normalized: L i max (A > i A i ) apple 1 Then: T ( ) = 1+ (! 1)( 1) (n 1)(1+ n) T (1) Linear speedup up to a certain data- independent minibatch size: apple 2+ n T ( ) apple 2 T (1) Further data- dependent speedup, up to the extreme case: T (1)! = O( n) T ( ) =O

66 Quartz [Qu, R & Zhang 14] Quartz: Paralleliza@on Speedup # examples: n = 10 6 Smoothness of loss func@ons: =1 Low regulariza@on: High regulariza@on: =1/n =1/ p n speed up factor λ=1e-3 λ=1e-4 λ=1e-6 speed up factor λ=1e-3 λ=1e-4 λ=1e-6 speed up factor λ=1e-3 λ=1e-4 λ=1e τ τ τ Sparse Data Denser Data Fully Dense Data! = 10 2! = 10 4! = 10 6

67 astro_ph: n = 29,882 density = 0.08% speed up factor T(1,1)/T(1,τ) λ=5.8e 3 in practice λ=5.8e 3 in theory λ=1e 3 in practice λ=1e 3 in theory λ=1e 4 in practice λ=1e 4 in theory τ

68 CCAT: n = 781,265 density = 0.16% speed up factor T(1,1)/T(1,τ) λ=1.1e 3 in practice λ=1.1e 3 in theory λ=1e 4 in practice λ=1e 4 in theory λ=1e 5 in practice λ=1e 5 in theory τ

Primal- dual methods with tau- nice sampling Algorithm Iteration complexity g SDCA n + 1 1 2 k k2 [S- Shwartz & Zhang 12] Li =1 ASDCA [S-

69 Primal- dual methods with tau- nice sampling Algorithm Iteration complexity g SDCA n k k2 [S- Shwartz & Zhang 12] Li =1 ASDCA [S- Shwartz & Zhang 13a] SPDC [Zhang & Xiao 14] 4 max ( r n n,, 1, n 1 3 ( ) 2 3 n + r n ) 1 2 k k2 general Quartz n + 1+ (! 1)( 1) n 1 1 general

For sufficiently sparse data, Quartz wins even

SDCA n n n n Accelerated ASDCA SPDC n n n p n

70 For sufficiently sparse data, Quartz wins even when compared against accelerated methods Algorithm n = ( 1 ) n = (1) n = ( ) n = (p n) apple = n apple = n apple = n/ apple = p n SDCA n n n n Accelerated ASDCA SPDC n n n p n p n n n + n3/4 p n + n3/4 p Quartz n +! n +! n n +! p n

71 THE END

Coordinate Descent Faceoff: Primal or Dual?

JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University