Mini-Batch Primal and Dual Methods for SVMs

Size: px

Start display at page:

Download "Mini-Batch Primal and Dual Methods for SVMs"

Elfreda Cobb
6 years ago
Views:

1 Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv: Fête Parisienne in Computation, Inference and Optimization March 20, / 31

ShaioShalevjSchwartzBoTongoZhang Stochastic dual coordinate ascent methods for regularized loss minimization 2012BoarXiv:120941873 ShaioShalevjSchwartzBoYoramo SingerBoNathanoSrebro Pegasos: Primal

2 ShaioShalevjSchwartzBoTongoZhang Stochastic dual coordinate ascent methods for regularized loss minimization 2012BoarXiv: ShaioShalevjSchwartzBoYoramo SingerBoNathanoSrebro Pegasos: Primal estimated subgradient solver for SVM ICMLo2007 PeteroRichtárikBoMartinoTakáčo Parallel coordinate descent methods for big data optimization 2012BoarXiv: MartinoTakáčBoAvleenoBijralBoPeteroRichtárikBoNathanoSrebro Mini-batch primal and dual methods for SVMs 2013BoarXiv:1303: / 31

3 Support Vector Machine <w,x> - b = 1 <w,x> - b = 0 <w,x> - b = -1 3 / 31

4 Family Support Machine 4 / 31

5 PART I: Stochastic Gradient Descent (SGD) 5 / 31

6 SVM: Primal Problem Data: {(x i, y i ) R d {+1, 1} : i S def = {1, 2,..., n}} Examples: x 1,..., x n (assumption: max i x i 2 1) Labels: y i {+1, 1} Optimization formulation of SVM: min w R d { P S (w) def = λ 2 w 2 + ˆL S (w) }, (P) where ˆLA (w) def = 1 A i A l(y i w, x i ) (average hinge loss on examples in A) l(ζ) def = max{0, 1 ζ} (hinge loss) 6 / 31

7 Pegasos (SGD) Algorithm 1. Choose w 1 = 0 R d 2. Iterate for t = 1, 2,..., T 2.1 Choose A t S = {1, 2,..., n}, A t = b, uniformly at random 2.2 Set stepsize η t 1 λt 2.3 Update w t+1 w t η t P At (w t) Theorem For w = 1 T T t=1 w t we have: where c = ( λ + 1) 2. E[P( w)] P(w ) + c log(t ) Shai Shalev-Shwartz, Yoram Singer and Nathan Srebro 1 λt, Pegasos: Primal Estimated sub-gradient SOlver for SVM, ICML / 31

8 Pegasos (SGD) Algorithm 1. Choose w 1 = 0 R d 2. Iterate for t = 1, 2,..., T 2.1 Choose A t S = {1, 2,..., n}, A t = b, uniformly at random 2.2 Set stepsize η t 1 λt 2.3 Update w t+1 (1 η tλ)w t + ηt b i A t : y i w t,x i <1 y ix i Theorem For w = 1 T T t=1 w t we have: where c = ( λ + 1) 2. E[P( w)] P(w ) + c log(t ) Shai Shalev-Shwartz, Yoram Singer and Nathan Srebro 1 λt, Pegasos: Primal Estimated sub-gradient SOlver for SVM, ICML / 31

9 Pegasos (SGD) Algorithm 1. Choose w 1 = 0 R d 2. Iterate for t = 1, 2,..., T 2.1 Choose A t S = {1, 2,..., n}, A t = b, uniformly at random 2.2 Set stepsize η t 1 λt 2.3 Update w t+1 (1 η tλ)w t + ηt b i A t : y i w t,x i <1 y ix i Theorem 1 For w = 2 T T t= T /2 +1 w t we have: E[P( w)] P(w ) + 30β b b 1 λt, where β b = 1 + (b 1)(nσ2 1) n 1, σ 2 def = 1 n Q and Q ij = y i x i, y j x j Martin Takáč, Avleen Bijral, P. R. and Nathan Srebro Mini-batch primal and dual methods for SVMs, / 31

10 Insight into β b b β b b = 1 + (b 1)(nσ n 1 b 2 1) 8 / 31

11 Insight into β b b β b b = 1 + (b 1)(nσ n 1 b 2 1) Letting X = [x 1, x 2,..., x n ], Z = [y 1 x 1,..., y n x n ] and assuming x i = 1 for all i, we have nσ 2 def = Q = ZZ T = Z T Z = λ max (Z T Z) [ tr(zt Z) n, tr(z T Z)] = [ tr(xt X) n, tr(x T X) ] }{{}}{{} =1 =n 8 / 31

12 Insight into β b b β b b = 1 + (b 1)(nσ n 1 b 2 1) Letting X = [x 1, x 2,..., x n ], Z = [y 1 x 1,..., y n x n ] and assuming x i = 1 for all i, we have nσ 2 def = Q = ZZ T = Z T Z = λ max (Z T Z) [ tr(zt Z) n, tr(z T Z)] = [ tr(xt X) n, tr(x T X) ] }{{}}{{} =1 =n nσ 2 = n β b b = 1 (no parallelization speedup; mini-batching does not help) nσ 2 = 1 β b b = 1 b (speedup equal to batch size!) 8 / 31

13 Insight into β b b β b b = 1 + (b 1)(nσ n 1 b 2 1) Letting X = [x 1, x 2,..., x n ], Z = [y 1 x 1,..., y n x n ] and assuming x i = 1 for all i, we have nσ 2 def = Q = ZZ T = Z T Z = λ max (Z T Z) [ tr(zt Z) n, tr(z T Z)] = [ tr(xt X) n, tr(x T X) ] }{{}}{{} =1 =n nσ 2 = n β b b = 1 (no parallelization speedup; mini-batching does not help) nσ 2 = 1 β b b = 1 b (speedup equal to batch size!) Similar expression appears in P. R. and Martin Takáč Parallel coordinate descent methods for big data problems, 2012 with nσ 2 replaced by ω (degree of partial separability of the loss function) 8 / 31

14 Computing β b : SVM Datasets To run SDCA with safe mini-batching, we need to compute β b : Two options: β b = 1 + (b 1)(nσ2 1), where nσ 2 = λ max (Z T Z) n 1 Compute the largest eigenvalue (e.g., power method) Replace nσ 2 by an upper bound: degree of partial separability ω General SDCA methods based on ω described here: P. R. and Martin Takáč Parallel coordinate descent methods for big data optimization, 2012 Dataset # Examples (d) # Features (n) A 0 nσ 2 [1, n] ω a1a 1, , a9a 32, , rcv1 20,242 47,236 1,498, real-sim 72,309 20,958 3,709, ,484 news20 19,996 1,355,191 9,097,958 9, ,423 url 2,396,130 3,231, ,058, webspam 350,000 16,609,143 29,796, kdda2010 8,407,752 20,216, ,613, kddb ,264,097 29,890, ,345, / 31

15 Where does β b Come from? Lemma 1 Consider any symmetric Q R n n, random subset A {1, 2,..., n} with A = b and v R n. Then [ ( E[v[A] T Qv [A]] = b 1 b 1 ) n ] Q ii vi 2 + b 1 n n 1 n 1 vt Qv. i=1 Moreover, if Q ii 1 for all i, then we get the following ESO (Expected Separable Overapproximation): E[v T [A] Qv [A]] b n β b v / 31

16 Where does β b Come from? Lemma 1 Consider any symmetric Q R n n, random subset A {1, 2,..., n} with A = b and v R n. Then [ ( E[v[A] T Qv [A]] = b 1 b 1 ) n ] Q ii vi 2 + b 1 n n 1 n 1 vt Qv. i=1 Moreover, if Q ii 1 for all i, then we get the following ESO (Expected Separable Overapproximation): E[v T [A] Qv [A]] b n β b v 2. Remark: ESO inequalities are systematically developed in P. R. and Martin Takáč Parallel coordinate descent methods for big data problems, / 31

17 Insight into the Analysis Classical Pegasos Analysis Uses the inequality: ˆL At (w) 2 1 which holds for any A t S = {1, 2,..., n} 11 / 31

18 Insight into the Analysis Classical Pegasos Analysis Uses the inequality: ˆL At (w) 2 1 which holds for any A t S = {1, 2,..., n} New Analysis Uses the inequality: E ˆL At (w) 2 β b b which holds for A t, A t = b, chosen uniformly at random (established by previous lemma) 11 / 31

19 PART II: Stochastic Dual Coordinate Ascent (SDCA) 12 / 31

20 Stochastic Dual Coordinate Ascent (SDCA) Problem: Algorithm max α R n, 0 α i 1 { D(α) def = 1 n n i=1 } α i 1 2λn 2 αt Qα (D) 1. Choose α 0 = 0 R n 2. For t = 0, 1, 2,... iterate: 2.1 Choose i {1,..., n}, uniformly at random 2.2 Set δ arg max{d(α t + δe i ) : 0 α i + δ 1} 2.3 α t+1 α t + δ e i 13 / 31

21 Stochastic Dual Coordinate Ascent (SDCA) Problem: max α R n, 0 α i 1 { D(α) def = 1 n n i=1 } α i 1 2λn 2 αt Qα (D) Algorithm 1. Choose α 0 = 0 R n 2. For t = 0, 1, 2,... iterate: 2.1 Choose i {1,..., n}, uniformly at random 2.2 Set δ arg max{d(α t + δe i ) : 0 α i + δ 1} 2.3 α t+1 α t + δ e i First proposed for SVM by C.-J. Hsieh K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan A dual coordinate descent method for large-scale linear SVM, ICML 2008 General analysis in P. R. and M. Takáč Iteration complexity of randomized block-coordinate descent methods..., MAPR 2012 [INFORMS Computing Society Best Student Paper Prize (runner-up), 2012] 13 / 31

22 Naive Mini-Batching / Parallelization Problem: Algorithm max α R n, 0 α i 1 { D(α) def = 1 n n i=1 } α i 1 2λn 2 αt Qα (D) 1. Choose α 0 = 0 R n 2. For t = 0, 1, 2,... iterate: 2.1 Choose A t {1,..., n}, A t = b, uniformly at random 2.2 Set α t+1 α t 2.3 For i A t do Set δ arg max{d(α + δe i ) : 0 α i + δ 1} α t+1 α t+1 + δ e i Analyzed in Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin Parallel coordinate descent for L1-regularized loss minimization, ICML 2011 Convergence guaranteed only for small b (β b 2) Analysis does not cover SVM dual (D) 14 / 31

23 Example: Failure of Naive Parallelization / 31

24 Example: Details Problem: Q = ( ) 1 1, λ = n = 1 2, b = 2. D(α) = 1 2 et α 1 4 αt Qα The naive approach will produce the sequence: α 0 = (0, 0) T with D(α 0 ) = 0 α 1 = (1, 1) T with D(α 1 ) = 0 α 2 = (0, 0) T with D(α 2 ) = 0 α 3 = (1, 1) T with D(α 3 ) = 0 Optimal solution: D(α ) = D(( 1 2, 1 2 )T ) = / 31

25 Safe Mini-Batching Instead of choosing δ naively via maximizing the original function D(α + δ) := (αt Qα + 2α T Qδ + δ T Qδ) 2λn 2 + work with its Expected Separable Underapproximation: H βb (δ, α) := (αt Qα + 2α T Qδ + β b δ 2 ) 2λn 2 + That is, instead of n i=1 n i=1 α i + δ i, n α i + δ i, n do δ arg max{d(α + δe i, α) : 0 α i + δ 1}, δ arg max{h β (δe i, α) : 0 α i + δ 1}, i A t i A t 17 / 31

26 Safe Mini-Batching: General Theory Developed in P. R. and Martin Takáč Parallel coordinate descent methods for big data optimization, 2012 Based on the idea If you can t guarantee descent/ascent, guarantee it in expectation Definition (Expected Separable Overapproximation) Let f : R n R be convex and smooth. Let Ŝ be a random subset of {1, 2,..., n} s.t. P(i Ŝ) = const for all i (uniform sampling) and for w R n def ++ define x w = ( n i=1 x i 2)1/2. Then we say that f admits a (β, w)-eso w.r.t. Ŝ if for all x, h R n : ( E[ Ŝ ] E[f (x + h [ Ŝ] )] f (x) + f (x), h + β ) n 2 h 2 w 18 / 31

27 ESO for SVM Dual ESO can also be written as ( E[f (x + h [ Ŝ] )] 1 E[ Ŝ ] n ) ( f (x) + E[ Ŝ ] n f (x) + f (x), h + β 2 h 2 w }{{} def =H β (h;x) SVM setting: f (x) = D(α), h = δ, Ŝ = A t, Ŝ = b, w = (1,..., 1)T ) 19 / 31

28 ESO for SVM Dual ESO can also be written as ( E[f (x + h [ Ŝ] )] 1 E[ Ŝ ] n ) ( f (x) + E[ Ŝ ] n f (x) + f (x), h + β 2 h 2 w }{{} def =H β (h;x) SVM setting: f (x) = D(α), h = δ, Ŝ = A t, Ŝ = b, w = (1,..., 1)T Lemma 3 For the SVM dual loss we have for all α, δ R n the following ESO: where E[D(α + δ)] (1 b n )D(α) + b n H β b (δ; α), H βb (δ, α) = (αt Qα + 2α T Qδ + β b δ 2 ) 2λn 2 + β b def = 1 + (b 1)(nσ2 1). n 1 n i=1 α i + δ i, n ) 19 / 31

29 Primal Suboptimality for SDCA with Safe Mini-Batching Theorem 2 Let us run SDCA with safe mini-batching and let α 0 = 0 R n and ɛ > 0. If we let then t 0 max{0, n b T 0 t 0 + β b [ 4 b 2λn log( β b ) }, λɛ 2 n β b, ]+ T T 0 + max{ n b, β b 1 b λɛ }, ᾱ def T 1 1 = T T 0 α t, t=t 0 w(ᾱ) def = 1 λn n ᾱ i y i x i is an ɛ-approximate solution to the PRIMAL problem, i.e., i=1 E[P(w(ᾱ))] P(w ) E[P(w(ᾱ)) D(ᾱ) ] ɛ. }{{} duality gap 20 / 31

30 Primal Suboptimality: Simple Expression β b b 5 λɛ + n b ( 1 + log ( )) 2λn β b 21 / 31

31 PART III: SGD vs SDCA: Theory and Numerics 22 / 31

32 23 / 31

33 SGD vs. SDCA: Theory Stochastic Gradient Descent (SGD) SGD needs T = β b b 30 λɛ Stochastic Dual Coordinate Ascent (SDCA) SDCA (with safe mini-batching) needs T = β b b 5 λɛ + n ( ( )) 2λn 1 + log b β b 24 / 31

34 Numerical Experiments: Datasets Data # train # test # features (n) Sparsity % λ cov 522,911 58, rcv1 20, ,399 47, astro-ph 29,882 32,487 99, news20 15,020 4,976 1,355, / 31

35 Batch Size vs Iterations ɛ = covertype Iterations Pegasos (SGD) Naive SDCA Safe SDCA Aggressive SDCA β b /b (right axis) Batch size 26 / 31

36 Batch Size vs Iterations ɛ = astro ph Iterations Batch size 27 / 31

37 Numerical Experiments 0.25 astro ph, b= astro ph, b=8192 Test Error Primal/Dual Suboptimality Iterations Iterations 28 / 31

38 Test Error and Primal/Dual Suboptimality Test Error news20, b=256 Pegasos (SGD) Naive SDCA Safe SDCA Aggressive SDCA Primal/Dual Suboptimality news20, b= Iterations Iterations 29 / 31

39 Summary 1 Shai Shalev-Shwartz, Yoram Singer and Nathan Srebro Pegasos: Primal Estimated sub-gradient SOlver for SVM, ICML Analysis of SGD for b = 1 - Weak analysis for b > 1 (no speedup) P. R. and Martin Takáč Parallel coordinate descent methods for big data optimization, General analysis of SDCA for b > 1 (even variable b) + ESO: Expected Separable Overapproximation - Dual suboptimality only Shai Shalev-Shwartz and Tong Zhang Stochastic dual coordinate ascent methods for regularized loss minimization, Primal sub-optimality for SDCA with b = 1 - No analysis for b > 1 30 / 31

40 Summary 2 Martin Takáč, Avleen Bijral, P. R. and Nathan Srebro Mini-batch primal and dual methods for SVMs, 2013 First analysis of mini-batched SGD for SVM primal which works New mini-batch SDCA method for SVM dual with safe mini-batching with aggressive mini-batching Both SGD and SDCA have guarantees in terms of primal suboptimality spectral norm of the data controls parallelization speedup have essentially identical iterations 31 / 31

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine