Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine Learning, NIPS Workshop, 2012 Shalev-Shwartz & Zhang (HU) SDCA OPT 12 1 / 20

Regularized Loss Minimization min P (w) := w [ 1 n ] n φ i (w x i ) + λ 2 w 2. i=1 Shalev-Shwartz & Zhang (HU) SDCA OPT 12 2 / 20

Regularized Loss Minimization min P (w) := w [ 1 n ] n φ i (w x i ) + λ 2 w 2. i=1 Examples: φ i (z) Lipschitz smooth SVM max{0, 1 y i z} Logistic regression log(1 + exp( y i z)) Abs-loss regression z y i Square-loss regression (z y i ) 2 Shalev-Shwartz & Zhang (HU) SDCA OPT 12 2 / 20

Dual Coordinate Ascent (DCA) Primal problem: min P (w) := w [ 1 n ] n φ i (w x i ) + λ 2 w 2 i=1 Dual problem: max D(α) := α Rn 1 n n φ i ( α i ) λ 2 i=1 1 λn n i=1 2 α i x i DCA: At each iteration, optimize D(α) w.r.t. a single coordinate, while the rest of the coordinates are kept in tact. Stochastic Dual Coordinate Ascent (SDCA): Choose the updated coordinate uniformly at random Shalev-Shwartz & Zhang (HU) SDCA OPT 12 3 / 20

SDCA vs. SGD update rule Stochastic Gradient Descent (SGD) update rule: SDCA update rule: w (t+1) = ( 1 1 t ) w (t) φ i (w(t) x i ) λ t 1. α (t+1) = argmax D(α (t) + e i ) R 2. w (t+1) = w (t) + λ n x i x i Rather similar update rules. SDCA has several advantages: Stopping criterion No need to tune learning rate Shalev-Shwartz & Zhang (HU) SDCA OPT 12 4 / 20

SDCA vs. SGD update rule Example SVM with the hinge loss: φ i (w) = max{0, 1 y i w x i } SGD update rule: SDCA update rule: ( 1. = y i max w (t+1) = ( 1 1 t 0, min 1. α (t+1) = α (t) + e i ( 2. w (t+1) = w (t) + λ n x i ) w (t) 1[y i x i w(t) < 1] λ t 1, 1 y i x i w(t 1) x i 2 2 /(λn) x i + y i α (t 1) i )) α (t 1) i Shalev-Shwartz & Zhang (HU) SDCA OPT 12 5 / 20

SDCA vs. SGD experimental observations On CCAT dataset, λ = 10 6, smoothed loss 10 0 SDCA SDCA Perm SGD 10 1 10 2 10 3 10 4 10 5 10 6 5 10 15 20 25 Shalev-Shwartz & Zhang (HU) SDCA OPT 12 6 / 20

SDCA vs. SGD experimental observations On CCAT dataset, λ = 10 5, hinge-loss 10 0 SDCA SDCA Perm SGD 10 1 10 2 10 3 10 4 10 5 10 6 5 10 15 20 25 30 35 Shalev-Shwartz & Zhang (HU) SDCA OPT 12 7 / 20

SDCA vs. SGD Current analysis is unsatisfactory How many iterations are required to guarantee P (w (t) ) P (w ) + ɛ? For SGD: Õ ( ) 1 λ ɛ For SDCA: Hsieh et al. (ICML 2008), following Luo and Tseng (1992): O ( 1 ν log(1/ɛ)), but, ν can be arbitrarily small S and Tewari (2009), Nesterov (2010): O(n/ɛ) for general n-dimensional coordinate ascent Can apply it to the dual problem Resulting rate is slower than SGD And, the analysis does not hold for logistic regression (it requires smooth dual) Analysis is for dual sub-optimality Shalev-Shwartz & Zhang (HU) SDCA OPT 12 8 / 20

Dual vs. Primal sub-optimality Take data which is linearly separable using a vector w 0 Set λ = 2ɛ/ w 0 2 and use the hinge-loss P (w ) P (w 0 ) = ɛ D(0) = 0 D(α ) D(0) = P (w ) D(0) ɛ But, w(0) = 0 so P (w(0)) P (w ) = 1 P (w ) 1 ɛ Conclusion: In the interesting regime, ɛ-sub-optimality on the dual can be meaningless w.r.t. the primal! Shalev-Shwartz & Zhang (HU) SDCA OPT 12 9 / 20

Our results For (1/γ)-smooth loss: (( Õ n + 1 ) log 1 ) λ ɛ For L-Lipschitz loss: ( Õ n + 1 ) λ ɛ For almost smooth loss functions (e.g. the hinge-loss): ( ) 1 Õ n + λ ɛ 1/(1+ν) where ν > 0 is a data dependent quantity Shalev-Shwartz & Zhang (HU) SDCA OPT 12 10 / 20

SDCA vs. DCA Randomization is crucial On CCAT dataset, λ = 10 4, smoothed hinge-loss 10 0 SDCA DCA Cyclic 10 1 SDCA Perm Bound 10 2 10 3 10 4 10 5 10 6 0 2 4 6 8 10 12 14 16 18 In particular, the bound of Luo and Tseng holds for cyclic order, hence must be inferior to our bound Shalev-Shwartz & Zhang (HU) SDCA OPT 12 11 / 20

Smoothing the hinge-loss 0 x > 1 φ(x) = 1 x γ/2 x < 1 γ (1 x)2 o.w. 1 2γ Shalev-Shwartz & Zhang (HU) SDCA OPT 12 12 / 20

Smoothing the hinge-loss Mild effect on 0-1 error astro-ph CCAT cov1 0.036 0.0512 0.2285 0-1 error 0.0359 0.0358 0.0357 0.0356 0.0511 0.051 0.0509 0.0508 0.0507 0.228 0.2275 0.227 0.0506 0.0355 0.0505 0.2265 0.0354 0.0504 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.0503 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.226 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 γ γ γ Shalev-Shwartz & Zhang (HU) SDCA OPT 12 13 / 20

10 1 γ=0.000 10 1 γ=0.000 10 0 γ=0.000 Smoothing the hinge-loss Improves training time astro-ph CCAT cov1 10 2 γ=0.010 γ=0.100 γ=1.000 10 2 γ=0.010 γ=0.100 γ=1.000 10 1 γ=0.010 γ=0.100 γ=1.000 10 2 10 3 10 3 10 3 10 4 10 4 10 4 10 5 10 5 10 5 10 6 0 20 40 60 80 100 120 140 160 10 6 0 10 20 30 40 50 60 70 80 90 100 10 6 0 10 20 30 40 50 60 Duality gap as a function of runtime for different smoothing parameters Shalev-Shwartz & Zhang (HU) SDCA OPT 12 14 / 20

Additional related work Collins et al (2008): For smooth loss, similar bound to ours (for smooth loss) but for a more complicated algorithm (Exponentiated Gradient on dual) Lacoste-Julien, Jaggi, Schmidt, Pletscher (preprint on Arxiv): Study Frank-Wolfe algorithm for the dual of structured prediction problems. Boils down to SDCA for the case of binary hinge-loss. Same bound as our bound for the Lipschitz case Le Roux, Schmidt, Bach (NIPS 2012): A variant of SGD for smooth loss and finite sample. Also obtain log(1/ɛ). Shalev-Shwartz & Zhang (HU) SDCA OPT 12 16 / 20

Extensions Slightly better rates for SDCA with SGD initialization Proximal Stochastic Dual Coordinate Ascent (in preparation, a preliminary version is on Arxiv) Solve: ] min P (w) := w [ 1 n n φ i (Xi w) + λg(w) i=1 For example: g(w) = 1 2 w 2 2 + σ λ w 1 leads to thresholding operator For example: φ i : R k R, φ i (v) = max y δ(y i, y ) + v yi v y is useful for structured output prediction Approximate dual maximization (can obtain closed form approximate solutions while still maintaining same guarantees on the convergence rate) Shalev-Shwartz & Zhang (HU) SDCA OPT 12 17 / 20

Proof Idea Main lemma: for any t and s [0, 1], E[D(α (t) ) D(α (t 1) )] s ( s 2 G n E[P (w(t 1) ) D(α (t 1) (t) )] n) 2λ G (t) = O(1) for Lipschitz losses With appropriate s, G (t) 0 for smooth losses Shalev-Shwartz & Zhang (HU) SDCA OPT 12 18 / 20

Proof Idea Main lemma: for any t and s [0, 1], E[D(α (t) ) D(α (t 1) )] s ( s 2 G n E[P (w(t 1) ) D(α (t 1) (t) )] n) 2λ Bounding dual sub-optimality: Since P (w (t 1) ) D(α ), the above lemma yields a convergence rate for the dual sub-optimality Bounding duality gap: Summing the inequality for iterations T 0 + 1,..., T and choosing a random t {T 0 + 1,..., T } yields, [ ] E (P (w (t 1) ) D(α (t 1) )) n s(t T 0 ) E[D(α(T ) ) D(α (T 0) )]+ s G 2λn Shalev-Shwartz & Zhang (HU) SDCA OPT 12 19 / 20

Summary SDCA works very well in practice So far, theoretical guarantees were unsatisfactory Our analysis shows that SDCA is an excellent choice in many scenarios Shalev-Shwartz & Zhang (HU) SDCA OPT 12 20 / 20