Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 1 / 34

Learning Goal (informal): Learn an accurate mapping h : X Y based on examples ((x 1, y 1 ),..., (x n, y n )) (X Y) n Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 2 / 34

Learning Goal (informal): Learn an accurate mapping h : X Y based on examples ((x 1, y 1 ),..., (x n, y n )) (X Y) n Deep learning: Each mapping h : X Y is parameterized by a weight vector w R d, so our goal is to learn the vector w Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 2 / 34

Regularized Loss Minimization A popular learning approach is Regularized Loss Minimization (RLM) with Euclidean regularization: Sample S = ((x 1, y 1 ),..., (x n, y n )) D n and approximately solve the RLM problem 1 min w R d n n φ i (w) + λ 2 w 2 i=1 where φ i (w) = l yi (h w (x i )) is the loss of predicting h w (x i ) when the true target is y i Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 3 / 34

How to solve RLM for Deep Learning? Stochastic Gradient Descent (SGD): Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 4 / 34

How to solve RLM for Deep Learning? Stochastic Gradient Descent (SGD): Advantages: Works well in practice Per iteration cost independent of n Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 4 / 34

How to solve RLM for Deep Learning? Stochastic Gradient Descent (SGD): Advantages: Works well in practice Per iteration cost independent of n Disadvantage: slow convergence 10 0 objective 10 1 10 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 # of backpropagation 10 7 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 4 / 34

How to improve SGD convergence rate? 1 Stochastic Dual Coordinate Ascent (SDCA): Same per iteration cost as SGD... but converges exponentially faster Designed for convex problems... but can be adapted to deep learning 2 SelfieBoost: AdaBoost, with SGD as weak learner, converges exponentially faster than vanilla SGD But yields an ensemble of networks very expensive at prediction time I ll describe a new boosting algorithm that boost the performance of the same network I ll show faster convergence under some SGD success assumption Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 5 / 34

SDCA vs. SGD On CCAT dataset, shallow architecture 10 0 SDCA SDCA Perm SGD 10 1 10 2 10 3 10 4 10 5 10 6 5 10 15 20 25 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 6 / 34

SelfieBoost vs. SGD On MNIST dataset, depth 5 network 10 0 SGD SelfieBoost 10 1 error 10 2 10 3 10 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 # of backpropagation 10 7 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 7 / 34

Gradient Descent vs. Stochastic Gradient Descent Define: P I (w) = 1 I i I φ i(w) + λ 2 w 2 GD SGD rule w t+1 = w t η P (w t ) w t+1 = w t η P I (w t ) for random I [n] per iteration cost O(n) O(1) convergence rate log(1/ɛ) 1/ɛ Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 8 / 34

Hey, wait, but what about... Decaying learning rate Nesterov s momentum Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 9 / 34

SGD More powerful oracle is crucial Theorem Any algorithm for solving RLM that only accesses the objective using stochastic gradient oracle and has log(1/ɛ) rate must perform Ω(n 2 ) iterations Proof idea: Consider two objectives (in both, λ = 1): for i {±1} P i (w) = 1 ( n 1 (w i) 2 + n + 1 ) (w + i) 2 2n 2 2 A stochastic gradient oracle returns w ± i w.p. 1 2 ± 1 2n Easy to see that w i = i/n, P i(0) = 1/2, P i (w i ) = 1/2 1/(2n2 ) Therefore, solving to accuracy ɛ < 1/(2n 2 ) amounts to determining the bias of the coin Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 10 / 34

Outline 1 SDCA Description and analysis for convex problems SDCA for Deep Learning Accelerated SDCA 2 SelfieBoost Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 11 / 34

Stochastic Dual Coordinate Ascent Primal problem: min P (w) := w Rd [ 1 n ] n φ i (w) + λ 2 w 2 i=1 (Fenchel) Dual problem: max D(α) := α Rd,n 1 n n i=1 φ i ( α i ) 1 2λn 2 n i=1 2 α i (where α i is the i th column of α) DCA: At each iteration, optimize D(α) w.r.t. a single column of α, while the rest of the columns are kept in tact. Stochastic Dual Coordinate Ascent (SDCA): Choose the updated column uniformly at random Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 12 / 34

Fenchel Conjugate Two equivalent representations of a convex function Point (w, f(w)) Tangent (θ, f (θ)) f(w) slope = θ w f (θ) = max w w, θ f(w) f (θ) Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 13 / 34

SDCA Analysis Theorem Assume each φ i is convex and smooth. Then, after (( Õ n + 1 ) log 1 ) λ ɛ iterations of SDCA we have, with high probability, P (w t ) P (w ) ɛ. GD SGD SDCA iteration cost nd d d convergence rate 1 λ log(1/ɛ) 1 λɛ runtime nd 1 λ d 1 λɛ runtime for λ = 1 n n 2 d nd ɛ ( n + 1 λ) log(1/ɛ) d (n + 1 ) λ nd Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 14 / 34

SDCA vs. SGD experimental observations On CCAT dataset, shallow architecture, λ = 10 6 10 0 SDCA SDCA Perm SGD 10 1 10 2 10 3 10 4 10 5 10 6 5 10 15 20 25 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 15 / 34

SDCA vs. DCA Randomization is crucial 10 0 SDCA DCA Cyclic 10 1 SDCA Perm Bound 10 2 10 3 10 4 10 5 10 6 0 2 4 6 8 10 12 14 16 18 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 16 / 34

Outline 1 SDCA Description and analysis for convex problems SDCA for Deep Learning Accelerated SDCA 2 SelfieBoost Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 17 / 34

Deep Networks are Non-Convex A 2-dim slice of a network with hidden layers {10, 10, 10, 10}, on MNIST, with the clamped ReLU activation function and logistic loss. The slice is defined by finding a global minimum (using SGD) and creating two random permutations of the first hidden layer. 2 1 1 0.8 0.6 0.4 0.2 1 0.8 0.6 0.4 0 0.2 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 1 1 0.8 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 18 / 34

But Deep Networks Seem Convex Near a Miminum Now the slice is based on 2 random points at distance 1 around a global minimum 10 2 8 7 6 5 4 1 0.5 0 0.5 1 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 19 / 34

SDCA for Deep Learning For φ i being non-convex, the Fenchel conjugate often becomes meaningless But, our analysis implies that an approximate dual update suffices, that is, ) α (t) i = α (t 1) i ( ηλn φ i (w (t 1) ) + α (t 1) i The relation between primal and dual vectors is by w (t 1) = 1 λn n i=1 α (t 1) i and therefore the corresponding primal update is: ( ) w (t) = w (t 1) η φ i (w (t 1) ) + α (t 1) i These updates can be implemented for deep learning as well and do not require to calculate the Fenchel conjugate Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 20 / 34

Intuition: Why SDCA is better than SGD Recall that SDCA primal update rule is ( ) w (t) = w (t 1) η φ i (w (t 1) ) + α (t 1) i }{{} v (t) and that w (t 1) = 1 n λn i=1 α(t 1) i. Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 21 / 34

Intuition: Why SDCA is better than SGD Recall that SDCA primal update rule is ( ) w (t) = w (t 1) η φ i (w (t 1) ) + α (t 1) i }{{} v (t) and that w (t 1) = 1 n λn i=1 α(t 1) i. Observe: v (t) is unbiased estimate of the gradient: E[v (t) w (t 1) ] = 1 n ( ) φ i (w (t 1) ) + α (t 1) i n i=1 = P (w (t 1) ) λw (t 1) + λw (t 1) = P (w (t 1) ) Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 21 / 34

Intuition: Why SDCA is better than SGD The update step of both SGD and SDCA is w (t) = w (t 1) ηv (t) where { v (t) φi (w (t 1) ) + λw (t 1) for SGD = φ i (w (t 1) ) + α (t 1) i for SDCA In both cases E[v (t) w (t 1) ] = P (w (t) ) What about the variance? For SGD, even if w (t 1) = w, the variance of v (t) is still constant For SDCA, we ll show that the variance of v (t) goes to zero as w (t 1) w Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 22 / 34

SDCA as variance reduction technique (Johnson & Zhang) For w we have P (w ) = 0 which means 1 n n φ i (w ) + λw = 0 i=1 w = 1 λn n i=1 ( φ i (w )) = 1 λn Therefore, if α (t 1) i αi and w(t 1) w then the update vector satisfies n i=1 v (t) = φ i (w (t 1) ) + α (t 1) i φ i (w ) + α i = 0 α i Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 23 / 34

Issues with SDCA for Deep Learning Needs to maintain the matrix α Can significantly reduce storage if working with mini-batches Another approach is the SVRG algorithm of Johnson and Zhang Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 24 / 34

Accelerated SDCA Nesterov s Accelerated (deterministic) Gradient Descent (AGD): combine 2 gradients to accelerate the convergence rate ( ) AGD runtime: Õ d n 1 λ SDCA runtime: Õ ( d ( n + 1 )) λ Can we accelerate SDCA? Yes! The main idea is to iterate Use SDCA to approximately minimize P t (w) = P (w) + κ 2 w y(t 1) 2 Update y (t) = w (t) + β(w (t) w (t 1) ) Accelerated SDCA runtime: Õ ( d ( n + )) n λ Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 25 / 34

Experimental Demonstration Smoothed hinge-loss with l 1, l 2 regularization, λ = 10 7 : astro-ph cov1 CCAT 0.5 AccProxSDCA ProxSDCA 0.4 FISTA 0.3 0.5 AccProxSDCA ProxSDCA 0.45 FISTA 0.4 0.5 AccProxSDCA ProxSDCA FISTA 0.4 0.3 0.2 0.35 0.2 0.1 0 0 20 40 60 80 100 0.3 0 20 40 60 80 100 0.1 0 20 40 60 80 100 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 26 / 34

Outline 1 SDCA Description and analysis for convex problems SDCA for Deep Learning Accelerated SDCA 2 SelfieBoost Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 27 / 34

SelfieBoost Motivation 10 0 objective 10 1 10 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 # of backpropagation 10 7 Why SGD is slow at the end? High variance, even close to the optimum Rare mistakes: Suppose all but 1% of the examples are correctly classified. SGD will now waste 99% of its time on examples that are already correct by the model Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 28 / 34

SelfieBoost Motivation For simplicity, consider a binary classification problem in the realizable case For a fixed ɛ 0 (not too small), few SGD iterations find a solution with P (w) P (w ) ɛ 0 However, for a small ɛ, SGD requires many iterations Smells like we need to use boosting... Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 29 / 34

First idea: learn an ensemble using AdaBoost Fix ɛ 0 (say 0.05), and assume SGD can find a solution with error < ɛ 0 quite fast Lets apply AdaBoost with the SGD learner as a weak learner: At iteration t, we sub-sample a training set based on a distribution D t over [n] We feed the sub-sample to a SGD learner and gets a weak classifier h t Update D t+1 based on the predictions of h t The output of AdaBoost is an ensemble with prediction T t=1 α th t (x) The celebrated Freund & Schapire theorem states that if T = O(log(1/ɛ)) then the error of the ensemble classifier is at most ɛ Observe that each boosting iteration involves calling SGD on a relatively small data, and updating the distribution on the entire big data. The latter step can be performed in parallel Disadvantage of learning an ensemble: at prediction time, we need to apply many networks Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 30 / 34

Boosting the Same Network Can we obtain boosting-like convergence, while learning a single network? The SelfieBoost Algorithm: Start with an initial network f 1 At iteration t, define weights over the n examples according to D i e y if t(x i ) Sub-sample a training set S D Use SGD for approximately solving the problem f t+1 argmin g y i (f t (x i ) g(x i )) + 1 (g(x i ) f t (x i )) 2 2 i S i S Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 31 / 34

Analysis of the SelfieBoost Algorithm Lemma: At each iteration, with high probability over the choice of S, there exists a network g with objective value of at most 1/4 Theorem: If at each iteration, the SGD algorithm finds a solution with objective value of at most ρ, then after log(1/ɛ) ρ SelfieBoost iterations the error of f t will be at most ɛ To summarize: we have obtained log(1/ɛ) convergence assuming that the SGD algorithm can solve each sub-problem to a fixed accuracy (which seems to hold in practice) Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 32 / 34

Summary SGD converges quickly to an o.k. solution, but then slows down: 1 High variance even at w 2 Wastes time on already solved cases There s a need for stochastic methods that have similar per-iteration complexity as SGD but converge faster 1 SDCA reduces variance and therefore converges faster 2 SelfieBoost focuses on the hard cases Future Work and Open Questions: Evaluate the empirical performance of SDCA and SelfieBoost for challenging deep learning tasks Bridge the gap between empirical success of SGD and worst-case hardness results Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 34 / 34