Lock-Free Approaches to Parallelizing Stochastic Gradient Descent

Lock-Free Approaches to Parallelizing Stochastic Gradient Descent Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison with Feng iu Christopher Ré Stephen Wright

minimize x f(x) = j= f j (x) Incremental Stochastic Gradient Descent: For each k, choose jk and compute x k+ = x k α k f jk (x k ) stepsize Robbins and Monro (950) Least Mean Squares + Recursive Estimators (960s-990s) Back Propagation in eural etworks (980s) IPS (2007ish)

Example: Computing the mean minimize 4 k= (x k) 2 x 0 =0 x = x 0 (x 0 ) = x 2 = x (x 2)/2 =.5 x 3 = x 2 (x 2 3)/3 =2 x 4 = x 3 (x 3 4)/4 =2.5 Stepsize = /2k In general, if we minimize SGD returns: (x z k ) 2 k= x = k= z k

minimize 9 k=0 cos πk 0 x +sin πk 0 x2 2 =5x 2 +5x 2 2 Stepsize = /2 x 2 f j(x) = 2 cj s j x s j +c j 3 2 0 5 6 4 0 6 7 Choose directions in order 7 8 Choose a direction uniformly with replacement

Convergence Rates minimize x f(x) = j= f j (x) Choose a direction uniformly with replacement ASSUME: ci 2 f LI f j (x) 2 M D 0 = x 0 x opt 2 emirovski et al (2009 or 983): x k x opt k max M 2 c 2 γ k = Θ 2ck with Θ > Θ 2 4Θ 4,D 0 Constant Stepsizes γ k = ϑ 2c with ϑ<, x k x opt log(2/β) 4( β) M 2 c 2 log(2/β) reduce by β after ϑβ k k ϑ log 4D 0 c 2 ϑm 2 iterations

minimize x R d f(x) = j= f j (x) Algorithm Time per iteration Error after T iterations Error after items ewton O(d 2 +d 3 ) C I 2 T C I 2 Gradient O(d) C G T C G SGD O(d) (or constant) C S T C S

SGD and BIG Data minimize x f(x) = f j (x) j= Stochastic Gradient Descent: For each k, choose jk and compute x k+ = x k α k f jk (x k ) Ideal for big data analysis: stepsize robustness against noise in data rapid learning rates small, predictable memory footprint

Is SGD inherently Serial? How to parallelize SGD? Master/Worker (Bertsekas and Tsitsiklis 985) Round Robin (Langford et al, 2009) Average Runs (Zinkevich et al, 200) Average Gradients (Duchi et al, Dekel et al 200) All require massive overhead due to lock contention and synchronization Don t lock! Don t communicate! What happens when we run parallel instances of SGD without locks?

Sparse Function: f(x) = e E f e (x e ) Hypergraph: G = (V,E) V - coordinates on R D E - v is in e E if fe depends on xv Graph statistics: Ω := max e E e := max v n {e E : v e} E ρ := max e E {ê E :ê e = } E Maximum Edge Size Maximum Degree Maximum Edge Degree

Sparse Support Vector Machines + + + + - + - + + - - - - - + + - - - - - sparse vectors minimize x minimize x α E α E max( y α x T z α, 0) + λx 2 2 max( y α x T z α, 0) + λ u e α x 2 u d u edge degrees Ω = max α z α 0, = max u d u/d, ρ (0, ]

Matrix Completion M = L R * k x n k x r Entries Specified on set E minimize Idea: approximate minimize (L,R) r x n (u,v) E (X uv M uv ) 2 + µx X LR T (u,v) E (Lu R T v M uv ) 2 + µ u L u 2 F + µ v R v 2 F Ω =2r = O(log(n)/n) ρ = O(log(n)/n)

Graph Cuts Image Segmentation Entity Resolution Topic Modeling minimize x (u,v) E w uvx u x v subject to T K x v =,x v 0, for v =,...,D Ω =2K = d max / E ρ =2d max / E

HOGWILD! Run SGD in parallel without locks. Each processor independently runs:. Sample e from E 2. Read current state of xe 3. for v in e do x v x v α[ f e (x e )] v Only assume atomicity of x v x v a Issues: Updates can be very old Processors can overwrite each others work

Convergence Theory Ω := max e E e := max v n {e E : v e} E ρ := max e E {ê E :ê e = } E ASSUME: ci 2 f LI f e (x) 2 M D 0 = x 0 x opt 2 ASSUME: Longest delay between an update and a memory read is τ Choose: k 2M 2 +6τρ+6τ 2 Ω /2 log(ld 0 /) c 2 Then after k gradient updates with appropriate choice of constant stepsize, we have E[x k x opt 2 ]

Hogs gone wild! data set size (GB) ρ Δ time (s) speedup SVM RCV 0.9 4.4E-0.0E+00 0 4.5 etflix.5 2.5E-03 2.3E-03 30 5.3 MC KDD 3.9 3.0E-03.8E-03 878 5.2 JUMBO 30 2.6E-07.4E-07 9,454 6.8 CUTS DBLife 0.003 8.6E-03 4.3E-03 230 8.8 Abdomen 8 9.2E-04 9.2E-04,8 4. Experiments run on 2 core machine 0 cores for gradients, 2 cores for data shuffling

Speedups 5 6 5 Speedup 4 3 2 Hogwild AIG RR Speedup 5 4 3 2 Hogwild AIG RR Speedup 4 3 2 Hogwild AIG RR 0 0 2 4 6 8 0 umber of Splits 0 0 2 4 6 8 0 umber of Splits 0 0 2 4 6 8 0 umber of Splits SVM RCV MC etflix CUTS Abdomen Experiments run on 2 core machine 0 cores for gradients, 2 cores for data shuffling

JELLYFISH SGD for Matrix Factorizations. Example: Idea: approximate minimize (L,R) (Lu R T v M uv ) 2 + µ u L u 2 F + µ v R v 2 F Step : Pick (i,j) and compute residual: Step 2: Take a gradient step: Lu minimize (u,v) E R v (u,v) E (X uv M uv ) 2 + µx X LR T e =(L u R T v M uv ) ( γµu )L u γer v ( γµ v )R v γel u

Mixture of hundreds of models, including nuclear norm nuclear norm (a.k.a. SVD)

JELLYFISH Observation: With replacement sample=poor locality Idea: Bias sample to improve locality. Algorithm: Shuffle the data.. Process {L R, L 2 R 2, L 3 R 3 } in parallel 2. Process {L R 2, L 2 R 3, L 3 R } in parallel 3. Process {L R 3, L 2 R, L 3 R 2 } in parallel Big wins: o locks. o overwrites. Strong Locality.

Example Optimization Shuffle all the rows and columns Apply block partitioning Train on each block independently Repeat...

jfish 00x faster than standard solvers 25% speedup over HOGWILD! on 2 core machine 3x speedup on 40 core machine etflix data-set 00M examples 7770 rows 48089 columns

Theory: treats without replacement sampling like deterministic sampling. Practice: without replacement sampling is always faster. Open question: Why?

Simplest Problem? Least Squares minimize x k= (a T k x b k ) 2 Assume overdetermined: a T k x opt = b k k fixed order: x = x opt + (I γa k a T k )(x 0 x opt ) k= with replacement: E[x ]=x opt + I γ a k a T k (x 0 x opt ) k= Which is better?

Simple Question A,...,A 0 A i Given, DxD, is it true that i= If Ai are scalars, this is always true by the arithmetic-geometric mean inequality. Is it true for matrices? i= A i True for =2 (Bhatia and Kittaneh 990).?

What about generically? A,...,A 0 A i Given, DxD, is it true that i= i= A i If sampled from the normal distribution? E i= D Q A i = g i g i g i (0, D I D) D Tr(Q ) = +O( A i = D I E ) k=0? A i = Q i= (D/) k k + k 2 F [ ( ), ; 2; D/] 2 D 2 + D k

Simple Question A,...,A 0 A i Given, DxD, is it true that i= If Ai are scalars, this is always true by the arithmetic-geometric mean inequality. Is it true for matrices? True for =2 (Bhatia and Kittaneh 990). True for generic Ai i= A i? Straightforward bound: i= A i D i= A i

log i= A i log i= A i log Tr(A i ) i= = i= log log Tr(A i ) = log Tr i= Tr(A i ) i= A i log D A i i=

i= A,...,A 0 A i Given, DxD, is it true that i= o! Counterexample: + cos 2πk A k = sin 2πk A i =2 cos π Saturates the bound i= A i i= A i sin 2πk cos 2πk 2 D i= A i A i = I i= Similar counterexamples for all,d>2

minimize 9 k=0 cos πk 0 x +sin πk 0 x2 2 =5x 2 +5x 2 2 Stepsize = /2 x 2 f j(x) = 2 cj s j x s j +c j 3 2 0 5 6 4 0 6 7 Choose directions in order 7 8 Choose a direction uniformly with replacement

! But what about on average? A,...,A 0 A σ(i)! Given, DxD, is it true that σ S i= Using the counterexample: + cos 2πk A k = sin 2πk σ S i= Yet to find a counterexample for averaging! Is there a noncommutative arithmetic-geometric mean inequality? i= A i sin 2πk cos 2πk A σ(i) = 2 F 3 /2+/2 /2 /2 + A i = I = i=? ; = O( )

Summary Practical Lessons Don t lock! Locality Matters. Theory: Bias in sampling for better data access understanding is only beginning here: oncommutative arithmetic-geometric mean in full generality Biased orderings of SGD Automatic identification of locality

Acknowledgements See: http://pages.cs.wisc.edu/~brecht/publications.html for all references Reports: HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. iu, Recht, Re, and Wright. 20. code: http://pages.cs.wisc.edu/~chrisre/hogwild.bz2 Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion. Recht and Re. 20. code: http://pages.cs.wisc.edu/~chrisre/jellyfish.bz2