Lock-Free Approaches to Parallelizing Stochastic Gradient Descent

Similar documents
Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

arxiv: v2 [math.oc] 11 Nov 2011

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

Collaborative Filtering Matrix Completion Alternating Least Squares

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Parallel Matrix Factorization for Recommender Systems

Optimization in Machine Learning

Parallel Numerical Algorithms

CS60021: Scalable Data Mining. Large Scale Machine Learning

Stochastic Optimization Algorithms Beyond SG

Parallel Coordinate Optimization

arxiv: v2 [cs.lg] 4 Oct 2016

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Coordinate Descent and Ascent Methods

arxiv: v3 [math.oc] 8 Jan 2019

Collaborative Filtering

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Recommendation Systems

Lasso Regression: Regularization for feature selection

Algorithmic Stability and Generalization Christoph Lampert

Randomized Smoothing for Stochastic Optimization

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Low Rank Matrix Completion Formulation and Algorithm

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Big Data Analytics: Optimization and Randomization

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Convex Optimization Lecture 16

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Iterative Methods for Solving A x = b

Some Relevant Topics in Optimization

Ad Placement Strategies

On Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed

Asynchronous Parallel Computing in Signal Processing and Machine Learning

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Why should you care about the solution strategies?

Modern Optimization Techniques

An Asynchronous Parallel Stochastic Coordinate Descent Algorithm

Lasso Regression: Regularization for feature selection

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

7.3 The Jacobi and Gauss-Siedel Iterative Techniques. Problem: To solve Ax = b for A R n n. Methodology: Iteratively approximate solution x. No GEPP.

HOMEWORK #4: LOGISTIC REGRESSION

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Solving linear equations with Gaussian Elimination (I)

Ergodic Subgradient Descent

Computational and Statistical Learning Theory

Multiclass Logistic Regression

A Distributed Solver for Kernelized SVM

HOMEWORK #4: LOGISTIC REGRESSION

Asaga: Asynchronous Parallel Saga

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

ECS289: Scalable Machine Learning

Linearly-Convergent Stochastic-Gradient Methods

Optimization and Gradient Descent

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Stochastic Gradient Descent. CS 584: Big Data Analytics

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Numerical tensor methods and their applications

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Subgradient methods for huge-scale optimization problems

ECS289: Scalable Machine Learning

Linear Models for Regression. Sargur Srihari

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Chapter 2. Matrix Arithmetic. Chapter 2

Matrix Factorization and Factorization Machines for Recommender Systems

Kasetsart University Workshop. Multigrid methods: An introduction

CSC321 Lecture 2: Linear Regression

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Day 3 Lecture 3. Optimizing deep networks

Perturbed Iterate Analysis for Asynchronous Stochastic Optimization

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear Regression (continued)

A Quick Tour of Linear Algebra and Optimization for Machine Learning

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

CPSC 540: Machine Learning

Online solution of the average cost Kullback-Leibler optimization problem

Distributed Stochastic ADMM for Matrix Factorization

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Preconditioning via Diagonal Scaling

Restricted Boltzmann Machines for Collaborative Filtering

Lecture 4: Linear Algebra 1

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Q-Learning for Markov Decision Processes*

arxiv: v3 [cs.lg] 15 Sep 2018

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Neural networks. Chapter 20. Chapter 20 1

Notes on AdaGrad. Joseph Perla 2014

SGD and Randomized projection algorithms for overdetermined linear systems

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

An Asynchronous Mini-Batch Algorithm for. Regularized Stochastic Optimization

Kernelized Perceptron Support Vector Machines

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Transcription:

Lock-Free Approaches to Parallelizing Stochastic Gradient Descent Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison with Feng iu Christopher Ré Stephen Wright

minimize x f(x) = j= f j (x) Incremental Stochastic Gradient Descent: For each k, choose jk and compute x k+ = x k α k f jk (x k ) stepsize Robbins and Monro (950) Least Mean Squares + Recursive Estimators (960s-990s) Back Propagation in eural etworks (980s) IPS (2007ish)

Example: Computing the mean minimize 4 k= (x k) 2 x 0 =0 x = x 0 (x 0 ) = x 2 = x (x 2)/2 =.5 x 3 = x 2 (x 2 3)/3 =2 x 4 = x 3 (x 3 4)/4 =2.5 Stepsize = /2k In general, if we minimize SGD returns: (x z k ) 2 k= x = k= z k

minimize 9 k=0 cos πk 0 x +sin πk 0 x2 2 =5x 2 +5x 2 2 Stepsize = /2 x 2 f j(x) = 2 cj s j x s j +c j 3 2 0 5 6 4 0 6 7 Choose directions in order 7 8 Choose a direction uniformly with replacement

Convergence Rates minimize x f(x) = j= f j (x) Choose a direction uniformly with replacement ASSUME: ci 2 f LI f j (x) 2 M D 0 = x 0 x opt 2 emirovski et al (2009 or 983): x k x opt k max M 2 c 2 γ k = Θ 2ck with Θ > Θ 2 4Θ 4,D 0 Constant Stepsizes γ k = ϑ 2c with ϑ<, x k x opt log(2/β) 4( β) M 2 c 2 log(2/β) reduce by β after ϑβ k k ϑ log 4D 0 c 2 ϑm 2 iterations

minimize x R d f(x) = j= f j (x) Algorithm Time per iteration Error after T iterations Error after items ewton O(d 2 +d 3 ) C I 2 T C I 2 Gradient O(d) C G T C G SGD O(d) (or constant) C S T C S

SGD and BIG Data minimize x f(x) = f j (x) j= Stochastic Gradient Descent: For each k, choose jk and compute x k+ = x k α k f jk (x k ) Ideal for big data analysis: stepsize robustness against noise in data rapid learning rates small, predictable memory footprint

Is SGD inherently Serial? How to parallelize SGD? Master/Worker (Bertsekas and Tsitsiklis 985) Round Robin (Langford et al, 2009) Average Runs (Zinkevich et al, 200) Average Gradients (Duchi et al, Dekel et al 200) All require massive overhead due to lock contention and synchronization Don t lock! Don t communicate! What happens when we run parallel instances of SGD without locks?

Sparse Function: f(x) = e E f e (x e ) Hypergraph: G = (V,E) V - coordinates on R D E - v is in e E if fe depends on xv Graph statistics: Ω := max e E e := max v n {e E : v e} E ρ := max e E {ê E :ê e = } E Maximum Edge Size Maximum Degree Maximum Edge Degree

Sparse Support Vector Machines + + + + - + - + + - - - - - + + - - - - - sparse vectors minimize x minimize x α E α E max( y α x T z α, 0) + λx 2 2 max( y α x T z α, 0) + λ u e α x 2 u d u edge degrees Ω = max α z α 0, = max u d u/d, ρ (0, ]

Matrix Completion M = L R * k x n k x r Entries Specified on set E minimize Idea: approximate minimize (L,R) r x n (u,v) E (X uv M uv ) 2 + µx X LR T (u,v) E (Lu R T v M uv ) 2 + µ u L u 2 F + µ v R v 2 F Ω =2r = O(log(n)/n) ρ = O(log(n)/n)

Graph Cuts Image Segmentation Entity Resolution Topic Modeling minimize x (u,v) E w uvx u x v subject to T K x v =,x v 0, for v =,...,D Ω =2K = d max / E ρ =2d max / E

HOGWILD! Run SGD in parallel without locks. Each processor independently runs:. Sample e from E 2. Read current state of xe 3. for v in e do x v x v α[ f e (x e )] v Only assume atomicity of x v x v a Issues: Updates can be very old Processors can overwrite each others work

Convergence Theory Ω := max e E e := max v n {e E : v e} E ρ := max e E {ê E :ê e = } E ASSUME: ci 2 f LI f e (x) 2 M D 0 = x 0 x opt 2 ASSUME: Longest delay between an update and a memory read is τ Choose: k 2M 2 +6τρ+6τ 2 Ω /2 log(ld 0 /) c 2 Then after k gradient updates with appropriate choice of constant stepsize, we have E[x k x opt 2 ]

Hogs gone wild! data set size (GB) ρ Δ time (s) speedup SVM RCV 0.9 4.4E-0.0E+00 0 4.5 etflix.5 2.5E-03 2.3E-03 30 5.3 MC KDD 3.9 3.0E-03.8E-03 878 5.2 JUMBO 30 2.6E-07.4E-07 9,454 6.8 CUTS DBLife 0.003 8.6E-03 4.3E-03 230 8.8 Abdomen 8 9.2E-04 9.2E-04,8 4. Experiments run on 2 core machine 0 cores for gradients, 2 cores for data shuffling

Speedups 5 6 5 Speedup 4 3 2 Hogwild AIG RR Speedup 5 4 3 2 Hogwild AIG RR Speedup 4 3 2 Hogwild AIG RR 0 0 2 4 6 8 0 umber of Splits 0 0 2 4 6 8 0 umber of Splits 0 0 2 4 6 8 0 umber of Splits SVM RCV MC etflix CUTS Abdomen Experiments run on 2 core machine 0 cores for gradients, 2 cores for data shuffling

JELLYFISH SGD for Matrix Factorizations. Example: Idea: approximate minimize (L,R) (Lu R T v M uv ) 2 + µ u L u 2 F + µ v R v 2 F Step : Pick (i,j) and compute residual: Step 2: Take a gradient step: Lu minimize (u,v) E R v (u,v) E (X uv M uv ) 2 + µx X LR T e =(L u R T v M uv ) ( γµu )L u γer v ( γµ v )R v γel u

Mixture of hundreds of models, including nuclear norm nuclear norm (a.k.a. SVD)

JELLYFISH Observation: With replacement sample=poor locality Idea: Bias sample to improve locality. Algorithm: Shuffle the data.. Process {L R, L 2 R 2, L 3 R 3 } in parallel 2. Process {L R 2, L 2 R 3, L 3 R } in parallel 3. Process {L R 3, L 2 R, L 3 R 2 } in parallel Big wins: o locks. o overwrites. Strong Locality.

Example Optimization Shuffle all the rows and columns Apply block partitioning Train on each block independently Repeat...

jfish 00x faster than standard solvers 25% speedup over HOGWILD! on 2 core machine 3x speedup on 40 core machine etflix data-set 00M examples 7770 rows 48089 columns

Theory: treats without replacement sampling like deterministic sampling. Practice: without replacement sampling is always faster. Open question: Why?

Simplest Problem? Least Squares minimize x k= (a T k x b k ) 2 Assume overdetermined: a T k x opt = b k k fixed order: x = x opt + (I γa k a T k )(x 0 x opt ) k= with replacement: E[x ]=x opt + I γ a k a T k (x 0 x opt ) k= Which is better?

Simple Question A,...,A 0 A i Given, DxD, is it true that i= If Ai are scalars, this is always true by the arithmetic-geometric mean inequality. Is it true for matrices? i= A i True for =2 (Bhatia and Kittaneh 990).?

What about generically? A,...,A 0 A i Given, DxD, is it true that i= i= A i If sampled from the normal distribution? E i= D Q A i = g i g i g i (0, D I D) D Tr(Q ) = +O( A i = D I E ) k=0? A i = Q i= (D/) k k + k 2 F [ ( ), ; 2; D/] 2 D 2 + D k

Simple Question A,...,A 0 A i Given, DxD, is it true that i= If Ai are scalars, this is always true by the arithmetic-geometric mean inequality. Is it true for matrices? True for =2 (Bhatia and Kittaneh 990). True for generic Ai i= A i? Straightforward bound: i= A i D i= A i

log i= A i log i= A i log Tr(A i ) i= = i= log log Tr(A i ) = log Tr i= Tr(A i ) i= A i log D A i i=

i= A,...,A 0 A i Given, DxD, is it true that i= o! Counterexample: + cos 2πk A k = sin 2πk A i =2 cos π Saturates the bound i= A i i= A i sin 2πk cos 2πk 2 D i= A i A i = I i= Similar counterexamples for all,d>2

minimize 9 k=0 cos πk 0 x +sin πk 0 x2 2 =5x 2 +5x 2 2 Stepsize = /2 x 2 f j(x) = 2 cj s j x s j +c j 3 2 0 5 6 4 0 6 7 Choose directions in order 7 8 Choose a direction uniformly with replacement

! But what about on average? A,...,A 0 A σ(i)! Given, DxD, is it true that σ S i= Using the counterexample: + cos 2πk A k = sin 2πk σ S i= Yet to find a counterexample for averaging! Is there a noncommutative arithmetic-geometric mean inequality? i= A i sin 2πk cos 2πk A σ(i) = 2 F 3 /2+/2 /2 /2 + A i = I = i=? ; = O( )

Summary Practical Lessons Don t lock! Locality Matters. Theory: Bias in sampling for better data access understanding is only beginning here: oncommutative arithmetic-geometric mean in full generality Biased orderings of SGD Automatic identification of locality

Acknowledgements See: http://pages.cs.wisc.edu/~brecht/publications.html for all references Reports: HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. iu, Recht, Re, and Wright. 20. code: http://pages.cs.wisc.edu/~chrisre/hogwild.bz2 Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion. Recht and Re. 20. code: http://pages.cs.wisc.edu/~chrisre/jellyfish.bz2