On Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed

Size: px

Start display at page:

Download "On Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed"

Coral Gilbert
6 years ago
Views:

1 On Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed Michael I. Jordan University of California, Berkeley February 9, 2017

2 Modern Optimization and Large-Scale Data Analysis A need to exploit parallelism, while controlling stochasticity, and tolerating asynchrony, and trying to go as fast as the oracle allows, and maybe even faster

3 Outline Variational and Hamiltonian perspectives on Nesterov acceleration Stochastically-controlled stochastic gradient A perturbed iterate framework for analysis of asynchronous algorithms Primal-dual distributed optimization, commoditized Avoiding saddle points, quickly

4 Part I Variational and Hamiltonian Perspectives on Nesterov Acceleration with Andre Wibisono and Ashia Wilson

5 Accelerated gradient descent Setting: Unconstrained convex optimization min x R d f (x)

6 Accelerated gradient descent Setting: Unconstrained convex optimization min x R d f (x) Classical gradient descent: x k+1 = x k β f (x k ) obtains a convergence rate of O(1/k)

7 Accelerated gradient descent Setting: Unconstrained convex optimization min x R d f (x) Classical gradient descent: x k+1 = x k β f (x k ) obtains a convergence rate of O(1/k) Accelerated gradient descent: y k+1 = x k β f (x k ) x k+1 = (1 λ k )y k+1 + λ k y k obtains the (optimal) convergence rate of O(1/k 2 )

8 The acceleration phenomenon Two classes of algorithms: Gradient methods Gradient descent, mirror descent, cubic-regularized Newton s method (Nesterov and Polyak 06), etc. Greedy descent methods, relatively well-understood

9 The acceleration phenomenon Two classes of algorithms: Gradient methods Gradient descent, mirror descent, cubic-regularized Newton s method (Nesterov and Polyak 06), etc. Greedy descent methods, relatively well-understood Accelerated methods Nesterov s accelerated gradient descent, accelerated mirror descent, accelerated cubic-regularized Newton s method (Nesterov 08), etc. Important for both theory (optimal rate for first-order methods) and practice (many extensions: FISTA, stochastic setting, etc.) Not descent methods, faster than gradient methods, still mysterious

10 Accelerated methods Analysis using Nesterov s estimate sequence technique Common interpretation as momentum methods (Euclidean case)

11 Accelerated methods Analysis using Nesterov s estimate sequence technique Common interpretation as momentum methods (Euclidean case) Many proposed explanations: Chebyshev polynomial (Hardt 13) Linear coupling (Allen-Zhu, Orecchia 14) Optimized first-order method (Drori, Teboulle 14; Kim, Fessler 15) Geometric shrinking (Bubeck, Lee, Singh 15) Universal catalyst (Lin, Mairal, Harchaoui 15)...

12 Accelerated methods Analysis using Nesterov s estimate sequence technique Common interpretation as momentum methods (Euclidean case) Many proposed explanations: Chebyshev polynomial (Hardt 13) Linear coupling (Allen-Zhu, Orecchia 14) Optimized first-order method (Drori, Teboulle 14; Kim, Fessler 15) Geometric shrinking (Bubeck, Lee, Singh 15) Universal catalyst (Lin, Mairal, Harchaoui 15)... But only for strongly convex functions, or first-order methods Question: What is the underlying mechanism that generates acceleration (including for higher-order methods)?

13 Accelerated methods: Continuous time perspective Gradient descent is discretization of gradient flow Ẋ t = f (X t ) (and mirror descent is discretization of natural gradient flow)

14 Accelerated methods: Continuous time perspective Gradient descent is discretization of gradient flow Ẋ t = f (X t ) (and mirror descent is discretization of natural gradient flow) Su, Boyd, Candes 14: Continuous time limit of accelerated gradient descent is a second-order ODE Ẍ t + 3 t Ẋt + f (X t ) = 0

15 Accelerated methods: Continuous time perspective Gradient descent is discretization of gradient flow Ẋ t = f (X t ) (and mirror descent is discretization of natural gradient flow) Su, Boyd, Candes 14: Continuous time limit of accelerated gradient descent is a second-order ODE Ẍ t + 3 t Ẋt + f (X t ) = 0 These ODEs are obtained by taking continuous time limits. Is there a deeper generative mechanism? Our work: A general variational approach to acceleration A systematic discretization methodology

16 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x)

17 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t

18 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function h(y) D h (y, x) h(x) x y

19 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function h(x) h(y) D h (y, x) x y

20 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function α t, β t, γ t R are arbitrary smooth functions h(x) x h(y) y D h (y, x)

21 Bregman Lagrangian Define the Bregman Lagrangian: ( ) 1 L(x, ẋ, t) = e γt αt 2 ẋ 2 e 2αt+βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function α t, β t, γ t R are arbitrary smooth functions In Euclidean setting, simplifies to damped Lagrangian h(x) x h(y) y D h (y, x)

22 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function α t, β t, γ t R are arbitrary smooth functions In Euclidean setting, simplifies to damped Lagrangian Ideal scaling conditions: β t e αt γ t = e αt h(x) x h(y) y D h (y, x)

23 Bregman Lagrangian ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Variational problem over curves: min L(X t, Ẋt, t) dt X Optimal curve is characterized by Euler-Lagrange equation: { } d L dt ẋ (X t, Ẋ t, t) = L x (X t, Ẋ t, t) x t

24 Bregman Lagrangian ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Variational problem over curves: min L(X t, Ẋt, t) dt X Optimal curve is characterized by Euler-Lagrange equation: { } d L dt ẋ (X t, Ẋ t, t) = L x (X t, Ẋ t, t) E-L equation for Bregman Lagrangian under ideal scaling: Ẍ t + (e αt α t )Ẋ t + e 2αt+βt [ 2 h(x t + e αt Ẋ t)] 1 f (Xt ) = 0 x t

25 General convergence rate Theorem Theorem Under ideal scaling, the E-L equation has convergence rate f (X t ) f (x ) O(e βt )

26 General convergence rate Theorem Theorem Under ideal scaling, the E-L equation has convergence rate f (X t ) f (x ) O(e βt ) Proof. Exhibit a Lyapunov function for the dynamics: ) E t = D h (x, X t + e αt Ẋ t + e βt (f (X t ) f (x )) E t = e αt+βt D f (x, X t ) + ( β t e αt )e βt (f (X t ) f (x )) 0 Note: Only requires convexity and differentiability of f, h

27 Polynomial convergence rate For p > 0, choose parameters: α t = log p log t β t = p log t + log C γ t = p log t E-L equation has O(e βt ) = O(1/t p ) convergence rate: Ẍ t + p + 1 Ẋ t + Cp 2 t p 2[ ( 2 h X t + t Ẋt)] 1 f (Xt ) = 0 t p

28 Polynomial convergence rate For p > 0, choose parameters: α t = log p log t β t = p log t + log C γ t = p log t E-L equation has O(e βt ) = O(1/t p ) convergence rate: Ẍ t + p + 1 Ẋ t + Cp 2 t p 2[ ( 2 h X t + t Ẋt)] 1 f (Xt ) = 0 t p For p = 2: Recover result of Krichene et al with O(1/t 2 ) convergence rate In Euclidean case, recover ODE of Su et al: Ẍ t + 3 t Ẋt + f (X t ) = 0

29 Time dilation property (reparameterizing time) (p = 2: accelerated gradient descent) ( ) 1 O t 2 : Ẍ t + 3 [ ( t Ẋt + 4C 2 h X t + 2Ẋt)] t 1 f (Xt ) = 0 speed up time: Y t = X t 3/2 ( ) 1 O t 3 : Ÿ t + 4 Y t [ ( t + 9Ct 2 h Y t + t Y 3 )] 1 f t (Yt ) = 0 (p = 3: accelerated cubic-regularized Newton s method)

30 Time dilation property (reparameterizing time) (p = 2: accelerated gradient descent) ( ) 1 O t 2 : Ẍ t + 3 [ ( t Ẋt + 4C 2 h X t + 2Ẋt)] t 1 f (Xt ) = 0 speed up time: Y t = X t 3/2 ( ) 1 O t 3 : Ÿ t + 4 Y t [ ( t + 9Ct 2 h Y t + t Y 3 )] 1 f t (Yt ) = 0 (p = 3: accelerated cubic-regularized Newton s method) All accelerated methods are traveling the same curve in space-time at different speeds Gradient methods don t have this property From gradient flow to rescaled gradient flow: Replace by 1 p p

31 Time dilation for general Bregman Lagrangian O(e βt ) : E-L for Lagrangian L α,β,γ speed up time: Y t = X τ(t) where O(e β τ(t) ) : E-L for Lagrangian L α, β, γ α t = α τ(t) + log τ(t) β t = β τ(t) γ t = γ τ(t)

32 Time dilation for general Bregman Lagrangian O(e βt ) : E-L for Lagrangian L α,β,γ speed up time: Y t = X τ(t) where O(e β τ(t) ) : E-L for Lagrangian L α, β, γ α t = α τ(t) + log τ(t) β t = β τ(t) γ t = γ τ(t) Question: How to discretize E-L while preserving the convergence rate?

33 Discretizing the dynamics (naive approach) Write E-L as a system of first-order equations: Z t = X t + t p Ẋt d dt h(z t) = Cpt p 1 f (X t )

34 Discretizing the dynamics (naive approach) Write E-L as a system of first-order equations: Z t = X t + t p Ẋt d dt h(z t) = Cpt p 1 f (X t ) Euler discretization with time step δ > 0 (i.e., set x k = X t, x k+1 = X t+δ ): x k+1 = p k + p z k + z k = arg min z k k + p x k { Cpk (p 1) f (x k ), z + 1 ɛ D h(z, z k 1 ) with step size ɛ = δ p, and k (p 1) = k(k + 1) (k + p 2) is the rising factorial }

35 Naive discretization doesn t work x k+1 = p k + p z k + z k = arg min z k k + p x k { Cpk (p 1) f (x k ), z + 1 ɛ D h(z, z k 1 ) Cannot obtain a convergence guarantee, and empirically unstable } f(xk) k

36 Modified discretization Introduce an auxiliary sequence y k : x k+1 = p k + p z k + z k = arg min z k k + p y k { Cpk (p 1) f (y k ), z + 1 ɛ D h(z, z k 1 ) Sufficient condition: f (y k ), x k y k Mɛ 1 p 1 f (y k ) } p p 1

37 Modified discretization Introduce an auxiliary sequence y k : x k+1 = p k + p z k + z k = arg min z k k + p y k { Cpk (p 1) f (y k ), z + 1 ɛ D h(z, z k 1 ) Sufficient condition: f (y k ), x k y k Mɛ 1 p 1 f (y k ) Assume h is uniformly convex: D h (y, x) 1 p y x p } p p 1 Theorem Theorem f (y k ) f (x ) O ( ) 1 ɛk p Note: Matching convergence rates 1/(ɛk p ) = 1/(δk) p = 1/t p Proof using generalization of Nesterov s estimate sequence technique

38 Accelerated higher-order gradient method x k+1 = p k + p z k + k k + p y k { y k = arg min f p 1 (y; x k ) + 2 y z k = arg min z } ɛp y x k p ( 1 O } { Cpk (p 1) f (y k ), z + 1 ɛ D h(z, z k 1 ) ɛk p 1 If p 1 f is (1/ɛ)-Lipschitz and h is uniformly convex of order p, then: ( ) 1 f (y k ) f (x ) O ɛk p accelerated rate p = 2: Accelerated gradient/mirror descent p = 3: Accelerated cubic-regularized Newton s method (Nesterov 08) p 2: Accelerated higher-order method )

39 Recap: Gradient vs. accelerated methods How to design dynamics for minimizing a convex function f? Rescaled gradient flow p 2 p 1 Ẋ t = f (X t) / f (X t) ( ) 1 O t p 1 Higher-order gradient method ( ) 1 O ɛk p 1 when p 1 f is 1 ɛ -Lipschitz matching rate with ɛ = δ p 1 δ = ɛ 1 p 1

40 Recap: Gradient vs. accelerated methods How to design dynamics for minimizing a convex function f? Rescaled gradient flow p 2 p 1 Ẋ t = f (X t) / f (X t) ( ) 1 O t p 1 Polynomial Euler-Lagrange equation Ẍ p + 1 t+ Ẋ t+t p 2[ ( 2 h X t )] 1 f t+ t p Ẋt (Xt) = ( ) 1 O t p Higher-order gradient method ( ) 1 O ɛk p 1 when p 1 f is 1 ɛ -Lipschitz matching rate with ɛ = δ p 1 δ = ɛ 1 p 1 Accelerated higher-order method ( ) 1 O ɛk p when p 1 f is 1 ɛ -Lipschitz matching rate with ɛ = δ p δ = ɛ 1 p

41 Summary: Bregman Lagrangian Bregman Lagrangian family with general convergence guarantee ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Polynomial subfamily generates accelerated higher-order methods: O(1/t p ) convergence rate via higher-order smoothness Exponential subfamily: O(e ct ) rate via uniform convexity Understand structure and properties of Bregman Lagrangian: Gauge invariance, symmetry, gradient flows as limit points, etc. Bregman Hamiltonian: ( H(x, p, t) = e (D αt+γt h h(x) + e γ t p, h(x) ) + e βt f (x))

42 Part II Stochastically-Controlled Stochastic Gradient with Lihua Lei

43 Setup Task: minimizing a composite objective: min f (x) = 1 f i (x) x R d n i [n]

44 Setup Task: minimizing a composite objective: min f (x) = 1 f i (x) x R d n i [n] Assumption: L <, µ 0, s.t. µ 2 x y 2 f i (x) f i (y) f i (y), x y L x y 2 2

45 Setup Task: minimizing a composite objective: min f (x) = 1 f i (x) x R d n i [n] Assumption: L <, µ 0, s.t. µ 2 x y 2 f i (x) f i (y) f i (y), x y L x y 2 2 µ = 0: non-strongly convex case; µ > 0: strongly convex case; κ L/µ.

46 Computation Complexity Accessing (f i (x), f i (x)) incurs one unit of cost; Given ɛ > 0, let T (ɛ) be the minimum cost to achieve E ( f (x T (ɛ) ) f (x ) ) ɛ; Worst-case analysis: bound T (ɛ) almost surely, e.g., ( T (ɛ) = O (n + κ) log 1 ) (SVRG). ɛ

47 SVRG Algorithm SVRG: (within an epoch) 1: I [n] 2: g 1 I i I f i (x 0) 3: m n 4: Generate N U([m]) 5: for k = 1, 2,, N do 6: Randomly pick i [n] 7: ν f i (x) f i (x 0) + g 8: x x ην 9: end for

48 Analysis Nesterov s AGD SGD General Convex n ɛ 1 ɛ 2 Strongly Convex n κ log 1 ɛ κ ɛ log 1 ɛ SVRG - (n + κ) log 1 ɛ Katyusha n ɛ (n + nκ) log 1 ɛ All above results are from worst-case analysis; SGD is the only method with complexity free of n; however, the stepsize η depends on the unknown x 0 x 2 and the total number of epochs T.

49 Average-Case Analysis Instead of bounding T (ɛ), one can bound ET (ɛ); Evidence from other areas that average-case analysis can give much better complexity bounds than worst-case analysis.

50 Average-Case Analysis Instead of bounding T (ɛ), one can bound ET (ɛ); Evidence from other areas that average-case analysis can give much better complexity bounds than worst-case analysis. Goal: design an efficient algorithm such that ET (ɛ) is independent of n; ET (ɛ) is smaller than the worst-case bounds.

51 Average-Case Analysis An algorithm is tuning-friendly if: the stepsize η is the only parameter to tune; η is a constant which only depends on L and µ. General Convex Strongly Convex Tuning-friendly SGD SCSG SCSG+ SCSG+ 1 κ ɛ 2 ɛ log 1 ɛ 1 ɛ 2 n ( ) 1 ɛ ɛ n + κ log 1 ɛ ( ) 1 1 ɛ log ɛ n + log n 1 nɛ 2 ɛ + κ (α << 1) ɛα Yes 1 ( ) 1 log n κ log ɛ ɛ n + 3 nɛ 2 ɛ + κ No Yes No

52 SCSG/SCSG+: Algorithm SVRG: (within an epoch) 1: I [n] 2: g 1 I i I f i (x 0) 3: m n 4: Generate N U([m]) 5: for k = 1, 2,, N do 6: Randomly pick i [n] 7: ν f i (x) f i (x 0) + g 8: x x ην 9: end for

53 SCSG/SCSG+: Algorithm SVRG: (within an epoch) 1: I [n] 2: g 1 I i I f i (x 0) 3: m n 4: Generate N U([m]) 5: for k = 1, 2,, N do 6: Randomly pick i [n] 7: ν f i (x) f i (x 0) + g 8: x x ην 9: end for SCSG(+): (within an epoch)

54 SCSG/SCSG+: Algorithm SVRG: (within an epoch) 1: I [n] 2: g 1 I i I f i (x 0) 3: m n 4: Generate N U([m]) 5: for k = 1, 2,, N do 6: Randomly pick i [n] 7: ν f i (x) f i (x 0) + g 8: x x ην 9: end for SCSG(+): (within an epoch) 1: Randomly pick I with size B 2: g 1 I i I f i (x 0)

55 SCSG/SCSG+: Algorithm SVRG: (within an epoch) 1: I [n] 2: g 1 I i I f i (x 0) 3: m n 4: Generate N U([m]) 5: for k = 1, 2,, N do 6: Randomly pick i [n] 7: ν f i (x) f i (x 0) + g 8: x x ην 9: end for SCSG(+): (within an epoch) 1: Randomly pick I with size B 2: g 1 I i I f i (x 0) 3: γ 1 1/B 4: Generate N Geo(γ)

56 SCSG/SCSG+: Algorithm SVRG: (within an epoch) 1: I [n] 2: g 1 I i I f i (x 0) 3: m n 4: Generate N U([m]) 5: for k = 1, 2,, N do 6: Randomly pick i [n] 7: ν f i (x) f i (x 0) + g 8: x x ην 9: end for SCSG(+): (within an epoch) 1: Randomly pick I with size B 2: g 1 I i I f i (x 0) 3: γ 1 1/B 4: Generate N Geo(γ) 5: for k = 1, 2,, N do 6: Randomly pick i I 7: ν f i (x) f i (x 0) + g 8: x x ην 9: end for

57 SCSG/SCSG+: Algorithm In epoch j, SCSG fixes B j B(ɛ); Explicit forms of B(ɛ) are given in both non-strongly convex cases and strongly convex cases; SCSG+ uses an geometrically increasing sequence B j = B 0 b j n

58 Conclusion SCSG/SCSG+ saves computation costs on average by avoiding calculating the full gradient; SCSG/SCSG+ also saves communication costs in the distributed system by avoiding sampling a gradient from the whole dataset; SCSG/SCSG+ are able to achieve an approximate optimum with potentially less than a single pass through the data; The average computation cost of SCSG+ beats the oracle lower bounds from worst-case analysis.

59 Part III On the Analysis of Asynchronous Algorithms: A Perturbed Iterate Framework with Horia Mania, Xinghao Pan, Dimitris Papailopolous, Ben Recht and Kannan Ramchandran

60 Data points SGD in Practice Step 3: compute! Step 4: update model rf sk (x k ) SGD can take100s x k+1 of xhours k rf to sk run (xon k ) large data sets [Dean et al, Google Brain] Model 1 x 0 x 0 [rf sk (x k )] 0 2 Goal: 3 Parallelize x 4 x 4 [rf sk (x k )] 4 4 x 6 x 6 [rf sk (x k )] 6

61 Challenges in Parallel SGD Data points Model x x rf 1 (x) x x rf n (x) If there is no conflict => speedup

62 Perspective! Asynchronous(Alg( INPUT )) = Serial(Alg(INPUT)+ Noise)

63 Challenges in Parallel SGD Data points Model x x rf 1 (x) Conflict x x rf 2 (x) Issues: What happens if there are conflicts? Approach 1: Coordinate? Approach 2: Don t Care?

renewed the interest Round-robin approach [Zinkevich, Langford, Smola, 2009] Average Runs [Zinkevich et al.

64 Prior (2011) Work Long line of theoretical work since the 60s [Chazan, Miranker, 1969] Foundational work on Asynchronous Optimization Master/Worker model [Tsitsiklis, Bertsekas, 1986, 1989] Recent hardware/software advances renewed the interest Round-robin approach [Zinkevich, Langford, Smola, 2009] Average Runs [Zinkevich et al., 2009], Average Gradients [Duchi et al, Dekel et al. 2010] Many based on Coordinate or Lock approach Issue: Synchronization and comm. overheads

g = Each processor in parallel rf i (x) for v in the support of f do Speedup 5 4 3 2 1 SVM

65 HOGWILD! 2011 Run SGD in parallel without synchronization sample function f i x = read shared memory g = Each processor in parallel rf i (x) for v in the support of f do Speedup SVM RCV1 (10 cores) Hogwild AIG RR x v x v + g v Number of Splits Impact Downpour SGD, Project Adam use HOGWILD! Renewed interest on asynchronous lock-free optimization

66 Challenges in Asynchronous Analysis Shared Memory Processor 1 Processor 2 Issue: Reads and Writes intertwine in arbitrary ways Processor P No clear concept of an iteration.

67 Main Contribution: View Asynchrony as Perturbation Suppose we want to optimize F, then perturbed GD is x k+1 = x k rf (bx k ) rf (x 0 ) x 0 x 1 = x 0 rf (bx 0 ) bx 0 rf (bx 0 )

68 HOGWILD! As Perturbed SGD sample function f i x = read shared memory g = Each processor in parallel rf i (x) for v in the support of f do x v x v + g v Def: s k is the k-th sampled data point bx k is whatever the processor working on s k read from memory. After T processed samples, the content in memory is: (due to atomic writes + commutativity of addition) x 1 z } { x 0 rf s0 (bx 0 )... {z rf st 1 (bx T 1 ) } x T The algorithmic Main progress Questions: is captured by ``phantom iterates 1) How much does the perturbation affect SGD? x k+1 = x k rf (bx sk k ) 2) Where does it come from, how strong is it?

69 Understanding Asynchronous Perturbation: Bounding x 5 bx 5 timeline CPU CPU CPU Recall that: x 5 = x 0 rf s0 (bx 0 ) rf s1 (bx 1 )... rf s4 (bx 4 ) bx 5 = whatever CPU 2 read from memory x 5 bx 5 rf s3 (bx 3 ) rf s4 (bx 4 ) Then depends on,, and. rf s6 (bx 6 ) rf s7 (bx 7 ) Use sparsity to show that the perturbation is small.

Asynchrony Perturbation Asynchrony causes an

70 Asynchrony Perturbation Asynchrony causes an error iff sampled data points are in conflict Pr( two samples conflict ) = O. n Average Degree of Conflict Graph Observation: If less than O n/ samples are processed concurrently, then we have at most constant number of conflicts (in expectation)

at most Our Main Hogwild Result Assumption: samples processed concurrently with one

same convergence rate as serial SGD (for strongly convex functions) Sparsity key in

Matrix Completion, Graph Cuts, when using Dropout etc Comparison to Niu et. al. 1.

71 at most Our Main Hogwild Result Assumption: samples processed concurrently with one sample n Theorem: Assuming apple O the asynchrony noise is ``small and Hogwild achieves same convergence rate as serial SGD (for strongly convex functions) Sparsity key in theory. True in practice? Matrix Completion, Graph Cuts, when using Dropout etc Comparison to Niu et. al. 1. Full SG update vs single coord. update 2. No assumption on consistent reads 3. Ordered samples by sampling time 4. Improved overlap from n^1/4 to n 5. Proof is one page long

72 The Perturbed Iterate Framework - Several new asynchronous algorithms analyzable by Asynchronous(Alg( INPUT )) Serial(Alg(INPUT) + Noise) - Asynchronous Coordinate Descent - KroMagnon: Asynchronous Sparse SVRG KroMagnon on 16 threads up to 10^4 x faster than serial dense SVRG 8x speedup compared to serial sparse SVRG

73 Part IV CoCoA: Communication-Efficient Distributed Primal- Dual Optimization with Virginia Smith, Simone Forte, Chenxin Ma, Martin Takac, and Martin Jaggi

74 Optimization when Data is Big Iteratively solve [avoid closed-form] First-order methods [do less computation O(nd)] Stochastic methods [do even less computation O(d)] Parallel, distributed methods

75 Distributed Optimization convergence guarantees high communication reduce: w = w P k w always communicate low communication convergence not guaranteed average: w := 1 K P k w k never communicate ZDWJ, 2012

76 Mini-batch reduce: w = w b P i2b w convergence guarantees tunable communication a natural middle-ground

77 Mini-batch Limitations 1. METHODS BEYOND SGD 2. STALE UPDATES 3. AVERAGE OVER BATCH SIZE

78 Communication-Efficient Distributed Dual Coordinate Ascent (CoCoA) 1. METHODS BEYOND SGD Use Primal-Dual Framework 2. STALE UPDATES Immediately apply local updates 3. AVERAGE OVER BATCH SIZE Average over K << batch size

79 1. Primal-Dual Framework PRIMAL DUAL min w2r d " P (w) := 2 w n # nx `i(w T x i ) i=1 max 2R n " D( ) := 2 ka k 2 1 n # nx ` i ( i ) i=1 A i = 1 n x i Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. liblinear Dual separates across machines (one dual variable per datapoint)

80 1. Primal-Dual Framework Global objective: max 2R n " D( ) := 2 ka k 2 1 n # nx ` i ( i ) i=1 A i = 1 n x i Local objective: 1 max [k] 2R n n X i2p k ` i ( i ( [k] ) i ) w T A [k] 2 A [k] 2 Can solve the local objective using any internal optimization method

81 2. Immediately Apply Updates STALE for i 2 b w w r i P (w) end w w + w FRESH for i 2 b w w r i P (w) end

82 3. Average over K reduce: w = w + 1 K P k w k

..,K in parallel (t 1) ( [k], w k ) LocalDualMethod( [k], w (t 1) ) (t) 1) [k] (t [k] + K K [k] end reduce w (t) w (t 1) + K K end P K k=1 w k

83 CoCoA Algorithm 1: CoCoA Input: T 1, scaling parameter 1 apple K apple K (default: K := 1). Data: {(x i,y i )} n i=1 distributed over K machines Initialize: (0) [k] 0 for all machines k, and w (0) 0 for t =1, 2,...,T for all machines k =1, 2,...,K in parallel (t 1) ( [k], w k ) LocalDualMethod( [k], w (t 1) ) (t) 1) [k] (t [k] + K K [k] end reduce w (t) w (t 1) + K K end P K k=1 w k Procedure A: LocalDualMethod: Dual algorithm on machine k Input: Local [k] 2 R n k, and w 2 R d consistent with other coordinate blocks of s.t. w = A Data: Local {(x i,y i )} n k i=1 Output: [k] and w := A [k] [k]

84 Convergence Assumptions: `i are 1/ -smooth LocalDualMethod makes improvement per step e.g. for SDCA = 1 n 1+ n 1 ñ H E[D( ) D( (T ) )] apple 1 (1 ) 1 K n + n T D( ) D( (0) ) applies also to duality gap 0 apple apple n/k measure of difficulty of data partition

85 Empirical Results in Dataset Training (n) Features (d) Sparsity Workers (K) Cov 522, % 4 Rcv1 677,399 47, % 8 Imagenet 32, , % 32

86 Algorithms for Comparison Properties Mini-batch SGD Mini-batch SDCA Local-SGD CoCoA convergence primal-dual apply updates immediately average over K

87 Imagenet 10 2 Log Primal Suboptimality COCOA (H=1e3) mini batch CD (H=1) local SGD (H=1e3) mini batch SGD (H=10) Time (s)

88 RCV Log Primal Suboptimality COCOA (H=1e5) minibatch CD (H=100) local SGD (H=1e4) batch SGD (H=100) Time (s) Cov 10 2 Imagenet 10 2 Log Primal Suboptimality COCOA (H=1e5) minibatch CD (H=100) local SGD (H=1e5) batch SGD (H=1) Time (s) Log Primal Suboptimality COCOA (H=1e3) mini batch CD (H=1) local SGD (H=1e3) mini batch SGD (H=10) Time (s)

89 CoCoA+ reduce: w = w + X k w k = 1 K =) Averaging =1 =) Adding

90 Local Subproblems 1 max [k] 2R n n X CoCoA i2p k ` i ( i ( [k] ) i ) 2 1 n A [k] 1 n wt A 2 [k] 1 max [k] 2R n n X CoCoA+ i2p k ` i ( i ( [k] ) i ) n A 1 n wt A [k] 2 [k]

91 must choose: 0 0 min := max 2R n ka k 2 P K k=1 ka [k]k 2 0 is a measure of the difficulty of the data partitioning in practice, 0 := K 2 [1,K] are safe values CoCoA+ 0 =1 =) Averaging 0 = K =) Adding

92 CoCoA+ vs. CoCoA 10 0 Covertype, 1e Averaging 10 6 Averaging 10 5 Averaging 10 4 Adding 10 6 Adding 10 5 Adding Covertype, 1e Averaging 10 6 Averaging 10 5 Averaging 10 4 Adding 10 6 Adding 10 5 Adding 10 4 Duality Gap Duality Gap Number of Communications Elapsed Time (s)

93 CoCoA+ vs. CoCoA 10 0 RCV1, 1e Averaging 10 6 Averaging 10 5 Averaging 10 4 Adding 10 6 Adding 10 5 Adding RCV1, 1e Averaging 10 6 Averaging 10 5 Averaging 10 4 Adding 10 6 Adding 10 5 Adding 10 4 Duality Gap 10-2 Duality Gap Number of Communications Elapsed Time (s)

94 Stronger Theoretical Results Convergence rate is independent of K Provide primal-dual rate Cover the case of non-smooth losses (e.g., hinge-loss)

95 Code: github.com/gingsmith/cocoa

96 Part V Avoiding Saddlepoints Quickly with Chi Jin, Rong Ge, Sham Kakade, and Praneeth Netrapalli

97 Definition: ɛ-second-order stationary points, for l- smooth, ρ-hessian functions (Nesterov & Polyak, 2006): f(x) ɛ, λ min ( 2 f(x)) ρɛ When all saddle points are strict, how fast can we converge to ɛ-second-order stationary points? Algorithm Iterations Oracle (Ge, et al., 2015) O(poly(d/ɛ)) Gradient (Levy, 2016) O(d 3 poly(1/ɛ)) Gradient This Work O(polylog(d)/ɛ 2 ) Gradient (Agarwal, et al., 2016) O(log d/ɛ 7/4 ) Hessian-vector product (Carmon, et al., 2016) O(log d/ɛ 7/4 ) Hessian-vector product (Carmon and Duchi, 2016) O(log d/ɛ 2 ) Hessian-vector product (Nesterov and Polyak, 2006) O(1/ɛ 1.5 ) Hessian (Curtis, et al., 2014) O(1/ɛ 1.5 ) Hessian

On Computational Thinking, Inferential Thinking and Data Science

On Computational Thinking, Inferential Thinking and Data Science Michael I. Jordan University of California, Berkeley December 17, 2016 A Job Description, circa 2015 Your Boss: I need a Big Data system