On Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed

Size: px
Start display at page:

Download "On Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed"

Transcription

1 On Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed Michael I. Jordan University of California, Berkeley February 9, 2017

2 Modern Optimization and Large-Scale Data Analysis A need to exploit parallelism, while controlling stochasticity, and tolerating asynchrony, and trying to go as fast as the oracle allows, and maybe even faster

3 Outline Variational and Hamiltonian perspectives on Nesterov acceleration Stochastically-controlled stochastic gradient A perturbed iterate framework for analysis of asynchronous algorithms Primal-dual distributed optimization, commoditized Avoiding saddle points, quickly

4 Part I Variational and Hamiltonian Perspectives on Nesterov Acceleration with Andre Wibisono and Ashia Wilson

5 Accelerated gradient descent Setting: Unconstrained convex optimization min x R d f (x)

6 Accelerated gradient descent Setting: Unconstrained convex optimization min x R d f (x) Classical gradient descent: x k+1 = x k β f (x k ) obtains a convergence rate of O(1/k)

7 Accelerated gradient descent Setting: Unconstrained convex optimization min x R d f (x) Classical gradient descent: x k+1 = x k β f (x k ) obtains a convergence rate of O(1/k) Accelerated gradient descent: y k+1 = x k β f (x k ) x k+1 = (1 λ k )y k+1 + λ k y k obtains the (optimal) convergence rate of O(1/k 2 )

8 The acceleration phenomenon Two classes of algorithms: Gradient methods Gradient descent, mirror descent, cubic-regularized Newton s method (Nesterov and Polyak 06), etc. Greedy descent methods, relatively well-understood

9 The acceleration phenomenon Two classes of algorithms: Gradient methods Gradient descent, mirror descent, cubic-regularized Newton s method (Nesterov and Polyak 06), etc. Greedy descent methods, relatively well-understood Accelerated methods Nesterov s accelerated gradient descent, accelerated mirror descent, accelerated cubic-regularized Newton s method (Nesterov 08), etc. Important for both theory (optimal rate for first-order methods) and practice (many extensions: FISTA, stochastic setting, etc.) Not descent methods, faster than gradient methods, still mysterious

10 Accelerated methods Analysis using Nesterov s estimate sequence technique Common interpretation as momentum methods (Euclidean case)

11 Accelerated methods Analysis using Nesterov s estimate sequence technique Common interpretation as momentum methods (Euclidean case) Many proposed explanations: Chebyshev polynomial (Hardt 13) Linear coupling (Allen-Zhu, Orecchia 14) Optimized first-order method (Drori, Teboulle 14; Kim, Fessler 15) Geometric shrinking (Bubeck, Lee, Singh 15) Universal catalyst (Lin, Mairal, Harchaoui 15)...

12 Accelerated methods Analysis using Nesterov s estimate sequence technique Common interpretation as momentum methods (Euclidean case) Many proposed explanations: Chebyshev polynomial (Hardt 13) Linear coupling (Allen-Zhu, Orecchia 14) Optimized first-order method (Drori, Teboulle 14; Kim, Fessler 15) Geometric shrinking (Bubeck, Lee, Singh 15) Universal catalyst (Lin, Mairal, Harchaoui 15)... But only for strongly convex functions, or first-order methods Question: What is the underlying mechanism that generates acceleration (including for higher-order methods)?

13 Accelerated methods: Continuous time perspective Gradient descent is discretization of gradient flow Ẋ t = f (X t ) (and mirror descent is discretization of natural gradient flow)

14 Accelerated methods: Continuous time perspective Gradient descent is discretization of gradient flow Ẋ t = f (X t ) (and mirror descent is discretization of natural gradient flow) Su, Boyd, Candes 14: Continuous time limit of accelerated gradient descent is a second-order ODE Ẍ t + 3 t Ẋt + f (X t ) = 0

15 Accelerated methods: Continuous time perspective Gradient descent is discretization of gradient flow Ẋ t = f (X t ) (and mirror descent is discretization of natural gradient flow) Su, Boyd, Candes 14: Continuous time limit of accelerated gradient descent is a second-order ODE Ẍ t + 3 t Ẋt + f (X t ) = 0 These ODEs are obtained by taking continuous time limits. Is there a deeper generative mechanism? Our work: A general variational approach to acceleration A systematic discretization methodology

16 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x)

17 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t

18 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function h(y) D h (y, x) h(x) x y

19 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function h(x) h(y) D h (y, x) x y

20 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function α t, β t, γ t R are arbitrary smooth functions h(x) x h(y) y D h (y, x)

21 Bregman Lagrangian Define the Bregman Lagrangian: ( ) 1 L(x, ẋ, t) = e γt αt 2 ẋ 2 e 2αt+βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function α t, β t, γ t R are arbitrary smooth functions In Euclidean setting, simplifies to damped Lagrangian h(x) x h(y) y D h (y, x)

22 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function α t, β t, γ t R are arbitrary smooth functions In Euclidean setting, simplifies to damped Lagrangian Ideal scaling conditions: β t e αt γ t = e αt h(x) x h(y) y D h (y, x)

23 Bregman Lagrangian ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Variational problem over curves: min L(X t, Ẋt, t) dt X Optimal curve is characterized by Euler-Lagrange equation: { } d L dt ẋ (X t, Ẋ t, t) = L x (X t, Ẋ t, t) x t

24 Bregman Lagrangian ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Variational problem over curves: min L(X t, Ẋt, t) dt X Optimal curve is characterized by Euler-Lagrange equation: { } d L dt ẋ (X t, Ẋ t, t) = L x (X t, Ẋ t, t) E-L equation for Bregman Lagrangian under ideal scaling: Ẍ t + (e αt α t )Ẋ t + e 2αt+βt [ 2 h(x t + e αt Ẋ t)] 1 f (Xt ) = 0 x t

25 General convergence rate Theorem Theorem Under ideal scaling, the E-L equation has convergence rate f (X t ) f (x ) O(e βt )

26 General convergence rate Theorem Theorem Under ideal scaling, the E-L equation has convergence rate f (X t ) f (x ) O(e βt ) Proof. Exhibit a Lyapunov function for the dynamics: ) E t = D h (x, X t + e αt Ẋ t + e βt (f (X t ) f (x )) E t = e αt+βt D f (x, X t ) + ( β t e αt )e βt (f (X t ) f (x )) 0 Note: Only requires convexity and differentiability of f, h

27 Polynomial convergence rate For p > 0, choose parameters: α t = log p log t β t = p log t + log C γ t = p log t E-L equation has O(e βt ) = O(1/t p ) convergence rate: Ẍ t + p + 1 Ẋ t + Cp 2 t p 2[ ( 2 h X t + t Ẋt)] 1 f (Xt ) = 0 t p

28 Polynomial convergence rate For p > 0, choose parameters: α t = log p log t β t = p log t + log C γ t = p log t E-L equation has O(e βt ) = O(1/t p ) convergence rate: Ẍ t + p + 1 Ẋ t + Cp 2 t p 2[ ( 2 h X t + t Ẋt)] 1 f (Xt ) = 0 t p For p = 2: Recover result of Krichene et al with O(1/t 2 ) convergence rate In Euclidean case, recover ODE of Su et al: Ẍ t + 3 t Ẋt + f (X t ) = 0

29 Time dilation property (reparameterizing time) (p = 2: accelerated gradient descent) ( ) 1 O t 2 : Ẍ t + 3 [ ( t Ẋt + 4C 2 h X t + 2Ẋt)] t 1 f (Xt ) = 0 speed up time: Y t = X t 3/2 ( ) 1 O t 3 : Ÿ t + 4 Y t [ ( t + 9Ct 2 h Y t + t Y 3 )] 1 f t (Yt ) = 0 (p = 3: accelerated cubic-regularized Newton s method)

30 Time dilation property (reparameterizing time) (p = 2: accelerated gradient descent) ( ) 1 O t 2 : Ẍ t + 3 [ ( t Ẋt + 4C 2 h X t + 2Ẋt)] t 1 f (Xt ) = 0 speed up time: Y t = X t 3/2 ( ) 1 O t 3 : Ÿ t + 4 Y t [ ( t + 9Ct 2 h Y t + t Y 3 )] 1 f t (Yt ) = 0 (p = 3: accelerated cubic-regularized Newton s method) All accelerated methods are traveling the same curve in space-time at different speeds Gradient methods don t have this property From gradient flow to rescaled gradient flow: Replace by 1 p p

31 Time dilation for general Bregman Lagrangian O(e βt ) : E-L for Lagrangian L α,β,γ speed up time: Y t = X τ(t) where O(e β τ(t) ) : E-L for Lagrangian L α, β, γ α t = α τ(t) + log τ(t) β t = β τ(t) γ t = γ τ(t)

32 Time dilation for general Bregman Lagrangian O(e βt ) : E-L for Lagrangian L α,β,γ speed up time: Y t = X τ(t) where O(e β τ(t) ) : E-L for Lagrangian L α, β, γ α t = α τ(t) + log τ(t) β t = β τ(t) γ t = γ τ(t) Question: How to discretize E-L while preserving the convergence rate?

33 Discretizing the dynamics (naive approach) Write E-L as a system of first-order equations: Z t = X t + t p Ẋt d dt h(z t) = Cpt p 1 f (X t )

34 Discretizing the dynamics (naive approach) Write E-L as a system of first-order equations: Z t = X t + t p Ẋt d dt h(z t) = Cpt p 1 f (X t ) Euler discretization with time step δ > 0 (i.e., set x k = X t, x k+1 = X t+δ ): x k+1 = p k + p z k + z k = arg min z k k + p x k { Cpk (p 1) f (x k ), z + 1 ɛ D h(z, z k 1 ) with step size ɛ = δ p, and k (p 1) = k(k + 1) (k + p 2) is the rising factorial }

35 Naive discretization doesn t work x k+1 = p k + p z k + z k = arg min z k k + p x k { Cpk (p 1) f (x k ), z + 1 ɛ D h(z, z k 1 ) Cannot obtain a convergence guarantee, and empirically unstable } f(xk) k

36 Modified discretization Introduce an auxiliary sequence y k : x k+1 = p k + p z k + z k = arg min z k k + p y k { Cpk (p 1) f (y k ), z + 1 ɛ D h(z, z k 1 ) Sufficient condition: f (y k ), x k y k Mɛ 1 p 1 f (y k ) } p p 1

37 Modified discretization Introduce an auxiliary sequence y k : x k+1 = p k + p z k + z k = arg min z k k + p y k { Cpk (p 1) f (y k ), z + 1 ɛ D h(z, z k 1 ) Sufficient condition: f (y k ), x k y k Mɛ 1 p 1 f (y k ) Assume h is uniformly convex: D h (y, x) 1 p y x p } p p 1 Theorem Theorem f (y k ) f (x ) O ( ) 1 ɛk p Note: Matching convergence rates 1/(ɛk p ) = 1/(δk) p = 1/t p Proof using generalization of Nesterov s estimate sequence technique

38 Accelerated higher-order gradient method x k+1 = p k + p z k + k k + p y k { y k = arg min f p 1 (y; x k ) + 2 y z k = arg min z } ɛp y x k p ( 1 O } { Cpk (p 1) f (y k ), z + 1 ɛ D h(z, z k 1 ) ɛk p 1 If p 1 f is (1/ɛ)-Lipschitz and h is uniformly convex of order p, then: ( ) 1 f (y k ) f (x ) O ɛk p accelerated rate p = 2: Accelerated gradient/mirror descent p = 3: Accelerated cubic-regularized Newton s method (Nesterov 08) p 2: Accelerated higher-order method )

39 Recap: Gradient vs. accelerated methods How to design dynamics for minimizing a convex function f? Rescaled gradient flow p 2 p 1 Ẋ t = f (X t) / f (X t) ( ) 1 O t p 1 Higher-order gradient method ( ) 1 O ɛk p 1 when p 1 f is 1 ɛ -Lipschitz matching rate with ɛ = δ p 1 δ = ɛ 1 p 1

40 Recap: Gradient vs. accelerated methods How to design dynamics for minimizing a convex function f? Rescaled gradient flow p 2 p 1 Ẋ t = f (X t) / f (X t) ( ) 1 O t p 1 Polynomial Euler-Lagrange equation Ẍ p + 1 t+ Ẋ t+t p 2[ ( 2 h X t )] 1 f t+ t p Ẋt (Xt) = ( ) 1 O t p Higher-order gradient method ( ) 1 O ɛk p 1 when p 1 f is 1 ɛ -Lipschitz matching rate with ɛ = δ p 1 δ = ɛ 1 p 1 Accelerated higher-order method ( ) 1 O ɛk p when p 1 f is 1 ɛ -Lipschitz matching rate with ɛ = δ p δ = ɛ 1 p

41 Summary: Bregman Lagrangian Bregman Lagrangian family with general convergence guarantee ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Polynomial subfamily generates accelerated higher-order methods: O(1/t p ) convergence rate via higher-order smoothness Exponential subfamily: O(e ct ) rate via uniform convexity Understand structure and properties of Bregman Lagrangian: Gauge invariance, symmetry, gradient flows as limit points, etc. Bregman Hamiltonian: ( H(x, p, t) = e (D αt+γt h h(x) + e γ t p, h(x) ) + e βt f (x))

42 Part II Stochastically-Controlled Stochastic Gradient with Lihua Lei

43 Setup Task: minimizing a composite objective: min f (x) = 1 f i (x) x R d n i [n]

44 Setup Task: minimizing a composite objective: min f (x) = 1 f i (x) x R d n i [n] Assumption: L <, µ 0, s.t. µ 2 x y 2 f i (x) f i (y) f i (y), x y L x y 2 2

45 Setup Task: minimizing a composite objective: min f (x) = 1 f i (x) x R d n i [n] Assumption: L <, µ 0, s.t. µ 2 x y 2 f i (x) f i (y) f i (y), x y L x y 2 2 µ = 0: non-strongly convex case; µ > 0: strongly convex case; κ L/µ.

46 Computation Complexity Accessing (f i (x), f i (x)) incurs one unit of cost; Given ɛ > 0, let T (ɛ) be the minimum cost to achieve E ( f (x T (ɛ) ) f (x ) ) ɛ; Worst-case analysis: bound T (ɛ) almost surely, e.g., ( T (ɛ) = O (n + κ) log 1 ) (SVRG). ɛ

47 SVRG Algorithm SVRG: (within an epoch) 1: I [n] 2: g 1 I i I f i (x 0) 3: m n 4: Generate N U([m]) 5: for k = 1, 2,, N do 6: Randomly pick i [n] 7: ν f i (x) f i (x 0) + g 8: x x ην 9: end for

48 Analysis Nesterov s AGD SGD General Convex n ɛ 1 ɛ 2 Strongly Convex n κ log 1 ɛ κ ɛ log 1 ɛ SVRG - (n + κ) log 1 ɛ Katyusha n ɛ (n + nκ) log 1 ɛ All above results are from worst-case analysis; SGD is the only method with complexity free of n; however, the stepsize η depends on the unknown x 0 x 2 and the total number of epochs T.

49 Average-Case Analysis Instead of bounding T (ɛ), one can bound ET (ɛ); Evidence from other areas that average-case analysis can give much better complexity bounds than worst-case analysis.

50 Average-Case Analysis Instead of bounding T (ɛ), one can bound ET (ɛ); Evidence from other areas that average-case analysis can give much better complexity bounds than worst-case analysis. Goal: design an efficient algorithm such that ET (ɛ) is independent of n; ET (ɛ) is smaller than the worst-case bounds.

51 Average-Case Analysis An algorithm is tuning-friendly if: the stepsize η is the only parameter to tune; η is a constant which only depends on L and µ. General Convex Strongly Convex Tuning-friendly SGD SCSG SCSG+ SCSG+ 1 κ ɛ 2 ɛ log 1 ɛ 1 ɛ 2 n ( ) 1 ɛ ɛ n + κ log 1 ɛ ( ) 1 1 ɛ log ɛ n + log n 1 nɛ 2 ɛ + κ (α << 1) ɛα Yes 1 ( ) 1 log n κ log ɛ ɛ n + 3 nɛ 2 ɛ + κ No Yes No

52 SCSG/SCSG+: Algorithm SVRG: (within an epoch) 1: I [n] 2: g 1 I i I f i (x 0) 3: m n 4: Generate N U([m]) 5: for k = 1, 2,, N do 6: Randomly pick i [n] 7: ν f i (x) f i (x 0) + g 8: x x ην 9: end for

53 SCSG/SCSG+: Algorithm SVRG: (within an epoch) 1: I [n] 2: g 1 I i I f i (x 0) 3: m n 4: Generate N U([m]) 5: for k = 1, 2,, N do 6: Randomly pick i [n] 7: ν f i (x) f i (x 0) + g 8: x x ην 9: end for SCSG(+): (within an epoch)

54 SCSG/SCSG+: Algorithm SVRG: (within an epoch) 1: I [n] 2: g 1 I i I f i (x 0) 3: m n 4: Generate N U([m]) 5: for k = 1, 2,, N do 6: Randomly pick i [n] 7: ν f i (x) f i (x 0) + g 8: x x ην 9: end for SCSG(+): (within an epoch) 1: Randomly pick I with size B 2: g 1 I i I f i (x 0)

55 SCSG/SCSG+: Algorithm SVRG: (within an epoch) 1: I [n] 2: g 1 I i I f i (x 0) 3: m n 4: Generate N U([m]) 5: for k = 1, 2,, N do 6: Randomly pick i [n] 7: ν f i (x) f i (x 0) + g 8: x x ην 9: end for SCSG(+): (within an epoch) 1: Randomly pick I with size B 2: g 1 I i I f i (x 0) 3: γ 1 1/B 4: Generate N Geo(γ)

56 SCSG/SCSG+: Algorithm SVRG: (within an epoch) 1: I [n] 2: g 1 I i I f i (x 0) 3: m n 4: Generate N U([m]) 5: for k = 1, 2,, N do 6: Randomly pick i [n] 7: ν f i (x) f i (x 0) + g 8: x x ην 9: end for SCSG(+): (within an epoch) 1: Randomly pick I with size B 2: g 1 I i I f i (x 0) 3: γ 1 1/B 4: Generate N Geo(γ) 5: for k = 1, 2,, N do 6: Randomly pick i I 7: ν f i (x) f i (x 0) + g 8: x x ην 9: end for

57 SCSG/SCSG+: Algorithm In epoch j, SCSG fixes B j B(ɛ); Explicit forms of B(ɛ) are given in both non-strongly convex cases and strongly convex cases; SCSG+ uses an geometrically increasing sequence B j = B 0 b j n

58 Conclusion SCSG/SCSG+ saves computation costs on average by avoiding calculating the full gradient; SCSG/SCSG+ also saves communication costs in the distributed system by avoiding sampling a gradient from the whole dataset; SCSG/SCSG+ are able to achieve an approximate optimum with potentially less than a single pass through the data; The average computation cost of SCSG+ beats the oracle lower bounds from worst-case analysis.

59 Part III On the Analysis of Asynchronous Algorithms: A Perturbed Iterate Framework with Horia Mania, Xinghao Pan, Dimitris Papailopolous, Ben Recht and Kannan Ramchandran

60 Data points SGD in Practice Step 3: compute! Step 4: update model rf sk (x k ) SGD can take100s x k+1 of xhours k rf to sk run (xon k ) large data sets [Dean et al, Google Brain] Model 1 x 0 x 0 [rf sk (x k )] 0 2 Goal: 3 Parallelize x 4 x 4 [rf sk (x k )] 4 4 x 6 x 6 [rf sk (x k )] 6

61 Challenges in Parallel SGD Data points Model x x rf 1 (x) x x rf n (x) If there is no conflict => speedup

62 Perspective! Asynchronous(Alg( INPUT )) = Serial(Alg(INPUT)+ Noise)

63 Challenges in Parallel SGD Data points Model x x rf 1 (x) Conflict x x rf 2 (x) Issues: What happens if there are conflicts? Approach 1: Coordinate? Approach 2: Don t Care?

64 Prior (2011) Work Long line of theoretical work since the 60s [Chazan, Miranker, 1969] Foundational work on Asynchronous Optimization Master/Worker model [Tsitsiklis, Bertsekas, 1986, 1989] Recent hardware/software advances renewed the interest Round-robin approach [Zinkevich, Langford, Smola, 2009] Average Runs [Zinkevich et al., 2009], Average Gradients [Duchi et al, Dekel et al. 2010] Many based on Coordinate or Lock approach Issue: Synchronization and comm. overheads

65 HOGWILD! 2011 Run SGD in parallel without synchronization sample function f i x = read shared memory g = Each processor in parallel rf i (x) for v in the support of f do Speedup SVM RCV1 (10 cores) Hogwild AIG RR x v x v + g v Number of Splits Impact Downpour SGD, Project Adam use HOGWILD! Renewed interest on asynchronous lock-free optimization

66 Challenges in Asynchronous Analysis Shared Memory Processor 1 Processor 2 Issue: Reads and Writes intertwine in arbitrary ways Processor P No clear concept of an iteration.

67 Main Contribution: View Asynchrony as Perturbation Suppose we want to optimize F, then perturbed GD is x k+1 = x k rf (bx k ) rf (x 0 ) x 0 x 1 = x 0 rf (bx 0 ) bx 0 rf (bx 0 )

68 HOGWILD! As Perturbed SGD sample function f i x = read shared memory g = Each processor in parallel rf i (x) for v in the support of f do x v x v + g v Def: s k is the k-th sampled data point bx k is whatever the processor working on s k read from memory. After T processed samples, the content in memory is: (due to atomic writes + commutativity of addition) x 1 z } { x 0 rf s0 (bx 0 )... {z rf st 1 (bx T 1 ) } x T The algorithmic Main progress Questions: is captured by ``phantom iterates 1) How much does the perturbation affect SGD? x k+1 = x k rf (bx sk k ) 2) Where does it come from, how strong is it?

69 Understanding Asynchronous Perturbation: Bounding x 5 bx 5 timeline CPU CPU CPU Recall that: x 5 = x 0 rf s0 (bx 0 ) rf s1 (bx 1 )... rf s4 (bx 4 ) bx 5 = whatever CPU 2 read from memory x 5 bx 5 rf s3 (bx 3 ) rf s4 (bx 4 ) Then depends on,, and. rf s6 (bx 6 ) rf s7 (bx 7 ) Use sparsity to show that the perturbation is small.

70 Asynchrony Perturbation Asynchrony causes an error iff sampled data points are in conflict Pr( two samples conflict ) = O. n Average Degree of Conflict Graph Observation: If less than O n/ samples are processed concurrently, then we have at most constant number of conflicts (in expectation)

71 at most Our Main Hogwild Result Assumption: samples processed concurrently with one sample n Theorem: Assuming apple O the asynchrony noise is ``small and Hogwild achieves same convergence rate as serial SGD (for strongly convex functions) Sparsity key in theory. True in practice? Matrix Completion, Graph Cuts, when using Dropout etc Comparison to Niu et. al. 1. Full SG update vs single coord. update 2. No assumption on consistent reads 3. Ordered samples by sampling time 4. Improved overlap from n^1/4 to n 5. Proof is one page long

72 The Perturbed Iterate Framework - Several new asynchronous algorithms analyzable by Asynchronous(Alg( INPUT )) Serial(Alg(INPUT) + Noise) - Asynchronous Coordinate Descent - KroMagnon: Asynchronous Sparse SVRG KroMagnon on 16 threads up to 10^4 x faster than serial dense SVRG 8x speedup compared to serial sparse SVRG

73 Part IV CoCoA: Communication-Efficient Distributed Primal- Dual Optimization with Virginia Smith, Simone Forte, Chenxin Ma, Martin Takac, and Martin Jaggi

74 Optimization when Data is Big Iteratively solve [avoid closed-form] First-order methods [do less computation O(nd)] Stochastic methods [do even less computation O(d)] Parallel, distributed methods

75 Distributed Optimization convergence guarantees high communication reduce: w = w P k w always communicate low communication convergence not guaranteed average: w := 1 K P k w k never communicate ZDWJ, 2012

76 Mini-batch reduce: w = w b P i2b w convergence guarantees tunable communication a natural middle-ground

77 Mini-batch Limitations 1. METHODS BEYOND SGD 2. STALE UPDATES 3. AVERAGE OVER BATCH SIZE

78 Communication-Efficient Distributed Dual Coordinate Ascent (CoCoA) 1. METHODS BEYOND SGD Use Primal-Dual Framework 2. STALE UPDATES Immediately apply local updates 3. AVERAGE OVER BATCH SIZE Average over K << batch size

79 1. Primal-Dual Framework PRIMAL DUAL min w2r d " P (w) := 2 w n # nx `i(w T x i ) i=1 max 2R n " D( ) := 2 ka k 2 1 n # nx ` i ( i ) i=1 A i = 1 n x i Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. liblinear Dual separates across machines (one dual variable per datapoint)

80 1. Primal-Dual Framework Global objective: max 2R n " D( ) := 2 ka k 2 1 n # nx ` i ( i ) i=1 A i = 1 n x i Local objective: 1 max [k] 2R n n X i2p k ` i ( i ( [k] ) i ) w T A [k] 2 A [k] 2 Can solve the local objective using any internal optimization method

81 2. Immediately Apply Updates STALE for i 2 b w w r i P (w) end w w + w FRESH for i 2 b w w r i P (w) end

82 3. Average over K reduce: w = w + 1 K P k w k

83 CoCoA Algorithm 1: CoCoA Input: T 1, scaling parameter 1 apple K apple K (default: K := 1). Data: {(x i,y i )} n i=1 distributed over K machines Initialize: (0) [k] 0 for all machines k, and w (0) 0 for t =1, 2,...,T for all machines k =1, 2,...,K in parallel (t 1) ( [k], w k ) LocalDualMethod( [k], w (t 1) ) (t) 1) [k] (t [k] + K K [k] end reduce w (t) w (t 1) + K K end P K k=1 w k Procedure A: LocalDualMethod: Dual algorithm on machine k Input: Local [k] 2 R n k, and w 2 R d consistent with other coordinate blocks of s.t. w = A Data: Local {(x i,y i )} n k i=1 Output: [k] and w := A [k] [k]

84 Convergence Assumptions: `i are 1/ -smooth LocalDualMethod makes improvement per step e.g. for SDCA = 1 n 1+ n 1 ñ H E[D( ) D( (T ) )] apple 1 (1 ) 1 K n + n T D( ) D( (0) ) applies also to duality gap 0 apple apple n/k measure of difficulty of data partition

85 Empirical Results in Dataset Training (n) Features (d) Sparsity Workers (K) Cov 522, % 4 Rcv1 677,399 47, % 8 Imagenet 32, , % 32

86 Algorithms for Comparison Properties Mini-batch SGD Mini-batch SDCA Local-SGD CoCoA convergence primal-dual apply updates immediately average over K

87 Imagenet 10 2 Log Primal Suboptimality COCOA (H=1e3) mini batch CD (H=1) local SGD (H=1e3) mini batch SGD (H=10) Time (s)

88 RCV Log Primal Suboptimality COCOA (H=1e5) minibatch CD (H=100) local SGD (H=1e4) batch SGD (H=100) Time (s) Cov 10 2 Imagenet 10 2 Log Primal Suboptimality COCOA (H=1e5) minibatch CD (H=100) local SGD (H=1e5) batch SGD (H=1) Time (s) Log Primal Suboptimality COCOA (H=1e3) mini batch CD (H=1) local SGD (H=1e3) mini batch SGD (H=10) Time (s)

89 CoCoA+ reduce: w = w + X k w k = 1 K =) Averaging =1 =) Adding

90 Local Subproblems 1 max [k] 2R n n X CoCoA i2p k ` i ( i ( [k] ) i ) 2 1 n A [k] 1 n wt A 2 [k] 1 max [k] 2R n n X CoCoA+ i2p k ` i ( i ( [k] ) i ) n A 1 n wt A [k] 2 [k]

91 must choose: 0 0 min := max 2R n ka k 2 P K k=1 ka [k]k 2 0 is a measure of the difficulty of the data partitioning in practice, 0 := K 2 [1,K] are safe values CoCoA+ 0 =1 =) Averaging 0 = K =) Adding

92 CoCoA+ vs. CoCoA 10 0 Covertype, 1e Averaging 10 6 Averaging 10 5 Averaging 10 4 Adding 10 6 Adding 10 5 Adding Covertype, 1e Averaging 10 6 Averaging 10 5 Averaging 10 4 Adding 10 6 Adding 10 5 Adding 10 4 Duality Gap Duality Gap Number of Communications Elapsed Time (s)

93 CoCoA+ vs. CoCoA 10 0 RCV1, 1e Averaging 10 6 Averaging 10 5 Averaging 10 4 Adding 10 6 Adding 10 5 Adding RCV1, 1e Averaging 10 6 Averaging 10 5 Averaging 10 4 Adding 10 6 Adding 10 5 Adding 10 4 Duality Gap 10-2 Duality Gap Number of Communications Elapsed Time (s)

94 Stronger Theoretical Results Convergence rate is independent of K Provide primal-dual rate Cover the case of non-smooth losses (e.g., hinge-loss)

95 Code: github.com/gingsmith/cocoa

96 Part V Avoiding Saddlepoints Quickly with Chi Jin, Rong Ge, Sham Kakade, and Praneeth Netrapalli

97 Definition: ɛ-second-order stationary points, for l- smooth, ρ-hessian functions (Nesterov & Polyak, 2006): f(x) ɛ, λ min ( 2 f(x)) ρɛ When all saddle points are strict, how fast can we converge to ɛ-second-order stationary points? Algorithm Iterations Oracle (Ge, et al., 2015) O(poly(d/ɛ)) Gradient (Levy, 2016) O(d 3 poly(1/ɛ)) Gradient This Work O(polylog(d)/ɛ 2 ) Gradient (Agarwal, et al., 2016) O(log d/ɛ 7/4 ) Hessian-vector product (Carmon, et al., 2016) O(log d/ɛ 7/4 ) Hessian-vector product (Carmon and Duchi, 2016) O(log d/ɛ 2 ) Hessian-vector product (Nesterov and Polyak, 2006) O(1/ɛ 1.5 ) Hessian (Curtis, et al., 2014) O(1/ɛ 1.5 ) Hessian

On Computational Thinking, Inferential Thinking and Data Science

On Computational Thinking, Inferential Thinking and Data Science On Computational Thinking, Inferential Thinking and Data Science Michael I. Jordan University of California, Berkeley December 17, 2016 A Job Description, circa 2015 Your Boss: I need a Big Data system

More information

On Computational Thinking, Inferential Thinking and Data Science

On Computational Thinking, Inferential Thinking and Data Science On Computational Thinking, Inferential Thinking and Data Science Michael I. Jordan University of California, Berkeley September 24, 2016 What Is the Big Data Phenomenon? Science in confirmatory mode (e.g.,

More information

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington Nonconvex optimization

More information

Mini-Course 1: SGD Escapes Saddle Points

Mini-Course 1: SGD Escapes Saddle Points Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Lock-Free Approaches to Parallelizing Stochastic Gradient Descent

Lock-Free Approaches to Parallelizing Stochastic Gradient Descent Lock-Free Approaches to Parallelizing Stochastic Gradient Descent Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison with Feng iu Christopher Ré Stephen Wright minimize x f(x)

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Optimization for Machine Learning (Lecture 3-B - Nonconvex)

Optimization for Machine Learning (Lecture 3-B - Nonconvex) Optimization for Machine Learning (Lecture 3-B - Nonconvex) SUVRIT SRA Massachusetts Institute of Technology MPI-IS Tübingen Machine Learning Summer School, June 2017 ml.mit.edu Nonconvex problems are

More information

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

A Conservation Law Method in Optimization

A Conservation Law Method in Optimization A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Learning in a Distributed and Heterogeneous Environment

Learning in a Distributed and Heterogeneous Environment Learning in a Distributed and Heterogeneous Environment Martin Jaggi EPFL Machine Learning and Optimization Laboratory mlo.epfl.ch Inria - EPFL joint Workshop - Paris - Feb 15 th Machine Learning Methods

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

arxiv: v2 [cs.lg] 10 Oct 2018

arxiv: v2 [cs.lg] 10 Oct 2018 Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith

More information

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014 Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Primal-dual coordinate descent

Primal-dual coordinate descent Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

Lecture 5: September 12

Lecture 5: September 12 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Adding vs. Averaging in Distributed Primal-Dual Optimization

Adding vs. Averaging in Distributed Primal-Dual Optimization Chenxin Ma Industrial and Systems Engineering, Lehigh University, USA Virginia Smith University of California, Berkeley, USA Martin Jaggi ETH Zürich, Switzerland Michael I. Jordan University of California,

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

Asaga: Asynchronous Parallel Saga

Asaga: Asynchronous Parallel Saga Rémi Leblond Fabian Pedregosa Simon Lacoste-Julien INRIA - Sierra Project-team INRIA - Sierra Project-team Department of CS & OR (DIRO) École normale supérieure, Paris École normale supérieure, Paris Université

More information

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Michael Patriksson 0-0 The Relaxation Theorem 1 Problem: find f := infimum f(x), x subject to x S, (1a) (1b) where f : R n R

More information

Cutting Plane Training of Structural SVM

Cutting Plane Training of Structural SVM Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Second-Order Methods for Stochastic Optimization

Second-Order Methods for Stochastic Optimization Second-Order Methods for Stochastic Optimization Frank E. Curtis, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Nonlinear Optimization for Optimal Control

Nonlinear Optimization for Optimal Control Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]

More information

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley ( For more info,

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

arxiv: v2 [cs.lg] 3 Jul 2015

arxiv: v2 [cs.lg] 3 Jul 2015 arxiv:502.03508v2 [cs.lg] 3 Jul 205 Chenxin Ma Industrial and Systems Engineering, Lehigh University, USA Virginia Smith University of California, Berkeley, USA Martin Jaggi ETH Zürich, Switzerland Michael

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech

More information