Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Size: px
Start display at page:

Download "Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity"

Transcription

1 Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University of Edinburgh)

2 Outline First-order methods for composite convex optimization Randomized coordinate descent method Adaptive sampling Expected separable overapproximation

3 Problem and Motivation

4 Problem Setup min [F (x) := f (x) + ψ(x)] x Rn f : R n R is convex and smooth: f (x + h) f (x) + f (x), h Ah 2, x, h R n ψ : R n R {+ } is proper, convex, closed and separable: n ψ(x) ψ i (x i ) i=1

5 Motivation: Empirical Risk Minimization ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 supervised learning/image processing...; train a linear predictor w R d ; n training samples A 1,..., A n R d ;

6 Motivation: Empirical Risk Minimization ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 supervised learning/image processing...; train a linear predictor w R d ; n training samples A 1,..., A n R d ; convex loss function φ i : R R; ex.: Squared loss (φ i (a) = 1 2 (a b i) 2 ), Logistic loss (φ i (a) = log(1 + e a )),...

7 Motivation: Empirical Risk Minimization ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 supervised learning/image processing...; train a linear predictor w R d ; n training samples A 1,..., A n R d ; convex loss function φ i : R R; ex.: Squared loss (φ i (a) = 1 2 (a b i) 2 ), Logistic loss (φ i (a) = log(1 + e a )),... convex regularizer g : R d R {+ }; ex.: L 1 regularization (g(w) = w 1 ), L 2 regularization (g(w) = 1 2 w 2 2 ),...

8 Primal Dual Formulation ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 Dual problem of ERM: ( ) max D(α) def = λg 1 n A i α i 1 n φ α R n i ( α i ) λn n i=1 i=1 }{{}}{{} smooth if convex g strongly convex and separable

9 Primal Dual Formulation ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 Dual problem of ERM: ( ) max D(α) def = λg 1 n A i α i 1 n φ α R n i ( α i ) λn n i=1 i=1 }{{}}{{} smooth if convex g strongly convex and separable Optimality conditions: ( ) 1 OPT1 : w = g λn Aα OPT2 : α i = φ i ( A i w ), i = 1,..., n.

10 First-order methods for non-strongly convex composite optimization

11 Problem Setup min [F (x) := f (x) + ψ(x)] x Rn f : R n R is convex and smooth: f (x + h) f (x) + f (x), h Ah 2, x, h R n ψ : R n R {+ } is proper, convex, closed and separable: n ψ(x) ψ i (x i ) i=1

12 Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: for i [n] do 5: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i +ψi (x) } 6: end for 7: end for

13 Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: for i [n] do 5: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i +ψi (x) } 6: end for 7: end for proximal operator of ψ i is easily computable.

14 Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: for i [n] do 5: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i +ψi (x) } 6: end for 7: end for proximal operator of ψ i is easily computable. a.k.a. explicite-implicite/forward-backward method splitting algorithm [Lions & Mercier 79], [Eckstein & Bertsekas 89]

15 Accelerated Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom(ψ), set z 0 = x 0 and θ 0 = 1 3: for k 0 do 4: for i [n] do 5: z i k+1 = arg min z R { i f ((1 θ k )x k + θ k z k ), z + θ k v i 2 z z i k 2 i + ψ i (z) } 6: end for 7: x k+1 = (1 θ k )x k + θ k z k+1 8: θ k+1 = θ 4 k +4θ 2 k θ2 k 2 9: end for [Nesterov 83, 04], [Beck & Teboulle 08](FISTA), [Tseng 08],

16 Accelerated Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom(ψ), set z 0 = x 0 and θ 0 = 1 3: for k 0 do 4: for i [n] do 5: z i k+1 = arg min z R { i f ((1 θ k )x k + θ k z k ), z + θ k v i 2 z z i k 2 i + ψ i (z) } 6: end for 7: x k+1 = (1 θ k )x k + θ k z k+1 8: θ k+1 = θ 4 k +4θ 2 k θ2 k 2 9: end for [Nesterov 83, 04], [Beck & Teboulle 08](FISTA), [Tseng 08], [Su, Boyd & Candès 14], [Chambolle & Pock 15], [Chambolle & Dossal 15], [Attouch & Peypouquet 15]

17 Convergence Analysis Theorem If v i = L for any i [n] with L λ max (A A), then the iterates {x k } of the proximal gradient method satsify: Theorem (Tseng 08) F (x k ) F (x ) L x 0 x 2, k 1. 2k If v i = L for any i [n] with L λ max (A A), then the iterates {x k } k of the accelerated proximal gradient algorithm satisfy: F (x k ) F (x ) 2L x 0 x 2 (k + 1) 2, k 1.

18 Randomized Coordinate Descent Randomized coordinate descent 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate random i [n] uniformly 5: x k+1 x k 6: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) } 7: end for

19 Randomized Coordinate Descent Randomized coordinate descent 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate random i [n] uniformly 5: x k+1 x k 6: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) } 7: end for v = Diag(A A) [Nesterov 10], [Shalev-Shwartz & Tewari 11], [Richtarik & Takac 11]

20 Randomized Coordinate Descent Randomized coordinate descent 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate random i [n] uniformly 5: x k+1 x k 6: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) } 7: end for v = Diag(A A) [Nesterov 10], [Shalev-Shwartz & Tewari 11], [Richtarik & Takac 11] Other variants [Wright 15] Cyclic (Gauss-Seidel) [Canutescu & Dunbrack 03] Greedy [Wu & Lange 08] [Nutini et. al 15]

21 Parallel Randomized Coordinate Descent Parallel coordinate descent 1: Parameters: τ [n], vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate a random subset S k [n] of size τ uniformly 5: x k+1 x k 6: for i S k do 7: x i k+1 8: end for 9: end for = arg min x R { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) }

22 Parallel Randomized Coordinate Descent Parallel coordinate descent 1: Parameters: τ [n], vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate a random subset S k [n] of size τ uniformly 5: x k+1 x k 6: for i S k do 7: x i k+1 8: end for 9: end for ( v = = arg min x R 1 + (τ 1)(ω 1) max(n 1,1) { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) } ) Diag(A A) [Richtarik & Takac 13] where ω is the maximal number of nonzero elements in each row of A.

23 Convergence Analysis Theorem (Richtarik & Takac 13) Define the level-set distance R v (x 0, x ) def = max{ x x 2 v : F (x) F (x 0 )}. Under the assumption we have: E[F (x k )] F (x ) x R v (x 0, x ) < +, 2n max{r v (x 0, x ), F (x 0 ) F (x )} 2n max{r v (x 0, x )/ (F (x k F (x )), 1} + τk

24 Accelerated Parallel Proximal Coordinate Descent 1: Parameters: τ [n], vector v R n ++ 2: Initialization: choose x 0 dom(ψ), set z 0 = x 0 and θ 0 = τ/n 3: for k 0 do 4: y k = (1 θ k )x k + θ k z k 5: Generate a random subset S k [n] of size τ uniformly 6: z k+1 z k 7: for i S k do 8: z i k+1 = arg min z R 9: end for 10: x k+1 = y k + θ k n/τ (z k+1 z k ) { i f (y k ), z + θ kv i n 2τ z z i k 2 i + ψ i (z) } 11: θ k+1 = θ 4 k +4θ 2 k θ2 k 2 12: end for [Nesterov 10], [Lee & Sidford 13], [Fercoq & Richtarik 13]...

25 Convergence Analysis Theorem (Fercoq & Richtarik 13) Choose v i = m j=1 ( 1 + (τ 1)(ω ) j 1) A 2 max(n 1, 1) ji, i = 1, 2,..., n. The iterates {x k } of APPROX for all k 1 satisfies: E[F (x k ) F (x )] [ ( 4 1 τ ) (F (x 0 ) F (x )) + 1 ] n 2 x 0 x 2 v. ((k 1)τ/n + 2) 2

26 Summary Parallel Coordinate Descent (choose subset of size τ uniformly)

27 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent

28 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent τ = n Proximal Gradient

29 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent τ = n Proximal Gradient Accelerated Parallel Proximal Coordinate Descent (choose subset of size τ uniformly)

30 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent τ = n Proximal Gradient Accelerated Parallel Proximal Coordinate Descent (choose subset of size τ uniformly) τ = n Accelerated Proximal Gradient

31 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent τ = n Proximal Gradient Accelerated Parallel Proximal Coordinate Descent (choose subset of size τ uniformly) τ = n Accelerated Proximal Gradient

32 Randomized coordinate descent method with arbitrary sampling Q. and Richtarik. Coordinate descent with arbitrary sampling I: algorithms and complexity, Optimization methods and software, 2016.

33 Sampling Sampling is a set-valued random variable: Probability vector: Proper sampling: Serial sampling: Uniform sampling: Ŝ {1,..., n} p i = P(i Ŝ), i {1,..., n} p i = P(i Ŝ) > 0, i {1,..., n} P( Ŝ = 1) = 1 p 1 = = p n = E[ Ŝ ] n

34 Algorithm 1: Parameters: proper sampling Ŝ with probability vector p = (p 1,..., p n ) [0, 1] n, v R n ++, sequence {θ k } k 0 (0, 1] 2: Initialization: choose x 0 dom ψ and set z 0 = x 0 3: for k 0 do 4: y k = (1 θ k )x k + θ k z k 5: Generate a random set of blocks S k Ŝ 6: z k+1 z k 7: for i S k do { 8: zk+1 i = arg min i f (y k ), z + θ kv i z z z R 2p k i 2 i + ψ i (z) } i 9: end for 10: x k+1 = y k + θ k p 1 (z k+1 z k ) 11: end for

35 Efficient Implementation 1: Parameters: proper sampling Ŝ with probability vector p = (p 1,..., p n ), v R n ++, sequence {θ k } k 0 2: Initialization: choose x 0 dom ψ, set z 0 = x 0, u 0 = 0 and α 0 = 1 3: for k 0 do 4: Generate a random set of coordinates S k Ŝ 5: z k+1 z k, u k+1 u k 6: for i S k do } 7: zi k = arg min t R {t i f (α k u k + z k ) + θ kv i 2p i t 2 + ψ i (zi k + t) 8: z k+1 i zi k + zi k 9: u k+1 i ui k α 1 k (1 θ kp 1 i ) zi k 10: α k+1 = (1 θ k+1 )α k 11: end for 12: end for 13: OUTPUT: x k+1 = z k + α k u k + θ k p 1 (z k+1 z k )

36 Convergence Analysis Lemma Let Ŝ be an arbitrary proper sampling and v Rn ++ be such that E[f (x + h [Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Let {θ k } k 0 be arbitrary sequence of positive numbers in (0, 1]. Then for the sequence of iterates produced by the algorithm and all k 0, the following recursion holds: [ ] E k ˆF k+1 + θ2 k 2 z k+1 x 2 v p 2 ] θ k ( ˆF k F ). [ ˆF k + θ2 k 2 z k x 2 v p 2

37 Convergence Analysis Lemma Let Ŝ be an arbitrary proper sampling and v Rn ++ be such that E[f (x + h [Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Let {θ k } k 0 be arbitrary sequence of positive numbers in (0, 1]. Then for the sequence of iterates produced by the algorithm and all k 0, the following recursion holds: [ ] E k ˆF k+1 + θ2 k 2 z k+1 x 2 v p 2 ] θ k ( ˆF k F ). [ ˆF k + θ2 k 2 z k x 2 v p 2 ˆF k F (x k ) if ψ 0 or θ k min p i

38 Convergence Results (f, Ŝ) ESO(v) + { ψ 0 or θ k min p i where θ k = θ 0 [ E F θ k+1 = θ 4 k +4θ 2 k θ2 k 2 ( )] x k + θ k 1 0 t=1 x t F 1 + (k 1)θ 0 E[F (x k )] F 4C ((k 1)θ 0 + 2) 2 C (k 1)θ C = (1 θ 0 )(F (x 0 ) F ) + θ2 0 2 x 0 x 2 v p 2

39 Corollaries-Parallel Coordinate Descent Corollary The iterates {x k } of Parallel Coordinate Descent satisfy: E[F (x k )] F (x ) n (k 1)τ + n [ ( 1 τ n Compare with [Richtarik & Takac 13]: ) (F (x 0 ) F (x )) + 1 ] 2 x 0 x 2 v max{ x x 2 v : F (x) F (x 0 )} < + x [Lu & Xiao 14] (τ = 1): E[F (x k )] F (x ) n n + k [ (F (x 0 ) F (x )) + 1 ] 2 x 0 x 2 v

40 Corollaries-Smooth Minimization Corollary If ψ 0, then the iterates {x k } of accelerated coordinate descent satisfy: E [ f (x k ) ] f 2 x 0 x 2 v p 2 (k + 1) 2, k 1.

41 Corollaries-Smooth Minimization Corollary If ψ 0, then the iterates {x k } of accelerated coordinate descent satisfy: E [ f (x k ) ] f 2 x 0 x 2 v p 2 (k + 1) 2, k 1. Define L i = A i A i for i = 1,..., n. Corollary If each step we update coordinate i with probability p i L i, then E [ f (x k ) ] f 2( i Li ) 2 x 0 x 2, k 1. (k + 1) 2

42 Corollaries-Smooth Minimization Serial sampling Ŝ, v = L: E [ f (x k ) ] f 2 x 0 x 2 L p 2 (k + 1) 2, k 1.

43 Corollaries-Smooth Minimization Serial sampling Ŝ, v = L: E [ f (x k ) ] f 2 x 0 x 2 L p 2 (k + 1) 2, k 1. The probability minimizing the right-hand side is: pi = (L i xi xi 0 2 ) 1 3, i = 1,..., n. n (L j xj xj 0 2 ) 1 3 j=1

44 Stochastic dual coordinate ascent with adaptive sampling Cisba, Q. and Richtarik. Stochastic dual coordinate ascent with adaptive sampling, International Conference on Machine Learning, 2015.

45 Primal Dual Formulation ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 Dual problem of ERM: ( ) max D(α) def = λg 1 n A i α i 1 n φ α R n i ( α i ) λn n i=1 i=1 }{{}}{{} smooth γ strongly convex and separable

46 Primal Dual Formulation ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 Dual problem of ERM: ( ) max D(α) def = λg 1 n A i α i 1 n φ α R n i ( α i ) λn n i=1 i=1 }{{}}{{} smooth γ strongly convex and separable Optimality conditions: ( ) 1 OPT1 : w = g λn Aα OPT2 : α i = φ i ( A i w ), i = 1,..., n.

47 Stochastic Dual Coordinate Ascent Dual solution For t 0: 1. α t+1 = α t ; Primal solution For t 0: 1. w t = g ( 1 λn Aαt ) 2. Randomly pick i t {1,..., n}; 3. Update α t+1 α t+1 i t i t : = arg max β R { φ it ( β) (A it w t )β A } i t 2 2λn β αt i t 2

48 Stochastic Dual Coordinate Ascent Dual solution For t 0: 1. α t+1 = α t ; Primal solution For t 0: 1. w t = g ( 1 λn Aαt ) 2. Randomly pick i t {1,..., n} according to a fixed distribution p; 3. Update α t+1 α t+1 i t i t : = arg max β R { φ it ( β) (A it w t )β A } i t 2 2λn β αt i t 2

49 Uniform and Importance Sampling Uniform sampling ( SDCA: [ Shalev-Shwartz & Zhang 13 ],... ) p i = Prob(i t = i) 1 n, Iteration complexity: Õ (n + max ) i A i 2 λγ

50 Uniform and Importance Sampling Uniform sampling ( SDCA: [ Shalev-Shwartz & Zhang 13 ],... ) p i = Prob(i t = i) 1 n, Iteration complexity: Õ (n + max ) i A i 2 λγ Importance sampling ( Iprox-SDCA: [Zhao & Zhang 15 ],...) Iteration complexity: p i = Prob(i t = i) A i 2 + λγn, Õ ( n + 1 n n i=1 A i 2 ) λγ

51 Adaptive Sampling Each dual variable has a natural measure of progress: κ t i def = α t i + φ i (A i w t ), i = 1,..., n called dual residue.

52 Adaptive Sampling Each dual variable has a natural measure of progress: κ t i def = α t i + φ i (A i w t ), i = 1,..., n called dual residue. Optimality conditions: OPT1 : w = g ( 1 λn Aα ( OPT2 : αi = φ i A i w ), i [n]. )

53 Adaptive Sampling Each dual variable has a natural measure of progress: κ t i def = α t i + φ i (A i w t ), i = 1,..., n called dual residue. Optimality conditions: OPT1 : w = g ( 1 λn Aα OPT2 : α i ) = φ i ( A i w ), i [n]. A sampling distribution p is coherent with κ t if for all i [n]: κ t i 0 p i > 0.

54 Stochastic Dual Coordinate Ascent Dual solution For t 0: 1. α t+1 = α t ; Primal solution For t 0: 1. w t = g ( 1 λn Aαt ) 2. Randomly pick i t {1,..., n} according to a fixed distribution p; 3. Update α t+1 α t+1 i t i t : = arg max β R { φ it ( β) (A it w t )β A } i t 2 2λn β αt i t 2

55 Adaptive Stochastic Dual Coordinate Ascent Dual solution For t 0: 1. α t+1 = α t ; Primal solution For t 0: 1. w t = g ( 1 λn Aαt ) 2. Randomly pick i t {1,..., n} according to a distribution p t coherent with dual residue κ t ; 3. Update α t+1 α t+1 i t i t : = arg max β R { φ it ( β) (A it w t )β A } i t 2 2λn β αt i t 2

56 Convergence Theorem Theorem (AdaSDCA) Consider AdaSDCA. If at each iteration t 0, then θ(κ t, p t ) def = nλγ i κt i i:κ 2 min p t ti 0(pt i ) 1 ( A i 2 + nλγ) κ t i 2 i:κ t i 0 i, E[P(w t ) D(α t )] 1 θt t (1 θ k ) ( D(α ) D(α 0 ) ), k=0 for all t 0 where θ t def = E[θ(κt, p t )(P(w t ) D(α t ))]. E[P(w t ) D(α t )]

57 Optimal Adaptive Sampling Probability p (κ t ) = arg max θ(κ t, p) s.t. p R n +, i p i = 1 p is coherent with κ t θ(κ t, p) min i:κ t i 0 p i

58 Optimal Adaptive Sampling Probability p (κ t ) = arg max θ(κ t, p) s.t. p R n +, i p i = 1 p is coherent with κ t θ(κ t, p) min i:κ t i 0 p i Relaxation: p (κ t ) = arg max θ(κ t, p) s.t. p R n +, n i=1 p i = 1

59 Optimal Adaptive Sampling Probability Relaxation: p (κ t ) = p (κ t ) = arg max θ(κ t, p) s.t. p R n +, i p i = 1 p is coherent with κ t θ(κ t, p) min i:κ t i 0 p i nλγ i arg max κt i 2 i:κ t 0(p i) 1 κ t i 2 ( A i 2 + nλγ) i s.t. p R n n +, i=1 p i = 1

60 Optimal Adaptive Sampling Probability Relaxation: p (κ t ) = p (κ t ) = arg max θ(κ t, p) s.t. p R n +, i p i = 1 p is coherent with κ t θ(κ t, p) min i:κ t i 0 p i nλγ i arg max κt i 2 i:κ t 0(p i) 1 κ t i 2 ( A i 2 + nλγ) i s.t. p R n n +, i=1 p i = 1 ( p (κ t )) i κ t i A i 2 + nλγ, i [n].

61 Exact Relaxation for Squared Loss Theorem (AdaSDCA) Consider AdaSDCA. If at each iteration t 0, then θ(κ t, p t ) def = nλγ i κt i 2 i:κ ti 0(pt i ) 1 κ t i 2 ( A i 2 + nλγ) min i:κ t i 0 p t i, E[P(w t ) D(α t )] 1 θt t (1 θ k ) ( D(α ) D(α 0 ) ), k=0 for all t 0 where θ t def = E[θ(κt, p t )(P(w t ) D(α t ))]. E[P(w t ) D(α t )]

62 Exact Relaxation for Squared Loss Theorem (AdaSDCA for squared loss) Consider AdaSDCA. If all the loss functions {φ i } are squared loss functions, then E[P(w t ) D(α t )] 1 θt t (1 θ k ) ( D(α ) D(α 0 ) ), k=0 for all t 0 where θ t def = E[θ(κt, p t )(P(w t ) D(α t ))]. E[P(w t ) D(α t )]

63 Exact Relaxation for Squared Loss Theorem (AdaSDCA for squared loss) Consider AdaSDCA. If all the loss functions {φ i } are squared loss functions, then E[P(w t ) D(α t )] 1 θt t (1 θ k ) ( D(α ) D(α 0 ) ), k=0 for all t 0 where θ t def = E[θ(κt, p t )(P(w t ) D(α t ))]. E[P(w t ) D(α t )] Optimal adaptive sampling probability is given by: ( p (κ t )) i κ t i A i 2 + nλγ, i [n].

64 AdaSDCA Dual solution For t 1: 1. Compute dual residue κ t : κ t i = α t i + φ i (A i w t ) Set p t i κ t i A i 2 + nλγ 2. Randomly pick i t {1,..., n} with probability proportional to p t 3. Update αi t t αi t t = arg max { φ it (β) (A it w t 1 )β A } i t 2 β αt 1 i β R 2λn t 2

65 Heuristic and Efficient Variant of AdaSDCA AdaSDCA+: Dual solution For t 1: 1. If mod(t, n) = 0, then Option I: Adaptive Sampling Probability Compute dual residue κ t : κ t i = α t i + φ i (A i w t ) Set p t i κ t i A i 2 + nλγ Option II: Importance Sampling Probability Set p t i A i 2 + nλγ 2. Randomly pick i t {1,..., n} according to p t 3. Update α t i t 4. Update Probability: p t+1 (p t 1,..., p t i t /m,... p t n)

66 Computational Cost per Epoch Algorithm cost of an epoch SDCA O(nnz) Iprox-SDCA O(nnz +n log(n)) AdaSDCA O(n nnz) AdaSDCA+ O(nnz +n log(n)) Table 1: One epoch computational cost of different algorithms

67 Numerical Experiments Figure 1: w8a dataset d = 300, n = 49749, Quadratic loss with L 2 regularizer, λ = 1/n, γ = 1.

68 Numerical Experiments Figure 2: w8a dataset d = 300, n = 49749, Quadratic loss with L 2 regularizer, λ = 1/n, γ = 1.

69 Numerical Experiments Figure 3: cov1 dataset: d = 54, n = 581, 012. Smooth Hinge loss with L 2 regularizer, λ = 1/n, γ = 1.

70 Numerical Experiments Figure 4: cov1 dataset: d = 54, n = 581, 012. Smooth Hinge loss with L 2 regularizer, λ = 1/n, γ = 1.

71 Numerical Experiments Figure 5: cov1 dataset: d = 54, n = 581, 012. Smooth Hinge loss with L 2 regularizer, λ = 1/n, γ = 1. comparison of different choices of the constant m.

72 More on ESO Q. and Richtarik. Coordinate descent with arbitrary sampling II: expected separable overapproximation, Optimization methods and software, 2016.

73 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n.

74 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Recall the smoothness assumption: f (x + h) f (x) + f (x), h Ah 2, x, h R n

75 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Recall the smoothness assumption: f (x + h) f (x) + f (x), h Ah 2, x, h R n (f, Ŝ) ESO(v) if

76 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Recall the smoothness assumption: f (x + h) f (x) + f (x), h Ah 2, x, h R n (f, Ŝ) ESO(v) if E[ Ah [Ŝ] 2 ] h 2 v p, h R n

77 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Recall the smoothness assumption: f (x + h) f (x) + f (x), h Ah 2, x, h R n (f, Ŝ) ESO(v) if E[ Ah [Ŝ] 2 ] = h E[I Ŝ A AI [Ŝ] ]h h 2 v p, h R n

78 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p)

79 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p) E[I Ŝ A AI ] = P [Ŝ] (A A) where P ij = P(i Ŝ, j Ŝ)

80 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p) E[I Ŝ A AI ] = P [Ŝ] (A A) where P ij = P(i Ŝ, j Ŝ) Let A = (A 1,... A m), then m P (A A) = P (A j A j ) j=1

81 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p) E[I Ŝ A AI ] = P [Ŝ] (A A) where P ij = P(i Ŝ, j Ŝ) Let A = (A 1,... A m), then m P (A A) = P (A j A j ) j=1 Denote J j := {i [n] : A ji 0},

82 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p) E[I Ŝ A AI ] = P [Ŝ] (A A) where P ij = P(i Ŝ, j Ŝ) Let A = (A 1,... A m), then m P (A A) = P (A j A j ) Denote then P (A A) = j=1 J j := {i [n] : A ji 0}, m P (A j A j ) = j=1 m P [Jj ] (A j A j ) j=1

83 Deriving Stepsize Theorem (ESO with coupling between sampling and data) Let Ŝ be an arbitrary sampling and v = (v 1,..., v n ) be defined by: m v i = λ (J j Ŝ)A 2 ji, i = 1, 2,..., n, where j=1 λ (J Ŝ) := max h R n{h P [J] h : h Diag(P [J] )h 1}. Then (f, Ŝ) ESO(v).

84 Deriving Stepsize Tight bounds for: serial sampling λ (J Ŝ) = 1;

85 Deriving Stepsize Tight bounds for: serial sampling λ (J Ŝ) = 1; uniform distribution over subsets of fixed size τ (aka τ-nice sampling) ([Richtarik & Takac 13]) λ (J Ŝ) = 1 + ( J 1)(τ 1) max(n 1, 1).

86 Deriving Stepsize Tight bounds for: serial sampling λ (J Ŝ) = 1; uniform distribution over subsets of fixed size τ (aka τ-nice sampling) ([Richtarik & Takac 13]) λ (J Ŝ) = 1 + ( J 1)(τ 1) max(n 1, 1). distributed sampling with datas equally partitionned on c processors, each of which draws independently a τ-nice sampling ([ Fercoq, Q., Richtarik & Takac 14]) λ (J Ŝ) ( τ 1 ) ( 1 + ( J 1)(τ 1) max(n/c 1, 1) ).

87 Conclusion Unified convergence analysis for Randomized coordinate descent method Accelerated Randomized coordinate descent method Arbitrary sampling Convergence condition (ESO)+Formulae for computing admissible stepsizes Adaptive sampling using duality gap

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Stochastic Dual Coordinate Ascent with Adaptive Probabilities

Stochastic Dual Coordinate Ascent with Adaptive Probabilities Dominik Csiba Zheng Qu Peter Richtárik University of Edinburgh CDOMINIK@GMAIL.COM ZHENG.QU@ED.AC.UK PETER.RICHTARIK@ED.AC.UK Abstract This paper introduces AdaSDCA: an adaptive variant of stochastic dual

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Primal-dual coordinate descent

Primal-dual coordinate descent Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression

The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression Ademir Alves Ribeiro Peter Richtárik August 28, 27 Abstract We study the ridge regression L 2 regularized least squares problem and

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop

More information

arxiv: v1 [math.na] 19 Jan 2018

arxiv: v1 [math.na] 19 Jan 2018 The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression Ademir Alves Ribeiro a,,, Peter Richtárik b,2 arxiv:8.6354v [math.na] 9 Jan 28 a Department of Mathematics, Federal University of

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Primal and Dual Variables Decomposition Methods in Convex Optimization

Primal and Dual Variables Decomposition Methods in Convex Optimization Primal and Dual Variables Decomposition Methods in Convex Optimization Amir Beck Technion - Israel Institute of Technology Haifa, Israel Based on joint works with Edouard Pauwels, Shoham Sabach, Luba Tetruashvili,

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Coordinate gradient descent methods. Ion Necoara

Coordinate gradient descent methods. Ion Necoara Coordinate gradient descent methods Ion Necoara January 2, 207 ii Contents Coordinate gradient descent methods. Motivation..................................... 5.. Coordinate minimization versus coordinate

More information

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017 Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Random coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara

Random coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara Random coordinate descent algorithms for huge-scale optimization problems Ion Necoara Automatic Control and Systems Engineering Depart. 1 Acknowledgement Collaboration with Y. Nesterov, F. Glineur ( Univ.

More information

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints By I. Necoara, Y. Nesterov, and F. Glineur Lijun Xu Optimization Group Meeting November 27, 2012 Outline

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Adding vs. Averaging in Distributed Primal-Dual Optimization

Adding vs. Averaging in Distributed Primal-Dual Optimization Chenxin Ma Industrial and Systems Engineering, Lehigh University, USA Virginia Smith University of California, Berkeley, USA Martin Jaggi ETH Zürich, Switzerland Michael I. Jordan University of California,

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

arxiv: v2 [cs.lg] 10 Oct 2018

arxiv: v2 [cs.lg] 10 Oct 2018 Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized

More information

Block Coordinate Descent for Regularized Multi-convex Optimization

Block Coordinate Descent for Regularized Multi-convex Optimization Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

arxiv: v2 [cs.lg] 3 Jul 2015

arxiv: v2 [cs.lg] 3 Jul 2015 arxiv:502.03508v2 [cs.lg] 3 Jul 205 Chenxin Ma Industrial and Systems Engineering, Lehigh University, USA Virginia Smith University of California, Berkeley, USA Martin Jaggi ETH Zürich, Switzerland Michael

More information

Iteration Complexity of Feasible Descent Methods for Convex Optimization

Iteration Complexity of Feasible Descent Methods for Convex Optimization Journal of Machine Learning Research 15 (2014) 1523-1548 Submitted 5/13; Revised 2/14; Published 4/14 Iteration Complexity of Feasible Descent Methods for Convex Optimization Po-Wei Wang Chih-Jen Lin Department

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Adaptive restart of accelerated gradient methods under local quadratic growth condition

Adaptive restart of accelerated gradient methods under local quadratic growth condition Adaptive restart of accelerated gradient methods under local quadratic growth condition Olivier Fercoq Zheng Qu September 6, 08 arxiv:709.0300v [math.oc] 7 Sep 07 Abstract By analyzing accelerated proximal

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization

Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization Ching-pei Lee Department of Computer Sciences University of Wisconsin-Madison Madison, WI 53706-1613, USA Kai-Wei

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

Dual and primal-dual methods

Dual and primal-dual methods ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

arxiv: v2 [math.na] 28 Jan 2016

arxiv: v2 [math.na] 28 Jan 2016 Stochastic Dual Ascent for Solving Linear Systems Robert M. Gower and Peter Richtárik arxiv:1512.06890v2 [math.na 28 Jan 2016 School of Mathematics University of Edinburgh United Kingdom December 21, 2015

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent : Safely Avoiding Wasteful Updates in Coordinate Descent Tyler B. Johnson 1 Carlos Guestrin 1 Abstract Coordinate descent (CD) is a scalable and simple algorithm for solving many optimization problems

More information

l 1 and l 2 Regularization

l 1 and l 2 Regularization David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about

More information

Distributed Optimization with Arbitrary Local Solvers

Distributed Optimization with Arbitrary Local Solvers Industrial and Systems Engineering Distributed Optimization with Arbitrary Local Solvers Chenxin Ma, Jakub Konečný 2, Martin Jaggi 3, Virginia Smith 4, Michael I. Jordan 4, Peter Richtárik 2, and Martin

More information

FAST ALTERNATING DIRECTION OPTIMIZATION METHODS

FAST ALTERNATING DIRECTION OPTIMIZATION METHODS FAST ALTERNATING DIRECTION OPTIMIZATION METHODS TOM GOLDSTEIN, BRENDAN O DONOGHUE, SIMON SETZER, AND RICHARD BARANIUK Abstract. Alternating direction methods are a common tool for general mathematical

More information

The basic problem of interest in this paper is the convex programming (CP) problem given by

The basic problem of interest in this paper is the convex programming (CP) problem given by Noname manuscript No. (will be inserted by the editor) An optimal randomized incremental gradient method Guanghui Lan Yi Zhou the date of receipt and acceptance should be inserted later arxiv:1507.02000v3

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent Yinyu Ye K. T. Li Professor of Engineering Department of Management Science and Engineering Stanford

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

On Nesterov s Random Coordinate Descent Algorithms

On Nesterov s Random Coordinate Descent Algorithms On Nesterov s Random Coordinate Descent Algorithms Zheng Xu University of Texas At Arlington February 19, 2015 1 Introduction Full-Gradient Descent Coordinate Descent 2 Random Coordinate Descent Algorithm

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Sparse and Regularized Optimization

Sparse and Regularized Optimization Sparse and Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity

More information

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information