Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Size: px

Start display at page:

Download "Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity"

Marshall Hodge
5 years ago
Views:

1 Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University of Edinburgh)

2 Outline First-order methods for composite convex optimization Randomized coordinate descent method Adaptive sampling Expected separable overapproximation

3 Problem and Motivation

4 Problem Setup min [F (x) := f (x) + ψ(x)] x Rn f : R n R is convex and smooth: f (x + h) f (x) + f (x), h Ah 2, x, h R n ψ : R n R {+ } is proper, convex, closed and separable: n ψ(x) ψ i (x i ) i=1

5 Motivation: Empirical Risk Minimization ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 supervised learning/image processing...; train a linear predictor w R d ; n training samples A 1,..., A n R d ;

6 Motivation: Empirical Risk Minimization ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 supervised learning/image processing...; train a linear predictor w R d ; n training samples A 1,..., A n R d ; convex loss function φ i : R R; ex.: Squared loss (φ i (a) = 1 2 (a b i) 2 ), Logistic loss (φ i (a) = log(1 + e a )),...

7 Motivation: Empirical Risk Minimization ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 supervised learning/image processing...; train a linear predictor w R d ; n training samples A 1,..., A n R d ; convex loss function φ i : R R; ex.: Squared loss (φ i (a) = 1 2 (a b i) 2 ), Logistic loss (φ i (a) = log(1 + e a )),... convex regularizer g : R d R {+ }; ex.: L 1 regularization (g(w) = w 1 ), L 2 regularization (g(w) = 1 2 w 2 2 ),...

8 Primal Dual Formulation ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 Dual problem of ERM: ( ) max D(α) def = λg 1 n A i α i 1 n φ α R n i ( α i ) λn n i=1 i=1 }{{}}{{} smooth if convex g strongly convex and separable

9 Primal Dual Formulation ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 Dual problem of ERM: ( ) max D(α) def = λg 1 n A i α i 1 n φ α R n i ( α i ) λn n i=1 i=1 }{{}}{{} smooth if convex g strongly convex and separable Optimality conditions: ( ) 1 OPT1 : w = g λn Aα OPT2 : α i = φ i ( A i w ), i = 1,..., n.

10 First-order methods for non-strongly convex composite optimization

11 Problem Setup min [F (x) := f (x) + ψ(x)] x Rn f : R n R is convex and smooth: f (x + h) f (x) + f (x), h Ah 2, x, h R n ψ : R n R {+ } is proper, convex, closed and separable: n ψ(x) ψ i (x i ) i=1

12 Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: for i [n] do 5: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i +ψi (x) } 6: end for 7: end for

13 Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: for i [n] do 5: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i +ψi (x) } 6: end for 7: end for proximal operator of ψ i is easily computable.

14 Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: for i [n] do 5: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i +ψi (x) } 6: end for 7: end for proximal operator of ψ i is easily computable. a.k.a. explicite-implicite/forward-backward method splitting algorithm [Lions & Mercier 79], [Eckstein & Bertsekas 89]

15 Accelerated Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom(ψ), set z 0 = x 0 and θ 0 = 1 3: for k 0 do 4: for i [n] do 5: z i k+1 = arg min z R { i f ((1 θ k )x k + θ k z k ), z + θ k v i 2 z z i k 2 i + ψ i (z) } 6: end for 7: x k+1 = (1 θ k )x k + θ k z k+1 8: θ k+1 = θ 4 k +4θ 2 k θ2 k 2 9: end for [Nesterov 83, 04], [Beck & Teboulle 08](FISTA), [Tseng 08],

16 Accelerated Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom(ψ), set z 0 = x 0 and θ 0 = 1 3: for k 0 do 4: for i [n] do 5: z i k+1 = arg min z R { i f ((1 θ k )x k + θ k z k ), z + θ k v i 2 z z i k 2 i + ψ i (z) } 6: end for 7: x k+1 = (1 θ k )x k + θ k z k+1 8: θ k+1 = θ 4 k +4θ 2 k θ2 k 2 9: end for [Nesterov 83, 04], [Beck & Teboulle 08](FISTA), [Tseng 08], [Su, Boyd & Candès 14], [Chambolle & Pock 15], [Chambolle & Dossal 15], [Attouch & Peypouquet 15]

17 Convergence Analysis Theorem If v i = L for any i [n] with L λ max (A A), then the iterates {x k } of the proximal gradient method satsify: Theorem (Tseng 08) F (x k ) F (x ) L x 0 x 2, k 1. 2k If v i = L for any i [n] with L λ max (A A), then the iterates {x k } k of the accelerated proximal gradient algorithm satisfy: F (x k ) F (x ) 2L x 0 x 2 (k + 1) 2, k 1.

18 Randomized Coordinate Descent Randomized coordinate descent 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate random i [n] uniformly 5: x k+1 x k 6: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) } 7: end for

19 Randomized Coordinate Descent Randomized coordinate descent 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate random i [n] uniformly 5: x k+1 x k 6: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) } 7: end for v = Diag(A A) [Nesterov 10], [Shalev-Shwartz & Tewari 11], [Richtarik & Takac 11]

20 Randomized Coordinate Descent Randomized coordinate descent 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate random i [n] uniformly 5: x k+1 x k 6: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) } 7: end for v = Diag(A A) [Nesterov 10], [Shalev-Shwartz & Tewari 11], [Richtarik & Takac 11] Other variants [Wright 15] Cyclic (Gauss-Seidel) [Canutescu & Dunbrack 03] Greedy [Wu & Lange 08] [Nutini et. al 15]

21 Parallel Randomized Coordinate Descent Parallel coordinate descent 1: Parameters: τ [n], vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate a random subset S k [n] of size τ uniformly 5: x k+1 x k 6: for i S k do 7: x i k+1 8: end for 9: end for = arg min x R { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) }

22 Parallel Randomized Coordinate Descent Parallel coordinate descent 1: Parameters: τ [n], vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate a random subset S k [n] of size τ uniformly 5: x k+1 x k 6: for i S k do 7: x i k+1 8: end for 9: end for ( v = = arg min x R 1 + (τ 1)(ω 1) max(n 1,1) { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) } ) Diag(A A) [Richtarik & Takac 13] where ω is the maximal number of nonzero elements in each row of A.

23 Convergence Analysis Theorem (Richtarik & Takac 13) Define the level-set distance R v (x 0, x ) def = max{ x x 2 v : F (x) F (x 0 )}. Under the assumption we have: E[F (x k )] F (x ) x R v (x 0, x ) < +, 2n max{r v (x 0, x ), F (x 0 ) F (x )} 2n max{r v (x 0, x )/ (F (x k F (x )), 1} + τk

24 Accelerated Parallel Proximal Coordinate Descent 1: Parameters: τ [n], vector v R n ++ 2: Initialization: choose x 0 dom(ψ), set z 0 = x 0 and θ 0 = τ/n 3: for k 0 do 4: y k = (1 θ k )x k + θ k z k 5: Generate a random subset S k [n] of size τ uniformly 6: z k+1 z k 7: for i S k do 8: z i k+1 = arg min z R 9: end for 10: x k+1 = y k + θ k n/τ (z k+1 z k ) { i f (y k ), z + θ kv i n 2τ z z i k 2 i + ψ i (z) } 11: θ k+1 = θ 4 k +4θ 2 k θ2 k 2 12: end for [Nesterov 10], [Lee & Sidford 13], [Fercoq & Richtarik 13]...

25 Convergence Analysis Theorem (Fercoq & Richtarik 13) Choose v i = m j=1 ( 1 + (τ 1)(ω ) j 1) A 2 max(n 1, 1) ji, i = 1, 2,..., n. The iterates {x k } of APPROX for all k 1 satisfies: E[F (x k ) F (x )] [ ( 4 1 τ ) (F (x 0 ) F (x )) + 1 ] n 2 x 0 x 2 v. ((k 1)τ/n + 2) 2

26 Summary Parallel Coordinate Descent (choose subset of size τ uniformly)

27 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent

28 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent τ = n Proximal Gradient

29 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent τ = n Proximal Gradient Accelerated Parallel Proximal Coordinate Descent (choose subset of size τ uniformly)

30 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent τ = n Proximal Gradient Accelerated Parallel Proximal Coordinate Descent (choose subset of size τ uniformly) τ = n Accelerated Proximal Gradient

31 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent τ = n Proximal Gradient Accelerated Parallel Proximal Coordinate Descent (choose subset of size τ uniformly) τ = n Accelerated Proximal Gradient

32 Randomized coordinate descent method with arbitrary sampling Q. and Richtarik. Coordinate descent with arbitrary sampling I: algorithms and complexity, Optimization methods and software, 2016.

33 Sampling Sampling is a set-valued random variable: Probability vector: Proper sampling: Serial sampling: Uniform sampling: Ŝ {1,..., n} p i = P(i Ŝ), i {1,..., n} p i = P(i Ŝ) > 0, i {1,..., n} P( Ŝ = 1) = 1 p 1 = = p n = E[ Ŝ ] n

34 Algorithm 1: Parameters: proper sampling Ŝ with probability vector p = (p 1,..., p n ) [0, 1] n, v R n ++, sequence {θ k } k 0 (0, 1] 2: Initialization: choose x 0 dom ψ and set z 0 = x 0 3: for k 0 do 4: y k = (1 θ k )x k + θ k z k 5: Generate a random set of blocks S k Ŝ 6: z k+1 z k 7: for i S k do { 8: zk+1 i = arg min i f (y k ), z + θ kv i z z z R 2p k i 2 i + ψ i (z) } i 9: end for 10: x k+1 = y k + θ k p 1 (z k+1 z k ) 11: end for

35 Efficient Implementation 1: Parameters: proper sampling Ŝ with probability vector p = (p 1,..., p n ), v R n ++, sequence {θ k } k 0 2: Initialization: choose x 0 dom ψ, set z 0 = x 0, u 0 = 0 and α 0 = 1 3: for k 0 do 4: Generate a random set of coordinates S k Ŝ 5: z k+1 z k, u k+1 u k 6: for i S k do } 7: zi k = arg min t R {t i f (α k u k + z k ) + θ kv i 2p i t 2 + ψ i (zi k + t) 8: z k+1 i zi k + zi k 9: u k+1 i ui k α 1 k (1 θ kp 1 i ) zi k 10: α k+1 = (1 θ k+1 )α k 11: end for 12: end for 13: OUTPUT: x k+1 = z k + α k u k + θ k p 1 (z k+1 z k )

36 Convergence Analysis Lemma Let Ŝ be an arbitrary proper sampling and v Rn ++ be such that E[f (x + h [Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Let {θ k } k 0 be arbitrary sequence of positive numbers in (0, 1]. Then for the sequence of iterates produced by the algorithm and all k 0, the following recursion holds: [ ] E k ˆF k+1 + θ2 k 2 z k+1 x 2 v p 2 ] θ k ( ˆF k F ). [ ˆF k + θ2 k 2 z k x 2 v p 2

37 Convergence Analysis Lemma Let Ŝ be an arbitrary proper sampling and v Rn ++ be such that E[f (x + h [Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Let {θ k } k 0 be arbitrary sequence of positive numbers in (0, 1]. Then for the sequence of iterates produced by the algorithm and all k 0, the following recursion holds: [ ] E k ˆF k+1 + θ2 k 2 z k+1 x 2 v p 2 ] θ k ( ˆF k F ). [ ˆF k + θ2 k 2 z k x 2 v p 2 ˆF k F (x k ) if ψ 0 or θ k min p i

38 Convergence Results (f, Ŝ) ESO(v) + { ψ 0 or θ k min p i where θ k = θ 0 [ E F θ k+1 = θ 4 k +4θ 2 k θ2 k 2 ( )] x k + θ k 1 0 t=1 x t F 1 + (k 1)θ 0 E[F (x k )] F 4C ((k 1)θ 0 + 2) 2 C (k 1)θ C = (1 θ 0 )(F (x 0 ) F ) + θ2 0 2 x 0 x 2 v p 2

39 Corollaries-Parallel Coordinate Descent Corollary The iterates {x k } of Parallel Coordinate Descent satisfy: E[F (x k )] F (x ) n (k 1)τ + n [ ( 1 τ n Compare with [Richtarik & Takac 13]: ) (F (x 0 ) F (x )) + 1 ] 2 x 0 x 2 v max{ x x 2 v : F (x) F (x 0 )} < + x [Lu & Xiao 14] (τ = 1): E[F (x k )] F (x ) n n + k [ (F (x 0 ) F (x )) + 1 ] 2 x 0 x 2 v

40 Corollaries-Smooth Minimization Corollary If ψ 0, then the iterates {x k } of accelerated coordinate descent satisfy: E [ f (x k ) ] f 2 x 0 x 2 v p 2 (k + 1) 2, k 1.

41 Corollaries-Smooth Minimization Corollary If ψ 0, then the iterates {x k } of accelerated coordinate descent satisfy: E [ f (x k ) ] f 2 x 0 x 2 v p 2 (k + 1) 2, k 1. Define L i = A i A i for i = 1,..., n. Corollary If each step we update coordinate i with probability p i L i, then E [ f (x k ) ] f 2( i Li ) 2 x 0 x 2, k 1. (k + 1) 2

42 Corollaries-Smooth Minimization Serial sampling Ŝ, v = L: E [ f (x k ) ] f 2 x 0 x 2 L p 2 (k + 1) 2, k 1.

43 Corollaries-Smooth Minimization Serial sampling Ŝ, v = L: E [ f (x k ) ] f 2 x 0 x 2 L p 2 (k + 1) 2, k 1. The probability minimizing the right-hand side is: pi = (L i xi xi 0 2 ) 1 3, i = 1,..., n. n (L j xj xj 0 2 ) 1 3 j=1

44 Stochastic dual coordinate ascent with adaptive sampling Cisba, Q. and Richtarik. Stochastic dual coordinate ascent with adaptive sampling, International Conference on Machine Learning, 2015.

45 Primal Dual Formulation ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 Dual problem of ERM: ( ) max D(α) def = λg 1 n A i α i 1 n φ α R n i ( α i ) λn n i=1 i=1 }{{}}{{} smooth γ strongly convex and separable

46 Primal Dual Formulation ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 Dual problem of ERM: ( ) max D(α) def = λg 1 n A i α i 1 n φ α R n i ( α i ) λn n i=1 i=1 }{{}}{{} smooth γ strongly convex and separable Optimality conditions: ( ) 1 OPT1 : w = g λn Aα OPT2 : α i = φ i ( A i w ), i = 1,..., n.

47 Stochastic Dual Coordinate Ascent Dual solution For t 0: 1. α t+1 = α t ; Primal solution For t 0: 1. w t = g ( 1 λn Aαt ) 2. Randomly pick i t {1,..., n}; 3. Update α t+1 α t+1 i t i t : = arg max β R { φ it ( β) (A it w t )β A } i t 2 2λn β αt i t 2

48 Stochastic Dual Coordinate Ascent Dual solution For t 0: 1. α t+1 = α t ; Primal solution For t 0: 1. w t = g ( 1 λn Aαt ) 2. Randomly pick i t {1,..., n} according to a fixed distribution p; 3. Update α t+1 α t+1 i t i t : = arg max β R { φ it ( β) (A it w t )β A } i t 2 2λn β αt i t 2

49 Uniform and Importance Sampling Uniform sampling ( SDCA: [ Shalev-Shwartz & Zhang 13 ],... ) p i = Prob(i t = i) 1 n, Iteration complexity: Õ (n + max ) i A i 2 λγ

50 Uniform and Importance Sampling Uniform sampling ( SDCA: [ Shalev-Shwartz & Zhang 13 ],... ) p i = Prob(i t = i) 1 n, Iteration complexity: Õ (n + max ) i A i 2 λγ Importance sampling ( Iprox-SDCA: [Zhao & Zhang 15 ],...) Iteration complexity: p i = Prob(i t = i) A i 2 + λγn, Õ ( n + 1 n n i=1 A i 2 ) λγ

51 Adaptive Sampling Each dual variable has a natural measure of progress: κ t i def = α t i + φ i (A i w t ), i = 1,..., n called dual residue.

52 Adaptive Sampling Each dual variable has a natural measure of progress: κ t i def = α t i + φ i (A i w t ), i = 1,..., n called dual residue. Optimality conditions: OPT1 : w = g ( 1 λn Aα ( OPT2 : αi = φ i A i w ), i [n]. )

53 Adaptive Sampling Each dual variable has a natural measure of progress: κ t i def = α t i + φ i (A i w t ), i = 1,..., n called dual residue. Optimality conditions: OPT1 : w = g ( 1 λn Aα OPT2 : α i ) = φ i ( A i w ), i [n]. A sampling distribution p is coherent with κ t if for all i [n]: κ t i 0 p i > 0.

54 Stochastic Dual Coordinate Ascent Dual solution For t 0: 1. α t+1 = α t ; Primal solution For t 0: 1. w t = g ( 1 λn Aαt ) 2. Randomly pick i t {1,..., n} according to a fixed distribution p; 3. Update α t+1 α t+1 i t i t : = arg max β R { φ it ( β) (A it w t )β A } i t 2 2λn β αt i t 2

55 Adaptive Stochastic Dual Coordinate Ascent Dual solution For t 0: 1. α t+1 = α t ; Primal solution For t 0: 1. w t = g ( 1 λn Aαt ) 2. Randomly pick i t {1,..., n} according to a distribution p t coherent with dual residue κ t ; 3. Update α t+1 α t+1 i t i t : = arg max β R { φ it ( β) (A it w t )β A } i t 2 2λn β αt i t 2

56 Convergence Theorem Theorem (AdaSDCA) Consider AdaSDCA. If at each iteration t 0, then θ(κ t, p t ) def = nλγ i κt i i:κ 2 min p t ti 0(pt i ) 1 ( A i 2 + nλγ) κ t i 2 i:κ t i 0 i, E[P(w t ) D(α t )] 1 θt t (1 θ k ) ( D(α ) D(α 0 ) ), k=0 for all t 0 where θ t def = E[θ(κt, p t )(P(w t ) D(α t ))]. E[P(w t ) D(α t )]

57 Optimal Adaptive Sampling Probability p (κ t ) = arg max θ(κ t, p) s.t. p R n +, i p i = 1 p is coherent with κ t θ(κ t, p) min i:κ t i 0 p i

58 Optimal Adaptive Sampling Probability p (κ t ) = arg max θ(κ t, p) s.t. p R n +, i p i = 1 p is coherent with κ t θ(κ t, p) min i:κ t i 0 p i Relaxation: p (κ t ) = arg max θ(κ t, p) s.t. p R n +, n i=1 p i = 1

59 Optimal Adaptive Sampling Probability Relaxation: p (κ t ) = p (κ t ) = arg max θ(κ t, p) s.t. p R n +, i p i = 1 p is coherent with κ t θ(κ t, p) min i:κ t i 0 p i nλγ i arg max κt i 2 i:κ t 0(p i) 1 κ t i 2 ( A i 2 + nλγ) i s.t. p R n n +, i=1 p i = 1

60 Optimal Adaptive Sampling Probability Relaxation: p (κ t ) = p (κ t ) = arg max θ(κ t, p) s.t. p R n +, i p i = 1 p is coherent with κ t θ(κ t, p) min i:κ t i 0 p i nλγ i arg max κt i 2 i:κ t 0(p i) 1 κ t i 2 ( A i 2 + nλγ) i s.t. p R n n +, i=1 p i = 1 ( p (κ t )) i κ t i A i 2 + nλγ, i [n].

61 Exact Relaxation for Squared Loss Theorem (AdaSDCA) Consider AdaSDCA. If at each iteration t 0, then θ(κ t, p t ) def = nλγ i κt i 2 i:κ ti 0(pt i ) 1 κ t i 2 ( A i 2 + nλγ) min i:κ t i 0 p t i, E[P(w t ) D(α t )] 1 θt t (1 θ k ) ( D(α ) D(α 0 ) ), k=0 for all t 0 where θ t def = E[θ(κt, p t )(P(w t ) D(α t ))]. E[P(w t ) D(α t )]

62 Exact Relaxation for Squared Loss Theorem (AdaSDCA for squared loss) Consider AdaSDCA. If all the loss functions {φ i } are squared loss functions, then E[P(w t ) D(α t )] 1 θt t (1 θ k ) ( D(α ) D(α 0 ) ), k=0 for all t 0 where θ t def = E[θ(κt, p t )(P(w t ) D(α t ))]. E[P(w t ) D(α t )]

63 Exact Relaxation for Squared Loss Theorem (AdaSDCA for squared loss) Consider AdaSDCA. If all the loss functions {φ i } are squared loss functions, then E[P(w t ) D(α t )] 1 θt t (1 θ k ) ( D(α ) D(α 0 ) ), k=0 for all t 0 where θ t def = E[θ(κt, p t )(P(w t ) D(α t ))]. E[P(w t ) D(α t )] Optimal adaptive sampling probability is given by: ( p (κ t )) i κ t i A i 2 + nλγ, i [n].

64 AdaSDCA Dual solution For t 1: 1. Compute dual residue κ t : κ t i = α t i + φ i (A i w t ) Set p t i κ t i A i 2 + nλγ 2. Randomly pick i t {1,..., n} with probability proportional to p t 3. Update αi t t αi t t = arg max { φ it (β) (A it w t 1 )β A } i t 2 β αt 1 i β R 2λn t 2

65 Heuristic and Efficient Variant of AdaSDCA AdaSDCA+: Dual solution For t 1: 1. If mod(t, n) = 0, then Option I: Adaptive Sampling Probability Compute dual residue κ t : κ t i = α t i + φ i (A i w t ) Set p t i κ t i A i 2 + nλγ Option II: Importance Sampling Probability Set p t i A i 2 + nλγ 2. Randomly pick i t {1,..., n} according to p t 3. Update α t i t 4. Update Probability: p t+1 (p t 1,..., p t i t /m,... p t n)

66 Computational Cost per Epoch Algorithm cost of an epoch SDCA O(nnz) Iprox-SDCA O(nnz +n log(n)) AdaSDCA O(n nnz) AdaSDCA+ O(nnz +n log(n)) Table 1: One epoch computational cost of different algorithms

67 Numerical Experiments Figure 1: w8a dataset d = 300, n = 49749, Quadratic loss with L 2 regularizer, λ = 1/n, γ = 1.

68 Numerical Experiments Figure 2: w8a dataset d = 300, n = 49749, Quadratic loss with L 2 regularizer, λ = 1/n, γ = 1.

69 Numerical Experiments Figure 3: cov1 dataset: d = 54, n = 581, 012. Smooth Hinge loss with L 2 regularizer, λ = 1/n, γ = 1.

70 Numerical Experiments Figure 4: cov1 dataset: d = 54, n = 581, 012. Smooth Hinge loss with L 2 regularizer, λ = 1/n, γ = 1.

71 Numerical Experiments Figure 5: cov1 dataset: d = 54, n = 581, 012. Smooth Hinge loss with L 2 regularizer, λ = 1/n, γ = 1. comparison of different choices of the constant m.

72 More on ESO Q. and Richtarik. Coordinate descent with arbitrary sampling II: expected separable overapproximation, Optimization methods and software, 2016.

73 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n.

74 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Recall the smoothness assumption: f (x + h) f (x) + f (x), h Ah 2, x, h R n

75 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Recall the smoothness assumption: f (x + h) f (x) + f (x), h Ah 2, x, h R n (f, Ŝ) ESO(v) if

76 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Recall the smoothness assumption: f (x + h) f (x) + f (x), h Ah 2, x, h R n (f, Ŝ) ESO(v) if E[ Ah [Ŝ] 2 ] h 2 v p, h R n

77 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Recall the smoothness assumption: f (x + h) f (x) + f (x), h Ah 2, x, h R n (f, Ŝ) ESO(v) if E[ Ah [Ŝ] 2 ] = h E[I Ŝ A AI [Ŝ] ]h h 2 v p, h R n

78 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p)

79 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p) E[I Ŝ A AI ] = P [Ŝ] (A A) where P ij = P(i Ŝ, j Ŝ)

80 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p) E[I Ŝ A AI ] = P [Ŝ] (A A) where P ij = P(i Ŝ, j Ŝ) Let A = (A 1,... A m), then m P (A A) = P (A j A j ) j=1

81 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p) E[I Ŝ A AI ] = P [Ŝ] (A A) where P ij = P(i Ŝ, j Ŝ) Let A = (A 1,... A m), then m P (A A) = P (A j A j ) j=1 Denote J j := {i [n] : A ji 0},

82 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p) E[I Ŝ A AI ] = P [Ŝ] (A A) where P ij = P(i Ŝ, j Ŝ) Let A = (A 1,... A m), then m P (A A) = P (A j A j ) Denote then P (A A) = j=1 J j := {i [n] : A ji 0}, m P (A j A j ) = j=1 m P [Jj ] (A j A j ) j=1

83 Deriving Stepsize Theorem (ESO with coupling between sampling and data) Let Ŝ be an arbitrary sampling and v = (v 1,..., v n ) be defined by: m v i = λ (J j Ŝ)A 2 ji, i = 1, 2,..., n, where j=1 λ (J Ŝ) := max h R n{h P [J] h : h Diag(P [J] )h 1}. Then (f, Ŝ) ESO(v).

84 Deriving Stepsize Tight bounds for: serial sampling λ (J Ŝ) = 1;

85 Deriving Stepsize Tight bounds for: serial sampling λ (J Ŝ) = 1; uniform distribution over subsets of fixed size τ (aka τ-nice sampling) ([Richtarik & Takac 13]) λ (J Ŝ) = 1 + ( J 1)(τ 1) max(n 1, 1).

86 Deriving Stepsize Tight bounds for: serial sampling λ (J Ŝ) = 1; uniform distribution over subsets of fixed size τ (aka τ-nice sampling) ([Richtarik & Takac 13]) λ (J Ŝ) = 1 + ( J 1)(τ 1) max(n 1, 1). distributed sampling with datas equally partitionned on c processors, each of which draws independently a τ-nice sampling ([ Fercoq, Q., Richtarik & Takac 14]) λ (J Ŝ) ( τ 1 ) ( 1 + ( J 1)(τ 1) max(n/c 1, 1) ).

87 Conclusion Unified convergence analysis for Randomized coordinate descent method Accelerated Randomized coordinate descent method Arbitrary sampling Convergence condition (ESO)+Formulae for computing admissible stepsizes Adaptive sampling using duality gap

Importance Sampling for Minibatches

Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,