Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
|
|
- Marshall Hodge
- 5 years ago
- Views:
Transcription
1 Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University of Edinburgh)
2 Outline First-order methods for composite convex optimization Randomized coordinate descent method Adaptive sampling Expected separable overapproximation
3 Problem and Motivation
4 Problem Setup min [F (x) := f (x) + ψ(x)] x Rn f : R n R is convex and smooth: f (x + h) f (x) + f (x), h Ah 2, x, h R n ψ : R n R {+ } is proper, convex, closed and separable: n ψ(x) ψ i (x i ) i=1
5 Motivation: Empirical Risk Minimization ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 supervised learning/image processing...; train a linear predictor w R d ; n training samples A 1,..., A n R d ;
6 Motivation: Empirical Risk Minimization ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 supervised learning/image processing...; train a linear predictor w R d ; n training samples A 1,..., A n R d ; convex loss function φ i : R R; ex.: Squared loss (φ i (a) = 1 2 (a b i) 2 ), Logistic loss (φ i (a) = log(1 + e a )),...
7 Motivation: Empirical Risk Minimization ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 supervised learning/image processing...; train a linear predictor w R d ; n training samples A 1,..., A n R d ; convex loss function φ i : R R; ex.: Squared loss (φ i (a) = 1 2 (a b i) 2 ), Logistic loss (φ i (a) = log(1 + e a )),... convex regularizer g : R d R {+ }; ex.: L 1 regularization (g(w) = w 1 ), L 2 regularization (g(w) = 1 2 w 2 2 ),...
8 Primal Dual Formulation ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 Dual problem of ERM: ( ) max D(α) def = λg 1 n A i α i 1 n φ α R n i ( α i ) λn n i=1 i=1 }{{}}{{} smooth if convex g strongly convex and separable
9 Primal Dual Formulation ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 Dual problem of ERM: ( ) max D(α) def = λg 1 n A i α i 1 n φ α R n i ( α i ) λn n i=1 i=1 }{{}}{{} smooth if convex g strongly convex and separable Optimality conditions: ( ) 1 OPT1 : w = g λn Aα OPT2 : α i = φ i ( A i w ), i = 1,..., n.
10 First-order methods for non-strongly convex composite optimization
11 Problem Setup min [F (x) := f (x) + ψ(x)] x Rn f : R n R is convex and smooth: f (x + h) f (x) + f (x), h Ah 2, x, h R n ψ : R n R {+ } is proper, convex, closed and separable: n ψ(x) ψ i (x i ) i=1
12 Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: for i [n] do 5: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i +ψi (x) } 6: end for 7: end for
13 Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: for i [n] do 5: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i +ψi (x) } 6: end for 7: end for proximal operator of ψ i is easily computable.
14 Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: for i [n] do 5: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i +ψi (x) } 6: end for 7: end for proximal operator of ψ i is easily computable. a.k.a. explicite-implicite/forward-backward method splitting algorithm [Lions & Mercier 79], [Eckstein & Bertsekas 89]
15 Accelerated Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom(ψ), set z 0 = x 0 and θ 0 = 1 3: for k 0 do 4: for i [n] do 5: z i k+1 = arg min z R { i f ((1 θ k )x k + θ k z k ), z + θ k v i 2 z z i k 2 i + ψ i (z) } 6: end for 7: x k+1 = (1 θ k )x k + θ k z k+1 8: θ k+1 = θ 4 k +4θ 2 k θ2 k 2 9: end for [Nesterov 83, 04], [Beck & Teboulle 08](FISTA), [Tseng 08],
16 Accelerated Proximal Gradient 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom(ψ), set z 0 = x 0 and θ 0 = 1 3: for k 0 do 4: for i [n] do 5: z i k+1 = arg min z R { i f ((1 θ k )x k + θ k z k ), z + θ k v i 2 z z i k 2 i + ψ i (z) } 6: end for 7: x k+1 = (1 θ k )x k + θ k z k+1 8: θ k+1 = θ 4 k +4θ 2 k θ2 k 2 9: end for [Nesterov 83, 04], [Beck & Teboulle 08](FISTA), [Tseng 08], [Su, Boyd & Candès 14], [Chambolle & Pock 15], [Chambolle & Dossal 15], [Attouch & Peypouquet 15]
17 Convergence Analysis Theorem If v i = L for any i [n] with L λ max (A A), then the iterates {x k } of the proximal gradient method satsify: Theorem (Tseng 08) F (x k ) F (x ) L x 0 x 2, k 1. 2k If v i = L for any i [n] with L λ max (A A), then the iterates {x k } k of the accelerated proximal gradient algorithm satisfy: F (x k ) F (x ) 2L x 0 x 2 (k + 1) 2, k 1.
18 Randomized Coordinate Descent Randomized coordinate descent 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate random i [n] uniformly 5: x k+1 x k 6: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) } 7: end for
19 Randomized Coordinate Descent Randomized coordinate descent 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate random i [n] uniformly 5: x k+1 x k 6: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) } 7: end for v = Diag(A A) [Nesterov 10], [Shalev-Shwartz & Tewari 11], [Richtarik & Takac 11]
20 Randomized Coordinate Descent Randomized coordinate descent 1: Parameters: vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate random i [n] uniformly 5: x k+1 x k 6: x i k+1 = arg min x R { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) } 7: end for v = Diag(A A) [Nesterov 10], [Shalev-Shwartz & Tewari 11], [Richtarik & Takac 11] Other variants [Wright 15] Cyclic (Gauss-Seidel) [Canutescu & Dunbrack 03] Greedy [Wu & Lange 08] [Nutini et. al 15]
21 Parallel Randomized Coordinate Descent Parallel coordinate descent 1: Parameters: τ [n], vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate a random subset S k [n] of size τ uniformly 5: x k+1 x k 6: for i S k do 7: x i k+1 8: end for 9: end for = arg min x R { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) }
22 Parallel Randomized Coordinate Descent Parallel coordinate descent 1: Parameters: τ [n], vector v R n ++ 2: Initialization: choose x 0 dom ψ 3: for k 0 do 4: Generate a random subset S k [n] of size τ uniformly 5: x k+1 x k 6: for i S k do 7: x i k+1 8: end for 9: end for ( v = = arg min x R 1 + (τ 1)(ω 1) max(n 1,1) { i f (x k ), x + v i 2 x x i k 2 i + ψ i (x) } ) Diag(A A) [Richtarik & Takac 13] where ω is the maximal number of nonzero elements in each row of A.
23 Convergence Analysis Theorem (Richtarik & Takac 13) Define the level-set distance R v (x 0, x ) def = max{ x x 2 v : F (x) F (x 0 )}. Under the assumption we have: E[F (x k )] F (x ) x R v (x 0, x ) < +, 2n max{r v (x 0, x ), F (x 0 ) F (x )} 2n max{r v (x 0, x )/ (F (x k F (x )), 1} + τk
24 Accelerated Parallel Proximal Coordinate Descent 1: Parameters: τ [n], vector v R n ++ 2: Initialization: choose x 0 dom(ψ), set z 0 = x 0 and θ 0 = τ/n 3: for k 0 do 4: y k = (1 θ k )x k + θ k z k 5: Generate a random subset S k [n] of size τ uniformly 6: z k+1 z k 7: for i S k do 8: z i k+1 = arg min z R 9: end for 10: x k+1 = y k + θ k n/τ (z k+1 z k ) { i f (y k ), z + θ kv i n 2τ z z i k 2 i + ψ i (z) } 11: θ k+1 = θ 4 k +4θ 2 k θ2 k 2 12: end for [Nesterov 10], [Lee & Sidford 13], [Fercoq & Richtarik 13]...
25 Convergence Analysis Theorem (Fercoq & Richtarik 13) Choose v i = m j=1 ( 1 + (τ 1)(ω ) j 1) A 2 max(n 1, 1) ji, i = 1, 2,..., n. The iterates {x k } of APPROX for all k 1 satisfies: E[F (x k ) F (x )] [ ( 4 1 τ ) (F (x 0 ) F (x )) + 1 ] n 2 x 0 x 2 v. ((k 1)τ/n + 2) 2
26 Summary Parallel Coordinate Descent (choose subset of size τ uniformly)
27 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent
28 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent τ = n Proximal Gradient
29 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent τ = n Proximal Gradient Accelerated Parallel Proximal Coordinate Descent (choose subset of size τ uniformly)
30 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent τ = n Proximal Gradient Accelerated Parallel Proximal Coordinate Descent (choose subset of size τ uniformly) τ = n Accelerated Proximal Gradient
31 Summary Parallel Coordinate Descent (choose subset of size τ uniformly) τ = 1 Randomized Coordinate Descent τ = n Proximal Gradient Accelerated Parallel Proximal Coordinate Descent (choose subset of size τ uniformly) τ = n Accelerated Proximal Gradient
32 Randomized coordinate descent method with arbitrary sampling Q. and Richtarik. Coordinate descent with arbitrary sampling I: algorithms and complexity, Optimization methods and software, 2016.
33 Sampling Sampling is a set-valued random variable: Probability vector: Proper sampling: Serial sampling: Uniform sampling: Ŝ {1,..., n} p i = P(i Ŝ), i {1,..., n} p i = P(i Ŝ) > 0, i {1,..., n} P( Ŝ = 1) = 1 p 1 = = p n = E[ Ŝ ] n
34 Algorithm 1: Parameters: proper sampling Ŝ with probability vector p = (p 1,..., p n ) [0, 1] n, v R n ++, sequence {θ k } k 0 (0, 1] 2: Initialization: choose x 0 dom ψ and set z 0 = x 0 3: for k 0 do 4: y k = (1 θ k )x k + θ k z k 5: Generate a random set of blocks S k Ŝ 6: z k+1 z k 7: for i S k do { 8: zk+1 i = arg min i f (y k ), z + θ kv i z z z R 2p k i 2 i + ψ i (z) } i 9: end for 10: x k+1 = y k + θ k p 1 (z k+1 z k ) 11: end for
35 Efficient Implementation 1: Parameters: proper sampling Ŝ with probability vector p = (p 1,..., p n ), v R n ++, sequence {θ k } k 0 2: Initialization: choose x 0 dom ψ, set z 0 = x 0, u 0 = 0 and α 0 = 1 3: for k 0 do 4: Generate a random set of coordinates S k Ŝ 5: z k+1 z k, u k+1 u k 6: for i S k do } 7: zi k = arg min t R {t i f (α k u k + z k ) + θ kv i 2p i t 2 + ψ i (zi k + t) 8: z k+1 i zi k + zi k 9: u k+1 i ui k α 1 k (1 θ kp 1 i ) zi k 10: α k+1 = (1 θ k+1 )α k 11: end for 12: end for 13: OUTPUT: x k+1 = z k + α k u k + θ k p 1 (z k+1 z k )
36 Convergence Analysis Lemma Let Ŝ be an arbitrary proper sampling and v Rn ++ be such that E[f (x + h [Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Let {θ k } k 0 be arbitrary sequence of positive numbers in (0, 1]. Then for the sequence of iterates produced by the algorithm and all k 0, the following recursion holds: [ ] E k ˆF k+1 + θ2 k 2 z k+1 x 2 v p 2 ] θ k ( ˆF k F ). [ ˆF k + θ2 k 2 z k x 2 v p 2
37 Convergence Analysis Lemma Let Ŝ be an arbitrary proper sampling and v Rn ++ be such that E[f (x + h [Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Let {θ k } k 0 be arbitrary sequence of positive numbers in (0, 1]. Then for the sequence of iterates produced by the algorithm and all k 0, the following recursion holds: [ ] E k ˆF k+1 + θ2 k 2 z k+1 x 2 v p 2 ] θ k ( ˆF k F ). [ ˆF k + θ2 k 2 z k x 2 v p 2 ˆF k F (x k ) if ψ 0 or θ k min p i
38 Convergence Results (f, Ŝ) ESO(v) + { ψ 0 or θ k min p i where θ k = θ 0 [ E F θ k+1 = θ 4 k +4θ 2 k θ2 k 2 ( )] x k + θ k 1 0 t=1 x t F 1 + (k 1)θ 0 E[F (x k )] F 4C ((k 1)θ 0 + 2) 2 C (k 1)θ C = (1 θ 0 )(F (x 0 ) F ) + θ2 0 2 x 0 x 2 v p 2
39 Corollaries-Parallel Coordinate Descent Corollary The iterates {x k } of Parallel Coordinate Descent satisfy: E[F (x k )] F (x ) n (k 1)τ + n [ ( 1 τ n Compare with [Richtarik & Takac 13]: ) (F (x 0 ) F (x )) + 1 ] 2 x 0 x 2 v max{ x x 2 v : F (x) F (x 0 )} < + x [Lu & Xiao 14] (τ = 1): E[F (x k )] F (x ) n n + k [ (F (x 0 ) F (x )) + 1 ] 2 x 0 x 2 v
40 Corollaries-Smooth Minimization Corollary If ψ 0, then the iterates {x k } of accelerated coordinate descent satisfy: E [ f (x k ) ] f 2 x 0 x 2 v p 2 (k + 1) 2, k 1.
41 Corollaries-Smooth Minimization Corollary If ψ 0, then the iterates {x k } of accelerated coordinate descent satisfy: E [ f (x k ) ] f 2 x 0 x 2 v p 2 (k + 1) 2, k 1. Define L i = A i A i for i = 1,..., n. Corollary If each step we update coordinate i with probability p i L i, then E [ f (x k ) ] f 2( i Li ) 2 x 0 x 2, k 1. (k + 1) 2
42 Corollaries-Smooth Minimization Serial sampling Ŝ, v = L: E [ f (x k ) ] f 2 x 0 x 2 L p 2 (k + 1) 2, k 1.
43 Corollaries-Smooth Minimization Serial sampling Ŝ, v = L: E [ f (x k ) ] f 2 x 0 x 2 L p 2 (k + 1) 2, k 1. The probability minimizing the right-hand side is: pi = (L i xi xi 0 2 ) 1 3, i = 1,..., n. n (L j xj xj 0 2 ) 1 3 j=1
44 Stochastic dual coordinate ascent with adaptive sampling Cisba, Q. and Richtarik. Stochastic dual coordinate ascent with adaptive sampling, International Conference on Machine Learning, 2015.
45 Primal Dual Formulation ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 Dual problem of ERM: ( ) max D(α) def = λg 1 n A i α i 1 n φ α R n i ( α i ) λn n i=1 i=1 }{{}}{{} smooth γ strongly convex and separable
46 Primal Dual Formulation ERM: min w R d [ P(w) def = 1 n ] n φ i (A i w) + λg(w) i=1 Dual problem of ERM: ( ) max D(α) def = λg 1 n A i α i 1 n φ α R n i ( α i ) λn n i=1 i=1 }{{}}{{} smooth γ strongly convex and separable Optimality conditions: ( ) 1 OPT1 : w = g λn Aα OPT2 : α i = φ i ( A i w ), i = 1,..., n.
47 Stochastic Dual Coordinate Ascent Dual solution For t 0: 1. α t+1 = α t ; Primal solution For t 0: 1. w t = g ( 1 λn Aαt ) 2. Randomly pick i t {1,..., n}; 3. Update α t+1 α t+1 i t i t : = arg max β R { φ it ( β) (A it w t )β A } i t 2 2λn β αt i t 2
48 Stochastic Dual Coordinate Ascent Dual solution For t 0: 1. α t+1 = α t ; Primal solution For t 0: 1. w t = g ( 1 λn Aαt ) 2. Randomly pick i t {1,..., n} according to a fixed distribution p; 3. Update α t+1 α t+1 i t i t : = arg max β R { φ it ( β) (A it w t )β A } i t 2 2λn β αt i t 2
49 Uniform and Importance Sampling Uniform sampling ( SDCA: [ Shalev-Shwartz & Zhang 13 ],... ) p i = Prob(i t = i) 1 n, Iteration complexity: Õ (n + max ) i A i 2 λγ
50 Uniform and Importance Sampling Uniform sampling ( SDCA: [ Shalev-Shwartz & Zhang 13 ],... ) p i = Prob(i t = i) 1 n, Iteration complexity: Õ (n + max ) i A i 2 λγ Importance sampling ( Iprox-SDCA: [Zhao & Zhang 15 ],...) Iteration complexity: p i = Prob(i t = i) A i 2 + λγn, Õ ( n + 1 n n i=1 A i 2 ) λγ
51 Adaptive Sampling Each dual variable has a natural measure of progress: κ t i def = α t i + φ i (A i w t ), i = 1,..., n called dual residue.
52 Adaptive Sampling Each dual variable has a natural measure of progress: κ t i def = α t i + φ i (A i w t ), i = 1,..., n called dual residue. Optimality conditions: OPT1 : w = g ( 1 λn Aα ( OPT2 : αi = φ i A i w ), i [n]. )
53 Adaptive Sampling Each dual variable has a natural measure of progress: κ t i def = α t i + φ i (A i w t ), i = 1,..., n called dual residue. Optimality conditions: OPT1 : w = g ( 1 λn Aα OPT2 : α i ) = φ i ( A i w ), i [n]. A sampling distribution p is coherent with κ t if for all i [n]: κ t i 0 p i > 0.
54 Stochastic Dual Coordinate Ascent Dual solution For t 0: 1. α t+1 = α t ; Primal solution For t 0: 1. w t = g ( 1 λn Aαt ) 2. Randomly pick i t {1,..., n} according to a fixed distribution p; 3. Update α t+1 α t+1 i t i t : = arg max β R { φ it ( β) (A it w t )β A } i t 2 2λn β αt i t 2
55 Adaptive Stochastic Dual Coordinate Ascent Dual solution For t 0: 1. α t+1 = α t ; Primal solution For t 0: 1. w t = g ( 1 λn Aαt ) 2. Randomly pick i t {1,..., n} according to a distribution p t coherent with dual residue κ t ; 3. Update α t+1 α t+1 i t i t : = arg max β R { φ it ( β) (A it w t )β A } i t 2 2λn β αt i t 2
56 Convergence Theorem Theorem (AdaSDCA) Consider AdaSDCA. If at each iteration t 0, then θ(κ t, p t ) def = nλγ i κt i i:κ 2 min p t ti 0(pt i ) 1 ( A i 2 + nλγ) κ t i 2 i:κ t i 0 i, E[P(w t ) D(α t )] 1 θt t (1 θ k ) ( D(α ) D(α 0 ) ), k=0 for all t 0 where θ t def = E[θ(κt, p t )(P(w t ) D(α t ))]. E[P(w t ) D(α t )]
57 Optimal Adaptive Sampling Probability p (κ t ) = arg max θ(κ t, p) s.t. p R n +, i p i = 1 p is coherent with κ t θ(κ t, p) min i:κ t i 0 p i
58 Optimal Adaptive Sampling Probability p (κ t ) = arg max θ(κ t, p) s.t. p R n +, i p i = 1 p is coherent with κ t θ(κ t, p) min i:κ t i 0 p i Relaxation: p (κ t ) = arg max θ(κ t, p) s.t. p R n +, n i=1 p i = 1
59 Optimal Adaptive Sampling Probability Relaxation: p (κ t ) = p (κ t ) = arg max θ(κ t, p) s.t. p R n +, i p i = 1 p is coherent with κ t θ(κ t, p) min i:κ t i 0 p i nλγ i arg max κt i 2 i:κ t 0(p i) 1 κ t i 2 ( A i 2 + nλγ) i s.t. p R n n +, i=1 p i = 1
60 Optimal Adaptive Sampling Probability Relaxation: p (κ t ) = p (κ t ) = arg max θ(κ t, p) s.t. p R n +, i p i = 1 p is coherent with κ t θ(κ t, p) min i:κ t i 0 p i nλγ i arg max κt i 2 i:κ t 0(p i) 1 κ t i 2 ( A i 2 + nλγ) i s.t. p R n n +, i=1 p i = 1 ( p (κ t )) i κ t i A i 2 + nλγ, i [n].
61 Exact Relaxation for Squared Loss Theorem (AdaSDCA) Consider AdaSDCA. If at each iteration t 0, then θ(κ t, p t ) def = nλγ i κt i 2 i:κ ti 0(pt i ) 1 κ t i 2 ( A i 2 + nλγ) min i:κ t i 0 p t i, E[P(w t ) D(α t )] 1 θt t (1 θ k ) ( D(α ) D(α 0 ) ), k=0 for all t 0 where θ t def = E[θ(κt, p t )(P(w t ) D(α t ))]. E[P(w t ) D(α t )]
62 Exact Relaxation for Squared Loss Theorem (AdaSDCA for squared loss) Consider AdaSDCA. If all the loss functions {φ i } are squared loss functions, then E[P(w t ) D(α t )] 1 θt t (1 θ k ) ( D(α ) D(α 0 ) ), k=0 for all t 0 where θ t def = E[θ(κt, p t )(P(w t ) D(α t ))]. E[P(w t ) D(α t )]
63 Exact Relaxation for Squared Loss Theorem (AdaSDCA for squared loss) Consider AdaSDCA. If all the loss functions {φ i } are squared loss functions, then E[P(w t ) D(α t )] 1 θt t (1 θ k ) ( D(α ) D(α 0 ) ), k=0 for all t 0 where θ t def = E[θ(κt, p t )(P(w t ) D(α t ))]. E[P(w t ) D(α t )] Optimal adaptive sampling probability is given by: ( p (κ t )) i κ t i A i 2 + nλγ, i [n].
64 AdaSDCA Dual solution For t 1: 1. Compute dual residue κ t : κ t i = α t i + φ i (A i w t ) Set p t i κ t i A i 2 + nλγ 2. Randomly pick i t {1,..., n} with probability proportional to p t 3. Update αi t t αi t t = arg max { φ it (β) (A it w t 1 )β A } i t 2 β αt 1 i β R 2λn t 2
65 Heuristic and Efficient Variant of AdaSDCA AdaSDCA+: Dual solution For t 1: 1. If mod(t, n) = 0, then Option I: Adaptive Sampling Probability Compute dual residue κ t : κ t i = α t i + φ i (A i w t ) Set p t i κ t i A i 2 + nλγ Option II: Importance Sampling Probability Set p t i A i 2 + nλγ 2. Randomly pick i t {1,..., n} according to p t 3. Update α t i t 4. Update Probability: p t+1 (p t 1,..., p t i t /m,... p t n)
66 Computational Cost per Epoch Algorithm cost of an epoch SDCA O(nnz) Iprox-SDCA O(nnz +n log(n)) AdaSDCA O(n nnz) AdaSDCA+ O(nnz +n log(n)) Table 1: One epoch computational cost of different algorithms
67 Numerical Experiments Figure 1: w8a dataset d = 300, n = 49749, Quadratic loss with L 2 regularizer, λ = 1/n, γ = 1.
68 Numerical Experiments Figure 2: w8a dataset d = 300, n = 49749, Quadratic loss with L 2 regularizer, λ = 1/n, γ = 1.
69 Numerical Experiments Figure 3: cov1 dataset: d = 54, n = 581, 012. Smooth Hinge loss with L 2 regularizer, λ = 1/n, γ = 1.
70 Numerical Experiments Figure 4: cov1 dataset: d = 54, n = 581, 012. Smooth Hinge loss with L 2 regularizer, λ = 1/n, γ = 1.
71 Numerical Experiments Figure 5: cov1 dataset: d = 54, n = 581, 012. Smooth Hinge loss with L 2 regularizer, λ = 1/n, γ = 1. comparison of different choices of the constant m.
72 More on ESO Q. and Richtarik. Coordinate descent with arbitrary sampling II: expected separable overapproximation, Optimization methods and software, 2016.
73 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n.
74 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Recall the smoothness assumption: f (x + h) f (x) + f (x), h Ah 2, x, h R n
75 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Recall the smoothness assumption: f (x + h) f (x) + f (x), h Ah 2, x, h R n (f, Ŝ) ESO(v) if
76 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Recall the smoothness assumption: f (x + h) f (x) + f (x), h Ah 2, x, h R n (f, Ŝ) ESO(v) if E[ Ah [Ŝ] 2 ] h 2 v p, h R n
77 ESO The function f admits an expected separable overapproximation (ESO) w.r.t. Ŝ and v R n +, denoted as (f, Ŝ) ESO(v), if E[f (x + h [ Ŝ] )] f (x) + f (x), h p h 2 v p, x, h R n. Recall the smoothness assumption: f (x + h) f (x) + f (x), h Ah 2, x, h R n (f, Ŝ) ESO(v) if E[ Ah [Ŝ] 2 ] = h E[I Ŝ A AI [Ŝ] ]h h 2 v p, h R n
78 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p)
79 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p) E[I Ŝ A AI ] = P [Ŝ] (A A) where P ij = P(i Ŝ, j Ŝ)
80 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p) E[I Ŝ A AI ] = P [Ŝ] (A A) where P ij = P(i Ŝ, j Ŝ) Let A = (A 1,... A m), then m P (A A) = P (A j A j ) j=1
81 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p) E[I Ŝ A AI ] = P [Ŝ] (A A) where P ij = P(i Ŝ, j Ŝ) Let A = (A 1,... A m), then m P (A A) = P (A j A j ) j=1 Denote J j := {i [n] : A ji 0},
82 Deriving Stepsize Find v R n + scuh that E[I Ŝ A AI [Ŝ] ] Diag(v p) E[I Ŝ A AI ] = P [Ŝ] (A A) where P ij = P(i Ŝ, j Ŝ) Let A = (A 1,... A m), then m P (A A) = P (A j A j ) Denote then P (A A) = j=1 J j := {i [n] : A ji 0}, m P (A j A j ) = j=1 m P [Jj ] (A j A j ) j=1
83 Deriving Stepsize Theorem (ESO with coupling between sampling and data) Let Ŝ be an arbitrary sampling and v = (v 1,..., v n ) be defined by: m v i = λ (J j Ŝ)A 2 ji, i = 1, 2,..., n, where j=1 λ (J Ŝ) := max h R n{h P [J] h : h Diag(P [J] )h 1}. Then (f, Ŝ) ESO(v).
84 Deriving Stepsize Tight bounds for: serial sampling λ (J Ŝ) = 1;
85 Deriving Stepsize Tight bounds for: serial sampling λ (J Ŝ) = 1; uniform distribution over subsets of fixed size τ (aka τ-nice sampling) ([Richtarik & Takac 13]) λ (J Ŝ) = 1 + ( J 1)(τ 1) max(n 1, 1).
86 Deriving Stepsize Tight bounds for: serial sampling λ (J Ŝ) = 1; uniform distribution over subsets of fixed size τ (aka τ-nice sampling) ([Richtarik & Takac 13]) λ (J Ŝ) = 1 + ( J 1)(τ 1) max(n 1, 1). distributed sampling with datas equally partitionned on c processors, each of which draws independently a τ-nice sampling ([ Fercoq, Q., Richtarik & Takac 14]) λ (J Ŝ) ( τ 1 ) ( 1 + ( J 1)(τ 1) max(n/c 1, 1) ).
87 Conclusion Unified convergence analysis for Randomized coordinate descent method Accelerated Randomized coordinate descent method Arbitrary sampling Convergence condition (ESO)+Formulae for computing admissible stepsizes Adaptive sampling using duality gap
Importance Sampling for Minibatches
Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationStochastic Dual Coordinate Ascent with Adaptive Probabilities
Dominik Csiba Zheng Qu Peter Richtárik University of Edinburgh CDOMINIK@GMAIL.COM ZHENG.QU@ED.AC.UK PETER.RICHTARIK@ED.AC.UK Abstract This paper introduces AdaSDCA: an adaptive variant of stochastic dual
More informationMini-Batch Primal and Dual Methods for SVMs
Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314
More informationFAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč
FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationLecture 3: Minimizing Large Sums. Peter Richtárik
Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationCoordinate Descent Faceoff: Primal or Dual?
JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationSDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization
Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationPrimal-dual coordinate descent
Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationThe Complexity of Primal-Dual Fixed Point Methods for Ridge Regression
The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression Ademir Alves Ribeiro Peter Richtárik August 28, 27 Abstract We study the ridge regression L 2 regularized least squares problem and
More informationAdaptive Probabilities in Stochastic Optimization Algorithms
Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights
More informationRecent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables
Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop
More informationarxiv: v1 [math.na] 19 Jan 2018
The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression Ademir Alves Ribeiro a,,, Peter Richtárik b,2 arxiv:8.6354v [math.na] 9 Jan 28 a Department of Mathematics, Federal University of
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationRandomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming
Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationSDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization
Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationPrimal and Dual Variables Decomposition Methods in Convex Optimization
Primal and Dual Variables Decomposition Methods in Convex Optimization Amir Beck Technion - Israel Institute of Technology Haifa, Israel Based on joint works with Edouard Pauwels, Shoham Sabach, Luba Tetruashvili,
More informationJournal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19
Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationCoordinate gradient descent methods. Ion Necoara
Coordinate gradient descent methods Ion Necoara January 2, 207 ii Contents Coordinate gradient descent methods. Motivation..................................... 5.. Coordinate minimization versus coordinate
More informationPavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017
Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationRandom coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara
Random coordinate descent algorithms for huge-scale optimization problems Ion Necoara Automatic Control and Systems Engineering Depart. 1 Acknowledgement Collaboration with Y. Nesterov, F. Glineur ( Univ.
More informationRandomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints
Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints By I. Necoara, Y. Nesterov, and F. Glineur Lijun Xu Optimization Group Meeting November 27, 2012 Outline
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationAdding vs. Averaging in Distributed Primal-Dual Optimization
Chenxin Ma Industrial and Systems Engineering, Lehigh University, USA Virginia Smith University of California, Berkeley, USA Martin Jaggi ETH Zürich, Switzerland Michael I. Jordan University of California,
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationParallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization
Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu
More informationFinite-sum Composition Optimization via Variance Reduced Gradient Descent
Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu
More informationarxiv: v2 [cs.lg] 10 Oct 2018
Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationImproved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations
Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org
More informationEfficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization
Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized
More informationBlock Coordinate Descent for Regularized Multi-convex Optimization
Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline
More informationIncremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning
Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic
More informationA Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming
A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose
More informationarxiv: v2 [cs.lg] 3 Jul 2015
arxiv:502.03508v2 [cs.lg] 3 Jul 205 Chenxin Ma Industrial and Systems Engineering, Lehigh University, USA Virginia Smith University of California, Berkeley, USA Martin Jaggi ETH Zürich, Switzerland Michael
More informationIteration Complexity of Feasible Descent Methods for Convex Optimization
Journal of Machine Learning Research 15 (2014) 1523-1548 Submitted 5/13; Revised 2/14; Published 4/14 Iteration Complexity of Feasible Descent Methods for Convex Optimization Po-Wei Wang Chih-Jen Lin Department
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationAdaptive restart of accelerated gradient methods under local quadratic growth condition
Adaptive restart of accelerated gradient methods under local quadratic growth condition Olivier Fercoq Zheng Qu September 6, 08 arxiv:709.0300v [math.oc] 7 Sep 07 Abstract By analyzing accelerated proximal
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationDistributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization
Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization Ching-pei Lee Department of Computer Sciences University of Wisconsin-Madison Madison, WI 53706-1613, USA Kai-Wei
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationModern Optimization Techniques
Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationLARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION
LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems
More informationAlgorithms for Nonsmooth Optimization
Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationI P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION
I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1
More informationDual and primal-dual methods
ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationA Universal Catalyst for Gradient-Based Optimization
A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication
More informationarxiv: v2 [math.na] 28 Jan 2016
Stochastic Dual Ascent for Solving Linear Systems Robert M. Gower and Peter Richtárik arxiv:1512.06890v2 [math.na 28 Jan 2016 School of Mathematics University of Edinburgh United Kingdom December 21, 2015
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationStingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent
: Safely Avoiding Wasteful Updates in Coordinate Descent Tyler B. Johnson 1 Carlos Guestrin 1 Abstract Coordinate descent (CD) is a scalable and simple algorithm for solving many optimization problems
More informationl 1 and l 2 Regularization
David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about
More informationDistributed Optimization with Arbitrary Local Solvers
Industrial and Systems Engineering Distributed Optimization with Arbitrary Local Solvers Chenxin Ma, Jakub Konečný 2, Martin Jaggi 3, Virginia Smith 4, Michael I. Jordan 4, Peter Richtárik 2, and Martin
More informationFAST ALTERNATING DIRECTION OPTIMIZATION METHODS
FAST ALTERNATING DIRECTION OPTIMIZATION METHODS TOM GOLDSTEIN, BRENDAN O DONOGHUE, SIMON SETZER, AND RICHARD BARANIUK Abstract. Alternating direction methods are a common tool for general mathematical
More informationThe basic problem of interest in this paper is the convex programming (CP) problem given by
Noname manuscript No. (will be inserted by the editor) An optimal randomized incremental gradient method Guanghui Lan Yi Zhou the date of receipt and acceptance should be inserted later arxiv:1507.02000v3
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationThe Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent
The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent Yinyu Ye K. T. Li Professor of Engineering Department of Management Science and Engineering Stanford
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationSparsity Regularization
Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationOn Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression
On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem
More informationFast proximal gradient methods
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationOn Nesterov s Random Coordinate Descent Algorithms
On Nesterov s Random Coordinate Descent Algorithms Zheng Xu University of Texas At Arlington February 19, 2015 1 Introduction Full-Gradient Descent Coordinate Descent 2 Random Coordinate Descent Algorithm
More informationLecture 23: November 21
10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,
More informationSparse and Regularized Optimization
Sparse and Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity
More informationOn the tradeoff between computational complexity and sample complexity in learning
On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More information