Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Size: px

Start display at page:

Download "Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions"

April Hudson
5 years ago
Views:

1 Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi

2 Problem Minimize the convex function min f (x) + g(x) + h(mx) x Rn f, g, h convex f is differentiable prox τ,g and prox σ,h are easy to compute prox τ,g (y) = arg min g(x) + 1 n x R n 2 i=1 M : R n R m is linear 1 τ i (x (i) y (i) ) 2 } {{ } 1 2 y 2 τ 1 Equivalent to saddle point problem if 0 ri(mdom g dom h) min x R n max y R m f (x) + g(x) h (y) + Mx, y 2 25 March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

3 Examples: Lasso A is a dictionary b is a vector we want to explain Solution x: a sparse representation of b 1 min x R n 2 Ax b λ x 1 f (x) = 1 2 Ax b 2 2 g(x) = λ x 1 h(x) = March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

4 Examples: L 1 + TV regularized regression Each line of A: a 3D image of brains b j : the result of experiment j x: which regions explain best the result of the experiments? 1 min x R n 2 Ax b α 1 x 1 + α 2 x 2,1 f (x) = 1 2 Ax b 2 2 g(x) = α 1 x 1 = discrete gradient h(y) = α 2 y 2,1 = α 2 n i=1 y 2 i,1 + y 2 i,2 + y 2 i, March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

5 Examples: dual of Support Vector Machines ( 1 m n ) 2 min b x R n 2λn 2 i A ji x (i) 1 n x (i) n j=1 st : x [0, 1] n b T x = 0 i=1 1.5 i=1 f (x) = ( n i=1 b i A ji x (i) ) 2 e x 2λn 2 n g(x) = I [0,1] n(x) h(y) = I b (y) Mx = x March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

6 Part 1: Classical coordinate descent min f (x) + g(x) x Rn h = 0 g is separable f has a coordinate-wise Lipschitz gradient: x R n, i {1,..., n}, t R, i f (x + te i ) i f (x) β i t 6 25 March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

7 At iteration k: Coordinate descent 1. Choose randomly a coordinate i k+1 2. x k+1 = prox τ,g ( xk τ f (x k ) ) 3. Update: x k+1 = { x i k+1 if i = i k+1 x i k if i i k+1 Remarks As g is separable, one only needs to compute x i k+1 k+1 Convergence if τ i < 2 β i for all i [Richtárik, Takáč, 2011] 7 25 March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

8 Part 2: Stochastic primal-dual coordinate descent min f (x) + g(x) + h(mx) x Rn f is L( f )-Lipschitz g and h need not be separable M R m n 8 25 March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

9 Vũ-Condat s algorithm [Vũ + Condat, 2013] ( ) y k+1 = prox σh yk + σmx k ( x k+1 = prox τg xk τm (2y k+1 y k ) τ f (x k ) ) Generalizes Chambolle-Pock, ADMM and proximal gradient Converges to a saddle point of the Lagrangian 1 Convergence as soon as τ < L( f ) + σ M 2 2 (fixed point of a firmly nonexpansive operator) 9 25 March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

10 Coordinate descent for firmly nonexpansive operators [Bianchi, Hachem & Iutzeler + Combettes & Pesquet, 2014] At iteration k: 1. Choose randomly a (block of) coordinate i k+1 2. z k+1 = T (z k ) 3. Update: z k+1 = { z i k+1 if i = i k+1 z i k if i i k+1 Convergence: if T = αr + (1 α)i, where α (0, 1) and R is nonexpansive in a separable norm March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

11 Primal-dual coordinate descent [Bianchi, Hachem & Iutzeler, 2014] We take Vũ-Condat s fixed point operator ȳ {}}{ ( ) ( ( ) ) y prox σh y + σmx T = ( x prox τg x τm (2ȳ y) τ f (x) ) Assume that M is block diagonal and define blocks of primal-dual variables z = (y, x) accordingly { T i (z k ) if i = i k+1 The algorithm is: z k+1 = zk i if i i k+1 1 Convergence as soon as τ < L( f ) + σ M March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

12 Duplication trick M 1,1 0 0 M 1,1 M 1,2 0 0 M 1,2 0 M = 0 M 2,2 0 K = 0 M 2,2 0 M 3,1 M 3,2 M 3,3 M 3, M 3, M 3, Define S = and D(m) = We have D(m)SK = M. We define h = h (D(m)S) : min x R n f (x) + g(x) + h(kx) = min x R n f (x) + g(x) + h(mx) prox mσ,h (y) = (1 m1 prox (1) σ,h (S(y)),..., 1 mp prox (p) σ,h (S(y))) March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

13 Part 3: Stochastic primal-dual coordinate descent with long steps March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

14 Comparison of coordinate descent algorithms Classical Primal-dual f (x) + n i=1 g i(x i ) f (x) + g(x) + h(kx) g separable g and h non-separable M: additional coupling i f (x + te i ) i f (x) β i t Bases on operator T : Longer steps β = L( f ) Speed in O(1/k) Speed in O(1/k) March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

15 Combine advantages of both approaches min f (x) + g(x) + h(mx) x Rn f has a coordinate-wise Lipschitz gradient: x R n, i {1,..., n}, t R, i f (x + te i ) i f (x) β i t g and h need not be separable M R m n March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

16 New algorithm [Suppose that M is block diagonal (duplication trick available)] At iteration k: 1. y k+1 = prox σ,h (y k + D(σ)Mx k ) x k+1 = prox τ,g (x k D(τ)( f (x k ) + M (2y k+1 y k ))) 2. For i = i k+1 and j : M j,ik+1 0, update: x i k+1 = x i k+1 y j k+1 = y j k+1 3. Otherwise, set x i k+1 = x i k, y j k+1 = y j k March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

17 Example: Iterates need not be feasible f (x) = 1 2 { x 2 0 if x 1 = x 2 g(x) = + if x 1 x 2 h = 0 Start with x 0 = [1, 1] (feasible): dom g x March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

18 Iterates need not be feasible Example: f (x) = 1 2 { x 2 0 if x 1 = x 2 g(x) = + if x 1 x 2 h = 0 Start with x 0 = [1, 1] (feasible): x 1 = [0, 0] Let i 1 = 1 dom g + x1 x March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

19 Example: Iterates need not be feasible f (x) = 1 2 { x 2 0 if x 1 = x 2 g(x) = + if x 1 x 2 h = 0 Start with x 0 = [1, 1] (feasible): x 1 = [0, 0] Let i 1 = 1, we get x 1 = [0, 1] (unfeasible) dom g x x1 x March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

20 Distance to optimum may not change f (x) = 1 2 (x 1 + x 2 1) 2, L( f ) = 2, β i = 1 x x 0 x March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

21 Distance to optimum may not change f (x) = 1 2 (x 1 + x 2 1) 2, L( f ) = 2, β i = 1 x x 0 x 1 E[ x 1 x 2 ] = 1 2 = x 0 x March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

22 Distance to optimum may increase f (x) = 1 2 (x 1 + x 2 + x 3 1) 2, L( f ) = 3, β i = 1 x x 0 x 1 E[ x 1 x 2 ] = 2 3 > x 0 x 2 = March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

23 Distance to optimum may increase a lot f (x) = 1 2 ( n x i 1) 2, L( f ) = n, β i = 1 i=1 x x 0 x 1 E[ x 1 x 2 ] = n n 1 x 0 x 2 = 1 n March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

24 Theorem Convergence If for all i {1,..., n}, P(i k+1 = i) = 1/n and τ i < 1 ( ) β i + ρ j J(i) σ jmj,i M j,i then there exists a saddle point (x, y ) of the Lagrangian L(x, y) = f (x) + g(x) + Ax, y h (y) such that lim (x k, y k ) = (x, y ) k Proof: A stochastic Lyapunov function is S k = f (x k ) f (x ) f (x ), x k x z k z 2 P March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

L 1 + TV regularized least squares min x R n 1 2 Ax b 2 2 + α(r x 1 + (1 r) x 2,1 ) A: 768 65,280 dense matrix : 195,840 65,280 sparse matrix (3D

006 0.004 0.002 0.000 0.002 0.004 0.006 FISTA: alpha=0.01,l1_ratio=0.5,nvoxels=65280 0.060 0.045 0.030 0.015 0.000 0.015 0.030 0.045 0.060 FISTA: alpha=0.

25 L 1 + TV regularized least squares min x R n 1 2 Ax b α(r x 1 + (1 r) x 2,1 ) A: ,280 dense matrix : 195,840 65,280 sparse matrix (3D gradient) FISTA: alpha=0.001,l1_ratio=0.5,nvoxels= FISTA: alpha=0.1,l1_ratio=0.1,nvoxels= FISTA: alpha=0.01,l1_ratio=0.5,nvoxels= FISTA: alpha=0.1,l1_ratio=0.9,nvoxels= March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

26 Comparison of algorithms on L 1 + TV regularized least squares l1_ratio=0.1 l1_ratio=0.5 l1_ratio= alpha= alpha= Chambolle-Pock FISTA alpha= L-BFGS-B Vu-Condat dupl_prim_dual_cd time (s) time (s) time (s) March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

27 Dual SVM max x R n 1 2λ AD(b)x et x I [0,C] n(x) I b (x) RCV1 dataset: A = 47,236 20,242 matrix, nnz = 0.157% C = 1, λ = 1, 100 passes through the data n n SDCA (Shalev-Shwartz & Zhang) Primal-dual coordinate descent RCD (Necoara & Patrascu) Duality gap CPU time (s) March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

28 Larger SVM problem max x R n 1 2λ AD(b)x et x I [0,C] (x) I b (x) KDD 2009: A = 86,825 50,000 matrix, nnz = 1.79% λ = 1 n, C i {C +, C }, 300 passes SDCA (Shalev-Shwartz & Zhang) Primal-dual coordinate descent 10-1 Duality gap CPU time (s) March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

29 Conclusion Summary Genuine coordinate descent method with non-separable and non-smooth convex function Promising numerical results Open questions Non uniform probabilities Speed of convergence Replace β i by β i /2 Longer steps for non-separable proximal operators: eg. MISO, projection on {x : x 1 =... = x n } March 2016 O. Fercoq and P. Bianchi Primal-dual coordinate descent

Primal-dual coordinate descent

Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x