Primal-dual coordinate descent

Size: px

Start display at page:

Download "Primal-dual coordinate descent"

Avis Cross
5 years ago
Views:

1 Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July /28

2 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x Rn (I + τ g) 1 and (I + σ h ) 1 are easy to compute M : R n R m is linear Equivalent to saddle point problem if 0 ri(mdom g dom h) min x R n max y R m f (x) + g(x) h (y) + Mx, y 2/28

3 Examples: Lasso min x R n Ax b λ x 1 f (x) = Ax b 2 2 g(x) = λ x 1 h(x) = 0 3/28

4 Examples: dual of Support Vector Machines 1 min x R n 2λn 2 ( m n ) 2 b i A ji x (i) 1 n j=1 st :x [0, 1] b T x = 0 i=1 n i=1 x (i) ( N ) 2 f (x) = 1 2λn 2 i=1 b ia ji x (i) 1 n n i=1 x (i) g(x) = I [0,1] N(x) h(y) = I b (y) Mx = x 4/28

5 Examples: L 1 + TV regularized regression min x R n Ax b α 1 x 1 + α 2 x 2,1 f (x) = Ax b 2 2 g(x) = α 1 x 1 = discrete gradient h(y) = α 2 y 2,1 = α 2 n i=1 y 2 i,1 + y 2 i,2 + y 2 i,3 5/28

6 Part 1: Classical coordinate descent h = 0 g is separable min f (x) + g(x) x Rn f has a coordinate-wise Lipschitz gradient: x R n, i {1,..., n}, t R, i f (x + te i ) i f (x) β i t 6/28

7 Coordinate descent At iteration k: Remarks 1. Choose randomly a coordinate i k+1 2. x k+1 = (I + τ g) 1( x k τ f (x k ) ) { x k+1 i if i = i k+1 3. Update: x k+1 = xk i if i i k+1 As g is separable, one only needs to compute x i k+1 k+1 Convergence if 1 τ i < β i 2 for all i [Richtárik, Takáč, 2011] 7/28

8 Part 2: Stochastic primal-dual coordinate descent f is L( f )-Lipschitz min f (x) + g(x) + h(mx) x Rn g and h need not be separable M R m n 8/28

9 Vũ-Condat s algorithm [Vũ + Condat, 2013] y k+1 = (I + σ h ) 1( ) y k + σmx k x k+1 = (I + τ g) 1( x k τm (2y k+1 y k ) τ f (x k ) ) Generalizes Chambolle-Pock, ADMM and proximal gradient Converges to a saddle point of the Lagrangian Proof: fixed point algorithm for a firmly nonexpansive operator 9/28

10 Vũ-Condat s algorithm y k+1 = (I + σ h ) 1( y k + σmx k ) x k+1 = (I + τ g) 1( x k τm (2y k+1 y k ) τ f (x k ) ) The fixed point operator is: T ( ) y = x ȳ {}}{ (I + σ h ) 1( y + σmx ) (I + τ g) 1( x τm (2ȳ y) τ f (x) ) Convergence as soon as 1 τ σ M 2 < L( f ) 2 10/28

11 Coordinate descent for firmly nonexpansive operators [Bianchi, Hachem & Iutzeler + Combettes & Pesquet, 2014] At iteration k: 1. Choose randomly a (block of) coordinate i k+1 2. z k+1 = T (z k ) 3. Update: z k+1 = { z i k+1 if i = i k+1 z i k if i i k+1 Convergence: if T = αr + (1 α)i where α (0, 1) and R is nonexpansive in a separable norm 11/28

12 Primal-dual coordinate descent We take Vũ-Condat s fixed point operator ȳ {}}{ ( ) (I + σ h ) 1( y + σmx ) y T = x (I + τ g) 1( x τm (2ȳ y) τ f (x) ) Assume that M is block diagonal and define blocks of primal-dual variables z = (y, x) accordingly The algorithm is z k+1 = { T i (z k ) if i = i k+1 z i k if i i k+1 Convergence as soon as 1 τ σ M 2 < L( f ) 2 12/28

13 Speed of convergence Theorem Assume that for all i {1,..., n}, p i = P(i k+1 = i) > 0 and τ 1 > σ M 2 + L( f ) 2. Denote ˆ x k+1 = 1( k+1 k l=1 x l) and ˆȳ k+1 = 1( k+1 k l=1 ȳl). Then for all (y, x) R m R n, E[L(ˆ x k+1, y) L(x, ˆȳ k+1 )] 1 ( 2k max 1, 1 + ( 1 κ 1) 1 ) z z 0 2 p 1 1/(2κ) 1 P [ ] τ where κ = (τ 1 σ M 2 )/L( f ) > 1 and P = 1 I M 2 M T σ 1 I 13/28

14 Duplication trick M 1,1 0 0 M 1,1 M 1,2 0 0 M 1,2 0 M = 0 M 2,2 0 K = 0 M 2,2 0 M 3,1 M 3,2 M 3,3 M 3, M 3, M 3, Define S = and D(m) = We have D(m)SK = M. We define h = h (D(m)S) : prox mσ,h (y) = (1 m1 prox (1) σ,h (S(y)),..., 1 mp prox (p) σ,h (S(y))) 14/28

15 Part 3: Stochastic primal-dual coordinate descent with long steps 15/28

16 Comparison of coordinate descent algorithms Classical Primal-dual f (x) + n i=1 g i(x i ) f (x) + g(x) + h(kx) g separable g and h non-separable M: additional coupling i f (x + te i ) i f (x) β i t Bases on operator T : Longer steps β = L( f ) Speed in O(1/k) Speed in O(1/k) 16/28

17 Combine advantages of both approaches min f (x) + g(x) + h(mx) x Rn f has a coordinate-wise Lipschitz gradient: x R n, i {1,..., n}, t R, i f (x + te i ) i f (x) β i t g and h need not be separable M R m n 17/28

18 Iterates need not be feasible Example: f (x) = { 1 2 x 2 0 if x 1 = x 2 g(x) = + if x 1 x 2 h = 0 Start with x 0 = [1, 1] (feasible): x 1 = [0, 0] Let i 1 = 1, we get x 1 = [0, 1] (unfeasible) 18/28

19 Distance to optimum may not change f (x) = 1 2 (x 1 + x 2 1) 2, L( f ) = 2, β i = 1 x x 0 x 1 19/28

20 Distance to optimum may not change f (x) = 1 2 (x 1 + x 2 1) 2, L( f ) = 2, β i = 1 x x 0 x 1 E[ x 1 x 2 ] = 1 2 = x 0 x 2 19/28

21 E[ x 1 x 2 ] = 2 3 > x 0 x 2 19/28 Distance to optimum may increase f (x) = 1 2 (x 1 + x 2 + x 3 1) 2, L( f ) = 3, β i = 1 x x 0 x 1

22 Distance to optimum may increase a lot f (x) = 1 n 2 ( x i 1) 2, L( f ) = n, β i = 1 i=1 x x 0 x 1 E[ x 1 x 2 ] = n n 1 x 0 x 2 = 1 n 19/28

23 New algorithm [Suppose that M is block diagonal (duplication trick available)] At iteration k: 1. y k+1 = prox σ,h (y k + D(σ)Mx k ) x k+1 = prox τ,g (x k D(τ)( f (x k )+M (2y k+1 y k ))) 2. For i = i k+1 and j : M j,ik+1 0, update: x (i) k+1 = x (i) k+1 y (j) k+1 = y (j) k+1 3. Otherwise, set x (i) k+1 = x (i) k, y (j) k+1 = y (j) k 20/28

24 Convergence Theorem If for all i {1,..., n}, P(i k+1 = i) = 1/n and τ i < 1 ( ) β i + ρ j J(i) σ jmj,i M j,i then there exists a saddle point (x, y ) of L such that lim (x k, y k ) = (x, y ) k Proof: We consider the Lyapunov function S k = f (x k ) f (x ) f (x ), x k x z k z P 21/28

25 Primal value Dual SVM max x R n 1 2 AD(b)x e T x I [0,C] n(x) I b (x) RCV1 dataset: A = 47,236 20,242 matrix, n = 47,236 C = Vu-Condat Bianchi et al SDCA primal-dual CD Passes on the data 22/28

26 alpha alpha=0.01 alpha=0.001 alpha=0.1 alpha= L 1 + TV 8500 regularized least squares l1_ratio= FISTA: alpha=0.001,l1_ratio=0.5,nvoxels= min x R n 1 2 Ax b α(r x (1 r) x 2,1 ) A: ,280 dense matrix 9650 : 195,840 65,280 sparse matrix (3D gradient) Lightly regularized problem α = 0.001, l1_ratio=0.5 r = 0.5 Chambolle-Pock FISTA L-BFGS-B Vu-Condat dupl_prim_dual_cd /

27 alpha= alpha=0.001 alpha=0.01 alpha=0.01 alpha=0.1 alpha= L 1 + TV 8500 regularized least squares l1_ratio= FISTA: alpha=0.01,l1_ratio=0.5,nvoxels= min x R n 1 2 Ax b α(r x (1 r) x 2,1 ) A: ,280 dense 9700 matrix : 195,840 65,280 sparse matrix (3D gradient) 9840 Mediumly regularized problem l1_ratio=0.5 α = 0.01, r = 0.5 Chambolle-Pock FISTA L-BFGS-B Vu-Condat dupl_prim_dual_cd /

28 L 1 + TV 8500 regularized least squares min x R n 1 2 Ax b α(r x 1 + (1 r) x 2,1 ) 9400 A: ,280 dense 9700 matrix : 195,840 65,280 sparse matrix (3D gradient) alpha alpha=0.01 FISTA: alpha=0.1,l1_ratio=0.1,nvoxels= alpha= Strongly TV-regularized problem α = 0.1, r = 0.1 Chambolle-Pock FISTA L-BFGS-B Vu-Condat dupl_prim_dual_cd /28

29 L 1 + TV 8500 regularized least squares min x R n 1 2 Ax b α(r x 1 + (1 r) x 2,1 ) A: ,280 dense 9700 matrix 9650 : 195,840 65,280 sparse matrix (3D gradient) FISTA: alpha=0.1,l1_ratio=0.9,nvoxels= Strongly L 1 -regularized problem α = 0.1, r = 0.9 Chambolle-Pock FISTA L-BFGS-B Vu-Condat dupl_prim_dual_cd /28

30 8500 L 1 + TV regularized least squares 9850 min x R n 1 2 Ax b α(r x 1 + (1 r) x 2,1 ) A: ,280 dense 9700 matrix 9650 : 195,840 65,280 sparse matrix (3D gradient) 9600 l1_ratio= l1_ratio= alpha=0.001 alpha=0.01 alpha= Chambolle-Pock FISTA L-BFGS-B Vu-Condat dupl_prim_dual_cd l1_ratio= /28

31 Summary Conclusion Genuine coordinate descent method with non-separable and non-smooth convex function Promising numerical results Open questions Long steps - Non uniform probabilities - Speed of convergence - Replace β i by β i /2 Non-ergodic speed of convergence Longer steps for non-separable proximal operators: eg. MISO, projection on {x : x 1 =... = x n } 28/28

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function