Primal-dual algorithms for the sum of two and three functions 1

Size: px

Start display at page:

Download "Primal-dual algorithms for the sum of two and three functions 1"

Tiffany Willis
5 years ago
Views:

1 Primal-dual algorithms for the sum of two and three functions 1 Ming Yan Michigan State University, CMSE/Mathematics 1 This works is partially supported by NSF.

2 optimization problems for primal-dual algorithms minimize f(x) + g(x) + h(ax) x f, g, and h are convex. X and Y are two Hilbert spaces (e.g., R m, R n ). f : X R is differentiable with a 1/β-Lipschitz continuous gradient for some β (0, + ). A : X Y is a bounded linear operator.

3 applications: statistics Elastic net regularization (Zou-Hastie 05): minimize x µ 2 x µ 1 x 1 + l(ax, b), where x R p, A R n p, b R n, and l is the loss function, which may be nondifferentiable.

4 applications: statistics Elastic net regularization (Zou-Hastie 05): minimize x µ 2 x µ 1 x 1 + l(ax, b), where x R p, A R n p, b R n, and l is the loss function, which may be nondifferentiable. Fused lasso (Tibshirani et al. 05): minimize x 1 2 Ax b µ 1 x 1 + µ 2 Dx 1, where x R p, A R n p, b R n, and 1 1 D = is a matrix in R (p 1) p. 1 1

5 applications: decentralized optimization minimize x n f i(x) + g i(x) i=1 f i and g i is known at node i only. Nodes 1,, n are connected in a undirected graph. f i is differentiable with a Lipschitz continuous gradient.

6 applications: decentralized optimization minimize x n f i(x) + g i(x) i=1 f i and g i is known at node i only. Nodes 1,, n are connected in a undirected graph. f i is differentiable with a Lipschitz continuous gradient. Introduce a copy x i at node i: minimize x f(x) + g(x) := n f i(x i) + g i(x i) i=1 s.t. Wx = x x i R p, x = [x 1 x 2 x n] R n p. W is a symmetric doubly stochastic mixing matrix.

7 applications: decentralized optimization minimize x n f i(x) + g i(x) i=1 f i and g i is known at node i only. Nodes 1,, n are connected in a undirected graph. f i is differentiable with a Lipschitz continuous gradient. Introduce a copy x i at node i: minimize x f(x) + g(x) := n f i(x i) + g i(x i) i=1 s.t. Wx = x x i R p, x = [x 1 x 2 x n] R n p. W is a symmetric doubly stochastic mixing matrix. The sum of three functions: minimize x f(x) + g(x) + ι 0((I W) 1/2 x)

8 applications: imaging Image restoration with two regularizations: minimize x 1 2 Ax b ι C(x) + µ Dx 1, where x R n is the image to be reconstructed, A R m n is the forward projection matrix, b R m is the measured data with noise, D is a discrete gradient operator, and ι C is the indicator function that returns zero if x C (here, C is the set of nonnegative vectors in R n ) and + otherwise.

9 applications: imaging Image restoration with two regularizations: minimize x 1 2 Ax b ι C(x) + µ Dx 1, where x R n is the image to be reconstructed, A R m n is the forward projection matrix, b R m is the measured data with noise, D is a discrete gradient operator, and ι C is the indicator function that returns zero if x C (here, C is the set of nonnegative vectors in R n ) and + otherwise. Other problems: f: data fitting term (infimal convolution for mixed noise) h A: total variation; other transforms g: nonnegativity; box constraint

10 primal-dual formulation Original problem: minimize x f(x) + g(x) + h(ax)

11 primal-dual formulation Original problem: minimize x f(x) + g(x) + h(ax) Introduce a dual variable s: minimize x maxf(x) + g(x) + Ax, s h (s) s Here h is the conjugate function of h that is defined as h (s) = max t s, t h(t),

12 primal-dual formulation Original problem: minimize x f(x) + g(x) + h(ax) Introduce a dual variable s: minimize x maxf(x) + g(x) + Ax, s h (s) s Here h is the conjugate function of h that is defined as h (s) = max t s, t h(t), It is equivalent to (s h(ax ) Ax h (s )): { 0 f(x ) + g(x ) + A s 0 h (s ) Ax All primal-dual algorithms try to find (x, s ).

13 existing algorithms: Condat-Vu, AFBA, and PDFP Condat-Vu (Condat 13, Vu 13): Convergence conditions: λ AA + γ/(2β) 1 Per-iteration computations: A, A, f, one (I + γ g) 1, (I + λ γ h ) (I + γ g) 1 ( x) = arg min γg(x) + 1 x 2 x x 2. This is a backward step (or implicit step) because (I + γ g) 1 ( x) x γ g((i + γ g) 1 ( x))

14 existing algorithms: Condat-Vu, AFBA, and PDFP Condat-Vu (Condat 13, Vu 13): Convergence conditions: λ AA + γ/(2β) 1 Per-iteration computations: A, A, f, one (I + γ g) 1, (I + λ γ h ) 1 2 AFBA (Latafat-Patrinos 16): Convergence conditions: λ AA /2 + λ AA /2 + γ/(2β) 1 Per-iteration computations: A, A, f, one (I + γ g) 1, (I + λ γ h ) 1 2 (I + γ g) 1 ( x) = arg min γg(x) + 1 x 2 x x 2. This is a backward step (or implicit step) because (I + γ g) 1 ( x) x γ g((i + γ g) 1 ( x))

15 existing algorithms: Condat-Vu, AFBA, and PDFP Condat-Vu (Condat 13, Vu 13): Convergence conditions: λ AA + γ/(2β) 1 Per-iteration computations: A, A, f, one (I + γ g) 1, (I + λ γ h ) 1 2 AFBA (Latafat-Patrinos 16): Convergence conditions: λ AA /2 + λ AA /2 + γ/(2β) 1 Per-iteration computations: A, A, f, one (I + γ g) 1, (I + λ γ h ) 1 PDFP (Chen-Huang-Zhang 16): Convergence conditions: λ AA < 1; γ/(2β) < 1 Per-iteration computations: A, A, f, two (I + γ g) 1, (I + λ γ h ) 1 2 (I + γ g) 1 ( x) = arg min γg(x) + 1 x 2 x x 2. This is a backward step (or implicit step) because (I + γ g) 1 ( x) x γ g((i + γ g) 1 ( x))

16 PDHG (Zhu-Chan 08) When f = 0, we have [ g A A h ] [ x s ] 0

17 PDHG (Zhu-Chan 08) When f = 0, we have [ g A ] [ x ] A h s 0 It is equivalent to [ g ] [ x ] [ 0 A ] [ ] x A h s 0 s

18 PDHG (Zhu-Chan 08) When f = 0, we have [ g A ] [ x ] A h s 0 It is equivalent to [ 1 γ I+ g ] [ x ] [ 1 γ I A ] [ ] x A γ λ I+ h s γ λ I s

19 PDHG (Zhu-Chan 08) When f = 0, we have [ g A ] [ x ] A h s 0 It is equivalent to [ 1 γ I+ g ] [ x ] [ 1 γ I A ] [ x ] A γ λ I+ h s γ λ I s Primal-dual hybrid gradient (PDHG) x + = (I + γ g) 1 (x γa s) s + = ( I + λ γ h ) 1 ( s + λ γ Ax+)

20 PDHG (Zhu-Chan 08) When f = 0, we have [ g A ] [ x ] A h s 0 It is equivalent to [ 1 γ I+ g ] [ x ] [ 1 γ I A ] [ x ] A γ λ I+ h s γ λ I s Chambolle-Pock (Chambolle et.al 09, Esser-Zhang-Chan 10) x + = (I + γ g) 1 (x γa s) x + = x + +x + x s + = ( I + λ γ h ) 1 ( s + λ γ A x+)

21 Chambolle-Pock 11 as proximal point Chambolle-Pock (x s order) x + = (I + γ g) 1 (x γa s) s + = ( I + λ h ) 1 ( γ s + λ γ A(2x+ x) )

22 Chambolle-Pock 11 as proximal point Chambolle-Pock (x s order) x + = (I + γ g) 1 (x γa s) s + = ( I + λ h ) 1 ( γ s + λ γ A(2x+ x) ) CP is equivalent to the backward operator applied on the KKT system.

23 Chambolle-Pock 11 as proximal point Chambolle-Pock (x s order) x + = (I + γ g) 1 (x γa s) s + = ( I + λ h ) 1 ( γ s + λ γ A(2x+ x) ) CP is equivalent to the backward operator applied on the KKT system. [[ 1 I ] [ ]] [ ] [ γ g x + 1 A γ I + I ] [ γ A x A h s + A γ I s λ λ ] CP is 1/2-averaged under the metric induced by the matrix if λ satisfies the condition λ AA 1.

24 Chambolle-Pock 11 as proximal point Chambolle-Pock (x s order) x + = (I + γ g) 1 (x γa s) s + = ( I + λ h ) 1 ( γ s + λ γ A(2x+ x) ) CP is equivalent to the backward operator applied on the KKT system. [[ 1 I ] [ ]] [ ] [ γ A g A x + 1 A γ I + I ] [ γ A x A h s + A γ I s λ λ ] CP is 1/2-averaged under the metric induced by the matrix if λ satisfies the condition λ AA 1.

25 The optimality condition: [ g A Condat-Vu (Condat 13, Vu 13) A h ] [ x s ] + [ f(x ) CV is equivalent to the forward-backward applied on the KKT system. [[ ] [ ]] [ ] [ ] [ ] [ ] 1 γ I A g A x + 1 A γ λ I + A h s + γ I A x f(x) A γ λ I s 0 0 ] 0

26 The optimality condition: [ g A Condat-Vu (Condat 13, Vu 13) A h ] [ x s ] + [ f(x ) CV is equivalent to the forward-backward applied on the KKT system. [[ ] [ ]] [ ] [ ] [ ] [ 1 γ I A g A x + 1 A γ λ I + A h s + γ I A x f(x) A γ λ I s 0 That is: 0 ] 0 ] x + = (I + γ g) 1 (x γ f(x) γa s) s + = ( I + λ h ) 1 ( γ s + λ γ A(2x+ x) ) It is equivalent to (by changing the update order) s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = (I + γ g) 1 (x γ f(x) γa s + ) x + = 2x + x

27 The optimality condition: [ g A Condat-Vu (Condat 13, Vu 13) A h ] [ x s ] + [ f(x ) CV is equivalent to the forward-backward applied on the KKT system. [[ ] [ ]] [ ] [ ] [ ] [ 1 γ I A g A x + 1 A γ λ I + A h s + γ I A x f(x) A γ λ I s 0 That is: 0 ] 0 ] x + = (I + γ g) 1 (x γ f(x) γa s) s + = ( I + λ h ) 1 ( γ s + λ γ A(2x+ x) ) It is equivalent to (by changing the update order) s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = (I + γ g) 1 (x γ f(x) γa s + ) x + = 2x + x CV is non-expansive (forward-backward) under the metric induced by the matrix if γ and λ satisfy the condition λ AA + γ/(2β) 1.

28 PDFP 2 O/PAPC (Loris-Verhoeven 11, Chen-Huang-Zhang 13, Drori-Sabach-Teboulle 15) When g = 0, the optimality condition becomes: [ ] [ ] [ 0 A x f(x ) + A h s 0 ] 0

29 PDFP 2 O/PAPC (Loris-Verhoeven 11, Chen-Huang-Zhang 13, Drori-Sabach-Teboulle 15) When g = 0, the optimality condition becomes: [ ] [ ] [ 0 A x f(x ) + A h s 0 ] 0 PAPC is equivalent to the forward-backward applied on the KKT system. [ 1 ] [ ] [ γ I A x + 1 ] [ ] [ A γ λ I γaa + h s + γ I x γ λ I γaa s f(x) 0 ]

30 PDFP 2 O/PAPC (Loris-Verhoeven 11, Chen-Huang-Zhang 13, Drori-Sabach-Teboulle 15) When g = 0, the optimality condition becomes: [ ] [ ] [ 0 A x f(x ) + A h s 0 ] 0 PAPC is equivalent to the forward-backward applied on the KKT system. [ 1 ] [ ] [ γ I A x + 1 ] [ ] [ A γ λ I γaa + h s + γ I x γ λ I γaa s f(x) 0 ] [ 1 γ I A γ λ I + h ] [ x + s + ] [ 1 γ I A γ λ I γaa ] [ x s ] [ f(x) γa f(x) ] PAPC is non-expansive (forward-backward) under the metric induced by the matrix if γ and λ satisfy the conditions λ AA 1 and γ/(2β) 1.

31 PAPC PAPC can be expressed as s + = ( I + λ h ) 1 ( γ (I λaa )s + λ A (x γ f(x))) γ x + = x γ f(x) γa s +

32 PAPC PAPC can be expressed as It is equivalent to s + = ( I + λ h ) 1 ( γ (I λaa )s + λ A (x γ f(x))) γ x + = x γ f(x) γa s + s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = x γ f(x) γa s + x + = x + γ f(x + ) γa s +

33 PAPC PAPC can be expressed as s + = ( I + λ h ) 1 ( γ (I λaa )s + λ A (x γ f(x))) γ x + = x γ f(x) γa s + It is equivalent to s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = x γ f(x) γa s + x + = x + γ f(x + ) γa s + PAPC is α-averaged under the metric induced by the matrix. PAPC converges if γ and λ satisfy the conditions λ AA < 4/3 and γ/(2β) < 1 (Li-Yan 17).

34 PDFP (Chen-Huang-Zhang 16) Rewrite PDFP 2 O as s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = x γ f(x) γa s + x + = x + γ f(x + ) γa s +

35 PDFP (Chen-Huang-Zhang 16) Rewrite PDFP 2 O as s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = x γ f(x) γa s + x + = x + γ f(x + ) γa s + PDFP, as a generalization of PDFP 2 O, is s + = (I + λ γ h ) ( 1 s + λ A x) γ x + = (I + γ g) 1 (x γ f(x) γa s + ) x + = (I + γ g) 1 (x + γ f(x + ) γa s + ) When g is the indicator function, PDFP reduces to Preconditioned Alternating Projection Algorithm (PAPA) (Krol-Li-Shen-Xu 12).

36 AFBA (Latafat-Patrinos 16) Rewrite PAPC as s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = x γa (s + s) x + = x + γ f(x + ) γa s +

37 AFBA (Latafat-Patrinos 16) Rewrite PAPC as s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = x γa (s + s) AFBA, as a generalization of PAPC, is x + = x + γ f(x + ) γa s + s + = (I + λ γ h ) 1 ( s + λ γ A x) x + = x γa (s + s) x + = (I + γ g) 1 (x + γ f(x + ) γa s + ) Convergence conditions: λ AA /2 + λ AA /2 + γ/(2β) 1

38 Chambolle-Pock: Chambolle-Pock and PAPC s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = (I + γ g) 1 (x γa s + ) x + = 2x + x

39 Chambolle-Pock: PAPC: Chambolle-Pock and PAPC s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = (I + γ g) 1 (x γa s + ) x + = 2x + x s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = x γ f(x) γa s + x + = x + γ f(x + ) γa s +

40 Chambolle-Pock: Chambolle-Pock and PAPC s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = (I + γ g) 1 (x γa s + ) x + = 2x + x PAPC: s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = x γ f(x) γa s + x + = 2x + x γ f(x + ) + γ f(x)

41 Chambolle-Pock: Chambolle-Pock and PAPC s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = (I + γ g) 1 (x γa s + ) x + = 2x + x PAPC: s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = x γ f(x) γa s + x + = 2x + x γ f(x + ) + γ f(x) PD3O (Yan 16): s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = (I + γ g) 1 (x γ f(x) γa s + ) x + = 2x + x γ f(x + ) + γ f(x)

42 Chambolle-Pock Chambolle-Pock (x s order): s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = (I + γ g) 1 (x γa s + ) x + = 2x + x

43 Chambolle-Pock 2x + z, s x +, s 2x + z γa s +, s + z, s z +, s + Chambolle-Pock (x s order): z = x γa s x + = (I + γ g) 1 (z) s + = ( I + λ h ) 1 ( γ (I λaa )s + λ γ A(2x+ z) ) z + = z + 2x + z γa s + x +

44 C-P and PAPC PAPC: s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = x γ f(x) γa s + x + = 2x + x γ f(x + ) + γ f(x)

45 C-P and PAPC 2x + z, s 2x + z γ f(x + ), s x +, s 2x + z γ f(x + ) γa s +, s + z, s z +, s + PAPC: x + = z = x γ f(x) γa s s + = ( I + λ h ) 1 ( γ (I λaa )s + λ γ A(2x+ z γ f(x + )) ) z + = z + 2x + z γ f(x + ) γa s + x +

46 PD3O 2x + z, s 2x + z γ f(x + ), s x +, s 2x + z γ f(x + ) γa s +, s + z, s z +, s + PD3O: s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = (I + γ g) 1 (x γ f(x) γa s + ) x + = 2x + x γ f(x + ) + γ f(x)

47 PD3O 2x + z, s 2x + z γ f(x + ), s x +, s 2x + z γ f(x + ) γa s +, s + z, s z +, s + PD3O: z = x γ f(x) γa s x + = (I + γ g) 1 (z) s + = ( I + λ h ) 1 ( γ (I λaa )s + λ γ A(2x+ z γ f(x + )) ) z + = z + 2x + z γ f(x + ) γa s + x +

48 PD3O vs Condat-Vu vs AFBA vs PDFP Algorithms: s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = (I + γ g) 1 (x γ f(x) γa s + ) PDFP x + = (I + γ g) 1 (x + γ f(x + ) γa s + ) Condat-Vu x + = 2x + x PD3O x + = 2x + x + γ f(x) γ f(x + )

49 PD3O vs Condat-Vu vs AFBA vs PDFP Algorithms: s + = ( I + λ γ h ) 1 ( s + λ γ A x) x + = (I + γ g) 1 (x γ f(x) γa s + ) PDFP x + = (I + γ g) 1 (x + γ f(x + ) γa s + ) Condat-Vu x + = 2x + x PD3O x + = 2x + x + γ f(x) γ f(x + ) Parameters: f 0, g 0 f = 0 g = 0 PDFP λ AA < 1; γ/(2β) < 1 PAPC Condat-Vu λ AA + γ/(2β) 1 C-P AFBA λ AA /2 + λ AA /2 + γ/(2β) 1 PAPC PD3O λ AA 1; γ/(2β) < 1 C-P PAPC

50 convergence results: summary Let z = x γ f(x) γa s and x + x: x = (I + γ g) 1 z s + = ( I + λ h ) 1 ( γ (I λaa )s + λ A (2x z γ f(x))) γ z + = x γ f(x) γa s +

51 convergence results: summary Let z = x γ f(x) γa s and x + x: x = (I + γ g) 1 z s + = ( I + λ h ) 1 ( γ (I λaa )s + λ A (2x z γ f(x))) γ z + = x γ f(x) γa s + (z k+1, s k+1 ) (z k, s k ) 2 M = o ( 1 k+1), and (z k, s k ) weakly converges to a fixed point (z, s )

52 convergence results: summary Let z = x γ f(x) γa s and x + x: x = (I + γ g) 1 z s + = ( I + λ h ) 1 ( γ (I λaa )s + λ A (2x z γ f(x))) γ z + = x γ f(x) γa s + (z k+1, s k+1 ) (z k, s k ) 2 M = o ( 1 k+1), and (z k, s k ) weakly converges to a fixed point (z, s ) Let L(x, s) = f(x) + g(x) + Ax, s h (s), then L( x k, s) L(x, s k+1 ) (z1, s 1 ) (z, s) 2 where ( x k, s k+1 ) = 1 k k i=1 (xi, s i+1 ), and z = x γ f(x) γa s. k

53 convergence results: summary Let z = x γ f(x) γa s and x + x: x = (I + γ g) 1 z s + = ( I + λ h ) 1 ( γ (I λaa )s + λ A (2x z γ f(x))) γ z + = x γ f(x) γa s + (z k+1, s k+1 ) (z k, s k ) 2 M = o ( 1 k+1), and (z k, s k ) weakly converges to a fixed point (z, s ) Let L(x, s) = f(x) + g(x) + Ax, s h (s), then L( x k, s) L(x, s k+1 ) (z1, s 1 ) (z, s) 2 where ( x k, s k+1 ) = 1 k k i=1 (xi, s i+1 ), and z = x γ f(x) γa s. Linear convergence with additional assumptions on f, g, and h k

54 convergence analysis: the general case Let M = γ2 λ (I λaa ) be positive definite. Define s M = s, s M = s, Ms and (z, s) M = z 2 + s 2 M. Lemma The iteration T mapping (z, s) to (z +, s + ) is a nonexpansive operator under the metric defined by M if γ 2β. Furthermore, it is α-averaged with α = 2β 4β γ. Chambolle-Pock is firmly non-expansive under the new metric, which is different from the previous metric.

55 convergence analysis: the general case Theorem 1) Let (z, s ) be any fixed point of T. Then ( (z k, s k ) (z, s ) M) k 0 is monotonically nonincreasing. 2) The sequence ( T(z k, s k ) (z k, s k ) M) k 0 is monotonically nonincreasing and converges to 0. 3) We have the following convergence rate T(z k, s k ) (z k, s k ) 2 M = o ( ) 1 k+1 4) (z k, s k ) weakly converges to a fixed point of T, and if X has finite dimension (e.g., R m ), then it is strongly convergent.

56 convergence analysis: linear convergent Denote: u h = γ λ (I λaa )s + A z γ λ s+ h (s + ), u g = 1 (z x) g(x), γ u h =A( z γa s ) = Ax h (s ), u g = 1 γ (z x ) g(x ). and g(x) g(y) L g x y, s + s, u h uh τ h s + s 2 M, x x, u g ug τ g x x 2, x x, f(x) f(x ) τ f x x.

57 convergence analysis: linear convergent Theorem We have z + z 2 + (1 + 2γτ h ) s + s 2 M ρ ( ) z z 2 + (1 + 2γτ h ) s s 2 M where ρ = max 1 1+2γτ h, 1 (( 2γ γ2 β )τ f +2γτ g ) 1+γL g. (5) When, in addition, γ < 2β, τ h > 0, and τ f + τ g > 0, we have that ρ < 1 and the algorithm converges linearly.

58 numerical experiment: fused lasso minimize x p Ax b µ 1 x 1 + µ 2 i=1 x i+1 x i, x = (x 1,, x p) R p, A R n p, b R n True PD3O PDFP Condat-Vu 3 2 True PD3O PDFP Condat-Vu Figure: The true sparse signal and the reconstructed results using PD3O, PDFP, and Condat-Vu. The right figure is a zoom-in of the signal in [3000, 5500].

59 numerical experiment: fused lasso 10 3 PD3O PD3O PDFP-.1 Condat-Vu PDFP-.1 Condat-Vu-.1 PD3O-.2 PD3O PDFP-.2 PD3O PDFP-.2 PD3O-.3 f!f $ f $ PDFP-.3 f!f $ f $ PDFP iteration iteration Figure: In the left figure, we fix λ = 1/8 and let γ = β, 1.5β, 1.9β. In the right figure, we fix γ = 1.9β and let λ = 1/80, 1/8, 1/4.

60 applications: decentralized optimization minimize x n f i(x) + g i(x) i=1 f i and g i is known at node i only. Nodes 1,, n are connected in a undirected graph. f i is differentiable with a Lipschitz continuous gradient.

61 applications: decentralized optimization minimize x n f i(x) + g i(x) i=1 f i and g i is known at node i only. Nodes 1,, n are connected in a undirected graph. f i is differentiable with a Lipschitz continuous gradient. Introduce a copy x i at node i: minimize x f(x) + g(x) := n f i(x i) + g i(x i) i=1 s.t. Wx = x x i R p, x = [x 1 x 2 x n] R n p. W is a symmetric doubly stochastic mixing matrix.

62 applications: decentralized optimization minimize x n f i(x) + g i(x) i=1 f i and g i is known at node i only. Nodes 1,, n are connected in a undirected graph. f i is differentiable with a Lipschitz continuous gradient. Introduce a copy x i at node i: minimize x f(x) + g(x) := n f i(x i) + g i(x i) i=1 s.t. Wx = x x i R p, x = [x 1 x 2 x n] R n p. W is a symmetric doubly stochastic mixing matrix. The sum of three functions: minimize x f(x) + g(x) + ι 0((I W) 1/2 x)

63 comparing PG-EXTRA and NIDS (Li-Shi-Yan 17) minimize x 1 2 n A ix i b i 2 + µ 1 i=1 n x i 1 + ι 0((I W) 1/2 x) i=1

64 comparing PG-EXTRA and NIDS (Li-Shi-Yan 17) minimize x 1 2 n A ix i b i 2 + µ 1 i=1 n x i 1 + ι 0((I W) 1/2 x) i= NIDS-1/L NIDS-1.5/L 10 0 NIDS-1.9/L PGEXTRA-1/L PGEXTRA-1.2/L 10-2 PGEXTRA-1.3/L PGEXTRA-1.4/L number of iterations 10 4

65 conclusion a new primal-dual algorithm for minimizing the sum of three functions. a new interpretation of Chambolle-Pock: Douglas-Rachford splitting on the KKT system under a new metric induced by a block diagonal matrix. PAPC is forward-backward splitting applied on the KKT system under the same metric; we proved the optimal bound for the parameters (dual stepsize). PD3O is a generalization of both Chambolle-Pock and PAPC, and it has the advantages of both Condat-Vu (a generalization of Chambolle-Pock), and AFBA and PDFP (two generalizations of PAPC). In decentralized consensus optimization, we derive a fast method whose stepsize does not depend on the network structure; we provide an optimal bound for the stepsize in PG-EXTRA (Shi et al. 15).

66 Thank You! Paper 1 M. Yan, A new primal-dual method for minimizing the sum of three functions with a linear operator, Arxiv: arxiv: Code Paper 2 Z. Li, W. Shi and M. Yan, A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates, arxiv: Code Paper 3 Z. Li and M. Yan, A primal-dual algorithm with optimal stepsizes and its application in decentralized consensus optimization, arxiv:

A Primal-dual Three-operator Splitting Scheme

Noname manuscript No. (will be inserted by the editor) A Primal-dual Three-operator Splitting Scheme Ming Yan Received: date / Accepted: date Abstract In this paper, we propose a new primal-dual algorithm