Proximal splitting methods on convex problems with a quadratic term: Relax!

Proximal splitting methods on convex problems with a quadratic term: Relax! The slides I presented with added comments Laurent Condat GIPSA-lab, Univ. Grenoble Alpes, France Workshop BASP Frontiers, Jan. 2017

Proximal splitting algorithms A zoo of methods in the literature ADMM ISTA Douglas-Rachford PDFB Chambolle-Pock This is the primal-dual forward-backward algorithm I proposed, sometimes called Condat-Vu algorithm 1 / 39

Goal We consider the problem ( MX ) Find x 2 Arg min (x) = m=1 g m (L m x) where the U m and X are real Hilbert spaces, the functions g m are convex, from U m to R [ {+1}, the L m are linear operators from X to U m. 2 / 39

Goal We consider the problem ( MX ) Find x 2 Arg min (x) = m=1 g m (L m x) We want full splitting, with individual activation of L m, L m, the gradient or proximity operator of g m. 3 / 39

Goal We consider the problem ( MX ) Find x 2 Arg min (x) = m=1 g m (L m x) We want full splitting, with individual activation of L m, L m, the gradient or proximity operator of g m. no implicit operation (inner loop or linear system to solve) 3 / 39

The subdifferential @f : X! 2 X @f (x) is the set of gradients of the affine minorants of f at x. f is smooth at x! @f (x) ={rf (x)}. 4 / 39

Fermat s rule ( MX ) x 2 Arg min (x) = m=1 g m (L m x) (x) 0 2 @ ( x) x Pierre de Fermat, 1601-1665 5 / 39

Fermat s rule ( MX ) x 2 Arg min (x) = m=1 g m (L m x) (x) 0 2 @ ( x) x 0 2 MX L m@g m (L m x) Pierre de Fermat, 1601-1665 m=1 Here some qualification constraints are hidden 5 / 39

The proximity operator prox f : X! X, x 7! arg min z2x f (z)+ 1 2 kz xk2 6 / 39

The proximity operator prox f : X! X, x 7! arg min z2x f (z)+ 1 2 kz xk2 Explicit forms of prox f for a large class of functions. f (x) = x prox f (x) = sgn(x) max( x 1, 0) 6 / 39

The proximity operator prox f : X! X, x 7! arg min z2x f (z)+ 1 2 kz xk2 8 > 0, prox f = (Id + @f ) 1 7 / 39

Iterative algorithms Principle: design an algorithm which iterates x (i+1) = T (x (i) ) for some operator T : X! X 8 / 39

Iterative algorithms Principle: design an algorithm which iterates x (i+1) = T (x (i) ) for some operator T : X! X Convergence: kx (i) xk!0, for some solution x. 8 / 39

Iterative algorithms Principle: design an algorithm which iterates x (i+1) = T (x (i) ) for some nonexpansive operator T : X! X i.e. 8x, y, kt (x) T (y)k applekx yk, and such that the set S of solutions is the set of fixed points of T, i.e. T ( x) = x. Here replace y by a fixed point and you get Fejer-monotonicity: at every iteration you get closer to the solution set 9 / 39

Iterative algorithms But nonexpansiveness is not sufficient : for instance, X = R 2 and T is a rotation. x (1) x (2) x x (3) 10 / 39

Iterative algorithms But nonexpansiveness is not sufficient : for instance, X = R 2 and T is a rotation. x (1) we need a bit more than nonexpansiveness. x (2) x x (3) "firmly nonexpansive" = 1/2-averaged 10 / 39

Iterative algorithms But nonexpansiveness is not sufficient : for instance, X = R 2 and T is a rotation. Definition: T is -averaged if T = T 0 + (1 )Id, for some 0 < < 1 and nonexpansive op. T 0. x (1) x (2) x (3) x 11 / 39

Iterative algorithms Theorem (Krasnoselskii Mann) If T is averaged, for some 0 < < 1, and a fixed point of T exists, then the iteration x (i+1) = T (x (i) ) converges to some fixed point x of T. 12 / 39

Over-relaxation An algorithm: w (i+1) = T (w (i) ) with T -averaged w (i) T w (i+1) the solution is around here! Its over-relaxed variant: $ w (i+ 1 2 ) = T (w (i) ) w (i+1) = w (i) + (w (i+ 1 2 ) w (i) ) T with 1 < < 1/ w (i) w (i+ 1 2 ) w (i+1) 13 / 39

Over-relaxation An algorithm: w (i+1) = T (w (i) ) with T -averaged w (i) T w (i+1) a firmly nonexpansive operator can be over-relaxed with 1 < < 2 T w (i) w (i+ 1 2 ) w (i+1) 14 / 39

Goal ( MX ) Find x 2 Arg min (x) = m=1 g m (L m x) by iterating x (i+1) = T (x (i) ) for an -averaged operator T. 15 / 39

Goal ( MX ) Find x 2 Arg min (x) = m=1 g m (L m x) by iterating x (i+1) = T (x (i) ) for an -averaged operator T. How to build T from the L m, L m, gradient or proximity operator of g m? 15 / 39

Forward-backward splitting Find x 2 Arg min where h is smooth with n o h(x) + f (x) -Lipschitz cont. gradient. the forward-backward iteration x (i+1) = prox f x (i) rh(x (i) ) converges to a solution x, if 0 < < 2. 16 / 39

Forward-backward splitting Find x 2 Arg min where h is smooth with n o h(x) + f (x) -Lipschitz cont. gradient. $ the over-relaxed forward-backward iteration x (i+ 1 2 ) = prox f x (i) rh(x (i) ) x (i+1) = x (i) + (x (i+ 1 2 ) x (i) ) converges to a solution x, if 0 < < 2,1apple < 2 2. 17 / 39

Minimization of 2 functions Find x 2 Arg min n o f (x) + g(x) where f and g have simple prox.! calls to prox f and prox g. 18 / 39

Minimization of 2 functions Find x 2 Arg min n o f (x) + g(x) find x 2 X such that 0 2 @f (x) + @g(x) 19 / 39

Minimization of 2 functions Find x 2 Arg min n o f (x) + g(x) find x 2 X such that 0 2 @f (x) + @g(x) find (x, u) 2 X X such that u 2 @f (x) u 2 @g(x) The dual variable u can just be viewed as an auxiliary variable. 19 / 39

Minimization of 2 functions Find x 2 Arg min n o f (x) + g(x) find x 2 X such that 0 2 @f (x) + @g(x) find (x, u) 2 X X such that u 2 @f (x) x 2 (@g) 1 (u) 20 / 39

Minimization of 2 functions Find x 2 Arg min n o f (x) + g(x) find x 2 X such that 0 2 @f (x) + @g(x) find (x, u) 2 X X such that 0 2 @f (x) + u 0 2 (@g) 1 (u) x 21 / 39

Minimization of 2 functions Find x 2 Arg min n o f (x) + g(x) find x 2 X such that 0 2 @f (x) + @g(x) find (x, u) 2 X X such that 0 @f (x) + u 2 0 x + (@g) 1 (u) 22 / 39

Primal-dual monotone inclusion Find x 2 Arg min n o f (x) + g(x) find x 2 X such that 0 2 @f (x) + @g(x) find (x, u) 2 X X such that 0 @f (x) + u 2 0 x + (@g) 1 (u) maximally monotone operator 22 / 39

The proximal point algorithm To solve the monotone inclusion 0 2 M( w), the proximal point algorithm is: w (i+1) =(P 1 M + Id) 1 (w (i) ), 0 2 M(w (i+1) )+P(w (i+1) w (i) ) Convergence, for any linear, self-adjoint, strongly positive, P. With w=(x,u), be careful that convergence is on the pair (x,u) and with respect to a distorted <.,P.> norm. So, the variable x alone does not get closer to the solution at every iteration, it can oscillate. 23 / 39

Douglas-Rachford splitting design an algorithm such that, 8i 2 N, 0 2 0 @f (x (i+1) ) + u (i+1) x (i+1) +(@g) 1 (u (i+1) ) +?? x (i+1)?? u (i+1) x (i) u (i) We need to find this linear operator P to have a primal-dual PPA like on the previous slide 24 / 39

Douglas-Rachford algorithm 0 2 0 design an algorithm such that, 8i 2 N, @f (x (i+1) ) + u (i+1) 1 x (i+1) + @g (u (i+1) + Id ) Id Id x (i+1) Id u (i+1) x (i) u (i) Douglas Rachford iteration: $ x (i+1) = prox f x (i) u (i) u (i+1) = prox g / u(i) + 1 (2x (i+1) x (i) ) The linear operator in orange is positive, not strongly positive, but the convergence proofs can be extended to this case. 25 / 39

Douglas-Rachford algorithm 0 2 0 design an algorithm such that, 8i 2 N, @f (x (i+1) ) + u (i+1) 1 x (i+1) + @g (u (i+1) + Id ) Id Id x (i+1) Id u (i+1) x (i) u (i) 6 4 Douglas Rachford iteration: x (i+1) = prox f s (i) z (i+1) = prox g 2x (i+1) s (i) s (i+1) = s (i) + z (i+1) x (i+1) 26 / 39

Douglas-Rachford algorithm Over-relaxed Douglas Rachford iteration: x (i+1) = prox f s (i) 6 4 z (i+1) = prox g 2x (i+1) s (i) s (i+1) = s (i) + (z (i+1) x (i+1) ) Convergence with 1 apple < 2. take = 1.9! Over-relaxation does not cost anything here! 27 / 39

Douglas-Rachford algorithm, version 2 Switching the update order! different iteration: 0 @f (x (i+1) ) + u (i+1) 1 2 0 x (i+1) + @g (u (i+1) + Id +Id x (i+1) ) +Id Id u (i+1) x (i) u (i) 28 / 39

Douglas-Rachford algorithm, version 2 Switching the update order! different iteration: 0 @f (x (i+1) ) + u (i+1) 1 2 0 x (i+1) + @g (u (i+1) + Id +Id x (i+1) ) +Id Id u (i+1) x (i) u (i) switching the roles of the primal and dual variables / problems 28 / 39

Douglas-Rachford algorithm, version 2 Switching the update order! different iteration: 0 @f (x (i+1) ) + u (i+1) 1 2 0 x (i+1) + @g (u (i+1) + Id +Id x (i+1) ) +Id Id u (i+1) x (i) u (i) $ u (i+1) = prox g / u(i) + 1 x (i) x (i+1) = prox f x (i) (2u (i+1) u (i) ) 29 / 39

ADMM Switching the update order! different iteration: 0 @f (x (i+1) ) + u (i+1) 1 2 0 x (i+1) + @g (u (i+1) + Id +Id x (i+1) ) +Id Id u (i+1) x (i) u (i) $ u (i+1) = prox g / u(i) + 1 x (i) x (i+1) = prox f x (i) (2u (i+1) u (i) ) 6 4 x (i) = prox f z (i) u (i) z (i+1) = prox g x (i) + u (i) This is ADMM u (i+1) = u (i) + 1 (x (i) z (i+1) ) 29 / 39

ADMM Switching the update order! different iteration: 0 @f (x (i+1) ) + u (i+1) 1 2 0 x (i+1) + @g (u (i+1) + Id +Id x (i+1) ) +Id Id u (i+1) 6 4 z (i+1) = prox g s (i) x (i+1) = prox f 2z (i+1) s (i) s (i+1) = s (i) + x (i+1) z (i+1) x (i) u (i) This is Douglas- Rachford like in slide 26, just with f and g exchanged. switching the roles of f and g in Douglas Rachford 30 / 39

Generalized Douglas-Rachford Find ( x, z) 2 Arg min (x,z) n o f (x) + g(z) s.t. Lx + Kz =0 6 4 Douglas Rachford iteration: x (i+1) 2 arg min x f (x) + 1 2 klx s(i) k 2 z (i+1) 2 arg min z g(z) + 1 2 kkz + (2Lx (i+1) s (i) )k 2 s (i+1) = s (i) (Lx (i+1) + Kz (i+1) ) take = 1.9! The ADMM is usually expressed with 1 or 2 linear operators, L and K here. We can write it in this completely equivalent Douglas- Rachford form. The advantage is that we can easily over-relax! 31 / 39

Chambolle-Pock algorithm Find x 2 Arg min n o f (x) + g(lx) 0 2 0 Design an algorithm such that, 8i 2 N, @f (x (i+1) ) + L u (i+1) 1 Lx (i+1) + @g (u (i+1) + Id L x (i+1) ) L 1 Id u (i+1) x (i) u (i) Once we understand that Douglas-Rachford is just a primal-dual PPA, there is no price to pay for introducing a linear operator L in the problem. It is the same as the L in the previous slide (with K=-Id) but this time we split. Other said, the Chambolle-Pock algorithm is a preconditioned ADMM/Douglas-Rachford. 32 / 39

Chambolle-Pock algorithm Find x 2 Arg min n o f (x) + g(lx) Chambolle Pock algorithm: $ x (i+1) = prox f x (i) L u (i) u (i+1) = prox g u (i) + L(2x (i+1) x (i) ) 33 / 39

Chambolle-Pock algorithm Find x 2 Arg min n o f (x) + g(lx) Chambolle Pock algorithm: $ x (i+1) = prox f x (i) L u (i) u (i+1) = prox g u (i) + L(2x (i+1) x (i) ) Convergence if klk 2 apple 1 set =1/( klk 2 ) This condition is proved in my paper JOTA 13 This way, there is only one parameter, tau, to tune, and we recover Douglas-Rachford exactly when L=Id (slide 25) 33 / 39

Chambolle-Pock algorithm Find x 2 Arg min n o f (x) + g(lx) Over-relaxed Chambolle Pock algorithm: x (i+ 1 2 ) = prox f x (i) L u (i) u (i+ 1 2 ) = prox g u (i) + L(2x (i+ 1 2 ) x (i) ) 6 4 x (i+1) = x (i) + (x (i+ 1 2 ) x (i) ) u (i+1) = u (i) + (u (i+ 1 2 ) u (i) ) 34 / 39

Preconditioned Chambolle-Pock algo. Find x 2 Arg min n o 1 2 kax yk2 + f (x) + g(lx) primal-dual PPA: 0 2 0 A Ax (i+1) A y + @f (x (i+1) ) + L u (i+1) Lx (i+1) + @g (u (i+1) ) 1 + Id A A L x (i+1) L 1 Id u (i+1) x (i) u (i) 35 / 39

Preconditioned Chambolle-Pock algo. Find x 2 Arg min n o 1 2 kax yk2 + f (x) + g(lx) 6 4 preconditioned Chambolle Pock algorithm: x (i+ 1 2 ) = prox f x (i) L u (i) (A Ax (i) A y) u (i+ 1 2 ) = prox g u (i) + L(2x (i+ 1 2 ) x (i) ) x (i+1) = x (i) + (x (i+ 1 2 ) x (i) ) u (i+1) = u (i) + (u (i+ 1 2 ) u (i) ) This is exactly the algorithm I proposed (JOTA 13), and exactly to solve this kind of regularized inverse problems. But if you view it as a preconditioned Chambolle-Pock algorithm instead of a primal-dual forward-backward, you can over-relax more! 36 / 39

Preconditioned Chambolle-Pock algo. Find x 2 Arg min n o 1 2 kax yk2 + f (x) + g(lx) 6 4 preconditioned Chambolle Pock algorithm: x (i+ 1 2 ) = prox f x (i) L u (i) (A Ax (i) A y) u (i+ 1 2 ) = prox g u (i) + L(2x (i+ 1 2 ) x (i) ) x (i+1) = x (i) + (x (i+ 1 2 ) x (i) ) u (i+1) = u (i) + (u (i+ 1 2 ) u (i) ) Convergence with 1 apple < 2. take = 1.9! With the conditions in my paper you cannot choose 1.9. This is specific to the smooth term in blue being quadratic. 36 / 39

Preconditioned PPA Find x 2 Arg min n o 1 2 kax yk2 + f (x) $ x (i+ 1 2 ) = prox f x (i) (A Ax (i) A y) x (i+1) = x (i) + (x (i+ 1 2 ) x (i) ) Convergence if < 1 and 1 apple < 2 kak2 Instead of = 1.99 kak 2, = 1, try = 0.99 kak 2, = 1.9! Here also, the forwardbackward can be viewed as just preconditioning the proximity operator. 37 / 39

Preconditioned PPA Find x 2 Arg min n o 1 2 kax yk2 + f (x) $ x (i+ 1 2 ) = prox f x (i) (A Ax (i) A y) x (i+1) = x (i) + (x (i+ 1 2 ) x (i) ) Convergence if < 1 and 1 apple < 2 kak2 Instead of = 1.99 kak 2, = 1, try = 0.99 kak 2, = 1.9! there is a continuum between these two regimes (proof unfinished) 37 / 39

Conclusion You can and should over-relax the algorithms, even more when the smooth function is quadratic. RELAX! 38 / 39

Conclusion You can and should over-relax the algorithms, even more when the smooth function is quadratic. No time to explain: Product-space trick to generalize g L to MX g m L m Applications to other splitting methods m=1 Design an algorithm = play LEGO... See the two bonus slides for the Chambolle-Pock algorithm (one of the two forms, the other one in my paper JOTA 13, algorithm 5.2). 39 / 39

The product-space trick minimize MX m=1 g m (L m x) minimize g(lx) with g : U 1 U M! R [ {+1} (z 1,..., z M ) 7! P M m=1 g m(z m ) and L : X! U 1 U M x 7! (L 1 x,..., L M x) 41 / 39

Chambolle-Pock algorithm ( MX ) minimize f (x) + m=1 g m (L m x) Over-relaxed Chambolle-Pock algo. Iterate, for i 2 N x (i+ 1 2 ) := prox f x (i) P M m=1 L mu m (i), x (i+1) := x (i) + (x (i+ 1 2 ) x (i) ), For m = 1,..., M, 1 6 6 4 4 u(i+ 2 ) m := prox g u (i) m m + L m (2x (i+ 1 2 ) x (i) ), u m (i+1) := u m (i) + (u (i+ 1 2 ) m u m (i) ). 42 / 39