Local Linear Convergence Analysis of Primal Dual Splitting Methods

Similar documents
Activity Identification and Local Linear Convergence of Forward Backward-type methods

Lecture 8 Plus properties, merit functions and gap functions. September 28, 2008

Constrained Optimization and Lagrangian Duality

M. Marques Alves Marina Geremia. November 30, 2017

Optimization and Optimal Control in Banach Spaces

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Second order forward-backward dynamical systems for monotone inclusion problems

ADMM for monotone operators: convergence analysis and rates

Proximal splitting methods on convex problems with a quadratic term: Relax!

Proximal methods. S. Villa. October 7, 2014

A Unified Approach to Proximal Algorithms using Bregman Distance

Convex Optimization Notes

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

Dual and primal-dual methods

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

1 Introduction and preliminaries

Optimality, identifiability, and sensitivity

arxiv: v1 [math.oc] 21 Apr 2016

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches

Optimality Conditions for Constrained Optimization

Splitting methods for decomposing separable convex programs

10. Smooth Varieties. 82 Andreas Gathmann

The resolvent average of monotone operators: dominant and recessive properties

Some Properties of the Augmented Lagrangian in Cone Constrained Optimization

On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean

Math 273a: Optimization Overview of First-Order Optimization Algorithms

ON A CLASS OF NONSMOOTH COMPOSITE FUNCTIONS

Signal Processing and Networks Optimization Part VI: Duality

Primal/Dual Decomposition Methods

COMP 558 lecture 18 Nov. 15, 2010

On nonexpansive and accretive operators in Banach spaces

THE INVERSE FUNCTION THEOREM FOR LIPSCHITZ MAPS

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Some Inexact Hybrid Proximal Augmented Lagrangian Algorithms

Subdifferential representation of convex functions: refinements and applications

About Split Proximal Algorithms for the Q-Lasso

Coordinate Update Algorithm Short Course Operator Splitting

AN ALTERNATING MINIMIZATION ALGORITHM FOR NON-NEGATIVE MATRIX APPROXIMATION

The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression

Homework 3. Convex Optimization /36-725

SPRING 2006 PRELIMINARY EXAMINATION SOLUTIONS

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem

Convex Functions. Pontus Giselsson

arxiv: v1 [math.oc] 12 Mar 2013

Contents: 1. Minimization. 2. The theorem of Lions-Stampacchia for variational inequalities. 3. Γ -Convergence. 4. Duality mapping.

A GENERAL INERTIAL PROXIMAL POINT METHOD FOR MIXED VARIATIONAL INEQUALITY PROBLEM

On the order of the operators in the Douglas Rachford algorithm

Optimality, identifiability, and sensitivity

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities

Local strong convexity and local Lipschitz continuity of the gradient of convex functions

A Dykstra-like algorithm for two monotone operators

FIXED POINT ITERATIONS

Date: July 5, Contents

Algorithms for Nonsmooth Optimization

STAT 200C: High-dimensional Statistics

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

Elementary linear algebra

Convergence analysis for a primal-dual monotone + skew splitting algorithm with applications to total variation minimization

Convergence rate analysis for averaged fixed point iterations in the presence of Hölder regularity

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

PARTIAL REGULARITY OF BRENIER SOLUTIONS OF THE MONGE-AMPÈRE EQUATION

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

On Gap Functions for Equilibrium Problems via Fenchel Duality

Monotone Operator Splitting Methods in Signal and Image Recovery

Approaching monotone inclusion problems via second order dynamical systems with linear and anisotropic damping

Optimization methods

Analysis II: The Implicit and Inverse Function Theorems

Sequential Unconstrained Minimization: A Survey

Stochastic Optimization Algorithms Beyond SG

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

A strongly polynomial algorithm for linear systems having a binary solution

MET Workshop: Exercises

Radius Theorems for Monotone Mappings

Semidefinite Programming

On convergence rate of the Douglas-Rachford operator splitting method

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS

You should be able to...

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

ZERO DUALITY GAP FOR CONVEX PROGRAMS: A GENERAL RESULT

Lagrangian-Conic Relaxations, Part I: A Unified Framework and Its Applications to Quadratic Optimization Problems

Part III. 10 Topological Space Basics. Topological Spaces

SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS

Introduction to Real Analysis Alternative Chapter 1

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

SMSTC (2017/18) Geometry and Topology 2.

Math 210B. Artin Rees and completions

SYMPLECTIC GEOMETRY: LECTURE 5

Operator Splitting for Parallel and Distributed Optimization

Optimization for Machine Learning

Convex functions, subdifferentials, and the L.a.s.s.o.

Monotone operators and bigger conjugate functions

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Transcription:

Local Linear Convergence Analysis of Primal Dual Splitting Methods Jingwei Liang a and Jalal Fadili b and Gabriel Peyré c a DAMTP, University of Cambridge, UK; b Normandie Université, ENSICAEN, CNS, GEYC, France; b CNS, DMA, ENS Paris, France ATICLE HISTOY Compiled January 8, 2018 ABSTACT In this paper, we study the local linear convergence properties of a versatile class of Primal Dual splitting methods for minimizing composite non-smooth convex optimization problems. Under the assumption that the non-smooth components of the problem are partly smooth relative to smooth manifolds, we present a unified local convergence analysis framework for these methods. More precisely, in our framework we first show that (i) the sequences generated by Primal Dual splitting methods identify a pair of primal and dual smooth manifolds in a finite number of iterations, and then (ii) enter a local linear convergence regime, which is characterized based on the structure of the underlying active smooth manifolds. We also show how our results for Primal Dual splitting can be specialized to cover existing ones on Forward Backward splitting and Douglas achford splitting/admm (alternating direction methods of multipliers). Moreover, based on these obtained local convergence analysis result, several practical acceleration techniques are discussed. To exemplify the usefulness of the obtained result, we consider several concrete numerical experiments arising from fields including signal/image processing, inverse problems and machine learning, etc. The demonstration not only verifies the local linear convergence behaviour of Primal Dual splitting methods, but also the insights on how to accelerate them in practice. KEYWODS Primal Dual splitting, Forward Backward splitting, Douglas achford/admm Partial Smoothness, Local Linear Convergence. AMS CLASSIFICATION 49J52, 65K05, 65K10. 1. Introduction 1.1. Composed optimization problem In various fields such as inverse problems, signal and image processing, statistics and machine learning etc., many problems are (eventually) formulated as structured optimization problems (see Section 6 for some specific examples). A typical example of CONTACT Jingwei Liang. Email: jl993@cam.ac.uk. Most of this work was done while Jingwei Liang was at Normandie Université, ENSICAEN, France.

these optimization problems, given in its primal form, reads min x n (x) + F (x) + (J + G)(Lx), (P P ) where (J + G)( ) def = inf v m J( ) + G( v) denotes the infimal convolution of J and G. Throughout, we assume the following: (A.1), F Γ 0 ( n ) with Γ 0 ( n ) being the class of proper convex and lower semicontinuous functions on n, and F is (1/β F )-Lipschitz continuous for some β F > 0, (A.2) J, G Γ 0 ( m ), and G is β G -strongly convex for some β G > 0. (A.3) L : n m is a linear mapping. (A.4) 0 ran( + F +L ( J G)L), where J G def = ( J 1 + G 1 ) 1 is the parallel sum of the subdifferential J and G, and ran( ) denotes the range of a set-valued operator. See emark 3 for the reasoning of this condition. The main difficulties of solving such a problem are that the objective function is nonsmooth, the presence of the linear operator L and the infimal convolution. Consider also the Fenchel-ockafellar dual problem [41] of (P P ), min J (v) + G (v) + ( + F )( L v). (P D ) v m The classical Kuhn-Tucker theory asserts that a pair (x, v ) n m solves (P P )- (P D ) if it satisfies the monotone inclusion [ L 0 L J ] ( x v ) + [ ] ( ) F 0 x 0 G v. (1) One can observe that in (1), the composition of the linear operator and the infimal convolution is decoupled, hence providing possibilities to achieve full splitting. This is a key property used by all Primal Dual algorithms we are about to review. In turn, solving (1) provides a pair of points that are solutions to (P P ) and (P D ) respectively. More complex forms of (P P ) involving for instance a sum of infimal convolutions can be tackled in a similar way using a product space trick, as we will see in Section 5. 1.2. Primal Dual splitting methods Primal Dual splitting methods to solve more or less complex variants of (P P )-(P D ) have witnessed a recent wave of interest in the literature [12,13,16,18,21,28,48]. All these methods achieve full splitting, they involve the resolvents of and J, the gradients of F and G and the linear operator L, all separately at various points in the course of iteration. For instance, building on the work of [3], the now-popular scheme proposed in [13] solves (P P )-(P D ) with F = G = 0. The authors in [28] have shown that the Primal Dual splitting method of [13] can be seen as a proximal point algorithm (PPA) in n m endowed with a suitable norm. Exploiting the same idea, the author in [21] considered (P P ) with G = 0, and proposed an iterative scheme which can be interpreted as a Forward Backward (FB) splitting again under an appropriately renormed space. This idea is further extended in [48] to solve more complex problems such as that in (P P ). A variable metric version was proposed in [19]. Motivated by the structure of (1), [12] and [18] proposed a Forward Backward-Forward 2

scheme [46] to solve it. In this paper, we focus the unrelaxed Primal Dual splitting method summarized in Algorithm 1, which covers [13,16,21,28,48]. Though we omit the details here for brevity, our analysis and conclusions carry through to the method proposed in [12,18]. Algorithm 1: A Primal Dual splitting method Initial: Choose, > 0 and θ [ 1, + [. For k = 0, x 0 n, v 0 m ; repeat k = k + 1; until convergence; x k+1 = prox γ (x k F (x k ) L v k ), x k+1 = x k+1 + θ(x k+1 x k ), v k+1 = prox γj J (v k G (v k ) + L x k+1 ), emark 1. (i) Algorithm 1 is somehow an interesting extension to the literature given the choice of θ that we advocate. Indeed, the range [ 1, + [ is larger than the one proposed in [28], which is [ 1, 1]. It encompasses the Primal Dual splitting method in [48] when θ = 1, the one in [21] when moreover G = 0. When both F = G = 0, it reduces to the Primal Dual splitting method proposed in [13,28]. (ii) It can also be verified that Algorithm 1 covers the Forward Backward (FB) splitting [37] (J = G = 0), Douglas achford (D) splitting [24] (if F = G = 0, L = Id and = 1/ ) as special cases; see Section 4 for a discussion or [13] and references therein for more details. Exploiting this relation, in Section 4, we build connections with the results provided in [34,35] for FB-type methods and D/ADMM. It also should be noted that, D splitting is the limiting case of the Primal Dual splitting [13], and the global convergence result of Primal Dual splitting does not apply to D. (2) 1.3. Contributions In the literature, most studies on the convergence rate of Primal Dual splitting methods mainly focus on the global behaviour [9,10,13,14,22,33]. For instance, it is now known that the (partial) duality gap decreases sublinearly (pointwise or in ergodic sense) at the rate O(1/k) [9,13]. This can be accelerated to O(1/k 2 ) on the iterates sequence under strong convexity of either the primal or the dual problem [10,13]. Linear convergence is achieved if both and J are strongly convex [10,13]. However, in practice, local linear convergence of the sequence generated by Algorithm 1 has been observed for many problems in the absence of strong convexity (as confirmed by our numerical experiments in Section 6). None of the existing theoretical analysis was able to explain this behaviour so far. Providing the theoretical underpinnings of this local behaviour is the main goal pursued in this paper. Our main findings can be summarized as follows. Finite time activity identification For Algorithm 1, let (x, v ) be a Kuhn-Tucker pair, i.e. a solution of (1). Under a non-degeneracy condition, and provided that 3

both and J are partly smooth relative to C 2 -smooth manifolds, respectively M x and M J v near x and v (see Definition 2.8), we show that the generated primaldual sequence {(x k, v k )} k N which converges to (x, v ) will identify in finite time the manifold M x MJ v (see Theorem 3.2). In plain words, this means that after a finite number of iterations, say K, we have x k M x and v k M J v for all k K. Local linear convergence Capitalizing on this finite identification result, we first show in Proposition 3.4 that the globally non-linear iteration (2) locally linearizes along the identified smooth manifolds, then we deduce that the convergence of the sequence becomes locally linear (see Theorem 3.7). The rate of linear convergence is characterized precisely based on the properties of the identified partly smooth manifolds and the involved linear operator L. Moreover, when F = G = 0, L = Id and, J are locally polyhedral around (x, v ), we show that the convergence rate is parameterized by the cosine of the largest principal angle (see Definition 2.5) between the tangent spaces of the two manifolds at (x, v ) (see Lemma 3.5). This builds a clear connection between the results in this paper and those we drew in our previous works on D and ADMM [35,36]. 1.4. elated work For the past few years, an increasing attention has been paid to investigate the local linear convergence of first-order proximal splitting methods in the absence of strong convexity. This has been done for instance for FB-type splitting [2,11,26,29,45], and D/ADMM [4,8,23] for special objective functions. In our previous work [32,34 36], based on the powerful framework provided by partial smoothness, we unified all the above-mentioned work and provide even stronger claims. To the best of our knowledge, we are aware of only one recent paper [44] which investigated finite identification and local linear convergence of a Primal Dual splitting method to solve a very special instance of (P P ). More precisely, they assumed to be gauge, F = 1 2 2 (hence strong convexity of the primal problem), G = 0 and J the indicator function of a point. Our work goes much beyond this limited case. It also deepens our current understanding of local behaviour of proximal splitting algorithms by complementing the picture we started in [34,35] for FB and D methods. Paper organization The rest of the paper is organized as follows. Some useful prerequisites, including partial smoothness, are collected in Section 2. The main contributions of this paper, i.e. finite time activity identification and local linear convergence of Primal Dual splitting under partial smoothness are the core of Section 3. Several discussions on the obtained result are delivered in Section 4. Section 5 extends the results to the case of more than one infimal convolution. In Section 6, we report various numerical experiments to support our theoretical findings. 2. Preliminaries Throughout the paper, N is the set of non-negative integers, n is a n-dimensional real Euclidean space equipped with scalar product, and associated norm. Id n denotes the identity operator on n, where n will be dropped if the dimension is clear from the context. 4

Sets For a nonempty convex set C n, denote aff(c) its affine hull, and par(c) the smallest subspace parallel to aff(c). Denote ι C the indicator function of C, and P C the orthogonal projection operator onto the set. Functions The sub-differential of a function Γ 0 ( n ) is a set-valued operator, : n n, x { g n (x ) (x) + g, x x, x n}, (3) which is maximal monotone (see Definition 2.2). For Γ 0 ( n ), the proximity operator of is prox (x) def 1 = argmin z n 2 z x 2 + (z). (4) Given a function J Γ 0 ( m ), its Legendre-Fenchel conjugate is defined as J (y) = sup v m v, y J(v). (5) Lemma 2.1 (Moreau Identity [39]). Let J Γ 0 ( m ), then for any v m and γ > 0, v = prox γj (v) + γprox J /γ(v/γ). (6) Using the Moreau identity, it is straightforward to see that the update of v k in Algorithm 1 can be obtained also from prox J/γJ. Operators Given a set-valued mapping A : n n, its range is ran(a) = {y n : x n s.t. y A(x)}, and graph is gph(a) def = {(x, u) n n u A(x)}. Definition 2.2 (Monotone operator). A set-valued mapping A : n n is called monotone if, x 1 x 2, v 1 v 2 0, (x 1, v 1 ) gph(a) and (x 2, v 2 ) gph(a). (7) It is moreover maximal monotone if gph(a) can not be contained in the graph of any other monotone operator. Let β ]0, + [, B : n n, then B is β-cocoercive if B(x 1 ) B(x 2 ), x 1 x 2 β B(x 1 ) B(x 2 ) 2, x 1, x 2 n, (8) which implies that B is β 1 -Lipschitz continuous. For a maximal monotone operator A, (Id + A) 1 is called its resolvent. It is known that for Γ 0 ( n ), prox = (Id + ) 1 [6, Example 23.3]. Definition 2.3 (Non-expansive operator). An operator F : n n is nonexpansive if F(x 1 ) F(x 2 ) x 1 x 2, x 1, x 2 n. 5

For any α ]0, 1[, F is called α-averaged if there exists a non-expansive operator F such that F = αf + (1 α)id. The class of α-averaged operators is closed under relaxation, convex combination and composition [6,20]. In particular when α = 1 2, F is called firmly non-expansive. Let S( n ) = {V n n V T = V} the set of symmetric positive definite matrices acting on n. The Loewner partial ordering on S( n ) is defined as V 1, V 2 S( n ), V 1 V 2 x n, (V 1 V 2 )x, x 0, that is, V 1 V 2 is positive definite. Given any positive constant ν > 0, define S ν as def S ν = { V S( n ) : V νid }, (9) i.e. the set of symmetric positive definite matrices whose eigenvalues are bounded below by ν. For any V S ν, define the following induced scalar product and norm, x, x V = x, Vx, x V = x, Vx, x, x n. By endowing the Euclidean space n with the above scalar product and norm, we obtain the Hilbert space which is denoted by n V. Lemma 2.4. Let the operators A : n n be maximal monotone, B : n n be β-cocoercive, and V S ν. Then for γ ]0, 2βν[, (i) (Id + γv 1 A) 1 : n V n V is firmly non-expansive; (ii) (Id γv 1 B) : n V n γ V is 2βν -averaged non-expansive; (iii) (Id + γv 1 A) 1 (Id γv 1 B) : n V n 2βν V is 4βν γ -averaged non-expansive. Proof. (i) See [19, Lemma 3.7(ii)]; (ii) Since B : n n is β-cocoercive, given any x, x n, we have x x, V 1 B(x) V 1 B(x ) V β B(x) B(x ) 2 = β V ( V 1 B(x) V 1 B(x ) ) 2 = β V 1/2 (V 1 B(x) V 1 B(x ))) 2 V βν V 1 B(x) V 1 B(x ) 2 V, which means that V 1 B : n V n V is (βν)-cocoercive. The rest of the proof follows [6, Proposition 4.33]. (iii) See [40, Theorem 3]. 2.1. Angles between subspaces Let T 1 and T 2 be two subspaces of n. Without the loss of generality, we assume 1 p def = dim(t 1 ) q def = dim(t 2 ) n 1. Definition 2.5 (Principal angles). The principal angles θ k [0, π 2 ], k = 1,..., p be- 6

def tween subspaces T 1 and T 2 are defined by, with u 0 = v 0 = 0, and cos(θ k ) def = u k, v k = max u, v s.t. u T 1, v T 2, u = 1, v = 1, u, u i = v, v i = 0, i = 0,, k 1. The principal angles θ k are unique with 0 θ 1 θ 2 θ p π/2. Definition 2.6 (Friedrichs angle). The Friedrichs angle θ F ]0, π 2 ] between the two subspaces T 1 and T 2 is cos ( θ F (T 1, T 2 ) ) def = max u, v s.t. u T 1 (T 1 T 2 ), u = 1, v T 2 (T 1 T 2 ), v = 1. The following lemma shows the relation between the Friedrichs and principal angles whose proof can be found in [5, Proposition 3.3]. Lemma 2.7. The Friedrichs angle is the (d + 1) th principal angle where d def = dim(t 1 T 2 ). Moreover, θ F (T 1, T 2 ) > 0. emark 2. The principal angles can be obtained by the singular value decomposition (SVD). For instance, let X n p and Y n q form the orthonormal basis for the subspaces T 1 and T 2 respectively. Let UΣV T be the SVD of X T Y p q, then cos(θ k ) = σ k, k = 1, 2,..., p and σ k corresponds to the k th largest singular value in Σ. 2.2. Partial smoothness The concept of partial smoothness is first introduced in [31]. This concept, as well as that of identifiable surfaces [49], captures the essential features of the geometry of non-smoothness which are along the so-called active/identifiable manifold. Loosely speaking, a partly smooth function behaves smoothly as we move along the identifiable submanifold, and sharply if we move transversal to the manifold. Let M be a C 2 -smooth embedded submanifold of n around a point x. To lighten the notation, henceforth we state as C 2 -manifold for short. The natural embedding of a submanifold M into n permits to define a iemannian structure on M, and we simply say M is a iemannian manifold. T M (x) denotes the tangent space to M at any point near x in M. More materials on manifolds are given in Section A. Below is the definition of partly smoothness associated to functions in Γ 0 ( n ). Definition 2.8 (Partly smooth function). Let Γ 0 ( n ), and x n such that (x). is then said to be partly smooth at x relative to a set M containing x if (i) Smoothness: M is a C 2 -manifold around x, restricted to M is C 2 around x; def (ii) Sharpness: The tangent space T M (x) coincides with T x = par( (x)) ; (iii) Continuity: The set-valued mapping is continuous at x relative to M. The class of partly smooth functions at x relative to M is denoted as PSF x (M). Owing to the results of [31], it can be shown that under transversality assumptions, the set of partly smooth functions is closed under addition and pre-composition by a linear operator. Popular examples of partly smooth functions are summarized in Section 6 whose details can be found in [34]. 7

The next lemma gives expressions of the iemannian gradient and Hessian (see Section A for definitions) of a partly smooth function. Lemma 2.9. If PSF x (M), then for any x M near x, In turn, for all h T x, M (x ) = P Tx ( (x )). 2 M(x )h = P Tx 2 (x )h + W x ( h, PT x (x ) ), where is any smooth extension (representative) of on M, and Wx (, ) : T x Tx T x is the Weingarten map of M at x. Proof. See [34, Fact 3.3]. 3. Local linear convergence of Primal Dual splitting methods In this section, we present the main result of the paper, the local linear convergence analysis of Primal Dual splitting methods. We start with the finite activity identification of the sequence (x k, v k ) generated by the methods, from which we further show that the fixed-point iteration of Primal Dual splitting methods locally can be linearized, and the linear convergence follows naturally. 3.1. Finite activity identification Let us first recall the result from [17,48], that under a proper renorming, Algorithm 1 can be written as Forward Backward. Let θ = 1, from the definition of the proximity operator (4), we have that (2) is equivalent to [ ] ( ) [ ] ( ) [ ] ( ) F 0 xk L xk+1 Idn /γ 0 G v k L J v + L xk+1 x k k+1 L Id m / v k+1 v. (10) k Let K = n m be a product space, Id the identity operator on K, and define the following variable and operators ( ) [ ] [ ] [ ] def xk z k = v, A def L = k L J, B def F 0 = 0 G, V def Idn /γ = L L Id m /γ. (11) J It is easy to verify that A is maximal monotone [12], B is min{β F, β G }-cocoercive. For V, denote ν = (1 L 2 ) min{ 1 1 γ, J γ }, then V is symmetric and ν-positive definite [19,48]. Define K V the Hilbert space induced by V. Now (10) can be reformulated as z k+1 = (V + A) 1 (V B)(z k ) = (Id + V 1 A) 1 (Id V 1 B)(z k ). (12) Clearly, (12) is the FB splitting on K V [48]. When F = G = 0, it reduces to the metric PPA discussed in [13,28]. 8

Before presenting the finite time activity identification under partial smoothness, we first recall the global convergence of Algorithm 1. Lemma 3.1 (Convergence of Algorithm 1). Consider Algorithm 1 under assumptions (A.1)-(A.4). Let θ = 1 and choose, such that 2 min{β F, β G } min { }( 1 1, γ 1 L 2) > 1, (13) then there exists a Kuhn-Tucker pair (x, v ) such that x solves (P P ), v solves (P D ), and (x k, v k ) (x, v ). Proof. See [48, Corollary 4.2]. emark 3. (i) Assumption (A.4) is important to ensure the existence of Kuhn-Tucker pairs. There are sufficient conditions which ensure that (A.4) can be satisfied. For instance, assuming that (P P ) has at least one solution and some classical domain qualification condition is satisfied (see e.g. [18, Proposition 4.3]), assumption (A.4) can be shown to be in force. (ii) It is obvious from (13) that L 2 < 1, which is also the condition needed in [13] for convergence for the special case F = G = 0. The convergence condition in [21] differs from (13), however, L 2 < 1 still is a key condition. The values of, can also be made varying along iterations, and convergence of the iteration remains under the rule provided in [19]. However, for the sake of brevity, we omit the details of this case here. (iii) Lemma 3.1 addresses global convergence of the iterates provided by Algorithm 1 only for the case θ = 1. For the choices θ [ 1, 1[ ]1, + [, so far the corresponding convergence of the iteration cannot be obtained directly, and a correction step as proposed in [28] for θ [ 1, 1[ is needed so that the iteration is a contraction. Unfortunately, such a correction step leads to a new iterative scheme, different from (2); see [28] for more details. In a very recent paper [50], the authors also proved the convergence of the Primal Dual splitting method of [48] for the case of θ [ 1, 1] with a proper modification of the iterates. Since the main focus of this work is to investigate local convergence behaviour, the analysis of global convergence of Algorithm 1 for any θ [ 1, + [ is beyond the scope of this paper. Thus, we will mainly consider the case θ = 1 in our analysis. Nevertheless, as we will see later, locally θ > 1 could give faster convergence rate compared to the choice θ [ 1, 1] for certain problems. This points out a future direction of research to design new Primal Dual splitting methods. Given a solution pair (x, v ) of (P P )-(P D ), we denote M x and MJv the C2 -smooth manifolds that x and v live in respectively, and denote Tx, T J v the tangent spaces of M x, MJv at x and v respectively. Theorem 3.2 (Finite activity identification). Consider Algorithm 1 under assumptions (A.1)-(A.4). Let θ = 1 and choose, based on Lemma 3.1. Thus (x k, v k ) (x, v ), where (x, v ) is a Kuhn-Tucker pair that solves (P P )-(P D ). If moreover 9

PSF x (M x ) and J PSF v (M J v ), and the non-degeneracy condition holds L v F (x ) ri ( (x ) ), Lx G (v ) ri ( J (v ) ). (ND) Then, there exists a large enough K N such that for all k K, (x k, v k ) M x MJ v. Moreover, (i) if M x = x + Tx, then T x k = Tx and x k M x hold for k > K. (ii) If M J v = v + T J v, then T J v k = T J v holds for k > K. (iii) If is locally polyhedral around x, then k K, x k M x = x + Tx, Tx k = Tx, M (x k) = x M ), and 2 x M (x k) = 0. x (iv) If J is locally polyhedral around v, then k K, v k M J v = v + T J v, T J v k = T J v, M (v J v k ) = M J (v ), and 2 J (v v M J k ) = 0. v emark 4. (i) The non-degeneracy condition (ND) is a strengthened version of (1). (ii) In general, we have no identification guarantees for x k and v k if the proximity operators are computed approximately, even if the approximation errors are summable, in which case one can still prove global convergence. The reason behind this is that in the exact case, under condition (ND), the proximal mapping of the partly smooth function and that of its restriction to M x locally agree nearby x (and similarly for J and v ). This property can be easily violated if approximate proximal mappings are involved, see [34] for an example. (iii) Theorem 3.2 only states the existence of K after which the identification of the sequences happens, but no bounds are provided. In [34,35], lower bounds of K for the FB and D methods are delivered. Though similar lower bounds can be obtained for the Primal Dual splitting method, the proof is not a simple adaptation of that in [34] even if the Primal Dual splitting method can be cast as a FB splitting in an appropriate metric. Indeed, the estimate in [34] is intimately tied to the fact that FB was applied to an optimization problem, while in the context of Primal-Dual splitting, FB is applied to a monotone inclusion. Since, moreover, these lower-bounds are only of theoretical interest, we decided to forgo the corresponding details here for the sake of conciseness. Proof of Theorem 3.2. From the definition of proximity operator (4) and the updating of x k+1 in (2), we have which yields 1 ( xk F (x k ) L v k x k+1 ) (xk+1 ), dist ( L v F (x ), (x k+1 ) ) L v F (x ) 1 (x k F (x k ) L v k x k+1 ) ( 1 + 1 β F ) xk x + L v k v 0. 10

Then similarly for v k+1, we have 1 ( vk G (v k ) + L x k+1 v k+1 ) J (v k+1 ), and dist ( Lx G (v ), J (v k+1 ) ) Lx G (v ) 1 (v k G (v k ) + L x k+1 v k+1 ) ( 1 + 1 β G ) vk v + L ( (1 + θ) x k+1 x + θ x k x ) 0. By assumption, J Γ 0 ( m ), Γ 0 ( n ), hence they are sub-differentially continuous at every point in their respective domains [42, Example 13.30], and in particular at v and x. It then follows that J (v k ) J (v ) and (x k ) (x ). Altogether with the non-degeneracy condition (ND), shows that the conditions of [27, Theorem 5.3] are fulfilled for Lx + G (v ), + J and L v + F (x ), +, and the finite identification claim follows. (i) In this case, M x is an affine subspace, it is straight to have x k M x. Then as is partly smooth at x relative to M x, the sharpness property holds for all nearby points in M x [31, Proposition 2.10]. Thus for k large enough, we have indeed T xk (M x ) = T x k = Tx as claimed. (ii) Similar to (i). (iii) It is immediate to verify that a locally polyhedral function around x is indeed partly smooth relative to the affine subspace x + Tx, and thus, the first claim follows from (i). For the rest, it is sufficient to observe that by polyhedrality, for any x M x near x, (x) = (x ). Therefore, combining local normal sharpness [31, Proposition 2.10] and Lemma A.2 yields the second conclusion. (iv) Similar to (iii). 3.2. Locally linearized iteration elying on the identification result, now we are able to show that the globally nonlinear fixed-point iteration (12) can be locally linearized along the manifolds M x MJ v. As a result, the convergence rate of the iteration essentially boils down to analyzing the spectral properties of the matrix obtained in the linearization. Given a Kuhn-Tucker pair (x, v ), define the following two functions (x) def = (x) + x, L v + F (x ), J (y) def = J (y) y, Lx G (v ). (14) We have the following lemma. Lemma 3.3. Let (x, v ) be a Kuhn-Tucker pair such that PSF x (M x ), J PSF x (M J v ). Denote the iemannian Hessians of and J as def H = P T 2 x M (x )P x T and H def x J = γ J P T J 2 v M J (v )P J v T J. (15) v Then H and H J conditions: (i) (ND) holds. are symmetric positive semi-definite under either of the following 11

(ii) M x Define, and MJv are affine subspaces. def W = (Id n + H ) 1 def and W J = (Id m + H J ) 1, (16) then both W and W J are firmly non-expansive. Proof. See [35, Lemma 6.1]. For the smooth functions F and G, in addition to (A.1) and (A.2), for the rest of the paper, we assume that (A.5) F and G locally are C 2 -smooth around x and v respectively. Now define the restricted Hessians of F and G, def H F = P T 2 F (x )P x T and H def x G = P T J 2 G (v )P v T J. (17) v def def Denote H F = Id n H F, H G = Id m H G, L def = P T J LP v T and x [ ] def W M PD = H F W L (1 + θ)w J LW H F θ W J L W J H G (1 + θ)w J LW L. (18) We have the following proposition. Proposition 3.4 (Local linearization). Suppose that Algorithm 1 is run under the identification conditions of Theorem 3.2, and moreover assumption (A.5) holds. Then for all k large enough, z k+1 z = M PD (z k z ) + o( z k z ). (19) emark 5. (i) For the case of varying (, ) along iteration, i.e. {(,k,,k)} k. According to the result of [34], (19) remains true if these parameters converge to some constants such that condition (13) still holds. (ii) Taking H G = Id m (i.e. G = 0) in (18), one gets the linearized iteration associated to the Primal Dual splitting method of [21]. If we further let H F = Id n, this will correspond to the linearized version of the method in [13]. Proof of Proposition 3.4. From the update of x k in (2), we have x k F (x k ) L v k x k+1 (x k+1 ), F (x ) L v (x ). Denote τk the parallel translation from T x k to Tx. Then project on to corresponding tangent spaces and apply parallel translation, τ k M x (x k+1) = τ k P T x x k+1(x k F (x k ) L v k x k+1 ) = P T (x x k F (x k ) L v k x k+1 ) + (τk P T x x k+1 P T )(x x k F (x k ) L v k x k+1 ), M (x ) = P x T ( x F (x ) γ L v ), 12

which leads to τk M (x k+1) γ x M (x ) x = P T ((x x k F (x k ) L v k x k+1 ) (x F (x ) L v x )) + (τk P T x x k+1 P T )( γ F x (x ) L v ) }{{} Term 1 + (τk P T x x k+1 P T )((x x k F (x k ) L v k x k+1 ) + ( F (x ) + L v )). }{{} Term 2 (20) Moving Term 1 to the other side leads to τk M (x k+1) x M (x ) (τ x k P T x x k+1 P T )( x F (x ) γ L v ) = τk ( M (x k+1) + (L v + F (x )) ) ( γ x M (x ) + (L v + F (x )) ) x = P T 2 x M (x )P x T (x x k+1 x ) + o( x k+1 x ), where Lemma A.2 is applied. Since x k+1 = prox γ (x k F (x k ) L v k ), prox γ is firmly non-expansive and Id n F is non-expansive (under the parameter setting), then (x k F (x k ) L v k x k+1 ) (x F (x ) L v x ) (Id n F )(x k ) (Id n F )(x ) + L v k L v x k x + L v k v. (21) Therefore, for Term 2, owing to Lemma A.1, we have (τ k P T x x k+1 P T x )((x k F (x k ) L v k x k+1 ) (x F (x ) L v x )) = o( x k x + L v k v ). Therefore, from (20), and apply x k x = P T x (x k x ) + o(x k x ) [32, Lemma 5.1] to (x k+1 x ) and (x k x ), we get (Id n + H )(x k+1 x ) + o( x k+1 x ) = (x k x ) P T x ( F (x k) F (x )) P T x L (v k v ) + o( x k x + L v k v ). Then apply Taylor expansion to F, and apply [32, Lemma 5.1] to (v k v ), (Id n + H )(x k+1 x ) = (Id n H F )(x k x ) L (v k v ) + o( x k x + L v k v ). (22) Then invert (Id n + H ) and apply [32, Lemma 5.1], we get x k+1 x = W H F (x k x ) W L (v k v ) + o( x k x + L v k v ). (23) 13

Now from the update of v k+1 v k G (v k ) + L x k+1 v k+1 J (v k+1 ), v G (v ) + Lx v J (v ). Denote τ J k+1 the parallel translation from T J v k+1 to T J v, then τ J k+1 M J v J (v k+1 ) = τ J k+1 P T J v k+1 (v k G (v k ) + L x k+1 v k+1 ) = P T J v (v k G (v k ) + L x k+1 v k+1 ) + (τ J k+1 P T J v k+1 M J v J (v ) = P T J v (v G (v ) + Lx v ) which leads to P T J v )(v k G (v k ) + L x k+1 v k+1 ), τk+1 J M J J (v k+1 ) γ v J M J J (v ) v = P T J ((v v k G (v k ) + L x k+1 v k+1 ) (v G (v ) + Lx v )) + (τk+1p J T J P v k+1 T J )(v v k G (v k ) + L x k+1 v k+1 ) = P T J ((v v k G (v k ) + L x k+1 v k+1 ) (v G (v ) + Lx v )) + (τk+1p J T J P v k+1 T J )(γ v J Lx G (v )) }{{} Term 3 + (τk+1p J T J P v k+1 T J )((v v k G (v k ) + L x k+1 v k+1 ) + ( G (v ) Lx )). }{{} Term 4 (24) Similarly to the previous analysis, for Term 3, move to the lefthand side of the inequality and apply Lemma A.2, τ J k+1 M J v J (v k+1 ) M J v J (v ) (τ J k+1p T J v k+1 P T J v )( Lx G (v )) = τk+1( J M J J (v k+1 ) (Lx G (v ))) γ v J ( M J J (v ) (Lx G (v ))) v = P T J 2 v M J (v )P J T v J (v v k+1 v ) + o( v k+1 v ). Since θ 1, we have x k+1 x (1 + θ) x k+1 x + θ x k x 2( x k x + L v k v ) + x k x = 3 x k x + 2 L v k v. Then for Term 4, since L 2 < 1, prox γj J G is non-expansive, we have is firmly non-expansive and Id m (τ J k+1p T J v k+1 P T J v )((v k G (v k ) + L x k+1 v k+1 ) (v G (v ) + Lx v )) = o( v k v + L x k x ). 14

Therefore, from (24), apply [32, Lemma 5.1] to (v k+1 v ) and (v k v ), we get (Id m + H J )(v k+1 v ) = (Id m H G )(v k v ) + L( x k+1 x ) + o( v k v + L x k x ). (25) Then similar to (23), we get from (25) v k+1 v = W J H G (v k v ) + W J L( x k+1 x ) + o( v k v + L x k x ) = W J H G (v k v ) + (1 + θ) W J L(x k+1 x ) θ W J L(x k x ) + o( v k v + L x k x ) = W J H G (v k v ) θ W J L(x k x ) + (1 + θ) W J L ( W H F (x k x ) W L (v k v ) ) + o( x k x + L v k v ) + o( v k v + L x k x ) = ( W J H G (1 + θ) W J LW L ) (v k v ) + ( (1 + θ) W J LW H F θ W J L ) (x k x ) + o( x k x + L v k v ) + o( v k v + L x k x ). (26) Now we consider the small o-terms. For the 2 small o-terms in (22) and (25). First, let a 1, a 2 be two constants, then we have a 1 + a 2 = ( a 1 + a 2 ) 2 2(a 2 1 + a 2 2) = 2 Denote b = max{1, L, L }, then ( a1 a 2 ). ( v k v + L x k x ) + ( x k x + L v k v ) 2b( x k x + v k v ) 2 ( ) 2b xk x v k v. Combining this with (23) and (26), and ignoring the constants of the small o-term leads to the claimed result. Now we need to study the spectral properties of M PD. Let p def = dim(tx def ), q = dim(t J v ) be the dimensions of the tangent spaces T x and T J v respectively, define Sx def = (T x ) and S J def v = (T J v ). Assume that q p (alternative situations are discussed in emark 6). Let L = XΣ L Y be the singular value decomposition of L, and define the rank as l def = rank(l). Clearly, we have l p. Denote M k the kth power of M PD PD. Lemma 3.5 (Convergence property of M PD ). The following holds for the matrix M PD : (i) If θ = 1, then there exists a finite matrix M to which M k converges, i.e. PD PD M PD def = lim k M k PD. (27) 15

Moreover, k N, M k PD M PD = (M PD M PD )k and ρ(m PD M PD ) < 1. Given any ρ ]ρ(m PD M ), 1[, there is K large enough such that for all k K, PD M k PD M PD = O(ρk ). (ii) If F = G = 0, and, J are locally polyhedral around (x, v ). Then given any θ ]0, 1], M PD is convergent with M PD = [ Y X ] 0 l Id n l 0 l Id m l [ Y X ]. (28) Moreover, all the eigenvalues of M PD M PD are complex with the spectral radius ρ(m PD M 1 ) = θγ PD σmin 2 < 1, (29) where σ min is the smallest non-zero singular value of L. emark 6. We discuss in short several possible cases of (28) when F = G = 0 and, J are locally polyhedral around (x, v ). (i) If L = Id, then L = P T J P v T and σ x min is the cosine value of the biggest principal angle (yet strictly smaller than π/2) between tangent spaces Tx and T J v. (ii) For the spectral radius formula in (29), let us consider the case of Arrow Hurwicz scheme [3], i.e. θ = 0. Let, J be locally polyhedral, and Σ L = (σ j ) {j=1,...,l} be the singular values of L, then the eigenvalues of M PD are ρ j = 2( 1 (2 γ σj 2 ) ± σj 2( σj 2 4)), j {1,..., l}, (30) which apparently are complex ( σ 2 j L 2 < 1). Moreover, ρ j = 1 2 (2 σj 2)2 σj 2( σj 2 4) = 1. This implies that M PD has multiple eigenvalues with absolute values all equal to 1, then owing to the result of [5], we have M PD is not convergent. Furthermore, for θ [ 1, 0[, we have 1 θ σmin 2 > 1 meaning that M PD is not convergent, this implies that the correction step proposed in [28] is necessary for θ [ 1, 0]. Discussion on θ > 1 is left to Section 4. Proof of Proposition 3.5. (i) When θ = 1, M PD becomes [ ] W M PD = H F W L 2 W J LW H F W J L W J H G 2 W J LW L (31) Next we show that M PD is averaged non-expansive. 16

First define the following matrices A = [ ] [ ] [ ] H / L HF 0 Idn /γ, B = L H J / 0 H, V = L, (32) G L Id m / where we have A is maximal monotone [12], B is min{β F, β G }-cocoercive, and V is ν-positive definite with ν = (1 L 2 ) min{ 1 γ, 1 γ }. J Now we have V + A = [ Idn+H γ 0 Id 2L m+h J ] [ (V + A) 1 = [ 1 ] γ H F L and V B = 1. As a result, we get L γ H G J [ (V + A) 1 γ (V B) = W 0 2 W J LW W J [ W = H F 2 W J LW H F W J L ] [ 1 H F W 0 2 W J LW W J L L 1 H G ] ], W L W J H G 2 W J LW L which is exactly (31). From Lemma 2.4 we know that M PD : K V K V is averaged non-expansive, hence it is convergent [6]. Then we have the induced matrix norm lim M k M PD PD V = lim M M PD PD k k k V = 0. Since we are in the finite dimensional space and V is an isomorphism, then the above limit implies that lim M M PD PD k k = 0, which means that ρ(m PD M ) < 1. The rest of the proof is classical using the PD spectral radius formula, see e.g. [5, Theorem 2.12(i)]. (ii) When and J are locally polyhedral, then W = Id n, W J = Id m, altogether with F = 0, G = 0, for any θ [0, 1], we have With the SVD of L, for M PD, we have [ Idn γ M PD = L ] L Id m (1 + θ)ll. (33) [ ] [ Idn γ M PD = L Y Y γ L Id m (1 + θ) LL = Y Σ ] L X XΣ L Y XX (1 + θ) XΣ 2 L [ ] [ X Y Idn γ = Σ ] [ ] L Y X Σ L Id m (1 + θ) Σ 2 X. L }{{} M Σ ], 17

Since we assume that rank(l) = l p, then Σ L can be represented as Σ L = [ ] Σl 0 l,n l, 0 m l,l 0 m l,n l where Σ l = (σ j ) j=1,...,l. Back to M Σ, we have Id l 0 l,n l Σ l 0 l,m l 0 M Σ = n l,l Id n l 0 n l,l 0 n l,m l Σ l 0 l,n l Id l (1 + θ) Σ 2 l 0. l,m l 0 m l,l 0 m l,n l 0 m l,l Id m l Let s study the eigenvalues of M Σ, M Σ ρid m+n (1 ρ)id l 0 l,n l Σ l 0 l,m l 0 = n l,l (1 ρ)id n l 0 n l,l 0 n l,m l γ J Σ l 0 l,n l (1 ρ)id l (1 + θ) Σ 2 l 0 l,m l 0 m l,l 0 m l,n l 0 m l,l (1 ρ)id m l [ ] = (1 ρ) m+n 2l (1 ρ)idl Σ l Σ l (1 ρ)id l (1 + θ) Σ 2. l Since ( Σ l )((1 ρ)id l ) = ((1 ρ)id l )( Σ l ), then by [43, Theorem 3], we have M Σ ρid a+b = (1 ρ) m+n 2l [ (1 ρ)idl Σ l Σ l (1 ρ)id l (1 + θ) Σ 2 l ] = (1 ρ) m+n 2l [ (1 ρ)((1 ρ)id l (1 + θ) Σ 2 l ) + Σ lσ l ] = (1 ρ) m+n 2l [ (1 ρ) 2 Id l (1 ρ)(1 + θ) Σ 2 l + Σ lσ l ] = (1 ρ) m+n 2l l j=1 (ρ2 (2 (1 + θ) σ 2 j )ρ + (1 θ σ 2 j )). For the eigenvalues ρ, clearly, except the 1 s, we have for j = 1,..., l ρ j = (2 (1 + θ) σ2 j ) ± (1 + θ) 2 γ 2 J γ2 σ4 j 4 σ2 j 2 Since σ 2 j L 2 < 1, then ρ j are complex and ρ j = 1 (2 ) (1 + θ)γj γ 2 σj 2 2 ( (1 + θ) 2 γ 2γ2σ4 J j 4γ ) γ J σj 2 = 1 θ σj 2 < 1.. As a result, we also obtain the M, which reads PD M PD = [ Y X ] 0 l Id n l 0 l Id m l [ Y X ]. 18

Corollary 3.6. Suppose that Algorithm 1 is run under the identification conditions of Theorem 3.2, and moreover assumption (A.5) holds. Then the following holds (i) the linearized iteration (19) is equivalent to (Id M PD )(z k+1 z ) = M PD (Id M PD )(z k z )+o((id M PD ) z k z ). (34) (ii) If moreover, J are locally polyhedral around (x, v ), and F, G are quadratic, then M PD (z k z ) = 0 for all k large enough, and (34) becomes Proof. See [35, Corollary 5.1]. z k+1 z = (M PD M PD )(z k z ). (35) 3.3. Local linear convergence Finally, we are able to present the local linear convergence result. Theorem 3.7 (Local linear convergence). Suppose that Algorithm 1 is run under the identification conditions of Theorem 3.2, and moreover assumption (A.5) holds. Then: (i) given any ρ ]ρ(m PD M ), 1[, there exists a K large enough such that k K, PD (Id M PD )(z k z ) = O(ρ k K ). (36) (ii) If moreover,, J are locally polyhedral around (x, v ), and F, G are quadratic, then there exists a K large enough such that for all k K, we have directly for ρ [ρ(m PD M ), 1[. PD Proof. See [35, Theorem 5.1]. z k z = O(ρ k K ), (37) emark 7. (i) Similar to Proposition 3.4 and emark 5, the above result remains hold if (, ) are varying yet convergent. However, the local rate convergence of z k z will depends on how fast {(,k,,k)} k converge, that means, if they converge at a sublinear rate, then the convergence rate of z k z will eventually become sublinear. See [35, Section 8.3] for the case of Douglas achford splitting method. (ii) When F = G = 0 and both and J are locally polyhedral around the (x, v ), then the convergence rate of the Primal Dual splitting method is controlled by θ and as shown in (29); see the upcoming section for a detailed discussion. For general situations (i.e. F, G are nontrivial and, J are general partly smooth functions), the factors that contribute to the local convergence rate are much more complicated, such as the iemannian Hessians of, J. 4. Discussions In this part, we present several discussions on the obtained local linear convergence result, including the effects of θ 1, local oscillation and connections with FB and D methods. 19

To make the discussion easier to deliver, for the rest of this section we focus on the case where F = G = 0, i.e. the Primal Dual splitting method of [13], and moreover, J are locally polyhedral around the Kuhn-Tucker pair (x, v ). Under such setting, the matrix defined in (18) becomes M PD [ def Idn γ = L ] L Id m (1 + θ) LL. (38) 4.1. Choice of θ Owing to Lemma 3.5, the matrix M PD in (38) is convergent for θ ]0, 1], see Eq. (28), with the spectral radius ρ(m PD M 1 ) = θγ PD σmin 2 < 1, (39) with σ min being the smallest non-zero singular value of L. In general, given a solution pair (x, v ), σ min is fixed, hence the spectral radius ρ(m PD M ) is simply controlled by θ and the product γ γ PD J. To make the local convergence rate as faster as possible, it is obvious that we need to make the value of θ as big as possible. ecall in the global convergence of Primal Dual splitting method or the result from [13], that L 2 < 1. Denote σ max the biggest singular value of L. It is then straightforward that σmax 2 L 2 < 1 and moreover ρ(m PD M PD ) = 1 θ σ 2 min > 1 θ(σ min / L ) 2 1 θ(σ min /σ max ) 2. (40) If we define cnd def = σ max /σ min the condition number of L, then we have ρ(m PD M PD ) > 1 θ(1/cnd) 2. To this end, it is clear that θ = 1 gives the best convergence rate for θ [ 1, 1]. Next let us look at what happens locally if we choose θ > 1.The spectral radius formula (39) implies that bigger value of θ yields smaller spectral radius ρ(m PD M ). Therefore, PD locally we should choose θ as big as possible. However, there is an upper bound of θ which is discussed below. Following emark 6, let Σ L = (σ j ) {j=1,...,l} be the singular values of L, let ρ j be the eigenvalue of M PD M PD, we have known that ρ j is complex with ρ j = 2( 1 (2 (1 + θ)γ σj 2 ) ± (1 + θ) 2 γ 2γ2σ4 J j 4γ ) σj 2, ρj = 1 θ σj 2. Now let θ > 1, to ensure ρ j make sense for all j {1,..., l}, there must holds 1 θ σ 2 max 0 θ 1 σmax 2, which means that θ indeed is bounded from above. 20

Unfortunately, since L = P T LP x T, the upper bound can be only obtained if we J v had the solution pair (x, v ). However, in practice one can use back-tracking or the Armijo-Goldstein-rule to find the proper θ. See Section 6.4 for an illustration of online searching of θ. It should be noted that such updating rule can also be applied to, since we have L L. Moreover, it should be noted that in practice one can choose to enlarge either θ or as they will have very similar acceleration outcome. emark 8. It should be noted that the above discussion on the effect of θ > 1 may only valid for the case F = 0, G = 0, i.e. the Primal Dual splitting method of [13]. If F and/or G are not vanished, then locally, θ < 1 may give faster convergence rate. 4.2. Oscillations For the inertial Forward Backward and FISTA [7] methods, it is shown in [34] that they locally oscillate when the inertia momentum are too high (see [34, Section 4.4] for more details). When solving certain type of problems (i.e. F = G = 0 and, J are locally polyhedral around the solution pair (x, v )), the Primal Dual splitting method also locally oscillates (see Figure 6 for an illustration). As revealed in the proof of Lemma 3.5, all the eigenvalues of M PD M in (38) are complex. This means that PD locally the sequences generated by the Primal Dual splitting iteration may oscillate. For σ min, the smallest non-zero singular of L, one of its corresponding eigenvalues of M PD reads ρ σmin = 2( 1 (2 (1 + θ)γj σmin) 2 + (1 + θ) 2 γ 2γ2σ4 J min 4γ γ J σmin) 2, and (1 + θ) 2 γ 2 γ2 J σ4 min 4 σ 2 min < 0. Denote ω the argument of ρ σ min, then cos(ω) = 2 (1 + θ)γ σ2 min. (41) 1 θγj σmin 2 The oscillation period of the sequence z k z is then exactly π ω. See Figure 6 for an illustration. 4.3. elations with FB and D/ADMM In this part, we discuss the relation between the obtained result and our previous work on local linear convergence of Forward Backward [32,34] and Douglas achford/admm [35,36]. 4.3.1. Forward Backward splitting For problem (P P ), when J = G = 0, Algorithm 1 reduces to, denoting γ = β = β F, and x k+1 = prox γ ( xk γ F (x k ) ), γ ]0, 2β[, (42) which is the non-relaxed FB splitting [37] with constant step-size. Let x Argmin( + F ) be a global minimizer to which {x k } k N of (42) converges, the non-degeneracy condition (ND) for identification then becomes F (x ) 21

ri( (x )) which recovers the conditions of [34, Theorem 4.11]. Following the notations def of Section 3, define M FB = W (Id n γh F ), we have for all k large enough x k+1 x = M FB (x k x ) + o( x k x ). From Theorem 3.7, we obtain the following result for the FB splitting method, for the case γ being fixed. Denote M x the manifold that x lives in. Corollary 4.1. For problem (P P ), let J = G = 0 and suppose that (A.1) holds and Argmin( + F ), and the FB iteration (42) creates a sequence x k x Argmin( + F ) such that PSF x (M x ), F is C 2 near x, and condition F (x ) ri( (x )) holds. Then (i) given any ρ ]ρ(m FB M ), 1[, there exists a K large enough such that for all FB k K, (Id M FB )(x k x ) = O(ρ k K ). (43) (ii) If moreover, are locally polyhedral around x, there exists a K large enough such that for all k K, we have directly for ρ [ρ(m FB M ), 1[. FB x k x = O(ρ k K ), (44) 2β Proof. Owing to [6], M FB is 4β γ -averaged non-expansive, hence convergent. The convergence rates in (43) and (44) are straightforward from Theorem 3.7. The result in [32,34] required a so-called restricted injectivity (I) condition, which means that H F should be positive definite. Moreover, this I condition was removed only when J is locally polyhedral around x (e.g. see [34, Theorem 4.9]). Here, we show that neither I condition nor polyhedrality are needed to show local linear convergence. As such, the analysis on this paper generalizes that of [34] even for FB splitting. However, the price of removing those conditions is that the obtained convergence rate is on a different criterion (i.e. (Id M FB )(x k x ) ) other than the sequence itself directly (i.e. x k x ). 4.3.2. Douglas achford splitting and ADMM Let F = G = 0 and L = Id, then problem (P P ) becomes min (x) + J(x). x n For the above problem, below we briefly show that D splitting is the limiting case of Primal Dual splitting by letting = 1. First, for the Primal Dual splitting scheme of Algorithm 1, let θ = 1 and change the order of updating the variables, we obtain the following iteration v k+1 = prox γj J (v k + x k ) x k+1 = prox γ (x k v k+1 ) x k+1 = 2x k+1 x k. 22 (45)

Then apply the Moreau s identity (6) to prox γj J, let = 1/ and define z k+1 = x k v k+1, iteration (45) becomes u k+1 = prox γ J(2x k z k ) z k+1 = z k + u k+1 x k x k+1 = prox γ (z k+1 ), which is the non-relaxed D splitting method [24]. At convergence, we have u k, x k x = prox γ (z ) where z is a fixed point of the iteration. See also [13, Section 4.2]. Specializing the derivation of (18) to (45) and (46), we obtain the following two linearized fixed-point operator for (45) and (46) respectively [ Id n M PD = P T J P v T x [ Id n M D = 1 P γ T J P v T x P T P x T J v Id n 2 P T J P v T P x T J v ] P T P x T J v Id n 2P T J P v T P x T J v Owing to (ii) of Lemma 3.5, M PD, M D are convergent. Let ω be the largest principal angle (yet smaller than π/2) between tangent spaces Tx and T J v, then we have the spectral radius of M PD MPD reads ((i) of emark 6),. ], (46) ρ(m PD M PD) = 1 cos 2 (ω) 1 cos 2 (ω) = sin(ω) = cos(π/2 ω). (47) Suppose that the Kuhn-Tucker pair (x, v ) is unique, and moreover that and J are polyhedral. Therefore, we have that if J is locally polyhedral near v along v + T J v, then J is locally polyhedral near x around x + Tx J, and moreover there holds Tx J = (T J v ). As a result, the principal angles between Tx, T J v and the ones between Tx, T x J are complementary, which means that π/2 ω is the Friedrichs angle between tangent spaces Tx, T x J. Thus, following (47), we have ρ(m PD M PD) = 1 cos 2 (ω) cos(π/2 ω) = ρ(m D M D). We emphasise the fact that such connection can be drawn only for the polyhedral case, which justifies the different analysis carried in [35,36]. In addition, for D we were able to characterize situations where finite convergence provably occurs, while this is not (yet) the case for Primal Dual splitting even for and J being locally polyhedral around (x, v ) and F = G = 0 but L is non-trivial. 5. Multiple infimal convolutions In this section, we consider problem (P P ) with more than one infimal convolution. Le m 1 be a positive integer. Consider the problem of solving min (x) + F (x) + m x n i=1 (J i + G i )(L i x), (PP m) 23

where (A.1) holds for and F, and for every i = 1,..., m the followings are hold: (A.2) J i, G i Γ 0 ( mi ), with G i being differentiable and β G,i -strongly convex for β G,i > 0. (A.3) L i : n mi is a linear operator. (A.4) The condition 0 ran( + F + m i=1 L i ( J i G i )L i ) holds. The dual problem of (PP m) reads, ( ) min ( + F m ) v 1 m 1,...,v m mm i=1 L m ( i v i + i=1 J i (v i ) + G i (v i ) ). (PD m) Problem (PP m ) is considered in [19,48], and a Primal Dual splitting algorithm is proposed there which is an extension of Algorithm 1 using a product space trick, see Algorithm 2 hereafter for details. In both schemes in [19,48], θ is set as 1. Algorithm 2: A Primal Dual splitting method Initial: Choose, ( i ) i > 0. For k = 0, x 0 n, v i,0 mi, i {1,..., m}; repeat k = k + 1; until convergence; ( x k+1 = prox γ xk F (x k ) i L ) i v i,k x k+1 = 2x k+1 x k (48) For i = 1,..., m ( vi,k+1 = prox γji J i vi,k γ G Ji i (v i,k ) + γ L ) Ji i x k+1, 5.1. Product space The following result is taken from [19]. Define the product space K = n m1 mm, and let Id be the identity operator on K. Define the following operators A def = L 1 L m L 1 J 1.... L m J m, B def = F G 1... G m, V def = Id n L 1 L m Id m1 1 (49) Then A is maximal monotone, B is min{β F, β G1,..., β Gm }-cocoercive, and V is symmetric and ν-positive definite with ν = (1 i γj i Li 2 ) min{ 1, 1 1,..., 1 m }. Define z k = (x k, v 1,k,, v m,k ) T, then it can be shown that (48) is equivalent to z k+1 = (V + A) 1 (V B)z k = (Id + V 1 A) 1 (Id V 1 B)z k. (50) L 1. L m... Id mm m. 24

5.2. Local convergence analysis Let (x, v 1,..., v m) be a Kuhn-Tucker pair. Define the following functions Ji (v i ) def = Ji (v i ) v i, L i x G i (vi ), v i mi, i {1,..., m}, (51) and the iemannian Hessian of each J i, H J i def = γ P J 2 J i T J i v i M J i i (vi )P J v T i i v i and W J i def = (Id mi + H J i ) 1, i {1,..., m}. (52) For each i {1,..., m}, owing to Lemma 3.3, we have that W J is firmly non-expansive i if the non-degeneracy condition (ND m ) holds. Now suppose (A.5) F locally is C 2 -smooth around x and G i locally is C 2 around vi. Define the restricted Hessian H G i def L i = P J T i v i M PD def = L i P T, and the matrix x def = P T J i v i 2 G i (v i )P T J i v i def. Define H G i = Id mi H G i i, W H F W L 1 W L m W J L 1 (2W 1 1 H F Id n) W J (H 1 G 2γ 1 J L 1 W L 1 ) 1.... W J L m (2W m m H F Id n) W J (H m G 2γ m J γ m L m W L m ) (53) Using the same strategy of the proof of Lemma 3.5, one can show that M PD is convergent, which again is denoted as M, and ρ(m M ) < 1. PD PD PD Corollary 5.1. Consider Algorithm 2 under assumptions (A.1) and (A.2)-(A.5). Choose, ( i ) i > 0 such that 2 min{β F, β G1,..., β Gm } min { 1, 1 1,..., 1 m }( 1 i i L i 2) > 1. (54) Then (x k, v 1,k,..., v m,k ) (x, v1,..., v m), where x solves (PP m) and (v 1,..., v m) solve (PD m). If moreover PSF x (M x ) and J i PSF v i (M J i v ), i {1,..., m}, and i the non-degeneracy condition holds i L i v i F (x ) ri ( (x ) ) L i x G i (v ) ri ( J i (v ) ), i {1,..., m}. Then, (i) there exists K > 0 such that for all k K, (x k, v 1,k,..., v m,k ) M x MJ 1 v 1 MJ m v m.. (ND m ) (ii) Given any ρ ]ρ(m PD M ), 1[, there exists a K large enough such that for all PD k K, (Id M PD )(z k z ) = O(ρ k K ). (55) 25