Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization

Similar documents
The Proximal Primal-Dual Approach for Nonconvex Linearly Constrained Problems

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

WE consider an undirected, connected network of n

Optimality Conditions for Constrained Optimization

Iteration-complexity of first-order penalty methods for convex programming

Accelerated primal-dual methods for linearly constrained convex problems

Douglas-Rachford splitting for nonconvex feasibility problems

A Distributed Newton Method for Network Utility Maximization, II: Convergence

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Algorithms for Nonsmooth Optimization

Iteration-complexity of first-order augmented Lagrangian methods for convex programming

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Optimization methods

Gradient Sliding for Composite Optimization

Introduction to Alternating Direction Method of Multipliers

Asynchronous Non-Convex Optimization For Separable Problem

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

Conditional Gradient (Frank-Wolfe) Method

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

arxiv: v3 [math.oc] 19 Oct 2017

You should be able to...

HYBRID JACOBIAN AND GAUSS SEIDEL PROXIMAL BLOCK COORDINATE UPDATE METHODS FOR LINEARLY CONSTRAINED CONVEX PROGRAMMING

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

A GLOBALLY CONVERGENT STABILIZED SQP METHOD: SUPERLINEAR CONVERGENCE

NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

arxiv: v1 [math.oc] 5 Dec 2014

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Lecture 5 : Projections

Sparse Optimization Lecture: Dual Methods, Part I

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

First-order methods of solving nonconvex optimization problems: Algorithms, convergence, and optimality

arxiv: v1 [math.oc] 1 Jul 2016

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems?

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Stochastic Proximal Gradient Algorithm

A STABILIZED SQP METHOD: GLOBAL CONVERGENCE

5 Handling Constraints

Math 273a: Optimization Subgradient Methods

Constrained Optimization and Lagrangian Duality

A Unified Approach to Proximal Algorithms using Bregman Distance

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Optimization methods

Solving Dual Problems

Coordinate Update Algorithm Short Course Operator Splitting

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

The Relation Between Pseudonormality and Quasiregularity in Constrained Optimization 1

Part 5: Penalty and augmented Lagrangian methods for equality constrained optimization. Nick Gould (RAL)

A STABILIZED SQP METHOD: SUPERLINEAR CONVERGENCE

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS

1 Computing with constraints

ADMM and Fast Gradient Methods for Distributed Optimization

Written Examination

Convex Optimization Theory. Athena Scientific, Supplementary Chapter 6 on Convex Optimization Algorithms

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Generalized Uniformly Optimal Methods for Nonlinear Programming

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

arxiv: v1 [math.oc] 23 May 2017

Distributed Consensus-Based Optimization

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Beyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory

Some Inexact Hybrid Proximal Augmented Lagrangian Algorithms

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Optimization for Machine Learning

Distributed Convex Optimization

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Adaptive Primal Dual Optimization for Image Processing and Learning

MATH 680 Fall November 27, Homework 3

Primal/Dual Decomposition Methods

UC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 5 Fall 2009

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Dual Ascent. Ryan Tibshirani Convex Optimization


MS&E 318 (CME 338) Large-Scale Numerical Optimization

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

10 Numerical methods for constrained problems

More First-Order Optimization Algorithms

AN AUGMENTED LAGRANGIAN AFFINE SCALING METHOD FOR NONLINEAR PROGRAMMING

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization

CPSC 540: Machine Learning

Optimization for Learning and Big Data

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Math 273a: Optimization Convex Conjugacy

DLM: Decentralized Linearized Alternating Direction Method of Multipliers

Complexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization

Transcription:

Noname manuscript No. (will be inserted by the editor Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization Davood Hajinezhad and Mingyi Hong Received: date / Accepted: date Abstract In this paper we propose a perturbed proximal primal dual algorithm (PProx-PDA for an important class of linearly constrained optimization problems whose objective is the sum of smooth (possibly nonconvex and convex (possibly nonsmooth functions. This family of problems has applications in a number of statistical and engineering problems, for example in highdimensional subspace estimation, and distributed signal processing and learning over networks. The proposed method is of Uzawa type, in which a primal gradient descent step is performed followed by an (approximate dual gradient ascent step. One distinctive feature of the proposed algorithm is that the primal and dual steps are both perturbed appropriately using past iterates so that a number of asymptotic convergence and rate of convergence results (to first-order stationary solutions can be obtained. Finally, we conduct extensive numerical experiments to validate the effectiveness of the proposed algorithm. AMS(MOS Subject Classifications: 49, 90. 1 Introduction 1.1 The Problem Consider the following optimization problem min x X f(x + h(x, s.t. Ax = b, (1 where f(x : R N R is a continuous smooth function (possibly nonconvex; A R M N is a rank deficient matrix; b R M is a given vector; X R N is a convex compact set; h(x : R N R is a lower semi-continuous nonsmooth convex function. Problem (1 is an interesting class that can be specialized to a number of statistical and engineering applications. We provide a few of these applications in subsection 1.3. 1. The Algorithm In this section, we present the proposed algorithm. The augmented Lagrangian for problem (1 is given below L ρ (x, y = f(x + h(x + λ, Ax b + ρ Ax b, ( D. Hajinezhad, Department of Mechannical Engineering and Matrials Science, Duke University, Durham, NC, USA. E-mail: davood.hajinezhad@duke.edu M. Hong, Department of Electrical and Computer Engineering, University of Minnesota, USA. E-mail: mhong@umn.edu

Davood Hajinezhad and Mingyi Hong where λ R M is the dual variable associated with the equality constraint Ax = b, and ρ > 0 is the penalty parameter for the augmented term Ax b. Define B R M N as a scaling matrix, and introduce two new parameters γ (0, 1 and β > 0, where γ is a small positive algorithm parameter that is related to the size of the equality constraint violation that is allowed by the algorithm, and β is the proximal parameter that regularizes the primal update. Let us choose γ > 0 and ρ > 0 such that ργ < 1. The steps of the proposed perturbed proximal primal dual algorithm (PProx-PDA are given below (Algorithm 1. Algorithm 1: The perturbed proximal primal-dual algorithm (PProx-PDA Initialize: λ 0 and x 0 Repeat: update variables by x r+1 = arg min x X { f(x r, x x r + h(x + (1 ργλ r, Ax b } + ρ Ax b + β x xr B T B λ r+1 = (1 ργλ r + ρ ( Ax r+1 b (3a (3b Until Convergence. In contrast to the classic Augmented Lagrangian (AL method [34, 59], in which the primal variable is updated by minimizing the augmented Lagrangian given in (, in PProx-PDA the primal step minimizes an approximated augmented Lagrangian, where the approximation comes from: 1 replacing function f(x with the surrogate function f(x r, x x r ; perturbing dual variable λ by a positive factor 1 ργ > 0; 3 adding proximal term β x xr B T B. We make a few remarks about these algorithmic choices. First, the use of the linear surrogate function f(x r, x x r ensures that only first-order information is used for the primal update. Also it is worth mentioning that one can replace the function f(x r, x x r with a wider class of surrogate functions satisfying certain gradient consistent conditions [60, 64], and our subsequent analysis will still hold true. However, in order to stay focused, we choose not to present those variations. Second, the primal and dual perturbations are added to facilitate convergence analysis. In particular the analysis for the PProx-PDA algorithm differs from the recent analysis on nonconvex primal/dual type algorithms, which is first presented in Ames and Hong [] and later generalized by [8,30,3,37,45,53,69]. Those analyses have been critically dependent on bounding the size of the successive dual variables with that of the successive primal variables. Unfortunately, this can only be done when the primal step immediately preceding the dual step is smooth and unconstrained. Therefore the analysis presented in these works cannot be applied to our general formulation with nonsmooth terms and constraints. Our perturbation scheme is strongly motivated from the dual perturbation scheme developed for the convex case, for example in [43]. Conceptually, the perturbed dual step can be viewed as performing a dual ascent on certain regularized Lagrangian in the dual space; see [4, Sec. 3.1]. The main purpose for introducing the dual perturbation/regularization in this reference, and in many related works, is to ensure that the dual update is well-behaved and easy to analyze. Intuitively, when adopting and modifying such a perturbation strategy in the non-convex setting of interest to this work, we no longer need to bound the size of the successive dual variables with that of the successive primal variables, since the change of the dual variables is now well controlled. Third, the proximal term β x xr B T B is used for two purposes: 1 to make the primal subproblem strongly convex; for certain applications to ensure that the primal subproblem is decomposable over the variables. We will discuss how this can be done in the subsequent sections.

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization 3 1.3 Motivating Applications Sparse subspace estimation. Suppose that Σ R p p is an unknown covariance matrix, λ 1 λ λ p and u 1, u,, u p are its eigenvalues and eigenvectors, respectively, and they satisfy Σ = p i=1 λ iu i u i. Principal Component Analysis (PCA aims to recover u 1, u,, u k, where k p, from a sample covariance matrix ˆΣ obtained from i.i.d samples {x i } n i=1. The subspace spanned by {u i } k i=1 is called k-dimensional principal subspace, whose projection matrix is given by Π = k i=1 u iu i. Therefore, PCA reduces to finding an estimate of Π, denoted by ˆΠ, from the sample covariance matrix ˆΣ. In high dimensional setting where the number of data points is significantly smaller than the dimension i.e. (n p, it is desirable to find a sparse ˆΠ, using the following formulation [6] min Π ˆΣ, Π + Pν (Π, s.t. Π F k. (4 In the above formulation, F k denotes the Fantope set [68], given by F k = {X : 0 X I, trace(x = k}, which promotes low rankness in X. The function P ν (Π is a nonconvex regularizer that enforces sparsity on Π. Typical forms of this regularization are smoothly clipped absolute deviation (SCAD [0], and minimax concave penalty (MCP [7]. For example, MCP with parameters b and ν for some scaler φ is given below ( P ν (φ = ι φ bν ν φ φ b where, ι X + ι φ >bν ( bν denotes the indicator function for convex set X, which is defined as, (5 ι X (y = 0, when y X, ι X (y =, otherwise. (6 Notice that P ν (Π in problem (4 is an element-wise operator over all entries of matrix Π. One particular characterization for these nonconvex penalties is that they can be decomposed as a sum of an l 1 -norm function (i.e. for x R N, x 1 = N i=1 x i and a concave function q ν (x as P ν (φ = ν φ + q ν (φ for some ν 0. In a recent work [6], it is shown that with high probability, every first-order stationary solution of problem (4 (denoted as ˆΠ is of high-quality. See [6, Theorem 3] for detailed description. In order to deal with the Fantope and the nonconvex regularizer separately, one can introduce a new variable Φ and reformulate problem (4 in the following manner [68] min Π,Φ ˆΣ, Π + Pν (Φ s.t. Π F k, Π Φ = 0. (7 Clearly this is a special case of problem (1, with x = [Π, Φ], f(x = ˆΣ, Π +qν (Φ, h(x = ν Φ 1, X = F k, A = [I, I], b = 0. The exact consensus problem over networks. Consider a network which consists of N agents who collectively optimize the following problem min y R N f(y + h(y := (f i (y + h i (y, (8 where f i (y : R R is a smooth function, and h i (y : R R is a convex, possibly nonsmooth regularizer (here y is assumed to be scalar for ease of presentation. Note that both f i and h i are only accessible by agent i. In particular, each local loss function f i can represent: 1 a mini-batch of (possibly nonconvex loss functions modeling data fidelity [4]; nonconvex activation functions of neural networks [1]; 3 nonconvex utility functions used in applications such as resource allocation [11]. The regularization function h i usually takes the following forms: 1 convex regularizers such as nonsmooth l 1 or smooth l functions; the indicator function for closed convex set X, i.e. the ι X function defined in (6. This problem has found applications in various domains such as distributed statistical learning [5], distributed consensus [67], distributed communication networking [46, 74], i=1

4 Davood Hajinezhad and Mingyi Hong distributed and parallel machine learning [, 35] and distributed signal processing [63, 74]; for more applications we refer the readers to a recent survey [4]. To integrate the structure of the network into problem (8, we assume that the agents are connected through a network defined by an undirected, connected graph G = {V, E}, with V = N vertices and E = E edges. For agent i V the neighborhood set is defined as N i := {j V s.t. (i, j E}. Each agent can only communicate with its neighbors, and it is responsible for optimizing one component function f i regularized by h i. Define the incidence matrix A R E N as following: if e E and it connects vertex i and j with i > j, then A ev = 1 if v = i, A ev = 1 if v = j and A ev = 0 otherwise. Using this definition, the signed graph Laplacian matrix L is given by L := A T A R N N. Introducing N new variables x i as the local copy of the global variable y, and define x := [x 1 ; ; x N ] R N, problem (8 can be equivalently expressed as min x R N f(x + h(x := N (f i (x i + h i (x i, s.t. Ax = 0. (9 i=1 This problem is precisely original problem (1 with correspondence X = R N, b = 0, f(x := N i=1 f i(x i, and h(x := N i=1 h i(x i. For this problem, let us see how the proposed PProx-PDA can be applied. The first observation is that choosing the scaling matrix B is critical because the appropriate choice of B ensures that problem (3a is decomposable over different variables (or variable blocks, thus the PProx-PDA algorithm can be performed fully distributedly. Let us define the signless incidence matrix B := A, where A is the signed incidence matrix defined above, and the absolute value is taken for each component of A. Using this choice of B, we have B T B = L + R N N, which is the signless graph Laplacian whose (i, ith diagonal entry is the degree of node i, and its (i, jth entry is 1 if e = (i, j E, and 0 otherwise. Further, let us set ρ = β. Then x-update step (3a becomes { N x r+1 = arg min f i (x r i, x i x r i + (1 ργλ r, Ax b + ρx T Dx ρx T L + x }, r x i=1 where D := diag[d 1,, d N ] R N N is the diagonal degree matrix, with d i denoting the degree of node i. Clearly this problem is separable over the variable x i for all i = 1,,, N. To perform this update, each agent i only requires local information as well as information from its neighbors N i. This is because D is a diagonal matrix and the structure of the matrix L + ensures that the ith block vector of L + x r is only related to x r j, where j N i. The partial consensus problem. In the previous application, the agents are required to reach exact consensus, and such constraint is imposed through Ax = 0 in (9. In practice, however, consensus is rarely required exactly, for example due to potential disturbances in network communication; see detailed discussion in [4]. Further, in applications ranging from distributed estimation to rare event detection, the data obtained by the agents, such as harmful algal blooms, network activities, and local temperature, often exhibit distinctive spatial structure [15]. The distributed problem in these settings can be best formulated by using certain partial consensus model in which the local variables of an agent are only required to be close to those of its neighbors. To model such a partial consensus constraint, we denote ξ as the permissible tolerance for e = (i, j E, and define the link variable z e = x i x j. Then we replace the strict consensus constraint z e = 0 with ξ [z e ] k ξ, where [z e ] k denotes the kth entry of vector z e (for the sake of simplicity we assume that the permissible tolerance ξ is identical for all e E. Setting z := {z e } e E and Z := {z [z e ] k ξ e E, k} the partial consensus problem can be formulated as min x,z N (f i (x i + h i (x i s.t. Ax z = 0, z Z, (10 i=1 which is again a special case of problem (1.

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization 5 1.4 Literature Review and Contribution. 1.4.1 Literature on Related Algorithms. The Augmented Lagrangian (AL method, also known as the methods of multipliers, is pioneered by Hestenes [34] and Powell [59]. It is a classical algorithm for solving nonconvex smooth constrained problems and its convergence is guaranteed under rather week assumptions [7, 1, 58]. A modified version of AL has been develped by Rockafellar in [61], in which a proximal term has been added to the objective function in order to make it strongly convex in each iteration. Later Wright [40] specialized this algorithm to the linear programming problem. Many existing packages such as LANCELOT are implemented based on AL method. Recently, due to the need to solve very large scale nonlinear optimization problems, the AL and its variants regain their popularity. For example, in [16] a line search AL method has been proposed for solving problem (1 with h 0 and X = {x; l x u}. Also reference [13] has developed an AL based algorithm for nonconvex nonsmooth optimization, where subgradients of the augmented Lagrangian are used in the primal update. When the problem is convex, smooth and the constraints are linear, Lan and Monterio [44] have analyzed the iteration complexity for the AL method. More specifically, the authors analyzed the total number of Nesterov s optimal iterations [57] that are required to reach high quality primal-dual solutions. Subsequently, Liu et al [48] proposed an inexact AL (IAL algorithm which only requires an ɛ approximated solution for the primal subproblem at each iteration. Hong et al [35] proposed a proximal primal-dual algorithm (Prox-PDA, an AL-based method mainly used to solve smooth and unconstrained distributed nonconvex problem [by unconstrained we refer to the problem (9 with h i 0 and X R N ; however, the consensus constraint Ax = 0 is always imposed]. Another AL based algorithm, which is called ALADIN [38], is designed for nonconvex smooth optimization problem with coupled affine constraints in distributed setting. In ALADIN the objective function is separable over different nodes and the loss function is assumed to be twice differentiable. To implement ALADIN a fusion center is needed in the network to propagate global variable to the agents. A comprehensive survey about AL-based methods in both convex and nonconvex setting can be found in [33] Overall, the AL based methods often require sophisticated stepsize selection, and an accurate oracle for solving the primal problem. Further, they cannot deal with problems that have both nonsmooth regularizer h(x and a general convex constraint. Therefore, it is not straightforward to apply these methods to problems such as distributed learning and high-dimensional sparse subspace estimation mentioned in the previous subsection. Recently, the alternating direction method of multipliers (ADMM, a variant of the AL, has gained popularity for decomposing large-scale nonsmooth optimization problems [1]. The method originates in early 1970s [3, 5], and has since been studied extensively [9, 18, 36]. The main strength of this algorithm is that it is capable of decomposing a large problem into a series of small and simple subproblems, therefore making the overall algorithm scalable and easy to implement. However, unlike the AL method, the ADMM is designed for convex problems, despite its good numerical performance in nonconvex problems such as the nonnegative matrix factorization [66], phase retrieval [70], distributed clustering [], tensor decomposition [47] and so on. Only very recently, researchers have begun to rigorously investigate the convergence of ADMM (to firstorder stationary solutions for nonocnvex problems. Zhang [73] have analyzed a class of splitting algorithms (which includes the ADMM as a special case for a very special class of nonconvex quadratic problems. Ames and Hong in [] have developed an analysis for ADMM for certain l 1 penalized problem arising in high-dimensional discriminant analysis. Other works along this line include [31, 37, 45, 53] and [69]; See Table 1 in [69] for a comparison of the conditions required for these works. Despite the recent progress, it appears that the aforementioned works still pose very restrictive assumptions on the problem types in order to achieve convergence. For example it is not clear whether the ADMM can be used for the distributed nonconvex optimization problem (9 over an arbitrary connected graph with regularizers and constraints, despite the fact that for convex problem such application is popular, and the resulting algorithms are efficient.

6 Davood Hajinezhad and Mingyi Hong 1.4. Literature on Applications. The sparse subspace estimation problem formulations (4 and (7 have been first considered in [17, 68] and subsequently considered in [6]. The work [68] proposes a semidefinite convex optimization problem to estimate principal subspace of a population matrix Σ based on a sample covariance matrix. The authors of [6] further show that by utilizing nonconvex regularizers it is possible to significantly improve the estimation accuracy for a given number of data points. However, the algorithm considered in [6] is not guaranteed to reach any stationary solutions. The consensus problem (8 and (9 have been studied extensively in the literature when the objective functions are all convex; see for example [6, 49, 54, 55, 65]. Without assuming convexity of f i s, the literature has been very scant; see recent developments in [10, 9, 37, 50]. However, all of these recent results require that the nonsmooth terms h i s, if present, have to be identical for all agents in the network. This assumption is unnecessarily strong and it defeats the purpose of distributed consensus since global information about the objective function has to be shared among the agents. Further, in the nonconvex setting we are not aware of any existing distributed algorithm with convergence guarantee that can deal with the more practical problem (10 with partial consensus. 1.4.3 Contributions of This work. In this paper we develop an AL-based algorithm, named the perturbed proximal primal dual algorithm (PProx-PDA, for the challenging linearly constrained nonconvex nonsmooth problem (1. The proposed method, listed in Algorithm 1, is of Uzawa type [41] and it has very simple update rule. It is a single-loop algorithm that alternates between a primal (scaled proximal gradient descent step, and an (approximate dual gradient ascent step. Further, by appropriately selecting the scaling matrix in the primal step, the variables can be easily updated in parallel. These features make the algorithm attractive for applications such as the high-dimensional subspace estimation and the distributed learning problems discussed in Section 1.3, One distinctive feature of the PProx-PDA is the use of a novel perturbation scheme for both the primal and dual steps, which is designed to ensure a number of asymptotic convergence and rate of convergence properties (to approximate first-order stationary solutions. Specifically, we show that when certain perturbation parameter remains constant across the iterations, the algorithm converges globally sublinearly to the set of approximate first-order stationary solutions. Further, when the perturbation parameter diminishes to zero with appropriate rate, the algorithm converges to the set of exact first-order stationary solutions. To the best of our knowledge, the proposed algorithm represents one of the first first-order methods with convergence and rate of convergence guarantees (to certain approximate stationary solutions for problems in the form of (1. Notation. We use, 1, and F to denote the Euclidean norm, l 1 -norm, and Frobenius norm respectively. For given vector x, and matrix H, we denote x H := xt Hx. For two vectors a, b we use a, b to denote their inner product. We use σ max (A to denote the maximum eigenvalue for a matrix A. We use I N to denote an N N identity matrix. For a nonsmooth convex function h(x, h(x denotes the subdifferential set defined by h(x = {v R N ; h(x h(y + v, x y y R N }. (11 For a convex function h(x and a constant α > 0 the proximity operator is defined as below { } prox 1/α 1 h (x := argmin α x z + h(z. (1 z

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization 7 Convergence Analysis of PProx-PDA In this section we provide the convergence analysis for PProx-PDA presented in Algorithm 1. We will frequently use the following identity b, b a = 1 Also, for the notation simplicity we define ( b a + b a. (13 w r := (x r+1 x r (x r x r 1. (14 To proceed, let us make the following blanket assumptions on problem (1. Assumptions A. A1. The gradient of function f(x is Lipschitz-continuous on X i.e., there exists L > 0 such that f(x f(y L x y, x, y X. (15 Further, without loss of generality, assume that f(x 0 for all x X. A. The function h(x is nonsmooth lower semi-continuous convex function, lower bounded (for simplicity we assume h(x 0, x X, and its subgradient is bounded. A3. The problem (1 is feasible. A4. The feasible set X is a convex and compact set. A5. The scaling matrix B is chosen such that A T A + B T B I. Our first lemma characterizes consecutive iterations. the relationship between the primal and dual variables for two Lemma 1 Under Assumptions A, the following holds true for PProx-PDA 1 ργ ρ λr+1 λ r + β xr+1 x r B T B 1 ργ ρ λr λ r 1 + β xr x r 1 B T B + L xr+1 x r + L xr x r 1 γ λ r+1 λ r, r 1. (16 Proof. From the optimality condition of the x-update in (3a we obtain f(x r + A T λ r (1 ργ + ρa T (Ax r+1 b + βb T B(x r+1 x r + ξ r+1, x r+1 x 0, x X, (17 for some ξ r+1 h(x r+1. Using the dual update rule (3b we obtain f(x r + A T λ r+1 + βb T B(x r+1 x r + ξ r+1, x r+1 x 0, x X. (18 Using this equation for r 1, we have f(x r 1 + A T λ r + βb T B(x r x r 1 + ξ r, x r x 0, x X, (19 for some ξ r h(x r. Let x = x r in the first inequality and x = x r+1 in the second, we can then add the resulting inequalities to obtain f(x r f(x r 1, x r+1 x r + A T (λ r+1 λ r, x r+1 x r + β B T Bw r, x r+1 x r ξ r ξ r+1, x r+1 x r 0 (0 where in the first inequality we used equation (3b to obtain λ r+1 λ r = A T λ r (1 ργ + ρa T (Ax r+1 b + A T λ r 1 (1 ργ + ρa T (Ax r b;

8 Davood Hajinezhad and Mingyi Hong In the last inequality we have utilized the convexity of h. Now let us analyze each term in (0. For the first term we have the following: f(x r 1 f(x r, x r+1 x r 1 α f(xr 1 f(x r + α xr+1 x r L α xr x r 1 + α xr+1 x r = L xr x r 1 + L xr+1 x r, (1 where in the first inequality we applied Young s inequality for α > 0, the second inequality is due to the Lipschitz continuity of the gradient of function f, and in the last equality we set α = L. For the second term in (0 which is A T (λ r+1 λ r, x r+1 x r, we have the following series of equalities A T (λ r+1 λ r, x r+1 x r = A(x r+1 x r, λ r+1 λ r = (Ax r+1 b γλ r (Ax r b γλ r 1, λ r+1 λ r + γ λ r λ r 1, λ r+1 λ r ( ( (3b,(13 1 1 = ρ γ λ r+1 λ r λ r λ r 1 + (λ r+1 λ r (λ r λ r 1 + γ λ r+1 λ r. ( For the term β B T Bw r, x r+1 x r, we have β B T Bw r, x r+1 x r (13 = β ( x r+1 x r B T B xr x r 1 B T B + wr B B T β ( x r+1 x r B T B xr x r 1 B T B. (3 Therefore, combining (1 (3, we obtain the desired result in (16. Q.E.D. Next we analyze the behavior of the primal iterations. Towards this end, let us define the following new quantity T (x, λ := f(x + h(x + (1 ργλ, Ax b γλ + ρ Ax b. (4 Note that this quantity is identical to the augmented Lagrangian when γ = 0. It is constructed to track the behavior of the algorithm. Even though function f is not convex, below we prove that T (x, λ + β x xr B T B is strongly convex with modulus β L when ρ β, and β L. First let us define g(x, λ; x r = T (x, λ h(x + β x xr B T B, which is a smooth function. For this function we have g(x, λ; x r g(y, λ; x r = f(x f(y + ρa T A(x y + βb T B(x y, x y f(x f(y, x y + β(a T A + B T B x y L x y + β(a T A + B T B x y (β L x y, (5 where the first inequality is true because ρ β, in the second inequality we used the Lipschitz continuity of f, and the last inequality is true because we assumed that A T A + B T B I. This proves that function g(x, λ; x r is strongly convex with modulus β L when β L. Since h(x is assumed to be convex we immediately conclude that T (x, λ + β x xr B T B is strongly convex with modulus β L. The next lemma analyzes the change of T in two successive iterations of the algorithm.

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization 9 Lemma Suppose that β > 3L and ρ β. Then we have the following T (x r+1, λ r+1 (1 ργγ + λ r+1 ( T (x r, λ r (1 ργγ (1 ργ( ργ + λ r + ρ ( β 3L λ r+1 λ r x r+1 x r, r 0. (6 Proof. It is easy to see that if β 3L, then the change of x results in the reduction of T : T (x r+1, λ r T (x r, λ r (i f(x r+1 + ξ r+1 + (1 ργa T λ r + ρa T (Ax r+1 b + βb T B(x r+1 x r, (ii x r+1 x r β L x r+1 x r ( β 3L x r+1 x r, (7 where (i is true because from (5 we know that when β > 3L, ρ β and A T A + B T B I, the function T (x, λ + β x xr BT B is strongly convex with modulus β L; (ii is true due to the optimality condition (17 for x-subproblem, and the assumption that f(x is gradient Lipschitz continuous. Second, let us analyze T (x r+1, λ r+1 T (x r+1, λ r as the following T (x r+1, λ r+1 T (x r+1, λ r (8 = (1 ργ ( λ r+1 λ r, Ax r+1 b γλ r (1 ργ γλ r+1 γλ r, λ r+1 ( (3b,(13 1 = (1 ργ ρ λr+1 λ r + γ ( λr λ r+1 λ r+1 λ r. Combining the previous two steps, we obtain the desired inequality in (6. Q.E.D. Comparing the results of Lemmas 1 and Lemma, from (16 we can observe that term 1 ( 1 ρ γ λ r+1 λ r + β xr+1 x r B T B is descending in λr+1 λ r and ascending in x r+1 x r, while from (6 we can see that T (x r+1, λ r+1 + (1 ργγ λ r+1 is behaving in an opposite manner. Therefore, let us define the following potential function P c as a conic combination of these two terms such that it is descending in each iteration. For some c > 0 P c (x r+1, λ r+1 ; x r, λ r := T (x r+1, λ r+1 + + c ( 1 ργ ρ (1 ργγ λ r+1 λ r+1 λ r + β x r+1 x r B T B + L xr+1 x r. (9 Then according to the previous two lemmas, one can conclude that there are constants a 1, a, such that P c (x r+1, λ r+1 ; x r, λ r P c (x r, λ r ; x r 1, λ r 1 a 1 λ r+1 λ r a x r+1 x r, (30 ( where a 1 = (1 ργ γ 1 ργ + cγ ρ, and a = ( β 3L cl. Therefore, we can verify that in order to make the function P c decrease, it is sufficient to ensure that (1 ργ γ + cγ 1 ργ ρ > 0, and β > (3 + cl. (31

10 Davood Hajinezhad and Mingyi Hong Therefore a sufficient condition is that τ := ργ (0, 1, c > 1 τ 1 > 0, β > (3 + cl, ρ β. (3 From the discussion here we can see the necessity for having perturbation parameter γ > 0. In particular, if γ = 0 the constant in front of the λ r+1 λ r would be 1 ρ, which is always positive. Therefore, it is difficult to construct a potential function that has descent on the dual variable. Next, let us show that the potential function P c is lower bounded, when choosing particular parameters given in Lemma. Lemma 3 Suppose Assumptions A are satisfied, and the algorithm parameters are chosen according to (3. Then the following statement holds true P s.t. P c (x r+1, λ r+1 ; x r, λ r P >, r 0. (33 Proof. First, we analyze terms related to T (x r+1, λ r+1. The inner product term in (4 can be bounded as λ r+1 ργλ r+1, Ax r+1 b γλ r+1 (13 = 1 ( 1 ργ (1 ργγ ( λ r+1 λ r + λ r+1 λ r ρ = (1 ργ ( λ r+1 λ r + λ r+1 λ r. (34 ρ Clearly, the constant in front of the above equality is positive. Taking a sum over R iterations of T (x r+1, λ r+1, we obtain R T (x r+1, λ r+1 = r=1 + (1 ργ ρ R (f(x r+1 + h(x r+1 + ρ Axr+1 b r=1 ( λ R+1 λ 1 + R λ r+1 λ r r=1 R (f(x r+1 + h(x r+1 + ρ Axr+1 b + r=1 (1 ργ ( λ R+1 λ 1 ρ (1 ργ λ 1, (35 ρ where the last inequality comes from the fact that f and h are both assumed to be lower bounded by 0. Since λ 1 is bounded, if follows that the sum of the T (, function is lower bounded. From (35 we conclude that R r=1 P c(x r+1, λ r+1 ; x r, λ r is also lower bounded by (1 ργ ρ λ 1 for any R, because besides the term R r=1 T (xr+1, λ r+1, the rest of the terms are all positive. Combined with the fact that P c is nonincreasing we conclude that the potential function is lower bounded, that is we have This proves the claim. P (1 ργ λ 1. (36 ρ Q.E.D. To present the main result on the convergence of the PProx-PDA, we need the following notion of approximate stationary solutions for problem (1.

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization 11 Definition 1 Stationary solution. Consider problem (1. Given ɛ > 0, the tuple (x, λ is an ɛ-stationary solution if the following holds Ax b ɛ, f(x + A T λ + ξ, x x 0, x X, (37 where x X and ξ is some vector that satisfies ξ h(x. We note that the ɛ-stationary solution slightly violates the constraint Ax b = 0. This definition is closely related to the approximate KKT (AKKT condition in the existing literature [3, 19, 7]. It can be verified that when X = R N, and h 0, then the condition in (37 satisfies the stopping criteria for reaching AKKT condition Eq. (9-(11 in [3]. We refer the readers to [3, Section 3.1] for detailed discussion of the relationship between AKKT and KKT conditions. We show below that by appropriately choosing the algorithm parameters, the PProx-PDA converges to the set of approximate stationary solutions. Theorem 1 Suppose Assumptions A hold. Further assume that the parameters γ, ρ, β, c satisfy (3. For any given ɛ > 0, the following is true for the sequence (x r, λ r generated by the PProx- PDA We have that {x r } and {λ r } are bounded, and that λ r+1 λ r 0, x r+1 x r 0. Let (x, λ denote any limit point of the sequence (x r, λ r. Then (x, λ is a (γ λ - stationary solution of problem (1. Proof. Using the fact that set X is a compact set we conclude that the sequence {x r } is bounded. Further, combining the bound given in (30 with the fact that the potential function P c is decreasing and lower bounded, we have λ r+1 λ r 0, x r+1 x r 0. (38 Also, from the dual update equation (3b we have λ r+1 λ r = ρ ( Ax r+1 b γλ r. Combining with λ r+1 λ r 0 we can see that {λ r } is also bounded. This proves the first part. In order to prove the second part let (x, λ be any limit point of the sequence (x r, λ r. From (3b we have λ r+1 λ r = ρ(ax r+1 b γλ r. Then combining this with (38 we obtain Ax b γλ = 0. (39 Thus, we have Ax b γ λ ; which proves the first inequality in (37. To show the boundedness of the sequences, first note that X has been assumed to be a convex and compact set, it follows that {x r } is bounded. To show the boundedness of {λ r }, note that (3b also suggests that (1 ργ(λ r+1 λ r = ρ(ax r+1 b ργλ r+1. (40 From the boundedness of x r+1 and λ r+1 λ r, we conclude that λ r+1 is bounded. To show the second part, from the optimality condition of (18 we have f(x r + A T λ r (1 ργ + ρa T (Ax r+1 b + βb T B(x r+1 x r, x r+1 x ξ r+1, x x r+1, x X. (41 From the convexity of function h we have that for all x X it holds that ξ r+1, x x r+1 h(x h(x r+1. Plugging this inequality into (41, using the update equation (3b and rearranging the terms we obtain h(x r+1 + f(x r + A T λ r+1 + βb T B(x r+1 x r, x r+1 x h(x, x X. (4

1 Davood Hajinezhad and Mingyi Hong Further, adding and subtracting x r in x r+1 x, and bny utilizing relation (13, we can get h(x r+1 + f(x r, x r+1 x r + λ r+1, Ax r+1 + β B(xr+1 x r h(x + f(x r, x x r + λ r+1, Ax + β B(x xr, x X. Let (x, λ be a limit point for the sequence {x r+1, λ r+1 }. Passing limit, and using the fact that x r+1 x r 0, we have h(x + λ, Ax h(x + f(x, x x + λ, Ax + β B(x x, x X. The above inequality suggests that x = x achieves the optimality for the right hand side. In particular, we have x = arg min x X h(x + f(x, x x + λ, Ax + β B(x x. (43 The optimality of the above problem becomes f(x + A T λ + ξ, x x 0, x X, (44 for some ξ h(x. Q.E.D..1 The Choice of Perturbation Parameter In this section, we discuss how to obtain ɛ-stationary solution. First, note that Theorem 1 indicates that if the sequence {λ r } is bounded, and the bound is independent of the choice of parameters γ, ρ, β, c, then one can choose γ = O( ɛ to reach an ɛ-optimal solution. Such boundedness of λ can be ensured by assuming certain constraint qualification (CQ at (x, λ ; see a related discussion in the Appendix. In the rest of this section, we take an alternative approach to argue ɛ- stationary solution. Our general strategy is to let 1 ρ and γ proportional to the accuracy parameter ɛ, while keeping τ = ργ (0, 1 and c fixed to some ɛ-independent constants. Let us define the following constants for problem (1 d 1 = max{ Ax b x X}, d = max{ x y x, y X}, d 3 = max{ x y B T B x, y X}, d 4 = max{f(x + h(x x X}. (45 The lemma below provides a parameter independent bound for ρ Ax1 b. Lemma 4 Suppose λ 0 = 0, Ax 0 = b, ρ β, and β 3L > 0. Then we have ρ Ax1 b d 4, Proof. From Lemma, and use the choice of x 0 and λ 0, we obtain β x1 x 0 d 4 + 3L d (46 T (x 1, λ 1 (1 ργγ + λ 1 + β 3L x 1 x 0 ( 1 ργ T (x 0, λ 0 + γ ρ (1 ργ λ 1. Utilizing the definition of T (x, λ and (34, we obtain T (x 1, λ 1 = f(x 1 + h(x 1 + T (x 0, λ 0 = f(x 0 + h(x 0. (1 ργ λ 1 + ρ ρ Ax1 b

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization 13 Combining the above, we obtain ( (1 ργγ 1 ργ ρ + (1 ργ ρ T (x 0, λ 0 f(x 1 h(x 1 By simple calculation we can show that + (1 ργ ρ = 0. By using the assumption f(x 1 0, h(x 1 0, it follows that This leads to the desired claim. λ 1 + ρ Ax1 b + β 3L x 1 x 0 ( (1 ργγ 1 ργ ρ β 3L x 1 x 0 d 4, Combining Lemma 4 with dual update (3b, we can conclude that ρ Ax1 b d 4. (47 Q.E.D. 1 ρ λ1 = ρ Ax1 b d 4. (48 Next, we derive an upper bound for the initial potential function P c (x 1, λ 1 ; x 0, λ 0. Assuming that Ax 0 = b, λ 0 = 0, we have P c (x 1, λ 1 ; x 0, λ 0 (9 = T (x 1, λ 1 (1 ργ(γ + c/ρ + λ 1 + c ( β x 1 x 0 B T B + L x1 x 0 (4,(34 + f(x 1 + h(x 1 (1 ργ + λ 1 + ρ ρ Ax1 b (1 ργ(γ + c/ρ λ 1 + c ( β x 1 x 0 B T B + L x1 x 0 (46 [ + (1 ργ + (1 ργ(c + ργ ] d 4 + c (σ max (B T B(d 4 + 3L d + Ld := P 0 c (49 It is important to note that P 0 c does not depend on ρ, γ, β individually, but only on ργ and c, both of which can be chosen as absolute constants. The next lemma bounds the size of λ r+1. Lemma 5 Suppose that (ρ, γ, β are chosen according to (3, and the assumptions in Lemma 4 hold true. Then the following holds true for all r 0 Proof. γ(1 ργ λ r Pc 0. (50 We use induction to prove the lemma. The initial step r = 0 is clearly true. In the inductive step we assume that γ(1 ργ λ r Pc 0 for some r 1. (51 Using the fact that the potential function is decreasing (cf. (30, we have P c (x r+1, λ r+1 ; x r, λ r P c (x 1, λ 1 ; x 0, λ 0 P 0 c. (5 Combining (5 with (34, and use the definition of P c function in (9, we obtain (1 ργ ( λ r+1 λ r + ρ γ(1 ργ λ r+1 Pc 0. (53

14 Davood Hajinezhad and Mingyi Hong If λ r+1 λ r 0, then we have γ(1 ργ λ r+1 If λ r+1 λ r < 0, then we have γ(1 ργ λ r+1 + γ(1 ργ λ r+1 < (1 ργ ( λ r+1 λ r (53 Pc 0. ρ γ(1 ργ λ r Pc 0, where the second inequality comes from the induction assumption (51. This concludes the proof of (50. Q.E.D. From Lemma 5, and the fact that ργ = τ (0, 1, we have Therefore, we get γ λ r+1 1 τ P 0 c, r 0. (54 γ λ r+1 P 0 c Also note that ρ and β should satisfy (3, restated below γ, for all r 0. (55 1 τ τ := ργ (0, 1, c > 1 τ 1 > 0, β > (3 + cl, ρ β. (56 Combining the above results, we have the following corollary about the choice of parameters to achieve ɛ-stationary solution. Corollary 1 Consider the following choices of algorithm parameters. { γ = min ɛ, 1 }, ρ = 1 { β max β, 1 }, β > 7L, c =. (57 ɛ Further suppose Assumptions A are satisfied, and that Ax 0 = b, λ 0 = 0. Then the sequence of dual variables {λ r } lies in a bounded set, and λ r+1 λ r, x r+1 x r. Further, every limit point generated by the PProx-PDA algorithm is an ɛ-stationary solution. Proof. Using the parameters in (57, we have the following relation τ = ργ = 1, γ 1 ργ ɛ. where the first equation is true regardless of whether or not ɛ 1/β. Then we can bound Pc 0 the following by P 0 c = [ + (1 ργ + (1 ργ(c + ργ ] d 4 + c (σ max(b T B(d 4 + 3L d + Ld (6 + σ max (B T Bd 4 + (3σ max (B T BL + Ld. Therefore using (55 we conclude γ λ r+1 P 0 c γ 1 ργ 4((6 + σ max(b T Bd 4 + (3σ max (B T BL + Ld ɛ. Note that the constant in front of ɛ is not dependent on algorithm parameters. This implies that γ λ r+1 = O(ɛ. Q.E.D.

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization 15 Remark 1 First, in the above result, the ɛ-stationary solution is obtained by imposing the additional assumption that the initial solution is feasible for the linear constraint (i.e., Ax 0 = b, and that λ 0 = 0. Admittedly, obtaining a feasible initial solution could be challenging, but for problems such as distributed optimization (9 and subspace estimation (4, finding feasible x 0 is relatively easy. For the former case either the agents can agree on a trivial solution (such as x i = x j = 0, or they can run an average consensus based algorithm such as [67] to reach consensus. For the latter case one can just set Π = Φ = 0. Second, the penalty parameter could be large because it is inversely proportional to the accuracy. Having a large penalty parameter at the beginning can make the algorithm progress slowly. In practice one can start with a smaller ρ and gradually increase it until reaching the predefined threshold. Following this idea, in the next section we will design an algorithm that allows ρ to increase unboundedly, such that in the limit the exact first-order stationary solution can be obtained.. Convergence Rate Analysis In this subsection we briefly discuss the convergence rate of the algorithm. To begin with, assume that parameters are chosen according to (3, and Ax 0 = b, λ 0 = 0. Also we will choose 1/ρ and γ proportional to certain accuracy parameter, while keeping τ = ργ (0, 1 and c fixed to some absolute constants. To proceed, let us define the following quantities H(x r, λ r := f(x r + h(x r + λ r, Ax r b, G(x r, λ r := H(x r, λ r + 1 ρ λr+1 λ r, Q(x r, λ r := H(x r, λ r + Ax r b, (58a (58b (58c where H(x r, λ r is the proximal gradient defined as [ H(x r, λ r = x r prox β h+ι(x x r 1 ] β (H(xr, λ r h(x r. (59 It can be checked that Q(x r, λ r 0 if and only if a stationary solution for problem (1 is obtained. Therefore we say that an θ-stationary solution is obtained if Q(x r, λ r θ. Note that the θ-stationary solution has been used in [37] for characterizing the rate for ADMM method. Compared with the ɛ-stationary solution defined in Definition 1, its progress is easier to quantify. Using the definition of proximity operator, the optimality condition of the x-subproblem (3a can be equivalently written as [ x r+1 = prox β h+ι(x x r+1 1 [ f(x r + A T λ r+1 + βb T B(x r+1 x r ]]. β By using the non-expansiveness of the prox operator, we obtain the following H(x r, λ r = xr prox β h+ι(x [x r 1β [ H(x r, λ r h(x r ]] [ = xr+1 prox β h+ι(x x r+1 1 [ f(x r + A T λ r+1 + βb T B(x r+1 x r ]] β x r + prox β h+ι(x [x r 1β [ H(x r, λ r h(x r ]] x r+1 x r + 4 β AT (λ r+1 λ r + 4 (I B T B(x r+1 x r ( + 4σ max( ˆB T ˆB x r+1 x r + 4σ max(a T A β λ r+1 λ r,

16 Davood Hajinezhad and Mingyi Hong where in the last inequality we define ˆB := I B T B. Therefore, G(x r, λ r b 1 λ r+1 λ r + b x r+1 x r, (60 where b 1 = 4σmax(AT A β + 1 ρ, and b = + 4σ max( ˆB T ˆB are positive constants. Combining (60 with the descent estimate for the potential function P c given in (30, we obtain where we have defined G(x r, λ r V [ P c (x r, λ r ; x r 1, λ r 1 P c (x r+1, λ r+1 ; x r, λ r ], (61 V := max(b 1, b min(a 1, a, and one can check that V is in the order of O(1/γ because a 1 is in the order of γ; cf. (30. From part 1 of Theorem 1 and equation (60 we conclude that G(x r, λ r 0. Let R denote the first time that G(x r+1, λ r+1 reaches below a given number θ > 0. Summing both sides of (61 over R iterations, and utilizing the fact that P c is lower bounded by P, it follows that θ V ( P 0 c P R V ( Pc 0 + (1 ργ ρ λ 1 R (36 (48 V ( P 0 c + (1 τ d 4 R where d 4 is given in (45, and Pc 0 is given in (49. Note that G r+1 θ implies that 1/ρ λ r+1 λ r = Ax r+1 b γλ r θ. From (50 we have that γλ r+1 Pc 0 γ 1 ργ, r 0. It follows that Ax r+1 b 1 ρ λr+1 λ r + γλ r Pc θ + 0 γ 1 ργ. It follows that whenever G(x r, λ r θ we have ( θ Q(x r, λ r := H(x r, λ r + Ax r b P θ + 0 + c γ 1 ργ Let us pick the parameters such that they satisfy (3, (57, and the following Pc 0 γ 1 ργ = P c 0 γ 1 τ = θ.. (6 Then whenever G(x r, λ r θ, we have Q r 5θ. It follows that the total number of iterations it takes for Q(x r, λ r to reach below 5θ is given by R V ( Pc 0 + (1 τ ( d 4 1 = O θ θ, (63 where the last relation holds because V is in the order of O( 1 γ, γ is chosen in the order of O(θ, and Pc 0, d 4 and τ are not dependent on the problem accuracy. The result below summarizes our discussion. Corollary Suppose that Ax 0 = b and λ 0 = 0. Additionally, for a given θ > 0, and τ (0, 1, choose γ, ρ, c, β as follows γ = θ(1 τ Pc 0, ρ = τ γ, c > 1 τ 1, ρ β and β > (3 + cl. Let R denote the first time that Q r reaches below 5θ. Then we have R = O ( 1 θ.

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization 17 3 An Algorithm with Increasing Accuracy So far we have shown that PProx-PDA converges to the set of approximate stationary solutions by properly choosing the problem parameters. The inaccuracy of the algorithm can be attributed to the use of perturbation parameter γ. Is it possible to gradually reduce the perturbation so that asymptotically the algorithm reaches the exact stationary solutions? Is it possible to avoid using very large penalty parameter ρ at the beginning of the algorithm? This section designs an algorithm that addresses these questions. We consider a modified algorithm in which the parameters (ρ, β, γ are iteration-dependent. In particular, we choose ρ r+1, β r+1 and 1/γ r+1 to be increasing sequences. The new algorithm, named PProx-PDA with increasing accuracy (PProx-PDA-IA, is listed in Algorithm. Below we analyze the convergence of the new algorithm. Besides assuming that Algorithm : PProx-PDA with increasing accuracy (PProx-PDA-IA Initialize: λ 0 and x 0 Repeat: update variables by x r+1 = arg min x X { f(x r, x x r + h(x + (1 ρ r+1 γ r+1 λ r, Ax b + ρr+1 Ax b + βr+1 x xr B T B λ r+1 = (1 ρ r+1 γ r+1 λ r + ρ r+1 ( Ax r+1 b. }. (64a (64b Until Convergence. the optimization problem under consideration satisfies Assumptions A, we make the following additional assumptions: Assumptions B B1. Assume that B. The sequence {ρ r } satisfies ρ r+1 γ r+1 = τ (0, 1, for some fixed constant τ. ρ r+1, r=1 1 =, ρr+1 r=1 1 (ρ r+1 <, ρr+1 ρ r = D > 0, for some D > 0. A simple choice of ρ r+1 is ρ r+1 = r+1. Similarly, the sequence {γ r+1 } satisfies γ r+1 γ r 0, γ r+1 0, γ r+1 =, r=1 (γ r+1 <. (65 r=1 B3. Assume that c 0 > 1 s.t. β r+1 = c 0 ρ r+1, for r large enough. (66 B4. There exists Λ > 0 such that for every r > 0 we have λ r Λ. We note that Assumption [B4] is somewhat restrictive because it is dependent on the iterates. In the Appendix we will show that such an assumption can be satisfied under some additional regularity conditions about problem (1. We choose to state [B4] here to avoid lengthy discussion on those regularity conditions before the main convergence analysis.

18 Davood Hajinezhad and Mingyi Hong The main idea of the proof is similar to that of Theorem 1. We first construct certain potential function and show that with appropriate choices of algorithm parameters, it will decrease eventually. Similar to Lemma 1, our first step utilizes the optimality condition of two consecutive iterates to analyze the change of the primal and dual differences. Lemma 6 Suppose that the Assumptions A and [B1]-[B3] hold true, and that τ, D, are constants defined in assumption B. Then for large enough r, there exists constant C 1 such that (1 τ λ r+1 λ r + τ (ρr+1 ρ r+ 1 λr+1 + βr+1 ρ r+1 x r+1 x r B T B + ρr+1 L x r+1 x r B T B (1 τ λ r λ r 1 + τ ( ρr ρ r+1 1 λr + βr ρ r xr x r 1 B T B + ρr L xr x r 1 B T B τ λr+1 λ r + C 1(γ r+1 λ r+1 + Lρr + D(L + β r+1 B T B x r+1 x r B T B βr ρ r wr B T B. (67 Proof. Suppose that ξ r+1 h(x r+1. From the optimality condition for x-subproblem (64a we have for all x dom(h f(x r + A T λ r+1 + β r+1 B T B(x r+1 x r + ξ r+1, x r+1 x 0. Performing the above inequality for the (r 1th iteration, we have f(x r 1 + A T λ r + β r B T B(x r x r 1 + ξ r, x r x 0, x dom(h. Plugging in x = x r in the first inequality, x = x r+1 in the second inequality and add them together, and use the fact that h is convex, we obtain f(x r f(x r 1 + A T (λ r+1 λ r + β r+1 B T B(x r+1 x r β r B T B(x r x r 1, x r+1 x r 0. (68 Let us analyze the above inequality term by term. First, using Young s inequality and the assumption that f is L-smooth i.e. (15 we have Second, note that we have f(x r 1 f(x r, x r+1 x r L xr+1 x r + L xr x r 1. A T (λ r+1 λ r, x r+1 x r = λ r+1 λ r, A(x r+1 x r = λ r+1 λ r, Ax r+1 b γ r+1 λ r + γ r+1 λ r + γ r λ r 1 γ r λ r 1 Ax r + b (64b = λ r+1 λ r, λr+1 λ r ρ r+1 + γ r+1 λ r γ r λ r 1 λr λ r 1 ρ r ( 1 ρ r+1 1 ρ r = 1 ρ r λr+1 λ r, λ r+1 λ r (λ r λ r 1 + λ r+1 λ r + λ r+1 λ r, λ r λ r 1 γ r + λ r+1 λ r, λ r (γ r+1 γ r (13 = 1 ( 1 ( λ ρ r γr r+1 λ r λ r λ r 1 + (λ r+1 λ r (λ r λ r 1 ( 1 + γ r λ r+1 λ r + ρ r+1 1 ρ r λ r+1 λ r + 1 (γr+1 γ r ( λ r+1 λ r λ r+1 λ r.

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization 19 Summarizing, we have A T (λ r+1 λ r, x r+1 x r 1 Third, notice that ( 1 ( λ ρ r γr r+1 λ r λ r λ r 1 + 1 (γr+1 γ r ( λ r+1 λ r ( 1 + ρ r+1 1 ρ r + γr 1 (γr+1 γ r λ r+1 λ r (B1 = 1 ( 1 ( λ ρ r γr r+1 λ r λ r λ r 1 + 1 (γr+1 γ r ( λ r+1 λ r ( ( 1 + γ r τ 1 (γ r γ r+1 λ r+1 λ r. β r+1 B T B(x r+1 x r β r B T B(x r x r 1, x r+1 x r (13 = (β r+1 β r x r+1 x r B T B + βr ( x r+1 x r B T B xr x r 1 B T B + wr B B T = βr+1 xr+1 x r B T B βr xr x r 1 B T B + βr+1 β r x r+1 x r B T B + βr wr B T B. Therefore, from the above three steps, we can bound (68 by (1 τ ρ r λ r+1 λ r + 1 (γr+ γ r+1 λ r+1 + βr+1 xr+1 x r B T B + L xr+1 x r (1 τ ρ r λ r λ r 1 + 1 (γr+1 γ r λ r + βr xr x r 1 B T B + L xr x r 1 ( γ r ( 1 τ 1 (γr γ r+1 λ r+1 λ r + L x r+1 x r + 1 (γr+ γ r+1 (γ r+1 γ r λ r+1 βr wr B T B. (69 Multiplying ρ r on both sides, we obtain (1 τ λ r+1 λ r + τ (ρr+1 ρ r+ 1 λr+1 + βr+1 ρ r+1 x r+1 x r B T B + ρr+1 L x r+1 x r (1 τ λ r λ r 1 + τ ( ρr ρ r+1 1 λr + βr ρ r xr x r 1 B T B ( + ρr L xr x r 1 ρ r γ r ( 1 τ 1 (γr γ r+1 λ r+1 λ r + Lρ r x r+1 x r + ρr (γr+ γ r+1 (γ r+1 γ r λ r+1 + (βr+1 (ρ r+1 ρ r where we have used the following fact ( ρ r 0 ρ r+ ρr ρ r+1 = ρr x r+1 x r B T B + L(ρr+1 ρ r x r+1 x r βr ρ r wr B T B. ρ r+1 ( ( ρ r+1 ρ r+1 ρ r+ 1 ρ r+ 1.

0 Davood Hajinezhad and Mingyi Hong Further, by Assumption B we have ρ r+1 ρ r = D, also we have x r+1 x r B T B BT B x r+1 x r. Therefore, we reach (1 τ λ r+1 λ r + τ (ρr+1 ρ r+ 1 λr+1 + βr+1 ρ r+1 x r+1 x r B T B + ρr+1 L x r+1 x r (1 τ λ r λ r 1 + τ ( ρr ρ r+1 1 λr + βr ρ r xr x r 1 B T B + ρr L xr x r 1 τ λr+1 λ r + C 1(γ r+1 λ r+1 + Lρr + D(L + β r+1 B T B x r+1 x r βr ρ r wr B T B, (70 where the last inequality is true using the following relations: To bound the term γ r+ γ r+1 (γ r+1 γ r we have γ r+ γ r+1 (γ r+1 γ r = Thus there exists a constant C 1 such that ( τ ρ r+ τ ρ r+1 τ ρ r+1 + τ ρ r = τd ρr+ ρ r ρ r ρ r+1 ρ r+ = τd ρ r ρ r+1 ρ r+. For large enough r ρ r ( γ r+ γ r+1 (γ r+1 γ r C 1(γ r+1. The proof of the lemma is complete. τ D( τ ρ r+1. Q.E.D. Now let us analyze the behavior of T (x, λ which is originally defined in (4 in order to bound the descent of the primal variable. In this case, because T is also a function of ρ and γ (which is also time varying, we denote it as T (x, λ; ρ, γ. Lemma 7 Suppose that the Assumptions Assumptions A and [B1]-[B3] hold true, τ and D are constants defined in Assumption B. Then we have T (x r+1, λ r+1 ; ρ r+, γ r+ + ((1 τ γr+ T (x r, λ r ; ρ r+1, γ r+1 + ((1 τ γr+1 ( β r+1 3L ( 1 + (1 τ x r+1 x r γr+1 + D(γr+1 ρr+1 τ (1 τ + (1 τ(γr+1 γ r+ D(γr+ τ D(γr+1 τ λ r+1 λ r λ r+1 + D(γr+ λ r+1 + D(γ r+ λ r+1 + D(γ r+1 λ r + D (γr+1 (γ r+ λ r+1. (71 τ

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization 1 Proof. Following the same analysis as in ](7, we have that the T function has the following descent when only changing the primal variable T (x r+1, λ r ; ρ r+1, γ r+1 T (x r, λ r ; ρ r+1, γ r+1 ( β r+1 3L x r+1 x r. (7 Second, following (8, it is easy to verify that T (x r+1, λ r+1 ; ρ r+1, γ r+1 T (x r+1, λ r ; ρ r+1, γ r+1 ( λ r+1 λ r (1 τ ρ r+1 + γr+1 ( λ r λ r+1 λ r+1 λ r ( λ r+1 λ r (1 τ ρ r+1 + γr+1 λr γr+ λr+1 ( γr+1 γr+ λr+1 γr+1 λr+1 λ r. (73 The most involving step is the analysis of the change of T when the parameters ρ and γ are changed. We first have the following bound T (x r+1, λ r+1 ; ρ r+, γ r+ T (x r+1, λ r+1 ; ρ r+1, γ r+1 (74 = (1 τ(γ r+1 γ r+ λ r+1 + ρr+1 ρ r Ax r+1 b = (1 τ(γ r+1 γ r+ λ r+1 + D (Axr+1 b γ r+1 λ r D }{{} γr+1 λ r + D γ r+1 λ r, Ax r+1 b. }{{}}{{} (c (a (b The term (a in (74 is given by D (Axr+1 b γ r+1 λ r D = (ρ r+1 λr+1 λ r. (75 The term (b in (74 is given by D γr+1 λ r = (γ r+1 D λr. (76 The term (c in (74 is given by D γ r+1 λ r, Ax r+1 b = D γ r+1 λ r, λr+1 λ r ρ r+1 + γ r+1 λ r = D(γ r+1 λ r + D (γr+1 ( λ r+1 λ r λ r+1 λ r. (77 τ