Optimization over Sparse Symmetric Sets via a Nonmonotone Projected Gradient Method

Similar documents
Sparse Approximation via Penalty Decomposition Methods

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

Iterative Hard Thresholding Methods for l 0 Regularized Convex Cone Programming

Sparse Recovery via Partial Regularization: Models, Theory and Algorithms

On the Minimization Over Sparse Symmetric Sets: Projections, O. Projections, Optimality Conditions and Algorithms

1. Introduction. We consider the following constrained optimization problem:

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Optimization methods

An Augmented Lagrangian Approach for Sparse Principal Component Analysis

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Projected Gradient Methods for NCP 57. Complementarity Problems via Normal Maps

Douglas-Rachford splitting for nonconvex feasibility problems

ON DISCRETE HESSIAN MATRIX AND CONVEX EXTENSIBILITY

arxiv: v1 [math.oc] 1 Jul 2016

Iteration-complexity of first-order penalty methods for convex programming

16 Chapter 3. Separation Properties, Principal Pivot Transforms, Classes... for all j 2 J is said to be a subcomplementary vector of variables for (3.

On the Existence of Ideal Solutions in Multi-objective 0-1 Integer Programs

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

Vector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition)

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

3.10 Lagrangian relaxation

Gradient methods for minimizing composite functions

Lecture 25: Subgradient Method and Bundle Methods April 24

Convex Optimization and l 1 -minimization

Midterm 1. Every element of the set of functions is continuous

A derivative-free nonmonotone line search and its application to the spectral residual method

Lecture 8 Plus properties, merit functions and gap functions. September 28, 2008

IE 5531: Engineering Optimization I

1 Directional Derivatives and Differentiability

On Coarse Geometry and Coarse Embeddability

Introduction to Compressed Sensing

STAT 200C: High-dimensional Statistics

An inexact strategy for the projected gradient algorithm in vector optimization problems on variable ordered spaces

Optimization and Optimal Control in Banach Spaces

Algorithms for Nonsmooth Optimization

An augmented Lagrangian approach for sparse principal component analysis

1. Introduction The nonlinear complementarity problem (NCP) is to nd a point x 2 IR n such that hx; F (x)i = ; x 2 IR n + ; F (x) 2 IRn + ; where F is

Cover Page. The handle holds various files of this Leiden University dissertation

Math 273a: Optimization Subgradient Methods

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems

Notes on Iterated Expectations Stephen Morris February 2002

Least Sparsity of p-norm based Optimization Problems with p > 1

5. Subgradient method

58 Appendix 1 fundamental inconsistent equation (1) can be obtained as a linear combination of the two equations in (2). This clearly implies that the

Subgradient Method. Ryan Tibshirani Convex Optimization

Adaptive First-Order Methods for General Sparse Inverse Covariance Selection

A TOUR OF LINEAR ALGEBRA FOR JDEP 384H

Elements of Convex Optimization Theory

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017

A projection-type method for generalized variational inequalities with dual solutions

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Approximation Algorithms for Maximum. Coverage and Max Cut with Given Sizes of. Parts? A. A. Ageev and M. I. Sviridenko

Lecture Note 7: Iterative methods for solving linear systems. Xiaoqun Zhang Shanghai Jiao Tong University

Near-Potential Games: Geometry and Dynamics

Garrett: `Bernstein's analytic continuation of complex powers' 2 Let f be a polynomial in x 1 ; : : : ; x n with real coecients. For complex s, let f

Accelerated primal-dual methods for linearly constrained convex problems

BCOL RESEARCH REPORT 17.07

MARKOV CHAINS: STATIONARY DISTRIBUTIONS AND FUNCTIONS ON STATE SPACES. Contents

Lecture 7 Monotonicity. September 21, 2008

Spectral gradient projection method for solving nonlinear monotone equations

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Generalized greedy algorithms.

1 Boolean functions and Walsh-Fourier system

A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Polyhedral Results for A Class of Cardinality Constrained Submodular Minimization Problems

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

Accelerated Block-Coordinate Relaxation for Regularized Optimization

The deterministic Lasso

On Backward Product of Stochastic Matrices

Gradient methods for minimizing composite functions

More First-Order Optimization Algorithms

An Alternative Proof of Primitivity of Indecomposable Nonnegative Matrices with a Positive Trace

Z Algorithmic Superpower Randomization October 15th, Lecture 12

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

A Geometrical Analysis of a Class of Nonconvex Conic Programs for Convex Conic Reformulations of Quadratic and Polynomial Optimization Problems

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

using the Hamiltonian constellations from the packing theory, i.e., the optimal sphere packing points. However, in [11] it is shown that the upper bou

Linear programming: theory, algorithms and applications

and P RP k = gt k (g k? g k? ) kg k? k ; (.5) where kk is the Euclidean norm. This paper deals with another conjugate gradient method, the method of s

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems

Homotopy methods based on l 0 norm for the compressed sensing problem

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems

system of equations. In particular, we give a complete characterization of the Q-superlinear

Contents. 2.1 Vectors in R n. Linear Algebra (part 2) : Vector Spaces (by Evan Dummit, 2017, v. 2.50) 2 Vector Spaces


Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming

AN AUGMENTED LAGRANGIAN AFFINE SCALING METHOD FOR NONLINEAR PROGRAMMING

In particular, if A is a square matrix and λ is one of its eigenvalues, then we can find a non-zero column vector X with

A Parametric Simplex Algorithm for Linear Vector Optimization Problems

arxiv: v1 [math.oc] 22 Sep 2016

Transcription:

Optimization over Sparse Symmetric Sets via a Nonmonotone Projected Gradient Method Zhaosong Lu November 21, 2015 Abstract We consider the problem of minimizing a Lipschitz dierentiable function over a class of sparse symmetric sets that has wide applications in engineering and science. For this problem, it is known that any accumulation point of the classical projected gradient PG) method with a constant stepsize 1/L satises the L-stationarity optimality condition that was introduced in [3]. In this paper we introduce a new optimality condition that is stronger than the L-stationarity optimality condition. We also propose a nonmonotone projected gradient NPG) method for this problem by incorporating some support-changing and coordinate-swapping strategies into a projected gradient method with variable stepsizes. It is shown that any accumulation point of NPG satises the new optimality condition and moreover it is a coordinatewise stationary point. Under some suitable assumptions, we further show that it is a global or a local minimizer of the problem. Numerical experiments are conducted to compare the performance of PG and NPG. The computational results demonstrate that NPG has substantially better solution quality than PG, and moreover, it is at least comparable to, but sometimes can be much faster than PG in terms of speed. Keywords: cardinality constraint, sparse optimization, sparse projection, nonmonotone projected gradient method. 1 Introduction Over the last decade sparse solutions have been concerned in numerous applications. For example, in compressed sensing, a large sparse signal is decoded by using a sampling matrix and a relatively low-dimensional measurement vector, which is typically formulated as minimizing a least squares function subject to a cardinality constraint see, for example, the comprehensive reviews [19, 11]). As another example, in nancial industry portfolio managers often face business-driven requirements that limit the number of constituents in their tracking portfolio. A natural model for this is to minimize a risk-adjusted return over a cardinality-constrained simplex, which has recently been considered in [17, 15, 24, 3] for nding a sparse index tracking. These models can be viewed as a special case of the following general cardinality-constrained optimization problem: where Ω is a closed convex set in R n and f := min fx), 1.1) x C s Ω C s = {x R n : x 0 s} Department of Mathematics, Simon Fraser University, Canada Email: zhaosong@sfu.ca). supported in part by NSERC Discovery Grant. This author was 1

for some s {1,..., n 1}. Here, x 0 denotes the cardinality or the number of nonzero elements of x, and f : R n R is Lipschitz continuously dierentiable, that is, there is a constant L f > 0 such that fx) fy) L f x y x, y R n. 1.2) Problem 1.1) is generally NP-hard. One popular approach to nding an approximate solution of 1.1) is by convex relaxation. For example, one can replace the associated 0 of 1.1) by 1 and the resulting problem has a convex feasible region. Some well-known models in compressed sensing or sparse linear regression such as lasso [18], basis pursuit [10], LP decoding [8] and the Dantzig selector [9] were developed in this spirit. In addition, direct approaches have been proposed in the literature for solving some special cases of 1.1). For example, IHT [5, 6] and CoSaMP [16] are two direct methods for solving problem 1.1) with f being a least squares function and Ω = R n. Besides, the IHT method was extended and analyzed in [21] for solving l 0 -regularized convex cone programming. The gradient support pursuit GraSP) method was proposed in [1] for solving 1.1) with a general f and Ω = R n. A penalty decomposition method was introduced and studied in [20] for solving 1.1) with general f and Ω. Recently, Beck and Hallak [3] considered problem 1.1) in which Ω is assumed to be a symmetric set. They introduced three types of optimality conditions that are basic feasibility, L-stationarity and coordinatewise optimality, and established a hierarchy among them. They also proposed methods for generating points satisfying these optimality conditions. Their methods require nding an exact solution of a sequence of subproblems in the form of min{fx) : x Ω, x i = 0 i I} for some index set I {1,..., n}. Such a requirement is, however, generally hard to meet unless f and Ω are both suciently simple. This motivates us to develop a method suitable for solving problem 1.1) with more general f and Ω. As studied in [3], the orthogonal projection of a point onto C s Ω can be eciently computed for some symmetric closed convex sets Ω. It is thus suitable to apply the projected gradient PG) method with a xed stepsize t 0, 1/L f ) to solve 1.1) with such Ω. It is easily known that any accumulation point x of the sequence generated by PG satises x Arg min { x x t fx )) : x C s Ω}. 1 1.3) It is noteworthy that this type of convergence result is weaker than that of the PG method applied to the problem min{fx) : x X }, where X is a closed convex set in R n. For this problem, it is known that any accumulation point x of the sequence generated by PG with a xed stepsize t 0, 1/L f ) satises x = arg min { x x t fx )) : x X }. 1.4) The uniqueness of solution to the optimization problem involved in 1.4) is due to the convexity of X. Given that C s Ω is generally nonconvex, the solution to the optimization problem involved in 1.3) may not be unique. As shown in later section of this paper, if it has a distinct solution x, that is, x x Arg min { x x t fx )) : x C s Ω}, then f x ) < fx ) and thus x is certainly not an optimal solution of 1.1). Therefore, a convergence result such as x = arg min { x x t fx )) : x C s Ω} 1.5) is generally stronger than 1.3). 1 By convention, the symbol Arg stands for the set of the solutions of the associated optimization problem. When this set is known to be a singleton, we use the symbol arg to stand for it instead. 2

In this paper we rst study some properties of the orthogonal projection of a point onto C s Ω and propose a new optimality condition for problem 1.1). We then propose a nonmonotone projected gradient NPG) method 2 for solving problem 1.1), which incorporates some support-changing and coordinate-swapping strategies into a PG method with variable stepsizes. It is shown that any accumulation point x of the sequence generated by NPG satises 1.5) for all t [0, T] for some T 0, 1/L f ). Under some suitable assumptions, we further show that x is a coordinatewise stationary point. Furthermore, if x 0 < s, then x is a global optimal solution of 1.1), and it is a local optimal solution otherwise. We also conduct numerical experiments to compare the performance of NPG and the PG method with a xed stepsize. The computational results demonstrate that NPG has substantially better solution quality than PG, and moreover, it is at least comparable to, but sometimes can be much faster than PG in terms of speed. The rest of the paper is organized as follows. In section 2, we study some properties of the orthogonal projection of a point onto C s Ω. In section 3, we propose a new optimality condition for problem 1.1). In section 4 we propose an NPG method for solving problem 1.1) and establish its convergence. We conduct numerical experiments in section 5 to compare the performance of the NPG and PG methods. Finally, we present some concluding remarks in section 6. 1.1 Notation and terminology For a real number a, a + denotes the nonnegative part of a, that is, a + = max{a, 0}. The symbol R n + denotes the nonnegative orthant of R n. Given any x R n, x is the Euclidean norm of x and x denotes the absolute value of x, that is, x i = x i for all i. In addition, x 0 denotes the number of nonzero entries of x. The support set of x is dened as suppx) = {i : x i 0}. Given an index set T {1,..., n}, x T denotes the sub-vector of x indexed by T, T denotes the cardinality of T, and T c is the complement of T in {1,..., n}. For a set Ω, we dene Ω T = {x R T : i T x ie i Ω}, where e i is the ith coordinate vector of R n. Let s {1,..., n 1} be given. Given any x R n with x 0 s, the index set T is called a s-super support of x if T {1,..., n} satises suppx) T and T s. The set of all s-super supports of x is denoted by T s x), that is, T s x) = {T {1,..., n} : suppx) T and T s}. In addition, T T s x) is called a s-super support of x with cardinality s if T = s. The set of all such s-super supports of x is denoted by T s x), that is, T s x) = {T T s x) : T = s}. The sign operator sign : R n { 1, 1} n is dened as { 1 if xi 0, signx)) i = 1 otherwise i = 1,..., n. The Hadmard product of any two vectors x, y R n is denoted by x y, that is, x y) i = x i y i for i = 1,..., n. Given a closed set X R n, the Euclidean projection of x R n onto X is dened as the set Proj X x) = Arg min{ y x 2 : y X }. If X is additionally convex, Proj X x) reduces to a singleton, which is treated as a point by convention. Also, the normal cone of X at any x X is denoted by N X x). The permutation group of the set of indices {1,..., n} is denoted by Σ n. For any x R n and σ Σ n, the vector x σ resulted from σ operating on x is dened as x σ ) i = x σi) i = 1,..., n. 2 As mentioned in the literature see, for example, [12, 13, 25]), nonmonotone type of methods often produce solutions of better quality than the monotone counterparts for nonconvex optimization problems, which motivates us to propose a nonmonotone type method in this paper. 3

Given any x R n, a permutation that sorts the elements of x in a non-ascending order is called a sorting permutation for x. The set of all sorting permutations for x is denoted by Σx). It is clear to see that σ Σx) if and only if x σ1) x σ2) x σn 1) x σn). Given any s {1,..., n 1} and σ Σ n, we dene S σ [1,s] = {σ1),..., σs)}. A set X R n is called a symmetric set if x σ X for all x X and σ Σ n. In addition, X is referred to as a nonnegative symmetric set if X is a symmetric set and moreover x 0 for all x X. X is called a sign-free symmetric set if X is a symmetric set and x y X for all x X and y { 1, 1} n. 2 Projection over some sparse symmetric sets In this section we study some useful properties of the orthogonal projection of a point onto the set C s Ω, where s {1,..., n 1} is given. Throughout this paper, we make the following assumption regarding Ω. Assumption 1 Ω is either a nonnegative or a sign-free symmetric closed convex set in R n. Let P : R n R n be an operator associated with Ω that is dened as follows: { x if Ω is nonnegative symmetric, Px) = x if Ω is sign-free symmetric. 2.1) One can observe that Px) 0 for all x Ω. Moreover, Px)) i = 0 for some x R n and i {1,..., n} if and only if x i = 0. The following two lemmas were established in [3]. The rst one presents a monotone property of the orthogonal projection associated with a symmetric set. The second one provides a characterization of the orthogonal projection associated with a sign-free symmetric set. Lemma 2.1 Lemma 3.1 of [3]) Let X be a closed symmetric set in R n. Let x R n and y Proj X x). Then y i y j )x i x j ) 0 for all i, j {1, 2,..., n}. Lemma 2.2 Lemma 3.3 of [3]) Let X be a closed sign-free symmetric set in R n. Proj X x) if and only if signx) y Proj X R n + x ). Then y Ω. We next establish a monotone property for the orthogonal projection operator associated with Lemma 2.3 Let P be the associated operator of Ω dened in 2.1). Then for every x R n and y Proj Ω x), there holds [Py)) i Py)) j ] [Px)) i Px)) j ] 0 i, j {1, 2,..., n}. 2.2) Proof. If Ω is a closed nonnegative symmetric set, 2.2) clearly holds due to 2.1) and Lemma 2.1. Now suppose Ω is a closed sign-free symmetric set. In view of Lemma 2.2 and y Proj Ω x), one can see that signx) y Proj Ω R n + x ). Taking absolute value on both sides of this relation, and using the denition of sign, we obtain that y Proj Ω R n + x ). 2.3) 4

Observe that Ω R n + is a closed nonnegative symmetric set. Using this fact, 2.3) and Lemma 2.1, we have y i y j ) x i x j ) 0 i, j {1, 2,..., n}, which, together with 2.1) and the fact that Ω is sign-free symmetric, implies that 2.2) holds. The following two lemma presents some useful properties of the orthogonal projection operator associated with C s Ω. The proof the rst one is similar to that of Lemma 4.1 of [3]. Lemma 2.4 For every x R n and y Proj Cs Ωx), there holds y T = Proj ΩT x T ) T T s y). Lemma 2.5 Theorem 4.4 of [3]) Let P be the associated operator of Ω dened in 2.1). Then for every x R n and σ ΣPx)), there exists y Proj Cs Ωx) such that S[1,s] σ T sy), that is, S[1,s] σ is a s-super support of y with cardinality s. Combining Lemmas 2.4 and 2.5, we obtain the following theorem, which provides a formula for nding a point in Proj Cs Ωx) for any x R n. Theorem 2.1 Let P be the associated operator of Ω dened in 2.1). T = S[1,s] σ for some σ ΣPx)). Dene y R n as follows: Given any x R n, let y T = Proj ΩT x T ), y T c = 0. Then y Proj Cs Ωx). In the following two theorems, we provide some sucient conditions under which the orthogonal projection of a point onto C s Ω reduces to a single point. Theorem 2.2 Given a R n, suppose there exists some y Proj Cs Ωa) with y 0 < s. Proj Cs Ωa) is a singleton containing y. Then Proof. For convenience, let I = suppy). We rst show that Pa)) i > Pa)) j i I, j I c, 2.4) where P is dened in 2.1). By the denitions of I and P, one can observe that Py)) i > Py)) j for all i I and j I c. This together with 2.2) with x = a implies that Pa)) i Pa)) j i I, j I c. It then follows that for proving 2.4) it suces to show Pa)) i Pa)) j i I, j I c. Suppose on the contrary that Pa)) i = Pa)) j for some i I and j I c. Let { yi if Ω is nonnegative symmetric, β = signa j ) y i if Ω is sign-free symmetric, and let z R n be dened as follows: y j if l = i, z l = β if l = j, y l otherwise, l = 1,..., n. 5

Since Ω is either nonnegative or sign-free symmetric, it is not hard to see that z Ω. Notice that y i 0 and y j = 0 due to i I and j I c. This together with y 0 < s and the denition of z implies z 0 < s. Hence, z C s Ω. In view of Lemma 2.2 with x = a and X = Ω and y Proj Cs Ωa), one can observe that y a 0 when Ω is sign-free symmetric. Using this fact and the denitions of P and z, one can observe that y i a i = Py)) i Pa)) i, z j a j = Py)) i Pa)) j, which along with the supposition Pa)) i = Pa)) j yields y i a i = z j a j. In addition, one can see that z 2 j = y2 i. Using these two relations, z i = y j = 0, and the denition of z, we have z a 2 = z l a l ) 2 + z i a i ) 2 + z j a j ) 2 = y l a l ) 2 + a 2 i + zj 2 2z j a j + a 2 j l i,j = y l a l ) 2 + a 2 i + yi 2 2y i a i + a 2 j = y a 2. 2.5) l i,j In addition, by the denition of z and the convexity of Ω, it is not hard to observe that y z and y + z)/2 C s Ω. By the strict convexity of 2, y z and 2.5), one has y + z 2 a 2 l i,j < 1 2 y a 2 + 1 2 z a 2 = y a 2, which together with y + z)/2 C s Ω contradicts the assumption y Proj Cs Ωa). Hence, 2.4) holds as desired. We next show that for any z Proj Cs Ωa), it holds that suppz) I, where I = suppy). Suppose for contradiction that there exists some j I c such that z j 0, which together with the denition of P yields Pz)) j 0. Clearly, Pz) 0 due to z Ω and 2.1). It then follows that Pz)) j > 0. In view of Lemma 2.3, we further have [Pz)) i Pz)) j ][Pa)) i Pa)) j ] 0 i I, which together with j I c and 2.4) implies Pz)) i Pz)) j for all i I. Using this, Pz)) j > 0 and the denition of P, we see that Pz)) i > 0 and hence z i 0 for all i I. Using this relation, y, z Ω, and the convexity of Ω, one can see that y + z)/2 C s Ω. In addition, since y, z Proj Cs Ωa), we have y a 2 = z a 2. Using this and a similar argument as above, one can show that y + z)/2 a 2 < y a 2, which together with y + z)/2 C s Ω contradicts the assumption y Proj Cs Ωa). Let z Proj Cs Ωa). As shown above, suppz) suppy). Let T T s y). It then follows that T T s z). Using these two relations and Lemma 2.4, we have y T = Proj ΩT a T ) = z T. Notice that y T c = z T c = 0. It thus follows y = z, which implies that the set Proj Cs Ωa) contains y only. Theorem 2.3 Given a R n, suppose there exists some y Proj Cs Ωa) such that min i I where I = suppy). Then Proj Cs Ωa) is a singleton containing y. Pa)) i > max i I c Pa)) i, 2.6) Proof. We divide the proof into two separate cases as follows. Case 1): y 0 < s. The conclusion holds due to Theorem 2.2. Case 2): y 0 = s. This along with I = suppy) yields I = s. Let z Proj Cs Ωa). In view of Lemma 2.3 and 2.6), one has min i I Pz)) i max i I c Pz)) i. 2.7) 6

Notice z C s Ω. Using this and the denition of P, we observe that Pz) 0 = z 0 s and Pz) 0. These relations together with I = s and 2.7) imply that Pz)) i = 0 for all i I c. This yields z I c = 0. It then follows that suppz) I. Hence, I T s z) and I T s y) due to I = s. Using these, Lemma 2.4, and y, z Proj Cs Ωa), one has z I = Proj ΩI a I ) = y I, which together with z I c = y I c = 0 implies z = y. Thus Proj Cs Ωa) contains only y. 3 Optimality conditions In this section we study some optimality conditions for problem 1.1). We start by reviewing a necessary optimality condition that was established in Theorem 5.3 of [3]. Theorem 3.1 necessary optimality condition) Suppose that x is an optimal solution of problem 1.1). Then there holds x Proj Cs Ωx t fx )) t [0, 1/L f ), 3.1) that is, x is an optimal but possibly not unique) solution to the problems min x x C x t fx )) 2 t [0, 1/L f ). s Ω We next establish a stronger necessary optimality condition than the one stated above for 1.1). Theorem 3.2 strong necessary optimality condition) Suppose that x is an optimal solution of problem 1.1). Then there holds x = Proj Cs Ωx t fx )) t [0, 1/L f ), 3.2) that is, x is the unique optimal solution to the problems min x x C x t fx )) 2 t [0, 1/L f ). 3.3) s Ω Proof. The conclusion clearly holds for t = 0. Now let t 0, 1/L f ) be arbitrarily chosen. Suppose for contradiction that problem 3.3) has an optimal solution x C s Ω with x x. It then follows that x x t fx )) 2 x x t fx )) 2, which leads to fx ) T x x ) 1 2t x x 2. In view of this relation, 1.2) and the facts that t 0, 1/L f ) and x x, one can obtain that f x ) fx ) + fx ) T x x ) + L f 2 x x 2 fx ) + 1 L f 1 ) x x 2 < fx ), 3.4) 2 t which contradicts the assumption that x is an optimal solution of problem 1.1). For ease of later reference, we introduce the following denitions. Denition 3.1 x R n is called a general stationary point of problem 1.1) if it satises the necessary optimality condition 3.1). 7

Denition 3.2 x R n is called a strong stationary point of problem 1.1) if it satises the strong necessary optimality condition 3.2). Clearly, a strong stationary point must be a general stationary point, but the converse may not be true. In addition, from the proof of Theorem 3.2, one can easily improve the quality of a general but not a strong stationary point x by nding a point x Proj Cs Ωx t fx )) with x x. As seen from above, f x ) < fx ), which means x has a better quality than x in terms of the objective value of 1.1). Before ending this section, we present another necessary optimality condition for problem 1.1) that was established in [3]. Theorem 3.3 Lemma 6.1 of [3]) If x is an optimal solution of problem 1.1), there hold: x T = Proj Ω T x T t fx )) T ) T T s x ), min{fx x i e i + x i e j), fx x i e i x i e j)} fx ) fx x i e i + x i e j) if Ω is sign-free symmetric; if Ω is nonnegative symmetric for some t > 0 and some i, j satisfying where I = Arg min i suppx ) Px )) i. i Arg min{p fx ))) l : l I}, j Arg max{p fx ))) l : l [suppx )] c }, Denition 3.3 x R n is called a coordinatewise stationary point of problem 1.1) if it satises the necessary optimality condition stated in Theorem 3.3. 4 A nonmonotone projected gradient method As seen from Theorem 2.1, the orthogonal projection a point onto C s Ω can be eciently computed. Therefore, the classical projected gradient PG) method with a constant step size can be suitably applied to solve problem 1.1). In particular, given a T 0, 1/L f ) and x 0 C s Ω, the PG method generates a sequence {x k } C s Ω according to x k+1 Proj Cs Ω x k T fx k ) ) k 0. 4.1) The iterative hard-thresholding IHT) algorithms [5, 6] are either a special case or a variant of the above method. By a similar argument as in the proof of Theorem 3.2, one can show that fx k+1 ) fx k ) 1 2 1 T L f ) x k+1 x k 2 k 0, 4.2) which implies that {fx k )} is non-increasing. Suppose that x is an accumulation point {x k }. It then follows that fx k ) fx ). In view of this and 4.2), one can see that x k+1 x k 0, which along with 4.1) and [3, Theorem 5.2] yields x Proj Cs Ωx t fx )) t [0, T]. 4.3) Therefore, when T is close to 1/L f, x is nearly a general stationary point of problem 1.1), that is, it nearly satises the necessary optimality condition 3.1). It is, however, still possible that there 8

exists some ˆt 0, T] such that x Proj Cs Ωx ˆt fx )). As discussed in Section 3, in this case one has f x ) < fx ) for any x Proj Cs Ωx ˆt fx )) with x x, and thus x is clearly not an optimal solution of problem 1.1). To prevent this case from occuring, we propose a nonmonotone projected gradient NPG) method for solving problem 1.1), which incorporates some support-changing and coordinate-swapping strategies into a projected gradient approach with variable stepsizes. We show that any accumulation point x of the sequence generated by NPG satises x = Proj Cs Ωx t fx )) t [0, T] for any pre-chosen T 0, 1/L f ). Therefore, when T is close to 1/L f, x is nearly a strong stationary point of problem 1.1), that is, it nearly satises the strong necessary optimality condition 3.2). Under some suitable assumptions, we further show that x is a coordinatewise stationary point. Furthermore, if x 0 < s, then x is an optimal solution of 1.1), and it is a local optimal solution otherwise. 4.1 Algorithm framework of NPG In this subsection we present an NPG method for solving problem 1.1). To proceed, we rst introduce a subroutine called SwapCoordinate that generates a new point y by swapping some coordinates of a given point x. The aim of this subroutine is to generate a new point y with a smaller objective value, namely, fy) < fx), if the given point x violates the second part of the coordinatewise optimality conditions stated in Theorem 3.3. Upon incorporating this subroutine into the NPG method, we show that under some suitable assumption, any accumulation point of the generated sequence is a coordinatewise stationary point of 1.1). The subroutine SwapCoordinatex) Input: x R n. 1) Set y = x and choose i Arg min{p fx))) l : l I}, where I = Arg min{px)) i : i suppx)}. j Arg max{p fx))) l : l [suppx)] c }, 2) If Ω is nonnegative symmetric and fx) > fx x i e i + x i e j ), set y = x x i e i + x i e j. 3) If Ω is sign-free symmetric and fx) > min{fx x i e i + x i e j ), fx x i e i x i e j )}, Output: y. 3a) if fx x i e i + x i e j ) fx x i e i x i e j ), set y = x x i e i + x i e j. 3b) if fx x i e i + x i e j ) > fx x i e i x i e j ), set y = x x i e i x i e j. We next introduce another subroutine called ChangeSupport that generates a new point by changing some part of the support of a given point. Upon incorporating this subroutine into the NPG method, we show that any accumulation point of the generated sequence is nearly a strong stationary point of 1.1). The subroutine ChangeSupportx, t) Input: x R n and t R. 1) Set a = x t fx), I = Arg min Pa)) i, and J = Arg max i suppx) j [suppx)] cpa)) j. 9

2) Choose S I I and S J J such that S I = S J = min{ I, J }. Set S = suppx) S J \ S I. 3) Set y R n with y S = Proj ΩS a S ) and y S c = 0. Output: y. One can observe that if 0 < x 0 < n, the output y of ChangeSupportx, t) must satisfy suppy) suppx) and thus y x. We next introduce some notations that will be used subsequently. Given any x R n with 0 < x 0 < n and T > 0, dene γt; x) = min Px t fx))) i max Px t fx))) i suppx) j [suppx)] c j t 0, 4.4) βt; x) Arg min γt; x), t [0,T] ϑt; x) = min γt; x). 4.5) t [0,T] If x = 0 or x 0 = n, dene βt; x) = T and ϑt; x) = 0. In addition, if the optimization problem 4.5) has multiple optimal solutions, βt; x) is chosen to be the largest one among them. As seen below, βt; x) and ϑt; x) can be evaluated eciently. To avoid triviality, assume 0 < x 0 < n. It follows from 2.1) and 4.4) that where γt; x) = min φ it; x) t 0, 4.6) i suppx) φ i t; x) = Px t fx))) i αt, α = max j [suppx)] c P fx))) j. We now consider two separate cases as follows. Case 1): Ω is nonnegative symmetric. In view of 2.1), we see that in this case ) f φ i t; x) = x i + α t i. x i This together with 4.6) implies that γt; x) is concave with respect to t. Thus, the minimum value of γt; x) for t [0, T] must be achieved at 0 or T. It then follows that βt; x) and ϑt; x) can be found by comparing γ0; x) and γt; x), which can be evaluated in O x 0 ) cost. Case 2): Ω is sign-free symmetric. In view of 4.5) and 4.6), one can observe that { } { } ϑt; x) = min t [0,T] min φ it; x) i suppx) = min i suppx) min φ it; x) t [0,T] By the denition of P, it is not hard to see that φ i ; x) is a convex piecewise linear function of t, ). Therefore, for a given x, one can nd a closed-form expression for. φ i x) = min t [0,T] φ it; x), t i x) Arg min t [0,T] φ it; x). Moreover, their associated arithmetic operation cost is O1) for each x. Let i suppx) be such that φ i x) = min i suppx) φ i x). Then we obtain ϑt; x) = φ i x) and βt; x) = t i x). Therefore, for a given x, ϑt; x) and βt; x) can be computed in O x 0 ) cost. Combining the above two cases, we reach the following conclusion regarding βt; x) and ϑt; x). Proposition 4.1 For any x and T > 0, βt; x) and ϑt; x) can be computed in O x 0 ) cost. 10

We are now ready to present an NPG method for solving problem 1.1). Nonmonotone projected gradient NPG) method for 1.1) Let 0 < t min < t max, τ 0, 1), T 0, 1/L f ), c 1 0, 1/T L f ), c 2 > 0, η > 0, and integers N 3, 0 M < N, 0 < q < N be given. Choose an arbitrary x 0 C s Ω and set k = 0. 1) coordinate swap) If modk, N) = 0, do 1a) Compute x k+1 = SwapCoordinatex k, T). 1b) If x k+1 x k, set x k+1 = x k+1 and go to step 4). 1c) If x k+1 = x k, go to step 3). 2) change support) If modk, N) = q and ϑt; x k ) η, do 2a) Compute 2b) If holds, set x k+1 = ˆx k+1 and go to step 4). x k+1 Proj Cs Ω x k βt; x k ) fx k ) ), 4.7) ˆx k+1 = ChangeSupport x k+1, βt; x k ) ). 4.8) 2c) If βt; x k ) > 0, set x k+1 = x k+1 and go to step 4). 2d) If βt; x k ) = 0, go to step 3). 3) projected gradient) Choose t 0 k [t min, t max ]. Set t k = t 0 k. fˆx k+1 ) f x k+1 ) c 1 2 ˆxk+1 x k+1 2 4.9) 3a) Solve the subproblem w Proj Cs Ωx k t k fx k )), 4.10) end 3b) If fw) holds, set x k+1 = w, t k = t k, and go to step 4). 3c) Set t k τ t k and go to step 3a). 4) Set k k + 1, and go to step 1). Remark: max [k M] fxi ) c 2 + i k 2 w xk 2 4.11) i) When M = 0, the sequence {fx k )} is decreasing. Otherwise, it may increase at some iterations and thus the above method is generally a nonmonotone method. ii) A popular choice of t 0 k is by the following formula proposed by Barzilai and Borwein [2]: { { { }} t 0 s min t k = max, max t min, k 2, if s k ) T y k 0, s k ) T y k otherwise. t max where s k = x k x k 1, y k = fx k ) fx k 1 ). 11

4.2 Convergence results of NPG In this subsection we study convergence properties of the NPG method proposed in Subsection 4.1. We rst state that the inner termination criterion 4.11) is satised in a certain number of inner iterations, whose proof is similar to [24, Theorem 2.1]. Theorem 4.1 The inner termination criterion 4.11) is satised after at most { max logl } f + c 2 ) + logt max ) + 2, 1 log τ inner iterations, and moreover, where t k is dened in Step 2) of the NPG method. min {t min, τ/l f + c 2 )} t k t max, 4.12) In what follows, we study the convergence of the outer iterations of the NPG method. Throughout the rest of this subsection, we make the following assumption regarding f. Assumption 2 f is bounded below in C s Ω, and moreover it is uniformly continuous in the level set Ω 0 := {x C s Ω : fx) fx 0 )}. We start by establishing a convergence result regarding the sequences {fx k )} and { x k x k 1 }. A similar result was established in [23, Lemma 4] for a nonmonotone proximal gradient method for solving a class of optimization problems. Its proof substantially relies on the relation: φx k+1 ) max [k M] + i k φxi ) c 2 xk+1 x k 2 k 0 for some constant c > 0, where φ is the associated objective function of the optimization problem considered in [23]. Notice that for our NPG method, this type of inequality holds only for a subset of indices k. Therefore, the proof of [23, Lemma 4] is not directly applicable here and a new proof is required. To make our presentation smooth, we leave the proof of the following result in Subsection 4.3. Theorem 4.2 Let {x k } be the sequence generated by the NPG method and There hold: i) {fx k )} converges as k ; ii) { x k+1 x k } 0 as k N. N = {k : x k+1 is generated by step 2) or 3) of NPG}. 4.13) The following theorem shows that any accumulation point of {x k } generated by the NPG method is nearly a strong stationary point of problem 1.1), that is, it nearly satises the strong necessary optimality condition 3.2). Since its proof is quite lengthy and technically involved, we present it in Subsection 4.3 instead. Theorem 4.3 Let {x k } be the sequence generated by the NPG method. Suppose that x is an accumulation point of {x k }. Then there hold: i) if x 0 < s, there exists ˆt [min{t min, τ/l f + c 2 )}, t max ] such that x = Proj Cs Ωx t fx )) for all t [0, ˆt], that is, x is the unique optimal solution to the problems min x x C x t fx )) 2 s Ω t [0, ˆt]; 12

ii) if x 0 = s, then x = Proj Cs Ωx t fx )) for all t [0, T], that is, x is the unique optimal solution to the problems min x x C x t fx )) 2 s Ω t [0, T]; iii) if t min, τ, c 2 and T are chosen such that then x = Proj Cs Ωx t fx )) for all t [0, T]; min{t min, τ/l f + c 2 )} T, 4.14) iv) if x 0 = s and f is additionally convex in Ω, then x is a local optimal solution of problem 1.1); v) if x 0 = s and Px )) i Px )) j for all i j suppx ), then x is a coordinatewise stationary point of problem 1.1). Before ending this subsection, we will establish some stronger results than those given in Theorem 4.3 under an additional assumption regarding Ω that is stated below. Assumption 3 Given any x Ω with x 0 < s, if v R n satises then v N Ω x). v T N ΩT x T ) T T s x), 4.15) The following result shows that Assumption 3 holds for some widely used sets Ω. Proposition 4.2 Suppose that X i, i = 1,..., n, are closed intervals in R, a R n with a i 0 for all i, b R, and g : R n + R is a smooth increasing convex. Let B = X 1 X n, C = {x : a T x b = 0}, Q = {x : g x ) 0}. Suppose additionally that g0) < 0, and moreover, [ gx)] suppx) 0 for any 0 x R n +. Then Assumption 3 holds for Ω = B, C, Q, C R n +, Q R n +, respectively. Proof. We only prove the case where Ω = Q R n + since the other cases can be similarly proved. To this end, suppose that Ω = Q R n +, x Ω with x 0 < s, and v R n satises 4.15). We now prove v N Ω x) by considering two separate cases as follows. Case 1): g x ) < 0. It then follows that N Ω x) = N R n + x) and N ΩT x T ) = N R s + x T ) T T s x). Using this and the assumption that v satises 4.15), one has v T N R s + x T ) for all T T s x), which implies v N R n + x) = N Ω x). Case 2): g x ) = 0. This together with g0) < 0 implies x 0. Since g is a smooth increasing convex in R n +, one can show that g x ) = g x ) x 1,..., x n ) T. 4.16) In addition, by g0) < 0 and [22, Theorems 3.18, 3.20], one has N Ω x) = {αu : α 0, u g x )} + N R n + x), 4.17) N ΩT x T ) = {αu T : α 0, u g x )} + N R s + x T ) T T s x). 4.18) 13

Recall that 0 < x 0 < s. Let I = suppx). It follows that I, and moreover, for any j I c, there exists some T j T s x) such that j T j. Since v satises 4.15), one can observe from 4.16) and 4.18) that for any j I c, there exists some α j 0, h j N R s + x Tj ) and w j x 1,..., x n ) T such that v Tj = α j [ g x )] Tj w j T j + h j. Using this, I = suppx) T j, x I > 0 and x I c = 0, we see that v I = α j [ g x )] I, v j t j g x )) j [ 1, 1] + q j j I c 4.19) for some q j 0 with j I c. Since x 0 and I = suppx), we have from the assumption that [ g x )] I 0. This along with 4.19) implies that there exists some α 0 such that α j = α for all j I c. It then follows from this, x I c = 0 and 4.19) that v I = α[ g x )] I, v I c α[ g x )] I c[ 1, 1] + N R n I + which together with 4.16), 4.17) and x I > 0 implies that v N Ω x). x I c), As an immediate consequence of Proposition 4.2, Assumption 3 holds for some sets Ω that were recently considered in [3]. Corollary 4.1 Assumption 3 holds for Ω = R n, R n +,, +, B p 0; r) and B p +0; r), where = {x R n : n i=1 x i = 1}, + = R n +, B p 0; r) = { x R n : x p p r }, B p +0; r) = B p 0; r) R n + for some r > 0 and p 1, and x p p = n i=1 x i p for all x R n. We are ready to present some stronger results than those given in Theorem 4.3 under some additional assumptions. The proof of them is left in Subsection 4.3. Theorem 4.4 Let {x k } be the sequence generated by the NPG method. Suppose that x is an accumulation point of {x k } and Assumption 3 holds. There hold: i) x = Proj Cs Ωx t fx )) for all t [0, T], that is, x is the unique optimal solution to the problems min x x C x t fx )) 2 t [0, T]; s Ω ii) if x 0 < s and f is additionally convex in Ω, then x Arg min fx), that is, x C x is a s Ω global optimal solution of problem 1.1); iii) if x 0 = s and f is additionally convex in Ω, then x is a local optimal solution of problem 1.1); iv) if x 0 = s and Px )) i Px )) j for all i j suppx ), then x is a coordinatewise stationary point. 4.3 Proof of main results In this subsection we present a proof for the main results, particularly, Theorems 4.2, 4.3 and 4.4. We start with the proof of Theorem 4.2. Proof of Theorem 4.2. i) Let N be dened in 4.13). We rst show that fx k+1 ) max [k M] fxi ) σ + i k 2 xk+1 x k 2 k N 4.20) 14

for some σ > 0. Indeed, one can observe that for every k N, x k+1 is generated by step 2) or 3) of NPG. We now divide the proof of 4.20) into these separate cases. Case 1): x k+1 is generated by step 2). Then x k+1 = x k+1 or ˆx k+1 and moreover βt; x k ) 0, T]. Using 4.7) and a similar argument as for proving 3.4), one can show that f x k+1 ) fx k ) 1 ) 1 2 T L f x k+1 x k 2. 4.21) Hence, if x k+1 = x k+1, then 4.20) holds with σ = T 1 L f. Moreover, such a σ is positive due to 0 < T < 1/L f. We next suppose x k+1 = ˆx k+1. Using this relation and the convexity of 2, one has x k+1 x k 2 = ˆx k+1 x k 2 2 x k+1 x k 2 + ˆx k+1 x k+1 2 ). 4.22) Summing up 4.9) and 4.21) and using 4.22), we have fx k+1 ) = fˆx k+1 ) fx k ) 1 4 min { 1 T L f, c 1 } x k+1 x k 2, and hence 4.20) holds with σ = min{t 1 L f, c 1 }/2. Case 2): x k+1 is generated by step 3). It immediately follows from 4.11) that 4.20) holds with σ = c 2. Combining the above two cases, we conclude that 4.20) holds for some σ > 0. Let lk) be an integer such that [k M] + lk) k and fx lk) ) = max [k M] fxi ). + i k It follows from 4.20) that fx k+1 ) fx lk) ) for every k N. Also, notice that for any k / N, x k+1 must be generated by step 1b) and fx k+1 ) < fx k ), which implies fx k+1 ) fx lk) ). By these facts, it is not hard to observe that {fx lk) )} is non-increasing. In addition, recall from Assumption 2 that f is bounded below in C s Ω. Since {x k } C s Ω, we know that {fx k )} is bounded below and so is fx lk) ). Hence, lim k fxlk) ) = ˆf 4.23) for some ˆf R. In addition, it is not hard to observe fx k ) fx 0 ) for all k 0. Thus {x k } Ω 0, where Ω 0 is dened in Assumption 2. We next show that lim k fxnk 1)+1 ) = ˆf, 4.24) where ˆf is given in 4.23). Let K j = {k 1 : lk) = Nk j}, 0 j M. Notice that Nk M lnk) Nk. This implies that {K j : 0 j M} forms a partition of all positive integers and hence M j=0 K j = {1, 2, }. Let 0 j M be arbitrarily chosen such that K j is an innite set. One can show that lim k K fxlnk) nj ) = ˆf, 4.25) j where n j = N 1 j. Indeed, due to 0 j M N 1, we have n j 0. Also, for every k K j, we know that lnk) n j = Nk 1) + 1. It then follows that Nk 1) + 1 lnk) i Nk 0 i n j, k K j. 15

Thus for every 0 i < n j and k K j, we have 1 modlnk) i 1, N) N 1. It follows that x lnk) i must be generated by step 2) or 3) of NPG. This together with 4.20) implies that fx lnk) i ) fx llnk) i 1) ) σ 2 xlnk) i x lnk) i 1 2 0 i < n j, k K j. 4.26) Letting i = 0 in 4.26), one has fx lnk) ) fx llnk) 1) ) σ 2 xlnk) x lnk) 1 2 k K j. Using this relation and 4.23), we have lim k Kj x lnk) x lnk) 1 = 0. By this, 4.23), {x k } Ω 0 and the uniform continuity of f in Ω 0, one has lim k K fxlnk) 1 ) = lim j k K fxlnk) ) = ˆf. j Using this result and repeating the above arguments recursively for i = 1,..., n j 1, we can conclude that 4.25) holds, which, together with the fact that lnk) n j = Nk 1) + 1 for every k K j, implies that lim k Kj fx Nk 1)+1 ) = ˆf. In view of this and M j=0 K j = {1, 2, }, one can see that 4.24) holds as desired. In what follows, we show that lim k fxnk ) = ˆf. 4.27) For convenience, let N 1 = {k : x Nk+1 is generated by step 2) or 3) of NPG}, N 2 = {k : x Nk+1 is generated by step 1) of NPG}. Clearly, at least one of them is an innite set. We rst suppose that N 1 is an innite set. It follows from 4.20) and the denition of N 1 that fx Nk+1 ) fx lnk) ) σ 2 xnk+1 x Nk 2 k N 1, which together with 4.23) and 4.24) implies lim k N1 x Nk+1 x Nk = 0. Using this, 4.24), {x k } Ω 0 and the uniform continuity of f in Ω 0, one has lim k N fxnk ) = ˆf. 4.28) 1 We now suppose that N 2 is an innite set. By the denition of N 2, we know that It then follows that fx Nk+1 ) < fx Nk ) k N 2. fx Nk+1 ) < fx Nk ) fx lnk) ) k N 2. This together with 4.23) and 4.24) leads to lim k N2 fx Nk ) = ˆf. Combining this relation and 4.28), one can conclude that 4.27) holds. Finally we show that lim k fxnk j ) = ˆf 1 j N 2. 4.29) One can observe that Nk 1) + 2 Nk j Nk 1 1 j N 2. Hence, 2 modnk j, N) N 1 for every 1 j N 2. It follows that {x Nk j+1 } is generated by step 2) or 3) of NPG for all 1 j N 2. This together with 4.20) implies that fx Nk j+1 ) fx lnk j) ) σ 2 xnk j+1 x Nk j 2 1 j N 2. 4.30) 16

Letting j = 1 and using 4.30), one has fx Nk ) fx lnk 1) ) σ 2 xnk x Nk 1 2, which together with 4.23) and 4.27) implies x Nk x Nk 1 0 as k. By this, 4.27), {x k } Ω 0 and the uniform continuity of f in Ω 0, we conclude that lim k fx Nk 1 ) = ˆf. Using this result and repeating the above arguments recursively for j = 2,..., N 2, we can see that 4.29) holds. Combining 4.24), 4.27) and 4.29), we conclude that statement i) of this theorem holds. ii) We now prove statement ii) of this theorem. It follows from 4.20) that fx k+1 ) fx lk) ) σ 2 xk+1 x k 2 k N. which together with 4.23) and statement i) of this theorem immediately implies statement ii) holds. We next turn to prove Theorems 4.3 and 4.4. Before proceeding, we establish several lemmas as follows. Lemma 4.1 Let {x k } be the sequence generated by the NPG method and x an accumulation point of {x k }. There holds: x Proj Cs Ω x ˆt fx ) ) 4.31) for some ˆt [min{t min, τ/l f + c 2 )}, t max ]. Proof. Since x is an accumulation point of {x k } and {x k } C s Ω, one can observe that x C s Ω and moreover there exists a subsequence K such that {x k } K x. We now divide the proof of 4.31) into three cases as follows. Case 1): {x k+1 } K consists of innite many x k+1 that are generated by step 3) of the NPG method. Considering a subsequence if necessary, we assume for convenience that {x k+1 } K is generated by step 3) of NPG. It follows from 4.10) with t k = t k that { x k+1 Arg min x x k t k fx k ) ) 2} k K, x C s Ω which implies that for all k K and x C s Ω, x x k t k fx k ) ) 2 x k+1 x k t k fx k ) ) 2. 4.32) We know from 4.12) that t k [min{t min, τ/l f +c 2 )}, t max ] for all k K. Considering a subsequence of K if necessary, we assume without loss of generality that {t k } K ˆt for some ˆt [min{t min, τ/l f + c 2 )}, t max ]. Notice that K N, where N is given in 4.13). It thus follows from Theorem 4.2 that x k+1 x k 0 as k K. Using this relation, {x k } K x, {t k } K ˆt and taking limits on both sides of 4.32) as k K, we obtain that x x ˆt fx ) ) 2 ˆt 2 fx ) 2 x C s Ω. This together with x C s Ω implies that 4.31) holds. Case 2): {x k+1 } K consists of innite many x k+1 that are generated by step 1) of the NPG method. Without loss of generality, we assume for convenience that {x k+1 } K is generated by step 1) of NPG. It then follows from NPG that modk, N) = 0 for all k K. Hence, we have modk 2, N) = N 2 and modk 1, N) = N 1 for every k K, which together with N 3 implies that {x k 1 } K and {x k } K are generated by step 2) or 3) of NPG. By Theorem 4.2, we then have x k 1 x k 2 0 and x k x k 1 0 as k K. Using this relation and {x k } K x, we have {x k 2 } K x 17

and {x k 1 } K x. We now divide the rest of the proof of this case into two separate subcases as follows. Subcase 2a): q N 1. This together with N 3 and modk 1, N) = N 1 for all k K implies that 0 < modk 1, N) q for every k K. Hence, {x k } K must be generated by step 3) of NPG. Using this, {x k 1 } K x and the same argument as in Case 1) with K and k replaced by K 1 and k 1, respectively, one can conclude that 4.31) holds. Subcase 2b): q = N 1. It along with N 3 and modk 2, N) = N 2 for all k K implies that 0 < modk 2, N) q for every k K. Thus {x k 1 } K must be generated by step 3) of NPG. By this, {x k 2 } K x and the same argument as in Case 1) with K and k replaced by K 2 and k 2, respectively, one can see that 4.31) holds. Case 3): {x k+1 } K consists of innite many x k+1 that are generated by step 2) of the NPG method. Without loss of generality, we assume for convenience that {x k+1 } K is generated by step 2) of NPG, which implies that modk, N) = q for all k K. Also, using this and Theorem 4.2, we have x k+1 x k 0 as k K. This together with {x k } K x yields {x k+1 } K x. We now divide the proof of this case into two separate subcases as follows. Subcase 3a): q N 1. It together with modk, N) = q for all k K implies that 0 < modk + 1, N) = q + 1 q for every k K. Hence, {x k+2 } K must be generated by step 3) of NPG. Using this, {x k+1 } K x and the same argument as in Case 1) with K and k replaced by K + 1 and k + 1, respectively, one can see that 4.31) holds. Subcase 3b): q = N 1. This along with N 3 and modk, N) = q for all k K implies that 0 < modk 1, N) = q 1 q for every k K. Thus {x k } K must be generated by step 3) of NPG. The rest of the proof of this subcase is the same as that of Subcase 2a) above. Lemma 4.2 Let {x k } be the sequence generated by the NPG method and x an accumulation point of {x k }. If x 0 = s, then there holds: where ϑ ; ) is dened in 4.5). ϑt; x ) > 0, 4.33) Proof. Since x is an accumulation point of {x k } and {x k } C s Ω, one can observe that x C s Ω and moreover there exists a subsequence K such that {x k } K x. Let ik) = { k N N + q if modk, N) 0, k N + q if modk, N) = 0 k K. Clearly, modik), N) = q and k ik) N 1 for all k K. In addition, one can observe from NPG that for any k K, if k < ik), then x k+1, x k+2,..., x ik) are generated by step 2) or 3) of NPG; if k > ik), then x ik)+1, x ik)+2,..., x k are generated by step 2) or 3) of NPG. This, together with Theorem 4.2, {x k } K x and k ik) N 1 for all k K, implies that {x ik) } K x and {x ik)+1 } K x, that is, {x k } K x and {x k+1 } K x, where K = {ik) : k K}. In view of these, x 0 = s and x k 0 s for all k, one can see that suppx k ) = suppx k+1 ) for suciently large k K. Considering a suitable subsequence of K if necessary, we assume for convenience that suppx k ) = suppx k+1 ) = suppx ) k K, 4.34) x k 0 = x 0 = s k K. 4.35) 18

Also, since modk, N) = q for all k K, one knows that {x k+1 } K is generated by step 2) or 3) of NPG, which along with Theorem 4.2 implies lim x k+1 x k = 0 4.36) k K We next divide the proof of 4.33) into two separate cases as follows. Case 1): ϑt; x k ) > η holds for innitely many k K. Then there exists a subsequence K K such that ϑt; x k ) > η for all k K. It follows from this, 4.4), 4.5) and 4.34) that for all t [0, T] and k K, η < ϑt; x k ) min Px k t fx k )) ) i suppx ) i max Px k t fx k )) ) j [suppx )] c j, where P is given in 2.1). Taking the limit of this inequality as k K, and using {x k } K x and the continuity of P, we obtain that min i suppx ) Px t fx ))) i max j [suppx )] Px t fx ))) j η t [0, T]. c This together with 4.4) and 4.5) yields ϑt; x ) η > 0. Case 2): ϑt; x k ) > η holds only for nitely many k K. It implies that ϑt ; x k ) η holds for innitely many k K. Considering a suitable subsequence of K if necessary, we assume for convenience that ϑt; x k ) η for all k K. This together with the fact that modk, N) = q for all k K implies that {x k+1 } K must be generated by step 2) if βt; x k ) > 0 and by step 3) otherwise. We rst show that lim x k+1 = x, 4.37) k K where x k+1 is dened in 4.7). One can observe that x k+1 = x k, f x k+1 ) = fx k ) if βt; x k ) = 0, k K, 4.38) Also, notice that if βt; x k ) > 0 and k K, then 4.21) holds. Combining this with 4.38), we see that 4.21) holds for all k K and hence x k+1 x k 2 2T 1 L f ) 1 fx k ) f x k+1 )) k K. 4.39) This implies f x k+1 ) fx k ) for all k K. In addition, if βt; x k ) > 0 and k K, one has x k+1 = x k+1 or ˆx k+1 : if x k+1 = x k+1, we have fx k+1 ) = f x k+1 ); and if x k+1 = ˆx k+1, then 4.9) must hold, which yields fx k+1 ) = fˆx k+1 ) f x k+1 ). It then follows that fx k+1 ) f x k+1 ) fx k ) if βt; x k ) > 0, k K. 4.40) By {x k } K x and 4.36), we have {x k+1 } K x. Hence, lim fx k+1 ) = lim fx k ) = fx ). k K k K In view of this, 4.38) and 4.40), one can observe that lim f x k+1 ) = lim fx k ) = fx ). k K k K This relation and 4.39) lead to x k+1 x k 0 as k K, which together with {x k } K x implies that 4.37) holds as desired. Notice that x k+1 0 s for all k K. Using this fact, 4.34), 4.35) and 4.37), one can see that there exists some k 0 such that supp x k+1 ) = suppx k ) = suppx ) k K, k > k 0, 4.41) x k+1 0 = x k 0 = x 0 = s k K, k > k 0. 4.42) 19

By 4.8), 4.42), 0 < s < n, and the denition of ChangeSupport, we can observe that suppˆx k+1 ) supp x k+1 ) for all k K and k > k 0. This together with 4.34) and 4.41) implies that suppˆx k+1 ) suppx k+1 ) k K, k > k 0, and hence x k+1 ˆx k+1 for every k K and k > k 0. Using this and the fact that modk, N) = q and ϑt; x k ) η for all k K, we conclude that 4.9) must fail for all k K and k > k 0, that is, fˆx k+1 ) > f x k+1 ) c 1 2 ˆxk+1 x k+1 2 k K, k > k 0. 4.43) Notice from 4.5) that βt; x k ) [0, T]. Considering a subsequence if necessary, we assume without loss of generality that lim k K βt; x k ) = t 4.44) for some t [0, T]. By the denition of P, one has that for all k K and k > k 0, P x k+1 )) i > P x k+1 )) j i supp x k+1 ), j [supp x k+1 )] c. This together with Lemma 2.3 and 4.7) implies that for all k K and k > k 0, min Px k βt; x k ) fx k )) ) max i supp x k+1 ) i Px k βt; x k ) fx k )) ). j [supp x k+1 )] c j In view of this relation and 4.41), one has min Px k βt; x k ) fx k )) ) max i suppx k ) i Px k βt; x k ) fx k )) ) k K, k > k j [suppx k )] c j 0, which along with 4.5) implies that for all k K, k > k 0 and t [0, T], min Px k t fx k )) ) max i suppx k ) i Px k t fx k )) ) j [suppx k )] c j min i suppx k ) Px k βt; x k ) fx k )) ) i max j [suppx k )] c Px k βt; x k ) fx k )) ) j > 0. Using this and 4.34), we have that for all k K and k > k 0 and every t [0, T], min Px k t fx k )) ) max i suppx ) i Px k t fx k )) ) j [suppx )] c j min Px k βt; x k ) fx k )) ) max i suppx ) i Px k βt; x k ) fx k )) ) > 0. j [suppx )] c j Taking limits on both sides of this inequality as k K, and using 4.4), 4.44), {x k } K x and the continuity of P, one can obtain that γt; x ) γt ; x ) 0 for every t [0, T]. It then follows from this and 4.5) that ϑt; x ) = γt ; x ) 0. To complete the proof of 4.33), it suces to show γt ; x ) 0. Suppose on the contrary that γt ; x ) = 0, which together with 4.4) implies that where Let min b i = max b j, 4.45) i suppx ) j [suppx )] c b = Px t fx )). 4.46) a k = x k+1 βt; x k ) f x k+1 ), 4.47) b k = P x k+1 βt; x k ) f x k+1 ) ), 4.48) I k = Arg min i supp x k+1 ) b k i, J k = Arg max j [supp x k+1 )] c b k j, 4.49) 20

S Ik I k and S Jk J k such that S Ik = S Jk = min{ I k, J k }. Also, let S k = supp x k+1 ) S Jk \S Ik. Notice that I k, J k, S Ik, S Jk and S k are some subsets in {1,..., n} and only have a nite number of possible choices. Therefore, there exists some subsequence ˆK K such that I k = I, J k = J, S Ik = S I, S Jk = S J, S k = S k ˆK 4.50) for some nonempty index sets I, J, S I, S J and S. In view of these relations and 4.41), one can observe that I suppx ), J [suppx )] c, S I I, S J J, 4.51) S I = S J = min{ I, J }, S = suppx ) S J \ S I, 4.52) and moreover, S suppx ) and S = suppx ) = s. In addition, by 4.37), 4.44), 4.47), 4.48), ˆK K and the continuity of P, we see that lim a k = a, k ˆK lim b k = b, b = Pa) 4.53) k ˆK where b is dened in 4.46) and a is dened as a = x t fx ). 4.54) Claim that where ˆx is dened as lim ˆx k+1 = ˆx, 4.55) k ˆK ˆx S = Proj ΩS a S ), ˆx Sc = 0. 4.56) Indeed, by the denitions of ChangeSupport and ˆx k+1, we can see that for all k ˆK, which together with 4.50) yields ˆx k+1 ˆx k+1 S k = Proj ΩSk a k S k ), ˆx k+1 S k ) c = 0, S = Proj ΩS a k S), ˆx k+1 Sc = 0 k ˆK. Using these, 4.53) and the continuity of Proj ΩS, we immediately see that 4.55) and 4.56) hold as desired. We next show that ˆx Arg min x x C a 2, 4.57) s Ω where a and ˆx are dened in 4.53) and 4.56), respectively. Indeed, it follows from 4.41), 4.49) and 4.50) that b k i b k j i I, j suppx ), k ˆK, k > k 0, b k i b k j i J, j [suppx )] c, k ˆK, 4.58) k > k 0. Taking limits as k ˆK on both sides of the inequalities in 4.58), and using 4.53), one has These together with 4.51) imply that b i b j i I, j suppx ), b i b j i J, j [suppx )] c. I Arg min b i, J Arg max b j. i suppx ) j [suppx )] c 21