arxiv: v2 [math.oc] 21 Nov 2017

Size: px

Start display at page:

Download "arxiv: v2 [math.oc] 21 Nov 2017"

Gervais Johnson
6 years ago
Views:

1 Proximal Gradient Method with Extrapolation and Line Search for a Class of Nonconvex and Nonsmooth Problems arxiv: v [math.oc] 1 Nov 017 Lei Yang Abstract In this paper, we consider a class of possibly nonconvex, nonsmooth and non-lipschitz optimization problems arising in many contemporary applications such as machine learning, variable selection and image processing. To solve this class of problems, we propose a proximal gradient method with extrapolation and line search. This method is developed based on a special potential function and successfully incorporates both extrapolation and non-monotone line search, which are two simple and efficient accelerating techniques for the proximal gradient method. Thanks to the line search, this method allows more flexibilities in choosing the extrapolation parameters and updates them adaptively at each iteration if a certain line search criterion is not satisfied. Moreover, with proper choices of parameters, our reduces to many existing algorithms. We also show that, under some mild conditions, our line search criterion is well defined and any cluster point of the sequence generated by is a stationary point of our problem. In addition, by assuming the Kurdyka- Lojasiewicz exponent of the objective in our problem, we further analyze the local convergence rate of two special cases of, including the widely used non-monotone proximal gradient method as one case. inally, we conduct some numerical experiments for solving the l 1 regularized logistic regression problem and the l 1- regularized least squares problem. Our numerical results illustrate the efficiency of and show the potential advantage of combining two accelerating techniques. Keywords: Proximal gradient method; extrapolation; non-monotone; line search; stationary point. 1 Introduction In this paper, we consider the following composite optimization problem: min x := fx+px, 1.1 x Rn where f : R n R and P : R n R + } satisfy Assumption 1.1. We also assume that the proximal mapping of νp is easy to compute for all ν > 0 see the next section for notation and definitions. Assumption 1.1. i f : R n R is a continuously differentiable possibly nonconvex function with Lipschitz continuous gradient, i.e., there exists some Lipschitz constant L f > 0 such that fx fy L f x y, x, y R n. ii P : R n R + } is a proper closed possibly nonconvex, nonsmooth and non-lipschitz function; it is bounded below and continuous on its domain. iii is level-bounded. Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, P.R. China. lei.yang@connect.polyu.hk. 1

2 Problem 1.1 arises in many contemporary applications such as machine learning [14, 34], variable selection [16,, 3, 36, 47] and image processing [10, 30]. In general, f is a loss or fitting function used for measuring the deviation of a solution from the observations. Two commonly used loss functions are least squares loss function: fx = 1 Ax b, logistic loss function: fx = m i=1 log1+exp b ia i x, where A = [a 1,,a m ] R m n is a data matrix and b R m is an observed vector. One can verify that these two loss functions satisfy Assumption 1.1i. On the other hand, the function P is usually a regularizer used for inducing certain structure in the solution. or example, P can be the indicator function for a certain set such as X = x R n : x 0} and X = x R n : n i=1 x i = 1, x 0}; the former choice restricts the elements of the solution to be nonnegative and the latter choice restricts the solution in a simplex. We can also choose P to be a certain sparsity-inducing regularizer such as λ x p p for 0 < p 1 [, 3, 36], λ n i=1 log1 + α x i for α > 0 [30] and λ x 1 x [46], where λ > 0 is a regularization parameter. Note that all the aforementioned examples of P as well as many other widely used regularizerssee [1, 11] and references therein for more regularizers satisfy Assumption 1.1ii. inally, we would like to point out that Assumption 1.1iii is also satisfied by many choices of f and P in practice; see, for example, 5.1 and 5.7. More examples of 1.1 can be found in [10, 14, 34] and references therein. Due to the importance and the popularity of 1.1, various attempts have been made to solve it efficiently, especially when the problem involves a large number of variables. One popular class of methods for solving 1.1 are first-order methods due to their cheap iteration cost and good convergence properties. Among them, the proximal gradient PG method 1 [17, 5] is arguably the most fundamental one, whose basic iteration is x k+1 Argmin fx k, x + µ } x x xk +Px, 1. where µ > 0 is a constant depending on the Lipschitz constant L f of f. However, PG can be slow in practice; see, for example, [5, 3, 39]. Therefore, a large amount of research has been conducted to accelerate PG for solving 1.1. One simple and widely studied strategy is to perform extrapolation in the spirit of Nesterov s extrapolation techniques [7, 8], whose basic idea is to make use of historical information at each iteration. A typical scheme of the proximal gradient method with extrapolation PGe for solving 1.1 is y k = x k +β k x k x k 1, x k+1 Argmin x fy k, x + µ x yk +Px where β k is the extrapolation parameter satisfying certain conditions and µ > 0 is a constant depending on L f. One representative algorithm that takes the form of 1.3 with proper choices of β k } is the fast iterative shrinkage-thresholding algorithm ISTA [5], which is also known as the accelerated proximal gradient method APG independently proposed and studied by Nesterov [9]. This algorithm ISTA or APG is designed for solving 1.1 with f and P being convex and exhibits a faster convergence rate O1/k in terms of objective values see [5, 9] for more details. This motivates the study of PGe and its variants for solving 1.1 under different scenarios; see, for example, [6, 18, 31, 3, 37, 38, 39, 40, 4, 43, 44]. It is worth noting that, when f and P are convex, many existing PGe and its variants [5, 6, 9, 3, 37, 38] choose the extrapolation parameters β k } explicitly or implicitly based on the following updating scheme β k = t k 1 1/t k, t k+1 = t k, }, 1.3 with t 1 = t 0 = 1, PG is also known as the forward-backward splitting algorithm [13] or the iterative shrinkage-thresholding algorithm [5].

3 which originated from Nesterov s work [7, 8] and was shown to be optimal [8]. However, for the nonconvex case, the optimal choices of β k } are still not clear. Although the convergence of PGe and its variants can be guaranteed in theory for some classes of nonconvex problems under certain conditions on β k } see, for example, [18, 31, 39, 40, 4, 43, 44], the choices of β k } are relatively restrictive and hence may not work well for acceleration. Another efficient strategy for accelerating PG is to apply a non-monotone line search to adaptively find a proper µ in 1. at each iteration. The non-monotone line search technique dates back to the non-monotone Newton s method proposed by Grippo et al. [1] and has been applied to many algorithms with good empirical performances; see, for example, [7, 15, 45, 48]. Based on this technique, Wright et al. [41] recently proposed an efficient method called SpaRSA to solve 1.1, whose iteration is roughly given as follows: Choose τ > 1, c > 0 and an integer N 0. Then, at the k-th iteration, choose µ 0 k > 0 and find the smallest nonnegative integer j k such that } u Argmin x fx k, x + τj k µ 0 k x x k +Px u max [k N] xi c + i k u xk. This method is essentially the non-monotone proximal gradient method, namely, the proximal gradient method with a non-monotone line search. Later, was extended for solving 1.1 under more general conditions and has been shown to have promising numerical performances in many applications see, for example, [1, 0, 6]. In view of the above, it is natural to raise a question: Can we derive an efficient method for solving 1.1, which takes advantage of both extrapolation and non-monotone line search? In this paper, we propose such a method for solving 1.1 that successfully incorporates both extrapolation and non-monotone line search and allows more flexibilities in choosing the extrapolation parameters β k }. We call our method the proximal gradient method with extrapolation and line search. This method is developed based on the following potential function specifically constructed for in 1.1: u,v,µ := u+ δµ 4 u v, u, v R n, µ > 0, 1.5 where δ [0, 1 is a given nonnegative constant. Clearly, u,v,µ u if δ = 0. We will see in Section 3 that this potential function is used to establish a new non-monotone line search criterion 3.3 when the extrapolation technique is applied. This allows more choices of β k at each iteration, and will adaptively update µ k and β k at the same time if the line search criterion is not satisfied see Algorithm 1 for more details. The convergence analysis of is also presented in Section 3. Specifically, under Assumption 1.1, we show that our line search criterion 3.3 is well defined and any cluster point of the sequence generated by is a stationary point of 1.1. Moreover, since our reduces to PG, PGe or with proper choices of parameters see Remark 3.1, then we actually obtain a unified convergence analysis for PG, PGe and as a byproduct. In addition, in Section 4, we further study thelocalconvergencerateintermsofobjectivevaluesfortwospecialcasesofincluding asone case under an additional assumption on the Kurdyka- Lojasiewicz exponent of the objective in 1.1. To the best of our knowledge, this is the first local convergence rate analysis of for solving 1.1. inally, we conduct some numerical experiments in Section 5 to evaluate the performance of our method for solving the l 1 regularized logistic regression problem and the l 1- regularized least squares problem. Our computational results illustrate the efficiency of our method and show the potential advantage of combining extrapolation and non-monotone line search. The rest of this paper is organized as follows. In Section, we present notation and preliminaries used in this paper. In Section 3, we describe for solving 1.1 and study its global subsequential convergence. The local convergence rate of two special cases of is analyzed in Section 4 and some numerical results are reported in Section 5. inally, some concluding remarks are given in Section 6., 3

4 Notation and preliminaries In this paper, we present scalars, vectors and matrices in lower case letters, bold lower case letters and upper case letters, respectively. We also use R, R n, R n + and Rm n to denote the set of real numbers, n-dimensional real vectors, n-dimensional real vectors with nonnegative entries and m n real matrices, respectively. or a vector x R n, x i denotes its i-th entry, x denotes its Euclidean norm, x 1 denotes its l 1 norm defined by x 1 := n i=1 x i, x p denotes its l p -quasi-norm 0 < p < 1 defined by x p := n i=1 x i p 1 p and x denotes its l norm given by the largest entry in magnitude. or a matrix A R m n, its spectral norm is denoted by A, which is the largest singular value of A. or an extended-real-valued function h : R n [, ], we say that it is proper if hx > for all x R n and its domain domh := x R n : hx < } is nonempty. A proper function h is said to be closed if it is lower semicontinuous. We also use the notation y h x to denote y x and hy hx. The basic subdifferential see [33, Definition 8.3] of h at x domh used in this paper is hx := d R n : x k h x, d k hy hx k d k,y x k } d with liminf y x k,y x k y x k 0 k. It can be observed from the above definition that } d R n : x k h x, d k d with d k hx k for each k hx..1 When h is continuously differentiable or convex, the above subdifferential coincides with the classical concept of derivative or convex subdifferential of h; see, for example, [33, Exercise 8.8] and [33, Proposition 8.1]. In addition, if h has several groups of variables, we use xi h resp., xi h to denote the partial subdifferential resp., gradient of h with respect to the group of variables x i. or a proper closed function h : R n, ] and ν > 0, the proximal mapping of νh at y R n is defined by Prox νh y := Argmin hx+ 1 } x R n ν x y. Note that this operator is well defined for any y R n if h is bounded below in R n. or a closed set X R n, its indicator function δ X is defined by 0, if x X, δ X x = +, otherwise. We also use distx,x to denote the distance from x to X, i.e., distx,x := inf y X x y. or any local minimizer x of 1.1, it is known from [33, Theorem 10.1] and [33, Exercise 8.8c] that the following first-order necessary condition holds: 0 f x+ P x,. where f denotes the gradient of f. In this paper, we say that x is a stationary point of 1.1 if x satisfies. in place of x. We next recall the Kurdyka- LojasiewiczKL propertysee [, 3, 4, 8, 9] for more details, which plays an important role in our analysis for the local convergence rate in Section 4. or notational simplicity, let Ξ ν ν > 0 denote a class of concave functions ϕ : [0,ν R + satisfying: i ϕ0 = 0; ii ϕ is continuously differentiable on 0,ν and continuous at 0; iii ϕ t > 0 for all t 0,ν. Then, the KL property can be described as follows. Definition.1 KL property and KL function. Let h : R n R + } be a proper closed function. i or x dom h := x R n : hx }, if there exist a ν 0,+ ], a neighborhood V of x and a function ϕ Ξ ν such that for all x V x R n : h x < hx < h x+ν}, it holds that ϕ hx h xdist0, hx 1, then h is said to have the Kurdyka- Lojasiewicz KL property at x. 4

5 ii If h satisfies the KL property at each point of dom h, then h is called a KL function. A large number of functions such as proper closed semialgebraic functions satisfy the KL property [3, 4]. Based on the above definition, we then introduce the KL exponent [3, 4]. Definition. KL exponent. Suppose that h : R n R + } is a proper closed function satisfying the KL property at x dom h with ϕt = a t 1 θ for some a > 0 and θ [0,1, i.e., there exist a,ε,ν > 0 such that dist0, hx ahx h x θ whenever x dom h, x x ε and h x < hx < h x+ν. Then, h is said to have the KL property at x with an exponent θ. If h is a KL function and has the same exponent θ at any x dom h, then h is said to be a KL function with an exponent θ. We also recall the following uniformized KL property, which was established in [9, Lemma 6]. Proposition.1 Uniformized KL property. Suppose that h : R n R + } is a proper closed function and Γ is a compact set. If h ζ on Γ for some constant ζ and satisfies the KL property at each point of Γ, then there exist ε > 0, ν > 0 and ϕ Ξ ν such that ϕ hx ζdist0, hx 1 for all x x R n : distx, Γ < ε} x R n : ζ < hx < ζ +ν}. inally, we recall two useful lemmas, which can be found in [4]. Lemma.1 [4, Lemma.]. Let α > 0. Then, for any w = w 1,,w n R n +, there exist 0 < c 1 c such that c 1 w w α 1 + +wα n 1 α c w. Lemma. [4, Lemma 3.1]. Let 1 < α. Then, for any u,v R n, there exist b 1 > 0 and 0 < b < 1 such that u+v α b 1 u α b v α. 3 Proximal gradient method with extrapolation and line search and its convergence analysis In this section, we present a proximal gradient method with extrapolation and line search for solving 1.1. This method is developed based on a specially constructed potential function, which is defined in 1.5. The complete for solving 1.1 is presented as Algorithm 1. Remark 3.1 Comments on special cases of. In Algorithm 1, if δ = 0, then we have β 0 k = 0 and hence y k = x k for all k 0. In this case, our line search criterion 3.3 reduces to u max [k N] xi c + i k u xk. Thus, our reduces to for solving 1.1 see, for example, [1, 0, 41]. On the other hand, for any 0 < δ < 1, if µ 0 k = µ max L f +c 1 δ and β 0 k δµ max L f µ max 4µ max +L f for all k 0, then it follows from Lemma 3.1 that the line search criterion 3.3 holds trivially for all k 0. Thus, in this case, we do not need to perform the line search loop and hence our reduces to a PGe for solving 1.1. inally, if δ = 0 and µ 0 k = µ max L f +c, then our obviously reduces to PG for solving

6 Algorithm 1 for solving 1.1 Input: x 0 dom, τ > 1, 0 δ < 1, 0 < η < 1, c > 0, µ max L f+c 1 δ µ min > 0, β max 0 and an integer N 0. Set x 1 = x 0, µ 1 = 1 and k = 0. while a termination criterion is not met, do Step 1. Choose µ 0 k [µ min, µ max ] and β 0 k [0, δβ max] arbitrarily. Set µ k = µ 0 k and β k = β 0 k. 1a Compute 1b Solve the subproblem 1c If u Argmin x y k = x k +β k x k x k fy k, x y k + µ } k x yk +Px. 3. u,x k,µ k max x i,x i 1, µ i 1 c [k N] + i k u xk 3.3 is satisfied, then go to Step. 1d Set µ k minτµ k, µ max }, β k ηβ k and go to step 1a. Step. Set x k+1 u, µ k µ k, β k β k, k k +1 and go to Step 1. end while Output: x k Remark 3. Comments on the extrapolation parameters in. Unlike most of the existing PGe and its variants mentioned in Section 1 that should choose the extrapolation parameters under certain schemes or conditions, our can choose any β 0 k [0, δβ max] as an initial guess at each iteration and then updates it and µ k adaptively at the same time if the line search criterion is not satisfied. This strategy actually allows more flexibilities in choosing the extrapolation parameters and works well from our computational results in Section 5. In the following, we will study the convergence properties of. Before proceeding, we present the first-order optimality condition for the subproblem 3. in 1b of Algorithm 1 as follows: 0 fy k +µ k u y k + Pu. 3.4 We now start our convergence analysis by proving the following supporting lemma, which characterizes the descent property of our potential function. Lemma 3.1 Sufficient descent of. Suppose that Assumption 1.1 holds and δ [0, 1 is a nonnegative constant. Let x k } and µ k } be the sequences generated by Algorithm 1, and let u be the candidate generated by step 1b at the k-th iteration. or any k 0, if δµ k L f µ k 1 µ k > L f and β k 4µ k +L f, then we have u,x k,µ k x k,x k 1, µ k 1 1 δµ k L f u x k, where is the potential function defined in

7 Proof. irst, from 3., we have fy k, u y k + µ k u yk +Pu fy k, x k y k + µ k xk y k +Px k, which implies that Pu Px k + fy k, x k u + µ k xk y k µ k u yk = Px k + fy k, x k u + µ k xk y k µ k u xk +x k y k = Px k + fy k, x k u µ k u xk +µ k x k u, x k y k On the other hand, using the fact that f is Lipschitz continuous with a Lipschitz constant L f Assumption 1.1i, we see from [8, Lemma 1..3] that Summing 3.6 and 3.7, we obtain that 3.6 fu fx k + fx k, u x k + L f u xk. 3.7 fu+pu fx k Px k µ k L f u x k +µ k x k u, x k y k + fx k fy k, u x k µ k L f u x k + µ k x k y k + fx k fy k u x k µ k L f u x k +µ k +L f x k y k u x k µ k L f u x k + µ k L f u x k + µ k +L f x k y k 4 µ k L f = µ k L f u x k + µ k +L f β 4 µ k L k x k x k 1 f µ k L f 4 u x k + δ µ k 1 x k x k 1 4 = 1 δµ k L f u x k δµ k 4 4 u xk + δ µ k 1 x k x k 1, 4 where the second inequality follows from Cauchy Schwarz inequality; the third inequality follows from Lipschitz continuityof f; the fourthinequalityfollowsfromtherelationab a 4s +sb with a = u x k, b = µ k + L f x k y k 1 and s = µ k L f > 0; the first equality follows from 3.1; the last inequality δµk L f µ k 1 follows from β k 4µ k +L f. Then, rearranging terms in above relation and recalling the definition of in 1.5, we obtain 3.5. Remark 3.3 Comments on Lemma 3.1. Note that the descent property in Lemma 3.1 is established for without requiring f or P to be convex or difference-of-convex function. In fact, with additional assumptions e.g., convexity on f or P, one can establish a similar descent property for some other constructed potential function H; see, for example, [39, Lemma 3.1]. Then, one can perform the line search criterion 3.3 with H in place of in Algorithm 1 and the convergence analysis can follow in a similar way as presented in this paper. Thus, one can choose suitable potential function in to fit different scenarios. In this paper, we only focus on under Assumption 1.1. It can be observed from Lemma 3.1 that the sufficient descent of can be guaranteed as long as µ k is sufficiently large and β k is sufficiently small for each k 0. Thus, based on this lemma, we can show in the following proposition that the line search criterion 3.3 in Algorithm 1 is well defined. 7

8 Proposition 3.1 Well-definedness of the line search criterion. Suppose that Assumption 1.1 holds and δ [0, 1 is a nonnegative constant. Let x k } and µ k } be the sequences generated by Algorithm 1. Then, for each k 0, the line search criterion 3.3 is satisfied after finitely many inner iterations. Proof. We prove this proposition by contradiction. Assume that there exists a k 0 such that the line search criterion 3.3 cannot be satisfied after finitely many inner iterations. Since µ k µ max due to 1d in Algorithm 1, then µ k = µ max must be satisfied after finitely many inner iterations. Let n k denote the number of inner iterations when µ k = µ max is satisfied for the first time. If µ 0 k = µ max, then n k = 1; otherwise, we have µ min τ n k 1 µ 0 kτ n k 1 < µ max, which implies that logµmax logµ min n k logτ +1. Now, we let β k := δµmax L f µ k 1 4µ max+l f for simplicity. Then, from 1d in Algorithm 1, we see that β k is decreasing in the inner loop and hence β k β k must be satisfied after finitely many inner iterations. Similarly, let ˆn k denote the number of inner iterations when β k β k is satisfied for the first time. Note that if δ = 0, we have β 0 k = β k = 0 and hence ˆn k = 1. or 0 < δ < 1, if β 0 k β k, then ˆn k n k ; otherwise, we have β k < β 0 kηˆn k 1 δβ max ηˆn k 1, which implies that logδβmax log β k ˆn k logη +1. Thus, after at most maxn k, ˆn k }+1 inner iterations, we must have µ k µ max and β k β k. Since c > 0 and 0 δ < 1, one can see that µ max L f+c 1 δ > L f and 1 δµ max L f c. Then, using these facts and Lemma 3.1, we have u,x k,µ max x k,x k 1, µ k 1 1 δµ max L f u x k c 4 u xk, which, together with x k,x k 1, µ k 1 max x i,x i 1, µ i 1, [k N] + i k implies that 3.3 must be satisfied after at most maxn k, ˆn k } + 1 inner iterations. This leads to a contradiction. We are now ready to show our first convergence result in the following theorem that characterizes a cluster point of the sequence generated by. Our proof is similar to that of [41, Lemma 4]. However, the arguments involved relies on our potential function 1.5 that contains multiple blocks of variables. This makes our proof more intricate. or notational simplicity, from now on, let lk Argmax x i,x i 1, µ i 1 : i = [k N] +,,k}. 3.8 i Theorem 3.1. Suppose that Assumption 1.1 holds and δ [0, 1 is a nonnegative constant. Let x k } and µ k } be the sequences generated by Algorithm 1. Then, i boundedness of sequence the sequence x k } is bounded; 8

9 ii non-increase of subsequence of the sequence x lk,x lk 1, µ lk 1 } is non-increasing; iii existence of limit ζ := lim k x lk,x lk 1, µ lk 1 exists; iv diminishing successive changes lim k xk+1 x k = 0; v global subsequential convergence any cluster point x of x k } is a stationary point of 1.1. Proof. Statement i. We first prove by induction that for all k 1. Indeed, for k = 1, it follows from Proposition 3.1 that x k,x k 1, µ k 1 x x 1,x 0, µ 0 x 0,x 1, µ 1 c x1 x 0 0 is satisfied after finitely many inner iterations. This, together with x 1 = x 0, implies that x 1,x 0, µ 0 x 0,x 1, µ 1 = x 0. Hence, 3.9 holds for k = 1. We now suppose that 3.9 holds for all k K for some integer K 1. Next, we show that 3.9 also holds for k = K +1. Indeed, for k = K +1, we have x K+1,x K, µ K x 0 x K+1,x K, µ K max x i,x i 1, µ i 1 [K N] + i K c xk+1 x K 0, where the first inequality follows from the induction hypothesis and the second inequality follows from 3.3. Hence, 3.9 holds for k = K +1. This completes the induction. Then, from 3.9, we have that for any k 1, x 0 x k,x k 1, µ k 1 x k, which, together with Assumption 1.1iii, implies that x k } is bounded. This proves statement i. Statement ii. Recall the definition of lk in 3.8 and let x k := x k+1 x k for simplicity. Then, from the line search criterion 3.3, we have Observe that x k+1,x k, µ k x lk,x lk 1, µ lk 1 c x k x lk+1,x lk+1 1, µ lk+1 1 = max x i,x i 1, µ i 1 [k+1 N] + i k+1 = max x k+1,x k, µ k, max x lk,x lk 1, µ lk 1, max x lk,x lk 1, µ lk 1, max x i,x i 1, µ i 1 [k+1 N] + i k } max x i,x i 1, µ i 1 [k+1 N] + i k } max x i,x i 1, µ i 1 [k N] + i k } = max x lk,x lk 1, µ lk 1, x lk,x lk 1, µ lk 1 = x lk,x lk 1, µ lk 1, where the first inequality follows from 3.10 and the second last equality follows from 3.8. This proves statement ii. } 9

10 Statement iii. It follows from Assumption 1.1iii and the definition of in 1.5 that x lk, x lk 1, µ lk 1 is bounded below. This together with statement ii proves that there exists a number ζ such that lim x lk,x lk 1, µ lk 1 = ζ k Statement iv. We next prove statement iv. To this end, we first show by induction that for all j 1, it holds that lim xlk j = 0, 3.1a k lim k xlk j = ζ. 3.1b We start by proving 3.1a and 3.1b for j = 1. Applying 3.10 with k replaced by lk 1, we obtain x lk,x lk 1, µ lk 1 x llk 1,x llk 1 1, µ llk 1 1 c x lk 1, which, together with 3.11, implies that Then, from 3.11 and 3.13, we have lim xlk 1 = k ζ = lim k x lk,x lk 1, µ lk 1 = lim k xlk 1 + x lk 1+ δ µ lk 1 4 x lk 1 = lim k xlk 1, where the second equality follows from the definition of in 1.5 and the last equality follows because x k } is bounded see statement i, µ k } is bounded since µ min µ k µ max and is uniformly continuous on any compact subset of dom under Assumption 1.1i and ii. Thus, 3.1a and 3.1b hold for j = 1. We next suppose that 3.1a and 3.1b hold for j = J for some J 1. It remains to show that they also hold for j = J +1. Indeed, from 3.10 with k replaced by lk J 1 here, without loss of generality, we assume that k is large enough such that lk J 1 is nonnegative, we have x lk J,x lk J 1, µ lk J 1 x llk J 1,x llk J 1 1, µ llk J 1 1 c x lk J 1, which, together with the definition of in 1.5, implies that c + δ µ lk J 1 4 x lk J 1 x llk J 1,x llk J 1 1, µ llk J 1 1 x lk J. This together with 3.11 and the induction hypothesis implies that lim xlk J+1 = 0. k Thus, 3.1a holds for j = J +1. rom this, we further have lim k xlk J+1 = lim k xlk J x lk J+1 = lim k xlk J = ζ, where the second equality follows because x k } is bounded see statement i, µ k } is bounded since µ min µ k µ max and is uniformly continuous on any compact subset of dom under Assumption 1.1i and ii. Hence, 3.1b also holds for j = J +1. This completes the induction. 10

11 We are now ready to prove the main result in this statement. Indeed, recalling the definition of lk in 3.8, we see that k N lk k without loss of generality, we assume that k is large enough such that k N. Thus, for any k, we must have k N 1 = lk j k for some j k [1, N +1]. Then, we have This together with 3.1a implies that x k N 1 = x lk j k max 1 j N+1 x lk j. lim x k k = lim xk N 1 = 0, k This proves statement iv. Statement v. irst, since x k } is bounded see statement i, there exists at least one cluster point. Suppose that x is a cluster point of x k } and let x ki } be a convergent subsequence such that lim x ki = x. Then, we see from statement iv and the boundedness of β k for any k since i 0 β k βk 0 δβ max that lim = lim x i yki ki +β ki x ki x ki 1 = lim x ki = x i i Thus, passing to the limit along x ki, y ki } in 3.4 with x ki+1 in place of u and µ ki in place of µ k, and invoking Assumption 1.1ii, statement iv, the boundedness of µ k } since µ min µ k µ max,.1 and 3.14, we obtain 0 fx + Px, which implies that x is a stationary point of 1.1. This proves statement v. Based on Theorem 3.1, we can further characterize the sequence of objective values along x k } in the following proposition. This proposition will be useful in the analysis of the local convergence rate in the next section. Proposition 3.. Suppose that Assumption 1.1 holds and δ [0, 1 is a nonnegative constant. Let x k } be a sequence generated by Algorithm 1 and lk be the index defined in 3.8 for each k. Then, i lim k xlk = ζ, where ζ is given in Theorem 3.1iii; ii ζ on Ω, where Ω is the set of cluster points of the subsequence x lk }. Proof. Statement i follows immediately from Theorem 3.1iii, x k+1 x k 0 see Theorem 3.1iv, the boundedness of µ k } since µ min µ k µ max and the definition of in 1.5. We now prove statement ii. irst, since x lk } is a subsequence of x k }, it follows from Theorem 3.1i and v that Ω X, where X is the set of all stationary points of 1.1. Moreover, we recall from Assumption 1.1i and ii that f is continuously differentiable and P is continuous on its domain. These facts prove statement ii. 4 Local convergence rate of two special cases of In this section, under the additional assumption that the objective in 1.1 is a KL function with an exponent θ, we further study the local convergence rate of two special cases of in terms of objective values. The first case is for with δ = 0, namely, and the other is for with N = 0, where N is the line search gap used in

12 4.1 Local convergence rate of In this subsection, we discuss the local convergence rate of with δ = 0, namely, see Remark 3.1. In this case, we have y k x k and x k x k,x k 1, µ k 1 for all k 0. The main results are presented in the following theorem. This kind of results on local convergence rate have been studied for many existing algorithms; see, for example, [, 40]. However, the analysis there heavily relies on the monotonicity of the objective or certain potential function along the sequence generated and hence cannot be applied for. More intricate analysis is needed for handling the non-monotone line search. Theorem 4.1. Suppose that Assumption 1.1 holds and the objective in 1.1 is a KL function with an exponent θ. Let x k } and µ k } be the sequences generated by Algorithm 1 with δ = 0, and let ζ be given in Theorem 3.1iii. Then, the following statements hold. i If θ = 0, then x k ζ for all large k; ii If θ 0, 1 ], then there exist ρ 0,1 and c1 > 0 such that x k ζ c 1 ρ k for all large k; iii If θ 1,1, then there exists c > 0 such that x k ζ c k 1 θ 1 for all large k. Proof. We start by defining an index sequence ξt} t=0 as follows: ξt = ln +1t, t = 0, 1,,, where lk is defined in 3.8. It is obviously that ξt} t=0 is a subsequence of lk} k=0. Moreover, since k N lk k for any k N, we have ξt = ln +1t N +1t N = N +1t 1+1 ln +1t 1+1 > ξt 1 for any t 1. Therefore, ξt} t=0 is increasing. We now recall from Theorem 3.1ii and Assumption 1.1iii that x lk } k=0 is non-increasing and bounded below. Since xξt } t=0 is a subsequence of x lk } k=0, then it follows that xξt } t=0 is non-increasing and bounded below. Moreover, from Proposition 3., we have lim t xξt = lim k xlk = ζ. In addition, it follows from 3.10 with k replaced by ξt 1 that x ξt x lξt 1 c x ξt 1 x ln+1t 1 c x ξt 1 = x ξt 1 c x ξt 1, 4.1 where the second inequality follows because x lk } k=0 is non-increasing and ξt 1 = ln +1t 1 N +1t 1. We next consider two cases. Case 1. In this case, we suppose that x ξt = ζ for some T 0. Since the sequence x ξt } t=0 is non-increasing, we must have x ξt = ζ for all t T. Then, for all k [N +1t N, N +1t] with any t T, we have x k x ln+1t = x ξt = ζ. Thus, the conclusions of three statements hold. Case. rom now on, we consider the case where x ξt > ζ for all t 0. rom Theorem 3.1i, we see that x ξt } t=0 is bounded and hence must have at least one cluster point. Let Γ denote the set of cluster points of x ξt } t=0. Since xξt } t=0 is a subsequence of xlk } k=0, we have Γ Ω, where Ω is the set of cluster points of x lk } k=0. Then, it follows from Proposition 3. that ζ on Γ. This fact together with our assumption that is a KL function with an exponent θ and Proposition.1, there exist ε,ν > 0 such that ϕ x ζdist0, x 1, where ϕs = as 1 θ for some a > 0, 1

13 for all x satisfying distx, Γ < ε and ζ < x < ζ+ν. On the other hand, since lim distx ξt, Γ = 0 t by the definition of Γ and x ξt ζ, then for such ε and ν, there exists an integer T 0 0 such that distx ξt, Γ < ε and ζ < x ξt < ζ +ν for all t T 0. Thus, for t T 0, we have ϕ x ξt ζ dist 0, x ξt 1, where ϕs = as 1 θ for some a > Next, looking at the subdifferential x k, we have x k = fx k + Px k = fx k 1 + Px k + µ k 1 x k x k 1 + fx k fx k 1 µ k 1 x k x k 1 fx k fx k 1 µ k 1 x k x k 1, where the inclusion follows from the optimality condition for 3. at the k 1-st iteration, i.e., 0 fx k 1 + Px k + µ k 1 x k x k 1. Using this relation together with the global Lipschitz continuity of f and the boundedness of µ k }, there exists d 1 > 0 such that Now, for notational simplicity, let ξt that ξt is non-increasing, ξt 1 ϕ ξt a 1 θ ad 1 1 θ = d dist 0, x k d 1 x k x k := x ξt ζ. Since x ξt } t=0 is non-increasing, we see 0. Then, for all t T 0, we have 0, x ξt θ d1 x ξt x ξt 1 > 0 for t 0 and ξt ξt dist ξt x ξt 1 x c ξt θ ξt θ ξt 1 ξt, where d := ad 1 1 θ > 0, the first inequality follows from 4., the second inequality follows from 4.3 c and the last inequality follows from 4.1. Next, we consider the following three cases. ξt ξt i θ = 0. In this case, we see from 4.4 that ξt 1 0. Thus, Case cannot happen. ξt 1 d 4.4 for all t T 0, which contradicts ii 0 < θ 1. In this case, we have 0 < θ 1. Since ξt 0, there exists T 1 0 such that 1 for all t T := maxt 0, T 1 }. Then, for all t T, we see from 4.4 that which implies that ξt ξt ξt γ ξt 1 θ d ξt 1 ξt, γ t T+1 ξ T 1, where γ := d < 1. Then, for all k [N +1t N, N +1t] with any t T, we have 1+d x k ζ ξt γ t T+1 ξ T 1 γ k N+1 T+1 ξ T 1 = c 1 ρ k, where c 1 := γ T+1 ξ T 1, ρ := γ 1 N+1 < 1 and the last inequality follows from t k N+1. This proves statement ii. iii 1 < θ < 1. We define gs := s θ for s 0,. It is easy to see that g is non-increasing. Then, for any t T 0, we further consider the following two cases. 13

14 If g ξt g ξt 1, it follows from 4.4 that 1 d ξt θ ξt 1 ξt = g ξt ξt 1 ξt g ξt 1 ξt 1 ξt gsds = ξt 1 1 θ ξt 1 θ which, together with 1 θ < 0, implies that ξt 1 θ ξt 1 1 θ, ξt 1 ξt 1 θ θ 1 d. 4.5 If g ξt g ξt 1, one can check that ξt 1 θ θ 1 ξt 1 θ ξt 1 1 θ θ 1 θ 1 ξt 1 1 θ θ ξt 1 θ 1 θ 1 θ. Then, we have 1 ξt0 1 1 θ, 4.6 where the last inequality follows from the facts that ξt is non-increasing and 1 θ < 0. Thus, combining 4.5 and 4.6, we obtain θ 1 1 θ d 3 := min, Then, we have ξt 1 θ ξt 1 ξt 1 θ ξt 1 θ ξt0 1 θ = t j=t 0+1 d θ 1 θ } 1 ξt0 1 1 θ. ξj 1 θ ξj 1 1 θ t T 0 d 3 d 3 t, where the last inequality holds for t T 0. This implies that ξt d 4 t 1 θ 1, where d4 := d 3 Then, for all k [N +1t N, N +1t] with any t T 0, we have x k ζ ξt d 4 t 1 θ 1 c k 1 θ 1, where c := N θ 1 d4 and the last inequality follows from t k 4. Local convergence rate of with N = 0 N+1 1 θ 1.. This proves statement iii. In this subsection, we discuss the local convergence rate of with N = 0, i.e., we set the line search gap N to 0 in 3.3. Before proceeding, we first give the following supporting lemma. Lemma 4.1. Suppose that Assumption 1.1 holds and δ [0, 1 is a nonnegative constant. Let x k }, µ k } and β k } be the sequences generated by Algorithm 1. Then, there exist c > 0 and K > 0 such that for any k K. dist 0,0,0, x k,x k 1, µ k 1 c x k x k 1 + x k 1 x k 4.7 Proof. irst, from 3.1 and the first-order optimality condition for 3., we have y k 1 = x k 1 + β k 1 x k 1 x k, 4.8a 0 fy k 1 + µ k 1 x k y k 1 + Px k. 4.8b 14

15 Next, we consider the subdifferential of u,v,µ at the point x k,x k 1, µ k 1 for k 0. Looking at the partial subdifferential with respect to u, we have u x k,x k 1, µ k 1 = x k + δ µ k 1 x k x k 1 = fx k + Px k + δ µ k 1 x k x k 1 = fy k 1 + Px k + µ k 1 x k y k 1 + fx k fy k 1 + δ µ k 1 x k x k 1 µ k 1 x k y k 1 fx k fy k 1 + δ µ k 1 x k x k 1 µ k 1 x k y k 1, where the inclusion follows from 4.8b. Similarly, we have v x k,x k 1, µ k 1 = δ µ k 1 x k x k 1, µ x k,x k 1, µ k 1 = δ 4 xk x k 1. Using the above relations, 4.8a, the global Lipschitz continuity of f and the boundednesses of µ k } and β k }, there exists c > 0 such that dist 0,0,0, x k,x k 1, µ k 1 c x k x k 1 + x k x k 1 + x k 1 x k. 4.9 Since x k+1 x k 0seeTheorem3.1iv, thenthereexistsanintegerk > 0suchthat x k+1 x k 1 and hence x k+1 x k x k+1 x k whenever k K. This together with 4.9 completes the proof. In the next proposition, we discuss the KL exponent of our potential function defined in 1.5. The arguments involved are similar to those for [4, Theorem 3.6], except that we have one more variable µ in. or self-containedness, we provide the proof here. Proposition 4.1 KL exponent of. Suppose that Assumption 1.1 holds and the objective in 1.1 has the KL property at x dom with an exponent θ [ 1,1. Let δ 0 and µ min > 0 be given constants. Then, for any µ µ min, the potential function defined in 1.5 has the KL property at x, x, µ with an exponent θ. Proof. or δ = 0, the statement holds trivially since u,v,µ u if δ = 0. Thus, we only need to consider δ > 0 in the following. Since has the KL property at x dom with an exponent θ [ 1,1, there exist ã,ε,ν > 0 such that for any x satisfying x dom, x x ε and x < x < x+ν, it holds that dist 1 θ 0, x ãx x Without loss of generality, we assume that ε < minµ min, 1 }. Next, consider any u,v,µ satisfying u dom, u x ε, v x ε, µ µ ε and x, x, µ < u,v,µ < x, x, µ+ν. Note that, for any such u,v,µ, we have u u,v,µ < x, x, µ+ν = x+ν. Thus, 4.10 holds for these u if u x, then 4.10 holds trivially. Moreover, for any such 15

16 u,v,µ, we have dist 1 θ 0,0,0, Hδ u,v,µ 1 1 δµ c 0 u v θ δ θ + 4 u v + inf ξ u ξ + δµ 1 u v 1 δµ c 0 u v θ + inf ξ u ξ + δµ 1 u v θ 1 1 δµ c 0 u v θ + inf b 1 ξ 1 δµ θ b ξ u u v θ 1 θ1 b δµ 1 θ 1 = c 0 ãδµ ã 4 u v 1 θ +b1 inf ξ 1 θ ξ u } 1 θ1 b δµ 1 θ 1 ãδµ c 0 min, b 1 ã 4 u v 1 θ + inf ξ 1 θ ξ u } i 1 θ 1 b δµ 1 θ 1 ãδµ c 0 min, b 1 ã 4 u v 1 θ +ãu x } = c 0 min 1 θ 1 b δµ 1 δµ θ 1, ãb 1 4 u v 1 θ +u x ii } c 0 min 1 θ 1 b δµ min ε 1 δµ θ 1, ãb 1 4 u v +u x = c 1 u,v,µ x, x, µ, where the existence of c 0 > 0 in the first inequality follows from Lemma.1; the third inequality follows from Lemma. applied to ξ+ δµ u v 1 θ with b 1 > 0 and 0 < b < 1; the inequality i follows from 4.10; the inequality ii follows from µ µ ε µ min ε > 0, 1 < 1 θ and the observation that u v u x + v x ε < 1. θ This completes the proof. Now, we are ready to discuss the local convergence rate of with N = 0 under the additional assumption on the KL exponent of. The proof is similar to that of Theorem 4.1 but makes use of our potential function. Theorem 4.. Suppose that Assumption 1.1 holds, δ [0, 1 is a nonnegative constant and the objective in 1.1 is a KL function with an exponent θ [ 1,1. Let xk } and µ k } be the sequences generated by Algorithm 1 with N = 0, and let ζ be given in Theorem 3.1iii. Then, the following statements hold. i If θ = 1, then there exist ρ 0,1 and c 1 > 0 such that x k ζ c 1 ρ k for all large k; ii If θ 1,1, then there exists c > 0 such that x k ζ c k 1 1 θ 1 for all large k. Proof. We first recall from 3.3 with N = 0 that x k+1,x k, µ k x k,x k 1, µ k 1 c xk+1 x k for any k 0. Thus, the sequence x k,x k 1, µ k 1 } k=0 with Theorem 3.1iii implies that is obviously non-increasing. This together lim x k,x k 1, µ k 1 = ζ, and x k,x k 1, µ k 1 ζ for all k 0. k 16

17 Moreover, we have x k ζ = x k,x k 1, µ k 1 ζ δ µ k 1 x k x k 1 4 x k,x k 1, µ k 1 ζ + δ µ k 1 x k x k 1 4 x k,x k 1, µ k 1 ζ + δ µ k 1 Hδ x k 1,x k, µ k x k,x k 1, µ k 1 c x k 1,x k, µ k ζ + δ µ k 1 Hδ x k 1,x k, µ k ζ c Hδ x k 1,x k, µ k ζ, 1+ δµmax c 4.1 where the first equality follows from the definition of in 1.5, the first inequality follows from the triangle inequality, the second inequality follows from 4.11 and the last three inequalities follow because x k,x k 1, µ k 1 } k=0 is non-increasing, x k,x k 1, µ k 1 ζ and µ k µ max for all k 0. We next consider two cases. Case 1. In this case, we suppose that x K0,x K0 1, µ K0 1 = ζ for some K 0 0. Since the sequence x k,x k 1, µ k 1 } k=0 isnon-increasing,wemusthavex k,x k 1, µ k 1 = ζ forallk K 0. This together with 4.1 proves statement i and ii. Case. rom now on, we consider the case where x k,x k 1, µ k 1 > ζ for all k 0. rom Theorem 3.1i, we see that x k } is bounded and hence must have at least one cluster point. Let Γ denote the set of cluster points of x k }. Then, it follows from Proposition 3.ii and lk = k that ζ on Γ. Next, we consider the sequence x k,x k 1, µ k 1 } k=0. In view of the boundedness of µ k} since µ min µ k µ max and Theorem 3.1iv which says that x k+1 x k 0, it is not hard to show that the set of cluster points of x k,x k 1, µ k 1 } k=0 is contained in Υ := x,x,µ : x Γ, µ min µ µ max }, which is a compact subset of dom. Moreover, for any x, x, µ Υ hence x Γ and µ min µ µ max, we have x, x, µ = x = ζ. Since x, x, µ Υ is arbitrary, we conclude that ζ on Υ. On the other hand, since is a KL function with an exponent θ [ 1,1, then it follows from Proposition 4.1 that is a KL function with an exponent θ [ 1,1 on Υ. Using these facts and Proposition.1, there exist ε,ν > 0 such that ϕ u,v,µ ζdist0,0,0, u,v,µ 1, where ϕs = as 1 θ for some a > 0, for all u,v,µ satisfying distu,v,µ, Υ < ε and ζ < u,v,µ < ζ + ν. Since Υ contains all the cluster points of x k,x k 1, µ k 1 } k=0, then we have lim k distxk,x k 1, µ k 1, Υ = 0. This together with lim k x k,x k 1, µ k 1 = ζ implies that there exists an integer K 1 0 such that distx k,x k 1, µ k 1, Υ < ε and ζ < x k,x k 1, µ k 1 < ζ + ν whenever k K 1. Thus, for any k K 1, we have ϕ x k,x k 1, µ k 1 ζ dist 0,0,0, x k,x k 1, µ k or notational simplicity, let k := x k,x k 1, µ k 1 ζ and x k := x k+1 x k. Since the sequence x k,x k 1, µ k 1 } k=0 is non-increasing, then k is non-increasing, k > 0 for k 0 and k 0. 17

18 We also let K 1 := maxk 1,K}, where K is given in Lemma 4.1. Then, for all k K 1, we have 1 ϕ k dist 0,0,0, Hδ x k,x k 1, µ k 1 a k θ H 1 θ δ c x k 1 + x k a c k θ H 1 θ δ x k 1 + x k a c k θ H 1 θ δ = a 1 k θ 4 c Hδ x k,x k 3, µ k 3 x k,x k 1, µ k 1 k k 4.14 where a 1 := a c 1 θ > 0, the first inequality follows from 4.13, the second inequality follows from 4.7, c the third inequality follows from p + q p +q for p,q 0 and the last inequality follows from In the following, we consider two cases. i θ = 1. Since k 0, there exists K 0 such that k 1 for all k K. Then, for all k K := maxk, K 1 }, we see from 4.14 that This implies that, for all k K +1, k = k θ a 1 k k. k 1 γ k 3 where γ := a 1 /1+a 1 < 1. Then, from 4.1, we have where c 1 := x k ζ 1+ δµmax c 1+ δµmax c k 1 k K γ K, 1+ δµmax c γ k K K = c 1 ρ k, γ K + K and ρ := γ < 1. This proves statement ii. ii 1 < θ < 1. We define gs := s θ for s 0,. It is easy to see that g is non-increasing. Then, for any k K 1, we further consider the following two cases. If g k g k, it follows from 4.14 that 1 a 1 k θ k k = g k k k gsds = k 1 θ k 1 θ, k 1 θ which, together with 1 θ < 0, implies that k 1 θ k 1 θ k g k k k If g k g k, it is not hard to see that k 1 θ θ 1 θ k k 1 θ k 1 θ θ 1 θ 1 θ 1 a k 1 θ θ 1 θ 1 1 θ. Then, we have K1 1 θ, 4.16 where the last inequality follows from the facts that k is non-increasing and 1 θ < 0. Thus, combining 4.15 and 4.16, we obtain θ 1 k 1 θ k 1 θ a := min, 18 a 1 θ 1 θ } 1 K1 1 θ.

19 Let π k = k K 1 mod for any k K 1. Then, we have k 1 θ k 1 θ K1+π k 1 θ = k j 1 θ j 1 θ j=k 1+π k k K 1 π k a a 4 k, where the last inequality holds whenever k K 1 +1 K 1 +π k. inally, using this relation and 4.1, we see that, for all k K , x k ζ where c := 1+ δµmax c 1+ δµmax c 1 θ 1 4 a k 1 5 Numerical experiments 1+ δµmax c 1 θ 1 4 a. This proves statement ii. k 1 1 θ 1 = c k 1 1 θ 1, In this section, we conduct some numerical experiments to test our for solving the l 1 regularized logistic regression problem and the l 1- regularized least squares problem. All experiments are run in MATLAB R016a on a 64-bit Laptop with an Intel Core i7-5600u CPU.60 GHz and 8 GB of RAM equipped with Windows 10 OS. 5.1 l 1 regularized logistic regression problem In this subsection, we consider the l 1 regularized logistic regression problem min x R n,x 0 R logx := m log1+exp b i a i x+x 0+λ x 1, 5.1 i=1 where x := x,x 0 R n+1, a i R n, b i 1,1} for i = 1,,m with m < n, and λ > 0 is the regularization parameter. We further assume that b 1,,b m are not all the same. Let C R m n+1 be the matrix whose i-th row is given by a i, 1. Then, we can rewrite 5.1 in the form of 1.1 with fx = m log1+exp b i Cx i, Px = λ x 1. i=1 Moreover, one can check that f is Lipschitz continuous with L f = 0.5 C. To apply our, we also need to show that log is level-bounded, i.e., lev α log := x : log x α} is bounded possibly empty for every α R. Since log is nonnegative, then we only need to consider α 0. or any x lev α log with α 0, due to the nonnegativities of f and P, we have fx = m i=1 log1+exp b ia i x+x 0 α, 5.a Px = λ x 1 α. 5.b Now, we define the index sets I := i : b i = 1} and I c := i : b i 1}. Since b 1,,b m are not all the same, then both I and I c are non-empty. Moreover, let M := max a 1,, a m }. Then, for any i = 1,,m, it holds that αm/λ a i x 1 a i x a i x 1 αm/λ, The level-boundedness of log was also shown using different way in the early version of [39, Section 4.1], which can be found in 19

20 where the first and last inequalities follow from 5.b. Using the above relation, we further have log1+exp a i x x 0 log1+exp αm/λ x 0, i I, log1+expa i x+x 0 log1+exp αm/λ+x 0, i I c. 5.3 Thus, we see that log 1+exp αm/λ x 0 1+exp αm/λ+x 0 = log1+exp αm/λ x 0 +log1+exp αm/λ+x 0 i I log1+exp αm/λ x 0+ i I c log1+exp αm/λ+x 0 i I log1+exp a i x x 0+ i I c log1+expa i x+x 0 = fx α, where the first inequality follows because both I and I c are non-empty, the second inequality follows from 5.3 and the last inequality follows from 5.a. The above relation further implies that e α 1+exp αm/λ x 0 1+exp αm/λ+x 0 = 1+e αm/λ +e αm/λ e x0 +e x Next, we consider the following two cases. α = 0. In this case, we see from 5.4 that e x0 +e x0 1, which cannot hold for any x 0. Thus, lev 0 log is empty. α > 0. In this case, it follows from 5.4 that e x0 +e x0 M := e αm/λ e α e αm/λ 1 = x 0 logm. This together with 5.b shows that x is bounded and hence lev α log is bounded. rom the above, we see that log is level-bounded and hence our is applicable. In our experiments, we will evaluate with δ = 0.9 denoted by and with δ = 0 denoted by. or, we choose βk 0} by 1.4 with β0 k in place of β k. Moreover, for both and, we set c = 10 4, τ =, η = 0.8, N =, β max = 10, µ min = 10 6, µ max = L f+c 1 δ, µ0 0 = 1, and y µ 0 k y k 1, fy k fy k 1 } } } k = min max max y k y k 1, 0.5 µ k 1, µ min, µ max for k 1. We also compare and with PG, ISTA and ISTA with restart reista; see, for example, [6, 3, 39]. or ease of future reference, we recall that ISTA for solving 5.1 is given by β k = t k 1 1/t k, y k = x k +β k x k x k 1, y k 1 x k+1 = Prox 1 L f P t k+1 = t k fy k, L f /, with x 1 = x 0 and t 1 = t 0 = 1. Then, PG is given by 5.5 with β k 0 and reista is given by 5.5 with resetting t k = t k+1 = 1 whenever kmod K = 0 or y k x k+1, x k+1 x k > 0. In our experiments, we choose K = 00 for reista this restart interval has been observed in [6] to perform best. In addition, we initialize all algorithms at the origin and set the maximum running time 3 to T max for all algorithms. The specific values of T max are given in ig.1. 3 The maximum running time used in Section 5.1 and 5. does not include the time for computing the Lipschitz constant L f of f

21 In the following experiments, we choose λ 10, 1, 0.1} and consider m,n,s = 100j, 1000j, 0j for j 3,5,10}. or each triple m,n,s, we follow [39, Section 4.1] to randomly generate a trial as follows. irst, we generate a matrix A R m n with i.i.d. standard Gaussian entries. We then choose a subset S 1,,n} of size s uniformly at random and generate an s-sparse vector ˆx R n, which has i.i.d. standard Gaussian entries on S and zeros on S c. inally, we generate the vector b R m by setting b = signaˆx+ˆǫ1, where ˆǫ is chosen uniformly at random from [0,1] and 1 = 1,,1 R m. To evaluate the performances of different algorithms, we follow[19, 45] to use an evolution of objective values. To introduce this evolution, we first define ek := logx k log min log x 0 min, 5.6 where log x k denotestheobjectivevalueatx k obtainedbyan algorithmand min log denotestheminimum of the terminating objective values obtained among all algorithms in a trial generated as above. or an algorithm, let T k denote the total computational time from the beginning when it obtains x k. One can see that T 0 = 0 and T k is non-decreasing with respect to k. We now define the evolution of objective values obtained by a particular algorithm with respect to time t as follows: log := minek : k i : T i t}}. Note that 0 1 since 0 ek 1 for all k and is non-increasing with respect to t. It can be considered as a normalized measure of the reduction of the function value with respect to time. Then, one can take the average of over several independent trials, and plot the average within time t for a given algorithm. ig. 1 shows the average of 10 independent trials of different algorithms for solving 5.1. rom this figure, we can see that performs best in most cases in the sense that it takes less time to return a lower objective value. 5. l 1- regularized least squares problem In this subsection, we consider the l 1- regularized least squares problem [46]: min 1-x := 1 x R n Ax b +λ x 1 x, 5.7 where A R m n, b R m and λ > 0 is the regularization parameter. Obviously, this problem takes the form of 1.1 with fx = 1 Ax b and Px = λ x 1 x. Moreover, f is Lipschitz continuous with L f = A and Px is a difference-of-convex regularizer. We further assume that A does not have zero columns so that 1- is level-bound; see [35, Example 4.1b] and [46, Lemma 3.1]. Thus, our is applicable. In this part of experiments, we compare three algorithms for solving5.7: with δ = 0.9, with δ = 0, and the proximal difference-of-convex algorithm with extrapolation pdca e 4 [40]. The pdca e for solving 5.7 is given as follows: choose proper extrapolation parameters β k }, let x 1 = x 0, and then at the k-th iteration, Take any ξ k λ x k and compute y k = x k +β k x k x k 1, x k+1 = argmin fy k ξ k, x + L } f x x yk +λ x 1. or and, we use the same parameter settings as in Section 5.1. or pdca e, we follow [40] to choose β k } as in reista see more details in Section 5.1. All algorithms are initialized at the origin 4 The MatlabcodesofpDCA e forsolving 5.7areavailable athttp:// 1

Douglas-Rachford splitting for nonconvex feasibility problems

Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying