arxiv: v2 [math.oc] 21 Nov 2017

Size: px
Start display at page:

Download "arxiv: v2 [math.oc] 21 Nov 2017"

Transcription

1 Proximal Gradient Method with Extrapolation and Line Search for a Class of Nonconvex and Nonsmooth Problems arxiv: v [math.oc] 1 Nov 017 Lei Yang Abstract In this paper, we consider a class of possibly nonconvex, nonsmooth and non-lipschitz optimization problems arising in many contemporary applications such as machine learning, variable selection and image processing. To solve this class of problems, we propose a proximal gradient method with extrapolation and line search. This method is developed based on a special potential function and successfully incorporates both extrapolation and non-monotone line search, which are two simple and efficient accelerating techniques for the proximal gradient method. Thanks to the line search, this method allows more flexibilities in choosing the extrapolation parameters and updates them adaptively at each iteration if a certain line search criterion is not satisfied. Moreover, with proper choices of parameters, our reduces to many existing algorithms. We also show that, under some mild conditions, our line search criterion is well defined and any cluster point of the sequence generated by is a stationary point of our problem. In addition, by assuming the Kurdyka- Lojasiewicz exponent of the objective in our problem, we further analyze the local convergence rate of two special cases of, including the widely used non-monotone proximal gradient method as one case. inally, we conduct some numerical experiments for solving the l 1 regularized logistic regression problem and the l 1- regularized least squares problem. Our numerical results illustrate the efficiency of and show the potential advantage of combining two accelerating techniques. Keywords: Proximal gradient method; extrapolation; non-monotone; line search; stationary point. 1 Introduction In this paper, we consider the following composite optimization problem: min x := fx+px, 1.1 x Rn where f : R n R and P : R n R + } satisfy Assumption 1.1. We also assume that the proximal mapping of νp is easy to compute for all ν > 0 see the next section for notation and definitions. Assumption 1.1. i f : R n R is a continuously differentiable possibly nonconvex function with Lipschitz continuous gradient, i.e., there exists some Lipschitz constant L f > 0 such that fx fy L f x y, x, y R n. ii P : R n R + } is a proper closed possibly nonconvex, nonsmooth and non-lipschitz function; it is bounded below and continuous on its domain. iii is level-bounded. Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, P.R. China. lei.yang@connect.polyu.hk. 1

2 Problem 1.1 arises in many contemporary applications such as machine learning [14, 34], variable selection [16,, 3, 36, 47] and image processing [10, 30]. In general, f is a loss or fitting function used for measuring the deviation of a solution from the observations. Two commonly used loss functions are least squares loss function: fx = 1 Ax b, logistic loss function: fx = m i=1 log1+exp b ia i x, where A = [a 1,,a m ] R m n is a data matrix and b R m is an observed vector. One can verify that these two loss functions satisfy Assumption 1.1i. On the other hand, the function P is usually a regularizer used for inducing certain structure in the solution. or example, P can be the indicator function for a certain set such as X = x R n : x 0} and X = x R n : n i=1 x i = 1, x 0}; the former choice restricts the elements of the solution to be nonnegative and the latter choice restricts the solution in a simplex. We can also choose P to be a certain sparsity-inducing regularizer such as λ x p p for 0 < p 1 [, 3, 36], λ n i=1 log1 + α x i for α > 0 [30] and λ x 1 x [46], where λ > 0 is a regularization parameter. Note that all the aforementioned examples of P as well as many other widely used regularizerssee [1, 11] and references therein for more regularizers satisfy Assumption 1.1ii. inally, we would like to point out that Assumption 1.1iii is also satisfied by many choices of f and P in practice; see, for example, 5.1 and 5.7. More examples of 1.1 can be found in [10, 14, 34] and references therein. Due to the importance and the popularity of 1.1, various attempts have been made to solve it efficiently, especially when the problem involves a large number of variables. One popular class of methods for solving 1.1 are first-order methods due to their cheap iteration cost and good convergence properties. Among them, the proximal gradient PG method 1 [17, 5] is arguably the most fundamental one, whose basic iteration is x k+1 Argmin fx k, x + µ } x x xk +Px, 1. where µ > 0 is a constant depending on the Lipschitz constant L f of f. However, PG can be slow in practice; see, for example, [5, 3, 39]. Therefore, a large amount of research has been conducted to accelerate PG for solving 1.1. One simple and widely studied strategy is to perform extrapolation in the spirit of Nesterov s extrapolation techniques [7, 8], whose basic idea is to make use of historical information at each iteration. A typical scheme of the proximal gradient method with extrapolation PGe for solving 1.1 is y k = x k +β k x k x k 1, x k+1 Argmin x fy k, x + µ x yk +Px where β k is the extrapolation parameter satisfying certain conditions and µ > 0 is a constant depending on L f. One representative algorithm that takes the form of 1.3 with proper choices of β k } is the fast iterative shrinkage-thresholding algorithm ISTA [5], which is also known as the accelerated proximal gradient method APG independently proposed and studied by Nesterov [9]. This algorithm ISTA or APG is designed for solving 1.1 with f and P being convex and exhibits a faster convergence rate O1/k in terms of objective values see [5, 9] for more details. This motivates the study of PGe and its variants for solving 1.1 under different scenarios; see, for example, [6, 18, 31, 3, 37, 38, 39, 40, 4, 43, 44]. It is worth noting that, when f and P are convex, many existing PGe and its variants [5, 6, 9, 3, 37, 38] choose the extrapolation parameters β k } explicitly or implicitly based on the following updating scheme β k = t k 1 1/t k, t k+1 = t k, }, 1.3 with t 1 = t 0 = 1, PG is also known as the forward-backward splitting algorithm [13] or the iterative shrinkage-thresholding algorithm [5].

3 which originated from Nesterov s work [7, 8] and was shown to be optimal [8]. However, for the nonconvex case, the optimal choices of β k } are still not clear. Although the convergence of PGe and its variants can be guaranteed in theory for some classes of nonconvex problems under certain conditions on β k } see, for example, [18, 31, 39, 40, 4, 43, 44], the choices of β k } are relatively restrictive and hence may not work well for acceleration. Another efficient strategy for accelerating PG is to apply a non-monotone line search to adaptively find a proper µ in 1. at each iteration. The non-monotone line search technique dates back to the non-monotone Newton s method proposed by Grippo et al. [1] and has been applied to many algorithms with good empirical performances; see, for example, [7, 15, 45, 48]. Based on this technique, Wright et al. [41] recently proposed an efficient method called SpaRSA to solve 1.1, whose iteration is roughly given as follows: Choose τ > 1, c > 0 and an integer N 0. Then, at the k-th iteration, choose µ 0 k > 0 and find the smallest nonnegative integer j k such that } u Argmin x fx k, x + τj k µ 0 k x x k +Px u max [k N] xi c + i k u xk. This method is essentially the non-monotone proximal gradient method, namely, the proximal gradient method with a non-monotone line search. Later, was extended for solving 1.1 under more general conditions and has been shown to have promising numerical performances in many applications see, for example, [1, 0, 6]. In view of the above, it is natural to raise a question: Can we derive an efficient method for solving 1.1, which takes advantage of both extrapolation and non-monotone line search? In this paper, we propose such a method for solving 1.1 that successfully incorporates both extrapolation and non-monotone line search and allows more flexibilities in choosing the extrapolation parameters β k }. We call our method the proximal gradient method with extrapolation and line search. This method is developed based on the following potential function specifically constructed for in 1.1: u,v,µ := u+ δµ 4 u v, u, v R n, µ > 0, 1.5 where δ [0, 1 is a given nonnegative constant. Clearly, u,v,µ u if δ = 0. We will see in Section 3 that this potential function is used to establish a new non-monotone line search criterion 3.3 when the extrapolation technique is applied. This allows more choices of β k at each iteration, and will adaptively update µ k and β k at the same time if the line search criterion is not satisfied see Algorithm 1 for more details. The convergence analysis of is also presented in Section 3. Specifically, under Assumption 1.1, we show that our line search criterion 3.3 is well defined and any cluster point of the sequence generated by is a stationary point of 1.1. Moreover, since our reduces to PG, PGe or with proper choices of parameters see Remark 3.1, then we actually obtain a unified convergence analysis for PG, PGe and as a byproduct. In addition, in Section 4, we further study thelocalconvergencerateintermsofobjectivevaluesfortwospecialcasesofincluding asone case under an additional assumption on the Kurdyka- Lojasiewicz exponent of the objective in 1.1. To the best of our knowledge, this is the first local convergence rate analysis of for solving 1.1. inally, we conduct some numerical experiments in Section 5 to evaluate the performance of our method for solving the l 1 regularized logistic regression problem and the l 1- regularized least squares problem. Our computational results illustrate the efficiency of our method and show the potential advantage of combining extrapolation and non-monotone line search. The rest of this paper is organized as follows. In Section, we present notation and preliminaries used in this paper. In Section 3, we describe for solving 1.1 and study its global subsequential convergence. The local convergence rate of two special cases of is analyzed in Section 4 and some numerical results are reported in Section 5. inally, some concluding remarks are given in Section 6., 3

4 Notation and preliminaries In this paper, we present scalars, vectors and matrices in lower case letters, bold lower case letters and upper case letters, respectively. We also use R, R n, R n + and Rm n to denote the set of real numbers, n-dimensional real vectors, n-dimensional real vectors with nonnegative entries and m n real matrices, respectively. or a vector x R n, x i denotes its i-th entry, x denotes its Euclidean norm, x 1 denotes its l 1 norm defined by x 1 := n i=1 x i, x p denotes its l p -quasi-norm 0 < p < 1 defined by x p := n i=1 x i p 1 p and x denotes its l norm given by the largest entry in magnitude. or a matrix A R m n, its spectral norm is denoted by A, which is the largest singular value of A. or an extended-real-valued function h : R n [, ], we say that it is proper if hx > for all x R n and its domain domh := x R n : hx < } is nonempty. A proper function h is said to be closed if it is lower semicontinuous. We also use the notation y h x to denote y x and hy hx. The basic subdifferential see [33, Definition 8.3] of h at x domh used in this paper is hx := d R n : x k h x, d k hy hx k d k,y x k } d with liminf y x k,y x k y x k 0 k. It can be observed from the above definition that } d R n : x k h x, d k d with d k hx k for each k hx..1 When h is continuously differentiable or convex, the above subdifferential coincides with the classical concept of derivative or convex subdifferential of h; see, for example, [33, Exercise 8.8] and [33, Proposition 8.1]. In addition, if h has several groups of variables, we use xi h resp., xi h to denote the partial subdifferential resp., gradient of h with respect to the group of variables x i. or a proper closed function h : R n, ] and ν > 0, the proximal mapping of νh at y R n is defined by Prox νh y := Argmin hx+ 1 } x R n ν x y. Note that this operator is well defined for any y R n if h is bounded below in R n. or a closed set X R n, its indicator function δ X is defined by 0, if x X, δ X x = +, otherwise. We also use distx,x to denote the distance from x to X, i.e., distx,x := inf y X x y. or any local minimizer x of 1.1, it is known from [33, Theorem 10.1] and [33, Exercise 8.8c] that the following first-order necessary condition holds: 0 f x+ P x,. where f denotes the gradient of f. In this paper, we say that x is a stationary point of 1.1 if x satisfies. in place of x. We next recall the Kurdyka- LojasiewiczKL propertysee [, 3, 4, 8, 9] for more details, which plays an important role in our analysis for the local convergence rate in Section 4. or notational simplicity, let Ξ ν ν > 0 denote a class of concave functions ϕ : [0,ν R + satisfying: i ϕ0 = 0; ii ϕ is continuously differentiable on 0,ν and continuous at 0; iii ϕ t > 0 for all t 0,ν. Then, the KL property can be described as follows. Definition.1 KL property and KL function. Let h : R n R + } be a proper closed function. i or x dom h := x R n : hx }, if there exist a ν 0,+ ], a neighborhood V of x and a function ϕ Ξ ν such that for all x V x R n : h x < hx < h x+ν}, it holds that ϕ hx h xdist0, hx 1, then h is said to have the Kurdyka- Lojasiewicz KL property at x. 4

5 ii If h satisfies the KL property at each point of dom h, then h is called a KL function. A large number of functions such as proper closed semialgebraic functions satisfy the KL property [3, 4]. Based on the above definition, we then introduce the KL exponent [3, 4]. Definition. KL exponent. Suppose that h : R n R + } is a proper closed function satisfying the KL property at x dom h with ϕt = a t 1 θ for some a > 0 and θ [0,1, i.e., there exist a,ε,ν > 0 such that dist0, hx ahx h x θ whenever x dom h, x x ε and h x < hx < h x+ν. Then, h is said to have the KL property at x with an exponent θ. If h is a KL function and has the same exponent θ at any x dom h, then h is said to be a KL function with an exponent θ. We also recall the following uniformized KL property, which was established in [9, Lemma 6]. Proposition.1 Uniformized KL property. Suppose that h : R n R + } is a proper closed function and Γ is a compact set. If h ζ on Γ for some constant ζ and satisfies the KL property at each point of Γ, then there exist ε > 0, ν > 0 and ϕ Ξ ν such that ϕ hx ζdist0, hx 1 for all x x R n : distx, Γ < ε} x R n : ζ < hx < ζ +ν}. inally, we recall two useful lemmas, which can be found in [4]. Lemma.1 [4, Lemma.]. Let α > 0. Then, for any w = w 1,,w n R n +, there exist 0 < c 1 c such that c 1 w w α 1 + +wα n 1 α c w. Lemma. [4, Lemma 3.1]. Let 1 < α. Then, for any u,v R n, there exist b 1 > 0 and 0 < b < 1 such that u+v α b 1 u α b v α. 3 Proximal gradient method with extrapolation and line search and its convergence analysis In this section, we present a proximal gradient method with extrapolation and line search for solving 1.1. This method is developed based on a specially constructed potential function, which is defined in 1.5. The complete for solving 1.1 is presented as Algorithm 1. Remark 3.1 Comments on special cases of. In Algorithm 1, if δ = 0, then we have β 0 k = 0 and hence y k = x k for all k 0. In this case, our line search criterion 3.3 reduces to u max [k N] xi c + i k u xk. Thus, our reduces to for solving 1.1 see, for example, [1, 0, 41]. On the other hand, for any 0 < δ < 1, if µ 0 k = µ max L f +c 1 δ and β 0 k δµ max L f µ max 4µ max +L f for all k 0, then it follows from Lemma 3.1 that the line search criterion 3.3 holds trivially for all k 0. Thus, in this case, we do not need to perform the line search loop and hence our reduces to a PGe for solving 1.1. inally, if δ = 0 and µ 0 k = µ max L f +c, then our obviously reduces to PG for solving

6 Algorithm 1 for solving 1.1 Input: x 0 dom, τ > 1, 0 δ < 1, 0 < η < 1, c > 0, µ max L f+c 1 δ µ min > 0, β max 0 and an integer N 0. Set x 1 = x 0, µ 1 = 1 and k = 0. while a termination criterion is not met, do Step 1. Choose µ 0 k [µ min, µ max ] and β 0 k [0, δβ max] arbitrarily. Set µ k = µ 0 k and β k = β 0 k. 1a Compute 1b Solve the subproblem 1c If u Argmin x y k = x k +β k x k x k fy k, x y k + µ } k x yk +Px. 3. u,x k,µ k max x i,x i 1, µ i 1 c [k N] + i k u xk 3.3 is satisfied, then go to Step. 1d Set µ k minτµ k, µ max }, β k ηβ k and go to step 1a. Step. Set x k+1 u, µ k µ k, β k β k, k k +1 and go to Step 1. end while Output: x k Remark 3. Comments on the extrapolation parameters in. Unlike most of the existing PGe and its variants mentioned in Section 1 that should choose the extrapolation parameters under certain schemes or conditions, our can choose any β 0 k [0, δβ max] as an initial guess at each iteration and then updates it and µ k adaptively at the same time if the line search criterion is not satisfied. This strategy actually allows more flexibilities in choosing the extrapolation parameters and works well from our computational results in Section 5. In the following, we will study the convergence properties of. Before proceeding, we present the first-order optimality condition for the subproblem 3. in 1b of Algorithm 1 as follows: 0 fy k +µ k u y k + Pu. 3.4 We now start our convergence analysis by proving the following supporting lemma, which characterizes the descent property of our potential function. Lemma 3.1 Sufficient descent of. Suppose that Assumption 1.1 holds and δ [0, 1 is a nonnegative constant. Let x k } and µ k } be the sequences generated by Algorithm 1, and let u be the candidate generated by step 1b at the k-th iteration. or any k 0, if δµ k L f µ k 1 µ k > L f and β k 4µ k +L f, then we have u,x k,µ k x k,x k 1, µ k 1 1 δµ k L f u x k, where is the potential function defined in

7 Proof. irst, from 3., we have fy k, u y k + µ k u yk +Pu fy k, x k y k + µ k xk y k +Px k, which implies that Pu Px k + fy k, x k u + µ k xk y k µ k u yk = Px k + fy k, x k u + µ k xk y k µ k u xk +x k y k = Px k + fy k, x k u µ k u xk +µ k x k u, x k y k On the other hand, using the fact that f is Lipschitz continuous with a Lipschitz constant L f Assumption 1.1i, we see from [8, Lemma 1..3] that Summing 3.6 and 3.7, we obtain that 3.6 fu fx k + fx k, u x k + L f u xk. 3.7 fu+pu fx k Px k µ k L f u x k +µ k x k u, x k y k + fx k fy k, u x k µ k L f u x k + µ k x k y k + fx k fy k u x k µ k L f u x k +µ k +L f x k y k u x k µ k L f u x k + µ k L f u x k + µ k +L f x k y k 4 µ k L f = µ k L f u x k + µ k +L f β 4 µ k L k x k x k 1 f µ k L f 4 u x k + δ µ k 1 x k x k 1 4 = 1 δµ k L f u x k δµ k 4 4 u xk + δ µ k 1 x k x k 1, 4 where the second inequality follows from Cauchy Schwarz inequality; the third inequality follows from Lipschitz continuityof f; the fourthinequalityfollowsfromtherelationab a 4s +sb with a = u x k, b = µ k + L f x k y k 1 and s = µ k L f > 0; the first equality follows from 3.1; the last inequality δµk L f µ k 1 follows from β k 4µ k +L f. Then, rearranging terms in above relation and recalling the definition of in 1.5, we obtain 3.5. Remark 3.3 Comments on Lemma 3.1. Note that the descent property in Lemma 3.1 is established for without requiring f or P to be convex or difference-of-convex function. In fact, with additional assumptions e.g., convexity on f or P, one can establish a similar descent property for some other constructed potential function H; see, for example, [39, Lemma 3.1]. Then, one can perform the line search criterion 3.3 with H in place of in Algorithm 1 and the convergence analysis can follow in a similar way as presented in this paper. Thus, one can choose suitable potential function in to fit different scenarios. In this paper, we only focus on under Assumption 1.1. It can be observed from Lemma 3.1 that the sufficient descent of can be guaranteed as long as µ k is sufficiently large and β k is sufficiently small for each k 0. Thus, based on this lemma, we can show in the following proposition that the line search criterion 3.3 in Algorithm 1 is well defined. 7

8 Proposition 3.1 Well-definedness of the line search criterion. Suppose that Assumption 1.1 holds and δ [0, 1 is a nonnegative constant. Let x k } and µ k } be the sequences generated by Algorithm 1. Then, for each k 0, the line search criterion 3.3 is satisfied after finitely many inner iterations. Proof. We prove this proposition by contradiction. Assume that there exists a k 0 such that the line search criterion 3.3 cannot be satisfied after finitely many inner iterations. Since µ k µ max due to 1d in Algorithm 1, then µ k = µ max must be satisfied after finitely many inner iterations. Let n k denote the number of inner iterations when µ k = µ max is satisfied for the first time. If µ 0 k = µ max, then n k = 1; otherwise, we have µ min τ n k 1 µ 0 kτ n k 1 < µ max, which implies that logµmax logµ min n k logτ +1. Now, we let β k := δµmax L f µ k 1 4µ max+l f for simplicity. Then, from 1d in Algorithm 1, we see that β k is decreasing in the inner loop and hence β k β k must be satisfied after finitely many inner iterations. Similarly, let ˆn k denote the number of inner iterations when β k β k is satisfied for the first time. Note that if δ = 0, we have β 0 k = β k = 0 and hence ˆn k = 1. or 0 < δ < 1, if β 0 k β k, then ˆn k n k ; otherwise, we have β k < β 0 kηˆn k 1 δβ max ηˆn k 1, which implies that logδβmax log β k ˆn k logη +1. Thus, after at most maxn k, ˆn k }+1 inner iterations, we must have µ k µ max and β k β k. Since c > 0 and 0 δ < 1, one can see that µ max L f+c 1 δ > L f and 1 δµ max L f c. Then, using these facts and Lemma 3.1, we have u,x k,µ max x k,x k 1, µ k 1 1 δµ max L f u x k c 4 u xk, which, together with x k,x k 1, µ k 1 max x i,x i 1, µ i 1, [k N] + i k implies that 3.3 must be satisfied after at most maxn k, ˆn k } + 1 inner iterations. This leads to a contradiction. We are now ready to show our first convergence result in the following theorem that characterizes a cluster point of the sequence generated by. Our proof is similar to that of [41, Lemma 4]. However, the arguments involved relies on our potential function 1.5 that contains multiple blocks of variables. This makes our proof more intricate. or notational simplicity, from now on, let lk Argmax x i,x i 1, µ i 1 : i = [k N] +,,k}. 3.8 i Theorem 3.1. Suppose that Assumption 1.1 holds and δ [0, 1 is a nonnegative constant. Let x k } and µ k } be the sequences generated by Algorithm 1. Then, i boundedness of sequence the sequence x k } is bounded; 8

9 ii non-increase of subsequence of the sequence x lk,x lk 1, µ lk 1 } is non-increasing; iii existence of limit ζ := lim k x lk,x lk 1, µ lk 1 exists; iv diminishing successive changes lim k xk+1 x k = 0; v global subsequential convergence any cluster point x of x k } is a stationary point of 1.1. Proof. Statement i. We first prove by induction that for all k 1. Indeed, for k = 1, it follows from Proposition 3.1 that x k,x k 1, µ k 1 x x 1,x 0, µ 0 x 0,x 1, µ 1 c x1 x 0 0 is satisfied after finitely many inner iterations. This, together with x 1 = x 0, implies that x 1,x 0, µ 0 x 0,x 1, µ 1 = x 0. Hence, 3.9 holds for k = 1. We now suppose that 3.9 holds for all k K for some integer K 1. Next, we show that 3.9 also holds for k = K +1. Indeed, for k = K +1, we have x K+1,x K, µ K x 0 x K+1,x K, µ K max x i,x i 1, µ i 1 [K N] + i K c xk+1 x K 0, where the first inequality follows from the induction hypothesis and the second inequality follows from 3.3. Hence, 3.9 holds for k = K +1. This completes the induction. Then, from 3.9, we have that for any k 1, x 0 x k,x k 1, µ k 1 x k, which, together with Assumption 1.1iii, implies that x k } is bounded. This proves statement i. Statement ii. Recall the definition of lk in 3.8 and let x k := x k+1 x k for simplicity. Then, from the line search criterion 3.3, we have Observe that x k+1,x k, µ k x lk,x lk 1, µ lk 1 c x k x lk+1,x lk+1 1, µ lk+1 1 = max x i,x i 1, µ i 1 [k+1 N] + i k+1 = max x k+1,x k, µ k, max x lk,x lk 1, µ lk 1, max x lk,x lk 1, µ lk 1, max x i,x i 1, µ i 1 [k+1 N] + i k } max x i,x i 1, µ i 1 [k+1 N] + i k } max x i,x i 1, µ i 1 [k N] + i k } = max x lk,x lk 1, µ lk 1, x lk,x lk 1, µ lk 1 = x lk,x lk 1, µ lk 1, where the first inequality follows from 3.10 and the second last equality follows from 3.8. This proves statement ii. } 9

10 Statement iii. It follows from Assumption 1.1iii and the definition of in 1.5 that x lk, x lk 1, µ lk 1 is bounded below. This together with statement ii proves that there exists a number ζ such that lim x lk,x lk 1, µ lk 1 = ζ k Statement iv. We next prove statement iv. To this end, we first show by induction that for all j 1, it holds that lim xlk j = 0, 3.1a k lim k xlk j = ζ. 3.1b We start by proving 3.1a and 3.1b for j = 1. Applying 3.10 with k replaced by lk 1, we obtain x lk,x lk 1, µ lk 1 x llk 1,x llk 1 1, µ llk 1 1 c x lk 1, which, together with 3.11, implies that Then, from 3.11 and 3.13, we have lim xlk 1 = k ζ = lim k x lk,x lk 1, µ lk 1 = lim k xlk 1 + x lk 1+ δ µ lk 1 4 x lk 1 = lim k xlk 1, where the second equality follows from the definition of in 1.5 and the last equality follows because x k } is bounded see statement i, µ k } is bounded since µ min µ k µ max and is uniformly continuous on any compact subset of dom under Assumption 1.1i and ii. Thus, 3.1a and 3.1b hold for j = 1. We next suppose that 3.1a and 3.1b hold for j = J for some J 1. It remains to show that they also hold for j = J +1. Indeed, from 3.10 with k replaced by lk J 1 here, without loss of generality, we assume that k is large enough such that lk J 1 is nonnegative, we have x lk J,x lk J 1, µ lk J 1 x llk J 1,x llk J 1 1, µ llk J 1 1 c x lk J 1, which, together with the definition of in 1.5, implies that c + δ µ lk J 1 4 x lk J 1 x llk J 1,x llk J 1 1, µ llk J 1 1 x lk J. This together with 3.11 and the induction hypothesis implies that lim xlk J+1 = 0. k Thus, 3.1a holds for j = J +1. rom this, we further have lim k xlk J+1 = lim k xlk J x lk J+1 = lim k xlk J = ζ, where the second equality follows because x k } is bounded see statement i, µ k } is bounded since µ min µ k µ max and is uniformly continuous on any compact subset of dom under Assumption 1.1i and ii. Hence, 3.1b also holds for j = J +1. This completes the induction. 10

11 We are now ready to prove the main result in this statement. Indeed, recalling the definition of lk in 3.8, we see that k N lk k without loss of generality, we assume that k is large enough such that k N. Thus, for any k, we must have k N 1 = lk j k for some j k [1, N +1]. Then, we have This together with 3.1a implies that x k N 1 = x lk j k max 1 j N+1 x lk j. lim x k k = lim xk N 1 = 0, k This proves statement iv. Statement v. irst, since x k } is bounded see statement i, there exists at least one cluster point. Suppose that x is a cluster point of x k } and let x ki } be a convergent subsequence such that lim x ki = x. Then, we see from statement iv and the boundedness of β k for any k since i 0 β k βk 0 δβ max that lim = lim x i yki ki +β ki x ki x ki 1 = lim x ki = x i i Thus, passing to the limit along x ki, y ki } in 3.4 with x ki+1 in place of u and µ ki in place of µ k, and invoking Assumption 1.1ii, statement iv, the boundedness of µ k } since µ min µ k µ max,.1 and 3.14, we obtain 0 fx + Px, which implies that x is a stationary point of 1.1. This proves statement v. Based on Theorem 3.1, we can further characterize the sequence of objective values along x k } in the following proposition. This proposition will be useful in the analysis of the local convergence rate in the next section. Proposition 3.. Suppose that Assumption 1.1 holds and δ [0, 1 is a nonnegative constant. Let x k } be a sequence generated by Algorithm 1 and lk be the index defined in 3.8 for each k. Then, i lim k xlk = ζ, where ζ is given in Theorem 3.1iii; ii ζ on Ω, where Ω is the set of cluster points of the subsequence x lk }. Proof. Statement i follows immediately from Theorem 3.1iii, x k+1 x k 0 see Theorem 3.1iv, the boundedness of µ k } since µ min µ k µ max and the definition of in 1.5. We now prove statement ii. irst, since x lk } is a subsequence of x k }, it follows from Theorem 3.1i and v that Ω X, where X is the set of all stationary points of 1.1. Moreover, we recall from Assumption 1.1i and ii that f is continuously differentiable and P is continuous on its domain. These facts prove statement ii. 4 Local convergence rate of two special cases of In this section, under the additional assumption that the objective in 1.1 is a KL function with an exponent θ, we further study the local convergence rate of two special cases of in terms of objective values. The first case is for with δ = 0, namely, and the other is for with N = 0, where N is the line search gap used in

12 4.1 Local convergence rate of In this subsection, we discuss the local convergence rate of with δ = 0, namely, see Remark 3.1. In this case, we have y k x k and x k x k,x k 1, µ k 1 for all k 0. The main results are presented in the following theorem. This kind of results on local convergence rate have been studied for many existing algorithms; see, for example, [, 40]. However, the analysis there heavily relies on the monotonicity of the objective or certain potential function along the sequence generated and hence cannot be applied for. More intricate analysis is needed for handling the non-monotone line search. Theorem 4.1. Suppose that Assumption 1.1 holds and the objective in 1.1 is a KL function with an exponent θ. Let x k } and µ k } be the sequences generated by Algorithm 1 with δ = 0, and let ζ be given in Theorem 3.1iii. Then, the following statements hold. i If θ = 0, then x k ζ for all large k; ii If θ 0, 1 ], then there exist ρ 0,1 and c1 > 0 such that x k ζ c 1 ρ k for all large k; iii If θ 1,1, then there exists c > 0 such that x k ζ c k 1 θ 1 for all large k. Proof. We start by defining an index sequence ξt} t=0 as follows: ξt = ln +1t, t = 0, 1,,, where lk is defined in 3.8. It is obviously that ξt} t=0 is a subsequence of lk} k=0. Moreover, since k N lk k for any k N, we have ξt = ln +1t N +1t N = N +1t 1+1 ln +1t 1+1 > ξt 1 for any t 1. Therefore, ξt} t=0 is increasing. We now recall from Theorem 3.1ii and Assumption 1.1iii that x lk } k=0 is non-increasing and bounded below. Since xξt } t=0 is a subsequence of x lk } k=0, then it follows that xξt } t=0 is non-increasing and bounded below. Moreover, from Proposition 3., we have lim t xξt = lim k xlk = ζ. In addition, it follows from 3.10 with k replaced by ξt 1 that x ξt x lξt 1 c x ξt 1 x ln+1t 1 c x ξt 1 = x ξt 1 c x ξt 1, 4.1 where the second inequality follows because x lk } k=0 is non-increasing and ξt 1 = ln +1t 1 N +1t 1. We next consider two cases. Case 1. In this case, we suppose that x ξt = ζ for some T 0. Since the sequence x ξt } t=0 is non-increasing, we must have x ξt = ζ for all t T. Then, for all k [N +1t N, N +1t] with any t T, we have x k x ln+1t = x ξt = ζ. Thus, the conclusions of three statements hold. Case. rom now on, we consider the case where x ξt > ζ for all t 0. rom Theorem 3.1i, we see that x ξt } t=0 is bounded and hence must have at least one cluster point. Let Γ denote the set of cluster points of x ξt } t=0. Since xξt } t=0 is a subsequence of xlk } k=0, we have Γ Ω, where Ω is the set of cluster points of x lk } k=0. Then, it follows from Proposition 3. that ζ on Γ. This fact together with our assumption that is a KL function with an exponent θ and Proposition.1, there exist ε,ν > 0 such that ϕ x ζdist0, x 1, where ϕs = as 1 θ for some a > 0, 1

13 for all x satisfying distx, Γ < ε and ζ < x < ζ+ν. On the other hand, since lim distx ξt, Γ = 0 t by the definition of Γ and x ξt ζ, then for such ε and ν, there exists an integer T 0 0 such that distx ξt, Γ < ε and ζ < x ξt < ζ +ν for all t T 0. Thus, for t T 0, we have ϕ x ξt ζ dist 0, x ξt 1, where ϕs = as 1 θ for some a > Next, looking at the subdifferential x k, we have x k = fx k + Px k = fx k 1 + Px k + µ k 1 x k x k 1 + fx k fx k 1 µ k 1 x k x k 1 fx k fx k 1 µ k 1 x k x k 1, where the inclusion follows from the optimality condition for 3. at the k 1-st iteration, i.e., 0 fx k 1 + Px k + µ k 1 x k x k 1. Using this relation together with the global Lipschitz continuity of f and the boundedness of µ k }, there exists d 1 > 0 such that Now, for notational simplicity, let ξt that ξt is non-increasing, ξt 1 ϕ ξt a 1 θ ad 1 1 θ = d dist 0, x k d 1 x k x k := x ξt ζ. Since x ξt } t=0 is non-increasing, we see 0. Then, for all t T 0, we have 0, x ξt θ d1 x ξt x ξt 1 > 0 for t 0 and ξt ξt dist ξt x ξt 1 x c ξt θ ξt θ ξt 1 ξt, where d := ad 1 1 θ > 0, the first inequality follows from 4., the second inequality follows from 4.3 c and the last inequality follows from 4.1. Next, we consider the following three cases. ξt ξt i θ = 0. In this case, we see from 4.4 that ξt 1 0. Thus, Case cannot happen. ξt 1 d 4.4 for all t T 0, which contradicts ii 0 < θ 1. In this case, we have 0 < θ 1. Since ξt 0, there exists T 1 0 such that 1 for all t T := maxt 0, T 1 }. Then, for all t T, we see from 4.4 that which implies that ξt ξt ξt γ ξt 1 θ d ξt 1 ξt, γ t T+1 ξ T 1, where γ := d < 1. Then, for all k [N +1t N, N +1t] with any t T, we have 1+d x k ζ ξt γ t T+1 ξ T 1 γ k N+1 T+1 ξ T 1 = c 1 ρ k, where c 1 := γ T+1 ξ T 1, ρ := γ 1 N+1 < 1 and the last inequality follows from t k N+1. This proves statement ii. iii 1 < θ < 1. We define gs := s θ for s 0,. It is easy to see that g is non-increasing. Then, for any t T 0, we further consider the following two cases. 13

14 If g ξt g ξt 1, it follows from 4.4 that 1 d ξt θ ξt 1 ξt = g ξt ξt 1 ξt g ξt 1 ξt 1 ξt gsds = ξt 1 1 θ ξt 1 θ which, together with 1 θ < 0, implies that ξt 1 θ ξt 1 1 θ, ξt 1 ξt 1 θ θ 1 d. 4.5 If g ξt g ξt 1, one can check that ξt 1 θ θ 1 ξt 1 θ ξt 1 1 θ θ 1 θ 1 ξt 1 1 θ θ ξt 1 θ 1 θ 1 θ. Then, we have 1 ξt0 1 1 θ, 4.6 where the last inequality follows from the facts that ξt is non-increasing and 1 θ < 0. Thus, combining 4.5 and 4.6, we obtain θ 1 1 θ d 3 := min, Then, we have ξt 1 θ ξt 1 ξt 1 θ ξt 1 θ ξt0 1 θ = t j=t 0+1 d θ 1 θ } 1 ξt0 1 1 θ. ξj 1 θ ξj 1 1 θ t T 0 d 3 d 3 t, where the last inequality holds for t T 0. This implies that ξt d 4 t 1 θ 1, where d4 := d 3 Then, for all k [N +1t N, N +1t] with any t T 0, we have x k ζ ξt d 4 t 1 θ 1 c k 1 θ 1, where c := N θ 1 d4 and the last inequality follows from t k 4. Local convergence rate of with N = 0 N+1 1 θ 1.. This proves statement iii. In this subsection, we discuss the local convergence rate of with N = 0, i.e., we set the line search gap N to 0 in 3.3. Before proceeding, we first give the following supporting lemma. Lemma 4.1. Suppose that Assumption 1.1 holds and δ [0, 1 is a nonnegative constant. Let x k }, µ k } and β k } be the sequences generated by Algorithm 1. Then, there exist c > 0 and K > 0 such that for any k K. dist 0,0,0, x k,x k 1, µ k 1 c x k x k 1 + x k 1 x k 4.7 Proof. irst, from 3.1 and the first-order optimality condition for 3., we have y k 1 = x k 1 + β k 1 x k 1 x k, 4.8a 0 fy k 1 + µ k 1 x k y k 1 + Px k. 4.8b 14

15 Next, we consider the subdifferential of u,v,µ at the point x k,x k 1, µ k 1 for k 0. Looking at the partial subdifferential with respect to u, we have u x k,x k 1, µ k 1 = x k + δ µ k 1 x k x k 1 = fx k + Px k + δ µ k 1 x k x k 1 = fy k 1 + Px k + µ k 1 x k y k 1 + fx k fy k 1 + δ µ k 1 x k x k 1 µ k 1 x k y k 1 fx k fy k 1 + δ µ k 1 x k x k 1 µ k 1 x k y k 1, where the inclusion follows from 4.8b. Similarly, we have v x k,x k 1, µ k 1 = δ µ k 1 x k x k 1, µ x k,x k 1, µ k 1 = δ 4 xk x k 1. Using the above relations, 4.8a, the global Lipschitz continuity of f and the boundednesses of µ k } and β k }, there exists c > 0 such that dist 0,0,0, x k,x k 1, µ k 1 c x k x k 1 + x k x k 1 + x k 1 x k. 4.9 Since x k+1 x k 0seeTheorem3.1iv, thenthereexistsanintegerk > 0suchthat x k+1 x k 1 and hence x k+1 x k x k+1 x k whenever k K. This together with 4.9 completes the proof. In the next proposition, we discuss the KL exponent of our potential function defined in 1.5. The arguments involved are similar to those for [4, Theorem 3.6], except that we have one more variable µ in. or self-containedness, we provide the proof here. Proposition 4.1 KL exponent of. Suppose that Assumption 1.1 holds and the objective in 1.1 has the KL property at x dom with an exponent θ [ 1,1. Let δ 0 and µ min > 0 be given constants. Then, for any µ µ min, the potential function defined in 1.5 has the KL property at x, x, µ with an exponent θ. Proof. or δ = 0, the statement holds trivially since u,v,µ u if δ = 0. Thus, we only need to consider δ > 0 in the following. Since has the KL property at x dom with an exponent θ [ 1,1, there exist ã,ε,ν > 0 such that for any x satisfying x dom, x x ε and x < x < x+ν, it holds that dist 1 θ 0, x ãx x Without loss of generality, we assume that ε < minµ min, 1 }. Next, consider any u,v,µ satisfying u dom, u x ε, v x ε, µ µ ε and x, x, µ < u,v,µ < x, x, µ+ν. Note that, for any such u,v,µ, we have u u,v,µ < x, x, µ+ν = x+ν. Thus, 4.10 holds for these u if u x, then 4.10 holds trivially. Moreover, for any such 15

16 u,v,µ, we have dist 1 θ 0,0,0, Hδ u,v,µ 1 1 δµ c 0 u v θ δ θ + 4 u v + inf ξ u ξ + δµ 1 u v 1 δµ c 0 u v θ + inf ξ u ξ + δµ 1 u v θ 1 1 δµ c 0 u v θ + inf b 1 ξ 1 δµ θ b ξ u u v θ 1 θ1 b δµ 1 θ 1 = c 0 ãδµ ã 4 u v 1 θ +b1 inf ξ 1 θ ξ u } 1 θ1 b δµ 1 θ 1 ãδµ c 0 min, b 1 ã 4 u v 1 θ + inf ξ 1 θ ξ u } i 1 θ 1 b δµ 1 θ 1 ãδµ c 0 min, b 1 ã 4 u v 1 θ +ãu x } = c 0 min 1 θ 1 b δµ 1 δµ θ 1, ãb 1 4 u v 1 θ +u x ii } c 0 min 1 θ 1 b δµ min ε 1 δµ θ 1, ãb 1 4 u v +u x = c 1 u,v,µ x, x, µ, where the existence of c 0 > 0 in the first inequality follows from Lemma.1; the third inequality follows from Lemma. applied to ξ+ δµ u v 1 θ with b 1 > 0 and 0 < b < 1; the inequality i follows from 4.10; the inequality ii follows from µ µ ε µ min ε > 0, 1 < 1 θ and the observation that u v u x + v x ε < 1. θ This completes the proof. Now, we are ready to discuss the local convergence rate of with N = 0 under the additional assumption on the KL exponent of. The proof is similar to that of Theorem 4.1 but makes use of our potential function. Theorem 4.. Suppose that Assumption 1.1 holds, δ [0, 1 is a nonnegative constant and the objective in 1.1 is a KL function with an exponent θ [ 1,1. Let xk } and µ k } be the sequences generated by Algorithm 1 with N = 0, and let ζ be given in Theorem 3.1iii. Then, the following statements hold. i If θ = 1, then there exist ρ 0,1 and c 1 > 0 such that x k ζ c 1 ρ k for all large k; ii If θ 1,1, then there exists c > 0 such that x k ζ c k 1 1 θ 1 for all large k. Proof. We first recall from 3.3 with N = 0 that x k+1,x k, µ k x k,x k 1, µ k 1 c xk+1 x k for any k 0. Thus, the sequence x k,x k 1, µ k 1 } k=0 with Theorem 3.1iii implies that is obviously non-increasing. This together lim x k,x k 1, µ k 1 = ζ, and x k,x k 1, µ k 1 ζ for all k 0. k 16

17 Moreover, we have x k ζ = x k,x k 1, µ k 1 ζ δ µ k 1 x k x k 1 4 x k,x k 1, µ k 1 ζ + δ µ k 1 x k x k 1 4 x k,x k 1, µ k 1 ζ + δ µ k 1 Hδ x k 1,x k, µ k x k,x k 1, µ k 1 c x k 1,x k, µ k ζ + δ µ k 1 Hδ x k 1,x k, µ k ζ c Hδ x k 1,x k, µ k ζ, 1+ δµmax c 4.1 where the first equality follows from the definition of in 1.5, the first inequality follows from the triangle inequality, the second inequality follows from 4.11 and the last three inequalities follow because x k,x k 1, µ k 1 } k=0 is non-increasing, x k,x k 1, µ k 1 ζ and µ k µ max for all k 0. We next consider two cases. Case 1. In this case, we suppose that x K0,x K0 1, µ K0 1 = ζ for some K 0 0. Since the sequence x k,x k 1, µ k 1 } k=0 isnon-increasing,wemusthavex k,x k 1, µ k 1 = ζ forallk K 0. This together with 4.1 proves statement i and ii. Case. rom now on, we consider the case where x k,x k 1, µ k 1 > ζ for all k 0. rom Theorem 3.1i, we see that x k } is bounded and hence must have at least one cluster point. Let Γ denote the set of cluster points of x k }. Then, it follows from Proposition 3.ii and lk = k that ζ on Γ. Next, we consider the sequence x k,x k 1, µ k 1 } k=0. In view of the boundedness of µ k} since µ min µ k µ max and Theorem 3.1iv which says that x k+1 x k 0, it is not hard to show that the set of cluster points of x k,x k 1, µ k 1 } k=0 is contained in Υ := x,x,µ : x Γ, µ min µ µ max }, which is a compact subset of dom. Moreover, for any x, x, µ Υ hence x Γ and µ min µ µ max, we have x, x, µ = x = ζ. Since x, x, µ Υ is arbitrary, we conclude that ζ on Υ. On the other hand, since is a KL function with an exponent θ [ 1,1, then it follows from Proposition 4.1 that is a KL function with an exponent θ [ 1,1 on Υ. Using these facts and Proposition.1, there exist ε,ν > 0 such that ϕ u,v,µ ζdist0,0,0, u,v,µ 1, where ϕs = as 1 θ for some a > 0, for all u,v,µ satisfying distu,v,µ, Υ < ε and ζ < u,v,µ < ζ + ν. Since Υ contains all the cluster points of x k,x k 1, µ k 1 } k=0, then we have lim k distxk,x k 1, µ k 1, Υ = 0. This together with lim k x k,x k 1, µ k 1 = ζ implies that there exists an integer K 1 0 such that distx k,x k 1, µ k 1, Υ < ε and ζ < x k,x k 1, µ k 1 < ζ + ν whenever k K 1. Thus, for any k K 1, we have ϕ x k,x k 1, µ k 1 ζ dist 0,0,0, x k,x k 1, µ k or notational simplicity, let k := x k,x k 1, µ k 1 ζ and x k := x k+1 x k. Since the sequence x k,x k 1, µ k 1 } k=0 is non-increasing, then k is non-increasing, k > 0 for k 0 and k 0. 17

18 We also let K 1 := maxk 1,K}, where K is given in Lemma 4.1. Then, for all k K 1, we have 1 ϕ k dist 0,0,0, Hδ x k,x k 1, µ k 1 a k θ H 1 θ δ c x k 1 + x k a c k θ H 1 θ δ x k 1 + x k a c k θ H 1 θ δ = a 1 k θ 4 c Hδ x k,x k 3, µ k 3 x k,x k 1, µ k 1 k k 4.14 where a 1 := a c 1 θ > 0, the first inequality follows from 4.13, the second inequality follows from 4.7, c the third inequality follows from p + q p +q for p,q 0 and the last inequality follows from In the following, we consider two cases. i θ = 1. Since k 0, there exists K 0 such that k 1 for all k K. Then, for all k K := maxk, K 1 }, we see from 4.14 that This implies that, for all k K +1, k = k θ a 1 k k. k 1 γ k 3 where γ := a 1 /1+a 1 < 1. Then, from 4.1, we have where c 1 := x k ζ 1+ δµmax c 1+ δµmax c k 1 k K γ K, 1+ δµmax c γ k K K = c 1 ρ k, γ K + K and ρ := γ < 1. This proves statement ii. ii 1 < θ < 1. We define gs := s θ for s 0,. It is easy to see that g is non-increasing. Then, for any k K 1, we further consider the following two cases. If g k g k, it follows from 4.14 that 1 a 1 k θ k k = g k k k gsds = k 1 θ k 1 θ, k 1 θ which, together with 1 θ < 0, implies that k 1 θ k 1 θ k g k k k If g k g k, it is not hard to see that k 1 θ θ 1 θ k k 1 θ k 1 θ θ 1 θ 1 θ 1 a k 1 θ θ 1 θ 1 1 θ. Then, we have K1 1 θ, 4.16 where the last inequality follows from the facts that k is non-increasing and 1 θ < 0. Thus, combining 4.15 and 4.16, we obtain θ 1 k 1 θ k 1 θ a := min, 18 a 1 θ 1 θ } 1 K1 1 θ.

19 Let π k = k K 1 mod for any k K 1. Then, we have k 1 θ k 1 θ K1+π k 1 θ = k j 1 θ j 1 θ j=k 1+π k k K 1 π k a a 4 k, where the last inequality holds whenever k K 1 +1 K 1 +π k. inally, using this relation and 4.1, we see that, for all k K , x k ζ where c := 1+ δµmax c 1+ δµmax c 1 θ 1 4 a k 1 5 Numerical experiments 1+ δµmax c 1 θ 1 4 a. This proves statement ii. k 1 1 θ 1 = c k 1 1 θ 1, In this section, we conduct some numerical experiments to test our for solving the l 1 regularized logistic regression problem and the l 1- regularized least squares problem. All experiments are run in MATLAB R016a on a 64-bit Laptop with an Intel Core i7-5600u CPU.60 GHz and 8 GB of RAM equipped with Windows 10 OS. 5.1 l 1 regularized logistic regression problem In this subsection, we consider the l 1 regularized logistic regression problem min x R n,x 0 R logx := m log1+exp b i a i x+x 0+λ x 1, 5.1 i=1 where x := x,x 0 R n+1, a i R n, b i 1,1} for i = 1,,m with m < n, and λ > 0 is the regularization parameter. We further assume that b 1,,b m are not all the same. Let C R m n+1 be the matrix whose i-th row is given by a i, 1. Then, we can rewrite 5.1 in the form of 1.1 with fx = m log1+exp b i Cx i, Px = λ x 1. i=1 Moreover, one can check that f is Lipschitz continuous with L f = 0.5 C. To apply our, we also need to show that log is level-bounded, i.e., lev α log := x : log x α} is bounded possibly empty for every α R. Since log is nonnegative, then we only need to consider α 0. or any x lev α log with α 0, due to the nonnegativities of f and P, we have fx = m i=1 log1+exp b ia i x+x 0 α, 5.a Px = λ x 1 α. 5.b Now, we define the index sets I := i : b i = 1} and I c := i : b i 1}. Since b 1,,b m are not all the same, then both I and I c are non-empty. Moreover, let M := max a 1,, a m }. Then, for any i = 1,,m, it holds that αm/λ a i x 1 a i x a i x 1 αm/λ, The level-boundedness of log was also shown using different way in the early version of [39, Section 4.1], which can be found in 19

20 where the first and last inequalities follow from 5.b. Using the above relation, we further have log1+exp a i x x 0 log1+exp αm/λ x 0, i I, log1+expa i x+x 0 log1+exp αm/λ+x 0, i I c. 5.3 Thus, we see that log 1+exp αm/λ x 0 1+exp αm/λ+x 0 = log1+exp αm/λ x 0 +log1+exp αm/λ+x 0 i I log1+exp αm/λ x 0+ i I c log1+exp αm/λ+x 0 i I log1+exp a i x x 0+ i I c log1+expa i x+x 0 = fx α, where the first inequality follows because both I and I c are non-empty, the second inequality follows from 5.3 and the last inequality follows from 5.a. The above relation further implies that e α 1+exp αm/λ x 0 1+exp αm/λ+x 0 = 1+e αm/λ +e αm/λ e x0 +e x Next, we consider the following two cases. α = 0. In this case, we see from 5.4 that e x0 +e x0 1, which cannot hold for any x 0. Thus, lev 0 log is empty. α > 0. In this case, it follows from 5.4 that e x0 +e x0 M := e αm/λ e α e αm/λ 1 = x 0 logm. This together with 5.b shows that x is bounded and hence lev α log is bounded. rom the above, we see that log is level-bounded and hence our is applicable. In our experiments, we will evaluate with δ = 0.9 denoted by and with δ = 0 denoted by. or, we choose βk 0} by 1.4 with β0 k in place of β k. Moreover, for both and, we set c = 10 4, τ =, η = 0.8, N =, β max = 10, µ min = 10 6, µ max = L f+c 1 δ, µ0 0 = 1, and y µ 0 k y k 1, fy k fy k 1 } } } k = min max max y k y k 1, 0.5 µ k 1, µ min, µ max for k 1. We also compare and with PG, ISTA and ISTA with restart reista; see, for example, [6, 3, 39]. or ease of future reference, we recall that ISTA for solving 5.1 is given by β k = t k 1 1/t k, y k = x k +β k x k x k 1, y k 1 x k+1 = Prox 1 L f P t k+1 = t k fy k, L f /, with x 1 = x 0 and t 1 = t 0 = 1. Then, PG is given by 5.5 with β k 0 and reista is given by 5.5 with resetting t k = t k+1 = 1 whenever kmod K = 0 or y k x k+1, x k+1 x k > 0. In our experiments, we choose K = 00 for reista this restart interval has been observed in [6] to perform best. In addition, we initialize all algorithms at the origin and set the maximum running time 3 to T max for all algorithms. The specific values of T max are given in ig.1. 3 The maximum running time used in Section 5.1 and 5. does not include the time for computing the Lipschitz constant L f of f

21 In the following experiments, we choose λ 10, 1, 0.1} and consider m,n,s = 100j, 1000j, 0j for j 3,5,10}. or each triple m,n,s, we follow [39, Section 4.1] to randomly generate a trial as follows. irst, we generate a matrix A R m n with i.i.d. standard Gaussian entries. We then choose a subset S 1,,n} of size s uniformly at random and generate an s-sparse vector ˆx R n, which has i.i.d. standard Gaussian entries on S and zeros on S c. inally, we generate the vector b R m by setting b = signaˆx+ˆǫ1, where ˆǫ is chosen uniformly at random from [0,1] and 1 = 1,,1 R m. To evaluate the performances of different algorithms, we follow[19, 45] to use an evolution of objective values. To introduce this evolution, we first define ek := logx k log min log x 0 min, 5.6 where log x k denotestheobjectivevalueatx k obtainedbyan algorithmand min log denotestheminimum of the terminating objective values obtained among all algorithms in a trial generated as above. or an algorithm, let T k denote the total computational time from the beginning when it obtains x k. One can see that T 0 = 0 and T k is non-decreasing with respect to k. We now define the evolution of objective values obtained by a particular algorithm with respect to time t as follows: log := minek : k i : T i t}}. Note that 0 1 since 0 ek 1 for all k and is non-increasing with respect to t. It can be considered as a normalized measure of the reduction of the function value with respect to time. Then, one can take the average of over several independent trials, and plot the average within time t for a given algorithm. ig. 1 shows the average of 10 independent trials of different algorithms for solving 5.1. rom this figure, we can see that performs best in most cases in the sense that it takes less time to return a lower objective value. 5. l 1- regularized least squares problem In this subsection, we consider the l 1- regularized least squares problem [46]: min 1-x := 1 x R n Ax b +λ x 1 x, 5.7 where A R m n, b R m and λ > 0 is the regularization parameter. Obviously, this problem takes the form of 1.1 with fx = 1 Ax b and Px = λ x 1 x. Moreover, f is Lipschitz continuous with L f = A and Px is a difference-of-convex regularizer. We further assume that A does not have zero columns so that 1- is level-bound; see [35, Example 4.1b] and [46, Lemma 3.1]. Thus, our is applicable. In this part of experiments, we compare three algorithms for solving5.7: with δ = 0.9, with δ = 0, and the proximal difference-of-convex algorithm with extrapolation pdca e 4 [40]. The pdca e for solving 5.7 is given as follows: choose proper extrapolation parameters β k }, let x 1 = x 0, and then at the k-th iteration, Take any ξ k λ x k and compute y k = x k +β k x k x k 1, x k+1 = argmin fy k ξ k, x + L } f x x yk +λ x 1. or and, we use the same parameter settings as in Section 5.1. or pdca e, we follow [40] to choose β k } as in reista see more details in Section 5.1. All algorithms are initialized at the origin 4 The MatlabcodesofpDCA e forsolving 5.7areavailable athttp:// 1

Douglas-Rachford splitting for nonconvex feasibility problems

Douglas-Rachford splitting for nonconvex feasibility problems Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

arxiv: v2 [math.oc] 22 Jun 2017

arxiv: v2 [math.oc] 22 Jun 2017 A Proximal Difference-of-convex Algorithm with Extrapolation Bo Wen Xiaojun Chen Ting Kei Pong arxiv:1612.06265v2 [math.oc] 22 Jun 2017 June 17, 2017 Abstract We consider a class of difference-of-convex

More information

arxiv: v2 [math.oc] 21 Nov 2017

arxiv: v2 [math.oc] 21 Nov 2017 Unifying abstract inexact convergence theorems and block coordinate variable metric ipiano arxiv:1602.07283v2 [math.oc] 21 Nov 2017 Peter Ochs Mathematical Optimization Group Saarland University Germany

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean

On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean Renato D.C. Monteiro B. F. Svaiter March 17, 2009 Abstract In this paper we analyze the iteration-complexity

More information

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Peter Ochs, Jalal Fadili, and Thomas Brox Saarland University, Saarbrücken, Germany Normandie Univ, ENSICAEN, CNRS, GREYC, France

More information

1. Introduction. In this paper we consider a class of matrix factorization problems, which can be modeled as

1. Introduction. In this paper we consider a class of matrix factorization problems, which can be modeled as 1 3 4 5 6 7 8 9 10 11 1 13 14 15 16 17 A NON-MONOTONE ALTERNATING UPDATING METHOD FOR A CLASS OF MATRIX FACTORIZATION PROBLEMS LEI YANG, TING KEI PONG, AND XIAOJUN CHEN Abstract. In this paper we consider

More information

Block Coordinate Descent for Regularized Multi-convex Optimization

Block Coordinate Descent for Regularized Multi-convex Optimization Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline

More information

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms JOTA manuscript No. (will be inserted by the editor) Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms Peter Ochs Jalal Fadili Thomas Brox Received: date / Accepted: date Abstract

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems O. Kolossoski R. D. C. Monteiro September 18, 2015 (Revised: September 28, 2016) Abstract

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

1. Introduction. We consider the following constrained optimization problem:

1. Introduction. We consider the following constrained optimization problem: SIAM J. OPTIM. Vol. 26, No. 3, pp. 1465 1492 c 2016 Society for Industrial and Applied Mathematics PENALTY METHODS FOR A CLASS OF NON-LIPSCHITZ OPTIMIZATION PROBLEMS XIAOJUN CHEN, ZHAOSONG LU, AND TING

More information

From error bounds to the complexity of first-order descent methods for convex functions

From error bounds to the complexity of first-order descent methods for convex functions From error bounds to the complexity of first-order descent methods for convex functions Nguyen Trong Phong-TSE Joint work with Jérôme Bolte, Juan Peypouquet, Bruce Suter. Toulouse, 23-25, March, 2016 Journées

More information

Sparse Approximation via Penalty Decomposition Methods

Sparse Approximation via Penalty Decomposition Methods Sparse Approximation via Penalty Decomposition Methods Zhaosong Lu Yong Zhang February 19, 2012 Abstract In this paper we consider sparse approximation problems, that is, general l 0 minimization problems

More information

On the Quadratic Convergence of the Cubic Regularization Method under a Local Error Bound Condition

On the Quadratic Convergence of the Cubic Regularization Method under a Local Error Bound Condition On the Quadratic Convergence of the Cubic Regularization Method under a Local Error Bound Condition Man-Chung Yue Zirui Zhou Anthony Man-Cho So Abstract In this paper we consider the cubic regularization

More information

On the Quadratic Convergence of the Cubic Regularization Method under a Local Error Bound Condition

On the Quadratic Convergence of the Cubic Regularization Method under a Local Error Bound Condition On the Quadratic Convergence of the Cubic Regularization Method under a Local Error Bound Condition Man-Chung Yue Zirui Zhou Anthony Man-Cho So Abstract In this paper we consider the cubic regularization

More information

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS XIANTAO XIAO, YONGFENG LI, ZAIWEN WEN, AND LIWEI ZHANG Abstract. The goal of this paper is to study approaches to bridge the gap between

More information

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING KATYA SCHEINBERG, DONALD GOLDFARB, AND XI BAI Abstract. We propose new versions of accelerated first order methods for convex

More information

Sparse Recovery via Partial Regularization: Models, Theory and Algorithms

Sparse Recovery via Partial Regularization: Models, Theory and Algorithms Sparse Recovery via Partial Regularization: Models, Theory and Algorithms Zhaosong Lu and Xiaorui Li Department of Mathematics, Simon Fraser University, Canada {zhaosong,xla97}@sfu.ca November 23, 205

More information

Convex Analysis and Optimization Chapter 2 Solutions

Convex Analysis and Optimization Chapter 2 Solutions Convex Analysis and Optimization Chapter 2 Solutions Dimitri P. Bertsekas with Angelia Nedić and Asuman E. Ozdaglar Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E. Curtis Department of Industrial and Systems Engineering, Lehigh University Daniel P. Robinson Department

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

A projection-type method for generalized variational inequalities with dual solutions

A projection-type method for generalized variational inequalities with dual solutions Available online at www.isr-publications.com/jnsa J. Nonlinear Sci. Appl., 10 (2017), 4812 4821 Research Article Journal Homepage: www.tjnsa.com - www.isr-publications.com/jnsa A projection-type method

More information

Proximal Alternating Linearized Minimization for Nonconvex and Nonsmooth Problems

Proximal Alternating Linearized Minimization for Nonconvex and Nonsmooth Problems Proximal Alternating Linearized Minimization for Nonconvex and Nonsmooth Problems Jérôme Bolte Shoham Sabach Marc Teboulle Abstract We introduce a proximal alternating linearized minimization PALM) algorithm

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems Optimization Methods and Software ISSN: 1055-6788 (Print) 1029-4937 (Online) Journal homepage: http://www.tandfonline.com/loi/goms20 An accelerated non-euclidean hybrid proximal extragradient-type algorithm

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9 MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä New Proximal Bundle Method for Nonsmooth DC Optimization TUCS Technical Report No 1130, February 2015 New Proximal Bundle Method for Nonsmooth

More information

COMPLEXITY OF A QUADRATIC PENALTY ACCELERATED INEXACT PROXIMAL POINT METHOD FOR SOLVING LINEARLY CONSTRAINED NONCONVEX COMPOSITE PROGRAMS

COMPLEXITY OF A QUADRATIC PENALTY ACCELERATED INEXACT PROXIMAL POINT METHOD FOR SOLVING LINEARLY CONSTRAINED NONCONVEX COMPOSITE PROGRAMS COMPLEXITY OF A QUADRATIC PENALTY ACCELERATED INEXACT PROXIMAL POINT METHOD FOR SOLVING LINEARLY CONSTRAINED NONCONVEX COMPOSITE PROGRAMS WEIWEI KONG, JEFFERSON G. MELO, AND RENATO D.C. MONTEIRO Abstract.

More information

Acceleration Method for Convex Optimization over the Fixed Point Set of a Nonexpansive Mapping

Acceleration Method for Convex Optimization over the Fixed Point Set of a Nonexpansive Mapping Noname manuscript No. will be inserted by the editor) Acceleration Method for Convex Optimization over the Fixed Point Set of a Nonexpansive Mapping Hideaki Iiduka Received: date / Accepted: date Abstract

More information

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

INERTIAL ACCELERATED ALGORITHMS FOR SOLVING SPLIT FEASIBILITY PROBLEMS. Yazheng Dang. Jie Sun. Honglei Xu

INERTIAL ACCELERATED ALGORITHMS FOR SOLVING SPLIT FEASIBILITY PROBLEMS. Yazheng Dang. Jie Sun. Honglei Xu Manuscript submitted to AIMS Journals Volume X, Number 0X, XX 200X doi:10.3934/xx.xx.xx.xx pp. X XX INERTIAL ACCELERATED ALGORITHMS FOR SOLVING SPLIT FEASIBILITY PROBLEMS Yazheng Dang School of Management

More information

Optimization Theory. A Concise Introduction. Jiongmin Yong

Optimization Theory. A Concise Introduction. Jiongmin Yong October 11, 017 16:5 ws-book9x6 Book Title Optimization Theory 017-08-Lecture Notes page 1 1 Optimization Theory A Concise Introduction Jiongmin Yong Optimization Theory 017-08-Lecture Notes page Optimization

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

Research Article Modified Halfspace-Relaxation Projection Methods for Solving the Split Feasibility Problem

Research Article Modified Halfspace-Relaxation Projection Methods for Solving the Split Feasibility Problem Advances in Operations Research Volume 01, Article ID 483479, 17 pages doi:10.1155/01/483479 Research Article Modified Halfspace-Relaxation Projection Methods for Solving the Split Feasibility Problem

More information

A GENERAL INERTIAL PROXIMAL POINT METHOD FOR MIXED VARIATIONAL INEQUALITY PROBLEM

A GENERAL INERTIAL PROXIMAL POINT METHOD FOR MIXED VARIATIONAL INEQUALITY PROBLEM A GENERAL INERTIAL PROXIMAL POINT METHOD FOR MIXED VARIATIONAL INEQUALITY PROBLEM CAIHUA CHEN, SHIQIAN MA, AND JUNFENG YANG Abstract. In this paper, we first propose a general inertial proximal point method

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

M. Marques Alves Marina Geremia. November 30, 2017

M. Marques Alves Marina Geremia. November 30, 2017 Iteration complexity of an inexact Douglas-Rachford method and of a Douglas-Rachford-Tseng s F-B four-operator splitting method for solving monotone inclusions M. Marques Alves Marina Geremia November

More information

An Alternative Three-Term Conjugate Gradient Algorithm for Systems of Nonlinear Equations

An Alternative Three-Term Conjugate Gradient Algorithm for Systems of Nonlinear Equations International Journal of Mathematical Modelling & Computations Vol. 07, No. 02, Spring 2017, 145-157 An Alternative Three-Term Conjugate Gradient Algorithm for Systems of Nonlinear Equations L. Muhammad

More information

A proximal-like algorithm for a class of nonconvex programming

A proximal-like algorithm for a class of nonconvex programming Pacific Journal of Optimization, vol. 4, pp. 319-333, 2008 A proximal-like algorithm for a class of nonconvex programming Jein-Shan Chen 1 Department of Mathematics National Taiwan Normal University Taipei,

More information

Variable Metric Forward-Backward Algorithm

Variable Metric Forward-Backward Algorithm Variable Metric Forward-Backward Algorithm 1/37 Variable Metric Forward-Backward Algorithm for minimizing the sum of a differentiable function and a convex function E. Chouzenoux in collaboration with

More information

Hedy Attouch, Jérôme Bolte, Benar Svaiter. To cite this version: HAL Id: hal

Hedy Attouch, Jérôme Bolte, Benar Svaiter. To cite this version: HAL Id: hal Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods Hedy Attouch, Jérôme Bolte, Benar Svaiter To cite

More information

Implications of the Constant Rank Constraint Qualification

Implications of the Constant Rank Constraint Qualification Mathematical Programming manuscript No. (will be inserted by the editor) Implications of the Constant Rank Constraint Qualification Shu Lu Received: date / Accepted: date Abstract This paper investigates

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

A user s guide to Lojasiewicz/KL inequalities

A user s guide to Lojasiewicz/KL inequalities Other A user s guide to Lojasiewicz/KL inequalities Toulouse School of Economics, Université Toulouse I SLRA, Grenoble, 2015 Motivations behind KL f : R n R smooth ẋ(t) = f (x(t)) or x k+1 = x k λ k f

More information

Chapter 2 Convex Analysis

Chapter 2 Convex Analysis Chapter 2 Convex Analysis The theory of nonsmooth analysis is based on convex analysis. Thus, we start this chapter by giving basic concepts and results of convexity (for further readings see also [202,

More information

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi Real Analysis Math 3AH Rudin, Chapter # Dominique Abdi.. If r is rational (r 0) and x is irrational, prove that r + x and rx are irrational. Solution. Assume the contrary, that r+x and rx are rational.

More information

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Mathematical Programming manuscript No. (will be inserted by the editor) Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Guanghui Lan Zhaosong Lu Renato D. C. Monteiro

More information

An inexact subgradient algorithm for Equilibrium Problems

An inexact subgradient algorithm for Equilibrium Problems Volume 30, N. 1, pp. 91 107, 2011 Copyright 2011 SBMAC ISSN 0101-8205 www.scielo.br/cam An inexact subgradient algorithm for Equilibrium Problems PAULO SANTOS 1 and SUSANA SCHEIMBERG 2 1 DM, UFPI, Teresina,

More information

On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities

On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities Caihua Chen Xiaoling Fu Bingsheng He Xiaoming Yuan January 13, 2015 Abstract. Projection type methods

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Generalized Uniformly Optimal Methods for Nonlinear Programming

Generalized Uniformly Optimal Methods for Nonlinear Programming Generalized Uniformly Optimal Methods for Nonlinear Programming Saeed Ghadimi Guanghui Lan Hongchao Zhang Janumary 14, 2017 Abstract In this paper, we present a generic framewor to extend existing uniformly

More information

An introduction to Mathematical Theory of Control

An introduction to Mathematical Theory of Control An introduction to Mathematical Theory of Control Vasile Staicu University of Aveiro UNICA, May 2018 Vasile Staicu (University of Aveiro) An introduction to Mathematical Theory of Control UNICA, May 2018

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Maths 212: Homework Solutions

Maths 212: Homework Solutions Maths 212: Homework Solutions 1. The definition of A ensures that x π for all x A, so π is an upper bound of A. To show it is the least upper bound, suppose x < π and consider two cases. If x < 1, then

More information

Proximal-like contraction methods for monotone variational inequalities in a unified framework

Proximal-like contraction methods for monotone variational inequalities in a unified framework Proximal-like contraction methods for monotone variational inequalities in a unified framework Bingsheng He 1 Li-Zhi Liao 2 Xiang Wang Department of Mathematics, Nanjing University, Nanjing, 210093, China

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

Introduction and Preliminaries

Introduction and Preliminaries Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December

More information

Optimality Conditions for Constrained Optimization

Optimality Conditions for Constrained Optimization 72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)

More information

A Quasi-Newton Algorithm for Nonconvex, Nonsmooth Optimization with Global Convergence Guarantees

A Quasi-Newton Algorithm for Nonconvex, Nonsmooth Optimization with Global Convergence Guarantees Noname manuscript No. (will be inserted by the editor) A Quasi-Newton Algorithm for Nonconvex, Nonsmooth Optimization with Global Convergence Guarantees Frank E. Curtis Xiaocun Que May 26, 2014 Abstract

More information

A first-order primal-dual algorithm with linesearch

A first-order primal-dual algorithm with linesearch A first-order primal-dual algorithm with linesearch Yura Malitsky Thomas Pock arxiv:608.08883v2 [math.oc] 23 Mar 208 Abstract The paper proposes a linesearch for a primal-dual method. Each iteration of

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Numerical Sequences and Series

Numerical Sequences and Series Numerical Sequences and Series Written by Men-Gen Tsai email: b89902089@ntu.edu.tw. Prove that the convergence of {s n } implies convergence of { s n }. Is the converse true? Solution: Since {s n } is

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

Laplace s Equation. Chapter Mean Value Formulas

Laplace s Equation. Chapter Mean Value Formulas Chapter 1 Laplace s Equation Let be an open set in R n. A function u C 2 () is called harmonic in if it satisfies Laplace s equation n (1.1) u := D ii u = 0 in. i=1 A function u C 2 () is called subharmonic

More information

NOTES ON VECTOR-VALUED INTEGRATION MATH 581, SPRING 2017

NOTES ON VECTOR-VALUED INTEGRATION MATH 581, SPRING 2017 NOTES ON VECTOR-VALUED INTEGRATION MATH 58, SPRING 207 Throughout, X will denote a Banach space. Definition 0.. Let ϕ(s) : X be a continuous function from a compact Jordan region R n to a Banach space

More information

Assignment 1: From the Definition of Convexity to Helley Theorem

Assignment 1: From the Definition of Convexity to Helley Theorem Assignment 1: From the Definition of Convexity to Helley Theorem Exercise 1 Mark in the following list the sets which are convex: 1. {x R 2 : x 1 + i 2 x 2 1, i = 1,..., 10} 2. {x R 2 : x 2 1 + 2ix 1x

More information

The Proximal Gradient Method

The Proximal Gradient Method Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,

More information

c 2013 Society for Industrial and Applied Mathematics

c 2013 Society for Industrial and Applied Mathematics SIAM J. OPTIM. Vol. 3, No., pp. 109 115 c 013 Society for Industrial and Applied Mathematics AN ACCELERATED HYBRID PROXIMAL EXTRAGRADIENT METHOD FOR CONVEX OPTIMIZATION AND ITS IMPLICATIONS TO SECOND-ORDER

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of

More information

Second order forward-backward dynamical systems for monotone inclusion problems

Second order forward-backward dynamical systems for monotone inclusion problems Second order forward-backward dynamical systems for monotone inclusion problems Radu Ioan Boţ Ernö Robert Csetnek March 6, 25 Abstract. We begin by considering second order dynamical systems of the from

More information

A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems

A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems Amir Beck and Shoham Sabach July 6, 2011 Abstract We consider a general class of convex optimization problems

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Abstract This paper presents an accelerated

More information

Relationships between upper exhausters and the basic subdifferential in variational analysis

Relationships between upper exhausters and the basic subdifferential in variational analysis J. Math. Anal. Appl. 334 (2007) 261 272 www.elsevier.com/locate/jmaa Relationships between upper exhausters and the basic subdifferential in variational analysis Vera Roshchina City University of Hong

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Received: 15 December 2010 / Accepted: 30 July 2011 / Published online: 20 August 2011 Springer and Mathematical Optimization Society 2011

Received: 15 December 2010 / Accepted: 30 July 2011 / Published online: 20 August 2011 Springer and Mathematical Optimization Society 2011 Math. Program., Ser. A (2013) 137:91 129 DOI 10.1007/s10107-011-0484-9 FULL LENGTH PAPER Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward backward splitting,

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems

An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems Kim-Chuan Toh Sangwoon Yun March 27, 2009; Revised, Nov 11, 2009 Abstract The affine rank minimization

More information

Zangwill s Global Convergence Theorem

Zangwill s Global Convergence Theorem Zangwill s Global Convergence Theorem A theory of global convergence has been given by Zangwill 1. This theory involves the notion of a set-valued mapping, or point-to-set mapping. Definition 1.1 Given

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum

More information

ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE. Sangho Kum and Gue Myung Lee. 1. Introduction

ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE. Sangho Kum and Gue Myung Lee. 1. Introduction J. Korean Math. Soc. 38 (2001), No. 3, pp. 683 695 ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE Sangho Kum and Gue Myung Lee Abstract. In this paper we are concerned with theoretical properties

More information

1. Let A R be a nonempty set that is bounded from above, and let a be the least upper bound of A. Show that there exists a sequence {a n } n N

1. Let A R be a nonempty set that is bounded from above, and let a be the least upper bound of A. Show that there exists a sequence {a n } n N Applied Analysis prelim July 15, 216, with solutions Solve 4 of the problems 1-5 and 2 of the problems 6-8. We will only grade the first 4 problems attempted from1-5 and the first 2 attempted from problems

More information

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0 Numerical Analysis 1 1. Nonlinear Equations This lecture note excerpted parts from Michael Heath and Max Gunzburger. Given function f, we seek value x for which where f : D R n R n is nonlinear. f(x) =

More information