Complexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization

Size: px

Start display at page:

Download "Complexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization"

Jessie Harrell
5 years ago
Views:

1 Mathematical Programming manuscript No. (will be inserted by the editor) Complexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization Wei Bian Xiaojun Chen Yinyu Ye July 5, 01, Received: 8 March, 013; 8 October, 013 / Accepted: date Abstract We propose a first order interior point algorithm for a class of non- Lipschitz and nonconvex minimization problems with box constraints, which arise from applications in variable selection and regularized optimization. The objective functions of these problems are continuously differentiable typically at interior points of the feasible set. Our first order algorithm is easy to implement and the objective function value is reduced monotonically along the iteration points. We show that the worst-case iteration complexity for finding an ϵ scaled first order stationary point is O(ϵ ). Furthermore, we develop a second order interior point algorithm using the Hessian matrix, and solve a quadratic program with a ball constraint at each iteration. Although the second order interior point algorithm costs more computational time than that of the first order algorithm in each iteration, its worst-case iteration complexity for finding an ϵ scaled second order stationary point is reduced to O(ϵ 3/ ). Note that an ϵ scaled second order stationary point must also be an ϵ scaled first order stationary point. Keywords constrained non-lipschitz optimization complexity analysis interior point method first order algorithm second order algorithm This wor was supported partly by Hong Kong Research Council Grant PolyU5003/10p, The Hong Kong Polytechnic University Postdoctoral Fellowship Scheme, the NSF foundation ( , ) of China and US AFOSR Grant FA Wei Bian Department of Mathematics, Harbin Institute of Technology, Harbin, China. bianweilvse50@163.com. Xiaojun Chen Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong, China. maxjchen@polyu.edu.h Yinyu Ye Department of Management Science and Engineering, Stanford University, Stanford, CA yinyu-ye@stanford.edu

2 Mathematics Subject Classification (000) 90C30 90C6 65K05 49M37 1 Introduction In this paper, we consider the following optimization problem: min s.t. n f(x) = H(x) + λ φ(x p i ) i=1 x Ω = {x : 0 x b}, (1) where H : R n R is continuously differentiable, φ : [0, + ) [0, + ) is continuous and concave, λ > 0, 0 < p < 1, b = (b 1, b,..., b n ) T with b i (0, + ) {+ }, and 0 x b means that if b i < +, x i [0, b i ], otherwise x i 0, i = 1,,..., n. Moreover, φ is continuously differentiable in (0, + ) and φ(0) = 0. Without loss of generality, we assume that a minimizer of (1) exists and min Ω f(x) 0. Problem (1) is nonsmooth, nonconvex, and non-lipschitz, which has been extensively used in image restoration, signal processing and variable selection; see, e.g., [,10,14,15,19,0,4,31]. The function H(x) is often used as a data fitting term, while the function n i=1 φ(xp i ) is used as a regularization term. The feasible set Ω of (1) includes R n + = {x : x 0} as a special case. Moreover, the unconstrained problem min n H(x) + λ φ( x i p ), () i=1 where the non-lipschitz points are in the interior of the feasible region, can be equivalently reformulated as the constrained problem n n min H(x + x ) + λ φ((x + i )p ) + λ φ((x i )p ) s.t. x + 0, x 0 i=1 i=1 (3) by using variable splitting x = x + x. In Section 3, we show that the scaled first and second order stationary points of problems () and (3) are in oneone correspondence. The advantage of (3) is that non-smooth points are only at the boundary of the feasible set which allows us to use the gradient and Hessian of the objective functions in the interior point methods. Numerical algorithms for nonconvex optimization problems have been studied extensively. However, little theoretical complexity or convergence speed analysis of the algorithms is nown, in contrast to the complexity study of convex optimization in the past thirty years. We now review few results on complexity analysis of nonconvex optimization problems.

3 3 Smooth, nonconvex. Using an interior point algorithm, Ye [9] proved that an ϵ KKT or first order stationary point of a general quadratic program min 1 xt Qx + c T x s.t. Ax = q, x 0 (4) can be computed in O(ϵ 1 log ϵ 1 ) iterations, where each iteration would solve a ball-constrained or trust-region quadratic program that is equivalent to a simple convex minimization problem. Here Q R n n, c R n, A R m n, q R m. Ye [9] also proved that, as ϵ 0, the iterative sequence converges to a point satisfying the second order necessary optimality condition. More precisely, the least eigenvalue of Q in the null space of all active constraints is greater than ϵ. For general unconstrained nonconvex optimization, it was shown in [17, 1] that the standard steepest descent method with line search or trust-region can find an ϵ first order stationary point in O(ϵ ) iterations. In [], Nesterov and Polya showed that a Newton-type method based on cubic regularization requires at most O(ϵ 3/ ) iterations to find an ϵ first order stationary point. Instead of using Hessian and its global Lipschitz constant, Cartis, Gould and Toint [4 6] further proposed an adaptive regularization with cubics (ARC) method in which approximations of Hessian and its Lipschitz constant are updated at each iteration. They have also showed that the ARC taes at most O(ϵ 3/ ) iterations to find an ϵ first order stationary point. Then, an algorithm with one-dimensional global optimization of the cubic model is given and the sharpness of the complexity bound O(ϵ 3/ ) is derived in [7]. Lipschitz continuous, nonconvex. Cartis, Gould and Toint [3] estimated the worst-case complexity of a first order trust-region or quadratic regularization method for solving the following unconstrained nonsmooth, nonconvex minimization problem min Φ h (x) := H(x) + h(c(x)), (5) where h : R m R is convex but may be nonsmooth and c : R n R m is continuously differentiable. Their method taes at most O(ϵ ) iterations to reduce the size of a first order criticality measure below ϵ, which is as the same order as the worst-case complexity of steepest-descent methods applied to unconstrained smooth nonconvex optimization. Garmanjani and Vicente [15] proposed a class of smoothing direct-search methods for a general unconstrained nonsmooth optimization by applying a direct-search method to the smoothing function f of the objective function f [8]. Such approach can be considered as the zero order methods because only function values are used. When f is locally Lipschitz, the smoothing directsearch method [15] too at most O(ϵ 3 log ϵ 1 ) iterations to find an x such that f(x, µ) ϵ and µ ϵ, where µ is the smoothing parameter. When µ 0, f(x, µ) f(x) and f(x, µ) v with v f(x).

4 4 Non-Lipschitz, nonconvex. Ge, Jiang and Ye [16] extended the complexity result of [9] to the following concave non-lipschitz minimization min n x p i s.t. Ax = q, x 0 (6) i=1 and showed that finding an ϵ scaled first order stationary point or global minimizer requires at most O(ϵ 1 log ϵ 1 ) iterations. Recently, Bian and Chen [1] proposed a smoothing quadratic regularization (SQR) algorithm for solving unconstrained non-lipshchitz minimization problem (). At each iteration, the SQR algorithm solves a strongly convex quadratic minimization problem with a diagonal Hessian matrix, which has a simple closed form solution. The SQR algorithm is easy to implement and its worst-case complexity of reaching an ϵ scaled first order stationary point is O(ϵ ). To overcome the nonsmoothness of the second term in the objective function of (), smoothing methods were used in the SQR algorithm. In this paper, we propose a first order interior point method and a second order interior point method for solving the constrained non-lipschitz nonconvex optimization problem (1), using the smoothness of f in {x : 0 < x b} and eeping all iterates in it. The former uses the gradient of f to derive a quadratic overestimation and has the worst-case complexity O(ϵ ) for finding an ϵ scaled first order stationary point of (1). The latter uses the gradient and the Hessian to derive a cubic overestimation and has the worst-case iteration complexity O(ϵ 3/ ) for finding an ϵ scaled second order stationary point of a special version of (1) as the following min H(x) + λ n x p i. (7) x 0 To our best nowledge, these two methods are the first methods with the state of the art iteration complexity bounds for constrained, non-lipschitz and nonconvex optimization. Specially, the second order interior point algorithm is the first method to find an ϵ scaled second order stationary point or an ϵ global minimizer of non-lipschitz, nonconvex optimization problem (7) in no more than O(ϵ 3/ ) iterations. The above results are summarized in Table 1. Our original goal was to produce the O(ϵ 1 log ϵ 1 ) worst-case complexity bound for problem (1), as it was established for problems (4) in [9] and (6) in [16]. But we failed even when H(x) is quadratic and we leave this as an open problem. In developing the O(ϵ 1 log ϵ 1 ) bound, potential reduction techniques were used, where the quadratic objective or the concavity of n i=1 xp i has played a ey role. In either case, a quadratic overestimation of the potential function can be constructed and it is to be minimized, which maes the analysis considerably simpler and the convergence rate better. However, when the two objectives in (4) and (6) are added together, which is a special case i=1

5 5 Smooth Lipschitz continuous Non-Lipschitz O(ϵ 1 log ϵ 1 ) [Ye 1998] (4) [Ge et al 011](6) O(ϵ 3/ ) [Nesterov et al 006]; this paper for (7) [Cartis et al 011] O(ϵ ) [Nesterov 004]; [Cartis et al 011](5) [Bian et al 01](); [Gratton et al 008] this paper for (1) O(ϵ 3 log ϵ 1 ) [Garmanjani et al 01] Table 1: Worst-case complexity results for nonconvex optimization of (1) namely 1 min x 0 xt Qx + c T x + λ n x p i, a quadratic overestimation of the potential function is no longer achievable. Our paper is organized as follows. In Section, a first order interior point algorithm is proposed for solving (1), which only uses f and a Lipschitz constant of H on Ω and is easy to implement. Any iteration point x > 0 belongs to Ω and the objective function is monotonically decreasing along the generated sequence {x }. Moreover, the algorithm produces an ϵ scaled first order stationary point of (1) in at most O(ϵ ) iterations. In Section 3, a second order interior point algorithm is given to solve (7), which can generate an ϵ scaled second order stationary point in at most O(ϵ 3/ ) iterations. Since our problem has constraints, the ϵ scaled first order and second order stationary points resemble the complementarity condition for all inequality constrained optimization problems. In Section 4, we present numerical results to illustrate the efficiency of the algorithms and the complexity bound. Throughout this paper, K = {0, 1,,...}, I = {1,,..., n}, I b = {i {1,,..., n} : b i < + } and e n = (1, 1,..., 1) T R n. For x R n, A = (a ij ) m n R m n and p > 0, x = x, diag(x) = diag(x 1, x,..., x n ), A p = ( a ij p ) m n. For two matrices A, B R n n, we denote A B when A B is positive semi-definite. i=1 First Order Method In this section, we propose a first order interior point algorithm for solving (1), which uses the first order reduction technique and eeps all iterates x > 0 in the feasible set Ω. We show that the objective function value f(x ) is monotonically decreasing along the sequence {x } generated by the algorithm, and the worst-case complexity of the algorithm for generating an ϵ global minimizer or ϵ scaled first order stationary point of (1) is O(ϵ ), which is the same in the worst-case complexity order of the steepest-descent methods for smooth nonconvex optimization problems. Moreover, it is worth noting that the proposed first order interior point algorithm is easy to implement, and the computation cost at each iteration is little.

6 6 Throughout this section, we need the following assumptions. Assumption.1: H is globally Lipschitz on Ω with a Lipschitz constant β 1. 1 Specially, when I b we choose β such that β max i Ib b i. Assumption.: For any given x 0 Ω, there is R 1 such that sup{ x : f(x) f(x 0 ), x Ω} R. When H(x) = 1 Ax q, Assumption.1 holds with β = A T A. Assumption. holds, if Ω is bounded or H(x) 0 for all x Ω and φ(s) as s..1 First Order Necessary Condition Note that for problem (1), when 0 < p < 1, the Clare generalized gradient of φ(s p ) does not exist at 0. Inspired by the first and second order necessary conditions for local minimizers of unconstrained non-lipschitz optimization in [11,1], we give the scaled first and second order necessary condition for the local minimizers of constrained non-lipschitz optimization (1) in this section and the next section, respectively. Then, for any ϵ (0, 1], the ϵ scaled first order and second order stationary point of (1) can be deduced directly. First, for ϵ > 0, an ϵ global minimizer of (1) is defined as a feasible solution 0 x ϵ b and f(x ϵ ) min f(x) ϵ. 0 x b It has been proved in [9] that finding a global minimizer of the unconstrained l -l p minimization problem is strongly NP hard. For any fixed x R n, denote X = diag(x). Any local minimizer x of the unconstrained l -l p optimization modeled by () with H(x) = 1 Ax q and φ(s) = s satisfies the first order necessary condition [1] XA T (Ax q) + λp x p = 0. Similarly, using X as a scaling matrix, if x is a local minimizer of (1), then x Ω satisfies the first order necessary condition [X f(x)] i = x i [ H(x)] i + λpφ (s) s=x p i i = 0 if x i < b i ; (8a) [ f(x)] i = [ H(x)] i + λpφ (s) s=x p i i 0 if x i = b i. (8b) For (1), if x at which f is differentiable satisfies the first order necessary condition given above, then x Ω satisfies (i) [ f(x)] i = 0 if x i < b i ; (ii) [ f(x)] i 0 if x i = b i. Moreover, although [ f(x)] i does not exist when x i = 0, one can see that, as x i 0+, [ f(x)] i +. Now we can define an ϵ scaled first order stationary point of (1).

7 7 Definition 1 For a given ϵ (0, 1], we call x Ω an ϵ scaled first order stationary point of (1), if (i) [X f(x)] i ϵ if x i < (1 1 ϵ)b i; (ii) [ f(x)] i ϵ if x i (1 1 ϵ)b i. Definition 1 is consistent with the first order necessary conditions in (8a)- (8b) with ϵ = 0.. First Order Interior Point Algorithm Note that for any x, x + (0, b], Assumption.1 implies that H(x + ) H(x) + H(x), x + x + β x+ x. Since φ is concave on [0, + ), then for any s, t (0, + ), φ(t) φ(s) + φ(s), t s. Thus, for any x, x + (0, b], we obtain f(x + ) f(x) + f(x), x + x + β x+ x. Let x + = x + Xd x. We obtain f(x + ) f(x) + X f(x), d x + β Xd x. (9) To achieve a reduction of the objective function at each iteration, we minimize a quadratic function subject to a box constraint at each iteration when x > 0, which is min s.t. X f(x), d x + β dt x X d x 1 e n d x X 1 (b x). (10) For any fixed x (0, b], the objective function in (10) is strongly convex and separable about every element of x, and thus the unique solution of (10) has a closed form as d x = P Dx [ 1 β X 1 f(x)], where D x = [ 1 e n, X 1 (b x)] and P Dx is the orthogonal projection operator on the box D x. Denote X = diag(x ) and use d and D to denote d x and D x, respectively.

8 8 First Order Interior Point Algorithm Give ϵ (0, 1] and choose x 0 (0, b]. For 0, set d = P D [ 1 β X 1 f(x )], (11a) x +1 = x + X d. (11b) The first order interior point algorithm presented in the current paper is significantly different from the classical interior point methods [8]. The current method is based on a simple gradient projection into the interior of the nonnegative orthant, while the classical one follows the central path of the logarithmic barrier function via the Newton method. Lemma 1 The proposed First Order Interior Point Algorithm is well defined, which means that 0 < x b, K. Proof We only need to prove that if 0 < x b, then 0 < x +1 b. On the one hand, by d X 1 (b x ), we have x +1 = x + X d x + (b x ). On the other hand, using d 1 e n, we obtain x +1 = x + X d x 1 x = 1 x > 0. Hence, 0 < x +1 b. Lemma Let {x } be the sequence generated by the First Order Interior Point Algorithm, then the sequence {f(x )} is monotonely decreasing and satisfies f(x +1 ) f(x ) β X d = β x+1 x. Moreover, we have x R. Proof From the KKT condition of (10), the solution d of quadratic programming (10) satisfies the necessary and sufficient condition as follows βx d + X f(x ) µ + ν = 0, M (d + 1 e n) = 0, N (d X 1 (b x )) = 0, 1 e n d X 1 (b x ), (1) where M = diag(µ ) and N = diag(ν ) with M, N 0.

9 9 Moreover, from (9), we obtain f(x +1 ) f(x ) X f(x ), d + β X d = βx d + µ ν, d + β X d = β X d + µ T d ν T d (13) = β X d 1 µt e n ν T X 1 (b x ). By µ, ν 0 and 0 < x b, we obtain the inequality given in this lemma, which implies that f(x ) f(x 0 ), K. From Assumption., we obtain x R. Different from some other potential reduction methods, the objective function is monotonelly decreasing along the sequence generated by the First Order Interior Point Algorithm. Theorem 1 For any ϵ (0, 1], the First Order Interior Point Algorithm obtains an ϵ scaled first order stationary point or ϵ global minimizer of (1) in no more than O(ϵ ) iterations. Proof Let {x } be the sequence generated by the proposed First Order Interior Point Algorithm. Then x (0, b] and x R, K. Without loss of generality, we suppose that R 1. In the following, we will consider four cases. Case 1: X d 1 4βR ϵ. From Lemma, we obtain that f(x +1 ) f(x 1 ) 4 R β ϵ = 1 3R β ϵ. Case : µ T e n 1 4β ϵ. From Lemma, we get Case 3: ν T X 1 (b x ) 1 4 ϵ. From (13), we get f(x +1 ) f(x ) 1 8β ϵ. f(x +1 ) f(x ) 1 4 ϵ. Case 4: X d < 1 4βR ϵ, µt e n < 1 4β ϵ and ν T X 1 (b x ) 1 4 ϵ. From the first condition in (1), we obtain that βx d + f(x ) X 1 µ + X 1 ν = 0, (14)

10 10 which implies that X f(x ) = βx d + µ ν. (15) Then, we obtain [X f(x )] i β X X d + µ + [ν ] i 1 ϵ + [ν ] i. (16) If i I b, [ν ] i = 0 and we have [X f(x )] i 1 ϵ. Fix i I b. We consider two subclasses in this case. Case 4.1: x i < b i bi ϵ. Then, bi x i x i b iϵ b i Then, (16) gives [X f(x )] i ϵ. Case 4.: x i b i b i ϵ. Then, x i [ν ] i 0, we obtain ϵ, which gives [ν ] i ϵ from νt X 1 bi. By µ 1 4β ϵ, we have [X 1 µ ] i ϵ βb i [ f(x )] i = [βx d X 1 β X d + [X 1 µ + X 1 ν ] i µ ] i [X 1 ν ] i (b x ) 1 4 ϵ β X d + 1 ϵ [X 1 ν ] i ϵ [X 1 ν ] i ϵ. ϵ. By Therefore, from the analysis in Cases , x is an ϵ scaled first order stationary point of (1). Basing on the above analysis in Cases 1-3, at least one of the following two facts holds at the th iteration: (i) f(x +1 ) f(x ) 1 3R β ϵ ; (ii) x is an ϵ scaled first order stationary point of (1). Therefore, we would produce an ϵ global minimizer or an ϵ scaled first order stationary point of (1) in at most 3f(x 0 )R βϵ iterations. Remar 1 When λ = 0 or p = 1 in (1), the KKT condition of (1) can be written as [ f(x)] i 0 if x i = 0, [ f(x)] i = 0 if 0 < x i < b i, [ f(x)] i 0 if x i = b i. (17) which are sufficient but not necessary for the conditions in (8). b Denote ϵ = min{1, min i i Ib }. Similar to the analysis in Theorem 1, from (14), for ϵ (0, ϵ], if X d < 1 4βR ϵ, µt e n < 1 4β ϵ and ν T X 1 (b x ) 1 4 ϵ, we can obtain the following estimation [ f(x )] i 1 ϵ if 0 x i ϵ, [ f(x )] i 1 ϵ + 1 ϵ if ϵ < x i b < (1 ϵ )b (18) i i [ f(x )] i ϵ if x i (1 ϵ )b i.

11 11 When ϵ = 0, the conditions in (18) are consistent with the KKT condition of (1) in (17). Thus, we can state that the First Order Interior Point Algorithm can be extended to (1) with λ = 0 or p = 1 for finding an ϵ KKT point with the worst-case complexity O(ϵ ), which recovers the complexity bound of the steepest descent method for constrained nonconvex optimization. 3 Second Order Interior Point Algorithm In this section, we consider problem (7), a special case of (1) with Ω = {x : x 0}, under Assumption. and the following assumption on H. Assumption 3.1: H is twice continuously differentiable and H is globally Lipschitz on Ω with Lipschitz constant γ. By using the cubic overestimation idea [4 6,18,,3], a second order interior point algorithm is proposed for solving (7), which uses the Hessian of H. We show that the worst-case complexity of the second order interior point algorithm for finding an ϵ scaled second order stationary point of (7) is O(ϵ 3/ ). Comparing with the first order interior point algorithm proposed in Section, the worst-case complexity of the second order interior point algorithm is better and the generated point satisfies stronger optimality conditions. However, a quadratic program with ball constraint has to be solved at each iteration. 3.1 Second Order Necessary Condition for (7) Based on the scaled second order necessary condition for unconstrained non- Lipschitz optimization in [11,1], we now that if x is a local minimizer of (7), then x Ω satisfies X f(x)x = X H(x)X + λp(p 1)X p 0. (19) If x > 0, then f is differentiable at x and (19) implies that f(x) 0. Now we give the definition of the ϵ scaled second order stationary point of (7) as follows. Definition For a given ϵ (0, 1], we call x Ω an ϵ scaled second order stationary point of (7), if X f(x) ϵ and X f(x)x ϵi n. Definition is consistent with the scaled first and second order necessary conditions given above when ϵ = 0.

12 1 3. Second Order Interior Point Algorithm From the cubic overestimation idea, for any x, x + Ω, Assumption 3.1 implies that H(x + ) H(x) + H(x), x + x Similarly, for any t, s > 0, + 1 H(x)(x + x), x + x γ x+ x 3. t p s p + ps p 1, t s + p(p 1) s p (t s) + Thus, for any x, x + > 0, we obtain p(p 1)(p ) s p 3 (t s) 3. 6 f(x + ) f(x) f(x), x + x + 1 f(x)(x + x), x + x which can also be expressed by γ x+ x 3 p(p 1)(p ) + λ 6 n i=1 x p 3 i (x + i x i ) 3, f(x + ) f(x) X f(x), d x + 1 X f(x)xd x, d x γ Xd x 3 + λp(p 1)(p ) 6 n x p i [d x] 3 i. i=1 (0) Since p(1 p) 1 4, then p(1 p)( p) 1. Combining this estimation with n i=1 [d x] 3 i d x 3, if x R, then (0) implies f(x + ) f(x) X f(x), d x + 1 X f(x)xd x, d x (γr3 + 1 λrp ) d x 3. (1) At x > 0, we minimize a quadratic function subject to a ball constraint to achieve a sufficient reduction. For a given ϵ (0, 1], we solve the following problem min q(d x ) = X f(x), d x + 1 X f(x)xd x, d x s.t. d x ϑ ϵ () where ϑ = 1 min{ 1 γr λrp, 1} and R is the constant in Assumption.. To solve (), we consider two cases. Case 1: X f(x)x is positive semi-definite.

13 13 From the KKT condition of (), the solution d x of () satisfies the following necessary and sufficient conditions X f(x)xd x + X f(x) + ρ x d x = 0, (3a) ρ x 0, d x ϑ ϵ, ρ x ( d x ϑ ϵ) = 0. (3b) In this case, () is a convex quadratic program with ball constraint, which can be solved effectively in polynomial time, see [7] and references therein. Case : X f(x)x has at least one negative eigenvalue. From [6], the solution d x of () satisfies the following necessary and sufficient conditions X f(x)xd x + X f(x) + ρ x d x = 0, (4a) d x = ϑ ϵ, X f(x)x + ρ x I n is positive semi-definite. (4b) In this case, () is a nonconvex quadratic programming with ball constraint. However, by the results in [9,30], (4) can be solved effectively in polynomial time with the worst-case complexity O(log(log(ϵ 1 ))). The double log is established based on a globally convergent Newton method. In the first phase the method selects a point from at most log log(ϵ) many candidate points using the bisection method. Then it is proved that, starting from the selected point, the quadratic convergence of the Newton method is guaranteed. From (3) and (4), if d x solves (), then q(d x ) = X f(x), d x + 1 X f(x)xd x, d x = X f(x)xd x ρ x d x, d x + 1 X f(x)xd x, d x (5) = ρ x d x 1 dt x X f(x)xd x. Since X f(x)x + ρ x I n is always positive semi-definite, d T x X f(x)xd x ρ x d x, and thus q(d x ) 1 ρ x d x. Therefore, from (1), we have f(x + ) f(x) 1 ρ x d x (γr3 + 1 λrp ) d x 3. (6) Second Order Interior Point Algorithm Choose x 0 int(ω) and ϵ (0, 1]. For 0, solve () with x = x for d x +1 = x + X d. (7a) (7b)

14 14 The Second Order Interior Point Algorithm is related to the classical interior point methods [8]. The computational wor of each iteration is identical to the algorithm in [9]. However, in [9] the objective is a quadratic overestimation of the Karmarar-type potential function, and in the current paper the objective is just the Taylor quadratic expansion of the original objective function. The main differences are: 1) the current paper deals with more general objective function and [9] deals with only quadratic function, which is why different convergence rates are established. For quadratic functions, there would be no third-order errors so that the analysis was considerably simpler and a better convergence rate was achieved in [9]. In fact, it is a surprise to us that we are able to establish the current convergence rate for such general objectives. ) [9] needs a prior nown lower bound for the objective function in order to construct a valid potential function, but the current method does not need such information. From the definition of ϑ in (), d 1, similar to the analysis in Lemma 1, the Second Order Interior Point Algorithm is also well defined. Let {x } be the sequence generated by it, then x > 0. In what follows, we will prove that there is κ > 0 such that either f(x +1 ) f(x ) κϵ 3 or x +1 is an ϵ scaled second order stationary point of (7). Lemma 3 If ρ > 9ϑ d holds for all K, then the Second Order Interior Point Algorithm produces an ϵ global minimizer of (1) in at most O(ϵ 3/ ) iterations. Moreover, x R, K. Proof If ρ > 9ϑ d, then ρ > 4 9 (γr3 + 1 λrp ) d and we have that 1 6 (γr3 + 1 λrp ) d < 3 8 ρ, (8) and from (3b) and (4b), we obtain d = ϑ ϵ. From (6) and (8), we have f(x +1 ) f(x ) 1 ρ d (γr3 + 1 λrp ) d 3 < 1 ρ d ρ d = 1 8 ρ d, which means f(x +1 ) < f(x ). If this always holds, then f(x ) is strictly monotone decreasing. By Assumption., x R, K. Moreover, f(x ) f(x 0 ) 8 ρ d 36 ϑ ϵ 3/, which follows that we would produce an ϵ global minimizer of (1) in at most 36f(x 0 )ϑ ϵ 3/ iterations. In what follows, we prove that x +1 is an ϵ scaled second order stationary point of (7) when ρ 9ϑ d for some.

15 15 Lemma 4 If there is K such that ρ 9ϑ d, then x +1 satisfies X +1 f(x +1 ) ϵ. Proof From (3a) and (4a), the following relation always holds ρ d =X f(x )X d + X f(x ) which implies =X H(x )X d + λp(p 1)X p Xd + X H(x ) + λpx (x ) p 1 =X ( H(x )X d + λp(p 1)X p X d + H(x ) + λp(x ) p 1 ), H(x )X d + λp(p 1)X p X d + H(x ) + λp(x ) p 1 + ρ X 1 d = 0. Thus, there is τ [0, 1] such that H(x +1 ) + λp(x +1 ) p 1 = H(x +1 ) + λp(x +1 ) p 1 H(x )X d λp(p 1)X p X d H(x ) λp(x ) p 1 ρ X 1 d = H(τx + (1 τ)x +1 )X d H(x )X d + λp(x +1 ) p 1 λp(p 1)X p X d λp(x ) p 1 ρ X 1 d. Therefore, we have X +1 ( H(x +1 ) + λp(x +1 ) p 1 ) X +1 ( H(τx + (1 τ)x +1 )X d H(x )X d ) + X +1 (λp(x +1 ) p 1 λp(p 1)X p X d λp(x ) p 1 ) + ρ X +1 X 1 d. (9) From Assumption 3.1, we estimate the first term after the inequality in (9) as follows X +1 H(τx + (1 τ)x +1 )X d H(x )X d γ(1 τ) X +1 X d γ X +1 X d γr 3 d. (30) From X D = X +1 X, we have with D = diag(d ), which implies X 1 X +1 = D + I n, (31) X 1 x+1 = d + e n.

16 16 Then, we consider the second term in (9), X +1 (λp(x +1 ) p 1 λp(p 1)X p X d λp(x ) p 1 ) =λpx p (X p (x+1 ) p + (1 p)x 1 X +1d X 1 x+1 ) =λpx p ((d + e n ) p + (1 p)(d + I n )d (d + e n )) =λpx p ((d + e n ) p + (1 p)d pd e n ). (3) obtain Using the Taylor expansion that 1 + pt p(1 p) t (1 + t) p 1 + pt, we e n + pd Adding (1 p)d pd e n into (33), we have Thus, p(1 p) d (d + e n ) p e n + pd. (33) 0 (d + e n ) p + (1 p)d pd e n (1 p)d. (d + e n ) p + (1 p)d pd e n (1 p) d. (34) Then, from (3) and (34), we get X +1 (λp(x +1 ) p 1 λp(p 1)X p X d λp(x ) p 1 ) λp X p (d + e n ) p + (1 p)d pd e n λp(1 p) X p d 1 λrp d. (35) From d 1, we have ρ X +1 X 1 d ρ d (1 + d ) 3 3 ρ d. (36) Therefore, from (9), (30), (35) and (36), we obtain X +1 H(x +1 ) + λp(x +1 ) p (γr λrp ) d + 3 ρ d = 1 ϵ ϵ 3 ϵ. Lemma 5 Under the assumptions in Lemma 4, x +1 satisfies Proof From (3b) and (4b), we now X +1 f(x +1 )X +1 ϵi n. X f(x )X + ρ I n is positive semi-definite. Then, H(x ) + λp(p 1)X p ρ X ; (37)

17 17 From Assumption 3.1, we obtain H(x +1 ) H(x ) γ x x +1 γ X d γr d. (38) Note that H(x +1 ) and H(x ) are symmetric, (38) gives Adding (37) and (39), we get H(x +1 ) H(x ) γr d I n. (39) H(x +1 ) λp(p 1)X p ρ X γr d I n. (40) Adding λp(p 1)X p +1 into the both sides of (40), we obtain X +1 H(x +1 )X +1 + λp(p 1)X +1 X p +1 X +1 λp(p 1)X +1 X p X +1 ρ X +1 X X +1 γr d X+1 + λp(p 1)X +1 X p +1 X +1. (41) Using (31) again, we get ρ X +1 X X +1 = ρ (D + I n ) 9 4 ρ I n. (4) On the other hand, using (31), we have X +1 X p +1 X +1 X +1 X p X +1 = X p +1 X +1X p =X p +1 (I n (X p +1 Xp )) = X p +1 (I n (I n + D ) p ). (43) From the Taylor expansion that (1 + t) p 1 + ( p)t + ( p)(1 p) t, we get (I n + D ) p I n + ( p)d + 1 ( p)(1 p)d. (44) Applying (44), D 1 I n and 0 < p < 1 to (43), we derive λp(1 p)(x +1 X p +1 X +1 X +1 X p X +1 ) =λp(1 p)x p +1 (I n (I n + D ) p ) λp(1 p)x p +1 (( p)d + 1 ( p)(1 p)d ) λp(1 p) x +1 p (( p) d + 1 ( p)(1 p) d )I n λp(1 p)r p ( p)(1 p) ( p + ) d I n p p(1 p)( p)(3 p) λr d I n λ Rp d I n. (45)

18 18 From (41), (4), (43), (45), and ρ 9ϑ d, we obtain X +1 f(x +1 )X +1 =X +1 H(x +1 )X +1 + λp(p 1)X +1 X p +1 X +1 ( 9 4 ρ + γr 3 d + 1 λrp d )I n ( 1 ϑ + γr3 + 1 λrp ) d I n 1 ϑ d I n ϵi n. According to Lemmas 3-5, we can obtain the worst-case complexity of the Second Order Interior Point Algorithm for finding an ϵ scaled second order stationary point of (7). Theorem For any ϵ (0, 1], the proposed Second Order Interior Point Algorithm obtains an ϵ scaled second order stationary point or ϵ global minimizer of (7) in no more than O(ϵ 3/ ) iterations. Remar When H(x) = 1 Ax q in (7), then γ = 0 and ϑ in () turns to be 1 ϑ = max{, λr p }. The proposed Second Order Interior Point Algorithm obtains an ϵ scaled second order stationary point or ϵ global minimizer of (7) in no more than 36f(x 0 ) max{4, λ R p }ϵ 3/ iterations. Remar 3 Through the worst-case complexity result in Theorem can be extended to (7) with λ = 0 or p = 1, the complexity bound is for finding the scaled second order stationary point of (7). When λ = 0 or p = 1, the general second order optimality condition of (7) is defined as x 0, f(x) 0, x T f(x) = 0 and H(x) 0, then we call x satisfies the ϵ second order optimality condition of (7) if x 0, f(x) ϵ, X f(x) ϵ and H(x) ϵi n. (46) Thus, due to the second inequality in (46) and the scaling in the second inequality of Definition, the conditions given in Definition is necessary but not sufficient for the ϵ second order optimality conditions in (46). Moreover, the second order interior point algorithm and the complexity bound given in section can be extended to min x 0 H(x) + λ n φ(x p ), where φ is twice continuously differentiable in (0, + ) and φ (t) > 0, for instance, φ and φ 3 in Section 5. i=1

19 19 We end this section by showing our interior point algorithms can be applied to solve the unconstrained problem (). In particular, we show the scaled first and second order stationary points of () and (3) are in one-one correspondence. Denote F n = {x : x is a scaled first order stationary point of ()}, F n + = {(x +, x ) : (x +, x ) is a scaled first order stationary point of (3)}, S n = {x : x is a scaled second order stationary point of ()}, S n + = {(x +, x ) : (x +, x ) is a scaled second order stationary point of (3)}. Theorem 3 (i) Suppose φ (s) > 0 for s > 0, then (x +, x ) F+ n x F n with x = x + x ; x F n (x +, x ) F+ n with x + = max(0, x) and x = max(0, x). (47) (ii) Suppose φ(s) = s and H is twice continuously differentiable, then (x +, x ) S+ n x S n with x = x + x ; x S n (x +, x ) S+ n with (47). Proof (i) Suppose (x +, x ) F n +, then x +, x 0 and { X + H(x + x ) + λp[φ (s) s=(x + i )p (x + i )p ] n i=1 = 0 X H(x + x ) + λp[φ (s) s=(x i )p (x i )p ] n i=1 = 0, (48) where X + = diag(x + ) and X = diag(x ). Thus, which gives X X + H(x + x ) + λp[φ (s) s=(x + i )p (x + i )p (x i )]n i=1 X + X H(x + x ) + λp[φ (s) s=(x i )p (x i )p (x + i )]n i=1 = 0, φ (s) s=(x + i )p x i (x+ i )p + φ (s) s=(x i )p x + i (x i )p = 0, i I. By φ (s) > 0 for s > 0 and the above equation, we obtain that x + i x i = 0, i I. Thus, adding the two equations in (48) gives (X + X ) H(x + x ) + λp[φ (s) s=(x + i x i )p (x + i x i )p ] n i=1 = 0, which means that x = x + x F n. On the other hand, suppose x F n, then (x + ) T x = 0, where x + and x are with the form in (47). Thus, (48) holds, which implies (x +, x ) F n +.

20 0 (ii) Suppose (x +, x ) S+ n and denote x = x + x, then ( ) ( ) ( ) X + H(x) H(x) X + X H(x) H(x) ( ) (X +λp(p 1) + ) p (X ) p 0, X (49) which follows ( In I n ) ( X + X ) ( H(x) H(x) H(x) H(x) Thus, ) ( X + X ) ( ) In +λp(p 1) ( I n I n ) ( (X + ) p (X ) p ) ( In I n ) 0. (X + X ) H(x)(X + X )+λp(p 1)(X + ) p +λp(p 1)(X ) p 0. (50) From (X + ) p + (X ) p X + X p and (50), we obtain (X + X ) H(x)(X + X ) + λp(p 1) X + X p 0, (51) which implies that x S n with x = x + x. On the other hand, suppose x S n and (x +, x ) with the form in (47). Then, (51) holds, which follows ( ) D + n ) D n (X + X ) H(x)(X + X ) ( D + n D n +λp(p 1) ( D + n D n I n ) X + X p ( (5) ) D n + Dn 0, where D + n = diag(sign(x + )) and D n = diag(sign(x )). By the definitions in (47), we obtain D + n (X + X ) = X +, D n (X + X ) = X, D + n X + X p D n = D n X + X p D + n = 0 n n, D + n X + X p D + n = X + p, D n X + X p D n = X p. (53) From (5) and (53), we obtain (49), which implies that (x +, x ) S n +. 4 Numerical Experiments In this section, we present three examples to show the good performance and worst-case complexity of the First Order Interior Point Algorithm (FOIPA) proposed in this paper for solving (1). The numerical testing is carried out on a Lenovo PC (3.00GHz,.00GB of RAM) with the use of Matlab 7.4. In the following three examples, Iteration number denotes the number of iterations for obtaining an ϵ scaled first order stationary point.

21 1 FOIPA Lasso Best subset λ, p 11.7, , , ,0.5 x 1 (lcavol) x (lweight) x 3 (age) x 4 (lbph) x 5 (svi) x 6 (lcp) x 7 (pleason) x 8 (pgg45) Number of nonzero Prediction error Iteration number Table : Example 1: Variable selection by FOIPA Lasso and Best subset methods Example 1 (Prostate Cancer) The prostate cancer date is downloaded from the web site edu/tibs/elemstatlearn/data.html. It consists of the medical records of 97 men who were about to receive a radical prostatectomy, which is divided into a training set with 67 observations and a test set with 30 observations. For more detail of the data set, see [8,11,1, 19]. We use the following constrained l l p model to solve this problem min Ax x 0 q + λ n x p i, where A R 67 8 and q R 67 are built by the training set. Let x 0 = 0.1e 8 and ϵ = The numerical results of the FOIPA with β = A T A are given in Table, in which two well-nown methods (Lasso, Best subset) from Table 3.3 in [19] are also listed. Table indicates the FOIPA with 0 < p < 1 can find fewer main factors with smaller prediction error than the other two methods. i=1 Example (Nonnegative Compressed Sensing) In this example, we test the FOIPA with a compressive sensing problem, where the goal is to reconstruct a length-n nonnegative sparse signal from m observations, where m < n. The purpose of this example is to show the worst-case complexity result and good performance of the FOIPA. We use the following code in Matlab to generate the original signal x s R n, sensing matrix A R m n and observation signal q R m with given positive integers n and T. m=n/4; x s =zeros(n,1); P=randperm(n); A=randn(m,n); x s (P(1:T))=abs(randn(T,1)); A=orth(A ) ; q=a x s.

22 Iteration number ϕ1,p = 0.5,λ = ϕ,p = 0.3,λ = 0.08,a = 0. ϕ3,p = 0.1,λ = 0.11,a = 0. CPU Time (second) ϕ1,p = 0.5,λ = ϕ,p = 0.3,λ = 0.08,a = 0. 10µϕ3,p = 0.1,λ = 0.11,a = ε (a) ε (b) Fig. 1: Example 1: FOIPA for finding an ϵ scaled first order stationary point of (54) (a) Iteration number, (b) CPU Time (second), where ϕ 1, φ and φ 3 are given in Section 5 In this example, we consider the following constrained optimization problem n min Ax x 0 q + φ(x p i ). (54) Set n = 4096 and T = 40 in the Matlab code to generate A, q and x s. Choose x 0 = 0.1e n. For different choices of φ and p, the Iteration number and CPU time for obtaining an ϵ scaled first order stationary point of (54) are illustrated in Fig. 1. From this figure, we can see that the total iterations for obtaining an ϵ scaled first order stationary point is much smaller than the worst-case estimation 3f(x 0 )R βϵ given in the proof of Theorem 1. For the case that φ := φ 1 with λ = and p = 0.5, the original signal x s and ϵ scaled first order stationary point x ϵ with ϵ = 10 4 are pictured in Fig. (a). Meantime, the convergence of x, f(x ) and MSE(x ) are also illustrated in Fig. (b)-(d), where MSE(x ) is the mean squared error of x to the original signal x s defined by MSE(x) = x x s /n. Example 3 (Unconstrained Compressed Sensing) In order to support the theoretical analysis in Theorem 3, we test the FOIPA into a typical compressive sensing problem, where the goal is to reconstruct a length-n sparse signal (may containing positive and negative signals) from m observations, where m < n. The original signal x s is generated randomly by the code x s (P(1:T))=randn(T,1), A and q are generated as in Example with n = 104 and T = 10. To solve this problem, we construct the following unconstrained l l p optimization min x +,x 0 i=1 A(x + x ) q + λ x + p p + λ x p p, (55) which can be solved by the algorithms proposed in this paper. With initial point x +0 = x 0 = 0.1e n and β = 4 A T A = 4, the FOIPA can find an ϵ

23 3 x s x ε (x +, x ) (a) (b) f(x ) 4 3 MSE(x ) Iteration (c) Iteration (d) Fig. : Example : (a) original signal x s and ϵ scaled stationary point x ϵ with ϵ = 10 4, (b) convergence of iterate x, (c) convergence of function value f(x ), (d) convergence of mean squared error MSE(x ) (a) (b) Fig. 3: Example 3: Original and reconstructed signal by the FOIPA with ϵ = 10 3 and x = x + x (a) original signal, (b) reconstructed signal. (=10 3 ) scaled first order stationary point in 131 iterations with CPU time seconds. The original signal and the ϵ scaled first order stationary point are described in Fig. 3. The convergence of (x +, x ) and (x + ) T x are shown in Fig. 4. From Fig. 3 and Fig. 4, we can see that x = x + x converges to the original signal.

24 (x +, x ) (x + ) T x Iteration (a) (b) Fig. 4: Example 3: (a) convergence of (x +, x ), (b) convergence of (x + ) T x 5 Final Remars This paper proposes two interior point methods for solving constrained non- Lipschitz, nonconvex optimization problems arising in many important applications. The first order interior point method is easy to implement and its worst-case complexity is O(ϵ ) which is the same in order as the worst-case complexity of steepest-descent methods applied to unconstrained, nonconvex smooth optimization, and the trust region methods and SQR methods applied to unconstrained, nonconvex nonsmooth optimization [1, 3]. The second order interior method has a better complexity order O(ϵ 3/ ) for finding an ϵ scaled second order stationary point. It is not answered in this paper that whether the complexity bounds are sharp for the first and second order methods, which gives us an interesting topic for further research. Assumptions in this paper are standard and applicable to many regularization models in practice. For example, H(x) = Ax q and φ is one of the following six penalty functions i) soft thresholding penalty function [0,5]: φ 1 (s) = s ii) logistic penalty function [4]: φ (s) = log(1 + αs) iii) fraction penalty function [13,4]: φ 3 (s) = αs 1+αs iv) hard thresholding penalty function[14]: φ 4 (s) = λ (λ s) +/λ v) smoothly clipped absolute deviation penalty function[14]: φ 5 (s) = s vi) minimax concave penalty function [31]: 0 φ 6 (s) = min{1, (α t/λ) + }dt α 1 s 0 (1 t αλ ) +dt.

25 5 Here α and λ are two positive parameters, especially, α > in φ 5 (s) and α > 1 in φ 6 (s). These six penalty functions are concave in [0, ) and continuously differentiable in (0, ), which are often used in statistics and sparse reconstruction. Acnowledgements We would lie to than Prof. Sven Leyffer, the anonymous Associate Editor and referees for their insightful and constructive comments, which help us to enrich the content and improve the presentation of the results in this paper. References 1. W. Bian and X. Chen, Worst-case complexity of smoothing quadratic regularization methods for non-lipschitzian optimization, SIAM J. Optim., 8, (013).. A.M. Brucstein D.L. Donoho and M. Elad, From sparse solutions of systems of equations to sparse modeling of signals and images, SIAM Review, 51, (009). 3. C. Cartis, N.I.M. Gould and Ph.L. Toint, On the evaluation complexity of composite function minimization with applications to nonconvex nonlinear programming, SIAM J. Optim., 1, (011). 4. C. Cartis, N.I.M. Gould and Ph.L. Toint, Adaptive cubic regularisation methods for unconstrained optimization, Part I: motivation, convergence and numerical results, Math. Program., 17, (011). 5. C. Cartis, N.I.M. Gould and Ph.L. Toint, Adaptive cubic regularisation methods for unconstrained optimization, Part II: worst-case function- and derivative-evaluation complexity, Math. Program., 130, (011). 6. C. Cartis, N.I.M. Gould and Ph.L. Toint, An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity, IMA J. Numer. Anal., doi: /imanum/drr035 (01). 7. C. Cartis, N.I.M. Gould and Ph.L. Toint, Complexity bounds for second-order optimality in unconstrained optimization, J. Complexity, 8, (01). 8. X. Chen, Smoothing methods for nonsmooth, novonvex minimization, Math. Program., 134, (01). 9. X. Chen, D. Ge, Z. Wang and Y. Ye, Complexity of unconstrained L -L p minimization, Math. Program, doi: /s (01). 10. X. Chen, M. Ng and C. Zhang, Nonconvex l p regularization and box constrained model for image restoration, IEEE Trans. Image Processing, 1, (01). 11. X. Chen, L. Niu and Y. Yuan, Optimality conditions and smoothing trust region Newton method for non-lipschitz optimization, Preprint, X. Chen, F. Xu and Y. Ye, Lower bound theory of nonzero entries in solutions of l -l p minimization, SIAM J. Sci. Comput., 3, (010). 13. X. Chen and W. Zhou, Smoothing nonlinear conjugate gradient method for image restoration using nonsmooth nonconvex minimization, SIAM J. Imaging Sci., 3, (010). 14. J. Fan, Comments on Wavelets in stastics: a review by A. Antoniadis, Stat. Method. Appl., 6, (1997). 15. R. Garmanjani and L.N. Vicente, Smoothing and worst case complexity for direct-search methods in nonsmooth optimization, IMA J. Numer. Anal., doi: /imanum/drs07 (01) 16. D. Ge, X. Jiang and Y. Ye, A note on the complexity of L p minimization, Math. Program., 1, (011). 17. S. Gratton, A. Sartenaer and Ph.L. Toint, Recursive trust-region methods for multiscale nonlinear optimization, SIAM J. Optim., 19, , (008). 18. A. Griewan, The modification of Newton s method for unconstrained optimization by bounding cubic terms, Technical Report NA/1 (1981), Department of Applied Mathematics and Theoretical Physics, University of Cambridge, United Kingdom, 1981.

26 6 19. T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning Data Mining, Inference, and Prediction, Springer, New Yor (009). 0. J. Huang, J.L. Horowitz and S. Ma, Asymptotic properties of bridge estimators in sparse high-dimensional regression models, Ann. Statist., 36, (008). 1. Yu. Nesterov, Introductory Lectures on Convex Optimization, Applied Optimization, Kluwer Academic Publishers, Dordrecht, The netherlands, Yu. Nesterov and B.T. Polya, Cubic regularization of Newton s method and its global performance, Math. Program., 108, (006). 3. Yu. Nesterov, Accelerating the cubic regularization of Newton s method on convex problems, Math. Program., 11, (008). 4. M. Niolova, M.K. Ng, S. Zhang and W.-K. Ching, Efficient reconstruction of piecewise constant images using nonsmooth nonconvex minimization, SIAM J. Imaging Sci., 1, -5 (008). 5. R. Tibshirani, Shrinage and selection via the Lasso, J. Roy. Statist. Soc. Ser. B, 58, (1996). 6. S.A. Vavasis and R. Zippel, Proving polynomial time for sphere-constrained quadratic programming, Technical Report , Department of Computer Science, Cornell University, Ithaca, NY, S.A. Vavasis, Nonlinear Optimization: Complexity Issues, Oxford Sciences, New Yor, Y. Ye, Interior Point Algorithms: Theory and Analysis, John Wiley & Sons, Inc., New Yor, Y. Ye, On the complexity of approximating a KKT point of quadratic programming, Math. Program., 80, (1998). 30. Y. Ye, A new complexity result on minimization of a quadratic function with a sphere constraint, in Recent Advances in Global Optimization, C. Floudas and P.M. Pardalos, eds., Princeton University Press, Princeton, NJ (199). 31. C.-H Zhang, Nearly unbiased variable selection under minimax concave penalty, Ann. Statist., 38, (010).

Complexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization

Mathematical Programming manuscript No. (will be inserted by the editor) Complexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization Wei Bian Xiaojun Chen Yinyu Ye July