A Filter Active-Set Algorithm for Ball/Sphere Constrained Optimization Problem

Size: px

Start display at page:

Download "A Filter Active-Set Algorithm for Ball/Sphere Constrained Optimization Problem"

Wilfred Manning
5 years ago
Views:

1 A Filter Active-Set Algorithm for Ball/Sphere Constrained Optimization Problem Chungen Shen Lei-Hong Zhang Wei Hong Yang September 6, 014 Abstract In this paper, we propose a filter active-set algorithm for the minimization problem over a product of multiple ball/sphere constraints. By making effective use of the special structure of the ball/sphere constraints, a new limited memory BFGS (L-BFGS) scheme is presented. The new L-BFGS implementation takes advantage of the sparse structure of the Jacobian of the constraints, and generates curvature information of the minimization problem. At each iteration, only two or three reduced linear systems are required to solve for the search direction. Filter technique combining with the backtracking line search strategy ensures the global convergence, and the local superlinear convergence can also be established under mild conditions. The algorithm is applied to two specific applications, the nearest correlation matrix with factor structure and the maximal correlation problem. Our numerical experiments indicate that the proposed algorithm is competitive to some recently custom-designed methods for each individual application. Keywords. SQP, Active Set, Filter, L-BFGS, Ball/sphere constraints, the nearest correlation matrix with factor structure, the maximal correlation problem AMS subject classification. 65K05, 90C30 1 Introduction In this paper, we consider a class of optimization problems of minimizing a (at least) twice continuously differentiable function (probably nonconvex) f(x) : R n R over a product of multiple balls/spheres constraints. Upon rescaling the balls/spheres, we cast without loss of generality such class of minimization problems in the following form: (BCOP) min x R n f(x) s.t. c i (x) := x [i] 1 = 0, i E, c i (x) := x [i] 1 0, i I, where E = 1,,..., m 1 }, I = m 1 + 1, m 1 +,..., m}, x [i] R pi, x = (x T [1], xt [],..., xt [m] )T, n = m i=1 p i. Here, we introduce the notation x [i] R pi to represent the ith sub-vector of x R n, and formulate the product of multiple ball/sphere constraints as a set of equality and inequality constraints. To simplify subsequent presentation, we name the above programming the ball/sphere constrained optimization problem (BCOP). This research is supported by National Natural Science Foundation of China (Nos , , and ). Department of Applied Mathematics, Shanghai Finance University, Shanghai 0109, China. School of Mathematics, Shanghai University of Finance and Economics, Shanghai 00433, China. Department of Mathematics, Fudan University, Shanghai 00433, China 1

2 The reason that we are interested in BCOP is twofold: on the one hand, many practical applications that arise recently from, for example, correlation matrix approximation with factor structure [3, ], factor models of asset returns [9], collateralized debt obligations [, 10], multivariate time series [5] and maximal correlation problem [7, 43, 44] can be recast in such form; on the other hand, general algorithms for nonlinearly constrained optimization may not be efficient as they generally do not take much advantage of the special structure of BCOP. Therefore, custom-made algorithm for BCOP can provide a uniform and much more efficient tool for these applications. Relying upon the framework of the sequential quadratic programming (SQP) method, e.g., [4, 16, 17, 18, 4, 7, 35, 36], and making heavy use of the special structure of BCOP, we will refine the SQP method to propose a custom-made implementation. It is known that SQP is one of the most widely used methods for the general nonlinearly constrained optimization. In particular, it generates steps by solving quadratic subproblems (QPs). Traditional SQP method (see e.g., [16]) takes certain penalty function as the merit function to determine if a trial step is accepted or not. One known problem in this procedure is that a suitable penalty parameter is difficult to set. To get around that trouble, Fletcher and Leyffer [13] introduced the filter technique to globalize the SQP method, which turns out to be very efficient and effective, and is proved to be globally convergent [1, 14]. The filter technique is later applied to various problems and combined into other methods; examples include Ulbrich et al. [37], Karas et al. [1], Ribeiro et al. [3], Wächter and Biegler [38, 39, 40], etc. Unfortunately, when it is directly applied to solve BCOP, the classical SQP method based on QP subproblems encounters numerical difficulties if m and p i get large. For instance, in the problem of the nearest correlation matrix with p = p i (i = 1,,..., m) factors structure [3, ] to be discussed in Section 5 (see (5.68)), solving the corresponding QP subproblem is both time-consuming and memory demanding as m and p increase. It is nearly intractable with dimensions, say m = 500, p = 50. As indicated in [3], both Newton method and classical SQP method fail to solve BCOP when m and p are large. The spectral projected gradient method (SPGM) is thus proposed in [3] to alleviate such heavy computational burden which uses less memory and computational costs at each iteration. The numerical results [3] show that SPGM is efficient for many medium-scale tested instances, but the number of iterations probably varies drastically from instance to instance, and can perform worse in case when p is close to m than in other situations. Fortunately, the standard SQP method can be improved largely for BCOP by exploiting the special structure contained in the constraints. One of remarkable features is that the Jacobian matrix c(x) is sparse and structured, which can be utilized to reduce computational amounts and memory requirements at each iteration. To do so, we employ the active set technique [4, 41] to estimate the active set of inequalities associated with the minimizer and then, similar to QP-free methods [6, 15, 9, 30, 34, 41, 4], transform the QP subproblem into relevant linear system(s). As m and p get large, the size of the resulting linear system can naturally be large too, but the limited memory BFGS (L-BFGS) [3] plus duality technique [36] can be effectively employed, which dramatically reduces the computational costs and memory requirements for the associated linear systems. By counting the detailed computational complexity for this procedure, we will see that there is a large amount of flops saved at each iteration. On the other hand, the local fast convergence can be preserved due to the SQP framework and the L-BFGS technique, and the global convergence is also guaranteed with the aid of filter technique. We apply this implementation to two specific practical applications: the correlation approximation problem [3, ] and maximal correlation problem [7] in Section 5; our numerical experiments demonstrate that the proposed method is robust and efficient, and is competitive to some recently custom-designed methods for each individual application, including SPGM, the block relaxation method [3] and the majorization method [3] for the correlation approximation problem, and the Riemannian trust-region method [44] for the maximal correlation problem. The rest of this paper is organized as follows. In the first part of Section, we first reformulate the QP subproblem into a relevant linear system by duality, and then introduce the L-BFGS technique to alleviate

3 the computational burden in solving these linear systems; the detailed implementation by exploiting the sparsity of the Jacobian matrix c(x) is stated; then we discuss the filter technique to globalize the SQP method; the overall algorithm is presented in the last part of Section. In Sections 3 and 4, we establish the global convergence and the local convergence rate of the proposed algorithm, respectively. The numerical experiments on the two specific applications are carried out in Section 5, where we report our numerical experiences by comparing the performance of our algorithm with others. Concluding remarks are finally drawn in Section 6. There are a few words for notation. We denote the feasible region of BCOP by Ω := x c i (x) = 0, i E; c i (x) 0, i I}. For the constrained functions c i (x) for i = 1,,..., m, we let c(x) = (c 1 (x),..., c m (x)) T : R n R m and c(x) = ( c 1 (x),..., c m (x)) R n m ; for a particular index subset J = i 1, i,..., i j } of 1,,..., m}, we denote by J c the cardinality of J and denote c J (x) = (c i1 (x),..., c ij (x)) T : R n R j and c J (x) = ( c i1 (x),..., c ij (x)) R n j ; thus the definitions of c E (x) and c I (x) follow naturally. Finally, suppose η k } and ν k } are two vanishing sequences, where η k, ν k R, k N; we denote η k = O(ν k ) if there exists a scalar c > 0 such that η k c ν k for all k sufficiently large, η k = o(ν k ) if lim k + η k ν k = 0, and η k = Θ(ν k ) if both ν k = O(η k ) and η k = O(ν k ) hold. Algorithm.1 The working set We begin with the first-order optimality conditions (or the KKT conditions), which can be written as where x L(x, λ) = f(x) + c(x)λ = 0, (.1) λ i c i (x) = 0, i I, (.) c i (x) 0, λ i 0, i I, (.3) c i (x) = 0, i E, (.4) L(x, λ) := f(x) + c(x) T λ is the Lagrange function and λ R m is the Lagrange multiplier. As our method is based on the active set approach, we next state the strategy to identify the active set. To this end, similar to [11, 19, 8], we first introduce the following function φ : R n+m R, where Ψ : R n+m R n+m is defined by Ψ(x, λ) = φ(x, λ) = Ψ(x, λ), x L(x, λ) c E (x) min c I (x), λ I }. 3

4 Thus the set A I (x, λ) = i I c i (x) minφ(x, λ), 10 6 } } (.5) provides an estimation of the active set I(x ) = i c i (x ) = 0, i I} of inequality constraints, where (x, λ ) is the KKT point at the minimizer of BCOP. It is true that when (x, λ) is sufficiently close to (x, λ ), the estimate A I (x, λ) is accurate, provided both of the Mangasarian-Fromovitz constraint qualification (MFCQ) and the second-order sufficient condition (SOSC) hold at (x, λ ) (see [8, Theorem.]). Now, suppose the current iteration (x k, λ k ) is an approximation to (x, λ ), then we define A k := A I (x k, λ k ) E (.6) as our working set, which includes all equality constraints, nearly active indices of inequality constraints and the indices of the violated inequality constraints. This choice of the working set is similar to [15, 41, 4] and is based on the following observations: it is reasonable to include i I whenever c i (x k ) is close to zero (say c i (x k ) 10 6 ); as for equality constraints and those violated inequality constraints (say c i (x k ) > 10 6 ), we include them in the working set in the hope of reducing the violation. After identifying the working set A k, a QP subproblem can be formulated which, by the QP-free technique [6, 15, 9, 30, 34, 41, 4], can alternatively be solved by solving a relevant linear system (details on the linear systems are discussed in the next subsection). The solution of the resulting linear system yields the search direction and generates curvature information of BCOP at (x k, λ k ). One issue related with the linear system is the consistency, which is equivalent to the linear independence of the gradients of constraints corresponding to the working set A k. Due to the structure of BCOP, we prove in Lemma.1 that c Ak (x k ) is of full column rank as long as x k is confined to the set Ω p := x x k [i] 0.5 for all i E}. Based on this fact, we can say that our choice of working set A k does not invoke any complicated procedure as those in [34, 41, 4], where the working sets I k should be determined via calculating the rank of c Ik (x k ) or the determinant of c Ik (x k ) T c Ik (x k ) for each trial estimate I k until c Ik (x k ) is of full column rank. Lemma.1. If x k Ω p, then the vectors c i (x k ), i A k are linear independent, where A k is defined in (.5)-(.6). Proof Since x k Ω p, it follows that x k [i] 0.5 for all i E and therefore x k [i] 0 for all i E. For i A k E, c i (x k ) = x k [i] and therefore x k [i] 0. Suppose that there exist scalars l i R, i = 1,..., m such that m i=1 l i c i (x k ) = 0. Note that l 1 x m k [1] l i c i (x k ) =.. i=1 l m x k [m] Because x k [i] 0 for all i = 1,..., m, we have that l i = 0 for all i A k, which implies that c i (x k ), i A k are linear independent. Analogously, we have the following lemma. Lemma.. Let the subsequence x k l } of x k } with x k } Ω p converge to x, and let A kl A for all sufficiently large l. Then c A (x ) is of full column rank. Proof Since x k l Ω p and x k l x, we have that x [i] 0.5 for all i E and therefore x [i] 0 for all i E. For i A E, c i (x k l ) 10 6, and then c i (x ) 10 6 as k l. By the definition of c(x), we also have that x [i] 0 for all i A I. Analogous to the proof of Lemma.1, c i (x ), i A are linear independent as was to be shown. 4

5 . The QP subproblem and its reformulation In this and the next subsections, we discuss how to compute the search direction at x k. After the working set A k is determined, the search direction d k and its associated Lagrange multiplier λ k can be determined via solving (probably two or three with different perturbed vectors w k R m where m = A k c ) equality constrained QP subproblem(s) in the form of: min d R n 1 dt B k d + f(x k ) T d s.t. c Ak (x k ) T d = w k, (.7) where B k R n n is symmetric and positive definite that is an approximation of the Hessian of the Lagrangian function L(x k, λ k ). We point out that B k can be updated by the BFGS formula [7]. The strategy of choosing different perturbed w k is similar to [4, 41] and they correspond to two types of search directions d k, which are designed for the purpose of the global convergence and locally superlinear convergence. In order to simplify the subsequent presentation, we identify these two cases by a boolean variable FAST, i.e., FAST=FALSE or FAST=TRUE, respectively. Details of the choice of w k for the search direction are delayed until Algorithm 3 and Remark., and we next will discuss an efficient procedure for solving the solution d k of (.7). It is evident that the equality constrained quadratic programming (.7) is equivalent to the linear system: B k d + c Ak (x k )λ = f(x k ), c Ak (x k ) T (.8) d = w k. However, as n gets large, solving the linear system (.8) can be expensive. In addition, without effectively exploiting the underlying sparse structure, the associated coefficient matrix could occupy too much memory. To resolve these numerical difficulties, we make use of the duality technique and solve the dual problem of (.7) 1 max λ R m λt W k λ + b T k λ. (.9) Note that (.9) is an unconstrained optimization problem with relatively smaller size m, where W k = c Ak (x k ) T B 1 k c A k (x k ), (.10) b k = w k + c Ak (x k ) T B 1 k f(xk ). (.11) Note that B k is positive definite and therefore strong duality follows, which implies that the search direction d k and the guess λ k of the associated Lagrange multiplier can be obtained from (.9), instead of (.7). In particular, observing that W k R m m and m m is much smaller than n, solving the KKT condition of (.9) or, equivalently, solving a much smaller linear system: W k λ = b k (.1) is inexpensive. Once λ k is obtained from (.1), putting it into the first equation in (.8) yields d k = B 1 ( k f(x k ) + c Ak (x k )λ k). (.13) The above procedure resolves most numerical difficulties. The last issue is how to calculate W k efficiently. The idea is to adopt the L-BFGS techinique which is the topic of the next subsection. 5

6 .3 Compute the search direction based on the L-BFGS formula The limited memory BFGS method [7, Chapter 9] is one of the most effective and widely used methods in the field of large scale unconstrained optimization. The main advantage is that the L-BFGS approach does not require to calculate or store a full Hessian matrix, which might be too expensive for large scale problems. For BCOP, we have pointed out that the matrix W k = c Ak (x k ) T B 1 k c A k (x k ) in (.10) needs to be computed. Note that c Ak (x k ) is large but sparse and structured, and if we adopt the L-BFGS formula to update the inverse of the Hessian approximation B k, much storage space and computational costs can be saved. To describe the detailed procedure, let S k = [s k l,..., s k 1 ], Y k = [y k l,..., y k 1 ], where s i = x i+1 x i and y i = L(x i+1, λ i ) L(x i, λ i ), i = k l,..., k 1. One may notice that the solution λ i to (.1) is in R m rather than in R m, and plugging λ i into L(x i, λ i ) is inappropriate. Nevertheless, we can augment λ i by setting λ i j = 0 for j I\A i. With this augment scheme, in what follows, we will use λ i to denote the estimate multiplier in R m as long as no confusion is caused. By the L-BFGS formula, the matrix B k resulting from l updates to the basic matrix B 0 = ν k I is given by ( B k = ν k I where L k, D k R l l are defined by (L k ) i,j = ) ( ) 1 ( ) ν k Sk T ν k S k Y S k L k ν k Sk T k L T k D k Yk T, (s k l 1+i ) T (y k l 1+i ) if i > j, 0 otherwise, D k = diag(s T k ly k l,..., s T k 1y k 1 ), and ν k = yt k 1 y k 1 s T k 1 y. To ensure the positive definiteness of B k+1, we adopt so-called damped BFGS technique k 1 to modify y k so that s T k y k is sufficiently positive. Let y k θ k y k + (1 θ k )B k s k, where the scalar θ k is defined as 1, if s T k θ k = y k 0.0s T k B ks k, (0.98s T k B ks k )/(s T k B ks k s T k y k), if s T k y k < 0.0s T k B ks k. We then use s k and the modified y k to update S k+1 and Y k+1, respectively. Let H k denote the inverse of B k, then the update formula for H k is given by H k+1 = V T k H k V k + ρ k y k s T k, (.14) where ρ k = 1 yk T s and V k = I ρ k k y k s T k. Using the information (S k and Y k ) of the last l iterations and choosing δ k I with δ k = 1 ν k as the initial approximation Hk 0, we obtain by repeatedly applying (.14) that H k = H f k + Hs k, where and H f k = δ k(v T k 1 V T k l)(v k l V k 1 ) H s k = ρ k l (V T k 1 V T k l+1)s k l s T k l(v k l+1 V k 1 ) +ρ k l+1 (V T k 1 V T k l+)s k l+1 s T k l+1(v k l+ V k 1 ) + + ρ k 1 s k 1 s T k 1. 6

7 For simplicity, we denote c Ak (x k ) by A k. It then follows from (.10) that W k = A T k H k A k = A T k H f k A k + A T k H s ka k. (.15) Since the matrix A k is sparse (no more than n nonzero elements) and V k is structured, we are able to accomplish matrix-chain multiplication for A T k Hf k A k and A T k Hs k A k rather efficiently, through transformation of the most right hand-side of (.15). In particular, it is straightforward that (V k l V k 1 )A k = A k ρ k 1 y k 1 s T k 1A k ρ k l+1 y k l+1 s T k l+1(v k l+ V k 1 )A k ρ k l y k l s T k l(v k l+1 V k 1 )A k. Let q i = ρ i s T i (V i+1 V k 1 )A k for i = k l,..., k and q k 1 = ρ k 1 s T k 1 A k. It then follows that A T k H f k A k = δ k (A T k = δ k A T k A k + k 1 i=k l k 1 q T i y T i k 1 i=k l j=k l Using q i, the last item in (.15) can be rewritten as ) ( A T k H s ka k = A k k 1 i=k l δ k (y T i y j )q T i q j k 1 i=k l y i q i ) k 1 i=k l δ k (q T i y T i A k + A T k y i q i ). (.16) q T i q i ρ i. (.17) Consequently, based on (.16) and (.17), the whole procedure for computing W k = A T k H ka k can be summarized by the pseudo-code in Algorithm 1. We remark that the procedure between lines -13 computes W s k = AT k Hs k A k and lines 15-5 computes W f k = AT k Hf k A k, and line 6 finally forms W k. Remark.1. We finally count the computational complexity of computing W k in Algorithm 1. For this purpose, we assume p i = p for i = 1,,..., m, only for simplicity. First, it requires at most (because m m) (l + l + )mp + lm + O(m) flops for computing W s k = AT k Hs k A k (lines -13), and costs at most ( 3 l + 7 l + 3)mp + (3 l + 7 l)m + O(m) flops for W f k = AT k Hf k A k (lines 15-5). Note that mp = n, and this implies that for l n, computation of W k requires at most O(m +mp) = O(m +n) flops. As for b k and d k, the main computational effort is to compute the matrix-vector product H k z. Applying [7, Algorithm 9.1], it is easy to know that 6lmp = 6ln flops are required for computing H k z, and therefore, computation of b k in (.11) and d k in (.13) needs at most 1lmp + 6mp = (1l + 6)n flops..4 The NLP Filter Suppose we have the search direction d k, then the step size α k is the next important ingredient that determines the iterate x k+1 := x k + α k d k. 7

8 Algorithm 1: Procedure for computing W k based on the L-BFGS formula Data: S k, Y k, A k, δ k Result: W k 1 % Compute W s k = AT k Hs k A k for i = k l,..., k 1 do 3 ρ i = 1/y T i s i; 4 end 5 W s k = 0; 6 for i = k 1,..., k l do 7 u = s T i ; 8 for j = i,..., k do 9 u = u ρ j+1 (uy j+1 )s T j+1 ; 10 end 11 q i = ρ i ua k % q i = 1 W s k = W s k + qt i (q i/ρ i ); 13 end 14 % Compute W f k = AT k Hf k A k 15 W f k = δ ka T k A k 16 for i = k l + 1,..., k 1 do 17 for j = k l + 1,..., i do 18 β = δ k (y T i y j); 19 W f k = W f k + (βq i) T q j ; 0 if j < i then 1 W f k = W f k + qt j (βq i); end 3 end 4 W f k = W f k (δ kq T i )(yt i A k) (A T k y i)(δ k q i ); 5 end 6 W k = W f k + W s k ; ρ i s T i (V i+1 V k 1 )A k, i = k l,..., k ρ k 1 s T k 1 A k, i = k 1 In choosing α k, we will use the filter method and the backtracking line search procedure. In particular, we will generate a decreasing sequence of trials for α k (αmin k, 1] until our preset acceptance criterion is fulfilled or the feasibility restoration phase (Section.5) is called. Here, αmin k 0 is a lower bound of αk and we will give an explicit formula of αmin k in the next subsection. Let ˆx := x k + ˆαd k, ˆα (αmin, k 1] denote a trial point. Using ( ) c E (x) h(x) = maxc I (x), 0} as a measure of infeasibility at the point x, we now give relevant definitions about filter. The first one, Definition.1, is a variant of [14, (.6)]. Definition.1. For given β (0, 1) and γ (0, 1), a trial point ˆx (or equivalently the pair (h(ˆx), f(ˆx))) is 8

9 acceptable to x l (or equivalently the pair (h(x l ), f(x l ))), if h(ˆx) βh(x l ) or (.18) f(ˆx) f(x l ) γ minh(ˆx), h(ˆx) }. (.19) In the original paper of Fletcher and Leyffer [13], a pair (h(ˆx), f(ˆx)) is said to dominate (h(x l ), f(x l )) if both (.18) and (.19) hold with β = 1 and γ = 0, and a filter is defined as a list of pairs (h(x l ), f(x l )) such that no pair dominates any other in this filter [13, Definition ]. The condition (.19) is a variant of [14, (.6)] where f(ˆx) f(x l ) γh(ˆx). Note that (.19) is equivalent to: f(ˆx) f(x l ) γh(ˆx) if h(ˆx) 1 and f(ˆx) f(x l ) γh(ˆx) otherwise. The reason to introduce this modified condition on h(ˆx) is that we prefer to accept the trial point ˆx for the purpose of convergence whenever the violation of the feasibility is not severe, i.e., h(ˆx) < 1. Similar to the original definition of the filter in [13], based on Definition.1, we define our filter, denoted by F k at the iteration k, as a set of pairs (h(x l ), f(x l )) such that any pair in the filter is acceptable to all previous pairs in F k in the sense of Definition.1. Initially with k = 0, the filter F k can begin with the pair (χ, ), where χ > 0 is imposed on h(ˆx) as an upper bound to control the constraint violation [13]. At the start of iteration k, the current pair (h(x k ), f(x k )) F k but must be acceptable to it, while at the end of iteration k, the pair (h(x k ), f(x k )) may or may not be added to F k, depending on our acceptance rule to be discussed in Remark.3. But once (h(x k ), f(x k )) is added to F k, we remove all pairs in the current filter F k which are worse than (h(x k ), f(x k )) with respect to both the objective function value and the constraint violation; the detailed procedure for updating the filter F k will be described in Algorithm 3 and Remark.3. Definition.. A trial point ˆx (or a pair (h(ˆx), f(ˆx))) is acceptable to the filter F k, if ˆx (or a pair (h(ˆx), f(ˆx))) is acceptable to x l in the sense of Definition.1, for all l F k := l (h(x l ), f(x l )) F k }. The trial point ˆx is to be accepted as the next iteration if it is acceptable both to x k (by Definition.1) and to the filter F k (by Definition.). Nevertheless, such acceptance rule for the trial ˆx may cause the situation: we always accept the points that satisfy (.18) alone, but not (.19). This would result in an iterative sequence converging to a feasible, but non-optimal point. To avoid this situation, we impose additional condition on ˆx: Case 1 When FAST=FALSE or ˆα < 1: if ˆα f(x k ) T d k > δh (x k ), (.0) then accepting ˆx as the next iterate x k+1 should satisfy f(ˆx) f(x k ) + ˆαη f(x k ) T d k ; (.1) Case When FAST=TRUE and ˆα = 1: if f(x k ) T d k > δh (x k ) and h(x k ) ζ 1 d k ζ, (.) then accepting ˆx as the next iterate x k+1 should satisfy f(ˆx) f(x k ) η min f(x k ) T d k, ξ d k ζ }, (.3) where ζ 1 > 0, ζ (, 3), ξ > 0, η (0, 1 ), and δ > 0 is chosen to satisfy δ γ/η. Note that Case 1 and Case are mutually exclusive. The motivation for these conditions is from [33, section ]. The switching condition for Case 1 and Case and the sufficient reduction conditions (.1) and 9

10 (.3) are useful for the global convergence and the fast local convergence as well: If (.0) for Case 1 is satisfied, then the direction d k is descent for f(x), and thereby imposing the reduction condition (.1) on f(x) is helpful for the global convergence; if (.) for Case is satisfied, implying d k a search direction for fast local convergence, the full step (i.e., ˆα = 1) is expected so that the fast local convergence can be achieved. Note that the condition (.3) is more relaxed than (.1) as we prefer to accept the full step. Finally, we are able to state our rule for accepting the trial point ˆx as the next iterate. Acceptance Rule: A trial point ˆx is accepted as the next iterate x k+1 if it is acceptable to F k (h(x k ), f(x k ))}, and one of the following two conditions holds, (i) either (.0) and (.1) for Case 1 or (.) and (.3) for Case are satisfied; (ii) (.0) for Case 1 or (.) for Case is not satisfied. If the trial point ˆx does not satisfy ˆx Ω p or the Acceptance Rule, we shrink ˆα until the trial point is accepted or ˆα αmin k. Once the latter occurs, the feasibility restoration phase is called, which is discussed in the next subsection..5 Feasibility Restoration Phase Motivated by [38], we define the lower bound αmin k of ˆα by } αmin k min 1 β, γh(xk ), δh (x k ), f(x k ) T d k < 0, = f(x k ) T d k f(x k ) T d k α φ, otherwise, (.4) where α φ is a positive scalar. Through shrinking ˆα, if we cannot find a step size ˆα (αmin k, 1] such that the trial point ˆx is accepted by the Acceptance Rule, we then turn to the feasibility restoration phase. Note that when the iteration gets into the restoration phase, x k is infeasible, but if x k is feasible, h(x k ) = 0 and there must be some ˆα (αmin k, 1] so that ˆx is accepted (see Lemma 3.9). Based on these facts, in the restoration phase, we project x k onto Ω to get the next iterate x k+1 = P Ω (x k ). Since the feasible set Ω is of special structure, projecting x k onto Ω (Algorithm ) is easy and costs only at most 3n flops. Algorithm : P Ω (x k ): projection x k onto Ω 1 Given x k ; for i=1,...,m do 3 if (i m 1 & x k [i] 1) or (i > m 1 & x k [i] > 1) then 4 x k [i] xk [i] / xk [i] ; 5 end 6 end 7 return x k ; 10

11 .6 The Statement of Algorithm We now state the overall algorithm. Algorithm 3: Filter active set method (FilterASM) 1 Given x 0 Ω p, χ > h(x 0 ), ν (, 3), β (0, 1), γ (0, 1), η (0, 1 ), δ γ η, ξ > 0, α φ (0, 1 ), ζ 1 > 0, ζ (, 3), r (0, 1). Initialize F 0 with the pair (χ, ); for k=0,1,,...,maxit do 3 Determine the working set A k ; 4 Compute λ k,0 by (.1) with w k = c Ak (x k ) and d k,0 by (.13) with λ k = λ k,0 5 if d k,0 = 0 and λ k,0 i 0 ( i A k I), stop % Termination condition 6 if λ k,0 i 0, i A k I, (.5) then 7 Set FAST=TRUE, d k = d k,0, λ k = λ k,0, and w k = c Ak (x k ) c Ak (x k + d k ) d k,0 ν e; 8 else 9 Set FAST=FALSE, and w k = 0; 10 end 11 Compute λ k,1 by (.1) with w k and compute d k,1 by (.13) with λ k = λ k,1 ; 1 if FAST=TRUE then 13 Set ˆd k = 14 else 15 Set [u Ak ] i = 0, if d k,1 d k,0 > d k,0, d k,1 d k,0, otherwise; min c ji (x k ), 0} + λ k,1 j i, λ k,1 j i < 0 (j i A k I), c ji (x k ), others where A k = j 1,..., j Ak c }; 16 Compute λ k, by (.1) with w k = u Ak and compute d k, by (.13) with λ k = λ k, ; 17 Set d k = d k,, λ k = λ k,1 ; 18 end 19 if FAST=FALSE or x k + d k + ˆd k does not satisfy the Acceptance Rule, or x k + d k + ˆd k / Ω p then 0 Find α k > αmin k, the first number αk of the sequence 1, r, r,...} such that ˆx = x k + α k d k satisfies the Acceptance Rule and ˆx Ω p ; 1 else Set ˆx = x k + d k + ˆd k and α k = 1; 3 end 4 if the above α k (i.e., α k > αmin k ) does not exist then 5 Go to the feasibility restoration phase to get x k+1 = P Ω (x k ) and add (h(x k ), f(x k )) to F k ; 6 else 7 if (.0) for Case 1 or (.) for Case does not hold then add (h(x k ), f(x k )) to F k ; 8 Set x k+1 = ˆx, s k = x k+1 x k, y k = L(x k+1, λ k ) L(x k, λ k ), and update S k, Y k to S k+1, Y k+1. 9 end 30 end Remark.. In Algorithm 3, lines 3-18 state the procedure for computing the search direction d k, the guess of Lagrange multiplier λ k, together with some other quantities (say d k,0, λ k,0, d k,1, λ k,1, d k, and λ k, etc.) related to d k and λ k, while lines 19-3 describe the procedure for the step size α k. In computing the search direction between lines 3 and 18, there are two different cases: 11

12 (i) FAST=TRUE. The pair (d k, λ k ) = (d k,0, λ k,0 ) solves B k d k,0 + c Ak (x k )λ k,0 = f(x k ), c Ak (x k ) T d k,0 = c Ak (x k ), (.6) which is a quasi-newton equation of KKT system (.1)-(.4) at the working set A k. To achieve fast local convergence and to overcome the Maratos effect, we adopt the second order correction technique. In particular, we compute the second order correction step by setting ˆd k = d k,1 d k,0 where d k,1 is from B k d k,1 + c Ak (x k )λ k,1 = f(x k ), c Ak (x k ) T d k,1 = (c Ak (x k + d k ) + c Ak (x k ) + d k,0 ν (.7) e). Here, e = (1, 1,...., 1) T with appropriate dimension. Then we check if ˆx = x k + d k + ˆd k satisfies the Acceptance Rule. If it fails, this second order correction step ˆd k is discarded, and the backtracking technique is invoked to find a step size α k such that x k + α k d k is accepted. (ii) FAST=FALSE. The search direction d k = d k, is computed by solving B k d k, + c Ak (x k )λ k, = f(x k ), c Ak (x k ) T d k, (.8) = u Ak, where u Ak (line 15) uses the information of λ k,1 from the system B k d k,1 + c Ak (x k )λ k,1 = f(x k ), c Ak (x k ) T d k,1 = 0. (.9) We explain the above two linear systems as follows: the solution d k,1 of (.9) is in the null space of c Ak (x k ) T and targets at improving f(x) rather than h(x); because d k,1 may be close to zero with a negative multiplier λ k,1, a slight perturbation system (.8) of (.9) is to be solved and yields a new direction d k,, which aims at improving h(x) instead, and prevents the unwelcome effect caused by a negative multiplier. In all, d k in this case contributes to the global convergence. Remark.3. The filter F k is updated either in line 5 or line 7. In other words, the pair (h(x k ), f(x k )) is added to F k and remove all other pairs in F k dominated by (h(x k ), f(x k )) if (.0) for Case 1 or (.) for Case is not fulfilled or the restoration phase is invoked. Remark.4. For the sake of convenience for analyzing the convergence, we borrow the terminology from Fletcher, Leyffer and Toint [14]: we call an iterate an f-type iterate if x k+1 = x k + α k d k (or x k+1 = x k + d k + ˆd k ) is accepted according to (i) of the Acceptance Rule; otherwise, we call the iterate an h-type iterate, which means that x k+1 is accepted according to (ii) of the Acceptance Rule, or is recovered from the feasibility restoration phase. 3 Global convergence In this section we show the global convergence of Algorithm 3 under the following two assumptions: (A1) The objective function f(x) is twice continuously differentiable; (A) The matrix B k is bounded and uniformly positive definite for all k; that is, there exists a scalar τ > 0 such that 1 τ d d T B k d τ d holds for any d R n and any k. We begin with the boundedness of the iterates. Lemma 3.1. The sequence x k } generated by Algorithm 3 is bounded. 1

13 Proof Since all iterates from Algorithm 3 satisfy the upper bound condition h(x k ) χ because F 0 = (χ, )}, combining with the definitions of h(x) directly leads to the boundedness of x k }. Theorem 3.. Suppose that Assumption (A1) holds. Let x k l } be an infinite subsequence of x k } on which (h(x k l ), f(x k l )) is added into the filter. Then lim k l h(xk l ) = 0. Proof From Assumption (A1) and Lemma 3.1, we know that f(x k l )} is bounded from below. Applying [33, Lemma 3.1] yields the assertion. Theorem 3. implies that all accumulation points of x k l } on which (h(x k l ), f(x k l )) is added into the filter are feasible points for BCOP. Lemma 3.3. Suppose that Assumptions (A1)-(A) hold. If FAST=TRUE, then the sequence (d k,0, λ k,0 )} is bounded; if FAST=FALSE, then both sequences (d k,1, λ k,1 )} and (d k,, λ k, )} are bounded. Proof From Algorithm 3, λ k,0 = W 1 k b k with b k = c Ak (x k ) + c Ak (x k ) T B 1 k f(xk ) in the case of FAST=TRUE, where W k = c Ak (x k ) T B 1 k c A k (x k ) is uniformly positive definite for all k due to Lemmas., 3.1 and Assumption (A). Again using Lemma 3.1 and Assumption (A), b k is bounded and therefore λ k,0 is bounded too, which together with the boundedness of B 1 k, xk and λ k,0 implies that d k in (.13) is bounded for all k. Analogously, in the case of FAST=FALSE, W k and its inverse are bounded for all k. Lemma 3.1 and Assumption (A) ensure the boundedness of c Ak (x k ) T B 1 k f(xk ). Since λ k,1 = W 1 k c A k (x k ) T B 1 k f(xk ) and d k,1 = B 1 ( k f(x k ) + c Ak (x k )λ k), it follows that both λ k,1 and d k,1 are bounded for all k. In view of the definition of u Ak (see line 15 of Algorithm 3) and the boundedness of x k }, u Ak is bounded too, which implies the boundedness of λ k, = W 1 k (u A k + c Ak (x k ) T B 1 k f(xk )). Consequently, d k, in (.13) with λ k, is bounded for all k. Remark 3.1. Based on the previous lemmas, for the convenience of further reference, we assume d k,j M d, j = 0, 1, and λ k,j M λ, j = 0, 1, for all k, where M d > 0 and M λ > 0 are two constants. Lemma 3.4. Under Assumptions (A1)-(A), the following two statements are true. (i) If FAST=TRUE and d k = 0, then x k is a KKT point of BCOP. (ii) If FAST=FALSE, h(x k ) = 0 and f(x k ) T d k = 0, then x k is a KKT point of BCOP. Proof (i) Since λ k,0 is from (.1) with b k = c Ak (x k ) + c Ak (x k ) T B 1 k f(xk ), rearranging (.1) leads to c Ak (x k ) = W k λ k,0 + c Ak (x k ) T B 1 k f(xk ) which, using (.13) and the definition of W k, gives c Ak (x k ) = c Ak (x k ) T B 1 k ( c A k (x k )λ k,0 + f(x k )) = c Ak (x k ) T d k,0. Putting d k,0 = d k = 0 into the above equation yields c Ak (x k ) = 0; now combining with the definition of A k implies that x k is feasible, that is, c E (x k ) = 0 and c I (x k ) 0. From Assumption (A) and (.13), d k,0 = 0 leads to c Ak (x k )λ k,0 + f(x k ) = 0 which shows the dual feasibility at x k. In addition, the nonnegativeness of λ k,0 is guaranteed by the mechanism of Algorithm 3 (in the case of FAST=TRUE). Thus, x k satisfies a variant of the KKT conditions (.1)-(.4) and therefore is a KKT point. (ii) By Algorithm 3, if FAST=FALSE, then λ k,1 = W 1 k c A k (x k ) T B 1 k f(xk ), (3.30) d k,1 = B 1 k ( f(xk ) + c Ak (x k )λ k,1 ), (3.31) λ k, = W 1 k u A k + λ k,1, (3.3) d k, = d k,1 + B 1 k c A k (x k )W 1 k u A k. (3.33) 13

14 From (3.33) and (3.30), we have that f(x k ) T d k, = f(x k ) T d k,1 + f(x k ) T B 1 k c A k (x k )W 1 k u A k = f(x k ) T d k,1 (λ k,1 ) T u Ak. (3.34) By premultiplying the first equation of (.9) by (d k,1 ) T and using the second equation of (.9), we get f(x k ) T d k,1 = (d k,1 ) T B k d k,1. Substituting it into (3.34) yields f(x k ) T d k, = (d k,1 ) T B k d k,1 (λ k,1 ) T u Ak. (3.35) According to the hypothesis (ii) of this lemma, c E (x k ) = 0, c I (x k ) 0 and f(x k ) T d k, = 0. Combining with the definition of u Ak, the second term in the righthand side of (3.35) can be changed to and then λ k,1 i <0,i A k I 0 = (d k,1 ) T B k d k,1 [(λ k,1 i ) + max λ k,1 i c i (x k ), 0}] λ k,1 i <0,i A k I λ k,1 i 0,i A k I [(λ k,1 i ) + max λ k,1 i c i (x k ), 0}] + λ k,1 i c i (x k ), λ k,1 i 0,i A k I λ k,1 i c i (x k ). It is easy to see that the first two terms (excluding the sign) in the righthand side are non-negative and the last term is non-positive, which implies that all terms in the righthand side must be zero. In particular, the first term (d k,1 ) T B k d k,1 = 0 implies the primal optimality condition c Ak (x k )λ k,1 + f(x k ) = 0 due to Assumption (A) and (3.31); the second term λ k,1 i <0,i A k I [(λk,1 i ) + max λ k,1 i c i (x k ), 0}] = 0 implies λ k,1 0; and the third term λ k,1 i 0,i A k I λk,1 i c i (x k ) = 0 implies λ k,1 i c i (x k ) = 0, i A k I which gives the complementarity condition. Thus, x k is a KKT point of BCOP. Remark 3.. Since B k is uniformly positive definite and uniformly bounded, by Lemma., the conclusion of Lemma 3.4 can be extended to its limit form: (i) if FAST=TRUE and d k l 0, then any limit point x of x k l } is a KKT point of BCOP, where k l } is an infinite subsequence of k}; (ii) if FAST=FALSE, h(x k l ) 0 and f(x k l ) T d k l 0, then any limit point x of x k l } is a KKT point of BCOP, where k l } is an infinite subsequence of k}. We next establish a series of lemmas concerning the f-type iterates. Lemma 3.5. Suppose that Assumptions (A1)-(A) hold. Then there exist scalars M h, M f > 0 and αu k (0, 1] such that h(x k + αd k ) (1 α)h(x k ) M hα d k (3.36) holds for all α (0, α k u], and f(x k + αd k ) f(x k ) α f(x k ) T d k M f α d k (3.37) holds for all α (0, 1], where d k is generated by Algorithm 3. Proof If FAST=TRUE, (d k, λ k ) = (d k,0, λ k,0 ) solves (.6), implying that c Ak (x k ) + c Ak (x k ) T d k = 0, (3.38) 14

15 and if FAST=FALSE, (d k, λ k ) = (d k,, λ k, ) solves (.8) which together with the definition of u Ak yields c i (x k ) + c i (x k ) T d k = c i (x k = 0, i E, ) + u i (3.39) 0, i A k I. Since c i (x k ), i A k are quadratic functions, it follows that for i A k c i (x k + αd k ) = c i (x k ) + α c i (x k ) T d k + α (dk ) T Q i d k, where Q i is the Hessian of c i (x). As a result, for either FAST=TRUE or FAST=FALSE, using (3.38) and (3.39) we have c i (x k + αd k ) = (1 α)c i (x k ) + α (dk ) T Q i d k i E, c i (x k + αd k ) (1 α)c i (x k ) + α (dk ) T Q i d k i A k I. Therefore, it is straightforward to get that for all i E and for all i A k I c i (x k + αd k ) (1 α) c i (x k ) + M hα d k (3.40) max0, c i (x k + αd k )} (1 α) max0, c i (x k )} + M hα d k, (3.41) where M h > 0 is a scalar satisfying Q i M h for all i A k. On the other hand, for i I\A k, c i (x k ) < 0 due to the definition of A k ; by the continuity of c i (x), there exists a scalar αu k (0, 1] such that c i (x k +αd k ) < 0 for all i I\A k and all α (0, αu]. k Consequently, in view of the definition of h(x), ( ) ( ) h(x k c E (x k ) ) = maxc Ak (x k and h(x k + αd k c E (x k + αd k ) ) = ), 0} maxc Ak (x k + αd k α (0, α ), 0} u], k which together with (3.40) and (3.41) gives (3.36). As for (3.37), it readily follows from Taylor s Theorem that f(x k + αd k ) f(x k ) α f(x k ) T d k = α (dk ) T f(ξ k )d k (3.4) where ξ k R n lies in the line segment from x k to x k + d k. Since x k and d k are bounded for all k, and the objective function f(x) is twice continuously differentiable, there exists a scalar M f > 0 such that f(ξ k ) M f for all ξ k, and thus using (3.4) gives (3.37). We remark that αu k in Lemma 3.5 is related to x k ; however, with some additional conditions, αu k in the conclusion of Lemma 3.36 can be reduced to a constant, which is shown in the following corollary. Corollary 3.6. Suppose that Assumptions (A1)-(A) hold. Let x k l } converge to a non-optimal point x and A kl keeps unchanged for all k l. Then there exist scalars M h > 0 and α u (0, 1] such that holds for all α (0, α u ], where d k l is generated by Algorithm 3. h(x k l ) (1 α)h(x k l ) M hα d k l (3.43) 15

16 Proof According to the hypothesis of this corollary, A kl A for all k l, where A is a finite index set independent of k l. Recalling the definition of A (i.e., A kl ) and x k l x, we obtain that c i (x ) < 0 for all i I\A and by continuity of c i (x), there exists an open ball B(x ; r) of radius r > 0 centered at x such that for any y B(x ; r), c i (y) < 0, i I\A. Again using x k l x, and d k l M d due to Remark 3.1, there exists a scalar ᾱ > 0 and an integer k l > 0 such that c i (x k l ) < 0, i I\A for all α (0, ᾱ] and all k l k l. Thus for all α (0, ᾱ] and k l k l, ( ) ( ) h(x k l c E (x k l ) ) = maxc A (x k l and h(x k l c E (x k l ) ) = ), 0} maxc A (x k l. ), 0} Following the proof of Lemma 3.5, for all i E and for all i A I c i (x k l ) (1 α) c i (x k l ) + M hα d k l, max0, c i (x k l )} (1 α) max0, c i (x k l )} + M hα d k l, and therefore (3.43) holds for all α (0, ᾱ] and k l k l. On the other hand, for those iterations with k l < k l, it follows from Lemma 3.5 that (3.43) holds for all α (0, α k l u ]. Define α u = minαu k1, αu k k l 1,..., αu, ᾱ}. We therefore conclude that (3.43) holds for all α (0, α u ] which completes the proof. Define the quantity d k,0, F AST = T RUE Υ k := h(x k ) + f(x k ) T d k,, F AST = F ALSE which is actually another first-order optimality measure due to Lemma 3.4. The proofs of the following lemmas and theorem are related to the optimality measure Υ k. In particular, the next lemma reveals that the search direction d k generated by Algorithm 3 is descent for the objective function if a point is nearly feasible but non-optimal. Lemma 3.7. Suppose that Assumptions (A1)-(A) hold. Let x k l } be a subsequence of x k } for which Υ kl ɛ with a constant ɛ > 0. Then there exist two scalars ɛ 1 > 0 and ɛ > 0 such that the following statement is true: h(x k l ) ɛ 1 f(x k l ) T d k l ɛ. Proof We first consider the case FAST=TRUE. In this situation, Υ kl = d k l,0 ɛ, and (d k l, λ k l ) = (d k l,0, λ k l,0 ) solves (.6). Premultiplying the first equation of (.6) by (d k l,0 ) T, we have that f(x k l ) T d k l = (d k l,0 ) T B k d k l,0 (d k l,0 ) T c Akl (x k l )λ k l,0, while premultiplying the second equation of (.6) by (λ kl,0 ) T and substituting it into above equation yield f(x k l ) T d k l = (d kl,0 ) T B k d kl,0 + i c i (x k l ). (3.44) i A kl λ kl,0 Due to FAST=TRUE, we have λ kl,0 0, and using Remark 3.1 gives λ kl,0 M λ. It is straightforward that λ kl,0 i c i (x k l ) mh(x k l ) λ kl,0 mm λ h(x k l ), i A kl which together with (3.44), Assumption (A) and d k l,0 ɛ gives f(x k l ) T d k l ɛ τ + mm λ h(x k l ). 16

17 Let ɛ 1 := ɛ mτm λ. If h(x k l ) ɛ 1, we then obtain that f(x k l ) T d k l ɛ, where ɛ := ɛ τ. Next, we show the assertion for the case FAST=FALSE. In this situation, d k l = d k l, and Υ kl = h(x k l ) + f(x k l ) T d k l, ɛ. If h(x k l ) ɛ, then From (3.35) and the definition of u Ak, f(x k l ) T d k l, By Assumption (A), one has f(x k l ) T d k l = f(x k l ) T d k l, ɛ. (3.45) = (d k l,1 ) T B kl d k l,1 + f(x k l ) T d k l, λ k l,1 i 0,i A kl I λ k l,1 i λ k l,1 i <0,i A kl I λ k l,1 i c i (x k l ) + λ k l,1 i c i (x k l ). i E 0,i A kl I mh(x k l ) λ k l,1 [(λ k l,1 i ) + max λ k l,1 i c i (x k l ), 0}] λ k,1 i c i (x k l ) + λ k l,1 i c i (x k l ) i E mm λ h(x k l ), (3.46) } ɛ where the third inequality follows from Remark 3.1. Let ɛ 1 := min, ɛ 3 mm λ and ɛ := ɛ. If h(xk l ) ɛ 1, then mm λ h(x k l ) ɛ 3 which combining with (3.46) and (3.45) yields f(xk l ) T d k l ɛ. Lemma 3.8. Suppose that Assumptions (A1)-(A) hold. If h(x k l ) > 0 and f(x k l ) T d k l ɛ (ɛ is from Lemma 3.7), then x k l is acceptable to the k l th filter for all α ᾱ k l, where ᾱ k l = minq 1 h(x k l ), q, α k l u }, q 1 = M h Md and q = ɛ M f. Md Proof The mechanism of Algorithm 3 (lines 19-3) ensures that (h(x k l ), f(x k l )) is acceptable to the k l th filter. We now show that x k l is no worse than x k l for all sufficiently small α > 0 in both feasibility and the objective function, implying that x k l is acceptable to the k l th filter. Since d k l M d due to Remark 3.1, it follows from (3.36) in Lemma 3.5 that for α (0, α k l u ] which turns out to be h(x k l ) h(x k l ) αh(x k l ) + α M h M d h(x k l ) h(x k l ) if 0 α minq 1 h(x k l ), α k l u } with q 1 := M h. Similarly, using (3.37) in Lemma 3.5 and the boundedness Md of d k l, we have that f(x k l ) f(x k l ) α f(x k l ) T d k l + α M f Md, which together with the assumption f(x k l ) T d k l ɛ yields f(x k l ) f(x k l ) αɛ + α M f M d Define q := ɛ M f. If 0 α q Md, then f(x k l ) f(x k l ). Therefore, x k l is acceptable to the k l th filter for all α ᾱ k l := minq 1 h(x k l ), α k l u, q }. With the help of Lemma 3.8, the following two lemmas show that there always exists some acceptable step size α such that x k + αd k is accepted as an f-type iteration point under certain conditions.. 17

18 Lemma 3.9. Suppose that Assumptions (A1)-(A) hold. If x k l is feasible but not optimal, then either x k l + d k l + ˆd k l is an f-type iteration point or there exists α k l 0 > αk l min such that xk l + α k l 0 dk l is an f-type iteration point. Proof The conclusion follows immediately if x k l + d k l + ˆd k l is an f-type iteration point. Otherwise, we need to prove that x k l + α k l 0 dk l is an f-type iteration point for some α k l 0 > αk l min. Since xk l is feasible but not optimal, we must have that h(x k l ) = 0 and Υ kl ɛ with some scalar ɛ > 0. By the mechanism of Algorithm 3 (line 7) and Lemma 3.7, the condition (.0) is always true if h(x l ) = 0, and therefore only pairs with h(x l ) > 0 can be added into the k l th filter. Let According to Lemma 3.5 and d k l M d, π k l := minh(x l ) (h(x l, f(x l )) F kl }. h(x k l ) α M h Md (3.47) } holds for all α (0, α k l u ]. If 0 α min α k l u, βπ k l M h, then h(x k l ) βπ k l, which implies that Md x k l is acceptable to the k l th filter. Since x k l is feasible, it follows from the definition of Ω p that x k l is in the interior of Ω p, which together with the boundedness of d k l shows x k l Ω p for all α in some subinterval of (0, 1], and therefore, we can assume without loss of generality, that x k l Ω p for all α (0, α k l u ]. By Lemma 3.7, h(x k l ) = 0 implies f(x k l ) T d k l ɛ, (3.48) which means that the switching condition for Case 1 and Case holds trivially no matter α < 1 or α = 1. It follows from (3.37) in Lemma 3.5 and the boundedness of d k l that f(x k l ) f(x k l ) αη f(x k l ) T d k l Thus, the sufficient reduction condition (.1) holds if 0 α (1 η)ɛ M f Md it is true from (3.47) that h(x k l ) αηɛ γ. Combining with (.1) and (3.48) yields α(1 η)ɛ + α M f M d f(x k l ) f(x k l ) γh(x k l ),. When 0 α min. } α k l ηɛ u, γm h, Md i.e., x k l is acceptable to x k l. From (.4) and the above proof, we have αmin k = 0, and we can choose any α in (0, ᾱ k l ] as α k l 0 such that xk l + α k l 0 dk l is an f-type iteration point, where } ᾱ k l := min α k βπ l u, k l M h Md, (1 η)ɛ ηɛ M f Md, γm h Md. Lemma Suppose that Assumptions (A1)-(A) hold. Let x k l } be an infinite subsequence of x k } on which (h(x k l ), f(x k l )) is added into the filter, and assume that x k l } converges to x and A kl keeps unchanged for all k l. If x is not a KKT point, then for all sufficiently large k l, either x k l + d k l + ˆd k l is an f-type iteration point or there exists α k l 0 > αk l min such that xk l + α k l 0 dk l is an f-type iteration point. Proof If x k l +d k l + ˆd k l is an f-type iteration point, the conclusion follows immediately. It suffices to prove the assertion for x k l. Since x is not a KKT point, it follows from Remark 3. that there exists a scalar 18

19 ɛ > 0 such that Υ kl ɛ for all sufficiently large k l. In the case of h(x k l ) = 0, the conclusion follows from Lemma 3.9. We now consider the remaining iteration k l with h(x k l ) > 0. As Υ kl ɛ, if h(x k l ) ɛ 1, then by Lemma 3.7, f(x k l ) T d k l ɛ. (3.49) If h(x k l ) < ɛ 1 and α minq 1 h(x k l ), α k l u, q }, it follows from Lemma 3.8 that x k l is acceptable to the k l th filter. Since x k l } converges to x and A kl keeps unchanged for all k l, Corollary 3.6 implies α k l u is independent of k l and we thereby drop the superscript k l in α k l u for the simplicity of the following proof. Analogous to the proof of Lemma 3.9, if 0 α (1 η)ɛ M f, the sufficient reduction condition (.1) is Md fulfilled. Again using Corollary 3.6 and the boundedness of d k l, for 0 α α u, h(x k l ) h(x k l ) αh(x k l ) + α M h M d and therefore h(x k l ) h(x k l ) if 0 α minq 1 h(x k l ), α u }, where q 1 is defined as Lemma 3.8. On the other hand, if (.0) is true, it follows from (.1) that f(x k l ) f(x k l ) αη f(x k l ) T d k l ηδh (x k l ) ηδh (x k l ) γh (x k l ), where the last inequality follows from δ γ/η. Hence, x k l is acceptable to x k l. Since h(x k l ) 0 due to Theorem 3., according to the definition of Ω p, x k l is in the interior of Ω p for all sufficiently large k l, which together with the boundedness of d k l implies x k l Ω p for all α in some subinterval of (0, 1] and all sufficiently large k l and, we assume without loss of generality that x k l Ω p for all α (0, α u ] for all sufficiently large k l. Therefore, we have now shown that for all sufficiently large k l, x k l is acceptable to x k l and the k l th filter, x k l Ω p, and the sufficient reduction condition (.1) holds if (.0) is satisfied, 0 α α k l and where α k l h(x k l ) min 1, ɛ 1, rq 1ɛ δ := min q 1 h(x k l ), q, α u, (1 η)ɛ } M f Md,, }, (3.50) and r is from line 0 in Algorithm 3. Let α k l 0 denote the first trial step size in the sequence 1, r, r,...} that satisfies α α k l. In view of Theorem 3., h(x k l ) tends to zero as k l, and therefore α k l = q 1 h(x k l ) and (3.50) is satisfied for all sufficiently large k l. It is evident that for all sufficiently large k l. Using (3.49) and (3.51) we have α k l 0 r αk l = rq 1 h(x k l ) (3.51) α k l 0 f(xk l ) T d k l rq 1 ɛ h(x k l ), which together with (3.50) implies that the switching condition (.0) for Case 1 is satisfied. Lastly, we show α k l 0 > αk l min. Noting the definition (.4) of αk l min, and using (3.49) and Theorem 3., we know α k l min = δh(x k l ) f(x k l )T d k l 19

5 Handling Constraints

5 Handling Constraints Engineering design optimization problems are very rarely unconstrained. Moreover, the constraints that appear in these problems are typically nonlinear. This motivates our interest