Mesh adaptive direct search with second directional derivative-based Hessian update

Size: px

Start display at page:

Download "Mesh adaptive direct search with second directional derivative-based Hessian update"

Barry Owen
5 years ago
Views:

1 Comput Optim Appl DOI /s Mesh adaptive direct search with second directional derivative-based Hessian update Árpád Bűrmen 1 Jernej Olenšek 1 Tadej Tuma 1 Received: 28 February 2014 Springer Science+Business Media New York 2015 Abstract The subject of this paper is inequality constrained black-box optimization with mesh adaptive direct search (MADS). The MADS search step can include additional strategies for accelerating the convergence and improving the accuracy of the solution. The strategy proposed in this paper involves building a quadratic model of the function and linear models of the constraints. The quadratic model is built by means of a second directional derivative-based Hessian update. The linear terms are obtained by linear regression. The resulting quadratic programming (QP) problem is solved with a dedicated solver and the original functions are evaluated at the QP solution. The proposed search strategy is computationally less expensive than the quadratically constrained QP strategy in the state of the art MADS implementation (NOMAD). The proposed MADS variant (QPMADS) and NOMAD are compared on four sets of test problems. QPMADS outperforms NOMAD on all four of them for all but the smallest computational budgets. Keywords Black-box optimization Constrained optimization Mesh adaptive direct search Second directional derivative Hessian update Quadratic models Quadratic programming Mathematics Subject Classification 90C30 90C56 65K05 90C20 B Árpád Bűrmen arpad.buermen@fe.uni-lj.si 1 Faculty of Electrical Engineering, University of Ljubljana, Tržaška cesta 25, 1000 Ljubljana, Slovenia

2 Á. Bűrmen et al. 1 Introduction In constrained black-box optimization the derivatives are usually not available. The optimization problem is described in the form of a black box capable of computing objective and constraint function values. One of the algorithmic frameworks for solving such problems is mesh adaptive direct search (MADS) [6]. The optimization problem is given by min x Ω f (x), (1) where f : R n R and Ω R n is the feasible region defined by nonlinear constraints. In this paper we limit ourselves to problems where f and the nonlinear constraints cannot be evaluated separately. This is often the case when f and the constraints are obtained from the results of a simulation. One of the advantages of MADS is its capability of handling general nonlinear constraints with the extreme barrier approach by replacing f (x) with f Ω (x) that is equal to f (x) if x Ω and + otherwise. Some of the constraints defining Ω can be of the form x L x x H (2) where vectors x L and x H represent lower and upper bounds on the optimization variables. These constraints are trivial to evaluate. We assume the function and the remaining constraints can only be evaluated if (2) is satisfied. This is often the case in simulation-based optimization where points outside bounds cannot be evaluated due to the inherent limitations of the underlying simulator. MADS relies on two basic ingredients to deliver certain convergence properties [1,6,27] for nonsmooth functions in the Clarke sense [10] and in the Rockafellar sense [23]. The first one is the restriction of the visited points to a discrete set (mesh). This restriction guarantees the existence of certain subsequences for which favourable convergence properties exist. The second ingredient is the exploration of the neighbourhood of the incumbent solution (also referred to as polling). The normalized directions in which MADS explores the neighbourhood of the incumbent solution are asymtotically dense on the unit sphere. The exploration mechanism provided by polling is slow, particularly on higher dimensional problems. To accelerate convergence MADS allows for the evaluation of arbitrary points on the current mesh (search). The points can be chosen by an arbitrary algorithm that can also use curvature information to significantly accelerate the convergence towards a solution. The MADS instance in [12] (implemented in the NOMAD software [18]) constructs a quadratic model of the function and quadratic models of the constraints from previously evaluated points in the neighbourhood of the incumbent solution. The obtained subproblem is solved and its solution is used as a trial point for the search. The model can also be used for ordering the poll directions so that directions resulting in the largest incumbent solution improvement predicted by the model are probed first [12]. Sampling and simplex derivatives were used for poll direction ordering and adjusting the mesh size parameter in [14]. Simplex derivatives can be obtained by interpolation or regression of a sufficiently large set of sample points. The smallest

3 Mesh adaptive direct search with second directional... number of points required for computing a full quadratic model (all first and second simplex derivatives) is (n + 1)(n + 2)/2. This requirement was relaxed in [12] where minimum Frobenius norm models were built using less than (n + 1)(n + 2)/2 points. An alternative approach is the use of the curvature information matrix (CIM) that approximates the Hessian for smooth functions [17]. Finite difference approximations of the second directional derivatives along coordinate directions yield the diagonal elements of the CIM. Extradiagonal elements are obtained by computing finite difference approximations of mixed derivatives in 2-dimensional subspaces defined by pairs of coordinate directions. For building a quadratic model one needs to compute the gradient and the Hessian matrix (i.e. the first and the second derivatives of the function). The Hessian matrix can be used in many different ways. One of the more straightforward approaches for exploiting the information in this matrix is to align the search directions with its eigenvectors [3]. This helps the algorithm to converge to a second order stationary point. Computing a complete quadratic model from sample points involves solving a system of (n + 1)(n + 2)/2 linear equations. To avoid this one can construct a sequence of Hessian approximations using various update formulas. In derivative-based quasi- Newton optimization methods, update formulas for approximating the Hessian matrix have been known since 1959 (see [16]). These update formulas are not suitable for derivative-free methods like MADS, because they require the knowledge of the function s gradient. The gradient of a function of n variables can be approximated by evaluating at least n + 1 points. If these points are not positioned in a particular manner (i.e. n points must lie along coordinate directions with respect to the n + 1-th point), a linear system of n equations must be solved for obtaining the gradient. Recently, a simple Hessian matrix update formula based on the second directional derivative was analysed in [19]. By continually applying this update the Hessian approximation converges to the Hessian of the function. The second directional derivative can be approximated using only three distinct collinear points. Such a constellation of points is common in some variants of MADS. For the Hessian update formula in [19] to perform optimally, the directions along which the second directional derivative is approximated must be uniformly distributed. Unfortunately, no known variant of MADS produces uniformly distributed poll directions. The first published MADS variant (LTMADS [6])tends to distribute most of its search directions along lower dimensional subspaces aligned with the coordinate directions [2,25]. The OrthoMADS variant of MADS [2] was developed with this shortcoming in mind. In [25] it was shown the poll directions in OrthoMADS tend to concentrate around coordinate directions as the problem dimension increases. A solution that generates the poll directions by means of QR decomposition of a random matrix was proposed in [25]. It produces seemingly uniformly distributed poll directions, albeit the uniformity of the distribution was not mathematically proven. Furthermore, the distribution used for generating the random matrix subject to QR decomposition was also not given. In [12], quadratic models of the objective and constraint functions are constructed. The obtained model of the original optimization problem (also referred to as the subproblem) is solved by a simple MADS algorithm. Because this algorithm uses

4 Á. Bűrmen et al. only the poll step, the obtained solution of the subproblem may be of very low quality, particularly for higher dimensional problems. The approach proposed in this paper relies on a dedicated quadratic programming (QP) solver, and the proposed algorithm is referred to as QPMADS. Because QP solvers handle only linear constraints, QPMADS uses a linear model of the constraints. The remainder of the paper is divided as follows. Section 2 introduces the second directional derivative-based Hessian update. Section 3 gives an overview of MADS and outlines the QPMADS algorithm. The approach for generating uniformly distributed poll directions and the update rules for the mesh and the poll size parameters are given in Sect. 4. Section 5 explains the model-based poll direction ordering and the application of the second directional derivative-based Hessian update. Construction of the model and its role in the search step are the subject of Sect. 6. The convergence proof for QPMADS is presented in Sect. 7. QPMADS was tested on several sets of test problems from [12,25] and its performance was compared to the performance of the state of the art MADS implementation in the NOMAD software [12,18]. The results are discussed in Sect. 8. Notation N, Z, and R denote, respectively, the sets of natural, integer, and real numbers. Vectors and matrices are denoted by lower- and uppercase bold letters (e.g. x, A), respectively. The indentity matrix of appropriate size is denoted by I. Vectors are treated as column vectors and the dot product of two vectors is expressed as a product of two matrices (x T y). Inequality relations are applied to vectors componentwise (e.g. x < 0 means all components of x are negative). Members of the standard orthonormal basis are denoted by e i. The expression x A is used for specifying that x is a column of matrix A., and F denote, respectively, the Euclidean, maximum, and Frobenius norms. Sequences are denoted by {x k } k=1. N(μ, σ 2 ) denotes the normal distribution with mean μ and variance σ 2. The rounding operator R k : R n R n is defined as R(x) arg min x y, (3) y G where G is the set of available rounded values. When arg min results in multiple points a tie breaking rule chooses a unique point for every x and every G. A tie breaking rule is implicitly applied every time the rounding function of the undelying mathematical library is invoked. 2 The Hessian update We consider the construction of a quadratic model of a nonlinear function f (x). Let g and B denote the approximate gradient and the approximate Hessian of f (x) at x 0. Then the model is given by m(x) = 1 2 (x x 0) T B(x x 0 ) + g T (x x 0 ) + f (x 0 ) (4) Suppose an approximation of the second directional derivative d2 dα 2 f (x 0 + αp) is available. Such an approximation can be obtained from the first three terms of the Taylor series expansion of f.

5 Mesh adaptive direct search with second directional... f 2,p (x 0 ) = d2 dα 2 f (x 0 + αp) 2 α + α ( f+ f 0 f ) f 0, (5) α + α where f 0, f + and f denote the values of f (x 0 +αp) at α = 0,α = α +, and α = α, respectively. The formula is exact if f is a quadratic function. Three collinear points are often available such that α + = α = 1. In this case (5) can be simplified to f 2,p (x 0 ) f + + f 2 f 0. (6) With decreasing p the approximation becomes more accurate. The second directional derivative of a quadratic functions depends only on p and can be expressed as f 2,p = p T Hp, (7) where H is the Hessian of f. In[19] the second directional derivative information was used for updating the Hessian approximation. Let B + denote the updated Hessian approximation. By minimizing the Frobenius norm of B + B subject to the linear constraint p T B + p = f 2,p (8) the following update formula is obtained. B + = B + f 2,p p T Bp p 4 pp T. (9) The following properties of the update formula (9) justify its use (see [19]). Lemma 1 Suppose H is the Hessian of a quadratic function f, B is a symmetric matrix, and p R n is a random vector such that p/ p is uniformly distributed on the unit sphere. Then the update formula given by (9) satisfies B + H F B H F (10) and E[ B + H 2 F ] (1 ) 2 B H 2 F (11) n(n + 2) Proof See the proof of Theorem 2.1 in [19]. Lemma 1 indicates that the Hessian approximation obtained with (9) converges linearly to H if p/ p is uniformly distributed on the unit sphere. In fact, we can relax the requirements for linear convergence. Lemma 2 Suppose H is the Hessian of a quadratic function f, B is a symmetric matrix, and p R n is a random vector such that [ (p T (B H)p) 2 ] E p 4 α B H 2 F (12)

6 Á. Bűrmen et al. for some α>0. Then the update formula given by (9) satisfies (10) and E[ B + H 2 F ] (1 α) B H 2 F. (13) Proof The proof of (10) in [19] does not depend on the distribution of p. Therefore (10) holds under the assumptions of Lemma 2. To prove (13) we begin by subtracting H from (9). Taking into account (7) results in B + H = B H pt (B H)p p 4 pp T. (14) Matrix B + H is orthogonal to pp T in the Frobenius product sense because B + B = pt (B H)p p 4 pp T (15) has minimal Frobenius norm subject to constraint (8). Thus we have B + H 2 F = B H 2 F (pt (B H)p) 2 p 4. (16) Computing the expected value and taking into account (12) yields the desired result. Lemma 3 Let B 0 be an arbitrary symmetric matrix and let {H k } denote the sequence of Hessians of f corresponding to a sequence of random points {x k } such that E[ H k F ] is finite. Let the members of the sequence of vectors {p k / p k } be independent and uniformly distributed on the unit sphere. Construct a sequence of random matrices using B k+1 = B k + f 2,p k (x k ) p T k B kp k p k 4 p k p T k. (17) Suppose H k converges in squared Frobenius norm to H (i.e. E[ H k H 2 F ] 0). Then E[ B k H 2 F ] 0. Proof See proof of Theorem 2.3 in [19].TheproofreliesonLemma1 If we replace Lemma 1 with Lemma 2 we obtain a more general result that requires the directions to satisfy (12). The final result of Lemma 3 justifies the use of update (9) for approximating the Hessian of a general nonlinear function. 3 Algorithm outline In iteration k MADS examines points that lie in scaled directions from the incumbent solution (x k ). These scaled directions are members of the set G k = { Δ m k Dz : z Nn D }, (18)

7 Mesh adaptive direct search with second directional... where the n D columns of D = GZ form a positive spanning set for R n [11], G is a n n real matrix, and Z is a n n D integer matrix. Points visited by the algorithm in k-th iteration lie on a mesh M k = { x + p : x S k, p G k }, (19) where S k is the set of all points visited by the algorithm in past iterations. The scaling coefficient Δ m k in (18)isalsoreferredtoasthemesh size.letx k denote the point with the lowest value of f Ω found in iterations 1,...,k 1 and f k the corresponding value of f Ω. One iteration of MADS is given by Algorithm 1. Algorithm 1 k-th iteration of the MADS framework. 1. Search. Evaluate f Ω on a finite subset of M k. 2. Poll. Select a finite set D k G k Evaluate f Ω (x k + d) for d D k until f k+1 < f k or D k is exhausted. 3. Update. Choose Δ m k+1 and Δp k+1. The length of poll directions (d) is determined by the poll size (Δ p k ). The search (or the poll for that matter) fails if it does not find a point x for which f Ω (x) < f k.an iteration is deemed successful if either the search or the poll is successful (i.e. does not fail). The poll may be omitted if the search succeeds. The algorithm starts with a given x 0 Ω, f 0 = f Ω (x 0 )<, and k = 0. Let D k denote the set of normalized poll directions {d/ d :d D k}. Aminimal frame center is a point x k for which all poll directions d D k fail to satisfy f Ω (x k + d) < f k. A refining subsequence is any sequence of minimal frame centers {x k } k K for which {Δ p k } k K converges to zero. MADS guarantees favourable convergence properties [6] for limit points of certain refining subsequences if the following requirements are satisfied: (A) Δ m k /Δm k+1 is an integer power of a given nonzero rational constant τ>1. (B) Δ m k+1 <Δm k if f k+1 = f k, otherwise Δ m k+1 Δm k (C) for some C > 0 and all d D k the bound d CΔ p k is satisfied, (D) lim k Δ p k = 0 if and only if lim k Δ m k = 0. (E) Limit points of D k in the sense of [13] are positive spanning sets. (F) The set k K D k where K corresponds to all failed iterations is dense on the unit sphere. Requirements A D guarantee the existence of at least one refining subsequence. Requirement E ensures the limit points of refining subsequences are also stationary points if f is smooth and Ω = R n, even when Requirement F is not satisfied (see [5]). Finally, F bestows MADS with convergence properties on nonsmooth functions and constrained problems (see [6]). The outline of the proposed approach (QPMADS) is given by Algorithm 2. Operator R k rounds a point to the nearest point in G k. The feasible region Ω is defined as Ω = { x R n : c(x) 0 } (20)

8 Á. Bűrmen et al. Algorithm 2 k-th iteration of the QPMADS algorithm. 1. QP search step. Use linear regression to build a linear model m c (s) of c(x) in the neighbourhood of x k. For given B use linear regression to obtain g for model m f (s) = 2 1 st Bs + g T s + f k. Replace B with a positive definite B = B + βi by choosing a suffucently large β 0. Obtain step s by minimizing m f (s) subject to m c (s) 0 (use a QP solver). Evaluate f Ω at x k + R k (s). If f Ω (x k + R k (s)) < f k go to step Poll step (performed if the QP search step fails). Select a positive spanning set D k G k Evaluate f Ω (x k + d) for d D k until f Ω (x k + d) < f k or D k is exhausted. If d D k is found satisfying f Ω (x k + d) < f k set p k := d. Otherwise go to step Speculative search step. Evaluate f Ω (x k + 2p k ). 4. Update. Compute Δ m k+1 and Δp k+1. where the vector valued function c is a map R n R m representing m inequality constraints. To simplify notation we introduce s = x x k and c k = c(x k ). Note that every evaluation of f Ω (x) involves the evaluation of f (x) and c(x). The quadratic programming (QP) search step first builds a quadratic model of f (x), m f (s) = 1 2 st Bs + g T s + f k, (21) and a linear model of c(x), m c (s) = As + c k. (22) Note that the Hessian approximation B is available because (9) is applied frequently, i.e. every time three collinear points are evaluated. The update formula (9) does not guarantee positive definiteness and therefore B can be indefinite. Because the QP solvers usually are capable of handling only convex problems, a positive definite B = B + βi is used instead of B in (21). The modified quadratic model of f and the linear model of c constitute a convex QP subproblem that approximates the original optimization problem in the neighbourhood of x k along some descent direction of the original QP problem. The modified subproblem can be solved using one of the many available QP solvers. The obtained step s is rounded to the nearest point from G k (denoted by R k (s)) and f Ω is evaluated at x k + R k (s). The QP search fails if f Ω (x k + R(s)) f k. A failed QP search is followed by a poll that examines points x k + d, d D k. If the poll step finds a point x k + p k for which f Ω (x k + p k )< f k the speculative search step evaluates point x k + 2p k further along the direction of last success in the hope of finding an even lower value of f Ω. We also experimented with a speculative step performed after a successful search step, but found that in most cases it had a negative impact on the performance of the algorithm. This is probably due to the fact that most of the information contained in the model is already exploited by the search step. Note that Algorithm 2 does not completely adhere to the MADS framework given by Algorithm 1 because the speculative search is performed at the end of the corresponding iteration. This does not invalidate the convergence theory given in [6]. See Sect. 7 for the details regarding this issue.

9 Mesh adaptive direct search with second directional... 4 Poll direction generation and the update In QPMADS the second directional derivative is approximated along search and poll directions. Lemma 1 guarantees linear convergence of the approximate Hessian to the actual Hessian if these directions are uniformly distributed. The algorithm in [2] generates sets of orthogonal directions that are not uniformly distributed [25]. To generate sets of uniformly distributed directions we give up orthogonality in favour of almost orthogonal search directions [25]. Algorithm 3 Generating poll directions for the k-th iteration of QPMADS. {N t } t=0 is a sequence of realizations of a n n random matrix with independent identically distributed elements from N(0, 1). 1. QR-decompose N tk into orthogonal matrix Q tk and upper triangular matrix R tk. 2. Construct a diagonal matrix D tk with elements d ii = sign(r ii ). 3. U tk := Q tk D tk. 4. Construct V k by rounding the columns of Δ p k U t k to the nearest points in G k. 5. The 2n members of D k are the columns of V k and V k. The basic idea for generating uniformly distributed poll directions given by Algorithm 3 is based on [24]. Every iteration of QPMADS is assigned an integer index t k corresponding to matrix N tk from which the set of poll directions D k is constructed. Matrix U is a random orthogonal matrix from the Haar measure on the orthogonal group O n (see [24] for a proof) and U tk is its t k -th realization. The distribution of such a random matrix is invariant to orthogonal transformations, i.e. the distribution of OU is identical to the distribution of U for every orthogonal matrix O. The following theorem is the basis for generating random vectors that are uniformly distributed on the unit sphere (i.e. vectors that point with equal probability in all directions). Theorem 1 Let U be a random matrix chosen from the Haar measure on the orthogonal group O n and a be a unit vector. Then Ua is distributed uniformly on the unit sphere. Proof It is sufficient to prove that the distribution of Ua does not change under orthogonal transformations. Let O be an arbitrary orthogonal matrix. The distribution of OU is by assumption identical to the distribution of U. From what was previously said and from O(Ua) = (OU)a we conclude the distribution of (OU)a is dentical to the distribution of Ua which completes the proof. By replacing a with a basis vector e i we can see every column of U is a random vector uniformly distributed on the unit sphere. By scaling the columns of U k with Δ p k and then rounding the results to the nearest points in G k an almost orthogonal matrix V k is obtained. Rounding a vector d R n to the nearest point from G k introduces a rounding error denoted by δ. This error is bounded in norm (see [9]): δ n 1/2 Δ m k /2 = δ 0. (23)

10 Á. Bűrmen et al. The effect of scaling and rounding on the distribution of the normalized columns of U can be summarized in the following theorem. Theorem 2 Let U be a random matrix chosen from the Haar measure on the orthogonal group O n and V be obtained by rounding the columns of Δ p ku to the nearest points in G k.asδ m k /Δp k approaches zero, the distribution of normalized columns of V approaches the uniform distribution on the unit sphere. Proof Let u denote a column of U.Byassumptionwehave u =1. The corresponding column of V can be expressed as The normalized vector v is then given by v v = v = R k ( Δ p k u) = Δ p k u + δ (24) Δ p k Δ p k u + δ u + δ Δ p (25) ku + δ. From Δ m k /Δp k 0, and (23) we see that δ/δp k 0. Consequently Δ p k Δ p k u + δ = u + δ 1 Δ p 1. (26) k The second term in (25) can be rewritten as δ Δ p k u + δ = δ Δ p u + δ k Δ p k By taking into account (25), (26), and (27) wearriveat 1 δ Δ p k 0. (27) v u. (28) v Vector u is by assumption uniformly distributed on the unit sphere. The columns of V k become mutually orthogonal as Δ m k /Δp k approaches zero. In the last step of Algorithm 3, the set of poll directions (D k ) is constructed from the columns of V k such that for every poll direction d it also contains d. Every iteration is associated with an index l k defining the mesh and the poll size. Greater values of l k correspond to shorter poll directions and finer meshes. The update rules for l k and t k are based on those from [2,25]: { lk + 1 f l k+1 = k+1 f k l k 1 f k+1 < f k, { lk+1 l t k+1 = k+1 max i k l i 1 + max i k t i otherwise. (29) (30)

11 Mesh adaptive direct search with second directional... The update rule for t k guaratees that every sequence of minimum frame centers {x k } k K for which {l k } k K ={0, 1, 2,...} corresponds to sequence {N t } t=0.inthe first iteration of QPMADS (k = 0), both t 0 and l 0 are set to zero. The set G k in QPMADS is obtained from D =[I I]. Rounding transforms the orthogonal columns of Δ p k U t k into almost orthogonal columns of V k. If the poll size Δ p k is not sufficiently large compared to Δm k, then rounding can result in a singular V k and consequently D k no longer positively spans R n. Due to this, the mesh and the poll size parameters are updated in the following manner: Δ m k = min { 1, 4 l } k /γ, (31) Δ p k = 2 l k. (32) The constant γ ensures Δ p k /Δm k = 2 lk γ is sufficiently large for all l k Z. It depends on the problem dimension and can be obtained via the cosine measure of D k, denoted by cm(d k ) [11]. The cosine measure of a positive basis B k comprising an orthogonal linear basis for R n and its negative is cm(b k ) = n 1/2 [11]. By rounding the members of B k to the nearest points from G k, a new set of vectors D k is obtained. Its cosine measure [9]isgivenby cm(d k ) cm(b k) δ 0 / b min, (33) 1 + δ 0 / b min where b min is the shortest vector in B k (i.e. the column of Δ p k U t k with the smallest norm). The set D k positively spans R n if cm(d k )>0. Consequently, the rounding error must satisfy [9] δ 0 < n 1/2 b min. (34) The columns of U tk are unit length vectors so b min =Δ p k. Therefore (34) can be satisfied with n 1/2 Δ m k /2 < n 1/2 Δ p k. (35) Rearranging (35) leads to Δ p k /Δm k > n/2. (36) For every γ>n/2, (36) is satisfied for all l k Z. In QPMADS we use γ = 1 + n/2. (37) In problems without constraints the set of normalized poll directions d D k resulting in feasible points x k + d Ω is asymptotically dense in the unit sphere. Let us now consider the case where (2) are the only constraints. Every active bound on optimization variables halves the set of normalized poll directions that result in feasible points. Because poll directions are generated in a random way, the probability of generating a feasible point decreases exponentially with the number of active bounds. The Hessian update described in Sect. 2 uses points x k + d, x k, and x k d, which only makes things worse. The set of directions d for which all three points are feasible reduces to R n m when m bounds are active.

12 Á. Bűrmen et al. For now let us consider the case when both x k + d and x k d violate (2). Then poll direction d is modified to obtain d. The components of d are equal to the components of d when the corresponding components of x k +d satisfy (2). The remaining components of d are set to the corresponding components of d. Poll directions d and d are removed from D k and replaced with d and d. Now in every pair of collinear poll directions, there is at least one that generates a point satisfying (2). The modified Hessian update (used when one of the points x k + d or x k d violates 2) is described in Sect Ordering the poll steps and applying the Hessian update Every time three collinear points are evaluated, the corresponding second directional derivative can be approximated using (5), and update (9) can be applied. Lemma 3 assures us that the approximate Hessian will converge to the actual Hessian of f.to maximize the number of times the update is applied, the poll directions d D k are examined in a particular order. Let d 1, d 2,...,d n denote the n columns of V k.the poll directions from D k are evaluated in the following order: d 1, d 1, d 2, d 2,... (38) The proposed ordering makes it possible to update the Hessian after every even poll direction using function values f (x k d i ), f (x k ), and f (x k + d i ). Note that whenever one of the three collinear points lies outside Ω the function f is still evaluated, as long as the point satisfies (2). This was our initial assumption (i.e. the function and the constraints cannot be evaluated separately). When the bounds given by (2) are not satisfied, the function and the constraints cannot be evaluated. Suppose x k +d fails to satisfy (2). Then d is replaced by d (see the last two paragraphs of Sect. 4) such that x k + d satisfies (2). If x k d also satisfies (2), the function and the constraint can be evaluated at both points and the Hessian update can be applied. It is more common that x k d fails to satisfy (2). In this case it is replaced with x k + 2 d M k. As the step size becomes sufficiently small x k + 2 d always satisfies (2), and again we have three collinear points for the Hessian update (x k, x k + d, and x k + 2 d). One could argue that replacing d with d changes the distribution of the normalized random direction (q = d/ d ) along which the Hessian is updated to such an extent that it is no longer uniform on the unit sphere and Lemma 1 no longer holds. The following theorem shows that the convergence remains linear. Theorem 3 Suppose d is a random vector such that d/ d is uniformly distributed on the unit sphere and the Hessian update B + = B + f 2,d d T Bd d 4 dd T (39) converges linearly to the true Hessian of a quadratic function. If d is replaced with d the convergence of the Hessian update remains linear.

13 Mesh adaptive direct search with second directional... Proof According to Lemma 2 the Hessian approximation converges linearly as long as the expected value of ( d T (H B) d) 2 / d 4 = (q T (H B)q) 2 satisfies E [ (q T (H B)q) 2] α H B 2 F (40) for some α>0. Without loss of generality we can assume H = 0 and B F = 1. Now we need to verify E [ (q T Bq) 2] α>0, B : B F = 1. (41) Because of the way direction d is replaced with d, the set of normalized directions corresponds to the part of the unit sphere S satisfying inequalities of the form ei Tq 0 (active lower bound on i-th variable) or ei T q 0 (active upper bound on i-th variable). Within S the normalized random direction is distributed uniformly. This is because changing the sign of one component of d maps one half of the hypersphere to the other. To compute E [ (q T Bq) 2] one has to integrate (q T Bq) 2 0 over S and divide the result with the surface area of S. Suppose the integral over S is not bounded away from zero. Then it must vanish for some B F = 1. This is only possible if q T Bq is zero almost everywhere on S.Fromq T Bq being a quadratic function we conclude B = 0. The obtained contradiction confirms (41). In [8] it was pointed out that MADS performance can be improved if the poll directions are ordered by ascending angle between d D k and the last direction of success p k0 (angle-based ordering). Let ˆd 1,...,ˆd 2n denote the ordered poll directions. Because of the way D k is constructed the first n vectors ˆd 1,...,ˆd n linearly span R n and every member of D k appears exactly once in the sequence {ˆd 1, ˆd 1, ˆd 2, ˆd 2,...}. QPMADS uses this sequence instead of (38). When m f and m c are available the poll directions can be ordered according to the predicted constraint violation obtained from m c and the predicted function decrease obtaned from m f (model-based ordering). This type of poll direction ordering is prefered over the more simple angle-based ordering. The constraint violation corresponding to step s is given by φ(s) = i I (As + c k ) 2 (42) where I denotes the set of constraint indices corresponding to violated constraints. I = { } i {1, 2,...,m} :ei T (As + c k) 0. (43) The primary criterion for the model-based poll direction ordering is the value of φ(s); i.e., steps corresponding to a lower value of φ are ordered before steps corresponding to a higher value of φ. The secondary criterion is based on m f and is applied

14 Á. Bűrmen et al. only when two poll directions result in the same value of φ. Poll directions with a lower value of m f are ordered before poll directions with a higher value of m f. Model-based ordering is performed in two stages. In the first stage, pairs of poll directions {d, d} are compared based on φ and m f. The resulting n steps denoted by ď 1,...,ď n form a basis for R n. In the second stage these steps are ordered again according to φ and m f to obtain n ordered steps ˆd 1,...,ˆd n. The final ordering of poll directions is then ˆd 1, ˆd 1, ˆd 2, ˆd 2,... Every member of D k appears exactly once in this sequence. Finally, after every speculative search when x k + 2p k satisfies the bounds given by (2), the function value is available at three collinear points (x k, x k + p k, and x k + 2p k ) and the Hessian approximation can be updated using (9). 6 The quadratic programming search In the beginning of every QP search a quadratic model of f and a linear model of c are constructed. The models are given by Eqs. (21) and (22). The Hessian approximation B is gradually built using update formula (9). It corresponds to the quadratic terms in model (21). The linear terms represented by g can be obtained from the computed values of f in the neighbourhood of x k using linear regression. QPMADS keeps a list of tuples of the form (x, f (x), c(x)). Every time f Ω is evaluated at some point x not in the list, a new tuple (x, f (x ), c(x )) is added to the list. The length of the list is limited to 2n + 1. If it is exceeded tuples are purged from the list according to their age (oldest first) until the number of the list members drops to 2n + 1. No attemps are made to keep the set of stored points well poised [14]. Linear regression is applied to obtain g and A from the list of stored points whenever it has at least n + 1 members. The value of g is obtained by minimizing the summed square distance between m f (x) and f (x) for the points in the list. This problem can be formulated as a linear least squares problem, arg min g R n D xg d f 2, (44) where the rows of D x are vectors x x k and the components of vector d f are the corresponding values of 2 1 (x x k) T B(x x k ) + g T (x x k ) + f k f (x). Only points x k = x satisfying x x k ρare used in the regression. Based on numerical experiments we have chosen ρ = 4. Problem (44) can be solved via singular value decomposition of D x. Similar to [15] we replace the singular values that are smaller than 2 52 with Similarly A can be obtained by first formulating a linear least squares problem, arg min D xa T D c 2, (45) A R n n where the rows of D c are the values of A T (x x k ) + c k c(x) corresponding to rows of D x. The singular value decompostion of D x needs to be computed only once for both regressions. If both linear regressions succeed, then B, g, A, and c k define a QP problem,

15 Mesh adaptive direct search with second directional... arg min 2 st Bs + g T s, (46) subject to As + c k 0. (47) s R n 1 If B is indefinite, the corresponding QP problem is NP hard. QP problems with positive semi-definite B are convex and can be solved in polynomial time. There exist several software packages for solving such problems. Our implementation of QPMADS uses the CVXOPT package [4]. When B is indefinite the QP problem given by (46) and (47) is modified to obtain a convex QP problem. This modified QP problem approximates the original one in the neighbourhood of s = 0 along some descent direction. For this purpose Hessian modification ([21], Algorithm 3.3) is applied by replacing B with B+βI where β>0 is sufficiently large so that B + βi is positive definite. The value of β is found by attempting to apply the Cholesky algorithm to B + βi for increasing values of β until the decomposition succeeds. When β>0 constraint, s ρ neg Δ p k (48) is added to constraints (47) to prevent s from becoming too large and resulting in points where the modified QP problem differs significantly from the original one. QPMADS uses ρ neg = 1 <ρ. Note that problem (46) (47) is just an approximation. Its purpose is to produce a QP search step that additionally decreases f Ω. The (possibly modified) QP problem is solved by a dedicated solver. The obtained step s is rounded to grid G k and f Ω is evaluated at x k + R k (s). Note that (48) is actually a trust region similar to the one used in [15]. We also considered applying a trust region for the case when B is positive definite but obtained no significant improvement of the algorithm s performance. In the unconstrained case one could also use a standard trust region solver for the quadratic subproblem and thereby avoid the need to modify B (trust region solvers can handle indefinite Hessians). But even with the proposed crude Hessian modification scheme, we still managed to outperform the state of the art MADS solver [18]. Therefore, we did not probe further in the direction of trust region solvers for the unconstrained subproblem. Applying the trust region approach to constrained problems is significantly more complicated which is why we decided in favour of QP. If there are insufficient points in the list for linear regression (i.e. fewer than n + 1), the linear regression fails, or the QP solver fails to compute s, then the QP search is considered as failed. 7 Convergence The following lemma is used in the convergence proof of QPMADS. Lemma 4 Let K be the set of indices corresponding to all failed iterations of QPMADS. Then the set k K D k is dense on the unit sphere with probability 1.

16 Á. Bűrmen et al. Proof The update rules (29) and (30) guarantee the existence of a subsequence of iteration indices K K such that t k covers the whole set {0, 1, 2,...} and Δ m k 0 for k K and k. One half of the members of D k are the columns of U tk scaled by Δ p k and rounded to G k. The remaining members are their corresponding negatives. The columns of U tk are uniformly distributed on the unit sphere and are dense with probability 1. All statements in the remainder of the proof hold with probability 1. For any x satisfying x =1 and any ɛ>0 there exists an infinite subsequence of indices K K such that for every k K there exists y k Δ p k U t k for which x y k / y k <ɛ/2. (49) Rounding y k to G k introduces an error δ with an upper bound on its norm δ δ 0 = n 1/2 Δ m k /2. Let y k denote R k(y k ) D k. It follows that y k / y k y k / y k = y k/ y k (y k + δ)/ y k + δ δ 0 / min( y k, y k + δ ) Due to the existence of at least one refining subsequence, Δ m k can be made arbitrarily small for all k > k by choosing a sufficiently large k. Therefore for all k K and k > k we have y k / y k y k / y k ɛ/2. (50) Merging (49) and (50) results in x y k / y k ɛ/2 + ɛ/2 = ɛ. (51) Because y k / y k D k the proof is complete. Now we can state the final result. Theorem 4 QPMADS is a valid MADS instance. Proof We begin by showing Algorithm 2 adheres to the MADS framework given by Algorithm 1. Then we show requirements A F are satisfied. The QP search step in k-th iteration is a member of G k. QPMADS follows the MADS algorithmic framework given by Algorithm 1 with one exception. The speculative search is performed after a successful poll. The convergence proof for the MADS framework applies to refining subsequences. These subsequences correspond to iterations for which polling fails to decrease the value of f Ω. A failed poll is not followed by a speculative search. Thus the placement of the speculative search step after the poll step does not invalidate the convergence theory inherited from the MADS framework. Finally by choosing D =[I I] we see QPMADS corresponds to the MADS framework given by Algorithm 1. Requirements A and D are satisfied by construction (see 31, 32, and choose τ = 4). The update rule for l k given by (29) which is the basis for computing Δ m k ensures Requirement B is satisfied. One half of the members of D k are columns of orthogonal matrix U tk scaled by Δ p k and rounded to the nearest member of G k while the other half

17 Mesh adaptive direct search with second directional... are their negatives. Rounding introduces an error bounded in norm by δ 0 = n 1/2 Δ m k /2 (see 23). Thus we have ( ) d Δ p k + δ Δp k 1 + n1/2 2 Δm k Δ p = Δ p k (1 + n1/2 2 l ) k 1. (52) γ k Requirement C is satisfied by noting that the last parenthetical expression in (52) is bounded above by C = 1 + n 1/2 /(2γ). The lower bound on the cosine measure of D k is given by (33). Because of the way the set of unrounded poll directions is constructed the lower bound on the cosine measure of D k is cm(d k ) n 1/2 δ 0 /Δ p k 1 + δ 0 /Δ p k = n 1/2 n 1/2 Δ m k /(2Δp k ) 1 + n 1/2 Δ m k /(2Δp k ). (53) By taking into account Δ m k /Δp k γ 1 and γ = n/2 + 1wehave cm(d k ) γ n 1/2 n 1/2 /2 γ + n 1/2 /2 = 2n 1/2 n + n 1/2 > 0. (54) + 2 Since the cosine measure of D k is uniformly bounded away from zero, all limit points of D k must be positive spanning sets (Requirement E). Finally, Lemma 4 ensures Requirement F is satified with probability 1. Theorem 4 guarantes all of the convergence properties given in [6] with probability 1. 8 Results and discussion QPMADS was implemented in Python [26]. For numerical computations the NumPy and the SciPy libraries were used [22]. The performance of QPMADS was compared to the performance of the NOMAD black-box optimization software [12,18], version All algorithmic parameters were kept at their respective default values, except for the following ones: 1. The set of poll directions comprised an orthogonal linear basis for R n and its negative, 2. the algorithm was stopped when Δ p k fell below With this choice of algorithmic parameters NOMAD uses the OrthoMADS instance of MADS [2]. Its main drawback is the non-uniform distribution of poll directions [25]. The poll directions in OrthoMADS tend to concentrate around coordinate directions as the problem dimension increases. All algorithms were stopped when the poll size parameter Δ p k dropped below The number of function evaluations was limited to 1000(n+1). For every problem and every tested algorithm two runs were performed. The first run didn t use quadratic models, implying the QP search was omitted and a

18 Á. Bűrmen et al. simple poll direction ordering was used. In the second run, QP search was included and model-based poll direction ordering was used, whenever this was possible (i.e. a model was successfully computed). The four MADS instances were tested on four sets of test problems. The first three [25] comprised 60 smooth problems (problem dimensions between 3 and 40), 62 nonsmooth problems (problem dimension between 4 and 40), and 28 constrained problems (problem dimension between 4 and 40). The fourth set (also referred to as the MADSMODEL set) comprised a mix of 48 test problems (smooth, nonsmooth, and constrained) from [12] with problem dimension between 2 and 50. The performance of the tested algorithms was expressed in terms of data profiles [20]. A data profile visualizes the share of the problems solved by an algorithm with respect to the computational budget expressed in terms of simplex gradient evaluations (one simplex gradient evaluation corresponds to n+1 function evaluations). A problem is considered as solved when a feasible point is found for which the corresponding value of the function subject to optimization satisfies f < f L + ɛ ( f 0 f L ), (55) where f 0 denotes the function value at the initial point, and f L the lowest function value found by all considered algorithms given the maximal allowed budget (in our case 1000(n + 1) function evaluations). The accuracy given by ɛ was set to 10 3.A data profile is a monotonically increasing function of the computational budget. It is bounded from above by 1. The data profiles are depicted in Fig. 1. Without quadratic models the use of uniformly distributed search directions improves the performance of MADS with respect to OrthoMADS for smooth and nonsmooth problems. The same can be said for MADSMODEL problems and computational budgets up to 500(N + 1). Smooth unconstrained problems admit an open halfspace of descent directions at every point in R n. Nonsmooth and constrained problems have points where the set of descent directions can be significantly smaller. For such functions choosing a uniformly distributed random direction is expected to perform better than choosing a non-uniformly distributed random direction (like in OrthoMADS). This is consistent with the observed improvement of QPMADS compared to NOMAD on nonsmooth and MADSMODEL test problems. On the other hand we found no simple explanation for the worse performance of QPMADS observed on the set of constrained problems. When quadratic models are used QPMADS performs significantly better than NOMAD on constrained problems. In our numerical experiments both algorithms used the extreme barrier approach for handling constraints. Due to this one might think the performance of NOMAD would improve if the progressive barrier approach had been used [7]. To clear this issue we compared the performance of QPMADS, NOMAD with extreme barrier, and NOMAD with progressive barrier on the set of constrained problems and on the MADSMODEL set of problems. The former comprises nonlinearly constrained problems only, while the latter contains 10 nonlinearly constrained problems. The data profiles are depicted in Fig. 2. The profiles show that the use of progressive barrier improves the performance of NOMAD on constrained problems to some extent, but not enough to compete with QPMADS.

19 Mesh adaptive direct search with second directional... Fig. 1 Data profiles for QPMADS and NOMAD for 60 smooth (top left), 62 nonsmooth (top right), and 28 constrained (bottom left) problems from [25]. The bottom right data profile was obtained with the set of 48 MADSMODEL problems from [12] Fig. 2 Data profiles for QPMADS, NOMAD with extreme barrier, and NOMAD with progressive barrier obtained on the set of constrained problems (left) and on the MADSMODEL set (right) As expected, the use of quadratic models significantly improves the performance of NOMAD (see [12]). NOMAD builds the quadratic models from a set of nearby points evaluated in the past. Building a full quadratic model requires (n + 1)(n + 2)/2 points. When there are insufficient points available the quadratic model is obtained by finding matrix B with the smallest Frobenius norm (smallest curvature model). Unlike QPMADS, NOMAD builds quadratic models for all constraints.

20 Á. Bűrmen et al. Table 1 Percentage of problems solved with a budget of 1000(n + 1) function evaluations Smooth Nonsmooth Constrained MADSMODEL NOMAD, no models QPMADS, no models NOMAD QPMADS NOMAD s approach to building quadratic models has a disadvantage due to its computational complexity. It requires the solution of a linear system with up to (n + 1)(n + 2)/2 equations. Thus the computational complexity of the underlying linear subproblems can grow proportionally to n 6. On the other hand QPMADS approximates the Hessian by updating B. The update itself can be expressed in terms of vector dot products and matrix-vector products with computational complexity proportional to n 2. The computation of vector g and matrix A involves linear regression with up to 2n + 1 points computed via singular value decomposition (computational complexity proportional to n 3 ). The obtained QCQP subproblem in NOMAD is solved by means of a simple MADS algorithm. In constrained problems there are often points where the set of descent directions is much smaller than an open halfspace. Guessing a descent direction with given step length (which is actually what MADS does in the poll step) can be hard in the neighbourhood of such points. Consequently the quality of the subproblem solution can be low. On the other hand, QPMADS uses a dedicated QP solver that reliably solves the QP subproblem. When quadratic models were used QPMADS outperformed NOMAD on all four sets of test functions. For small computational budgets (less than 200 simplex gradient evaluations) NOMAD performed slightly better on smooth, nonsmooth, and MADSMODEL problems. We attribute this to the fact that NOMAD uses much more information for building the quadratic models than QPMADS, where the quadratic part of the model is built by means of second directional derivative-based updates. The percentage of the problems solved with a large computational budget is given in Table 1. QPMADS solved 97 % all of the smooth problems. On the remaining three sets of test problems QPMADS solved between 74 and 89 % of all problems. On constrained problems the success rate of QPMADS was more than two times greater than the success rate of NOMAD. Table 2 lists the number of function evaluations required for solving selected problems from [25] with accuracy ɛ = 10 3 and ɛ = 10 7 when f L = f is used as the lowest function value in (55). The first four problems are smooth, the next three are nonsmooth, and the last two are constrained. The computational budget was set to 1000(n + 1) function evaluations. The table illustrates the effect of the required solution accuracy on the number of function evaluations that are needed for finding the solution.

21 Mesh adaptive direct search with second directional... Table 2 Number of function evaluations required for reaching prescribed precision (ɛ) for selected problems Function/constraint n f With models Without models QPMADS NOMAD QPMADS NOMAD Broyden trid / / / /414 Penalty I / / / /771 Trigonometric /- / / / Discrete int. eq / / / /21344 ElAttar /- / / / Active faces / / / /6313 Gen. MAXQ / / / /11843 CB3 I / MAD1 I / / / 1638/4568 LQ / MAD1 I / / 6361/ 7634/ The first and the second value in every pair correspond to ɛ = 10 3 and ɛ = 10 7, respectively. A dash indicates a failure to solve a problem with corresponding precision

Multipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations

Multipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations Oleg Burdakov a,, Ahmad Kamandi b a Department of Mathematics, Linköping University,