Direct Search Based on Probabilistic Descent

Size: px

Start display at page:

Download "Direct Search Based on Probabilistic Descent"

Sabina Hines
5 years ago
Views:

1 Direct Search Based on Probabilistic Descent S. Gratton C. W. Royer L. N. Vicente Z. Zhang January 22, 2015 Abstract Direct-search methods are a class of popular derivative-free algorithms characterized by evaluating the objective function using a step size and a number of polling directions. When applied to the minimization of smooth functions, the polling directions are typically taken from positive spanning sets which in turn must have at least n+1 vectors in an n-dimensional variable space. In addition, to ensure the global convergence of these algorithms, the positive spanning sets used throughout the iterations are required to be uniformly non-degenerate in the sense of having a positive cosine measure bounded away from zero. However, recent numerical results indicated that randomly generating the polling directions without imposing the positive spanning property can improve the performance of these methods, especially when the number of directions is chosen considerably less than n + 1. In this paper, we analyze direct-search algorithms when the polling directions are probabilistic descent, meaning that with a certain probability at least one of them is of descent type. Such a framework enjoys almost-sure global convergence. More interestingly, we will show a global decaying rate of 1/ k for the gradient size, with overwhelmingly high probability, matching the corresponding rate for the deterministic versions of the gradient method or of direct search. Our analysis helps to understand numerical behavior and the choice of the number of polling directions. Keywords: Derivative-free optimization, direct-search methods, polling, positive spanning sets, probabilistic descent, random directions. 1 Introduction Minimizing a function without using derivatives has been the subject of recent intensive research, as it poses a number of mathematical and numerical challenges and it appears in various ENSEEIHT, INPT, rue Charles Camichel, B.P , Toulouse Cedex 7, France serge.gratton@enseeiht.fr. ENSEEIHT, INPT, rue Charles Camichel, B.P , Toulouse Cedex 7, France clement.royer@enseeiht.fr. Support for this author comes from a doctoral grant of Université Toulouse III Paul Sabatier. CMUC, Department of Mathematics, University of Coimbra, Coimbra, Portugal lnv@mat.uc.pt, zhang@mat.uc.pt. Support for this research was provided by FCT under grants PTDC/MAT/116736/2010 and PEst-C/MAT/UI0324/2011 and by the Réseau Thématique de Recherche Avancée, Fondation de Coopération Sciences et Technologies pour l Aéronautique et l Espace, under the grant ADTAO. CERFACS-IRIT joint lab, Toulouse, France zaikun.zhang@irit.fr. This author works within the framework of the project FILAOS funded by RTRA STAE. 1

2 applications of optimization [7]. In a simplified way, there are essentially two paradigms or main approaches to design an algorithm for Derivative-Free Optimization with rigorous global convergence properties. One possibility consists of building models based on sampling typically by quadratic interpolation for use in a trust-region algorithm, where the accuracy of the models ensures descent for small enough step sizes. The other main possibility is to make use of a number of directions to ensure descent when the size of the step is relatively small, and direct-search methods [7, 20] offer a well studied framework to use multiple directions in a single iteration. If the objective function is smooth continuously differentiable, it is well known that at least one of the directions of a positive spanning set PSS is descent and by a PSS we mean a set of directions that spans R n with non-negative coefficients. Based on this guarantee of descent, direct-search methods, at each iteration, evaluate the function along the directions in a PSS a process called polling, using a step size parameter that is reduced when none of the directions leads to some form of decrease. There are a variety of globally convergent methods of this form, depending essentially on the choice of PSS and on the condition to declare an iteration successful simple or sufficient decrease. Coordinate or compass search, for instance, uses the PSS formed by the 2n coordinate directions. Multiple PSSs can be used in direct search, in a finite number when accepting new iterates based on simple decrease see pattern search [27] and generalized pattern search [3], or even in an infinite number when accepting new iterates based on sufficient decrease see generating set search [20]. In any of these cases, the PSSs in question are required to be uniformly non-degenerate in the sense of not being close to loose its defining property. More precisely, their cosine measure is required to be uniformly bounded away from zero, which in turn implies the existence, in any PSS used in a direct-search iteration, of a descent direction uniformly bounded away from being orthogonal to the negative gradient. If the objective function is non-smooth, say only assumed locally Lipschitz continuous, the polling directions asymptotically used in direct-search run are required, in some normalized form, to densely cover the unit sphere. Mesh adaptive direct search [4] encompasses a process of dense generation that leads to global convergence. Such a process can be significantly simplified if the iterates are only accepted based on a sufficient decrease condition [29]. Anyhow, generating polling directions densely in the unit sphere naturally led to the use of randomly generated PSSs [4, 29]. However, one can take the issue of random generation one step further, and ask the question of whether one can randomly generate a set of directions at each direct-search iteration and use it for polling, without checking if it is a PSS. Moreover, one knows that a PSS must have at least n + 1 elements in R n, and thus one even questions the need to generate so many directions and asks how many of them are appropriate to use. We were motivated to address this problem given the obvious connection to the trust-region methods based on probabilistic models recently studied in [5], as we will see later in our paper. Simultaneously, we were surprised by the numerical experiments reported [18] where generating the polling directions free of PSS rules and possibly in a number less than n + 1 was indeed beneficial. Similarly to the notion of probabilistically fully linear model in [5], we then introduce in this paper the companion notion of a probabilistically descent set of directions, by requiring at least one of them to make an acute angle with the negative gradient with a certain probability, uniformly across all iterations. We then analyze what is direct search capable of delivering when the set of polling directions is probabilistically descent. Our algorithmic framework is extremely simple and it can be simplistically reduced here to: at each iteration k, generate a finite set D k of polling directions that is probabilistically descent; if a d k D k is found such that 2

3 fx k +α k d k < fx k ρα k, where ρα k = α 2 k /2 and α k is the step size, then x k+1 = x k +α k d k and α k+1 = 2α k, otherwise x k+1 = x k and α k+1 = α k /2. We will then prove that such a scheme enjoys, with overwhelmingly high probability, a gradient decaying rate of 1/ k or, equivalently, that the number of iterations taken to reach a gradient of size ɛ is Oɛ 2. The rate is global since no assumption is made on the starting point. For this purpose, we first derive a bound, for all realizations of the algorithm, on the number of times the set of polling directions is descent. Such a bound is shown to be a fraction of k when k is larger than the inverse of the square of the minimum norm gradient up to the k-th iteration. When descent is probabilistically conditioned to the past, one can then apply a Chernoff type bound to prove that the probability of the minimum norm gradient decaying with a global rate of 1/ k approaches one exponentially. The analysis is carried out for a more general function ρ with the help of an auxiliary function ϕ which is a multiple of the identity when ρα k = α 2 k /2. Although such a direct-search framework, where a set of polling directions is independently randomly generated at each iteration, shares similarities with other randomized algorithmic approaches, the differences are significant. Random search methods typically generate a single direction at each iteration, take steps independently of decrease conditions, and update the step size using some deterministic or probabilistic formula. The generated direction can be first premultiplied by an approximation to the directional derivative along the direction itself [25], and it was shown in [22] how to frame this technique using the concept of smoothing to lead to global rates of appropriate order. Methods using multiple points or directions at each iteration, such as evolution strategies see [11] for a globally convergent version, typically impose some correlation among the direction sets from one iteration to another as in covariance matrix adaptation [19] and, again, take action and update step sizes independently of decrease conditions. More related to our work is the contribution of [10], a derivative-free line search method that uses random searching directions, where at each iteration the step size is defined by a tolerant nonmonotone backtracking line search that always starts from the unitary step. The authors proved global convergence with probability one in [10], but no global rate was provided. In addition, global convergence in [10] is much easier to establish than in our paper, since the backtracking line search always starts from the unitary step, independently of the previous iterations, and consequently the step size is little coupled to the history of the computation. The organization of the paper is the following. First, we recall in Section 2 the main elements of direct-search methods using PSSs and start a numerical illustration of the issues at stake in this paper. Then, we introduce in Section 3 the notion of probabilistic descent and show how it leads direct search to almost sure global convergence. The main result of our paper is presented in Section 4 where we show that direct search using probabilistic descent conditioned to the past attains the desired global rate with overwhelmingly high probability. In Section 5 we cover a number of related issues, among which how to derive in expectation the main result and what can still be achieved without conditioning to the past. Having already presented our findings, we then revisit the numerical illustrations with additional insight. Section 6 discusses how to extend our analysis to other algorithmic contexts, in particular to trust-region methods based on probabilistic models [5]. The paper is concluded with some final remarks in Section 7. 3

4 2 Direct search based on deterministic descent In this paper we consider the minimization of a function f : R n R, without imposing any constraints, and assuming that f is smooth, say continuously differentiable. As in [7], we consider first a quite general direct-search algorithmic framework but without a search step. Algorithm 2.1 Direct search Select x 0 R n, α max 0, ], α 0 0, α max, θ 0, 1, γ [1,, and a forcing function ρ : 0, 0,. For each iteration k = 0, 1,... Poll step Step size update Choose a finite set D k of non-zero vectors. Start evaluating f at the polling points {x k + α k d : d D k } following a chosen order. If a poll point x k + α k d k is found such that fx k + α k d k < fx k ρα k then stop polling, set x k+1 = x k + α k d k, and declare the iteration successful. Otherwise declare the iteration unsuccessful and set x k+1 = x k. If the iteration was successful, set α k+1 = min{γα k, α max }. Otherwise, set α k+1 = θα k. End for An optional search step could be taken, before the poll one, by testing a finite number of points, looking for an x such that fx < fx k ρα k. If such an x would be found, then one would define x k+1 = x, declare the iteration successful, and skip the poll step. However, ignoring such a search step allows us to focus on the polling mechanism of Algorithm 2.1, which is the essential part of the algorithm. 2.1 Deterministic descent and positive spanning The choice of the direction sets D k in Algorithm 2.1 is a major issue. To guarantee a successful iteration in a finite number of attempts, it is sufficient that at least one of the directions in D k is a descent one. But global convergence to stationary points requires more as we cannot have all directions in D k becoming arbitrarily close of being orthogonal to the negative gradient fx k = g k, and so one must have that k, d k D k, d k g k d k g k κ > 0, 1 for some κ that does not depend on k. On the other hand, it is well known [9] see also [7, 20] that if D is a PSS, then v R n, d D, d v d v > 0. 2 One can then see that condition 1 is easily verified if {D k } is chosen from a finite number of PSSs. When passing from a finite to an infinite number of PSSs, some uniform bound must be 4

5 imposed in 2, which essentially amounts to say that the cosine measure of all the D k is at least κ, where by the cosine measure [20] of a PSS D with non-zero vectors we mean the positive quantity cmd = min max v R n \{0} d D d v d v. One observes that imposing cmd k κ is much stronger than 1, and that it is the lack of knowledge of the negative gradient that leads us to the imposition of such a condition. Had we known g k = fx k, we would have just required cmd k, g k κ, where cmd, v is the cosine measure of D given v, defined by: cmd, v = max d D d v d v. To avoid an ill-posed definition when g k = 0, we assume by convention that cmd, v = 1 when v = 0. Condition cmd k, g k κ is all one needs to guarantee a successful step for a sufficiently small step size α k. For completeness we show such a result in Lemma 2.1 below, under Assumptions , which are assumed here and throughout the paper. Assumption 2.1 The objective function f is bounded from below and continuously differentiable in R n. f is Lipschitz continuous in R n. Let then f low > be a lower bound of f, and ν > 0 be a Lipschitz constant of f. Assumption 2.2 The forcing function ρ is positive, non-decreasing, and ρα = oα when α 0 +. The following function ϕ will make the algebra conveniently more concise. For each t > 0, let { ρα ϕt = inf α : α > 0, α + 1 } 2 να t. 3 It is clear that ϕ a well defined non-decreasing function. Note that Assumption 2.2 ensures that ϕt > 0 when t > 0. When ρα = c α 2 /2, one obtains ϕt = 2t/c + ν, which is essentially a multiple of the identity. Assumption 2.3 For each k 0, D k is a finite set of normalized vectors. Assumption 2.3 is made for the sake of simplicity, and it is actually enough to assume that the norms of all the polling directions are uniformly bounded away from zero and infinity. Lemma 2.1 The k-th iteration in Algorithm 2.1 is successful if cmd k, g k κ and α k < ϕκ g k. Proof. According to the definition of cmd k, g k, there exists d k D k satisfying d k g k = cmd k, g k d k g k κ g k. 5

6 Thus, by Taylor expansion, fx k + α k d k fx k α k d k g k να2 k κα k g k να2 k. 4 Using the definition of ϕ, we obtain from α k < ϕκ g k that ρα k α k να k < κ g k. Hence 4 implies fx k + α k d k < fx k ρα k, and thus the k-th iteration is successful. 2.2 A numerical illustration A natural question that arises is whether one can gain by generating the vectors in D k randomly hoping that 1 will then be satisfied frequently enough. For this purpose we ran Algorithm 2.1 in Matlab for different choices of the polling directions. The test problems were taken from CUTEr [17] with dimension n = 40. We set α max = inf, α 0 = 1, θ = 0.5, γ = 2, and ρα = 10 3 α 2. The algorithm terminated when either α k was below a tolerance of or a budget of 2000n function evaluations was exhausted. Table 1: Relative performance for different sets of polling directions n = 40. [I I] [Q Q] [Q k Q k ] 2n n + 1 n/2 n/4 2 1 arglina arglinb broydn3d dqrtic engval freuroth integreq nondquar sinquad vardim The first three columns of Table 1 correspond to the following PSS choices of the polling directions D k : [I I] represents the columns of I and I, where I is the identity matrix of size n; [Q Q] means the columns of Q and Q, where Q is an orthogonal matrix obtained by the QR decomposition of a random vector, uniformly distributed on the unit sphere of R n, generated before the start of the algorithm and then fixed throughout the iterations; [Q k Q k ] consists of the columns of Q k and Q k, where Q k is a matrix obtained in the same way as Q except that it is generated independently at each iteration. In the cases of [I I] and [Q Q], a cyclic polling procedure was applied to accelerate descent, by starting polling at the direction that led to previous success if the last iteration was successful or at the direction right after the last one used if the last iteration was unsuccessful. In the remaining columns, D k consists of m m = 2n, n + 1, n/2, n/4, 2, 1 independent random vectors uniformly distributed on the unit sphere in R n, independently generated at every iteration. 6

7 We counted the number of function evaluations taken to drive the function value below f low + ε[fx 0 f low ], where f low is the true minimal value of the objective function f, and ε is a tolerance set to Given a test problem, we present in each row the ratio between the number of function evaluations taken by the corresponding version and the number of function evaluations taken by the best version. Because of the random nature of the computations, the number of function evaluations was obtained by averaging over ten independent runs, except for the first column. The symbol indicates that the algorithm failed to solve the problem to the required precision at least once in the ten runs. When D k is randomly generated on the unit sphere columns m = 2n, n + 1, n/2, n/4, 2, 1 in Table 1, the vectors in D k do not form a PSS when m n, and are not guaranteed to do so when m > n. However, we can see clearly from Table 1 that randomly generating the polling directions in this way performed evidently better, despite the fact that their use is not covered by the classical theory of direct search. This evidence motivates us to establish the convergence theory of Algorithm 2.1 when the polling directions are randomly generated. 3 Probabilistic descent and global convergence From now on, we suppose that the polling directions in Algorithm 2.1 are not defined deterministically but generated by a random process {D k }. Due to the randomness of {D k }, the iterates and stepsizes are also random processes and we denote the random iterates by {X k }. The realizations of {D k } and {X k } are {D k } and {x k }, respectively. We notice, however, that the starting point x 0 and thus the initial function value fx 0 and the initial stepsize α 0 are not random. 3.1 Probabilistic descent The following concept of probabilistically descent sets of polling directions is critical to our analysis. We use it to describe the quality of the random polling directions in Algorithm 2.1. Recalling that g k = fx k, let G k be the random variable corresponding to g k. Definition 3.1 The sequence {D k } in Algorithm 2.1 is said to be p-probabilistically κ-descent if P cmd 0, G 0 κ p and, for each k 1, P cmd k, G k κ D 0,..., D k 1 p. 5 Definition 3.1 requires the cosine measure cmd k, G k to be favorable in a probabilistic sense rather than deterministically. We will see that the role of probabilistically descent sets in the analysis of direct search based on probabilistic descent is similar to that of positive spanning sets in the theory of deterministic direct search. Definition 3.1 is inspired by the definition of probabilistically fully linear models [5]. Inequality 5 involves the notion of conditional probability see [26, Chapter II] and says essentially that the probability of the event {cmd k, G k κ} is not smaller than p, no matter what happened with D 0,..., D k 1. It is stronger than assuming merely that P cmd k, G k κ p. The analysis in this paper would still hold even if a search step is included in the algorithm and possibly taken. The analysis would consider D k defined at all iterations even if the poll step is skipped. 7

8 3.2 Global convergence The global convergence analysis of direct search based on probabilistic descent follows what was done in [5] for trust-region methods based on probabilistic models, and is no more than a reorganization of the arguments there now applied to direct search when no search step is taken. It is well known that α k 0 in deterministic direct search based on sufficient decrease as in Algorithm 2.1, no matter how the polling directions are defined [20]. A similar result can be derived by applying exactly the same argument to a realization of direct search now based on probabilistic descent. Lemma 3.1 For each realization of Algorithm 2.1, lim k α k = 0. For each k 0, we now define Y k as the indicator function of the event {the k-th iteration is successful}. Furthermore, Z k will be the indicator function of the event {cmd k, G k κ}, 6 where, again, κ is a positive constant and independent of the iteration counter. The Bernoulli processes {Y k } and {Z k } will play a major role in our analysis. Their realizations are denoted by {y k } and {z k }. Notice, then, that Lemma 2.1 can be restated as follows: given a realization of Algorithm 2.1 and k 0, if α k < ϕκ g k, then y k z k. Lemmas 2.1 and 3.1 lead to a critical observation presented below as Lemma 3.2. We observe that such a result holds without any assumption on the probabilistic behavior of {D k }. Lemma 3.2 For the stochastic processes {G k } and {Z k }, where G k = fx k and Z k is the indicator of the event 6, it holds that { } { } lim inf G k > 0 [Z k ln γ + 1 Z k ln θ] =. 7 k k=0 Proof. Consider a realization of Algorithm 2.1 for which lim inf k g k is not zero but a positive number ɛ. There exists a positive integer k 0 such that for each k k 0 it holds g k ɛ/2 and α k < ϕκɛ/2 because α k 0 and ϕκɛ/2 > 0, and consequently α k < ϕκ g k. Hence we can obtain from Lemma 2.1 that y k z k. Additionally, we can assume that k 0 is large enough to ensure α k γ 1 α max for each k k 0. Then the stepsize update of Algorithm 2.1 gives us α k k 1 = α k0 l=k 0 γ y l θ 1 y k 1 l α k0 l=k 0 γ z l θ 1 z l for all k k 0. This leads to γ z l θ 1 z l = 0, since αk0 > 0 and α k 0. Taking logarithms, we conclude that [ zl ln γ + 1 z l ln θ ] =, which completes the proof. 8

9 If {D k } is p 0 -probabilistically κ-descent, with p 0 = ln θ lnγ 1 θ, 8 then similar to [5, Theorem 4.2], it can be checked that the random process { k [ Zl ln γ + 1 Z l ln θ ]} is a submartingale with bounded increments. Hence the event on the right-hand side of 7 has probability zero see [5, Theorem 4.1]. Thereby we obtain the confirmation of global convergence of probabilistic direct search provided below. Theorem 3.1 If {D k } is p 0 -probabilistically κ-descent in Algorithm 2.1, then P lim inf G k = 0 = 1. k We point out that this lim inf-type global convergence result is also implied by our global rate theory of Section 4, under a slightly stronger assumption see Proposition Global rate for direct search based on probabilistic descent For measuring the global rate of decay of the gradient, we consider the gradient g k with minimum norm among g 0, g 1,..., g k, and use G k to represent the corresponding random variable. Given a positive number ɛ, we define k ɛ as the smallest integer k such that g k ɛ, and denote the corresponding random variable by K ɛ. It is easy to check that k ɛ k if and only if g k ɛ. In the deterministic case, one either looks at the global decaying rate of g k or one counts the number of iterations a worst case complexity bound needed to drive the norm of the gradient below a given tolerance ɛ > 0. When ρα is a positive multiple of α 2, it can be shown [28] that the global rate is of the order of 1/ k and that the worst case complexity bound in number of iterations is of the order of ɛ 2. Since the algorithms are now probabilistic, what interests us are lower bounds, as close as possible to one, on the probabilities P G k O1/ k global rate and PK ɛ Oɛ 2 worst case complexity bound. Our analysis is carried out in two major steps. In Subsection 4.1, we will concentrate on the intrinsic mechanisms of Algorithm 2.1, without assuming any probabilistic property about the polling directions {D k }, establishing a bound on k 1 z l for all realizations of the algorithm. In Subsection 4.2, we will see how this property can be used to relate P G k ɛ to the lower tail of the random variable k 1 Z l. We will concentrate on the probabilistic behavior of Algorithm 2.1, using a Chernoff bound for the lower tail of k 1 Z l when {D k } is probabilistically descent, and then prove the desirable lower bounds on P G k ɛ and PK ɛ k. In order to achieve our goal, we henceforth make an additional assumption on the forcing function ρ as follows. Assumption 4.1 There exist constants θ and γ satisfying 0 < θ < 1 γ such that, for each α > 0, ρθα θρα, ργα γρα. Such an assumption is not restrictive as it holds in particular for the classical forcing functions of the form ρα = c α q, with c > 0 and q > 1. 9

10 4.1 Analysis of step size behavior and number of iterations with descent The goal of this subsection is to establish a bound on k 1 z l in terms of k and g k for an arbitrary realization of Algorithm 2.1, as presented in Lemma 4.2. To do this, we first show the boundedness of k=0 ρα k. Lemma 4.1 For each realization of Algorithm 2.1, ρα k γ [ 1 θ ρ γ 1 α 0 + f0 f low ]. k=0 Proof. Consider a realization of Algorithm 2.1. We assume that there are infinitely many successful iterations as it is trivial to adapt the argument otherwise. Let k i be the index of the i-th successful iteration i 1. Define k 0 = 1 and α 1 = γ 1 α 0 for convenience. Let us rewrite k=0 ρα k as ρα k = k=0 k i+1 i=0 k=k i +1 ρα k, 9 and study first k i+1 k=k i +1 ρα k. According to Algorithm 2.1 and the definition of k i, it holds { α k+1 γα k, k = k i, which gives α k+1 = θα k, k = k i + 1,..., k i+1 1, α k γθ k k i 1 α ki, k = k i + 1,..., k i+1. Hence, by the monotonicity of ρ and Assumption 4.1, we have Thus Inequalities 9 and 10 imply ρα k γ θ k k i 1 ρα ki, k = k i + 1,..., k i+1. k i+1 k=k i +1 ρα k ρα k k=0 γ 1 θ γ 1 θ ρα k i. 10 Inequality 11 is sufficient to conclude the proof because α k0 = γ 1 α 0 and ρα ki fx 0 f low, according to the definition of k i. The bound on the sum of the series is denoted by β = γ [ 1 θ ρ γ 1 ] α 0 + fx0 f low and used next to bound the number of iterations with descent. i=1 10 ρα ki. 11 i=0

11 Lemma 4.2 Given a realization of Algorithm 2.1 and a positive integer k, k 1 z l β ρ min {γ 1 α 0, ϕ κ g k } + p 0k. Proof. Consider a realization of Algorithm 2.1. For each l {0, 1,..., k 1}, define { 1 if α l < min { γ 1 α 0, ϕκ g k }, v l = 0 otherwise. 12 A key observation for proving the lemma is z l 1 v l + v l y l. 13 When v l = 0, inequality 13 is trivial; when v l = 1, Lemma 2.1 implies that y l z l since g k g k 1 g l, and hence inequality 13 holds. It suffices then to separately prove k 1 1 v l β ρmin {γ 1 α 0, ϕκ g k } 14 and k 1 v l y l p 0 k. 15 Because of Lemma 4.1, inequality 14 is justified by the fact that 1 v l ρα l ρmin {γ 1 α 0, ϕκ g k }, which in turn is guaranteed by the definition 12 and the monotonicity of ρ. Now consider inequality 15. If v l = 0 for all l {0, 1,..., k 1}, then 15 holds. Consider then that v l = 1 for some l {0, 1,..., k 1}. Let l be the largest one of such integers. Then k 1 v l y l = l v l y l. 16 Let us estimate the sum on the right-hand side. For each l { 0, 1,..., l }, Algorithm 2.1 together with the definitions of v l and y l give { α l+1 = min{γα l, α max } = γα l if v l y l = 1, which implies α l+1 θα l if v l y l = 0, α l+1 α 0 l γ v l y l θ 1 v ly l

12 On the other hand, since v l = 1, we have α l γ 1 α 0, and hence α l+1 α 0. Consequently, by taking logarithms, one can obtain from inequality 17 that which leads to l v l y l 0 lnγθ 1 l v l y l + l + 1 ln θ, ln θ ln γ 1 θ l + 1 = p 0 l + 1 p 0 k 18 since lnγ 1 θ < 0. Inequality 15 is then obtained by combining inequalities 16 and 18. Lemma 4.2 allows us to generalize the worst case complexity of deterministic direct search [28] for more general forcing functions when γ > 1. As we prove such a result we gain also momentum for the derivation of the global rates for direct search based on probabilistic descent. Proposition 4.1 Assume that γ > 1 and consider a realization of Algorithm 2.1. If, for each k 0, cmd k, g k κ 19 and then ɛ k ɛ γ κα 0 ργ 1 α 0 + να 0 2κγ, 20 β 1 p 0 ρ[ϕκɛ]. Proof. By the definition of k ɛ, we have g kɛ 1 ɛ. Therefore, from Lemma 4.2 which holds also with g k replaced by g k 1 and the monotonicity of ρ and ϕ, we obtain k ɛ 1 z l β ρ min {γ 1 α 0, ϕκɛ} + p 0k ɛ. 21 According to 19, z k = 1 for each k 0. By the definition of ϕ, inequality 20 implies [ ργ 1 α 0 ϕκɛ ϕ γ 1 + ν ] α 0 2 γ 1 α 0 γ 1 α Hence 21 reduces to k ɛ β ρ [ϕκɛ] + p 0k ɛ. Since γ > 1, one has, from 8, p 0 < 1, and the proof is completed. 12

13 4.2 Global rate with conditioning to the past In this subsection, we will study the probability P G k ɛ and equivalently, PK ɛ k with the help of Lemma 4.2. First we present a universal lower bound for this probability, which holds without any assumptions on the probabilistic behavior of {D k } Lemma 4.3. Using this bound, we prove that P G k O1/ k and PK ɛ Oɛ 2 are overwhelmingly high when ρα = c α 2 /2 Corollaries 4.1 and 4.2, which will be given as special cases of the results for general forcing functions Theorems 4.1 and 4.2. Lemma 4.3 If then ɛ P G k ɛ k 1 where π k λ = P Z l λk. γ κα 0 ργ 1 α 0 + να 0 2κγ, 23 β 1 π k kρ[ϕκɛ] + p 0 Proof. According to Lemma 4.2 and the monotonicity of ρ and ϕ, we have { G } k ɛ { k 1 Z l, 24 } β ρmin {γ 1 α 0, ϕκɛ} + p 0k. 25 Again, as in 22, by the definition of ϕ, inequality 23 implies ϕκɛ γ 1 α 0. rewrite 25 as { { G } k 1 } β k ɛ Z l ρ[ϕκɛ] + p 0k, which gives us inequality 24 according to the definition of π k. Thus we This lemma enables us to lower bound P G k ɛ by just focusing on the function π k, which is a classical object in probability theory. Various lower bounds can then be established under different assumptions on {D k }. Given the assumption that {D k } is p-probabilistically κ-descent, Definition 3.1 implies that PZ 0 = 1 p and P Z k = 1 Z 0,..., Z k 1 p k It is known that the lower tail of k 1 Z l obeys a Chernoff type bound, even when conditioning to the past replaces the more traditional assumption of independence of the Z k s see, for instance, [15, Problem 1.7] and [13, Lemma 1.18]. We present such a bound in Lemma 4.4 below and give a proof in Appendix A, where it can be seen the role of the concept of probabilistic descent. Lemma 4.4 Suppose that {D k } is p-probabilistically κ-descent and λ 0, p. Then ] p λ2 π k λ exp [ k. 27 2p 13

14 Now we are ready to present the main results of this section. In these results we will assume that {D k } is p-probabilistically κ-descent with p > p 0, which cannot be fulfilled unless γ > 1 since p 0 = 1 if γ = 1. Theorem 4.1 Suppose that {D k } is p-probabilistically κ-descent with p > p 0 and that k 1 + δβ p p 0 ρ[ϕκɛ], 28 for some positive number δ, and where ɛ satisfies ɛ γ κα 0 ργ 1 α 0 + να 0 2κγ. 29 Then P G k ɛ 1 exp [ p p 0 2 δ 2 ] 2p1 + δ 2 k. 30 Proof. According to Lemma 4.3 and the monotonicity of π k, P G β p k ɛ 1 π k kρ[ϕκɛ] + p p0 0 1 π k 1 + δ + p 0. Then inequality 30 follows directly from Lemma 4.4. Theorem 4.1 reveals that, when ɛ is an arbitrary number but small enough to verify 29 and the iteration counter satisfies 28 for some positive number δ, the norm of the gradient is below ɛ with overwhelmingly high probability. Note that ɛ is related to k through 28. In fact, if ρ ϕ is invertible which is true when the forcing function is a multiple of α q with q > 1, then, for any positive integer k, sufficiently large to satisfy 28 and 29 all together, one can set ɛ = δβ κ ρ 1 ϕ 1, 31 p p 0 k and what we have in 30 can then be read as since ɛ and k satisfy the assumptions of Theorem 4.1 P G k κ 1 ρ ϕ 1 O1/k 1 exp Ck, with C a positive constant. A particular case is when the forcing function is a multiple of the square of the step size where it is easy to check that ρα = 1 2 c α2, ϕt = 2t c + ν. One can then obtain the global rate G k O1/ k with overwhelmingly high probability, matching [28] for deterministic direct search, as it is shown below in Corollary 4.1, which is just an application of Theorem 4.1 for this particular forcing function. For simplicity, we will set δ = 1. 14

15 Corollary 4.1 Suppose that {D k } is p-probabilistically κ-descent with p > p 0, ρα = c α 2 /2, and 4γ 2 β k cp p 0 α Then P G k β 1 2 c + ν c 1 2 p p κ 1 k 1 exp [ p p 0 2 ] k. 33 8p Proof. As in 31, with δ = 1, let ɛ = 1 2β κ ρ 1 ϕ p p 0 k Then, by straightforward calculations, we have β 1 2 c + ν 1 ɛ =. 35 c 1 2 p p κ k Moreover, inequality 32 gives us ɛ β 1 2 c + ν c 1 2 p p κ 4γ 2 β cp p 0 α = c + να κγ Definition 34 and inequality 36 guarantee that k and ɛ satisfy 28 and 29 for ρα = c α 2 /2 and δ = 1. Hence we can plug 35 into 30 yielding 33. Based on Lemmas 4.3 and 4.4 or directly on Theorem 4.1, one can lower bound PK ɛ k and arrive at a worst case complexity result. Theorem 4.2 Suppose that {D k } is p-probabilistically κ-descent with p > p 0 and ɛ γ κα 0 ργ 1 α 0 + να 0 2κγ. Then, for each δ > 0, 1 + δβ P K ɛ p p 0 ρ[ϕκɛ] 1 exp [ βp p 0δ 2 ]. 37 2p1 + δρ[ϕκɛ] Proof. Letting k = 1 + δβ p p 0 ρ[ϕκɛ] we have PK ɛ k = P G k ɛ and then inequality 37 follows from Theorem 4.1 as k 1 + δβ p p 0 ρ[ϕκɛ]. 15

16 Since ρα = oα from the definition of the forcing function ρ and ϕκɛ 2ν 1 κɛ from definition 3 of ϕ, it holds that the lower bound in 37 goes to one faster than 1 exp ɛ 1. Hence, we conclude from Theorem 4.2 that direct search based on probabilistic descent exhibits a worst case complexity bound in number of iterations of the order of 1/ρ[ϕκɛ] with overwhelmingly high probability, matching Proposition 4.1 for deterministic direct search. To see this effect more clearly, we present in Corollary 4.2 the particularization of Theorem 4.2 when ρα = c α 2 /2, taking δ = 1 as in Corollary 4.1. The worst case complexity bound is then of the order of 1/ɛ 2 with overwhelmingly high probability, matching [28] for deterministic direct search. The same matching happens when ρα is a power of α with exponent q, where the bound is Oɛ q min{q 1,1}. Corollary 4.2 Suppose that {D k } is p-probabilistically κ-descent with p > p 0, ρα = c α 2 /2, and ɛ c + να 0. 2κγ Then βc + ν 2 P K ɛ cp p 0 κ 2 ɛ 2 1 exp [ βp p 0c + ν 2 ] 8cpκ 2 ɛ It is important to understand how the worst case complexity bound 38 depends on the dimension n of the problem. For this purpose, we first need to make explicit the dependence on n of the constant in K ɛ βc + ν 2 /[cp p 0 κ 2 ]ɛ 2 which can only come from p and κ and is related to the choice of D k. One can choose D k as m directions uniformly independently distributed on the unit sphere, with m independent of n, in which case p is a constant larger than p 0 and κ = τ/ n for some constant τ > 0 both p and τ are totally determined by γ and θ without dependence on m or n; see Corollary B.1 in Appendix B and the remarks after it. In such a case, from Corollary 4.2, βc + ν 2 P K ɛ nɛ 2 cp p 0 τ 2 1 exp [ βp p 0c + ν 2 ] 8cpκ 2 ɛ 2. To derive a worst case complexity bound in terms of the number of function evaluations, one just needs then to see that each iteration of Algorithm 2.1 costs at most m function evaluations. Thus, if Kɛ f represents the number of function evaluations within K ɛ iterations, we obtain βc + ν P Kɛ f 2 nɛ 2 cp p 0 τ 2 m 1 exp [ βp p 0c + ν 2 ] 8cpκ 2 ɛ The worst case complexity bound is then Omnɛ 2 with overwhelmingly high probability, which is clearly better than the corresponding bound On 2 ɛ 2 for deterministic direct search [28] if m is chosen an order of magnitude smaller than n. 4.3 High probability iteration complexity Given a confidence level P, the following theorem presents an explicit bound for the number of iterations which can guarantee that G k ɛ holds with probability at least P. Bounds of this type are interesting in practice and have been considered in the theoretical analysis of probabilistic algorithms see, for instance, [23, 24]. 16

17 Theorem 4.3 Suppose that {D k } is p-probabilistically κ-descent with p > p 0. Then for any ɛ γ ργ 1 α 0 + να 0 κα 0 2κγ and P 0, 1, it holds P G k ɛ P whenever k 3β 3p ln1 P 2p p 0 ρ[ϕκɛ] p p Proof. By Theorem 4.1, P G k ɛ P is achieved when { 1 + δβ k max p p 0 ρ[ϕκɛ], 2p1 + } δ2 ln1 P δ 2 p p for some positive number δ. Hence it suffices to show that the right-hand side of 40 is bigger than that of 41 for properly chosen δ. For simplicity, denote c 1 = β p p 0 ρ[ϕκɛ], c 2p ln1 P 2 = p p 0 2. Let us consider the positive number δ such that It is easy to check that 1 + δc 1 = 1 + δ2 δ 2 c 2. δ = 1 c 2 + c 22 2c + 4c 1c c 2 2c 1. Thus which completes the proof. } 1 + δ2 max {1 + δc 1, δ 2 c 2 = 1 + δc c 1 + c 2, 5 Discussion and extensions 5.1 Establishing global convergence from global rate It is interesting to notice that Theorem 4.1 implies a form of global convergence of direct search based on probabilistic descent, as we mentioned at the end of Subsection 3.2. Proposition 5.1 Suppose that {D k } is p-probabilistically κ-descent with p > p 0. Then P inf G k = 0 = 1. k 0 17

18 Proof. We prove the result by contradiction. Suppose that P inf G k > 0 > 0. k 0 Then there exists a positive constant ɛ > 0 satisfying 29 and P inf G k ɛ > k 0 But Theorem 4.1 implies that lim P G l ɛ l = 0, which then contradicts 42, because P G l ɛ P inf k 0 G k ɛ for each l 0. If we assume for all realizations of Algorithm 2.1 that the iterates never arrive at a stationary point in a finite number of iterations, then the events {lim inf k G k = 0} and {inf k 0 G k = 0} are identical. Thus, in such a case, Proposition 5.1 reveals that P lim inf G k = 0 = 1, k if {D k } is p-probabilistically κ-descent with p > p The behavior of expected minimum norm gradient In this subsection we study how E G k behaves with the iteration counter k. For simplicity, we consider the special case of the forcing function ρα = c α 2 /2. Proposition 5.2 Suppose {D k } is p-probabilistically κ-descent with p > p 0, ρα = c α 2 /2, and k 4γ 2 β cp p 0 α0 2. Then E G k c 3 k g0 exp c 4 k, where c 3 = β 1 2 c + ν c 1 2 p p κ, c 4 = p p p Proof. Let us define a random variable H k as { c 3 k 1 2 if H k = G k c 3 k 1 2, g 0 otherwise. Then G k H k, and hence E G k E H k c 3 k g0 P G k > c 3 k

19 Therefore it suffices to notice P G k > c 3 k 1 2 exp c 4 k, which is a straightforward application of Corollary 4.1. In deterministic direct search, when a forcing function ρα = c α 2 /2 is used, g k decays with Ok 1 2 when k tends to infinity [28]. Proposition 5.2 shows that E G k behaves in a similar way in direct search based on probabilistic search. 5.3 Global rate without conditioning to the past Assuming that {D k } is p-probabilistically κ-descent as defined in Definition 3.1, we obtained the Chernoff-type bound 27 for π k, and then established the worst case complexity of Algorithm 2.1 when p > p 0. As mentioned before, inequality 5 in Definition 3.1 is stronger than the one without conditioning to the past. A natural question is then to see what can be obtained if we weaken this requirement by not conditioning to the past. It turns out that we can still establish some lower bound for P G k ɛ, but much weaker than inequality 30 obtained in Theorem 4.1 and in particular not approaching one. Proposition 5.3 If for each k 0 and then P cmd k, G k κ p 43 ɛ P G k ɛ γ κα 0 ργ 1 α 0 + να 0 2κγ, p p 0 β 1 p 0 1 p 0 kρ[ϕκɛ]. Proof. Due to Lemma 4.3, it suffices to show that β π k kρ[ϕκɛ] + p 1 p + β/kρ[ϕκɛ] p 0 Let us study π k λ. According to 43 and to the definition of Z k, P Z k = 1 p for each k 0. Hence π k λ = P k 1 Z l λk k 1 E 1 Z l k λk k pk k λk = 1 p 1 λ. Thus β π k kρ[ϕκɛ] + p 0 1 p 1 p 0 β/kρ[ϕκɛ]. 45 If β/kρ[ϕκɛ] p p 0, then 44 follows from 45 and the fact that a/b a + c/b + c for 0 < a b and c 0. If β/kρ[ϕκɛ] > p p 0, then 44 is trivial since π k λ 1. 19

20 5.4 A more detailed look at the numerical experiments An understanding of what is at stake in the global analysis of direct search based on probabilistic descent allows us to revisit the numerical experiments of Subsection 2.2 with additional insight. Remember that we tested Algorithm 2.1 by choosing D k as m independent random vectors uniformly distributed on the unit sphere in R n. In such a case, the almost-sure global convergence of Algorithm 2.1 is guaranteed as long as m > log 2 [1 ln θ/ln γ], as it can be concluded from Theorem 3.1 see Corollary B.1 in Appendix B. For example, when γ = 2 and θ = 0.5, the algorithm converges with probability 1 when m 2, even if such values of m are much smaller than the number of elements of the positive spanning sets with smallest cardinality in R n, which is n in the case tested in Subsection 2.2. We point out here that our theory is applicable only if γ > 1. Moreover, for deterministic direct search based on positive spanning sets PSSs, setting γ = 1 tends to lead to better numerical performance see, for instance, [8]. In this sense, the experiment in Subsection 2.2 is biased in favor of direct search based on probabilistic descent. To be fairer, we designed a new experiment by keeping γ = 1 when the sets of polling directions are guaranteed PSSs which is true for the versions corresponding to [I I], [Q Q], and [Q k Q k ], while setting γ > 1 for direct search based on probabilistic descent. All the other parameters were selected as in Subsection 2.2. In the case of direct search based on probabilistic descent, we pick now γ = 2 and γ = 1.1 as illustrations, and as for the cardinality m of D k we simply take the smallest integers satisfying m > log 2 [1 ln θ/ln γ], which are 2 and 4 respectively. Table 2 presents the results of the redesigned experiment with n = 40. Table 3 shows what happened for n = 100. The data is organized in the same way as in Table 1. We can see from the tables that direct search based on probabilistic descent still outperforms for these problems the direct-search versions using PSSs, even though the difference is not so considerable as in Table 1. We note that such an effect is even more visible when the dimension is higher n = 100, which is somehow in agreement with the fact that 39 reveals a worst case complexity in function evaluations of Omnɛ 2, which is more favorable than On 2 ɛ 2 for deterministic direct search based on PSSs when m is significantly smaller than n. Table 2: Relative performance for different sets of polling directions n = 40. [I I] [Q Q] [Q k Q k ] 2 γ = 2 4 γ = 1.1 arglina arglinb broydn3d dqrtic engval freuroth integreq nondquar sinquad vardim

21 Table 3: Relative performance for different sets of polling directions n = 100. [I I] [Q Q] [Q k Q k ] 2 γ = 2 4 γ = 1.1 arglina arglinb broydn3d dqrtic engval freuroth integreq nondquar sinquad vardim Other algorithmic contexts Our analysis is wide enough to be carried out to other algorithmic contexts where no derivatives are used and some randomization is applied to probabilistically induce descent. In particular, it will also render a global rate of 1/ k with overwhelmingly high probability for trust-region methods based on probabilistic models [5]. We recall that Lemma 4.2 is the key result leading to the global rate results in Subsection 4.2. Lemma 4.2 was based only on the two following elements about a realization of Algorithm 2.1: 1. if the k-th iteration is successful, then fx k fx k+1 ρα k in which case α k is increased, and 2. if cmd k, g k κ and α k < ϕκ g k, then the k-th iteration is successful Lemma 2.1. One can easily identify similar elements in a realization of the trust-region algorithm proposed in [5, Algorithm 3.1]. In fact, using the notations in [5] and denoting one can find positive constants µ 1 and µ 2 such that: K = {k N : ρ k η 1 and g k η 2 δ k }, if k K, then fx k fx k+1 µ 1 δ 2 k and δ k is increased by [5, Algorithm 3.1], and 2. if m k is κ eg, κ ef -fully linear and δ k < µ 2 g k, then k K by [5, Lemma 3.2]. One sees that δ k and K play the same roles as α k and the set of successful iterations in our paper. Based on these facts, one can first follow Lemma 4.1 to obtain a uniform upper bound µ 3 for k=0 δ2 k, and then prove that κ eg, κ ef -fully linear models appear at most µ 3 min { γ 2 δ 2 0, µ2 2 g k 2} + k

22 times in the first k iterations, a conclusion similar to Lemma 4.2. It is then straightforward to mimic the analysis in Subsection 4.2, and prove a 1/ k rate for the gradient with overwhelmingly high probability if the random models are probabilistically fully linear see [5, Definition 3.2]. 7 Concluding remarks We introduced a new proof technique for establishing global rates or worst case complexity bounds for randomized algorithms where the choice of the new iterate depends on a decrease condition and where the quality of the object directions, models used for descent is favorable with a certain probability. The proof technique separates the counting of the number of iterations that descent holds from the probabilistic properties of such a number. In the theory of direct search for smooth functions, the polling directions are typically required to be uniformly non-degenerated positive spanning sets. However, numerical observation suggests that the performance of direct search can be improved by instead randomly generating the polling directions. To understand this phenomenon, we proposed the concept of probabilistically descent sets of directions and studied direct search based upon it. An argument inspired by [5] shows that direct search Algorithm 2.1 converges almost surely if the sets of polling directions are p 0 -probabilistically κ-descent for some κ > 0, where p 0 = ln θ/[lnγ 1 θ], γ and θ being the expanding and contracting parameters of the step size. But more insightfully, we established in this paper the global rate and worst case complexity of direct search when the sets of polling directions are p-probabilistically κ-descent for some p > p 0 and κ > 0. It was proved that in such a situation direct search enjoys the global rate and worst case complexity of deterministic direct search with overwhelmingly high probability. In particular, when the forcing function is ρα = c α 2 /2 c > 0, the norm of the gradient is driven under O1/ k in k iterations with a probability that tends to 1 exponentially when k, or equivalently, the norm is below ɛ in Oɛ 2 iterations with a probability that tends to 1 exponentially when ɛ 0. Based on these conclusions, we also showed that the expected minimum norm gradient decays with a rate of 1/ k, which matches the behavior of the gradient norm in deterministic direct search or steepest descent. A more precise output of this paper is then a worst case complexity bound of Omnɛ 2, in terms of functions evaluations, for this class of methods when m is the number of independently uniformly distributed random directions used for polling. As said above, such a bound does not hold deterministically but under a probability approaching 1 exponentially when ɛ 0, but this is still clearly better than On 2 ɛ 2 known for deterministic direct search in a serial environment where m is chosen significantly smaller than n. Open issues When we were finishing this paper, Professor Nick Trefethen brought to our attention the possibility of setting D k = {d, d}, where d is a random vector independent of the previous iterations and uniformly distributed on the unit sphere of R n. Given a unit vector v R n and a number κ [0, 1], it is easy to see that the event {cmd k, v κ} is the union of {d v κ} and { d v κ}, whose intersection has probability zero, and therefore P cmd k, v κ = P d v κ + P d v κ = 2ϱ, 22

23 ϱ being the probability of {d v κ}. By the same argument as in the proof of Proposition B.1, one then sees that {D k } defined in this way is 2ϱ-probabilistically κ-descent. Given any constants γ and θ satisfying 0 < θ < 1 < γ, we can pick κ > 0 sufficiently small so that 2ϱ > ln θ/[lnγ 1 θ], and then Algorithm 2.1 conforms to the theory presented in Subsection 3.2 and Section 4. Moreover, the set {d, d} turns out to be optimal among all the sets D consisting of 2 random vectors uniformly distributed on the unit sphere, in the sense that it maximizes the probability P cmd, v κ for each κ [0, 1]. In fact, if D = {d 1, d 2 } with d 1 and d 2 uniformly distributed on the unit sphere, then P cmd, v κ = 2ϱ P {d 1 v κ} {d 2 v κ} 2ϱ, and the maximal value 2ϱ is attained when D = {d, d} as already discussed. We tested the set {d, d} numerically with γ = 2 and θ = 1/2, and it performed even better than the set of 2 independent vectors uniformly distributed on the unit sphere yet the difference was not substantial, which illustrates again our theory of direct search based on probabilistic descent. It remains to know how to define D so that it maximizes P cmd, v κ for each κ [0, 1], provided that D consists of m > 2 random vectors uniformly distributed on the unit sphere, but this is out of the scope of the paper. A number of other issues remain also to be investigated related to how known properties of deterministic direct search extend to probabilistic descent. For instance, one knows that in some convex instances [12] the worst case complexity bound of deterministic direct search can be improved to the order of 1/ɛ corresponding to a global rate of 1/k. One also knows that some deterministic direct-search methods exhibit an r-linear rate of local convergence [14]. Finally, extending our results to the presence of constraints and/or non-smoothness may be also of interest. A Proof of Lemma 4.4 Proof. The result can be proved by standard techniques of large deviations. Let t be an arbitrary positive number. By Markov s Inequality, k 1 k 1 π k λ = P exp t Z l exp tλk exp tλk E e tz l. 48 Now let us study E k 1 e tz l. By Properties G and K of Shiryaev [26, page 216], we have k 1 E e tz l = E E e tz k 2 k 1 Z 0, Z 1,..., Z k 2 e tz l. 49 According to 26 and the fact that the function re t + 1 r is monotonically decreasing in r, it holds with p = P Z k 1 = 1 Z 0, Z 1,..., Z k 2 p E e tz k 1 Z 0, Z 1,..., Z k 2 = pe t + 1 p pe t + 1 p exp pe t p, which implies, from equality 49, that k 1 E e tz l exp pe t p k 2 E 23 e tz l.

A multistart multisplit direct search methodology for global optimization

1/69 A multistart multisplit direct search methodology for global optimization Ismael Vaz (Univ. Minho) Luis Nunes Vicente (Univ. Coimbra) IPAM, Optimization and Optimal Control for Complex Energy and