Direct Search Based on Probabilistic Descent

Size: px
Start display at page:

Download "Direct Search Based on Probabilistic Descent"

Transcription

1 Direct Search Based on Probabilistic Descent S. Gratton C. W. Royer L. N. Vicente Z. Zhang January 22, 2015 Abstract Direct-search methods are a class of popular derivative-free algorithms characterized by evaluating the objective function using a step size and a number of polling directions. When applied to the minimization of smooth functions, the polling directions are typically taken from positive spanning sets which in turn must have at least n+1 vectors in an n-dimensional variable space. In addition, to ensure the global convergence of these algorithms, the positive spanning sets used throughout the iterations are required to be uniformly non-degenerate in the sense of having a positive cosine measure bounded away from zero. However, recent numerical results indicated that randomly generating the polling directions without imposing the positive spanning property can improve the performance of these methods, especially when the number of directions is chosen considerably less than n + 1. In this paper, we analyze direct-search algorithms when the polling directions are probabilistic descent, meaning that with a certain probability at least one of them is of descent type. Such a framework enjoys almost-sure global convergence. More interestingly, we will show a global decaying rate of 1/ k for the gradient size, with overwhelmingly high probability, matching the corresponding rate for the deterministic versions of the gradient method or of direct search. Our analysis helps to understand numerical behavior and the choice of the number of polling directions. Keywords: Derivative-free optimization, direct-search methods, polling, positive spanning sets, probabilistic descent, random directions. 1 Introduction Minimizing a function without using derivatives has been the subject of recent intensive research, as it poses a number of mathematical and numerical challenges and it appears in various ENSEEIHT, INPT, rue Charles Camichel, B.P , Toulouse Cedex 7, France serge.gratton@enseeiht.fr. ENSEEIHT, INPT, rue Charles Camichel, B.P , Toulouse Cedex 7, France clement.royer@enseeiht.fr. Support for this author comes from a doctoral grant of Université Toulouse III Paul Sabatier. CMUC, Department of Mathematics, University of Coimbra, Coimbra, Portugal lnv@mat.uc.pt, zhang@mat.uc.pt. Support for this research was provided by FCT under grants PTDC/MAT/116736/2010 and PEst-C/MAT/UI0324/2011 and by the Réseau Thématique de Recherche Avancée, Fondation de Coopération Sciences et Technologies pour l Aéronautique et l Espace, under the grant ADTAO. CERFACS-IRIT joint lab, Toulouse, France zaikun.zhang@irit.fr. This author works within the framework of the project FILAOS funded by RTRA STAE. 1

2 applications of optimization [7]. In a simplified way, there are essentially two paradigms or main approaches to design an algorithm for Derivative-Free Optimization with rigorous global convergence properties. One possibility consists of building models based on sampling typically by quadratic interpolation for use in a trust-region algorithm, where the accuracy of the models ensures descent for small enough step sizes. The other main possibility is to make use of a number of directions to ensure descent when the size of the step is relatively small, and direct-search methods [7, 20] offer a well studied framework to use multiple directions in a single iteration. If the objective function is smooth continuously differentiable, it is well known that at least one of the directions of a positive spanning set PSS is descent and by a PSS we mean a set of directions that spans R n with non-negative coefficients. Based on this guarantee of descent, direct-search methods, at each iteration, evaluate the function along the directions in a PSS a process called polling, using a step size parameter that is reduced when none of the directions leads to some form of decrease. There are a variety of globally convergent methods of this form, depending essentially on the choice of PSS and on the condition to declare an iteration successful simple or sufficient decrease. Coordinate or compass search, for instance, uses the PSS formed by the 2n coordinate directions. Multiple PSSs can be used in direct search, in a finite number when accepting new iterates based on simple decrease see pattern search [27] and generalized pattern search [3], or even in an infinite number when accepting new iterates based on sufficient decrease see generating set search [20]. In any of these cases, the PSSs in question are required to be uniformly non-degenerate in the sense of not being close to loose its defining property. More precisely, their cosine measure is required to be uniformly bounded away from zero, which in turn implies the existence, in any PSS used in a direct-search iteration, of a descent direction uniformly bounded away from being orthogonal to the negative gradient. If the objective function is non-smooth, say only assumed locally Lipschitz continuous, the polling directions asymptotically used in direct-search run are required, in some normalized form, to densely cover the unit sphere. Mesh adaptive direct search [4] encompasses a process of dense generation that leads to global convergence. Such a process can be significantly simplified if the iterates are only accepted based on a sufficient decrease condition [29]. Anyhow, generating polling directions densely in the unit sphere naturally led to the use of randomly generated PSSs [4, 29]. However, one can take the issue of random generation one step further, and ask the question of whether one can randomly generate a set of directions at each direct-search iteration and use it for polling, without checking if it is a PSS. Moreover, one knows that a PSS must have at least n + 1 elements in R n, and thus one even questions the need to generate so many directions and asks how many of them are appropriate to use. We were motivated to address this problem given the obvious connection to the trust-region methods based on probabilistic models recently studied in [5], as we will see later in our paper. Simultaneously, we were surprised by the numerical experiments reported [18] where generating the polling directions free of PSS rules and possibly in a number less than n + 1 was indeed beneficial. Similarly to the notion of probabilistically fully linear model in [5], we then introduce in this paper the companion notion of a probabilistically descent set of directions, by requiring at least one of them to make an acute angle with the negative gradient with a certain probability, uniformly across all iterations. We then analyze what is direct search capable of delivering when the set of polling directions is probabilistically descent. Our algorithmic framework is extremely simple and it can be simplistically reduced here to: at each iteration k, generate a finite set D k of polling directions that is probabilistically descent; if a d k D k is found such that 2

3 fx k +α k d k < fx k ρα k, where ρα k = α 2 k /2 and α k is the step size, then x k+1 = x k +α k d k and α k+1 = 2α k, otherwise x k+1 = x k and α k+1 = α k /2. We will then prove that such a scheme enjoys, with overwhelmingly high probability, a gradient decaying rate of 1/ k or, equivalently, that the number of iterations taken to reach a gradient of size ɛ is Oɛ 2. The rate is global since no assumption is made on the starting point. For this purpose, we first derive a bound, for all realizations of the algorithm, on the number of times the set of polling directions is descent. Such a bound is shown to be a fraction of k when k is larger than the inverse of the square of the minimum norm gradient up to the k-th iteration. When descent is probabilistically conditioned to the past, one can then apply a Chernoff type bound to prove that the probability of the minimum norm gradient decaying with a global rate of 1/ k approaches one exponentially. The analysis is carried out for a more general function ρ with the help of an auxiliary function ϕ which is a multiple of the identity when ρα k = α 2 k /2. Although such a direct-search framework, where a set of polling directions is independently randomly generated at each iteration, shares similarities with other randomized algorithmic approaches, the differences are significant. Random search methods typically generate a single direction at each iteration, take steps independently of decrease conditions, and update the step size using some deterministic or probabilistic formula. The generated direction can be first premultiplied by an approximation to the directional derivative along the direction itself [25], and it was shown in [22] how to frame this technique using the concept of smoothing to lead to global rates of appropriate order. Methods using multiple points or directions at each iteration, such as evolution strategies see [11] for a globally convergent version, typically impose some correlation among the direction sets from one iteration to another as in covariance matrix adaptation [19] and, again, take action and update step sizes independently of decrease conditions. More related to our work is the contribution of [10], a derivative-free line search method that uses random searching directions, where at each iteration the step size is defined by a tolerant nonmonotone backtracking line search that always starts from the unitary step. The authors proved global convergence with probability one in [10], but no global rate was provided. In addition, global convergence in [10] is much easier to establish than in our paper, since the backtracking line search always starts from the unitary step, independently of the previous iterations, and consequently the step size is little coupled to the history of the computation. The organization of the paper is the following. First, we recall in Section 2 the main elements of direct-search methods using PSSs and start a numerical illustration of the issues at stake in this paper. Then, we introduce in Section 3 the notion of probabilistic descent and show how it leads direct search to almost sure global convergence. The main result of our paper is presented in Section 4 where we show that direct search using probabilistic descent conditioned to the past attains the desired global rate with overwhelmingly high probability. In Section 5 we cover a number of related issues, among which how to derive in expectation the main result and what can still be achieved without conditioning to the past. Having already presented our findings, we then revisit the numerical illustrations with additional insight. Section 6 discusses how to extend our analysis to other algorithmic contexts, in particular to trust-region methods based on probabilistic models [5]. The paper is concluded with some final remarks in Section 7. 3

4 2 Direct search based on deterministic descent In this paper we consider the minimization of a function f : R n R, without imposing any constraints, and assuming that f is smooth, say continuously differentiable. As in [7], we consider first a quite general direct-search algorithmic framework but without a search step. Algorithm 2.1 Direct search Select x 0 R n, α max 0, ], α 0 0, α max, θ 0, 1, γ [1,, and a forcing function ρ : 0, 0,. For each iteration k = 0, 1,... Poll step Step size update Choose a finite set D k of non-zero vectors. Start evaluating f at the polling points {x k + α k d : d D k } following a chosen order. If a poll point x k + α k d k is found such that fx k + α k d k < fx k ρα k then stop polling, set x k+1 = x k + α k d k, and declare the iteration successful. Otherwise declare the iteration unsuccessful and set x k+1 = x k. If the iteration was successful, set α k+1 = min{γα k, α max }. Otherwise, set α k+1 = θα k. End for An optional search step could be taken, before the poll one, by testing a finite number of points, looking for an x such that fx < fx k ρα k. If such an x would be found, then one would define x k+1 = x, declare the iteration successful, and skip the poll step. However, ignoring such a search step allows us to focus on the polling mechanism of Algorithm 2.1, which is the essential part of the algorithm. 2.1 Deterministic descent and positive spanning The choice of the direction sets D k in Algorithm 2.1 is a major issue. To guarantee a successful iteration in a finite number of attempts, it is sufficient that at least one of the directions in D k is a descent one. But global convergence to stationary points requires more as we cannot have all directions in D k becoming arbitrarily close of being orthogonal to the negative gradient fx k = g k, and so one must have that k, d k D k, d k g k d k g k κ > 0, 1 for some κ that does not depend on k. On the other hand, it is well known [9] see also [7, 20] that if D is a PSS, then v R n, d D, d v d v > 0. 2 One can then see that condition 1 is easily verified if {D k } is chosen from a finite number of PSSs. When passing from a finite to an infinite number of PSSs, some uniform bound must be 4

5 imposed in 2, which essentially amounts to say that the cosine measure of all the D k is at least κ, where by the cosine measure [20] of a PSS D with non-zero vectors we mean the positive quantity cmd = min max v R n \{0} d D d v d v. One observes that imposing cmd k κ is much stronger than 1, and that it is the lack of knowledge of the negative gradient that leads us to the imposition of such a condition. Had we known g k = fx k, we would have just required cmd k, g k κ, where cmd, v is the cosine measure of D given v, defined by: cmd, v = max d D d v d v. To avoid an ill-posed definition when g k = 0, we assume by convention that cmd, v = 1 when v = 0. Condition cmd k, g k κ is all one needs to guarantee a successful step for a sufficiently small step size α k. For completeness we show such a result in Lemma 2.1 below, under Assumptions , which are assumed here and throughout the paper. Assumption 2.1 The objective function f is bounded from below and continuously differentiable in R n. f is Lipschitz continuous in R n. Let then f low > be a lower bound of f, and ν > 0 be a Lipschitz constant of f. Assumption 2.2 The forcing function ρ is positive, non-decreasing, and ρα = oα when α 0 +. The following function ϕ will make the algebra conveniently more concise. For each t > 0, let { ρα ϕt = inf α : α > 0, α + 1 } 2 να t. 3 It is clear that ϕ a well defined non-decreasing function. Note that Assumption 2.2 ensures that ϕt > 0 when t > 0. When ρα = c α 2 /2, one obtains ϕt = 2t/c + ν, which is essentially a multiple of the identity. Assumption 2.3 For each k 0, D k is a finite set of normalized vectors. Assumption 2.3 is made for the sake of simplicity, and it is actually enough to assume that the norms of all the polling directions are uniformly bounded away from zero and infinity. Lemma 2.1 The k-th iteration in Algorithm 2.1 is successful if cmd k, g k κ and α k < ϕκ g k. Proof. According to the definition of cmd k, g k, there exists d k D k satisfying d k g k = cmd k, g k d k g k κ g k. 5

6 Thus, by Taylor expansion, fx k + α k d k fx k α k d k g k να2 k κα k g k να2 k. 4 Using the definition of ϕ, we obtain from α k < ϕκ g k that ρα k α k να k < κ g k. Hence 4 implies fx k + α k d k < fx k ρα k, and thus the k-th iteration is successful. 2.2 A numerical illustration A natural question that arises is whether one can gain by generating the vectors in D k randomly hoping that 1 will then be satisfied frequently enough. For this purpose we ran Algorithm 2.1 in Matlab for different choices of the polling directions. The test problems were taken from CUTEr [17] with dimension n = 40. We set α max = inf, α 0 = 1, θ = 0.5, γ = 2, and ρα = 10 3 α 2. The algorithm terminated when either α k was below a tolerance of or a budget of 2000n function evaluations was exhausted. Table 1: Relative performance for different sets of polling directions n = 40. [I I] [Q Q] [Q k Q k ] 2n n + 1 n/2 n/4 2 1 arglina arglinb broydn3d dqrtic engval freuroth integreq nondquar sinquad vardim The first three columns of Table 1 correspond to the following PSS choices of the polling directions D k : [I I] represents the columns of I and I, where I is the identity matrix of size n; [Q Q] means the columns of Q and Q, where Q is an orthogonal matrix obtained by the QR decomposition of a random vector, uniformly distributed on the unit sphere of R n, generated before the start of the algorithm and then fixed throughout the iterations; [Q k Q k ] consists of the columns of Q k and Q k, where Q k is a matrix obtained in the same way as Q except that it is generated independently at each iteration. In the cases of [I I] and [Q Q], a cyclic polling procedure was applied to accelerate descent, by starting polling at the direction that led to previous success if the last iteration was successful or at the direction right after the last one used if the last iteration was unsuccessful. In the remaining columns, D k consists of m m = 2n, n + 1, n/2, n/4, 2, 1 independent random vectors uniformly distributed on the unit sphere in R n, independently generated at every iteration. 6

7 We counted the number of function evaluations taken to drive the function value below f low + ε[fx 0 f low ], where f low is the true minimal value of the objective function f, and ε is a tolerance set to Given a test problem, we present in each row the ratio between the number of function evaluations taken by the corresponding version and the number of function evaluations taken by the best version. Because of the random nature of the computations, the number of function evaluations was obtained by averaging over ten independent runs, except for the first column. The symbol indicates that the algorithm failed to solve the problem to the required precision at least once in the ten runs. When D k is randomly generated on the unit sphere columns m = 2n, n + 1, n/2, n/4, 2, 1 in Table 1, the vectors in D k do not form a PSS when m n, and are not guaranteed to do so when m > n. However, we can see clearly from Table 1 that randomly generating the polling directions in this way performed evidently better, despite the fact that their use is not covered by the classical theory of direct search. This evidence motivates us to establish the convergence theory of Algorithm 2.1 when the polling directions are randomly generated. 3 Probabilistic descent and global convergence From now on, we suppose that the polling directions in Algorithm 2.1 are not defined deterministically but generated by a random process {D k }. Due to the randomness of {D k }, the iterates and stepsizes are also random processes and we denote the random iterates by {X k }. The realizations of {D k } and {X k } are {D k } and {x k }, respectively. We notice, however, that the starting point x 0 and thus the initial function value fx 0 and the initial stepsize α 0 are not random. 3.1 Probabilistic descent The following concept of probabilistically descent sets of polling directions is critical to our analysis. We use it to describe the quality of the random polling directions in Algorithm 2.1. Recalling that g k = fx k, let G k be the random variable corresponding to g k. Definition 3.1 The sequence {D k } in Algorithm 2.1 is said to be p-probabilistically κ-descent if P cmd 0, G 0 κ p and, for each k 1, P cmd k, G k κ D 0,..., D k 1 p. 5 Definition 3.1 requires the cosine measure cmd k, G k to be favorable in a probabilistic sense rather than deterministically. We will see that the role of probabilistically descent sets in the analysis of direct search based on probabilistic descent is similar to that of positive spanning sets in the theory of deterministic direct search. Definition 3.1 is inspired by the definition of probabilistically fully linear models [5]. Inequality 5 involves the notion of conditional probability see [26, Chapter II] and says essentially that the probability of the event {cmd k, G k κ} is not smaller than p, no matter what happened with D 0,..., D k 1. It is stronger than assuming merely that P cmd k, G k κ p. The analysis in this paper would still hold even if a search step is included in the algorithm and possibly taken. The analysis would consider D k defined at all iterations even if the poll step is skipped. 7

8 3.2 Global convergence The global convergence analysis of direct search based on probabilistic descent follows what was done in [5] for trust-region methods based on probabilistic models, and is no more than a reorganization of the arguments there now applied to direct search when no search step is taken. It is well known that α k 0 in deterministic direct search based on sufficient decrease as in Algorithm 2.1, no matter how the polling directions are defined [20]. A similar result can be derived by applying exactly the same argument to a realization of direct search now based on probabilistic descent. Lemma 3.1 For each realization of Algorithm 2.1, lim k α k = 0. For each k 0, we now define Y k as the indicator function of the event {the k-th iteration is successful}. Furthermore, Z k will be the indicator function of the event {cmd k, G k κ}, 6 where, again, κ is a positive constant and independent of the iteration counter. The Bernoulli processes {Y k } and {Z k } will play a major role in our analysis. Their realizations are denoted by {y k } and {z k }. Notice, then, that Lemma 2.1 can be restated as follows: given a realization of Algorithm 2.1 and k 0, if α k < ϕκ g k, then y k z k. Lemmas 2.1 and 3.1 lead to a critical observation presented below as Lemma 3.2. We observe that such a result holds without any assumption on the probabilistic behavior of {D k }. Lemma 3.2 For the stochastic processes {G k } and {Z k }, where G k = fx k and Z k is the indicator of the event 6, it holds that { } { } lim inf G k > 0 [Z k ln γ + 1 Z k ln θ] =. 7 k k=0 Proof. Consider a realization of Algorithm 2.1 for which lim inf k g k is not zero but a positive number ɛ. There exists a positive integer k 0 such that for each k k 0 it holds g k ɛ/2 and α k < ϕκɛ/2 because α k 0 and ϕκɛ/2 > 0, and consequently α k < ϕκ g k. Hence we can obtain from Lemma 2.1 that y k z k. Additionally, we can assume that k 0 is large enough to ensure α k γ 1 α max for each k k 0. Then the stepsize update of Algorithm 2.1 gives us α k k 1 = α k0 l=k 0 γ y l θ 1 y k 1 l α k0 l=k 0 γ z l θ 1 z l for all k k 0. This leads to γ z l θ 1 z l = 0, since αk0 > 0 and α k 0. Taking logarithms, we conclude that [ zl ln γ + 1 z l ln θ ] =, which completes the proof. 8

9 If {D k } is p 0 -probabilistically κ-descent, with p 0 = ln θ lnγ 1 θ, 8 then similar to [5, Theorem 4.2], it can be checked that the random process { k [ Zl ln γ + 1 Z l ln θ ]} is a submartingale with bounded increments. Hence the event on the right-hand side of 7 has probability zero see [5, Theorem 4.1]. Thereby we obtain the confirmation of global convergence of probabilistic direct search provided below. Theorem 3.1 If {D k } is p 0 -probabilistically κ-descent in Algorithm 2.1, then P lim inf G k = 0 = 1. k We point out that this lim inf-type global convergence result is also implied by our global rate theory of Section 4, under a slightly stronger assumption see Proposition Global rate for direct search based on probabilistic descent For measuring the global rate of decay of the gradient, we consider the gradient g k with minimum norm among g 0, g 1,..., g k, and use G k to represent the corresponding random variable. Given a positive number ɛ, we define k ɛ as the smallest integer k such that g k ɛ, and denote the corresponding random variable by K ɛ. It is easy to check that k ɛ k if and only if g k ɛ. In the deterministic case, one either looks at the global decaying rate of g k or one counts the number of iterations a worst case complexity bound needed to drive the norm of the gradient below a given tolerance ɛ > 0. When ρα is a positive multiple of α 2, it can be shown [28] that the global rate is of the order of 1/ k and that the worst case complexity bound in number of iterations is of the order of ɛ 2. Since the algorithms are now probabilistic, what interests us are lower bounds, as close as possible to one, on the probabilities P G k O1/ k global rate and PK ɛ Oɛ 2 worst case complexity bound. Our analysis is carried out in two major steps. In Subsection 4.1, we will concentrate on the intrinsic mechanisms of Algorithm 2.1, without assuming any probabilistic property about the polling directions {D k }, establishing a bound on k 1 z l for all realizations of the algorithm. In Subsection 4.2, we will see how this property can be used to relate P G k ɛ to the lower tail of the random variable k 1 Z l. We will concentrate on the probabilistic behavior of Algorithm 2.1, using a Chernoff bound for the lower tail of k 1 Z l when {D k } is probabilistically descent, and then prove the desirable lower bounds on P G k ɛ and PK ɛ k. In order to achieve our goal, we henceforth make an additional assumption on the forcing function ρ as follows. Assumption 4.1 There exist constants θ and γ satisfying 0 < θ < 1 γ such that, for each α > 0, ρθα θρα, ργα γρα. Such an assumption is not restrictive as it holds in particular for the classical forcing functions of the form ρα = c α q, with c > 0 and q > 1. 9

10 4.1 Analysis of step size behavior and number of iterations with descent The goal of this subsection is to establish a bound on k 1 z l in terms of k and g k for an arbitrary realization of Algorithm 2.1, as presented in Lemma 4.2. To do this, we first show the boundedness of k=0 ρα k. Lemma 4.1 For each realization of Algorithm 2.1, ρα k γ [ 1 θ ρ γ 1 α 0 + f0 f low ]. k=0 Proof. Consider a realization of Algorithm 2.1. We assume that there are infinitely many successful iterations as it is trivial to adapt the argument otherwise. Let k i be the index of the i-th successful iteration i 1. Define k 0 = 1 and α 1 = γ 1 α 0 for convenience. Let us rewrite k=0 ρα k as ρα k = k=0 k i+1 i=0 k=k i +1 ρα k, 9 and study first k i+1 k=k i +1 ρα k. According to Algorithm 2.1 and the definition of k i, it holds { α k+1 γα k, k = k i, which gives α k+1 = θα k, k = k i + 1,..., k i+1 1, α k γθ k k i 1 α ki, k = k i + 1,..., k i+1. Hence, by the monotonicity of ρ and Assumption 4.1, we have Thus Inequalities 9 and 10 imply ρα k γ θ k k i 1 ρα ki, k = k i + 1,..., k i+1. k i+1 k=k i +1 ρα k ρα k k=0 γ 1 θ γ 1 θ ρα k i. 10 Inequality 11 is sufficient to conclude the proof because α k0 = γ 1 α 0 and ρα ki fx 0 f low, according to the definition of k i. The bound on the sum of the series is denoted by β = γ [ 1 θ ρ γ 1 ] α 0 + fx0 f low and used next to bound the number of iterations with descent. i=1 10 ρα ki. 11 i=0

11 Lemma 4.2 Given a realization of Algorithm 2.1 and a positive integer k, k 1 z l β ρ min {γ 1 α 0, ϕ κ g k } + p 0k. Proof. Consider a realization of Algorithm 2.1. For each l {0, 1,..., k 1}, define { 1 if α l < min { γ 1 α 0, ϕκ g k }, v l = 0 otherwise. 12 A key observation for proving the lemma is z l 1 v l + v l y l. 13 When v l = 0, inequality 13 is trivial; when v l = 1, Lemma 2.1 implies that y l z l since g k g k 1 g l, and hence inequality 13 holds. It suffices then to separately prove k 1 1 v l β ρmin {γ 1 α 0, ϕκ g k } 14 and k 1 v l y l p 0 k. 15 Because of Lemma 4.1, inequality 14 is justified by the fact that 1 v l ρα l ρmin {γ 1 α 0, ϕκ g k }, which in turn is guaranteed by the definition 12 and the monotonicity of ρ. Now consider inequality 15. If v l = 0 for all l {0, 1,..., k 1}, then 15 holds. Consider then that v l = 1 for some l {0, 1,..., k 1}. Let l be the largest one of such integers. Then k 1 v l y l = l v l y l. 16 Let us estimate the sum on the right-hand side. For each l { 0, 1,..., l }, Algorithm 2.1 together with the definitions of v l and y l give { α l+1 = min{γα l, α max } = γα l if v l y l = 1, which implies α l+1 θα l if v l y l = 0, α l+1 α 0 l γ v l y l θ 1 v ly l

12 On the other hand, since v l = 1, we have α l γ 1 α 0, and hence α l+1 α 0. Consequently, by taking logarithms, one can obtain from inequality 17 that which leads to l v l y l 0 lnγθ 1 l v l y l + l + 1 ln θ, ln θ ln γ 1 θ l + 1 = p 0 l + 1 p 0 k 18 since lnγ 1 θ < 0. Inequality 15 is then obtained by combining inequalities 16 and 18. Lemma 4.2 allows us to generalize the worst case complexity of deterministic direct search [28] for more general forcing functions when γ > 1. As we prove such a result we gain also momentum for the derivation of the global rates for direct search based on probabilistic descent. Proposition 4.1 Assume that γ > 1 and consider a realization of Algorithm 2.1. If, for each k 0, cmd k, g k κ 19 and then ɛ k ɛ γ κα 0 ργ 1 α 0 + να 0 2κγ, 20 β 1 p 0 ρ[ϕκɛ]. Proof. By the definition of k ɛ, we have g kɛ 1 ɛ. Therefore, from Lemma 4.2 which holds also with g k replaced by g k 1 and the monotonicity of ρ and ϕ, we obtain k ɛ 1 z l β ρ min {γ 1 α 0, ϕκɛ} + p 0k ɛ. 21 According to 19, z k = 1 for each k 0. By the definition of ϕ, inequality 20 implies [ ργ 1 α 0 ϕκɛ ϕ γ 1 + ν ] α 0 2 γ 1 α 0 γ 1 α Hence 21 reduces to k ɛ β ρ [ϕκɛ] + p 0k ɛ. Since γ > 1, one has, from 8, p 0 < 1, and the proof is completed. 12

13 4.2 Global rate with conditioning to the past In this subsection, we will study the probability P G k ɛ and equivalently, PK ɛ k with the help of Lemma 4.2. First we present a universal lower bound for this probability, which holds without any assumptions on the probabilistic behavior of {D k } Lemma 4.3. Using this bound, we prove that P G k O1/ k and PK ɛ Oɛ 2 are overwhelmingly high when ρα = c α 2 /2 Corollaries 4.1 and 4.2, which will be given as special cases of the results for general forcing functions Theorems 4.1 and 4.2. Lemma 4.3 If then ɛ P G k ɛ k 1 where π k λ = P Z l λk. γ κα 0 ργ 1 α 0 + να 0 2κγ, 23 β 1 π k kρ[ϕκɛ] + p 0 Proof. According to Lemma 4.2 and the monotonicity of ρ and ϕ, we have { G } k ɛ { k 1 Z l, 24 } β ρmin {γ 1 α 0, ϕκɛ} + p 0k. 25 Again, as in 22, by the definition of ϕ, inequality 23 implies ϕκɛ γ 1 α 0. rewrite 25 as { { G } k 1 } β k ɛ Z l ρ[ϕκɛ] + p 0k, which gives us inequality 24 according to the definition of π k. Thus we This lemma enables us to lower bound P G k ɛ by just focusing on the function π k, which is a classical object in probability theory. Various lower bounds can then be established under different assumptions on {D k }. Given the assumption that {D k } is p-probabilistically κ-descent, Definition 3.1 implies that PZ 0 = 1 p and P Z k = 1 Z 0,..., Z k 1 p k It is known that the lower tail of k 1 Z l obeys a Chernoff type bound, even when conditioning to the past replaces the more traditional assumption of independence of the Z k s see, for instance, [15, Problem 1.7] and [13, Lemma 1.18]. We present such a bound in Lemma 4.4 below and give a proof in Appendix A, where it can be seen the role of the concept of probabilistic descent. Lemma 4.4 Suppose that {D k } is p-probabilistically κ-descent and λ 0, p. Then ] p λ2 π k λ exp [ k. 27 2p 13

14 Now we are ready to present the main results of this section. In these results we will assume that {D k } is p-probabilistically κ-descent with p > p 0, which cannot be fulfilled unless γ > 1 since p 0 = 1 if γ = 1. Theorem 4.1 Suppose that {D k } is p-probabilistically κ-descent with p > p 0 and that k 1 + δβ p p 0 ρ[ϕκɛ], 28 for some positive number δ, and where ɛ satisfies ɛ γ κα 0 ργ 1 α 0 + να 0 2κγ. 29 Then P G k ɛ 1 exp [ p p 0 2 δ 2 ] 2p1 + δ 2 k. 30 Proof. According to Lemma 4.3 and the monotonicity of π k, P G β p k ɛ 1 π k kρ[ϕκɛ] + p p0 0 1 π k 1 + δ + p 0. Then inequality 30 follows directly from Lemma 4.4. Theorem 4.1 reveals that, when ɛ is an arbitrary number but small enough to verify 29 and the iteration counter satisfies 28 for some positive number δ, the norm of the gradient is below ɛ with overwhelmingly high probability. Note that ɛ is related to k through 28. In fact, if ρ ϕ is invertible which is true when the forcing function is a multiple of α q with q > 1, then, for any positive integer k, sufficiently large to satisfy 28 and 29 all together, one can set ɛ = δβ κ ρ 1 ϕ 1, 31 p p 0 k and what we have in 30 can then be read as since ɛ and k satisfy the assumptions of Theorem 4.1 P G k κ 1 ρ ϕ 1 O1/k 1 exp Ck, with C a positive constant. A particular case is when the forcing function is a multiple of the square of the step size where it is easy to check that ρα = 1 2 c α2, ϕt = 2t c + ν. One can then obtain the global rate G k O1/ k with overwhelmingly high probability, matching [28] for deterministic direct search, as it is shown below in Corollary 4.1, which is just an application of Theorem 4.1 for this particular forcing function. For simplicity, we will set δ = 1. 14

15 Corollary 4.1 Suppose that {D k } is p-probabilistically κ-descent with p > p 0, ρα = c α 2 /2, and 4γ 2 β k cp p 0 α Then P G k β 1 2 c + ν c 1 2 p p κ 1 k 1 exp [ p p 0 2 ] k. 33 8p Proof. As in 31, with δ = 1, let ɛ = 1 2β κ ρ 1 ϕ p p 0 k Then, by straightforward calculations, we have β 1 2 c + ν 1 ɛ =. 35 c 1 2 p p κ k Moreover, inequality 32 gives us ɛ β 1 2 c + ν c 1 2 p p κ 4γ 2 β cp p 0 α = c + να κγ Definition 34 and inequality 36 guarantee that k and ɛ satisfy 28 and 29 for ρα = c α 2 /2 and δ = 1. Hence we can plug 35 into 30 yielding 33. Based on Lemmas 4.3 and 4.4 or directly on Theorem 4.1, one can lower bound PK ɛ k and arrive at a worst case complexity result. Theorem 4.2 Suppose that {D k } is p-probabilistically κ-descent with p > p 0 and ɛ γ κα 0 ργ 1 α 0 + να 0 2κγ. Then, for each δ > 0, 1 + δβ P K ɛ p p 0 ρ[ϕκɛ] 1 exp [ βp p 0δ 2 ]. 37 2p1 + δρ[ϕκɛ] Proof. Letting k = 1 + δβ p p 0 ρ[ϕκɛ] we have PK ɛ k = P G k ɛ and then inequality 37 follows from Theorem 4.1 as k 1 + δβ p p 0 ρ[ϕκɛ]. 15

16 Since ρα = oα from the definition of the forcing function ρ and ϕκɛ 2ν 1 κɛ from definition 3 of ϕ, it holds that the lower bound in 37 goes to one faster than 1 exp ɛ 1. Hence, we conclude from Theorem 4.2 that direct search based on probabilistic descent exhibits a worst case complexity bound in number of iterations of the order of 1/ρ[ϕκɛ] with overwhelmingly high probability, matching Proposition 4.1 for deterministic direct search. To see this effect more clearly, we present in Corollary 4.2 the particularization of Theorem 4.2 when ρα = c α 2 /2, taking δ = 1 as in Corollary 4.1. The worst case complexity bound is then of the order of 1/ɛ 2 with overwhelmingly high probability, matching [28] for deterministic direct search. The same matching happens when ρα is a power of α with exponent q, where the bound is Oɛ q min{q 1,1}. Corollary 4.2 Suppose that {D k } is p-probabilistically κ-descent with p > p 0, ρα = c α 2 /2, and ɛ c + να 0. 2κγ Then βc + ν 2 P K ɛ cp p 0 κ 2 ɛ 2 1 exp [ βp p 0c + ν 2 ] 8cpκ 2 ɛ It is important to understand how the worst case complexity bound 38 depends on the dimension n of the problem. For this purpose, we first need to make explicit the dependence on n of the constant in K ɛ βc + ν 2 /[cp p 0 κ 2 ]ɛ 2 which can only come from p and κ and is related to the choice of D k. One can choose D k as m directions uniformly independently distributed on the unit sphere, with m independent of n, in which case p is a constant larger than p 0 and κ = τ/ n for some constant τ > 0 both p and τ are totally determined by γ and θ without dependence on m or n; see Corollary B.1 in Appendix B and the remarks after it. In such a case, from Corollary 4.2, βc + ν 2 P K ɛ nɛ 2 cp p 0 τ 2 1 exp [ βp p 0c + ν 2 ] 8cpκ 2 ɛ 2. To derive a worst case complexity bound in terms of the number of function evaluations, one just needs then to see that each iteration of Algorithm 2.1 costs at most m function evaluations. Thus, if Kɛ f represents the number of function evaluations within K ɛ iterations, we obtain βc + ν P Kɛ f 2 nɛ 2 cp p 0 τ 2 m 1 exp [ βp p 0c + ν 2 ] 8cpκ 2 ɛ The worst case complexity bound is then Omnɛ 2 with overwhelmingly high probability, which is clearly better than the corresponding bound On 2 ɛ 2 for deterministic direct search [28] if m is chosen an order of magnitude smaller than n. 4.3 High probability iteration complexity Given a confidence level P, the following theorem presents an explicit bound for the number of iterations which can guarantee that G k ɛ holds with probability at least P. Bounds of this type are interesting in practice and have been considered in the theoretical analysis of probabilistic algorithms see, for instance, [23, 24]. 16

17 Theorem 4.3 Suppose that {D k } is p-probabilistically κ-descent with p > p 0. Then for any ɛ γ ργ 1 α 0 + να 0 κα 0 2κγ and P 0, 1, it holds P G k ɛ P whenever k 3β 3p ln1 P 2p p 0 ρ[ϕκɛ] p p Proof. By Theorem 4.1, P G k ɛ P is achieved when { 1 + δβ k max p p 0 ρ[ϕκɛ], 2p1 + } δ2 ln1 P δ 2 p p for some positive number δ. Hence it suffices to show that the right-hand side of 40 is bigger than that of 41 for properly chosen δ. For simplicity, denote c 1 = β p p 0 ρ[ϕκɛ], c 2p ln1 P 2 = p p 0 2. Let us consider the positive number δ such that It is easy to check that 1 + δc 1 = 1 + δ2 δ 2 c 2. δ = 1 c 2 + c 22 2c + 4c 1c c 2 2c 1. Thus which completes the proof. } 1 + δ2 max {1 + δc 1, δ 2 c 2 = 1 + δc c 1 + c 2, 5 Discussion and extensions 5.1 Establishing global convergence from global rate It is interesting to notice that Theorem 4.1 implies a form of global convergence of direct search based on probabilistic descent, as we mentioned at the end of Subsection 3.2. Proposition 5.1 Suppose that {D k } is p-probabilistically κ-descent with p > p 0. Then P inf G k = 0 = 1. k 0 17

18 Proof. We prove the result by contradiction. Suppose that P inf G k > 0 > 0. k 0 Then there exists a positive constant ɛ > 0 satisfying 29 and P inf G k ɛ > k 0 But Theorem 4.1 implies that lim P G l ɛ l = 0, which then contradicts 42, because P G l ɛ P inf k 0 G k ɛ for each l 0. If we assume for all realizations of Algorithm 2.1 that the iterates never arrive at a stationary point in a finite number of iterations, then the events {lim inf k G k = 0} and {inf k 0 G k = 0} are identical. Thus, in such a case, Proposition 5.1 reveals that P lim inf G k = 0 = 1, k if {D k } is p-probabilistically κ-descent with p > p The behavior of expected minimum norm gradient In this subsection we study how E G k behaves with the iteration counter k. For simplicity, we consider the special case of the forcing function ρα = c α 2 /2. Proposition 5.2 Suppose {D k } is p-probabilistically κ-descent with p > p 0, ρα = c α 2 /2, and k 4γ 2 β cp p 0 α0 2. Then E G k c 3 k g0 exp c 4 k, where c 3 = β 1 2 c + ν c 1 2 p p κ, c 4 = p p p Proof. Let us define a random variable H k as { c 3 k 1 2 if H k = G k c 3 k 1 2, g 0 otherwise. Then G k H k, and hence E G k E H k c 3 k g0 P G k > c 3 k

19 Therefore it suffices to notice P G k > c 3 k 1 2 exp c 4 k, which is a straightforward application of Corollary 4.1. In deterministic direct search, when a forcing function ρα = c α 2 /2 is used, g k decays with Ok 1 2 when k tends to infinity [28]. Proposition 5.2 shows that E G k behaves in a similar way in direct search based on probabilistic search. 5.3 Global rate without conditioning to the past Assuming that {D k } is p-probabilistically κ-descent as defined in Definition 3.1, we obtained the Chernoff-type bound 27 for π k, and then established the worst case complexity of Algorithm 2.1 when p > p 0. As mentioned before, inequality 5 in Definition 3.1 is stronger than the one without conditioning to the past. A natural question is then to see what can be obtained if we weaken this requirement by not conditioning to the past. It turns out that we can still establish some lower bound for P G k ɛ, but much weaker than inequality 30 obtained in Theorem 4.1 and in particular not approaching one. Proposition 5.3 If for each k 0 and then P cmd k, G k κ p 43 ɛ P G k ɛ γ κα 0 ργ 1 α 0 + να 0 2κγ, p p 0 β 1 p 0 1 p 0 kρ[ϕκɛ]. Proof. Due to Lemma 4.3, it suffices to show that β π k kρ[ϕκɛ] + p 1 p + β/kρ[ϕκɛ] p 0 Let us study π k λ. According to 43 and to the definition of Z k, P Z k = 1 p for each k 0. Hence π k λ = P k 1 Z l λk k 1 E 1 Z l k λk k pk k λk = 1 p 1 λ. Thus β π k kρ[ϕκɛ] + p 0 1 p 1 p 0 β/kρ[ϕκɛ]. 45 If β/kρ[ϕκɛ] p p 0, then 44 follows from 45 and the fact that a/b a + c/b + c for 0 < a b and c 0. If β/kρ[ϕκɛ] > p p 0, then 44 is trivial since π k λ 1. 19

20 5.4 A more detailed look at the numerical experiments An understanding of what is at stake in the global analysis of direct search based on probabilistic descent allows us to revisit the numerical experiments of Subsection 2.2 with additional insight. Remember that we tested Algorithm 2.1 by choosing D k as m independent random vectors uniformly distributed on the unit sphere in R n. In such a case, the almost-sure global convergence of Algorithm 2.1 is guaranteed as long as m > log 2 [1 ln θ/ln γ], as it can be concluded from Theorem 3.1 see Corollary B.1 in Appendix B. For example, when γ = 2 and θ = 0.5, the algorithm converges with probability 1 when m 2, even if such values of m are much smaller than the number of elements of the positive spanning sets with smallest cardinality in R n, which is n in the case tested in Subsection 2.2. We point out here that our theory is applicable only if γ > 1. Moreover, for deterministic direct search based on positive spanning sets PSSs, setting γ = 1 tends to lead to better numerical performance see, for instance, [8]. In this sense, the experiment in Subsection 2.2 is biased in favor of direct search based on probabilistic descent. To be fairer, we designed a new experiment by keeping γ = 1 when the sets of polling directions are guaranteed PSSs which is true for the versions corresponding to [I I], [Q Q], and [Q k Q k ], while setting γ > 1 for direct search based on probabilistic descent. All the other parameters were selected as in Subsection 2.2. In the case of direct search based on probabilistic descent, we pick now γ = 2 and γ = 1.1 as illustrations, and as for the cardinality m of D k we simply take the smallest integers satisfying m > log 2 [1 ln θ/ln γ], which are 2 and 4 respectively. Table 2 presents the results of the redesigned experiment with n = 40. Table 3 shows what happened for n = 100. The data is organized in the same way as in Table 1. We can see from the tables that direct search based on probabilistic descent still outperforms for these problems the direct-search versions using PSSs, even though the difference is not so considerable as in Table 1. We note that such an effect is even more visible when the dimension is higher n = 100, which is somehow in agreement with the fact that 39 reveals a worst case complexity in function evaluations of Omnɛ 2, which is more favorable than On 2 ɛ 2 for deterministic direct search based on PSSs when m is significantly smaller than n. Table 2: Relative performance for different sets of polling directions n = 40. [I I] [Q Q] [Q k Q k ] 2 γ = 2 4 γ = 1.1 arglina arglinb broydn3d dqrtic engval freuroth integreq nondquar sinquad vardim

21 Table 3: Relative performance for different sets of polling directions n = 100. [I I] [Q Q] [Q k Q k ] 2 γ = 2 4 γ = 1.1 arglina arglinb broydn3d dqrtic engval freuroth integreq nondquar sinquad vardim Other algorithmic contexts Our analysis is wide enough to be carried out to other algorithmic contexts where no derivatives are used and some randomization is applied to probabilistically induce descent. In particular, it will also render a global rate of 1/ k with overwhelmingly high probability for trust-region methods based on probabilistic models [5]. We recall that Lemma 4.2 is the key result leading to the global rate results in Subsection 4.2. Lemma 4.2 was based only on the two following elements about a realization of Algorithm 2.1: 1. if the k-th iteration is successful, then fx k fx k+1 ρα k in which case α k is increased, and 2. if cmd k, g k κ and α k < ϕκ g k, then the k-th iteration is successful Lemma 2.1. One can easily identify similar elements in a realization of the trust-region algorithm proposed in [5, Algorithm 3.1]. In fact, using the notations in [5] and denoting one can find positive constants µ 1 and µ 2 such that: K = {k N : ρ k η 1 and g k η 2 δ k }, if k K, then fx k fx k+1 µ 1 δ 2 k and δ k is increased by [5, Algorithm 3.1], and 2. if m k is κ eg, κ ef -fully linear and δ k < µ 2 g k, then k K by [5, Lemma 3.2]. One sees that δ k and K play the same roles as α k and the set of successful iterations in our paper. Based on these facts, one can first follow Lemma 4.1 to obtain a uniform upper bound µ 3 for k=0 δ2 k, and then prove that κ eg, κ ef -fully linear models appear at most µ 3 min { γ 2 δ 2 0, µ2 2 g k 2} + k

22 times in the first k iterations, a conclusion similar to Lemma 4.2. It is then straightforward to mimic the analysis in Subsection 4.2, and prove a 1/ k rate for the gradient with overwhelmingly high probability if the random models are probabilistically fully linear see [5, Definition 3.2]. 7 Concluding remarks We introduced a new proof technique for establishing global rates or worst case complexity bounds for randomized algorithms where the choice of the new iterate depends on a decrease condition and where the quality of the object directions, models used for descent is favorable with a certain probability. The proof technique separates the counting of the number of iterations that descent holds from the probabilistic properties of such a number. In the theory of direct search for smooth functions, the polling directions are typically required to be uniformly non-degenerated positive spanning sets. However, numerical observation suggests that the performance of direct search can be improved by instead randomly generating the polling directions. To understand this phenomenon, we proposed the concept of probabilistically descent sets of directions and studied direct search based upon it. An argument inspired by [5] shows that direct search Algorithm 2.1 converges almost surely if the sets of polling directions are p 0 -probabilistically κ-descent for some κ > 0, where p 0 = ln θ/[lnγ 1 θ], γ and θ being the expanding and contracting parameters of the step size. But more insightfully, we established in this paper the global rate and worst case complexity of direct search when the sets of polling directions are p-probabilistically κ-descent for some p > p 0 and κ > 0. It was proved that in such a situation direct search enjoys the global rate and worst case complexity of deterministic direct search with overwhelmingly high probability. In particular, when the forcing function is ρα = c α 2 /2 c > 0, the norm of the gradient is driven under O1/ k in k iterations with a probability that tends to 1 exponentially when k, or equivalently, the norm is below ɛ in Oɛ 2 iterations with a probability that tends to 1 exponentially when ɛ 0. Based on these conclusions, we also showed that the expected minimum norm gradient decays with a rate of 1/ k, which matches the behavior of the gradient norm in deterministic direct search or steepest descent. A more precise output of this paper is then a worst case complexity bound of Omnɛ 2, in terms of functions evaluations, for this class of methods when m is the number of independently uniformly distributed random directions used for polling. As said above, such a bound does not hold deterministically but under a probability approaching 1 exponentially when ɛ 0, but this is still clearly better than On 2 ɛ 2 known for deterministic direct search in a serial environment where m is chosen significantly smaller than n. Open issues When we were finishing this paper, Professor Nick Trefethen brought to our attention the possibility of setting D k = {d, d}, where d is a random vector independent of the previous iterations and uniformly distributed on the unit sphere of R n. Given a unit vector v R n and a number κ [0, 1], it is easy to see that the event {cmd k, v κ} is the union of {d v κ} and { d v κ}, whose intersection has probability zero, and therefore P cmd k, v κ = P d v κ + P d v κ = 2ϱ, 22

23 ϱ being the probability of {d v κ}. By the same argument as in the proof of Proposition B.1, one then sees that {D k } defined in this way is 2ϱ-probabilistically κ-descent. Given any constants γ and θ satisfying 0 < θ < 1 < γ, we can pick κ > 0 sufficiently small so that 2ϱ > ln θ/[lnγ 1 θ], and then Algorithm 2.1 conforms to the theory presented in Subsection 3.2 and Section 4. Moreover, the set {d, d} turns out to be optimal among all the sets D consisting of 2 random vectors uniformly distributed on the unit sphere, in the sense that it maximizes the probability P cmd, v κ for each κ [0, 1]. In fact, if D = {d 1, d 2 } with d 1 and d 2 uniformly distributed on the unit sphere, then P cmd, v κ = 2ϱ P {d 1 v κ} {d 2 v κ} 2ϱ, and the maximal value 2ϱ is attained when D = {d, d} as already discussed. We tested the set {d, d} numerically with γ = 2 and θ = 1/2, and it performed even better than the set of 2 independent vectors uniformly distributed on the unit sphere yet the difference was not substantial, which illustrates again our theory of direct search based on probabilistic descent. It remains to know how to define D so that it maximizes P cmd, v κ for each κ [0, 1], provided that D consists of m > 2 random vectors uniformly distributed on the unit sphere, but this is out of the scope of the paper. A number of other issues remain also to be investigated related to how known properties of deterministic direct search extend to probabilistic descent. For instance, one knows that in some convex instances [12] the worst case complexity bound of deterministic direct search can be improved to the order of 1/ɛ corresponding to a global rate of 1/k. One also knows that some deterministic direct-search methods exhibit an r-linear rate of local convergence [14]. Finally, extending our results to the presence of constraints and/or non-smoothness may be also of interest. A Proof of Lemma 4.4 Proof. The result can be proved by standard techniques of large deviations. Let t be an arbitrary positive number. By Markov s Inequality, k 1 k 1 π k λ = P exp t Z l exp tλk exp tλk E e tz l. 48 Now let us study E k 1 e tz l. By Properties G and K of Shiryaev [26, page 216], we have k 1 E e tz l = E E e tz k 2 k 1 Z 0, Z 1,..., Z k 2 e tz l. 49 According to 26 and the fact that the function re t + 1 r is monotonically decreasing in r, it holds with p = P Z k 1 = 1 Z 0, Z 1,..., Z k 2 p E e tz k 1 Z 0, Z 1,..., Z k 2 = pe t + 1 p pe t + 1 p exp pe t p, which implies, from equality 49, that k 1 E e tz l exp pe t p k 2 E 23 e tz l.

A multistart multisplit direct search methodology for global optimization

A multistart multisplit direct search methodology for global optimization 1/69 A multistart multisplit direct search methodology for global optimization Ismael Vaz (Univ. Minho) Luis Nunes Vicente (Univ. Coimbra) IPAM, Optimization and Optimal Control for Complex Energy and

More information

Direct search based on probabilistic feasible descent for bound and linearly constrained problems

Direct search based on probabilistic feasible descent for bound and linearly constrained problems Direct search based on probabilistic feasible descent for bound and linearly constrained problems S. Gratton C. W. Royer L. N. Vicente Z. Zhang March 8, 2018 Abstract Direct search is a methodology for

More information

On the optimal order of worst case complexity of direct search

On the optimal order of worst case complexity of direct search On the optimal order of worst case complexity of direct search M. Dodangeh L. N. Vicente Z. Zhang May 14, 2015 Abstract The worst case complexity of direct-search methods has been recently analyzed when

More information

Worst Case Complexity of Direct Search

Worst Case Complexity of Direct Search Worst Case Complexity of Direct Search L. N. Vicente May 3, 200 Abstract In this paper we prove that direct search of directional type shares the worst case complexity bound of steepest descent when sufficient

More information

Worst Case Complexity of Direct Search

Worst Case Complexity of Direct Search Worst Case Complexity of Direct Search L. N. Vicente October 25, 2012 Abstract In this paper we prove that the broad class of direct-search methods of directional type based on imposing sufficient decrease

More information

Mesures de criticalité d'ordres 1 et 2 en recherche directe

Mesures de criticalité d'ordres 1 et 2 en recherche directe Mesures de criticalité d'ordres 1 et 2 en recherche directe From rst to second-order criticality measures in direct search Clément Royer ENSEEIHT-IRIT, Toulouse, France Co-auteurs: S. Gratton, L. N. Vicente

More information

Worst case complexity of direct search

Worst case complexity of direct search EURO J Comput Optim (2013) 1:143 153 DOI 10.1007/s13675-012-0003-7 ORIGINAL PAPER Worst case complexity of direct search L. N. Vicente Received: 7 May 2012 / Accepted: 2 November 2012 / Published online:

More information

Globally Convergent Evolution Strategies

Globally Convergent Evolution Strategies Globally Convergent Evolution Strategies Y. Diouane S. Gratton L. N. Vicente July 9, 24 Abstract In this paper we show how to modify a large class of evolution strategies (ES s) for unconstrained optimization

More information

Implicitely and Densely Discrete Black-Box Optimization Problems

Implicitely and Densely Discrete Black-Box Optimization Problems Implicitely and Densely Discrete Black-Box Optimization Problems L. N. Vicente September 26, 2008 Abstract This paper addresses derivative-free optimization problems where the variables lie implicitly

More information

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E. Curtis Department of Industrial and Systems Engineering, Lehigh University Daniel P. Robinson Department

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Levenberg-Marquardt methods based on probabilistic gradient models and inexact subproblem solution, with application to data assimilation

Levenberg-Marquardt methods based on probabilistic gradient models and inexact subproblem solution, with application to data assimilation Levenberg-Marquardt methods based on probabilistic gradient models and inexact subproblem solution, with application to data assimilation E. Bergou S. Gratton L. N. Vicente June 26, 204 Abstract The Levenberg-Marquardt

More information

Chapter 8 Gradient Methods

Chapter 8 Gradient Methods Chapter 8 Gradient Methods An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Introduction Recall that a level set of a function is the set of points satisfying for some constant. Thus, a point

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

A Subsampling Line-Search Method with Second-Order Results

A Subsampling Line-Search Method with Second-Order Results A Subsampling Line-Search Method with Second-Order Results E. Bergou Y. Diouane V. Kungurtsev C. W. Royer November 21, 2018 Abstract In many contemporary optimization problems, such as hyperparameter tuning

More information

Complexity of gradient descent for multiobjective optimization

Complexity of gradient descent for multiobjective optimization Complexity of gradient descent for multiobjective optimization J. Fliege A. I. F. Vaz L. N. Vicente July 18, 2018 Abstract A number of first-order methods have been proposed for smooth multiobjective optimization

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf. Maria Cameron 1. Trust Region Methods At every iteration the trust region methods generate a model m k (p), choose a trust region, and solve the constraint optimization problem of finding the minimum of

More information

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero Chapter Limits of Sequences Calculus Student: lim s n = 0 means the s n are getting closer and closer to zero but never gets there. Instructor: ARGHHHHH! Exercise. Think of a better response for the instructor.

More information

Interpolation-Based Trust-Region Methods for DFO

Interpolation-Based Trust-Region Methods for DFO Interpolation-Based Trust-Region Methods for DFO Luis Nunes Vicente University of Coimbra (joint work with A. Bandeira, A. R. Conn, S. Gratton, and K. Scheinberg) July 27, 2010 ICCOPT, Santiago http//www.mat.uc.pt/~lnv

More information

Globally Convergent Evolution Strategies and CMA-ES

Globally Convergent Evolution Strategies and CMA-ES Globally Convergent Evolution Strategies and CMA-ES Y. Diouane, S. Gratton and L. N. Vicente Technical Report TR/PA/2/6 Publications of the Parallel Algorithms Team http://www.cerfacs.fr/algor/publications/

More information

Levenberg-Marquardt methods based on probabilistic gradient models and inexact subproblem solution, with application to data assimilation

Levenberg-Marquardt methods based on probabilistic gradient models and inexact subproblem solution, with application to data assimilation Levenberg-Marquardt methods based on probabilistic gradient models and inexact subproblem solution, with application to data assimilation E. Bergou S. Gratton L. N. Vicente May 24, 206 Abstract The Levenberg-Marquardt

More information

A derivative-free nonmonotone line search and its application to the spectral residual method

A derivative-free nonmonotone line search and its application to the spectral residual method IMA Journal of Numerical Analysis (2009) 29, 814 825 doi:10.1093/imanum/drn019 Advance Access publication on November 14, 2008 A derivative-free nonmonotone line search and its application to the spectral

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

An asymptotic ratio characterization of input-to-state stability

An asymptotic ratio characterization of input-to-state stability 1 An asymptotic ratio characterization of input-to-state stability Daniel Liberzon and Hyungbo Shim Abstract For continuous-time nonlinear systems with inputs, we introduce the notion of an asymptotic

More information

This manuscript is for review purposes only.

This manuscript is for review purposes only. 1 2 3 4 5 6 7 8 9 10 11 12 THE USE OF QUADRATIC REGULARIZATION WITH A CUBIC DESCENT CONDITION FOR UNCONSTRAINED OPTIMIZATION E. G. BIRGIN AND J. M. MARTíNEZ Abstract. Cubic-regularization and trust-region

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Clément Royer - University of Wisconsin-Madison Joint work with Stephen J. Wright MOPTA, Bethlehem,

More information

Convex Optimization Notes

Convex Optimization Notes Convex Optimization Notes Jonathan Siegel January 2017 1 Convex Analysis This section is devoted to the study of convex functions f : B R {+ } and convex sets U B, for B a Banach space. The case of B =

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Part III. 10 Topological Space Basics. Topological Spaces

Part III. 10 Topological Space Basics. Topological Spaces Part III 10 Topological Space Basics Topological Spaces Using the metric space results above as motivation we will axiomatize the notion of being an open set to more general settings. Definition 10.1.

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

Vector Spaces. Vector space, ν, over the field of complex numbers, C, is a set of elements a, b,..., satisfying the following axioms.

Vector Spaces. Vector space, ν, over the field of complex numbers, C, is a set of elements a, b,..., satisfying the following axioms. Vector Spaces Vector space, ν, over the field of complex numbers, C, is a set of elements a, b,..., satisfying the following axioms. For each two vectors a, b ν there exists a summation procedure: a +

More information

Randomized Complexity Classes; RP

Randomized Complexity Classes; RP Randomized Complexity Classes; RP Let N be a polynomial-time precise NTM that runs in time p(n) and has 2 nondeterministic choices at each step. N is a polynomial Monte Carlo Turing machine for a language

More information

1 Lyapunov theory of stability

1 Lyapunov theory of stability M.Kawski, APM 581 Diff Equns Intro to Lyapunov theory. November 15, 29 1 1 Lyapunov theory of stability Introduction. Lyapunov s second (or direct) method provides tools for studying (asymptotic) stability

More information

Line Search Methods for Unconstrained Optimisation

Line Search Methods for Unconstrained Optimisation Line Search Methods for Unconstrained Optimisation Lecture 8, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Generic

More information

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS Bendikov, A. and Saloff-Coste, L. Osaka J. Math. 4 (5), 677 7 ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS ALEXANDER BENDIKOV and LAURENT SALOFF-COSTE (Received March 4, 4)

More information

Hilbert Spaces. Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space.

Hilbert Spaces. Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space. Hilbert Spaces Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space. Vector Space. Vector space, ν, over the field of complex numbers,

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for

More information

LECTURE 10: REVIEW OF POWER SERIES. 1. Motivation

LECTURE 10: REVIEW OF POWER SERIES. 1. Motivation LECTURE 10: REVIEW OF POWER SERIES By definition, a power series centered at x 0 is a series of the form where a 0, a 1,... and x 0 are constants. For convenience, we shall mostly be concerned with the

More information

4 Sums of Independent Random Variables

4 Sums of Independent Random Variables 4 Sums of Independent Random Variables Standing Assumptions: Assume throughout this section that (,F,P) is a fixed probability space and that X 1, X 2, X 3,... are independent real-valued random variables

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

Lecture 6: Non-stochastic best arm identification

Lecture 6: Non-stochastic best arm identification CSE599i: Online and Adaptive Machine Learning Winter 08 Lecturer: Kevin Jamieson Lecture 6: Non-stochastic best arm identification Scribes: Anran Wang, eibin Li, rian Chan, Shiqing Yu, Zhijin Zhou Disclaimer:

More information

MATH 117 LECTURE NOTES

MATH 117 LECTURE NOTES MATH 117 LECTURE NOTES XIN ZHOU Abstract. This is the set of lecture notes for Math 117 during Fall quarter of 2017 at UC Santa Barbara. The lectures follow closely the textbook [1]. Contents 1. The set

More information

Estimates for probabilities of independent events and infinite series

Estimates for probabilities of independent events and infinite series Estimates for probabilities of independent events and infinite series Jürgen Grahl and Shahar evo September 9, 06 arxiv:609.0894v [math.pr] 8 Sep 06 Abstract This paper deals with finite or infinite sequences

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

A line-search algorithm inspired by the adaptive cubic regularization framework with a worst-case complexity O(ɛ 3/2 )

A line-search algorithm inspired by the adaptive cubic regularization framework with a worst-case complexity O(ɛ 3/2 ) A line-search algorithm inspired by the adaptive cubic regularization framewor with a worst-case complexity Oɛ 3/ E. Bergou Y. Diouane S. Gratton December 4, 017 Abstract Adaptive regularized framewor

More information

A Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O(ɛ 3/2 ) Complexity

A Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O(ɛ 3/2 ) Complexity A Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O(ɛ 3/2 ) Complexity Mohammadreza Samadi, Lehigh University joint work with Frank E. Curtis (stand-in presenter), Lehigh University

More information

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by: Newton s Method Suppose we want to solve: (P:) min f (x) At x = x, f (x) can be approximated by: n x R. f (x) h(x) := f ( x)+ f ( x) T (x x)+ (x x) t H ( x)(x x), 2 which is the quadratic Taylor expansion

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Problem 3. Give an example of a sequence of continuous functions on a compact domain converging pointwise but not uniformly to a continuous function

Problem 3. Give an example of a sequence of continuous functions on a compact domain converging pointwise but not uniformly to a continuous function Problem 3. Give an example of a sequence of continuous functions on a compact domain converging pointwise but not uniformly to a continuous function Solution. If we does not need the pointwise limit of

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Week 2: Sequences and Series

Week 2: Sequences and Series QF0: Quantitative Finance August 29, 207 Week 2: Sequences and Series Facilitator: Christopher Ting AY 207/208 Mathematicians have tried in vain to this day to discover some order in the sequence of prime

More information

Analysis II - few selective results

Analysis II - few selective results Analysis II - few selective results Michael Ruzhansky December 15, 2008 1 Analysis on the real line 1.1 Chapter: Functions continuous on a closed interval 1.1.1 Intermediate Value Theorem (IVT) Theorem

More information

Distributed MAP probability estimation of dynamic systems with wireless sensor networks

Distributed MAP probability estimation of dynamic systems with wireless sensor networks Distributed MAP probability estimation of dynamic systems with wireless sensor networks Felicia Jakubiec, Alejandro Ribeiro Dept. of Electrical and Systems Engineering University of Pennsylvania https://fling.seas.upenn.edu/~life/wiki/

More information

6.1 Moment Generating and Characteristic Functions

6.1 Moment Generating and Characteristic Functions Chapter 6 Limit Theorems The power statistics can mostly be seen when there is a large collection of data points and we are interested in understanding the macro state of the system, e.g., the average,

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Eigenvalues and Eigenfunctions of the Laplacian

Eigenvalues and Eigenfunctions of the Laplacian The Waterloo Mathematics Review 23 Eigenvalues and Eigenfunctions of the Laplacian Mihai Nica University of Waterloo mcnica@uwaterloo.ca Abstract: The problem of determining the eigenvalues and eigenvectors

More information

The Growth of Functions. A Practical Introduction with as Little Theory as possible

The Growth of Functions. A Practical Introduction with as Little Theory as possible The Growth of Functions A Practical Introduction with as Little Theory as possible Complexity of Algorithms (1) Before we talk about the growth of functions and the concept of order, let s discuss why

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Topological properties of Z p and Q p and Euclidean models

Topological properties of Z p and Q p and Euclidean models Topological properties of Z p and Q p and Euclidean models Samuel Trautwein, Esther Röder, Giorgio Barozzi November 3, 20 Topology of Q p vs Topology of R Both R and Q p are normed fields and complete

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Existence and Uniqueness

Existence and Uniqueness Chapter 3 Existence and Uniqueness An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect

More information

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability... Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................

More information

Rank Determination for Low-Rank Data Completion

Rank Determination for Low-Rank Data Completion Journal of Machine Learning Research 18 017) 1-9 Submitted 7/17; Revised 8/17; Published 9/17 Rank Determination for Low-Rank Data Completion Morteza Ashraphijuo Columbia University New York, NY 1007,

More information

ON SPACE-FILLING CURVES AND THE HAHN-MAZURKIEWICZ THEOREM

ON SPACE-FILLING CURVES AND THE HAHN-MAZURKIEWICZ THEOREM ON SPACE-FILLING CURVES AND THE HAHN-MAZURKIEWICZ THEOREM ALEXANDER KUPERS Abstract. These are notes on space-filling curves, looking at a few examples and proving the Hahn-Mazurkiewicz theorem. This theorem

More information

STAT 7032 Probability Spring Wlodek Bryc

STAT 7032 Probability Spring Wlodek Bryc STAT 7032 Probability Spring 2018 Wlodek Bryc Created: Friday, Jan 2, 2014 Revised for Spring 2018 Printed: January 9, 2018 File: Grad-Prob-2018.TEX Department of Mathematical Sciences, University of Cincinnati,

More information

A note on the convex infimum convolution inequality

A note on the convex infimum convolution inequality A note on the convex infimum convolution inequality Naomi Feldheim, Arnaud Marsiglietti, Piotr Nayar, Jing Wang Abstract We characterize the symmetric measures which satisfy the one dimensional convex

More information

Introductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19

Introductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19 Introductory Analysis I Fall 204 Homework #9 Due: Wednesday, November 9 Here is an easy one, to serve as warmup Assume M is a compact metric space and N is a metric space Assume that f n : M N for each

More information

CONVERGENCE OF RANDOM SERIES AND MARTINGALES

CONVERGENCE OF RANDOM SERIES AND MARTINGALES CONVERGENCE OF RANDOM SERIES AND MARTINGALES WESLEY LEE Abstract. This paper is an introduction to probability from a measuretheoretic standpoint. After covering probability spaces, it delves into the

More information

c 2007 Society for Industrial and Applied Mathematics

c 2007 Society for Industrial and Applied Mathematics SIAM J. OPTIM. Vol. 18, No. 1, pp. 106 13 c 007 Society for Industrial and Applied Mathematics APPROXIMATE GAUSS NEWTON METHODS FOR NONLINEAR LEAST SQUARES PROBLEMS S. GRATTON, A. S. LAWLESS, AND N. K.

More information

On the Converse Law of Large Numbers

On the Converse Law of Large Numbers On the Converse Law of Large Numbers H. Jerome Keisler Yeneng Sun This version: March 15, 2018 Abstract Given a triangular array of random variables and a growth rate without a full upper asymptotic density,

More information

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 12: PAC III Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 In this lecture will use the measure of VC dimension, which is a combinatorial

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Introduction to Probability

Introduction to Probability LECTURE NOTES Course 6.041-6.431 M.I.T. FALL 2000 Introduction to Probability Dimitri P. Bertsekas and John N. Tsitsiklis Professors of Electrical Engineering and Computer Science Massachusetts Institute

More information

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Sub-Sampled Newton Methods I: Globally Convergent Algorithms Sub-Sampled Newton Methods I: Globally Convergent Algorithms arxiv:1601.04737v3 [math.oc] 26 Feb 2016 Farbod Roosta-Khorasani February 29, 2016 Abstract Michael W. Mahoney Large scale optimization problems

More information

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011 LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS S. G. Bobkov and F. L. Nazarov September 25, 20 Abstract We study large deviations of linear functionals on an isotropic

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

A recursive model-based trust-region method for derivative-free bound-constrained optimization.

A recursive model-based trust-region method for derivative-free bound-constrained optimization. A recursive model-based trust-region method for derivative-free bound-constrained optimization. ANKE TRÖLTZSCH [CERFACS, TOULOUSE, FRANCE] JOINT WORK WITH: SERGE GRATTON [ENSEEIHT, TOULOUSE, FRANCE] PHILIPPE

More information

Self-Concordant Barrier Functions for Convex Optimization

Self-Concordant Barrier Functions for Convex Optimization Appendix F Self-Concordant Barrier Functions for Convex Optimization F.1 Introduction In this Appendix we present a framework for developing polynomial-time algorithms for the solution of convex optimization

More information

On the Properties of Positive Spanning Sets and Positive Bases

On the Properties of Positive Spanning Sets and Positive Bases Noname manuscript No. (will be inserted by the editor) On the Properties of Positive Spanning Sets and Positive Bases Rommel G. Regis Received: May 30, 2015 / Accepted: date Abstract The concepts of positive

More information

Isomorphisms between pattern classes

Isomorphisms between pattern classes Journal of Combinatorics olume 0, Number 0, 1 8, 0000 Isomorphisms between pattern classes M. H. Albert, M. D. Atkinson and Anders Claesson Isomorphisms φ : A B between pattern classes are considered.

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

An interior-point gradient method for large-scale totally nonnegative least squares problems

An interior-point gradient method for large-scale totally nonnegative least squares problems An interior-point gradient method for large-scale totally nonnegative least squares problems Michael Merritt and Yin Zhang Technical Report TR04-08 Department of Computational and Applied Mathematics Rice

More information

The properties of L p -GMM estimators

The properties of L p -GMM estimators The properties of L p -GMM estimators Robert de Jong and Chirok Han Michigan State University February 2000 Abstract This paper considers Generalized Method of Moment-type estimators for which a criterion

More information

Measures and Measure Spaces

Measures and Measure Spaces Chapter 2 Measures and Measure Spaces In summarizing the flaws of the Riemann integral we can focus on two main points: 1) Many nice functions are not Riemann integrable. 2) The Riemann integral does not

More information

Constructive bounds for a Ramsey-type problem

Constructive bounds for a Ramsey-type problem Constructive bounds for a Ramsey-type problem Noga Alon Michael Krivelevich Abstract For every fixed integers r, s satisfying r < s there exists some ɛ = ɛ(r, s > 0 for which we construct explicitly an

More information

Sequential Monte Carlo methods for filtering of unobservable components of multidimensional diffusion Markov processes

Sequential Monte Carlo methods for filtering of unobservable components of multidimensional diffusion Markov processes Sequential Monte Carlo methods for filtering of unobservable components of multidimensional diffusion Markov processes Ellida M. Khazen * 13395 Coppermine Rd. Apartment 410 Herndon VA 20171 USA Abstract

More information

DOUBLE SERIES AND PRODUCTS OF SERIES

DOUBLE SERIES AND PRODUCTS OF SERIES DOUBLE SERIES AND PRODUCTS OF SERIES KENT MERRYFIELD. Various ways to add up a doubly-indexed series: Let be a sequence of numbers depending on the two variables j and k. I will assume that 0 j < and 0

More information

Introduction to Bases in Banach Spaces

Introduction to Bases in Banach Spaces Introduction to Bases in Banach Spaces Matt Daws June 5, 2005 Abstract We introduce the notion of Schauder bases in Banach spaces, aiming to be able to give a statement of, and make sense of, the Gowers

More information