A Model Reference Adaptive Search Method for Global Optimization

Size: px

Start display at page:

Download "A Model Reference Adaptive Search Method for Global Optimization"

Coleen Floyd
5 years ago
Views:

1 A Model Reference Adaptive Search Method for Global Optimization Jiaqiao Hu Department of Electrical and Computer Engineering & Institute for Systems Research, University of Maryland, College Park, MD 20742, Michael C. Fu Robert H. Smith School of Business & Institute for Systems Research, University of Maryland, College Park, MD 20742, Steven I. Marcus Department of Electrical and Computer Engineering & Institute for Systems Research, University of Maryland, College Park, MD 20742, March, 2005; revised September, 2005 Abstract We introduce a new randomized method called Model Reference Adaptive Search (MRAS) for solving global optimization problems. The method works with a parameterized probabilistic model on the solution space and generates at each iteration a group of candidate solutions. These candidate solutions are then used to update the parameters associated with the probabilistic model in such a way that the future search will be biased toward the region containing high quality solutions. The parameter updating procedure in MRAS is guided by a sequence of implicit reference models. We provide a particular instantiation of the sequence of reference models and describe an algorithm that implements the idea. We prove global convergence of the proposed algorithm in both continuous and combinatorial domains. In addition, we show that the model reference framework can also be used to describe the recently proposed cross-entropy (CE) method for optimization and study its properties. Numerical studies are also carried out to illustrate the performance of the algorithm. Keywords: global optimization, combinatorial optimization, cross-entropy method (CE), estimation of distribution algorithm (EDA). Introduction Global optimization problems arise in a wide range of applications and are often extremely difficult to solve. Following Zlochin et al. (2004), we classify the solution methods for both continuous and combinatorial problems as being either instance-based or model-based. In instance-based methods, searches for new candidate solutions depend directly on previously generated solutions. Some well-known instance-based methods are simulated annealing (SA) (Kirkpatrick et al. 983), genetic algorithms (GAs) (Srinivas and Patnaik 994), tabu search (Glover 990), and the recently proposed nested partitions (NP) method (Shi and Ólafsson 2000). In model-based algorithms, new solutions are generated via an intermediate probabilistic model that is updated or induced from the previous solutions. The model-based search methods are a class of solution techniques introduced fairly recently. In general, most of the algorithms that fall in this category share a similar framework and usually involve the following two phases:. Generate candidate solutions (e.g., random samples) according to a specified probabilistic model. 2. Update the parameters associated with the probabilistic model, on the basis of the data collected in the previous step, in order to bias the future search toward better solutions.

2 Some well-established model-based methods are ant colony optimization (ACO) (Dorigo and Gambardella 997), the cross-entropy (CE) method (Rubinstein and Kroese 2004, De Boer et al. 2005), and the estimation of distribution algorithms (EDAs) (Mühlenbein and Paaß 996). Among the above approaches, those that are most relevant to our work are the cross-entropy (CE) method and the estimation of distribution algorithms (EDAs). The CE method was motivated by an adaptive algorithm for estimating probabilities of rare events (Rubinstein 997) in stochastic networks. It was later realized (Rubinstein 999, 200) that the method can be modified to solve combinatorial and continuous optimization problems. The CE method starts with a family of parameterized probability distributions on the solution space and tries to find the parameter of the distribution that assigns maximum probability to the set of optimal solutions. Implicit in CE is an optimal (importance sampling) distribution concentrated only on the set of optimal solutions (i.e., zero variance), and the key idea is to use an iterative scheme to successively estimate the optimal parameter that minimizes the Kullback-Leibler (KL) divergence between the optimal distribution and the family of parameterized distributions. In the context of estimation of rare event probabilities, Homem-de-Mello (2004) shows the convergence of an adaptive version of CE to an estimate of the optimal (possibly local) CE parameter with probability one. Rubinstein (999) shows the probability one convergence of a modified version of the CE method to the optimal solution for combinatorial optimization problems. The estimation of distribution algorithms (EDAs) were first introduced in the field of evolutionary computation in Mühlenbein and Paaß (996). They inherit the spirit of the well-known genetic algorithms (GAs), but eliminate the crossover and the mutation operators in order to avoid the disruption of partial solutions. In EDAs, a new population of candidate solutions is generated according to the probability distribution induced or estimated from the promising solutions selected from the previous generation. Unlike CE, EDAs often take into account the interaction effects between the underlying decision variables needed to represent the individual candidate solutions, which are often expressed explicitly through the use of different probabilistic models. We refer the reader to Larrañaga et al. (999) for a review of the way in which different probabilistic models are used as EDAs instantiations. The convergence of a class of EDAs, under the infinite population assumption, to the global optimum can be found in Zhang and Mühlenbein (2004). In this paper, we introduce a new randomized method, called model reference adaptive search (MRAS), for solving both continuous and combinatorial optimization problems. MRAS resembles CE in that it works with a family of parameterized distributions on the solution space. However, the method has a particular working mechanism which, to the best of our knowledge, has never been proposed before. The motivation behind the method is to use a sequence of intermediate reference distributions to facilitate and guide the updating of the parameters associated with the family of parameterized distributions during the search process. At each iteration of MRAS, candidate solutions are generated from the distribution (among the prescribed family of distributions) that possesses the minimum KL-divergence with respect to the reference model corresponding to the previous iteration. These candidate solutions are in turn used to construct the next distribution by minimizing the KL-divergence with respect to the current reference model, from which future candidate solutions will be generated. In contrast, as mentioned previously, the CE method targets a single optimal (importance sampling) distribution to direct its parameter updating, where new parameters are obtained as the solutions to the problems of estimating the optimal reference parameters. We propose an instantiation of the MRAS method, where the sequence of reference distributions can be viewed as the generalized probability distribution models for EDAs with proportional selection scheme (Zhang and Mühlenbein 2004) and can be shown to converge to a degenerate distribution concentrated only on the set of optimal solutions. An attractive feature is that the sequence of reference distributions depend directly on the performance of the candidate solutions; thus the method automatically takes into account the correlations between the underlying decision variables, so that the random samples generated at each stage can be efficiently utilized. However, in EDAs (with proportional selection scheme), these distribution models are directly constructed in order to generate new candidate solutions, which is in general a difficult and computationally intensive task, whereas in our approach, the sequence of reference models is only used 2

3 implicitly to guide the parameter updating procedure; thus there is no need to build them explicitly. We show that for a class of parameterized probability distributions, the so-called Natural Exponential Family (NEF), the proposed algorithm converges to an optimal solution with probability one. In sum, our main contributions in this paper include: (i) We introduce a new framework for global optimization, which allows considerable flexibility in the choices of the reference models. Consequently, by carefully analyzing and selecting the sequence of reference models, one can design different (perhaps even more efficient) versions of our proposed approach. (ii) We propose an instantiation of the MRAS method, analyze its convergence properties in both continuous and combinatorial domains, and test its performance on some widely-used benchmark problems. (iii) We also explore the relationship between CE and MRAS. In particular, we show that the CE method can also be interpreted as an instance of the model reference framework. Based on this observation, we establish and discuss some important properties of CE. (iv) In addition, we extend the recent results on the convergence of quantile estimates in Homem-de-Mello (2004) to the cases where the random samples are drawn from a sequence of distributions rather than a single fixed distribution. The rest of the paper is organized as follows. In Section 2, we describe the MRAS method and motivate the use of a specific sequence of distributions as the reference models. In Section 3, we present a detailed implementation of the deterministic version of MRAS and establish its global convergence properties. In Section 4, we provide a unified view of CE and MRAS, and study the properties of the CE method. In Section 5, we implement the Monte Carlo version of MRAS and show its asymptotic convergence to a global optimum with probability one. Illustrative numerical studies on both continuous and combinatorial optimization problems are given in Section 6. Finally some future research topics are outlined in Section 7. 2 The Model Reference Adaptive Search Method We consider the following global optimization problem: x arg max H(x), X R n, () x X where the solution space X is a non-empty set in R n, and H( ) : X R is a deterministic function that is bounded from below, i.e., M > such that H(x) M x X. Throughout this paper, we assume that problem () has a unique global optimal solution, i.e., x X such that H(x) < H(x ) x x, x X. In addition to the general regularity conditions stated above, we also make the following assumptions on the objective function. Assumptions: A. For any given constant ξ < H(x ), the set {x : H(x) ξ} X has a strictly positive Lebesgue or discrete measure. A2. For any given constant δ > 0, sup x Aδ H(x) < H(x ), where A δ := {x : x x δ} X, and we define the supremum over the empty set to be. Intuitively, A ensures that any neighborhood of the optimal solution x will have a positive probability of being sampled. For ease of exposition, A restricts the class of problems under consideration to either continuous or discrete problems; however, we remark that the work of this paper can be easily extended to problems with mixture of both continuous and discrete variables. Since H( ) has a unique global optimizer, A2 is satisfied by many functions encountered in practice. Note that both A and A2 hold trivially when X is (discrete) finite and the counting measure is used. The MRAS method approaches the above problem as follows. It works with a family of parameterized distributions {f(, θ), θ Θ} on the solution space, where Θ is the parameter space. Assume that at the kth 3

4 iteration of the method, we have a sampling distribution f(, θ k ). We generate candidate solutions from this sampling distribution. The performances of these randomly generated solutions are then evaluated and used to calculate a new parameter vector θ k+ Θ according to a specified parameter updating rule. The above steps are performed repeatedly until a termination criterion is satisfied. The idea is that if the parameter updating rule is chosen appropriately, the future sampling process will be more and more concentrated on regions containing high quality solutions. In MRAS, the parameter updating is determined by another sequence of distributions {g k ( )}, called the reference distribution. In particular, at each iteration k, we look at the projection of g k ( ) on the family of distributions {f(, θ), θ Θ} and compute the new parameter vector θ k+ that minimizes the Kullback-Leibler (KL) divergence [ D(g k, f(, θ)) := E gk ln g ] k(x) = ln g k(x) f(x, θ) X f(x, θ) g k(x)ν(dx), where ν is the Lebesgue/counting measure defined on X, X = (X,..., X n ) is a random vector taking values in X, and E gk [ ] denotes the expectation taken with respect to g k ( ). Intuitively speaking, f(, θ k+ ) can be viewed as a compact representation (approximation) of the reference distribution g k ( ); consequently, the feasibility and effectivenss of the method will, to some large extent, depend on the choices of the reference distributions. There are many different ways to construct the sequence of reference distributions {g k ( )}. Here, we propose to use the following simple iterative scheme. Let g 0 (x) > 0 x X be an initial probability density/mass function (p.d.f./p.m.f.) on the solution space X. At each iteration k, we compute a new p.d.f./p.m.f. by tilting the old p.d.f./p.m.f. g k (x) with the performance function H(x) (for simplicity, here we assume H(x) > 0 x X ), i.e., g k (x) = H(x)g k (x) X H(x)g k (x)ν(dx), x X. (2) By doing so, we are assigning more weight to the solutions that have better performance. One direct consequence of this is that each iteration of (2) improves the expected performance. To be precise, E gk [H(X)] = E g k [(H(X)) 2 ] E gk [H(X)] E gk [H(X)]. Furthermore, it is possible to show that the sequence {g k ( ), k = 0,,...} will converge to a distribution that concentrates only on the optimal solution for arbitrary g 0 ( ). So we will have lim k E gk [H(X)] = H(x ). The above idea has previously been used, for example, in EDAs with proportional selection schemes (cf. e.g., Zhang and Mühlenbein 2004), and in randomized algorithms for solving Markov decision processes (Chang et al. 2003). However, in those approaches, the construction of g k ( ) in (2) needs to be carried out explicitly to generate new samples; moreover, since g k ( ) may not have any structure, sampling from it could be computationally expensive. In MRAS, these difficulties are circumvented by projecting g k ( ) on the family of parameterized distributions {f(, θ)}. On the one hand, f(, θ k ) often has some special structure and therefore could be much easier to handle, and on the other hand, the sequence {f(, θ k+ ), k = 0,,...} may retain some nice properties of {g k ( )} and also converge to a degenerate distribution concentrated on the optimal solution. 4

5 3 The MRAS 0 Algorithm (Deterministic Version) We now present the deterministic version of the MRAS algorithm that uses the reference distributions proposed in Section 2. Throughout the analysis, we use P θk ( ) and E θk [ ] to denote the probability and expectation taken with respect to the p.d.f./p.m.f. f(, θ k ), and I { } to denote the indicator function, i.e., { if event A holds, I {A} := 0 otherwise. Thus, under our notational convention, P θk (H(X) γ) = I {H(x) γ} f(x, θ k )ν(dx) and E θk [H(X)] = 3. Algorithm Description X X H(x)f(x, θ k )ν(dx). Algorithm MRAS 0 deterministic version Initialization: Specify ρ (0, ], a small number ε 0, a strictly increasing function S( ) : R R +, and an initial p.d.f./p.m.f. f(x, θ 0 ) > 0 x X. Set the iteration counter k = 0. Repeat until a specified stopping rule is satisfied:. Calculate the ( ρ)-quantile 2. if k = 0, then set γ k+ = γ k+. elseif k if γ k+ γ k + ε, then set γ k+ = γ k+. else set γ k+ = γ k. endif endif 3. Compute the parameter vector θ k+ as 4. Set k = k +. θ k+ := arg max θ Θ γ k+ := sup {l : P θk (H(X) l) ρ}. l [ ] [S(H(X))] k E θk I f(x, θ k ) {H(X) γk+} ln f(x, θ), (3) The MRAS 0 algorithm requires specification of a parameter ρ, which determines the proportion of samples that will be used to update the probabilistic model. At successive iterations of the algorithm, a sequence {γ k, k =, 2,...}, i.e., the ( ρ)-quantiles with respect to the sequence of p.d.f s/p.m.f s {f(, θ k )}, are calculated at step of MRAS 0. These quantile values are then used in step 2 to construct a sequence of non-decreasing thresholds { γ k, k =, 2,...}; and only those candidate solutions that have performances better than these thresholds will be used in parameter updating (cf. (3)). As we will see, the theoretical convergence of MRAS 0 is unaffected by the value of the parameter ρ. We use ρ in our approach to concentrate the computational effort on the set of elite/promising samples, which is a standard technique employed in most of the population-based approaches, like GAs and EDAs. During the initialization step of MRAS 0, a small number ε and a strictly increasing function S( ) : R R + are also specified. The function S( ) is used to account for the cases where the values of H(x) are negative for some x, and the parameter ε ensures that each strict increment in the sequence { γ k } is lower bounded, 5

6 i.e., inf γ k+ γ k k=,2,... ( γ k+ γ k ) ε. We require ε to be strictly positive for continuous problems, and non-negative for discrete (finite) problems. In continuous domains, the division by f(x, θ k ) in the performance function in step 3 is well defined if f(x, θ k ) has infinite support (e.g. normal p.d.f.), whereas in discrete/combinatorial domains, the division is still valid as long as each point x in the solution space has a positive probability of being sampled. Additional regularity conditions on f(x, θ k ) in Section 5 will ensure that step 3 of MRAS 0 can be used interchangeably with the following equation: θ k+ = arg max [S(H(x))] k I {H(x) γk+ } ln f(x, θ)ν(dx). θ Θ X The following lemma shows that there is a sequence of reference models {g k ( ), k =, 2,...} implicit in MRAS 0, and the parameter θ k+ computed at step 3 indeed minimizes the KL-divergence D(g k+, f(, θ)). Lemma 3. The parameter θ k+ computed at the kth iteration of the MRAS 0 algorithm minimizes the KL-divergence D (g k+, f(, θ)), where g k+ (x) := S(H(x))I {H(x) γ k+ }g k (x) [ ] x X, k =, 2,..., and g (x) := E gk S(H(X))I{H(X) γk+ } 3.2 Global Convergence I {H(x) γ} E θ0 [ I{H(X) γ } f(x,θ 0) Global convergence of the MRAS 0 algorithm clearly depends on the choice of parameterized distribution family. Throughout this paper, we restrict our analysis and discussions to a particular family of distributions called the natural exponential family (NEF), for which the global convergence properties can be established. We start by stating the definition of NEF and some regularity conditions. Definition 3. A parameterized family of p.d.f s/p.m.f s {f(, θ), θ Θ R m } on X is said to belong to the natural exponential family (NEF) if there exist functions h( ) : R n R, Γ( ) : R n R m, and K( ) : R m R such that f(x, θ) = exp { θ T Γ(x) K(θ) } h(x), θ Θ, (4) where K(θ) = ln x X exp { θ T Γ(x) } h(x)ν(dx), and the superscript T denotes the vector transposition. For the case where f(, θ) is a p.d.f., we assume that Γ( ) is a continuous mapping. Many common p.d.f. s/p.m.f. s belong to the NEF, e.g., Gaussian, Poisson, binomial, geometric, and certain multivariate forms of them. Assumptions: A3. There exists a compact set Π such that the level set {x : H(x) γ } X Π, where γ = sup l {l : P θ0 (H(X) l) ρ} is defined as in the MRAS 0 algorithm. A4. The maximizer of equation (3) is an interior point of Θ for all k. A5. sup θ Θ exp{θ T Γ(x)}Γ(x)h(x) is integrable/summable with respect to x, where θ, Γ( ), and h( ) are defined as in Definition 3.. Assumption A3 restricts the search of the MRAS 0 algorithm to some compact set; it is satisfied if the function H( ) has compact level sets or the solution space X is compact. In actual implementation of the ]. 6

7 algorithm, step 3 of MRAS 0 is often posed as an unconstrained optimization problem, i.e., Θ = R m, in which case A4 is automatically satisfied. It is also easy to verify that A5 is satisfied by most NEFs. To show the convergence of MRAS 0, we will need the following key observation. Lemma 3.2 If assumptions A3 A5 hold, then we have E θk+ [Γ(X)] = E gk+ [Γ(X)], k = 0,,..., where E θk+ [ ] and E gk+ [ ] denote the expectations taken with respect to f(, θ k+ ) and g k+ ( ), respectively. We have the following convergence result for the MRAS 0 algorithm. Theorem 3. Let {θ k, k =, 2,...} be the sequence of parameters generated by MRAS 0. assumptions A A5 are satisfied, then If ε > 0 and lim k E θ k [Γ(X)] = Γ(x ), (5) where the limit is component-wise. Remark : The convergence result in Theorem 3. is much stronger than it appears to be. For example, when Γ(x) is a one-to-one function (which is the case for many NEFs encountered in practice), the convergence result (5) can be equivalently written as Γ (lim k E θk [Γ(X)]) = x. Also note that for some particular p.d.f. s/p.m.f. s, the solution vector x itself will be a component of Γ(x) (e.g., multivariate normal distribution). Under these circumstances, we can interpret (5) as lim k E θk [X] = x. Another special case of particular interest is when the components of the random vector X = (X,..., X n ) are independent, i.e., each has a univariate p.d.f./p.m.f. of the form f(x i, ϑ i ) = exp(x i ϑ i K(ϑ i ))h(x i ), ϑ i R, i =,..., n. In this case, since the distribution of the random vector X is simply the product of the marginal distributions, we will clearly have Γ(x) = x. Thus, (5) is again equivalent to lim k E θk [X] = x, where θ k := (ϑ k,..., ϑ k n), and ϑ k i is the value of ϑ i at the kth iteration. In Lemma 3.2, we have already established a relationship between reference models {g k ( )} and the sequence of sampling distributions {f(, θ k )}. Therefore, proving Theorem 3. amounts to showing that lim k E gk [Γ(X)] = Γ(x ). Proof of Theorem 3.: Recall from Lemma 3. that g k+ ( ) can be expressed recursively as g k+ (x) := S(H(x))I {H(x) γ k+ }g k (x) [ ] x X, k =, 2,.... E gk S(H(X))I{H(X) γk+ } Thus E gk+ [ S(H(X))I{H(X) γk+ } ] E [ [S(H(X))] 2I ] = gk [ {H(X) γk+ ] } E gk S(H(X))I{H(X) γk+ } E gk [ S(H(X))I{H(X) γk+ }]. (6) Since γ k H(x ) k, and each strict increment in the sequence { γ k } is lower bounded by the quantity ε > 0, there exists a finite N such that γ k+ = γ k, k N. Before we proceed any further, we need to distinguish between two cases, γ N = H(x ) and γ N < H(x ). 7

8 Case. If γ N = H(x ) (note that since ρ > 0, this could only happen when the solution space is discrete), then from the definition of g k+ ( ) (see Lemma 3.), we obviously have g k+ (x) = 0, x x, and g k+ (x ) = [S(H(x ))] k I {H(x)=H(x )} X [S(H(x))]k I {H(x)=H(x )}ν(dx) = k N. Hence it follows immediately that E gk+ [Γ(X)] = Γ(x ) k N. Case 2. If γ N < H(x ), then from (6), we have E gk+ [ S(H(X))I{H(X) γk+2 }] Egk [ S(H(X))I{H(X) γk+ }], k N, (7) i.e., the sequence { [ } E gk S(H(X))I{H(X) γk+ }], k =, 2,... converges. Now we show that the limit of the above sequence is S(H(x )). To do so, we proceed by contradiction and assume that S := lim E [ ] g k S(H(X))I{H(X) γk+ } < S := S(H(x )). (8) k Define the set A as { A := {x : H(x) γ N } x : S(H(x)) S + S 2 } X. Since S( ) is strictly increasing, its inverse S ( ) exists. Thus A can be reformulated as A = { { x : H(x) max γ N, S ( S + S 2 )}} X. And since γ N < H(x ), A has a strictly positive Lebesgue/discrete measure by A. Notice that g k ( ) can be rewritten as g k (x) = k i= S(H(x))I {H(x) γi+ } [ ] g (x). E gi S(H(X))I{H(X) γi+ } S(H(x))I {H(x) γk+ } Since lim k [ ] = S(H(x))I {H(x) γ N } E gk S(H(X))I S >, x A, we conclude that {H(X) γk+ } Thus, by Fatou s lemma, we have = lim inf g k (x)ν(dx) lim inf k k which is a contradiction. Hence, it follows that X lim g k(x) =, x A. k A g k (x)ν(dx) A lim inf k g k(x)ν(dx) =, lim E [ ] g k S(H(X))I{H(X) γk+ } = S. (9) k In order to show that lim k E gk [Γ(X)] = Γ(x ), we now bound the difference between E gk [Γ(X)] and 8

9 Γ(x ). Note that k N, we have E gk [Γ(X)] Γ(x ) = X C Γ(x) Γ(x ) g k (x)ν(dx) Γ(x) Γ(x ) g k (x)ν(dx), (0) where C := {x : H(x) γ N } X is the support of g k ( ), k N. By the assumption on Γ( ) in Definition 3., for any given ζ > 0, there exists a δ > 0 such that x x < δ implies Γ(x) Γ(x ) < ζ. With A δ defined from assumption A2, we have from (0), E gk [Γ(X)] Γ(x ) Γ(x) Γ(x ) g k (x)ν(dx) + Γ(x) Γ(x ) g k (x)ν(dx) A c δ C A δ C ζ + Γ(x) Γ(x ) g k (x)ν(dx), k N. () A δ C The rest of the proof amounts to showing that the second term in () is also bounded. Clearly the term Γ(x) Γ(x ) is bounded on the set A δ C. We only need to find a bound for g k (x). By A2, we have sup H(x) sup H(x) < H(x ). x A δ C x A δ Define S δ := S S(sup x Aδ H(x)). Since S( ) is strictly increasing, we have S δ > 0. Thus, it follows that S(H(x)) S S δ, x A δ C. (2) On the other hand, from (7) and (9), there exists N N such that k N Observe that g k (x) can be alternatively expressed as g k (x) = E gk [ S(H(X))I{H(X) γk+ }] S 2 S δ. (3) k i= N Thus, it follows from (2) and (3) that g k (x) S(H(x))I {H(x) γi+ } [ ] g N (x), k E N. gi S(H(X))I{H(X) γi+ } ( S ) k N S δ S g N (x), x A δ C, k S δ /2 N. Therefore, E gk [Γ(X)] Γ(x ) ζ + ζ + = sup x A δ C sup x A δ C Γ(x) Γ(x ) g k (x)ν(dx) A δ C ( S Γ(x) Γ(x ) k N S δ ) ( + sup x A δ C Γ(x) Γ(x ) S S δ /2 ) ζ, k N,, k N where N is given by N := max { N, N + ln ζ/ ln ( S S δ S S δ /2) }. 9

10 And since ζ is arbitrary, we have lim E g k [Γ(X)] = Γ(x ). k Finally, the proof is completed by applying Lemma 3.2 to both Case and Case 2. Remark 2: Note that, as mentioned in Section 2, for problems with finite solution spaces, assumptions A and A2 are automatically satisfied. Furthermore, if we take the input parameter ε = 0, then step 2 of MRAS 0 is equivalent to γ k+ = max i k+ γ i. Thus, { γ k } is non-decreasing and each strict increment in the sequence is bounded from below by min H(x) H(y). H(x) H(y) x,y X Therefore, the ε > 0 assumption in Theorem 3. can be relaxed to ε 0. We now address some of the special cases discussed in Remark. Corollary 3. (Multivariate Normal) For continuous optimization problems in R n, if multivariate normal p.d.f. s are used in MRAS 0, i.e., f(x, θ k ) = ( (2π)n Σ k exp ) 2 (x µ k) T Σ k (x µ k), (4) where θ k := (µ k ; Σ k ), ε > 0, and assumptions A A4 are satisfied, then where 0 n n represents an n-by-n zero matrix. lim µ k = x, and lim Σ k = 0 n n, k k Proof: By Lemma 3.2, it is easy to show that µ k+ = E gk+ (X), k = 0,,..., and Σ k+ = E gk+ [ (X µk+ )(X µ k+ ) T ], k = 0,,.... The rest of the proof amounts to showing that lim E g k (X) = x, and k which is the same as the proof of Theorem 3.. lim E [ g k (X µk )(X µ k ) T ] = 0 n n, k Remark 3: Corollary 3. shows that in the multivariate normal case, the sequence of parameterized p.d.f. s will converge to a degenerate p.d.f. concentrated only on the optimal solution. In this case the parameters are updated as µ k+ = E [ θ k {[S(H(X))] k /f(x, θ k )}I {H(X) γk+ }X ] [ ], (5) E θk {[S(H(X))]k /f(x, θ k )}I {H(X) γk+ } and Σ k+ = E [ θ k {[S(H(X))] k /f(x, θ k )}I {H(X) γk+ }(X µ k+ )(X µ k+ ) ] T [ ], (6) E θk {[S(H(X))]k /f(x, θ k )}I {H(X) γk+ } where f(x, θ k ) is given by (4). Note that when the solution space X is a (simple) constrained region in R n, one straightforward approach is to use the acceptance-rejection method (cf. e.g., Kroese et al. 2004). And it is easy to verify that the parameter updating rules remain the same. 0

11 Corollary 3.2 (Independent Univariate) If the components of the random vector X = (X,..., X n ) are independent, each has a univariate p.d.f./p.m.f. of the form ε > 0, and A A5 are satisfied, then f(x i, ϑ i ) = exp(x i ϑ i K(ϑ i ))h(x i ), ϑ i R, i =,..., n, lim E θ k [X] = x, k where θ k := (ϑ k,..., ϑ k n). 4 An Alternative View of the Cross-Entropy Method In this section, we give an alternative interpretation of the CE method for optimization and discuss its similarities and differences with the MRAS 0 algorithm. Specifically, we show that the CE method can also be viewed as a search strategy guided by a sequence of reference models. From this particular point of view, we establish some important properties of the CE method. The deterministic version of the CE method for solving () can be summarized as follows. Algorithm CE 0 : Deterministic Version of the CE Method. Choose the initial p.d.f./p.m.f. f(, θ 0 ), θ 0 Θ. Specify the parameter ρ (0, ] and a non-decreasing function ϕ( ) : R R + {0}. Set k = Calculate the ( ρ)-quantile γ k+ as γ k+ := sup {l : P θk (H(X) l) ρ}. 3. Compute the new parameter [ θ k+ := arg max E θk ϕ(h(x))i{h(x) γk+ } ln f(x, θ) ]. θ Θ 4. If a specified stopping rule is satisfied, then terminate; otherwise set k = k + and go to Step 2. In CE 0, choosing ϕ(h(x)) = gives the standard CE method, whereas choosing ϕ(h(x)) = H(x) (if H(x) 0, x X ) gives an extended version of the standard CE method (cf. e.g., de Boer et al. 2005). One resemblance between CE and MRAS 0 is the use of the parameter ρ and the ( ρ)-quantile in both algorithms. However, the fundamental difference is that in CE, the problem of estimating the optimal value of the parameter is broken down into a sequence of simple estimation problems, in which the parameter ρ assumes a crucial role. Since a small change in the values of ρ may disturb the whole estimation process and affect the quality of the resulting estimates, the convergence of CE cannot be always guaranteed unless the value of ρ is chosen sufficiently small (cf. De Boer et al. 2005, Homem-de-Mello 2004; also example 4. below), whereas the theoretical convergence of MRAS 0 is unaffected by the parameter ρ. The following lemma provides a unified view of MRAS and CE; it shows that by appropriately defining a sequence of implicit reference models {gk ce ( ) : k =, 2,...}, the CE method can be recovered, and the parameter updating in CE is guided by this sequence of models. Lemma 4. The parameter θ k+ computed at the kth iteration of the CE 0 algorithm minimizes the KLdivergence D ( g ce k+, f(, θ)), where k+(x) := ϕ(h(x))i {H(x) γ k+ }f(x, θ k ) [ ] x X, k = 0,,.... (7) E θk ϕ(h(x))i{h(x) γk+ } g ce

12 Proof: Similar to the proof of Lemma 3.. The key observation to note is that in contrast to MRAS 0, the sequence of reference models in CE depends explicitly on the family of parameterized p.d.f s/p.m.f s {f(, θ k )} used. Since gk+ ce ( ) is obtained by tilting f(, θ k ) with the performance function, it improves the expected performance in the sense that [ [ ] E θk (ϕ(h(x))i{h(x) γk+ }) 2] [ E g ce ϕ(h(x))i{h(x) γk+} k+ = [ ] E θk ϕ(h(x))i{h(x) γk+ }]. E θk ϕ(h(x))i{h(x) γk+ } Thus, it is reasonable to expect that the projection of g ce k+ ( ) on {f(, θ) : θ Θ} (i.e., f(, θ k+)) also improves the expected performance. This result is formalized in the following theorem. Theorem 4. For the CE 0 algorithm, we have E θk+ [ ϕ(h(x))i{h(x) γk+ }] Eθk [ ϕ(h(x))i{h(x) γk+ }], k = 0,,.... In the standard CE method, Theorem 4. implies the monotonicity of the sequence {γ k : k =, 2,...}. Lemma 4.2 For the standard CE method (i.e., CE 0 with ϕ(h(x)) = ), we have γ k+2 γ k+, k = 0,,.... Proof: By Theorem 4., we have E θk+ [I {H(X) γk+ }] E θk [I {H(X) γk+ }], i.e., P θk+ (H(X) γ k+ ) P θk (H(X) γ k+ ) ρ. The result follows by the definition of γ k+2 (See Step 2 of the CE 0 algorithm). Note that since γ k H(x ) for all k, Lemma 4.2 implies that the sequence {γ k : k =, 2,...} generated by the standard CE method converges. However, depending on the p.d.f s/p.m.f s and the parameter ρ used, the sequence {γ k } may not converge to H(x ) or even to a small neighborhood of H(x ) (cf. Examples 4. and 4.2 below). Similar to MRAS 0 (cf. Lemma 3.2), when f(, θ) belongs to the natural exponential families, the following lemma relates the sequence {f(, θ k ), k =, 2,...} to the sequence of reference models {gk ce ( ) : k =, 2,...}. Lemma 4.3 Assume that: Then. There exists a compact set Π such that the level set {x : H(x) γ k } X Π for all k =, 2,..., where γ k = sup l {l : P θk (H(X) l) ρ} is defined as in the CE 0 algorithm. 2. The parameter θ k+ computed at step 3 of the CE 0 algorithm is an interior point of Θ for all k. 3. Assumptions A5 is satisfied. E θk+ [Γ(X)] = E g ce [Γ(X)], k = 0,,.... k+ The above lemma indicates that the behavior of the sequence of p.d.f s/p.m.f s {f(, θ k )} is closely related to the properties of the sequence of reference models. To understand this, consider the particular case where Γ(x) = x. If the CE method converges to the optimal solution in the sense that lim k E θk [H(X)] = H(x ), then we must have lim k E θk [X] = x, since H(x) < H(x ) x x. Thus, by Lemma 4.3, a necessary 2

13 condition for this convergence is lim k E g ce [X] = x. However, unlike MRAS k 0, where the convergence of the sequence of reference models to an optimal degenerate distribution is guaranteed, the convergence of the sequence {gk ce ( ) : k =, 2,...} relies on the choices of the families of distributions {f(, θ)} and the values of the parameter ρ used (cf. (7)). We now illustrate this issue by two simple examples. Example 4. (The Standard CE Method) Consider maximizing the function H(x) given by 0 x {(0, ), (, 0)}, H(x) = x = (0, 0), a x = (, ), (8) where a >, and x := (x, x 2 ) X := {(0, 0), (0, ), (, 0), (, )}. If we take 0.25 < ρ 0.5 and an initial p.m.f. f(x, θ 0 ) = p 0 x ( p 0 ) x q 0 x 2 ( q 0 ) x 2 with θ 0 = (p 0, q 0 ) = (0.5, 0.5), then since P θ0 (x {(0, 0), (, )}) = 0.5 ρ, we have γ =. It is also straightforward to see that { 0.5 x = (0, 0) or (, ), g ce (x) = 0 otherwise, and the parameter θ computed at step 3 (with ϕ(h(x)) = ) of CE 0 is given by θ = (0.5, 0.5). Proceeding iteratively, we have γ k = and gk ce (x) = gce (x) k =, 2,..., i.e., the algorithm does not converge to a degenerate distribution at the optimal solution. On the other hand, if we choose ρ 0.25, then it turns out that γ k = a and { x = (, ), gk ce (x) = 0 otherwise. for all k =, 2,..., which means the algorithm converges to the optimum. Example 4.2 (The Extended Version of the CE Method) Consider solving problem (8) by CE 0 with the performance function ϕ(h(x)) = H(x). We use the same family of p.m.f s as( in Example 4. with the initial parameter θ 0 = ( +a, +a ). If the values of ρ are chosen from the interval a (+a), (+a) ), then we 2 have θ k = ( +a, +a ), γ k =, and g ce k (x) = a +a x = (0, 0), +a x = (, ), 0 otherwise, for all k =, 2,.... On the other hand, if we choose ρ = 0.5 and θ 0 = (0.5, 0.5), then it is easy to verify that lim k γ k = a and { x = (, ), lim k gce k (x) = 0 otherwise. 5 The MRAS Algorithm (Monte Carlo Version) The MRAS 0 algorithm describes the idealized situation where quantile values and expectations can be evaluated exactly. In practice, we will usually resort to its stochastic counterpart, where only a finite number of samples are used and expected values are replaced with their corresponding sample averages. For 3

14 example, step 3 of MRAS 0 will be replaced with θ k+ = arg max θ Θ N N [S(H(X i ))] k f(x i, θ I {H(Xi) γ k+ } ln f(x i, θ), (9) k ) i= where X,..., X N are i.i.d. random samples generated from f(x, θ k ), θ k is the estimated parameter vector computed at the previous iteration, and γ k+ is a threshold determined by the sample ( ρ)-quantile of H(X ),..., H(X N ). However, the theoretical convergence can no longer be guaranteed for a simple stochastic counterpart of MRAS 0. In particular, the set {x : H(x) γ k+ } involved in (9) may be empty, since all the random samples generated at the current iteration may be much worse than those generated at the previous iteration. Thus, we can only expect the algorithm to converge if the expected values in the MRAS 0 algorithm are closely approximated. Obviously, the quality of the approximation will depend on the number of samples to be used in the simulation, but it is difficult to determine in advance the appropriate number of samples. A sample size too small will cause the algorithm to fail to converge and result in poor quality solutions, whereas a sample size too large may lead to high computational cost. As mentioned earlier, the parameter ρ, to some extent, will affect the performance of the algorithm. Large values of ρ mean that almost all samples generated, regardless of their performances, will be used to update the probabilistic model, which could slow down the convergence process. On the other hand, since a good estimate will necessarily require a reasonable amount of valid samples, the quantity ρn (i.e., the approximate amount of samples that will be used in parameter updating) cannot be too small. Thus, small values of ρ will require a large number of samples to be generated at each iteration and may result in significant simulation efforts. For a given problem, although it is clear that we should avoid those values of ρ that are either too close to or too close to 0, to determine a priori which ρ gives a satisfactory performance may be difficult. In order to address the above difficulties, we adopt the same idea as in Homem-de-Mello (2004) and propose a modified Monte Carlo version of MRAS 0 in which the sample size N is adaptively increasing and the parameter ρ is adaptively decreasing. 5. Algorithm Description Roughly speaking, the MRAS algorithm is essentially a Monte Carlo version of MRAS 0 except that the parameter ρ and the sample size N may change from one iteration to another. The rate of increase in the sample size is controlled by an extra parameter α >, specified during the initialization step. For example, if the initial sample size is N 0, then after k increments, the sample size will be approximately α k N 0. At each iteration k, random samples are drawn from the density/mass function f(, θ k ), which is a mixture of the initial density/mass f(, θ 0 ) and the density/mass calculated from the previous iteration f(, θ k ) (cf. e.g., Auer et al for a similar idea in the context of multiarmed bandit models). We assume that f(, θ 0 ) satisfies the following condition: Assumption A3. There exists a compact set Π ε such that {x : H(x) H(x ) ε} X Π ε. Moreover, the initial density/mass function f(x, θ 0 ) is bounded away from zero on Π ε, i.e., f := inf x Πε f(x, θ 0 ) > 0. In practice, the initial density f(, θ 0 ) can be chosen according to some prior knowledge of the problem structure (cf. Section 6.2); however, if nothing is known about where the good solutions are, this density should be chosen in such a way that each region in the solution space will have an (approximately) equal probability of being sampled. For instance, when X is finite, one simple choice of f(, θ 0 ) is the uniform distribution. Intuitively, mixing in the initial density forces the algorithm to explore the entire solution space and to maintain a global perspective during the search process. Also note that if λ =, then random samples will always be drawn from the initial density, in which case, MRAS becomes a pure random sampling 4

15 Algorithm MRAS Monte Carlo version Initialization: Specify ρ 0 (0, ], an initial sample size N 0 >, ε 0, α >, a mixing coefficient λ (0, ], a strictly increasing function S( ) : R R +, and an initial p.d.f. f(x, θ 0) > 0 x X. Set θ 0 θ 0, k 0. Repeat until a specified stopping rule is satisfied:. Generate i.i.d. samples X k,..., X k according to f(, θ k ) := ( λ)f(, θ k ) + λf(, θ 0 ). 2. Compute the sample ( ρ k )-quantile γ k+ (ρ k, ) := H ( ( ρk ) ), where a is the smallest integer greater than a, and H (i) is the ith order statistic of the sequence { H(X k i ), i =,..., }. 3. If k = 0 or γ k+ (ρ k, ) γ k + ε 2, then 3a. Set γ k+ γ k+ (ρ k, ), ρ k+ ρ k, +. else, find the largest ρ (0, ρ k ) such that γ k+ ( ρ, ) γ k + ε 2. endif 3b. If such a ρ exists, then set γ k+ γ k+ ( ρ, ), ρ k+ ρ, +. 3c. else (if no such ρ exists), set γ k+ γ k, ρ k+ ρ k, + α. 4. Compute θ k+ as 5. Set k k +. θ k+ = arg max θ Θ N k i= [S(H(Xi k ))] k f(x i k, θ I k ) {H(X k i ) γ k+} ln f(xk i, θ). (20) approach. At step 2, the sample ( ρ k )-quantile γ k+ is calculated by first ordering the sample performances H(Xi k), i =,..., from smallest to largest, H () H (2) H (Nk ), and then taking the ( ρ k ) th order statistic. We use the function γ k+ (ρ k, ) to emphasize the dependencies of γ k+ on both ρ k and, so that different sample quantile values used during one iteration can be distinguished by their arguments. Step 3 of MRAS is used to extract a sequence of non-decreasing thresholds { γ k, k =, 2...} from the sequence of sample quantiles { γ k }, and to determine the appropriate values of ρ k+ and + to be used in subsequent iterations. This step is carried out as follows. At each iteration k, we first check whether the inequality γ k+ (ρ k, ) γ k + ε/2 is satisfied, where γ k is the threshold value used in the previous iteration. If the inequality holds, then it means that both the current ρ k value and the current sample size are satisfactory; thus we proceed to step 3a and update the parameter vector θ k+ in step 4 by using γ k+ (ρ k, ). Otherwise, it indicates that either ρ k is too large or the sample size is too small. To determine which, we fix the sample size and check if there exists a smaller ρ < ρ k such that the above inequality can be satisfied with the new sample ( ρ)-quantile. If such a ρ does exist, then the current sample size is still deemed acceptable, and we only need to decrease the ρ k value. Accordingly, the parameter vector is updated in step 4 by using the sample ( ρ)-quantile. On the other hand, if no such ρ can be found, then the parameter vector is updated by using the threshold γ k calculated during the previous iteration and the sample size is increased by a factor α. We make the following assumption about the parameter vector θ k+ computed at step 4: Assumption A4. The parameter vector θ k+ computed at step 4 of MRAS is an interior point of Θ for all k. It is important to note that the set { x : H(x) γ k+, x { X k,..., X k }} could be empty if step 3c is visited. If this happens, the right hand side of (20) will be equal to zero, so any θ Θ is a maximizer, and 5

16 we define θ k+ := θ k in this case. 5.2 Global Convergence In this section, we discuss the convergence properties of the MRAS algorithm for natural exponential families (NEFs). To be specific, we will explore the relations between MRAS and MRAS 0 and show that with high probability, the gaps (e.g., approximation errors incurred by replacing expected values with sample averages) between the two algorithms can be made small enough such that the convergence analysis of MRAS can be ascribed to the convergence analysis of the MRAS 0 algorithm; thus, our analysis relies heavily on the results obtained in Section 3.2. Throughout this section, we denote by ( ) and [ ] the P θk E θk respective probability and expectation taken with respect to the p.d.f./p.m.f. f(, θ k ), and P θk ( ) and Ẽ θk [ ] the respective probability and expectation taken with respect to f(, θ k ). Note that since the sequence { θ k } results from random samples generated at each iteration of MRAS, these quantities are also random. Let g k+ ( ), k = 0,,..., be defined by g k+ (x) := g k (x) [[S(H(x))] k / f(x, θ k )]I {H(x) γk+ } Nk i=[[s(h(xi k))]k / f(x i k, θ k )]I {H(X i k) γ k+} if { x : H(x) γ k+, x { X k,..., X k }}, otherwise, γ k+ (ρ k, ) if step 3a is visited, where γ k+ is given by γ k+ := γ k+ ( ρ, ) if step 3b is visited, γ k if step 3c is visited. Similar to Lemma 3.2, the following lemma shows the connection between f(, θ k+ ) and g k+ ( ). Lemma 5. If assumptions A4 and A5 hold, then the parameter θ k+ computed at step 3 of MRAS satisfies E θk+ [Γ(X)] = E gk+ [Γ(X)], k = 0,,..., Note that the region {x : H(x) γ k+ } will become smaller and smaller as γ k+ increases. Lemma 5. shows that the sequence of sampling p.d.f s/p.m.f s {f(, θ k+ )} is adapted to this sequence of shrinking regions. For example, consider the case where {x : H(x) γ k+ } is convex and Γ(x) = x. Since E gk+ [X] is the convex combination of X k,..., X k, the lemma implies that E θk+ [X] {x : H(x) γ k+ }. Thus, it is natural to expect that the random samples generated at the next iteration will fall in the region {x : H(x) γ k+ } with large probabilities (e.g., consider the normal p.d.f. where its mode is equal to its mean). In contrast, if we use a fixed sampling distribution for all iterations as in pure random sampling (i.e., the λ = case), then sampling from this sequence of shrinking regions could become a substantially difficult problem in practice. Next, we present a useful lemma, which shows the convergence of the quantile estimates when random samples are generated from a sequence of different distributions. Lemma 5.2 For any given ρ (0, ), let γ k be the set of ( ρ )-quantiles of H(X) with respect to the p.d.f./p.m.f. f(, θk ), and let γ k (ρ, ) be the corresponding sample quantile of H(X k ),..., H(X k ), where f(, θ k ) and are defined as in MRAS, and X k,..., X k are i.i.d. with common density f(, θ k ). Then the distance from γ k (ρ, ) to γ k tends to zero as k w.p.. (2) We are now ready to state the main theorem. Theorem 5. Let ε > 0, and define the ε-optimal set O ε := {x : H(x) H(x ) ε} X. If assumptions A, A3, A4, and A5 are satisfied, then there exists a random variable K such that w.p.., K <, and 6

17 . γ k > H(x ) ε, k K 2. E θk+ [Γ(X)] CONV {Γ(O ε )}, k K, where CONV {Γ(O ε )} indicates the convex hull of the set Γ(O ε ). Furthermore, let β be a positive constant satisfying the condition that the set { x : S(H(x)) β } has a strictly positive Lebesgue/counting measure. If assumptions A, A2, A3, A4, and A5 are all satisfied and α > (βs ) 2, where S := S(H(x )), then 3. lim k E θk [Γ(X)] = Γ(x ) w.p.. Remark 4: Roughly speaking, the second result can be understood as finite time ε-optimality. To see this, consider the special case where H(x) is locally concave on the set O ε. Let x, y O ε and η [0, ] be arbitrary. By the definition of concavity, we will have H(ηx + ( η)y) ηh(x) + ( η)h(y) H(x ) ε, which implies that the set O ε is convex. If in addition Γ(x) is also convex and one-to-one on O ε (e.g. multivariate normal p.d.f.), then CONV {Γ(O ε )} = Γ(O ε ). Thus it follows that Γ (E θk+ [Γ(X)]) O ε, k K w.p.. Proof of Theorem 5.: () The first part of the proof is an extension of the proofs given in Homem-de- Mello (2004). First we claim that given ρ k and γ k, if γ k H(x ) ε, then K < w.p. and ρ (0, ρ k ) such that γ k +( ρ, ) γ k + ε 2 k K. To show this, we proceed by contradiction. Let ρ k := ( ) P θk H(X) γk + 2ε 3. If γk H(x ) ε, then γ k + 2ε 3 H(x ) ε 3. By A and A3, we have ( ρ k P θk H(X) H(x ) ε ) λc(ε, θ 0 ) > 0, (22) 3 where C(ε, θ 0 ) = X I {H(x) H(x ) ε/3}f(x, θ 0 )ν(dx) is a constant. Now assume that ρ (0, ρ k ) such that γ k+(ρ, θ k ) < γ k + 2ε 3, where γ k+(ρ, θ k ) is the ( ρ)-quantile of H(X) with respect to f(, θ k ). By the definition of quantiles, we have (H(X) γ P θk k+ (ρ, θ ) k ) ρ, and (H(X) γ P θk k+ (ρ, θ ) k ) ρ > ρ k. (23) It follows that (H(X) P θk γ k+ (ρ, θ ) ( ) k ) P θk H(X) < γ k + 2ε 3 = ρ k by the definition of ρ k, which contradicts equation (23); thus we must have that if γ k H(x ) ε, then γ k+ (ρ, θ k ) γ k + 2ε 3, ρ (0, ρ k). Therefore by (22), ρ ( 0, min{ρ k, λc(ε, θ 0 )} ) (0, ρ k ) such that γ k+ ( ρ, θ k ) γ k + 2ε 3 whenever γ k H(x ) ε. By Lemma 5.2, the distance from the sample ( ρ)-quantile γ k+ ( ρ, ) to the set of ( ρ)- quantiles γ k+ ( ρ, θ k ) goes to zero as k w.p., thus K < w.p. such that γ k +( ρ, ) γ k + ε 2 k K. Notice that from the MRAS algorithm, if neither step 3a nor 3b is visited at the kth iteration, we will have ρ k+ = ρ k and γ k+ = γ k. Thus, whenever γ k H(x ) ε, w.p. step 3a/3b will be visited after a finite number of iterations. Furthermore, since the total number of visits to steps 3a and 3b is finite (i.e., bounded by 2[H(x ) M] ε such that, where recall that M is a lower bound for H(x)), we conclude that there exists K < w.p., γ k > H(x ) ε, k K w.p.. (2) From the MRAS algorithm, it is easy to see that γ k+ γ k, k = 0,,.... By part (), we have γ k+ H(x ) ε, k K w.p.. Thus, by the definition of g k+ (x) (cf. (2)), it follows immediately that if { x : H(x) γk+, x { X k,..., X k }}, then the support of gk+ (x) satisfies supp { g k+ } O ε k K 7

18 w.p.; otherwise if { x : H(x) γ k+, x { X k,..., X k }} =, then supp { gk+ } =. We now discuss these two cases separately. Case. If supp { g k+ } O ε, then we have {Γ(supp { g k+ })} {Γ(O ε )}. Since E gk+ [Γ(X)] is the convex combination of Γ(X k ),..., Γ(X k ), it follows that Thus by A4, A5, and Lemma 5., E gk+ [Γ(X)] CONV {Γ (supp { g k+ })} CONV {Γ(O ε )}. E θk+ [Γ(X)] CONV {Γ(O ε )}. Case 2. If supp { g k+ } = (note that this could only happen if step 3c is visited), then from the algorithm, there exists some k < k + such that γ k+ = γ k and supp { g k}. Without loss of generality, let k be the largest iteration counter such that the preceding properties hold. Since γ k = γ k+ > H(x ) ε k K w.p., we have supp { g k} Oε w.p.. By following the discussions in Case, it is clear that E θ k [Γ(X)] CONV {Γ(O ε )}, w.p.. Furthermore, since θ k = θ k+ = = θ k+ (see discussions in Section 5.), we will again have (3) Define ĝ k+ (x) as E θk+ [Γ(X)] CONV {Γ(O ε )}, k K w.p.. ĝ k+ (x) := [S(H(x))] k I {H(x) γk }, k =, 2,..., X [S(H(x))]k I {H(x) γk }ν(dx) where γ k is defined as in MRAS. Note that since γ k is a random variable, ĝ k+ (x) is also a random variable. It follows that X Eĝk+ [Γ(X)] = [βs(h(x))]k I {H(x) γk }Γ(x)ν(dx). X [βs(h(x))]k I {H(x) γk }ν(dx) Let ω = (X 0,..., X 0 N 0, X,..., X N,...) be a particular sample path generated by the algorithm. For each ω, the sequence { γ k (ω), k =, 2,...} is non-decreasing and each strict increase is lower bounded by ε/2. Thus, Ñ (ω) > 0 such that γ k+(ω) = γ k (ω) k Ñ (ω). Now define Ω := {ω : lim k γ k (ω) = H(x )}. By the definition of g k+ ( ) (cf. (2)), for each ω Ω we clearly have lim k E gk (ω) [Γ(X)] = Γ(x ); thus, it follows from Lemma 5. that lim k E θk (ω) [Γ(X)] = Γ(x ), ω Ω. The rest of the proof amounts to showing that the result also holds almost surely (a.s.) on the set Ω c. Since lim k γ k (ω) = γñ (ω) < H(x ) ω Ω c, we have by Fatou s lemma lim inf [βs(h(x))] k I {H(x) γk }ν(dx) lim inf k X X k [βs(h(x))]k I {H(x) γk }ν(dx) > 0, ω Ω c, (24) where the last inequality follows from the fact that βs(h(x)) x { x : H(x) max{s ( β ), γ Ñ }} and assumption A. Since f(x, θ 0 ) > 0 x X, we have X supp{ f(, θ k )} k; thus [ Ẽ θk β k Sk (H(X))I {H(X) γk }Γ(X) ] Eĝk+ [Γ(X)] = [ ], k =, 2,..., Ẽ θk β k Sk (H(X))I {H(X) γk } where S k (H(x)) := [S(H(x))] k / f(x, θ k ). We now show that E gk+ [Γ(X)] Eĝk+ [Γ(X)] a.s. on Ω c as 8

A Model Reference Adaptive Search Method for Global Optimization

OPERATIONS RESEARCH Vol. 55, No. 3, May June 27, pp. 549 568 issn 3-364X eissn 1526-5463 7 553 549 informs doi 1.1287/opre.16.367 27 INFORMS A Model Reference Adaptive Search Method for Global Optimization