A Model Reference Adaptive Search Method for Global Optimization

Size: px
Start display at page:

Download "A Model Reference Adaptive Search Method for Global Optimization"

Transcription

1 A Model Reference Adaptive Search Method for Global Optimization Jiaqiao Hu Department of Electrical and Computer Engineering & Institute for Systems Research, University of Maryland, College Park, MD 20742, Michael C. Fu Robert H. Smith School of Business & Institute for Systems Research, University of Maryland, College Park, MD 20742, Steven I. Marcus Department of Electrical and Computer Engineering & Institute for Systems Research, University of Maryland, College Park, MD 20742, March, 2005; revised September, 2005 Abstract We introduce a new randomized method called Model Reference Adaptive Search (MRAS) for solving global optimization problems. The method works with a parameterized probabilistic model on the solution space and generates at each iteration a group of candidate solutions. These candidate solutions are then used to update the parameters associated with the probabilistic model in such a way that the future search will be biased toward the region containing high quality solutions. The parameter updating procedure in MRAS is guided by a sequence of implicit reference models. We provide a particular instantiation of the sequence of reference models and describe an algorithm that implements the idea. We prove global convergence of the proposed algorithm in both continuous and combinatorial domains. In addition, we show that the model reference framework can also be used to describe the recently proposed cross-entropy (CE) method for optimization and study its properties. Numerical studies are also carried out to illustrate the performance of the algorithm. Keywords: global optimization, combinatorial optimization, cross-entropy method (CE), estimation of distribution algorithm (EDA). Introduction Global optimization problems arise in a wide range of applications and are often extremely difficult to solve. Following Zlochin et al. (2004), we classify the solution methods for both continuous and combinatorial problems as being either instance-based or model-based. In instance-based methods, searches for new candidate solutions depend directly on previously generated solutions. Some well-known instance-based methods are simulated annealing (SA) (Kirkpatrick et al. 983), genetic algorithms (GAs) (Srinivas and Patnaik 994), tabu search (Glover 990), and the recently proposed nested partitions (NP) method (Shi and Ólafsson 2000). In model-based algorithms, new solutions are generated via an intermediate probabilistic model that is updated or induced from the previous solutions. The model-based search methods are a class of solution techniques introduced fairly recently. In general, most of the algorithms that fall in this category share a similar framework and usually involve the following two phases:. Generate candidate solutions (e.g., random samples) according to a specified probabilistic model. 2. Update the parameters associated with the probabilistic model, on the basis of the data collected in the previous step, in order to bias the future search toward better solutions.

2 Some well-established model-based methods are ant colony optimization (ACO) (Dorigo and Gambardella 997), the cross-entropy (CE) method (Rubinstein and Kroese 2004, De Boer et al. 2005), and the estimation of distribution algorithms (EDAs) (Mühlenbein and Paaß 996). Among the above approaches, those that are most relevant to our work are the cross-entropy (CE) method and the estimation of distribution algorithms (EDAs). The CE method was motivated by an adaptive algorithm for estimating probabilities of rare events (Rubinstein 997) in stochastic networks. It was later realized (Rubinstein 999, 200) that the method can be modified to solve combinatorial and continuous optimization problems. The CE method starts with a family of parameterized probability distributions on the solution space and tries to find the parameter of the distribution that assigns maximum probability to the set of optimal solutions. Implicit in CE is an optimal (importance sampling) distribution concentrated only on the set of optimal solutions (i.e., zero variance), and the key idea is to use an iterative scheme to successively estimate the optimal parameter that minimizes the Kullback-Leibler (KL) divergence between the optimal distribution and the family of parameterized distributions. In the context of estimation of rare event probabilities, Homem-de-Mello (2004) shows the convergence of an adaptive version of CE to an estimate of the optimal (possibly local) CE parameter with probability one. Rubinstein (999) shows the probability one convergence of a modified version of the CE method to the optimal solution for combinatorial optimization problems. The estimation of distribution algorithms (EDAs) were first introduced in the field of evolutionary computation in Mühlenbein and Paaß (996). They inherit the spirit of the well-known genetic algorithms (GAs), but eliminate the crossover and the mutation operators in order to avoid the disruption of partial solutions. In EDAs, a new population of candidate solutions is generated according to the probability distribution induced or estimated from the promising solutions selected from the previous generation. Unlike CE, EDAs often take into account the interaction effects between the underlying decision variables needed to represent the individual candidate solutions, which are often expressed explicitly through the use of different probabilistic models. We refer the reader to Larrañaga et al. (999) for a review of the way in which different probabilistic models are used as EDAs instantiations. The convergence of a class of EDAs, under the infinite population assumption, to the global optimum can be found in Zhang and Mühlenbein (2004). In this paper, we introduce a new randomized method, called model reference adaptive search (MRAS), for solving both continuous and combinatorial optimization problems. MRAS resembles CE in that it works with a family of parameterized distributions on the solution space. However, the method has a particular working mechanism which, to the best of our knowledge, has never been proposed before. The motivation behind the method is to use a sequence of intermediate reference distributions to facilitate and guide the updating of the parameters associated with the family of parameterized distributions during the search process. At each iteration of MRAS, candidate solutions are generated from the distribution (among the prescribed family of distributions) that possesses the minimum KL-divergence with respect to the reference model corresponding to the previous iteration. These candidate solutions are in turn used to construct the next distribution by minimizing the KL-divergence with respect to the current reference model, from which future candidate solutions will be generated. In contrast, as mentioned previously, the CE method targets a single optimal (importance sampling) distribution to direct its parameter updating, where new parameters are obtained as the solutions to the problems of estimating the optimal reference parameters. We propose an instantiation of the MRAS method, where the sequence of reference distributions can be viewed as the generalized probability distribution models for EDAs with proportional selection scheme (Zhang and Mühlenbein 2004) and can be shown to converge to a degenerate distribution concentrated only on the set of optimal solutions. An attractive feature is that the sequence of reference distributions depend directly on the performance of the candidate solutions; thus the method automatically takes into account the correlations between the underlying decision variables, so that the random samples generated at each stage can be efficiently utilized. However, in EDAs (with proportional selection scheme), these distribution models are directly constructed in order to generate new candidate solutions, which is in general a difficult and computationally intensive task, whereas in our approach, the sequence of reference models is only used 2

3 implicitly to guide the parameter updating procedure; thus there is no need to build them explicitly. We show that for a class of parameterized probability distributions, the so-called Natural Exponential Family (NEF), the proposed algorithm converges to an optimal solution with probability one. In sum, our main contributions in this paper include: (i) We introduce a new framework for global optimization, which allows considerable flexibility in the choices of the reference models. Consequently, by carefully analyzing and selecting the sequence of reference models, one can design different (perhaps even more efficient) versions of our proposed approach. (ii) We propose an instantiation of the MRAS method, analyze its convergence properties in both continuous and combinatorial domains, and test its performance on some widely-used benchmark problems. (iii) We also explore the relationship between CE and MRAS. In particular, we show that the CE method can also be interpreted as an instance of the model reference framework. Based on this observation, we establish and discuss some important properties of CE. (iv) In addition, we extend the recent results on the convergence of quantile estimates in Homem-de-Mello (2004) to the cases where the random samples are drawn from a sequence of distributions rather than a single fixed distribution. The rest of the paper is organized as follows. In Section 2, we describe the MRAS method and motivate the use of a specific sequence of distributions as the reference models. In Section 3, we present a detailed implementation of the deterministic version of MRAS and establish its global convergence properties. In Section 4, we provide a unified view of CE and MRAS, and study the properties of the CE method. In Section 5, we implement the Monte Carlo version of MRAS and show its asymptotic convergence to a global optimum with probability one. Illustrative numerical studies on both continuous and combinatorial optimization problems are given in Section 6. Finally some future research topics are outlined in Section 7. 2 The Model Reference Adaptive Search Method We consider the following global optimization problem: x arg max H(x), X R n, () x X where the solution space X is a non-empty set in R n, and H( ) : X R is a deterministic function that is bounded from below, i.e., M > such that H(x) M x X. Throughout this paper, we assume that problem () has a unique global optimal solution, i.e., x X such that H(x) < H(x ) x x, x X. In addition to the general regularity conditions stated above, we also make the following assumptions on the objective function. Assumptions: A. For any given constant ξ < H(x ), the set {x : H(x) ξ} X has a strictly positive Lebesgue or discrete measure. A2. For any given constant δ > 0, sup x Aδ H(x) < H(x ), where A δ := {x : x x δ} X, and we define the supremum over the empty set to be. Intuitively, A ensures that any neighborhood of the optimal solution x will have a positive probability of being sampled. For ease of exposition, A restricts the class of problems under consideration to either continuous or discrete problems; however, we remark that the work of this paper can be easily extended to problems with mixture of both continuous and discrete variables. Since H( ) has a unique global optimizer, A2 is satisfied by many functions encountered in practice. Note that both A and A2 hold trivially when X is (discrete) finite and the counting measure is used. The MRAS method approaches the above problem as follows. It works with a family of parameterized distributions {f(, θ), θ Θ} on the solution space, where Θ is the parameter space. Assume that at the kth 3

4 iteration of the method, we have a sampling distribution f(, θ k ). We generate candidate solutions from this sampling distribution. The performances of these randomly generated solutions are then evaluated and used to calculate a new parameter vector θ k+ Θ according to a specified parameter updating rule. The above steps are performed repeatedly until a termination criterion is satisfied. The idea is that if the parameter updating rule is chosen appropriately, the future sampling process will be more and more concentrated on regions containing high quality solutions. In MRAS, the parameter updating is determined by another sequence of distributions {g k ( )}, called the reference distribution. In particular, at each iteration k, we look at the projection of g k ( ) on the family of distributions {f(, θ), θ Θ} and compute the new parameter vector θ k+ that minimizes the Kullback-Leibler (KL) divergence [ D(g k, f(, θ)) := E gk ln g ] k(x) = ln g k(x) f(x, θ) X f(x, θ) g k(x)ν(dx), where ν is the Lebesgue/counting measure defined on X, X = (X,..., X n ) is a random vector taking values in X, and E gk [ ] denotes the expectation taken with respect to g k ( ). Intuitively speaking, f(, θ k+ ) can be viewed as a compact representation (approximation) of the reference distribution g k ( ); consequently, the feasibility and effectivenss of the method will, to some large extent, depend on the choices of the reference distributions. There are many different ways to construct the sequence of reference distributions {g k ( )}. Here, we propose to use the following simple iterative scheme. Let g 0 (x) > 0 x X be an initial probability density/mass function (p.d.f./p.m.f.) on the solution space X. At each iteration k, we compute a new p.d.f./p.m.f. by tilting the old p.d.f./p.m.f. g k (x) with the performance function H(x) (for simplicity, here we assume H(x) > 0 x X ), i.e., g k (x) = H(x)g k (x) X H(x)g k (x)ν(dx), x X. (2) By doing so, we are assigning more weight to the solutions that have better performance. One direct consequence of this is that each iteration of (2) improves the expected performance. To be precise, E gk [H(X)] = E g k [(H(X)) 2 ] E gk [H(X)] E gk [H(X)]. Furthermore, it is possible to show that the sequence {g k ( ), k = 0,,...} will converge to a distribution that concentrates only on the optimal solution for arbitrary g 0 ( ). So we will have lim k E gk [H(X)] = H(x ). The above idea has previously been used, for example, in EDAs with proportional selection schemes (cf. e.g., Zhang and Mühlenbein 2004), and in randomized algorithms for solving Markov decision processes (Chang et al. 2003). However, in those approaches, the construction of g k ( ) in (2) needs to be carried out explicitly to generate new samples; moreover, since g k ( ) may not have any structure, sampling from it could be computationally expensive. In MRAS, these difficulties are circumvented by projecting g k ( ) on the family of parameterized distributions {f(, θ)}. On the one hand, f(, θ k ) often has some special structure and therefore could be much easier to handle, and on the other hand, the sequence {f(, θ k+ ), k = 0,,...} may retain some nice properties of {g k ( )} and also converge to a degenerate distribution concentrated on the optimal solution. 4

5 3 The MRAS 0 Algorithm (Deterministic Version) We now present the deterministic version of the MRAS algorithm that uses the reference distributions proposed in Section 2. Throughout the analysis, we use P θk ( ) and E θk [ ] to denote the probability and expectation taken with respect to the p.d.f./p.m.f. f(, θ k ), and I { } to denote the indicator function, i.e., { if event A holds, I {A} := 0 otherwise. Thus, under our notational convention, P θk (H(X) γ) = I {H(x) γ} f(x, θ k )ν(dx) and E θk [H(X)] = 3. Algorithm Description X X H(x)f(x, θ k )ν(dx). Algorithm MRAS 0 deterministic version Initialization: Specify ρ (0, ], a small number ε 0, a strictly increasing function S( ) : R R +, and an initial p.d.f./p.m.f. f(x, θ 0 ) > 0 x X. Set the iteration counter k = 0. Repeat until a specified stopping rule is satisfied:. Calculate the ( ρ)-quantile 2. if k = 0, then set γ k+ = γ k+. elseif k if γ k+ γ k + ε, then set γ k+ = γ k+. else set γ k+ = γ k. endif endif 3. Compute the parameter vector θ k+ as 4. Set k = k +. θ k+ := arg max θ Θ γ k+ := sup {l : P θk (H(X) l) ρ}. l [ ] [S(H(X))] k E θk I f(x, θ k ) {H(X) γk+} ln f(x, θ), (3) The MRAS 0 algorithm requires specification of a parameter ρ, which determines the proportion of samples that will be used to update the probabilistic model. At successive iterations of the algorithm, a sequence {γ k, k =, 2,...}, i.e., the ( ρ)-quantiles with respect to the sequence of p.d.f s/p.m.f s {f(, θ k )}, are calculated at step of MRAS 0. These quantile values are then used in step 2 to construct a sequence of non-decreasing thresholds { γ k, k =, 2,...}; and only those candidate solutions that have performances better than these thresholds will be used in parameter updating (cf. (3)). As we will see, the theoretical convergence of MRAS 0 is unaffected by the value of the parameter ρ. We use ρ in our approach to concentrate the computational effort on the set of elite/promising samples, which is a standard technique employed in most of the population-based approaches, like GAs and EDAs. During the initialization step of MRAS 0, a small number ε and a strictly increasing function S( ) : R R + are also specified. The function S( ) is used to account for the cases where the values of H(x) are negative for some x, and the parameter ε ensures that each strict increment in the sequence { γ k } is lower bounded, 5

6 i.e., inf γ k+ γ k k=,2,... ( γ k+ γ k ) ε. We require ε to be strictly positive for continuous problems, and non-negative for discrete (finite) problems. In continuous domains, the division by f(x, θ k ) in the performance function in step 3 is well defined if f(x, θ k ) has infinite support (e.g. normal p.d.f.), whereas in discrete/combinatorial domains, the division is still valid as long as each point x in the solution space has a positive probability of being sampled. Additional regularity conditions on f(x, θ k ) in Section 5 will ensure that step 3 of MRAS 0 can be used interchangeably with the following equation: θ k+ = arg max [S(H(x))] k I {H(x) γk+ } ln f(x, θ)ν(dx). θ Θ X The following lemma shows that there is a sequence of reference models {g k ( ), k =, 2,...} implicit in MRAS 0, and the parameter θ k+ computed at step 3 indeed minimizes the KL-divergence D(g k+, f(, θ)). Lemma 3. The parameter θ k+ computed at the kth iteration of the MRAS 0 algorithm minimizes the KL-divergence D (g k+, f(, θ)), where g k+ (x) := S(H(x))I {H(x) γ k+ }g k (x) [ ] x X, k =, 2,..., and g (x) := E gk S(H(X))I{H(X) γk+ } 3.2 Global Convergence I {H(x) γ} E θ0 [ I{H(X) γ } f(x,θ 0) Global convergence of the MRAS 0 algorithm clearly depends on the choice of parameterized distribution family. Throughout this paper, we restrict our analysis and discussions to a particular family of distributions called the natural exponential family (NEF), for which the global convergence properties can be established. We start by stating the definition of NEF and some regularity conditions. Definition 3. A parameterized family of p.d.f s/p.m.f s {f(, θ), θ Θ R m } on X is said to belong to the natural exponential family (NEF) if there exist functions h( ) : R n R, Γ( ) : R n R m, and K( ) : R m R such that f(x, θ) = exp { θ T Γ(x) K(θ) } h(x), θ Θ, (4) where K(θ) = ln x X exp { θ T Γ(x) } h(x)ν(dx), and the superscript T denotes the vector transposition. For the case where f(, θ) is a p.d.f., we assume that Γ( ) is a continuous mapping. Many common p.d.f. s/p.m.f. s belong to the NEF, e.g., Gaussian, Poisson, binomial, geometric, and certain multivariate forms of them. Assumptions: A3. There exists a compact set Π such that the level set {x : H(x) γ } X Π, where γ = sup l {l : P θ0 (H(X) l) ρ} is defined as in the MRAS 0 algorithm. A4. The maximizer of equation (3) is an interior point of Θ for all k. A5. sup θ Θ exp{θ T Γ(x)}Γ(x)h(x) is integrable/summable with respect to x, where θ, Γ( ), and h( ) are defined as in Definition 3.. Assumption A3 restricts the search of the MRAS 0 algorithm to some compact set; it is satisfied if the function H( ) has compact level sets or the solution space X is compact. In actual implementation of the ]. 6

7 algorithm, step 3 of MRAS 0 is often posed as an unconstrained optimization problem, i.e., Θ = R m, in which case A4 is automatically satisfied. It is also easy to verify that A5 is satisfied by most NEFs. To show the convergence of MRAS 0, we will need the following key observation. Lemma 3.2 If assumptions A3 A5 hold, then we have E θk+ [Γ(X)] = E gk+ [Γ(X)], k = 0,,..., where E θk+ [ ] and E gk+ [ ] denote the expectations taken with respect to f(, θ k+ ) and g k+ ( ), respectively. We have the following convergence result for the MRAS 0 algorithm. Theorem 3. Let {θ k, k =, 2,...} be the sequence of parameters generated by MRAS 0. assumptions A A5 are satisfied, then If ε > 0 and lim k E θ k [Γ(X)] = Γ(x ), (5) where the limit is component-wise. Remark : The convergence result in Theorem 3. is much stronger than it appears to be. For example, when Γ(x) is a one-to-one function (which is the case for many NEFs encountered in practice), the convergence result (5) can be equivalently written as Γ (lim k E θk [Γ(X)]) = x. Also note that for some particular p.d.f. s/p.m.f. s, the solution vector x itself will be a component of Γ(x) (e.g., multivariate normal distribution). Under these circumstances, we can interpret (5) as lim k E θk [X] = x. Another special case of particular interest is when the components of the random vector X = (X,..., X n ) are independent, i.e., each has a univariate p.d.f./p.m.f. of the form f(x i, ϑ i ) = exp(x i ϑ i K(ϑ i ))h(x i ), ϑ i R, i =,..., n. In this case, since the distribution of the random vector X is simply the product of the marginal distributions, we will clearly have Γ(x) = x. Thus, (5) is again equivalent to lim k E θk [X] = x, where θ k := (ϑ k,..., ϑ k n), and ϑ k i is the value of ϑ i at the kth iteration. In Lemma 3.2, we have already established a relationship between reference models {g k ( )} and the sequence of sampling distributions {f(, θ k )}. Therefore, proving Theorem 3. amounts to showing that lim k E gk [Γ(X)] = Γ(x ). Proof of Theorem 3.: Recall from Lemma 3. that g k+ ( ) can be expressed recursively as g k+ (x) := S(H(x))I {H(x) γ k+ }g k (x) [ ] x X, k =, 2,.... E gk S(H(X))I{H(X) γk+ } Thus E gk+ [ S(H(X))I{H(X) γk+ } ] E [ [S(H(X))] 2I ] = gk [ {H(X) γk+ ] } E gk S(H(X))I{H(X) γk+ } E gk [ S(H(X))I{H(X) γk+ }]. (6) Since γ k H(x ) k, and each strict increment in the sequence { γ k } is lower bounded by the quantity ε > 0, there exists a finite N such that γ k+ = γ k, k N. Before we proceed any further, we need to distinguish between two cases, γ N = H(x ) and γ N < H(x ). 7

8 Case. If γ N = H(x ) (note that since ρ > 0, this could only happen when the solution space is discrete), then from the definition of g k+ ( ) (see Lemma 3.), we obviously have g k+ (x) = 0, x x, and g k+ (x ) = [S(H(x ))] k I {H(x)=H(x )} X [S(H(x))]k I {H(x)=H(x )}ν(dx) = k N. Hence it follows immediately that E gk+ [Γ(X)] = Γ(x ) k N. Case 2. If γ N < H(x ), then from (6), we have E gk+ [ S(H(X))I{H(X) γk+2 }] Egk [ S(H(X))I{H(X) γk+ }], k N, (7) i.e., the sequence { [ } E gk S(H(X))I{H(X) γk+ }], k =, 2,... converges. Now we show that the limit of the above sequence is S(H(x )). To do so, we proceed by contradiction and assume that S := lim E [ ] g k S(H(X))I{H(X) γk+ } < S := S(H(x )). (8) k Define the set A as { A := {x : H(x) γ N } x : S(H(x)) S + S 2 } X. Since S( ) is strictly increasing, its inverse S ( ) exists. Thus A can be reformulated as A = { { x : H(x) max γ N, S ( S + S 2 )}} X. And since γ N < H(x ), A has a strictly positive Lebesgue/discrete measure by A. Notice that g k ( ) can be rewritten as g k (x) = k i= S(H(x))I {H(x) γi+ } [ ] g (x). E gi S(H(X))I{H(X) γi+ } S(H(x))I {H(x) γk+ } Since lim k [ ] = S(H(x))I {H(x) γ N } E gk S(H(X))I S >, x A, we conclude that {H(X) γk+ } Thus, by Fatou s lemma, we have = lim inf g k (x)ν(dx) lim inf k k which is a contradiction. Hence, it follows that X lim g k(x) =, x A. k A g k (x)ν(dx) A lim inf k g k(x)ν(dx) =, lim E [ ] g k S(H(X))I{H(X) γk+ } = S. (9) k In order to show that lim k E gk [Γ(X)] = Γ(x ), we now bound the difference between E gk [Γ(X)] and 8

9 Γ(x ). Note that k N, we have E gk [Γ(X)] Γ(x ) = X C Γ(x) Γ(x ) g k (x)ν(dx) Γ(x) Γ(x ) g k (x)ν(dx), (0) where C := {x : H(x) γ N } X is the support of g k ( ), k N. By the assumption on Γ( ) in Definition 3., for any given ζ > 0, there exists a δ > 0 such that x x < δ implies Γ(x) Γ(x ) < ζ. With A δ defined from assumption A2, we have from (0), E gk [Γ(X)] Γ(x ) Γ(x) Γ(x ) g k (x)ν(dx) + Γ(x) Γ(x ) g k (x)ν(dx) A c δ C A δ C ζ + Γ(x) Γ(x ) g k (x)ν(dx), k N. () A δ C The rest of the proof amounts to showing that the second term in () is also bounded. Clearly the term Γ(x) Γ(x ) is bounded on the set A δ C. We only need to find a bound for g k (x). By A2, we have sup H(x) sup H(x) < H(x ). x A δ C x A δ Define S δ := S S(sup x Aδ H(x)). Since S( ) is strictly increasing, we have S δ > 0. Thus, it follows that S(H(x)) S S δ, x A δ C. (2) On the other hand, from (7) and (9), there exists N N such that k N Observe that g k (x) can be alternatively expressed as g k (x) = E gk [ S(H(X))I{H(X) γk+ }] S 2 S δ. (3) k i= N Thus, it follows from (2) and (3) that g k (x) S(H(x))I {H(x) γi+ } [ ] g N (x), k E N. gi S(H(X))I{H(X) γi+ } ( S ) k N S δ S g N (x), x A δ C, k S δ /2 N. Therefore, E gk [Γ(X)] Γ(x ) ζ + ζ + = sup x A δ C sup x A δ C Γ(x) Γ(x ) g k (x)ν(dx) A δ C ( S Γ(x) Γ(x ) k N S δ ) ( + sup x A δ C Γ(x) Γ(x ) S S δ /2 ) ζ, k N,, k N where N is given by N := max { N, N + ln ζ/ ln ( S S δ S S δ /2) }. 9

10 And since ζ is arbitrary, we have lim E g k [Γ(X)] = Γ(x ). k Finally, the proof is completed by applying Lemma 3.2 to both Case and Case 2. Remark 2: Note that, as mentioned in Section 2, for problems with finite solution spaces, assumptions A and A2 are automatically satisfied. Furthermore, if we take the input parameter ε = 0, then step 2 of MRAS 0 is equivalent to γ k+ = max i k+ γ i. Thus, { γ k } is non-decreasing and each strict increment in the sequence is bounded from below by min H(x) H(y). H(x) H(y) x,y X Therefore, the ε > 0 assumption in Theorem 3. can be relaxed to ε 0. We now address some of the special cases discussed in Remark. Corollary 3. (Multivariate Normal) For continuous optimization problems in R n, if multivariate normal p.d.f. s are used in MRAS 0, i.e., f(x, θ k ) = ( (2π)n Σ k exp ) 2 (x µ k) T Σ k (x µ k), (4) where θ k := (µ k ; Σ k ), ε > 0, and assumptions A A4 are satisfied, then where 0 n n represents an n-by-n zero matrix. lim µ k = x, and lim Σ k = 0 n n, k k Proof: By Lemma 3.2, it is easy to show that µ k+ = E gk+ (X), k = 0,,..., and Σ k+ = E gk+ [ (X µk+ )(X µ k+ ) T ], k = 0,,.... The rest of the proof amounts to showing that lim E g k (X) = x, and k which is the same as the proof of Theorem 3.. lim E [ g k (X µk )(X µ k ) T ] = 0 n n, k Remark 3: Corollary 3. shows that in the multivariate normal case, the sequence of parameterized p.d.f. s will converge to a degenerate p.d.f. concentrated only on the optimal solution. In this case the parameters are updated as µ k+ = E [ θ k {[S(H(X))] k /f(x, θ k )}I {H(X) γk+ }X ] [ ], (5) E θk {[S(H(X))]k /f(x, θ k )}I {H(X) γk+ } and Σ k+ = E [ θ k {[S(H(X))] k /f(x, θ k )}I {H(X) γk+ }(X µ k+ )(X µ k+ ) ] T [ ], (6) E θk {[S(H(X))]k /f(x, θ k )}I {H(X) γk+ } where f(x, θ k ) is given by (4). Note that when the solution space X is a (simple) constrained region in R n, one straightforward approach is to use the acceptance-rejection method (cf. e.g., Kroese et al. 2004). And it is easy to verify that the parameter updating rules remain the same. 0

11 Corollary 3.2 (Independent Univariate) If the components of the random vector X = (X,..., X n ) are independent, each has a univariate p.d.f./p.m.f. of the form ε > 0, and A A5 are satisfied, then f(x i, ϑ i ) = exp(x i ϑ i K(ϑ i ))h(x i ), ϑ i R, i =,..., n, lim E θ k [X] = x, k where θ k := (ϑ k,..., ϑ k n). 4 An Alternative View of the Cross-Entropy Method In this section, we give an alternative interpretation of the CE method for optimization and discuss its similarities and differences with the MRAS 0 algorithm. Specifically, we show that the CE method can also be viewed as a search strategy guided by a sequence of reference models. From this particular point of view, we establish some important properties of the CE method. The deterministic version of the CE method for solving () can be summarized as follows. Algorithm CE 0 : Deterministic Version of the CE Method. Choose the initial p.d.f./p.m.f. f(, θ 0 ), θ 0 Θ. Specify the parameter ρ (0, ] and a non-decreasing function ϕ( ) : R R + {0}. Set k = Calculate the ( ρ)-quantile γ k+ as γ k+ := sup {l : P θk (H(X) l) ρ}. 3. Compute the new parameter [ θ k+ := arg max E θk ϕ(h(x))i{h(x) γk+ } ln f(x, θ) ]. θ Θ 4. If a specified stopping rule is satisfied, then terminate; otherwise set k = k + and go to Step 2. In CE 0, choosing ϕ(h(x)) = gives the standard CE method, whereas choosing ϕ(h(x)) = H(x) (if H(x) 0, x X ) gives an extended version of the standard CE method (cf. e.g., de Boer et al. 2005). One resemblance between CE and MRAS 0 is the use of the parameter ρ and the ( ρ)-quantile in both algorithms. However, the fundamental difference is that in CE, the problem of estimating the optimal value of the parameter is broken down into a sequence of simple estimation problems, in which the parameter ρ assumes a crucial role. Since a small change in the values of ρ may disturb the whole estimation process and affect the quality of the resulting estimates, the convergence of CE cannot be always guaranteed unless the value of ρ is chosen sufficiently small (cf. De Boer et al. 2005, Homem-de-Mello 2004; also example 4. below), whereas the theoretical convergence of MRAS 0 is unaffected by the parameter ρ. The following lemma provides a unified view of MRAS and CE; it shows that by appropriately defining a sequence of implicit reference models {gk ce ( ) : k =, 2,...}, the CE method can be recovered, and the parameter updating in CE is guided by this sequence of models. Lemma 4. The parameter θ k+ computed at the kth iteration of the CE 0 algorithm minimizes the KLdivergence D ( g ce k+, f(, θ)), where k+(x) := ϕ(h(x))i {H(x) γ k+ }f(x, θ k ) [ ] x X, k = 0,,.... (7) E θk ϕ(h(x))i{h(x) γk+ } g ce

12 Proof: Similar to the proof of Lemma 3.. The key observation to note is that in contrast to MRAS 0, the sequence of reference models in CE depends explicitly on the family of parameterized p.d.f s/p.m.f s {f(, θ k )} used. Since gk+ ce ( ) is obtained by tilting f(, θ k ) with the performance function, it improves the expected performance in the sense that [ [ ] E θk (ϕ(h(x))i{h(x) γk+ }) 2] [ E g ce ϕ(h(x))i{h(x) γk+} k+ = [ ] E θk ϕ(h(x))i{h(x) γk+ }]. E θk ϕ(h(x))i{h(x) γk+ } Thus, it is reasonable to expect that the projection of g ce k+ ( ) on {f(, θ) : θ Θ} (i.e., f(, θ k+)) also improves the expected performance. This result is formalized in the following theorem. Theorem 4. For the CE 0 algorithm, we have E θk+ [ ϕ(h(x))i{h(x) γk+ }] Eθk [ ϕ(h(x))i{h(x) γk+ }], k = 0,,.... In the standard CE method, Theorem 4. implies the monotonicity of the sequence {γ k : k =, 2,...}. Lemma 4.2 For the standard CE method (i.e., CE 0 with ϕ(h(x)) = ), we have γ k+2 γ k+, k = 0,,.... Proof: By Theorem 4., we have E θk+ [I {H(X) γk+ }] E θk [I {H(X) γk+ }], i.e., P θk+ (H(X) γ k+ ) P θk (H(X) γ k+ ) ρ. The result follows by the definition of γ k+2 (See Step 2 of the CE 0 algorithm). Note that since γ k H(x ) for all k, Lemma 4.2 implies that the sequence {γ k : k =, 2,...} generated by the standard CE method converges. However, depending on the p.d.f s/p.m.f s and the parameter ρ used, the sequence {γ k } may not converge to H(x ) or even to a small neighborhood of H(x ) (cf. Examples 4. and 4.2 below). Similar to MRAS 0 (cf. Lemma 3.2), when f(, θ) belongs to the natural exponential families, the following lemma relates the sequence {f(, θ k ), k =, 2,...} to the sequence of reference models {gk ce ( ) : k =, 2,...}. Lemma 4.3 Assume that: Then. There exists a compact set Π such that the level set {x : H(x) γ k } X Π for all k =, 2,..., where γ k = sup l {l : P θk (H(X) l) ρ} is defined as in the CE 0 algorithm. 2. The parameter θ k+ computed at step 3 of the CE 0 algorithm is an interior point of Θ for all k. 3. Assumptions A5 is satisfied. E θk+ [Γ(X)] = E g ce [Γ(X)], k = 0,,.... k+ The above lemma indicates that the behavior of the sequence of p.d.f s/p.m.f s {f(, θ k )} is closely related to the properties of the sequence of reference models. To understand this, consider the particular case where Γ(x) = x. If the CE method converges to the optimal solution in the sense that lim k E θk [H(X)] = H(x ), then we must have lim k E θk [X] = x, since H(x) < H(x ) x x. Thus, by Lemma 4.3, a necessary 2

13 condition for this convergence is lim k E g ce [X] = x. However, unlike MRAS k 0, where the convergence of the sequence of reference models to an optimal degenerate distribution is guaranteed, the convergence of the sequence {gk ce ( ) : k =, 2,...} relies on the choices of the families of distributions {f(, θ)} and the values of the parameter ρ used (cf. (7)). We now illustrate this issue by two simple examples. Example 4. (The Standard CE Method) Consider maximizing the function H(x) given by 0 x {(0, ), (, 0)}, H(x) = x = (0, 0), a x = (, ), (8) where a >, and x := (x, x 2 ) X := {(0, 0), (0, ), (, 0), (, )}. If we take 0.25 < ρ 0.5 and an initial p.m.f. f(x, θ 0 ) = p 0 x ( p 0 ) x q 0 x 2 ( q 0 ) x 2 with θ 0 = (p 0, q 0 ) = (0.5, 0.5), then since P θ0 (x {(0, 0), (, )}) = 0.5 ρ, we have γ =. It is also straightforward to see that { 0.5 x = (0, 0) or (, ), g ce (x) = 0 otherwise, and the parameter θ computed at step 3 (with ϕ(h(x)) = ) of CE 0 is given by θ = (0.5, 0.5). Proceeding iteratively, we have γ k = and gk ce (x) = gce (x) k =, 2,..., i.e., the algorithm does not converge to a degenerate distribution at the optimal solution. On the other hand, if we choose ρ 0.25, then it turns out that γ k = a and { x = (, ), gk ce (x) = 0 otherwise. for all k =, 2,..., which means the algorithm converges to the optimum. Example 4.2 (The Extended Version of the CE Method) Consider solving problem (8) by CE 0 with the performance function ϕ(h(x)) = H(x). We use the same family of p.m.f s as( in Example 4. with the initial parameter θ 0 = ( +a, +a ). If the values of ρ are chosen from the interval a (+a), (+a) ), then we 2 have θ k = ( +a, +a ), γ k =, and g ce k (x) = a +a x = (0, 0), +a x = (, ), 0 otherwise, for all k =, 2,.... On the other hand, if we choose ρ = 0.5 and θ 0 = (0.5, 0.5), then it is easy to verify that lim k γ k = a and { x = (, ), lim k gce k (x) = 0 otherwise. 5 The MRAS Algorithm (Monte Carlo Version) The MRAS 0 algorithm describes the idealized situation where quantile values and expectations can be evaluated exactly. In practice, we will usually resort to its stochastic counterpart, where only a finite number of samples are used and expected values are replaced with their corresponding sample averages. For 3

14 example, step 3 of MRAS 0 will be replaced with θ k+ = arg max θ Θ N N [S(H(X i ))] k f(x i, θ I {H(Xi) γ k+ } ln f(x i, θ), (9) k ) i= where X,..., X N are i.i.d. random samples generated from f(x, θ k ), θ k is the estimated parameter vector computed at the previous iteration, and γ k+ is a threshold determined by the sample ( ρ)-quantile of H(X ),..., H(X N ). However, the theoretical convergence can no longer be guaranteed for a simple stochastic counterpart of MRAS 0. In particular, the set {x : H(x) γ k+ } involved in (9) may be empty, since all the random samples generated at the current iteration may be much worse than those generated at the previous iteration. Thus, we can only expect the algorithm to converge if the expected values in the MRAS 0 algorithm are closely approximated. Obviously, the quality of the approximation will depend on the number of samples to be used in the simulation, but it is difficult to determine in advance the appropriate number of samples. A sample size too small will cause the algorithm to fail to converge and result in poor quality solutions, whereas a sample size too large may lead to high computational cost. As mentioned earlier, the parameter ρ, to some extent, will affect the performance of the algorithm. Large values of ρ mean that almost all samples generated, regardless of their performances, will be used to update the probabilistic model, which could slow down the convergence process. On the other hand, since a good estimate will necessarily require a reasonable amount of valid samples, the quantity ρn (i.e., the approximate amount of samples that will be used in parameter updating) cannot be too small. Thus, small values of ρ will require a large number of samples to be generated at each iteration and may result in significant simulation efforts. For a given problem, although it is clear that we should avoid those values of ρ that are either too close to or too close to 0, to determine a priori which ρ gives a satisfactory performance may be difficult. In order to address the above difficulties, we adopt the same idea as in Homem-de-Mello (2004) and propose a modified Monte Carlo version of MRAS 0 in which the sample size N is adaptively increasing and the parameter ρ is adaptively decreasing. 5. Algorithm Description Roughly speaking, the MRAS algorithm is essentially a Monte Carlo version of MRAS 0 except that the parameter ρ and the sample size N may change from one iteration to another. The rate of increase in the sample size is controlled by an extra parameter α >, specified during the initialization step. For example, if the initial sample size is N 0, then after k increments, the sample size will be approximately α k N 0. At each iteration k, random samples are drawn from the density/mass function f(, θ k ), which is a mixture of the initial density/mass f(, θ 0 ) and the density/mass calculated from the previous iteration f(, θ k ) (cf. e.g., Auer et al for a similar idea in the context of multiarmed bandit models). We assume that f(, θ 0 ) satisfies the following condition: Assumption A3. There exists a compact set Π ε such that {x : H(x) H(x ) ε} X Π ε. Moreover, the initial density/mass function f(x, θ 0 ) is bounded away from zero on Π ε, i.e., f := inf x Πε f(x, θ 0 ) > 0. In practice, the initial density f(, θ 0 ) can be chosen according to some prior knowledge of the problem structure (cf. Section 6.2); however, if nothing is known about where the good solutions are, this density should be chosen in such a way that each region in the solution space will have an (approximately) equal probability of being sampled. For instance, when X is finite, one simple choice of f(, θ 0 ) is the uniform distribution. Intuitively, mixing in the initial density forces the algorithm to explore the entire solution space and to maintain a global perspective during the search process. Also note that if λ =, then random samples will always be drawn from the initial density, in which case, MRAS becomes a pure random sampling 4

15 Algorithm MRAS Monte Carlo version Initialization: Specify ρ 0 (0, ], an initial sample size N 0 >, ε 0, α >, a mixing coefficient λ (0, ], a strictly increasing function S( ) : R R +, and an initial p.d.f. f(x, θ 0) > 0 x X. Set θ 0 θ 0, k 0. Repeat until a specified stopping rule is satisfied:. Generate i.i.d. samples X k,..., X k according to f(, θ k ) := ( λ)f(, θ k ) + λf(, θ 0 ). 2. Compute the sample ( ρ k )-quantile γ k+ (ρ k, ) := H ( ( ρk ) ), where a is the smallest integer greater than a, and H (i) is the ith order statistic of the sequence { H(X k i ), i =,..., }. 3. If k = 0 or γ k+ (ρ k, ) γ k + ε 2, then 3a. Set γ k+ γ k+ (ρ k, ), ρ k+ ρ k, +. else, find the largest ρ (0, ρ k ) such that γ k+ ( ρ, ) γ k + ε 2. endif 3b. If such a ρ exists, then set γ k+ γ k+ ( ρ, ), ρ k+ ρ, +. 3c. else (if no such ρ exists), set γ k+ γ k, ρ k+ ρ k, + α. 4. Compute θ k+ as 5. Set k k +. θ k+ = arg max θ Θ N k i= [S(H(Xi k ))] k f(x i k, θ I k ) {H(X k i ) γ k+} ln f(xk i, θ). (20) approach. At step 2, the sample ( ρ k )-quantile γ k+ is calculated by first ordering the sample performances H(Xi k), i =,..., from smallest to largest, H () H (2) H (Nk ), and then taking the ( ρ k ) th order statistic. We use the function γ k+ (ρ k, ) to emphasize the dependencies of γ k+ on both ρ k and, so that different sample quantile values used during one iteration can be distinguished by their arguments. Step 3 of MRAS is used to extract a sequence of non-decreasing thresholds { γ k, k =, 2...} from the sequence of sample quantiles { γ k }, and to determine the appropriate values of ρ k+ and + to be used in subsequent iterations. This step is carried out as follows. At each iteration k, we first check whether the inequality γ k+ (ρ k, ) γ k + ε/2 is satisfied, where γ k is the threshold value used in the previous iteration. If the inequality holds, then it means that both the current ρ k value and the current sample size are satisfactory; thus we proceed to step 3a and update the parameter vector θ k+ in step 4 by using γ k+ (ρ k, ). Otherwise, it indicates that either ρ k is too large or the sample size is too small. To determine which, we fix the sample size and check if there exists a smaller ρ < ρ k such that the above inequality can be satisfied with the new sample ( ρ)-quantile. If such a ρ does exist, then the current sample size is still deemed acceptable, and we only need to decrease the ρ k value. Accordingly, the parameter vector is updated in step 4 by using the sample ( ρ)-quantile. On the other hand, if no such ρ can be found, then the parameter vector is updated by using the threshold γ k calculated during the previous iteration and the sample size is increased by a factor α. We make the following assumption about the parameter vector θ k+ computed at step 4: Assumption A4. The parameter vector θ k+ computed at step 4 of MRAS is an interior point of Θ for all k. It is important to note that the set { x : H(x) γ k+, x { X k,..., X k }} could be empty if step 3c is visited. If this happens, the right hand side of (20) will be equal to zero, so any θ Θ is a maximizer, and 5

16 we define θ k+ := θ k in this case. 5.2 Global Convergence In this section, we discuss the convergence properties of the MRAS algorithm for natural exponential families (NEFs). To be specific, we will explore the relations between MRAS and MRAS 0 and show that with high probability, the gaps (e.g., approximation errors incurred by replacing expected values with sample averages) between the two algorithms can be made small enough such that the convergence analysis of MRAS can be ascribed to the convergence analysis of the MRAS 0 algorithm; thus, our analysis relies heavily on the results obtained in Section 3.2. Throughout this section, we denote by ( ) and [ ] the P θk E θk respective probability and expectation taken with respect to the p.d.f./p.m.f. f(, θ k ), and P θk ( ) and Ẽ θk [ ] the respective probability and expectation taken with respect to f(, θ k ). Note that since the sequence { θ k } results from random samples generated at each iteration of MRAS, these quantities are also random. Let g k+ ( ), k = 0,,..., be defined by g k+ (x) := g k (x) [[S(H(x))] k / f(x, θ k )]I {H(x) γk+ } Nk i=[[s(h(xi k))]k / f(x i k, θ k )]I {H(X i k) γ k+} if { x : H(x) γ k+, x { X k,..., X k }}, otherwise, γ k+ (ρ k, ) if step 3a is visited, where γ k+ is given by γ k+ := γ k+ ( ρ, ) if step 3b is visited, γ k if step 3c is visited. Similar to Lemma 3.2, the following lemma shows the connection between f(, θ k+ ) and g k+ ( ). Lemma 5. If assumptions A4 and A5 hold, then the parameter θ k+ computed at step 3 of MRAS satisfies E θk+ [Γ(X)] = E gk+ [Γ(X)], k = 0,,..., Note that the region {x : H(x) γ k+ } will become smaller and smaller as γ k+ increases. Lemma 5. shows that the sequence of sampling p.d.f s/p.m.f s {f(, θ k+ )} is adapted to this sequence of shrinking regions. For example, consider the case where {x : H(x) γ k+ } is convex and Γ(x) = x. Since E gk+ [X] is the convex combination of X k,..., X k, the lemma implies that E θk+ [X] {x : H(x) γ k+ }. Thus, it is natural to expect that the random samples generated at the next iteration will fall in the region {x : H(x) γ k+ } with large probabilities (e.g., consider the normal p.d.f. where its mode is equal to its mean). In contrast, if we use a fixed sampling distribution for all iterations as in pure random sampling (i.e., the λ = case), then sampling from this sequence of shrinking regions could become a substantially difficult problem in practice. Next, we present a useful lemma, which shows the convergence of the quantile estimates when random samples are generated from a sequence of different distributions. Lemma 5.2 For any given ρ (0, ), let γ k be the set of ( ρ )-quantiles of H(X) with respect to the p.d.f./p.m.f. f(, θk ), and let γ k (ρ, ) be the corresponding sample quantile of H(X k ),..., H(X k ), where f(, θ k ) and are defined as in MRAS, and X k,..., X k are i.i.d. with common density f(, θ k ). Then the distance from γ k (ρ, ) to γ k tends to zero as k w.p.. (2) We are now ready to state the main theorem. Theorem 5. Let ε > 0, and define the ε-optimal set O ε := {x : H(x) H(x ) ε} X. If assumptions A, A3, A4, and A5 are satisfied, then there exists a random variable K such that w.p.., K <, and 6

17 . γ k > H(x ) ε, k K 2. E θk+ [Γ(X)] CONV {Γ(O ε )}, k K, where CONV {Γ(O ε )} indicates the convex hull of the set Γ(O ε ). Furthermore, let β be a positive constant satisfying the condition that the set { x : S(H(x)) β } has a strictly positive Lebesgue/counting measure. If assumptions A, A2, A3, A4, and A5 are all satisfied and α > (βs ) 2, where S := S(H(x )), then 3. lim k E θk [Γ(X)] = Γ(x ) w.p.. Remark 4: Roughly speaking, the second result can be understood as finite time ε-optimality. To see this, consider the special case where H(x) is locally concave on the set O ε. Let x, y O ε and η [0, ] be arbitrary. By the definition of concavity, we will have H(ηx + ( η)y) ηh(x) + ( η)h(y) H(x ) ε, which implies that the set O ε is convex. If in addition Γ(x) is also convex and one-to-one on O ε (e.g. multivariate normal p.d.f.), then CONV {Γ(O ε )} = Γ(O ε ). Thus it follows that Γ (E θk+ [Γ(X)]) O ε, k K w.p.. Proof of Theorem 5.: () The first part of the proof is an extension of the proofs given in Homem-de- Mello (2004). First we claim that given ρ k and γ k, if γ k H(x ) ε, then K < w.p. and ρ (0, ρ k ) such that γ k +( ρ, ) γ k + ε 2 k K. To show this, we proceed by contradiction. Let ρ k := ( ) P θk H(X) γk + 2ε 3. If γk H(x ) ε, then γ k + 2ε 3 H(x ) ε 3. By A and A3, we have ( ρ k P θk H(X) H(x ) ε ) λc(ε, θ 0 ) > 0, (22) 3 where C(ε, θ 0 ) = X I {H(x) H(x ) ε/3}f(x, θ 0 )ν(dx) is a constant. Now assume that ρ (0, ρ k ) such that γ k+(ρ, θ k ) < γ k + 2ε 3, where γ k+(ρ, θ k ) is the ( ρ)-quantile of H(X) with respect to f(, θ k ). By the definition of quantiles, we have (H(X) γ P θk k+ (ρ, θ ) k ) ρ, and (H(X) γ P θk k+ (ρ, θ ) k ) ρ > ρ k. (23) It follows that (H(X) P θk γ k+ (ρ, θ ) ( ) k ) P θk H(X) < γ k + 2ε 3 = ρ k by the definition of ρ k, which contradicts equation (23); thus we must have that if γ k H(x ) ε, then γ k+ (ρ, θ k ) γ k + 2ε 3, ρ (0, ρ k). Therefore by (22), ρ ( 0, min{ρ k, λc(ε, θ 0 )} ) (0, ρ k ) such that γ k+ ( ρ, θ k ) γ k + 2ε 3 whenever γ k H(x ) ε. By Lemma 5.2, the distance from the sample ( ρ)-quantile γ k+ ( ρ, ) to the set of ( ρ)- quantiles γ k+ ( ρ, θ k ) goes to zero as k w.p., thus K < w.p. such that γ k +( ρ, ) γ k + ε 2 k K. Notice that from the MRAS algorithm, if neither step 3a nor 3b is visited at the kth iteration, we will have ρ k+ = ρ k and γ k+ = γ k. Thus, whenever γ k H(x ) ε, w.p. step 3a/3b will be visited after a finite number of iterations. Furthermore, since the total number of visits to steps 3a and 3b is finite (i.e., bounded by 2[H(x ) M] ε such that, where recall that M is a lower bound for H(x)), we conclude that there exists K < w.p., γ k > H(x ) ε, k K w.p.. (2) From the MRAS algorithm, it is easy to see that γ k+ γ k, k = 0,,.... By part (), we have γ k+ H(x ) ε, k K w.p.. Thus, by the definition of g k+ (x) (cf. (2)), it follows immediately that if { x : H(x) γk+, x { X k,..., X k }}, then the support of gk+ (x) satisfies supp { g k+ } O ε k K 7

18 w.p.; otherwise if { x : H(x) γ k+, x { X k,..., X k }} =, then supp { gk+ } =. We now discuss these two cases separately. Case. If supp { g k+ } O ε, then we have {Γ(supp { g k+ })} {Γ(O ε )}. Since E gk+ [Γ(X)] is the convex combination of Γ(X k ),..., Γ(X k ), it follows that Thus by A4, A5, and Lemma 5., E gk+ [Γ(X)] CONV {Γ (supp { g k+ })} CONV {Γ(O ε )}. E θk+ [Γ(X)] CONV {Γ(O ε )}. Case 2. If supp { g k+ } = (note that this could only happen if step 3c is visited), then from the algorithm, there exists some k < k + such that γ k+ = γ k and supp { g k}. Without loss of generality, let k be the largest iteration counter such that the preceding properties hold. Since γ k = γ k+ > H(x ) ε k K w.p., we have supp { g k} Oε w.p.. By following the discussions in Case, it is clear that E θ k [Γ(X)] CONV {Γ(O ε )}, w.p.. Furthermore, since θ k = θ k+ = = θ k+ (see discussions in Section 5.), we will again have (3) Define ĝ k+ (x) as E θk+ [Γ(X)] CONV {Γ(O ε )}, k K w.p.. ĝ k+ (x) := [S(H(x))] k I {H(x) γk }, k =, 2,..., X [S(H(x))]k I {H(x) γk }ν(dx) where γ k is defined as in MRAS. Note that since γ k is a random variable, ĝ k+ (x) is also a random variable. It follows that X Eĝk+ [Γ(X)] = [βs(h(x))]k I {H(x) γk }Γ(x)ν(dx). X [βs(h(x))]k I {H(x) γk }ν(dx) Let ω = (X 0,..., X 0 N 0, X,..., X N,...) be a particular sample path generated by the algorithm. For each ω, the sequence { γ k (ω), k =, 2,...} is non-decreasing and each strict increase is lower bounded by ε/2. Thus, Ñ (ω) > 0 such that γ k+(ω) = γ k (ω) k Ñ (ω). Now define Ω := {ω : lim k γ k (ω) = H(x )}. By the definition of g k+ ( ) (cf. (2)), for each ω Ω we clearly have lim k E gk (ω) [Γ(X)] = Γ(x ); thus, it follows from Lemma 5. that lim k E θk (ω) [Γ(X)] = Γ(x ), ω Ω. The rest of the proof amounts to showing that the result also holds almost surely (a.s.) on the set Ω c. Since lim k γ k (ω) = γñ (ω) < H(x ) ω Ω c, we have by Fatou s lemma lim inf [βs(h(x))] k I {H(x) γk }ν(dx) lim inf k X X k [βs(h(x))]k I {H(x) γk }ν(dx) > 0, ω Ω c, (24) where the last inequality follows from the fact that βs(h(x)) x { x : H(x) max{s ( β ), γ Ñ }} and assumption A. Since f(x, θ 0 ) > 0 x X, we have X supp{ f(, θ k )} k; thus [ Ẽ θk β k Sk (H(X))I {H(X) γk }Γ(X) ] Eĝk+ [Γ(X)] = [ ], k =, 2,..., Ẽ θk β k Sk (H(X))I {H(X) γk } where S k (H(x)) := [S(H(x))] k / f(x, θ k ). We now show that E gk+ [Γ(X)] Eĝk+ [Γ(X)] a.s. on Ω c as 8

A Model Reference Adaptive Search Method for Global Optimization

A Model Reference Adaptive Search Method for Global Optimization OPERATIONS RESEARCH Vol. 55, No. 3, May June 27, pp. 549 568 issn 3-364X eissn 1526-5463 7 553 549 informs doi 1.1287/opre.16.367 27 INFORMS A Model Reference Adaptive Search Method for Global Optimization

More information

A Model Reference Adaptive Search Method for Stochastic Global Optimization

A Model Reference Adaptive Search Method for Stochastic Global Optimization A Model Reference Adaptive Search Method for Stochastic Global Optimization Jiaqiao Hu Department of Electrical and Computer Engineering & Institute for Systems Research, University of Maryland, College

More information

A MODEL REFERENCE ADAPTIVE SEARCH METHOD FOR STOCHASTIC GLOBAL OPTIMIZATION

A MODEL REFERENCE ADAPTIVE SEARCH METHOD FOR STOCHASTIC GLOBAL OPTIMIZATION COMMUNICATIONS IN INFORMATION AND SYSTEMS c 2008 International Press Vol. 8, No. 3, pp. 245-276, 2008 004 A MODEL REFERENCE ADAPTIVE SEARCH METHOD FOR STOCHASTIC GLOBAL OPTIMIZATION JIAQIAO HU, MICHAEL

More information

Gradient-based Adaptive Stochastic Search

Gradient-based Adaptive Stochastic Search 1 / 41 Gradient-based Adaptive Stochastic Search Enlu Zhou H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology November 5, 2014 Outline 2 / 41 1 Introduction

More information

STOCHASTIC OPTIMIZATION USING MODEL REFERENCE ADAPTIVE SEARCH. Jiaqiao Hu Michael C. Fu Steven I. Marcus

STOCHASTIC OPTIMIZATION USING MODEL REFERENCE ADAPTIVE SEARCH. Jiaqiao Hu Michael C. Fu Steven I. Marcus Proceedings of the 2005 Winter Simulation Conference M. E. Kuhl, N. M. Steiger, F. B. Armstrong, and J. A. Joines, eds. STOCHASTIC OPTIMIZATION USING MODEL REFERENCE ADAPTIVE SEARCH Jiaqiao Hu Michael

More information

Solving Markov Decision Processes by Simulation: a Model-Based Approach

Solving Markov Decision Processes by Simulation: a Model-Based Approach Universität Ulm Fakultät für Mathematik und Wirtschaftswissenschaften Solving Markov Decision Processes by Simulation: a Model-Based Approach Diplomarbeit in Wirtschaftsmathematik vorgelegt von Kristina

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Global Convergence of Model Reference Adaptive Search for Gaussian Mixtures

Global Convergence of Model Reference Adaptive Search for Gaussian Mixtures Global Convergence of Model Reference Adaptive Search for Gaussian Mixtures Jeffrey W. Heath Department of Mathematics University of Maryland, College Park, MD 20742, jheath@math.umd.edu Michael C. Fu

More information

Module 3. Function of a Random Variable and its distribution

Module 3. Function of a Random Variable and its distribution Module 3 Function of a Random Variable and its distribution 1. Function of a Random Variable Let Ω, F, be a probability space and let be random variable defined on Ω, F,. Further let h: R R be a given

More information

Proceedings of the 2012 Winter Simulation Conference C. Laroque, J. Himmelspach, R. Pasupathy, O. Rose, and A. M. Uhrmacher, eds.

Proceedings of the 2012 Winter Simulation Conference C. Laroque, J. Himmelspach, R. Pasupathy, O. Rose, and A. M. Uhrmacher, eds. Proceedings of the 2012 Winter Simulation Conference C. Laroque, J. Himmelspach, R. Pasupathy, O. Rose, and A. M. Uhrmacher, eds. COMBINING GRADIENT-BASED OPTIMIZATION WITH STOCHASTIC SEARCH Enlu Zhou

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

September Math Course: First Order Derivative

September Math Course: First Order Derivative September Math Course: First Order Derivative Arina Nikandrova Functions Function y = f (x), where x is either be a scalar or a vector of several variables (x,..., x n ), can be thought of as a rule which

More information

A PARTICLE FILTERING FRAMEWORK FOR RANDOMIZED OPTIMIZATION ALGORITHMS. Enlu Zhou Michael C. Fu Steven I. Marcus

A PARTICLE FILTERING FRAMEWORK FOR RANDOMIZED OPTIMIZATION ALGORITHMS. Enlu Zhou Michael C. Fu Steven I. Marcus Proceedings of the 2008 Winter Simulation Conference S. J. Mason, R. R. Hill, L. Moench, O. Rose, eds. A PARTICLE FILTERIG FRAMEWORK FOR RADOMIZED OPTIMIZATIO ALGORITHMS Enlu Zhou Michael C. Fu Steven

More information

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero Chapter Limits of Sequences Calculus Student: lim s n = 0 means the s n are getting closer and closer to zero but never gets there. Instructor: ARGHHHHH! Exercise. Think of a better response for the instructor.

More information

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory Part V 7 Introduction: What are measures and why measurable sets Lebesgue Integration Theory Definition 7. (Preliminary). A measure on a set is a function :2 [ ] such that. () = 2. If { } = is a finite

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of

More information

Gradient-Based Adaptive Stochastic Search for Non-Differentiable Optimization

Gradient-Based Adaptive Stochastic Search for Non-Differentiable Optimization Gradient-Based Adaptive Stochastic Search for Non-Differentiable Optimization Enlu Zhou Department of Industrial & Enterprise Systems Engineering, University of Illinois at Urbana-Champaign, IL 61801,

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

Near-Potential Games: Geometry and Dynamics

Near-Potential Games: Geometry and Dynamics Near-Potential Games: Geometry and Dynamics Ozan Candogan, Asuman Ozdaglar and Pablo A. Parrilo January 29, 2012 Abstract Potential games are a special class of games for which many adaptive user dynamics

More information

3 Integration and Expectation

3 Integration and Expectation 3 Integration and Expectation 3.1 Construction of the Lebesgue Integral Let (, F, µ) be a measure space (not necessarily a probability space). Our objective will be to define the Lebesgue integral R fdµ

More information

New Global Optimization Algorithms for Model-Based Clustering

New Global Optimization Algorithms for Model-Based Clustering New Global Optimization Algorithms for Model-Based Clustering Jeffrey W. Heath Department of Mathematics University of Maryland, College Park, MD 7, jheath@math.umd.edu Michael C. Fu Robert H. Smith School

More information

Worst case analysis for a general class of on-line lot-sizing heuristics

Worst case analysis for a general class of on-line lot-sizing heuristics Worst case analysis for a general class of on-line lot-sizing heuristics Wilco van den Heuvel a, Albert P.M. Wagelmans a a Econometric Institute and Erasmus Research Institute of Management, Erasmus University

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Sample Spaces, Random Variables

Sample Spaces, Random Variables Sample Spaces, Random Variables Moulinath Banerjee University of Michigan August 3, 22 Probabilities In talking about probabilities, the fundamental object is Ω, the sample space. (elements) in Ω are denoted

More information

Week 2: Sequences and Series

Week 2: Sequences and Series QF0: Quantitative Finance August 29, 207 Week 2: Sequences and Series Facilitator: Christopher Ting AY 207/208 Mathematicians have tried in vain to this day to discover some order in the sequence of prime

More information

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash Equilibrium Price of Stability Coping With NP-Hardness

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

The square root rule for adaptive importance sampling

The square root rule for adaptive importance sampling The square root rule for adaptive importance sampling Art B. Owen Stanford University Yi Zhou January 2019 Abstract In adaptive importance sampling, and other contexts, we have unbiased and uncorrelated

More information

Near-Potential Games: Geometry and Dynamics

Near-Potential Games: Geometry and Dynamics Near-Potential Games: Geometry and Dynamics Ozan Candogan, Asuman Ozdaglar and Pablo A. Parrilo September 6, 2011 Abstract Potential games are a special class of games for which many adaptive user dynamics

More information

Legendre-Fenchel transforms in a nutshell

Legendre-Fenchel transforms in a nutshell 1 2 3 Legendre-Fenchel transforms in a nutshell Hugo Touchette School of Mathematical Sciences, Queen Mary, University of London, London E1 4NS, UK Started: July 11, 2005; last compiled: August 14, 2007

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Convexity in R n. The following lemma will be needed in a while. Lemma 1 Let x E, u R n. If τ I(x, u), τ 0, define. f(x + τu) f(x). τ.

Convexity in R n. The following lemma will be needed in a while. Lemma 1 Let x E, u R n. If τ I(x, u), τ 0, define. f(x + τu) f(x). τ. Convexity in R n Let E be a convex subset of R n. A function f : E (, ] is convex iff f(tx + (1 t)y) (1 t)f(x) + tf(y) x, y E, t [0, 1]. A similar definition holds in any vector space. A topology is needed

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

A Tutorial on the Cross-Entropy Method

A Tutorial on the Cross-Entropy Method A Tutorial on the Cross-Entropy Method Pieter-Tjerk de Boer Electrical Engineering, Mathematics and Computer Science department University of Twente ptdeboer@cs.utwente.nl Dirk P. Kroese Department of

More information

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015 Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis

More information

On the convergence of sequences of random variables: A primer

On the convergence of sequences of random variables: A primer BCAM May 2012 1 On the convergence of sequences of random variables: A primer Armand M. Makowski ECE & ISR/HyNet University of Maryland at College Park armand@isr.umd.edu BCAM May 2012 2 A sequence a :

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Lecture December 2009 Fall 2009 Scribe: R. Ring In this lecture we will talk about

Lecture December 2009 Fall 2009 Scribe: R. Ring In this lecture we will talk about 0368.4170: Cryptography and Game Theory Ran Canetti and Alon Rosen Lecture 7 02 December 2009 Fall 2009 Scribe: R. Ring In this lecture we will talk about Two-Player zero-sum games (min-max theorem) Mixed

More information

On the complexity of maximizing the minimum Shannon capacity in wireless networks by joint channel assignment and power allocation

On the complexity of maximizing the minimum Shannon capacity in wireless networks by joint channel assignment and power allocation On the complexity of maximizing the minimum Shannon capacity in wireless networks by joint channel assignment and power allocation Mikael Fallgren Royal Institute of Technology December, 2009 Abstract

More information

On the Optimal Scaling of the Modified Metropolis-Hastings algorithm

On the Optimal Scaling of the Modified Metropolis-Hastings algorithm On the Optimal Scaling of the Modified Metropolis-Hastings algorithm K. M. Zuev & J. L. Beck Division of Engineering and Applied Science California Institute of Technology, MC 4-44, Pasadena, CA 925, USA

More information

SUPPLEMENTARY MATERIAL TO IRONING WITHOUT CONTROL

SUPPLEMENTARY MATERIAL TO IRONING WITHOUT CONTROL SUPPLEMENTARY MATERIAL TO IRONING WITHOUT CONTROL JUUSO TOIKKA This document contains omitted proofs and additional results for the manuscript Ironing without Control. Appendix A contains the proofs for

More information

Some Background Math Notes on Limsups, Sets, and Convexity

Some Background Math Notes on Limsups, Sets, and Convexity EE599 STOCHASTIC NETWORK OPTIMIZATION, MICHAEL J. NEELY, FALL 2008 1 Some Background Math Notes on Limsups, Sets, and Convexity I. LIMITS Let f(t) be a real valued function of time. Suppose f(t) converges

More information

On the Power of Robust Solutions in Two-Stage Stochastic and Adaptive Optimization Problems

On the Power of Robust Solutions in Two-Stage Stochastic and Adaptive Optimization Problems MATHEMATICS OF OPERATIONS RESEARCH Vol. xx, No. x, Xxxxxxx 00x, pp. xxx xxx ISSN 0364-765X EISSN 156-5471 0x xx0x 0xxx informs DOI 10.187/moor.xxxx.xxxx c 00x INFORMS On the Power of Robust Solutions in

More information

Simulated Annealing for Constrained Global Optimization

Simulated Annealing for Constrained Global Optimization Monte Carlo Methods for Computation and Optimization Final Presentation Simulated Annealing for Constrained Global Optimization H. Edwin Romeijn & Robert L.Smith (1994) Presented by Ariel Schwartz Objective

More information

Stochastic Dynamic Programming: The One Sector Growth Model

Stochastic Dynamic Programming: The One Sector Growth Model Stochastic Dynamic Programming: The One Sector Growth Model Esteban Rossi-Hansberg Princeton University March 26, 2012 Esteban Rossi-Hansberg () Stochastic Dynamic Programming March 26, 2012 1 / 31 References

More information

Concepts and Applications of Stochastically Weighted Stochastic Dominance

Concepts and Applications of Stochastically Weighted Stochastic Dominance Concepts and Applications of Stochastically Weighted Stochastic Dominance Jian Hu Department of Industrial Engineering and Management Sciences Northwestern University jianhu@northwestern.edu Tito Homem-de-Mello

More information

Complexity of two and multi-stage stochastic programming problems

Complexity of two and multi-stage stochastic programming problems Complexity of two and multi-stage stochastic programming problems A. Shapiro School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332-0205, USA The concept

More information

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

On Finding Optimal Policies for Markovian Decision Processes Using Simulation On Finding Optimal Policies for Markovian Decision Processes Using Simulation Apostolos N. Burnetas Case Western Reserve University Michael N. Katehakis Rutgers University February 1995 Abstract A simulation

More information

Quantifying Stochastic Model Errors via Robust Optimization

Quantifying Stochastic Model Errors via Robust Optimization Quantifying Stochastic Model Errors via Robust Optimization IPAM Workshop on Uncertainty Quantification for Multiscale Stochastic Systems and Applications Jan 19, 2016 Henry Lam Industrial & Operations

More information

Part III. 10 Topological Space Basics. Topological Spaces

Part III. 10 Topological Space Basics. Topological Spaces Part III 10 Topological Space Basics Topological Spaces Using the metric space results above as motivation we will axiomatize the notion of being an open set to more general settings. Definition 10.1.

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

On the Power of Robust Solutions in Two-Stage Stochastic and Adaptive Optimization Problems

On the Power of Robust Solutions in Two-Stage Stochastic and Adaptive Optimization Problems MATHEMATICS OF OPERATIONS RESEARCH Vol. 35, No., May 010, pp. 84 305 issn 0364-765X eissn 156-5471 10 350 084 informs doi 10.187/moor.1090.0440 010 INFORMS On the Power of Robust Solutions in Two-Stage

More information

Metaheuristics and Local Search

Metaheuristics and Local Search Metaheuristics and Local Search 8000 Discrete optimization problems Variables x 1,..., x n. Variable domains D 1,..., D n, with D j Z. Constraints C 1,..., C m, with C i D 1 D n. Objective function f :

More information

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä New Proximal Bundle Method for Nonsmooth DC Optimization TUCS Technical Report No 1130, February 2015 New Proximal Bundle Method for Nonsmooth

More information

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems Robert M. Freund February 2016 c 2016 Massachusetts Institute of Technology. All rights reserved. 1 1 Introduction

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Compiled by David Rosenberg Abstract Boyd and Vandenberghe s Convex Optimization book is very well-written and a pleasure to read. The

More information

A minimalist s exposition of EM

A minimalist s exposition of EM A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability

More information

Metaheuristics and Local Search. Discrete optimization problems. Solution approaches

Metaheuristics and Local Search. Discrete optimization problems. Solution approaches Discrete Mathematics for Bioinformatics WS 07/08, G. W. Klau, 31. Januar 2008, 11:55 1 Metaheuristics and Local Search Discrete optimization problems Variables x 1,...,x n. Variable domains D 1,...,D n,

More information

Notes 6 : First and second moment methods

Notes 6 : First and second moment methods Notes 6 : First and second moment methods Math 733-734: Theory of Probability Lecturer: Sebastien Roch References: [Roc, Sections 2.1-2.3]. Recall: THM 6.1 (Markov s inequality) Let X be a non-negative

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

CONVERGENCE PROPERTIES OF COMBINED RELAXATION METHODS

CONVERGENCE PROPERTIES OF COMBINED RELAXATION METHODS CONVERGENCE PROPERTIES OF COMBINED RELAXATION METHODS Igor V. Konnov Department of Applied Mathematics, Kazan University Kazan 420008, Russia Preprint, March 2002 ISBN 951-42-6687-0 AMS classification:

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Extreme Value Analysis and Spatial Extremes

Extreme Value Analysis and Spatial Extremes Extreme Value Analysis and Department of Statistics Purdue University 11/07/2013 Outline Motivation 1 Motivation 2 Extreme Value Theorem and 3 Bayesian Hierarchical Models Copula Models Max-stable Models

More information

Stability of optimization problems with stochastic dominance constraints

Stability of optimization problems with stochastic dominance constraints Stability of optimization problems with stochastic dominance constraints D. Dentcheva and W. Römisch Stevens Institute of Technology, Hoboken Humboldt-University Berlin www.math.hu-berlin.de/~romisch SIAM

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Measure Theory and Lebesgue Integration. Joshua H. Lifton

Measure Theory and Lebesgue Integration. Joshua H. Lifton Measure Theory and Lebesgue Integration Joshua H. Lifton Originally published 31 March 1999 Revised 5 September 2004 bstract This paper originally came out of my 1999 Swarthmore College Mathematics Senior

More information

Introduction and Preliminaries

Introduction and Preliminaries Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539 Brownian motion Samy Tindel Purdue University Probability Theory 2 - MA 539 Mostly taken from Brownian Motion and Stochastic Calculus by I. Karatzas and S. Shreve Samy T. Brownian motion Probability Theory

More information

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications. Stat 811 Lecture Notes The Wald Consistency Theorem Charles J. Geyer April 9, 01 1 Analyticity Assumptions Let { f θ : θ Θ } be a family of subprobability densities 1 with respect to a measure µ on a measurable

More information

Consistency of the maximum likelihood estimator for general hidden Markov models

Consistency of the maximum likelihood estimator for general hidden Markov models Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models

More information

Line search methods with variable sample size for unconstrained optimization

Line search methods with variable sample size for unconstrained optimization Line search methods with variable sample size for unconstrained optimization Nataša Krejić Nataša Krklec June 27, 2011 Abstract Minimization of unconstrained objective function in the form of mathematical

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

Legendre-Fenchel transforms in a nutshell

Legendre-Fenchel transforms in a nutshell 1 2 3 Legendre-Fenchel transforms in a nutshell Hugo Touchette School of Mathematical Sciences, Queen Mary, University of London, London E1 4NS, UK Started: July 11, 2005; last compiled: October 16, 2014

More information

1. Introduction. In this paper we consider stochastic optimization problems of the form

1. Introduction. In this paper we consider stochastic optimization problems of the form O RATES OF COVERGECE FOR STOCHASTIC OPTIMIZATIO PROBLEMS UDER O-I.I.D. SAMPLIG TITO HOMEM-DE-MELLO Abstract. In this paper we discuss the issue of solving stochastic optimization problems by means of sample

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Cross entropy-based importance sampling using Gaussian densities revisited

Cross entropy-based importance sampling using Gaussian densities revisited Cross entropy-based importance sampling using Gaussian densities revisited Sebastian Geyer a,, Iason Papaioannou a, Daniel Straub a a Engineering Ris Analysis Group, Technische Universität München, Arcisstraße

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Linear Programming Methods

Linear Programming Methods Chapter 11 Linear Programming Methods 1 In this chapter we consider the linear programming approach to dynamic programming. First, Bellman s equation can be reformulated as a linear program whose solution

More information

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013 School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Junming Yin Lecture 15, March 4, 2013 Reading: W & J Book Chapters 1 Roadmap Two

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Verifying Regularity Conditions for Logit-Normal GLMM

Verifying Regularity Conditions for Logit-Normal GLMM Verifying Regularity Conditions for Logit-Normal GLMM Yun Ju Sung Charles J. Geyer January 10, 2006 In this note we verify the conditions of the theorems in Sung and Geyer (submitted) for the Logit-Normal

More information

Convex relaxations of chance constrained optimization problems

Convex relaxations of chance constrained optimization problems Convex relaxations of chance constrained optimization problems Shabbir Ahmed School of Industrial & Systems Engineering, Georgia Institute of Technology, 765 Ferst Drive, Atlanta, GA 30332. May 12, 2011

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

POL502: Foundations. Kosuke Imai Department of Politics, Princeton University. October 10, 2005

POL502: Foundations. Kosuke Imai Department of Politics, Princeton University. October 10, 2005 POL502: Foundations Kosuke Imai Department of Politics, Princeton University October 10, 2005 Our first task is to develop the foundations that are necessary for the materials covered in this course. 1

More information

Continuity. Chapter 4

Continuity. Chapter 4 Chapter 4 Continuity Throughout this chapter D is a nonempty subset of the real numbers. We recall the definition of a function. Definition 4.1. A function from D into R, denoted f : D R, is a subset of

More information

Lebesgue Measure on R n

Lebesgue Measure on R n CHAPTER 2 Lebesgue Measure on R n Our goal is to construct a notion of the volume, or Lebesgue measure, of rather general subsets of R n that reduces to the usual volume of elementary geometrical sets

More information

Interior-Point Methods for Linear Optimization

Interior-Point Methods for Linear Optimization Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function

More information

Whitney s Extension Problem for C m

Whitney s Extension Problem for C m Whitney s Extension Problem for C m by Charles Fefferman Department of Mathematics Princeton University Fine Hall Washington Road Princeton, New Jersey 08544 Email: cf@math.princeton.edu Supported by Grant

More information

A PROJECTED HESSIAN GAUSS-NEWTON ALGORITHM FOR SOLVING SYSTEMS OF NONLINEAR EQUATIONS AND INEQUALITIES

A PROJECTED HESSIAN GAUSS-NEWTON ALGORITHM FOR SOLVING SYSTEMS OF NONLINEAR EQUATIONS AND INEQUALITIES IJMMS 25:6 2001) 397 409 PII. S0161171201002290 http://ijmms.hindawi.com Hindawi Publishing Corp. A PROJECTED HESSIAN GAUSS-NEWTON ALGORITHM FOR SOLVING SYSTEMS OF NONLINEAR EQUATIONS AND INEQUALITIES

More information

Lecture 4 Lebesgue spaces and inequalities

Lecture 4 Lebesgue spaces and inequalities Lecture 4: Lebesgue spaces and inequalities 1 of 10 Course: Theory of Probability I Term: Fall 2013 Instructor: Gordan Zitkovic Lecture 4 Lebesgue spaces and inequalities Lebesgue spaces We have seen how

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Introduction to Maximum Likelihood Estimation

Introduction to Maximum Likelihood Estimation Introduction to Maximum Likelihood Estimation Eric Zivot July 26, 2012 The Likelihood Function Let 1 be an iid sample with pdf ( ; ) where is a ( 1) vector of parameters that characterize ( ; ) Example:

More information