On the Convergence of Adaptive Stochastic Search Methods for Constrained and Multi-Objective Black-Box Global Optimization

Size: px

Start display at page:

Download "On the Convergence of Adaptive Stochastic Search Methods for Constrained and Multi-Objective Black-Box Global Optimization"

Roderick Pitts
5 years ago
Views:

1 On the Convergence of Adaptive Stochastic Search Methods for Constrained and Multi-Objective Black-Box Global Optimization Rommel G. Regis Citation: R. G. Regis On the convergence of adaptive stochastic search methods for constrained and multiobjective black-box optimization. Journal of Optimization Theory and Applications, 170(3): The Version of Record of this manuscript has been published and is available in the Journal of Optimization Theory and Applications (2016), 1

2 JOTA manuscript No. (will be inserted by the editor) On the Convergence of Adaptive Stochastic Search Methods for Constrained and Multi-Objective Black-Box Global Optimization Rommel G. Regis June 30, 2016 Abstract Stochastic search methods for global optimization and multi-objective optimization are widely used in practice, especially on problems with black-box objective and constraint functions. Although there are many theoretical results on the convergence of stochastic search methods, relatively few deal with black-box constraints and multiple black-box objectives and previous convergence analyses require feasible iterates. Moreover, some of the convergence conditions are difficult to verify for practical stochastic algorithms and some of the theoretical results only apply to specific algorithms. First, this article presents some technical conditions that guarantee the convergence of a general class of adaptive stochastic algorithms for constrained black-box global optimization that do not require iterates to be always feasible and applies them to practical algorithms, including an evolutionary algorithm. The conditions are only required for a subsequence of the iterations and provide a recipe for making any algorithm converge to the global minimum in a probabilistic sense. Second, it uses the results for constrained optimization to derive convergence results for stochastic search methods for constrained multi-objective optimization. Keywords Constrained optimization multi-objective optimization random search convergence evolutionary programming Mathematics Subject Classification (2000) 65K05 90C29 Rommel G. Regis Department of Mathematics Saint Joseph s University Philadelphia, Pennsylvania 19131, USA rregis@sju.edu

3 1 Introduction Stochastic search methods have been widely used to solve constrained global optimization and multiobjective optimization problems in situations where the objective or constraint functions are black-box. Here, black-box means that the mathematical expression defining the function is not provided, and instead, the function values are obtained via a computer simulation. Stochastic search methods for constrained black-box global optimization include random search techniques (e.g., [1, 2]), evolutionary algorithms and swarm algorithms. Moreover, the most popular stochastic search methods for multiobjective optimization are multi-objective evolutionary algorithms (MOEAs) (e.g., [3, 4]). Many papers have been written on the convergence of stochastic search methods to the global minimum in a probabilistic sense (e.g., [5 10]). However, relatively few papers have dealt with black-box constraints and infeasible start points. For example, Baba [1] and Pinter [11] proved convergence results for some stochastic search algorithms for constrained black-box global optimization. However, their algorithms need a feasible starting point and they require the iterates to remain feasible. Unfortunately, finding feasible starting points and maintaining feasibility can be challenging for highly constrained blackbox problems and so this requirement can be very restrictive in practice. Moreover, their convergence conditions are not easy to verify for some practical stochastic search algorithms. Also, relatively few papers addressed the convergence of stochastic search methods to the Pareto set. For example, Baba et al. [12] proposed a random search algorithm for multi-objective constrained black-box optimization and proved that it converges with probability 1 to an approximate Pareto-optimal solution under certain conditions. However, the algorithmic framework they used again requires feasible iterates. Moreover, other papers (e.g., [13 15]) analyzed the convergence of MOEAs. In addition, Schüetze et al. [16] proved the convergence of stochastic search algorithms to finite size Pareto set approximations using the concept of ϵ-dominance. For a review of convergence results for MOEAs, see [17]. The first part of this paper presents some technical conditions that guarantee the convergence of a general class of adaptive stochastic algorithms for constrained black-box global optimization and applies them to some practical algorithms, including an Evolutionary Programming (EP) algorithm. Although the algorithms are stochastic, this paper focuses on problems with deterministic objective and constraint functions. These results extend the ones given in [18] and [8]. The convergence results provided here do 3

4 not require iterates to be always feasible and some of the convergence conditions are stated mainly in terms of the conditional densities of the random vector iterates. In addition, the algorithmic framework is flexible in that the convergence conditions are only required for a subsequence of the iterations. In fact, it provides a recipe for making any algorithm converge to the global minimum in a probabilistic sense by simply inserting iterations that satisfy some global search conditions. The second part of this paper uses the results for constrained black-box optimization to derive convergence results for stochastic search methods in a multi-objective setting. It applies the convergence results for the constrained case to single-objective reformulations of the multi-objective problem. Moreover, it provides some analysis of stochastic algorithms that work directly on convex constrained multi-objective problems. Some of the main differences between the multi-objective results in this paper and previous work in [12] are that the algorithmic framework presented does not require feasible iterates, the type of convergence result presented is different, and the convergence conditions are again required only on a subsequence of iterations. This paper is among the few to prove the convergence of random search methods for constrained and multi-objective optimization in a probabilistic sense. This paper is organized as follows. Section 2 deals with convergence results for constrained black-box optimization. Section 3 provides the convergence results for adaptive random search for multi-objective optimization. Finally, Section 4 provides a summary and conclusions. 2 Constrained Black-Box Global Optimization 2.1 Preliminaries and Notations This section focuses on the convergence of adaptive stochastic search methods for the following constrained black-box global optimization problem (CBOP): min f(x) subject to: G(x) := (g 1 (x),..., g m (x)) 0 and l x u (1) Here, f : R d R and g j : R d R, j = 1,..., m are deterministic black-box measurable functions and l, u R d are the bounds on the decision variables. In many practical applications, the values of the 4

5 objective and constraint functions are obtained by running computer simulations and their derivatives are unavailable. Hence, assume that one simulation at a point in [l, u] yields the values of f, g 1,..., g m at that point. For convenience, problem (1) is denoted by CBOP(f, G, [l, u]). Let D := { x R d : l x u, G(x) 0 } be the feasible region of problem (1). If D is compact and f is continuous over D, then problem (1) is guaranteed to have an optimal solution over D. Note that D is compact if each constraint function g j is also continuous. As mentioned earlier, this paper focuses on stochastic algorithms for problem (1) but the objective and constraint functions are all assumed to be deterministic. For problems where the objective or constraint functions are stochastic, see [8]. Since the constraint functions in (1) are black-box, the feasible region D is unknown. Hence, the algorithms for finding the global minimum of f over D will be searching throughout the box-shaped region [l, u] defined by the bound constraints. We refer to the region [l, u] as the search space of problem (1). Definition 2.1 Let S be a proper subset of R d. A mapping ρ S : R d S such that ρ S (x) = x for all x S is said to be an absorbing transformation. Absorbing transformations are necessary for stochastic search methods because most search spaces in practical problems are compact subsets of some Euclidean space and some probability distributions yield iterates that are outside the search space. For example, Gaussian iterates might fall outside the search space even if the center of the distribution is well inside the search space. An absorbing transformation is meant to absorb runaway iterates into the search space. Commonly used absorbing transformations are projections onto the search space or successive reflections about a boundary. Definition 2.2 Consider a CBOP(f, G, [l, u]) and let D := { x R d : l x u, G(x) 0 }. A constraint violation (CV) function for G over [l, u] is a function V G : [l, u] R + with the following properties: (i) V G (x) = 0 for all x D; (ii) V G (x) > 0 for all x D; and (iii) If G(x) G(y), then V G (x) V G (y). A CV function is a measure of the degree of constraint violation of a point in the search space [l, u]. Two examples are V G (x) = m j=1 [max{g j(x), 0}] and V G (x) = m j=1 [max{g j(x), 0}] 2. Before defining the concept of convergence of a stochastic algorithm to the global minimum, we first recall the definitions of convergence in probability and almost sure convergence (e.g., see [19]). In the 5

6 definitions below, {X n } n 1 is a sequence of random vectors and X is another random vector defined on the same probability space (Ω, B, P ), where Ω is the sample space, B is a σ-field of subsets of Ω, and P is the probability measure. That is, the X n s and X are mappings X n : (Ω, B) (R d, B(R d )) and X : (Ω, B) (R d, B(R d )), where B(R d ) consists of the Borel subsets of R d. The definitions below also apply to the special case where {X n } n 1 and X are random variables instead of random vectors. Throughout this paper, denotes the 2-norm. Moreover, if E is a collection of random elements defined on a probability space, then σ(e) is the σ-field generated by E. We can think of σ(e) as representing all the information that can be derived from the random elements in E. Definition 2.3 Let {X n } n 1 and X be random vectors defined on the same probability space (Ω, B, P ). The sequence {X n } n 1 converges in probability to X iff for any ϵ > 0, lim n P [ X n X > ϵ] = 0. Moreover, {X n } n 1 converges almost surely (a.s.) to X, written X n X a.s., iff there exists an event N B such that P (N ) = 0 and lim n X n (ω) = X(ω) for all ω Ω \ N. It is well-known that convergence a.s. implies convergence in probability but the converse is false. One of the goals of this paper is to provide simple conditions that guarantee the convergence of a general class of adaptive stochastic algorithms for problem (1) that is described in the framework in Section 2.3. Because the algorithms that follow this framework are stochastic, the iterates are treated as d-dimensional random vectors. Consider a stochastic algorithm whose iterates are given by {Y n } n 1 defined on a probability space (Ω, B, P ). Here, the random vector Y n : (Ω, B) (R d, B(R d )) represents the nth trial iterate generated by some probability distribution such as the uniform distribution over [l, u] or a multivariate Normal distribution centered at a current best solution. As mentioned above, the trial iterate Y n could fall outside of [l, u] so it needs to be brought back to [l, u]. Hence, define {X n } n 1 to be the sequence of actual random vector iterates where X n := ρ [l,u] (Y n ) for some absorbing transformation ρ [l,u]. Note that the objective and constraint function evaluations are carried out on the sequence {X n } n 1 rather than the trial sequence {Y n } n 1. We refer to {Y n } n 1 as the trial random vector iterates while {X n } n 1 are the actual random vector iterates. Next, we clarify what it means for a stochastic algorithm to converge to the global minimum in a probabilistic sense. Given the sequence {X n } n 1 of actual random vector iterates, we define the sequence {Xn} n 1 of the best points visited by the algorithm with respect to the objective function f(x) and a 6

7 constraint violation function V G (x) as follows: set X 1 := X 1, and for any n 1, set Xn := X n X n if X n and X n 1 are feasible and f(x n ) < f(x n 1) if X n 1 is infeasible and V G (X n ) < V G (X n 1) (2) Xn 1 otherwise Definition 2.4 Let f be a real-valued measurable function defined on a measurable set D R d such that the global minimum value f := min x D f(x) exists. A stochastic algorithm whose actual iterates are denoted by the sequence of random vectors {X n } n 1 is said to converge to the global minimum of f over D almost surely (or in probability) iff the sequence of random variables {f(xn)} n 1, where Xn is defined in (2), converges to f almost surely (or in probability). 2.2 Simple Stochastic Search Algorithms for Constrained Optimization Before presenting a general framework for adaptive stochastic search methods for constrained black-box optimization, we first present two simple stochastic algorithms for constrained optimization that are easy to use in practice. The first one uses uniform random search over the search space while the second one uses Gaussian steps centered at the current best solution. Algorithm A1. Uniform Random Search for Constrained Optimization Inputs: (i) CBOP(f, G, [l, u]); (ii) CV function V G : [l, u] R +. Step 0. Set n = 1. Step 1. Generate a realization of X n U[l, u]. Step 2. Evaluate f(x n ) and G(X n ). (This is equivalent to one simulation.) Step 3. Let Xn be the best point found so far with respect to f(x) and V G (x) as defined in (2). Step 4. Increment n n + 1 and go back to Step 1. Algorithm A1 samples uniformly at random over the search space [l, u]. Then it updates the current best solution Xn after the objective and constraint function values of X n are obtained. This is the simplest conceivable stochastic search algorithm for problem (1). Since the X n s are independent and identically distributed (iid), this algorithm is not really adaptive but it can be shown to converge to the global minimum under some simple assumptions as will be seen in the next section. 7

8 To make the algorithm somewhat adaptive, one can use Gaussian distributions centered at the current best solution. By localizing the search in this manner, the chances of finding better iterates are improved. Algorithm A2. Localized Random Search for Constrained Optimization Inputs: (i) CBOP(f, G, [l, u]); (ii) CV function V G : [l, u] R + ; (iii) Initial covariance matrix for the Gaussian steps C 1 ; (iv) Absorbing transformation ρ [l,u] : R d [l, u]; (v) Initial solution X 0 [l, u]. Step 0. Initialize X0 := X 0 and set n = 1. Step 1. Generate a realization of Z n such that Z n σ({z 1,..., Z n 1 }) N(0 d 1, C n ), and set Y n := Xn 1 + Z n. Step 2. Set X n := ρ [l,u] (Y n ). Step 3. Evaluate f(x n ) and G(X n ). (This is equivalent to one simulation.) Step 4. Let Xn be the best point found so far with respect to f(x) and V G (x) as defined in (2). Step 5. Determine the next covariance matrix C n+1, possibly using information obtained so far. Step 6. Increment n n + 1 and go back to Step 1. In Step 5 of Algorithm A2, the covariance matrix of the Gaussian random vector iterates may be kept constant but it also allows for the possibility of adjusting this covariance matrix based on information obtained so far. For example, such an adjustment strategy is commonly employed by evolution strategies (ES) such as CMA-ES [20] and its variants. In the next section, we provide a general framework that includes Algorithms A1 and A2 as special cases and we prove the convergence of algorithms that follow this framework. 2.3 General Framework for Adaptive Stochastic Search for Constrained Optimization In Algorithm A1, the actual random vector iterates {X n } n 1 are iid and X n U[l, u] for all n 1. This paper focuses on the more general case of adaptive algorithms where each (trial or actual) random vector iterate possibly depends on previous random vector iterates, i.e., the trial random vector iterates {Y n } n 1 are dependent, and so, {X n } n 1 are also dependent. For example, in Algorithm A2, the trial iterate Y n is obtained by adding random perturbations to the current best solution that are Normally distributed with zero mean. Since the current best solution Xn 1 depends on the previous trial iterates Y 1,..., Y n 1 and other associated random vectors, so does Y n. However, the results in this paper also 8

9 apply to the case where the actual random vectors are iid as in the case of Algorithm A1. Moreover, as in [18], the framework below allows for the possibility that the trial iterate Y n is a deterministic function of current and previous intermediate random elements, denoted by {Λ i,j : i = 1,..., n, j = 1,..., k i }, in order to encompass many practical stochastic algorithms. These Λ i,j s could be random variables, random vectors or other types of random elements defined on the same probability space (Ω, B, P ) and they are meant to capture all random decisions prior to generating a trial iterate. Below is the Generalized Adaptive Random Search for Constrained Optimization (GARSCO) framework for solving problem (1). This framework is an extension of the GARS framework in [18] that can handle black-box inequality constraints. In the notation below, n is the number of simulations, where one simulation yields the values of f, g 1,..., g m at a given point in the search space [l, u]. Algorithm A3. Generalized Adaptive Random Search for Constrained Optimization (GARSCO) Inputs: (i) CBOP(f, G, [l, u]); (ii) CV function V G : [l, u] R + ; (iii) Collection of intermediate random elements {Λ i,j : (Ω, B) (Ω i,j, B i,j ) : i 1 and j = 1, 2,..., k i } used to determine the trial random vector iterates; (iv) Absorbing transformation ρ [l,u] : R d [l, u]. Step 0. Set n = 1. Step 1. Generate a realization of the random vector Y n : (Ω, B) (R d, B(R d )) as follows: Step 1.1. For each j = 1,..., k n, generate a realization of the intermediate random element Λ n,j : (Ω, B) (Ω n,j, B n,j ) according to some probability distribution. Step 1.2. Set Y n := Θ n (E n ) for some deterministic function Θ n, where E n := {Λ i,j : i = 1, 2,..., n; j = 1, 2,..., k n }. Step 2. Set X n := ρ [l,u] (Y n ). Step 3. Evaluate f(x n ) and G(X n ). (This is equivalent to one simulation.) Step 4. Let Xn be the best point found so far with respect to f(x) and V G (x) as defined in (2). Step 5. Increment n n + 1 and go back to Step 1. To see how Algorithm A1 follows the GARSCO framework, set k n := 1 and Y n := Λ n,1 U[l, u] for all n 1. Moreover, the choice of absorbing transformation does not matter since X n Y n for all n 1. For Algorithm A2, the intermediate random elements depend on how the covariance matrix C n is updated. The simplest case is when the covariance update is deterministic given a fixed setting of all 9

10 previous random elements. This includes the case where C n is constant for all n 1. It also includes the case where C n+1 possibly depends only on C 1,..., C n and the realizations of the random vectors in n i=1 {Z i, Y i, X i, f(x i ), G(X i )}, which represents the history of the algorithm, and not on some other independent random elements. For example, if the current iterate X n is not an improvement over Xn 1 in terms of f(x) and V G (x), one can set C n+1 := αc n, where 0 < α < 1, in the hopes of generating an iterate closer to Xn 1. In this case, k n := 1, Λ n,1 := Z n, where Z n σ(e n 1 ) N(0 d 1, C n ), and Y n := Xn 1 + Z n is a function of E n := {Z 1,..., Z n } for all n 1. In the more complex case, the covariance matrix update involves additional random elements beyond the Z n s such as in the case of the Evolutionary Programming algorithm in Section 2.4. The following convergence result for a GARSCO algorithm for problem (1) is similar to Theorem 1 in [18], which is a generalization of the theorem on page 40 of [8]. Below, σ(e (nk ) 1) is the σ-field generated by the random elements in E (nk ) 1 for all k 1. One big difference between the following result and the ones in [1,11,18] is that, in these earlier papers, the actual iterates X n where f and G are evaluated are all feasible. However, this is difficult, if not impossible, to ensure when the constraint functions are black-box since D is unknown. The GARSCO framework is more flexible in that it allows the iterates to be infeasible at the beginning. In particular, Xn is the best solution so far according to the given objective function f(x) and the measure of constraint violation V G (x). That is, if Xn is feasible then it has the best value of f(x) among all feasible realizations of X 1,..., X n. On the other hand, if Xn is infeasible, then X 1,..., X n are all infeasible and Xn has the best value of V G (x) among the realizations of these random vector iterates. The proof for the proposition below is given in the Appendix. Proposition 2.1 Consider a CBOP(f, G, [l, u]) with feasible region D and assume that f := inf x D f(x) >. Suppose that a GARSCO algorithm applied to this problem satisfies the condition: For every ϵ > 0, there exists 0 < L(ϵ) < 1 such that P [Y nk D : f(x nk ) < f + ϵ σ(e (nk ) 1)] L(ϵ) (3) for some subsequence {n k } k 1. If V G (x) is a constraint violation function for G over [l, u] and {X n} n 1 is defined by (2), then f(x n) f almost surely (a.s.). 10

11 Note that the condition (3) above is expressed in terms of the trial random vectors {Y nk } k 1, which are random vectors generated by the GARSCO algorithm. The sequence {X nk } k 1 are the images of {Y nk } k 1 under the chosen absorbing transformation. The next proposition deals with the case where the objective function has a unique global minimizer over the feasible region. It is similar to Theorem 2 in [18] but the meaning of Xn in this proposition is different. The proof of this proposition is also given in the Appendix. Proposition 2.2 Consider a GARSCO algorithm applied to problem (1) and suppose the assumptions in Proposition 2.1 hold. Moreover, suppose x is the unique global minimizer of f over D in the sense that f(x ) := inf x D f(x) > and inf x D, x x η f(x) > f(x ) for all η > 0. Then X n x a.s. Verifying condition (3) in Proposition 2.1 is difficult, if not impossible, in a practical setting where the constraint functions are black-box because the feasible region D is not known. A much simpler condition to verify is given by the next proposition. Before we proceed, we make the following assumptions. Assumptions A. Consider a CBOP(f, G, [l, u]) with a compact feasible region D := {x R d : l x u, G(x) 0}. Suppose f := inf x D f(x) > and that f is continuous at a global minimizer of the problem. Moreover, suppose D has a nonempty interior and that every neighborhood of a boundary point of D intersects the interior of D. Throughout this paper, a neighborhood of a point z R d is a closed ball of some radius δ > 0 centered at z, denoted by B(z, δ) := {x R d : x z δ}. Proposition 2.3 Suppose Assumptions A hold for a CBOP(f, G, [l, u]) and suppose that a GARSCO algorithm applied to the problem satisfies the condition: z [l, u] and δ > 0, 0 < ν(z, δ) < 1 such that P [Y nk B(z, δ) [l, u] σ(e (nk ) 1)] ν(z, δ), for some subsequence {n k } k 1. If V G (x) is a constraint violation function for G over [l, u] and {X n} n 1 is defined by (2), then f(x n) f a.s. Moreover, if x is the unique global minimizer of f over D in the sense of Proposition 2.2, then X n x a.s. Proof We show that the conditions of Proposition 2.1 are satisfied. Suppose f is continuous at a global minimizer x of the CBOP. Fix ϵ > 0 and an integer k 1. Since f is continuous at x, δ(ϵ) > 0 such 11

12 that f(x) f(x ) < ϵ whenever x x < δ(ϵ). Hence, P [X nk D : f(x nk ) < f(x ) + ϵ σ(e (nk ) 1)] P [X nk D : X nk x < δ(ϵ) σ(e (nk ) 1)] = P [X nk B(x, δ(ϵ) D σ(e (nk ) 1)] P [Y nk B(x, δ(ϵ) D σ(e (nk ) 1)]. Next, since D is closed, it follows that D = int(d) bd(d), where int(d) is the interior of D and bd(d) is the boundary of D. There are two cases to consider: (1) x int(d); and (2) x bd(d). First, suppose x int(d). Then 0 < η 1 δ(ϵ), where η 1 depends on δ(ϵ) such that B(x, η 1 ) D. Now observe that P [Y nk B(x, δ(ϵ)) D σ(e (nk ) 1)] P [Y nk B(x, η 1 ) [l, u] σ(e (nk ) 1)] ν(x, η 1 ) =: L 1 (ϵ) > 0 (since η 1 depends only on ϵ). Next, suppose x bd(d). By assumption, z int(d) B(x, δ(ϵ)), where z depends on δ(ϵ). Since z int(d), 0 < η 2 δ(ϵ), where η 2 also depends on z and δ(ϵ), such that B(z, η 2 ) D. Again, P [Y nk B(x, δ(ϵ)) D σ(e (nk ) 1)] P [Y nk B(z, η 2 ) [l, u] σ(e (nk ) 1)] ν(z, η 2 ) =: L 2 (ϵ) > 0 (since z and η 2 depend only on ϵ). Define L(ϵ) := L 1 (ϵ) if x int(d) and L(ϵ) := L 2 (ϵ) if x bd(d). Clearly, 0 < L(ϵ) < 1 and P [X nk D : f(x nk ) < f(x ) + ϵ σ(e (nk ) 1)] P [Y nk B(x, δ(ϵ) D σ(e (nk ) 1)] L(ϵ) > 0. By Proposition 2.1, f(x n) f a.s. Proposition 2.3 says that a GARSCO algorithm converges to the global minimum almost surely if it has a subsequence of iterations {n k } k 1 where the random trial iterate Y nk hits any ball of positive radius centered at a point in the search space [l, u] with positive probability in addition to the conditions in Assumptions A. These conditions are satisfied in the special case where the random vector Y nk has a uniform distribution over the search space [l, u] as shown in Corollary 2.1. Consequently, any search method for problem (1) can be made to converge to the global minimum almost surely if one interjects a uniform random sampling step over the search space [l, u] for a subsequence of the iterations regardless of what the method is doing in its actual iterations. In practice though, it would be more efficient if 12

13 the subsequence of iterations responsible for convergence somehow interacts with the rest of the search method. For example, one can use Gaussian iterations where the covariance matrix is chosen so that the trial random vector iterates are more likely to be generated in promising regions based on the history of points visited by the algorithm. The condition in Assumptions A that f is continuous at a global minimizer over D is important. Suppose the function h has a unique global minimizer x over D and that h is continuous at x. Define the function f over D as follows: f(x) := h(x) for all x D, x x and f(x ) := h(x ) 1. Clearly, the f has a unique global minimizer at x but it is discontinuous at that point. Now even if the rest of Assumptions A and the conditions in Proposition 2.3 hold, f(xn) cannot converge to f(x ) a.s. The condition in Assumptions A that every neighborhood around every boundary point of D intersects int(d) is not difficult to satisfy. For example, the following proposition shows that this condition is satisfied under some mild assumptions. The proof is also in the Appendix. Proposition 2.4 If S is a nonempty set in R d such that cl(int(s)) = cl(s), then every neighborhood of a boundary point of S intersects the interior of S. In particular, if C is a closed convex set in R d with a nonempty interior, then every neighborhood of a boundary point of C intersects the interior of C. Although the condition in Proposition 2.4 is satisfied by convex sets D with a nonempty interior, convexity of D is not necessary. It is easy to create examples where cl(int(d)) = cl(d) but D is nonconvex. The next proposition is a modification of Theorem 4 from [18] applied to problem (1). Since D is not known, some of the conditions of the proposition use the search space [l, u] instead of D. Proposition 2.5 Suppose Assumptions A hold for problem (1). Moreover, suppose a GARSCO algorithm applied to the problem has the property that there is a subsequence {n k } k 1 such that for each k 1, Y nk has a conditional density g nk (y σ(e (nk ) 1)) satisfying the condition: µ({y [l, u] : h(y) = 0}) = 0, where h(y) := inf k 1 g nk (y σ(e (nk ) 1)) and µ is the Lebesgue measure on R d. Then f(xn) f a.s. Proof Fix δ > 0 and z [l, u]. For all k 1, P [Y nk B(z, δ) [l, u] σ(e (nk ) 1)] = g nk (y σ(e (nk ) 1)) dy h(y) dy =: ν(z, δ). B(z,δ) [l,u] B(z,δ) [l,u] 13

14 Since h(y) is a nonnegative function on [l, u], µ({y [l, u] : h(y) = 0}) = 0 and µ(b(z, δ) [l, u]) > 0, it follows that ν(z, δ) > 0. By Proposition 2.3, f(x n) f a.s. Next, we apply the previous propositions to commonly used distributions in practice such as the uniform distribution, the Gaussian distribution or the more general class of elliptical distributions. Corollary 2.1 Suppose Assumptions A hold for problem (1). Moreover, suppose a GARSCO algorithm applied to the problem has the property that there is a subsequence {n k } k 1 such that for each k 1, Y nk has a uniform distribution on [l, u]. Then f(x n) f a.s. In particular, Algorithm A1 converges to the global minimum of f over D a.s. Next, we consider GARSCO algorithms that use elliptical distributions to generate its iterates. The class of elliptical distributions include the multivariate Normal and Cauchy distributions. Let Z : (Ω, B) (R d, B(R d )) be a random vector that has an elliptical distribution. If Z has a density, then it has the form ([21]) g(z) = γ [det(c)] 1/2 Ψ((z u) T C 1 (z u)), z R d where u R d, C is a symmetric and positive definite matrix, Ψ is a nonnegative function over the positive reals such that 0 z (d/2) 1 Ψ(z) dz <, and γ is the normalizing constant given by γ := 1 ( 1 2 π d/2 Γ (d/2) z d 1 Ψ(z 2 ) dz). (4) 0 Elliptical distributions generalize widely used probability distributions, including the multivariate Gaussian distribution (Ψ(y) = e y/2 ). The following result states that GARSCO algorithms that use elliptical distributions, where Ψ is monotonically nonincreasing and the eigenvalues of the symmetric and positive definite matrix C are bounded away from 0, converge to the global minimum almost surely. In the notation below, λ min (C) and λ max (C) denote the smallest and largest eigenvalues of C. Moreover, U k = Φ k (E (nk ) 1) represents a deterministic function of the random elements in E (nk ) 1. A typical setting in practice is U k = X(n k ) 1, which is the best solution after (n k) 1 simulations. However, U k could also be any random vector whose realization is in the search space [l, u]. 14

15 Proposition 2.6 Suppose Assumptions A hold for problem (1). Moreover, suppose a GARSCO algorithm applied to the problem satisfies the condition: There is a subsequence {n k } k 1 such that for each k 1, Y nk = U k + Z k, where U k = Φ k (E (nk ) 1) for some deterministic function Φ k and Z k is a random vector whose conditional distribution given σ(e (nk ) 1) is an elliptical distribution with conditional density q k (z σ(e (nk ) 1)) = γ[det(c k )] 1/2 Ψ(z T C 1 k z), z Rd, where γ is defined in (4). Furthermore, suppose the following properties hold: (P1) Ψ is monotonically nonincreasing; and (P2) inf k 1 λ min (C k ) > 0. Then f(x n) f a.s. In particular, Algorithm A2 converges to the global minimum of f over D a.s. provided inf k 1 λ min (C k ) > An Evolutionary Algorithm for Constrained Optimization In this section, one of the convergence results from the previous section is applied to an evolutionary programming (EP) algorithm for constrained optimization. Below is a pseudo-code of the (µ + µ)-ep for constrained black-box optimization described in [22]. This evolutionary algorithm solves a problem (1) by using only a Gaussian mutation operator and the mutations on the components of a parent solution vector are independent (i.e., the covariance matrix of the random vector of Gaussian mutations is a diagonal matrix). The notation in Section 2.1 is used in the algorithm below. Each individual in the EP below is a pair (X n, C n ), where X n is the nth point where f and G are evaluated and C n is the diagonal covariance matrix associated with X n that is used to generate an offspring. The initial parent population (generation 0) is denoted by P(0) := {P 1 (0),..., P µ (0)} := {(X 1, C 1 ),..., (X µ, C µ )}. Moreover, for t 1, the offspring population of generation t is denoted by M(t) := {(X tµ+1, C tµ+1 ),..., (X tµ+µ, C tµ+µ )} and the parent population at the end of generation t is denoted by P(t) := {P 1 (t),..., P µ (t)}. The offspring population M(t) is generated by applying a mutation operator to each of the µ parents of the previous generation P(t 1) and the parent population P(t) is obtained from the offspring of the current generation M(t) and the parent population at the end of the previous generation P(t 1). For convenience, the individuals in the parent population P(t) are denoted by P(t) = {P 1 (t),..., P µ (t)} := {( X tµ+1, C tµ+1 ),..., ( X tµ+µ, C tµ+µ )}. In addition, the solution 15

16 vector and the covariance matrix associated with the ith parent at the end of generation t are denoted by X(P i (t)) := X tµ+i and C(P i (t)) := C tµ+i. Algorithm A4: Evolutionary Programming for Constrained Black-Box Optimization Inputs: (i) CBOP(f, G, [l, u]); (ii) CV function V G : [l, u] R + ; (iii) Number of offspring (also the number of parents) in every generation, denoted by µ; (iv) Initial and minimum standard deviations of the Gaussian mutations, denoted by σ init > 0 and σ min > 0, respectively; (v) Absorbing transformation ρ [l,u] : R d [l, u]. Step 1. (Initialize Parent Population) Set t = 0 and for each i = 1, 2,..., µ, generate Y i according to some probability distribution whose realizations are on R d, where Y i possibly depends on Y 1,..., Y i 1, and set X i := ρ [l,u] (Y i ). Moreover, for each i = 1, 2,..., µ, set P i (0) := (X i, C i ), where C i := σinit 2 I d. Step 2. (Evaluate the Initial Parent Population) For each i = 1, 2,..., µ, evaluate f(x i ) and G(X i ). Step 3. (Iterate) While termination condition is not satisfied do Step 3.1. (Update Generation Counter) Reset t := t + 1. Step 3.2. (Generate Offspring by Mutation) For each i = 1, 2,..., µ, set (Y tµ+i, C tµ+i ) := Mut(P i (t 1)) and X tµ+i := ρ [l,u] (Y tµ+i ). Step 3.3. (Evaluate the Offspring) For each i = 1, 2,..., µ, evaluate f(x tµ+i ) and G(X tµ+i ). Step 3.4. (Select New Parent Population) P(t) := Sel(P(t 1) M(t)) (see below for explanation). End. Recall that for t 1, P(t) = {P 1 (t),..., P µ (t)} = {( X tµ+1, C tµ+1 ),..., ( X tµ+µ, C tµ+µ )}. Now in Step 3.2, the mutation operator is defined as follows: For each t 1 and i = 1, 2,..., µ, Y tµ+i := X(P i (t 1)) + Z tµ+i = X (t 1)µ+i + Z tµ+i, where Z tµ+i is a random vector whose conditional distribution given σ(e tµ+i 1 ) (defined below) is a Gaussian distribution with mean vector 0 and diagonal covariance matrix ( ( Cov(Z tµ+i ) = C(P i (t 1)) = C ) 2 ( ) 2 ( ) ) 2 (t 1)µ+i = diag σ (1) tµ+i, σ (2) tµ+i,..., σ (d) tµ+i. 16

17 Note that σ (1) tµ+i,..., σ(d) tµ+i are the standard deviations of the Gaussian mutations for the d components of the solution vector X(P i (t 1)). Moreover, for t 1 and i = 1,..., µ, ( ) C tµ+i := C(P i (t 1)) diag exp(τ ξ (0) t,i + τξ(1) t,i ), exp(τ ξ (0) t,i + τξ(2) t,i ),..., exp(τ ξ (0) t,i + τξ(d) t,i ) = C (t 1)µ+i exp(τ ξ (0) t,i ) diag ( exp(τξ (1) t,i ) ), exp(τξ(2) t,i ),..., exp(τξ(d) t,i ), where τ = 1/ 2 d and τ = 1/ 2d (Bäck 1993) and ξ (0) t,i, ξ(1) t,i,..., ξ(d) t,i are iid standard Normal random variables. For convenience, define the random vector Ξ t,i := [ξ (0) t,i, ξ(1) t,i,..., ξ(d) t,i ] for all t 1, i = 1,..., µ. In addition, to prevent the standard deviations of the Gaussian mutations from becoming too small, a minimum standard deviation σ min is used. That is, for t 1, i = 1,..., µ and k = 1,..., d, C tµ+i (k, k) = max(c tµ+i (k, k), σmin 2 ). It will be shown later (see proof of Proposition 2.7) that E tµ+i 1 = {Y 1,..., Y µ } t 1 µ {Z sµ+j, Ξ s,j } i 1 {Z tµ+j, Ξ t,j }. s=1 j=1 j=1 In Step 3.4, the selection of the parent solutions for the next generation is usually accomplished by probabilistic q-tournament selection as described in Bäck (1993). As q increases, this q-tournament selection procedure becomes more and more greedy. For simplicity, we assume that the selection of parent solutions proceeds in a completely greedy manner. That is, P(t) is simply the collection of µ best solutions from P(t 1) M(t) in terms of the objective function f(x) and the constraint violation function V G (x). The next result, whose proof is in the Appendix, shows that the above EP follows the GARSCO framework and it also shows almost sure convergence to the global minimum. Proposition 2.7 The EP in Algorithm A4 follows the GARSCO framework. Moreover, if this EP is applied to a CBOP(f, G, [l, u]) such that Assumptions A hold, then f(x n) f a.s. 3 Multi-Objective Constrained Black-Box Optimization 3.1 Problem Statement and Preliminaries The goal of this section is to prove some convergence results for adaptive stochastic search methods for solving the following multi-objective constrained black-box optimization problem (MCBOP): 17

18 min F (x) := (f 1 (x),..., f k (x)) s.t. G(x) := (g 1 (x),..., g m (x)) 0 and l x u (5) Here, f i : R d R, i = 1,..., k and g j : R d R, j = 1,..., m are again deterministic black-box measurable functions and l, u R d. As before, let D be the feasible region and assume that one simulation yields the values of F and G at a given input. This problem is denoted by MCBOP(F, G, [l, u]). In the event that the objective and constraint functions are not black-box and their mathematical forms are actually known, it might be more efficient to take advantage of the mathematical structures of these functions and possibly their gradients. The reader is referred to a standard textbook on nonlinear multiobjective optimization (e.g., [23]) for more suitable methods that can be used. Consider an MCBOP(F, G, [l, u]) of the form (5). We employ some terminology found in standard texts in multi-objective optimization (e.g., [23]). Below are some basic terms. Definition 3.1 Given x, y D, we say that x dominates y, written x y, iff f i (x) f i (y) for all i = 1,..., k and f j (x) < f j (y) for some j. Definition 3.2 A point x D is said to be a (global) Pareto minimizer of F over D iff y D s.t. y x. The Pareto set of F over D is the set of all global Pareto minimizers of F over D. The Pareto front of F over D is the image of the Pareto set under the mapping F, i.e., it is the set of objective vectors {F (x ) : x is a global Pareto minimizer of F over D}. Ideally, we wish to determine, or at least characterize, the entire Pareto set and Pareto front of F over D. However, for many practical problems, the Pareto front is an infinite set, and so, one can only hope to find a finite representative subset of the Pareto set and Pareto front. In practice, many stochastic algorithms strive to find a non-dominated subset of objective vectors, sometimes with no guarantee of being Pareto optimal. The solutions found are then presented to a decision maker who selects one or a few non-dominated solutions for implementation. In this paper, we would like to develop stochastic algorithms that are guaranteed to converge to Pareto optimal solutions. Even better would be to design algorithms that can, in principle, find every Pareto optimal point. In some situations, it would be convenient to focus on convex MCBOPs, which is defined next. 18

19 Definition 3.3 The above MCBOP is convex iff all the objective functions f 1,..., f k and the feasible region D := { x R d : l x u, G(x) 0 } are convex. For example, the feasible region D is convex if the inequality constraint functions g 1,..., g m are all quasi-convex. A special case of a convex MCBOP occurs when F : R d R k is a linear function (F (x) = Mx for some matrix M R k d ) and G : R d R m is an affine function (i.e., G(x) = Ax + v for some matrix A R m d and vector v R m. This is equivalent to the case where each of the objective functions f 1,..., f k and each of the inequality constraint functions g 1,..., g m are linear. Since this paper assumes that the objective and constraint functions in (5) are black-box, it is not really possible to know whether the given MCBOP is convex. However, it would still be valuable to explore convergence results for the convex case since it is more mathematically tractable. Moreover, any convergence result proved for the convex case will hold in the event that the black-box MCBOP happens to be convex (even if this is unknown to the user of the algorithm). 3.2 Weighted Combination of Objective Functions A basic method for finding Pareto optimal solutions is to convert problem (5) into a single-objective optimization problem where the objective function is a weighted combination of f 1,..., f k. More precisely, consider an MCBOP(F, G, [l, u]) and for each λ R k, λ 0, define problem MCBOP (λ) as follows: k min λ T F (x) = λ i f i (x) i=1 (MCBOP (λ)) s.t. x D := { x R d : l x u, G(x) 0 } The following result, which is Theorem of Part II of [23], is well-known since the time of Geoffrion [24]. It says that by choosing weights that are strictly positive, an optimal solution to the above problem with the weighted combination of objectives will always yield a Pareto optimal solution. Proposition 3.1 For any λ R k, λ > 0, an optimal solution to the problem MCBOP (λ) is a Pareto minimizer for the original MCBOP. Unfortunately, it is also well-known that the previous result is not guaranteed to yield all possible Pareto optimal points unless the MCBOP is convex. Below is a result from Part II of [23]. 19

20 Proposition 3.2 Suppose an MCBOP(F, G, [l, u]) is convex. For any Pareto optimal solution x to this MCBOP, λ R k, λ > 0 such that x is an optimal solution of MCBOP (λ). However, Soland [25] proved that by adding upper bounds on the objective functions, the resulting problem will be able to generate the entire Pareto set. More precisely, consider an MCBOP(F, G, [l, u]). For each λ, b R k, λ > 0, define problem MCBOP (λ, b) as follows: min λ T F (x) = k i=1 λ if i (x) s.t. x D, F (x) b (MCBOP (λ, b)) The following result characterizes all Pareto-optimal points of the given MCBOP(F, G, [l, u]). Proposition 3.3 (Soland [25]) Fix λ R k, λ > 0. Then x is a Pareto-optimal solution to the MCBOP(F, G, [l, u]) problem if and only if x is optimal in problem MCBOP (λ, b) for some b R k. Because of Propositions 3.2 and 3.3, we can obtain Pareto optimal solutions by fixing λ, b R k, λ > 0, and then applying a GARSCO algorithm from the previous section. In particular, the following procedure can be used to generate Pareto optimal solutions for the MCBOP in (5): Algorithm B0. Generalized Adaptive Random Search on Weighted Objectives with Constraints Inputs: (i) MCBOP(F, G, [l, u]); (ii) A GARSCO algorithm for finding the global minimum of a weighted combination of the objective functions over [l, u] subject to the given inequality constraints G(x) 0; (iii) Number of iterations (or maximum number of Pareto optimal solutions to obtain), denoted by T. Step 0. Initialize the Pareto set X0 := and the Pareto front F0 :=. Set the iteration counter t = 1. Step 1. Select a particular λ t R k, λ t > 0. Step 2. Randomly generate points uniformly over [l, u] until a feasible point x t is obtained. Step 3. Calculate the objective vector b t of the feasible point found in Step 2, i.e., b t := F ( x t ). Step 4. Run the GARSCO algorithm to solve MCBOP (λ t, b t ) and let x t be the solution found. Step 5. Update Xt := Xt 1 {x t } and Ft := Ft 1 {F (x t )}. Step 6. If t T, then stop and return Xt and Ft ; else, t t + 1 and go back to Step 1. In Step 0, the Pareto set, Pareto front and iteration counter are initialized. Then in Step 1, the vector of weights for the objective functions (denoted by λ t ) is chosen. Proposition 3.3 says that any choice 20

21 of λ t > 0 should be able to generate all Pareto optimal solutions using suitable upper bounds. These weights could be the same for all iterations, they may be chosen uniformly at random over the unit simplex {λ R k : λ > 0, λ T 1 k = 1}, or they may incorporate some information about a decision maker s preferences. In practice, it makes sense to set the weights according to the relative importance of the objective functions if such information is available. Next, Step 2 repeatedly generates a point uniformly at random over the search space [l, u] until a feasible point is obtained. Since it was assumed that the feasible region has a nonempty interior, a feasible point will be obtained in a finite number of trials with probability 1. The purpose of this step is to find suitable upper bounds for the objective functions so that the resulting single-objective optimization problem in Step 4 will be feasible. In Step 3, we calculate the objective vector b t corresponding to the feasible point obtained in Step 2. Then, in Step 4, a GARSCO algorithm is used to solve the MCBOP (λ t, b t ) problem. Proposition 3.3 says that any optimal solution x t to MCBOP (λ t, b t ) is guaranteed to be Pareto optimal. Next, the Pareto set and Pareto front are updated in Step 5. Finally, in Step 6, the Pareto set and Pareto front are returned if the number of iterations reached T ; otherwise, the iteration counter is incremented and the algorithm goes back to Step 1. If the given MCBOP is known to be convex (though perhaps the exact mathematical forms of its objective and constraint functions are unknown), Step 3 of Algorithm B0 can be removed. Moreover, Step 4 uses the chosen GARSCO algorithm to solve the MCBOP (λ t ) problem. This applies to the special case where the objective and constraint functions are all linear. However, if this is known in advance, then there are more suitable and efficient approaches that can be used (e.g., see [23]). 3.3 Adaptive Stochastic Search Algorithms for Constrained Multi-Objective Optimization In the previous section, we applied a GARSCO algorithm on the weighted combination of the objectives subject to additional constraints giving upper bounds on the objectives. However, we would also like to analyze stochastic algorithms that work directly on the above MCBOP problem without scalarization. To do this, we extend the notion of domination to infeasible points as given in the next definition. Definition 3.4 Consider an MCBOP(F, G, [l, u]) with feasible region D and let V G (x) be a constraint violation function for G over [l, u]. Given x, y [l, u], we say that x dominates y, written x y, iff any 21

22 one of the following conditions hold: (i) x, y D and f i (x) f i (y) for all i = 1,..., k and f j (x) < f j (y) for some j; (ii) x D but y D; or (iii) x, y D and V G (x) < V G (y). The next proposition says that the sequence of best solutions for any of the single-objective weighted combination problems obtained from (5) for any choice of positive weights is always captured by the set of all nondominated points at any given iteration. Proposition 3.4 Consider any algorithm for an MCBOP(F, G, [l, u]) that evaluates F (x) and G(x) at a sequence of points {x n } n 1. Let D be the feasible region and let V G (x) be the constraint violation function for G over [l, u] used by the algorithm. Moreover, for any n 1, let A n := {x 1,..., x n } be the set of all points that have been evaluated after n simulations and let A n be the set of nondominated points in A n. Fix λ R k, λ > 0. For n 1, let x n A n be the best point among all points in A n in terms of the objective function λ T F (x) and constraint violation function V G (x) in the following sense: If A n has a feasible point, then x n is feasible and λ T F ( x n ) λ T F (x), x A n D; else V G ( x n ) V G (x), x A n. Then x n A n for any n 1. Proof Fix n 1. Consider two cases. First, suppose x n A n is infeasible. In this case, all points in A n are infeasible (if one of these is feasible, then x n would not be the best) and x n has the best value of V G (x) among all points in A n. This means x n is nondominated by any other point in A n. Next, suppose x n A n is feasible. If x n is the only feasible point among A n, then A n = { x n }. So assume A n has at least two feasible points. To show that x n A n, argue by contradiction. Suppose x n A n. Then there exists x A n that dominates x n, i.e., f i (x) f i ( x n ) for all i and f j (x) < f j ( x n ) for some index j. Note that this implies that λ T F (x) < λ T F ( x n ) λ T F (x), x A n. Hence, λ T F (x) < λ T F (x), x A n, which is a contradiction since x A n. Again, before presenting a general framework for adaptive stochastic search methods for constrained multi-objective optimization, we first present two simple stochastic algorithms that are easy to use in practice. Algorithm B1 uses uniform random search over the region defined by the bound constraints while Algorithm B2 uses Gaussian steps centered at a current non-dominated point. Below, A n is the set of all previously evaluated points and A n is the set of all nondominated points after n simulations that yield the objective and constraint function values. 22

23 Algorithm B1. Uniform Random Search for Constrained Multi-Objective Optimization Inputs: (i) MCBOP(F, G, [l, u]); (ii) CV function V G : [l, u] R +. Step 0. Initialize A 0 := and A 0 :=. Set n = 1. Step 1. Generate a realization of X n U([l, u]). Step 2. Evaluate F (X n ) and G(X n ). (This is equivalent to one simulation.) Step 3. Set A n := A n 1 {X n } and A n := Update(A n 1, {X n }). Step 4. Increment n n + 1 and go back to Step 1. To make the previous algorithm somewhat adaptive, we again use Gaussian distributions centered at one of the current non-dominated points. The selection of the non-dominated point to be the center of the Gaussian random vector iterates can be done uniformly at random or in some deterministic fashion (e.g., choose the most isolated non-dominated point). By localizing the search in the neighborhoods of current non-dominated points, the chances of finding better non-dominated solutions are improved. Algorithm B2. Localized Random Search for Constrained Multi-Objective Optimization Inputs: (i) MCBOP(F, G, [l, u]); (ii) CV function V G : [l, u] R + ; (iii) Initial covariance matrix for the Gaussian steps C 1 ; (iv) Absorbing transformation ρ [l,u] : R d [l, u]; (v) Initial solution X 0 [l, u]. Step 0. Initialize X0 := X 0, A 0 := and A 0 :=. Set n = 1. Step 1. Generate a realization of Z n such that Z n σ({z 1,..., Z n 1 }) N(0 d 1, C n ), and set Y n := Xn 1 + Z n. Step 2. Set X n := ρ [l,u] (Y n ). Step 3. Evaluate F (X n ) and G(X n ). (This is equivalent to one simulation.) Step 4. Set A n := A n 1 {X n } and A n := Update(A n 1, {X n }). Step 5. Select Xn from A n (uniformly at random or in some deterministic fashion). Step 6. Determine the next covariance matrix C n+1, possibly using information obtained so far. Step 7. Increment n n + 1 and go back to Step 1. Algorithms B1 and B2 are special cases of a more general class of adaptive stochastic search methods for problem (5) that we refer to as Generalized Adaptive Random Search for CONstrained Multi-Objective Optimization (GARSCOM) and is given in the framework below. This framework is also a modification 23

Metric Spaces and Topology

Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies