A JOINT+MARGINAL APPROACH TO PARAMETRIC POLYNOMIAL OPTIMIZATION

A JOINT+MARGINAL APPROACH TO PARAMETRIC POLNOMIAL OPTIMIZATION JEAN B. LASSERRE Abstract. Given a compact parameter set R p, we consider polynomial optimization problems (P y) on R n whose description depends on the parameter y. We assume that one can compute all moments of some probability measure ϕ on, absolutely continuous with respect to the Lebesgue measure (e.g. is a box or a simplex and ϕ is uniformly distributed). We then provide a hierarchy of semidefinite relaxations whose associated sequence of optimal solutions converges to the moment vector of a probability measure that encodes all information about all global optimal solutions x (y) of P y, as y. In particular, one may approximate as closely as desired any polynomial functional of the optimal solutions, like e.g. their ϕ-mean. In addition, using this knowledge on moments, the measurable function y x k (y) of the k-th coordinate of optimal solutions, can be estimated, e.g. by maximum entropy methods. Also, for a boolean variable x k, one may approximate as closely as desired its persistency ϕ({y : x k (y) = 1}, i.e. the probability that in an optimal solution x (y), the coordinate x k (y) takes the value 1. At last but not least, from an optimal solution of the dual semidefinite relaxations, one provides a sequence of polynomial (resp. piecewise polynomial) lower approximations with L 1 (ϕ) (resp. ϕ-almost uniform) convergence to the optimal value function. ey words. Parametric and polynomial optimization; semidefinite relaxations AMS subject classifications. 65 D15, 65 5, 46 N1, 9 C22 1. Introduction. Roughly speaking, given a set of parameters and an optimization problem whose description depends on y (call it P y ), parametric optimization is concerned with the behavior and properties of the optimal value as well as primal (and possibly dual) optimal solutions of P y, when y varies in. This is quite a challenging problem and in general one may only obtain information locally around some nominal value y of the parameter. There is a vast and rich literature on the topic and for a detailed treatment, the interested reader is referred to e.g. Bonnans and Shapiro [5] and the many references therein. Sometimes, in the context of optimization with data uncertainty, some probability distribution ϕ on the parameter set is available and in this context one is also interested in e.g. the distribution of the optimal value, optimal solutions, all viewed as random variables. In fact, very often only some moments of ϕ are known (typically first- and second-order moments) and the goal of this more realistic moment-based approach is to obtain optimal bounds over all distributions that share this moment information. In particular, for discrete optimization problems where coefficients of the cost vector (of the objective function to optimize) are random variables with joint distribution ϕ, some bounds on the expected optimal value have been obtained. More recently Natarajan et al. [2] extended the earlier work in [4] to even provide a convex optimization problem for computing the so-called persistency values 1 of (discrete) variables, for a particular distribution ϕ in a certain set Φ of distributions that share some common moment information. However, the resulting second-order cone program requires knowledge of the convex hull of some discrete set of points, which LAAS-CNRS and Institute of Mathematics, University of Toulouse, 7 Avenue du Colonel Roche, 3177 Toulouse Cédex 4, France. Tel: 33 +5 61336415; Fax: 33 +5 61336936; email: lasserre@laas.fr 1 Given a 1 optimization problem max{c x : x X {, 1} n } and a distribution ϕ on c, the persistency value of the variable x i is Prob ϕ(x i = 1) at an optimal solution x (c) = (x i ). 1

2 Jean B. LASSERRE is possible when the number of vertices is small. The approach is nicely illustrated on a discrete choice problem and a stochastic knapsack problem. For more details on persistency in discrete optimization, the interested reader is referred to [2] and the references therein. Recently Natarajan et al. [21] have considered mixed zero-one linear programs with uncertainty in the objective function and first- and second-order moment information. They show that computing the supremum of the expected optimal value (where the supremum is over all distributions sharing this moment information) reduces to solving a completely positive program. Moment-based approaches also appear in robust optimization and stochastic programming under data uncertainty, where the goal is different since in this context, decisions of interest must be taken before the uncertain data is known. For these minmax type problems, a popular approach initiated in Arrow et al. [2] is to optimize decisions against the worst possible distribution θ (on uncertain data) taken in some set Φ of candidate distributions that share some common moment information (in general first- and second-order moments). Recently Delage and e [9]) have even considered the case where only a confidence interval is known for first- and second-order moments. For a nice discussion the interested reader is referred to [9] and the many references therein. In the context of solving systems of polynomial equations whose coefficients are themselves polynomials of some parameter y, specific parametric methods exist. For instance, one may compute symbolically once and for all, what is called a comprehensive Gröbner basis, i.e., a fixed basis that, for all instances of y, is a Gröbner basis of the ideal associated with the polynomials in the system of equations; see Weispfenning [29] and more recently Rostalski [23] for more details. Then when needed, and for any specific value of the parameter y, one may compute all complex solutions of the system of equations, e.g. by the eigenvalue method of Möller and Stetter [19]. However, one still needs to apply the latter method for each value of the prameter y. A similar two-step approach is also proposed for homotopy (instead of Gröbner bases) methods in [23]. The purpose of this paper devoted to parametric optimization is to show that in the case of polynomial optimization, all information about the optimal value and optimal solutions can be obtained, or at least, approximated as closely as desired. Contribution. We here restrict our attention to parametric polynomial optimization, that is, when P y is described by polynomial equality and inequality constraints on both the parameter vector y and the optimization variables x. Moreover, the set is restricted to be a compact basic semi-algebraic set of R p, and preferably a set sufficiently simple so that one may obtain the moments of some probability measure on, absolutely continuous with respect to the Lebesgue measure (or with respect to the counting measure if is discrete). For instance if is a simple set (like a simplex, a box) one may choose ϕ to be the probability measure uniformly distributed on ; typical candidates are polyhedra. Or sometimes, in the context of optimization with data uncertainty, ϕ is already specified. In this specific context we are going to show that one may get insightful information on the set of all global minimizers of P y and on the global optimum, via what we call a joint+marginal approach. Our contribution is as follows: (a) Call J(y) (resp. X y Rn ) the optimal value (resp. the set of optimal solutions) of P y for the value y of the parameter. We first define an infinite-

Parametric polynomial optimization 3 dimensional optimization problem P whose optimal value is exactly ρ = J(y)dϕ(y), i.e. the ϕ-mean of the global optimum. Any optimal solution of P is a probability measure µ on R n R p with marginal ϕ on R p. It turns out that µ encodes all information on the set of global minimizers X y, y. Whence the name joint+marginal as µ is a joint distribution of x and y, and ϕ is the marginal of µ on R p. (b) Next, we provide a hierarchy of semidefinite relaxations of P with associated sequence of optimal values (ρ i ) i, in the spirit of the hierarchy defined in [15]. An optimal solution of the i-th semidefinite relaxation is a sequence z i =(z i αβ ) indexed in the monomial basis (x α y β ) of the subspace R[x, y] 2i of polynomials of degree at most 2i. If for ϕ-almost all y, P y has a unique global minimizer x (y) R n, then as i, z i converges pointwise to the sequence of moments of µ defined in (a). In particular, one obtains the distribution of the optimal solution x (y), and therefore, one may approximate as closely as desired any polynomial functional of the solution x (y), like e.g. the ϕ-mean or variance of x (y). In addition, if the optimization variable x k is boolean then one may approximate as closely as desired its persistency ϕ({y : x k (y) = 1} (i.e., the probability that x k (y) = 1 in an optimal solution x (y)), as well as a a necessary and sufficient condition for this persistency to be 1. (c) Let e(k) N n be the vector with e j (k) =δ j=k, j =1,..., n. Then as i, and for every β N p, the sequence (ze(k)β i ) converges to z kβ := yβ g k (y)dϕ(y) for the measurable function y g k (y) := x k (y). In other words, the sequence (z kβ ) β N p is the moment sequence of the measure dψ(y) := x k (y)dϕ(y) on. And so, the k-th coordinate function y x k (y) of the global minimizer of P y, y, can be estimated, e.g. by maximum entropy methods. Of course, the latter estimation is not pointwise but it still provides useful information on optimal solutions, e.g. the shape of the function y x k (y), especially if the function x k ( ) is continuous, as illustrated on some simple examples. For instance, for parametric polynomial equations, one may use this estimation of x (y) as an initial point for Newton s method for any given value of the parameter y. (d) At last but not least, from an optimal solution of the dual of the i-th semidefinite relaxation, one obtains a piecewise polynomial approximation of the optimal value function y J(y), that converges ϕ-almost surely to J. Finally, the computational complexity of the above methodology is roughly the same as the moment approach described in [15] for an optimization problem with n + p variables since we consider the joint distribution of the n variables x and the p parameters y. Hence, the approach is particularly interesting when the number of parameters is small, say 1 or 2. In addition, in the latter case the max-entropy estimation has been shown to be very efficient in several examples in the literature; see e.g. [6, 26, 27]. However, in view of the present status of SDP solvers, if no sparsity or symmetry is taken into account as proposed in e.g. [17], the approach is limited to small to medium size polynomial optimization problems. Alternatively one may also use LP-relaxations which can handle larger size problems but with probably less precise results because of their poor convergence properties in general. But this computational price may not seem that high in view of the ambitious goal of the approach. After all, keep in mind that by applying the moment approach to a single (n + p)-variables problem, one obtains information on global optimal solutions of an n-variables problem that depends on p parameters, that is, one approximates n functions of p variables!

4 Jean B. LASSERRE 2. A related linear program. For a Borel space X let B(X) denote the Borel σ-field associated with X. Let R[x, y] denote the ring of polynomials in the variables x =(x 1,...,x n ), and the variables y =(y 1,..., y p ), whereas R[x, y] k denotes its subspace of polynomials of degree at most k. Let Σ[x, y] R[x, y] denote the subset of polynomials that are sums of squares (in short s.o.s.). For a real symmetric matrix A the notation A stands for A is positive semidefinite. The parametric optimization problem. Let R p be a compact set, called the parameter set, and let f, h j : R n R p R, j =1,..., m, be continuous. For each y, fixed, consider the following optimization problem: J(y) := inf x { f y(x) : h jy (x), j=1,..., m } (2.1) where the functions f y,h jy : R n R are defined via: x f y (x) := f(x, y) x h jy (x) := h j (x, y), j=1,..., m } x R n, y R p. Next, let R n R p be the set: and for each y, let := { (x, y) : y ; h j (x, y), j =1,..., m }, (2.2) y := { x R n : h jy (x), j =1,..., m }. (2.3) The interpretation is as follows: is a set of parameters and for each instance y of the parameter, one wishes to compute an optimal decision vector x (y) that solves problem (2.1). Let ϕ be a Borel probability measure on, with a positive density with respect to the Lebesgue measure on the smallest affine variety that contains. For instance choose for ϕ the probability measure ( ) 1 ϕ(b) := dy B dy, B B(), uniformly distributed on (assuming of course that has nonempty interior). Of course, one may also treat the case of a discrete set of parameters (finite or countable) by taking for ϕ a discrete probability measure on with strictly positive weight at each point of the support. Sometimes, e.g. in the context of optimization with data uncertainty, ϕ is already specified. We will use ϕ (or more precisely, its moments) to get information on the distribution of optimal solutions x (y) of P y, viewed as random vectors. In the rest of the paper we assume that for every y, the set y in (2.3) is nonempty. 2.1. A related infinite-dimensional linear program. Let M() be the set of finite Borel measures on, and consider the following infinite-dimensional linear program P: ρ := inf { f dµ : πµ = ϕ } (2.4) µ M()

Parametric polynomial optimization 5 where πµ denotes the marginal of µ on R p, that is, πµ is a probability measure on R p defined by πµ(b) := µ(r n B), B B(R p ). Notice that µ() = 1 for any feasible solution µ of P. Indeed, as ϕ is a probability measure and πµ = ϕ one has 1 = ϕ() =µ(r n R p )=µ(). Recall that for two Borel spaces X,, the graph Gr ψ X of a set-valued mapping ψ : X is the set Gr ψ := {(x, y) : x X ; y ψ(x) }. If ψ is measurable then any measurable function h : X with h(x) ψ(x) for every x X, is called a measurable selector. (See 4.1 for more details.) Lemma 2.1. Let both R n and in (2.2) be compact. Then the set-valued mapping y y is Borel-measurable. In addition: (a) The mapping y J(y) is measurable. (b) There exists a measurable selector g : y such that J(y) =f(g(y), y) for every y. Proof. As and are both compact, the set valued mapping y y R n is compact-valued. Moreover, the graph of y is by definition the set, which is a Borel subset of R n R p. Next, since x f y (x) is continuous for every y, (a) and (b) follow from Proposition 4.3 and 4.4. Theorem 2.2. Let both R p and in (2.2) be compact and assume that for every y, the set y R n in (2.3) is nonempty. Let P be the optimization problem (2.4) and let X y := {x Rn : f(x, y) =J(y)}, y. Then: (a) ρ = J(y) dϕ(y) and P has an optimal solution. (b) For every optimal solution µ of P, and for ϕ-almost all y, there is a probability measure ψ (dx y) on R n, concentrated on X y, such that: µ (C B) = ψ (C y) dϕ(y), B B(), C B(R n ). (2.5) B (c) Assume that for ϕ-almost all y, the set of minimizers of X y is the singleton {x (y)} for some x (y) y. Then there is a measurable mapping g : y such that g(y) =x (y) for every y ; ρ = f(g(y), y) dϕ(y), (2.6) and for every α N n, and β N p : x α y β dµ (x, y) = y β g(y) α dϕ(y). (2.7) Proof. (a) As is compact then so is y for every y. Next, as y for every y and f is continuous, the set X y := {x Rn : f(x, y) =J(y)} is nonempty for every y. Let µ be any feasible solution of P and so by definition, its marginal on R p is just ϕ. Since X y, y, one has f y (x) J(y) for all x y and all y. So, f(x, y) J(y) for all (x, y) and therefore fdµ J(y) dµ = J(y) dϕ,

6 Jean B. LASSERRE which proves that ρ J(y) dϕ. On the other hand, recall that y, y. Consider the set-valued mapping y X y y. As f is continuous and is compact, then X y is compact-valued. In addition, as f y is continuous, by Proposition 4.4 there exists a measurable selector g : X y (and so f(g(y), y) =J(y)). Therefore, for every y, let ψ y := δ g(y) be the Dirac probability measure with support on the singleton g(y) X y, and let µ be the probability measure on defined by: µ(c B) := 1 C (g(y)) ϕ(dy), B B(R p ),C B(R n ). B (The measure µ is well-defined because g is measurable.) Then µ is feasible for P and [ ] ρ f dµ = f(x, y) dδ g(y) dϕ(y) = y f(g(y), y) dϕ(y) = J(y) dϕ(y), which shows that µ is an optimal solution of P and ρ = J(y)dϕ(y). (b) Let µ be an arbitrary optimal solution of P, hence on R n and concentrated on Gr y = y. Therefore by Proposition 4.6 the probability measure µ can be disintegrated as µ (C B) := ψ (C y) dϕ(y), B B(), C B(R n ), B where for all y, ψ ( y) is a probability measure on y. (The object ψ ( ) is called a stochastic kernel; see Proposition 4.6.) Hence from (a), ρ = J(y) dϕ(y) = f(x, y) dµ (x, y) ( ) = f(x, y) ψ (dx y) dϕ(y). y Therefore, using f(x, y) J(y) on, = J(y) f(x, y) ψ (dx y) dϕ(y), y }{{} which implies ψ (X (y) y) = 1 for ϕ-almost all y. (c) Let g : y be the measurable mapping of Lemma 2.1(b). As J(y) = f(g(y), y) and (g(y), y) then necessarily g(y) X y for every y. Next, let µ be an optimal solution of P, and let α N n, β N p. Then ( ) x α y β dµ (x, y) = y β x α ψ (dx y) dϕ(y) X y = y β g(y) α dϕ(y),

Parametric polynomial optimization 7 the desired result. An optimal solution µ of P encodes all information on the optimal solutions x (y) of P y. For instance, let B be a given Borel set of R n. Then from Theorem 2.2, Prob (x (y) B) =µ (B R p )= ψ (B y) dϕ(y), with ψ as in Theorem 2.2(b). Consequently, if one knows an optimal solution µ of P then one may evaluate functionals on the solutions of P y, y. That is, assuming that for ϕ-almost all y, problem P y has a unique optimal solution x (y), and given a measurable mapping h : R n R q, one may evaluate the functional h(x (y)) dϕ(y). For instance, with x h(x) := x one obtains the mean vector E ϕ (x (y)) := x (y)dϕ(y) of optimal solutions x (y), y. Corollary 2.3. Let both R p and in (2.2) be compact. Assume that for every y, the set y R n in (2.3) is nonempty, and for ϕ-almost all y, the set X y := {x y : J(y) =f(x, y)} is the singleton {x (y)}. Then for every measurable mapping h : R n R q, h(x (y)) dϕ(y) = h(x) dµ (x, y). (2.8) where µ is an optimal solution of P. Proof. By Theorem 2.2(c) [ ] h(x) dµ (x, y) = h(x)ψ (dx y) X y dϕ(y) = h(x (y)) dϕ(y). Remark 2.4. If the set X (y) is not a singleton on some set with positive ϕ-measure then Theorem 2.2(c) now becomes: For each α N n, there exists a measurable mapping g α : R such that: ( ) x α y β dµ = y β x α ψ (dx y) dy = y β g α (y) dy, X y where y g α (y) = x α ψ (dx y) = E [x α y], y, X y and E [ y] denotes the conditional expectation operator associated with µ. 2.2. Duality. Consider the following infinite-dimensional linear program P : ρ := sup p dϕ p R[y] (2.9) f(x, y) p(y) (x, y).

8 Jean B. LASSERRE Then P is a dual of P. Lemma 2.5. Let both R p and in (2.2) be compact and let P and P be as in (2.4) and (2.9) respectively. Then there is no duality gap, i.e., ρ = ρ. Proof. For a topological space X denote by C(X ) the space of bounded continuous functions on X. Let M() be the vector space of finite signed Borel measures on (and so M() is its positive cone). Let π : M() M() be defined by (πµ)(b) = µ((r n B) ) for all B B(), with adjoint mapping π : C() C() defined as (x, y) (π h)(x, y) := h(y), h C(). Put (2.4) in the framework of infinite-dimensional linear programs on vector spaces, as described in e.g. [1]. That is: with dual: ρ = inf { f, µ : πµ = ϕ, µ }, µ M() ρ = sup { h, ϕ : f π h h C() on }. Endow M() (respectively M()) with the weak topology σ(m(),c()) (respectively σ(m(),c()). One first proves that ρ = ρ and then ρ = ρ. By [1, Theor. 3.1], to get ρ = ρ, it suffices to prove that the set D := {(πµ, f, µ ) :µ M()} is closed for the respective weak topologies σ(m() R,C() R) of M() R and σ(m(),c()) of M(). Therefore consider a converging sequence πµ n a with µ n M(). The sequence (µ n ) is uniformly bounded because µ n () = (πµ n )() = 1, πµ n 1,a = a(). But by the Banach-Alaoglu Theorem (see e.g. [3, Theor. 3.5.16]), the bounded closed sets of M() are compact in the weak topology. And so µ nk µ for some µ M() and some subsequence (n k ). Next, observe that for h C() arbitrary, h, πµ nk = π h, µ nk π h, µ = h, πµ, where we have used that π h C(). Hence combining the above with πµ nk a, we obtain πµ = a. Similarly, f, µ nk f, µ because f C(). Hence D is closed and the desired result ρ = ρ follows. We next prove that ρ = ρ. Given ɛ> fixed arbitrary, there is a function h ɛ C() such that f h ɛ on and h ɛdϕ ρ ɛ. By compactness of and the Stone-Weierstrass theorem, there is p ɛ R[y] such that sup y h ɛ (y) p ɛ (y) ɛ. Hence the polynomial p ɛ := p ɛ ɛ is feasible with value p ɛdϕ ρ 3ɛ, and as ɛ was arbitrary, the result ρ = ρ follows. As next shown, optimal or nearly optimal solutions of P provide us with polynomial lower approximations of the optimal value function y J(y) that converges to J( ) in the L 1 (ϕ) norm. Moreover, one may also obtain a piecewise polynomial approximation that converges to J( ) ϕ-almost uniformly. (Recall that a sequence of measurable functions (g n ) on a measure space (, B(),ϕ) converges to gϕ-almost uniformly if and only if for every ɛ>, there is a set A B() such that ϕ(a) <ɛ and g n g uniformly on \ A.)

Parametric polynomial optimization 9 Corollary 2.6. Let both R p and in (2.2) be compact and assume that for every y, the set y is nonempty. Let P be as in (2.9). If (p i ) i N R[y] is a maximizing sequence of (2.9) then J(y) p i (y) dϕ as i. (2.1) Moreover, define the functions ( p i ) as follows: p := p, y p i (y) := max [ p i 1 (y),p i (y)], i =1, 2,... Then p i J( ) ϕ-almost uniformly. Proof. By Lemma 2.5, we already know that ρ = ρ and so p i (y) dϕ(y) ρ = ρ = J(y) dϕ. Next by feasibility of p i in (2.9) f(x, y) p i (y) (x, y) inf x y f(x, y) =J(y) p i (y) y. Hence (2.1) follows from p i (y) J(y) on. With y fixed, the sequence ( p i (y)) i is obviously monotone nondecreasing and bounded above by J(y), hence with a limit p (y) J(y). Therefore p i has the pointwise limit y p (y) J(y). Also, by the Montone convergence theorem, p i(y)dϕ(y) p (y)dϕ(y). This latter fact combined with (2.1) and p i (y) p i (y) J(y) yields = (J(y) p (y)) dϕ(y), which in turn implies that p (y) =J(y) for ϕ-almost all y. Therefore p i (y) J(y) for ϕ-almost all y. And so by Egoroff s Theorem [3, Theor. 2.5.5], p i J( ), ϕ-almost uniformly. 3. A hierarchy of semidefinite relaxations. In general, solving the infinitedimensional problem P and getting an optimal solution µ is impossible. One possibility is to use numerical discretization schemes on a box containing ; see for instance [14]. But in the present context of parametric optimization, if one selects finitely many grid points (x, y), one is implicitly considering solving (or rather approximating) P y for finitely many points y in a grid of, which we want to avoid. To avoid this numerical discretization scheme we will use specific features of P when its data f (resp. ) is a polynomial (resp. a compact basic semi-algebraic set). Therefore in this section we are now considering a polynomial parametric optimization problem, a special case of (2.1) as we assume the following: f R[x, y] and h j R[x, y], for every j =1,..., m. is compact and R p is a compact basic semi-algebraic set. Hence the set R n R p in (2.2) is a compact basic semi-algebraic set. We also assume that there is a probability measure ϕ on, absolutely continuous with respect to the Lebesgue measure, whose moments γ =(γ β ), β N p, are available. As already mentioned, if is a simple set (like e.g. a simplex or a box) then one may choose ϕ to be the probability measure uniformly distributed on, for which all moments can be computed easily. Sometimes, in the context of optimization with data uncertainty, the probability measure ϕ is already specified and in this case we assume that its moments γ =(γ β ), β N p, are available.

1 Jean B. LASSERRE 3.1. Notation and preliminaries. Let N n i := {α N n : α i} with α = i α i. With a sequence z =(z αβ ), α N n,β N p, indexed in the canonical basis (x α y β ) of R[x, y], let L z : R[x, y] R be the linear mapping: f (= αβ f αβ (x, y)) L z (f) := αβ f αβ z αβ, f R[x, y]. Moment matrix. The moment matrix M i (z) associated with a sequence z = (z αβ ), has its rows and columns indexed in the canonical basis (x α y β ), and with entries. M i (z)(α, β), (δ, γ)) = L z (x α y β x δ y γ )=z (α+δ)(β+γ), for every α, δ N n i and every β, γ N p i. Localizing matrix. Let q be the polynomial (x, y) q(x, y) := u,v q uvx u y v. The localizing matrix M i (q z) associated with q R[x, y] and a sequence z =(z αβ ), has its rows and columns indexed in the canonical basis (x α y β ), and with entries. M i (q z)(α, β), (δ, γ)) = L z (q(x, y)x α y β x δ y γ ) = q uv z (α+δ+u)(β+γ+v), u N n,v N p for every α, δ N n i and every β, γ N p i. A sequence z =(z αβ ) R has a representing finite Borel measure supported on if there exists a finite Borel measure µ such that z αβ = x α y β dµ, α N n,β N p. The next important result states a necssary and sufficient condition when is compact and its defining polynomials (h k ) R[x, y] satisfy some condition. Assumption 3.1. Let (h j ) t j=1 R[x, y] be a given family of polynomials. There is some N such that the quadratic polynomial (x, y) N (x, y) 2 can be written N (x, y) 2 = σ + t σ j h j, for some s.o.s. polynomials (σ j ) t j= Σ[x, y]. Theorem 3.2. Let := {(x, y) : h k (x, y), k=1,... t} and let (h k ) t k=1 satisfy Assumption 3.1. A sequence z =(z αβ ) has a representing measure on if and only if, for all i =, 1,... j=1 M i (z) ; M i (h k z), k =1,..., t. Theorem 3.2 is a direct consequence of Putinar s Positivstellensatz [22] and [25]. Of course, when Assumption 3.1 holds then is compact. On the other hand, if is compact and one knows a bound N for (x, y) on then its suffices to add the redundant quadratic constraint h t+1 (x, y)(:= N 2 (x, y) 2 ) to the definition of, and Assumption 3.1 holds.

Parametric polynomial optimization 11 3.2. Semidefinite relaxations. To compute (or at least, approximate) the optimal value ρ of problem P in (2.4), we now provide a hierarchy of semidefinite relaxations in the spirit of those defined in [15]. Let R n R p be as in (2.2), and let R p be the compact semi-algebraic set defined by: := { y R p : h k (y), k = m +1,..., t} (3.1) for some polynomials (h k ) t k=m+1 R[y]; let v k := (deg h k )/2 ] for every k =1,...,t. Next, let γ =(γ β ) with γ β = y β dϕ(y), β N p, be the moments of a probability measure ϕ on, absolutely continuous with respect to the Lebesgue measure, and let i := max[ (deg f)/2, max k v k ]. For i i, consider the following semidefinite relaxations: ρ i = inf L z (f) z s.t. M i (z) (3.2) M i vj (h j z), j =1,..., t L z (y β )=γ β, β N p i. Theorem 3.3. Let, be as (2.2) and (3.1) respectively, and let (h k ) t k=1 satisfy Assumption 3.1. Assume that for every y the set y is nonempty and consider the semidefinite relaxations (3.2). Then: (a) ρ i ρ as i. (b) Let z i be a nearly optimal solution of (3.2), e.g. such that L z i(f) ρ i +1/i, and let g : y be the measurable mapping in Theorem 2.2(c). If for ϕ-almost all y, J(y) is attained at a unique optimal solution x (y), then: lim i zi αβ = y β g(y) α dϕ(y), α N n,β N p. (3.3) In particular, for every k =1,..., n, lim i zi e(k)β = y β g k (y) dϕ(y), β N p, (3.4) where e(k) j = δ j=k, j =1,..., n (with δ being the ronecker symbol). The proof is postponed to 4.2. Remark 3.4. Observe that if ρ i =+ for some index i in the hierarchy (and hence for all i i), then the set y is empty for all y in some Borel set B of with ϕ(b) >. Conversely, one may prove that if y is empty for all y B with ϕ(b) >, then necessarily ρ i =+ for all i sufficiently large. In other words, the hierarchy of semidefinite relaxations (3.2) may also provide a certificate of emptyness of y for some Borel set of with positive ϕ-measure.

12 Jean B. LASSERRE 3.3. The dual semidefinite relaxations. The dual of the semidefinite relaxtion (3.2) reads: ρ i = sup p dϕ (= p β γ β ) p,(σ i) s.t. β N p i f p = σ + t j=1 σ j h j p R[y]; σ j Σ[x, y], j =1,..., t deg p 2i, deg σ j h j 2i, j =1,..., t (3.5) Observe that (3.5) is a strenghtening of (2.9) as one restricts to polynomials p R[y] of degree at most 2i and the nonnegativity of f p in (2.9) is replaced with the stronger weighted s.o.s. representation in (3.5). Therefore ρ i ρ for every i. Theorem 3.5. Let, be as (2.2) and (3.1) respectively, and let (h k ) t k=1 satisfy Assumption 3.1. Assume that for every y the set y is nonempty, and consider the semidefinite relaxations (3.5). Then: (a) ρ i ρ as i. (b) Let (p i, (σj i)) be a nearly optimal solution of (3.5), e.g. such that p i dϕ ρ i 1/i. Then p i J( ) and Moreover if one defines lim i J(y) p i (y) dϕ(y) = (3.6) p := p, y p i (y) := max [ p i 1 (y),p i (y)], i =1, 2,..., then p i J( ) ϕ-almost uniformly on. Proof. Recall that by Lemma 2.5, ρ = ρ. Moreover let (p k ) R[y] be a maximizing sequence of (2.9) as in Corollary 2.6 with value s k := p kdϕ, and let p k := p k 1/k for every k so that f p k > 1/k on. By Theorem 3.2, there exist s.o.s. polynomials (σj k) Σ[x, y] such that f p k = σk + j σk j h j. Letting d k be the maximum degree of σ and σ j h j, j =1,..., t, it follows that (s k 1/k, (σj k )) is a feasible solution of (3.5) with i := d k. Hence ρ ρ d k s k 1/k and the result (a) follows because s k ρ, and the sequence ρ i is monotone. Then (b) follows from Corollary 2.6. Hence in Theorem 3.5, p i R[y] provides a polynomial lower approximation of the optimal value function J( ), with degree at most 2i (the order of the moments (γ β ) of ϕ taken into account in the semidefinite relaxation (3.5)). Moreover one may even define a piecewise polynomial lower approximation p i that converges ϕ-almost uniformly to J( ) on. Functionals of the optimal solutions. Theorem 3.3 provides a mean of approximating any polynomial functional on the global minimizers of P y, y. Indeed, Corollary 3.6. Let, be as (2.2) and (3.1) respectively, and let (h k ) t k=1 satisfy Assumption 3.1. Assume that for every y the set y is nonempty, and for ϕ-almost all y, J(y) is attained at a unique optimal solution x (y) X y. Let x h(x) := h α x α, α N n

Parametric polynomial optimization 13 and let z i be a nearly optimal solution of the semidefinite relaxations (3.2). Then, for i sufficiently large, h(x (y)) dϕ(y) h α zα i. α N n Proof. The proof is an immediate consequence of Theorem 3.3 and Corollary 2.3. 3.4. Persistence for Boolean variables. One interesting and potentially useful application is in Boolean optimization. Indeed suppose that for some subset I {1,..., n}, the variables (x i ), i I, are boolean, that is, the definition of in (2.2) includes the quadratic constraints x 2 i x i =, for every i I. Then for instance, one might be interested to determine whether in any optimal solution x (y) of P y, and for some index i I, one has x i (y) = 1 (or x i (y) = ) for ϕ-almost all values of the parameter y. In [4, 2] the probability that x k (y) is 1 is called the persistency of the boolean variable x k (y) Corollary 3.7. Let, be as in (2.2) and (3.1) respectively. Let (h k ) t k=1 satisfy (3.1). Assume that for every y the set y is nonempty. Let z i be a nearly optimal solution of the semidefinite relaxations (3.2). Then for k I fixed: (a) x k (y) = 1 in any optimal solution and for ϕ-almost all y, only if lim i zi e(k) =1. (b) x k (y) = in any optimal solution and for ϕ-almost all y, only if lim i zi e(k) =. Assume that for ϕ-almost all y, J(y) is attained at a unique optimal solution x (y) X y. Then Prob (x k (y) = 1) = lim i zi e(k), and so: (c) x k (y) =1for ϕ-almost all y, if and only if lim i zi e(k) =1. (d) x k (y) =for ϕ-almost all y, if and only if lim i zi e(k) =. Proof. (a) The only if part. Let α := e(k) N n. From the proof of Theorem 3.3, for an arbitrary converging subsequence (i l ) l (i) i as in (4.5), lim l zi l e(k) = x k dµ, where µ is some optimal solution of P. Hence, by Theorem 2.2(b), µ can be disintegrated into ψ (dx y)dϕ(y) where ψ (dx y) is a probability measure on X y for every y. Therefore, lim l zi l e(k) = = = ( X y x k ψ (dx y) ) dϕ(y), ψ (X y y) dϕ(y) [because x k = 1 in X y] dϕ(y) =1, and as the converging subsequence (i l ) l in (4.5) was arbitrary, the whole sequence (ze(k) i ) converges to 1, the desired result. The proof of (b) being exactly the same is

14 Jean B. LASSERRE omitted. Next, if for every y, J(y) is attained at a singleton, then by Theorem 3.3(b), from which (c) and (d) follow. lim i zi e(k) = x k(y) dϕ(y) =ϕ({y : x k(y) =1}) = Prob (x k(y) = 1), 3.5. Estimating the density g(y). By Corollary 3.6, one may approximate any polynomial functional of the optimal solutions, like for instance the mean, variance, etc. (with respect to the probability measure ϕ). However, one may also wish to approximate (in some sense) the function y g k (y), that is, the curve described by the k-th coordinate x k (y) of the optimal solution x (y) when y varies in. So let g : R n be the measurable mapping in Theorem 3.3 and suppose that one knows some lower bound vector a =(a k ) R n, where: a k inf { x k :(x, y) }, k =1,..., n. Then for every k =1,..., n, the measurable function ĝ k : R n defined by y ĝ k (y) := g k (y) a k, y, (3.7) is nonnegative and ϕ-integrable. Hence for every k =1,..., n, one may consider dλ := ĝ k dϕ as a Borel measure on with unknown density ĝ k with respect to ϕ, but with known moments u =(u β ). Indeed, using (3.4), u β := y β dλ(y) = a k y β dϕ(y)+ y β g k (y) dϕ(y) = a k γ β + z e(k)β, β N p, (3.8) where for every k =1,...,n, z e(k)β = lim i z i e(k)β, β Nn, with z i being an optimal (or nearly optimal) solution of the semidefinite relaxation (3.2). Hence we are now faced with a density estimation problem, that is: Given the sequence of moments u β = yβ g k (y)dϕ, β N p, of the unknown nonnegative measurable function ĝ k on, estimate ĝ k. One possibility is to use the so-called maximum entropy approach, briefly described in the next section. Maximum-entropy estimation. We briefly describe the maximum entropy estimation technique in the univariate case. The multivariate case generalizes easily. Let g L 1 ([, 1]) be a nonnegative function 2 only known via the first 2d + 1 moments u =(u j ) 2d j= of its associated measure dϕ = gdx on [, 1]. (In the context of previous section, the function g to estimate is y ĝ k (y) in (3.7) from the sequence u in (3.8) of its (multivariate) moments.) 2 L 1 ([, 1]) denote the Banach space of integrable functions on the interval [, 1] of the real line, equipped with the norm g 1 = R 1 b(x) dx.

Parametric polynomial optimization 15 From that partial knowledge one wishes (a) to provide an estimate h d of g such that the first 2d + 1 moments of the measure h d dx match those of gdx, and (b) analyze the asymptotic behavior of h d when d. This problem has important applications in various areas of physics, engineering, and signal processing in particular. An elegant methodology is to search for h d in a (finitely) parametrized family {h d (λ, x)} of functions, and optimize over the unknown parameters λ via a suitable criterion. For instance, one may wish to select an estimate h d that maximizes some appropriate entropy. Several choices of entropy functional are possible as long as one obtains a convex optimization problem in the finitely many coefficients λ i s. For more details the interested reader is referred to e.g. Borwein and Lewis [7, 8] and the many references therein. We here choose the Boltzmann-Shannon entropy H : L 1 ([, 1]) R { }: h H[h] := 1 h(x) ln h(x) dx, (3.9) a strictly concave functional. Therefore, the problem reduces to: { 1 sup h H[h] : x j h(x) dx = u j, j =,..., 2d }. (3.1) The structure of this infinite-dimensional convex optimization problem permits to search for an optimal solution h d of the form: 2d x h d (x) = exp λ j xj, (3.11) and so λ is an optimal solution of the finite-dimensional unconstrained convex problem 1 2d θ(u) := sup u,λ exp λ j x j dx. λ Notice that the above function θ is just the Legendre-Fenchel transform of the convex function λ 1 exp 2d j= λ jx j dx. An optimal solution can be calculated by applying first-order methods, in which case the gradient v d of the function λ v d (λ) := u,λ 1 j= exp j= 2d j= λ j x j dx, is provided by: v d (λ) λ k = u k 1 x k exp 2d j= λ j x j dx, k =,..., 2d. If one applies second-order methods, e.g. Newton s method, then computing the Hessian 2 v d at current iterate λ, reduces to computing 2 v d (λ) 1 2d = x k+j exp λ j x j dx, k, j =,..., 2d. λ k λ j j=

16 Jean B. LASSERRE In such simple cases like a box [a, b] (or [a, b] n in the multivariate case) such quantities can be approximated quite accurately via cubature formula as described in e.g. [11]. In particular, several cubature formula behave very well for exponentials of polynomials as shown in e.g. Bender et al. [6]. An alternative with no cubature formula is also proposed in [18]. One has the following convergence result which follows directly from [7, Theor. 1.7 and p. 259]. Proposition 3.8. Let g L 1 ([, 1]) and for every d N, let h d in (3.11) be an optimal solution of (3.1). Then, as d, 1 ψ(x)(h d (x) g(x)) dx, for every bounded measurable function ψ : [, 1] R which is continuous ϕ-almost everywhere Hence, the max-entropy estimate we obtain is not a pointwise estimate of g, and so, at some points of [, 1] the max-entropy density h d and the density g to estimate may differ significantly. However, for sufficiently large d, both curves of h d and g are close to each other. For instance, in our context and with a one-dimensional parameter y in say = [, 1], recall that g is the mappping y x k (y), and so in general, for fixed y, h d (y) is close to x k (y) and might be chosen for the k-coordinate of an initial point x, input of a local minimization algorithm to find a local minimizer x (y) (with reasonable hope that it is a global minimizer). 3.6. LP-relaxations. Of course, in view of the present status of semidefinite programming solvers, the proposed approach is limited to small to medium size problems because the size of the semidefinite relaxations (3.2) grows like O(n+p) i. On the other hand, if there is some structured sparsity pattern in the original problem, then one may use alternative sparse semidefinite relaxations as defined in e.g. [28], which increases significantly the size of problems that can be addressed by this technique. Alternatively, instead of the hierarchy of semidefinite relaxations (3.2), one may also use an analogous hierarchy of LP-relaxations as defined in [16], whose convergence is also guaranteed. In view of the status of LP solvers, and although the former relaxations have much better convergence properties, sometimes it may be better to use LP-relaxations, e.g., if one wishes to obtain only crude approximations but for larger size problems. 3.7. Illustrative examples. In this section we provide some simple illustrative examples. To show the potential of the approach we have voluntarily chosen very simple examples for which one knows the solutions exactly so as to compare the results we obtain with the exact optimal value and optimal solutions. The semidefinite relaxations (3.2) were implemented by using the software package Gloptipoly [12]. The max-entropy estimate h d of g k was computed by using Newton s method, where at each iterate (λ (k),h d (λ (k) )): λ (k+1) = λ (k) ( 2 v d (λ (k) )) 1 v d (λ (k) ). Example 3.9. For illustration purpose, consider the toy example where := [, 1], := {(x, y) : 1 x 2 y 2 ; x, y } R 2, (x, y) f(x, y) := x 2 y. Hence for each value of the parameter y, the unique optimal solution is x (y) := 1 y2. And so in Theorem 3.3(b), y g(y) = 1 y 2.

Parametric polynomial optimization 17 Let ϕ be the probability measure uniformly distributed on [, 1]. Therefore, ρ = 1 J(y) dϕ(y) = 1 y(1 y 2 ) dy = 1/4. Solving (3.2) with i := 3, that is, with moments up to order 6, one obtains the optimal value.25146. Solving (3.2) with i := 4, one obtains the optimal value.251786 and the moment sequence z = (1,.7812,.5,.664,.3334,.3333,.5813,.25,.1964,.25,.5244,.2,.1333, Observe that.1334,.2,.481,.1667,.981,.833,.983,.1667) z 1k 1 y k 1 y 2 dy O(1 6 ), k =,...4, z 1k 1 y k 1 y 2 dy O(1 5 ), k =5, 6, 7. Using a max-entropy approach to approximate the density y g(y) on [, 1], with the first 5 moments z 1k, k =,..., 4, we find that the optimal function h 4 in (3.11) is obtained with λ =(.1564, 2.5316, 12.2194, 2.3835, 12.1867). Both curves of g and h 4 are displayed in Figure 3.1. Observe that with only 5 moments, the max-entropy solution h 4 approximates g relatively well, even if it differs significantly at some points. Indeed, the shape of h 4 resembles very much that of g. Finally, from an optimal solution of (3.5) one obtains for p R[y], the degree-8 univariate polynomial y p(y) =.4.999y.876y 2 +1.4364y 3 1.2481y 4 +2.1261y 5 2.139y 6 +1.1593y 7.2641y 8 and Figure 3.2 displays the curve y J(y) p(y) on [, 1]. One observes that J p and the maximum difference is about 3.1 4 close to and much less for y.1, a good precision with only 8 moments. Example 3.1. Again with := [, 1], let := {(x,y) : 1 x 2 1 x 2 2 } R 2, (x,y) f(x,y) := yx 1 + (1 y)x 2. For each value of the parameter y, the unique optimal solution x satisfies (x 1(y)) 2 +(x 2(y)) 2 = 1; (x 1(y)) 2 = with optimal value y 2 y 2 + (1 y) 2, (x 2(y)) 2 (1 y)2 = y 2 + (1 y) 2, y 2 J(y) = y2 + (1 y) (1 y) 2 2 y2 + (1 y) = y 2 + (1 y) 2. 2

18 Jean B. LASSERRE 1.4 1.2 1.8.6.4.2.1.2.3.4.5.6.7.8.9 1 Fig. 3.1. Example 3.9: g(y) = p 1 y 2 versus h 4 (y) x 1 4 3 2 1.1.2.3.4.5.6.7.8.9 1 Fig. 3.2. Example 3.9: J(y) p(y) on [, 1] So in Theorem 3.3(b), y g 1 (y) = y y2 + (1 y) 2, y g 2(y) = y 1 y2 + (1 y) 2,

Parametric polynomial optimization 19 and with ϕ being the probability measure uniformly distributed on [, 1], ρ = 1 J(y) dϕ(y) = 1 y2 + (1 y) 2 dy.81162 Solving (3.2) with i := 3, that is, with moments up to order 6, one obtains ρ 3.8117 with ρ 3 ρ O(1 5 ). Solving (3.2) with i := 4, one obtains ρ 4.81162 with ρ 4 ρ O(1 6 ), and the moment sequence (z k1 ), k =, 1, 2, 3, 4: and z k1 =(.6232,.458,.2971,.2328,.197), z k1 1 y k g 1 (y) dy O(1 5 ), k =,..., 4. Using a max-entropy approach to approximate the density y g 1 (y) on [, 1], with the first 5 moments z 1k, k =,..., 4, we find that the optimal function h 4 in (3.11) is obtained with and we find that λ =( 3.61284, 15.66153266 29.439127.326347 9.9884452). z k1 + 1 y k h 4 (y) dy O(1 11 ), k =,..., 4. In Figure 3.3 are displayed the two functions g 1 and h 4, and one observes a very good concordance. 1.4 1.2 1.8.6.4.2.1.2.3.4.5.6.7.8.9 1 Fig. 3.3. Example 3.1: h 4 (y) versus g 1(y) =y/ p y 2 + (1 y) 2

2 Jean B. LASSERRE Finally, from an optimal solution of (3.5) one obtains for p R[y], the following degree-8 univariate polynomial y p(y) := 1. +.9983y.4537y 2.9941y 3 +2.2488y 4 7.6739y 5 +11.8448y 6 7.966y 7 +1.993y 8 and Figure 3.4 displays the curve y J(y) p(y) on [, 1]. One observes that J p and the maximum difference is about 1 4, a good precision with only 8 moments. 1.2 x 1 4 1.8.6.4.2.1.2.3.4.5.6.7.8.9 1 Fig. 3.4. Example 3.1: J(y) p(y) on [, 1] Example 3.11. In this example one has = [, 1], (x,y) f(x,y) := yx 1 + (1 y)x 2, and := {(x,y):yx 2 1 + x 2 2 y<= ; x 2 1 + yx 2 y<=}. That is, for each y the set y is the intersection of two ellipsoids. It is easy to chack that 1 + x i (y) for all y, i := 1, 2. With i = 4, the max-entropy estimate y h 4(y) for 1 + x 1(y) is obtained with λ =(.2894, 1.7192, 19.8381, 36.8285, 18.4828), whereas the max-entropy estimate y h 4 (y) for 1 + x 2 (y) is obtained with λ =(.118, 3.928, 4.468, 1.796, 7.5782). Figure 3.5 displays the curves of x 1 (y) and x 2 (y), as well as the constraint h 1(x (y),y). Observe that h 1 (x (y),y) on [, 1] which means that for ϕ-almost all y [, 1], at an optimal solution x (y), the constraint h 1 is saturated. Figure 3.6 displays the curves of h 1 (x (y),y) and h 2 (x (y),y). Example 3.12. This time = [, 1], (x,y) f(x,y) := (1 2y)(x 1 + x 2 ), and := {(x,y):yx 2 1 + x2 2 y = ; x2 1 + yx2 y =}.

Parametric polynomial optimization 21.4.2.2.4.6.8 1.1.2.3.4.5.6.7.8.9 1 Fig. 3.5. Example 3.11: x 1 (y), x 2 (y) and h 1(x (y),y) on [, 1].1.5.5.1.15.1.2.3.4.5.6.7.8.9 1 Fig. 3.6. Example 3.11: h 1 (x (y),y) and h 2 (x (y),y) on [, 1] That is, for each y the set y is the intersection of two ellipses, and ( ) y y y x = ± 1+y, ± ; J(y) = 2 1 2y 1+y 1+y. With i = 4 the max-entropy estimate y h 4 (y) for 1 + x 1 (y) is obtained with λ = (.371151, 12.51867, 43.21597, 46.985733, 16.395944).

22 Jean B. LASSERRE In Figure 3.7 are displayed the curves y p(y) and y J(y), whereas in Figure 3.8 is displayed the curve y p(y) J(y). One may see that p is a good lower approximation of J even with only 8 moments. 1.5 1.5.1.2.3.4.5.6.7.8.9 1 Fig. 3.7. Example 3.12: p(y) and J(y) on [, 1].2.4.6.8.1.12.14.1.2.3.4.5.6.7.8.9 1 Fig. 3.8. Example 3.12: the curve p(y) J(y) on [, 1] On the other hand, in Figure 3.9 is displayed h 4 (y) versus x 1 (y) where the latter is y/(1 + y) on [, 1/2] and y/(1 + y) on [1/2, 1]. Here we see that the discontinuity

y=.1 y=.5 y=1 J(y).4 1. 2. p 6 (y).384.9264 1.9887 p 8 (y).39.9395 2. Parametric polynomial optimization 23 Table 3.1 J(y) versus p k (y) in Example 3.13 of x 1 (y) is difficult to approximate pointwise with few moments, and despite a very good precision on the five first moments. Indeed: 1 1 y k (h 4 (y) 1) dx y k x 1 (y) dx = O(1 14 ), k =,..., 4. 1.8.6.4.2.2.4.6.1.2.3.4.5.6.7.8.9 1 Fig. 3.9. Example 3.12: h 4 (y) 1 and x 1 (y) on [, 1] Example 3.13. Consider the following system of 4 quadratic equations in 4 variables and one parameter y = [, 1]: x 1 x 2 x 1 x 3 x 4 = yx 2 x 3 x 2 x 4 x 1 = y x 1 x 3 + x 3 x 4 x 2 = yx 1 x 4 x 2 x 4 x 3 = y, for which one wishes to compute the minimum norm J(y) of real solutions as a function of y. We have computed the polynomial y p k (y) from (3.5) with degree k =6 and k = 8, with ϕ being the probability with uniform distribution on [, 1]. The results displayed in Table 1 show that a pretty good approximation is obtained with few moments of ϕ.

24 Jean B. LASSERRE We end up this section with the case where the density g k to estimate is a step function which would be the case in an optimization problem P y with boolean variables (e.g. the variable x k takes values in {, 1}). Example 3.14. Assume that with a single parameter y [, 1], the density g k to estimate is the step function. { 1 if y [, 1/3] [2/3, 1] y g k (y) := otherwise. The max-entropy estimate h 4 in (3.11) with 5 moments is obtained with λ =[.6547367219.17724 115.39354192.4493171655 96.226948865], and we have 1 1 y k h 4 (y) dy y k dg k (y) O(1 8 ), k =,..., 4. In particular, the persistency 1 g k(y)dy =2/3ofthe variable x k (y), is very well approximated (up to 1 8 precision) by h 4(y)dy, with only 5 moments. 1.4 1.2 1.8.6.4.2.1.2.3.4.5.6.7.8.9 1 Fig. 3.1. Example 3.14: Max-entropy estimate h 4 (y) of 1 [,1/3] [2/3,1] Of course, in this case and with only 5 moments, the density h 4 is not a good pointwise approximation of the step function g k (y) = 1 [,1/3] [2/3,1] ; however its shape in Figure 3.1 reveals the two steps of value 1 separated by a step of value. A better pointwise approximation would require more moments. 4. Appendix.

Parametric polynomial optimization 25 4.1. Set-valued functions and measurable selectors. The material of this subsection is taken from [13, D]. Let X R p and R n be Borel spaces. A setvalued mapping ψ : X is a function such that ψ(x) is a nonempty subset of for all x X. The graph Gr ψ of the set-valued function ψ is the subset of X defined by: Gr ψ := { (x, y) : x X, y ψ(x) }. Definition 4.1. A set-valued function ψ : X is said to be: (a) Borel-measurable if ψ 1 (B) is a Borel subset of X for every open set B. (b) compact-valued if ψ(x) is compact for every x X. (c) closed if its graph Gr ψ is closed. Definition 4.2. Let ψ : X be a Borel-measurable set-valued function, and let F be the set of all measurable functions f : X with f(x) ψ(x) for all x X. A function f F is called a measurable selector. Let v : Gr ψ : R be a measurable function and let v : X R be defined by: v (x) := inf { v(x, y) : y ψ(x) }, x X. y Proposition 4.3. ([13, D.4]) Let ψ : X be compact-valued. Then the following are equivalent: (a) ψ is Borel-measurable. (b) ψ 1 (F ) is a Borel subset of X for every closed set F. (c) Gr ψ is a Borel subset of X. (d) ψ is a measurable function from X to the space of nonempty compact subsets of, topologized by the Hausdorff metric. Proposition 4.4. ([13, D.5]) Suppose that ψ : X is compact-valued and the function y v(x, y) is lower-semicontinuous on ψ(x) for every x X. Then there exists a measurable selector f F such that: v(x,f(x)) = v (x) = min y {v(x, y) : y ψ(x) }, x X. Definition 4.5. For two Borel spaces X,, a stochastic kernel on given X is a function P ( ) such that: (a) P ( x) is a probability measure on for each fixed x X. (b) P (B ) is a measurable function on X for each fixed Borel subset B of. Let ψ : X be a Borel-measurable set-valued function such that F is nonempty (equivalently, there is a measurable function f : X whose graph is contained in Gr ψ). Let Φ be the class of stochastic kernels ϕ on given such that ϕ(ψ(x) x) = 1 for every x X. Finally, for a probability measure µ on X, denote by µ 1 the marginal (or projection) of µ on X, i.e., µ 1 (B) := µ(b ), B B(X), where B(X) is the Borel σ-field associated with X. Proposition 4.6. ([13, D.8]) If µ is a probability measure on X, concentrated on the graph Gr ψ of ψ, then there exists a stochastic kernel ϕ Φ such that µ(b C) = ϕ(c x) µ 1 (dx), B B(X), C B( ). See also [1, p. 88 89]. B

26 Jean B. LASSERRE 4.2. Proof of Theorem 3.3.. We already know that ρ i ρ for all i i. We also need to prove that ρ i > for sufficiently large i. Let Q R[x, y] be the quadratic module generated by the polynomials {h j } R[x, y] that define, i.e., t Q := { σ R[x, y] : σ = σ + σ j h j j=1 with {σ j } t j= Σ[x, y]}. In addition, let Q(l) Q be the set of elements σ Q which have a representation σ + t j= σ j h j for some s.o.s. family {σ j } Σ 2 with deg σ 2l and deg σ j h j 2l for all j =1,...,t. Let i N be fixed. As is compact, there exists N such that N ± x α y β > on, for all α N n and β N p, with α + β 2i. Therefore, under Assumption 3.1(ii), the polynomial N ± x α y β belongs to Q; see Putinar [22]. But there is even some l(i) such that N ± x α y β Q(l(i)) for every α + β 2i. Of course we also have N ± x α y β Q(l) for every α + β 2i, whenever l l(i). Therefore, let us take l(i) i. For every feasible solution z of Q l(i) one has z αβ = L z (x α y β ) N, α + β 2i. This follows from z = 1, M l(i) (z) and M l(i) vj (h j z), which implies Nz ± z αβ = L z (N ± x α y β )=L z (σ )+ t L z (σ j h j ) for some {σ j } Σ[x, y] with deg σ j h j 2l(i). In particular, L z (f) N α,β f αβ, which proves that ρ l(i) >, and so ρ i > for all sufficiently large i. From what precedes, and with k N arbitrary, let l(k) k and N k be such that N k ± x α y β Q(l(k)) α N n,β N p with α + β 2k. (4.1) Let i l(i ), and let z i be a nearly optimal solution of (3.2) with value ρ i L z i(f) ρ i + 1 ( ρ + 1 ). (4.2) i i j=1 Fix k N. Notice that from (4.1), for every i l(k), one has L z i(x α y β ) N k z = N k, α N n,β N p with α + β 2k. Therefore, for all i l(i ), z i αβ = L z i(xα y β ) N k, α Nn,β N p with α + β 2k, (4.3) where N k = max[n k,v k ], with V k := max α,β,i { zi αβ : α + β 2k ; l(i ) i l(k) }. Complete each vector z i with zeros to make it an infinite bounded sequence in l, indexed in the canonical basis (x α y β ) of R[x, y]. In view of (4.3), z i αβ N k α N n,β N p with 2k 1 α + β 2k, (4.4)