Optimal Design in Regression. and Spline Smoothing

Size: px

Start display at page:

Download "Optimal Design in Regression. and Spline Smoothing"

Alban Eaton
5 years ago
Views:

1 Optimal Design in Regression and Spline Smoothing by Jaerin Cho A thesis submitted to the Department of Mathematics and Statistics in conformity with the requirements for the degree of Doctor of Philosophy Queen s University Kingston, Ontario, Canada July 2007 Copyright c Jaerin Cho, 2007

2 ABSTRACT This thesis represents an attempt to generalize the classical Theory of Optimal Design to popular regression models, based on Rational and Spline approximations. The problem of finding optimal designs for such models can be reduced to solving certain minimax problems. Explicit solutions to such problems can be obtained only in a few selected models, such as polynomial regression. Even when an optimal design can be found, it has, from the point of view of modern nonparametric regression, certain drawbacks. For example, in the polynomial regression case, the optimal design crucially depends on the degree m of approximating polynomial. Hence, it can be used only when such degree is given/known in advance. We present a partial, but practical, solution to this problem. Namely, the so-called Super Chebyshev Design has been found, which does not depend on the degree m of the underlying polynomial regression in a large range of m, and at the same time is asymptotically more than 90% efficient. Similar results are obtained in the case of rational regression, even though the exact form of optimal design in this case remains unknown. Optimal Designs in the case of Spline Interpolation are also currently unknown. This problem, however, has a simple solution in the case of Cardinal Spline Interpolation. Until recently, this model has been practically unknown in modern nonparametric regression. We demonstrate the usefulness of Cardinal Kernel Spline Estimates in ii

3 nonparametric regression, by proving their asymptotic optimality, in certain classes of smooth functions. In this way, we have found, for the first time, a theoretical justification of a well known empirical observation, by which cubic splines suffice, in most practical applications. iii

4 ACKNOWLEDGEMENT I am thankful to my supervisor professor Boris Levit for guiding me throughout writing this thesis. I am also thankful to the committee for providing comments and suggestions. Finally, I would like to express my love to my family. iv

5 CONTENTS Abstract Acknowledgement ii iv Chapter. Introduction.. Brief history of optimal design 2.2. The scope of the thesis Extension of the concept of optimal design The super Chebyshev design The rational regression model Cardinal spline smoothing 7 Chapter 2. Overview of the optimal design theory Introduction The statistical model The classical concept of optimal designs Extended concept of optimal designs Criteria of Optimal Designs Fisher information and maximum likelihood estimators Minimax estimation An application to Optimal Designs An extension of the class of designs Ξ n Some properties of the information matrix M(ξ) Convex compactness of M The quadratic risk d(x, ξ) Strict concavity of log M Strict convexity of M 26 v

6 2.4. The equivalence theorem A stronger version of the equivalence theorem 3 Chapter 3. Designs in polynomial regression Introduction The goals of this chapter Optimal design in the polynomial regression of a given order The existence and uniqueness of the optimal design Conditions defining the optimal design Introduction of Jacobi polynomials Derivation of the optimal design Another method of deriving the optimal design The quadratic risk of the optimal design The Chebyshev design The super Chebyshev design (ξ o, a)-systems The super Chebyshev system The efficiency of the super Chebyshev design 59 Chapter 4. Designs in Rational Regression Introduction The goals of this chapter Optimal design in the rational regression of a given pole set The existence and uniqueness of the optimal design Conditions defining the optimal design Derivation of the optimal design The quadratic risk of the optimal design The Chebyshev design in rational regression Orthogonal system of rational functions on U = {z : z = } An orthogonal basis for the Chebyshev design 73 vi

7 The quadratic risk of the Chebyshev design Asymptotic behavior of the quadratic risk d(x, ξ c ) The super Chebyshev design The existence of (ξ o, a)-systems Relative efficiency of the super Chebyshev design 89 Chapter 5. Nonparametric regression based on fundamental splines of cardinal interpolation Introduction The observation model Kernel type estimators Goals of this chapter B-splines with equally spaced knots Definitions B-splines as probability densities Asymptotic behavior of B-splines for large m Generating functions C-splines and their properties Definitions Examples Asymptotic behavior of C-splines for large x Fourier transforms ˆL m Further properties of ˆL m Asymptotic behavior of C-splines for large m Kernel spline estimates Poisson summation formula A bound for the remainder terms Quadratic risk of C-spline estimates Asymptotic optimality of C-splines 26 vii

8 5.5.. Asymptotically optimal bandwidth The main result The simulation result An application of the method of GCV Definitions Properties of GCV An application to C-spline 37 References 38 viii

9 Chapter. Introduction In the classical theory of regression, a set of values x, x 2,..., x n of an independent variable x is selected, and observations are made on a related response variable y corresponding to the selected x-values. If y i denotes the response corresponding to x i, it is then assumed that y, y 2,..., y n are uncorrelated variables with common variance σ 2. If it is assumed additionally that there is a vector of parameters θ = (θ, θ 2,..., θ m ) Θ R m such that the trend function f(x) = E(y x) is completely determined by the parameter θ, such models are referred to as parametric regression models. More specifically, if it is assumed that the trend function f(x) belongs to the linear subspace spanned by a given set of linearly independent functions f (x), f 2 (x),..., f m (x) the model is called a linear regression model. The most popular linear regression model is the polynomial regression model which assumes that f(x) = m j= θ jx j. Any linear regression model can be represented in the form of a standard linear model, where y = Xθ + ε, y = (y i ) n, X = (X ij ) = (f j (x i )) n m, θ = (θ j ) m and ε = (ε i ) n. () It is well known that if the matrix X X is non-singular, the least squares estimates ˆθ = (X X) X y are the unique minimum variance unbiased estimates of the corresponding components of θ and Cov(ˆθ θ) = σ 2 (X X). The corresponding trend estimate is ˆf(x) = m j= ˆθ j f j (x), and its error, at a given point x, has mean 0 and variance σ 2 h (x)(x X) h(x), where h(x) = (f (x), f 2 (x),..., f m (x)). The collection ξ = {x i, i =,..., n} of the observation points is commonly referred to as the design of the model. The goal of the theory of optimal designs is to determine the best designs in a way that is both practically meaningful and theoretically feasible. There are many optimality criteria. An optimality criterion which has been widely studied is that of D-optimality in which the determinant of the covariance matrix σ 2 (X X) is minimized (or, equivalently, the determinant of the Fisher information matrix σ 2 X X is maximized). Another criterion, so-called G-optimality, is concerned with the variance of the predicted response. A G-optimal design (or minimax design) is one which minimizes the maximum variance max x E( ˆf(x) f(x)) 2.

10 The minimax design has the most clear practical meaning. In this thesis, the term optimal designs will be synonymous to minimax designs.. Brief history of optimal design. Historically, polynomial regression was the first model for which optimal designs were found independently by different authors, whose findings later led to the foundation of the general theory of optimal design. It is one of the simplest cases of the general theory, and represents a relatively small number of examples for which optimal designs can be explicitly found. In the case x [, ], the optimal designs were worked out numerically for lower degree polynomials by Smith (98) [44], up to the 5th degree. De La Garza (954) [4] showed that for a polynomial of degree m on any given interval, the information matrix X X corresponding to observations made at any k points, k m, can always be attained from observations made at exactly m points in the same interval. There are certain advantages when we apply this result to determining the optimal designs or the D-optimal designs. This result shows that in finding optimal designs, it is sufficient to consider designs x,..., x n which contain only m different points (cf. Lemma 3.). Assuming that the design contains only m different points x,..., x m with proportions p i of the number of observations at each x i, det(x X) = (det F ) 2 where det F = det(x j i ) i,j=,m = k<l m (x k x l ), is the so-called Vandermonde determinant. Thus the problem is to determine the observations points x,..., x m [, ] maximizing the Vandermonde determinant, since m i= p i is maximal when p i = m (i =, 2,..., m). The solution of this classical problem, also known as the problem of electrostatic equilibrium, goes back to Stieltjes (885); see e.g. Szegö [45], Sec It turns out that the D-optimal design is given by the zeros of the mth degree polynomial ( x 2 ) P m (x), where P m (x) is the classical Legendre polynomial of degree m (cf. Theorem 3.7). Now let us assume that the observations y i are made at m different observation points (nodes) x i [, ], i =,..., m. Note that in this case the problem of fitting a polynomial of degree m by least squares is equivalent to interpolating the observations y i by such a polynomial. If we replace the original power basis, x,..., x m by the basis provided by the fundamental polynomials of Lagrange 2 m i= p i,

11 interpolation with nodes x,..., x m, L i (x) = L(x) L(x i )(x x i ), where L(x) = (x x )(x x 2 ) (x x m ), we can easily obtain ˆf(x) = m ˆθ j x j = j= m y j L j (x). j= It then follows easily from () that E( ˆf(x) m f(x)) 2 = σ 2 L 2 i (x). i= Therefore ξ = {x i } is an optimal design if and if only it minimizes max x m i= L2 i (x), among all such designs (see Sec. 3.2). The problem of minimizing max x m i= L2 i (x) was first considered and solved by Féjer(932) [0]. He proved that the design ξ = {x i } minimizing max m x i= L2 i (x) is again given by the zeros of the polynomial ( x 2 ) P m (x). Using the De La Garza result [4], Guest [6] rediscovered in 958 the explicit formula for the optimal polynomial design, obtained earlier by Féjer. At the same time, Hoel [7] found the D-optimal design for polynomial regression combining the De La Garza result with the solution to the electrostatic equilibrium problem already known to Stieltjes. Thus he concluded that the optimal and the D-optimal design are the same, in the case of polynomial regression. This result prompted Kiefer and Wolfowitz to present their celebrated result, the so-called Equivalence Theorem. Surprisingly, it says that for any compact set X, and for any system of continuous linearly independent functions f i (x), i =,..., m, defined on X, optimal designs are equivalent to the D-optimal designs and the minimax risk depends neither on the set X, nor on the functions f i themselves, but exclusively on their number m. First, in [25], they proved their equivalence under an additional restriction that the design consists of m points and gives equal mass at each point. Later in [26] they proved this theorem in its most general form (see Theorem 2.5). A more general approach to the Equivalence Theorem was later proposed by Karlin and Studden [2]. They have shown that the existence of optimal designs in a general linear regression model follows, essentially, from von Neumann s classical Minimax Theorem, see Theorem

12 The results of Guest and Hoel (or the corresponding earlier results of Féjer and Stieltjes), together with the Equivalence Theorem, form the basis of the classical Theory of Optimal Design. Detailed discussions of this theory and its foundations are presented in [9], [20] and [34]. Guest [6] also compared the variances of the fitted polynomials corresponding to the discrete uniform design and the optimal design. He showed that for most of the region, the advantage lies with the uniform spacing method, whereas, at the extremes of the region of interpolation, the advantage lies decidedly with the optimal design. Generalizing earlier results going back to Stieltjes, Szegö ([45], pp. 232) showed that, in general, the zeros of the mth degree Jacobi polynomial P m (α,β) (cos θ) satisfy the asymptotic relation θ v = m {vπ + O()}, with O() being uniformly bounded for all values of v =, 2,..., m and m =, 2,.... In particular, this applies to the Legendre polynomials P m (x) = const P m. (,) Using this result, Fedorov ([9], pp. 9 92) notes that the sequence of optimal designs ξm converges weakly to a (continuous) Chebyshev design ξ c, where ξ c has the density π ( x 2 ) /2. Let us define G-efficiency of a given design ξ with respect to the optimal design ξ m, as d m (ξ m)/d m (ξ), where d m (ξ) = max x E( ˆf(x) f(x)) 2. Of course, here ˆf(x) depends on the design ξ. Also denote the D-efficiency of a design ξ as (D m (ξ)/d m (ξ m)) /m where D m (ξ) = det(x X). A natural question is whether the limiting design ξ c is asymptotically efficient in the sense of the thus introduced G- and D-efficiencies. It turns out that the answer to this question is negative in the first case and positive in the second. Namely, Kiefer and Studden [24] showed that the limiting G-efficiency of ξ c is d m (ξ lim m) m d m (ξ c ) = 2, (2) while the limiting D-efficiency of ξ c is ( ) Dm (ξ c ) m lim =. m D m (ξm) Thus, surprisingly, a limiting D-efficiency does not guarantee limiting G-efficiency, even though, for any given m, D-efficiency is equivalent to G-efficiency, by the Equivalence Theorem. On the contrary, asymptotic G-efficiency of a sequence of designs guarantees their asymptotic D-efficiency (see [24], pp. 8). 4

13 Kiefer and Studden [24] also showed that, for any ε > 0, a design ξ ε which doesn t depend on m, satisfying lim inf d m(ξm) m d m (ξ ε ) ε, (3) can be obtained by just a slight modification of the measure ξ c, namely, by adding a small additional mass near the end-points ±..2 The scope of the thesis..2. Extension of the concept of optimal design. The classical Theory of Optimal Design was developed under the tacit assumption that only the least squares estimates were needed. This leaves the open question of to what degree its optimality results hold when no such restrictive assumptions are made about the class of possible estimates. Let us assume that the error terms ε i are normally distributed independent random variables, ε i N (0, σ 2 ). In this thesis, the classical theory of optimal designs for the Gaussian observation model is generalized to arbitrary possible estimates under the squared losses (see Sec. 2.2). We have chosen to present complete proofs of all the relevant results in the theory of optimal designs and, where possible, indicate further simplifications of the classical results (see Ch. 2). In this context, we first discuss applications of the general theory of optimal designs to the polynomial regression (see Sec. 3.2)..2.2 The super Chebyshev design. As we already mentioned, in the polynomial regression of a given order m, there exists the unique optimal design ξm. However, ξm depends crucially on the order m; consequently, it can be used only when m is given a priori. Actually, the maximal risk can be very sensitive to the design. Even relatively small change of design increase the risk several times. From the modern perspective, m is hardly ever exactly known. Even though parametric regression undoubtedly remains the most prevalent approach to regression analysis, it requires very specific information about the degree of the polynomial regression function f(x). Thus parametric regression is appropriate only when there is adequate information about the degree of f(x). On the other hand, non-parametric regression generally only assumes that the regression curve belongs to some infinite dimensional collection of functions. Modern approaches to non-parametric regression require computation and a careful comparison of a collection of different estimates, often using high degree polynomial estimators of different degree. 5

14 This leads to the problem of finding such designs which would be nearly optimal for a larger set of possible values of the order m. To deal with this problem, we will show first that, in the case of the Chebyshev design ξ c which does not require the knowledge of m, the quadratic risk uniformly converges to the optimal risk value m on any closed interval [a, b] (, ), whereas it s corresponding maximal risk on [, ] is only twice bigger (cf. (2)) than that of the optimal design. Note that according to (3), there exists a design ξ ε which is almost optimal, for all sufficiently large m. Unfortunately, the proof of the existence of such designs is not constructive; cf. [24]. Therefore, in Sec. 3.4, we introduce the so-called the super Chebyshev design: ξ s = αξ c + ( α)ξ e, where α is a constant, ξ c is the Chebyshev design and ξ e is a measure concentrated at the end-points: ξ e ( ) = ξ e () = /2. Since the maximal risk of the Chebyshev design ξ c occurs at the end-points, it is natural to try to improve it by adding more weight to the end-points. It will be demonstrated that, by choosing appropriate α, asymptotically as m, ξ s reduces the maximal risk of the Chebyshev design roughly by 45%. In other words, ξ s is about 9% asymptotically efficient, with respect to the maximal risk, compared to the Chebyshev design which is only 50% efficient..2.3 The rational regression model. In Ch. 4, we venture into the intriguing area of the optimal designs for the rational regression model, about which little was known. Rational functions are a classical tool in Approximation Theory. In many cases, they proved to be a much more efficient tool for approximation than algebraic and trigonometric polynomials. In 964, D. Newman [32] showed that the function x can be uniformly approximated by rational functions much better than by polynomials. This result stimulated a great interest and substantial progress in the field of rational approximation. In this direction, we outline a general class of estimation procedures and characterize for them the optimal designs (see Sec. 4.2). We expect that models based on rational approximation may also be of a significant value in non-parametric regression. If in a rational regression model the set of poles {w i } m i= is given (see the definitions in Sec. 4.), there exists unique optimal design ξ (see Sec. 4.2). However, here again 6

15 ξ depends crucially on the set of poles {w i } m i=, consequently, it can be used only when this set is known a priori. As in polynomial regression, this leads to the problem of finding such designs which would be nearly optimal for a larger collection of possible pole sets {w i } m i=. To deal with this problem, we generalize the approach used in polynomial regression to the more general class of rational regression models. In Sec. 4.3, we discuss those properties of the Chebyshev design ξ c which do not require specific knowledge of the pole set {w j } m j=. In particular, we show that properties of ξ c, in the rational regression model, are similar to those in the polynomial regression. Further, in Sec. 4.4, theoretical and numerical evidence will be presented showing that, by choosing appropriate α, asymptotically as m, ξ s reduces the maximal risk of the Chebyshev design roughly by 45%..2.4 Cardinal spline smoothing. It is well known that when the degree of interpolating polynomial increases, the overall approximation error, in some cases, may tend to infinity. In the numerical analysis, this effect is known as Runge s phenomenon (see [4], pp. 02). The problem can be avoided by using spline curves which are piecewise polynomials. When trying to improve upon the approximation error, one can increase the number of polynomial pieces of which an interpolating spline is made up, instead of increasing the degree of the these polynomials. It is generally accepted that the first mathematical reference to splines is the 946 paper by Schoenberg [39], which is probably the first publication using the word spline in connection with smooth piecewise polynomial functions. Before computers were used, numerical calculations were done by hand. Although, of course, piecewise-defined functions, like the sign function, or step function, were well known, polynomials were generally preferred because they were easier to work with. Through the advent of computers splines have gained importance. They were first used as a replacement for polynomials in interpolation, then as a tool to construct smooth and flexible shapes in computer graphics. Splines are also broadly used in statistical applications requiring data smoothing. In this area, they successfully compete with yet another well known statistical tool, kernel smoothing. In Ch. 5, we introduce a special type of kernel based on the so-called fundamental splines of cardinal interpolation. These functions are well known in Approximation Theory, where there is a huge literature devoted to their numerous properties (see e.g. [37], [38] and [39]). While many other kinds of splines, such as B-splines and natural splines, have long since found their way into Nonparametric Statistics, cardinal splines, to the best of our knowledge, have been never discussed in this context. This 7

16 is all the more surprising since, as we will demonstrate in Ch. 5, they have excellent statistical properties and represent quite a natural and useful tool for nonparametric estimation. Asymptotic optimality of the resulting nonparametric spline estimators will be discussed in the context of functional classes F = F(C, γ) = {f C(R) : ˆf(t) } Ce γ t, where ˆf(t) is the Fourier transform of f(x) and the positive constants C, γ > 0, for the most part, will be considered given. These functional classes are well known in nonparametric estimation, see [2] and [29]. Related questions such as comparative properties of B- and cardinal splines, numerical bounds on the accuracy of these splines, as well as the well known generalized method of cross validation, will be also discussed in Ch. 5. 8

17 Chapter 2. Overview of the optimal design theory 2.. Introduction. In this chapter the classical theory of optimal designs for the Gaussian observations model ([9], [34]) is generalized to arbitrary possible estimates under the square losses (see Sec. 2.2). We present complete proofs of all relevant results in the theory of optimal designs, and indicate some possible simplifications, where possible The statistical model. Let F = (f (x), f 2 (x),..., f m (x)) denote a system of linearly independent continuous real valued functions defined on a compact set X R d. Consider the regression model y i = f(x i ) + ε i = m θ j f j (x i ) + ε i, i =,..., n. (4) j= Here y i are observations, f(x) is an unknown regression function which belongs to the linear subspace R F spanned by f j, θ j, j =, 2,..., m, are regression parameters, x i X given observation points, and ε i are independent normally distributed error terms, ε i N (0, σ 2 ). We are interested in estimators exhibiting good accuracy over the whole class R F of regression functions in (4). The collection of the observation points ξ n = {x i } used in model (4) will be called the design. Here x i are not necessarily distinct. The definition of the design ξ n will be generalized further in Sec designs will be denoted Ξ n. The set of all The classical concept of optimal designs. Let us start with some known results. First, we will write equation (4) in the form of a standard linear model. Denote y = (y i ) n, X = (X ij ) = (f j (x i )) n m, θ = (θ j ) m and ε = (ε i ) n. Then equation (4) can be written in the matrix form of a standard linear model, y = Xθ + ε. (5) 9

18 Let ˆf n (x) = h (x)ˆθ be the least squares estimator of f(x), corresponding to a given design ξ n, with the column-vector h(x) given by h(x) = (f (x), f 2 (x),..., f m (x)). It is well known that, if the matrix X X is non-singular, the least squares estimator of the vector θ is ˆθ = (X X) X y, and the mean squared risk of estimating f(x), at a given at x X, is E( ˆf n (x) f(x)) 2 = σ 2 h (x)(x X) h(x). Note that this mean squared risk does not depend on the underlying function f. Denote the maximal risk, corresponding to a given design ξ n, R(F, ξ n ) = max x X E( ˆf n (x) f(x)) 2 = max x X σ2 h (x)(x X) h(x). In the classical theory (see [9], [34]), the optimal design ξ o n is any design minimizing the maximal risk, i.e. a design ξ o n achieving R(F ) = min ξ n Ξ n R(F, ξ n ) = min ξ n Ξ n max x X E( ˆf n (x) f(x)) Extended concept of optimal designs. Now let us extend the above approach from least squares estimators to arbitrary estimators. Let f n (x) be an arbitrary estimator of the unknown function based on the set of observations y. To indicate the dependence of this estimator on the underlying design ξ n, we will use the notation f n ξ n. The accuracy of any estimator f n can be measured by R( f n, f) = max x X E( f n (x) f(x)) 2. (6) 0

19 Therefore, it is natural to look at the following minimax risk: r(f, ξ n ) = min f n ξ n max f R F max x X E( f n (x) f(x)) 2. Our final goal is to find an optimal design which minimizes this risk, i.e. a design ξ n achieving r(f ) = min ξ n Ξ n r(f, ξ n ) = min ξ n Ξ n min max max E( f n (x) f(x)) 2. (7) f n ξ n f R F x X 2.2. Criteria of Optimal Designs. Preceding our further discussion of Optimal Designs, we introduce some basic statistical tools related to linear models. Under our assumptions, the joint density of observations can be written as p(y, θ) = ) y Xθ 2 exp (. (2π) n/2 2σ Fisher information and maximum likelihood estimators. Consider the score function ( ) ln p(y, θ) l(y, θ) = θ. m The Fisher information matrix, related to the density p(y, θ), is defined as I(θ) = E θ l(y, θ)l (y, θ), where the index θ indicates that expectation is calculated under the density p(y, θ). An elementary calculation shows that, in the case of the normal density p(y, θ), the Fisher information matrix does not depend on θ and is given by I = σ 2 X X. To indicate dependence of this matrix on the design ξ n, it is convenient to introduce the notation X X = nm(ξ n ).

20 In the terminology commonly used in the Optimal Design Theory, M(ξ n ) is also refered to as information matrix. Note that M(ξ n ) is a m m square matrix, with entries M(ξ n ) kl = n n f k (x i )f l (x i ). (8) i= Let θ Θ R m be a vector of unknown parameters. Denote by ˆθ the Maximum Likelihood Estimator (MLE) of θ. Clearly, in the case of a linear model (5), ˆθ coincides with the Least Squares Estimator (LSE) of θ. It is well known that if the matrix X X is non-singular, the least squares estimator ˆθ = (X X) X y is normally distributed: N (θ, I ) Minimax estimation. One of the most compelling justifications of the use of LSE is that, in the case of normal linear regression models, such as (4) and (5), they can be shown to be minimax, simultaneously for a broad variety of loss functions. Although the related results are well known, there seem to be no easy references, nor sufficiently simple proofs, specifically oriented towards linear regression. In this section we demonstrate that ˆθ is a minimax estimator, for a variety of quadratic loss functions. In particular, if we might be interested in the squared risk function R 0 ( θ, θ) = E θ θ θ 2. (9) Note that in this case R 0 (ˆθ, θ) = σ 2 Tr(X X). (0) We might be also interested in estimating a linear functional of the form Ψ = a θ. Then, with the mean squared risk defined as R a ( Ψ, θ) = E θ ( Ψ Ψ) 2, () 2

21 the maximum likelihood estimate ˆΨ = a ˆθ satisfies R a ( ˆΨ, θ) = σ 2 a (X X) a. Theorem 2.. If Θ = R m, then the mle estimate ˆθ is minimax with respect to the risk functions (9) and (). Proof. Since the risk of the mle does not depend on θ, we can see from (0) that minimax risk satisfies r 0 := min max R 0( θ, θ) σ 2 Tr(X X). θ θ Θ Hence we need only to show that r 0 σ 2 Tr(X X). (2) Note that if Π is a prior distribution for θ such that Π(Θ) =, (3) then the corresponding Bayes risk of an estimate θ, R 0 ( θ, Π) = R 0 ( θ, θ) dπ(θ), Θ satisfies R 0 ( θ, Π) max θ Θ R 0( θ, θ). (4) Consider the corresponding Bayes risk R 0 (Π) = min R 0 ( θ, Π). θ It follows from (4) that, for any estimate θ, R 0 (Π) max θ Θ R 0( θ, θ). Hence, for any prior distribution Π satisfying (3), R 0 (Π) r 0. (5) 3

22 We will use this fact to demonstrate the validity of (2). For that, let us specify the prior distribution as Π = N ( 0, σ 2 κ 2 (X X) ), where the parameter κ > 0 will be chosen later. Under our assumptions, condition (3) holds trivially. Denote by π(θ) the density of Π and by ( f(y θ) exp ) y Xθ 2 2σ2 the density of observations y. Then treating y and θ as jointly random elements, the posterior density of θ is f(θ y) f(y θ)π(θ) = exp ( ( y Xθ 2 + κ 2 θ X Xθ ) ). 2σ 2 Simple algebra yields ( f(θ y) exp ( ( + κ 2 )θ X Xθ 2θ X y ) ) 2σ 2 exp ( + κ2 2σ 2 ( θ ) ( + κ 2 (X X) X y X X θ ) ) + κ 2 (X X) X y = exp + κ2 2σ 2 ( θ ˆθ ) ( X X θ ˆθ ). + κ 2 + κ 2 From Decision Theory, it is well known that the Bayes estimate θ, with respect to the mean squared risk (9), is the posterior mean θ = ˆθ + κ 2, and its Bayes risk R(π) = R 0 ( θ, Π), 4

23 coincides with the posterior risk R 0 (π) = σ 2 ( + κ 2 ) Tr(X X). Using (5) and letting κ 0 proves the inequality (9), and thus our theorem, in the case of the risk function (9). In the case of the risk function (), the corresponding Bayes estimate again coincides with the posterior mean Ψ = a ˆθ + κ 2, while both its posterior quadratic risk and Bayes risk are given by R a (π) = σ 2 ( + κ 2 ) a (X X) a. Our Theorem follows from this equation as before. Applying Theorem 2. to model (4), with ˆf n (x) = h (x)ˆθ, where h(x) = (f (x), f 2 (x),..., f m (x)), we obtain the following minimax property of the LSE ˆθ: for any x 0 X, min max E( f n (x 0 ) f(x 0 )) 2 f n ξ n f R F = max f R F E( ˆf n (x 0 ) f(x 0 )) 2 (6) = σ2 n h (x 0 )M (ξ n )h(x 0 ). 5

24 An application to Optimal Designs. This minimax property of the maximum likelihood estimate can be easily extended to the risk function R( f n, f) in (6). Using (6) we obtain r(f, ξ n ) = min f n ξ n max f R F max x X E( f n (x) f(x)) 2 min f n ξ n max f R F E( f n (x 0 ) f(x 0 )) 2 = max f R F E( ˆf n (x 0 ) f(x 0 )) 2 = σ2 n h (x 0 )M (ξ n )h(x 0 ). Therefore, r(f, ξ n ) σ2 n max x X h (x)m (ξ n )h(x). On the other hand, r(f, ξ n ) max f R F max x X E( ˆf n (x) f(x)) 2 = σ2 n max x X h (x)m (ξ n )h(x). It follows that the minimax risk (7) can be represented as r(f ) = σ2 n min max ξ n Ξ n x X h (x)m (ξ n )h(x). (7) This result shows that for the Gaussian observation model the optimal design ξn, o in the traditional sense commonly used in the existing theory of Optimal Designs, essentially coincides with a more rigorous approach which does not restrict the class of possible estimators exclusively to LSE An extension of the class of designs Ξ n. Obviously, any design ξ n Ξ n can be identified with the probability measure ξ n on X of the following special form ξ n (x i ) =, i =,..., n. n In terms of this measure, the matrix M(ξ n ) has the form, cf. (8), M(ξ n ) = h(x)h (x) dξ n (x). X 6

25 This allows us to extend the definition of the information matrix M(ξ n ) to an arbitrary probability measure ξ with support in X, supp(ξ) X, as M(ξ) = h(x)h (x) dξ(x). (8) X Denote by Ξ the collection of all such designs. This generalization is quite natural and, in fact, is technically easier to work with. In the classical concept of Optimal Designs ([9], [34]), this has led to the study of the following minimax risk min ξ Ξ max x X E( ˆf(x) f(x)) 2. Under our more general approach, this can be replaced by a more general risk, without any restrictions on the class of possible estimates f(x): min min max max ξ Ξ f ξ f R F x X E( f(x) f(x)) 2. Note. The introduction of an arbitrary design measure ξ may be a bit difficult to interpret in terms of practical applications. This can be helped, however, by Corollary 2.4 below, according to which a regression model with an arbitrary design measure can be replaced by an equivalent design with a discrete design measure ξ supported by at most j [(m + )m/2] + design points. Applying a sufficiency argument to normal data reduces any such discrete design model to an equivalent inhomogeneous regression model, with a finite set of observation points x i, i =,..., j, and varying variances of the error terms ξ(x i ). As a result, for an arbitrary design model (4), there exists an equivalent design model, with a finite support set x i, i =,..., j, for some j [(m + )m/2] +, having the following form y i = f(x i ) + m ξ(x i ) ε i = θ j f j (x i ) + ξ(x i ) ε i. j= This equation exhibits a direct relation between the design measure ξ(x i ) and the variance of errors terms. Obviously, in an inhomogeneous regression model, the latter can be an arbitrary number ξ(x i ) 0. 7

26 2.3. Some properties of the information matrix M(ξ). The above relation (7) makes it obvious that the information matrix M(ξ) defined in the general case by (8), plays a central role in determining the optimal designs ξn. o In this subsection, we will discuss some properties of M(ξ). In particular, these properties will be used below to prove the celebrated Equivalence Theorem and its ramifications. Denote by M the family of matrices M(ξ) defined by (8) for arbitrary design measures ξ Convex compactness of M. Recall that a subset S of a linear space is called convex, if for any s S, s 2 S, and 0 α, the point s = αs + ( α)s 2 again belongs to S. If S is an arbitrary set in a linear space, then the set S of points s = k α i s i, i= where k α i =, α i 0, s i S, (i =, 2,..., k; k =, 2,...), i= is called the convex hull of the set S. Note that if S is a compact set, then S is also a compact set. Denote by S the set of points where k s = α i s i, i= α i 0, s i S (i =, 2,..., k; k =, 2,...). S is called the convex cone of the set S. Let us demonstrate first that M constitutes a convex compact set. Recall that we consider regression model (4), for which a design measure ξ has its support in a given subset X R d. Lemma 2.2. Let X R d be a compact set. Then M is the convex hull of the matrices M(ξ[x]) corresponding to those designs ξ = ξ[x] whose support consists of a single point x X. Moreover, M is a convex compact set in the linear space of all m m matrices. 8

27 Proof. For reader s convenience, we present below a short proof of this lemma. For more details see [9], pp Let S = {M(ξ[x]) x X } and let S be the convex hull of the set S. Since the matrix M(ξ) is linear in ξ, and the set Ξ of all probability measures on X is convex, S M. Note that the entries of the matrix M(ξ[x]) have the form f i (x)f j (x), cf. (8). Since the functions f,..., f m are continuous, so is the matrix M(ξ[x]). Since X is compact, the set S is also compact, and therefore S is compact. Now, let us demonstrate that M = S. In fact, it only remains to show that M S. Since S is a closed convex set, there is a set L of linear functionals L( ) such that s S if and only if L(s) 0, for all L L. Thus we need to show that L(M(ξ)) 0, for any L L. Since any linear matrix functional L(M) is of the form L(M) = i,j a ij M ij, we have, for any L L, L ( M(ξ) ) = i,j a ij M ij (ξ) = i,j a ij f i (x)f j (x)dξ(x) = m i,j= a ij f i (x)f j (x)dξ(x) = L(M(ξ[x]))dξ(x) 0. The following classical result plays an important role in the Theory of Optimal Designs, cf. Corollary 2.4 below. Theorem 2.3. [Caratheodory s Theorem.] Let S be any subset of R l, l. Then for any point s in the convex hull S, there exist s i S, i =,..., l +, and a set of l+ α i 0, α i =, i= such that l+ s = α i s i. i= 9

28 Proof. For the reader s convenience we present below a proof of this. For more details, see e.g. [43]. Given a set S in R l, let S be the subset of R l+ which consists of all vectors of the form (, s) with s S, and let S be the convex cone generated by S. Let C be the set of vectors c in R l such that (, c) S. Then it is easily seen that C = S. Let s S so that s = λ s + + λ k s k where λ i > 0 and each s i S. If k > l +, then the vectors s i are linearly dependent. Thus there exist µ,..., µ k, not all zero, such that µ s + + µ k s k = 0. Since the first component of each vector s i is, we have µ + + µ k = 0, and so at least one of the coefficients µ i is positive. Let λ be the largest number such that λµ i λ i, i =,..., k. Obviously, λ is finite, since at least one of the µ i is positive. Denote λ i = λ i λµ i. Then λ s + + λ ks k = λ s... + λ k s k λ(µ s + + µ k s k) = s. Hence at least one of the coefficients λ i = 0. We have thus expressed s as a positive linear combination of fewer than k elements of S. This argument can clearly be repeated until s has been expressed as a positive linear combination at most l + elements of S, since more than l + elements are necessarily linearly dependent. In particular, all element s of the form (, c) can be expressed in this way. From this result it follows easily that the information matrix M(ξ) of an arbitrary design ξ coincides with the information matrix corresponding to some design with a finite support. Corollary 2.4. Let ξ Ξ be an arbitrary design. Then there exists a finite subset x,..., x j X and a set p,..., p j 0, j i= p i =, such that j [(m + )m/2] +, for which the matrix M(ξ) can be represented in the form M(ξ) = j p i h(x i )h (x i ). i= 20

29 Proof. This result follows directly from Lemma 2.2 and the Caratheodory theorem. By Corollary 2.4, in analyzing the set of all information matrices M(ξ), ξ Ξ, we may assume, without loss of generality, that the design measure ξ has a finite support The quadratic risk d(x, ξ). Recall that, for any design ξ Ξ, the minimax risk in (7) is determined by the m m matrix M(ξ) in (8) through the function h (x)m (ξ)h(x), where h(x) = (f (x), f 2 (x),..., f m (x)). In what follows, we will denote the determinant of a square matrix M by M. Since the information matrix M(ξ) may happen to be singular, we define the risk function, in general, as h (x)m (ξ)h(x), if M(ξ) 0, d(x, ξ) =, if M(ξ) = 0, Note that in the special case when the functions {f i (x), i =,..., m} are orthonormal with respect to the design ξ: f i (x)f j (x)dξ(x) = δ ij, the quadratic risk can be represented as X d(x, ξ) = m fi 2 (x). (9) i= By definition a design ξ is optimal if it minimizes max d(x, ξ). To carry further our x X discussion of optimal designs, we first introduce some properties of d(x, ξ). Lemma 2.5. Let ξ be an arbitrary design. If matrix M(ξ) is not singular, then d(x, ξ)dξ(x) = m, where m is the number of unknown parameters. X 2

30 Proof. For the reader s convenience, we present below a proof of this lemma. For more details, see e.g. [9], p. 69. Since h (x)m (ξ)h(x) is a real number, obviously d(x, ξ)dξ(x) = h (x)m (ξ)h(x)dξ(x) = Tr ( h (x)m (ξ)h(x) ) dξ(x). X X X Moreover, since Tr(AB) = Tr(BA) for any square matrices A and B, Tr ( h (x)m (ξ)h(x) ) = Tr ( M (ξ)h(x)h (x) ). Thus, by the definition of M(ξ), Tr ( h (x)m (ξ)h(x) ) ( ) dξ(x) = Tr M (ξ) h(x)h (x)dξ(x) X X = Tr ( M (ξ)m(ξ) ) = m. Corollary 2.6. For any design ξ, max d(x, ξ) m. x X Lemma 2.7. Let ξ be an arbitrary design. If the matrix M(ξ) is not singular, then max x d(x, ξ) = max Tr(M (ξ)m(θ)). θ Proof. Similar to the proof of Lemma 2.5, d(x, ξ) = Tr ( M (ξ)m(ξ[x]) ), where ξ[x] is a design whose support consists of a single point x X. So, trivially, On the other hand, Tr(M (ξ)m(θ)) = X max x d(x, ξ) max Tr(M (ξ)m(θ)). θ Tr ( M (ξ)m(ξ[x]) ) dθ(x) = X d(x, ξ)dθ(x) max d(x, ξ). x Thus, max x d(x, ξ) max Tr(M (ξ)m(θ)). θ Therefore, the lemma follows. 22

31 Strict concavity of log M. Here we introduce some properties of the function log M, M M, which will be used below to prove the celebrated Equivalence Theorem. Recall that M is the convex set of all information matrices M(ξ), ξ Ξ. Denote further M + = {M M M = 0} M and M ε = {M M all eigenvalues of M are ε} M. Lemma 2.8. M + and M ε are convex sets. Proof. A symmetric semi-definite matrix M is non-singular if and only if its smallest eigenvalue is positive: λ min = min v (v, Mv) > 0, where the minimum is taken over all normalized vectors v such that (v, v) = ; see e.g. [2], pp Let M, M 2 M +. Then obviously for any 0 α min v (v, (αm + ( α)m 2 )v) = min (α(v, M v) + ( α)(v, M 2 v)) v α min(v, M v) + ( α) min(v, M 2 v) > 0. v v Similarly, if M, M 2 M ε. min v (v, (αm + ( α)m 2 )v) ε. Remark. It follows from the above proof that the matrix αm + ( α)m 2 is non-singular even if only one of the matrices, say M 2, is non-singular, and 0 α <. Definition. Let S be a convex set. A numerical function f defined on S is called concave, if for any s, s 2 S, s s 2, and any 0 < α <, f(αs + ( α)s 2 ) αf(s ) + ( α)f(s 2 ); 23

32 convex, if f(αs + ( α)s 2 ) αf(s ) + ( α)f(s 2 ); strictly concave, respectively, convex, if the corresponding inequality is strict. First we show that the function log M defined on the convex set M +, is strictly concave. Lemma 2.9. The function log M, M M +, is strictly concave. Proof. For readers convenience, we present below a simplified proof of this lemma. For more details, see [9], p. 7. By Lemma 2.8, M + is a convex set. Therefore, it is sufficient to show that for any M, M 2 M + and 0 < α <, log M > α log M + ( α) log M 2. where M = αm + ( α)m 2. Since M and M 2 are both symmetric positive-definite, there exists a non-singular matrix A such that AM A = I and AM 2 A = B, where I is the identity matrix and B = diag(b,..., b m ), b i > 0, is a diagonal matrix (see [8], p. 466). Thus M = αm + ( α)m 2 = αa (A ) + ( α)a B(A ) m = A (αi + ( α)b)(a ) = A 2 (α + ( α)b i ). (20) i= On the other hand, M α M 2 α = A (A ) α A B(A ) α = A 2 m i= b α i. (2) Using the classical Young inequality in the form xy αx α + ( α)y α, we obtain from (20) and (2) M > M α M 2 α. 24

33 Therefore, log M > α log M + ( α) log M 2. Let A = A(t), t R, be a smooth family of symmetric positive definite matrices. Denote Ȧ = da. The following result shows how to differentiate the determinant A. dt Lemma 2.0. d dt log A(t) = Tr(A Ȧ). Proof. Note that, together with A, the matrix A is symmetric positive definite. Consider the integral of the density of a multivariate normal distribution with the vector of means 0 and covariance matrix V = A : ( ) Rm A /2 exp x Ax dx =. (2π) m/2 2 Differentiating under the integral sign which is easy to justify in this case gives Rm A /2 (2π) m/2 exp 2 ( ) ( x Ax A 2 ( d log A Tr(V A) dt ) 2 A x Ȧx dx = 2 ) = 0. From the above lemma we can get the following result which will be essential below in proving the Equivalence Theorem. Lemma 2.. Let there be two design measures ξ and ξ 2 with information matrices, respectively, M(ξ ) and M(ξ 2 ) such that M(ξ 2 ) M +. Denote ξ = αξ + ( α)ξ 2, 0 α <. Then and d dα log M(ξ) = Tr ( M (ξ) (M(ξ ) M(ξ 2 )) ), d dα log M(ξ) α=0 = d(x, ξ 2 )dξ (x) m. X 25

34 Proof. For the reader s convenience, we present below a proof of this lemma. For more details, see e.g. [9], p. 7. Note that M(ξ) = h(x)h (x)dξ(x) = α X X h(x)h (x)dξ (x) + ( α) h(x)h (x)dξ 2 (x) X = αm(ξ ) + ( α)m(ξ 2 ) and, by Remark, M(ξ) is non-singular. Using Lemma 2.0 we obtain ( d log M(ξ) = Tr M (ξ) d ) dα dα M(ξ) = Tr ( M (ξ) (M(ξ ) M(ξ 2 )) ). Moreover, since M(ξ) = M(ξ 2 ) for α = 0, we have d dα log M(ξ) α=0 = Tr ( M (ξ 2 ) (M(ξ ) M(ξ 2 )) ) = Tr ( M (ξ 2 )M(ξ ) ) Tr ( M (ξ 2 )M(ξ 2 ) ) = Tr ( M (ξ 2 )M(ξ ) ) m = d(x, ξ 2 )dξ (x) m. X Strict convexity of M. Let M and M 2 be two m m matrices. We say that M > M 2 (M M 2 ) if the matrix M M 2 is positive (semi-) definite. Note that this partial order is preserved under pre- and post-multiplication by any non-singular m m matrix (respectively, any matrix) A. In this subsection we will show that the matrix M is strictly convex, in the sense of the above partial ordering, on the set M +. Lemma 2.2. Let M and M 2 be symmetric positive-definite matrices. Then a) for any 0 < α <, αm + ( α)m 2 (αm + ( α)m 2 ), 26

35 with equality if and only if M = M 2 ; b) for any m m matrix A and any 0 < α < α Tr(M A) + ( α) Tr(M 2 A) Tr((αM + ( α)m 2 ) A), with strict inequality if A is non-singular. Proof. a) Since M and M 2 are positive definite, there exists a non-singular matrix A such that AM A = I and AM 2 A = B, where I is the identity matrix and B = diag(b,..., b m ), b i > 0, is a diagonal matrix (see [8], p. 466). Then αm + ( α)m2 = A diag ( ) α + ( α)b i A, (22) and (αm + ( α)m 2 ) = A diag (α + ( α)b i ) A. (23) Note that (α + ( α)b i )(α + ( α)b i ) = α 2 + (b i + b i )α( α) + ( α) 2 α 2 + 2α( α) + ( α) 2 =, or α + ( α)b i α + ( α)b i, (24) with equality if and only if b i =. Therefore, by (22), (23) and (24), for any 0 < α <, αm + ( α)m 2 (αm + ( α)m 2 ), (25) with equality if and only if M = M 2. b) Pre- and post-multiplying (25) by A /2 shows that the matrix αa /2 M A /2 + A /2 ( α)m 2 A /2 A /2 (αm + ( α)m 2 ) A /2 27

36 is positive semi-definite (positive definite, if A is non-singular). Since Tr is a linear operator and Tr(AB) = Tr(BA), α Tr(M A) + ( α) Tr(M 2 A) Tr((αM + ( α)m 2 ) A) 0, with strict inequality if A is non-singular. Lemma 2.3. Let M be a symmetric m m positive-definite matrix. Then m Tr M M /m. Moreover, the equality holds if and only if M is a multiple of the identity. Proof. Denote by λ i the eigenvalues of M. Since Tr M = m i= λ i and M = m i= λ i, the result follows from the arithmetic-geometric means inequality. Corollary 2.4. Let there be two design measures ξ and ξ 2 with information matrices, respectively, M(ξ ) and M(ξ 2 ) such that M(ξ 2 ), M(ξ 2 ) M +. Then Tr(M (ξ )M(ξ 2 )) m M(ξ 2) /m M(ξ ) /m and equality occurs if and only if M(ξ 2 ) is proportional to M(ξ ) The equivalence theorem. The purpose of this subsection is to provide a proof of the following theorem due to Kiefer and Wolfowitz [26] which plays an important role in the Theory of Optimal Designs. Theorem 2.5. [Equivalence Theorem] The following assertions: () the design ξ maximizes M(ξ), (2) the design ξ minimizes max d(x, ξ), x (3) max d(x, ξ ) = m, x are equivalent. 28

37 Proof. For the reader s convenience, we present below a proof of this theorem. For more details, see [26]. () (2) and (3). Since the set of M(ξ) is compact and the determinant is a continuous function, there exists ξ which maximizes M(ξ). Let ξ be such a design. Consider another design ξ, and let ξ : ξ = αξ + ( α)ξ. Since M(ξ ) M(ξ), by Lemma 2. we have d log M( ξ) = dα α=0 X d(x, ξ )dξ(x) m 0. Assuming that the design ξ concentrates its mass at one point x X, we obtain d(x, ξ )dξ(x) m = d(x, ξ ) m 0. X This means that max d(x, ξ ) m. x On the other hand, by Lemma 2.5, for any design ξ, max d(x, ξ) m. x Therefore, ξ minimizes max x d(x, ξ) and max d(x, ξ ) = m. x (2) (). Let a design ξ minimize max d(x, ξ). In the previous step of the proof x d(x, ξ ) = m. It follows that for we have shown that min ξ any other design ξ, max x d(x, ξ) = m. Thus max x X d(x, ξ )dξ(x) m 0. Assume that ξ does not maximize M(ξ). Since, by Lemma 2.9, log M(ξ) is a strictly concave function, there is another design ξ such that log M( ξ) is strictly increasing, for all sufficiently small α, where ξ = αξ + ( α)ξ. Thus d log M( ξ) = d(x, ξ )dξ(x) m > 0. dα α=0 X We arrived at a contradiction. (3) (2). The result follows trivially from Corollary

38 The above result is somewhat surprising. It shows, in particular, that for any compact set X and for any system of continuous linearly independent functions f i (x), i =,..., m, defined on X, there always exists an optimal design ξ, and the minimax risk depends neither on the set X nor on the functions f i themselves, but exclusively on their number! On the other hand, finding an optimal design ξ is often quite a formidable task. In many cases, solving this problem can be helped by the following result. Corollary 2.6. For an optimal design ξ, ξ {x d(x, ξ ) = m} =. Proof. Let A = {x d(x, ξ ) < m}. Suppose ξ (A) > 0. Then, in view of Part iii) of Theorem 2.5, d(x, ξ )dξ (x) = d(x, ξ )dξ (x) + d(x, ξ )dξ (x) X A X \A < mdξ (x) + mdξ (x) A X \A = mdξ (x) = m. X However, by Lemma 2.5, X d(x i, ξ )dξ (x) = m. This contradiction proves our result. We will use the above Corollary to prove a result which will be used later for proving uniqueness of the optimal design ξ in the cases of polynomial and rational regression. We call x an a-point of a given function f, if f(x) = a. The number of a-points of f, in a given interval, will be counted according to their multiplicities. Lemma 2.7. Let X = [, ]. Assume that functions f,..., f m are continuously differentiable on (, ), and let ξ be an optimal design. If the number of m-points of d(x, ξ ) in the closed interval [, ] does not exceed 2m 2, then ξ concentrates 30

39 its mass at exactly m points in the interval [, ], two of which coincide with the end-points. Proof. Obviously, the support of ξ must contain at least m points, since otherwise any m functions f,..., f m are linearly dependent ξ -a.s., the matrix M(ξ ) is singular, and M(ξ ) = 0. Thus the support of ξ must contain at least m 2 points in the open interval (, ). There are only three alternatives: the support of ξ contains () at least m 2 points in (, ) plus the end-points ±; (2) at least m points lie in (, ) and one of the end-points; (3) at least m points in (, ). Note that by Corollary 2.6, all points in the support of ξ are m-points of the risk function d(x, ξ ). By Theorem 2.5, iii) they are also maximum points of d(x, ξ ), i.e. m-points whose multiplicities at least 2. This shows that in case (), support of ξ must contain exactly m 2 points in the open interval (, ). In case (2), the number of m-points of d(x, ξ ) would be at least 2(m ) +, contradicting the assumption. In case (3), the number of m-points of d(x, ξ ) is at least 2m, which again is impossible A stronger version of the equivalence theorem. In view of the importance of the Equivalence Theorem, Theorem 2.5, we prove below a stronger version of this theorem using ideas of Game Theory. This approach was first introduced in [20]. In general, a game with two Players, I and II, is defined by a pay-off function K = K(ξ, θ), here interpreted as the gain of Player II when Player I chooses a strategy ξ and Player II chooses independently a strategy θ. Denote by Ξ the set of all strategies available to Player I and by Θ the set of all strategies for Player II. Lemma 2.8 (The Main Theorem of Game Theory; see e.g. [20], pp. 4-6). Let Ξ and Θ be compact convex sets and let the pay-off function K = K(ξ, θ) be continuous on Ξ Θ, concave in θ for any given ξ Ξ, and convex in ξ for any given θ Θ. Then max min θ Θ ξ Ξ K(ξ, θ) = min max ξ Ξ θ Θ K(ξ, θ) := v. Furthermore, for both players there exist optimal strategies ξ o Ξ and θ o Θ, such that K(ξ o, θ) v, for all θ Θ, 3

Model Selection and Geometry

Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model