VARIATIONAL CALCULUS IN SPACE OF MEASURES AND OPTIMAL DESIGN

Chapter 1 VARIATIONAL CALCULUS IN SPACE OF MEASURES AND OPTIMAL DESIGN Ilya Molchanov Department of Statistics University of Glasgow ilya@stats.gla.ac.uk www.stats.gla.ac.uk/ ilya Sergei Zuyev Department of Statistics and Modelling Science University of Strathclyde sergei@stams.strath.ac.uk www.stams.strath.ac.uk/ sergei Abstract Keywords: The paper applies abstract optimisation principles in the space of measures within the context of optimal design problems. It is shown that within this framework it is possible to treat various design criteria and constraints in a unified manner providing a universal variant of the Kiefer-Wolfowitz theorem and giving a full spectrum of optimality criteria for particular cases. The described steepest descent algorithm uses the true direction of the steepest descent and descends faster than the conventional sequential algorithms that involve renormalisation on every step. design of experiments, gradient methods, optimal design, regression, space of measures Introduction Consider a standard linear optimal design problem on design space (assumed to be a locally compact separable metric space) with regressors Üµ ½ Üµ Üµµ and unknown coefficients ½ µ. The optimal design is described as a probability measure on that determines frequencies 1

2 making an observation at particular points. The information matrix is given by Å µ Üµ Üµ Üµ (1.1) see, e. g., (Atkinson and Donev, 1992) for details. The measure that minimises Ø Å ½ µ is called D-optimal design measure. To create a convex optimisation criterion it is convenient to minimise Å µ ÐÓ Ø Å µ, which is a convex functional on the space of matrices. In general, a measure that minimises a given differentiable functional Å µµ is called a -optimal design measure. It is common to handle the above optimisation problem by first finding an optimal information matrix from the class defined by all admissible design measures and afterwards by relating it to the corresponding design measure. Schematically, such methods of finding the optimal solution involve the following two stages: Å µµ Step I Å µ Step II (1.2) Step I concerns optimisation in the Euclidean space of dimension ½µ¾, while Step II aims to identify the measure by the obtained in the previous step information matrix. However, the chain approach based on (1.2) cannot be easily amended when the formulation of the problem changes, e.g., when constraints on appear. This usually calls for major changes in proofs, since a new family of matrices has to be analysed in Step I. Therefore, for each new type of constraints on not only Step II, but also Step I, has to be reworked, for example, as in (Cook and Fedorov, 1995) and (Fedorov and Hackl, 1997), where a number of different types of constraint are analysed. Although the family of non-negative measures is not a linear space, many objective functionals can be extended onto the linear space of signed measures as the definition (1.1) of the information matrix Å µ applies literally to signed measures. This allows us to treat the design problems described above as a special case of the optimisation of functionals defined on signed measures. We show that this abstract setting incorporates naturally many specific optimal design problems and leads to new (apparently more efficient) steepest descent algorithms for numerical computation of optimal designs. Recall that the directional (or Gateaux) derivative of a real-valued functional in a Banach space is defined by ÜµÚ ÐÑ Ø ½ Ü ØÚµ Üµµ (1.3) Ø¼ where Ú ¾ is a direction (Hille and Phillips, 1957). The functional is said to have a Fréchet (or strong) derivative if Ü Úµ Üµ ÜµÚ Ó Úµ as Ú ¼

Variational calculus in space of measures and optimal design 3 where ÜµÚ is a bounded linear continuous functional of Ú. In the design context, is a functional on the space of measures, such as µ Å µµ, and is a norm defined for signed measures, e. g. the total variation norm. Then the Fréchet derivative is written as µ, where is a signed measure. Clearly, if is a probability measure, then Ø as in (1.3) above is not necessarily a probability measure or even a positive measure. Therefore, Øµ may not have a direct interpretation within the design framework for an arbitrary. To circumvent this difficulty, it is quite typical in the optimal design literature to replace (1.3) by the following definition of the directional derivative: µ ÐÑ Ø ½ ½ Øµ Øµ µµ (1.4) Ø¼ (Atkinson and Donev, 1992; Wynn, 1972). Now ½ Øµ Ø is a probability measure for Ø ¾ ¼ ½ if both and are probability measures. This definition of the directional derivative is used to construct the steepest (with respect to differential operator ) descent algorithms for finding optimal designs when it is not possible to obtain analytical solutions (Wynn, 1972). However, the descent related to (1.4) is only asymptotically equivalent to the true steepest descent as defined by in (1.3). Indeed, it is easy to see from the definitions that µ µ Thus, the descent direction determined by minimising µ over all with the given norm is not the true steepest descent direction obtained by minimising µ. The steepest descent algorithm described in Section 3 emerges from our theoretical results on constrained optimisation in the space of measures presented in Section 1. In contrast to the classical sequential algorithms in the optimal design literature (Wu, 1978ab; Wu and Wynn, 1978; Wynn, 1970), we do not renormalise the obtained design measure on each step. Instead, the algorithm adds a signed measure chosen so to minimise the Fréchet derivative of the goal function among all measures satisfying the imposed constraints. This extends the ideas of Atwood (1973, 1976), Fedorov (1972) and Ford (1976, pp.59 64) and puts the whole algorithm into the conventional optimisation framework. Working on the linear space of signed measures rather than on the cone of positive measures makes this algorithm just a special case of the general steepest descent algorithm known from the optimisation literature. To establish its required convergence properties it is now sufficient to refer to the general results for the steepest descent method, see (Polak, 1997). Section 1 surveys several necessary concepts of abstract optimisation for functionals defined on the cone of positive measures describing a number of

4 results adapted from (Molchanov and Zuyev, 1997), where proofs and further generalisations can be found. Applications to optimal designs are discussed in Section 2. Section 3 is devoted to steepest descent type algorithms. 1. OPTIMISATION IN THE SPACE OF MEASURES Let Å be the family of all non-negative finite measures on a Polish space and Å be the linear space of all signed measures with bounded total variation on its Borel sets. In numerical implementations becomes a grid in a Euclidean space Ê and ¾ Å is a non-negative array indexed by the grid s nodes. For every ¾ Å, denote by (resp. ) the positive (resp. negative) part in the Jordan decomposition of. Consider the following optimisation problem with finitely many constraints of equality and inequality type: subject to À µ ¼ À µ ¼ µ Ò (1.5) ½ Ñ Ñ ½ (1.6) It is always assumed that Å Ê and À Å Ê are Fréchet differentiable functions. Most differentiable functionals of measures met in practice have derivatives which can be represented in the integral form. We consider here only this common case, so that there exist measurable real-valued functions Ü µ and Ü µ, ½, such that for all ¾ Å µ Ü µ Üµ and À µ Ü µ Üµ (1.7) where ½ µ. Note that all integrals are over unless specified otherwise. Furthermore, assume that the solution of Problem (1.5) is regular, i.e. the functions ½ µ Ñ µ are linear independent and there exists ¾ Å such that Ê Ü µ Üµ ¼ for all ½ Ñ Ê Ü µ Üµ ¼ for all ¾ Ñ ½ verifying À µ ¼ (1.8) (without the inequality constraints, (1.8) trivially holds with being the zero measure). The regularity condition above guarantees the existence and boundedness of the Lagrange multipliers (Zowe and Kurcyusz, 1979) for the problem (1.5).

Variational calculus in space of measures and optimal design 5 Theorem 1 (see Molchanov and Zuyev, 1997). Let be a regular local minimum of subject to (1.6). Then there exists Ù Ù ½ Ù µ with Ù ¼ if À µ ¼ and Ù ¼ if À µ ¼ for ¾ Ñ ½, such that Ü µ Ù Ü µ almost everywhere (1.9) Ü µ Ù Ü µ for all Ü ¾ Example 1. The simplest (but extremely important) example concerns minimisation of µ over all ¾ Å with the fixed total mass µ ½. In the above framework, Ê this corresponds to the case when ½ and the only constraint is À ½ µ Üµ ½ ¼. Then (1.9) becomes Ü µ Ù almost everywhere (1.10) Ü µ Ù for all Ü ¾ In the sequel we call (1.10) a necessary condition for a minimum in the fixed total mass problem. 2. APPLICATIONS TO OPTIMAL DESIGN Optimal design problems can be naturally treated within the above described general framework of optimisation of functionals defined on finite measures. Theorem 1 directly applies to functionals of measures typical in the optimal design literature and under quite general differentiable constraints. Although we do not assume any convexity assumptions on and work exclusively with necessary optimal conditions, convexity of immediately ensures the existence of the optimal design. Theorem 1 can easily be specialised to functionals that effectively depend on the information matrix. Given a function Ê ¾ Ê, a -optimal design measure minimises µ Å µµ over all probability measures on with possibly some additional constraints. Direct computation of the derivative of leads to the following result that can be considered as a generalised Kiefer Wolfowitz theorem. Corollary 2. Let provide a -optimal design subject to constraints (1.6). Then (1.9) holds with Ü µ Üµ Å µ µ Üµ and Å µ µ Å µ Ñ µ where Ñ Ê Üµ Üµ Üµ is the µ-th entry of the information matrix.

6 Example 2. (D-optimal design) Put Å µ ÐÓ Ø Å. Then Å µ Å ½, the only constraint if À µ µ ½ has the derivative ½. Since Ü µ ÜµÅ ½ µ Üµ Ü µ, Theorem 1 yields the necessary condition in the Kiefer-Wolfowitz characterisation of D-optimal designs. Example 3. (D-optimal design with fixed moments) Let Ê and assume that along with the constraint on the total mass µ ½ we fix the expectation of, which is a vector Ñ Ê Ü Üµ. These constraints can be written as À µ Ñ ¼µ, where À µ is a ½µ-dimensional vector function with the components À µ Ê Ü Üµ for ½ and À ½ µ µ ½. Clearly, (1.7) holds with Ü µ Ü ½µ. By Theorem 1, if minimises µ under the condition À µ Ñ ¼µ, then Ü µ Ú Ü Ù Ü µ Ú Ü Ù almost everywhere for all Ü ¾ for some Ú ¾ Ê. In other words, Ü µ is affine for -almost all Ü. Example 4. (D-optimal designs with a limited cost) Let Üµ determine the cost of making an observation at point Ü. If the expected total cost is bounded by, then the design measure should, in addition to µ ½, satisfy the constraint Ê Üµ Üµ. If provides a -optimal design under such constraints, then Ü µ Ù Û Üµ Ü µ Ù Û Üµ a.e. for all Ü ¾ for some Ù ¾ Ê and Û ¼ if Ê Üµ Üµ and Û ¼ if Ê Üµ Üµ. Since the function µ Å µµ in the D-optimality problem is a convex function of, the above necessary conditions become necessary and sufficient conditions, thus providing a full characterisation of the optimal designs. Other examples, like D-optimal designs with bounded densities, design on productspaces, etc., can be treated similarly. 3. GRADIENT METHODS Direction of the steepest descent. The gradient descent method relies on knowledge of the derivative of the goal functional. As before, it is assumed that µ is Fréchet differentiable and its derivative has the integral representation (1.7). We first consider the general setup, which will be specialised later for particular constraints and objective functionals.

Variational calculus in space of measures and optimal design 7 The most basic method of the gradient descent type used in the optimal design suggests moving from Ò (the approximation on step Ò) to Ò ½ ½ «Ò µ Ò «Ò Ò, where ¼ «Ò ½ and Ò minimises µ over all probability measures (Wynn, 1970). It is easy to see that such Ò is concentrated at the points where the corresponding gradient function Ü µ is minimised. Rearranging the terms, we obtain Ò ½ Ò «Ò Ò Ò µ (1.11) In this form the algorithm looks like a conventional descent algorithm that descends along the direction Ò «Ò Ò Ò µ. In this particular case, the step size is Ò ¾«Ò, and Ò Ò Ò with Ò «Ò Ò Ò «Ò Ò Such a choice of Ò ensures that Ò Ò remains a probability measure since the negative part of Ò is proportional to Ò with «Ò ½. However, if we do not restrict the choice for the descent direction to measures with the negative part proportional to Ò, then it is possible to find a steeper descent direction Ò than the one suggested by (1.11). If the current value Ò, then the steepness of the descent direction is commonly characterised by the value of the derivative µ Ü µ Üµ The true steepest descent direction must be chosen to minimise µ over all signed measures with total variation ¾«Ò such that belongs to the family Å of non-negative measures and satisfies all specified constraints. For instance, if the only constraint is that the total mass µ ½, then any signed measure with µ ¼ and the negative part dominated by would ensure that stays in Å and maintains the required total mass. Consider the problem with only linear constraints of equality type: À µ Üµ Üµ ½ (1.12) where ½ are given real numbers. In vector form, À µ Ê Üµ Üµ, where À À ½ À µ, ½ µ and ½ µ implying that À µ Ê Üµ Üµ. For a ¾ Å denote by the family of all signed measures ¾ Å such that ¾ Å and satisfies the constraints (1.12). The family represents admissible directions of descent. Let denote the restriction of a measure onto a Borel set, i.e. µ µ. The following results that we give without proof characterise the steepest direction. Recall that vectors

8 Û ½ Û ½ are called affinely independent if Û ¾ Û ½ Û ½ Û ½ are linearly independent. Theorem 3 (see Molchanov and Zuyev, 2000). The minimum of µ over all ¾ with is achieved on a È signed measure such ½ that has at most atoms and Ø ½ for some ¼ Ø ½ with Ø ½ Ø ½ ½ and some measurable sets such that vectors À µ ½ ½, are affinely independent. Corollary 4. If the only constraint is µ ½, then the minimum of µ over all ¾ with is achieved on a signed measure such that is the positive measure of total mass ¾ concentrated on the points of the global minima of Ü µ and Å Øµ Å µòå Ø µ, where and Å Ôµ Ü ¾ Ü µ Ô Ø ÒÔ Å Ôµµ ¾ (1.13) ÙÔÔ Å Ôµµ ¾ (1.14) The factor is chosen in so that Å Ø µµ Å µ Ò Å Ø µµ ¾. It is interesting to note that, without constraint on the total variation norm of the increment measure, the steepest direction preserving the total mass is the measure Æ Ü¼, where Æ Ü¼ is the unit measure concentrated on a global minimum point Ü ¼ of Ü µ. Thus the classical algorithm based on the modified directional derivative (1.4) uses, in fact, the scaled variant of this direction. Steepest descent in the fixed total mass problem. The necessary condition given in Corollary 2 can be used as a stopping rule for descent algorithms. Corollary 4 above that describes the steepest descent direction in the minimisation problem with a fixed total mass justifies the following algorithm. Procedure Óº ØÔ Input. Initial measure. Step 0. Compute µ. Step 1. Compute Ü µ. If ºÓÔØÑ( ), stop. Otherwise, choose the step size. Step 2. Compute ½ Øº ØÔ µ. Step 3. If ½ ½ µ, then ½ ; ½ ; and go to Step 2. Otherwise, go to Step 1.

Variational calculus in space of measures and optimal design 9 The necessary condition for the optimum, that is used as a stopping rule in Step 1 above, is given by (1.10): the function Üµ Ü µ is constant and takes its minimal value on the support of an optimal. Strictly speaking, the support of on the discrete space is the set Ë Ü ¾ Üµ ¼. But in practice one may wish to ignore the atoms of the mass less than a certain small threshold ÙÔÔºØÓÐ. The boolean procedure ºÓÔØÑ has the following structure: Procedure ºÓÔØÑ Input. Measure, gradient function Üµ, tolerance ØÓÐ, ÙÔÔºØÓÐ. Step 1. Determine support Ë of up to tolerance ÙÔÔºØÓÐ. Step 2. If ÑÜ Ü¾Ë Üµ ÑÒ Üµ ØÓÐ return ÌÊÍ, and otherwise ÄË. The procedure Øº ØÔ returns the updated measure ½, where is an increment measure with total mass 0 and variance along the steepest direction described in Corollary 4. Procedure Øº ØÔ Input. Step size, measure, gradient function Üµ. Step 0. Put ½ Üµ Üµ for each point Ü ¾. Step 1. Find a point Ü ¼ of the global minimum of Üµ and put ½ Ü ¼ µ Ü ¼ µ ¾. Step 2. Find Ø and from (1.13) and (1.14), put ½ Üµ ¼ for all Ü ¾ Å Ø µ, decrease the total ½ -mass of the points from Å µ Ò Å Ø µ by value ¾ Å Ø µµ and return thus obtained ½. In Step 1 above the mass ¾ can also be spread uniformly or in any other manner over the points of the global minimum of Üµ. The described algorithm leads to a global minimum when applied to convex objective functions. In the general case it may stuck in a local minimum, the feature common for gradient algorithms applied in the context of global optimisation. There are many possible methods suitable to choose the step size in Step 1 of procedure Óº ØÔ. Many aspects can be taken into consideration: the previous step size and/or difference between the supremum and infimum of Ü µ over the support of. The Armijo method widely used for general gradient descent algorithms (Polak, 1997) defines the new step size to be Ñ, the integer Ñ is such that Ñ µ µ «Ñ ½ µ µ «Ü µ Ñ Üµ Ü µ Ñ ½ Üµ where ¼ «½ and Ñ is the steepest descent measure with the total variation Ñ described in Corollary 4.

10 The corresponding steepest descent algorithms are applicable for general differentiable objective functions and realised in ËÔÐÙ»Ê library Ñ Ø (for MEasures with FIxed mass STeepest Ascent/descent) from bundle Ñ ÓÔ that can be obtained from the author s web-pages. The chosen optimality criterion and the regressors should be passed as arguments to the descent procedure using (appropriately coded) functions and. A numerical example. Cubic regression through the origin. Let ¼ ½ and let Ü Ü ¾ Ü µ. For the D-optimal design in this model an exact solution is available that assigns equal weights to three points Ô µ½¼ and 1 (Atkinson and Donev, 1992, p.119). The initial measure is taken to be uniform on ¼ ½, discretised with grid size 0.01. In the algorithm described in (Atkinson and Donev, 1992) and further referred to as A0, a mass equal to the step size is added into the point of minimum of the gradient function Ü µ, and the measure is then rescaled to the original total mass. The algorithm A1 is based on Corollary 4 and described above. We run both algorithms A0 and A1 until either the objective function no longer decreases or the stopping condition is satisfied: the standardised response function attains its minimum and is constant (within the predetermined tolerance level of 0.01) on the support of the current measure. It takes 28 steps for our algorithm A1 to obtain the following approximation to the optimal measure: ¼¾µ ¼¾¼, ¼¾µ ¼½¾, ¼¾µ ¼¾¾, ¼ µ ¼½¼¼, ½µ ¼. Note that the distribution of the mass over the nearest grid points corresponds exactly to the true atoms positioned at irrational points Ô µ½¼, since ¼¾µ ¼¾µ ¼ ¼¾ and ¼¾µ ¼ µ ¼ ¼¾. The final step size is 0.00078125, µ ½½¾½ ½¼, and the range of Ü µ on the support of is ¼¼¼¼ ¼ ØÓÐ ¼¼½. Thus, algorithm A1 achieves the optimal design measure. In contrast, it takes 87 steps for A0 to arrive at a similar result (which still has a number of support points with negligible mass, but not true zeroes). Despite the fact that our algorithm makes more calculations at each step; it took 1.96 seconds of system time on a Pentium-II PC with 256 Mb RAM running at 450 MHz under ÄÒÙÜ to finalise the task, compared to 3.70 seconds for A0. The difference becomes even more spectacular for smaller tolerance levels or finer grids. Multiple linear constraints. Although the steepest direction for optimisation with many linear constraints given by (1.12) is characterised in Theorem 3, its practical determination becomes a difficult problem. Indeed, it is easy to see that for a discrete space (used in numerical methods) minimisation of µ over all signed measures ¾ with is a linear programming problem of dimension equal to the cardinality of. Therefore, in the presence of many

Variational calculus in space of measures and optimal design 11 constraints, it might be computationally more efficient to use an approximation to the exact steepest direction. For instance, it is possible to fix the negative component of the increment at every step and vary its positive part. Due to Theorem 3, the positive part of the steepest increment measure consists of at most atoms. With the negative part being proportional to, we suggest moving from the current measure to where for some ¼. This is equivalent to renormalisation ¼ µ with ½ µ and ¼ ½. The measure has atoms of masses Ô ½ Ô located at points Ü ½ Ü chosen so that to minimise the directional derivative µ (or, equivalently, µ ) and to maintain the imposed constraints (1.12). The value of characterises the size of the step, although it is not equal to the total variation of. To satisfy the linear constraints À µ ½ µ we need to have À µ ½ that can be written in matrix form as Ô Ü µ À Ü ½ Ü µô with Ô Ô ½ Ô µ and À Ü ½ Ü µ Ü µ ½. This implies Ô À Ü ½ Ü µ ½ (1.15) Since minimises, the directional derivative µ is minimised if µ Ü ½ Ü µà Ü ½ Ü µ ½ (1.16) where Ü ½ Ü µ Ü ½ µ Ü µµ. The right-hand side of (1.16) is a function of variables Ü ½ Ü that should be minimised to find the optimal locations of the atoms. Their masses Ô ½ Ô are determined by (1.15). Note that minimisation is restricted to only those -tuples Ü ½ Ü that provide Ô with all non-negative components. Since this descent differs from the steepest descent given in Theorem 3, an additional analysis is necessary to ensure that the algorithm of the described kind does converge to the desired solution. By using the same arguments as in (Wu and Wynn, 1978), it is possible to prove the dichotomous theorem for this situation. The descent algorithm for many constraints based on renormalisation procedure has been programmed in the ËÔÐÙ and Ê languages. The corresponding

12 library Ñ (for MEasure DEscent/Ascent) from bundle Ñ ÓÔ and related examples can be obtained from the authors web-pages. In the case of a single constraint on the total mass the described approach turns into a renormalisation method: has a single atom that is placed at the point of global minimum of µ to minimise µ. References Atkinson, A. C. and Donev, A. N. (1992). Optimum Experimental Designs. Clarendon Press, Oxford. Atwood, C. L. (1973). Sequences converging to D-optimal designs of experiments. Ann. Statist., 1:342 352. Atwood, C. L. (1976). Convergent design sequences, for sufficiently regular optimality criteria. Ann. Statist., 4:1124 1138. Cook, D. and Fedorov, V. (1995). Constrained optimization of experimental design. Statistics, 26:129 178. Fedorov, V. V. (1972). Theory of Optimal Experiments. Academic Press, N.Y. Fedorov, V. V. and Hackl, P. (1997). Model-Oriented Design of Experiments, volume 125 of Lect. Notes Statist. Springer, New York. Ford, I. (1976). Optimal Static and Sequential Design: a Critical Review. PhD Thesis, Department of Statistics, University of Glasgow, Glasgow. Hille, E. and Phillips, R. S. (1957). Functional Analysis and Semigroups, volume XXXI of AMS Colloquium Publications. American Mathematical Society, Providence. Molchanov, I. and Zuyev, S. (1997). Variational analysis of functionals of a Poisson process. Rapport de Recherche 3302, INRIA, Sophia-Antipolis. To appear in Math. Oper. Res. Molchanov, I. and Zuyev, S. (2000). Steepest descent algorithms in space of measures. To appear. Polak, E. (1997). Optimization. Springer, New York. Wu, C.-F. (1978a). Some algorithmic aspects of the theory of optimal design. Ann. Statist., 6:1286 1301. Wu, C.-F. (1978b). Some iterative procedures for generating nonsingular optimal designs. Comm. Statist. Theory Methods, A7(14):1399 1412. Wu, C.-F. and Wynn, H. P. (1978). The convergence of general step-length algorithms for regular optimum design criteria. Ann. Statist., 6:1273 1285. Wynn, H. P. (1970). The sequential generation of D-optimum experimental designs. Ann. Math. Statist., 41:1655 1664. Wynn, H. P. (1972). Results in the theory and construction of D-optimum experimental designs. J. Roy. Statist. Soc. Ser. B, 34:133 147. Zowe, J. and Kurcyusz, S. (1979). Regularity and stability for the mathematical programming problem in Banach spaces. Appl. Math. Optim., 5:49 62.