Lattice Based Derivative-free Optimization via Global Surrogates

Size: px

Start display at page:

Download "Lattice Based Derivative-free Optimization via Global Surrogates"

William Baker
5 years ago
Views:

1 Lattice Based Derivative-free Optimization via Global Surrogates Paul Belitz and Thomas Bewley Abstract Derivative-free algorithms are frequently required for the optimization of nonsmooth scalar functions in n variables, as defined by physical experiments or by averaging of the statistics of numerical simulations of chaotic systems such as turbulent flows. The core idea of all efficient algorithms for problems of this class is to keep function evaluations far apart until convergence is approached. Generalized Pattern Search (GPS) algorithms accomplish this by coordinating the search with an underlying grid which is refined, and coarsened, as appropriate. Rather than using the cubic grid (the typical choice), the present work introduces for this purpose the use of lattices derived from n-dimensional sphere packings. Such lattices are significantly more uniform and have much higher kissing numbers (that is, they have many more nearest neighbors) than their cubic grid counterparts; both of these facts make them much better suited for coordinating GPS algorithms. One of the most efficient subclasses of GPS algorithms, known as the Surrogate Management Framework (SMF), alternates between an exploratory search over an interpolating function which summarizes the trends exhibited by the existing function evaluations, and an exhaustive poll which checks the function on neighboring points to confirm or confute the local optimality of any given Candidate Minimum Point (CMP) on the underlying grid. The present work uses efficient lattices based on n-dimensional sphere packings to coordinate such surrogate-based optimizations while incorporating an efficient global search strategy based on both the predictor and the uncertainty of a kriging model of the function, thereby developing an extremely efficient algorirthm for Lattice Based Derivative-free Optimization via Global Surrogates, dubbed LABDOGS. Our code implementing this algorithm, dubbed Checkers, compares favorably to competing algorithms on a range of well-known optimization test problems when implemented on the root lattices and tested up to dimension n = 8. In addition to introducing this efficient new algorithm and its implementation, the present paper contains a comprehensive review of the component ideas upon which they are built: namely, lattice theory (à la Conway & Sloane 998) and derivative-free optimization via global surrogates (à la Cox & John 997, Booker et al 999, and Jones ). I. BACKGROUND The minimization of computationally expensive, high-dimensional functions is often most efficiently performed via gradient-based optimization algorithms such as Steepest Descent, Conjugate Gradient, and L-BFGS. In complex systems for which an accurate computer model is available, the gradient required by such algorithms may often be found by adjoint analysis. However, when the function in question is not sufficiently smooth to leverage gradient information effectively during its optimization (see, e.g., Figure ), a derivative-free approach is necessary. Such a scenario is evident, for example, when optimizing a finite-time-average approximation of an infinite-time-average statistic of a chaotic system such as a turbulent flow. Such an approximation may be determined via simulation or experiment. The truncation of the averaging window used to determine this approximation renders derivative-based optimization strategies ill suited, as the truncation error, though small, is effectively decorrelated from one flow simulation/experiment to the next. This effective decorrelation of the truncation error is reflected, for example, by the exponential growth, over the entire finite time horizon considered, of the adjoint field related to the optimization problem of interest in the simulation-based setting. Due to the sometimes significant expense associated with performing repeated function evaluations (in the above example, turbulent flow simulations or experiments), a derivative-free optimization algorithm which, with reasonable confidence, converges to within an accurate tolerance of the global minimum of a nonconvex function of interest with a minimum number of function evaluations is desired. It is noted that, in the general case, proof of convergence of an optimization algorithm to a global minimum is possible only when, in the limit that the total number of function evaluations, N, approaches infinity, the function evaluations become dense in the feasible region of parameter space (Torn & Zilinskas, 987). Though the algorithm developed in the present work, when implemented properly, satisfies this condition, so do far inferior approaches, such as a rather unintelligent algorithm which we call Exhaustive Sampling (ES), which simply covers the feasible parameter space with a grid, evaluates the function at every gridpoint, refines the grid by a factor of two, and repeats until terminated. Thus, a guarantee of global convergence is not sufficient to establish the efficiency of an optimization algorithm. If function evaluations are relatively expensive, and thus only a relatively small number of function evaluations can be afforded, effective heuristics for rapid convergence are certainly just as important, if not significantly more important, than rigorous proofs of the behavior of the optimization algorithm in the limit that N, a limit that might be argued to be of limited relevance when function evaluations are expensive. Careful attention to such heuristics thus forms an important foundation for the present study. If, for the moment, we give up on the goal of global convergence, perhaps the simplest grid-based derivative-free optimization algorithm, which we will identify in this paper with the name Successive Polling (SP), proceeds as follows. Start with a coarse grid and evaluate the function at some starting point on this grid, identified as the first candidate minimum

2 Fig.. Prototypical nonsmooth optimization problem for which local gradient information is ill-suited to accelerate the optimization algorithm. point (CMP). Then, poll (that is, evaluate) the function values on gridpoints which neighbor the CMP in parameter space, at a sufficient number of gridpoints to positively span the feasible neighborhood of the CMP [this step ensures convergence, as discussed further in Torczon 997, Booker et al. 999, and Coope & Price ]. When polling: (a) If any poll point is found to have a function value lower than that of the CMP, immediately consider this new point the new CMP and terminate the present poll step. (b) If all poll points are found to have function values higher than that of the CMP, refine the grid by a factor of two. A new poll step is then initiated, either around the new CMP or on the refined grid, and the process repeated until terminated. Though the basic SP algorithm described above, on its own, is not very efficient, there are a variety of effective techniques for accelerating it. All grid-based schemes which effectively build on the basic SP idea described above are classified as Generalized Pattern Search (GPS) algorithms. The most efficient subclass of GPS algorithms, known as the Surrogate Management Framework (SMF; see Booker et al., 999), leverages inexpensive interpolating surrogate functions (often, kriging interpolations are used) to interpolate the available function evaluations and provide suggested regions of parameter space in which to perform new function evaluations between each poll step. SMF algorithms thus alternate beween two steps: (i) Search over the inexpensive interpolating function to identify, based on the existing function evaluations, the most promising gridpoint at which to perform a new function evaluation. Perform a function evaluation at this point, update the interpolating function, and repeat. The search step is terminated when this search algorithm returns a gridpoint at which either the function has already been evaluated or the function, once evaluated, has a value greater than that of the CMP. (ii) Poll the neighborhood of the new CMP identified by the search algorithm, following rules (a) and (b) above. There is substantial flexibility during the search step described above. An effective search strategy is essential for an efficient SMF algorithm. In the case that the search behaves poorly and fails to return improved function values, which often happens when the function of interest is very flat (such as near the minimum of the Rosenbrock test function), the SMF algorithm essentially reduces to the SP algorithm. If, however, the surrogate-based search is effective, the SMF algorithm will converge to a minimum far faster than the simple SP search. As the search and poll steps are essentially independent of one another, we will discuss them each in turn in the sections that follow, then present how we have combined them in our highly efficient new optimization algorithm and its realization in numerical code. Note that, if the search produces a new CMP which is several gridpoints away from all previous function evaluations, which often happens when exploring functions with multiple minima, the grid may be coarsened appropriately in order to explore the vicinity of this new CMP efficiently (that is, with a coarse grid first, then refined as necessary). Note also that the interpolating surrogate function of the SMF may be used to order the function evaluations at the poll points such that those poll points which are most likely to have a function value lower than that of the CMP are evaluated first. By so doing, the poll steps will, on average, terminate much sooner, and the computational cost of the overall algorithm may be substantially reduced. At the heart of the SMF algorithm lies the discretizing grid or lattice to which all function evaluations are restricted, and which defines the set of points from which the poll set is selected at each poll step. To the best of the authors knowledge, all previous algorithms using such grid-based optimization strategies have been based on cubic grids. However, like in the game of Checkers (contrast American Checkers with Chinese Checkers), cubic grids are not the only possibility for discretizing parameter space in such an application. As the underlying lattice is the foundation for any GPS algorithm, we first define and compare the characteristics of various lattice alternatives to cubic grids. That is, such that any feasible point in the neighborhood of the CMP can be reached via a linear combination with non-negative coefficients of the vectors from the CMP to the poll points.

3 (a) (b) (c) (d) (e) (f) (g) (h) Fig.. The packing (a and b), covering (c and d), typical Voronoi cells (e and f), and kissing (g and h) of the D square and hexagonal lattices. n lattice name Θ G τ 8 Z square A hexagonal Z 8 cubic E 8 Gosset Z cubic.5e-,, Λ Leech Table. Characteristics of selected lattices in dimensions n =, 8, and. Boldface denotes values known, or believed, to be optimal among all lattices at that n. A. Introduction: a tale of two lattices II. A REVIEW OF LATTICE THEORY We now consider the problem of characterizing an n-dimensional lattice by focusing our attention first on the n = case; specifically, on the square (Z ) and hexagonal (A ) lattices. The characteristics of such lattices may be quantified by the following measures [Conway & Sloane 998]: The packing radius of a lattice, ρ, is the maximal radius of the spheres in a set of identical nonoverlapping spheres centered at each nodal point (see Figures a and b). The packing density of a lattice,, is the fraction of the volume of the feasible domain included within a set of identical non-overlapping spheres of radius ρ centered at each nodal point on the lattice. Lattices that maximize this metric are referred to as close-packed. The covering radius of a lattice, R, is the maximum distance between any point in the feasible domain and its nearest nodal point on the lattice. The deep holes of a lattice are those points which are at a distance R from all of their nearest neighbors. Typical vectors from a lattice point to the nearest deep holes are often denoted [], [], etc. The covering thickness of a lattice, Θ, is the number of spheres of radius R centered at each nodal point (see Figures c and d) containing an arbitrary point in the domain, averaged over the entire domain. The Voronoi cell of a nodal point on a lattice, V(P i ), consists of all points in the domain that are at least as close to the nodal point P i as they are to any other nodal point P j (see Figures e and f). The mean squared quantization error per dimension of a lattice, G, is the average mean square distance of any point in the domain to its nearest nodal point, normalized by the appropriate power of the volume of the Voronoi cell, and divided

4 (a) (b) Fig. 3. Configuration of the nearest-neighbor gridpoints in the D (a) square and (b) hexagonal lattices. by n. Shifting the origin to be at the centroid of a Voronoi cell V(P i ), it is given by G = n R V(P i ) x dx/ [R V(P i ) dx] + n. The kissing number of a lattice, τ, is defined as the number of nearest neighbors to any given nodal point in the lattice. In other words, it is the number of spheres of radius ρ centered at the nodal points that touch, or kiss, the sphere of radius ρ centered at the origin (see Figures g and h). There are two key drawbacks with cubic approaches for the coordination of derivative-free optimization algorithms. First, the discretization of the optimization space is less uniform when using the cubic grid as opposed to the available alternatives, as measured by the packing density, the covering thickness Θ, and the normalized mean-squared quantization error G, as summarized in Table. Second, the configuration of nearest-neighbor gridpoints is poor when using the cubic grid, as measured by the kissing number τ, which is an indicator of the degree of flexibility available when selecting a positive basis from nearest neighbors on the lattice. As seen by comparing the n = and n = cases in Table, these drawbacks become increasingly substantial as the dimension n is increased. By the dimension n =, the cubic grid has a factor of.93/.5e 7,, worse packing density, a factor of,,63/ , worse covering thickness, a factor of ( /.658) 9 worse total mean-squared quantization error, and a factor of 9656/8 worse kissing number than the best available alternative lattice. In light of this, the selection of the cubic grid, by default, for the coordination of n-dimensional GPS algorithms is simply untenable. Recall in particular from that the poll points must be selected to form a positive basis around the CMP (that is, a set of vectors such that any point in the feasible parameter space neighboring the CMP can be reached by a linear combination of these vectors with non-negative coefficients). Assuming computationally expensive function evaluations, minimizing the number of poll points while maintaining a positive basis around the CMP is of key importance in maximizing the efficiency of the SP algorithm. This highlights an obvious shortcoming of defining the poll points based on a cubic grid, where a complete poll step performed on a positive basis based on the nearest neighbors of a CMP requires n function evaluations; in more well-behaved lattices such as A n, the positive basis requires only n+ function evaluations (see Figure 3). In fact, most alternative lattices developed as n-dimensional sphere packings also require only n + nearest-neighbor points to form a positive basis. Thus, independent of the benefits in increased uniformity and decreased mean-square quantization error provided by the alternative lattices considered here, a factor of nearly increase in SP efficiency is realized immediately as a direct consequence of the more convenient configuration of the nearest-neighbor points on these alternative lattices. Note that it is possible to construct a positive basis with only n+ points, referred to as a minimal positive basis, in the n-dimensional cubic case if one poll point is used which is not a nearest-neighbor point, as indicated in n = dimensions in Figure 3a. However, the vector to this oddball point is a factor of n longer than the remaining n vectors. Additionally, this oddball vector is at a much larger angle to all of the other vectors in the positive basis than these vectors are to themselves. As a consequence, the region to which the optimal point is effectively localized via polling a set of points distributed in such a fashion is increased greatly from that tight region resulting from a poll on a well-distributed positive basis on nearestneighbor points, as possible when using the A configuration as depicted in Figure 3b. By providing a poorly distributed poll set, the efficiency of the SP algorithm is significantly compromised following the oddball vector approach depicted in Figure 3a. Taking this idea one step further, a relatively new class of methods, referred to as Mesh Adaptive Direct Search (MADS), polls based on a cubic grid but using several non-nearest-neighbor gridpoints. Though this approach has received much attention in recent years (see, e.g., Abramson, Audet, & Dennis 5) and shows some promise, a poll of this sort has the unfortunate consequence of effectively localizing the minimum point to a much larger region of parameter space than does a poll based on nearest-neighbor points on a grid of the same density. We believe that a MADS-type approach is rendered unnecessary when a lattice with a significantly higher kissing number than that of the cubic grid is used.

5 Note that it is not yet clear which of the four standard metrics introduced above (that is,, Θ, G, τ) is/are in fact most relevant when selecting a lattice for coordinating a derivative-free optimization algorithm, and some experimentation will thus be required to make this selection optimally for each n. The one thing that is clear, however, is that the cubic grid is inferior (by orders of magnitude for even modest values of n) to the available alternative lattices by all four of these metrics, as discussed further and tabulated comprehensively in II-C. B. Definitions An n-dimensional real lattice is a regular array of nodal points in R n, often defined via the dense packing of identical n- dimensional spheres, which is shift invariant (that is, which looks identical upon shifting any nodal point to the origin). This is a special subclass of sphere packings, and there are many regular sphere packings which are not shift invariant [hexagonal close packing (hcp) and diamond packing, denoted D + 3, are two well-known examples]. For simplicity, the present work focuses primarily on the utility of (notationally straightforward) lattice packings for the coordination of GPS algorithms; the potential advantages in certain dimensions n, in terms of the packing characteristics introduced above, which are available via expanding the scope of this study to include the (notationally more complex) non-lattice packings appear to be relatively slight. We thus make only brief mention of such non-lattice packings in the present introduction. There are three primary ways to define any given n-dimensional real lattice: As an explicit description of the points included in the lattice. As an integer linear combination (that is, a linear combination with integer coefficients) of a set of n basis vectors b i defined in R n+m for m ; for convenience, we arrange these basis vectors as the columns 3 of a basis matrix B. As a union of cosets, or sets of nodal points, which themselves may or may not be lattices. The standard form of these definitions as used below makes it straightforward to develop a single derivative-free optimization code that can build easily upon any of the lattices so described. Note that any lattice L n also has associated with it a dual lattice, denoted L n, which is defined such that L n = { x R n : x u Z for all u L n }, where Z denotes the set of all integers and x u is the usual scalar product. If B is a square basis matrix for L n, then B T is a square basis matrix for L n. The notation L n = Mn denotes that the two lattices L n and M n are equivalent (when appropriately rotated and scaled) at the specified dimension n. Also note that the four perhaps most basic families of lattices we introduce below, denoted Z n, A n, D n, and E n, are often referred to as the root lattices due to their relation to the root systems of certain Lie algebras. Unless specified otherwise, the word lattice in this article implies a real lattice, defined in R n. However, note that it is straightforward to extend this work to complex lattices defined in C n. To accomplish this extension, it is necessary to extend the concept of the integers, which are used to construct the lattice via the integer linear combination of the basis vectors in the basis matrix B, as described above. There are two primary such extensions: The Gaussian integers, defined as G = {a+bı : a,b Z} where ı =, which lie on a square array in the complex plane C. The Eisenstein integers, defined as E = {a+bω : a,b Z} where ω = ( +ı 3)/ [note that ω 3 = ], which lie on a hexagonal array in the complex plane C. We may thus define two types of complex lattices L from an appropriate basis matrix B: a G-lattice, denoted L[ı], defined as a linear combination of the columns of B with Gaussian integers as weights. an E-lattice, denoted L[ω], defined as a linear combination of the columns of B with Eisenstein integers as weights. Note also that, if L is a G-lattice or E-lattice with z L C n, then there is a corresponding real lattice L real with x L real R n such that x = ( R{ z } I{ z }... R{ z n } I{ z n } ) T. The present work focuses on the use of real lattices to minimize scalar real functions of n real parameters; a straightforward extension of this work is to use complex lattices, as defined above, to minimize scalar real functions of n complex parameters. In the present introduction, we will only make brief use of a complex lattice to simplify the construction of, and quantization to, E 6 and E 6. Note that the definitive comprehensive reference on this topic is Conway & Sloane (998), and all results in this section are drawn from the articles compiled in this text unless indicated otherwise. 3 In the literature on this subject, it is more common to use a generator matrix M to describe how to construct lattices. The basis matrix convention B used here is related simply to the corresponding generator matrix such that B = M T ; we find the basis matrix convention to be more natural in terms of its linear algebraic interpretation. Note that integer linear combinations of the columns of most matrices do not produce lattices (as defined above). The matrices listed in this section as basis matrices are special in this regard. Note also that basis matrices are not at all unique, but the lattices constructed from alternative forms of them are equivalent; the basis matrices listed in the discussion in this section were selected based on their simplicity.

6 C. Some important n-dimensional lattices There are many lattices more complex than the cubic lattice that offer superior uniformity and nearest-neighbor configuration, as quantified by the standard metrics introduced in II-A (namely, packing density, covering thickness, mean-square quantization error, and kissing number). This section reviews some of the most important of these lattices. ) The cubic lattice Z n : The cubic lattice, Z n, is defined Z n = { (x,...,x n ) : x i Z }, and may be constructed via an integer linear combination of the columns of the basis matrix B = I n n : B Z n =.... The cubic lattice is self dual [that is, (Z n ) = Z n ] for all n. ) The checkerboard lattice D n and its dual D n: The checkerboard lattice, D n, is an n-dimensional analog of the 3- dimensional face-centered cubic (fcc) lattice. It is defined D n = { (x,...,x n ) Z n : x x n = even }, and may be constructed via an integer linear combination of the columns of the n n basis matrix B Dn = (b) The dual of the checkerboard lattice, denoted D n and reasonably identified as the offset cubic lattice, is an n-dimensional analog of the 3-dimensional body-centered cubic (bcc) lattice. It may be written as D n = D n ([]+D n ) ([]+D n ) ([3]+D n ) = Z n ([]+Z n ), where the coset representatives [], [], and [3] are defined such that / / [] =. /, [] =., [3] =. /. / / Note that, in this notation, ([] + Z n ) simply denotes a Z n lattice with all nodal points shifted by the vector [], and Z n ([]+Z n ) simply denotes the union of all nodal points in Z n together with all nodal points in ([]+Z n ). The D n lattice may be constructed via an integer linear combination of the columns of the n n basis matrix.5.5 B D n =..... (3b).5.5 The packing D + n, reasonably identified as the offset checkerboard packing, is defined simply as (a) (b) (a) (3a) D + n = D n ([]+D n ); () note that D + n is a lattice packing only for even n, and that D+ 3 is the diamond packing.

7 3 3 6 Fig.. 6 The A lattice as defined on a plane in R 3. Note that the normal vector n A = ( ) T points directly out of the page in this view. 3) The zero-sum lattice A n and its dual A n: The zero-sum lattice, A n, is an n-dimensional analog of the -dimensional hexagonal lattice. It is defined A n = { (x,...,x n ) Z n+ : x x n = }, (5a) and may be constructed via an integer linear combination of the columns of the (n+) n basis matrix B An =......, with n An =.. (5b) Notice that A n is constructed here via n basis vectors in n+ dimensions. The resulting lattice lies in an n-dimensional subspace in R n+ ; this subspace is normal to the vector n An. An illustrative example is A, the hexagonal D lattice, which may conveniently be constructed on a plane in R 3 (see Figure ). Note that, starting from a (D) hexagonal configuration of oranges on a table at the market, one can stack additional layers of oranges in a hexagonal configuration on top, appropriately offset from the base layer, to build up the (3D) fcc configuration mentioned previously. This idea is referred to as lamination, and will be extended further in the discussion below when considering the Λ n family of lattices. Also note that, in the special case of n =, the A lattice may also be written as ( ) A / = R (a+r ), where a = (6) 3/ and R is the rectangular grid (not a lattice) obtained by stretching the Z grid in the second element by a factor of 3. The dual of the zero-sum lattice, denoted A n, may be written as n[ A n = ([i]+a n ), (7a) i= where the n+ coset representatives [i], for i =,...,n, are defined such that the k th component of the vector [i] is { i [i] k = n+ k n+ i, i n n+ otherwise. The A n lattice may be constructed via an integer linear combination of the columns of the (n+) n basis matrix n n+ n+ n+ B A n =..., with n A. n = n An. (7b) n+ n+

8 ) The Gosset lattice E 8 = E8, E 7, E7, E 6, and E6 : The Gosset lattice E 8 = E8, which has a (remarkable) kissing number of τ =, may be defined simply as E 8 = D + 8, (8a) and may be constructed via an integer linear combination of the columns of the 8 8 basis matrix / / / B E8 = / /. (8b) / / / The lattice E 7 is defined by restricting E 8 to a 7-dimensional subspace, E 7 = {(x,...,x 8 ) E 8 : x x 8 = }, and may be constructed via an integer linear combination of the columns of the 8 7 basis matrix / / / / / / B E7 = /, with n E7 = / /. (9b) / / / / / / A convenient coset construction of E 7 is given by (9a) 7[ E 7 = (a i + Z 7 ), (9c) i= where the 8 coset representatives a i are the codewords of a [7,3,] Hamming code given by the columns of the matrix A =. The dual of the E 7 lattice may be written as / E7 = E. 7 ([]+E 7 ), where [] = /, (a) 3/ 3/ and may be constructed via an integer linear combination of the columns of the 8 7 basis matrix 3/ 3/ / B E 7 = / /, with n E 7 = n E7. (b) / / / A convenient coset construction of E 7 is given by E 7 = 5 [ i= (a i + Z 7 ), (c)

9 where the 6 coset representatives a i are the codewords of a [7,,3] Hamming code given by the columns of the matrix A =. The lattice E 6 is defined by further restricting E 7 to a 6-dimensional subspace, E 6 = {(x,...,x 8 ) E 7 : x + x 8 = }, and may be constructed via an integer linear combination of the columns of the 8 6 basis matrix / / / / / / B E6 = /, with N E = / = n E6 / / / / / / n E7 (a). (b) A convenient coset construction of E 6 is given by the real lattice E 6 = L real corresponding to the complex lattice L constructed via the union of three scaled and shifted complex E-lattices Z[ω] 3 [themselves constructed via linear combination of the columns of the basis matrix B = I 3 3 with the complex Eisenstein integers as coefficients] such that [ L = (a i + θz[ω] 3 ), (c) i= where θ = ω ω = ı 3 and the coset representatives a, a, and a are given by the columns of the matrix A =. The dual of the E 6 lattice may be written as /3 /3 E6 = E 6 ([]+E 6 ) ([]+E 6 ), where [] = /, [] = [], (a). / and may be constructed via an integer linear combination of the columns of the 8 6 basis matrix / /3 / /3 / B E 6 = /3 / /3, with N E = N E. (b) /3 / /3 / / A convenient coset construction of E 6 is given by the real lattice E 6 = L real corresponding to the complex lattice L constructed via the union of nine scaled and shifted complex E-lattices Z[ω] 3 such that where the 9 coset representatives a i are given by the columns of the matrix 8[ L = (a i + θz[ω] 3 ), (c) i=

10 A =. The characteristics of all of the lattices introduced thus far are summarized, for n 8, in Table. 5) The laminated lattices Λ n and K n : The lattices in the Λ n and K n families can be built up one dimension, or laminate, at a time, starting from the integer lattice (Z = Λ = K ) at n =, to hexagonal (A = Λ = K ) at n =, to fcc (D 3 = Λ3 = K3 ) at n = 3, all the way up (one layer at a time) to the Leech lattice (Λ = K ), defined below, at n =. Both families of lattices may in fact be extended (but not uniquely) to n = 8. The Leech lattice, Λ, is the unique lattice in n = dimensions with a (remarkable) kissing number of τ = 96,56. It may be constructed via an integer linear combination of the columns of the basis matrix B Λ, which is depicted here in the celebrated Miracle Octad Generator (MOG) coordinates (see Curtis 976 and Conway & Sloane 998): B Λ = As in the E 8 E 7 E 6 progression described in II-C., the Λ n lattices for n = 3,,... may all be constructed by restricting the Λ lattice to smaller and smaller subspaces via the normal vectors assembled in the matrix N Λ = = n Λ... n Λ3.

11 For example, the Λ 3 lattice is obtained from the points of the Λ lattice in R which lie in the 3-dimensional subspace orthogonal to n Λ3, the Λ lattice is obtained from the points of the Λ lattice in R which lie in the -dimensional subspace orthogonal to both n Λ3 and n Λ, etc. Noting the structure of the B Λ and N Λ matrices given above, all points of the Λ 3 lattice as constructed in this manner are characterized by being zero in their twenty-fourth component, and thus Λ 3 may in fact be constructed using the basis matrix, denoted B Λ3, given by the 3 3 submatrix in the upper-left corner of B Λ. Noting the block diagonal structure of N Λ, it follows similarly that Λ n may be constructed using the basis matrix, denoted B Λn, given by the n n submatrix in the upper-left corner of B Λ for any n N = {3,,,,6,,,,9,8,7,6,5,}. For the remaining dimensions, n N = {9,8,7,5,,3,3,,,}, Λ n may be constructed via the appropriate restriction of the lattice generated by the next larger basis matrix in the set N ; for example, Λ may be constructed in R 6 via restriction of the lattice generated by the basis matrix B Λ6 to the subspace normal to the vectors (in R 6 ) given by the first 6 elements of n Λ5 and n Λ. Note that the K n lattices may be constructed via restriction of the Leech lattice in a similar fashion. As partially summarized in Tables and 3, lattices from the Λ n and/or K n families have the optimal packing densities and kissing numbers for the entire range considered, n. Note that the Λ n and K n families are not equivalent in the range 7 n 7, with Λ n being superior to K n by all four metrics introduced in II-A at most values of n in this range, except for the range n 3, where in fact K n has a distinct advantage. Note also that there is some flexibility in the definition of Λ, Λ, and Λ 3 ; the branch of the Λ n family considered here is that which maximizes the kissing number τ in this range of n, and thus the corresponding lattices are denoted Λ max, Λmax, and Λmax 3.

12 n lattice name Θ G τ Z = Λ = K integer Z = D = D = D + square A = A = Λ = K hexagonal Z 3 cubic D 3 = A3 = Λ3 = K3 face-centered cubic (FCC) D 3 = A 3 body-centered cubic (BCC) D + 3 diamond Z = D + cubic A zero-sum A D = D = Λ = K checkerboard Z 5 cubic A 5 zero-sum A D 5 = Λ5 = K5 checkerboard D 5 offset cubic D + 5 offset checkerboard Z 6 cubic A 6 zero-sum A D 6 checkerboard D 6 offset cubic D + 6 offset checkerboard.75 3 E 6 = Λ6 = K E Z 7 cubic A 7 zero-sum A D 7 checkerboard D 7 offset cubic D + 7 offset checkerboard.67 6 E 7 = Λ E Z 8 cubic A 8 zero-sum A D 8 checkerboard D 8 offset cubic E 8 = E 8 = D + 8 = Λ8 Gosset Table. Characteristics up to n = 8 of the root lattices (Z n, A n, D n, and E n ), their duals, and the offset checkerboard packing D + n (which is a lattice only for n even). Listed are the packing density, the covering thickness Θ, and the mean squared quantization error per dimension, G, quantifying the lattice uniformity, and the kissing number τ, indicating the flexibility available in selecting the positive basis from nearest neighbors. Boldface denotes values known, or believed, to be optimal among all lattices at that n. All results based on formulae taken from the articles compiled in Conway & Sloane (998). Blank entries appear to be unavailable in the published literature on this subject.

13 n lattice Θ G τ 9 Z A D Λ Z A Λ Z 9.e A K.63 3 Λ max Z 3.6e A 7.7e K = K Λ max Z 3.e A 3.569e K Λ max Z 3.658e A 8.7e Λ Z 5.6e A 5.87e Λ Z e-6 5, A 6 9.6e Λ 6 = Λ n lattice Θ G τ Z 7.76e-6 3, A 7.87e Λ 7 8.8e Z 8 3.3e-7 6, A e Λ 8 = K Z e-8 5, A 9.3e Λ 9 = K Z.6e-8 5,.8333 A 6.9e Λ = K Z 6.65e-9 58, A.9e Λ = K Z.757e-9,6, A 5.68e Λ = K Λ 7.88 Z 3.53e-,75, A 3.36e Λ 3 = K Λ Z.5e-,, A 3.53e Λ = Λ = K Table 3. Known characteristics of selected lattices in dimension 9 n. Boldface denotes values known, or believed, to be optimal among all lattices at that dimension n. Note that Z n is referred to as the cubic lattice, K is referred to as the Coxeter-Todd lattice, Λ 6 is referred to as the Barnes-Wall lattice, and Λ is referred to as the Leech lattice. The symbol denotes approximate values estimated via Monte Carlo integration; all other results reported have been determined in the literature analytically (see Conway & Sloane 998). The symbol denotes a bound, not an exact value. Blank entries appear to be unavailable in the published literature on this subject.

14 D. Quantization (that is, moving onto the Lattice) We now consider the problem of quantizing from an arbitrary point x in parameter space R n onto a point x on the discrete lattice, which is defined via an integer linear combination of the columns of the corresponding basis matrix B. The solution to this problem is lattice specific, and thus is treated lattice by lattice in the subsections below. Note that we neglect the problem of scaling of the lattices in this discussion, which is trivial to implement in code. ) Quantization to Z n : Quantize to Z n simply by rounding each element of x to the nearest integer. ) Quantization to D n : Quantize to D n by rounding x two different ways: Round each element of x to the nearest integer, and call the result ˆx. Round each element of x to the nearest integer except that element of x which is furthest from an integer, and round that element the wrong way (that is, round it down instead of up, or up instead of down); call the result ˆx. Compute the sum s of the individual elements of ˆx; the desired quantiziation is x = ˆx if is s is even, and x = ˆx if s is odd. 3) Quantization to A n : The A n lattice is defined in an n-dimensional subspace C of Y = R n+. The subspace C is spanned by the n columns of the corresponding basis matrix B An, and the orthogonal complement of C is spanned by the vector n An. Thus, the closest point in the subspace, y C C, to any given point y Y is given by y C = y (y,n An ) n An. More significantly for our present purposes, an orthogonal basis ˆB An of C may easily be determined from B An via Gram Schmidt orthogonalization. With this orthogonal basis, the n-dimensional parameter space in which we are actually performing the optimization may be identified as the n elements of x, which are related to y C via the equation y C = ˆB An x. (3a) Thus, starting from some point x in parameter space, but not yet quantized onto the lattice, we can easily determine the corresponding (n+)-dimensional vector y C which lies within the n-dimensional subspace C of R n+ via (3a). Given this value of y C C, we now need to quantize onto the lattice. We may accomplish this with the following simple steps: Round each component of y C to the nearest integer, and call the result ŷ. Define the deficiency = i ŷ i, which quantifies the orthogonal distance of the point ŷ from the subspace C. If =, then ỹ = ŷ. If not, define d = y C ŷ, and distribute the integers,...,n among the indices i,...,i n such that / d(ŷ i ) d(ŷ i )... d(ŷ in ) /. { ŷ ik k <, If >, then nudge ŷ back onto the C subspace by defining ỹ ik = ŷ ik otherwise. { ŷ ik + k > n+, If <, then nudge ŷ back onto the C subspace by defining ỹ ik = ŷ ik otherwise. Back in n-dimensional parameter space, the quantized value ỹ C corresponds to x = ˆB T A n ỹ. (3b) ) Quantization to the union of cosets: The dual lattices D n and A n, the hexagonal lattice A, the packing D + n, and the lattices E 8 = E8, E 7, and E7 are described via the union of (real) cosets in (3a), (7a), (6), (), (8a), (9c), and (c), respectively. To quantize a lattice defined as a union of k cosets, simply quantize to each coset independently, then select from these k quantization points that point which is closest to the original point x. The lattices E 6 and E6 are described via the union of (complex) cosets [which are scaled and shifted complex E-lattices Z[ω] 3 ] in (c) and (c). Following Conway & Sloane (98), to discretize a point x to coset i in these cases: Determine the complex vector z C 3 corresponding to x R 6. Shift and scale such that ẑ = (z a i )/θ. Determine the real vector ˆx R 6 corresponding to ẑ C 3. Quantize the first, second, and third pairs of elements of ˆx to the real hexagonal A lattice to create the quantized vector ˆ x. Determine the complex vector ˆ z C 3 corresponding to ˆ x R 6. Unscale and unshift such that z = θˆ z+a i. Determine the real vector x R 6 corresponding to z C 3. 5) Quantization to the laminated lattices Λ n and K n : Quantization to the laminated lattices Λ n and K n in dimension 9 n is a more difficult problem, and is the subject of numerous papers in the area of coding theory in the last years, where many sophisticated algorithms for such problems have been proposed and tuned. A future paper (currently under preparation) will survey the best available algorithms of this type for our present purposes in dimension 9 n. For simplicity, the remainder of this paper will focus more narrowly on the use of the lattices reviewed above in dimension n 8, as characterized in Table, for the coordination of lattice-based derivative-free optimization algorithms.

15 III. EXTENDING LATTICE THEORY FOR THE COORDINATION OF DERIVATIVE-FREE OPTIMIZATION To extend lattice theory, as summarized above, to coordinate a derivative-free optimization algorithm, a few additional steps are needed, as described in this section. A. Enumerating nearest-neighbor lattice points All lattice points x i R n which are nearest neighbors of the origin x = in any real lattice defined by a basis matrix B may be enumerated via the following algorithm.. Initialize m =.. Define a distribution of points z i such that each element of each of these vectors is selected from the set of integers { m,...,,...,m}, and that all possible vectors that can be created in such a fashion, except the origin, are present (without duplication) in this distribution.. Compute the distance of each transformed point ỹ i = B z i in this distribution from the origin, and eliminate those points in the distribution that are farther from the origin than the minimum distance computed in the set. 3. Count the number of points remaining in the distribution. If this number equals the (known) kissing number of the lattice under consideration, as listed in Table or 3, then determine an orthogonal ˆB from B via Gram Schmidt orthogonalization, set x i = ˆB T ỹ i for all i, and exit; otherwise, increment m and repeat from step. Though this simple algorithm is not particularly efficient, it need not be, as the nearest neighbor distribution is identical around every lattice point, and thus this algorithm need only be run once during the initialization of the optimization code. B. Testing for a positive basis Given a subset of the nearest-neighbor lattice points, we will at times need an efficient test to determine whether or not the vectors to these points from the CMP form a positive basis of the feasible domain around the CMP. Without loss of generality, we will shift this problem so that the CMP corresponds to the origin in the discussion that follows. A set of vectors { x,..., x k } for k n+ is said to positively span R n if any point in R n may be reached via a linear combination of these vectors with non-negative coefficients. Since the n basis vectors {e,...,e n, e,..., e n } positively span R n, a convenient test for whether or not the vectors { x,..., x k } positively span R n is to determine whether or not each vector in the set {e,...,e n, e,..., e n } can be reached by a positive linear combination of the vectors { x,..., x k }. That is, for each vector e in the set {e,...,e n, e,..., e n }, a solution z, with z i for i =,...,n, to the equation Xz = e is desired, where X = ( x... x k). If such a z exists for each vector e, then the vectors { x,..., x k } positively span R n ; if such a z does not exist, then the vectors { x,..., x k } do not positively span R n. Thus, testing a set of vectors to determine whether or not it positively spans R n may be reduced simply to testing for the existence of a solution to n well-defined linear programs in standard form. Techniques to perform such tests, such as Matlab s linprog algorithm, are well developed and readily available. Further, if a set of vectors positiviely spans R n, it is a simple matter to check whether or not this set of vectors is also a positive basis of R n, if such a check is necessary, simply by checking whether or not any subset of k vectors chosen from this set also positively span R n. Note that a positive basis with k vectors will necessarily have k in the range n+ k n. C. Selecting a positive basis from the nearest-neighbor lattice points Section III-A described how to enumerate all points which are nearest neighbors of the origin of a lattice (and thus, with the appropriate shift, all points which are nearest neighbors of any CMP on a lattice), and Section III-B described how to test a subset of such points to see if the vectors to these points form a positive basis around the CMP. We now present a general algorithm to solve the problem of selecting a positive basis from the nearest-neighbor points using the minimum number of new poll points possible, while creating the maximum achievable angular uniformity between the vectors from the CMP to each of these points. This problem has an interesting connection to Tammes problem (Tammes 93), which may be summarized as the question Where should k repulsive individuals settle on a planet in order to be as far away from each other as possible?. In the present incarnation of this problem, we have an n-dimensional planet, and we need to distribute k = n+m such repulsive individuals on a well-selected subset of a discrete set of locations (that is, the nearest-neighbor lattice points) at which the individuals are allowed to settle. Ideally 5, for m =, the solution to this discrete Tammes problem will produce a positive basis with good angular uniformity; if it does not, we may successively increment m by one and try again until we succeed in producing a positive basis. We have studied three algorithms for solving the problem of finding a positive basis while leveraging this discrete Tammes formulation: Algorithm A. If the kissing number τ of the lattice under consideration is relatively large (that is, if τ n; for example, for the Leech lattice Λ ), then a straightforward algorithm can first be used to solve Tammes problem on a continuous sphere 5 That is, for a good lattice, such as that depicted in Figure 5.

16 Fig. 5. Two different positive bases on the bcc lattice D 3, shown in green and red around the blue CMP. Note the complete radial and angular uniformity as well as the flexibility in the orientation of the basis. in n dimensions. This can be done simply and quickly by modeling the n + m repulsive individuals with identical negatively charged particles, initializing the location of each such particle on the sphere randomly, and then, at each iteration, using a straightforward force-based algorithm 6 to move each particle along the surface of the sphere a small amount in the direction that the other particles are tending to push it, and iterating until the set of particles approaches an equilibrium. Then, each equilibrium point so determined may be quantized to the closest nearest-neighbor lattice point, as enumerated in III-A. Algorithm B. If the kissing number τ of the lattice under consideration is relatively small (that is, if τ is not well over an order of magnitude larger than n), then it turns out to be more expedient to solve the discrete Tammes problem directly. To accomplish this, we distribute the n + m negatively charged particles randomly on n + m nearest-neighbor points, and then, at each iteration, move a few (two or three 7 ) of these particles that are furthest from equilibrium in the force-based model described above (that is, those particles which have the highest force component projected onto the surface of the sphere) into new positions selected from the available locations (enumerated in III-A) which minimize the maximum force (projected onto the sphere) over the entire set of particles. Though each iteration of this algorithm involves an exhaustive search for placing the two or three particles in question, it converges quickly when τ is O() or less. Algorithm C. For intermediate kissing numbers τ, a hybrid approach may be used: a good initial distribution may be found using Algorithm A, then this distribution may be refined using Algorithm B. In each of these algorithms, to minimize the number of new function evaluations required at each poll step, a check is first made to determine whether any previous function evaluations have already been performed on the set of nearest-neighbor lattice points. If so, then negatively-charged particles are fixed at these locations, while the remaining negatively-charged particles are adjusted via one of the three algorithms described above. By so doing, previously-calculated function values may be used with maximum effectiveness during the polling procedure. When performing the poll step of a surrogate-based search, in order to orient the new poll set favorably, a negatively-charged particle is also fixed at the nearest neighbor point with the lowest value of the surrogate function; when polling, this poll point is evaluated first. The iterative algorithms described above, though in practice quite effective, are not guaranteed to converge from arbitrary initial conditions to a positive basis for a given value of m, even if such a positive basis exists. To address this issue, if either algorithm fails to produce a positive basis, the algorithm may be repeated using a new random starting distribution. Our numerical tests have indicated that this repeated random initialization scheme usually generates a positive basis within a few initializations when such a positive basis indeed exists. Since at times there exists no minimal positive basis on the nearest-neighbor lattice points, particularly when the previous function evaluations being leveraged are poorly configured, the number of new random initializations is limited to a prespecified value. Once this value is reached, m is increased by one and the process repeated. As the cost of each function evaluation increases, the user can increase the number of random initializations attempted using one of the above algorithms for each value of m in order to avoid the computation of extraneous poll points that might in fact be unnecessary if sufficient exploration by the discrete Tammes algorithm is performed. Numerical tests have demonstrated the efficacy of this rather simple basis-finding strategy, which reliably generates a positive basis, even when leveraging a relatively poor configuration of previous function evaluations, while keeping computational costs to a minimum. Additionally, the strategy itself lacks any explicit dependence on the lattice being used; the only inputs to it are the dimension of the problem, the locations of the nearest-neighbor lattice points, and the identification of those nearest-neighbor lattice points for which previous function evaluations are available. 6 In this model, the repulsive force exerted by any two particles on each other is proporational to the inverse square of the distance between the particles. 7 Moving more than two or three particles at a time in this algorithm makes each iteration computationally intensive, and has little impact on overall convergence of the algorithm.

Sphere Packings, Coverings and Lattices

Sphere Packings, Coverings and Lattices Anja Stein Supervised by: Prof Christopher Smyth September, 06 Abstract This article is the result of six weeks of research for a Summer Project undertaken at the