VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui

Size: px

Start display at page:

Download "VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui"

Roger Phelps
5 years ago
Views:

VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui School of Engineering and Computer Science Te Kura Mātai Pūkaha, Pūrorohiko PO Box 600 Wellington New Zealand Tel: +64 4

1 VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui School of Engineering and Computer Science Te Kura Mātai Pūkaha, Pūrorohiko PO Box 600 Wellington New Zealand Tel: Fax: Internet: office@ecs.vuw.ac.nz Evaluate and Improve an Optimization Algorithm Thomas Robinson Supervisor: Marcus Frean Submitted in partial fulfilment of the requirements for Bachelor of Science with Honours in Computer Science. Abstract Global Optimization is a difficult task with many applications. This report describes one method of Global Optimization known as Gaussian Process Optimization. This is a response surface based methodology which attempts to construct its own model of the underlying surface and determines the next best place to sample by using an Expected Improvement measure. Several improvements were investigated by Michael Mudge in an honours project in 2010 and these are thoroughly evaluated in this report. Another improvement is also made to the algorithm in the form of a new covariance function and this is described and thoroughly evaluated in this report.

3 Acknowledgments I acknowledge the helpful and constructive guidance of my supervisor Marcus Frean who gave freely of his time and advice throughout the project. i

4 ii

5 Contents 1 Introduction 1 2 Background Information Gaussian Processes Covariance Hyperparameters Expected Improvement Algorithm Improvements Background Mean EI of the Mean over the Best Mean Local EI Related Work Evaluation Method Hard Optimization Problems Test Problems Evaluation Measure Evaluation of Improvements Results A New Improvement to GPO Motivation Motivation New Improvement to GPO Evaluation Towards a Combined Algorithm Combination: Superposition with Mean EI GPO Combination: Superposition with Local EI GPO Combination: All Improvements Future Work and Conclusion Unanswered Questions Learning the Hyperparameters Matrix Inversion Other Covariance Functions Other Expected Improvement Measures Conclusion iii

6 iv

7 Chapter 1 Introduction Many real world applications require one to optimize a set of parameters in order to find the best, or most desirable combination. For example, consider designing an efficient aerofoil: there are several variables (for example, length, width, curvature, the drag coefficient of the material and so on), all of which affect the lift-to-drag ratio in different ways. Finding the combination of variables which produce a wing with the highest lift-to-drag ratio is desirable. This problem is equivalent to finding the highest point on a surface of several dimensions which relates the variables of the wing. Finding the highest point is possible with a process known as Global Optimization. Global Optimization, in general, is a very difficult problem to solve. If one has the mathematical function relating the variables to the quantity being optimized, it is likely the gradient of the surface can be calculated. A simple method for locating an optimum of the surface is to set the gradient to zero and solve the resulting equation. This may give several points (the stationary points of the surface), each of which should be evaluated (by passing to the original function) and comparing the results. While the solution is exact, in many real world problems it is not feasible because: the underlying mathematical function may not be available the gradient may not be available and/or impossible to calculate the number of optima on the surface may be very large Therefore, techniques have been developed for finding the global optimum of a surface. These techniques aim to both explore and exploit a surface. The exploration means that the algorithm will look in areas it has not visited before and the exploitation means that the algorithm will pursue promising points. One of these techniques is known as simulated annealing [7]. This technique involves a parameter T known as the temperature which starts at a large value. Early in the search, the algorithm behaves much like a random search, performing random exploration. The temperature is slowly annealed down to a lower value where the search will only move to another point if it is better than the best point seen so far. That is, it exploits the surface. Another approach to global optimization is particle swarm optimization [6]. This algorithm involves a number of particles which are essentially samples. Since there is usually a large number of them, this fact alone covers the exploration aspect of the optimization. Each particle has a velocity associated with it, which determines how it will move within the search space in each iteration. This velocity is updated based on how good the best point a particle has seen so far is, and how that point compares to the best point seen out of all particles. This is the exploitation aspect of the algorithm. 1

8 However, these methods are not necessarily the most efficient in terms of how many samples they take. This is an important consideration in global optimization because in many cases it can be very expensive to actually take the samples. In the aerofoil example above, there could be a large development cost (time, money) associated with manufacturing a wing which is slightly longer only for its drag coefficient to be measured once. Clearly, the lower the number of prototype wings that need to be created in order to find the optimal one, the better! This report outlines a data efficient global optimization method which formed the basis of an honours project in This method is based on the general idea of response surfaces [4], in which the algorithm creates its own model of the surface being optimized. The response surface approach works by first modelling the current known data to create a response surface or, better still, apredictive distribution. The predictive distribution is then analyzed to determine the next place to sample. Finally, an actual sample is taken at this point and we add the point and its value to our known data (see Figure 1.1). The assumption is that sampling from the real world is expensive, but sampling from the model is relatively cheap. Data. Produce Model from Data Response Set Surface Add Sample to Data Set Determine Where to Sample Take Sample Figure 1.1: A figure showing the response surface methodology. The optimization method which is being investigated here is known as Gaussian Process Optimization (GPO) which aims to be as efficient as possible in terms of the number of samples needed to locate the global optimum. In the case of GPO, the response surface model is a Gaussian Process and the measure for determining the next point to sample at is known as the Expected Improvement (EI). In 2010, three improvements to the GPO algorithm were investigated by Michael Mudge but these were not evaluated thoroughly. The contributions I have made to GPO in this project are: A thorough evaluation of the improvements to GPO investigated in 2010, using test problems which allow the difficulty to be controlled. An addition to the algorithm in the form of a new covariance function. A thorough evaluation of this addition. 2

9 Chapter 2 Background Information This chapter introduces Gaussian Processes and how they can be used for optimization. It also gives an explanation of the improvements that were investigated by Michael Mudge in 2010 and which are being evaluated in this project. 2.1 Gaussian Processes Consider a surface defined by y = f (x) where x is a vector in a continuous d-dimensional input space. Let D be the set of samples (x, y pairs) that have been taken on the surface f. A Gaussian Process (GP) is defined to be a model which provides a Gaussian predictive distribution p(y x, D) over y at all points x. The assumption is that p(y x, D), the posterior distribution, is a Gaussian distribution. This means that it is possible to obtain a mean and variance of a Gaussian distribution for any input x via a Gaussian Process. A single variable Gaussian distribution is parameterized by a mean and a variance. A multi-variate Gaussian distribution is parameterized by a mean vector and a covariance matrix. It follows then that a Gaussian process is parameterized by a mean function and a variance function [15]. 2.2 Covariance A central part of the mechanics of a GP is the covariance function. Given two input vectors x and x, this function expresses what we believe about how the two corresponding values y = f (x) and y = f (x ) are related. This is known as the covariance between x and x and is denoted by cov(x, x ). In general, one would expect two inputs that are close to each other in x-space to have similar y values, that is, a high covariance. Similarly, two inputs that are far apart would be expected to have unrelated y values, or a low covariance. For this reason, the most common choice of covariance function is the squared exponential. Figure 2.1 shows the one-dimensional squared exponential function. Two input values that are far apart can be seen to have a low covariance, while input values that are close together can be seen to have a high covariance. The squared exponential provides a smooth transition across these different levels of covariance. Recall the first step of the response surface methodology: producing a model from the data at hand. Consider a matrix C defined as follows C ij = cov(x i, x j ) + δ ij θ noise (2.1) where x i, x j D and δ ij is the Kronecker delta function which is 1 when i = j and 0 otherwise. This matrix C is known as a covariance matrix and is a table of covariance values 3

10 cov(x1, x2) x1 - x2 Figure 2.1: A 1-dimensional squared-exponential curve, cov(x1, x2) exp [ x1 x2 2]. between all pairs of data points in the data set D. θ noise is a hyperparameter which is discussed in more detail in the next section and represents the assumed measurement noise of the samples. The Kronecker delta function ensures that the noise is only added to the diagonal of the matrix (i.e. the covariance of a point with itself) since it is only referring to the noise inherent in the measurement of that sample, not the covariance between two points. Note that C will be symmetric, because cov(x i, x j ) = cov(x j, x i ). Similar to the way in which C was defined, k is defined to be a vector of the covariance values between x and every data point in D for some x in the search space. The mean of the Gaussian Process at this point x is then defined to be: and the variance of the Gaussian Process is: where κ = cov(x, x) the covariance of x with itself. µ(x) = k T C 1 y (2.2) σ 2 (x) = κ k T C 1 k (2.3) 2.3 Hyperparameters More generally, the squared exponential covariance function for more than one dimension has the formula: The squared exponential covariance function has the formula: [ ] cov(x, x ; θ) = θ h exp 1 (x i x i ) 2 (2.4) 2 It is parameterized by a set of hyperparameters, θ. These hyperparameters define several characteristics of the covariance function. θ h is the vertical scale and defines the height of the squared exponential. This represents the overall scale of variation in the height of the surface. θ li is the length scale in the ith input dimension and defines the width of the squared exponential in the ith input dimension. This represents how rapidly the covariance dies away with distance in each dimension. I i=1 θ li 4

11 θ noise is a third component of θ (mentioned earlier) representing our beliefs about the level of measurement noise of the samples. By providing different values for these hyperparameters, the shape of the covariance function can be changed. Consequently, by changing the shape of the covariance function, we are changing our model concerning the underlying surface. For example, a shorter length scale will mean that there will be a lower covariance between any two points. Essentially, the covariance function expresses our beliefs about the smoothness of the surface. It is possible to learn the hyperparameters, θ, from the sampled data, D. The best way of doing this would be to define a prior distribution on the hyperparameters and integrate over them: P(y n+1 x n+1, D) = P(y n+1 x n+1, θ, D)P(θ D) dθ (2.5) Unfortunately, this process is almost always completely intractable [10]. There is, however, a common way to approximate this integral: P(y n+1 x n+1, D) P(y n+1 x n+1, D, θ MAP ) (2.6) where θ MAP is the maximum a posteriori value for θ, that is θ MAP = argmax θ P(θ D) = argmax θ P(D θ)p(θ) (2.7) which can be found by using the conjugate gradients algorithm on the posterior P(θ D). Thus, we can learn the values for θ which best model the data. 2.4 Expected Improvement The final piece of the puzzle required to make an optimization algorithm is to determine where the next place to sample should be. This is calculated via a measure known as the Expected Improvement (EI). Let the best value we have seen so far in D be denoted by y best. The expected improvement in the value of y over this value y best at some new point (x n+1, y n+1 ) is EI(x n+1 ) = where I is the improvement in y best at x n+1 y best I(y n+1 )p(y n+1 x n+1 ), D dy n+1 (2.8) I(y n+1 ) = { yn+1 y best y n+1 > y best 0 otherwise (2.9) Thus we can calculate the EI of y over y best for any point x in the search space. To determine where to sample next, we simply choose the point x in the input space which maximizes the EI. There are a variety of ways in which to perform this maximization. In what follows, we used the conjugate gradients algorithm, since not only is the function EI(x) available, so is the gradient. Figure 2.2 shows an illustrative example of Gaussian Process Optimization in action. 5

The black curve shows the Gaussian Process mean and the green area is Gaussian Process variance.

12 (a) (b) (c) (d) (e) (f) Figure 2.2: The Gaussian Process Optimization algorithm in action. The blue curve is the function to be optimized. The black curve shows the Gaussian Process mean and the green area is Gaussian Process variance. The best point seen so far is indicated by the horizontal yellow line. The expected improvement over this best point is indicated by the red line. The highest point of the red curve is the place where the next sample will be taken. 6

13 (g) (h) (i) (j) (k) (l) Figure 2.2: (Continued) The Gaussian Process Optimization algorithm in action. The blue curve is the function to be optimized. The black curve shows the Gaussian Process mean and the green area is Gaussian Process variance. The best point seen so far is indicated by the horizontal yellow line. The expected improvement over this best point is indicated by the red line. The highest point of the red curve is the place where the next sample will be taken. 7

14 2.5 Algorithm Improvements The previous sections have described the basic GPO algorithm. In 2010, Michael Mudge investigated three improvements to this algorithm which are outlined in this section Background Mean The Gaussian Process prior, as described above, has a mean of zero. As a result, the predictive mean, µ, away from the samples will be zero since there is no data to update those prior beliefs. This means that if the surface being optimized is in fact predominantly above (or below) zero, the Gaussian Process model away from samples will not accurately reflect the underlying surface. If the surface does genuinely have a mean of zero, then the GP prior with a zero mean will probably perform OK. However, simply by translating the surface in the y direction, the GP prior will not model the surface accurately away from samples, putting the algorithm at a disadvantage even though the optimization problem is effectively the same. The improvement made was to incorporate a non-zero mean for the prior. The first idea was to use the mean value of the sampled points as the mean for the prior. However, since we are aiming to optimize, the samples are more likely to be taken around the peaks of the function. If this value were used as the prior s mean, it would be higher than the mean value of the surface. The second idea (and the one which was implemented) was to use a value for the mean which gives the maximum likelihood for the Gaussian Process distribution. GPO with this calculated mean for the GP prior will be referred to from now on as Vanilla GPO as we believe no algorithm should be disadvantaged by the fact that a surface is much higher or lower than zero EI of the Mean over the Best Mean The Expected Improvement measure described in Section 2.4 is one of four possible EI measures (see Figure 2.3). It aims to get a sample higher than the best sample so far. This could lead to the algorithm sampling where it got lucky once (an anomalously high sample) [12]. In places where many samples have been taken, the chance of improving the outcome of the optimization by sampling in that place again is low. Ideally, the algorithm should sample at a point where the GP s predictive mean is likely to be high. EI in. Standard GPO EI over the y best y EI in µ Mean EI over the best µ Figure 2.3: Diagram showing the different possibilities for EI measure. EI in y over the best y is the standard GPO EI measure. EI in µ over the best µ is referred to as Mean EI. The other two combinations have not been investigated. 8

15 The improvement that was made to the EI measure was to instead use the bottom-most EI measure in Figure 2.3: the EI of the GP mean, µ, over the best GP mean value so far, µ best. This improves the behaviour of the algorithm because while sampling near the other samples again the algorithm may get lucky, it is unlikely to improve the mean at that point very significantly. Sampling away from other points is more likely to improve the mean. Hence, this improvement tends to explore the surface more Local EI GPO takes a greedy approach to optimization by choosing the next best point (according to the model) on the very next sample. However, if the next sample truly were the best point, we would only ever need to take one sample (rendering an entire area of machine learning completely solved!). The observation is that narrow EI peaks generally do not lead to much improvement in the model, but sampling at wider EI peaks does. The aim of the next sample should be to ensure there is a higher expected improvement for the sample following. It is also desirable to reward the algorithm for sampling in a region which has good values. If we take a sample in a region of high EI and it happens to be a good point, then the algorithm should be encouraged to exploit that region. These behaviours have been implemented by considering the amount of EI around a particular point, x (hence Local EI ). It works by taking the covariance function and centering it on the point x. The covariance function is then treated as a distribution and a number of samples are taken from underneath it. The EI is then measured at each of those sample points (as well as the x) and an average is taken. The result is the Local EI for x. Recall that the covariance function defines our beliefs about the underlying surface. The reason for using the covariance function in the above process is to define the region which we should consider to be Local. In order to exploit a local region around x, we should focus the region to be the points W which have high covariance with x because those points W are likely to have similar values to x. 2.6 Related Work Gaussian Processes are a well studied area of statistical modelling, particularly in the area of regression. David MacKay s introduction [10] provides a decent overview of Gaussian Processes in machine learning. It gives a lot of detail about Gaussian Processes, including how they can be used for different tasks such as regression and classification. It describes all of the equations necessary for every facet of GPs including learning the hyperparameters and generating covariance functions. The most rigorous study into Gaussian Processes is Rasmussen and Williams book, Gaussian Processes for Machine Learning [15]. This covers similar content to David MacKay s introduction but goes into much more detail, provides more thoroughly worked examples and also contains a discussion comparing GPs with other statistical models. This book briefly talks about using Gaussian Processes for global optimization but does not discuss details. The response surface approach to global optimization has been well studied by Jones [5]. His paper provides a good introduction using a technique known as kriging. This technique is simply another name for Gaussian Process regression. Applications of kriging, focusing on using this technique in engineering design have been discussed in [2]. Osborne et al [13] further investigated the idea of Gaussian Process Optimization. They experimented with different degrees of look-ahead. That is, in a world where the number of samples you are allowed to take is limited, if you had only one sample remaining, the place 9

16 you take this sample would be different than if you had many samples remaining. They also investigate the use of a periodic covariance function (in order to exploit periodicity in surfaces). They find that their version of the algorithm significantly outperformed tested competitors. There are also a number of accounts of using GPO with very successful results (including [8, 3]). In [8], applications in robotics are considered. In particular, learning the correct parameters for a robot s gait is a difficult problem. It relies on a number of parameters such as the velocity of the leg movements and how soft the surface is. The authors report using GPO to try and maximize the walking velocity of the robot, and to maximize the smoothness of the gait. The results are very positive and they note that GPO solved the problem in less samples than a local gradient algorithm. In [3], the concept of Gaussian Processes is introduced and another example is provided of its applications. In this case, it is for solving a double pole balancing problem which is a well known optimization problem. Again, the authors note that GPO was able to solve the problem in an order of magnitude fewer samples than reported elsewhere. In 2010, Michael Mudge investigated several improvements (those in Section 2.5) to the basic algorithm in his honours project [12]. His paper described those improvements in detail and gave an evaluation on a number of commonly used 2-dimensional problems from the literature. Those improvements showed promise on certain problems, however they were only 2-dimensional and the difficulty of those problems was not characterized. My project aims to further evaluate those improvements on high dimensional problems and to characterize what makes those problems difficult. I also investigate another possible improvement to the algorithm in the form of a different covariance function. 10

17 Chapter 3 Evaluation Method This chapter describes how I evaluated the improvements which were investigated in It describes what makes an optimization problem hard, the test problems used and the measures taken. 3.1 Hard Optimization Problems In order to provide a thorough evaluation of the improvements to GPO, we must find a range of problems which range in difficulty to determine how good the improvements actually are. This requires us to define what we mean by hard optimization problems. To start the investigation of the concept of hard optimization problems, one should first consider the No Free Lunch Theorems for Optimization [16]. These theorems essentially state that if an algorithm performs well on one set of problems, its performance will be offset by performing poorly in another class. Theorem 1 The sum of the performance of algorithm A, on all test problems, is equal to the sum of the performance of algorithm B, on all test problems. This theorem suggests that there is no point in even doing an evaluation because averaged over all problems our algorithm will be no better than a random search! This theorem is a sum of performance over all test problems, of which there are an infinite amount. This will include completely random surfaces which are never actually encountered in the real world. So while we must accept that we can never do better than random search on all test problems, we can try to do better than a random search on test problems which simulate real world problems. In particular, hard real world problems. When we think of problems which are hard to optimize for humans, we commonly think of high dimensional problems as they are hard for humans to visualize. High-dimensionality also makes the problem difficult for a computer. The reason for this is the curse of dimensionality. The curse of dimensionality is that in order to find a particular point within some error, the number of samples to be taken grows exponentially with the dimension of the space. In the case of optimization, this means the computer is required to make exponentially more samples in order to find the global optimum. Recall that the number of hyperparameters required for a n-dimensional problem is n + 2 because we have n length scales, the vertical scale and the noise parameter. Since GPO also performs a search in this space to learn the hyperparameters, the computational complexity of learning the hyperparameters in higher dimensions increases exponentially because of the curse of dimensionality. Another property of test problems which makes them difficult is the number of local optima. Naive algorithms often get stuck at local optima. Slightly smarter algorithms will 11

18 begin to explore the space. However, if there are a large number of local optima, and a limited number of samples, these algorithms will also find the problem difficult. If the local optima are laid out in a regular pattern, some algorithms may be able to detect and consequently exploit this pattern. Therefore, another property which would make a optimization problem difficult is a more chaotic layout for the positions of the local optima. 3.2 Test Problems Ideally, we would like to use a class of test problems which allow for the difficulty described above to be controlled. The test problems chosen for the evaluation are a class of problems proposed by Bernardetta Addis and Marco Locatelli [1]. These test functions allow us to control the difficulty of the problem as described in Section 3.1. They are parameterized by the dimension and the number of local optima. The number of local optima is controlled by two separate parameters. These are the number of local optima at level two and the number of local optima at level three. A description of what it means to be a local optima at a certain level can be found below in a maximization context. Consider a surface f with a number of maxima. Consider these maxima to be the nodes of a graph. These are the local maxima at level one. Let there be a directed edge (X, Y) between two nodes X and Y if: f (Y) f (X) Z = X such that X Z r 1 and f (X) f (Z) for some threshold distance r 1 There is a non-descending continuous path from Z to Y along the surface f The nodes which only have edges terminating at them are called local maxima at level two. This relation can then be applied on another graph where the nodes are the level two maxima in order to find the level 3 maxima. It can be continued indefinitely, however three levels is usually enough to define a unique global maximum. Figure 3.1(a) shows a function f (x) with two maxima X and Y outlined in blue boxes. The threshold distance r 1 is shown as the green line. To determine if there is an edge between X and Y, we can check the three conditions: Clearly f (Y) f (X) The point Z (marked by the blue cross) is within the threshold distance r 1 and f (X) f (Z). So a point Z which satisfies the required condition does exist. Following f (x) (the red curve) from Z to Y is a non-descending continuous path. So in this case, there would be an edge from X to Y in the resulting graph. Figure 3.1(b) shows an example where the above relation has been completely applied to a set of optima. All nodes of this graph are local optima at level one. The red nodes do not have any edges originating from them, only edges terminating at them. These nodes are defined to be the local optima at level two. Intuitively, the optima represented by these nodes are the best points in their neighbourhood. Increasing the number of level two optima of a surface means that there are more neighbourhoods with best points. This means that an algorithm which quickly moves on from a local optima at level one may still only find a local optima at level two. However, out of 12

19 . m 2 m 3 m 4 m 1 m 5 m 6 m 7 m 8 m 9 m 10 m 11 m 12 m 13 m 14 m 15 m 16 (a) Creating the graph structure requires one to determine if an edge should be placed between X and Y using the above relation. (b) Once the graph has been constructed, we can determine which nodes correspond to local optima at level two. In this case the red nodes correspond to the local optima at level two. Figure 3.1: The two stages of determining which optima are local optima at level one and which are local optima at level two. all the local optima at level two, there will be some which are better than all others in that suburb. These are the local optima at level three, which are even more difficult to find. In order to construct the test functions with the specified dimension and number of local optima at each level, a number of basic components were developed and combined by Addis and Locatelli. First, two basic one-dimensional components are introduced, s and d, with known numbers of local minimizers at level one and level two. Next, a basic n-dimensional component F is introduced, again with known numbers of local optima at level one and level two. Thirdly, these basic components can be combined into a surface G with a parameterizable number of local optima at level two and precisely one local optimum at level three. Finally, two G components can be combined into a Γ component with parameterizable numbers of local optima at level two and level three. The details of these processes are outside the scope of this project and can be found in [1]. In order to focus on the more difficult problems, I chose a selection from this class of test problems which are of high dimensionality (see Table 3.1). Since the functions do not have names, I have developed a naming scheme for identifying them. Each function is called LTFxDyyz, where x is the number of input dimensions, yy is the number of level 2 optima and z is the number of level 3 optima. 13

20 (a) An example of a 1D test function. (b) An example of a 2D test function. Figure 3.2: Two example curves from the class of test problems being used for evaluation. Note the multiple local optima, and in the 2D case, the different characteristic length scales in the two dimensions. Name Dimension # Level 2 Optima # Level 3 Optima LTF2D LTF3D LTF4D LTF5D LTF5D LTF6D LTF6D LTF6D LTF6D Table 3.1: Features of the problems chosen for the evaluation. The table is sorted by ascending difficulty. 3.3 Evaluation Measure For the initial evaluation, three algorithms are being compared which correspond to the improvements outlined in Section 2.5: Vanilla GPO uses the standard expected improvement measure (the expected improvement of y over the best y). Mean EI GPO uses the expected improvement of the mean over the best mean. Local EI GPO uses the local expected improvement measure. The algorithms will be run one hundred times on each test problem. Each run will begin with two random pre-samples which will be shared by all algorithms being compared for that particular run. The next run will begin with two different random pre-samples. The number of samples per run that I have chosen to take depends on the problem due to the time it takes to actually perform one hundred runs with three different algorithms. For four input dimensions and lower, 25 samples were taken in each run. For larger than four input dimensions, 20 samples were taken in each run. This is an unfortunate limitation of the code which is discussed later in Future Work (Chapter 7). 14

21 After each sample, a measure of performance will be taken. Several different measures were considered: 1. The value of best sample point seen so far (i.e. y best ). 2. The value of the underlying surface at the point of µ best (the highest value of the GP mean). 3. The value of the underlying surface at a sampled point where µ is the highest. The best sample point seen so far, is exactly that: after each sample, the best y value we have seen so far is recorded. This is a bad idea for very noisy samples because the actual value of the surface being optimized could be considerably worse. This measure also does not reflect the quality of the learned model, which is another factor which should be considered by the measurement. One way to take into account the learned model is to measure the value of the underlying surface at the highest point of the Gaussian Process mean (µ best ). This measures the quality of the learned model because if the ground truth value of the surface at µ best is very low, then the GP must not be modelling the surface very accurately. When we tried this measure, the highest point of the GP mean was often very far away from where any samples had been taken, where the model was a very poor representation of the underlying surface. Users of the algorithm would want to be confident in the point returned from the optimization. Thus, we believe the x vector being returned as the optimum should at least have been sampled at. In order to prevent the problems with the two above approaches, we decided to combine them. This is the third measure we considered: the true value of the underlying surface at a sampled point x D where µ(x) is the highest for all x D. This prevents the problems of the second measure from occurring by ensuring we have at least sampled at the point we measure as the best. 15

22 16

23 Chapter 4 Evaluation of Improvements 4.1 Results The results of the evaluation on Michael s improvements are summarized in Figures 4.1 and 4.2. The results show that both improvements are fairly significant. On the 2D and 3D results there is not much difference between the three different algorithms. In these low dimensions, search is quite easy for most algorithms so we do not expect the improvements to really stand out here. What is important is that they are performing no worse than the Vanilla algorithm MeanEI Local25EI 35 MeanEI Local25EI (a) d = 2, L2 = 2, L3 = 1 (b) d = 3, L2 = 4, L3 = 1 Figure 4.1: Results of the improvements on the 2 and 3 dimensional test functions. The 4D problem shows the Local EI measure doing clearly much better than the other two algorithms. Even the standard EI measure does better than the Mean EI measure. This is somewhat contradictory to the first 5D problem where the Mean EI performs far better than the other two algorithms. I think this is a good indication that combining the Local EI and Mean EI measures into one will produce a very efficient optimization algorithm because each part of this combination should be robust against the deficiencies in the other. The 6 dimensional results indicate similar performance. The easiest 6D problem shows the Mean EI improvement doing considerably better than the other two algorithms, which have fairly similar performance. The three, more difficult, 6D problems show the Local EI improvement, again, performing considerably better than the other two. It is also worth noting that on the 4-6 dimensional problems, the Mean EI improvement performs better than both of the other variations on average at the first sample. It would appear that when 17

24 MeanEI Local25EI 75 MeanEI Local25EI (a) d = 4, L2 = 3, L3 = 1 (b) d = 5, L2 = 4, L3 = MeanEI Local25EI 220 (c) d = 5, L2 = 4, L3 = 2 Figure 4.2: Results of the improvements on the 4 and 5 dimensional test functions. few samples have been taken, Mean EI selects better points than the other two algorithms simply by trying to improve the mean of the GP, µ, instead of improving the best sample, y. From these results, it would seem that when the Local EI and Standard GPO EI have similar performance, the Mean EI does considerably better. This indicates that for these particular test problems, the EI of y over the best y is not a good measure since both algorithms are affected in the same way. The Mean EI, which aims to improve the value of the GP mean over the current best GP mean value, is a better measure because if two points have the same or similar EI values, it will choose to sample at the point which is further away from any other samples leading to more exploratory behaviour. In Figure 4.3(c), all three algorithms seem to dip down between sample 2 and 7. This means that the value of the underlying surface at a sampled point where the GP mean is the highest has decreased compared to where it was at sample 2. The reason for this is the Gaussian Process model is learning a θ noise value which is larger than it was before. It is explaining the variations in the data as noise because the model is not powerful enough to explain it in any other way. In a real world situation, increasing the noise may or may not be the correct thing to do. If the samples on a surface are known to have no noise whatsoever, then the hyperprior on θ noise could be set accordingly to ensure the model never incorporates noise. This would eliminate the dipping since the value of the highest point of the GP mean would always be the value of the best sample so far because the mean would have to pass exactly through 18

25 MeanEI Local25EI 350 MeanEI Local25EI (a) d = 6, L2 = 1, L3 = 1 (b) d = 6, L2 = 12, L3 = MeanEI Local25EI 300 MeanEI Local25EI (c) d = 6, L2 = 24, L3 = 2 (d) d = 6, L2 = 32, L3 = 2 Figure 4.3: Results of improvements on the 6 dimensional test functions. every sample point. This is not necessarily a problem because the algorithm is simply exploring different models. However, since the manner in which the next sample is taken depends on the predictive model, it could be beneficial not to consider these noisy models, particularly in the case where the samples are known to be non-noisy. Another interesting point is the performance of Mean EI. In every single test problem, the Mean EI algorithm is performing the best after the first sample. Mean EI has been designed to introduce more exploratory behaviour into the algorithm so this result is somewhat unexpected. We would have expected Mean EI to show improvement once more samples have been taken. The results presented here are consistent with Michael s results [12]. When considering the performance on the easy problems (2D,3D), there is very little difference between the three algorithms. However, when the difficulty increases, both improvements increase the performance of the algorithm in different situations. Michael also found the improvements were of more benefit on harder problems, even in 2 dimensions. 19

26 20

Chapter 5 A New Improvement to GPO Depending on the choice of covariance function, a GP may have difficulty modelling surfaces which have multiple trends.

27 Chapter 5 A New Improvement to GPO Depending on the choice of covariance function, a GP may have difficulty modelling surfaces which have multiple trends. In many cases, the GP will simply model the variations in the y direction as noise, even when the samples are not noisy. GPO can be improved by using a different covariance function. 5.1 Motivation 1 The results in the previous chapter show some cases where more data is added but the optimization appears to perform worse (the dipping problem). The reason for this is that the hyperparameter model is too simple to explain the variations in the data. The only way the algorithm can explain the data is by increasing the amount of assumed noise in the model. This causes the mean of the GP to be considerably more smooth in the y direction. This has the effect of raising the GP mean where it was once very low, and decreasing it where it was once very high (see Figure 5.1). (a) Small θ noise value (b) Large θ noise value Figure 5.1: The black line is the GP mean, the green region is the GP variance and the black dots are the samples. Within the space of one sample, the predictive distribution can change its beliefs dramatically about the underlying surface. It seems unreasonable to change your beliefs about the surface from being low noise to high noise so drastically in the space of one sample. 21

28 5.2 Motivation 2 It is possible to take samples from the Gaussian Process prior by taking some samples D and a covariance function, producing the covariance matrix for those samples and finally, taking sample points from the Gaussian distribution defined at each point in D. Examples of such samples are given in Figure 5.2(a). By changing the covariance function, we can change the kinds of samples produced by the Gaussian Process prior. Samples taken from a GP prior with a superposition (sum) of two squared exponential covariance functions are shown in Figure 5.2(b). (a) (b) Figure 5.2: Samples from a Gaussian Process prior (a) using a squared exponential covariance function, and (b) using a superposition of squared exponential covariance functions. Looking at the differences between these samples, one could argue that the superposition samples look like more difficult problems to solve than the standard covariance function samples. The superposition samples have more local optima and these are laid out in a more chaotic fashion. These are two of the properties that have been mentioned which make optimization difficult. This shows that if the model has more free hyperparameters to learn, it can model more complex surfaces which are more representative of difficult optimization problems. 5.3 New Improvement to GPO This improvement aims to provide more parameters so that the Gaussian Process has more freedom to correctly model the surface. The simplest way of providing these extra parameters, is to use a superposition of covariance functions. The superposition (or sum) of two covariance functions is itself a valid covariance function [10]. So instead of a standard squared exponential, the covariance function will have the equation: cov(x, x ; θ) = θ h1 exp [ 1 2 I i=1 ] [ (x i x i ) 2 + θ h2 exp 1 2 θ l1i I i=1 ] (x i x i ) 2 This in turn requires a change to the hyperparameter representation. For an n-dimensional problem, θ now consists of 2n length scales, two vertical scales and one noise parameter. This is effectively a doubling in the number of free hyperparameters the algorithm must θ l2i (5.1) 22

29 Θ h Θ l Θ h Θ l Figure 5.3: Figure of a superposition of two squared exponential covariance functions. The dimensions to which the hyperparameters refer to have been labelled. θ h1 and θ l1 refer to the height and length scale of the larger squared exponential. θ h2 and θ l2 refer to the height and length scale of the smaller squared exponential. learn. This could result in an algorithm which struggles when few samples have been taken because there are so many parameters to learn. However, once a decent amount of data has been sampled, the increased number of hyperparameters should provide the ability to form a better model of complex surfaces. The time complexity of the algorithm will also be affected by this change, however, the increased modelling power outweighs this because we are explicitly interested in the case of expensive data. The model cost we can essentially ignore. Combining the superposition covariance function with the Vanilla GPO and Mean EI GPO improvements is a trivial task because these two algorithms are not constrained to any particular covariance function. They only rely on there being a covariance function available. When we combine the Local EI GPO with the superposition covariance function, however, some slight adjustments are required. Recall that Local EI GPO works by taking into account the EI around a particular point. It does this by treating the covariance function as a distribution and sampling underneath it. In the case of a single squared exponential, the distribution is equivalent to a Gaussian distribution which is easy to sample from. To do this, a number of samples are taken under the covariance function (it is treated as a distribution), and the value of the EI at each of those points is averaged. This allows us to to take into account the EI around a particular point (hence Local EI ). When using the standard squared exponential covariance function, the distribution the samples are taken under is essentially a Gaussian distribution. This is easy to sample from. However, in the case of a superposition of squared exponentials, sampling is not as easy. In order to take samples under the superposition of squared exponential distribution, a rejection sampling technique was employed. Rejection sampling is a standard technique that enables us to take samples under a curve 23

30 1.6 Superposition (P*) Q* Figure 5.4: Diagram of the rejection sampling technique. The red curve is the distribution that we want samples under (P ) and the green curve is Q. The vertical line indicates the x position of a sample under Q. A random sample u is then taken on the vertical line between 0 and Q. If u falls in the cyan section of the line, it is accepted. If u is in the magenta segment, it is rejected. P (x), where denotes a unnormalized probability distribution. First, one finds an easy-tosample distribution, such as a Gaussian distribution, which we denote Q(x). Then, we need to find a curve Q (x) = cq(x) for some constant c such that Q (x) > P (x) for all x. The process is then as follows: 1. Take a sample x sample from the easy-to-sample distribution, Q. 2. Take a random uniform sample between 0 and Q (x sample ). Call this point u. 3. If u P (x sample ) then we accept the sample x sample. Otherwise, we reject it. 4. Repeat steps 1 to 3 until the desired number of samples have been collected. To sample from the density corresponding to the superposition covariance function, we set P = cov(d) where d = x x and cov(d) is the superposition of squared exponentials parameterized with some hyperparameters. An east-to-sample distribution Q is the Gaussian distribution with a variance equal to the larger of the two superposition length scales. This is to ensure that when we find Q, we will definitely cover P. The constant c is a number which changes the height of Q to ensure Q is at least as tall as P. The height of P is equal to the sum of the two vertical scale hyperparameters (see Figure 5.3). 24

31 5.4 Evaluation This section compares and evaluates the performance of the superposition covariance function with the standard squared exponential covariance function when used with the Vanilla GPO. The results in this case are very positive towards the superposition covariance function. In every case, the superposition outperforms the Vanilla GPO algorithm. It is especially convincing in the 4,5 and easier 6 dimensional problems Super Super (a) d = 2, L2 = 2, L3 = 1 (b) d = 3, L2 = 4, L3 = 1 Figure 5.5: Results comparing the standard covariance function against the superposition covariance function when used with Vanilla GPO EI on the 2 and 3 dimensional problems. The 2 and 3 dimensional problems show the superposition performing slightly better than the standard covariance function. In the 3 dimensional problem, the superposition appears to be performing much better when very few samples have been taken. This is quite different to the behaviour we were expecting. The 4 and 5 dimensional problems are where the superposition really shows itself to be a strong improvement over the Vanilla algorithm. It is finding points on the surface which are far superior to those found by the standard covariance function. As with the 3D case, the superposition has better performance from very early on in the optimization process. In Figure 5.7(c), the superposition does not exhibit the same dipping problem that the single squared exponential is susceptible to. The superposition curve does not rise as steeply as the squared exponential because there are more hyperparameters that must be learned. Learning so many hyperparameters accurately with a limited amount of data is quite difficult. However, the superposition climbs consistently and reasonably quickly, clearly showing that these extra parameters are not a huge burden on the algorithm. The most difficult problem, LTF6D322, shows an example where the extra hyperparameters are quite difficult for the algorithm to learn. However, once it finds a good set of hyperparameters it ultimately ends up explaining the surface much better than the standard algorithm and finding a higher point. This is the behaviour we expected from the algorithm (since there are more hyperparameters to learn) so it is surprising that it only occurred in one of the test problems. 25

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada