VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui

Size: px
Start display at page:

Download "VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui"

Transcription

1 VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui School of Engineering and Computer Science Te Kura Mātai Pūkaha, Pūrorohiko PO Box 600 Wellington New Zealand Tel: Fax: Internet: office@ecs.vuw.ac.nz Evaluate and Improve an Optimization Algorithm Thomas Robinson Supervisor: Marcus Frean Submitted in partial fulfilment of the requirements for Bachelor of Science with Honours in Computer Science. Abstract Global Optimization is a difficult task with many applications. This report describes one method of Global Optimization known as Gaussian Process Optimization. This is a response surface based methodology which attempts to construct its own model of the underlying surface and determines the next best place to sample by using an Expected Improvement measure. Several improvements were investigated by Michael Mudge in an honours project in 2010 and these are thoroughly evaluated in this report. Another improvement is also made to the algorithm in the form of a new covariance function and this is described and thoroughly evaluated in this report.

2

3 Acknowledgments I acknowledge the helpful and constructive guidance of my supervisor Marcus Frean who gave freely of his time and advice throughout the project. i

4 ii

5 Contents 1 Introduction 1 2 Background Information Gaussian Processes Covariance Hyperparameters Expected Improvement Algorithm Improvements Background Mean EI of the Mean over the Best Mean Local EI Related Work Evaluation Method Hard Optimization Problems Test Problems Evaluation Measure Evaluation of Improvements Results A New Improvement to GPO Motivation Motivation New Improvement to GPO Evaluation Towards a Combined Algorithm Combination: Superposition with Mean EI GPO Combination: Superposition with Local EI GPO Combination: All Improvements Future Work and Conclusion Unanswered Questions Learning the Hyperparameters Matrix Inversion Other Covariance Functions Other Expected Improvement Measures Conclusion iii

6 iv

7 Chapter 1 Introduction Many real world applications require one to optimize a set of parameters in order to find the best, or most desirable combination. For example, consider designing an efficient aerofoil: there are several variables (for example, length, width, curvature, the drag coefficient of the material and so on), all of which affect the lift-to-drag ratio in different ways. Finding the combination of variables which produce a wing with the highest lift-to-drag ratio is desirable. This problem is equivalent to finding the highest point on a surface of several dimensions which relates the variables of the wing. Finding the highest point is possible with a process known as Global Optimization. Global Optimization, in general, is a very difficult problem to solve. If one has the mathematical function relating the variables to the quantity being optimized, it is likely the gradient of the surface can be calculated. A simple method for locating an optimum of the surface is to set the gradient to zero and solve the resulting equation. This may give several points (the stationary points of the surface), each of which should be evaluated (by passing to the original function) and comparing the results. While the solution is exact, in many real world problems it is not feasible because: the underlying mathematical function may not be available the gradient may not be available and/or impossible to calculate the number of optima on the surface may be very large Therefore, techniques have been developed for finding the global optimum of a surface. These techniques aim to both explore and exploit a surface. The exploration means that the algorithm will look in areas it has not visited before and the exploitation means that the algorithm will pursue promising points. One of these techniques is known as simulated annealing [7]. This technique involves a parameter T known as the temperature which starts at a large value. Early in the search, the algorithm behaves much like a random search, performing random exploration. The temperature is slowly annealed down to a lower value where the search will only move to another point if it is better than the best point seen so far. That is, it exploits the surface. Another approach to global optimization is particle swarm optimization [6]. This algorithm involves a number of particles which are essentially samples. Since there is usually a large number of them, this fact alone covers the exploration aspect of the optimization. Each particle has a velocity associated with it, which determines how it will move within the search space in each iteration. This velocity is updated based on how good the best point a particle has seen so far is, and how that point compares to the best point seen out of all particles. This is the exploitation aspect of the algorithm. 1

8 However, these methods are not necessarily the most efficient in terms of how many samples they take. This is an important consideration in global optimization because in many cases it can be very expensive to actually take the samples. In the aerofoil example above, there could be a large development cost (time, money) associated with manufacturing a wing which is slightly longer only for its drag coefficient to be measured once. Clearly, the lower the number of prototype wings that need to be created in order to find the optimal one, the better! This report outlines a data efficient global optimization method which formed the basis of an honours project in This method is based on the general idea of response surfaces [4], in which the algorithm creates its own model of the surface being optimized. The response surface approach works by first modelling the current known data to create a response surface or, better still, apredictive distribution. The predictive distribution is then analyzed to determine the next place to sample. Finally, an actual sample is taken at this point and we add the point and its value to our known data (see Figure 1.1). The assumption is that sampling from the real world is expensive, but sampling from the model is relatively cheap. Data. Produce Model from Data Response Set Surface Add Sample to Data Set Determine Where to Sample Take Sample Figure 1.1: A figure showing the response surface methodology. The optimization method which is being investigated here is known as Gaussian Process Optimization (GPO) which aims to be as efficient as possible in terms of the number of samples needed to locate the global optimum. In the case of GPO, the response surface model is a Gaussian Process and the measure for determining the next point to sample at is known as the Expected Improvement (EI). In 2010, three improvements to the GPO algorithm were investigated by Michael Mudge but these were not evaluated thoroughly. The contributions I have made to GPO in this project are: A thorough evaluation of the improvements to GPO investigated in 2010, using test problems which allow the difficulty to be controlled. An addition to the algorithm in the form of a new covariance function. A thorough evaluation of this addition. 2

9 Chapter 2 Background Information This chapter introduces Gaussian Processes and how they can be used for optimization. It also gives an explanation of the improvements that were investigated by Michael Mudge in 2010 and which are being evaluated in this project. 2.1 Gaussian Processes Consider a surface defined by y = f (x) where x is a vector in a continuous d-dimensional input space. Let D be the set of samples (x, y pairs) that have been taken on the surface f. A Gaussian Process (GP) is defined to be a model which provides a Gaussian predictive distribution p(y x, D) over y at all points x. The assumption is that p(y x, D), the posterior distribution, is a Gaussian distribution. This means that it is possible to obtain a mean and variance of a Gaussian distribution for any input x via a Gaussian Process. A single variable Gaussian distribution is parameterized by a mean and a variance. A multi-variate Gaussian distribution is parameterized by a mean vector and a covariance matrix. It follows then that a Gaussian process is parameterized by a mean function and a variance function [15]. 2.2 Covariance A central part of the mechanics of a GP is the covariance function. Given two input vectors x and x, this function expresses what we believe about how the two corresponding values y = f (x) and y = f (x ) are related. This is known as the covariance between x and x and is denoted by cov(x, x ). In general, one would expect two inputs that are close to each other in x-space to have similar y values, that is, a high covariance. Similarly, two inputs that are far apart would be expected to have unrelated y values, or a low covariance. For this reason, the most common choice of covariance function is the squared exponential. Figure 2.1 shows the one-dimensional squared exponential function. Two input values that are far apart can be seen to have a low covariance, while input values that are close together can be seen to have a high covariance. The squared exponential provides a smooth transition across these different levels of covariance. Recall the first step of the response surface methodology: producing a model from the data at hand. Consider a matrix C defined as follows C ij = cov(x i, x j ) + δ ij θ noise (2.1) where x i, x j D and δ ij is the Kronecker delta function which is 1 when i = j and 0 otherwise. This matrix C is known as a covariance matrix and is a table of covariance values 3

10 cov(x1, x2) x1 - x2 Figure 2.1: A 1-dimensional squared-exponential curve, cov(x1, x2) exp [ x1 x2 2]. between all pairs of data points in the data set D. θ noise is a hyperparameter which is discussed in more detail in the next section and represents the assumed measurement noise of the samples. The Kronecker delta function ensures that the noise is only added to the diagonal of the matrix (i.e. the covariance of a point with itself) since it is only referring to the noise inherent in the measurement of that sample, not the covariance between two points. Note that C will be symmetric, because cov(x i, x j ) = cov(x j, x i ). Similar to the way in which C was defined, k is defined to be a vector of the covariance values between x and every data point in D for some x in the search space. The mean of the Gaussian Process at this point x is then defined to be: and the variance of the Gaussian Process is: where κ = cov(x, x) the covariance of x with itself. µ(x) = k T C 1 y (2.2) σ 2 (x) = κ k T C 1 k (2.3) 2.3 Hyperparameters More generally, the squared exponential covariance function for more than one dimension has the formula: The squared exponential covariance function has the formula: [ ] cov(x, x ; θ) = θ h exp 1 (x i x i ) 2 (2.4) 2 It is parameterized by a set of hyperparameters, θ. These hyperparameters define several characteristics of the covariance function. θ h is the vertical scale and defines the height of the squared exponential. This represents the overall scale of variation in the height of the surface. θ li is the length scale in the ith input dimension and defines the width of the squared exponential in the ith input dimension. This represents how rapidly the covariance dies away with distance in each dimension. I i=1 θ li 4

11 θ noise is a third component of θ (mentioned earlier) representing our beliefs about the level of measurement noise of the samples. By providing different values for these hyperparameters, the shape of the covariance function can be changed. Consequently, by changing the shape of the covariance function, we are changing our model concerning the underlying surface. For example, a shorter length scale will mean that there will be a lower covariance between any two points. Essentially, the covariance function expresses our beliefs about the smoothness of the surface. It is possible to learn the hyperparameters, θ, from the sampled data, D. The best way of doing this would be to define a prior distribution on the hyperparameters and integrate over them: P(y n+1 x n+1, D) = P(y n+1 x n+1, θ, D)P(θ D) dθ (2.5) Unfortunately, this process is almost always completely intractable [10]. There is, however, a common way to approximate this integral: P(y n+1 x n+1, D) P(y n+1 x n+1, D, θ MAP ) (2.6) where θ MAP is the maximum a posteriori value for θ, that is θ MAP = argmax θ P(θ D) = argmax θ P(D θ)p(θ) (2.7) which can be found by using the conjugate gradients algorithm on the posterior P(θ D). Thus, we can learn the values for θ which best model the data. 2.4 Expected Improvement The final piece of the puzzle required to make an optimization algorithm is to determine where the next place to sample should be. This is calculated via a measure known as the Expected Improvement (EI). Let the best value we have seen so far in D be denoted by y best. The expected improvement in the value of y over this value y best at some new point (x n+1, y n+1 ) is EI(x n+1 ) = where I is the improvement in y best at x n+1 y best I(y n+1 )p(y n+1 x n+1 ), D dy n+1 (2.8) I(y n+1 ) = { yn+1 y best y n+1 > y best 0 otherwise (2.9) Thus we can calculate the EI of y over y best for any point x in the search space. To determine where to sample next, we simply choose the point x in the input space which maximizes the EI. There are a variety of ways in which to perform this maximization. In what follows, we used the conjugate gradients algorithm, since not only is the function EI(x) available, so is the gradient. Figure 2.2 shows an illustrative example of Gaussian Process Optimization in action. 5

12 (a) (b) (c) (d) (e) (f) Figure 2.2: The Gaussian Process Optimization algorithm in action. The blue curve is the function to be optimized. The black curve shows the Gaussian Process mean and the green area is Gaussian Process variance. The best point seen so far is indicated by the horizontal yellow line. The expected improvement over this best point is indicated by the red line. The highest point of the red curve is the place where the next sample will be taken. 6

13 (g) (h) (i) (j) (k) (l) Figure 2.2: (Continued) The Gaussian Process Optimization algorithm in action. The blue curve is the function to be optimized. The black curve shows the Gaussian Process mean and the green area is Gaussian Process variance. The best point seen so far is indicated by the horizontal yellow line. The expected improvement over this best point is indicated by the red line. The highest point of the red curve is the place where the next sample will be taken. 7

14 2.5 Algorithm Improvements The previous sections have described the basic GPO algorithm. In 2010, Michael Mudge investigated three improvements to this algorithm which are outlined in this section Background Mean The Gaussian Process prior, as described above, has a mean of zero. As a result, the predictive mean, µ, away from the samples will be zero since there is no data to update those prior beliefs. This means that if the surface being optimized is in fact predominantly above (or below) zero, the Gaussian Process model away from samples will not accurately reflect the underlying surface. If the surface does genuinely have a mean of zero, then the GP prior with a zero mean will probably perform OK. However, simply by translating the surface in the y direction, the GP prior will not model the surface accurately away from samples, putting the algorithm at a disadvantage even though the optimization problem is effectively the same. The improvement made was to incorporate a non-zero mean for the prior. The first idea was to use the mean value of the sampled points as the mean for the prior. However, since we are aiming to optimize, the samples are more likely to be taken around the peaks of the function. If this value were used as the prior s mean, it would be higher than the mean value of the surface. The second idea (and the one which was implemented) was to use a value for the mean which gives the maximum likelihood for the Gaussian Process distribution. GPO with this calculated mean for the GP prior will be referred to from now on as Vanilla GPO as we believe no algorithm should be disadvantaged by the fact that a surface is much higher or lower than zero EI of the Mean over the Best Mean The Expected Improvement measure described in Section 2.4 is one of four possible EI measures (see Figure 2.3). It aims to get a sample higher than the best sample so far. This could lead to the algorithm sampling where it got lucky once (an anomalously high sample) [12]. In places where many samples have been taken, the chance of improving the outcome of the optimization by sampling in that place again is low. Ideally, the algorithm should sample at a point where the GP s predictive mean is likely to be high. EI in. Standard GPO EI over the y best y EI in µ Mean EI over the best µ Figure 2.3: Diagram showing the different possibilities for EI measure. EI in y over the best y is the standard GPO EI measure. EI in µ over the best µ is referred to as Mean EI. The other two combinations have not been investigated. 8

15 The improvement that was made to the EI measure was to instead use the bottom-most EI measure in Figure 2.3: the EI of the GP mean, µ, over the best GP mean value so far, µ best. This improves the behaviour of the algorithm because while sampling near the other samples again the algorithm may get lucky, it is unlikely to improve the mean at that point very significantly. Sampling away from other points is more likely to improve the mean. Hence, this improvement tends to explore the surface more Local EI GPO takes a greedy approach to optimization by choosing the next best point (according to the model) on the very next sample. However, if the next sample truly were the best point, we would only ever need to take one sample (rendering an entire area of machine learning completely solved!). The observation is that narrow EI peaks generally do not lead to much improvement in the model, but sampling at wider EI peaks does. The aim of the next sample should be to ensure there is a higher expected improvement for the sample following. It is also desirable to reward the algorithm for sampling in a region which has good values. If we take a sample in a region of high EI and it happens to be a good point, then the algorithm should be encouraged to exploit that region. These behaviours have been implemented by considering the amount of EI around a particular point, x (hence Local EI ). It works by taking the covariance function and centering it on the point x. The covariance function is then treated as a distribution and a number of samples are taken from underneath it. The EI is then measured at each of those sample points (as well as the x) and an average is taken. The result is the Local EI for x. Recall that the covariance function defines our beliefs about the underlying surface. The reason for using the covariance function in the above process is to define the region which we should consider to be Local. In order to exploit a local region around x, we should focus the region to be the points W which have high covariance with x because those points W are likely to have similar values to x. 2.6 Related Work Gaussian Processes are a well studied area of statistical modelling, particularly in the area of regression. David MacKay s introduction [10] provides a decent overview of Gaussian Processes in machine learning. It gives a lot of detail about Gaussian Processes, including how they can be used for different tasks such as regression and classification. It describes all of the equations necessary for every facet of GPs including learning the hyperparameters and generating covariance functions. The most rigorous study into Gaussian Processes is Rasmussen and Williams book, Gaussian Processes for Machine Learning [15]. This covers similar content to David MacKay s introduction but goes into much more detail, provides more thoroughly worked examples and also contains a discussion comparing GPs with other statistical models. This book briefly talks about using Gaussian Processes for global optimization but does not discuss details. The response surface approach to global optimization has been well studied by Jones [5]. His paper provides a good introduction using a technique known as kriging. This technique is simply another name for Gaussian Process regression. Applications of kriging, focusing on using this technique in engineering design have been discussed in [2]. Osborne et al [13] further investigated the idea of Gaussian Process Optimization. They experimented with different degrees of look-ahead. That is, in a world where the number of samples you are allowed to take is limited, if you had only one sample remaining, the place 9

16 you take this sample would be different than if you had many samples remaining. They also investigate the use of a periodic covariance function (in order to exploit periodicity in surfaces). They find that their version of the algorithm significantly outperformed tested competitors. There are also a number of accounts of using GPO with very successful results (including [8, 3]). In [8], applications in robotics are considered. In particular, learning the correct parameters for a robot s gait is a difficult problem. It relies on a number of parameters such as the velocity of the leg movements and how soft the surface is. The authors report using GPO to try and maximize the walking velocity of the robot, and to maximize the smoothness of the gait. The results are very positive and they note that GPO solved the problem in less samples than a local gradient algorithm. In [3], the concept of Gaussian Processes is introduced and another example is provided of its applications. In this case, it is for solving a double pole balancing problem which is a well known optimization problem. Again, the authors note that GPO was able to solve the problem in an order of magnitude fewer samples than reported elsewhere. In 2010, Michael Mudge investigated several improvements (those in Section 2.5) to the basic algorithm in his honours project [12]. His paper described those improvements in detail and gave an evaluation on a number of commonly used 2-dimensional problems from the literature. Those improvements showed promise on certain problems, however they were only 2-dimensional and the difficulty of those problems was not characterized. My project aims to further evaluate those improvements on high dimensional problems and to characterize what makes those problems difficult. I also investigate another possible improvement to the algorithm in the form of a different covariance function. 10

17 Chapter 3 Evaluation Method This chapter describes how I evaluated the improvements which were investigated in It describes what makes an optimization problem hard, the test problems used and the measures taken. 3.1 Hard Optimization Problems In order to provide a thorough evaluation of the improvements to GPO, we must find a range of problems which range in difficulty to determine how good the improvements actually are. This requires us to define what we mean by hard optimization problems. To start the investigation of the concept of hard optimization problems, one should first consider the No Free Lunch Theorems for Optimization [16]. These theorems essentially state that if an algorithm performs well on one set of problems, its performance will be offset by performing poorly in another class. Theorem 1 The sum of the performance of algorithm A, on all test problems, is equal to the sum of the performance of algorithm B, on all test problems. This theorem suggests that there is no point in even doing an evaluation because averaged over all problems our algorithm will be no better than a random search! This theorem is a sum of performance over all test problems, of which there are an infinite amount. This will include completely random surfaces which are never actually encountered in the real world. So while we must accept that we can never do better than random search on all test problems, we can try to do better than a random search on test problems which simulate real world problems. In particular, hard real world problems. When we think of problems which are hard to optimize for humans, we commonly think of high dimensional problems as they are hard for humans to visualize. High-dimensionality also makes the problem difficult for a computer. The reason for this is the curse of dimensionality. The curse of dimensionality is that in order to find a particular point within some error, the number of samples to be taken grows exponentially with the dimension of the space. In the case of optimization, this means the computer is required to make exponentially more samples in order to find the global optimum. Recall that the number of hyperparameters required for a n-dimensional problem is n + 2 because we have n length scales, the vertical scale and the noise parameter. Since GPO also performs a search in this space to learn the hyperparameters, the computational complexity of learning the hyperparameters in higher dimensions increases exponentially because of the curse of dimensionality. Another property of test problems which makes them difficult is the number of local optima. Naive algorithms often get stuck at local optima. Slightly smarter algorithms will 11

18 begin to explore the space. However, if there are a large number of local optima, and a limited number of samples, these algorithms will also find the problem difficult. If the local optima are laid out in a regular pattern, some algorithms may be able to detect and consequently exploit this pattern. Therefore, another property which would make a optimization problem difficult is a more chaotic layout for the positions of the local optima. 3.2 Test Problems Ideally, we would like to use a class of test problems which allow for the difficulty described above to be controlled. The test problems chosen for the evaluation are a class of problems proposed by Bernardetta Addis and Marco Locatelli [1]. These test functions allow us to control the difficulty of the problem as described in Section 3.1. They are parameterized by the dimension and the number of local optima. The number of local optima is controlled by two separate parameters. These are the number of local optima at level two and the number of local optima at level three. A description of what it means to be a local optima at a certain level can be found below in a maximization context. Consider a surface f with a number of maxima. Consider these maxima to be the nodes of a graph. These are the local maxima at level one. Let there be a directed edge (X, Y) between two nodes X and Y if: f (Y) f (X) Z = X such that X Z r 1 and f (X) f (Z) for some threshold distance r 1 There is a non-descending continuous path from Z to Y along the surface f The nodes which only have edges terminating at them are called local maxima at level two. This relation can then be applied on another graph where the nodes are the level two maxima in order to find the level 3 maxima. It can be continued indefinitely, however three levels is usually enough to define a unique global maximum. Figure 3.1(a) shows a function f (x) with two maxima X and Y outlined in blue boxes. The threshold distance r 1 is shown as the green line. To determine if there is an edge between X and Y, we can check the three conditions: Clearly f (Y) f (X) The point Z (marked by the blue cross) is within the threshold distance r 1 and f (X) f (Z). So a point Z which satisfies the required condition does exist. Following f (x) (the red curve) from Z to Y is a non-descending continuous path. So in this case, there would be an edge from X to Y in the resulting graph. Figure 3.1(b) shows an example where the above relation has been completely applied to a set of optima. All nodes of this graph are local optima at level one. The red nodes do not have any edges originating from them, only edges terminating at them. These nodes are defined to be the local optima at level two. Intuitively, the optima represented by these nodes are the best points in their neighbourhood. Increasing the number of level two optima of a surface means that there are more neighbourhoods with best points. This means that an algorithm which quickly moves on from a local optima at level one may still only find a local optima at level two. However, out of 12

19 . m 2 m 3 m 4 m 1 m 5 m 6 m 7 m 8 m 9 m 10 m 11 m 12 m 13 m 14 m 15 m 16 (a) Creating the graph structure requires one to determine if an edge should be placed between X and Y using the above relation. (b) Once the graph has been constructed, we can determine which nodes correspond to local optima at level two. In this case the red nodes correspond to the local optima at level two. Figure 3.1: The two stages of determining which optima are local optima at level one and which are local optima at level two. all the local optima at level two, there will be some which are better than all others in that suburb. These are the local optima at level three, which are even more difficult to find. In order to construct the test functions with the specified dimension and number of local optima at each level, a number of basic components were developed and combined by Addis and Locatelli. First, two basic one-dimensional components are introduced, s and d, with known numbers of local minimizers at level one and level two. Next, a basic n-dimensional component F is introduced, again with known numbers of local optima at level one and level two. Thirdly, these basic components can be combined into a surface G with a parameterizable number of local optima at level two and precisely one local optimum at level three. Finally, two G components can be combined into a Γ component with parameterizable numbers of local optima at level two and level three. The details of these processes are outside the scope of this project and can be found in [1]. In order to focus on the more difficult problems, I chose a selection from this class of test problems which are of high dimensionality (see Table 3.1). Since the functions do not have names, I have developed a naming scheme for identifying them. Each function is called LTFxDyyz, where x is the number of input dimensions, yy is the number of level 2 optima and z is the number of level 3 optima. 13

20 (a) An example of a 1D test function. (b) An example of a 2D test function. Figure 3.2: Two example curves from the class of test problems being used for evaluation. Note the multiple local optima, and in the 2D case, the different characteristic length scales in the two dimensions. Name Dimension # Level 2 Optima # Level 3 Optima LTF2D LTF3D LTF4D LTF5D LTF5D LTF6D LTF6D LTF6D LTF6D Table 3.1: Features of the problems chosen for the evaluation. The table is sorted by ascending difficulty. 3.3 Evaluation Measure For the initial evaluation, three algorithms are being compared which correspond to the improvements outlined in Section 2.5: Vanilla GPO uses the standard expected improvement measure (the expected improvement of y over the best y). Mean EI GPO uses the expected improvement of the mean over the best mean. Local EI GPO uses the local expected improvement measure. The algorithms will be run one hundred times on each test problem. Each run will begin with two random pre-samples which will be shared by all algorithms being compared for that particular run. The next run will begin with two different random pre-samples. The number of samples per run that I have chosen to take depends on the problem due to the time it takes to actually perform one hundred runs with three different algorithms. For four input dimensions and lower, 25 samples were taken in each run. For larger than four input dimensions, 20 samples were taken in each run. This is an unfortunate limitation of the code which is discussed later in Future Work (Chapter 7). 14

21 After each sample, a measure of performance will be taken. Several different measures were considered: 1. The value of best sample point seen so far (i.e. y best ). 2. The value of the underlying surface at the point of µ best (the highest value of the GP mean). 3. The value of the underlying surface at a sampled point where µ is the highest. The best sample point seen so far, is exactly that: after each sample, the best y value we have seen so far is recorded. This is a bad idea for very noisy samples because the actual value of the surface being optimized could be considerably worse. This measure also does not reflect the quality of the learned model, which is another factor which should be considered by the measurement. One way to take into account the learned model is to measure the value of the underlying surface at the highest point of the Gaussian Process mean (µ best ). This measures the quality of the learned model because if the ground truth value of the surface at µ best is very low, then the GP must not be modelling the surface very accurately. When we tried this measure, the highest point of the GP mean was often very far away from where any samples had been taken, where the model was a very poor representation of the underlying surface. Users of the algorithm would want to be confident in the point returned from the optimization. Thus, we believe the x vector being returned as the optimum should at least have been sampled at. In order to prevent the problems with the two above approaches, we decided to combine them. This is the third measure we considered: the true value of the underlying surface at a sampled point x D where µ(x) is the highest for all x D. This prevents the problems of the second measure from occurring by ensuring we have at least sampled at the point we measure as the best. 15

22 16

23 Chapter 4 Evaluation of Improvements 4.1 Results The results of the evaluation on Michael s improvements are summarized in Figures 4.1 and 4.2. The results show that both improvements are fairly significant. On the 2D and 3D results there is not much difference between the three different algorithms. In these low dimensions, search is quite easy for most algorithms so we do not expect the improvements to really stand out here. What is important is that they are performing no worse than the Vanilla algorithm MeanEI Local25EI 35 MeanEI Local25EI (a) d = 2, L2 = 2, L3 = 1 (b) d = 3, L2 = 4, L3 = 1 Figure 4.1: Results of the improvements on the 2 and 3 dimensional test functions. The 4D problem shows the Local EI measure doing clearly much better than the other two algorithms. Even the standard EI measure does better than the Mean EI measure. This is somewhat contradictory to the first 5D problem where the Mean EI performs far better than the other two algorithms. I think this is a good indication that combining the Local EI and Mean EI measures into one will produce a very efficient optimization algorithm because each part of this combination should be robust against the deficiencies in the other. The 6 dimensional results indicate similar performance. The easiest 6D problem shows the Mean EI improvement doing considerably better than the other two algorithms, which have fairly similar performance. The three, more difficult, 6D problems show the Local EI improvement, again, performing considerably better than the other two. It is also worth noting that on the 4-6 dimensional problems, the Mean EI improvement performs better than both of the other variations on average at the first sample. It would appear that when 17

24 MeanEI Local25EI 75 MeanEI Local25EI (a) d = 4, L2 = 3, L3 = 1 (b) d = 5, L2 = 4, L3 = MeanEI Local25EI 220 (c) d = 5, L2 = 4, L3 = 2 Figure 4.2: Results of the improvements on the 4 and 5 dimensional test functions. few samples have been taken, Mean EI selects better points than the other two algorithms simply by trying to improve the mean of the GP, µ, instead of improving the best sample, y. From these results, it would seem that when the Local EI and Standard GPO EI have similar performance, the Mean EI does considerably better. This indicates that for these particular test problems, the EI of y over the best y is not a good measure since both algorithms are affected in the same way. The Mean EI, which aims to improve the value of the GP mean over the current best GP mean value, is a better measure because if two points have the same or similar EI values, it will choose to sample at the point which is further away from any other samples leading to more exploratory behaviour. In Figure 4.3(c), all three algorithms seem to dip down between sample 2 and 7. This means that the value of the underlying surface at a sampled point where the GP mean is the highest has decreased compared to where it was at sample 2. The reason for this is the Gaussian Process model is learning a θ noise value which is larger than it was before. It is explaining the variations in the data as noise because the model is not powerful enough to explain it in any other way. In a real world situation, increasing the noise may or may not be the correct thing to do. If the samples on a surface are known to have no noise whatsoever, then the hyperprior on θ noise could be set accordingly to ensure the model never incorporates noise. This would eliminate the dipping since the value of the highest point of the GP mean would always be the value of the best sample so far because the mean would have to pass exactly through 18

25 MeanEI Local25EI 350 MeanEI Local25EI (a) d = 6, L2 = 1, L3 = 1 (b) d = 6, L2 = 12, L3 = MeanEI Local25EI 300 MeanEI Local25EI (c) d = 6, L2 = 24, L3 = 2 (d) d = 6, L2 = 32, L3 = 2 Figure 4.3: Results of improvements on the 6 dimensional test functions. every sample point. This is not necessarily a problem because the algorithm is simply exploring different models. However, since the manner in which the next sample is taken depends on the predictive model, it could be beneficial not to consider these noisy models, particularly in the case where the samples are known to be non-noisy. Another interesting point is the performance of Mean EI. In every single test problem, the Mean EI algorithm is performing the best after the first sample. Mean EI has been designed to introduce more exploratory behaviour into the algorithm so this result is somewhat unexpected. We would have expected Mean EI to show improvement once more samples have been taken. The results presented here are consistent with Michael s results [12]. When considering the performance on the easy problems (2D,3D), there is very little difference between the three algorithms. However, when the difficulty increases, both improvements increase the performance of the algorithm in different situations. Michael also found the improvements were of more benefit on harder problems, even in 2 dimensions. 19

26 20

27 Chapter 5 A New Improvement to GPO Depending on the choice of covariance function, a GP may have difficulty modelling surfaces which have multiple trends. In many cases, the GP will simply model the variations in the y direction as noise, even when the samples are not noisy. GPO can be improved by using a different covariance function. 5.1 Motivation 1 The results in the previous chapter show some cases where more data is added but the optimization appears to perform worse (the dipping problem). The reason for this is that the hyperparameter model is too simple to explain the variations in the data. The only way the algorithm can explain the data is by increasing the amount of assumed noise in the model. This causes the mean of the GP to be considerably more smooth in the y direction. This has the effect of raising the GP mean where it was once very low, and decreasing it where it was once very high (see Figure 5.1). (a) Small θ noise value (b) Large θ noise value Figure 5.1: The black line is the GP mean, the green region is the GP variance and the black dots are the samples. Within the space of one sample, the predictive distribution can change its beliefs dramatically about the underlying surface. It seems unreasonable to change your beliefs about the surface from being low noise to high noise so drastically in the space of one sample. 21

28 5.2 Motivation 2 It is possible to take samples from the Gaussian Process prior by taking some samples D and a covariance function, producing the covariance matrix for those samples and finally, taking sample points from the Gaussian distribution defined at each point in D. Examples of such samples are given in Figure 5.2(a). By changing the covariance function, we can change the kinds of samples produced by the Gaussian Process prior. Samples taken from a GP prior with a superposition (sum) of two squared exponential covariance functions are shown in Figure 5.2(b). (a) (b) Figure 5.2: Samples from a Gaussian Process prior (a) using a squared exponential covariance function, and (b) using a superposition of squared exponential covariance functions. Looking at the differences between these samples, one could argue that the superposition samples look like more difficult problems to solve than the standard covariance function samples. The superposition samples have more local optima and these are laid out in a more chaotic fashion. These are two of the properties that have been mentioned which make optimization difficult. This shows that if the model has more free hyperparameters to learn, it can model more complex surfaces which are more representative of difficult optimization problems. 5.3 New Improvement to GPO This improvement aims to provide more parameters so that the Gaussian Process has more freedom to correctly model the surface. The simplest way of providing these extra parameters, is to use a superposition of covariance functions. The superposition (or sum) of two covariance functions is itself a valid covariance function [10]. So instead of a standard squared exponential, the covariance function will have the equation: cov(x, x ; θ) = θ h1 exp [ 1 2 I i=1 ] [ (x i x i ) 2 + θ h2 exp 1 2 θ l1i I i=1 ] (x i x i ) 2 This in turn requires a change to the hyperparameter representation. For an n-dimensional problem, θ now consists of 2n length scales, two vertical scales and one noise parameter. This is effectively a doubling in the number of free hyperparameters the algorithm must θ l2i (5.1) 22

29 Θ h Θ l Θ h Θ l Figure 5.3: Figure of a superposition of two squared exponential covariance functions. The dimensions to which the hyperparameters refer to have been labelled. θ h1 and θ l1 refer to the height and length scale of the larger squared exponential. θ h2 and θ l2 refer to the height and length scale of the smaller squared exponential. learn. This could result in an algorithm which struggles when few samples have been taken because there are so many parameters to learn. However, once a decent amount of data has been sampled, the increased number of hyperparameters should provide the ability to form a better model of complex surfaces. The time complexity of the algorithm will also be affected by this change, however, the increased modelling power outweighs this because we are explicitly interested in the case of expensive data. The model cost we can essentially ignore. Combining the superposition covariance function with the Vanilla GPO and Mean EI GPO improvements is a trivial task because these two algorithms are not constrained to any particular covariance function. They only rely on there being a covariance function available. When we combine the Local EI GPO with the superposition covariance function, however, some slight adjustments are required. Recall that Local EI GPO works by taking into account the EI around a particular point. It does this by treating the covariance function as a distribution and sampling underneath it. In the case of a single squared exponential, the distribution is equivalent to a Gaussian distribution which is easy to sample from. To do this, a number of samples are taken under the covariance function (it is treated as a distribution), and the value of the EI at each of those points is averaged. This allows us to to take into account the EI around a particular point (hence Local EI ). When using the standard squared exponential covariance function, the distribution the samples are taken under is essentially a Gaussian distribution. This is easy to sample from. However, in the case of a superposition of squared exponentials, sampling is not as easy. In order to take samples under the superposition of squared exponential distribution, a rejection sampling technique was employed. Rejection sampling is a standard technique that enables us to take samples under a curve 23

30 1.6 Superposition (P*) Q* Figure 5.4: Diagram of the rejection sampling technique. The red curve is the distribution that we want samples under (P ) and the green curve is Q. The vertical line indicates the x position of a sample under Q. A random sample u is then taken on the vertical line between 0 and Q. If u falls in the cyan section of the line, it is accepted. If u is in the magenta segment, it is rejected. P (x), where denotes a unnormalized probability distribution. First, one finds an easy-tosample distribution, such as a Gaussian distribution, which we denote Q(x). Then, we need to find a curve Q (x) = cq(x) for some constant c such that Q (x) > P (x) for all x. The process is then as follows: 1. Take a sample x sample from the easy-to-sample distribution, Q. 2. Take a random uniform sample between 0 and Q (x sample ). Call this point u. 3. If u P (x sample ) then we accept the sample x sample. Otherwise, we reject it. 4. Repeat steps 1 to 3 until the desired number of samples have been collected. To sample from the density corresponding to the superposition covariance function, we set P = cov(d) where d = x x and cov(d) is the superposition of squared exponentials parameterized with some hyperparameters. An east-to-sample distribution Q is the Gaussian distribution with a variance equal to the larger of the two superposition length scales. This is to ensure that when we find Q, we will definitely cover P. The constant c is a number which changes the height of Q to ensure Q is at least as tall as P. The height of P is equal to the sum of the two vertical scale hyperparameters (see Figure 5.3). 24

31 5.4 Evaluation This section compares and evaluates the performance of the superposition covariance function with the standard squared exponential covariance function when used with the Vanilla GPO. The results in this case are very positive towards the superposition covariance function. In every case, the superposition outperforms the Vanilla GPO algorithm. It is especially convincing in the 4,5 and easier 6 dimensional problems Super Super (a) d = 2, L2 = 2, L3 = 1 (b) d = 3, L2 = 4, L3 = 1 Figure 5.5: Results comparing the standard covariance function against the superposition covariance function when used with Vanilla GPO EI on the 2 and 3 dimensional problems. The 2 and 3 dimensional problems show the superposition performing slightly better than the standard covariance function. In the 3 dimensional problem, the superposition appears to be performing much better when very few samples have been taken. This is quite different to the behaviour we were expecting. The 4 and 5 dimensional problems are where the superposition really shows itself to be a strong improvement over the Vanilla algorithm. It is finding points on the surface which are far superior to those found by the standard covariance function. As with the 3D case, the superposition has better performance from very early on in the optimization process. In Figure 5.7(c), the superposition does not exhibit the same dipping problem that the single squared exponential is susceptible to. The superposition curve does not rise as steeply as the squared exponential because there are more hyperparameters that must be learned. Learning so many hyperparameters accurately with a limited amount of data is quite difficult. However, the superposition climbs consistently and reasonably quickly, clearly showing that these extra parameters are not a huge burden on the algorithm. The most difficult problem, LTF6D322, shows an example where the extra hyperparameters are quite difficult for the algorithm to learn. However, once it finds a good set of hyperparameters it ultimately ends up explaining the surface much better than the standard algorithm and finding a higher point. This is the behaviour we expected from the algorithm (since there are more hyperparameters to learn) so it is surprising that it only occurred in one of the test problems. 25

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Support Vector Machines and Bayes Regression

Support Vector Machines and Bayes Regression Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

EM Algorithm & High Dimensional Data

EM Algorithm & High Dimensional Data EM Algorithm & High Dimensional Data Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Gaussian EM Algorithm For the Gaussian mixture model, we have Expectation Step (E-Step): Maximization Step (M-Step): 2 EM

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Te Whare Wananga o te Upoko o te Ika a Maui. Computer Science

Te Whare Wananga o te Upoko o te Ika a Maui. Computer Science VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui School of Mathematical and Computing Sciences Computer Science Multiple Output Gaussian Process Regression Phillip Boyle and

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Generative Learning algorithms

Generative Learning algorithms CS9 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we ve mainly been talking about learning algorithms that model p(y x; θ), the conditional distribution of y given x. For instance,

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.1 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Gaussian process for nonstationary time series prediction

Gaussian process for nonstationary time series prediction Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

1 Using standard errors when comparing estimated values

1 Using standard errors when comparing estimated values MLPR Assignment Part : General comments Below are comments on some recurring issues I came across when marking the second part of the assignment, which I thought it would help to explain in more detail

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (6-83, F) Lecture# (Monday November ) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan Applications of Gaussian Processes (a) Inverse Kinematics

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Nonparametric Regression With Gaussian Processes

Nonparametric Regression With Gaussian Processes Nonparametric Regression With Gaussian Processes From Chap. 45, Information Theory, Inference and Learning Algorithms, D. J. C. McKay Presented by Micha Elsner Nonparametric Regression With Gaussian Processes

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

More information

Prediction of Data with help of the Gaussian Process Method

Prediction of Data with help of the Gaussian Process Method of Data with help of the Gaussian Process Method R. Preuss, U. von Toussaint Max-Planck-Institute for Plasma Physics EURATOM Association 878 Garching, Germany March, Abstract The simulation of plasma-wall

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

Design and Optimization of Energy Systems Prof. C. Balaji Department of Mechanical Engineering Indian Institute of Technology, Madras

Design and Optimization of Energy Systems Prof. C. Balaji Department of Mechanical Engineering Indian Institute of Technology, Madras Design and Optimization of Energy Systems Prof. C. Balaji Department of Mechanical Engineering Indian Institute of Technology, Madras Lecture - 09 Newton-Raphson Method Contd We will continue with our

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics October 17, 2017 CS 361: Probability & Statistics Inference Maximum likelihood: drawbacks A couple of things might trip up max likelihood estimation: 1) Finding the maximum of some functions can be quite

More information

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares Robert Bridson October 29, 2008 1 Hessian Problems in Newton Last time we fixed one of plain Newton s problems by introducing line search

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling Due: Tuesday, May 10, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to the questions below, including

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Lecture 4: Training a Classifier

Lecture 4: Training a Classifier Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as

More information

Week 3: Linear Regression

Week 3: Linear Regression Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to

More information

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Rafdord M. Neal and Jianguo Zhang Presented by Jiwen Li Feb 2, 2006 Outline Bayesian view of feature

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Predicting AGI: What can we say when we know so little?

Predicting AGI: What can we say when we know so little? Predicting AGI: What can we say when we know so little? Fallenstein, Benja Mennen, Alex December 2, 2013 (Working Paper) 1 Time to taxi Our situation now looks fairly similar to our situation 20 years

More information

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I SYDE 372 Introduction to Pattern Recognition Probability Measures for Classification: Part I Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 Why use probability

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer Gaussian Process Regression: Active Data Selection and Test Point Rejection Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer Department of Computer Science, Technical University of Berlin Franklinstr.8,

More information

Math 350: An exploration of HMMs through doodles.

Math 350: An exploration of HMMs through doodles. Math 350: An exploration of HMMs through doodles. Joshua Little (407673) 19 December 2012 1 Background 1.1 Hidden Markov models. Markov chains (MCs) work well for modelling discrete-time processes, or

More information

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted

More information

Automatic Differentiation and Neural Networks

Automatic Differentiation and Neural Networks Statistical Machine Learning Notes 7 Automatic Differentiation and Neural Networks Instructor: Justin Domke 1 Introduction The name neural network is sometimes used to refer to many things (e.g. Hopfield

More information

Prediction of double gene knockout measurements

Prediction of double gene knockout measurements Prediction of double gene knockout measurements Sofia Kyriazopoulou-Panagiotopoulou sofiakp@stanford.edu December 12, 2008 Abstract One way to get an insight into the potential interaction between a pair

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation Vu Malbasa and Slobodan Vucetic Abstract Resource-constrained data mining introduces many constraints when learning from

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Fundamentals of Metaheuristics

Fundamentals of Metaheuristics Fundamentals of Metaheuristics Part I - Basic concepts and Single-State Methods A seminar for Neural Networks Simone Scardapane Academic year 2012-2013 ABOUT THIS SEMINAR The seminar is divided in three

More information

Using Gaussian Processes to Optimize Expensive Functions.

Using Gaussian Processes to Optimize Expensive Functions. Using Gaussian Processes to Optimize Expensive Functions. Marcus Frean and Phillip Boyle Victoria University of Wellington, P.O. Box 600, Wellington, New Zealand marcus@mcs.vuw.ac.nz http://www.mcs.vuw.ac.nz/

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over Point estimation Suppose we are interested in the value of a parameter θ, for example the unknown bias of a coin. We have already seen how one may use the Bayesian method to reason about θ; namely, we

More information

Pengju

Pengju Introduction to AI Chapter04 Beyond Classical Search Pengju Ren@IAIR Outline Steepest Descent (Hill-climbing) Simulated Annealing Evolutionary Computation Non-deterministic Actions And-OR search Partial

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Please bring the task to your first physics lesson and hand it to the teacher.

Please bring the task to your first physics lesson and hand it to the teacher. Pre-enrolment task for 2014 entry Physics Why do I need to complete a pre-enrolment task? This bridging pack serves a number of purposes. It gives you practice in some of the important skills you will

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

DD Advanced Machine Learning

DD Advanced Machine Learning Modelling Carl Henrik {chek}@csc.kth.se Royal Institute of Technology November 4, 2015 Who do I think you are? Mathematically competent linear algebra multivariate calculus Ok programmers Able to extend

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Parameter estimation Conditional risk

Parameter estimation Conditional risk Parameter estimation Conditional risk Formalizing the problem Specify random variables we care about e.g., Commute Time e.g., Heights of buildings in a city We might then pick a particular distribution

More information

Probabilistic numerics for deep learning

Probabilistic numerics for deep learning Presenter: Shijia Wang Department of Engineering Science, University of Oxford rning (RLSS) Summer School, Montreal 2017 Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling

More information