CS 7140: Advanced Machine Learning
|
|
- Diane Haynes
- 5 years ago
- Views:
Transcription
1 Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent Scribes Mo Han Guillem Reus Muns Somanshu Singh 1 Gaussian Processes In ridge regression, our goal is to compute a point estimate, which is the expected value of the posterior predictive distribution given previous observations. When we additionally want to obtain a confidence interval for predictions then we can use Gaussian processes, which compute the full posterior predictive distribution p(f y) for a new input x using previous inputs x 1:N and labels y, where f( x) is the function of input x. 1.1 Formal View: Non-parametric Distribution on Functions In a prediction problem, the new input x should be a known constant value or vector. But in a formal view, before we get the new input x when given previous inputs and outputs, we don t know its exact value, it is said that the new input is a variable, and the prediction should be a function of this variable. So our goal here is to calculate a non-parametric distribution p(f y) on functions f( x), which has infinite degrees of freedom. Furthermore, for each new input, we can have a new posterior distribution based on previous data. The posterior distribution p(f y) can be defined using Bayes rule p(f y) = p( y f)p(f). (1) p( y) In Gaussian Processes, we assume the prior to be a Gaussian distribution over functions of input x f GP(µ( x), k( x, x )), (2) where µ( x) is a mean function and k( x, x ) is covariance function. The likelihood of existing labels is given by y n f Norm(f( x n ), σ)). (3) In many machine learning applications, the knowledge about the true underlying mechanism behind the data generation process is limited. Instead one relies on generic smoothness assumptions; for example we might wish that for two inputs x and x that are close, the corresponding outputs y and y should be similar. Many generic techniques in machine learning can be viewed as different characteristics of smoothness. The kernel function as convariance can represent the smoothness of two inputs. The closer the inputs, the higher the convariance function, and the more similar their corresponding outputs.
2 1.2 Practical View: Generalization of the Multivariate Normal From a formal point of view, the function value f is a random variable. In absence of further assumptions, a function on R n has an uncountably infinite number of degrees of freedom. However, in practice we will only ever need to reason about the function values at a finite set of inputs. For any set of inputs X = [ x 1,..., x N ], a Gaussian process defines a joint distribution on function values that is a multivariate Gaussian f Norm(µ(X), k(x, X)). (4) Here we use f and µ(x) as shorthands for the vectors of function values f := (f( x 1 ),..., f( x N )), (5) µ(x) := (µ( x 1 ),..., µ( x N )), (6) and k(x, X) is similarly a shorthand for the covariance matrix k( x 1, x 1 )... k( x 1, x N ) k(x, X) :=... (7) k( x N, x 1 )... k( x N, x N ) 1.3 Regression with the Predictive Distribution Suppose that we have observed a vector of values y = (y 1,..., y N ) distributed as y f Norm( f, σ 2 I), (8) The goal of Gaussian process regression is to reason about function values f := (f( x 1 ),..., f( x M )) at some new set of inputs X = [ x 1,..., x M ]. The predictive distribution on f is p( f y) = p( y, f ) p( y) = p( y f)p( f, f) df. (9) p( y) In this equation, the joint distribution p( f, f ) is a multivariate Gaussian on function values at N + M points with mean and covariance matrix [ ] ([ ] [ f µ(x) k(x, X) k(x, X f Norm µ(x, ]) ) ) k(x, X) k(x, X, (1) ) where the notation is k(x, X ) is used to refer to an N M (sub-)matrix k( x 1, x 1 k(x, X )... k( x 1, x M ) ) :=.., (11) k( x N, x 1 )... k( x N, x M ) and k(x, X) and k(x, X ) are defined analogously as M N and M M (sub-)matrices. When we assume Gaussian distributed additive noise y = f + ɛ, ɛ Norm(, σ 2 I), (12) 2
3 the distribution on y is once again a multivariate Gaussian y Norm(µ(X), k(x, X) + σ 2 I). (13) If we substitute the part of f by y in (1), then we see that the joint distribution p( y, f ) is [ ] ([ ] [ y µ(x) k(x, X) + σ f Norm µ(x, 2 I k(x, X ]) ) ) k(x, X) k(x, X. (14) ) Given the joint distribution p( y, f ) we can make use of standard Gaussian identities which state that for any joint distribution of the form ([ [ [ ]) p( α, β) α a A C = Norm β] ; b], C (15) B the conditional distribution is once more a multivariate normal with mean and covariance p( α β) = Norm( α; a + CB 1 ( β b), A CB 1 C ), (16) and the marginal distributions are likewise multivariate Gaussians of the form p( α) = Norm( α; a, A), p( β) = Norm( β; b, B). (17) When we substitute the form of the joint (14) into these identities we obtain a predictive distribution in which the predictive mean and covariance are f y Norm( f; µ(x ), k(x, X )), (18) µ(x ) = µ(x ) + k(x, X) ( k(x, X) + σ 2 I ) 1 ( y µ(x)), (19) k(x, X ) = k(x, X ) k(x, X) ( k(x, X) + σ 2 I ) 1 k(x, X ). (2) In practice we can always pre-process our data into a detrended form by defining y = y µ(x) and f ( x) = f( x) µ( x). For this reason, software implementations of Gaussian processes typically assume a zero mean distribution [ ] ([ ] [ y k(x, X) + σ f Norm, 2 I k(x, X ]) ) k(x, X) k(x, X, (21) ) for which the predictive mean simplifies to µ(x ) = k(x, X) ( k(x, X) + σ 2 I ) 1 y. (22) When we want to perform regression with a non-zero mean, we can simply pre-process the data to compute y, perform regression for the zero-mean function values f and finally add the mean back in to define the non-zero mean function values f := f + µ(x ). 3
4 2 Relationship to Kernel Ridge Regression In the previous set of lecture notes, we considered kernel ridge regression, which computes the expected function value f := f( x ) at a previously unseen input x, conditioned on previous observed values y, f = E[Y Y = y], (23) under the assumption of a linear regression model with a Gaussian prior on the weights y = f + ɛ, f D = w d φ d (X), w N (, s 2 I), ɛ N (, σ 2 I). (24) d=1 In the normal formulation of kernel ridge regression, we define a kernel function k( x i, x j ) = φ( x i ) φ( x j ), (25) and compute the expected function value in terms of the kernel evaluations ) 1 f = k( x, X) (k(x, X) + σ2 s 2 I y. (26) Note that this expression is identical to the expression for the predictive mean in (22), with two minor distinctions. The first is that the expression above predicts a single function value f = f( x ), rather than a vector of function values f = f (X ). The second is that the expression above contains a regularization constant λ = σ 2 /s 2, whereas equation (22) only contains the term σ 2. The reason for this is that in Gaussian process regression, we implicitly absorb the variance of the weights into the definition of kernel function. To see what we mean by absorb, let us consider the more general case of prior on weights with a full covariance matrix S, rather than the diagonal covariance s 2 I, For this prior, the predictive mean in ridge regression is w N (, S ). (27) ( 1 f = φ( x ) S φ(x) φ(x) S φ(x) + σ I) 2 y. (28) Note that this expression is reduces to the one in equation (26) when S = s 2 I. We now also see that when we define the kernel k ( x i, x j ) = φ( x i ) S φ( x j ) = D φ d ( x i )S,d,e, φ e ( x j ), (29) d,e we can rewrite equation (28) to f = k ( x, X) ( k (X, X) + σ 2 I ) 1 y, (3) which recovers the predictive mean in equation (22). In other words, kernel ridge regression is equivalent to computing the mean in Gaussian process regression, when we employ a kernel of the form in equation (29). 4
5 Figure 1: Estimations for different λ, going from small values (green curves) to large (red curves). The relationship between Gaussian processes and kernel ridge regression has consequences for how we should think of regularization in both these methods. As explained in previous lecture, λ is the parameter defined for tuning regularization, making the output estimated functions smoothness higher or lower. Taking Fig. 1 as an example, we can easily notice how high values of λ (high regularization) impact by considerably smoothening the output functions, whereas the exact opposite result can be observed for small λ values (low regularization). In contrast to the previous examples of regression considered in class, where the parameter lambda was tunned by the following: λ = σ2 s 2 (31) However, once any Kernel-based regression is applied, the Kernel function that is employed is the one that intrinsically implies a choice on λ: λ = σ2 S 2 (32) Looking back to Eq. (29), we see that choosing the Kernel implies choosing S and in turn, choosing λ. This means that when we are performing kernelized regression w.r.t. some explicit set of features φ d ( x), we need to define a kernel k( x, x ) = s 2 φ( x) φ( x ) in order to obtain results that are fully equivalent to performing ridge regression. More generally speaking, any choice of kernel in Gaussian process regression implies regularization in the form smoothness assumptions that are encoded by the kernel. 3 Kernel Hyperparameters The degree of smoothness assumptions imposed by a kernel are controlled by the kernel hyperparameters. We will consider a couple of examples of different Kernels will be presented in order to show how the choice of parameters affects smoothness. The Squared-Exponential Kernel function 5
6 Example: Fitting the length scale parameter Parameterized covariance function: k(x, x ) = v2 exp Characteristic Lengthscales 2 noise xx too long about right too short.5 function value, y (x - x )2 + 2` input, x The mean posterior predictive function is plotted for 3 different length scales (the blue curve corresponds to optimizing the marginal likelihood). Notice, that an Figure 2: Squared-Exponential Kernel for different values of l almost exact fit to the data can be achieved by reducing the length scale but the marginal likelihood does not favour this! Carl Edward Rasmussen is defined by the following expression: GP Marginal Likelihood and Hyperparameters k(x, x ) = e October 13th, 216 5/7 (x x )2 2l2 (33) Fig. 2 shows different estimated functions for different values of l. We can see then how the picked 8. A Predominantly Approach to Optimization Kernel impacts on thebayesian regularization of the model. For that particular Kernel, large values159 of l means stronger regularization. Figure 3: Matern Kernel for different values of l Figure 8.2: Effective of changing the length scale for Matérn-5/2 kernel. The Matern Kernel is defined by the following expression:! 1 v Õ 2 2v 3 not 5 consider non-stationary that it only depends on x absolute values), will kv (~, ~x ) rather = σf thankthe k~x ~x k we v=,,... (34) v Γ(v) l 2 2 cases further here (see Rasmussen and Williams [26] for further information). Where Γ(v) and Kv are the Gamma and Bessel functions respectively. Observing Fig. 3, we can [Duvenaud, 2.1]isshows some simple example Note x tunning. here is equivalent see again how the 214, KernelFigure function the one responsible for the kernels. Regularization to in our notation. The figure shows that the qualitative behavior of the GP changes substantially 4 with Basic Kernels and Combinations of Kernels changes to the kernel. All of these kernels also behave substantially differently to the Depending onmatérn the problem is being tackled, it may be of The interest to use kerneliswith whose compound kernelthat we itconsidered in the last section. choice of akernel therefore properties reflect our prior knowledge about the problem domain. See in Fig. 4 for a few examples and their to main the Though case of athere particular problem, we some may want specify function critical the properties. performancegiven of GPs. is work, including of ourtoown [Janz et al., behavior that using a combination of Kernels. Kernel functions have a set of properties that allow 216], that examines methods for The learning kernel directly [Duvenaud et al., 213; Lloyd et al., combining them in multiple ways. mostthe important ones are: 214; Wilson et al., 214], this is typically too computationally intensive to carry out in practice. 6 One must therefore either use prior knowledge about the problem, or use a relatively general purpose kernel to ensure that the nuances of the data are not overwhelmed.
7 2.2 A few basic kernels To begin understanding the types of structures expressible by GPs, we will start by briefly examining the priors on functions encoded by some commonly used kernels: the squared-exponential (SE), periodic (Per), and linear (Lin) kernels. defined in figure 2.1. These kernels are Kernel name: Squared-exp (SE) Periodic (Per) Linear (Lin) k(x, x Õ )= f 2 exp 1 2 (x xõ ) f exp sin fi x xõ 2 p f(x c)(x Õ c) Plot of k(x, x Õ ): Functions f(x) sampled from GP prior: x x Õ x x Õ x (with x Õ =1) x x x Type of structure: local variation repeating structure linear functions Figure 2.1: Examples of structures expressible by some basic kernels. Each covariance function corresponds to a di erent set of assumptions made about the function we wish to model. For example, using a squared-exp (SE) kernel implies that Sum: k( x, x ) = k 1 ( x, x ) + k 2 ( x, x ) the function we are modeling has infinitely many derivatives. There exist many variants of local kernels similar to the SE kernel, each encoding slightly di erent assumptions Product: k( x, x ) = k 1 ( x, x )k 2 ( x, x ) about the smoothness of the function being modeled. Kernel parameters Figure 4: Examples of basic Kernels Product Spaces: For any z = ( x, y); k( z, z ) = k 1 ( x, x ) + k 2 ( y, y ); k( z, z ) = k 1 ( x, x )k 2 ( y, y ) Each kernel has a number of parameters which specify the precise shape of the covariance function. These are sometimes referred to as hyper-parameters, Vertical Rescaling: k( x, x ) = a( x)k 1 ( x, x )a( x); for any function a( x) since they can be viewed as specifying a distribution over function parameters, instead of being parameters which specify a function directly. An example would be the lengthscale 2.3 Combining kernels 11 In Fig. 5, different Kernel combinations and their obtained properties can be observed. Lin Lin SE Per Lin SE Lin Per x (with x Õ =1) x x Õ x (with x Õ =1) x (with x Õ =1) quadratic functions locally periodic increasing variation growing amplitude 2.2: Examples of one-dimensional structures expressible by multiplying kernels. PlotsFigure have same 5: meaning Examples as in figure of different 2.1. combinations of Kernels Combining properties through multiplication Multiplying two positive-definite kernels together always results in another positivedefinite kernel. But what properties do these new kernels have? Figure 2.2 shows some kernels obtained by multiplying two basic kernels together. Working with kernels, rather than the parametric form of the function itself, allows us to express high-level properties of functions that do not necessarily have a simple parametric form. Here, we discuss a few examples: Polynomial Regression. By multiplying together T linear kernels, we obtain a prior on polynomials of degree T. The first column of figure 2.2 shows a quadratic kernel. Locally Periodic Functions. In univariate data, multiplying a kernel by SE gives a way of converting global structure to local structure. For example, Per corresponds to exactly periodic structure, whereas Per SE corresponds to locally periodic structure, as shown in the second column of figure Functions with Growing Amplitude. Multiplying by a linear kernel means that the marginal standard deviation of the function being modeled grows linearly
8 5 Inner Product To motivate the concept of inner product, think of vectors in R 2 and R 3 as arrows with initial point at the origin. The length of a vector x in R 2 or R 3 is called the norm of x, denoted by x. Thus for x = (x 1, x 2 ) R 2, we have x = x x 2 2.Similarly, if x = (x 1, x 2, x 3 ) R 3, then x = x x x 3 2. we define the norm of x = (x 1,..., x 2 ) R n by x = x x n 2 (35) The norm is not linear on R n. To inject linearity into the discussion, we introduce the dot product. Definition 1. For x; y R n, the dot product of x and y, denoted x y, is defined by where x := (x 1,..., x n ) and y := (y 1,..., y n ). The dot product on R n has the following properties: x x for all x R n ; x x = if and only if x = ; x y := x 1 y x n y n, (36) for y R n fixed, the map from R n to R that sends x R n to x y is linear; x y = y x for all x, y R n An inner product is a generalization of the dot product for all kinds of vector spaces, not just real vector spaces. Definition 2. An inner product on V is a function that maps each ordered pair (u, v) of elements of V to a scalar u, v F and has the following properties: positivity; v, v for all v V definiteness; v, v = if and only if v = additivity in first slot; u + v, w = u, w + v, w for all u, v, w V homogeneity in the first slot; λu, v = λ u, v for all λ F and all u, v V conjugate symmetry u, v = v, u for all u, v V where V is any vector space and F is any scalar field (Real or Complex). 8
9 5.1 Examples : Inner Product 1. The Euclidean inner product on F n is defined by (w 1,..., w n ), (z 1,..., z n ) = w 1 z w n z n. (37) 2. An inner product can be defined on the vector space of continuous real-valued functions on the interval [a, b] by f, g = b a f(x)g(x)dx. (38) An inner product space is a vector space where inner product is defined. 6 Hilbert Spaces and Kernels 6.1 Hilbert Spaces Any vector space V with an inner product, defines a norm by v = v, v. Similarly, it also defines a metric, d(x, y) = x y. A sequence of elements {v n } in V is called a Cauchy sequence if, for every positive real number ɛ, there is a positive integer N, such that for all m, n > N An inner product space is also called pre-hilbert space. v m v n < ɛ, (39) Definition 3. An inner product space H is called a Hilbert space if it is a complete metric space, i.e, if {h n } is a Cauchy sequence in H, then there exits h H with 6.2 Kernels h h n as n (4) Definition 4. A function k : X X R is a kernel, if there is a function φ : X X H such that x, x X k(x, x ) = φ(x), φ(x ) H (41) Definition 5. Let K denote the scalar field either of real or complex numbers. Let K N be the set of all sequences of scalars (x n ) n N, x n K (42) A sequence space is defined as any linear subspace of K N for which vector addition and scalar multiplication is defined. (x n ) n N + (y n ) n N := (x n ) + (y n ) n N (43) α(x n ) n N := α(x n ) n N (44) Definition 6. l 2 is the subspace of K N consisting of all the sequences x = (x n ) satisfying α=1 x n 2 < (45) Given a sequence {φ d (x)} d 1 in l 2 where φ d : X R is the d-th co-ordinate, k(x, x ) := α=1 φ d(x)φ d (x ) = φ(x), φ(x ) H (46) 9
10 6.3 Positive Definiteness Theorem 1. If H is a Hilbert space, φ : X H is a feature map (and X is a non-empty set) then, 6.4 Reproducing Kernels k(x, x ) := φ(x), φ(x ) H, is positive definite. (47) Definition 7. Suppose that H is a Hilbert space over functions f : X R then H is a Reproducing Kernel Hilbert Space(RKHS) when, k(, x) H, x X (48) f( ), k(, x) H = f(x) (49) This is also called Kernel trick or Reproducing property. Theorem 2. (Moore-Aronszajn)Suppose K is symmetric, positive definite kernel on a set X, then there is a unique Hilbert space of functions on X for which K is a reproducing kernel. 6.5 Function Space Equivalence classes There is equivalence relation between reproducing kernels, positive definite functions, and Hilbert functions spaces with bounded point evaluation. Every reproducing kernel function is also a positive definite function and is part of Hilbert function space with bounded point evaluation. 6.6 Example: XOR We can see what is reproducing kernel Hilbert space using a simple XOR example. Let us consider a feature and feature map as defined below, φ : R 2 R 3 (5) [ ] x1 x = (51) x 2 φ( x) = x 1 x 2 (52) x 1 x 2 1
11 with the kernel defined as, k( x, y) = x 1 x 2 x 1 x 2 y 1 y 2 = φ( x), φ( y) R 3 (53) y 1 y 2 T This feature space is a Hilbert(H) space because inner product is defined and is given by the dot product. Let s now define a function of the features x 1, x 2, x 1 x 2 of x as, f( x) = w 1 x 1 + w 2 x 2 + w 3 x 1 x 2 = w, φ( x) R 3 (54) This function is a member of a space of functions mapping from x = R 2 to R. We can define an equivalent representation for f, w 1 f( ) = w 2 (55) w 3 the notation f( ) refers to the function itself and the notation f(x) R refers to the function evaluated at a particular point. Then we can write, f(x) = f( ) T φ(x) (56) f(x) := f( ), φ(x) H (57) In other words, the evaluation of f at x can be written as an inner product in feature space. Moreover, we can write express the kernel function in terms of feature map using the same convention as above as, y 1 k(, y) = y 2 y 1 y 2 = φ(y) (58) where w 1 = y 1, w 2 = y 2, and w 3 = y 1 y 2. Due to symmetry we can also write, k(, x), φ(y) = uy 1 + vy 2 + wy 3 = k(x, y) (59) In other words φ(x) = k(, x) and φ(y) = k(, y). This way of writing feature mapping is called the canonical feature map. Reproducing Property : f( x) = w, φ( x) R 3 = f( ), φ( x) H, φ( x) = k(, x) Here, we use the positive definite kernels to define functions on X. The space of such functions is known as a reproducing kernel Hilbert space. 6.7 Implications of RKHS for Gaussian Processes Suppose that we have a random function f that is distributed according to a Gaussian process prior f GP (, k( x, x )), (6) Since, k(x, x ) is a positive definite kernel, f lies in an RKHS. To see this note, that we can express the GP posterior as, f y GP ( µ( x), k( x, x )), (61) 11
12 where the posterior mean is defined as µ( ) = k(, X)(k(X, X) + σ 2 I) 1 y, (62) = nm k(, x n) ( k(x, X) + σ 2 I ) 1 nm y m. (63) In other words, the we can express the mean function of the posterior as, µ( ) = k(, x n)w n, (64) n w n = ( k(x, X) + σ 2 I ) 1 nm y m. (65) m In summary, the choice of kernel in a Gaussian process implies a choice of RKHS, which constrains GP regression to solutions that lie inside this RKHS. 12
Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationStatistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes
Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationGaussian Processes. 1 What problems can be solved by Gaussian Processes?
Statistical Techniques in Robotics (16-831, F1) Lecture#19 (Wednesday November 16) Gaussian Processes Lecturer: Drew Bagnell Scribe:Yamuna Krishnamurthy 1 1 What problems can be solved by Gaussian Processes?
More informationLecture 5: GPs and Streaming regression
Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More information22 : Hilbert Space Embeddings of Distributions
10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationCOMP 551 Applied Machine Learning Lecture 20: Gaussian processes
COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos http://www.gaussianprocess.org/ 2 Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin
More informationStatistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes
Statistical Techniques in Robotics (6-83, F) Lecture# (Monday November ) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan Applications of Gaussian Processes (a) Inverse Kinematics
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More informationComputer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression
Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More information20: Gaussian Processes
10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction
More informationCSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes
CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More information9.2 Support Vector Machines 159
9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of
More informationGaussian Processes in Machine Learning
Gaussian Processes in Machine Learning November 17, 2011 CharmGil Hong Agenda Motivation GP : How does it make sense? Prior : Defining a GP More about Mean and Covariance Functions Posterior : Conditioning
More informationKernel Methods. Outline
Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More informationMachine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart
Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationGaussian Processes for Machine Learning
Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing
More informationReliability Monitoring Using Log Gaussian Process Regression
COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationIntroduction. Chapter 1
Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics
More informationA Process over all Stationary Covariance Kernels
A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationKernel Methods in Machine Learning
Kernel Methods in Machine Learning Autumn 2015 Lecture 1: Introduction Juho Rousu ICS-E4030 Kernel Methods in Machine Learning 9. September, 2015 uho Rousu (ICS-E4030 Kernel Methods in Machine Learning)
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression
More informationLecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu
Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationSupervised Learning Coursework
Supervised Learning Coursework John Shawe-Taylor Tom Diethe Dorota Glowacka November 30, 2009; submission date: noon December 18, 2009 Abstract Using a series of synthetic examples, in this exercise session
More informationMTTTS16 Learning from Multiple Sources
MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:
More information10-701/ Recitation : Kernels
10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationCOMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation
COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted
More informationEECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels
EECS 598: Statistical Learning Theory, Winter 2014 Topic 11 Kernels Lecturer: Clayton Scott Scribe: Jun Guo, Soumik Chatterjee Disclaimer: These notes have not been subjected to the usual scrutiny reserved
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationKernel Methods. Charles Elkan October 17, 2007
Kernel Methods Charles Elkan elkan@cs.ucsd.edu October 17, 2007 Remember the xor example of a classification problem that is not linearly separable. If we map every example into a new representation, then
More informationBayesian Linear Regression. Sargur Srihari
Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear
More informationHow to build an automatic statistician
How to build an automatic statistician James Robert Lloyd 1, David Duvenaud 1, Roger Grosse 2, Joshua Tenenbaum 2, Zoubin Ghahramani 1 1: Department of Engineering, University of Cambridge, UK 2: Massachusetts
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationPractical Bayesian Optimization of Machine Learning. Learning Algorithms
Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that
More informationKernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1
Kernel Methods Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 6 of Pattern Recognition and Machine Learning by Bishop Chapter 12 of The Elements of Statistical Learning by Hastie,
More informationWorst-Case Bounds for Gaussian Process Models
Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationNeutron inverse kinetics via Gaussian Processes
Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques
More informationKernels and the Kernel Trick. Machine Learning Fall 2017
Kernels and the Kernel Trick Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels
More informationThe Multivariate Gaussian Distribution
The Multivariate Gaussian Distribution Chuong B. Do October, 8 A vector-valued random variable X = T X X n is said to have a multivariate normal or Gaussian) distribution with mean µ R n and covariance
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)
More informationGaussian Process Regression with K-means Clustering for Very Short-Term Load Forecasting of Individual Buildings at Stanford
Gaussian Process Regression with K-means Clustering for Very Short-Term Load Forecasting of Individual Buildings at Stanford Carol Hsin Abstract The objective of this project is to return expected electricity
More informationADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II
1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector
More informationCS798: Selected topics in Machine Learning
CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction
More informationBuilding an Automatic Statistician
Building an Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ ALT-Discovery Science Conference, October
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationModel Selection for Gaussian Processes
Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal
More informationMultivariate Bayesian Linear Regression MLAI Lecture 11
Multivariate Bayesian Linear Regression MLAI Lecture 11 Neil D. Lawrence Department of Computer Science Sheffield University 21st October 2012 Outline Univariate Bayesian Linear Regression Multivariate
More informationLinear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction
Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationClassifier Complexity and Support Vector Classifiers
Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl
More informationCausal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech
Causal Inference by Minimizing the Dual Norm of Bias Nathan Kallus Cornell University and Cornell Tech www.nathankallus.com Matching Zoo It s a zoo of matching estimators for causal effects: PSM, NN, CM,
More informationState Space Representation of Gaussian Processes
State Space Representation of Gaussian Processes Simo Särkkä Department of Biomedical Engineering and Computational Science (BECS) Aalto University, Espoo, Finland June 12th, 2013 Simo Särkkä (Aalto University)
More informationLecture 4 February 2
4-1 EECS 281B / STAT 241B: Advanced Topics in Statistical Learning Spring 29 Lecture 4 February 2 Lecturer: Martin Wainwright Scribe: Luqman Hodgkinson Note: These lecture notes are still rough, and have
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationGaussian processes in python
Gaussian processes in python John Reid 17th March 2009 1 What are Gaussian processes? Often we have an inference problem involving n data, D = {(x i, y i i = 1,..., n, x i X, y i R} where the x i are the
More information01 Probability Theory and Statistics Review
NAVARCH/EECS 568, ROB 530 - Winter 2018 01 Probability Theory and Statistics Review Maani Ghaffari January 08, 2018 Last Time: Bayes Filters Given: Stream of observations z 1:t and action data u 1:t Sensor/measurement
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationPattern Recognition 2018 Support Vector Machines
Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht
More informationLecture 10: Support Vector Machine and Large Margin Classifier
Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes 1 Objectives to express prior knowledge/beliefs about model outputs using Gaussian process (GP) to sample functions from the probability measure defined by GP to build
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationTokamak profile database construction incorporating Gaussian process regression
Tokamak profile database construction incorporating Gaussian process regression A. Ho 1, J. Citrin 1, C. Bourdelle 2, Y. Camenen 3, F. Felici 4, M. Maslov 5, K.L. van de Plassche 1,4, H. Weisen 6 and JET
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationProbabilistic Graphical Models Lecture 20: Gaussian Processes
Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationLeast Absolute Shrinkage is Equivalent to Quadratic Penalization
Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationSupport Vector Machines
Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More information