CS 7140: Advanced Machine Learning

Size: px
Start display at page:

Download "CS 7140: Advanced Machine Learning"

Transcription

1 Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent Scribes Mo Han Guillem Reus Muns Somanshu Singh 1 Gaussian Processes In ridge regression, our goal is to compute a point estimate, which is the expected value of the posterior predictive distribution given previous observations. When we additionally want to obtain a confidence interval for predictions then we can use Gaussian processes, which compute the full posterior predictive distribution p(f y) for a new input x using previous inputs x 1:N and labels y, where f( x) is the function of input x. 1.1 Formal View: Non-parametric Distribution on Functions In a prediction problem, the new input x should be a known constant value or vector. But in a formal view, before we get the new input x when given previous inputs and outputs, we don t know its exact value, it is said that the new input is a variable, and the prediction should be a function of this variable. So our goal here is to calculate a non-parametric distribution p(f y) on functions f( x), which has infinite degrees of freedom. Furthermore, for each new input, we can have a new posterior distribution based on previous data. The posterior distribution p(f y) can be defined using Bayes rule p(f y) = p( y f)p(f). (1) p( y) In Gaussian Processes, we assume the prior to be a Gaussian distribution over functions of input x f GP(µ( x), k( x, x )), (2) where µ( x) is a mean function and k( x, x ) is covariance function. The likelihood of existing labels is given by y n f Norm(f( x n ), σ)). (3) In many machine learning applications, the knowledge about the true underlying mechanism behind the data generation process is limited. Instead one relies on generic smoothness assumptions; for example we might wish that for two inputs x and x that are close, the corresponding outputs y and y should be similar. Many generic techniques in machine learning can be viewed as different characteristics of smoothness. The kernel function as convariance can represent the smoothness of two inputs. The closer the inputs, the higher the convariance function, and the more similar their corresponding outputs.

2 1.2 Practical View: Generalization of the Multivariate Normal From a formal point of view, the function value f is a random variable. In absence of further assumptions, a function on R n has an uncountably infinite number of degrees of freedom. However, in practice we will only ever need to reason about the function values at a finite set of inputs. For any set of inputs X = [ x 1,..., x N ], a Gaussian process defines a joint distribution on function values that is a multivariate Gaussian f Norm(µ(X), k(x, X)). (4) Here we use f and µ(x) as shorthands for the vectors of function values f := (f( x 1 ),..., f( x N )), (5) µ(x) := (µ( x 1 ),..., µ( x N )), (6) and k(x, X) is similarly a shorthand for the covariance matrix k( x 1, x 1 )... k( x 1, x N ) k(x, X) :=... (7) k( x N, x 1 )... k( x N, x N ) 1.3 Regression with the Predictive Distribution Suppose that we have observed a vector of values y = (y 1,..., y N ) distributed as y f Norm( f, σ 2 I), (8) The goal of Gaussian process regression is to reason about function values f := (f( x 1 ),..., f( x M )) at some new set of inputs X = [ x 1,..., x M ]. The predictive distribution on f is p( f y) = p( y, f ) p( y) = p( y f)p( f, f) df. (9) p( y) In this equation, the joint distribution p( f, f ) is a multivariate Gaussian on function values at N + M points with mean and covariance matrix [ ] ([ ] [ f µ(x) k(x, X) k(x, X f Norm µ(x, ]) ) ) k(x, X) k(x, X, (1) ) where the notation is k(x, X ) is used to refer to an N M (sub-)matrix k( x 1, x 1 k(x, X )... k( x 1, x M ) ) :=.., (11) k( x N, x 1 )... k( x N, x M ) and k(x, X) and k(x, X ) are defined analogously as M N and M M (sub-)matrices. When we assume Gaussian distributed additive noise y = f + ɛ, ɛ Norm(, σ 2 I), (12) 2

3 the distribution on y is once again a multivariate Gaussian y Norm(µ(X), k(x, X) + σ 2 I). (13) If we substitute the part of f by y in (1), then we see that the joint distribution p( y, f ) is [ ] ([ ] [ y µ(x) k(x, X) + σ f Norm µ(x, 2 I k(x, X ]) ) ) k(x, X) k(x, X. (14) ) Given the joint distribution p( y, f ) we can make use of standard Gaussian identities which state that for any joint distribution of the form ([ [ [ ]) p( α, β) α a A C = Norm β] ; b], C (15) B the conditional distribution is once more a multivariate normal with mean and covariance p( α β) = Norm( α; a + CB 1 ( β b), A CB 1 C ), (16) and the marginal distributions are likewise multivariate Gaussians of the form p( α) = Norm( α; a, A), p( β) = Norm( β; b, B). (17) When we substitute the form of the joint (14) into these identities we obtain a predictive distribution in which the predictive mean and covariance are f y Norm( f; µ(x ), k(x, X )), (18) µ(x ) = µ(x ) + k(x, X) ( k(x, X) + σ 2 I ) 1 ( y µ(x)), (19) k(x, X ) = k(x, X ) k(x, X) ( k(x, X) + σ 2 I ) 1 k(x, X ). (2) In practice we can always pre-process our data into a detrended form by defining y = y µ(x) and f ( x) = f( x) µ( x). For this reason, software implementations of Gaussian processes typically assume a zero mean distribution [ ] ([ ] [ y k(x, X) + σ f Norm, 2 I k(x, X ]) ) k(x, X) k(x, X, (21) ) for which the predictive mean simplifies to µ(x ) = k(x, X) ( k(x, X) + σ 2 I ) 1 y. (22) When we want to perform regression with a non-zero mean, we can simply pre-process the data to compute y, perform regression for the zero-mean function values f and finally add the mean back in to define the non-zero mean function values f := f + µ(x ). 3

4 2 Relationship to Kernel Ridge Regression In the previous set of lecture notes, we considered kernel ridge regression, which computes the expected function value f := f( x ) at a previously unseen input x, conditioned on previous observed values y, f = E[Y Y = y], (23) under the assumption of a linear regression model with a Gaussian prior on the weights y = f + ɛ, f D = w d φ d (X), w N (, s 2 I), ɛ N (, σ 2 I). (24) d=1 In the normal formulation of kernel ridge regression, we define a kernel function k( x i, x j ) = φ( x i ) φ( x j ), (25) and compute the expected function value in terms of the kernel evaluations ) 1 f = k( x, X) (k(x, X) + σ2 s 2 I y. (26) Note that this expression is identical to the expression for the predictive mean in (22), with two minor distinctions. The first is that the expression above predicts a single function value f = f( x ), rather than a vector of function values f = f (X ). The second is that the expression above contains a regularization constant λ = σ 2 /s 2, whereas equation (22) only contains the term σ 2. The reason for this is that in Gaussian process regression, we implicitly absorb the variance of the weights into the definition of kernel function. To see what we mean by absorb, let us consider the more general case of prior on weights with a full covariance matrix S, rather than the diagonal covariance s 2 I, For this prior, the predictive mean in ridge regression is w N (, S ). (27) ( 1 f = φ( x ) S φ(x) φ(x) S φ(x) + σ I) 2 y. (28) Note that this expression is reduces to the one in equation (26) when S = s 2 I. We now also see that when we define the kernel k ( x i, x j ) = φ( x i ) S φ( x j ) = D φ d ( x i )S,d,e, φ e ( x j ), (29) d,e we can rewrite equation (28) to f = k ( x, X) ( k (X, X) + σ 2 I ) 1 y, (3) which recovers the predictive mean in equation (22). In other words, kernel ridge regression is equivalent to computing the mean in Gaussian process regression, when we employ a kernel of the form in equation (29). 4

5 Figure 1: Estimations for different λ, going from small values (green curves) to large (red curves). The relationship between Gaussian processes and kernel ridge regression has consequences for how we should think of regularization in both these methods. As explained in previous lecture, λ is the parameter defined for tuning regularization, making the output estimated functions smoothness higher or lower. Taking Fig. 1 as an example, we can easily notice how high values of λ (high regularization) impact by considerably smoothening the output functions, whereas the exact opposite result can be observed for small λ values (low regularization). In contrast to the previous examples of regression considered in class, where the parameter lambda was tunned by the following: λ = σ2 s 2 (31) However, once any Kernel-based regression is applied, the Kernel function that is employed is the one that intrinsically implies a choice on λ: λ = σ2 S 2 (32) Looking back to Eq. (29), we see that choosing the Kernel implies choosing S and in turn, choosing λ. This means that when we are performing kernelized regression w.r.t. some explicit set of features φ d ( x), we need to define a kernel k( x, x ) = s 2 φ( x) φ( x ) in order to obtain results that are fully equivalent to performing ridge regression. More generally speaking, any choice of kernel in Gaussian process regression implies regularization in the form smoothness assumptions that are encoded by the kernel. 3 Kernel Hyperparameters The degree of smoothness assumptions imposed by a kernel are controlled by the kernel hyperparameters. We will consider a couple of examples of different Kernels will be presented in order to show how the choice of parameters affects smoothness. The Squared-Exponential Kernel function 5

6 Example: Fitting the length scale parameter Parameterized covariance function: k(x, x ) = v2 exp Characteristic Lengthscales 2 noise xx too long about right too short.5 function value, y (x - x )2 + 2` input, x The mean posterior predictive function is plotted for 3 different length scales (the blue curve corresponds to optimizing the marginal likelihood). Notice, that an Figure 2: Squared-Exponential Kernel for different values of l almost exact fit to the data can be achieved by reducing the length scale but the marginal likelihood does not favour this! Carl Edward Rasmussen is defined by the following expression: GP Marginal Likelihood and Hyperparameters k(x, x ) = e October 13th, 216 5/7 (x x )2 2l2 (33) Fig. 2 shows different estimated functions for different values of l. We can see then how the picked 8. A Predominantly Approach to Optimization Kernel impacts on thebayesian regularization of the model. For that particular Kernel, large values159 of l means stronger regularization. Figure 3: Matern Kernel for different values of l Figure 8.2: Effective of changing the length scale for Matérn-5/2 kernel. The Matern Kernel is defined by the following expression:! 1 v Õ 2 2v 3 not 5 consider non-stationary that it only depends on x absolute values), will kv (~, ~x ) rather = σf thankthe k~x ~x k we v=,,... (34) v Γ(v) l 2 2 cases further here (see Rasmussen and Williams [26] for further information). Where Γ(v) and Kv are the Gamma and Bessel functions respectively. Observing Fig. 3, we can [Duvenaud, 2.1]isshows some simple example Note x tunning. here is equivalent see again how the 214, KernelFigure function the one responsible for the kernels. Regularization to in our notation. The figure shows that the qualitative behavior of the GP changes substantially 4 with Basic Kernels and Combinations of Kernels changes to the kernel. All of these kernels also behave substantially differently to the Depending onmatérn the problem is being tackled, it may be of The interest to use kerneliswith whose compound kernelthat we itconsidered in the last section. choice of akernel therefore properties reflect our prior knowledge about the problem domain. See in Fig. 4 for a few examples and their to main the Though case of athere particular problem, we some may want specify function critical the properties. performancegiven of GPs. is work, including of ourtoown [Janz et al., behavior that using a combination of Kernels. Kernel functions have a set of properties that allow 216], that examines methods for The learning kernel directly [Duvenaud et al., 213; Lloyd et al., combining them in multiple ways. mostthe important ones are: 214; Wilson et al., 214], this is typically too computationally intensive to carry out in practice. 6 One must therefore either use prior knowledge about the problem, or use a relatively general purpose kernel to ensure that the nuances of the data are not overwhelmed.

7 2.2 A few basic kernels To begin understanding the types of structures expressible by GPs, we will start by briefly examining the priors on functions encoded by some commonly used kernels: the squared-exponential (SE), periodic (Per), and linear (Lin) kernels. defined in figure 2.1. These kernels are Kernel name: Squared-exp (SE) Periodic (Per) Linear (Lin) k(x, x Õ )= f 2 exp 1 2 (x xõ ) f exp sin fi x xõ 2 p f(x c)(x Õ c) Plot of k(x, x Õ ): Functions f(x) sampled from GP prior: x x Õ x x Õ x (with x Õ =1) x x x Type of structure: local variation repeating structure linear functions Figure 2.1: Examples of structures expressible by some basic kernels. Each covariance function corresponds to a di erent set of assumptions made about the function we wish to model. For example, using a squared-exp (SE) kernel implies that Sum: k( x, x ) = k 1 ( x, x ) + k 2 ( x, x ) the function we are modeling has infinitely many derivatives. There exist many variants of local kernels similar to the SE kernel, each encoding slightly di erent assumptions Product: k( x, x ) = k 1 ( x, x )k 2 ( x, x ) about the smoothness of the function being modeled. Kernel parameters Figure 4: Examples of basic Kernels Product Spaces: For any z = ( x, y); k( z, z ) = k 1 ( x, x ) + k 2 ( y, y ); k( z, z ) = k 1 ( x, x )k 2 ( y, y ) Each kernel has a number of parameters which specify the precise shape of the covariance function. These are sometimes referred to as hyper-parameters, Vertical Rescaling: k( x, x ) = a( x)k 1 ( x, x )a( x); for any function a( x) since they can be viewed as specifying a distribution over function parameters, instead of being parameters which specify a function directly. An example would be the lengthscale 2.3 Combining kernels 11 In Fig. 5, different Kernel combinations and their obtained properties can be observed. Lin Lin SE Per Lin SE Lin Per x (with x Õ =1) x x Õ x (with x Õ =1) x (with x Õ =1) quadratic functions locally periodic increasing variation growing amplitude 2.2: Examples of one-dimensional structures expressible by multiplying kernels. PlotsFigure have same 5: meaning Examples as in figure of different 2.1. combinations of Kernels Combining properties through multiplication Multiplying two positive-definite kernels together always results in another positivedefinite kernel. But what properties do these new kernels have? Figure 2.2 shows some kernels obtained by multiplying two basic kernels together. Working with kernels, rather than the parametric form of the function itself, allows us to express high-level properties of functions that do not necessarily have a simple parametric form. Here, we discuss a few examples: Polynomial Regression. By multiplying together T linear kernels, we obtain a prior on polynomials of degree T. The first column of figure 2.2 shows a quadratic kernel. Locally Periodic Functions. In univariate data, multiplying a kernel by SE gives a way of converting global structure to local structure. For example, Per corresponds to exactly periodic structure, whereas Per SE corresponds to locally periodic structure, as shown in the second column of figure Functions with Growing Amplitude. Multiplying by a linear kernel means that the marginal standard deviation of the function being modeled grows linearly

8 5 Inner Product To motivate the concept of inner product, think of vectors in R 2 and R 3 as arrows with initial point at the origin. The length of a vector x in R 2 or R 3 is called the norm of x, denoted by x. Thus for x = (x 1, x 2 ) R 2, we have x = x x 2 2.Similarly, if x = (x 1, x 2, x 3 ) R 3, then x = x x x 3 2. we define the norm of x = (x 1,..., x 2 ) R n by x = x x n 2 (35) The norm is not linear on R n. To inject linearity into the discussion, we introduce the dot product. Definition 1. For x; y R n, the dot product of x and y, denoted x y, is defined by where x := (x 1,..., x n ) and y := (y 1,..., y n ). The dot product on R n has the following properties: x x for all x R n ; x x = if and only if x = ; x y := x 1 y x n y n, (36) for y R n fixed, the map from R n to R that sends x R n to x y is linear; x y = y x for all x, y R n An inner product is a generalization of the dot product for all kinds of vector spaces, not just real vector spaces. Definition 2. An inner product on V is a function that maps each ordered pair (u, v) of elements of V to a scalar u, v F and has the following properties: positivity; v, v for all v V definiteness; v, v = if and only if v = additivity in first slot; u + v, w = u, w + v, w for all u, v, w V homogeneity in the first slot; λu, v = λ u, v for all λ F and all u, v V conjugate symmetry u, v = v, u for all u, v V where V is any vector space and F is any scalar field (Real or Complex). 8

9 5.1 Examples : Inner Product 1. The Euclidean inner product on F n is defined by (w 1,..., w n ), (z 1,..., z n ) = w 1 z w n z n. (37) 2. An inner product can be defined on the vector space of continuous real-valued functions on the interval [a, b] by f, g = b a f(x)g(x)dx. (38) An inner product space is a vector space where inner product is defined. 6 Hilbert Spaces and Kernels 6.1 Hilbert Spaces Any vector space V with an inner product, defines a norm by v = v, v. Similarly, it also defines a metric, d(x, y) = x y. A sequence of elements {v n } in V is called a Cauchy sequence if, for every positive real number ɛ, there is a positive integer N, such that for all m, n > N An inner product space is also called pre-hilbert space. v m v n < ɛ, (39) Definition 3. An inner product space H is called a Hilbert space if it is a complete metric space, i.e, if {h n } is a Cauchy sequence in H, then there exits h H with 6.2 Kernels h h n as n (4) Definition 4. A function k : X X R is a kernel, if there is a function φ : X X H such that x, x X k(x, x ) = φ(x), φ(x ) H (41) Definition 5. Let K denote the scalar field either of real or complex numbers. Let K N be the set of all sequences of scalars (x n ) n N, x n K (42) A sequence space is defined as any linear subspace of K N for which vector addition and scalar multiplication is defined. (x n ) n N + (y n ) n N := (x n ) + (y n ) n N (43) α(x n ) n N := α(x n ) n N (44) Definition 6. l 2 is the subspace of K N consisting of all the sequences x = (x n ) satisfying α=1 x n 2 < (45) Given a sequence {φ d (x)} d 1 in l 2 where φ d : X R is the d-th co-ordinate, k(x, x ) := α=1 φ d(x)φ d (x ) = φ(x), φ(x ) H (46) 9

10 6.3 Positive Definiteness Theorem 1. If H is a Hilbert space, φ : X H is a feature map (and X is a non-empty set) then, 6.4 Reproducing Kernels k(x, x ) := φ(x), φ(x ) H, is positive definite. (47) Definition 7. Suppose that H is a Hilbert space over functions f : X R then H is a Reproducing Kernel Hilbert Space(RKHS) when, k(, x) H, x X (48) f( ), k(, x) H = f(x) (49) This is also called Kernel trick or Reproducing property. Theorem 2. (Moore-Aronszajn)Suppose K is symmetric, positive definite kernel on a set X, then there is a unique Hilbert space of functions on X for which K is a reproducing kernel. 6.5 Function Space Equivalence classes There is equivalence relation between reproducing kernels, positive definite functions, and Hilbert functions spaces with bounded point evaluation. Every reproducing kernel function is also a positive definite function and is part of Hilbert function space with bounded point evaluation. 6.6 Example: XOR We can see what is reproducing kernel Hilbert space using a simple XOR example. Let us consider a feature and feature map as defined below, φ : R 2 R 3 (5) [ ] x1 x = (51) x 2 φ( x) = x 1 x 2 (52) x 1 x 2 1

11 with the kernel defined as, k( x, y) = x 1 x 2 x 1 x 2 y 1 y 2 = φ( x), φ( y) R 3 (53) y 1 y 2 T This feature space is a Hilbert(H) space because inner product is defined and is given by the dot product. Let s now define a function of the features x 1, x 2, x 1 x 2 of x as, f( x) = w 1 x 1 + w 2 x 2 + w 3 x 1 x 2 = w, φ( x) R 3 (54) This function is a member of a space of functions mapping from x = R 2 to R. We can define an equivalent representation for f, w 1 f( ) = w 2 (55) w 3 the notation f( ) refers to the function itself and the notation f(x) R refers to the function evaluated at a particular point. Then we can write, f(x) = f( ) T φ(x) (56) f(x) := f( ), φ(x) H (57) In other words, the evaluation of f at x can be written as an inner product in feature space. Moreover, we can write express the kernel function in terms of feature map using the same convention as above as, y 1 k(, y) = y 2 y 1 y 2 = φ(y) (58) where w 1 = y 1, w 2 = y 2, and w 3 = y 1 y 2. Due to symmetry we can also write, k(, x), φ(y) = uy 1 + vy 2 + wy 3 = k(x, y) (59) In other words φ(x) = k(, x) and φ(y) = k(, y). This way of writing feature mapping is called the canonical feature map. Reproducing Property : f( x) = w, φ( x) R 3 = f( ), φ( x) H, φ( x) = k(, x) Here, we use the positive definite kernels to define functions on X. The space of such functions is known as a reproducing kernel Hilbert space. 6.7 Implications of RKHS for Gaussian Processes Suppose that we have a random function f that is distributed according to a Gaussian process prior f GP (, k( x, x )), (6) Since, k(x, x ) is a positive definite kernel, f lies in an RKHS. To see this note, that we can express the GP posterior as, f y GP ( µ( x), k( x, x )), (61) 11

12 where the posterior mean is defined as µ( ) = k(, X)(k(X, X) + σ 2 I) 1 y, (62) = nm k(, x n) ( k(x, X) + σ 2 I ) 1 nm y m. (63) In other words, the we can express the mean function of the posterior as, µ( ) = k(, x n)w n, (64) n w n = ( k(x, X) + σ 2 I ) 1 nm y m. (65) m In summary, the choice of kernel in a Gaussian process implies a choice of RKHS, which constrains GP regression to solutions that lie inside this RKHS. 12

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Gaussian Processes. 1 What problems can be solved by Gaussian Processes? Statistical Techniques in Robotics (16-831, F1) Lecture#19 (Wednesday November 16) Gaussian Processes Lecturer: Drew Bagnell Scribe:Yamuna Krishnamurthy 1 1 What problems can be solved by Gaussian Processes?

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

22 : Hilbert Space Embeddings of Distributions

22 : Hilbert Space Embeddings of Distributions 10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos http://www.gaussianprocess.org/ 2 Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (6-83, F) Lecture# (Monday November ) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan Applications of Gaussian Processes (a) Inverse Kinematics

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

More information

Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Gaussian Processes in Machine Learning November 17, 2011 CharmGil Hong Agenda Motivation GP : How does it make sense? Prior : Defining a GP More about Mean and Covariance Functions Posterior : Conditioning

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing

More information

Reliability Monitoring Using Log Gaussian Process Regression

Reliability Monitoring Using Log Gaussian Process Regression COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

Kernel Methods in Machine Learning

Kernel Methods in Machine Learning Kernel Methods in Machine Learning Autumn 2015 Lecture 1: Introduction Juho Rousu ICS-E4030 Kernel Methods in Machine Learning 9. September, 2015 uho Rousu (ICS-E4030 Kernel Methods in Machine Learning)

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression

More information

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Supervised Learning Coursework

Supervised Learning Coursework Supervised Learning Coursework John Shawe-Taylor Tom Diethe Dorota Glowacka November 30, 2009; submission date: noon December 18, 2009 Abstract Using a series of synthetic examples, in this exercise session

More information

MTTTS16 Learning from Multiple Sources

MTTTS16 Learning from Multiple Sources MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted

More information

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels EECS 598: Statistical Learning Theory, Winter 2014 Topic 11 Kernels Lecturer: Clayton Scott Scribe: Jun Guo, Soumik Chatterjee Disclaimer: These notes have not been subjected to the usual scrutiny reserved

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

Kernel Methods. Charles Elkan October 17, 2007

Kernel Methods. Charles Elkan October 17, 2007 Kernel Methods Charles Elkan elkan@cs.ucsd.edu October 17, 2007 Remember the xor example of a classification problem that is not linearly separable. If we map every example into a new representation, then

More information

Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression. Sargur Srihari Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear

More information

How to build an automatic statistician

How to build an automatic statistician How to build an automatic statistician James Robert Lloyd 1, David Duvenaud 1, Roger Grosse 2, Joshua Tenenbaum 2, Zoubin Ghahramani 1 1: Department of Engineering, University of Cambridge, UK 2: Massachusetts

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1 Kernel Methods Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 6 of Pattern Recognition and Machine Learning by Bishop Chapter 12 of The Elements of Statistical Learning by Hastie,

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernels and the Kernel Trick. Machine Learning Fall 2017 Kernels and the Kernel Trick Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels

More information

The Multivariate Gaussian Distribution

The Multivariate Gaussian Distribution The Multivariate Gaussian Distribution Chuong B. Do October, 8 A vector-valued random variable X = T X X n is said to have a multivariate normal or Gaussian) distribution with mean µ R n and covariance

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

Gaussian Process Regression with K-means Clustering for Very Short-Term Load Forecasting of Individual Buildings at Stanford

Gaussian Process Regression with K-means Clustering for Very Short-Term Load Forecasting of Individual Buildings at Stanford Gaussian Process Regression with K-means Clustering for Very Short-Term Load Forecasting of Individual Buildings at Stanford Carol Hsin Abstract The objective of this project is to return expected electricity

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Building an Automatic Statistician

Building an Automatic Statistician Building an Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ ALT-Discovery Science Conference, October

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Multivariate Bayesian Linear Regression MLAI Lecture 11

Multivariate Bayesian Linear Regression MLAI Lecture 11 Multivariate Bayesian Linear Regression MLAI Lecture 11 Neil D. Lawrence Department of Computer Science Sheffield University 21st October 2012 Outline Univariate Bayesian Linear Regression Multivariate

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

Causal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech

Causal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech Causal Inference by Minimizing the Dual Norm of Bias Nathan Kallus Cornell University and Cornell Tech www.nathankallus.com Matching Zoo It s a zoo of matching estimators for causal effects: PSM, NN, CM,

More information

State Space Representation of Gaussian Processes

State Space Representation of Gaussian Processes State Space Representation of Gaussian Processes Simo Särkkä Department of Biomedical Engineering and Computational Science (BECS) Aalto University, Espoo, Finland June 12th, 2013 Simo Särkkä (Aalto University)

More information

Lecture 4 February 2

Lecture 4 February 2 4-1 EECS 281B / STAT 241B: Advanced Topics in Statistical Learning Spring 29 Lecture 4 February 2 Lecturer: Martin Wainwright Scribe: Luqman Hodgkinson Note: These lecture notes are still rough, and have

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Gaussian processes in python

Gaussian processes in python Gaussian processes in python John Reid 17th March 2009 1 What are Gaussian processes? Often we have an inference problem involving n data, D = {(x i, y i i = 1,..., n, x i X, y i R} where the x i are the

More information

01 Probability Theory and Statistics Review

01 Probability Theory and Statistics Review NAVARCH/EECS 568, ROB 530 - Winter 2018 01 Probability Theory and Statistics Review Maani Ghaffari January 08, 2018 Last Time: Bayes Filters Given: Stream of observations z 1:t and action data u 1:t Sensor/measurement

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes 1 Objectives to express prior knowledge/beliefs about model outputs using Gaussian process (GP) to sample functions from the probability measure defined by GP to build

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Tokamak profile database construction incorporating Gaussian process regression

Tokamak profile database construction incorporating Gaussian process regression Tokamak profile database construction incorporating Gaussian process regression A. Ho 1, J. Citrin 1, C. Bourdelle 2, Y. Camenen 3, F. Felici 4, M. Maslov 5, K.L. van de Plassche 1,4, H. Weisen 6 and JET

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information