Gaussian Processes for Machine Learning

Size: px
Start display at page:

Download "Gaussian Processes for Machine Learning"

Transcription

1 Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany Carlos III, Madrid, May 2006 The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of Probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man s mind. James Clerk Maxwell [1850]

2 Course Goals To understand how learning for regression classification can be based on Gaussian processes how GPs relate to alternative methods the advantages and shortcomings of GPs some practical familiarity using matlab functions 1/30

3 Todays Outline Brief background on probability Gaussian processes Bayesian inference two views the process view Occam s Razor the weight-space view on the equivalence between the two views 2/30

4 Why Gaussian Processes Gaussian processes are not new! Used in spatial statistics, geostatistics (kriging), meteorology Why are GPs not used more often? computational reasons Bayesian 3/30

5 Probabilities are non-negative p(x) 0 Probabilities and Probability Densities Probabilities normalize: x p(x) = 1 (discrete) or p(x)dx = 1 (continuous). The joint probability of x and y is p(x, y). The marginal probability of x is p(x) = p(x, y)dy. The conditional probability of x given y is p(x y) = p(x, y) p(y). Bayes Rule is given by p(x, y) = p(x y)p(y) = p(y x)p(x) p(y x) = p(x y)p(y) p(x) p(x y)p(y). Interpretations of p(x): long run frequencies (frequentist, classical) subjective degree of belief (Bayesian) 4/30

6 Expectation and Variance (Moments) The expectation (or mean or average) of a random variable is: µ = E[x] = xp(x)dx = x p(x). The variance (or second central moment) is: σ 2 = V[x] = (x µ) 2 p(x)dx = E[x 2 ] (E[x]) 2. The covariance between x and y: cov(x, y) = E[(x E[x])(y E[y])]. If x and y are independent, then their covariance is zero. 5/30

7 The Gaussian Distribution A joint, or multivariate Gaussian distribution in D dimensions: y N (µ, Σ) = (2π) D/2 Σ 1/2 exp ( 1 2 (y µ) Σ 1 (y µ) ), is fully specified by its mean vector where µ is the mean vector and Σ the covariance matrix. Let x and y be jointly Gaussian random vectors [ x y ] ([ ] [ ]) µx A C N, µ y C B ( [ ] [ ] 1 ) µx à C = N, µ y C, B then the marginal distribution of x and the conditional distribution of x given y are x N (µ x, A), and x y N ( µ x + CB 1 (y µ y ), A CB 1 C ) or x y N ( µ x à 1 C(y µy ), à 1). (1) 6/30

8 Products of Gaussians The product of two Gaussians gives another (un-normalized) Gaussian N (x a, A)N (x b, B) = Z 1 N (x c, C) (2) where c = C(A 1 a + B 1 b) and C = (A 1 + B 1 ) 1. Notice that the resulting Gaussian has a precision (inverse variance) equal to the sum of the precisions and a mean equal to the convex sum of the means, weighted by the precisions. The normalizing constant looks itself like a Gaussian (in a or b) Z 1 = (2π) D/2 A + B 1/2 exp ( 1 2 (a b) (A + B) 1 (a b) ). 7/30

9 Gaussian Processes Definition: A Gaussian Process is a collection of random variables any finite number of which have (consistent) joint Gaussian distributions. Key observation: A Gaussian process is a natural generalization from vectors to functions: f(x) GP(m(x), k(x, x )), and is fully specified by its mean function m(x) and covariance function k(x, x ). This may sound very impractical......but we are saved by the marginalization property. Recall: p(x) = p(x, y)dy, for Gaussians: p(x, y) = N ([ a b ], [ A B B C ] ) = p(x) = N (a, A) 8/30

10 Gaussian Processes specify distributions over functions To generate a random sample from a D dimensional joint Gaussian with covariance matrix S and mean vector m: (in matlab) x = randn(d,1); y = chol(s) *x + m; where chol is the Cholesky factor R such that R R = S. Thus, the covariance of y is: E[yy ] = E[R xx R] = R E[xx ]R = R IR = S. Example: smooth, stationary, Squared Exponential (or Gaussian) GP: ( f(x) GP m(x) = 0, k(x, x ) = exp ( 1 2 (x x ) 2)). Do try this at home! 9/30

11 2 output, f(x) input, x 10/30

12 Function drawn at random from a Gaussian Process with Gaussian covariance /30

13 Bayesian Inference, parametric model Supervised parametric learning: data: x, y model: y = f w (x) + ε Gaussian likelihood: p(y x, w, M i ) c exp( 1 2 (y c f w (x c )) 2 /σ 2 noise). Parameter prior: p(w M i ) Posterior parameter distribution by Bayes rule p(a b) = p(b a)p(a)/p(b): p(w x, y, M i ) = p(w M i)p(y x, w, M i ) p(y x, M i ) 12/30

14 Bayesian Inference, parametric model, cont. Making predictions: p(y x, x, y, M i ) = p(y w, x, M i )p(w x, y, M i )dw Marginal likelihood: p(y x, M i ) = p(w M i )p(y x, w, M i )dw. Model probability: p(m i x, y) = p(m i)p(y x, M i ) p(y x) Problem: integrals are intractable for most interesting models! 13/30

15 Non-parametric Gaussian process models In our non-parametric model, the parameters are the function itself! Gaussian likelihood: y x, f(x), M i N (f, σ 2 noisei) (Zero mean) Gaussian process prior: f(x) M i GP ( m(x) 0, k(x, x ) ) Leads to a Gaussian process posterior f(x) x, y, M i GP ( m post (x) = k(x, x)[k(x, x) + σ 2 noisei] 1 y, k post (x, x ) = k(x, x ) k(x, x)[k(x, x) + σ 2 noisei] 1 k(x, x ) ). And a Gaussian predictive distribution: y x, x, y, M i N ( k(x, x) [K + σ 2 noisei] 1 y, k(x, x ) + σ 2 noise k(x, x) [K + σ 2 noisei] 1 k(x, x) ) 14/30

16 Prior and Posterior 2 2 output, f(x) output, f(x) input, x input, x Predictive distribution: p(y x, x, y) N ( k(x, x) [K + σ 2 noisei] 1 y, k(x, x ) + σ 2 noise k(x, x) [K + σ 2 noisei] 1 k(x, x) ) 15/30

17 Graphical model for Gaussian Process observations y (1) y ( ) * y (c) Gaussian field f (1) f ( ) * f (c) inputs x (1) x (2) x ( ) x (c) * Square nodes are clamped, round nodes stochastic (free). Thick black line indicates every-one is connected to everyone. Notice, that adding a triplet x (i), f (i), y (i) (where the input x (i) is clamped) does not influence the distribution (as long as y (i) is not clamped). This is guaranteed by the consistency of the GP. This explains why we can make inference using a finite amount of computation! 16/30

18 The marginal likelihood Log marginal likelihood: log p(y x, M i ) = 1 2 y K 1 y 1 2 log K n 2 log(2π) is the combination of a data fit term and complexity penalty. Occam s Razor is automatic. Learning in Gaussian process models involves finding the form of the covariance function, and any unknown (hyper-) parameters θ. This can be done by optimizing the marginal likelihood: log p(y x, θ, M i ) θ j = 1 2 y K 1 K θ j K 1 y 1 2 trace(k 1 K θ j ) 17/30

19 Example: Fitting the length scale parameter Parameterized covariance function: k(x, x ) = v 2 exp ( (x x ) 2 2l 2 ) + σ 2 n δ xx observations too short good length scale too long The mean posterior predictive function is plotted for 3 different length scales (the green curve corresponds to optimizing the marginal likelihood). Notice, that an almost exact fit to the data can be achieved by reducing the length scale but the marginal likelihood does not favour this! 18/30

20 Why, in principle, does Bayesian Inference work? Occam s Razor too simple P(Y M i ) "just right" too complex Y All possible data sets 19/30

21 The weight-space view Linear model: Gaussian noise. assume y = x w + ε = f(x) + ε, where ε N (0, σnoise 2 ) is independent Likelihood: p(y X, w) = = n p(y i x i, w) = i=1 n i=1 1 exp ( (y i x i w)2 ) 2πσn 2σ 2 n 1 (2πσn) exp ( 1 2 n/2 2σn 2 y X w 2) = N (X w, σni). 2 Note, that the likelihood also has a Gaussian shape w.r.t. normalized. the parameter w, although not 20/30

22 Maximum Likelihood Maximizing the likelihood over weights: log p(y X, w) w = 0 = w ML = (XX ) 1 Xy. (The solution could be ill determined if XX is close to singular.) Predictive distribution: p(f x, D, w ML ) = N (x w ML, σ 2 n). 21/30

23 The Bayesian Approach Define the prior distribution: eg, Σ p = σ 2 pi, for some σ 2 p. p(w) = N (0, Σ p ), What does this prior mean? Posterior Likelihood Prior p(w X, y) exp ( 1 2σn 2 (y X w) (y X w) ) exp ( 1 2 w Σ 1 p w ) Ie, the posterior is Gaussian: exp ( 1 2 (w w) ( 1 σ 2 n XX + Σ 1 ) ) p (w w), p(w X, y) N ( w = 1 σn 2 A 1 Xy, A 1 ), where A = σn 2 XX + Σ 1 p. 22/30

24 Making Predictions The predictive distribution is again Gaussian: p(f x, X, y) = p(f x, w)p(w X, y)dw = = N ( 1 σn 2 x A 1 Xy, x A 1 ) x. x w p(w X, y)dw Notice, how the error-bars increase when predicting further away from the origin. Why? Outstanding Problems: The model is still too limited (only linear). What about σ 2 n and Σ p? 23/30

25 Nonlinear Regression Instead of x, we can use a linear model in some fixed function basis φ(x) = (ϕ 1 (x), ϕ 2 (x),...). Example: φ(x) = (1, x, x 2,...), turns: f(x) = φ(x) w, into polynomial regression, which is more flexible. Note: the analysis from the previous slides can be reused by replacing x by φ(x). The maximum likelihood parameters become: w ML = (Φ(X)Φ(X) ) 1 Φ(X)y. with design matrix Φ(X) = (φ(x 1 ), φ(x 2 ),...). 24/30

26 Re-introducing the Feature Space The predictive distribution: p(f x, X, y) = p(f x, w)p(w X, y)dw where A = σ 2 n where K = Φ Σ p Φ. = N ( 1 σn 2 φ(x ) A 1 Φ(X)y, φ(x ) A 1 φ(x ) ), Φ(X)Φ(X) + Σ 1 p. This can be re-written as: f x, X, y N ( φ Σ p Φ(K + σ 2 ni) 1 y, φ Σ p φ φ Σ p Φ(K + σ 2 ni) 1 Φ Σ p φ ), Notice, that this last expression contains φ(x) only in the form of inner products can be possibly replaced by kernel functions k(x, x ) = φ(x) Σ p φ(x )! 25/30

27 On the correspondence between weight space and function space views There is an exact correspondence between the two views: Given a set of m basis functions, φ(x), it is obvious that we can construct the corresponding covariance function: k(x, x ) = φ(x) Σ p φ(x ). conversely, given a positive definite covariance function k(x, x ) we can decompose it (Mercer s theorem): k(x, x ) = λ i φ i (x)φ i (x ), where λ i and φ i are eigenvalues and eigenfunctions respectively, obeying: k(x, x )φ i (x)dx = λφ i (x ). i=1 The function space view is valid irrespective of whether the corresponding eigendecomposition is finite. The squared exponential covariance function does not have a finite decomposistion. 26/30

28 From random functions to covariance functions Consider the class of linear functions: f(x) = ax + b, where a N (0, α), and b N (0, β). We can compute the mean function: µ(x) = E[f(x)] = f(x)p(a)p(b)dadb = axp(a)da + bp(b)db = 0, and covariance function: k(x, x ) = E[(f(x) 0)(f(x ) 0)] = (ax + b)(ax + b)p(a)p(b)dadb = a 2 xx p(a)da + b 2 p(b)db + (x + x ) abp(a)p(b)dadb = αxx + β. 27/30

29 Predictions from a noise free Gaussian Process Imagine that we have observed the value of the function at some locations {X, f}, and seek to make predictions f at (one or more) test inputs X. The joint distribution under the GP is: [ f f ] N ( 0, [ ] K(X, X) K(X, X ) ). K(X, X) K(X, X ) Now, we can condition on the obsevations, to obtain f X, X, f N ( f ; K(X, X )K(X, X) 1 f, K(X, X ) K(X, X )K(X, X) 1 K(X, X) ). Remark: Conditioning a Gaussian gives another Gaussian: [ x y] N ( [ ]) A C T 0, C B = x y N (C T B 1 y, A C T B 1 C) 28/30

30 Predictions from a noisy Gaussian Process More commonly, we have access to noisy observations {X, y}. Assuming independent Gaussian noise, the joint distribution of observations and test predictions is: [ ] y N ( [ ] K(X, X) + σ 2 0, n I K(X, X ) ), f K(X, X) K(X, X ) Conditioning on the observations: f X, X, y N ( K(X, X )[K(X, X) + σ 2 ni] 1 y, K(X, X ) K(X, X )[K(X, X) + σ 2 ni] 1 K(X, X) ). Note: Equating K(X, Z) with Φ(X) Σ p Φ(Z) for the Bayesian linear model, gives exactly identical results! 29/30

31 Some Interpretation Recall our main result: f X, X, y N ( K(X, X)[K(X, X) + σni] 2 1 y, K(X, X ) K(X, X)[K(X, X) + σni] 2 1 K(X, X ) ). The mean is linear in two ways: µ(x ) = k(x, X)[K(X, X) + σ 2 n] 1 y = n β c y (c) = c=1 n α c k(x, x (c) ). c=1 The last form is most commonly encountered in the kernel literature. The variance is the difference between two terms: V (x ) = k(x, x ) k(x, X)[K(X, X) + σ 2 ni] 1 k(x, x ), the first term is the prior variance, from which we subtract a (positive) term, telling how much the data X has explained. Note, that the variance is independent of the observed outputs y. 30/30

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

More information

Introduction to Bayesian Statistics

Introduction to Bayesian Statistics School of Computing & Communication, UTS January, 207 Random variables Pre-university: A number is just a fixed value. When we talk about probabilities: When X is a continuous random variable, it has a

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group Nonparmeteric Bayes & Gaussian Processes Baback Moghaddam baback@jpl.nasa.gov Machine Learning Group Outline Bayesian Inference Hierarchical Models Model Selection Parametric vs. Nonparametric Gaussian

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

DD Advanced Machine Learning

DD Advanced Machine Learning Modelling Carl Henrik {chek}@csc.kth.se Royal Institute of Technology November 4, 2015 Who do I think you are? Mathematically competent linear algebra multivariate calculus Ok programmers Able to extend

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides. ECE521 Tutorial 2 Regression, GPs, Assignment 1 ECE521 Winter 2016 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides. ECE521 Tutorial 2 ECE521 Winter 2016 Credits to Alireza / 3 Outline

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion Today Probability and Statistics Naïve Bayes Classification Linear Algebra Matrix Multiplication Matrix Inversion Calculus Vector Calculus Optimization Lagrange Multipliers 1 Classical Artificial Intelligence

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (6-83, F) Lecture# (Monday November ) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan Applications of Gaussian Processes (a) Inverse Kinematics

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

01 Probability Theory and Statistics Review

01 Probability Theory and Statistics Review NAVARCH/EECS 568, ROB 530 - Winter 2018 01 Probability Theory and Statistics Review Maani Ghaffari January 08, 2018 Last Time: Bayes Filters Given: Stream of observations z 1:t and action data u 1:t Sensor/measurement

More information

Reliability Monitoring Using Log Gaussian Process Regression

Reliability Monitoring Using Log Gaussian Process Regression COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Bayesian Interpretations of Regularization

Bayesian Interpretations of Regularization Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Gaussian Processes in Machine Learning November 17, 2011 CharmGil Hong Agenda Motivation GP : How does it make sense? Prior : Defining a GP More about Mean and Covariance Functions Posterior : Conditioning

More information

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0. Gaussian Processes Gaussian Process Stochastic process: basically, a set of random variables. may be infinite. usually related in some way. Gaussian process: each variable has a Gaussian distribution every

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Some Concepts of Probability (Review) Volker Tresp Summer 2018 Some Concepts of Probability (Review) Volker Tresp Summer 2018 1 Definition There are different way to define what a probability stands for Mathematically, the most rigorous definition is based on Kolmogorov

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows. Chapter 5 Two Random Variables In a practical engineering problem, there is almost always causal relationship between different events. Some relationships are determined by physical laws, e.g., voltage

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Gaussian Processes. 1 What problems can be solved by Gaussian Processes? Statistical Techniques in Robotics (16-831, F1) Lecture#19 (Wednesday November 16) Gaussian Processes Lecturer: Drew Bagnell Scribe:Yamuna Krishnamurthy 1 1 What problems can be solved by Gaussian Processes?

More information

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen Kernel Methods Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Kernel Methods 1 / 37 Outline

More information

CS 7140: Advanced Machine Learning

CS 7140: Advanced Machine Learning Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Tokamak profile database construction incorporating Gaussian process regression

Tokamak profile database construction incorporating Gaussian process regression Tokamak profile database construction incorporating Gaussian process regression A. Ho 1, J. Citrin 1, C. Bourdelle 2, Y. Camenen 3, F. Felici 4, M. Maslov 5, K.L. van de Plassche 1,4, H. Weisen 6 and JET

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Introduction. Basic Probability and Bayes Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 26/9/2011 (EPFL) Graphical Models 26/9/2011 1 / 28 Outline

More information

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Machine Learning Srihari. Gaussian Processes. Sargur Srihari Gaussian Processes Sargur Srihari 1 Topics in Gaussian Processes 1. Examples of use of GP 2. Duality: From Basis Functions to Kernel Functions 3. GP Definition and Intuition 4. Linear regression revisited

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Multivariate Bayesian Linear Regression MLAI Lecture 11

Multivariate Bayesian Linear Regression MLAI Lecture 11 Multivariate Bayesian Linear Regression MLAI Lecture 11 Neil D. Lawrence Department of Computer Science Sheffield University 21st October 2012 Outline Univariate Bayesian Linear Regression Multivariate

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

[POLS 8500] Review of Linear Algebra, Probability and Information Theory [POLS 8500] Review of Linear Algebra, Probability and Information Theory Professor Jason Anastasopoulos ljanastas@uga.edu January 12, 2017 For today... Basic linear algebra. Basic probability. Programming

More information

Lecture Note 1: Probability Theory and Statistics

Lecture Note 1: Probability Theory and Statistics Univ. of Michigan - NAME 568/EECS 568/ROB 530 Winter 2018 Lecture Note 1: Probability Theory and Statistics Lecturer: Maani Ghaffari Jadidi Date: April 6, 2018 For this and all future notes, if you would

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Statistical Learning Theory

Statistical Learning Theory Statistical Learning Theory Part I : Mathematical Learning Theory (1-8) By Sumio Watanabe, Evaluation : Report Part II : Information Statistical Mechanics (9-15) By Yoshiyuki Kabashima, Evaluation : Report

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f(x).5.5 f(x) y i.5.2.4.6.8 x f 5 5.5 x x 2.5 We have (possibly

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Machine Learning Srihari. Probability Theory. Sargur N. Srihari

Machine Learning Srihari. Probability Theory. Sargur N. Srihari Probability Theory Sargur N. Srihari srihari@cedar.buffalo.edu 1 Probability Theory with Several Variables Key concept is dealing with uncertainty Due to noise and finite data sets Framework for quantification

More information

Lecture 1a: Basic Concepts and Recaps

Lecture 1a: Basic Concepts and Recaps Lecture 1a: Basic Concepts and Recaps Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

More information

Variational Bayesian Logistic Regression

Variational Bayesian Logistic Regression Variational Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing

More information

Bristol Machine Learning Reading Group

Bristol Machine Learning Reading Group Bristol Machine Learning Reading Group Introduction to Variational Inference Carl Henrik Ek - carlhenrik.ek@bristol.ac.uk November 25, 2016 http://www.carlhenrik.com Introduction Ronald Aylmer Fisher 1

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Optimization of Gaussian Process Hyperparameters using Rprop

Optimization of Gaussian Process Hyperparameters using Rprop Optimization of Gaussian Process Hyperparameters using Rprop Manuel Blum and Martin Riedmiller University of Freiburg - Department of Computer Science Freiburg, Germany Abstract. Gaussian processes are

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

Bayesian Inference. Chris Mathys Wellcome Trust Centre for Neuroimaging UCL. London SPM Course

Bayesian Inference. Chris Mathys Wellcome Trust Centre for Neuroimaging UCL. London SPM Course Bayesian Inference Chris Mathys Wellcome Trust Centre for Neuroimaging UCL London SPM Course Thanks to Jean Daunizeau and Jérémie Mattout for previous versions of this talk A spectacular piece of information

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression. Sargur Srihari Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Bias-Variance Trade-off in ML. Sargur Srihari

Bias-Variance Trade-off in ML. Sargur Srihari Bias-Variance Trade-off in ML Sargur srihari@cedar.buffalo.edu 1 Bias-Variance Decomposition 1. Model Complexity in Linear Regression 2. Point estimate Bias-Variance in Statistics 3. Bias-Variance in Regression

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Probabilistic & Bayesian deep learning. Andreas Damianou

Probabilistic & Bayesian deep learning. Andreas Damianou Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of Sheffield, 19 March 2019 In this talk Not in this talk: CRFs, Boltzmann machines,... In this

More information