COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Similar documents
COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

CS-E3210 Machine Learning: Basic Principles

Nonparameteric Regression:

STA414/2104 Statistical Methods for Machine Learning II

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Probabilistic & Unsupervised Learning

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

GAUSSIAN PROCESS REGRESSION

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Week 3: Linear Regression

CPSC 540: Machine Learning

Nonparametric Regression With Gaussian Processes

Introduction to Gaussian Processes

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Gaussian Processes (10/16/13)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Nonparametric Bayesian Methods (Gaussian Processes)

COMP90051 Statistical Machine Learning

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Supervised Learning Coursework

GWAS V: Gaussian processes

Perceptron Revisited: Linear Separators. Support Vector Machines

Lecture 3. Linear Regression

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Lecture 5: GPs and Streaming regression

Overfitting, Bias / Variance Analysis

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

ECE521 week 3: 23/26 January 2017

MTTTS16 Learning from Multiple Sources

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Announcements. Proposals graded

Lecture : Probabilistic Machine Learning

STA 4273H: Sta-s-cal Machine Learning

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Today. Calculus. Linear Regression. Lagrange Multipliers

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Linear Regression (continued)

Introduction. Chapter 1

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

4 Bias-Variance for Ridge Regression (24 points)

Multivariate Bayesian Linear Regression MLAI Lecture 11

CS 7140: Advanced Machine Learning

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Advanced Introduction to Machine Learning CMU-10715

Gaussian Process Regression

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

CSE446: non-parametric methods Spring 2017

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Why should you care about the solution strategies?

CSC2515 Assignment #2

12 - Nonparametric Density Estimation

Introduction to Gaussian Processes

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Reliability Monitoring Using Log Gaussian Process Regression

Bayesian Methods for Machine Learning

Bayesian Linear Regression [DRAFT - In Progress]

SUPPORT VECTOR MACHINE

(Kernels +) Support Vector Machines

Gaussian Processes for Machine Learning

SVM optimization and Kernel methods

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

COMP 551 Applied Machine Learning Lecture 2: Linear regression

Gaussian Processes in Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Probabilistic & Bayesian deep learning. Andreas Damianou

STAT 518 Intro Student Presentation

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Introduction to SVM and RVM

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Machine learning - HT Basis Expansion, Regularization, Validation

CS145: INTRODUCTION TO DATA MINING

Bayesian Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Machine Learning 4771

Ways to make neural networks generalize better

CS 188: Artificial Intelligence Spring Announcements

COMP90051 Statistical Machine Learning

Model Selection for Gaussian Processes

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Transcription:

COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55 Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor s written permission.

Announcements Change in office hours next week: Wednesday from am-2pm, MC 232 Project 4 Kaggle submission due today! Written report due tomorrow No hard-copy needs to be submitted! Just submit on MyCourses # Kaggle submissions increased to 4/day 2

Announcements 3

Beyond linear regression Relying on features can be problematic We tried to avoid using features before Lecture 8, instance based learning. Use distances! Lecture 2, support vector machines. Use kernels! This class: extend regression to nonparametric models Gaussian processes! 4

Recall: Kernels A kernel is a function of two arguments which corresponds to a dot product in some feature space k(x i, x j )= (x i ) T (x j ) Advantage of using kernels: Sometimes evaluating k() is cheaper than evaluating features and taking the dot product Sometimes k() corresponds to an inner product in a feature space with infinite dimensions k(x i, x j )=(x T i x j ) d k(x i, x j )=exp kx i x j k 2 5

Recall: Kernels Kernelize algorithm: Try to formulate algorithm so feature vectors only ever occur in inner products Replace inner products by kernel evaluations (kernel trick) 6

Recall: kernel regression Given dataset, how do we calculate y value for new input? Regression: learn weighted function of features y = w^t x Kernel regression: don t learn any parameters! Instead, use y s of neighbouring data points!! Image source: http://mccormickml.com/ 7

Recall: kernel regression What y should we predict for x=6? Could e.g. take an average of surrounding y values In general: calculate a weight for how much each data point contributes, take weighted average y=? The kernel is just a weighting function Image source: http://mccormickml.com/ 8

Recall: kernel regression Common kernel is Gaussian: Points nearby contribute more, points further away contribute less Variance controls how many neighbouring points are used Higher sigma -> smoother function Image source: http://mccormickml.com/ 9

Recall: Kernel regression Kernel regression is non-parametric: no parameters are explicitly learned, just use nearby datapoint to make predictions Kernel can be thought of as a distance measure, defining which points are considered nearby for each input We kernelized linear regression can we kernelize Bayesian linear regression? Start with just the mean

Kernelizing the mean function Inspect solution mean from Bayesian linear regression p(y D) =N ( 2 x T S N X T y, 2 + x T S N x) S N =( I + 2 X T X) (from last class) () (2) y Vector concatenates training outputs Matrix X has one column for each feature (length N) one row for each datapoint (length M) Mean prediction is mean of the Gaussian: y = 2 x T ( I + 2 X T X) X T y

Kernelizing the mean function Step 2: Reformulate to only have inner products of features y = 2 x T ( I + 2 X T X) X T y y = 2 x T X T ( I + 2 XX T ) y k(x ) T K element i,j of this matrix is (x i ) T (x j ) element i of this vector is (x i ) T (x ) 2

Kernelizing the mean function Step 2: Reformulate to only have inner products of features y = 2 x T ( I + 2 X T X) X T y # features x #features # datapoints x #datapoints y = 2 x T X T ( I + 2 XX T ) y k(x ) T K element i,j of this matrix is (x i ) T (x j ) element i of this vector is (x i ) T (x ) 3

Kernelizing the mean function Step 3: Replace inner products by kernel evaluations y = k(x ) T ( I + K) y element i,j of this matrix is k(x i, x j ) element i of this vector is k(x i, x ) Remember: Mean function is same as ridge regression This is kernel ridge regression 4

Recall: Ridge regression Bayesian linear regression: Ridge regression: max 2 min 2 NX (y n w T x n ) 2 2 wt w + const. n= NX (y n w T x n ) 2 + w T w n= Difference between the two? Bayesian linear regression learns a distribution over parameters So kernelized mean prediction with Bayesian linear regression <=> kernel ridge regression, = 2 5

Kernel ridge regression Choosing a kernel: 2 2 2-5 k(x i,x j )=exp kx i x j k 2 2 2-5 - 5 k(x i,x j )=x i x j k(x i,x j )=exp x i x j 6

Kernel ridge regression Setting parameters: sigma controls what data points are close.5.5.5 -.5 -.5 -.5-5 - 5-5 = = =. k(x i,x j )=exp kx i x j k 2 2 2 7

Kernel ridge regression Setting parameters: alpha controls smoothness.5.5.5 -.5 -.5 -.5-5 - 5-5 Small alpha Medium alpha Large alpha y = k(x ) T ( I + K) y 8

Why add the ridge? As before, kernel regression can easily overfit: regularisation is critical! 3 2 - -2-3 2 4 6 8 alpha= (aka kernel regression) 9

Kernel regression: Practical issues Compare ridge regression: w =( I + X T X) X T y inverse O(d 3 ) matrix-vector product O(d 2 N) prediction O(d) memory O(d) Kernel ridge regression: inverse, product prediction memory O(N) O(N) O(N 3 ) y = k(x ) T ( I + K) y d = feature dimension N = # datapoints 2

Kernel regression: Practical issues If we have a small set of good features it s faster to do regression in feature space However, if no good features are available (or we need a very big set of features), kernel regression might yield better results Often, it is easier to pick a kernel than to choose a good set of features 2

Kernelizing Bayesian linear regression We have now kernelized ridge regression Can we kernelize Bayesian linear regression, too? i.e. can we kernelize the covariance / uncertainty? linear regression ridge regression bayesian linear regression (kernel regression) kernel ridge regression? 22

Kernelizing Bayesian linear regression We have now kernelized ridge regression Can we kernelize Bayesian linear regression, too? i.e. can we kernelize the covariance / uncertainty? Yes, and this is called Gaussian process regression (GPR) linear regression ridge regression bayesian linear regression (kernel regression) kernel ridge regression Gaussian process 23

Gaussian processes: high level GPs are defined by a mean function, and a covariance function Mean function derived in the same way as kernel ridge regression (based on surrounding data points) Covariance defined by the kernel: Cov(f(x), f(x )) = k(x,x ) Bayesian method need to specify prior distribution 24

Gaussian processes Mean function derived already, variance can be similarly derived Formal definition: a function f is a GP if any finite set of values f(x_),, f(x_n) follows a multivariate Gaussian distribution.9? Assumption: outputs are correlated Copyright C.M. Bishop, PRML 25

Deriving GP equations Model: We are interested in the function values, at a set of points set, but we assume these are noisy Prior: x, x 2,... y N (, K) With y a vector of function values and K the kernel matrix Likelihood (Gaussian noise on output): t N (y, y,y 2,.... We observe target values t for the training t n = y n + I) red: y_n green: t_n Copyright C.M. Bishop, PRML 26

Examples from the prior y N (, K) k(x, y) =exp (x y) 2 k(x, y) =exp x y Copyright C.M. Bishop, PRML 27

GP Regression Prior and likelihood are Gaussian Again obtain a closed form solution E[y ]=y T (K + I) k(x ) kernel ridge regression Cov[y ]=k(x, x ) k(x ) T (K + I) k(x ) prior variance Prediction of new observations reduction in variance due to close training points Cov[t ]=k(x, x ) k(x ) T (K + I) k(x )+ Easy to implement! add noise term 28

GP Regression Results of GP regression 3 2 - -2-3 5 E[y t]+ p Cov[y t] E[y t] E[y t] p Cov[y t] Calculated for many possible y t: set of observed points 29

GP Regression: hyperparameters Hyperparameters Assumed noise (variance of likelihood) Any parameters of the kernel Typical kernel: t N (y, I) k(x i, x j )=s 2 exp kx i x j k 2 2 2 s: scale (standard deviation prior to seeing data) : bandwidth (which datapoint are considered close) Effective regularisation: s Knowing the meaning of parameters helps tune them 3

GP Regression: hyperparameters Assumed noise (variance of likelihood) Effective regularisation: s t N (y, I) 3 3 3 2 2 2 - - - -2-2 -2-3 5 = -3 5 =. -3 5 = Mostly changes behaviour close to train points 3

Kernel GP Regression: hyperparameters k(x i, x j )=s 2 exp kx i x j k 2 Effective regularisation 2 2 s 3 3 3 2 2 2 - - - -2-2 -2-3 5-3 5-3 5 s =. s = s = Mostly changes behaviour further away from training points 32

GP Regression: hyperparameters Kernel k(x i, x j )=s 2 exp kx i x j k 2 2 2 3 3 3 2 2 2 - - - -2-2 -2-3 5-3 5 =. = = -3 5 Changes what is considered close or far 33

GPs: Practical issues Complexity pretty much similar to kernel regression Except for calculating predictive variance E[y ]=y T (K + I) k(x ) Cov[y ]=k(x, x ) k(x ) T (K + I) k(x ) inverse, product O(N 3 ) prediction O(N) O(N 2 ) memory O(N) 34

GPs: Practical issues For small dataset, GPR is a state-of-the-art method! Advantage: provides uncertainty, flexible yet can control overfitting Computational costs acceptable for small datasets (< ) Has been applied to robotics & control, hyperparameter optimization, MRI data, weather prediction, For large datasets, uncertainty not as important, GPs are expensive Good approximations exist Specifying the right prior (kernel!) is important! 35

More resources on GPs Lectures by Nando de Freitas: https://www.youtube.com/watch? v=4vgihc35j9s&t=s&index=8&list=ple6wd9fr-- EdyJ5lbFl8UuGjecvVw66F6 Gaussian processes for dummies http://katbailey.github.io/post/gaussian-processes-for-dummies/ Gaussian processes textbook http://www.gaussianprocess.org/gpml/ (free download) 36

Bayesian methods in practice Time complexity varies compared to frequentist methods Memory complexity can be higher e.g. need to store mean + uncertainty : quadratic, not linear Lots of data everywhere: posterior close to point estimate (might as well use frequentist methods) Little data everywhere Prior information helps bias/variance trade-off Some areas with little data, some areas with lots of data Uncertainty helps to decide where predictions are reliable 37

Inference in more complex models We saw some examples with closed-form posterior In many complex models, no closed-form representation Variational inference (deterministic) Consider family of distributions we can represent (Gaussian) Use optimisation techniques to find best of these Sampling (stochastic) Try to directly sample from the posterior Expectations can be approximated using the samples Maximum a posterior (point estimate) 38

What you should know Previous lectures: What is the Bayesian view of probability? Why can the Bayesian view be beneficial? Role of the following distributions: Likelihood, prior, posterior, posterior predictive Key idea of Bayesian regression and its properties This lecture: Key idea of kernel regression and its properties Main idea behind Gaussian process regression 39