Introduction to Gaussian Process

Similar documents
Nonparametric Bayesian Methods (Gaussian Processes)

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Model Selection for Gaussian Processes

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

ECE521 week 3: 23/26 January 2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

STA414/2104 Statistical Methods for Machine Learning II

Nonparameteric Regression:

Gaussian Processes in Machine Learning

Gaussian Process Regression

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

GWAS V: Gaussian processes

20: Gaussian Processes

Probabilistic & Unsupervised Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture : Probabilistic Machine Learning

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Density Estimation. Seungjin Choi

Introduction to Gaussian Processes

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CPSC 340: Machine Learning and Data Mining

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Bayesian Deep Learning

Neutron inverse kinetics via Gaussian Processes

Bayesian Support Vector Machines for Feature Ranking and Selection

Announcements. Proposals graded

Probabilistic Graphical Models

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Logistic Regression. Machine Learning Fall 2018

COMS 4771 Regression. Nakul Verma

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Statistical Data Mining and Machine Learning Hilary Term 2016

CS-E3210 Machine Learning: Basic Principles

Overfitting, Bias / Variance Analysis

Today. Calculus. Linear Regression. Lagrange Multipliers

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

ECE521 lecture 4: 19 January Optimization, MLE, regularization

CSC321 Lecture 18: Learning Probabilistic Models

Kernel methods, kernel SVM and ridge regression

Discriminative Models

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Lecture 5: GPs and Streaming regression

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Bayesian Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Introduction to Machine Learning Midterm, Tues April 8

Machine Learning Lecture 5

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

GAUSSIAN PROCESS REGRESSION

GWAS IV: Bayesian linear (variance component) models

Linear Models for Regression CS534

Machine Learning Lecture 7

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Holdout and Cross-Validation Methods Overfitting Avoidance

CS 6375 Machine Learning

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear Models for Regression CS534

Introduction to SVM and RVM

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Introduction to Support Vector Machines

Classification: The rest of the story

Discriminative Models

Linear Models for Classification

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression. William Cohen

Pattern Recognition and Machine Learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Advanced Introduction to Machine Learning CMU-10715

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Bayesian Regression Linear and Logistic Regression

Stochastic Gradient Descent

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Linear Models for Regression CS534

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Machine Learning Practice Page 2 of 2 10/28/13

CMU-Q Lecture 24:

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Transcription:

Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1

What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression GP Regression with Squared Exponential Kernel CS 478 INTRODUCTION 2

Regression Predict real value as output Learn underlying target function Some data samples of target function (with noise) E.g. Predict Credit Score Estimate model parameters from data points. y = w x, estimate w, to minimize loss CS 478 INTRODUCTION 3

Paper Titles Gaussian process classification for segmenting and annotating sequences Discovering hidden features with Gaussian processes regression Active learning with Gaussian processes for object categorization Semi-supervised learning via Gaussian processes Discriminative Gaussian process latent variable model for classification Computation with infinite neural networks Fast sparse Gaussian process methods: The informative vector machine Approximate Dynamic Programming with Gaussian Processes CS 478 INTRODUCTION 4

Paper Titles Learning to control an octopus arm with Gaussian process temporal difference measure Gaussian Processes and Reinforcement Learning for Identification and Control of an Autonomous Blimp Gaussian process latent variable models for visualization of high dimensional data Bayesian unsupervised signal classification by Dirichlet process mixtures of Gaussian processes CS 478 INTRODUCTION 5

Bayesian ML Probability Theory What makes function a Prob. Dist? Estimate distribution over model parameters, θ, that generated D AKA function or hypothesis Important distributions Prior: p(θ) Data Likelihood: p(d θ) Posterior: p θ D = 1 p(d) p D θ p(θ) Posterior Predictive: p y x = p y x, θ p θ D dθ Let all possible functions (one per θ) vote for the correct value CS 478 INTRODUCTION 6

Running Example Predict weight given height Lbs & inches Let s use a linear model with Gaussian Noise 2 parameters: m, σ 2 Let s assume that σ 2 = 5 Weight ~ N m height, σ 2 p x; μ, σ 2 = Let s use a prior for m x μ 2 1 2πσ e σ 2 CS 478 INTRODUCTION 7

Prior GENERAL p(θ) Your own bias about what makes a good solution Score solutions without seeing data Encoded as a distribution over parameters EXAMPLE Let s get 5 good guesses about what m is Units are lbs/inch p m = guess i = 0.2 Other possible m get 0 prob. under our prior Normally use a prior which assigns everything a non-zero probability. CS 478 INTRODUCTION 8

Data Likelihood GENERAL p(d θ) θ describes the process of creating D Assuming θ reflects reality, out of all the D, how likely is the D we got? Maximum Likelihood Estimation EXAMPLE Data: Height -> Weight (h i w i ) 70in -> 150lbs 80in -> 200lbs 60in -> 130lbs p D m, σ 2 D = i=1 N(w i ; h i m, σ 2 ) The m that maximizes ^ is the MLE solution. Is it one of our 5 guesses? No. What is it? m = 686 298 = 2.302 CS 478 INTRODUCTION 9

Posterior GENERAL p θ D = Bayes Theorem p D θ p(θ) p(d) Combines Prior with Data Likelihood Often find θ that maximizes p θ D Maximum a posteriori (MAP) estimate Can ignore p(d) in this case EXAMPLE Plug in each of our guesses for m into the numerator Get denominator by p m = guess i D = 1 Guess with highest probability is the MAP estimate CS 478 INTRODUCTION 10

Posterior Predictive GENERAL p y x = p y x, θ p θ D dθ Let all θ vote, weighted by posterior probability Bayes Optimal Ensemble Rarely tractable integral MLE or MAP is easier Really want: argmax y p y x Can also do confidence intervals EXAMPLE Given a test point height = 75in, predict weight Using MLE? Using MAP? Posterior Predictive Guess a weight and score it Score weight using each of 5 m Weight scores by p(m) Iterate until argmax w p w h CS 478 INTRODUCTION 11

Other Regression Methods Simple Linear Regression (LR) y = w x or in Bayesian Land: y ~ N(w x, σ 2 ) Ordinary Least Squares: find w to minimize SSE = (w x y t ) 2 Equivalent to maximizing p D = p y t x, w = N y t w x; σ 2 ) Closed form solution: w = X T X 1 X T y Kernelized Linear Regression (KLR) In LR, replace x with φ(x), where φ is a non-linear feature transform Kernel trick allows you to do implicit φ x for efficiency E.g. Learn coefficients of order N polynomial CS 478 INTRODUCTION 12

Other Regression Methods Ridge Regression (RR) Like LR, but with L2 weight penalty to encourage small weights More robust to outliers (less overfit) L2 weight penalty = Gaussian Prior on weights w 2 i w i ~ N(0, σ 2 ) => p w e σ 2 2 => log p w w i Find w to minimize Obj w = (w x y t ) 2 + λ w i 2 Lasso Regression (Lasso) LR with L1 penalty Laplacian Distribution Prior: p w i e w i Fewer non-zero weights (sparse) CS 478 INTRODUCTION 13

Other Regression Methods Bayesian Regression (BR) Like RR, but estimate λ from data Prior over the Prior w i ~ N(0, σ 2 ) but σ 2 ~ Γ(α, β) α, β are user set Automatic Relevance Determination (ARD) Like BR, but w i ~ N(0, σ i ) and σ i ~ Γ(α, β) Each weight comes from different prior distribution Good for feature selection CS 478 INTRODUCTION 14

Gaussian Process GP = Stochastic Process where Random Variables are Gaussian Distributed x = 1 Distribution of f(1) CS 478 INTRODUCTION 15

Gaussian Process - MVN Generalization of Multi-Variate Normal to dimensions CS 478 INTRODUCTION 16

Gaussian Process - Background First theorized in 1940s Used in geostatistic and meterology in 1970s for time series prediction Under the name kriging CS 478 INTRODUCTION 17

Smooth Functions How would you measure the smoothness of a function? Smooth functions is a common prior or bias Smoothness: function values close in input space should be similar How similar is f(1) to f(1.1)? Compared to f(1) to f(10)? For any finite # of points, {x i }, can fit a MVN based on training set composed of smooth functions to learn (μ, Σ) Smoothness f p MVN (f x 0, f x i μ, Σ) CS 478 INTRODUCTION 18

Dist. Over Functions Functions == infinite lookup table == infinite vector of values Vector entries are value of function at some input Problem with our MVN smoothness at finite # of points Functions score well if not smooth between those points GP solves this by being non-parametric Σ replaced by k(x, x ) μ replaced by m(x) GP is dimensions, but any finite subset of dimension is MVN CS 478 INTRODUCTION 19

Covariance Function k x i, x j yields Σ for MVN over any finite subset f(x i ) Σ ij = k(x i, x j ) x i, k(x i, x j ) must make Σ Positive Definite Otherwise MVN is not defined Common to subtract out the mean function from train/test points Less common to learn mean function from data CS 478 INTRODUCTION 20

Covariance Functions CS 478 INTRODUCTION 21

Covariance Functions Learning in GP is finding hyperparameters that fit data well! Many design choices: kernel, fixed, global, local, per-dimension Math is messy Do SGD on hyperparameters to maximize log probability of training data Observation Noise Universal hyperparameter σ n is our estimate of how bad we measured the training targets Similar to Bayesian LR Helps prevent overfit CS 778 GAUSSIAN PROCESS 22

Hyperparameters Length-Scale CS 478 INTRODUCTION 23

Inference/Prediction The joint distribution of training targets and test target is MVN We condition the MVN on the observed training targets Resulting conditional distribution is Gaussian Use distribution mean/covariance for prediction CS 478 INTRODUCTION 24

Magic Math From Wikipedia CS 478 INTRODUCTION 25

Discussion PROS Learned function is smooth but not constrained by form Confidence intervals Good prior for other models E.g. Bias N-degree polynomials to be smooth Can do multiple test points that co-vary appropriately CONS Math/Optimization is messy Use lots of approximations NxN Matrix inversion Numerical Stability issues Scalability in # instances issues like SVMs High dimensional data is challenging CS 478 INTRODUCTION 26

Implementation Only compute K 1 once using numerical stable methods Cholesky decomposition: K = LL T, where L is lower triangular For each test instance, compute covariance vector k* σ n I is our observation noise added to raw k(x,x) CS 478 INTRODUCTION 27

Example Using a squared exponential kernel k x, x = exp 0.5 x x 2 Data to the right Task is to predict f(2) and f(4) σ n = 0.25 X f(x) 0-5 1 0 2? 3 5 4? CS 478 INTRODUCTION 28

Example Covariance Matrix Begin by calculating covariance matrix Use f(0), f(1), f(3) as dimensions of the MVN k 0,1 = exp 0.5 1 0 2 = e.5 k 0,3 = exp 0.5 3 0 2 = e 4.5 k 1,3 = exp 0.5 3 1 2 = e 2 k x, x = exp 0.5 x x 2 = 1 K = 1 e 0.5 e 4.5 e 0.5 1 e 2 e 4.5 e 2 1 + σ n I X f(x) 0-5 1 0 2? 3 5 4? CS 478 INTRODUCTION 29

Example Matrix Inversion K = 1 e 0.5 e 4.5 e 0.5 1 e 2 e 4.5 e 2 1 + σ n I or K = 1.25 0.607 0.011 0.607 1.25 0.135 0.011 0.135 1.25 Next step, invert K K 1 = 1.049 0.514 0.046 0.514 1.061 0.11 0.046 0.11 0.812 X f(x) 0-5 1 0 2? 3 5 4? CS 478 INTRODUCTION 30

Example Multiply by Targets Need K 1 t t = K 1 = 5 0 5 1.049 0.514 0.046 0.514 1.061 0.11 0.046 0.11 0.812 K 1 t = 5.013 2.018 3.826 X f(x) 0-5 1 0 2? 3 5 4? CS 478 INTRODUCTION 31

Example Predicting f(2) Now need k* for f(2) k 2,0 = exp 0.5 2 0 2 = e 2 k 2,1 = k 2,3 = e 0.5 k = 0.135 0.607 0.607 K 1 t = k** = 1 (no obs noise) 5.013 2.018 3.826 X f(x) 0-5 1 0 2? 3 5 4? CS 478 INTRODUCTION 32

Example Predicting f(2) k = 0.135 0.607 0.607 K 1 = 1.049 0.514 0.046 0.514 1.061 0.11 0.046 0.11 0.812 K 1 t = 5.013 2.018 3.826 μ = k T K 1 t = 2.866 k T K 1 k = 0.142 0.507 0.432 k = 0.550 σ = k k T K 1 k = 1 0.55 = 0.45 CS 478 INTRODUCTION 33

Example Predicting f(4) k = 0.00033 0.011.607 K 1 = 1.049 0.514 0.046 0.514 1.061 0.11 0.046 0.11 0.812 K 1 t = 5.013 2.018 3.826 μ = k T K 1 t = 2.341 k T K 1 k = 0.023 0.055 0.491 k = 0.297 σ = k k T K 1 k = 1 0.297 = 0.703 CS 478 INTRODUCTION 34

Example - Plot X f(x) 0-5 1 0 2 2.8 3 5 4 2.3 CS 778 GAUSSIAN PROCESS 35

My Experiments 1-D synthetic datasets generated with Gaussian noise Linear Cubic Periodic Sigmoid Complex 40 instances for training on the interval [-5,5] Additive Gaussian noise with σ = 0.5 Test with 1000 evenly spaced instances without noise (MSE) Use 3-fold CV on training data to set hyperparameters with grid-search Compare with Kernelized Lasso regression and 7-NN with inverse distance weighting Polynomial and Sine function basis CS 778 GAUSSIAN PROCESS 36

Results - Linear y = 0.5x 1 Model MSE GP 0.009 Lasso 0.019 KNN 0.058 CS 778 GAUSSIAN PROCESS 37

Results - Cubic y = 0.02x 3 0.1x 2 + 0.5x + 2 Model MSE GP 0.046 Lasso 0.037 KNN 0.101 CS 778 GAUSSIAN PROCESS 38

Results - Periodic y = 2 sin 1.75x 1 + 1 Model MSE GP 0.087 Lasso 1.444 KNN 0.350 CS 778 GAUSSIAN PROCESS 39

Results - Sigmoid y = 4 1.5 + e.75x 1 + 1 Model MSE GP 0.034 Lasso 0.045 KNN 0.060 CS 778 GAUSSIAN PROCESS 40

Results - Complex y = linear combination of previous functions Model MSE GP 0.194 Lasso 0.574 KNN 0.327 CS 778 GAUSSIAN PROCESS 41

Real Dataset Concrete Slump Test How fast does concrete flow given its materials? 103 instances; 7 input features; 3 outputs; all real values Input Features are kg per cubic meter of concrete Cement, Slag, Fly Ash, Water, SP, Coarse Agg, Fine Agg Output Variables are Slump, Flow, 28-day Compressive Strength Ran 10-fold CV with our three models on each output variable CS 778 GAUSSIAN PROCESS 42

Concrete Results Slump Model Avg MSE +- Std GP 0.103 +- 0.037 Lasso 0.072 +- 0.024 KNN 0.067 +- 0.040 Flow Model Avg MSE +- Std GP 0.091 +- 0.026 Lasso 0.051 +- 0.020 KNN 0.063 +- 0.036 Strength Model Avg MSE +- Std GP 0.030 +- 0.015 Lasso 0.004 +- 0.002 KNN 0.008 +- 0.005 CS 778 GAUSSIAN PROCESS 43

Learning Hyperparameters Find hyperparameters to maximize (log) probability of data Can use Gradient Descent by computing for is just the elementwise derivative of the covariance matrix which is the derivative of the covariance function CS 778 GAUSSIAN PROCESS 44

Learning Hyperparameters For squared exponential covariance function in this form We get the following derivatives CS 778 GAUSSIAN PROCESS 45

Covariance Function Tradeoffs CS 778 GAUSSIAN PROCESS 46

Classification Many regression methods become binary classification by passing output through a sigmoid, Softmax, or other threshold MLP, SVM, Logistic Regression Same with GP, but scarier math CS 778 GAUSSIAN PROCESS 47

Classification Math where GP Regression GP prior Intractable! ignored p y i f i = σ(f i ) Vote for class 1 Weight for vote CS 778 GAUSSIAN PROCESS 48

Bells and Whistles Multiple Regression Assume correlation between multiple outputs Correlated noise especially useful for time Training set values + derivatives (max or min occurs at ) Noise in the x-direction Mixture (ensemble) of local GP experts Integral approximation with GP Covariance over structured inputs (strings, trees, etc) CS 778 GAUSSIAN PROCESS 49

Bibliography Wikipedia: Gaussian Process, Multi-variate Normal Scikit-Learn Documentation www.gaussianprocess.org C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.gaussianprocess.org/gpml S. Marsland, Machine Learning An Algorithmic Perspective 2 nd Edition, CRC Press, 2015 CS 778 GAUSSIAN PROCESS 50