Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu

Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2

Fast Learning for Autonomous Robots with Gaussian Processes Demo 1: Cart-Pole Swing-up Swing up and balance a freely swinging pendulum on a cart. No knowledge about nonlinear dynamics >> learn from scratch. 3

Fast learning for Autonomous Robots with Gaussian processes Demo 2: Learning to Control a Low-Cost Manipulator. 4

Idea: Reinforcement Learning Difference between optimal control and reinforcement learning is that in optimal control, you assume that f() is known. 5

At x=7, what is going on? f(7) = -1.5. I need to make decision based on this prediction. 9

We need to characterize the model errors --- Use Gaussian Process to Characterize the Uncertainty of Prediction. 11

From Statistical Perspective 12

Linear vs. Nonlinear regression We have i.i.d. data pairs Want to make inference about relations between input and output. Linear relationships. 13

One puts priors on w (e.g. normal), and derives the posterior. To make predictions on new input distribution:, we use posterior predictive Alternatively, one could first map x to some basis function, then let Still restrictive due to sensitivity to the choice of 14

Want more flexible form for f(x), treat the whole f(.) as a parameter. Assume f(.) takes values from a function space. Assume f(.) is random. In particular, Gaussian Process. What does a Gaussian Process look like? 15

Gaussian process Assume that f(.) is a Gaussian process. Definition: A Gaussian process is a (infinite) collection of random variables, any finite number of which have a joint Gaussian distribution. A GP is completely specified by its mean m(x) and covariance function k(x, x ). 17

Gaussian process cont d We write with Interpret arbitrary finite projection, consistency (well-definedness). 18

Parametric forms of covariance functions Squared Exponential (SE) Matérn 19

Reminder: conditional distribution of MVN. Reference: https://en.wikipedia.org/wiki/multivariate_normal_distribution#conditional_distributions

Case 1: the GP regression with noise free observations 21

Case 1: the GP regression noise free observations (cont d) * 22

Case 2: the GP regression -- noisy observations 23

Model Selection: Hyperparameters 24 24

Optimizing Marginal Likelihood Note: marginalized f(.) from the likelihood. 25

Example 1. R code demo of GP regression 26

The GP classification binary case 27

The GP classification binary case (cont d) Let y = 1/0, the class labels. Assume latent function f(x),. Assume 28

f is a nuisance function (latent variable). we do not observe values of f itself (we observe only the inputs X and the class labels y) and we are not particularly interested in the values of f, but rather in π, in particular for test cases π(x ). The purpose of f is solely to allow a convenient formulation of the model. 29

Steps for predicting the y given x : 1. First compute the distribution of the latent variable corresponding to a test case. 2. Use this distribution over the latent f to produce a probabilistic prediction. 30

In classification p(y f) has link function involved, conjugacy of f are lost, and the integrations in the previous slides are difficult. Thus we need to use either analytic approximations of integrals to approximate p(f X, y), or solutions based on Monte Carlo sampling. E.g. Laplace approximation, expectation propagation (EP), INLA etc. 31

Laplace s method utilizes a Gaussian approximation q(f X, y) to the posterior p(f X, y). Laplace approximation: Local (normal) approximation to the posterior by matching the mode. We will discuss more in the non-mcmc method lecture. 32

Posterior predictive mean and variance based on Laplace approximation: 34