Lecture 5: GPs and Streaming regression

Similar documents
Nonparameteric Regression:

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Lecture : Probabilistic Machine Learning

Gaussian Process Regression

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Least Squares Regression

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Least Squares Regression

Gaussian Processes for Machine Learning

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Exercises with solutions (Set D)

Reliability Monitoring Using Log Gaussian Process Regression

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

ECE521 week 3: 23/26 January 2017

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Relevance Vector Machines

CSC321 Lecture 18: Learning Probabilistic Models

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

GWAS V: Gaussian processes

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

CS 7140: Advanced Machine Learning

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Model Selection for Gaussian Processes

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Introduction to Gaussian Processes

Machine learning - HT Maximum Likelihood

GAUSSIAN PROCESS REGRESSION

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

CMU-Q Lecture 24:

Advanced Introduction to Machine Learning CMU-10715

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Probabilistic & Unsupervised Learning

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Machine Learning Basics: Maximum Likelihood Estimation

Linear Models for Regression

Introduction to Gaussian Process

Linear Regression (9/11/13)

Bayesian Linear Regression. Sargur Srihari

Today. Calculus. Linear Regression. Lagrange Multipliers

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Probabilistic Reasoning in Deep Learning

Lecture 2 Machine Learning Review

Naïve Bayes classification

Probability Theory for Machine Learning. Chris Cremer September 2015

Introduction to Probabilistic Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction. Chapter 1

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Massachusetts Institute of Technology

4 Bias-Variance for Ridge Regression (24 points)

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Gaussian with mean ( µ ) and standard deviation ( σ)

Parametric Techniques Lecture 3

LECTURE NOTE #3 PROF. ALAN YUILLE

Week 3: Linear Regression

STA414/2104 Statistical Methods for Machine Learning II

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Gaussian Processes in Machine Learning

STAT 518 Intro Student Presentation

STA 4273H: Statistical Machine Learning

Parametric Techniques

Learning Gaussian Process Models from Uncertain Data

Perceptron Revisited: Linear Separators. Support Vector Machines

Bayesian Machine Learning

Sparse Linear Models (10/7/13)

The Expectation-Maximization Algorithm

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Probabilistic & Bayesian deep learning. Andreas Damianou

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Gaussian Process Regression Networks

GWAS IV: Bayesian linear (variance component) models

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Linear Models for Regression CS534

Introduction to Gaussian Processes

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

Density Estimation. Seungjin Choi

Neutron inverse kinetics via Gaussian Processes

Transcription:

Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1

Recall: Non-parametric regression Input space X R n, target space Y = R Input matrix X of size m n, target vector y of size m 1 Feature mapping φ : X R d Assumption: y = Φw Minimize J λ (w) = 1 2 (Φw y) (Φw y) + λ 2 w w λ 0 The solution is w = (Φ Φ + λi n ) 1 Φ y = Φ (ΦΦ + λi m ) 1 y COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 2

Recall: Kernel regression Let K = ΦΦ and k(x) = φ(x) Φ be the kernel matrix/vector: K = k(x 1, x 1 ) k(x 1, x 2 )... k(x 1, x m ) k(x 2, x 1 ). k(x 2, x 2 ).... k(x 2, x m ). k(x m, x 1 ) k(x m, x 2 )... k(x m, x m ) The predictions for the input data are given by k(x) = ŷ = ΦΦ (ΦΦ + λi m ) 1 y = K(K + λi m ) 1 y k(x, x 1 ) k(x, x 2 ). k(x, x m ) The prediction for a new input point x is given by ˆf(x) = φ(x) Φ (K + λi m ) 1 y = k(x)(k + λi m ) 1 y COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 3

Recall: Bayesian view of regression Consider noisy observations y = f(x) + ɛ = φ(x) w + ɛ Recall Bayes rule: posterior = likelihood prior marginal likelihood P φ (w y, X) = P φ(y X, w)p (w) P φ (y X) Marginal likelihood is independent of weights w With Gaussian noise ɛ N (0, σ 2 ) P φ (y X, w) = = m P φ (y i x i, w) = i=1 1 ( 2πσ) m exp ( m i=1 1 2πσ exp y Φw 2 2σ 2 ) ( (y i φ(x i ) w) 2 ) 2σ 2 = N m ( Φw, σ 2 I m ) COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 4

Posterior distribution on parameters With Gaussian prior on parameters w N d (0, Σ w ) ) y Φw 2 P φ (w y, X) exp ( 2σ 2 exp ( w Σ 1 ) w w 2 = exp ( y y y Φw w Φy + w Φ Φw + σ 2 w Σ 1 2σ 2 = exp ( y y y Φw w Φy + w (Φ Φ + σ 2 Σ 1 ) w )w 2σ 2 exp ( (w b) A 1 (w b) ) where A 1 = σ 2 (Φ Φ + σ 2 Σ 1 w ) and b = (Φ Φ + σ 2 Σ 1 w ) 1 Φ y The posterior distribution is Gaussian! w w ) COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 5

Predictive distribution The pointwise posterior predictive distribution is a normal distribution ( f(x) x 1,..., x m, y 1,..., y m N ˆf(x), s (x)) 2 of expectation and variance ˆf(x) = φ(x) (Φ Φ + σ 2 Σ 1 w ) 1 Φ y = φ(x) Σ w Φ (ΦΣ w Φ + σ 2 I m ) 1 y s 2 (x) = σ 2 φ(x) (Φ Φ + σ 2 Σ 1 w ) 1 φ(x) = φ(x) Σ w φ(x) φ(x) Σ w Φ (Φ Σ w Φ + σ 2 I m ) 1 ΦΣ w φ(x) using Sherman-Morrison COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 6

Reinterpreting regularization Recall kernel regression predictions: ˆf(x) = k(x) (K + λi m ) 1 y Using prior Σ w = σ2 λ I d, the predictive mean rewrites as: ˆf(x) = φ(x) Σ w Φ (ΦΣ w Φ + σ 2 I m ) 1 y ) 1 = φ(x) (Φ σ2 λ Φ σ2 λ Φ + σ 2 I m y = k(x) (K + λi m ) 1 y λ encodes some prior on weights w COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 7

Reinterpreting regularization (cont d) Still using Σ w = σ2 λ I d, the predictive variance rewrites as: s 2 (x) = φ(x) Σ w φ(x) φ(x) Σ w Φ (Φ Σ w Φ + σ 2 I m ) 1 ΦΣ w φ(x) ) 1 = φ(x) σ2 φ(x) φ(x) σ2 (Φ λ λ Φ σ2 λ Φ + σ2 I m Φ σ2 λ φ(x) = σ2 λ k λ(x, x) with k λ (x, x ) = k(x, x ) k(x) (K + λi m ) 1 k(x ) COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 8

Summary Using prior Σ w = σ2 λ I d, we have ˆf(x) = k(x) (K + λi m ) 1 y s 2 (x) = σ2 λ k λ(x, x) with k λ (x, x ) = k(x, x ) k(x) (K + λi m ) 1 k(x ) Providing pointwise posterior prediction f(x) x 1,..., x m, y 1,..., y m N ( ˆf(x), s 2 (x)) What does it mean to use λ R 0? COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 9

Pointwise posterior distribution ( At each point x X, we have a distribution N ˆf(x), s (x)) 2 We can sample from these f(x) ( N ˆf(x), s (x)) 2 Y 1.0 0.5 0.0 0.5 1.0 ˆf s 1.0 0.5 0.0 0.5 1.0 X COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 10

Joint distribution Suppose you query your model at locations X Extend the prior to include query points: [ f f ] [ ]) KX,X K X, X N m+m (0, X,X K X,X K X,X y f N m (f, σ 2 I m ) Using joint normality of f and y: [ ( [ ]) y KX,X + σ N f ] m+m 0, 2 I m K X,X K X,X K X,X COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 11

Gaussian Process (GP) By considering the covariance between every points in the space, we get a distribution over functions! Posterior distribution on f: P [f X, y] N X ( [ ˆf(x) ] 1.0 0.5 x X ), σ2 λ [k λ(x, x )] x,x X Y 0.0 0.5 1.0 ˆf s 1.0 0.5 0.0 0.5 1.0 X COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 12

Sampling from a Gaussian Process Generalization of normal probability distribution to the function space From a normal distribution we sample variables From a GP we sample functions! 2 1 Y 0 1 Sampled f 1.0 0.5 0.0 0.5 1.0 X COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 13

Sampling from a Gaussian Process: How to Observe that N X ( [ ˆf(x) ] x X ), σ2 λ [k λ(x, x )] x,x X defines a X -dimensional multivariate Gaussian distribution What if X = (e.g. X = [ 1, 1])? We can consider a discrete, finite, set X X and sample from N X ( [ ˆf(x) ] x X ), σ2 λ [k λ(x, x )] x,x X This will result in a function f evaluated at every x X COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 14

Learning the hyperparameters If we assume that Σ w = I d, then we have λ = σ 2 Let θ denote the kernel hyperparameters (e.g. ρ) In practice you may not know the noise and the optimal kernel hypers Recall: multivariate normal density P (y θ) = exp ( 1 2 y (K θ + σ 2 I m ) 1 y ) (2π)D K θ + σ 2 I m Maximize the marginal likelihood L = log P (y θ): L = 1 2 y (K θ + σ 2 I m ) 1 y D 2 log(2π) 1 2 log K θ + σ 2 I m COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 15

Marginal likelihood: Anatomy of marginal likelihood L = log P (y θ) 1 2 y (K θ + σ 2 I m ) 1 y 1 2 log K θ + σ 2 I m C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Lea ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.g 1st term: quality of predictions; 2nd term: model complexity 5.4 Model Selection for GP Regression Trade-off (from Rasmussen & Williams, 2006): 40 20 log probability 20 0 20 40 60 80 100 minus complexity penalty data fit marginal likelihood 10 0 characteristic lengthscale (a) log marginal likelihood 0 20 40 60 80 100 95% conf int 10 0 Characteristic lengthscale Figure 5.3: Panel (a) shows a decomposition of the log marginal likelihood its constituents: data-fit and complexity penalty, as a function of the characteri length-scale. The training data is drawn from a Gaussian process with SE covaria function and parameters (`, f, n) = (1, 1, 0.1), the same as in Figure 2.5, and we fitting only the length-scale parameter ` (the two other parameters have been se COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 16 (b) 2 5

Compute gradients: Gradient-based optimization L = 1 θ i 2 y (K θ + σ 2 I m ) 1 (K θ + σ 2 I m ) (K θ + σ 2 I m ) 1 y θ i 1 ( 2 Tr (K θ + σ 2 I m ) 1 (K θ + σ 2 ) I m ) θ i Minimize the negative Non-convex optimization task COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 17

Summary Normal priors on the weights distribution Gaussian Process Regularization prior on the weights covariance GP provides a posterior distribution on functions Expectation: kernel regression model Covariance confidence intervals Sample discretized functions from a GP COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 18

Typical supervised setting Have dataset (X, y) of previously acquired data Learn model w on (X, y) Apply model w to provide predictions at new data points Samples (X, y) are often assumed to be i.i.d. COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 19

Streaming setting Start with (possibly empty) dataset (X 0, y 0 ) For each time step t = 1, 2,... : Fit model w t on X t 1 and y t 1 Acquire a new sample (x t, y t ) (possibly using w t ) Define X t = [x i ] i=1...t, y t = [y i ] i=1...t Samples may be dependent of model (not i.i.d.) Samples influence model / Samples depend on model COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 20

Application: Online function optimization 1.0 Y 0.5 0.0? 0.0 0.5 1.0 X Unknown function f : X R Sequentially select locations (x t ) t 1 at which to observe the function y t Try to maximize/minimize the observations COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 21

Example What is the optimal treatment dosage for a given disease? For each time step t = 1, 2,... : New patient t comes in We decide on treatment dosage x t We observe the patient s response to treatement, y t Our goal is to cure patients as effectively as possible What you don t want: give bad dosages that were known to be bad What you want: give good dosages try informative dosages COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 22

Information An informative sample improves the model by reducing its uncertainty 1 1 Y 0 Y 0 1 1 1.0 0.5 0.0 0.5 1.0 X 1.0 0.5 0.0 0.5 1.0 X How do we quantity the reduction of uncertainty after observing y 1,..., y t at locations x 1,..., x t? COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 23

A little bit of information theory Mutual information between underlying function f and observations y 1,..., y t at locations x 1,..., x t : I(y 1,..., y t ; f) = H(y 1,..., y t ) }{{} Marginal entropy H(y 1,..., y t f) }{{} Conditional entropy It quantifies the amount of information obtained about one random variable, through the other random variable Entropy is the amount of information held by a random variable H(Y X) = 0 if and only if the value of Y is completely determined by the value of X H(Y X) = H(Y ) if and only if Y and X are independent random variables COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 24

Amount of information H(X) = E[ ln P X ] = n p(x i ) log b (p(x i )) i=1 Example: Tossing a coin A fair coin has maximal entropy: it is the less predictable! Every new sample that you get changes your model the most COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 25

Mutual information Entropy of X N D (µ, Σ): [ ln exp ( 1 2 (x µ) Σ 1 (x µ) ) ] H(X) = E (2π)D Σ = D 2 +D 2 ln 2π+1 2 ln Σ Using E [ a M 1 a ] = E [ Tr ( a M 1 a )] = E [ Tr ( aa M 1)] = D Recall the joint distribution between m observations y and m query points X : [ ( [ ]) y KX,X + σ N f ] m+m 0, 2 I m K X,X K X,X K X,X Then I(y 1,..., y t ; f) = 1 2 ln K t + σ 2 I t COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 26

Another decomposition of mutual information ( ) Recall that y 1,..., y t f(x 1 ),..., f(x t ) N t (f(x1 ),..., f(x t )), σ 2 I t Using that y i = f(x i ) + ɛ with ɛ N (0, σ 2 ) Pluging-in the conditional entropy of a multivariate normal distribution: I(y 1,..., y t ; f) = H(y 1,..., y t ) 1 2 ln 2πeσ2 I t = H(y 1,..., y t ) t 2 ln 2πe t 2 σ2 What is the marginal entropy H(y 1,..., y t )? COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 27

Recursively decompose Entropy of observations H(y 1,..., y t ) = H(y 1,..., y t 1 ) + H(y t y 1,..., y t 1 ) = H(y 1,..., y t 1 ) + 1 )] [2πe (σ 2 ln 2 + σ2 λ k λ,t 1(x t, x t ). = H(y 1 ) + H(y 2 y 1 ) + + 1 )] [2πe (σ 2 ln 2 + σ2 λ k λ,t 1(x s, x s ) t ( 1 = [2πeσ 2 ln 2 1 + 1 )] λ k λ,s 1(x s, x s ) s=1 Still using that y i = f(x i ) + ɛ with ɛ N (0, σ 2 ) Also using the uncertainty about the location of the true f COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 28

Information gain I(y 1,..., y t ; f) = = t s=1 t s=1 ( 1 [2πeσ 2 ln 2 1 + 1 )] λ k λ,s 1(x s, x s ) [ 1 2 ln 1 + 1 ] λ k λ,s 1(x s, x s ) t 2 ln(2πe) t 2 σ2 Information gain γ t (λ) = I(y 1,..., y t ; f): reduction of uncertainty on f after observing y 1,..., y t Information gain is inversely proportionnal to λ Limiting changes in function limits the contribution of samples COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 29

Information gain vs regularization Y 1 0 λ = 0.1 λ = 1 λ = 10 Y 1 0 λ = 0.1 λ = 1 λ = 10-1 1 1.0 0.5 0.0 0.5 1.0 X 1.0 0.5 0.0 0.5 1.0 X γt(λ) 3 2 λ = 0.1 λ = 1 λ = 10 γt(λ) 40 30 20 λ = 0.1 λ = 1 λ = 10 1 10 0 0 25 50 75 100 t 0 0 25 50 75 100 t COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 30

Summary Streaming regression: sequentially gather (potentially non-i.i.d) samples Information gain measures the total information that could result from adding a new sample (observation) to the model Information gain is controlled by the information sharing capability of the kernel Information gain is controlled by the changes in model admitted by regularization Information gain will play a part in confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 31

Confidence intervals Given that we have gathered t samples under the streaming setting, what kind of guarantees can we have on the resulting model? More specifically, could we guarantee that f(x) ˆf t (x) something simultaneously for all x X and for all t 0? Motivations: Control bad behaviours in critical applications Help selecting the next sample location Maximize/minimize function? Maximize model improvement? Derive sound algorithms COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 32

Result (Maillard, 2016) Under the assumption of subgaussian noise... f(x) ˆf t (x) [ kλ,t (x, x) λ f K + σ 2 ln(1/δ) + 2γ t (λ)] λ With probability higher than 1 δ Simultaneously for all t 0, for all x X Observe that the error bound scales with the information gain! COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 33

Subgaussian noise A real-valued random variable X is σ 2 -subgaussian if E [ e γx] e γ2 σ 2 /2 The Laplace transform of X is dominated by the Laplace transform of a random variable sampled from N (0, σ 2 ) Require that the tails of the noise distribution are dominated by the tails of a Gaussian distribution For example, true for Gaussian noise Bounded noise COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 34

Confidence envelope 5 observations 10 observations 1 1 Y 0 Y 0 1 1 0.00 0.25 0.50 0.75 1.00 X 50 observations 0.00 0.25 0.50 0.75 1.00 X 100 observations Y 1 0 1 Y 1 0 1 f λ = σ 2 λ = σ 2 / f 2 K 0.00 0.25 0.50 0.75 1.00 X 0.00 0.25 0.50 0.75 1.00 X COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 35

Summary It is possible to have guarantees even with non-i.i.d. data The predition error depends on How well your model shares information across observations How well your model is adapted to the function How noisy the observations are Regularization prior on Σ w Confidence intervals/envelopes will be really useful for deriving algorithms (we will see that later) COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 36