Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1
Recall: Non-parametric regression Input space X R n, target space Y = R Input matrix X of size m n, target vector y of size m 1 Feature mapping φ : X R d Assumption: y = Φw Minimize J λ (w) = 1 2 (Φw y) (Φw y) + λ 2 w w λ 0 The solution is w = (Φ Φ + λi n ) 1 Φ y = Φ (ΦΦ + λi m ) 1 y COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 2
Recall: Kernel regression Let K = ΦΦ and k(x) = φ(x) Φ be the kernel matrix/vector: K = k(x 1, x 1 ) k(x 1, x 2 )... k(x 1, x m ) k(x 2, x 1 ). k(x 2, x 2 ).... k(x 2, x m ). k(x m, x 1 ) k(x m, x 2 )... k(x m, x m ) The predictions for the input data are given by k(x) = ŷ = ΦΦ (ΦΦ + λi m ) 1 y = K(K + λi m ) 1 y k(x, x 1 ) k(x, x 2 ). k(x, x m ) The prediction for a new input point x is given by ˆf(x) = φ(x) Φ (K + λi m ) 1 y = k(x)(k + λi m ) 1 y COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 3
Recall: Bayesian view of regression Consider noisy observations y = f(x) + ɛ = φ(x) w + ɛ Recall Bayes rule: posterior = likelihood prior marginal likelihood P φ (w y, X) = P φ(y X, w)p (w) P φ (y X) Marginal likelihood is independent of weights w With Gaussian noise ɛ N (0, σ 2 ) P φ (y X, w) = = m P φ (y i x i, w) = i=1 1 ( 2πσ) m exp ( m i=1 1 2πσ exp y Φw 2 2σ 2 ) ( (y i φ(x i ) w) 2 ) 2σ 2 = N m ( Φw, σ 2 I m ) COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 4
Posterior distribution on parameters With Gaussian prior on parameters w N d (0, Σ w ) ) y Φw 2 P φ (w y, X) exp ( 2σ 2 exp ( w Σ 1 ) w w 2 = exp ( y y y Φw w Φy + w Φ Φw + σ 2 w Σ 1 2σ 2 = exp ( y y y Φw w Φy + w (Φ Φ + σ 2 Σ 1 ) w )w 2σ 2 exp ( (w b) A 1 (w b) ) where A 1 = σ 2 (Φ Φ + σ 2 Σ 1 w ) and b = (Φ Φ + σ 2 Σ 1 w ) 1 Φ y The posterior distribution is Gaussian! w w ) COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 5
Predictive distribution The pointwise posterior predictive distribution is a normal distribution ( f(x) x 1,..., x m, y 1,..., y m N ˆf(x), s (x)) 2 of expectation and variance ˆf(x) = φ(x) (Φ Φ + σ 2 Σ 1 w ) 1 Φ y = φ(x) Σ w Φ (ΦΣ w Φ + σ 2 I m ) 1 y s 2 (x) = σ 2 φ(x) (Φ Φ + σ 2 Σ 1 w ) 1 φ(x) = φ(x) Σ w φ(x) φ(x) Σ w Φ (Φ Σ w Φ + σ 2 I m ) 1 ΦΣ w φ(x) using Sherman-Morrison COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 6
Reinterpreting regularization Recall kernel regression predictions: ˆf(x) = k(x) (K + λi m ) 1 y Using prior Σ w = σ2 λ I d, the predictive mean rewrites as: ˆf(x) = φ(x) Σ w Φ (ΦΣ w Φ + σ 2 I m ) 1 y ) 1 = φ(x) (Φ σ2 λ Φ σ2 λ Φ + σ 2 I m y = k(x) (K + λi m ) 1 y λ encodes some prior on weights w COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 7
Reinterpreting regularization (cont d) Still using Σ w = σ2 λ I d, the predictive variance rewrites as: s 2 (x) = φ(x) Σ w φ(x) φ(x) Σ w Φ (Φ Σ w Φ + σ 2 I m ) 1 ΦΣ w φ(x) ) 1 = φ(x) σ2 φ(x) φ(x) σ2 (Φ λ λ Φ σ2 λ Φ + σ2 I m Φ σ2 λ φ(x) = σ2 λ k λ(x, x) with k λ (x, x ) = k(x, x ) k(x) (K + λi m ) 1 k(x ) COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 8
Summary Using prior Σ w = σ2 λ I d, we have ˆf(x) = k(x) (K + λi m ) 1 y s 2 (x) = σ2 λ k λ(x, x) with k λ (x, x ) = k(x, x ) k(x) (K + λi m ) 1 k(x ) Providing pointwise posterior prediction f(x) x 1,..., x m, y 1,..., y m N ( ˆf(x), s 2 (x)) What does it mean to use λ R 0? COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 9
Pointwise posterior distribution ( At each point x X, we have a distribution N ˆf(x), s (x)) 2 We can sample from these f(x) ( N ˆf(x), s (x)) 2 Y 1.0 0.5 0.0 0.5 1.0 ˆf s 1.0 0.5 0.0 0.5 1.0 X COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 10
Joint distribution Suppose you query your model at locations X Extend the prior to include query points: [ f f ] [ ]) KX,X K X, X N m+m (0, X,X K X,X K X,X y f N m (f, σ 2 I m ) Using joint normality of f and y: [ ( [ ]) y KX,X + σ N f ] m+m 0, 2 I m K X,X K X,X K X,X COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 11
Gaussian Process (GP) By considering the covariance between every points in the space, we get a distribution over functions! Posterior distribution on f: P [f X, y] N X ( [ ˆf(x) ] 1.0 0.5 x X ), σ2 λ [k λ(x, x )] x,x X Y 0.0 0.5 1.0 ˆf s 1.0 0.5 0.0 0.5 1.0 X COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 12
Sampling from a Gaussian Process Generalization of normal probability distribution to the function space From a normal distribution we sample variables From a GP we sample functions! 2 1 Y 0 1 Sampled f 1.0 0.5 0.0 0.5 1.0 X COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 13
Sampling from a Gaussian Process: How to Observe that N X ( [ ˆf(x) ] x X ), σ2 λ [k λ(x, x )] x,x X defines a X -dimensional multivariate Gaussian distribution What if X = (e.g. X = [ 1, 1])? We can consider a discrete, finite, set X X and sample from N X ( [ ˆf(x) ] x X ), σ2 λ [k λ(x, x )] x,x X This will result in a function f evaluated at every x X COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 14
Learning the hyperparameters If we assume that Σ w = I d, then we have λ = σ 2 Let θ denote the kernel hyperparameters (e.g. ρ) In practice you may not know the noise and the optimal kernel hypers Recall: multivariate normal density P (y θ) = exp ( 1 2 y (K θ + σ 2 I m ) 1 y ) (2π)D K θ + σ 2 I m Maximize the marginal likelihood L = log P (y θ): L = 1 2 y (K θ + σ 2 I m ) 1 y D 2 log(2π) 1 2 log K θ + σ 2 I m COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 15
Marginal likelihood: Anatomy of marginal likelihood L = log P (y θ) 1 2 y (K θ + σ 2 I m ) 1 y 1 2 log K θ + σ 2 I m C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Lea ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.g 1st term: quality of predictions; 2nd term: model complexity 5.4 Model Selection for GP Regression Trade-off (from Rasmussen & Williams, 2006): 40 20 log probability 20 0 20 40 60 80 100 minus complexity penalty data fit marginal likelihood 10 0 characteristic lengthscale (a) log marginal likelihood 0 20 40 60 80 100 95% conf int 10 0 Characteristic lengthscale Figure 5.3: Panel (a) shows a decomposition of the log marginal likelihood its constituents: data-fit and complexity penalty, as a function of the characteri length-scale. The training data is drawn from a Gaussian process with SE covaria function and parameters (`, f, n) = (1, 1, 0.1), the same as in Figure 2.5, and we fitting only the length-scale parameter ` (the two other parameters have been se COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 16 (b) 2 5
Compute gradients: Gradient-based optimization L = 1 θ i 2 y (K θ + σ 2 I m ) 1 (K θ + σ 2 I m ) (K θ + σ 2 I m ) 1 y θ i 1 ( 2 Tr (K θ + σ 2 I m ) 1 (K θ + σ 2 ) I m ) θ i Minimize the negative Non-convex optimization task COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 17
Summary Normal priors on the weights distribution Gaussian Process Regularization prior on the weights covariance GP provides a posterior distribution on functions Expectation: kernel regression model Covariance confidence intervals Sample discretized functions from a GP COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 18
Typical supervised setting Have dataset (X, y) of previously acquired data Learn model w on (X, y) Apply model w to provide predictions at new data points Samples (X, y) are often assumed to be i.i.d. COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 19
Streaming setting Start with (possibly empty) dataset (X 0, y 0 ) For each time step t = 1, 2,... : Fit model w t on X t 1 and y t 1 Acquire a new sample (x t, y t ) (possibly using w t ) Define X t = [x i ] i=1...t, y t = [y i ] i=1...t Samples may be dependent of model (not i.i.d.) Samples influence model / Samples depend on model COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 20
Application: Online function optimization 1.0 Y 0.5 0.0? 0.0 0.5 1.0 X Unknown function f : X R Sequentially select locations (x t ) t 1 at which to observe the function y t Try to maximize/minimize the observations COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 21
Example What is the optimal treatment dosage for a given disease? For each time step t = 1, 2,... : New patient t comes in We decide on treatment dosage x t We observe the patient s response to treatement, y t Our goal is to cure patients as effectively as possible What you don t want: give bad dosages that were known to be bad What you want: give good dosages try informative dosages COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 22
Information An informative sample improves the model by reducing its uncertainty 1 1 Y 0 Y 0 1 1 1.0 0.5 0.0 0.5 1.0 X 1.0 0.5 0.0 0.5 1.0 X How do we quantity the reduction of uncertainty after observing y 1,..., y t at locations x 1,..., x t? COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 23
A little bit of information theory Mutual information between underlying function f and observations y 1,..., y t at locations x 1,..., x t : I(y 1,..., y t ; f) = H(y 1,..., y t ) }{{} Marginal entropy H(y 1,..., y t f) }{{} Conditional entropy It quantifies the amount of information obtained about one random variable, through the other random variable Entropy is the amount of information held by a random variable H(Y X) = 0 if and only if the value of Y is completely determined by the value of X H(Y X) = H(Y ) if and only if Y and X are independent random variables COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 24
Amount of information H(X) = E[ ln P X ] = n p(x i ) log b (p(x i )) i=1 Example: Tossing a coin A fair coin has maximal entropy: it is the less predictable! Every new sample that you get changes your model the most COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 25
Mutual information Entropy of X N D (µ, Σ): [ ln exp ( 1 2 (x µ) Σ 1 (x µ) ) ] H(X) = E (2π)D Σ = D 2 +D 2 ln 2π+1 2 ln Σ Using E [ a M 1 a ] = E [ Tr ( a M 1 a )] = E [ Tr ( aa M 1)] = D Recall the joint distribution between m observations y and m query points X : [ ( [ ]) y KX,X + σ N f ] m+m 0, 2 I m K X,X K X,X K X,X Then I(y 1,..., y t ; f) = 1 2 ln K t + σ 2 I t COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 26
Another decomposition of mutual information ( ) Recall that y 1,..., y t f(x 1 ),..., f(x t ) N t (f(x1 ),..., f(x t )), σ 2 I t Using that y i = f(x i ) + ɛ with ɛ N (0, σ 2 ) Pluging-in the conditional entropy of a multivariate normal distribution: I(y 1,..., y t ; f) = H(y 1,..., y t ) 1 2 ln 2πeσ2 I t = H(y 1,..., y t ) t 2 ln 2πe t 2 σ2 What is the marginal entropy H(y 1,..., y t )? COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 27
Recursively decompose Entropy of observations H(y 1,..., y t ) = H(y 1,..., y t 1 ) + H(y t y 1,..., y t 1 ) = H(y 1,..., y t 1 ) + 1 )] [2πe (σ 2 ln 2 + σ2 λ k λ,t 1(x t, x t ). = H(y 1 ) + H(y 2 y 1 ) + + 1 )] [2πe (σ 2 ln 2 + σ2 λ k λ,t 1(x s, x s ) t ( 1 = [2πeσ 2 ln 2 1 + 1 )] λ k λ,s 1(x s, x s ) s=1 Still using that y i = f(x i ) + ɛ with ɛ N (0, σ 2 ) Also using the uncertainty about the location of the true f COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 28
Information gain I(y 1,..., y t ; f) = = t s=1 t s=1 ( 1 [2πeσ 2 ln 2 1 + 1 )] λ k λ,s 1(x s, x s ) [ 1 2 ln 1 + 1 ] λ k λ,s 1(x s, x s ) t 2 ln(2πe) t 2 σ2 Information gain γ t (λ) = I(y 1,..., y t ; f): reduction of uncertainty on f after observing y 1,..., y t Information gain is inversely proportionnal to λ Limiting changes in function limits the contribution of samples COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 29
Information gain vs regularization Y 1 0 λ = 0.1 λ = 1 λ = 10 Y 1 0 λ = 0.1 λ = 1 λ = 10-1 1 1.0 0.5 0.0 0.5 1.0 X 1.0 0.5 0.0 0.5 1.0 X γt(λ) 3 2 λ = 0.1 λ = 1 λ = 10 γt(λ) 40 30 20 λ = 0.1 λ = 1 λ = 10 1 10 0 0 25 50 75 100 t 0 0 25 50 75 100 t COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 30
Summary Streaming regression: sequentially gather (potentially non-i.i.d) samples Information gain measures the total information that could result from adding a new sample (observation) to the model Information gain is controlled by the information sharing capability of the kernel Information gain is controlled by the changes in model admitted by regularization Information gain will play a part in confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 31
Confidence intervals Given that we have gathered t samples under the streaming setting, what kind of guarantees can we have on the resulting model? More specifically, could we guarantee that f(x) ˆf t (x) something simultaneously for all x X and for all t 0? Motivations: Control bad behaviours in critical applications Help selecting the next sample location Maximize/minimize function? Maximize model improvement? Derive sound algorithms COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 32
Result (Maillard, 2016) Under the assumption of subgaussian noise... f(x) ˆf t (x) [ kλ,t (x, x) λ f K + σ 2 ln(1/δ) + 2γ t (λ)] λ With probability higher than 1 δ Simultaneously for all t 0, for all x X Observe that the error bound scales with the information gain! COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 33
Subgaussian noise A real-valued random variable X is σ 2 -subgaussian if E [ e γx] e γ2 σ 2 /2 The Laplace transform of X is dominated by the Laplace transform of a random variable sampled from N (0, σ 2 ) Require that the tails of the noise distribution are dominated by the tails of a Gaussian distribution For example, true for Gaussian noise Bounded noise COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 34
Confidence envelope 5 observations 10 observations 1 1 Y 0 Y 0 1 1 0.00 0.25 0.50 0.75 1.00 X 50 observations 0.00 0.25 0.50 0.75 1.00 X 100 observations Y 1 0 1 Y 1 0 1 f λ = σ 2 λ = σ 2 / f 2 K 0.00 0.25 0.50 0.75 1.00 X 0.00 0.25 0.50 0.75 1.00 X COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 35
Summary It is possible to have guarantees even with non-i.i.d. data The predition error depends on How well your model shares information across observations How well your model is adapted to the function How noisy the observations are Regularization prior on Σ w Confidence intervals/envelopes will be really useful for deriving algorithms (we will see that later) COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 36