Lecture 5: GPs and Streaming regression

Size: px

Start display at page:

Download "Lecture 5: GPs and Streaming regression"

Laureen Samantha Sutton
5 years ago
Views:

1 Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19,

2 Recall: Non-parametric regression Input space X R n, target space Y = R Input matrix X of size m n, target vector y of size m 1 Feature mapping φ : X R d Assumption: y = Φw Minimize J λ (w) = 1 2 (Φw y) (Φw y) + λ 2 w w λ 0 The solution is w = (Φ Φ + λi n ) 1 Φ y = Φ (ΦΦ + λi m ) 1 y COMP-652 and ECSE-608, Lecture 5 - September 19,

3 Recall: Kernel regression Let K = ΦΦ and k(x) = φ(x) Φ be the kernel matrix/vector: K = k(x 1, x 1 ) k(x 1, x 2 )... k(x 1, x m ) k(x 2, x 1 ). k(x 2, x 2 ).... k(x 2, x m ). k(x m, x 1 ) k(x m, x 2 )... k(x m, x m ) The predictions for the input data are given by k(x) = ŷ = ΦΦ (ΦΦ + λi m ) 1 y = K(K + λi m ) 1 y k(x, x 1 ) k(x, x 2 ). k(x, x m ) The prediction for a new input point x is given by ˆf(x) = φ(x) Φ (K + λi m ) 1 y = k(x)(k + λi m ) 1 y COMP-652 and ECSE-608, Lecture 5 - September 19,

4 Recall: Bayesian view of regression Consider noisy observations y = f(x) + ɛ = φ(x) w + ɛ Recall Bayes rule: posterior = likelihood prior marginal likelihood P φ (w y, X) = P φ(y X, w)p (w) P φ (y X) Marginal likelihood is independent of weights w With Gaussian noise ɛ N (0, σ 2 ) P φ (y X, w) = = m P φ (y i x i, w) = i=1 1 ( 2πσ) m exp ( m i=1 1 2πσ exp y Φw 2 2σ 2 ) ( (y i φ(x i ) w) 2 ) 2σ 2 = N m ( Φw, σ 2 I m ) COMP-652 and ECSE-608, Lecture 5 - September 19,

5 Posterior distribution on parameters With Gaussian prior on parameters w N d (0, Σ w ) ) y Φw 2 P φ (w y, X) exp ( 2σ 2 exp ( w Σ 1 ) w w 2 = exp ( y y y Φw w Φy + w Φ Φw + σ 2 w Σ 1 2σ 2 = exp ( y y y Φw w Φy + w (Φ Φ + σ 2 Σ 1 ) w )w 2σ 2 exp ( (w b) A 1 (w b) ) where A 1 = σ 2 (Φ Φ + σ 2 Σ 1 w ) and b = (Φ Φ + σ 2 Σ 1 w ) 1 Φ y The posterior distribution is Gaussian! w w ) COMP-652 and ECSE-608, Lecture 5 - September 19,

6 Predictive distribution The pointwise posterior predictive distribution is a normal distribution ( f(x) x 1,..., x m, y 1,..., y m N ˆf(x), s (x)) 2 of expectation and variance ˆf(x) = φ(x) (Φ Φ + σ 2 Σ 1 w ) 1 Φ y = φ(x) Σ w Φ (ΦΣ w Φ + σ 2 I m ) 1 y s 2 (x) = σ 2 φ(x) (Φ Φ + σ 2 Σ 1 w ) 1 φ(x) = φ(x) Σ w φ(x) φ(x) Σ w Φ (Φ Σ w Φ + σ 2 I m ) 1 ΦΣ w φ(x) using Sherman-Morrison COMP-652 and ECSE-608, Lecture 5 - September 19,

7 Reinterpreting regularization Recall kernel regression predictions: ˆf(x) = k(x) (K + λi m ) 1 y Using prior Σ w = σ2 λ I d, the predictive mean rewrites as: ˆf(x) = φ(x) Σ w Φ (ΦΣ w Φ + σ 2 I m ) 1 y ) 1 = φ(x) (Φ σ2 λ Φ σ2 λ Φ + σ 2 I m y = k(x) (K + λi m ) 1 y λ encodes some prior on weights w COMP-652 and ECSE-608, Lecture 5 - September 19,

8 Reinterpreting regularization (cont d) Still using Σ w = σ2 λ I d, the predictive variance rewrites as: s 2 (x) = φ(x) Σ w φ(x) φ(x) Σ w Φ (Φ Σ w Φ + σ 2 I m ) 1 ΦΣ w φ(x) ) 1 = φ(x) σ2 φ(x) φ(x) σ2 (Φ λ λ Φ σ2 λ Φ + σ2 I m Φ σ2 λ φ(x) = σ2 λ k λ(x, x) with k λ (x, x ) = k(x, x ) k(x) (K + λi m ) 1 k(x ) COMP-652 and ECSE-608, Lecture 5 - September 19,

9 Summary Using prior Σ w = σ2 λ I d, we have ˆf(x) = k(x) (K + λi m ) 1 y s 2 (x) = σ2 λ k λ(x, x) with k λ (x, x ) = k(x, x ) k(x) (K + λi m ) 1 k(x ) Providing pointwise posterior prediction f(x) x 1,..., x m, y 1,..., y m N ( ˆf(x), s 2 (x)) What does it mean to use λ R 0? COMP-652 and ECSE-608, Lecture 5 - September 19,

10 Pointwise posterior distribution ( At each point x X, we have a distribution N ˆf(x), s (x)) 2 We can sample from these f(x) ( N ˆf(x), s (x)) 2 Y ˆf s X COMP-652 and ECSE-608, Lecture 5 - September 19,

11 Joint distribution Suppose you query your model at locations X Extend the prior to include query points: [ f f ] [ ]) KX,X K X, X N m+m (0, X,X K X,X K X,X y f N m (f, σ 2 I m ) Using joint normality of f and y: [ ( [ ]) y KX,X + σ N f ] m+m 0, 2 I m K X,X K X,X K X,X COMP-652 and ECSE-608, Lecture 5 - September 19,

12 Gaussian Process (GP) By considering the covariance between every points in the space, we get a distribution over functions! Posterior distribution on f: P [f X, y] N X ( [ ˆf(x) ] x X ), σ2 λ [k λ(x, x )] x,x X Y ˆf s X COMP-652 and ECSE-608, Lecture 5 - September 19,

13 Sampling from a Gaussian Process Generalization of normal probability distribution to the function space From a normal distribution we sample variables From a GP we sample functions! 2 1 Y 0 1 Sampled f X COMP-652 and ECSE-608, Lecture 5 - September 19,

14 Sampling from a Gaussian Process: How to Observe that N X ( [ ˆf(x) ] x X ), σ2 λ [k λ(x, x )] x,x X defines a X -dimensional multivariate Gaussian distribution What if X = (e.g. X = [ 1, 1])? We can consider a discrete, finite, set X X and sample from N X ( [ ˆf(x) ] x X ), σ2 λ [k λ(x, x )] x,x X This will result in a function f evaluated at every x X COMP-652 and ECSE-608, Lecture 5 - September 19,

15 Learning the hyperparameters If we assume that Σ w = I d, then we have λ = σ 2 Let θ denote the kernel hyperparameters (e.g. ρ) In practice you may not know the noise and the optimal kernel hypers Recall: multivariate normal density P (y θ) = exp ( 1 2 y (K θ + σ 2 I m ) 1 y ) (2π)D K θ + σ 2 I m Maximize the marginal likelihood L = log P (y θ): L = 1 2 y (K θ + σ 2 I m ) 1 y D 2 log(2π) 1 2 log K θ + σ 2 I m COMP-652 and ECSE-608, Lecture 5 - September 19,

16 Marginal likelihood: Anatomy of marginal likelihood L = log P (y θ) 1 2 y (K θ + σ 2 I m ) 1 y 1 2 log K θ + σ 2 I m C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Lea ISBN X. c 2006 Massachusetts Institute of Technology. 1st term: quality of predictions; 2nd term: model complexity 5.4 Model Selection for GP Regression Trade-off (from Rasmussen & Williams, 2006): log probability minus complexity penalty data fit marginal likelihood 10 0 characteristic lengthscale (a) log marginal likelihood % conf int 10 0 Characteristic lengthscale Figure 5.3: Panel (a) shows a decomposition of the log marginal likelihood its constituents: data-fit and complexity penalty, as a function of the characteri length-scale. The training data is drawn from a Gaussian process with SE covaria function and parameters (`, f, n) = (1, 1, 0.1), the same as in Figure 2.5, and we fitting only the length-scale parameter ` (the two other parameters have been se COMP-652 and ECSE-608, Lecture 5 - September 19, (b) 2 5

17 Compute gradients: Gradient-based optimization L = 1 θ i 2 y (K θ + σ 2 I m ) 1 (K θ + σ 2 I m ) (K θ + σ 2 I m ) 1 y θ i 1 ( 2 Tr (K θ + σ 2 I m ) 1 (K θ + σ 2 ) I m ) θ i Minimize the negative Non-convex optimization task COMP-652 and ECSE-608, Lecture 5 - September 19,

18 Summary Normal priors on the weights distribution Gaussian Process Regularization prior on the weights covariance GP provides a posterior distribution on functions Expectation: kernel regression model Covariance confidence intervals Sample discretized functions from a GP COMP-652 and ECSE-608, Lecture 5 - September 19,

19 Typical supervised setting Have dataset (X, y) of previously acquired data Learn model w on (X, y) Apply model w to provide predictions at new data points Samples (X, y) are often assumed to be i.i.d. COMP-652 and ECSE-608, Lecture 5 - September 19,

20 Streaming setting Start with (possibly empty) dataset (X 0, y 0 ) For each time step t = 1, 2,... : Fit model w t on X t 1 and y t 1 Acquire a new sample (x t, y t ) (possibly using w t ) Define X t = [x i ] i=1...t, y t = [y i ] i=1...t Samples may be dependent of model (not i.i.d.) Samples influence model / Samples depend on model COMP-652 and ECSE-608, Lecture 5 - September 19,

21 Application: Online function optimization 1.0 Y ? X Unknown function f : X R Sequentially select locations (x t ) t 1 at which to observe the function y t Try to maximize/minimize the observations COMP-652 and ECSE-608, Lecture 5 - September 19,

22 Example What is the optimal treatment dosage for a given disease? For each time step t = 1, 2,... : New patient t comes in We decide on treatment dosage x t We observe the patient s response to treatement, y t Our goal is to cure patients as effectively as possible What you don t want: give bad dosages that were known to be bad What you want: give good dosages try informative dosages COMP-652 and ECSE-608, Lecture 5 - September 19,

23 Information An informative sample improves the model by reducing its uncertainty 1 1 Y 0 Y X X How do we quantity the reduction of uncertainty after observing y 1,..., y t at locations x 1,..., x t? COMP-652 and ECSE-608, Lecture 5 - September 19,

24 A little bit of information theory Mutual information between underlying function f and observations y 1,..., y t at locations x 1,..., x t : I(y 1,..., y t ; f) = H(y 1,..., y t ) }{{} Marginal entropy H(y 1,..., y t f) }{{} Conditional entropy It quantifies the amount of information obtained about one random variable, through the other random variable Entropy is the amount of information held by a random variable H(Y X) = 0 if and only if the value of Y is completely determined by the value of X H(Y X) = H(Y ) if and only if Y and X are independent random variables COMP-652 and ECSE-608, Lecture 5 - September 19,

25 Amount of information H(X) = E[ ln P X ] = n p(x i ) log b (p(x i )) i=1 Example: Tossing a coin A fair coin has maximal entropy: it is the less predictable! Every new sample that you get changes your model the most COMP-652 and ECSE-608, Lecture 5 - September 19,

26 Mutual information Entropy of X N D (µ, Σ): [ ln exp ( 1 2 (x µ) Σ 1 (x µ) ) ] H(X) = E (2π)D Σ = D 2 +D 2 ln 2π+1 2 ln Σ Using E [ a M 1 a ] = E [ Tr ( a M 1 a )] = E [ Tr ( aa M 1)] = D Recall the joint distribution between m observations y and m query points X : [ ( [ ]) y KX,X + σ N f ] m+m 0, 2 I m K X,X K X,X K X,X Then I(y 1,..., y t ; f) = 1 2 ln K t + σ 2 I t COMP-652 and ECSE-608, Lecture 5 - September 19,

27 Another decomposition of mutual information ( ) Recall that y 1,..., y t f(x 1 ),..., f(x t ) N t (f(x1 ),..., f(x t )), σ 2 I t Using that y i = f(x i ) + ɛ with ɛ N (0, σ 2 ) Pluging-in the conditional entropy of a multivariate normal distribution: I(y 1,..., y t ; f) = H(y 1,..., y t ) 1 2 ln 2πeσ2 I t = H(y 1,..., y t ) t 2 ln 2πe t 2 σ2 What is the marginal entropy H(y 1,..., y t )? COMP-652 and ECSE-608, Lecture 5 - September 19,

28 Recursively decompose Entropy of observations H(y 1,..., y t ) = H(y 1,..., y t 1 ) + H(y t y 1,..., y t 1 ) = H(y 1,..., y t 1 ) + 1 )] [2πe (σ 2 ln 2 + σ2 λ k λ,t 1(x t, x t ). = H(y 1 ) + H(y 2 y 1 ) )] [2πe (σ 2 ln 2 + σ2 λ k λ,t 1(x s, x s ) t ( 1 = [2πeσ 2 ln )] λ k λ,s 1(x s, x s ) s=1 Still using that y i = f(x i ) + ɛ with ɛ N (0, σ 2 ) Also using the uncertainty about the location of the true f COMP-652 and ECSE-608, Lecture 5 - September 19,

29 Information gain I(y 1,..., y t ; f) = = t s=1 t s=1 ( 1 [2πeσ 2 ln )] λ k λ,s 1(x s, x s ) [ 1 2 ln ] λ k λ,s 1(x s, x s ) t 2 ln(2πe) t 2 σ2 Information gain γ t (λ) = I(y 1,..., y t ; f): reduction of uncertainty on f after observing y 1,..., y t Information gain is inversely proportionnal to λ Limiting changes in function limits the contribution of samples COMP-652 and ECSE-608, Lecture 5 - September 19,

30 Information gain vs regularization Y 1 0 λ = 0.1 λ = 1 λ = 10 Y 1 0 λ = 0.1 λ = 1 λ = X X γt(λ) 3 2 λ = 0.1 λ = 1 λ = 10 γt(λ) λ = 0.1 λ = 1 λ = t t COMP-652 and ECSE-608, Lecture 5 - September 19,

31 Summary Streaming regression: sequentially gather (potentially non-i.i.d) samples Information gain measures the total information that could result from adding a new sample (observation) to the model Information gain is controlled by the information sharing capability of the kernel Information gain is controlled by the changes in model admitted by regularization Information gain will play a part in confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19,

32 Confidence intervals Given that we have gathered t samples under the streaming setting, what kind of guarantees can we have on the resulting model? More specifically, could we guarantee that f(x) ˆf t (x) something simultaneously for all x X and for all t 0? Motivations: Control bad behaviours in critical applications Help selecting the next sample location Maximize/minimize function? Maximize model improvement? Derive sound algorithms COMP-652 and ECSE-608, Lecture 5 - September 19,

33 Result (Maillard, 2016) Under the assumption of subgaussian noise... f(x) ˆf t (x) [ kλ,t (x, x) λ f K + σ 2 ln(1/δ) + 2γ t (λ)] λ With probability higher than 1 δ Simultaneously for all t 0, for all x X Observe that the error bound scales with the information gain! COMP-652 and ECSE-608, Lecture 5 - September 19,

34 Subgaussian noise A real-valued random variable X is σ 2 -subgaussian if E [ e γx] e γ2 σ 2 /2 The Laplace transform of X is dominated by the Laplace transform of a random variable sampled from N (0, σ 2 ) Require that the tails of the noise distribution are dominated by the tails of a Gaussian distribution For example, true for Gaussian noise Bounded noise COMP-652 and ECSE-608, Lecture 5 - September 19,

35 Confidence envelope 5 observations 10 observations 1 1 Y 0 Y X 50 observations X 100 observations Y Y f λ = σ 2 λ = σ 2 / f 2 K X X COMP-652 and ECSE-608, Lecture 5 - September 19,

36 Summary It is possible to have guarantees even with non-i.i.d. data The predition error depends on How well your model shares information across observations How well your model is adapted to the function How noisy the observations are Regularization prior on Σ w Confidence intervals/envelopes will be really useful for deriving algorithms (we will see that later) COMP-652 and ECSE-608, Lecture 5 - September 19,

Nonparameteric Regression:

Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,