Nonparameteric Regression:

Size: px

Start display at page:

Download "Nonparameteric Regression:"

Bethany Oliver
5 years ago
Views:

1 Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin 1 / 34

2 Outline Two nonparametric regression methods are introduced: Kernel regression: Nadaraya-Watson Estimator Gaussian process regression (Bayesian nonparametric models applied to the regression problem) GP regression Link with Bayesian linear regression 2 / 34

3 Nadaraya-Watson Kernel Regression 3 / 34

4 Kernel Regression Nadaraya-Watson estimator: N n=1 f (x ) = y n k h (x x n ) N l=1 k. h(x x l ) In the case of Gaussian kernel, k h (x x n ) = 1 { exp 1 (2πh 2D D/2 ) 2h 2 x x n 2} = 1 D { exp 1 } (2πh 2D ) D/2 2h 2 (x i x n,i ) 2. i=1 Computes the locally weighted average of y n s near x using a kernel as a weighting function. 4 / 34

5 Nadaraya-Watson Estimator (Detailed Derivation) Recall f (x) = E [y x] = y p(y x) dy = p(x, y) y dy. p(x) Use kernel density estimation to determine both p(x, y) and p(x): p(x, y) = 1 N p(x) = 1 N N k hx (x x n)k hy (y y n), n=1 N k hx (x x n). n=1 5 / 34

6 Compute y p(x, y) dy = 1 N = 1 N = 1 N N k hx (x x n) y k hy (y y n) dy n=1 N n=1 k hx (x x n) N y n k hx (x x n) n=1 y 1 { exp λ y (y y n) 2} dy Z y } {{ } y n Therefore, f (x) = = y p(x, y) dy p(x) N n=1 yn k h(x x n) N l=1 k. h(x x l ) 6 / 34

7 Gaussian Process Regression 7 / 34

8 Pictorial Illustration of GP Regression Green curve: the true sinusoidal function from which the data points are obtained by sampling and addition of Gaussian noise. Red line: the mean of GP predictive distribution and the shaded region corresponds to plus and minus two standard deviations Treat the latent vector as parameters f = [f (x 1 ),..., f (x N )] R N. Compute Gaussian process posterior f (x) X, y, combining the GP prior f (x) GP(0, k(x, x )) with the Gaussian likelihood p(y X, f) = N (y f, σ 2 I). [Figure source: Bishop s PRML] 8 / 34

9 Bayesian Regression: Parametric vs. Nonparametric Given a set of training examples, D = {(x n, y n) n = 1,..., N}, the goal of Bayesian regression is to make a prediction given new input x, computing p(y x, D). Parametric approach Model x n, y n θ p(x, y θ), assuming a parametric representation for f ( ) = f θ ( ). Prior over parameters: p(θ). Posterior over parameters p(θ D) = p(d θ)p(θ). p(d) Nonparametric approach Model x n, y n f, without parametric representations for f ( ). Prior over function: f p(f ). Posterior over function p(f D) = p(d f )p(f ). p(d) Prediction is done by Prediction is done by p(y x, D) = p(y x, θ)p(θ D)dθ. p(y x, D) = p(y x, f )p(f D)df. GP regression infers p(f D), instead of p(θ D). 9 / 34

10 Gaussian Processes Definition: A Gaussian Process (GP) is a collection of random variables, any finite number of which has a joint Gaussian distribution. A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables. GP defines a a distribution over functions of the form f : X R, which is completely specified by mean function µ(x) and covariance function k(x, x ): where f (x) GP ( µ(x), k(x, x ) ), µ(x) = E [f (x)], k(x, x ) = E [ (f (x) µ(x)) ( f (x ) µ(x ) )] { = σf 2 exp 1 } 2l x 2 x 2, which is referred to as squared exponential kernel, l is an length-scale parameter that controls the rate of decay of the covariance, σf 2 controls the prior variance (signal variance). 10 / 34

11 Gaussian Process Regression Model: y i = f (x i ) + ɛ i, where f ( ) is referred to as latent function. Latent vector: f = [f (x 1 ),..., f (x N )] R N. Note that parameters are function itself in GPR model. Gaussian likelihood: p(y X, f) = N (y f, σ 2 I). Gaussian process prior (zero mean): f (x) GP(0, k(x, x )), p(f X) = N (f 0, K). (K = [k(x i, x j )] R N N ) 11 / 34

12 Interested in: p(f X, X, y), given a test data X, leading to the predictive distribution p(y X, X, y). This is nothing but computing the posterior over f, combining the Gaussian likelihood with GP prior. It can be analytically computed. 12 / 34

13 Gaussian process posterior: f (x) X, y GP(µ(x), k(x, x )), where [ 1 µ(x) = k(x, X) k(x, X) + σ I] 2 y, [ 1 k(x, x ) = k(x, x ) k(x, X) k(x, X) + σ I] 2 k(x, x ). Predictive distribution: ( [ k(x, 1 p(f X, y, x ) = N y X) k(x, X) + σ I] 2 y, [ ) 1 k(x, x ) k(x, X) k(x, X) + σ I] 2 k(x, x ), where k(x, X) = [k(x, x 1),..., k(x, x N )] R 1 N, k(x, X) = [k(x i, x j )] R N N. 13 / 34

14 [Figure source: Rasmussen and Williams] 14 / 34

15 GP regression is a linear predictor in the sense that the prediction at x is done via E [f D] = N α n k(x, x n ), n=1 where α = [ σ 2 I + k(x, X)] 1y. 15 / 34

16 Algorithm Outline Algorithm 1 GP Regression Input: Training dataset D = {(x n, y n ) n = 1,..., N}, test input x, covariance function k(, ), and noise level σ 2 1: Compute K = [k(x i, x j )] and k = [k(x 1, x ),..., k(x N, x )] 2: L = Cholesky(K + σ 2 I) 3: α = L \(L\y) 4: Compute predictive mean: E[f ] = k α 5: v = L\k 6: Compute predictive variance: var(f ) = k(x, x ) v v 7: Compute the marginal log-likelihood: log p(y X) = 1 2 y α n log L n,n N 2 log 2π 8: return E[f ], var(f ), log p(y X) 16 / 34

17 Cholesky Decomposition The Cholesdy decomposition of a symmetric, positive-definite matrix A decomposes A into a product of lower triangular matrix L and its transpose: A = LL, where L is called the Cholesky factor. To solve Ax = b for x, first solve the triangular system Ly = b by forward substitution and then the triangular system L x = y by back substitution. We write the solution as x = L \(L\b). 17 / 34

18 GP Regression: Detailed Derivation Let f R T be latent function values evaluated at test data points X R D T. We first write the joint distribution of the observed target values and the function values at the test locations under the prior: [ ] ([ ] [ ]) y 0 k(x, X) + σ 2 I k(x, X ) N,, 0 k(x, X) k(x, X ) f It follows from the Gaussian Identity that we have ( [ ] k(x, 1 f y, X, X N f X) k(x, X) + σ 2 I N y, [ ] ) 1 k(x, X ) k(x, X) k(x, X) + σ 2 I N k(x, X ), leading to p(y y, X, X ) = N ( [ ] k(x, 1 f X) k(x, X) + σ 2 I N y, ] ) 1 k(x, X ) k(x, X) [k(x, X) + σ 2 I N k(x, X ) + σ 2 I T. 18 / 34

19 Gaussian Identity A D-dimensional Gaussian density for x is 1 N (x µ, Σ) = exp { 12 } (2π) D 2 Σ (x 1 µ) Σ 12 (x µ). 2 Define the augmented vector y = [x, z ] which is jointly normal, i.e., [ ] ([ ] [ ]) x a A C y = N, z b C. B The marginal densities are x N (a, A), z N (b, B). The conditional distributions are: p(x z) = N (a + CB 1 (z b), A CB 1 C ) ( ) p(z x) = N b + C A 1 (x a), B C A 1 C. 19 / 34

20 Pros and Cons Pros GPs provide fully probabilistic predictive distributions, including estimates of the uncertainty of the predictions. The evidence framework applied to GPs allows to learn thehyperparameters of the kernel (marginal likelihood maximization). Cons Computational complexity grows as O(N 3 ) (in the case of naïve implementation). 20 / 34

21 Marginal Likelihood (Evidence) The marginal likelihood is the integral of the likelihood times the prior (marginalization over the function values f): p(y X) = p(y f, X) p(f X) df, }{{}}{{} likelihood prior where p(y f, X) = N (y f, σ 2 I), p(f X) = N (f 0, K). Performing the Gaussian integration yields p(y X) = N (y 0, K + σ 2 I), Thus, the marginal log-likelihood is log p(y X) = N 2 log 2π 1 2 log K + σ2 I 1 }{{} 2 y (K + σ 2 I) 1 y. }{{} model complexity data fit 21 / 34

22 (a) (b) (c) Figure: (l, σ f, σ) =: (a) (1,1,0.1); (b) 0.3,1.08, ); (c) (3.0, 1.16, 0.89). [Figure source: Murphy s Fig. 15.3] 22 / 34

23 The marginal likelihood tells us the probability of observations given the assumptions of the model. Hyperparameters are determined by maximizing the marginal log-likelihood. Alternatively, cross-validation is used for hyperparameter estimation (leave-one-out predictive probability, a.k.a. pseudo-likelihood). Sparse approximations for GPs will be given in other lectures (possibly fall semester?). 23 / 34

24 Link with Bayesian Linear Regression 24 / 34

25 Bayesian Linear Regression Given a set of N training examples, D = {(x n, y n ) n = 1,..., N}, assuming Gaussian noise, ɛ n N (0, σ 2 ), linear regression model is described as: Or in a compact form, y = X θ + ɛ, y n = f (x n ) + ɛ n = θ x n + ɛ n. Gaussian prior over θ: θ N (0, Σ 0). Gaussian likelihood: p(y X, θ) = = = (X R D N is a design matrix) N p(y n x n, θ) n=1 N { 1 exp 1 } n=1 2πσ 2 2σ 2 (yn θ x n) 2 ( ) N { 1 exp 1 } 2πσ 2 2σ y 2 X θ 2 2 = N (Xθ, σ 2 I). 25 / 34

26 Calculate the posterior over θ: p(θ X, y) = p(y X, θ)p(θ) p(y X, θ)p(θ)dθ p(y X, θ)p(θ) = p(y X) { exp 1 ( ) ( y X θ 2σ 2 { = exp 1 2 (θ θ N) Σ 1 N (θ θ N ) y X θ) } exp }. { 1 } 2 θ Σ 1 0 θ where θ N = 1 σ Σ NXy, ( 2 Σ N = Σ ) 1 σ 2 XX. Hence, the posterior p(θ X, y) is also Gaussian: p(θ X, y) = N (θ N, Σ N ). 26 / 34

27 Given a new input x, the predictive distribution of f = f (x ) is calculated as: p(f x, X, y) = p(f x, θ)p(θ X, y)dθ = f (x θ)p(θ X, y)dθ Hence, where = E θ X,y [ x θ ]. p(f x, X, y) = N (x θ N, x Σ N x ), θ N = 1 σ 2 Σ NXy, ( Σ N = Σ ) 1 σ 2 XX. 27 / 34

28 Bayesian linear regression with f (x) = w 1 + w 2x: (a) Gaussian prior over w 1 and w 2; (b) three training points (superimposed on data are the predictive mean plus/minus two standard deviations of the (noise-free) predictive distribution p(f x, X, y); (c) likelihood; (d) posterior over w 1 and w 2. [Figure source: Rasmussen and Williams] 28 / 34

29 Increasing Expressiveness Use a set of basis function φ(x) = [φ 1 (x),..., φ M (x)] to project a D-dimensional input x R D to M-dimensional feature space: φ(x) : R D R M. The regression function is written as f (x) = φ(x) θ. The design matrix is Φ R M N. The predictive distribution p(f φ, Φ, y) is computed in feature space: p(f φ, Φ, y) = N (µ, σ 2 ), where µ = φ σ 2 = φ ( 1 σ 2 Σ ) 1 σ 2 ΦΦ Φy, ( Σ ) 1 σ 2 ΦΦ φ. 29 / 34

30 Now we show that the predictive distribution p(f φ, Φ, y) can be expressive in terms of inner products in feature space (K = Φ Σ 0 Φ): where p(f φ, Φ, y) = N (µ, σ 2 ), µ = φ Σ 0 Φ ( K + σ 2 I ) 1 y, σ 2 = φ Σ 0 φ φ Σ 0 Φ ( K + σ 2 I ) 1 Φ Σ 0 φ. Recall our earlier results for GP regression: ( k(x f y, X, x N f, X) [ k(x, X) + σ 2 I ] 1 y, k(x, x ) k(x, X) [ k(x, X) + σ 2 I ] ) 1 k(x, x ), GP regression is a Bayesian linear regression leveraged with kernel trick. 30 / 34

31 Detailed Calculation: Weight-Space View Define K = Φ Σ 0Φ, ( Σ N = Σ ) 1 σ 2 ΦΦ. Then Σ 1 N Σ 0Φ = = Φ ( Σ ) σ 2 ΦΦ Σ 0Φ = (I + 1σ ) K = 1σ ( ) Φ σ 2 I + K. 2 2 ( Φ + 1 ) σ 2 ΦΦ Σ 0Φ Premultiply Σ N and postmultiply (K + σ 2 I) 1 to obtain Σ N ( Σ 1 N ) [ Σ 0Φ (σ 2 I + K) 1 1 ( ) ] = Σ N σ Φ σ 2 I + K (σ 2 I + K) 1 2 = 1 σ 2 Σ NΦ. 31 / 34

32 Thus, we have With this result, we have 1 σ 2 Σ NΦ = Σ 0Φ(σ 2 I + K) 1. µ = 1 σ 2 φ Σ N Φy = φ Σ 0Φ(σ 2 I + K) 1 y. Apply the matrix inversion lemma ( 1 (A + BCD) 1 = A 1 A 1 B C 1 + DA B) 1 DA 1, to obtain σ 2 = φ Σ Nφ ( = φ Σ ) 1 σ 2 ΦΦ φ [ ( ) ] 1 = φ Σ 0 Σ 0Φ σ 2 I + Φ Σ 0Φ Φ Σ 0 φ ( 1 = φ Σ 0φ φ Σ 0Φ K + σ I) 2 Φ Σ 0φ. QED 32 / 34

33 How many basis functions? Recall φ(x) : R D R M (M could be infinite) Kernels are inner products in a feature space. For instance, k(x, y) = e (x y)2 = e x 2 e y 2 2 k x k y k. k! k=0 }{{} e 2xy 33 / 34

34 References C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning, MIT Press, C. E. Rasmussen, Advances in Gaussian Processes, NIPS-2006 Tutorial. 34 / 34

Linear Models for Regression

Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr