Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009

Size: px

Start display at page:

Download "Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009"

Mark Sharp
5 years ago
Views:

1 with with July 30, 2010

2 with Representation Representation for Distribution Inference for the Augmented Model 4 Approximate Laplacian Approximation Introduction to Laplacian Approximation Laplacian Approximation for the Model MAP Estimation for the Hyperparameters 5 Experimental Results 6 Conclusion

3 Model with

4 with All the following inference methods are covered in this paper: Approximate inference: Variational Bayes (VB), Expectation Maximization (EM), Laplacian Approximation. Exact inference: Markov Chain Monte Carlo (MCMC). The author derived the Laplacian Approximation method for the GP regression Model and compared with other inference methods.

5 with Given data X = {x i } n i=1 and the latent function f = {f(x i )} n i=1, the GP regression model can be expressed as y = f +ɛ with GP prior p(f θ) = N(f;0,K ff ) where (K ff ) ij = K(x i,x j ) = σ 2 se exp( D d=1 (x id x jd ) 2 /l 2 d ) is the kernel function. The likelihood (i.e., the noise model) is p(y f,θ) = n = t(y i;f i,σ,ν) i=1 n Γ((ν +1)/2) (1+ Γ(ν/2) (y i f i ) 2 νπσ νσ 2 i=1 ) ν+1 2 where θ = {σ 2 se,l 1:D,σ,ν} is the set of hyperparameters.

6 with The full likelihood of the model can be expressed as p(f,y θ) = p(f θ)p(y f,θ) = N(f;0,K ff ) n i=1 t(y i;f i,σ,ν) Then the posterior inference for f can be expressed as p(f y,θ) = p(f θ)p(y f,θ) p(y θ) This inference is not straight forward since prior and likelihood are not conjugate. Use scale mixture representation of to make the model conjugate, then apply VB/EM/MCMC. Use Laplacian Approximation directly. Once the posterior p(f y, θ) is inferred, we can do prediction for new data: p(y x,y,θ) = p(f f,x,θ)p(f y,θ)p(y f,θ)dfdf

7 Representation for Distribution with Scale mixture representation for distribution can be applied as follows: p(y f,θ) = n t(y i;f i,σ,ν) i=1 = n N(y i ;f i,τ 1 i )Ga(τ i ;ν/2,νσ 2 /2)dτ i i=1 Then the augmented model is p(f,y,τ θ) = p(f θ)p(y f,τ,θ)p(τ θ) = N(f;0,K ff )N(y;f,diag 1 (τ)) n i=1 Ga(τ i;ν/2,νσ 2 /2) By integrating out τ, the above augmented model reduces to the original model. The advantage we gain is that now the likelihood is conjugate with the prior of f.

8 Variational Bayes with VB assumes that the joint posterior can be factorized as products of marginal posteriors: p(f,τ y,θ) = p(f θ)p(y f,τ,θ)p(τ θ) p(y θ). = q(f)q(τ) 1) q(f) = N(f;m,A) with A = (K ff +diag( τ )) 1, m = (K ff +diag( τ )) 1 (diag( τ )y). 2) q(τ) = n i=1 Ga(τ i;(ν +1)/2,(νσ 2 + (y i f i ) 2 )/2) ν+1 with τ i = νσ 2 + (y i f i ) 2. Notice that (y i f i ) 2 = (y i m i ) 2 +A ii.

9 Expectation Maximization with EM is a special case of VB, which gives a point estimate on f: p(f,τ y,θ) = p(f θ)p(y f,τ,θ)p(τ θ) p(y θ). = δ(f ˆf) q(τ) 1) ˆf = argmax f logp(p(f,τ,y θ)) q(τ)dτ = (K ff +diag( τ )) 1 (diag( τ )y). 2) q(τ) = n i=1 Ga(τ i;(ν +1)/2,(νσ 2 +(y i ˆf i ) 2 )/2) with ν+1 τ i = νσ 2 +(y i ˆf i ) 2.

10 Markov Chain Monte Carlo with Gibbs sampling: 1) Sample f from p(f τ,y,θ) 2) Sample τ from p(τ f,y,θ) Iteration between the above two processes.

11 Introduction to Laplacian Approximation with Given an arbitrary density function p(x), the Taylor expansion of logp(x) is logp(x) = logp(x 0 )+(x x 0 ) logp(x) x=x (x x 0) ( logp(x) x=x0 )(x x 0 )+o( x x 0 2 ) Specifically, choose x 0 = x as a stationary point, that is, logp(x) x=x = 0, then p(x) =. ( cexp 1 ) 2 (x x ) Σ 1 (x x ) = N(x;x,Σ) where c is a normalizing constant and Σ 1 = logp(x) x=x.

12 Laplacian Approximation for the Model with Specifically for this GP regression model, the posterior p(f y, θ) is approximated by a normal distribution as p(f y,θ) = N(f;ˆf,Σ) 1) ˆf=argmax f logp(f y,θ)=argmax f (logp(y f,θ)+logp(f θ)). 2) Σ 1 = logp(f y,θ) = f=ˆf K 1 ff +W with W ii = (ν +1) (y i ˆf i ) 2 νσ 2 and W ((y i ˆf i ) 2 +νσ 2 ) 2 ij = 0 if i j. ˆf is usually done by numerical optimization methods, but in this paper it is done by using EM (iterative) estimation.

13 MAP Estimation for the Hyperparameters in the Laplacian Approximation with The MAP estimation for θ is done via θ = argmaxlogp(θ,y) = argmax (logp(y θ)+logp(θ)) θ θ The marginal term can be approximated as p(y θ) = p(f θ)p(y f,θ) p(f y, θ) Specifically set f =ˆf, then. = N(f;0,K ff)p(y f,θ) N(f;ˆf,Σ) ( f) logp(y θ). =logp(y ˆf,θ) 1 2ˆf K 1 ff ˆf 1 2 log K ff 1 2 log K 1 ff +W Cholesky decomposition and rank one update are used to speed up the above computation.

14 Results with

15 Conclusion with Laplacian Approximation method is derived for the GP regression likelihood model. Performance is competitive to other inference methods (e.g., VB). likelihood model performs significantly better than likelihood model on four real data sets.

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.