Markov Chain Monte Carlo Algorithms for Gaussian Processes

Size: px

Start display at page:

Download "Markov Chain Monte Carlo Algorithms for Gaussian Processes"

Rebecca Little
5 years ago
Views:

1 Markov Chain Monte Carlo Algorithms for Gaussian Processes Michalis K. Titsias, Neil Lawrence and Magnus Rattray School of Computer Science University of Manchester June 8

2 Outline Gaussian Processes Sampling algorithms for Gaussian Process Models Sampling from the prior Gibbs sampling schemes Sampling using control variables Applications Demonstration on regression/classification Transcriptional regulation Summary/Future work

3 Gaussian Processes A Gaussian process (GP) is a distribution over a real-valued function f (x). It is defined by a mean function µ(x) = E(f (x)) and a covariance or kernel function k(x n, x m ) = E(f (x n )f (x m )) E.g. this can be the RBF (or squared exponential) kernel k(x n, x m ) = α exp ( x n x m ) l

4 Gaussian Processes We evaluate a function in a set of inputs (x i ) N i= : f i = f (x i ) A Gaussian process reduces to a multivariate Gaussian distribution over f = (f i ) N i= ( p(f) = N(x, K) = exp ft K ) f (π) N K where the covariance K is defined by the kernel function p(f) is a conditional distribution (a precise notation is p(f X ))

5 Gaussian Processes for Bayesian learning Many problems involve inference over unobserved/latent functions A Gaussian process can place a prior on a latent function Bayesian inference: Data y = (y i ) N i= (associated with inputs (x i) N i= ) Likelihood model p(y f) GP prior p(f) for the latent function f Bayes rule p(f y) p(y f) p(f) Posterior Likelihood Prior For regression, where the likelihood is Gaussian, this computation is analytically obtained

6 Gaussian Processes for Bayesian Regression Data and the GP prior (rbf kernel function) Posterior GP process

7 Gaussian Processes for non-gaussian Likelihoods When the likelihood p(y f) is non-gaussian computations are analytically intractable Non-Gaussian likelihoods: Classification problems Spatio-temporal models and geostatistics Non-linear differential equations with latent functions Approximations need to be considered MCMC is a powerful framework that offers: Arbitrarily precise approximation in the limit of long runs General applicability (independent from the functional form of the likelihood)

8 MCMC for Gaussian Processes The Metropolis-Hastings (MH) algorithm Initialize f () Form a Markov chain. Use a proposal distribution Q(f (t+) f (t) ) and accept with the MH step ( ) min, p(y f(t+) )p(f (t+) ) Q(f (t) f (t+) ) p(y f (t) )p(f (t) ) Q(f (t+) f (t) ) The posterior is highly-correlated and f is high dimensional How do we choose the proposal Q(f (t+) f (t) )?

9 MCMC for Gaussian Processes Use the GP prior as the proposal distribution Proposal: Q(f (t+) f (t) ) = p(f (t+) ) MH probability min (, p(y f(t+) ) p(y f (t) ) Nice property: The prior samples functions with the appropriate smoothing requirement Bad property: We get almost zero acceptance rate. The chain will get stuck in the same state for thousands of iterations )

10 MCMC for Gaussian Processes Use Gibbs sampling Proposal: Iteratively sample from the conditional posterior p(f i f i, y) where f i = f \ f i Nice property: All samples are accepted and the prior smoothing requirement is satisfied Bad property: The Markov chain will move extremely slowly for densely sampled functions: The variance of p(f i f i, y) is smaller or equal to the variance of the conditional prior p(f i f i ) But p(f i f i ) may already have a tiny variance

11 Gibbs-like schemes Gibbs-like algorithm: Instead of p(f i f i, y) use the conditional prior p(f i f i ) and accept with the MH step (it has been used in geostatistics, Diggle and Tawn, 998) Gibbs-like algorithm is still inefficient to sample from highly correlated functions Block or region sampling: Cluster the function values f into regions/blocks {f k } M k= Sample each block f k from the conditional GP prior p(f (t+) k f (t) k ), where f k = f \ f k and accept with the MH step This scheme can work better But it does not solve the problem of sampling highly correlated functions since the variance of the proposal can be very small in the boundaries between regions

12 Gibbs-like schemes Region sampling with 4 regions ( of the proposals are shown below) Note that the variance of the conditional priors is small close to the boundaries between regions

13 Sampling using control variables Let f c be a set of auxiliary function values. We call them control variables The control variables provide a low dimensional representation of f (analogously to the inducing/active variables in sparse GP models) Using f c, we can write the posterior p(f y) = p(f f c, y)p(f c y)df c f c When f c is highly informative about f, ie. p(f f c, y) p(f f c ), we can approximately sample from p(f y): Sample the control variables from p(f c y) Generate f from the conditional prior p(f f c )

14 Sampling using control variables Idea: Sample the control variables from p(f c y) and generate f from the conditional prior p(f f c ) Make this a MH algorithm: We only need to specify the proposal q(f c (t+) f c (t) ), that will mimic sampling from p(f c y) The whole proposal is Q(f (t+), f (t+) c f (t), f (t) ) = p(f (t+) f (t+) c Each (f (t+), f c (t+) ) is accepted using the MH step c ) A = p(y f(t+) )p(f (t+) p(y f (t) )p(f (t) c ) c q(f c (t) f c (t+) ) q(f c (t+) f c (t) ) )q(f c (t+) f c (t) )

15 Sampling using control variables: Specification of q(f (t+) c f c (t) ) q(f (t+) c f c (t) ) must mimic sampling from p(f c y) The control points are meant to be almost independent, thus Gibbs can be efficient Sample each f ci from the conditional posterior p(f ci f c i, y) Unfortunately computing p(f ci f c i, y) is intractable But we can use the Gibbs-like algorithm: Iterate between different control variables i: Sample f (t+) c i from p(f c (t+) i f (t), f c (t) i ). Accept with the MH step c i ) and f (t+) from p(f (t+) f c (t+) i The proposal for f is the leave-one-out conditional prior p(f t+ f (t) c i ) = f (t+) c i p(f t+ f c (t+) i, f (t) c i )p(f c (t+) i f c (t) i )df c (t+) i

16 Sampling using control variables: Demonstration Data, current f (t) (red line) and current control variables f c (t) circles) (red

17 Sampling using control variables: Demonstration First control variable: The proposal p(f c (t+) f c (t) ) (green bar)

18 Sampling using control variables: Demonstration First control variable: The proposed f (t+) c (diamond in magenta)

19 Sampling using control variables: Demonstration First control variable: The proposed function f (t+) (blue line)

20 Sampling using control variables: Demonstration First control variable: Shaded area is the overall effective proposal p(f (t+) f (t) c )

21 Sampling using control variables: Demonstration Iteration between control variables: Allows f to be drawn with considerable variance everywhere in the input space

22 Sampling using control variables: Demonstration Iteration between control variables: Allows f to be drawn with considerable variance everywhere in the input space

23 Sampling using control variables: Demonstration Iteration between control variables: Allows f to be drawn with considerable variance everywhere in the input space

24 Sampling using control variables: Demonstration Iteration between control variables: Allows f to be drawn with considerable variance everywhere in the input space

25 Sampling using control variables: Demonstration Iteration between control variables: Allows f to be drawn with considerable variance everywhere in the input space

26 Sampling using control variables: Demonstration Iteration between control variables: Allows f to be drawn with considerable variance everywhere in the input space

27 Sampling using control variables: Demonstration Iteration between control variables: Allows f to be drawn with considerable variance everywhere in the input space

28 Sampling using control variables: Demonstration Iteration between control variables: Allows f to be drawn with considerable variance everywhere in the input space

29 Sampling using control variables: Demonstration Iteration between control variables: Allows f to be drawn with considerable variance everywhere in the input space

30 Sampling using control variables: Input control locations To apply the algorithm, we need to select the number M of control variables and their input locations X c Choose X c using a PCA-like approach Knowledge of f c must determine f with small error Given f c the prediction of f is K f,c Kc,c f c Mininimize the averaged error f K f,c Kc,c f c G(X c ) = f,f c f K f,c K c,c f c p(f f c )p(f c )dfdf c = Tr(K f,f K f,c K c,c K T f,c) Minimize G(X c ) w.r.t. X c using gradient-based optimization Note: G(X c ) is the total variance of the conditional prior p(f f c )

31 Sampling using control points: Choice of M To find the number M of control variables Minimize G(X c ) by incrementally adding control variables until G(X c ) becomes smaller than a certain percentage of the total variance of p(f) (5% used in all our experiments) Start the simulation and observe the acceptance rate of the chain Keep adding control variables until the acceptance rate becomes larger than 5% (following standard heuristics Gelman, Carlin, Stern and Rubin (4))

32 Sampling using control variables: G(X c ) function The minimization of G places the control inputs close to the clusters of the input data in such a way that the kernel function is taken into account

33 Applications: Demonstration on regression Regression: Compare Gibbs, local region sampling and control variables in regression (randomly chosen GP functions of varied input-dimensions: d =,...,, with fixed N = training points) KL(real empirical) Gibbs region control dimension number of control variables corrcoef control dimension Note: The number of control variables increases as the function values become more independent... this is very intuitive

34 Applications: Classification Classification: Wisconsin Breast Cancer (WBC) and the Pima Indians Diabetes. Hyperparameters fixed to those obtained by Expectation-Propagation Log likelihood 58 6 Log likelihood MCMC iterations MCMC iterations gibbs contr ep gibbs contr ep Figure: Log-likelihood for Gibbs (left) and control (middle) in WBC dataset. (right) shows the test errors (grey bars) and the average negative log likelihoods (black bars) on the WBC (left) and PID (right)

35 Applications: Transcriptional regulation Data: Gene expression levels y = (y jt ) of N genes at T times Goal: We suspect/know that a certain protein regulates ( i.e. is a transcription factor (TF) ) these genes and we wish to model this relationship Model: Use a differential equation (Barenco et al. [6]; Rogers et al. [7]; Lawerence et al. [7]) dy j (t) = B j + S j g(f (t)) D j y j (t) dt where t - time y j (t) - expression of the jth gene f (t) - concentration of the transcription factor protein D j - decay rate B j - basal rate S j - Sensitivity

36 Transcriptional regulation using Gaussian processes Solve the equation y j (t) = B j D j +A j exp( D j t)+s j exp( D j t) t g(f (u)) exp(d j u)du Apply numerical integration using a very dense grid (u i ) P i= and f = (f i (u i )) P i= y j (t) B P t j +A j exp( D j t)+s j exp( D j t) w p g(f p ) exp(d j u p ) D j p= Assuming Gaussian noise for the observed gene expressions {y jt }, the ODE defines the likelihood p(y f) Bayesian inference: Assume a GP prior for the transcription factor f and apply MCMC to infer (f, {A j, B j, D j, S j } N j= ) f is inferred in a continuous manner (P T )

37 Results in E.coli data: Rogers, Khanin and Girolami (7) One transcription factor (lexa) that acts as a repressor. We consider the Michaelis-Menten kinetic equation dy j (t) dt = B j + S j exp(f (t)) + γ j D j y j (t) We have 4 genes (5 kinetic parameters each) Gene expressions are available for T = 6 time slots TF (f) is discretized using points MCMC details: 6 control points are used Running time was 5 hours for 5 5 iterations plus burn in

38 Results in E.coli data: Predicted gene expressions dinf Gene 7.5 dini Gene 6 lexa Gene reca Gene recn Gene ruva Gene

39 Results in E.coli data: Predicted gene expressions 6.5 ruvb Gene 4.5 sbmc Gene 4.5 sula Gene umuc Gene umud Gene uvrb Gene 4 5 6

40 Results in E.coli data: Predicted gene expressions 5.5 yebg Gene 7 yjiw Gene

41 Results in E.coli data: Protein concentration 7 Inferred protein

42 Results in E.coli data: Kinetic parameters 7 Basal rates.4 Decay rates dinf dini lexa reca recn ruva ruvb sbmc sula umuc umud uvrb yebg yjiw dinf dini lexa reca recn ruva ruvb sbmc sula umuc umud uvrb yebg yjiw.7 Sensitivities.5 Gamma parameters dinf dini lexa reca recn ruva ruvb sbmc sula umuc umud uvrb yebg yjiw dinf dini lexa reca recn ruva ruvb sbmc sula umuc umud uvrb yebg yjiw

43 Results in E.coli data: Confidence intervals for the kinetic parameters Basal rates 4 Decay rates dinf dini lexa reca recn ruva ruvb sbmc sula umuc umud uvrb yebg yjiw dinf dini lexa reca recn ruva ruvb sbmc sula umuc umud uvrb yebg yjiw Sensitivities 9 Gamma parameters dinf dini lexa reca recn ruva ruvb sbmc sula umuc umud uvrb yebg yjiw dinf dini lexa reca recn ruva ruvb sbmc sula umuc umud uvrb yebg yjiw

44 Data used by Barenco et al. [6] One transcription factor (p5) that acts as an activator. We consider the Michaelis-Menten kinetic equation dy j (t) dt = B j + S j exp(f (t)) exp(f (t)) + γ j D j y j (t) We have 5 genes Gene expressions are available for T = 7 times and there are replicas of the time series data TF (f) is discretized using points MCMC details: 7 control points are used Running time 4 hours for 5 5 iterations plus burn in

45 Data used by Barenco et al. [6]: Predicted gene expressions for the st replica.5 DDB Gene first Replica 5 BIK Gene first Replica.5 TNFRSFb Gene first Replica CIp/p Gene first Replica.5 p6 sesn Gene first Replica

46 Data used by Barenco et al. [6]: Protein concentrations Inferred p5 protein Inferred p5 protein Inferred p5 protein Linear model (Barenco et al. predictions are shown as crosses).7 Inferred protein.7 Inferred protein Inferred protein

47 Data used by Barenco et al. [6]: Kinetic parameters.6 Basal rates Decay rates DDB p6 sesn TNFRSFb CIp/p BIK DDB p6 sesn TNFRSFb CIp/p BIK Sensitivities.9 Gamma parameters DDB p6 sesn TNFRSFb CIp/p BIK DDB p6 sesn TNFRSFb CIp/p BIK Our results (grey) compared with Barenco et al. [6] (black). Note that Barenco et al. use a linear model

48 Summary/Future work Summary: A new MCMC algorithm for Gaussian processes using control variables It can be generally applicable Future work: Deal with large systems of ODEs for the transcriptional regulation application Consider applications in geostatistics Use the G(X c ) function to learn sparse GP models in an unsupervised fashion without the outputs y being involved

Variational Model Selection for Sparse Gaussian Process Regression

Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Outline Gaussian process regression and sparse