Gaussian Process Regression Networks Supplementary Material

Size: px

Start display at page:

Download "Gaussian Process Regression Networks Supplementary Material"

Byron Gordon
5 years ago
Views:

1 Gaussian Process Regression Networks Supplementary Material Andrew Gordon Wilson, David A. Knowles, Zoubin Ghahramani University o Cambridge agw38@cam.ac.uk, dak33@cam.ac.uk, zoubin@eng.cam.ac.uk In this supplementary material, we discuss some urther details o our ESS and VB inerence (Sections 1 and 2), the computational complexity o our inerence procedures (Section 3), and the correlation structure induced by the GPRN model (Section 4). We also discuss multimodality in the GPRN posterior (Section 5), SVLMC, and some background inormation and notation or Gaussian process regression (Section 7). Figure 1 shows the network structure o GPRN learned on the JURA dataset. 1 Elliptical Slice Sampling To sample rom p(u D, γ), we could use a Gibbs sampling scheme which would have conjugate posterior updates, alternately conditioning on weight and node unctions. However, this Gibbs cycle would mix poorly because o the correlations between the weight and node unctions in the posterior p(u D, γ). In general, MCMC samples rom p(u D, γ) mix poorly because o the strong correlations in the prior p(u σ, θ, θ w ) imposed by C B. The sampling process is also oten slowed by costly matrix inversions in the likelihood. We use Elliptical Slice Sampling (Murray et al., 2010), a recent MCMC technique speciically designed to sample rom posteriors with tightly correlated Gaussian priors. It does joint updates and has no ree parameters. Since there are no costly or numerically unstable matrix inversions in the likelihood o we also ind sampling to be eicient. With a sample rom p(u D, γ), we can sample rom the predictive p(w (x ), (x ) u, σ, D). Let W, i i be the i th such joint sample. Using our generative GPRN model we can then construct samples o W11(x) 1 (x) W21(x) W12(x) y 1(x) W22(x) W31(x) y 2(x) 2 (x) W32(x) y 3(x) Figure 1: Network structure or the Jura dataset learnt by GPRN. The spatially varying node and weight unctions are shown, along with the predictive means or the observations. The three output dimensions are cadmium, nickel and zinc concentrations respectively. 1

2 p(y(x ) W i, i, σ, σ y ), rom which we can construct the predictive distribution 1 p(y(x ) D) = lim J J J p(y(x ) W, i, i σ, σ y ). (1) i=1 We see that even with a Gaussian observation model, the predictive distribution in (1) is an ininite mixture o Gaussians, and will generally be heavy tailed. Since mixtures o Gaussians are dense in the set o probability distributions, the predictive distribution or GPRN is highly lexible. Mixing was assessed by looking at trace plots o samples, and the likelihoods o these samples. Speciic inormation about how long it takes to sample a solution or a given problem is in the experiments section. 2 Variational Bayes We perorm variational EM (Jordan et al., 1999) to it an approximate posterior q to the true posterior p, by minimising the Kullback-Leibler divergence KL(q p) = H[q(v)] q(v) log p(v)dv, where H[q(v)] = q(v) log q(v)dv is the entropy and v = {, W, σ 2, σ2 y, a j }. E-step. We use Variational Message Passing (Winn and Bishop, 2006) under the Iner.NET ramework (Minka et al., 2010) to estimate the posterior over v = {, W, σ 2, σ2 y, a j }, where the {a j } are signal variance hyperparameters or each node unction j, so that k ˆj a j k ˆj. We speciy inverse Gamma priors on {σ 2, σ2 y, a j }: σ 2 j IG(α σ 2, β σ 2 ), σ 2 y IG(α σ 2 y, β σ 2 y ), a j IG(α a, β a ). We use a variational posterior o the ollowing orm: Q P q(v) = q σ 2 y (σy) 2 q j ( j )q σ 2 j (σj)q 2 aj (a j ) q Wij (W ij )q ˆnj ( ˆ nj ) j=1 where q σ 2 y, q σ 2 j and q aj are inverse Gamma distributions; q nj, q ˆnj are univariate normal distributions; and q j and q Wij are multivariate normal distributions. The lengthscales {θ, θ w } do not appear here because they are optimised in the M-step below (we could equivalently consider a point mass distribution on the lengthscales). For mathematical and computational convenience we introduce the ollowing variables which are deterministic unctions o the existing variables in the model: i=1 w nij := W ij (x n ), nj := j (x n ) (2) t nij := w nij ˆnj, s in := j t nij (3) We reer to these as derived variables. Note that the observations y i (x n ) N (s in, σ 2 y) and that ˆ nj N ( nj, σ2 j ). These derived variables are all given univariate normal pseudo-marginals which Variational message passing uses as conduits to pass appropriate moments, resulting in the same updates as standard 2

3 VB (see Winn and Bishop (2006) or details). Using the derived variables the ull model can be written as ( Q p(v) IG(σy; 2 α σ 2 y, β σ 2 y ) N ( j ; 0, a j K j ) IG(σ 2 j; α σ 2, β σ 2 )IG(a j ; α a, β a ) j=1 [ P N (W ij ; 0, K w ) i=1 N δ(w nij W ij (x n ))δ( nj ˆ j (x n ))N ( ˆ nj ; nj, σ 2 j ) n=1 δ(t nij w nij ˆnj )δ(s in t nij )N (y i (x n ); s in, σy) 2 j ]) The updates or, W, σ 2, σ2 y are standard VB updates and are available in Iner.NET. The update or the ARD parameters a j however required speciic implementation. The actor itsel is log N ( j ;0, a j K ) c = 1 2 log a jk j 1 2 T j (a j K ) 1 j = N 2 log a j 1 2 log K j 1 2 a 1 j j T K 1 j (4) where = c denotes equality up to an additive ( constant. Taking ) expectations with respect to under q we obtain the VMP message to a j as IG a j ; N 2 1, 1 2 j T K 1 j. Since the variational posterior on is multivariate normal the expectation j T K 1 j is straightorward to calculate. M-step. In the M-step we optimise the variational lower bound with respect to the log length scale parameters {θ, θ w }, using gradient descent with line search. When optimising θ we only need to consider the contribution to the lower bound o the actor N ( j ; 0, a j K j ) (see (4)), which is straightorward to evaluate and dierentiate. From (4) we have: log N ( j ; 0, a j K j ) q We will need the gradient with respect to θ : c = N 2 log a j 1 2 log K j 1 2 a 1 j j T K 1 j j log N ( j ; 0, a j K j ) θ = 1 2 tr ( K 1 j ) K j θ 1 2 a 1 j T j K 1 j K j K 1 θ j j The expectations here are straightorward to compute analytically since j has multivariate normal variational posterior. Analogously or θ w we consider the contribution o N (W pq ; 0, K W ). VB predictive distribution. The predictive distribution or the output y (x) at a new input location x is calculated as p(y (x) D) = p(y (x) W (x), (x))p(w (x), (x) D)dW d (5) VB its the approximation p(w (x), (x) D) = q(w )q(), so the approximate predictive is p(y (x) D) = p(y (x) W (x), ˆ(x))q(W )q( ˆ)dW d ˆ (6) 3

4 We can calculate the mean and covariance o this distribution analytically: ȳ (x) i = k cov(y (x)) ij = k E(W ik)e[ ˆ k ] (7) [E(Wik)E(W jk)var( ˆ k ) + δ ij var(wik)e( 2 ˆ k )] + δ ij E[σy] 2 (8) where δ ij = I[i = j] is the Kronecker delta unction, Wik = W ik(x) and ˆ k = ˆ k (x). The moments o Wik and ˆ k under q are straightorward to obtain rom q(w ) and q() respectively using the standard GP prediction equations (see Rasmussen and Williams (2006)). It is also o interest to calculate the noise covariance. Recall our model can be written as y(x) = W (x)(x) + σ W (x)ɛ + σ y z }{{}}{{} signal noise (9) Let n(x) = σ W (x)ɛ + σ y z be the noise component. The covariance o n(x) under q is then cov(n(x)) ij = k [E[σ 2 k ]E(W ik)e(w jk) + δ ij var(w jk)] + δ ij E[σ 2 y] (10) 3 Computational Considerations The computational complexity o a Markov chain Monte Carlo GPRN approach is mainly limited by taking the Cholesky decomposition o the block diagonal C B, an Nq(p + 1) Nq(p + 1) matrix in the prior on GP unction values. But pq o these blocks are the same N N covariance matrix K w or the weight unctions, and q o these blocks are the covariance matrices K ˆi associated with the node unctions, and chol(blkdiag(a, B,... )) = blkdiag(chol(a), chol(b),... ). Thereore assuming the node unctions share the same covariance unction (which they do in our experiments), the complexity o this operation is only O(N 3 ), the same as or regular Gaussian process regression. At worst it is O(qN 3 ), assuming dierent covariance unctions or each node. Sampling also requires likelihood evaluations. Since there are input dependent noise correlations between the elements o the p dimensional observations y(x i ), multivariate volatility models would normally require inverting 1 a p p covariance matrix N times, like MGARCH (Bollerslev et al., 1988) or multivariate stochastic volatility models (Harvey et al., 1994). This would lead to a complexity o O(Nqp + Np 3 ) per likelihood evaluation. However, by working directly with the noisy ˆ instead o the noise ree, evaluating the likelihood requires no costly or numerically unstable inversions, and thus has a complexity o only O(Nqp). The computational complexity o variational Bayes is dominated by the O(N 3 ) inversions required to calculate the covariance o the node and weight unctions in the E-step. Naively q and qp such inversions are required per iteration or the node and weight unctions respectively, giving a total complexity o O(qpN 3 ). However, under VB the covariances o the weight unctions or the same p are all equal, reducing the complexity to O(qN 3 ). I p is large the O(pqN 2 ) cost o calculating the weight unction means may become signiicant. Although the per iteration cost o VB is actually higher than or MCMC, ar ewer iterations are typically required to reach convergence. We see that when ixing q and p, the computational complexity o GPRN scales cubically with the number o data points, like standard Gaussian process regression. On modern computers, this limits GPRN to datasets with ewer than about N = points. However, one could adopt a sparse representation o GPRN, or example ollowing the DTC, (Csató and Opper, 2001; Seeger et al., 2003; Quiñonero-Candela and Rasmussen, 2005; Rasmussen and Williams, 2006), PITC (Quiñonero-Candela and Rasmussen, 2005), or FITC (Snelson and Ghahramani, 2006) approximations, which would lead to O(M 2 N) scaling where 1 In this context, inverting means decomposing (e.g. a Cholesky decomposition o) the matrix Σ(x) in question, or instance to take the the determinant o Σ 1 (x), or the matrix vector product y (x)σ 1 (x)y(x). 4

5 M N. I the input space X is on a ull (but not necessarily equispaced) grid, in that it can be expressed as a cartesian product o the input locations or each dimension, then the complexity in N reduces to O(N log N). Fixing q and N, the per iteration (or MCMC or VB) computational complexity o GPRN scales linearly with p. Overall, the computational demands o GPRN compare avourably to most multi-task GP models, which commonly have a complexity o O(p 3 N 3 ) (e.g. SLFM, LMC, and CMOGP in the experiments) and do not account or either input dependent signal or noise correlations. Moreover, multivariate volatility models, which account or input dependent noise correlations, are commonly intractable or p > 5 (Engle, 2002; Gouriéroux et al., 2009). The 1000 dimensional gene expression experiment is tractable or GPRN, but intractable or the alternative multi-task models used on the 50 dimensional gene expression set, and the multivariate volatility models. Using either MCMC or VB with GPRN, the memory requirement is O(N 2 ) i all covariance kernels share the same hyperparameters, is O(qN 2 ) is the node unctions have dierent kernel hyperparameters, and O(pq 2 N 2 ) i all GPs have dierent kernel hyperparameters. This memory requirement comes rom storing the inormation in the block diagonal C B in the prior p(u σ, θ, θ w ). 4 Covariance structure In this section we derive the covariance structure induced by the GPRN o Section 2 in the main text, explicitly in terms o the kernel unctions k w and k or the weight and node unctions. We assume that the weight unctions share a kernel k w and the node unctions share a kernel k. In the GPRN speciication o Section 2 in the main text, a priori, at a particular location x, E[y(x)] = 0, (11) cov(y(x)) = (1 + σ 2 + σ 2 y)i, (12) because W (x) and (x) are comprised o independent mean zero Gaussian processes. One can always preprocess data so that (11) and (12) accord with prior belies. The modelling power o standard Gaussian process regression comes primarily through the covariance kernel and its hyperparameters, which allow or a lexible covariance structure between data points at dierent locations. Likewise, the modelling power o GPRN comes rom the induced process over Σ(x ) = cov[y(x ) x ] and how its covariance structure directly relates to the Gaussian process covariance kernels k w and k used in the weight and node unctions. Given the weight and node unctions, the covariance matrix Σ(x ) = cov[y(x ) x ] is Σ(x ) = W (x )(x )(x ) W (x ) + σ 2 W (x )W (x ) + σ }{{} yi 2. (13) }{{} signal noise Theorem 4.1 gives the covariance structure or the induced process in (13). The covariances induced by the signal and noise models are separately labelled. Theorem 4.1. Abbreviating the kernels k (x, x ) and k w (x, x ) as k locations x and x, and k w, or i j l s, and cov[σ ii (x), Σ ii (x )] = 2q 2 kwk q(kw 2 + kv) 2 + 2qσ 4 kw 2 + σy 4. (14) }{{}}{{} signal noise cov[σ ij (x), Σ ji (x )] = cov(σ ij (x), Σ ij (x )) = 0.5(3q + q 2 )kwk qkw 2 + σ 4 kw 2 + σy 4. (15) }{{}}{{} signal noise cov[σ ij (x), Σ ls (x )] = 0. (16) Proo 4.1. W ij (x) GP(0, k w ) and i (x) GP(0, k ). We irst consider the contribution rom the signal, A(x) = W (x)(x)(x) W (x). The entry A 11 (x) = q i=1 W 1i(x) i (x) q j=1 W 1j(x) j (x). We will ocus 5

6 on this entry, since all the diagonal entries are i.i.d., and cov(a 11 (x), A 11 (x )) = cov(a ii (x), A ii (x )). Abbreviating W 11 (x ), W 12 (x ), 1 (x ), 2 (x ) respectively as W 11, W 12, 1, 2, and W 11 (x), W 12 (x), 1 (x), 2 (x) as W 11, W 12, 1, 2, and given that the node and weight unctions are independent, cov(a 11 (x), A 11 (x )) = qcov(w , W ) + 2q(q 1)cov(W 11 1 W 12 2, W 11 1W 12 2). (17) cov(w 11 W , W 11W ) = E[W 11 W W 11W ] E[W 11 W ]E[W 11W ], (18) = E[W 11 W W 11W ]. (19) Since W 11 and W 11 are jointly Gaussian and marginally N (0, 1) random variables, we can write W 11 = k w W k 2 wγ 1, (20) where γ 1 is N (0, 1) and independent rom W 11 and W 11. Substitutions similar to (20) can be made or W 12, 1, 2, so that E[W 11 W W 11W ] = k 2 wk 2. (21) Likewise, cov(w 11 (x) 2 1 (x) 2, W 11 (x ) 2 1 (x ) 2 ) = 2(k 2 wk 2 + k 2 w + k 2 v), (22) using E[a 4 ] = 3 i a N (0, 1). Thereore cov(a ii (x), A ii (x )) = 2q(k 2 wk 2 + k 2 w + k 2 ) + 2q(q 1)k 2 wk 2 = 2q 2 k 2 wk 2 + 2q(k 2 w + k 2 ). (23) Equations (14)-(16) ollow similarly. Corollary 4.1. Since E[Σ(x)] = I = constant, and the covariances are proportional to the kernel unctions(s), the induced process on Σ(x) is weakly stationary i the kernel unction(s) are stationary that is, i k(x, x ) = k(x + a, x + a) or all easible constants a and all kernels k. This holds, or example, i k(x, x ) is a unction o x x. Corollary 4.2. I k(x, x ) 0 as x x then lim cov[σ ij(x), Σ ls (x )] = 0, i, j, l, s. (24) x x Theorem 5.1 and its corollaries clariy the importance o the kernels k w and k and their hyperparameters (particularly length-scales): through k w and k we can explicitly speciy the desired prior covariance structure o the GPRN model. 5 Multimodality in the GPRN posterior It is possible to reduce the number o modes in the posterior by constraining the weights W or nodes to be positive. For MCMC it is straightorward to do this by exponentiating the weights, as in Adams and Stegle (2008) and Adams et al. (2010). For VB it is more straightorward to explicitly constrain the weights to be positive using a truncated Gaussian representation. We ound that these extensions did not signiicantly improve empirical perormance, although exponentiating the weights sometimes improved numerical stability or MCMC on the multivariate volatility experiments. For Adams and Stegle (2008) exponentiating the weights will have been more valuable because they use Expectation Propagation, which in their case would centre probability mass between symmetric modes. MCMC and VB approaches are more robust to this problem. MCMC can explore these symmetric modes, and VB will concentrate on one o these modes without losing the expressivity o the GPRN prior. 6

7 6 SVLMC In the main text, we compared to 8 multiple output GP methods: CMOGP, SLFM, ICM, co-kriging, LMC, CMOFITC, CMODTC, CMOPITC. We also tested SVLMC, but ound that it was not tractable on the gene expression or jura datasets, and since it does not account or input dependent noise correlations, it is not applicable to the multivariate volatility datasets. In this section, we briely present some o the results that we did ind. The SVLMC does not incorporate hyperparameters, and we ound perormance on these datasets to be sensitive to covariance kernel hyperparameters, particularly length-scales. So we introduced covariance kernel hyperparameters (like length-scales) into the SVLMC. We ound the covariogram (based on the ACFs), which is typically used in geostatistics, too crude to set these hyperparameters, so we used the GPRN to learn hyperparameters. We ran an instance o SVLMC on the p = 50 gene expression dataset. Ater having taken samples (taking 20 hours on a 2.3 GHz i5 Intel Duo Core processor), the SMSE was 4.94, compared to an SMSE o 0.32 and 0.34 with GPRN (ESS) and GPRN (VB) respectively. For GPRN (ESS) we took 6000 samples, which took 40 seconds. GPRN (VB) took 12 seconds to converge. 7 Gaussian processes We briely review Gaussian process regression, some notation, and expand on some o the points in the introduction. For more detail see Rasmussen and Williams (2006). A Gaussian process is a collection o random variables, any inite number o which have a joint Gaussian distribution. Using a Gaussian process, we can deine a distribution over unctions w(x): w(x) GP(m(x), k(x, x )), (25) where w is the output variable, x is an arbitrary (potentially vector valued) input variable, and the mean m(x) and covariance unction (or kernel) k(x, x ) are respectively deined as m(x) = E[w(x)], (26) k(x, x ) = cov[w(x), w(x )]. (27) This means that any collection o unction values has a joint Gaussian distribution: (w(x 1 ), w(x 2 ),..., w(x N )) N (µ, K), (28) where the N N covariance matrix K has entries K ij = k(x i, x j ), and the mean µ has entries µ i = m(x i ). The properties o these unctions (smoothness, periodicity, etc.) are determined by the kernel unction. The squared exponential kernel is popular: k SE (x, x ) = A exp( 0.5 x x 2 /l 2 ). (29) Functions drawn rom a Gaussian process with this kernel unction are smooth, and can display long range trends. The length-scale hyperparameter l is easy to interpret: it determines how much the unction values w(x) and w(x + a) depend on one another, or some constant a X. When the length-scale is learned rom data, it is useul or determining how ar into the past one should look in order to make good orecasts. A R is the amplitude coeicient, which determines the marginal variance o w(x) in the prior, Var[w(x)] = A, and the magnitude o covariances between w(x) at dierent inputs x. The Ornstein-Uhlenbeck kernel is also widely applied: k OU (x, x ) = exp( x x /l). (30) In one dimension it is the covariance unction o an Ornstein-Uhlenbeck process (Uhlenback and Ornstein, 1930), which was introduced to model the velocity o a particle undergoing Brownian motion. With this kernel, the corresponding GP is a continuous time AR(1) process with Markovian dynamics: w(x + a) is 7

8 independent o w(x a) given w(x) or any constant a. Indeed the OU kernel belongs to a more general class o Matérn kernels, k Matérn (x, x ) = 21 α 2α x x Γ(α) ( 2α x x ) α K α ( ), (31) l l where K α is a modiied Bessel unction (Abramowitz and Stegun, 1964). In one dimension the corresponding GP is a continuous time AR(p) process, where p = α + 1/2. 2 The OU kernel is recovered by setting α = 1/2. There are many other useul kernels, like the periodic kernel (with a period that can be learned rom data), or the Gibbs kernel (Gibbs, 1997) which allows or input dependent length-scales. Kernels can be combined together, e.g. k = a 1 k 1 + a 2 k 2 + a 3 k 3, and the relative importance o each kernel can be determined rom data (e.g. rom estimating a 1, a 2, a 3 ). Rasmussen and Williams (2006) and Bishop (2006) have a discussion about how to create and combine kernels. Suppose we are doing a regression using points {y(x 1 ),..., y(x N )} rom a noisy unction y = w(x) + ɛ, where ɛ is additive i.i.d Gaussian noise, such that ɛ N (0, σn). 2 Letting y = (y(x 1 ),..., y(x N )), and w = (w(x 1 ),..., w(x N ), we have p(y w) = N (w, σni) 2 and p(w) = N (µ, K) as above. For notational simplicity, we assume µ = 0. For a test point w(x ), the joint distribution p(w(x ), y) is Gaussian: [ ] [ ] w(x ) k(x, x ) k N (0, ), (32) w K + σni 2 where K is deined as above, and (k ) i = k(x, x i ) with i = 1,..., N. We can thereore condition on y to ind p(w(x ) y) = N (µ, v ) where k µ = k (K + σ 2 ni) 1 y, (33) v = k(x, x ) k (K + σ 2 ni) 1 k. (34) We can ind this more laboriously by noting that p(w y) and p(w(x ) w) are Gaussian and integrating, since p(w(x ) y) = p(w(x ) w)p(w y)dw. We see that (34) doesn t depend on the data y, just on how ar away the test point x is rom the training inputs {x 1,..., x N }. In regards to the introduction, we also see that or this standard Gaussian process regression, the observation model p(y w) is Gaussian, the predictive distribution in (33) and (34) is Gaussian, the marginals in the prior (rom marginalising equation (28)) are Gaussian, the noise is constant, and in the popular covariance unctions given, the amplitude and length-scale are constant. A brie discussion o multiple outputs, noise models with dependencies, and non-gaussian observation models can be ound in sections 9.1, 9.2 and 9.3 on pages o Rasmussen and Williams (2006), available ree online at the book website An example o an input dependent length-scale is in section 4.2 on page 43. Reerences Abramowitz, M. and Stegun, I. (1964). Handbook o mathematical unctions with ormulas, graphs, and mathematical tables. Dover publications. Adams, R. and Stegle, O. (2008). Gaussian process product models or nonparametric nonstationarity. In ICML. Adams, R. P., Dahl, G. E., and Murray, I. (2010). Incorporating side inormation into probabilistic matrix actorization using Gaussian processes. In Grünwald, P. and Spirtes, P., editors, Proceedings o the 26th Conerence on Uncertainty in Artiicial Intelligence, pages Discrete time autoregressive processes such as w(t + 1) = w(t) + ɛ(t), where ɛ(t) N (0, 1), are widely used in time series modelling and are a particularly simple special case o Gaussian processes. 8

9 Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Bollerslev, T., Engle, R. F., and Wooldridge, J. M. (1988). A capital asset pricing model with time-varying covariances. The Journal o Political Economy, 96(1): Csató, L. and Opper, M. (2001). Sparse representation or gaussian process models. In Advances in neural inormation processing systems 13: proceedings o the 2000 conerence, volume 13, page 444. The MIT Press. Engle, R. (2002). New rontiers or ARCH models. Journal o Applied Econometrics, 17(5): Gibbs, M. (1997). Bayesian Gaussian Process or Regression and Classiication. PhD thesis, Dept. o Physics, University o Cambridge. Gouriéroux, C., Jasiak, J., and Suana, R. (2009). The Wishart autoregressive process o multivariate stochastic volatility. Journal o Econometrics, 150(2): Harvey, A., Ruiz, E., and Shephard, N. (1994). Multivariate stochastic variance models. The Review o Economic Studies, 61(2): Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. (1999). An introduction to variational methods or graphical models. Machine learning, 37(2): Minka, T. P., Winn, J. M., Guiver, J. P., and Knowles, D. A. (2010). Iner.NET 2.4. Microsot Research Cambridge. Murray, I., Adams, R. P., and MacKay, D. J. (2010). Elliptical Slice Sampling. JMLR: W&CP, 9: Quiñonero-Candela, J. and Rasmussen, C. (2005). A uniying view o sparse approximate gaussian process regression. The Journal o Machine Learning Research, 6: Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes or Machine Learning. The MIT Press. Seeger, M., Williams, C., and Lawrence, N. (2003). Fast orward selection to speed up sparse gaussian process regression. In Workshop on AI and Statistics, volume 9, page Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudo-inputs. Advances in neural inormation processing systems, 18:1257. Uhlenback, G. and Ornstein, L. (1930). On the theory o brownian motion. Phys. Rev., 36: Winn, J. and Bishop, C. M. (2006). Variational message passing. Journal o Machine Learning Research, 6(1):661. 9

Gaussian Process Regression Networks

Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh