Gaussian Process Regression Networks Supplementary Material

Size: px
Start display at page:

Download "Gaussian Process Regression Networks Supplementary Material"

Transcription

1 Gaussian Process Regression Networks Supplementary Material Andrew Gordon Wilson, David A. Knowles, Zoubin Ghahramani University o Cambridge agw38@cam.ac.uk, dak33@cam.ac.uk, zoubin@eng.cam.ac.uk In this supplementary material, we discuss some urther details o our ESS and VB inerence (Sections 1 and 2), the computational complexity o our inerence procedures (Section 3), and the correlation structure induced by the GPRN model (Section 4). We also discuss multimodality in the GPRN posterior (Section 5), SVLMC, and some background inormation and notation or Gaussian process regression (Section 7). Figure 1 shows the network structure o GPRN learned on the JURA dataset. 1 Elliptical Slice Sampling To sample rom p(u D, γ), we could use a Gibbs sampling scheme which would have conjugate posterior updates, alternately conditioning on weight and node unctions. However, this Gibbs cycle would mix poorly because o the correlations between the weight and node unctions in the posterior p(u D, γ). In general, MCMC samples rom p(u D, γ) mix poorly because o the strong correlations in the prior p(u σ, θ, θ w ) imposed by C B. The sampling process is also oten slowed by costly matrix inversions in the likelihood. We use Elliptical Slice Sampling (Murray et al., 2010), a recent MCMC technique speciically designed to sample rom posteriors with tightly correlated Gaussian priors. It does joint updates and has no ree parameters. Since there are no costly or numerically unstable matrix inversions in the likelihood o we also ind sampling to be eicient. With a sample rom p(u D, γ), we can sample rom the predictive p(w (x ), (x ) u, σ, D). Let W, i i be the i th such joint sample. Using our generative GPRN model we can then construct samples o W11(x) 1 (x) W21(x) W12(x) y 1(x) W22(x) W31(x) y 2(x) 2 (x) W32(x) y 3(x) Figure 1: Network structure or the Jura dataset learnt by GPRN. The spatially varying node and weight unctions are shown, along with the predictive means or the observations. The three output dimensions are cadmium, nickel and zinc concentrations respectively. 1

2 p(y(x ) W i, i, σ, σ y ), rom which we can construct the predictive distribution 1 p(y(x ) D) = lim J J J p(y(x ) W, i, i σ, σ y ). (1) i=1 We see that even with a Gaussian observation model, the predictive distribution in (1) is an ininite mixture o Gaussians, and will generally be heavy tailed. Since mixtures o Gaussians are dense in the set o probability distributions, the predictive distribution or GPRN is highly lexible. Mixing was assessed by looking at trace plots o samples, and the likelihoods o these samples. Speciic inormation about how long it takes to sample a solution or a given problem is in the experiments section. 2 Variational Bayes We perorm variational EM (Jordan et al., 1999) to it an approximate posterior q to the true posterior p, by minimising the Kullback-Leibler divergence KL(q p) = H[q(v)] q(v) log p(v)dv, where H[q(v)] = q(v) log q(v)dv is the entropy and v = {, W, σ 2, σ2 y, a j }. E-step. We use Variational Message Passing (Winn and Bishop, 2006) under the Iner.NET ramework (Minka et al., 2010) to estimate the posterior over v = {, W, σ 2, σ2 y, a j }, where the {a j } are signal variance hyperparameters or each node unction j, so that k ˆj a j k ˆj. We speciy inverse Gamma priors on {σ 2, σ2 y, a j }: σ 2 j IG(α σ 2, β σ 2 ), σ 2 y IG(α σ 2 y, β σ 2 y ), a j IG(α a, β a ). We use a variational posterior o the ollowing orm: Q P q(v) = q σ 2 y (σy) 2 q j ( j )q σ 2 j (σj)q 2 aj (a j ) q Wij (W ij )q ˆnj ( ˆ nj ) j=1 where q σ 2 y, q σ 2 j and q aj are inverse Gamma distributions; q nj, q ˆnj are univariate normal distributions; and q j and q Wij are multivariate normal distributions. The lengthscales {θ, θ w } do not appear here because they are optimised in the M-step below (we could equivalently consider a point mass distribution on the lengthscales). For mathematical and computational convenience we introduce the ollowing variables which are deterministic unctions o the existing variables in the model: i=1 w nij := W ij (x n ), nj := j (x n ) (2) t nij := w nij ˆnj, s in := j t nij (3) We reer to these as derived variables. Note that the observations y i (x n ) N (s in, σ 2 y) and that ˆ nj N ( nj, σ2 j ). These derived variables are all given univariate normal pseudo-marginals which Variational message passing uses as conduits to pass appropriate moments, resulting in the same updates as standard 2

3 VB (see Winn and Bishop (2006) or details). Using the derived variables the ull model can be written as ( Q p(v) IG(σy; 2 α σ 2 y, β σ 2 y ) N ( j ; 0, a j K j ) IG(σ 2 j; α σ 2, β σ 2 )IG(a j ; α a, β a ) j=1 [ P N (W ij ; 0, K w ) i=1 N δ(w nij W ij (x n ))δ( nj ˆ j (x n ))N ( ˆ nj ; nj, σ 2 j ) n=1 δ(t nij w nij ˆnj )δ(s in t nij )N (y i (x n ); s in, σy) 2 j ]) The updates or, W, σ 2, σ2 y are standard VB updates and are available in Iner.NET. The update or the ARD parameters a j however required speciic implementation. The actor itsel is log N ( j ;0, a j K ) c = 1 2 log a jk j 1 2 T j (a j K ) 1 j = N 2 log a j 1 2 log K j 1 2 a 1 j j T K 1 j (4) where = c denotes equality up to an additive ( constant. Taking ) expectations with respect to under q we obtain the VMP message to a j as IG a j ; N 2 1, 1 2 j T K 1 j. Since the variational posterior on is multivariate normal the expectation j T K 1 j is straightorward to calculate. M-step. In the M-step we optimise the variational lower bound with respect to the log length scale parameters {θ, θ w }, using gradient descent with line search. When optimising θ we only need to consider the contribution to the lower bound o the actor N ( j ; 0, a j K j ) (see (4)), which is straightorward to evaluate and dierentiate. From (4) we have: log N ( j ; 0, a j K j ) q We will need the gradient with respect to θ : c = N 2 log a j 1 2 log K j 1 2 a 1 j j T K 1 j j log N ( j ; 0, a j K j ) θ = 1 2 tr ( K 1 j ) K j θ 1 2 a 1 j T j K 1 j K j K 1 θ j j The expectations here are straightorward to compute analytically since j has multivariate normal variational posterior. Analogously or θ w we consider the contribution o N (W pq ; 0, K W ). VB predictive distribution. The predictive distribution or the output y (x) at a new input location x is calculated as p(y (x) D) = p(y (x) W (x), (x))p(w (x), (x) D)dW d (5) VB its the approximation p(w (x), (x) D) = q(w )q(), so the approximate predictive is p(y (x) D) = p(y (x) W (x), ˆ(x))q(W )q( ˆ)dW d ˆ (6) 3

4 We can calculate the mean and covariance o this distribution analytically: ȳ (x) i = k cov(y (x)) ij = k E(W ik)e[ ˆ k ] (7) [E(Wik)E(W jk)var( ˆ k ) + δ ij var(wik)e( 2 ˆ k )] + δ ij E[σy] 2 (8) where δ ij = I[i = j] is the Kronecker delta unction, Wik = W ik(x) and ˆ k = ˆ k (x). The moments o Wik and ˆ k under q are straightorward to obtain rom q(w ) and q() respectively using the standard GP prediction equations (see Rasmussen and Williams (2006)). It is also o interest to calculate the noise covariance. Recall our model can be written as y(x) = W (x)(x) + σ W (x)ɛ + σ y z }{{}}{{} signal noise (9) Let n(x) = σ W (x)ɛ + σ y z be the noise component. The covariance o n(x) under q is then cov(n(x)) ij = k [E[σ 2 k ]E(W ik)e(w jk) + δ ij var(w jk)] + δ ij E[σ 2 y] (10) 3 Computational Considerations The computational complexity o a Markov chain Monte Carlo GPRN approach is mainly limited by taking the Cholesky decomposition o the block diagonal C B, an Nq(p + 1) Nq(p + 1) matrix in the prior on GP unction values. But pq o these blocks are the same N N covariance matrix K w or the weight unctions, and q o these blocks are the covariance matrices K ˆi associated with the node unctions, and chol(blkdiag(a, B,... )) = blkdiag(chol(a), chol(b),... ). Thereore assuming the node unctions share the same covariance unction (which they do in our experiments), the complexity o this operation is only O(N 3 ), the same as or regular Gaussian process regression. At worst it is O(qN 3 ), assuming dierent covariance unctions or each node. Sampling also requires likelihood evaluations. Since there are input dependent noise correlations between the elements o the p dimensional observations y(x i ), multivariate volatility models would normally require inverting 1 a p p covariance matrix N times, like MGARCH (Bollerslev et al., 1988) or multivariate stochastic volatility models (Harvey et al., 1994). This would lead to a complexity o O(Nqp + Np 3 ) per likelihood evaluation. However, by working directly with the noisy ˆ instead o the noise ree, evaluating the likelihood requires no costly or numerically unstable inversions, and thus has a complexity o only O(Nqp). The computational complexity o variational Bayes is dominated by the O(N 3 ) inversions required to calculate the covariance o the node and weight unctions in the E-step. Naively q and qp such inversions are required per iteration or the node and weight unctions respectively, giving a total complexity o O(qpN 3 ). However, under VB the covariances o the weight unctions or the same p are all equal, reducing the complexity to O(qN 3 ). I p is large the O(pqN 2 ) cost o calculating the weight unction means may become signiicant. Although the per iteration cost o VB is actually higher than or MCMC, ar ewer iterations are typically required to reach convergence. We see that when ixing q and p, the computational complexity o GPRN scales cubically with the number o data points, like standard Gaussian process regression. On modern computers, this limits GPRN to datasets with ewer than about N = points. However, one could adopt a sparse representation o GPRN, or example ollowing the DTC, (Csató and Opper, 2001; Seeger et al., 2003; Quiñonero-Candela and Rasmussen, 2005; Rasmussen and Williams, 2006), PITC (Quiñonero-Candela and Rasmussen, 2005), or FITC (Snelson and Ghahramani, 2006) approximations, which would lead to O(M 2 N) scaling where 1 In this context, inverting means decomposing (e.g. a Cholesky decomposition o) the matrix Σ(x) in question, or instance to take the the determinant o Σ 1 (x), or the matrix vector product y (x)σ 1 (x)y(x). 4

5 M N. I the input space X is on a ull (but not necessarily equispaced) grid, in that it can be expressed as a cartesian product o the input locations or each dimension, then the complexity in N reduces to O(N log N). Fixing q and N, the per iteration (or MCMC or VB) computational complexity o GPRN scales linearly with p. Overall, the computational demands o GPRN compare avourably to most multi-task GP models, which commonly have a complexity o O(p 3 N 3 ) (e.g. SLFM, LMC, and CMOGP in the experiments) and do not account or either input dependent signal or noise correlations. Moreover, multivariate volatility models, which account or input dependent noise correlations, are commonly intractable or p > 5 (Engle, 2002; Gouriéroux et al., 2009). The 1000 dimensional gene expression experiment is tractable or GPRN, but intractable or the alternative multi-task models used on the 50 dimensional gene expression set, and the multivariate volatility models. Using either MCMC or VB with GPRN, the memory requirement is O(N 2 ) i all covariance kernels share the same hyperparameters, is O(qN 2 ) is the node unctions have dierent kernel hyperparameters, and O(pq 2 N 2 ) i all GPs have dierent kernel hyperparameters. This memory requirement comes rom storing the inormation in the block diagonal C B in the prior p(u σ, θ, θ w ). 4 Covariance structure In this section we derive the covariance structure induced by the GPRN o Section 2 in the main text, explicitly in terms o the kernel unctions k w and k or the weight and node unctions. We assume that the weight unctions share a kernel k w and the node unctions share a kernel k. In the GPRN speciication o Section 2 in the main text, a priori, at a particular location x, E[y(x)] = 0, (11) cov(y(x)) = (1 + σ 2 + σ 2 y)i, (12) because W (x) and (x) are comprised o independent mean zero Gaussian processes. One can always preprocess data so that (11) and (12) accord with prior belies. The modelling power o standard Gaussian process regression comes primarily through the covariance kernel and its hyperparameters, which allow or a lexible covariance structure between data points at dierent locations. Likewise, the modelling power o GPRN comes rom the induced process over Σ(x ) = cov[y(x ) x ] and how its covariance structure directly relates to the Gaussian process covariance kernels k w and k used in the weight and node unctions. Given the weight and node unctions, the covariance matrix Σ(x ) = cov[y(x ) x ] is Σ(x ) = W (x )(x )(x ) W (x ) + σ 2 W (x )W (x ) + σ }{{} yi 2. (13) }{{} signal noise Theorem 4.1 gives the covariance structure or the induced process in (13). The covariances induced by the signal and noise models are separately labelled. Theorem 4.1. Abbreviating the kernels k (x, x ) and k w (x, x ) as k locations x and x, and k w, or i j l s, and cov[σ ii (x), Σ ii (x )] = 2q 2 kwk q(kw 2 + kv) 2 + 2qσ 4 kw 2 + σy 4. (14) }{{}}{{} signal noise cov[σ ij (x), Σ ji (x )] = cov(σ ij (x), Σ ij (x )) = 0.5(3q + q 2 )kwk qkw 2 + σ 4 kw 2 + σy 4. (15) }{{}}{{} signal noise cov[σ ij (x), Σ ls (x )] = 0. (16) Proo 4.1. W ij (x) GP(0, k w ) and i (x) GP(0, k ). We irst consider the contribution rom the signal, A(x) = W (x)(x)(x) W (x). The entry A 11 (x) = q i=1 W 1i(x) i (x) q j=1 W 1j(x) j (x). We will ocus 5

6 on this entry, since all the diagonal entries are i.i.d., and cov(a 11 (x), A 11 (x )) = cov(a ii (x), A ii (x )). Abbreviating W 11 (x ), W 12 (x ), 1 (x ), 2 (x ) respectively as W 11, W 12, 1, 2, and W 11 (x), W 12 (x), 1 (x), 2 (x) as W 11, W 12, 1, 2, and given that the node and weight unctions are independent, cov(a 11 (x), A 11 (x )) = qcov(w , W ) + 2q(q 1)cov(W 11 1 W 12 2, W 11 1W 12 2). (17) cov(w 11 W , W 11W ) = E[W 11 W W 11W ] E[W 11 W ]E[W 11W ], (18) = E[W 11 W W 11W ]. (19) Since W 11 and W 11 are jointly Gaussian and marginally N (0, 1) random variables, we can write W 11 = k w W k 2 wγ 1, (20) where γ 1 is N (0, 1) and independent rom W 11 and W 11. Substitutions similar to (20) can be made or W 12, 1, 2, so that E[W 11 W W 11W ] = k 2 wk 2. (21) Likewise, cov(w 11 (x) 2 1 (x) 2, W 11 (x ) 2 1 (x ) 2 ) = 2(k 2 wk 2 + k 2 w + k 2 v), (22) using E[a 4 ] = 3 i a N (0, 1). Thereore cov(a ii (x), A ii (x )) = 2q(k 2 wk 2 + k 2 w + k 2 ) + 2q(q 1)k 2 wk 2 = 2q 2 k 2 wk 2 + 2q(k 2 w + k 2 ). (23) Equations (14)-(16) ollow similarly. Corollary 4.1. Since E[Σ(x)] = I = constant, and the covariances are proportional to the kernel unctions(s), the induced process on Σ(x) is weakly stationary i the kernel unction(s) are stationary that is, i k(x, x ) = k(x + a, x + a) or all easible constants a and all kernels k. This holds, or example, i k(x, x ) is a unction o x x. Corollary 4.2. I k(x, x ) 0 as x x then lim cov[σ ij(x), Σ ls (x )] = 0, i, j, l, s. (24) x x Theorem 5.1 and its corollaries clariy the importance o the kernels k w and k and their hyperparameters (particularly length-scales): through k w and k we can explicitly speciy the desired prior covariance structure o the GPRN model. 5 Multimodality in the GPRN posterior It is possible to reduce the number o modes in the posterior by constraining the weights W or nodes to be positive. For MCMC it is straightorward to do this by exponentiating the weights, as in Adams and Stegle (2008) and Adams et al. (2010). For VB it is more straightorward to explicitly constrain the weights to be positive using a truncated Gaussian representation. We ound that these extensions did not signiicantly improve empirical perormance, although exponentiating the weights sometimes improved numerical stability or MCMC on the multivariate volatility experiments. For Adams and Stegle (2008) exponentiating the weights will have been more valuable because they use Expectation Propagation, which in their case would centre probability mass between symmetric modes. MCMC and VB approaches are more robust to this problem. MCMC can explore these symmetric modes, and VB will concentrate on one o these modes without losing the expressivity o the GPRN prior. 6

7 6 SVLMC In the main text, we compared to 8 multiple output GP methods: CMOGP, SLFM, ICM, co-kriging, LMC, CMOFITC, CMODTC, CMOPITC. We also tested SVLMC, but ound that it was not tractable on the gene expression or jura datasets, and since it does not account or input dependent noise correlations, it is not applicable to the multivariate volatility datasets. In this section, we briely present some o the results that we did ind. The SVLMC does not incorporate hyperparameters, and we ound perormance on these datasets to be sensitive to covariance kernel hyperparameters, particularly length-scales. So we introduced covariance kernel hyperparameters (like length-scales) into the SVLMC. We ound the covariogram (based on the ACFs), which is typically used in geostatistics, too crude to set these hyperparameters, so we used the GPRN to learn hyperparameters. We ran an instance o SVLMC on the p = 50 gene expression dataset. Ater having taken samples (taking 20 hours on a 2.3 GHz i5 Intel Duo Core processor), the SMSE was 4.94, compared to an SMSE o 0.32 and 0.34 with GPRN (ESS) and GPRN (VB) respectively. For GPRN (ESS) we took 6000 samples, which took 40 seconds. GPRN (VB) took 12 seconds to converge. 7 Gaussian processes We briely review Gaussian process regression, some notation, and expand on some o the points in the introduction. For more detail see Rasmussen and Williams (2006). A Gaussian process is a collection o random variables, any inite number o which have a joint Gaussian distribution. Using a Gaussian process, we can deine a distribution over unctions w(x): w(x) GP(m(x), k(x, x )), (25) where w is the output variable, x is an arbitrary (potentially vector valued) input variable, and the mean m(x) and covariance unction (or kernel) k(x, x ) are respectively deined as m(x) = E[w(x)], (26) k(x, x ) = cov[w(x), w(x )]. (27) This means that any collection o unction values has a joint Gaussian distribution: (w(x 1 ), w(x 2 ),..., w(x N )) N (µ, K), (28) where the N N covariance matrix K has entries K ij = k(x i, x j ), and the mean µ has entries µ i = m(x i ). The properties o these unctions (smoothness, periodicity, etc.) are determined by the kernel unction. The squared exponential kernel is popular: k SE (x, x ) = A exp( 0.5 x x 2 /l 2 ). (29) Functions drawn rom a Gaussian process with this kernel unction are smooth, and can display long range trends. The length-scale hyperparameter l is easy to interpret: it determines how much the unction values w(x) and w(x + a) depend on one another, or some constant a X. When the length-scale is learned rom data, it is useul or determining how ar into the past one should look in order to make good orecasts. A R is the amplitude coeicient, which determines the marginal variance o w(x) in the prior, Var[w(x)] = A, and the magnitude o covariances between w(x) at dierent inputs x. The Ornstein-Uhlenbeck kernel is also widely applied: k OU (x, x ) = exp( x x /l). (30) In one dimension it is the covariance unction o an Ornstein-Uhlenbeck process (Uhlenback and Ornstein, 1930), which was introduced to model the velocity o a particle undergoing Brownian motion. With this kernel, the corresponding GP is a continuous time AR(1) process with Markovian dynamics: w(x + a) is 7

8 independent o w(x a) given w(x) or any constant a. Indeed the OU kernel belongs to a more general class o Matérn kernels, k Matérn (x, x ) = 21 α 2α x x Γ(α) ( 2α x x ) α K α ( ), (31) l l where K α is a modiied Bessel unction (Abramowitz and Stegun, 1964). In one dimension the corresponding GP is a continuous time AR(p) process, where p = α + 1/2. 2 The OU kernel is recovered by setting α = 1/2. There are many other useul kernels, like the periodic kernel (with a period that can be learned rom data), or the Gibbs kernel (Gibbs, 1997) which allows or input dependent length-scales. Kernels can be combined together, e.g. k = a 1 k 1 + a 2 k 2 + a 3 k 3, and the relative importance o each kernel can be determined rom data (e.g. rom estimating a 1, a 2, a 3 ). Rasmussen and Williams (2006) and Bishop (2006) have a discussion about how to create and combine kernels. Suppose we are doing a regression using points {y(x 1 ),..., y(x N )} rom a noisy unction y = w(x) + ɛ, where ɛ is additive i.i.d Gaussian noise, such that ɛ N (0, σn). 2 Letting y = (y(x 1 ),..., y(x N )), and w = (w(x 1 ),..., w(x N ), we have p(y w) = N (w, σni) 2 and p(w) = N (µ, K) as above. For notational simplicity, we assume µ = 0. For a test point w(x ), the joint distribution p(w(x ), y) is Gaussian: [ ] [ ] w(x ) k(x, x ) k N (0, ), (32) w K + σni 2 where K is deined as above, and (k ) i = k(x, x i ) with i = 1,..., N. We can thereore condition on y to ind p(w(x ) y) = N (µ, v ) where k µ = k (K + σ 2 ni) 1 y, (33) v = k(x, x ) k (K + σ 2 ni) 1 k. (34) We can ind this more laboriously by noting that p(w y) and p(w(x ) w) are Gaussian and integrating, since p(w(x ) y) = p(w(x ) w)p(w y)dw. We see that (34) doesn t depend on the data y, just on how ar away the test point x is rom the training inputs {x 1,..., x N }. In regards to the introduction, we also see that or this standard Gaussian process regression, the observation model p(y w) is Gaussian, the predictive distribution in (33) and (34) is Gaussian, the marginals in the prior (rom marginalising equation (28)) are Gaussian, the noise is constant, and in the popular covariance unctions given, the amplitude and length-scale are constant. A brie discussion o multiple outputs, noise models with dependencies, and non-gaussian observation models can be ound in sections 9.1, 9.2 and 9.3 on pages o Rasmussen and Williams (2006), available ree online at the book website An example o an input dependent length-scale is in section 4.2 on page 43. Reerences Abramowitz, M. and Stegun, I. (1964). Handbook o mathematical unctions with ormulas, graphs, and mathematical tables. Dover publications. Adams, R. and Stegle, O. (2008). Gaussian process product models or nonparametric nonstationarity. In ICML. Adams, R. P., Dahl, G. E., and Murray, I. (2010). Incorporating side inormation into probabilistic matrix actorization using Gaussian processes. In Grünwald, P. and Spirtes, P., editors, Proceedings o the 26th Conerence on Uncertainty in Artiicial Intelligence, pages Discrete time autoregressive processes such as w(t + 1) = w(t) + ɛ(t), where ɛ(t) N (0, 1), are widely used in time series modelling and are a particularly simple special case o Gaussian processes. 8

9 Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Bollerslev, T., Engle, R. F., and Wooldridge, J. M. (1988). A capital asset pricing model with time-varying covariances. The Journal o Political Economy, 96(1): Csató, L. and Opper, M. (2001). Sparse representation or gaussian process models. In Advances in neural inormation processing systems 13: proceedings o the 2000 conerence, volume 13, page 444. The MIT Press. Engle, R. (2002). New rontiers or ARCH models. Journal o Applied Econometrics, 17(5): Gibbs, M. (1997). Bayesian Gaussian Process or Regression and Classiication. PhD thesis, Dept. o Physics, University o Cambridge. Gouriéroux, C., Jasiak, J., and Suana, R. (2009). The Wishart autoregressive process o multivariate stochastic volatility. Journal o Econometrics, 150(2): Harvey, A., Ruiz, E., and Shephard, N. (1994). Multivariate stochastic variance models. The Review o Economic Studies, 61(2): Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. (1999). An introduction to variational methods or graphical models. Machine learning, 37(2): Minka, T. P., Winn, J. M., Guiver, J. P., and Knowles, D. A. (2010). Iner.NET 2.4. Microsot Research Cambridge. Murray, I., Adams, R. P., and MacKay, D. J. (2010). Elliptical Slice Sampling. JMLR: W&CP, 9: Quiñonero-Candela, J. and Rasmussen, C. (2005). A uniying view o sparse approximate gaussian process regression. The Journal o Machine Learning Research, 6: Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes or Machine Learning. The MIT Press. Seeger, M., Williams, C., and Lawrence, N. (2003). Fast orward selection to speed up sparse gaussian process regression. In Workshop on AI and Statistics, volume 9, page Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudo-inputs. Advances in neural inormation processing systems, 18:1257. Uhlenback, G. and Ornstein, L. (1930). On the theory o brownian motion. Phys. Rev., 36: Winn, J. and Bishop, C. M. (2006). Variational message passing. Journal o Machine Learning Research, 6(1):661. 9

Gaussian Process Regression Networks

Gaussian Process Regression Networks Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

Gaussian Process Regression Networks

Gaussian Process Regression Networks Andrew Gordon Wilson David A. Knowles Zoubin Ghahramani Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, UK agw38@cam.ac.uk dak33@cam.ac.uk zoubin@eng.cam.ac.uk Abstract We introduce

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Gaussian Process Regression Models for Predicting Stock Trends

Gaussian Process Regression Models for Predicting Stock Trends Gaussian Process Regression Models or Predicting Stock Trends M. Todd Farrell Andrew Correa December 5, 7 Introduction Historical stock price data is a massive amount o time-series data with little-to-no

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Multiple-step Time Series Forecasting with Sparse Gaussian Processes Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005 Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005 Presented by Piotr Mirowski CBLL meeting, May 6, 2009 Courant Institute of Mathematical Sciences, New York University

More information

Lecture 1c: Gaussian Processes for Regression

Lecture 1c: Gaussian Processes for Regression Lecture c: Gaussian Processes for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Variable sigma Gaussian processes: An expectation propagation perspective

Variable sigma Gaussian processes: An expectation propagation perspective Variable sigma Gaussian processes: An expectation propagation perspective Yuan (Alan) Qi Ahmed H. Abdel-Gawad CS & Statistics Departments, Purdue University ECE Department, Purdue University alanqi@cs.purdue.edu

More information

Variational Model Selection for Sparse Gaussian Process Regression

Variational Model Selection for Sparse Gaussian Process Regression Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Outline Gaussian process regression and sparse

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression Michalis K. Titsias Department of Informatics Athens University of Economics and Business mtitsias@aueb.gr Miguel Lázaro-Gredilla

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

Gaussian Process Approximations of Stochastic Differential Equations

Gaussian Process Approximations of Stochastic Differential Equations Gaussian Process Approximations of Stochastic Differential Equations Cédric Archambeau Dan Cawford Manfred Opper John Shawe-Taylor May, 2006 1 Introduction Some of the most complex models routinely run

More information

Generalised Wishart Processes

Generalised Wishart Processes Generalised Wishart Processes Andrew Gordon Wilson Department of Engineering University of Cambridge, UK Zoubin Ghahramani Department of Engineering University of Cambridge, UK Abstract We introduce a

More information

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression JMLR: Workshop and Conference Proceedings : 9, 08. Symposium on Advances in Approximate Bayesian Inference Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Variational Methods in Bayesian Deconvolution

Variational Methods in Bayesian Deconvolution PHYSTAT, SLAC, Stanford, California, September 8-, Variational Methods in Bayesian Deconvolution K. Zarb Adami Cavendish Laboratory, University of Cambridge, UK This paper gives an introduction to the

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Kernels for Automatic Pattern Discovery and Extrapolation

Kernels for Automatic Pattern Discovery and Extrapolation Kernels for Automatic Pattern Discovery and Extrapolation Andrew Gordon Wilson agw38@cam.ac.uk mlg.eng.cam.ac.uk/andrew University of Cambridge Joint work with Ryan Adams (Harvard) 1 / 21 Pattern Recognition

More information

Nearest Neighbor Gaussian Processes for Large Spatial Data

Nearest Neighbor Gaussian Processes for Large Spatial Data Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of

More information

Tree-structured Gaussian Process Approximations

Tree-structured Gaussian Process Approximations Tree-structured Gaussian Process Approximations Thang Bui joint work with Richard Turner MLG, Cambridge July 1st, 2014 1 / 27 Outline 1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary

More information

Scattered Data Approximation of Noisy Data via Iterated Moving Least Squares

Scattered Data Approximation of Noisy Data via Iterated Moving Least Squares Scattered Data Approximation o Noisy Data via Iterated Moving Least Squares Gregory E. Fasshauer and Jack G. Zhang Abstract. In this paper we ocus on two methods or multivariate approximation problems

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, China E-MAIL: slsun@cs.ecnu.edu.cn,

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Large-scale Ordinal Collaborative Filtering

Large-scale Ordinal Collaborative Filtering Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

Non Linear Latent Variable Models

Non Linear Latent Variable Models Non Linear Latent Variable Models Neil Lawrence GPRS 14th February 2014 Outline Nonlinear Latent Variable Models Extensions Outline Nonlinear Latent Variable Models Extensions Non-Linear Latent Variable

More information

Approximate Inference using MCMC

Approximate Inference using MCMC Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Demonstration of Emulator-Based Bayesian Calibration of Safety Analysis Codes: Theory and Formulation

Demonstration of Emulator-Based Bayesian Calibration of Safety Analysis Codes: Theory and Formulation Demonstration o Emulator-Based Bayesian Calibration o Saety Analysis Codes: Theory and Formulation The MIT Faculty has made this article openly available. Please share how this access beneits you. Your

More information

The achievable limits of operational modal analysis. * Siu-Kui Au 1)

The achievable limits of operational modal analysis. * Siu-Kui Au 1) The achievable limits o operational modal analysis * Siu-Kui Au 1) 1) Center or Engineering Dynamics and Institute or Risk and Uncertainty, University o Liverpool, Liverpool L69 3GH, United Kingdom 1)

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

2.6 Two-dimensional continuous interpolation 3: Kriging - introduction to geostatistics. References - geostatistics. References geostatistics (cntd.

2.6 Two-dimensional continuous interpolation 3: Kriging - introduction to geostatistics. References - geostatistics. References geostatistics (cntd. .6 Two-dimensional continuous interpolation 3: Kriging - introduction to geostatistics Spline interpolation was originally developed or image processing. In GIS, it is mainly used in visualization o spatial

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

FastGP: an R package for Gaussian processes

FastGP: an R package for Gaussian processes FastGP: an R package for Gaussian processes Giri Gopalan Harvard University Luke Bornn Harvard University Many methodologies involving a Gaussian process rely heavily on computationally expensive functions

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f(x).5.5 f(x) y i.5.2.4.6.8 x f 5 5.5 x x 2.5 We have (possibly

More information

Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

State Space Gaussian Processes with Non-Gaussian Likelihoods

State Space Gaussian Processes with Non-Gaussian Likelihoods State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2,3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018 Outline

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Inference and estimation in probabilistic time series models

Inference and estimation in probabilistic time series models 1 Inference and estimation in probabilistic time series models David Barber, A Taylan Cemgil and Silvia Chiappa 11 Time series The term time series refers to data that can be represented as a sequence

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

Scaling Neighbourhood Methods

Scaling Neighbourhood Methods Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)

More information

A short introduction to INLA and R-INLA

A short introduction to INLA and R-INLA A short introduction to INLA and R-INLA Integrated Nested Laplace Approximation Thomas Opitz, BioSP, INRA Avignon Workshop: Theory and practice of INLA and SPDE November 7, 2018 2/21 Plan for this talk

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Variational Approximate Inference in Latent Linear Models

Variational Approximate Inference in Latent Linear Models Variational Approximate Inference in Latent Linear Models Edward Arthur Lester Challis A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of the

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

On a Closed Formula for the Derivatives of e f(x) and Related Financial Applications

On a Closed Formula for the Derivatives of e f(x) and Related Financial Applications International Mathematical Forum, 4, 9, no. 9, 41-47 On a Closed Formula or the Derivatives o e x) and Related Financial Applications Konstantinos Draais 1 UCD CASL, University College Dublin, Ireland

More information

Multi-Kernel Gaussian Processes

Multi-Kernel Gaussian Processes Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Arman Melkumyan Australian Centre for Field Robotics School of Aerospace, Mechanical and Mechatronic Engineering

More information

Tokamak profile database construction incorporating Gaussian process regression

Tokamak profile database construction incorporating Gaussian process regression Tokamak profile database construction incorporating Gaussian process regression A. Ho 1, J. Citrin 1, C. Bourdelle 2, Y. Camenen 3, F. Felici 4, M. Maslov 5, K.L. van de Plassche 1,4, H. Weisen 6 and JET

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information