INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, China E-MAIL: slsun@cs.ecnu.edu.cn, shiliangsun@gmail.com Abstract: This paper presents a new model called infinite mixtures of multivariate Gaussian processes, which can be used to learn vectorvalued functions and applied to multitask learning. As an extension of the single multivariate Gaussian process, the mixture model has the advantages of modeling multimodal data and alleviating the computationally cubic complexity of the multivariate Gaussian process. A Dirichlet process prior is adopted to allow the (possibly infinite) number of mixture components to be automatically inferred from training data, and Markov chain Monte Carlo sampling techniques are used for parameter and latent variable inference. Preliminary experimental results on multivariate regression show the feasibility of the proposed model. Keywords: Gaussian process; Dirichlet process; Markov chain Monte Carlo; Multitask learning; Vector-valued function; Regression. Introduction Gaussian processes provide a principled probabilistic approach to pattern recognition and machine learning. Formally, a Gaussian process is a collection of random variables such that any finite number of them obey a joint Gaussian prior distribution. As a Bayesian nonparametric model, the Gaussian process model proves to be very powerful for general function learning problems such as regression and classification [, 2]. Recently, motivated by the need to learn vector-valued functions and for multitask learning, research on multivariate or multi-output Gaussian processes has attracted a lot of attention. By learning multiple related tasks jointly, the common knowledge underlying different tasks can be shared, and thus a performance gain is likely to be obtained [3]. Representative works on multivariate Gaussian processes include the methods given in [4, 5, 6]. However, it is well known that Gaussian processes suffer from two important limitations [2, 7]. First, limited by the inherent unimodality of Gaussian distributions, Gaussian processes cannot characterize multimodal data which are prevalent in practice. Second, they are computationally infeasible for big data, since inference requires the inversion of an N N and N M N M covariance matrix respectively for a single-variate and multivariate Gaussian process, where N is the number of training examples and M is the output dimensionality. These two limitations can be greatly alleviated by making use of mixtures of Gaussian processes [8] where there are multiple Gaussian processes to jointly explain data and one example only belongs to one Gaussian process component. For mixtures of Gaussian processes, the infinite mixtures based on Dirichlet processes [9] are prevailing because they permit the number of components to be inferred directly from data and thus bypass the difficult model selection problem on the component number. For single-variate or single-output Gaussian processes, there were already some variants and implementations for infinite mixtures which brought great success for data modeling and prediction applications [2, 7, 0]. However, no extension of multivariate Gaussian processes to mixture models has been presented yet. Here, we will fill this gap by proposing an infinite mixture model of multivariate Gaussian processes. It should be noted that the implementation of this infinite model is very challenging because the multivariate Gaussian processes are much more complicated than the single-variate Gaussian processes. The rest of this paper is organized as follows. After providing the new infinite mixture model in Section 2, we show how the hidden variable inference and prediction problems are performed in Section 3 and Section 4, respectively. Then, we report experimental results on multivariate regression in Section 5. Finally, concluding remarks and future work directions are given in Section 6.

2. The proposed model The graphical model for the proposed infinite mixture of multivariate Gaussian processes (IMMGP) on the observed training data D = {x i, y i } N i= is depicted in Figure. The observation likelihood for our IMMGP is p({x i, y i } Θ) = p( Θ) p({y i : z i = r} {x i : z i = r}, Θ) r p({x i : z i = r} Θ) = p( Θ) p({y i : z i = r} {x i : z i = r}, Θ) r N r p(x rj µ r, R r ). () j= 2.. Distributions for hidden variables α is the concentration parameter of the Dirichlet process, which controls the prior probability of assigning an example to a new mixture component and thus influences the total number of components in the mixture model. A gamma distribution G(α a 0, b 0 ) is used. We use the parameterization for the gamma distribution given in []. Given α and {z i } n i=, the distribution of z n+ is easy to get with the Chinese restaurant process metaphor [9]. The distribution over the input space for a mixture component is given by a Gaussian distribution with a full covariance p(x z = r, µ r, R r ) = N (x µ r, R r ), (2) Figure. The graphical model for IMMGP In the graphical model, r indexes the rth Gaussian process component in the mixture, which can be infinitely large if enough data are provided. N r is the number of examples belonging to the rth component. D and M are the dimensions for the input and output space, respectively. The set {α, {µ r }, {R r }, {σ r0 }, {K r }, {w rd }, {σ rl }} includes all random parameters, which is denoted here by Θ. The latent variables are z i (i =,..., N) and F r (r = : ), where F r can be removed from the graphical model by integration if we directly consider a distribution over {Y r }. Denote the set of latent indicators by, that is, = {z i } N i=. Since F r is for illustrative purposes only, the latent indicators and random parameters Θ constitute the total hidden variables. The circles in the left and right columns of the graphical model indicate the hyperparameters, whose values are usually found by maximum likelihood estimation or designated manually if people have a strong belief on them. where R r is the precision (inverse covariance) matrix. This input model is often flexible enough to provide a good performance, though people can consider to adopt mixtures of Gaussian distributions to model the input space. Parameters µ r and R r are further specified by a Gaussian distribution prior and a Wishart distribution prior, respectively µ r N (µ 0, R 0 ), R r W(W 0, ν 0 ). (3) The parameterization for the Wishart distribution is the same as that in []. A Gaussian process prior is placed over the latent functions {f rl } M l= for component r in our model. Assuming the Gaussian processes have zero mean values, we set E ( f rl (x)f rk (x ) ) = σ r0 K r (l, k)k r (x, x ), y rl (x) N (f rl (x), σ rl ), (4) where scaling parameter σ r0 > 0, K r is a positive semi-definite matrix that specifies the inter-task similarities, k r (, ) is a covariance function over inputs, and σ rl is the noise variance for the lth output of the rth component. The prior of the M M

positive semi-definite matrix K r is given by a Wishart distribution W(W, ν ). σ r0 and σ rl are given gamma priors G(σ r0 a, b ) and G(σ rl a 2, b 2 ), respectively. We set ( k r (x, x ) = exp D 2 d= w2 rd(x d x d) 2), (5) where w rd obeys a log-normal distribution N (ln w rd µ, r ) with mean µ and variance r. The whole setup for a single Gaussian process component is in large difference with that in [4]. 3. Inference Since exact inference on the distribution p(, Θ D) is infeasible, in this paper we use Markov chain Monte Carlo sampling techniques to obtain L samples { j, Θ j } L j= to approximate the distribution p(, Θ D). In particular, Gibbs sampling is adopted to represent the posterior of the hidden variables. First of all, we initialize all the variables in {, Θ} by sampling them from their priors. Then the variables are updated using the following steps. () Update indicator variables {z i } N i= one by one, by cycling through the training data. (2) Update input and output space Gaussian process parameters {{µ r }, {R r }, {σ r0 }, {K r }, {w rd }, {σ rl }} for each Gaussian process component in turn. (3) Update Dirichlet process concentration parameter α. These three steps constitute a Gibbs sampling sweep over all hidden variables, which are repeated until the Markov chain has adequate samples. Note that samples in the burn-in stage should be removed from the Markov chain and are not used for approximating the posterior distribution. In the following subsections, we provide the specific sampling method and formulations involved for each update. 3.. Updating indicator variables Let i = \z i = {z,..., z i, z i+,..., z N } and D i = D\{x i, y i }. To sample z i, we need the following posterior conditional distribution p(z i i, Θ, D) p(z i i, Θ)p(D z i, i, Θ) p(z i i, Θ)p ( y i {y j : j i, z j = z i }, {x j : z j = z i }, Θ ) p(x i µ zi, R zi ), (6) where we have used a clear decomposition between the joint distributions of {x i, y i } and D i. It is not difficult to calculate the three terms involved in the last two lines of (6). However, the computation of p(y i {y j : j i, z j = z i }, {x j : z j = z i }, Θ) may be more efficient if some approximation scheme or acceleration method is adopted. In addition, for exploring new experts, we just sample the parameters once from the prior, use them for the new expert, and then calculate (6), following [0, 2]. The indicator variable update method is also algorithm 8 from [3] with the auxiliary component parameter m =. 3.2. Updating input space component parameters For the input space parameters µ r and R r, they can be sampled directly because their posterior conditional distributions have a simple formulation as a result of using conjugate priors. p(µ r, Θ\µ r, D) = p(µ r {x rj } Nr j=, R r) p(µ r )p({x rj } Nr j= µ r, R r ) R 0 /2 exp{ 2 (µ r µ 0 ) R 0 (µ r µ 0 )} R r /2 exp{ 2 (x rj µ r ) R r (x rj µ r )} j exp{ 2 [µ r R 0 µ r 2µ r R 0 µ 0 + j (µ r R r µ r 2µ r R r x rj )]}, (7) and therefore p(µ r, Θ\µ r, D) = N ((R 0 + N r R r ) (R 0 µ 0 + R r j x j), (R 0 + N r R r ) ). p(r r, Θ\R r, D) = p(r r {x rj } Nr j=, µ r) p(r r )p({x rj } Nr j= µ r, R r ) R r (ν0 D )/2 exp{ 2 Tr(W 0 R r)} R r /2 exp{ 2 (x rj µ r ) R r (x rj µ r )} j R r (ν0+nr D )/2 exp{ 2 Tr((W 0 + j (x rj µ r )(x rj µ r ) )R r )},

and thus p(r r, Θ\R r, D) ( (W = W 0 + (x rj µ r )(x rj µ r ) ) ), ν0 + N r. j 3.3. Updating output space component parameters Note that Y r = {y i : i N, z i = r} and Y r = N r. In this subsection, we denote its N r elements by {y j r }Nr j= which correspond to {x j r} Nr j=. Define the complete M outputs in the rth GP as y r = (y r,..., y Nr r, y r2,..., y Nr r2,..., y rm,..., y Nr rm ), (8) where y j rl is the observation for the lth output on the jth input. According to the Gaussian process assumption given in (4), the observation y r follows a Gaussian distribution y r N (0, Σ), Σ = σ r0 K r K x r + D r I, (9) where denotes the Kronecker product, Kr x is the N r N r covariance matrix between inputs with Kr x (i, j) = k r (x i r, x j r), D r is an M M diagonal matrix with D r (i, i) = σ ri, I is an N r N r identity matrix, and therefore the size of Σ is MN r MN r. The predictive distribution for the interested variable f on a new input x which belongs to the rth component is N (K Σ y r, K K Σ K ), (0) where K(M MN = σ r) r0k r k x r, K = σ r0 K r, and k x r is a N r row vector with the ith element being k r (x, x i r). Hence, the expected output on x is K Σ y r. Note that the calculation of Σ is a source for approximation to speed up training. However, this problem is easier than the original single GP model since we already reduced the inversion from an MN MN matrix to several MN r MN r matrices. We use hybrid Monte Carlo [4] to update σ r0, and the basic Metropolis-Hastings algorithm to update K r, {w rd }, and {σ rl } with the corresponding proposal distributions being their priors. Below we give the posteriors of the output space parameters and when necessary provide some useful technical details. We have p(σ r0, Θ\σ r0, D) p(σ r0 )p(y r {x j r} Nr σ a r0 exp( b σ r0 ) = exp { Σ /2 exp( 2 y r Σ y r ) [ ( a ) ln σ r0 + b σ r0 + 2 ln Σ + ]} 2 y r Σ y r, () and thus the potential energy E(σ r0 ) = ( a ) ln σ r0 + b σ r0 + 2 ln Σ + 2 y r Σ y r. The gradient de(σ r0 )/dσ r0 is needed in order to use hybrid Monte Carlo, which is given by de(σ r0 ) dσ r0 = a σ r0 + b + 2 Tr[(Σ Σ y r y r Σ )(K r K x r )]. p(k r, Θ\K r, D) p(k r )p(y r {x j r} Nr K r (ν M )/2 exp{ 2 Tr(W K r)} Σ exp( /2 2 y r Σ y r ) { = exp [ (M + ν ) ln K r + Tr(W 2 K r) + ln Σ + y r Σ y r ]}. (2) p(w rd, Θ\w rd, D) p(w rd )p(y r {x j r} Nr w rd exp{ (ln w rd µ ) 2 } 2r Σ exp( /2 2 y r Σ y r ) { [ = exp ln w rd + (ln w rd µ ) 2 + 2r 2 ln Σ + ]} 2 y r Σ y r. (3) p(σ rl, Θ\σ rl, D) p(σ rl )p(y r {x j r} Nr σ a2 rl exp( b 2 σ rl ) = exp { Σ /2 exp( 2 y r Σ y r ) [ ( a 2 ) ln σ rl + b 2 σ rl + 2 ln Σ + ]} 2 y r Σ y r. (4) 3.4. Updating the concentration parameter α The basic Metropolis-Hastings algorithm is used to update α. Let c N be the number of distinct values in

{z,..., z N }. It is clear from [5] that p(c α, N) = β N c α c Γ(α) Γ(N + α), (5) where coefficient βc N is the absolute value of Stirling numbers of the first kind, and Γ( ) is the gamma function. With (5) as the likelihood, we can get the posterior of α is p(α c, N) p(α)p(c α, N) p(α) αc Γ(α) Γ(N + α). (6) Since the gamma prior is used, it follows that, 4. Prediction p(α c, N) αc+a0 exp( b 0 α)γ(α). (7) Γ(N + α) The graphical model for prediction is shown in Figure 2. Figure 2. The graphical model for prediction on a new input x The predictive distribution for the predicted output of a new test input x is p(f x, D) = p(f, z,, Θ x, D)dΘ z = p(z,, Θ x, D)p(f z,, Θ, x, D)dΘ z = p(z, Θ, x )p(, Θ x, D)p(f z,, Θ, x, D)dΘ,z p(z x,, Θ)p(, Θ D)p(f z,, Θ, x, D)dΘ,z [ ] = p(z x,, Θ)p(f z,, Θ, x, D) z p(, Θ D)dΘ, (8) where we have made use of the conditional independence p(z, Θ, x, D) = p(z x,, Θ) and a reasonable approximation p(, Θ x, D) p(, Θ D). With the Markov chain Monte Carlo samples { i, Θ i } L i= to approximate the above summation and integration over and Θ, it follows that, p(f x, D) [ L L i= z p(z x, i, Θ i )p(f z, i, Θ i, x, D) Therefore, the prediction for f is [ ] ˆf = L p(z x, i, Θ i )E(f z, i, Θ i, x, D), L z i= where the expectation involved is simple to calculate since p(f z, i, Θ i, x, D) is a Gaussian distribution, and z takes values from i or is different from i with the corresponding parameters sampled from the priors. The computation of p(z x, i, Θ i ) is given as follows. p(z x, i, Θ i ) = p(z i, Θ i )p(x z, i, Θ i ) p(x i, Θ i ) = p(z i, Θ i )p(x z, i, Θ i ) z p(z i, Θ i )p(x z, i, Θ i ) = p(z i, Θ i )p(x z, Θ i ) z p(z i, Θ i )p(x z, Θ i ), (9) where the last equality follows from the conditional independence. If z = r i, then p(z = r i, Θ i ) = Nir α+n with N ir = #{z : z i, z = r} and p(x z, Θ i ) = p(x µ r, R r ). If z / i, then p(z i, Θ i ) = α α+n and p(x z, Θ i ) = p(x µ, R)p(µ µ 0, R 0 )p(r W 0, ν 0 )dµdr. Unfortunately, this integral is not analytically tractable. A Monte Carlo estimate by sampling µ and R from the priors can be used to reach an approximation. Note that, if z / i, then E(f z, i, Θ i, x, D) = 0 as a result of zero-mean Gaussian process priors. Otherwise, E(f z, i, Θ i, x, D) can be calculated using standard Gaussian process regression formulations. 5. Experiment To evaluate the proposed infinite mixture model and the used inference and prediction methods, we perform multivariate re- ].

gression on a synthetic data set. The data set includes 500 examples that are generated by ancestral sampling from the infinite mixture model. The dimensions for the input and output spaces are both set to two. From the whole data, 400 examples are randomly selected as training data and the other 00 examples serve as test data. 5.. Hyperparameter setting The hyperparameters for generating data are set as follows: a 0 =, b 0 =, µ 0 = 0, R 0 = I/0, W 0 = I/(0D), ν 0 = D, a =, b =, W = I/M, ν = M, µ = 0, r = 0.0, a 2 = 0., and b 2 =. The same hyperparameters are used for inference except µ 0, R 0 and W 0. µ 0 and R 0 are set to the mean µ x and inverse covariance R x of the training data, respectively. W 0 is set to R x /D. 5.2. Prediction Performance By Markov chain Monte Carlo sampling, we obtain 4000 samples where only the last 2000 samples are retained for prediction. For comparison purpose, the MTLNN approach (multitask learning neural networks without ensemble learning) [6] is adopted. Table reports the root mean squared error (RMSE) on the test data for our IMMGP model and the MTLNN approach. IMMGP only considers the existing Gaussian process components reflected by the samples, while IMMGP2 considers to choose a new component as well. The results indicate that IMMGP outperforms MTLNN and the difference between IM- MGP and IMMGP2 is very small. Table. Prediction errors of different methods 6. Conclusion MTLNN IMMGP IMMGP2 2.0659 0.7963 0.7964 In this paper, we have presented a new model called infinite mixtures of multivariate Gaussian processes and applied it to multivariate regression with good performance returned. Interesting future directions include applying this model to largescale data, adapting it to classification problems and devising fast deterministic approximate inference techniques. References [] C. Rasmussen and C. Williams, Gaussian Processes for Machine Learning, MIT Press, Cambridge, MA, 2006. [2] S. Sun and X. Xu, Variational inference for infinite mixtures of Gaussian processes with applications to traffic flow prediction, IEEE Transactions on Intelligent Transportation Systems, Vol. 2, No. 2, pp. 466-475, 20. [3] Y. Ji and S. Sun, Multitask multiclass support vector machines: Model and experiments, Pattern Recognition, Vol. 46, No. 3, pp. 94-924, 203. [4] E. Bonilla, K. Chai, and C. Williams, Multi-task Gaussian process prediction, Advances in Neural Information Processing Systems, Vol. 20, pp. 53-60, 2008. [5] C. Yuan, Conditional multi-output regression, Proceedings of the Interantional Joint Conference on Neural Networks, pp. 89-96, 20. [6] M. Alvarez and N. Lawrence, Computationally efficient convolved multiple output Gaussian processes, Journal of Machine Learning Research, Vol. 2, pp. 459-500, 20. [7] C. Rasmussen and. Ghahramani, Infinite mixtures of Gaussian process experts, Advances in Neural Information Processing Systems, Vol. 4, pp. 88-888, 2002. [8] V. Tresp, Mixtures of Gaussian processes, Advances in Neural Information Processing Systems, Vol. 3, pp. 654-660, 200. [9] Y. Teh, Dirichlet processes, in Encyclopedia of Machine Learning, Springer-Verlag, Berlin, Germany, 200. [0] E. Meeds and S. Osindero, An alternative infinite mixture of Gaussian process experts, Advances in Neural Information Processing Systems, Vol. 8, pp. 883-890, 2006. [] C. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag, New York, 2006. [2] C. Rasmussen, The infinite Gaussian mixture model, Advances in Neural Information Processing Systems, Vol. 2, pp. 554-560, 2000. [3] R. Neal, Markov chain sampling methods for Dirichlet process mixture models, Technical Report 985, Department of Statistics, University of Toronto, 998. [4] R. Neal, Probabilistic inference using Markov chain Monte Carlo methods, Technical Report CRG-TR-93-, University of Toronto, 993. [5] C. Antoniak, Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems, Annals of Statistics, Vol. 2, No. 6, pp. 52-74, 974. [6] S. Sun, Traffic flow forecasting based on multitask ensemble learning, Proceedings of the ACM SIGEVO World Summit on Genetic and Evolutionary Computation, pp. 96-964, 2009.