INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES
|
|
- Grant Copeland
- 5 years ago
- Views:
Transcription
1 INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, China Abstract: This paper presents a new model called infinite mixtures of multivariate Gaussian processes, which can be used to learn vectorvalued functions and applied to multitask learning. As an extension of the single multivariate Gaussian process, the mixture model has the advantages of modeling multimodal data and alleviating the computationally cubic complexity of the multivariate Gaussian process. A Dirichlet process prior is adopted to allow the (possibly infinite) number of mixture components to be automatically inferred from training data, and Markov chain Monte Carlo sampling techniques are used for parameter and latent variable inference. Preliminary experimental results on multivariate regression show the feasibility of the proposed model. Keywords: Gaussian process; Dirichlet process; Markov chain Monte Carlo; Multitask learning; Vector-valued function; Regression. Introduction Gaussian processes provide a principled probabilistic approach to pattern recognition and machine learning. Formally, a Gaussian process is a collection of random variables such that any finite number of them obey a joint Gaussian prior distribution. As a Bayesian nonparametric model, the Gaussian process model proves to be very powerful for general function learning problems such as regression and classification [, 2]. Recently, motivated by the need to learn vector-valued functions and for multitask learning, research on multivariate or multi-output Gaussian processes has attracted a lot of attention. By learning multiple related tasks jointly, the common knowledge underlying different tasks can be shared, and thus a performance gain is likely to be obtained [3]. Representative works on multivariate Gaussian processes include the methods given in [4, 5, 6]. However, it is well known that Gaussian processes suffer from two important limitations [2, 7]. First, limited by the inherent unimodality of Gaussian distributions, Gaussian processes cannot characterize multimodal data which are prevalent in practice. Second, they are computationally infeasible for big data, since inference requires the inversion of an N N and N M N M covariance matrix respectively for a single-variate and multivariate Gaussian process, where N is the number of training examples and M is the output dimensionality. These two limitations can be greatly alleviated by making use of mixtures of Gaussian processes [8] where there are multiple Gaussian processes to jointly explain data and one example only belongs to one Gaussian process component. For mixtures of Gaussian processes, the infinite mixtures based on Dirichlet processes [9] are prevailing because they permit the number of components to be inferred directly from data and thus bypass the difficult model selection problem on the component number. For single-variate or single-output Gaussian processes, there were already some variants and implementations for infinite mixtures which brought great success for data modeling and prediction applications [2, 7, 0]. However, no extension of multivariate Gaussian processes to mixture models has been presented yet. Here, we will fill this gap by proposing an infinite mixture model of multivariate Gaussian processes. It should be noted that the implementation of this infinite model is very challenging because the multivariate Gaussian processes are much more complicated than the single-variate Gaussian processes. The rest of this paper is organized as follows. After providing the new infinite mixture model in Section 2, we show how the hidden variable inference and prediction problems are performed in Section 3 and Section 4, respectively. Then, we report experimental results on multivariate regression in Section 5. Finally, concluding remarks and future work directions are given in Section 6.
2 2. The proposed model The graphical model for the proposed infinite mixture of multivariate Gaussian processes (IMMGP) on the observed training data D = {x i, y i } N i= is depicted in Figure. The observation likelihood for our IMMGP is p({x i, y i } Θ) = p( Θ) p({y i : z i = r} {x i : z i = r}, Θ) r p({x i : z i = r} Θ) = p( Θ) p({y i : z i = r} {x i : z i = r}, Θ) r N r p(x rj µ r, R r ). () j= 2.. Distributions for hidden variables α is the concentration parameter of the Dirichlet process, which controls the prior probability of assigning an example to a new mixture component and thus influences the total number of components in the mixture model. A gamma distribution G(α a 0, b 0 ) is used. We use the parameterization for the gamma distribution given in []. Given α and {z i } n i=, the distribution of z n+ is easy to get with the Chinese restaurant process metaphor [9]. The distribution over the input space for a mixture component is given by a Gaussian distribution with a full covariance p(x z = r, µ r, R r ) = N (x µ r, R r ), (2) Figure. The graphical model for IMMGP In the graphical model, r indexes the rth Gaussian process component in the mixture, which can be infinitely large if enough data are provided. N r is the number of examples belonging to the rth component. D and M are the dimensions for the input and output space, respectively. The set {α, {µ r }, {R r }, {σ r0 }, {K r }, {w rd }, {σ rl }} includes all random parameters, which is denoted here by Θ. The latent variables are z i (i =,..., N) and F r (r = : ), where F r can be removed from the graphical model by integration if we directly consider a distribution over {Y r }. Denote the set of latent indicators by, that is, = {z i } N i=. Since F r is for illustrative purposes only, the latent indicators and random parameters Θ constitute the total hidden variables. The circles in the left and right columns of the graphical model indicate the hyperparameters, whose values are usually found by maximum likelihood estimation or designated manually if people have a strong belief on them. where R r is the precision (inverse covariance) matrix. This input model is often flexible enough to provide a good performance, though people can consider to adopt mixtures of Gaussian distributions to model the input space. Parameters µ r and R r are further specified by a Gaussian distribution prior and a Wishart distribution prior, respectively µ r N (µ 0, R 0 ), R r W(W 0, ν 0 ). (3) The parameterization for the Wishart distribution is the same as that in []. A Gaussian process prior is placed over the latent functions {f rl } M l= for component r in our model. Assuming the Gaussian processes have zero mean values, we set E ( f rl (x)f rk (x ) ) = σ r0 K r (l, k)k r (x, x ), y rl (x) N (f rl (x), σ rl ), (4) where scaling parameter σ r0 > 0, K r is a positive semi-definite matrix that specifies the inter-task similarities, k r (, ) is a covariance function over inputs, and σ rl is the noise variance for the lth output of the rth component. The prior of the M M
3 positive semi-definite matrix K r is given by a Wishart distribution W(W, ν ). σ r0 and σ rl are given gamma priors G(σ r0 a, b ) and G(σ rl a 2, b 2 ), respectively. We set ( k r (x, x ) = exp D 2 d= w2 rd(x d x d) 2), (5) where w rd obeys a log-normal distribution N (ln w rd µ, r ) with mean µ and variance r. The whole setup for a single Gaussian process component is in large difference with that in [4]. 3. Inference Since exact inference on the distribution p(, Θ D) is infeasible, in this paper we use Markov chain Monte Carlo sampling techniques to obtain L samples { j, Θ j } L j= to approximate the distribution p(, Θ D). In particular, Gibbs sampling is adopted to represent the posterior of the hidden variables. First of all, we initialize all the variables in {, Θ} by sampling them from their priors. Then the variables are updated using the following steps. () Update indicator variables {z i } N i= one by one, by cycling through the training data. (2) Update input and output space Gaussian process parameters {{µ r }, {R r }, {σ r0 }, {K r }, {w rd }, {σ rl }} for each Gaussian process component in turn. (3) Update Dirichlet process concentration parameter α. These three steps constitute a Gibbs sampling sweep over all hidden variables, which are repeated until the Markov chain has adequate samples. Note that samples in the burn-in stage should be removed from the Markov chain and are not used for approximating the posterior distribution. In the following subsections, we provide the specific sampling method and formulations involved for each update. 3.. Updating indicator variables Let i = \z i = {z,..., z i, z i+,..., z N } and D i = D\{x i, y i }. To sample z i, we need the following posterior conditional distribution p(z i i, Θ, D) p(z i i, Θ)p(D z i, i, Θ) p(z i i, Θ)p ( y i {y j : j i, z j = z i }, {x j : z j = z i }, Θ ) p(x i µ zi, R zi ), (6) where we have used a clear decomposition between the joint distributions of {x i, y i } and D i. It is not difficult to calculate the three terms involved in the last two lines of (6). However, the computation of p(y i {y j : j i, z j = z i }, {x j : z j = z i }, Θ) may be more efficient if some approximation scheme or acceleration method is adopted. In addition, for exploring new experts, we just sample the parameters once from the prior, use them for the new expert, and then calculate (6), following [0, 2]. The indicator variable update method is also algorithm 8 from [3] with the auxiliary component parameter m = Updating input space component parameters For the input space parameters µ r and R r, they can be sampled directly because their posterior conditional distributions have a simple formulation as a result of using conjugate priors. p(µ r, Θ\µ r, D) = p(µ r {x rj } Nr j=, R r) p(µ r )p({x rj } Nr j= µ r, R r ) R 0 /2 exp{ 2 (µ r µ 0 ) R 0 (µ r µ 0 )} R r /2 exp{ 2 (x rj µ r ) R r (x rj µ r )} j exp{ 2 [µ r R 0 µ r 2µ r R 0 µ 0 + j (µ r R r µ r 2µ r R r x rj )]}, (7) and therefore p(µ r, Θ\µ r, D) = N ((R 0 + N r R r ) (R 0 µ 0 + R r j x j), (R 0 + N r R r ) ). p(r r, Θ\R r, D) = p(r r {x rj } Nr j=, µ r) p(r r )p({x rj } Nr j= µ r, R r ) R r (ν0 D )/2 exp{ 2 Tr(W 0 R r)} R r /2 exp{ 2 (x rj µ r ) R r (x rj µ r )} j R r (ν0+nr D )/2 exp{ 2 Tr((W 0 + j (x rj µ r )(x rj µ r ) )R r )},
4 and thus p(r r, Θ\R r, D) ( (W = W 0 + (x rj µ r )(x rj µ r ) ) ), ν0 + N r. j 3.3. Updating output space component parameters Note that Y r = {y i : i N, z i = r} and Y r = N r. In this subsection, we denote its N r elements by {y j r }Nr j= which correspond to {x j r} Nr j=. Define the complete M outputs in the rth GP as y r = (y r,..., y Nr r, y r2,..., y Nr r2,..., y rm,..., y Nr rm ), (8) where y j rl is the observation for the lth output on the jth input. According to the Gaussian process assumption given in (4), the observation y r follows a Gaussian distribution y r N (0, Σ), Σ = σ r0 K r K x r + D r I, (9) where denotes the Kronecker product, Kr x is the N r N r covariance matrix between inputs with Kr x (i, j) = k r (x i r, x j r), D r is an M M diagonal matrix with D r (i, i) = σ ri, I is an N r N r identity matrix, and therefore the size of Σ is MN r MN r. The predictive distribution for the interested variable f on a new input x which belongs to the rth component is N (K Σ y r, K K Σ K ), (0) where K(M MN = σ r) r0k r k x r, K = σ r0 K r, and k x r is a N r row vector with the ith element being k r (x, x i r). Hence, the expected output on x is K Σ y r. Note that the calculation of Σ is a source for approximation to speed up training. However, this problem is easier than the original single GP model since we already reduced the inversion from an MN MN matrix to several MN r MN r matrices. We use hybrid Monte Carlo [4] to update σ r0, and the basic Metropolis-Hastings algorithm to update K r, {w rd }, and {σ rl } with the corresponding proposal distributions being their priors. Below we give the posteriors of the output space parameters and when necessary provide some useful technical details. We have p(σ r0, Θ\σ r0, D) p(σ r0 )p(y r {x j r} Nr σ a r0 exp( b σ r0 ) = exp { Σ /2 exp( 2 y r Σ y r ) [ ( a ) ln σ r0 + b σ r0 + 2 ln Σ + ]} 2 y r Σ y r, () and thus the potential energy E(σ r0 ) = ( a ) ln σ r0 + b σ r0 + 2 ln Σ + 2 y r Σ y r. The gradient de(σ r0 )/dσ r0 is needed in order to use hybrid Monte Carlo, which is given by de(σ r0 ) dσ r0 = a σ r0 + b + 2 Tr[(Σ Σ y r y r Σ )(K r K x r )]. p(k r, Θ\K r, D) p(k r )p(y r {x j r} Nr K r (ν M )/2 exp{ 2 Tr(W K r)} Σ exp( /2 2 y r Σ y r ) { = exp [ (M + ν ) ln K r + Tr(W 2 K r) + ln Σ + y r Σ y r ]}. (2) p(w rd, Θ\w rd, D) p(w rd )p(y r {x j r} Nr w rd exp{ (ln w rd µ ) 2 } 2r Σ exp( /2 2 y r Σ y r ) { [ = exp ln w rd + (ln w rd µ ) 2 + 2r 2 ln Σ + ]} 2 y r Σ y r. (3) p(σ rl, Θ\σ rl, D) p(σ rl )p(y r {x j r} Nr σ a2 rl exp( b 2 σ rl ) = exp { Σ /2 exp( 2 y r Σ y r ) [ ( a 2 ) ln σ rl + b 2 σ rl + 2 ln Σ + ]} 2 y r Σ y r. (4) 3.4. Updating the concentration parameter α The basic Metropolis-Hastings algorithm is used to update α. Let c N be the number of distinct values in
5 {z,..., z N }. It is clear from [5] that p(c α, N) = β N c α c Γ(α) Γ(N + α), (5) where coefficient βc N is the absolute value of Stirling numbers of the first kind, and Γ( ) is the gamma function. With (5) as the likelihood, we can get the posterior of α is p(α c, N) p(α)p(c α, N) p(α) αc Γ(α) Γ(N + α). (6) Since the gamma prior is used, it follows that, 4. Prediction p(α c, N) αc+a0 exp( b 0 α)γ(α). (7) Γ(N + α) The graphical model for prediction is shown in Figure 2. Figure 2. The graphical model for prediction on a new input x The predictive distribution for the predicted output of a new test input x is p(f x, D) = p(f, z,, Θ x, D)dΘ z = p(z,, Θ x, D)p(f z,, Θ, x, D)dΘ z = p(z, Θ, x )p(, Θ x, D)p(f z,, Θ, x, D)dΘ,z p(z x,, Θ)p(, Θ D)p(f z,, Θ, x, D)dΘ,z [ ] = p(z x,, Θ)p(f z,, Θ, x, D) z p(, Θ D)dΘ, (8) where we have made use of the conditional independence p(z, Θ, x, D) = p(z x,, Θ) and a reasonable approximation p(, Θ x, D) p(, Θ D). With the Markov chain Monte Carlo samples { i, Θ i } L i= to approximate the above summation and integration over and Θ, it follows that, p(f x, D) [ L L i= z p(z x, i, Θ i )p(f z, i, Θ i, x, D) Therefore, the prediction for f is [ ] ˆf = L p(z x, i, Θ i )E(f z, i, Θ i, x, D), L z i= where the expectation involved is simple to calculate since p(f z, i, Θ i, x, D) is a Gaussian distribution, and z takes values from i or is different from i with the corresponding parameters sampled from the priors. The computation of p(z x, i, Θ i ) is given as follows. p(z x, i, Θ i ) = p(z i, Θ i )p(x z, i, Θ i ) p(x i, Θ i ) = p(z i, Θ i )p(x z, i, Θ i ) z p(z i, Θ i )p(x z, i, Θ i ) = p(z i, Θ i )p(x z, Θ i ) z p(z i, Θ i )p(x z, Θ i ), (9) where the last equality follows from the conditional independence. If z = r i, then p(z = r i, Θ i ) = Nir α+n with N ir = #{z : z i, z = r} and p(x z, Θ i ) = p(x µ r, R r ). If z / i, then p(z i, Θ i ) = α α+n and p(x z, Θ i ) = p(x µ, R)p(µ µ 0, R 0 )p(r W 0, ν 0 )dµdr. Unfortunately, this integral is not analytically tractable. A Monte Carlo estimate by sampling µ and R from the priors can be used to reach an approximation. Note that, if z / i, then E(f z, i, Θ i, x, D) = 0 as a result of zero-mean Gaussian process priors. Otherwise, E(f z, i, Θ i, x, D) can be calculated using standard Gaussian process regression formulations. 5. Experiment To evaluate the proposed infinite mixture model and the used inference and prediction methods, we perform multivariate re- ].
6 gression on a synthetic data set. The data set includes 500 examples that are generated by ancestral sampling from the infinite mixture model. The dimensions for the input and output spaces are both set to two. From the whole data, 400 examples are randomly selected as training data and the other 00 examples serve as test data. 5.. Hyperparameter setting The hyperparameters for generating data are set as follows: a 0 =, b 0 =, µ 0 = 0, R 0 = I/0, W 0 = I/(0D), ν 0 = D, a =, b =, W = I/M, ν = M, µ = 0, r = 0.0, a 2 = 0., and b 2 =. The same hyperparameters are used for inference except µ 0, R 0 and W 0. µ 0 and R 0 are set to the mean µ x and inverse covariance R x of the training data, respectively. W 0 is set to R x /D Prediction Performance By Markov chain Monte Carlo sampling, we obtain 4000 samples where only the last 2000 samples are retained for prediction. For comparison purpose, the MTLNN approach (multitask learning neural networks without ensemble learning) [6] is adopted. Table reports the root mean squared error (RMSE) on the test data for our IMMGP model and the MTLNN approach. IMMGP only considers the existing Gaussian process components reflected by the samples, while IMMGP2 considers to choose a new component as well. The results indicate that IMMGP outperforms MTLNN and the difference between IM- MGP and IMMGP2 is very small. Table. Prediction errors of different methods 6. Conclusion MTLNN IMMGP IMMGP In this paper, we have presented a new model called infinite mixtures of multivariate Gaussian processes and applied it to multivariate regression with good performance returned. Interesting future directions include applying this model to largescale data, adapting it to classification problems and devising fast deterministic approximate inference techniques. References [] C. Rasmussen and C. Williams, Gaussian Processes for Machine Learning, MIT Press, Cambridge, MA, [2] S. Sun and X. Xu, Variational inference for infinite mixtures of Gaussian processes with applications to traffic flow prediction, IEEE Transactions on Intelligent Transportation Systems, Vol. 2, No. 2, pp , 20. [3] Y. Ji and S. Sun, Multitask multiclass support vector machines: Model and experiments, Pattern Recognition, Vol. 46, No. 3, pp , 203. [4] E. Bonilla, K. Chai, and C. Williams, Multi-task Gaussian process prediction, Advances in Neural Information Processing Systems, Vol. 20, pp , [5] C. Yuan, Conditional multi-output regression, Proceedings of the Interantional Joint Conference on Neural Networks, pp , 20. [6] M. Alvarez and N. Lawrence, Computationally efficient convolved multiple output Gaussian processes, Journal of Machine Learning Research, Vol. 2, pp , 20. [7] C. Rasmussen and. Ghahramani, Infinite mixtures of Gaussian process experts, Advances in Neural Information Processing Systems, Vol. 4, pp , [8] V. Tresp, Mixtures of Gaussian processes, Advances in Neural Information Processing Systems, Vol. 3, pp , 200. [9] Y. Teh, Dirichlet processes, in Encyclopedia of Machine Learning, Springer-Verlag, Berlin, Germany, 200. [0] E. Meeds and S. Osindero, An alternative infinite mixture of Gaussian process experts, Advances in Neural Information Processing Systems, Vol. 8, pp , [] C. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag, New York, [2] C. Rasmussen, The infinite Gaussian mixture model, Advances in Neural Information Processing Systems, Vol. 2, pp , [3] R. Neal, Markov chain sampling methods for Dirichlet process mixture models, Technical Report 985, Department of Statistics, University of Toronto, 998. [4] R. Neal, Probabilistic inference using Markov chain Monte Carlo methods, Technical Report CRG-TR-93-, University of Toronto, 993. [5] C. Antoniak, Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems, Annals of Statistics, Vol. 2, No. 6, pp , 974. [6] S. Sun, Traffic flow forecasting based on multitask ensemble learning, Proceedings of the ACM SIGEVO World Summit on Genetic and Evolutionary Computation, pp , 2009.
An Alternative Infinite Mixture Of Gaussian Process Experts
An Alternative Infinite Mixture Of Gaussian Process Experts Edward Meeds and Simon Osindero Department of Computer Science University of Toronto Toronto, M5S 3G4 {ewm,osindero}@cs.toronto.edu Abstract
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationLearning Gaussian Process Models from Uncertain Data
Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada
More informationLarge-scale Ordinal Collaborative Filtering
Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk
More informationProbabilistic Graphical Models Lecture 17: Markov chain Monte Carlo
Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,
More informationMTTTS16 Learning from Multiple Sources
MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationAn Introduction to Bayesian Machine Learning
1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems
More informationApproximate Inference using MCMC
Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov
More informationGaussian process for nonstationary time series prediction
Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationPattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods
Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs
More informationMachine Learning using Bayesian Approaches
Machine Learning using Bayesian Approaches Sargur N. Srihari University at Buffalo, State University of New York 1 Outline 1. Progress in ML and PR 2. Fully Bayesian Approach 1. Probability theory Bayes
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationLecture 7 and 8: Markov Chain Monte Carlo
Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationA Process over all Stationary Covariance Kernels
A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that
More informationPILCO: A Model-Based and Data-Efficient Approach to Policy Search
PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol
More informationStatistical Approaches to Learning and Discovery
Statistical Approaches to Learning and Discovery Bayesian Model Selection Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon
More informationTopic Modelling and Latent Dirichlet Allocation
Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer
More informationInfinite Mixtures of Gaussian Process Experts
in Advances in Neural Information Processing Systems 14, MIT Press (22). Infinite Mixtures of Gaussian Process Experts Carl Edward Rasmussen and Zoubin Ghahramani Gatsby Computational Neuroscience Unit
More informationGentle Introduction to Infinite Gaussian Mixture Modeling
Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for
More informationBayesian Mixtures of Bernoulli Distributions
Bayesian Mixtures of Bernoulli Distributions Laurens van der Maaten Department of Computer Science and Engineering University of California, San Diego Introduction The mixture of Bernoulli distributions
More informationNPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic
NPFL108 Bayesian inference Introduction Filip Jurčíček Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic Home page: http://ufal.mff.cuni.cz/~jurcicek Version: 21/02/2014
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationVariational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures
17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter
More informationAfternoon Meeting on Bayesian Computation 2018 University of Reading
Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationVariational Scoring of Graphical Model Structures
Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More information17 : Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationBayesian Nonparametrics
Bayesian Nonparametrics Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent
More informationStatistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling
1 / 27 Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 27 Monte Carlo Integration The big question : Evaluate E p(z) [f(z)]
More informationStochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints
Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September
More informationBayesian Inference for the Multivariate Normal
Bayesian Inference for the Multivariate Normal Will Penny Wellcome Trust Centre for Neuroimaging, University College, London WC1N 3BG, UK. November 28, 2014 Abstract Bayesian inference for the multivariate
More informationInfinite Latent Feature Models and the Indian Buffet Process
Infinite Latent Feature Models and the Indian Buffet Process Thomas L. Griffiths Cognitive and Linguistic Sciences Brown University, Providence RI 292 tom griffiths@brown.edu Zoubin Ghahramani Gatsby Computational
More informationVariational Dependent Multi-output Gaussian Process Dynamical Systems
Variational Dependent Multi-output Gaussian Process Dynamical Systems Jing Zhao and Shiliang Sun Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai
More informationA new Hierarchical Bayes approach to ensemble-variational data assimilation
A new Hierarchical Bayes approach to ensemble-variational data assimilation Michael Tsyrulnikov and Alexander Rakitko HydroMetCenter of Russia College Park, 20 Oct 2014 Michael Tsyrulnikov and Alexander
More informationDevelopment of Stochastic Artificial Neural Networks for Hydrological Prediction
Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationBayesian time series classification
Bayesian time series classification Peter Sykacek Department of Engineering Science University of Oxford Oxford, OX 3PJ, UK psyk@robots.ox.ac.uk Stephen Roberts Department of Engineering Science University
More informationLecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models
Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance
More informationoutput dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1
To appear in M. S. Kearns, S. A. Solla, D. A. Cohn, (eds.) Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 999. Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin
More informationVariational Mixtures of Gaussian Processes for Classification
Variational Mixtures of Gaussian Processes for Classification Chen Luo, Shiliang Sun Department of Computer Science and Technology, East China Normal University, 3663 North Zhongshan Road, Shanghai 200062,
More informationFurther Issues and Conclusions
Chapter 9 Further Issues and Conclusions In the previous chapters of the book we have concentrated on giving a solid grounding in the use of GPs for regression and classification problems, including model
More informationBasic Sampling Methods
Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution
More informationModeling human function learning with Gaussian processes
Modeling human function learning with Gaussian processes Thomas L. Griffiths Christopher G. Lucas Joseph J. Williams Department of Psychology University of California, Berkeley Berkeley, CA 94720-1650
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More informationHyperparameter estimation in Dirichlet process mixture models
Hyperparameter estimation in Dirichlet process mixture models By MIKE WEST Institute of Statistics and Decision Sciences Duke University, Durham NC 27706, USA. SUMMARY In Bayesian density estimation and
More informationHierarchical Dirichlet Processes with Random Effects
Hierarchical Dirichlet Processes with Random Effects Seyoung Kim Department of Computer Science University of California, Irvine Irvine, CA 92697-34 sykim@ics.uci.edu Padhraic Smyth Department of Computer
More informationPart IV: Monte Carlo and nonparametric Bayes
Part IV: Monte Carlo and nonparametric Bayes Outline Monte Carlo methods Nonparametric Bayesian models Outline Monte Carlo methods Nonparametric Bayesian models The Monte Carlo principle The expectation
More informationExpectation Propagation in Dynamical Systems
Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationCSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection
CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection
More informationInference in Explicit Duration Hidden Markov Models
Inference in Explicit Duration Hidden Markov Models Frank Wood Joint work with Chris Wiggins, Mike Dewar Columbia University November, 2011 Wood (Columbia University) EDHMM Inference November, 2011 1 /
More informationScaling Neighbourhood Methods
Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)
More informationMultivariate Normal & Wishart
Multivariate Normal & Wishart Hoff Chapter 7 October 21, 2010 Reading Comprehesion Example Twenty-two children are given a reading comprehsion test before and after receiving a particular instruction method.
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationMultivariate Bayesian Linear Regression MLAI Lecture 11
Multivariate Bayesian Linear Regression MLAI Lecture 11 Neil D. Lawrence Department of Computer Science Sheffield University 21st October 2012 Outline Univariate Bayesian Linear Regression Multivariate
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 4 Problem: Density Estimation We have observed data, y 1,..., y n, drawn independently from some unknown
More informationSupplementary Notes: Segment Parameter Labelling in MCMC Change Detection
Supplementary Notes: Segment Parameter Labelling in MCMC Change Detection Alireza Ahrabian 1 arxiv:1901.0545v1 [eess.sp] 16 Jan 019 Abstract This work addresses the problem of segmentation in time series
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA Contents in latter part Linear Dynamical Systems What is different from HMM? Kalman filter Its strength and limitation Particle Filter
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationNeutron inverse kinetics via Gaussian Processes
Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques
More informationCreating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005
Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach Radford M. Neal, 28 February 2005 A Very Brief Review of Gaussian Processes A Gaussian process is a distribution over
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationIntroduction to Bayesian inference
Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions
More informationModel Selection for Gaussian Processes
Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal
More informationBayesian Image Segmentation Using MRF s Combined with Hierarchical Prior Models
Bayesian Image Segmentation Using MRF s Combined with Hierarchical Prior Models Kohta Aoki 1 and Hiroshi Nagahashi 2 1 Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 10 Alternatives to Monte Carlo Computation Since about 1990, Markov chain Monte Carlo has been the dominant
More informationLecture 1a: Basic Concepts and Recaps
Lecture 1a: Basic Concepts and Recaps Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationBayesian Sampling and Ensemble Learning in Generative Topographic Mapping
Bayesian Sampling and Ensemble Learning in Generative Topographic Mapping Akio Utsugi National Institute of Bioscience and Human-Technology, - Higashi Tsukuba Ibaraki 35-8566, Japan March 6, Abstract Generative
More informationPower Load Forecasting based on Multi-task Gaussian Process
Preprints of the 19th World Congress The International Federation of Automatic Control Power Load Forecasting based on Multi-task Gaussian Process Yulai Zhang Guiming Luo Fuan Pu Tsinghua National Laboratory
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More information