CMPS 242: Project Report

Size: px
Start display at page:

Download "CMPS 242: Project Report"

Transcription

1 CMPS 242: Project Report RadhaKrishna Vuppala Univ. of California, Santa Cruz Abstract The classification procedures impose certain models on the data and when the assumption match the data, they perform well. The assumption may be that the data is linearly separable or that the data comes some parametric family of distributions for example. The parametric models can be made flexible enough to handle various types of data by using mixture models instead assuming a single distribution. However even the parametric mixture models assume finite components exist in data. Bayesian non parametric mixtures based on Dirichlet process mixtures (DPM) are a way of modeling infinite mixtures and are more flexible in terms of modeling both the form and number of distributions that data could be generated from. The paper explores the strengths and weaknesses of these models. The variational approximation for the DPM model is explored in this paper. Variational approximations are deterministic approximations techniques that have fast convergence and can be used to replace the costly MCMC computations in some situations where exact answer is not required. 1. Introduction The bayesian nonparametrics received a strong push when the Dirichlet process was presented by Ferguson. [4]. The Dirichlet process is a measure on measures. i.e. when a sample is drawn from the Dirichlet process, a distribution is obtained. Describing a process with a large support and with a analytically tractable posterior was a great step forward for bayesian analysis. G {G 0,α} DP(α,G 0 ) η n G, n 1,...,N If G is a random probability measure drawn from a Dirichlet process, and η n are n samples drawn from G, then it can be shown by integrating out G, that the joint distribution of {η 1,...η n } follows a Polya Urn scheme as demonstrated by Blackwell and MacQueen. The probability that a newsample η n is equal to one of the existing samples is given as follows. P(η n = η j ) = = N j N + α 1 α N + α 1 j = 1,...,n 1 j = n where N j = {k : η k = η j }. This indicates that these samples show a clustering behavior. Let N j indicate as above the numbers of samples with value η j. A new sample has a value equal to one of the existing samples (say η j ) with a probability proportional to number of observations in the cluster (i.e. N j ). Also there is a finite probability that sample is part of a new cluster. The parameter α thus plays a critical role in determining how probable a new cluster would be. 1

2 A powerful constructive characterization of the Dirichlet process was given Sethuraman [12] and is known as Stick breaking construction. Considering two infinite collections of independent random variables V i Beta(1,α) and η i G 0, for i = 1,2,..., the stick breaking representation for G, a sample from DP is given by i 1 π i (v) = v i (1 v j ) (1) G = j=1 i=1 π i (v)δ η i (2) Thus dirichlet process is shown to be a discrete distribution with infinite support, with each atom being drawn from G 0. Since DP has infinite support, it can be used as a prior for nonparametric data models. Dirichlet Process Mixture model [1], DPM, is a nonparametric data model which uses DP in a hierarchical specification as follows. y θ i F(θ i ) θ i G G G DP(α,G 0 ) The data y 1,y 2,...,y n is assumed to be i.i.d data obtained from some distribution F(.). The parameters of the distribution are samples from DP. Since DP is a mixture distribution, the data y also forms a infinite mixture model. The data distribution also forms a Polya Urn and hence shows a clustering behavior. The data from any single cluster is modeled as any simple distribution, normal distribution being usually preferred, and the mixture proportions(π i (v)) indicate how likely the data could have from a particular cluster. This model is more flexible than data in that it can model data from multiple distributions. Another interesting observation is that the number of clusters is not a parameter in this model. The inference algorithm is able to deduce the number of components. Traditional clustering algorithms (K-means) need to be provided with the number of components of in the mixture. As is common in bayesian methods, the posteriors resulting from these models are intractable and no closed form solutions can be found. While this has troubled the Bayesian community for while, the introduction of Monte Carlo Markov chain sampling methods [9, 6, 13, 5] have provided a necessary breakthrough for the bayesian learning. Monte Carlo Markov Chain (MCMC) methods which are computationally expensive and are based on probabilistic sampling are typically used to estimate the intractable posterior distributions. The Gibbs sampling, which is a special case of Metropolis Hastings algorithm, applicable when conjugate priors are used is a very popular method for estimation. The gibbs sampling algorithms for DPM model were developed by Neal [10]. An interesting extension of this model is the Nested Dirichlet Process model (NDP) [11] which models hierarchical relationships in the data. Here data in a class is assumed to come from a mixture of distributions and hence can be more flexible in capturing data relationships. This model can capture more complex relationships in data. y i l=1 ({w il } l=1,{φ il} l=1 ) Ḡ w il h( φ il ) Ḡ DP(αḠ 0 ) Rest of the report is organized as follows. Section two discusses an algorithm for estimating Nested Dirichlet Process model. As part of the project variational approximation method was studied in some detail. The variational approximation technique is discussed in section three and applied to the specific case of DPM model of data. The section four experimentally analyses MCMC implementation NDP model with other well known classifiers. The variational approximation for DPM model is compared with the Gibbs implementation. Section five describes 2

3 some of the key learnings from the work and also describes some future work. 2. Gibbs sampling algorithms This section describes a Pólya urn Gibbs sampler for the NDP model. In this case we introduce two sets of labels, the class labels ζ 1,...,ζ n indicating to which class each observation belongs to, and the component labels ξ 1,...,ξ n indicating to which component the observation is assigned, i.e., ζ i = k and ξ i = l if and only if θ i = φ lk (and therefore G i = G k ). Without loss of generality, we assume that classes are labeled continuously between 1 and K, and that the mixture components corresponding to the k-th distribution are also labeled continuously between 1 and L k. We use negative superscripts throughout to denote quantities computed after removing the relevant observation. Note that the process generating the sequence of G i s is just a Dirichlet process (in a general Polish space), and therefore we can express the full conditional distribution ζ i ζi as a Pólya urn. Similarly, all observations assigned to the same class arise from another Dirichlet process. Combining these two observations, we can get the joint full conditional distribution, ζ i,ξ i ζi,ξi K i L i k k=1 l=1 K k=1 k α + n 1 k lk β k + k β k α + n 1 α α + n 1 δ (K +1,1) β k + k δ (k,l) + δ (k,l k +1) + (3) where where lk corresponds to the number of observations in class k assigned to component l and k = L k l=1 n i lk is the total number of observation in class k. Note that the first term in (3) provides the prior full conditional probability that observation i is assigned to one of the existing components in class k, while the second term corresponds to the prior probability that a new class is created in class k to accommodate observation i and the third term is the prior probability that it is assigned to a new class. This result can be used to construct a Gibbs sampler to estimate the model. The data is assumed to be normal and a Normal-Inverse- Wishart prior is imposed on the parameters. y i ζ i,ξ i N(µ i,σ i ) µ i Σ N(µ 0,Σ/κ 0 ) (4) Σ i Inv Wishart ν0 (Λ 1 0 ) The probability that an observation is assigned a new class or a new component in an existing class can be obtained from the prior predictive probability with G 0 as the prior and y i as the data. P prior = N(y i,φ)dg 0 (φ) κ0 = π p/2 1 + κ (5) 0 Gamma p ((ν n + 1 i)/2) Λ 0 ν 0/2 Gamma p ((ν i)/2 Λ n ν n/2 In the above equations, p is the dimensionality of data, φ is the parameters for prior (i.e. φ = (µ,σ)), ν n and Λ n are the parameters of the posterior of φ based on prior G 0 and the current observation. The probability of observation being assigned a new class label is given by P(ζ i = K + 1) = α P prior n 1 + α (6) The probability of observation being assigned a new component label l within same class k is given by P(ζ i = k,ξ = L k + 1) = β.k + β i n i k P prior n + α 1 (7) The probability that observation belongs to an existing non empty component l in class k is given by the post predictive of observations in the component. The post predictive probability 3

4 (P post ) is obtained as above but by replacing the prior parameters with those of posterior (that results from prior G 0 and normal likelihood of the observations in the component). P(ζ i = k) =.k n + α 1 lk.k + beta k P post (8) The class and component labels for an observation are randomly sampled from the discrete distribution defined by probabilities computed as above. After the class/component labels are sampled, the α and the β k are sampled as described in [3]. The model can be more flexible by sampling the hyper parameters µ 0, κ 0 and Γ 0. (This will be undertaken as future work.) Algorithm 1 Gibbs sampling for Nested DP Initialization: assign random labels to the test data while iterations = 1 to niter do {draw ζ,ξ labels for each observation} for each observation in test data do draw ζ i, ξ i as per (8), (9), (10) end for {draw component labels for all observations} for each observation in all data do draw ξ i labels end for {Sample the model parameters} sample alpha sample beta end while derive a point estimate for the ζ labels. 3. Variational Approximation for DPM The MCMC methods including gibbs sampling methods are computationally expensive and are more so for large datasets. The probabilistic sampling methods are thus not well suited for many classification tasks that require faster turnaround times. The variational approach provides a deterministic alternative method for dealing with intractable priors. Recently there has been significant interest in applying variational approach to the bayesian problem [8, 7, 2]. Variational approximation has been applied to various models including the DPM [2]. Variational approach has its origins in the calculus of variations. A problem is expressed as an optimization problem in which quantity being optimized is a functional. The solution is obtained by exploring all possible input functions. When the whole solution space is explored the solution obtained would be the exact solution. But in many situations exploring the full solution space may not be practical. Instead of searching in the full space, the space is restricted in a manner that makes computation tractable and this leads to an approximate solution. In the bayesian inference problem the typical restriction takes form of assumption that the solution is factorizable. i.e. we assume that the posterior distribution (which is the quantity of interest) factorizes into many independent distributions. Consider a full bayesian model in which all parameters have priors assigned. let the model parameters as well as the hidden variables of the model be denoted by Z and the observed variables be denoted by X. As in any hidden variable problem (for example EM) the joint distribution of (X,Z) is assumed and the goal is to find an approximation for P(Z X). The log marginal probability can be decomposed as follows. ln p(x) = L (q) + KL(q p) where the quantities introduced are defined as { } p(x,z) L (q) = q(z)ln dz q(z) { } p(z X) KL(q p) = q(z)ln dz q(z) The log marginal distribution for X is defined in terms of a variable quantity, q(z). The equa- 4

5 tion holds for any q(z). When the KL divergence between q(z) and p(z X) is zero, the quantity L (q) equals the log marginal. This occurs when q(z) equals the posterior distribution p(z X). For any other values L (q) acts as a lower bound for the log marginal. If it is possible to explore all the possible functions q(z) the exact solution can be easily found. When this is not possible, then approximation can be sought by optimal value among a restricted set of values for q(z). Assuming that Z can be partition into Z i,i = 1,...,M, the variable function q(z) is assumed to be of the form given by q(z) = M i=1 q i (Z i ) Substituting the approximation for q(z), the form of the solution can be found as follows. { } p(x,z) L (q) = q(z)ln dz q(z) { } = q j ln p(x,z) q i dz i = i j q j ln q j dz j + const q j ln p(x,z)dz j + const q j ln q j dz j where q j was used as short form for q j (Z j ) and ln p(x,z) = lnp(x,z) i j q i. It can be seen that the last equation above is a negative KL divergence between p(x,z) and q j (Z j ). L (q) is maximized when the negative KL divergence is minimized and this occurs when q j (Z j ) = p(x, Z). The general form for an optimum solution for q j (Z j ) is given by ln q j(z j ) = ln p(x,z) q i i j = E i j (ln p(x,z)) Since optimal solution for q j (Z j ) will depend on other factors, consistent solution must be found by iterating through each factor and replacing the values for each factor with revised estimate. This leads to a coordinate ascent algorithm. It has been shown that convergence is guaranteed in such a setting as the bound is convex with respect to each factor. Based on the above, an algorithm for the DPM model can be obtained as below. The stick breaking constructive model is defined as follows. The hidden parameters for the model are given by Z = {V,η,W}. The W are the hidden variables that indicate which component the data sample belongs. W is a one of K variable, ie a multinomial distributed variable. Assuming that the component distribution is a multivariate normal, we have η = (µ,λ) where µ is the mean and Λ is the precision matrix for the normal data distribution. The conjugate priors are given by V beta(1,β 0 ) (µ,λ) normal gamma(µ 0,κ 0,ν 0,Λ 0 ) W = multinomial(π i (v i )) Let the solution q(z) to be factorized as q(z) = M i q (αt,β t )(v i ) q (µn,κ n,ν t,λ t )(ηt ) q φn (w n ) i n with subscripts on q() indicating the parameters for the distributions. We assume there is one indicator variable, W n, per data sample and one η t per component and also one beta variable, V i per component. Variational distribution q(v i ) is a beta distribution with parameters (α t,β t ). Variational distribution q(η t ) is a normal-wishart distribution with parameters (µ t,κ t,ν t,λ t ) and q(w n ) is multinomial distribution with φ n as parameters. Defining the following quantities N t = φ nt, with φ n = E(W n = t) n x t = n φ nt X n N k S t = n φ nt (X n x t ) T (X n x t ) N t 5

6 The updates are as follows α new t = 1 + N t 2 β new t = β 0 + n κ new t = κ 0 + N t T φ n j j=t+1 µ t new = n φ nt X n + µ 0 κ 0 N t + κ 0 ν t = ν 0 + N t (Λt new ) 1 = Λ N t S t + κ 0N t ( x t µ 0 )( x t µ 0 ) T κ 0 + N t φnt new = exp(0.5e(ln Λ t ) + D κ t + ν t (X n µ t )Λ t (X n µ t ) T ) +φ nt (Ψ(α t ) Ψ(α t + β t )) + T t=1 { T j=t+1 φ n j Ψ(α t ) Ψ(α t + β t )}) Algorithm 2 Variational approximation for DPM Initialization: initialize the parameters of the variational distributions. i.e. initialize α t,β t, µ t,κ t,ν t,λ t for t = 1,..,T, φ n, for n = 1,...,N. while (convergence is not reached) Update (µ t,κ t,ν t,λ t ) Update (α t,β t ) Update φ n 4. Experiments The experiments were conducted using a collection of real data sets. The datasets were obtained from the UCI machine learning data repository. Only the data sets without any missing values and with numerical attributes were used for the experiments. Datasets Glass dataset consists of 214 instances of multiattribute data that describes various samples of glass. The number of attributes in the data is 10. Each of the attributes is continuos numerical value which indicates amount of a chemical element in the glass instance. The first attribute is id and can be ignored in the classification tasks. The target attribute is categorical and can have 7 different values indicating various types of glass. Thus it is 7-class classification problem. Iris dataset is well known dataset containing three classes of iris plant with each class containing 50 instances. One of the classes is linearly separable from others and it hence easily classifiable. The other two classes are not linearly separable and hence present some challenge for the classifier. The attributes are real values which describe various characteristics of the iris plant. This is considered a very simple dataset and used here as a basic test case. Balance data set contains set of measurements of weights placed on a balance and target class indicates whether the scale is a balanced or not. The attributes are integer values which specify the left and right weights and the distance at which they are placed from the center. This is an example of output being a result of a simple mathematical equation where the correct result can be easily determined by the relation between (left weight * left distance ) and (right weight * right distance). However it is a non linear relationship between attributes and may make the data difficult to learn. The datasets were used to compare the performance of SVM, J48 (decision trees), MDA(mixture discriminant analysis package) 6

7 and NDP. The results are as shown below. The results are shown as percentage error averaged over couple of runs. dataset/classifer MDA J48 SVM NDP Glass Iris Balance As expected, Iris was the easiest dataset to classify. Glass is hard for all the datasets and they all show similar performance. The interesting result with balance dataset is the very bad performance shown by NDP. It is not clear if this is due to the incorrect implementation or if the model assumptions do not match the data. But since MDA also makes similar assumptions about data (that the data is composed of a mixture of gaussians), I am inclined to think that the implementation of NDP was not correct. Next section shows that NDP/DPM classifier has a problem datasets with components with small separation. Future work will investigate this problem. Figure 1. variational approximation of gaussian mixtures with various component separations. Comparing variational approximation with DPM Variational approximation algorithm was implemented in R software package. However only the univariate case works correctly and multivariate code is still a work in progress. (The code has a unknown defect and final approximation degenerates to a simple gaussian irrespective of data distribution). The goodness of approximation was verified using visual plots. The data was generated using a equiprobable mixture of normals with means separated by a variable gap and with small variance for each component. nth component has mean n*gap and variance n. The visual comparison between the original and the predictive distributions of DPM and variational approximations are shown in the figures. The original distribution is shown in green, while variational distribution is in red and DPM based estimate is in blue. The actual data points are Figure 2. variational approximation of gaussian mixtures with various component separations. shown at the bottom of the plot twice with colors to indicate the clusters detected by DPM and Var algorithms. The upper one is for Variational clusters and lower one for the DPM clusters. In figure 1, with gap=5, it can be seen that DPM predictive distribution does not match the 7

8 Figure 3. variational approximation of gaussian mixtures with various component separations. Figure 5. variational approximation of gaussian mixtures with various component separations. Figure 4. variational approximation of gaussian mixtures with various component separations. actual distribution that well. The BLUE curve misses two components in the data. (it has only two modes). The data points for the higher components (points to the right) are a bit closer to each other and DPM considers them to be in a single component. The clustering shown by the lower colored markings shows the DPM prediction for each data point. There are only two colors implying only two components are detected by DPM.. Variational distribution detects all the components, though its estimation of means and variances of the components is a bit off. Similar thing can be observed for gap Figure 3 onwards, DPM does a much better job in the estimating the data distribution. The Variational seems to do well but it shows bad approximation for both gap=20 and gap=40. This brings up one of the issues with the variational algorithm. The variational algorithm has this tendency to get stuck in the local maxima. Blei [2] warns of this issue and suggests that initialization of variational distribution parameters requires some care to avoid the local maxima. Their recommendation is to initialize the variational distribution by incrementally updating the parameters according to a random permutation of the data points. Even then, multiple runs must be done and the parameter settings that give the best bound selected. This randomization step was not performed in this experiments and that explains these odd fits. 8

9 The Dirichlet models also show strange behavior when the components separations are small. As the gap increases the fit becomes much better and closely follows the original distribution. As indicated in the previous section, this could an implementation issue rather than being a model restriction and must be investigated further. While this observation is on the DPM model, it also applies to the NDP model as they both share the implementation. 5. Conclusions and Future work This report described the comparison of NDP model with other classifiers and also a preliminary analysis of the effectiveness of variational approximation algorithm. The learning with respect to DPM, NDP models was that there were either hidden implementation defects or model assumptions that cause them to classify certain datasets poorly. It is observed that when ever components of a data are close (as was case with balance-scale dataset) NDP or DPM are unable to distinguish the components causing misclassifications. This is unlikely to be restriction of the model as MDA algorithm has a very similar model and yet yields better results. DPM models should be able to perform equally well or better than MDA models. The anomaly observed is likely to an implementation issue and will be investigated further. (I have already spent some time on this issue, with no result yet). The task of implementing variational approximation involved the study of various methods and then deriving the mathematical model for a specific setting that assumes a normal data distribution. The task of implementing variational approximation for a DPM model turned out to be lot harder than it looked. While it was a simple coordinate ascent algorithm, the mathematics involved was a bit tedious and also the algorithm is sensitive to the choice of the priors. Also there is always a chance that the variational distribution gets stuck in a local maxima. All these factors have made the work very hard. The multivariate approximation was implemented, but does not yield appropriate results due to possible implementation errors or modeling errors. The future work will be to correct this issue. Overall it was observed that the algorithm converges faster and can also yields good results when convergence was to the maxima. The variational approximations seem to work well when they, but the algorithm still needs to be improved such that it always converges to a global maxima. References [1] C.E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 2: , [2] D.M. Blei and M.I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1): , [3] M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of American Statistical Association, 90: , [4] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1: , [5] A.E. Gelfand. Gibbs Sampling. Journal of the American Statistical Association, 95(452), [6] WK Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97 109, [7] T.S. Jaakkola and M.I. Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing, 10(1):25 37, [8] M.I. Jordan. Learning in graphical models. Kluwer Academic Publishers,

10 [9] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, et al. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087, [10] R.M. Neal. Markov Chain Sampling Methods for Dirichlet Process Mixture Models. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 9(2): , [11] A. Rodriguez, D.B. Dunson, and AE Gelfang. The nested Dirichlet process. Journal of the American Statistical Association, submitted, [12] J. Sethuraman. A constructive definition of dirichlet priors. Statistica Sinica, 4: , [13] A. F. M. Smith and G. O. Roberts. Bayesian computation via the gibbs sampler and related markov chain monte carlo methods. Journal of the Royal Statistical Society. Series B (Methodological), 55(1):3 23,

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Bayesian non-parametric model to longitudinally predict churn

Bayesian non-parametric model to longitudinally predict churn Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics

More information

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison AN IMPROVED MERGE-SPLIT SAMPLER FOR CONJUGATE DIRICHLET PROCESS MIXTURE MODELS David B. Dahl dbdahl@stat.wisc.edu Department of Statistics, and Department of Biostatistics & Medical Informatics University

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional

More information

Sampling Algorithms for Probabilistic Graphical models

Sampling Algorithms for Probabilistic Graphical models Sampling Algorithms for Probabilistic Graphical models Vibhav Gogate University of Washington References: Chapter 12 of Probabilistic Graphical models: Principles and Techniques by Daphne Koller and Nir

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Bayesian Nonparametrics: Dirichlet Process

Bayesian Nonparametrics: Dirichlet Process Bayesian Nonparametrics: Dirichlet Process Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL http://www.gatsby.ucl.ac.uk/~ywteh/teaching/npbayes2012 Dirichlet Process Cornerstone of modern Bayesian

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Bayesian nonparametrics

Bayesian nonparametrics Bayesian nonparametrics 1 Some preliminaries 1.1 de Finetti s theorem We will start our discussion with this foundational theorem. We will assume throughout all variables are defined on the probability

More information

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix Infinite-State Markov-switching for Dynamic Volatility Models : Web Appendix Arnaud Dufays 1 Centre de Recherche en Economie et Statistique March 19, 2014 1 Comparison of the two MS-GARCH approximations

More information

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models for Image Analysis - Lecture 4 Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.

More information

Tree-Based Inference for Dirichlet Process Mixtures

Tree-Based Inference for Dirichlet Process Mixtures Yang Xu Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, USA Katherine A. Heller Department of Engineering University of Cambridge Cambridge, UK Zoubin Ghahramani

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

STAT Advanced Bayesian Inference

STAT Advanced Bayesian Inference 1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(

More information

A Brief Overview of Nonparametric Bayesian Models

A Brief Overview of Nonparametric Bayesian Models A Brief Overview of Nonparametric Bayesian Models Eurandom Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin Also at Machine

More information

Dirichlet Enhanced Latent Semantic Analysis

Dirichlet Enhanced Latent Semantic Analysis Dirichlet Enhanced Latent Semantic Analysis Kai Yu Siemens Corporate Technology D-81730 Munich, Germany Kai.Yu@siemens.com Shipeng Yu Institute for Computer Science University of Munich D-80538 Munich,

More information

Bayesian Mixture Modeling of Significant P Values: A Meta-Analytic Method to Estimate the Degree of Contamination from H 0 : Supplemental Material

Bayesian Mixture Modeling of Significant P Values: A Meta-Analytic Method to Estimate the Degree of Contamination from H 0 : Supplemental Material Bayesian Mixture Modeling of Significant P Values: A Meta-Analytic Method to Estimate the Degree of Contamination from H 0 : Supplemental Material Quentin Frederik Gronau 1, Monique Duizer 1, Marjan Bakker

More information

Lecture 3a: Dirichlet processes

Lecture 3a: Dirichlet processes Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu Lecture 16-17: Bayesian Nonparametrics I STAT 6474 Instructor: Hongxiao Zhu Plan for today Why Bayesian Nonparametrics? Dirichlet Distribution and Dirichlet Processes. 2 Parameter and Patterns Reference:

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

Gibbs Sampling for (Coupled) Infinite Mixture Models in the Stick Breaking Representation

Gibbs Sampling for (Coupled) Infinite Mixture Models in the Stick Breaking Representation Gibbs Sampling for (Coupled) Infinite Mixture Models in the Stick Breaking Representation Ian Porteous, Alex Ihler, Padhraic Smyth, Max Welling Department of Computer Science UC Irvine, Irvine CA 92697-3425

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Part IV: Monte Carlo and nonparametric Bayes

Part IV: Monte Carlo and nonparametric Bayes Part IV: Monte Carlo and nonparametric Bayes Outline Monte Carlo methods Nonparametric Bayesian models Outline Monte Carlo methods Nonparametric Bayesian models The Monte Carlo principle The expectation

More information

Hyperparameter estimation in Dirichlet process mixture models

Hyperparameter estimation in Dirichlet process mixture models Hyperparameter estimation in Dirichlet process mixture models By MIKE WEST Institute of Statistics and Decision Sciences Duke University, Durham NC 27706, USA. SUMMARY In Bayesian density estimation and

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of Discussion Functionals Calculus of Variations Maximizing a Functional Finding Approximation to a Posterior Minimizing K-L divergence Factorized

More information

A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process

A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We give a simple

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 4 Problem: Density Estimation We have observed data, y 1,..., y n, drawn independently from some unknown

More information

Bayesian Nonparametric Models

Bayesian Nonparametric Models Bayesian Nonparametric Models David M. Blei Columbia University December 15, 2015 Introduction We have been looking at models that posit latent structure in high dimensional data. We use the posterior

More information

Bayesian Mixtures of Bernoulli Distributions

Bayesian Mixtures of Bernoulli Distributions Bayesian Mixtures of Bernoulli Distributions Laurens van der Maaten Department of Computer Science and Engineering University of California, San Diego Introduction The mixture of Bernoulli distributions

More information

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017 Bayesian Statistics Debdeep Pati Florida State University April 3, 2017 Finite mixture model The finite mixture of normals can be equivalently expressed as y i N(µ Si ; τ 1 S i ), S i k π h δ h h=1 δ h

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Bayesian Nonparametrics: Models Based on the Dirichlet Process Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro

More information

Infinite Mixtures of Trees

Infinite Mixtures of Trees Sergey Kirshner sergey@cs.ualberta.ca AICML, Department of Computing Science, University of Alberta, Edmonton, Canada 6G 2E8 Padhraic Smyth Department of Computer Science, University of California, Irvine

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

An Alternative Infinite Mixture Of Gaussian Process Experts

An Alternative Infinite Mixture Of Gaussian Process Experts An Alternative Infinite Mixture Of Gaussian Process Experts Edward Meeds and Simon Osindero Department of Computer Science University of Toronto Toronto, M5S 3G4 {ewm,osindero}@cs.toronto.edu Abstract

More information

Machine Learning Overview

Machine Learning Overview Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 2. Types of Information Processing Problems Solved 1. Regression

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Supplementary Notes: Segment Parameter Labelling in MCMC Change Detection

Supplementary Notes: Segment Parameter Labelling in MCMC Change Detection Supplementary Notes: Segment Parameter Labelling in MCMC Change Detection Alireza Ahrabian 1 arxiv:1901.0545v1 [eess.sp] 16 Jan 019 Abstract This work addresses the problem of segmentation in time series

More information

Bayesian Nonparametric Regression for Diabetes Deaths

Bayesian Nonparametric Regression for Diabetes Deaths Bayesian Nonparametric Regression for Diabetes Deaths Brian M. Hartman PhD Student, 2010 Texas A&M University College Station, TX, USA David B. Dahl Assistant Professor Texas A&M University College Station,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Nonparametric Bayesian modeling for dynamic ordinal regression relationships

Nonparametric Bayesian modeling for dynamic ordinal regression relationships Nonparametric Bayesian modeling for dynamic ordinal regression relationships Athanasios Kottas Department of Applied Mathematics and Statistics, University of California, Santa Cruz Joint work with Maria

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Markov Chain Monte Carlo

Markov Chain Monte Carlo Markov Chain Monte Carlo (and Bayesian Mixture Models) David M. Blei Columbia University October 14, 2014 We have discussed probabilistic modeling, and have seen how the posterior distribution is the critical

More information

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016 Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Variational inference for Dirichlet process mixtures

Variational inference for Dirichlet process mixtures Variational inference for Dirichlet process mixtures David M. Blei School of Computer Science Carnegie-Mellon University blei@cs.cmu.edu Michael I. Jordan Computer Science Division and Department of Statistics

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007 COS 597C: Bayesian Nonparametrics Lecturer: David Blei Lecture # Scribes: Jordan Boyd-Graber and Francisco Pereira October, 7 Gibbs Sampling with a DP First, let s recapitulate the model that we re using.

More information

Collapsed Variational Dirichlet Process Mixture Models

Collapsed Variational Dirichlet Process Mixture Models Collapsed Variational Dirichlet Process Mixture Models Kenichi Kurihara Dept. of Computer Science Tokyo Institute of Technology, Japan kurihara@mi.cs.titech.ac.jp Max Welling Dept. of Computer Science

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

Slice Sampling Mixture Models

Slice Sampling Mixture Models Slice Sampling Mixture Models Maria Kalli, Jim E. Griffin & Stephen G. Walker Centre for Health Services Studies, University of Kent Institute of Mathematics, Statistics & Actuarial Science, University

More information

Two Useful Bounds for Variational Inference

Two Useful Bounds for Variational Inference Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the

More information

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals

More information

Approximate Inference using MCMC

Approximate Inference using MCMC Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

More information

Variational inference

Variational inference Simon Leglaive Télécom ParisTech, CNRS LTCI, Université Paris Saclay November 18, 2016, Télécom ParisTech, Paris, France. Outline Introduction Probabilistic model Problem Log-likelihood decomposition EM

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs

More information

Dirichlet Process. Yee Whye Teh, University College London

Dirichlet Process. Yee Whye Teh, University College London Dirichlet Process Yee Whye Teh, University College London Related keywords: Bayesian nonparametrics, stochastic processes, clustering, infinite mixture model, Blackwell-MacQueen urn scheme, Chinese restaurant

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

Hierarchical Dirichlet Processes with Random Effects

Hierarchical Dirichlet Processes with Random Effects Hierarchical Dirichlet Processes with Random Effects Seyoung Kim Department of Computer Science University of California, Irvine Irvine, CA 92697-34 sykim@ics.uci.edu Padhraic Smyth Department of Computer

More information

Bayesian Nonparametrics

Bayesian Nonparametrics Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric Xing @ CMU, 2006-2009 Clustering Eric Xing

More information

39th Annual ISMS Marketing Science Conference University of Southern California, June 8, 2017

39th Annual ISMS Marketing Science Conference University of Southern California, June 8, 2017 Permuted and IROM Department, McCombs School of Business The University of Texas at Austin 39th Annual ISMS Marketing Science Conference University of Southern California, June 8, 2017 1 / 36 Joint work

More information

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian

More information

Multi-Task Learning for Classification with Dirichlet Process Priors

Multi-Task Learning for Classification with Dirichlet Process Priors Journal of Machine Learning Research 8 (2007) 35-63 Submitted 4/06; Revised 9/06; Published 1/07 Multi-Task Learning for Classification with Dirichlet Process Priors Ya Xue Xuejun Liao Lawrence Carin Department

More information

Contents. Part I: Fundamentals of Bayesian Inference 1

Contents. Part I: Fundamentals of Bayesian Inference 1 Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian

More information

16 : Approximate Inference: Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture) 10-708: Probabilistic Graphical Models 10-708, Spring 2014 18 : Advanced topics in MCMC Lecturer: Eric P. Xing Scribes: Jessica Chemali, Seungwhan Moon 1 Gibbs Sampling (Continued from the last lecture)

More information

Topic Modelling and Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer

More information

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

Machine Learning. Probabilistic KNN.

Machine Learning. Probabilistic KNN. Machine Learning. Mark Girolami girolami@dcs.gla.ac.uk Department of Computing Science University of Glasgow June 21, 2007 p. 1/3 KNN is a remarkably simple algorithm with proven error-rates June 21, 2007

More information

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

Bayes: All uncertainty is described using probability.

Bayes: All uncertainty is described using probability. Bayes: All uncertainty is described using probability. Let w be the data and θ be any unknown quantities. Likelihood. The probability model π(w θ) has θ fixed and w varying. The likelihood L(θ; w) is π(w

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

LECTURE 15 Markov chain Monte Carlo

LECTURE 15 Markov chain Monte Carlo LECTURE 15 Markov chain Monte Carlo There are many settings when posterior computation is a challenge in that one does not have a closed form expression for the posterior distribution. Markov chain Monte

More information

Machine Learning using Bayesian Approaches

Machine Learning using Bayesian Approaches Machine Learning using Bayesian Approaches Sargur N. Srihari University at Buffalo, State University of New York 1 Outline 1. Progress in ML and PR 2. Fully Bayesian Approach 1. Probability theory Bayes

More information

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, China E-MAIL: slsun@cs.ecnu.edu.cn,

More information