CMPS 242: Project Report
|
|
- Ginger McKinney
- 5 years ago
- Views:
Transcription
1 CMPS 242: Project Report RadhaKrishna Vuppala Univ. of California, Santa Cruz Abstract The classification procedures impose certain models on the data and when the assumption match the data, they perform well. The assumption may be that the data is linearly separable or that the data comes some parametric family of distributions for example. The parametric models can be made flexible enough to handle various types of data by using mixture models instead assuming a single distribution. However even the parametric mixture models assume finite components exist in data. Bayesian non parametric mixtures based on Dirichlet process mixtures (DPM) are a way of modeling infinite mixtures and are more flexible in terms of modeling both the form and number of distributions that data could be generated from. The paper explores the strengths and weaknesses of these models. The variational approximation for the DPM model is explored in this paper. Variational approximations are deterministic approximations techniques that have fast convergence and can be used to replace the costly MCMC computations in some situations where exact answer is not required. 1. Introduction The bayesian nonparametrics received a strong push when the Dirichlet process was presented by Ferguson. [4]. The Dirichlet process is a measure on measures. i.e. when a sample is drawn from the Dirichlet process, a distribution is obtained. Describing a process with a large support and with a analytically tractable posterior was a great step forward for bayesian analysis. G {G 0,α} DP(α,G 0 ) η n G, n 1,...,N If G is a random probability measure drawn from a Dirichlet process, and η n are n samples drawn from G, then it can be shown by integrating out G, that the joint distribution of {η 1,...η n } follows a Polya Urn scheme as demonstrated by Blackwell and MacQueen. The probability that a newsample η n is equal to one of the existing samples is given as follows. P(η n = η j ) = = N j N + α 1 α N + α 1 j = 1,...,n 1 j = n where N j = {k : η k = η j }. This indicates that these samples show a clustering behavior. Let N j indicate as above the numbers of samples with value η j. A new sample has a value equal to one of the existing samples (say η j ) with a probability proportional to number of observations in the cluster (i.e. N j ). Also there is a finite probability that sample is part of a new cluster. The parameter α thus plays a critical role in determining how probable a new cluster would be. 1
2 A powerful constructive characterization of the Dirichlet process was given Sethuraman [12] and is known as Stick breaking construction. Considering two infinite collections of independent random variables V i Beta(1,α) and η i G 0, for i = 1,2,..., the stick breaking representation for G, a sample from DP is given by i 1 π i (v) = v i (1 v j ) (1) G = j=1 i=1 π i (v)δ η i (2) Thus dirichlet process is shown to be a discrete distribution with infinite support, with each atom being drawn from G 0. Since DP has infinite support, it can be used as a prior for nonparametric data models. Dirichlet Process Mixture model [1], DPM, is a nonparametric data model which uses DP in a hierarchical specification as follows. y θ i F(θ i ) θ i G G G DP(α,G 0 ) The data y 1,y 2,...,y n is assumed to be i.i.d data obtained from some distribution F(.). The parameters of the distribution are samples from DP. Since DP is a mixture distribution, the data y also forms a infinite mixture model. The data distribution also forms a Polya Urn and hence shows a clustering behavior. The data from any single cluster is modeled as any simple distribution, normal distribution being usually preferred, and the mixture proportions(π i (v)) indicate how likely the data could have from a particular cluster. This model is more flexible than data in that it can model data from multiple distributions. Another interesting observation is that the number of clusters is not a parameter in this model. The inference algorithm is able to deduce the number of components. Traditional clustering algorithms (K-means) need to be provided with the number of components of in the mixture. As is common in bayesian methods, the posteriors resulting from these models are intractable and no closed form solutions can be found. While this has troubled the Bayesian community for while, the introduction of Monte Carlo Markov chain sampling methods [9, 6, 13, 5] have provided a necessary breakthrough for the bayesian learning. Monte Carlo Markov Chain (MCMC) methods which are computationally expensive and are based on probabilistic sampling are typically used to estimate the intractable posterior distributions. The Gibbs sampling, which is a special case of Metropolis Hastings algorithm, applicable when conjugate priors are used is a very popular method for estimation. The gibbs sampling algorithms for DPM model were developed by Neal [10]. An interesting extension of this model is the Nested Dirichlet Process model (NDP) [11] which models hierarchical relationships in the data. Here data in a class is assumed to come from a mixture of distributions and hence can be more flexible in capturing data relationships. This model can capture more complex relationships in data. y i l=1 ({w il } l=1,{φ il} l=1 ) Ḡ w il h( φ il ) Ḡ DP(αḠ 0 ) Rest of the report is organized as follows. Section two discusses an algorithm for estimating Nested Dirichlet Process model. As part of the project variational approximation method was studied in some detail. The variational approximation technique is discussed in section three and applied to the specific case of DPM model of data. The section four experimentally analyses MCMC implementation NDP model with other well known classifiers. The variational approximation for DPM model is compared with the Gibbs implementation. Section five describes 2
3 some of the key learnings from the work and also describes some future work. 2. Gibbs sampling algorithms This section describes a Pólya urn Gibbs sampler for the NDP model. In this case we introduce two sets of labels, the class labels ζ 1,...,ζ n indicating to which class each observation belongs to, and the component labels ξ 1,...,ξ n indicating to which component the observation is assigned, i.e., ζ i = k and ξ i = l if and only if θ i = φ lk (and therefore G i = G k ). Without loss of generality, we assume that classes are labeled continuously between 1 and K, and that the mixture components corresponding to the k-th distribution are also labeled continuously between 1 and L k. We use negative superscripts throughout to denote quantities computed after removing the relevant observation. Note that the process generating the sequence of G i s is just a Dirichlet process (in a general Polish space), and therefore we can express the full conditional distribution ζ i ζi as a Pólya urn. Similarly, all observations assigned to the same class arise from another Dirichlet process. Combining these two observations, we can get the joint full conditional distribution, ζ i,ξ i ζi,ξi K i L i k k=1 l=1 K k=1 k α + n 1 k lk β k + k β k α + n 1 α α + n 1 δ (K +1,1) β k + k δ (k,l) + δ (k,l k +1) + (3) where where lk corresponds to the number of observations in class k assigned to component l and k = L k l=1 n i lk is the total number of observation in class k. Note that the first term in (3) provides the prior full conditional probability that observation i is assigned to one of the existing components in class k, while the second term corresponds to the prior probability that a new class is created in class k to accommodate observation i and the third term is the prior probability that it is assigned to a new class. This result can be used to construct a Gibbs sampler to estimate the model. The data is assumed to be normal and a Normal-Inverse- Wishart prior is imposed on the parameters. y i ζ i,ξ i N(µ i,σ i ) µ i Σ N(µ 0,Σ/κ 0 ) (4) Σ i Inv Wishart ν0 (Λ 1 0 ) The probability that an observation is assigned a new class or a new component in an existing class can be obtained from the prior predictive probability with G 0 as the prior and y i as the data. P prior = N(y i,φ)dg 0 (φ) κ0 = π p/2 1 + κ (5) 0 Gamma p ((ν n + 1 i)/2) Λ 0 ν 0/2 Gamma p ((ν i)/2 Λ n ν n/2 In the above equations, p is the dimensionality of data, φ is the parameters for prior (i.e. φ = (µ,σ)), ν n and Λ n are the parameters of the posterior of φ based on prior G 0 and the current observation. The probability of observation being assigned a new class label is given by P(ζ i = K + 1) = α P prior n 1 + α (6) The probability of observation being assigned a new component label l within same class k is given by P(ζ i = k,ξ = L k + 1) = β.k + β i n i k P prior n + α 1 (7) The probability that observation belongs to an existing non empty component l in class k is given by the post predictive of observations in the component. The post predictive probability 3
4 (P post ) is obtained as above but by replacing the prior parameters with those of posterior (that results from prior G 0 and normal likelihood of the observations in the component). P(ζ i = k) =.k n + α 1 lk.k + beta k P post (8) The class and component labels for an observation are randomly sampled from the discrete distribution defined by probabilities computed as above. After the class/component labels are sampled, the α and the β k are sampled as described in [3]. The model can be more flexible by sampling the hyper parameters µ 0, κ 0 and Γ 0. (This will be undertaken as future work.) Algorithm 1 Gibbs sampling for Nested DP Initialization: assign random labels to the test data while iterations = 1 to niter do {draw ζ,ξ labels for each observation} for each observation in test data do draw ζ i, ξ i as per (8), (9), (10) end for {draw component labels for all observations} for each observation in all data do draw ξ i labels end for {Sample the model parameters} sample alpha sample beta end while derive a point estimate for the ζ labels. 3. Variational Approximation for DPM The MCMC methods including gibbs sampling methods are computationally expensive and are more so for large datasets. The probabilistic sampling methods are thus not well suited for many classification tasks that require faster turnaround times. The variational approach provides a deterministic alternative method for dealing with intractable priors. Recently there has been significant interest in applying variational approach to the bayesian problem [8, 7, 2]. Variational approximation has been applied to various models including the DPM [2]. Variational approach has its origins in the calculus of variations. A problem is expressed as an optimization problem in which quantity being optimized is a functional. The solution is obtained by exploring all possible input functions. When the whole solution space is explored the solution obtained would be the exact solution. But in many situations exploring the full solution space may not be practical. Instead of searching in the full space, the space is restricted in a manner that makes computation tractable and this leads to an approximate solution. In the bayesian inference problem the typical restriction takes form of assumption that the solution is factorizable. i.e. we assume that the posterior distribution (which is the quantity of interest) factorizes into many independent distributions. Consider a full bayesian model in which all parameters have priors assigned. let the model parameters as well as the hidden variables of the model be denoted by Z and the observed variables be denoted by X. As in any hidden variable problem (for example EM) the joint distribution of (X,Z) is assumed and the goal is to find an approximation for P(Z X). The log marginal probability can be decomposed as follows. ln p(x) = L (q) + KL(q p) where the quantities introduced are defined as { } p(x,z) L (q) = q(z)ln dz q(z) { } p(z X) KL(q p) = q(z)ln dz q(z) The log marginal distribution for X is defined in terms of a variable quantity, q(z). The equa- 4
5 tion holds for any q(z). When the KL divergence between q(z) and p(z X) is zero, the quantity L (q) equals the log marginal. This occurs when q(z) equals the posterior distribution p(z X). For any other values L (q) acts as a lower bound for the log marginal. If it is possible to explore all the possible functions q(z) the exact solution can be easily found. When this is not possible, then approximation can be sought by optimal value among a restricted set of values for q(z). Assuming that Z can be partition into Z i,i = 1,...,M, the variable function q(z) is assumed to be of the form given by q(z) = M i=1 q i (Z i ) Substituting the approximation for q(z), the form of the solution can be found as follows. { } p(x,z) L (q) = q(z)ln dz q(z) { } = q j ln p(x,z) q i dz i = i j q j ln q j dz j + const q j ln p(x,z)dz j + const q j ln q j dz j where q j was used as short form for q j (Z j ) and ln p(x,z) = lnp(x,z) i j q i. It can be seen that the last equation above is a negative KL divergence between p(x,z) and q j (Z j ). L (q) is maximized when the negative KL divergence is minimized and this occurs when q j (Z j ) = p(x, Z). The general form for an optimum solution for q j (Z j ) is given by ln q j(z j ) = ln p(x,z) q i i j = E i j (ln p(x,z)) Since optimal solution for q j (Z j ) will depend on other factors, consistent solution must be found by iterating through each factor and replacing the values for each factor with revised estimate. This leads to a coordinate ascent algorithm. It has been shown that convergence is guaranteed in such a setting as the bound is convex with respect to each factor. Based on the above, an algorithm for the DPM model can be obtained as below. The stick breaking constructive model is defined as follows. The hidden parameters for the model are given by Z = {V,η,W}. The W are the hidden variables that indicate which component the data sample belongs. W is a one of K variable, ie a multinomial distributed variable. Assuming that the component distribution is a multivariate normal, we have η = (µ,λ) where µ is the mean and Λ is the precision matrix for the normal data distribution. The conjugate priors are given by V beta(1,β 0 ) (µ,λ) normal gamma(µ 0,κ 0,ν 0,Λ 0 ) W = multinomial(π i (v i )) Let the solution q(z) to be factorized as q(z) = M i q (αt,β t )(v i ) q (µn,κ n,ν t,λ t )(ηt ) q φn (w n ) i n with subscripts on q() indicating the parameters for the distributions. We assume there is one indicator variable, W n, per data sample and one η t per component and also one beta variable, V i per component. Variational distribution q(v i ) is a beta distribution with parameters (α t,β t ). Variational distribution q(η t ) is a normal-wishart distribution with parameters (µ t,κ t,ν t,λ t ) and q(w n ) is multinomial distribution with φ n as parameters. Defining the following quantities N t = φ nt, with φ n = E(W n = t) n x t = n φ nt X n N k S t = n φ nt (X n x t ) T (X n x t ) N t 5
6 The updates are as follows α new t = 1 + N t 2 β new t = β 0 + n κ new t = κ 0 + N t T φ n j j=t+1 µ t new = n φ nt X n + µ 0 κ 0 N t + κ 0 ν t = ν 0 + N t (Λt new ) 1 = Λ N t S t + κ 0N t ( x t µ 0 )( x t µ 0 ) T κ 0 + N t φnt new = exp(0.5e(ln Λ t ) + D κ t + ν t (X n µ t )Λ t (X n µ t ) T ) +φ nt (Ψ(α t ) Ψ(α t + β t )) + T t=1 { T j=t+1 φ n j Ψ(α t ) Ψ(α t + β t )}) Algorithm 2 Variational approximation for DPM Initialization: initialize the parameters of the variational distributions. i.e. initialize α t,β t, µ t,κ t,ν t,λ t for t = 1,..,T, φ n, for n = 1,...,N. while (convergence is not reached) Update (µ t,κ t,ν t,λ t ) Update (α t,β t ) Update φ n 4. Experiments The experiments were conducted using a collection of real data sets. The datasets were obtained from the UCI machine learning data repository. Only the data sets without any missing values and with numerical attributes were used for the experiments. Datasets Glass dataset consists of 214 instances of multiattribute data that describes various samples of glass. The number of attributes in the data is 10. Each of the attributes is continuos numerical value which indicates amount of a chemical element in the glass instance. The first attribute is id and can be ignored in the classification tasks. The target attribute is categorical and can have 7 different values indicating various types of glass. Thus it is 7-class classification problem. Iris dataset is well known dataset containing three classes of iris plant with each class containing 50 instances. One of the classes is linearly separable from others and it hence easily classifiable. The other two classes are not linearly separable and hence present some challenge for the classifier. The attributes are real values which describe various characteristics of the iris plant. This is considered a very simple dataset and used here as a basic test case. Balance data set contains set of measurements of weights placed on a balance and target class indicates whether the scale is a balanced or not. The attributes are integer values which specify the left and right weights and the distance at which they are placed from the center. This is an example of output being a result of a simple mathematical equation where the correct result can be easily determined by the relation between (left weight * left distance ) and (right weight * right distance). However it is a non linear relationship between attributes and may make the data difficult to learn. The datasets were used to compare the performance of SVM, J48 (decision trees), MDA(mixture discriminant analysis package) 6
7 and NDP. The results are as shown below. The results are shown as percentage error averaged over couple of runs. dataset/classifer MDA J48 SVM NDP Glass Iris Balance As expected, Iris was the easiest dataset to classify. Glass is hard for all the datasets and they all show similar performance. The interesting result with balance dataset is the very bad performance shown by NDP. It is not clear if this is due to the incorrect implementation or if the model assumptions do not match the data. But since MDA also makes similar assumptions about data (that the data is composed of a mixture of gaussians), I am inclined to think that the implementation of NDP was not correct. Next section shows that NDP/DPM classifier has a problem datasets with components with small separation. Future work will investigate this problem. Figure 1. variational approximation of gaussian mixtures with various component separations. Comparing variational approximation with DPM Variational approximation algorithm was implemented in R software package. However only the univariate case works correctly and multivariate code is still a work in progress. (The code has a unknown defect and final approximation degenerates to a simple gaussian irrespective of data distribution). The goodness of approximation was verified using visual plots. The data was generated using a equiprobable mixture of normals with means separated by a variable gap and with small variance for each component. nth component has mean n*gap and variance n. The visual comparison between the original and the predictive distributions of DPM and variational approximations are shown in the figures. The original distribution is shown in green, while variational distribution is in red and DPM based estimate is in blue. The actual data points are Figure 2. variational approximation of gaussian mixtures with various component separations. shown at the bottom of the plot twice with colors to indicate the clusters detected by DPM and Var algorithms. The upper one is for Variational clusters and lower one for the DPM clusters. In figure 1, with gap=5, it can be seen that DPM predictive distribution does not match the 7
8 Figure 3. variational approximation of gaussian mixtures with various component separations. Figure 5. variational approximation of gaussian mixtures with various component separations. Figure 4. variational approximation of gaussian mixtures with various component separations. actual distribution that well. The BLUE curve misses two components in the data. (it has only two modes). The data points for the higher components (points to the right) are a bit closer to each other and DPM considers them to be in a single component. The clustering shown by the lower colored markings shows the DPM prediction for each data point. There are only two colors implying only two components are detected by DPM.. Variational distribution detects all the components, though its estimation of means and variances of the components is a bit off. Similar thing can be observed for gap Figure 3 onwards, DPM does a much better job in the estimating the data distribution. The Variational seems to do well but it shows bad approximation for both gap=20 and gap=40. This brings up one of the issues with the variational algorithm. The variational algorithm has this tendency to get stuck in the local maxima. Blei [2] warns of this issue and suggests that initialization of variational distribution parameters requires some care to avoid the local maxima. Their recommendation is to initialize the variational distribution by incrementally updating the parameters according to a random permutation of the data points. Even then, multiple runs must be done and the parameter settings that give the best bound selected. This randomization step was not performed in this experiments and that explains these odd fits. 8
9 The Dirichlet models also show strange behavior when the components separations are small. As the gap increases the fit becomes much better and closely follows the original distribution. As indicated in the previous section, this could an implementation issue rather than being a model restriction and must be investigated further. While this observation is on the DPM model, it also applies to the NDP model as they both share the implementation. 5. Conclusions and Future work This report described the comparison of NDP model with other classifiers and also a preliminary analysis of the effectiveness of variational approximation algorithm. The learning with respect to DPM, NDP models was that there were either hidden implementation defects or model assumptions that cause them to classify certain datasets poorly. It is observed that when ever components of a data are close (as was case with balance-scale dataset) NDP or DPM are unable to distinguish the components causing misclassifications. This is unlikely to be restriction of the model as MDA algorithm has a very similar model and yet yields better results. DPM models should be able to perform equally well or better than MDA models. The anomaly observed is likely to an implementation issue and will be investigated further. (I have already spent some time on this issue, with no result yet). The task of implementing variational approximation involved the study of various methods and then deriving the mathematical model for a specific setting that assumes a normal data distribution. The task of implementing variational approximation for a DPM model turned out to be lot harder than it looked. While it was a simple coordinate ascent algorithm, the mathematics involved was a bit tedious and also the algorithm is sensitive to the choice of the priors. Also there is always a chance that the variational distribution gets stuck in a local maxima. All these factors have made the work very hard. The multivariate approximation was implemented, but does not yield appropriate results due to possible implementation errors or modeling errors. The future work will be to correct this issue. Overall it was observed that the algorithm converges faster and can also yields good results when convergence was to the maxima. The variational approximations seem to work well when they, but the algorithm still needs to be improved such that it always converges to a global maxima. References [1] C.E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 2: , [2] D.M. Blei and M.I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1): , [3] M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of American Statistical Association, 90: , [4] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1: , [5] A.E. Gelfand. Gibbs Sampling. Journal of the American Statistical Association, 95(452), [6] WK Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97 109, [7] T.S. Jaakkola and M.I. Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing, 10(1):25 37, [8] M.I. Jordan. Learning in graphical models. Kluwer Academic Publishers,
10 [9] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, et al. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087, [10] R.M. Neal. Markov Chain Sampling Methods for Dirichlet Process Mixture Models. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 9(2): , [11] A. Rodriguez, D.B. Dunson, and AE Gelfang. The nested Dirichlet process. Journal of the American Statistical Association, submitted, [12] J. Sethuraman. A constructive definition of dirichlet priors. Statistica Sinica, 4: , [13] A. F. M. Smith and G. O. Roberts. Bayesian computation via the gibbs sampler and related markov chain monte carlo methods. Journal of the Royal Statistical Society. Series B (Methodological), 55(1):3 23,
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationBayesian non-parametric model to longitudinally predict churn
Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics
More informationDavid B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison
AN IMPROVED MERGE-SPLIT SAMPLER FOR CONJUGATE DIRICHLET PROCESS MIXTURE MODELS David B. Dahl dbdahl@stat.wisc.edu Department of Statistics, and Department of Biostatistics & Medical Informatics University
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationOutline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution
Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model
More informationBayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units
Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional
More informationSampling Algorithms for Probabilistic Graphical models
Sampling Algorithms for Probabilistic Graphical models Vibhav Gogate University of Washington References: Chapter 12 of Probabilistic Graphical models: Principles and Techniques by Daphne Koller and Nir
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationBayesian Nonparametrics: Dirichlet Process
Bayesian Nonparametrics: Dirichlet Process Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL http://www.gatsby.ucl.ac.uk/~ywteh/teaching/npbayes2012 Dirichlet Process Cornerstone of modern Bayesian
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationBayesian nonparametrics
Bayesian nonparametrics 1 Some preliminaries 1.1 de Finetti s theorem We will start our discussion with this foundational theorem. We will assume throughout all variables are defined on the probability
More informationInfinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix
Infinite-State Markov-switching for Dynamic Volatility Models : Web Appendix Arnaud Dufays 1 Centre de Recherche en Economie et Statistique March 19, 2014 1 Comparison of the two MS-GARCH approximations
More informationProbabilistic Graphical Models for Image Analysis - Lecture 4
Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.
More informationTree-Based Inference for Dirichlet Process Mixtures
Yang Xu Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, USA Katherine A. Heller Department of Engineering University of Cambridge Cambridge, UK Zoubin Ghahramani
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationSTAT Advanced Bayesian Inference
1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(
More informationA Brief Overview of Nonparametric Bayesian Models
A Brief Overview of Nonparametric Bayesian Models Eurandom Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin Also at Machine
More informationDirichlet Enhanced Latent Semantic Analysis
Dirichlet Enhanced Latent Semantic Analysis Kai Yu Siemens Corporate Technology D-81730 Munich, Germany Kai.Yu@siemens.com Shipeng Yu Institute for Computer Science University of Munich D-80538 Munich,
More informationBayesian Mixture Modeling of Significant P Values: A Meta-Analytic Method to Estimate the Degree of Contamination from H 0 : Supplemental Material
Bayesian Mixture Modeling of Significant P Values: A Meta-Analytic Method to Estimate the Degree of Contamination from H 0 : Supplemental Material Quentin Frederik Gronau 1, Monique Duizer 1, Marjan Bakker
More informationLecture 3a: Dirichlet processes
Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms
More informationLecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu
Lecture 16-17: Bayesian Nonparametrics I STAT 6474 Instructor: Hongxiao Zhu Plan for today Why Bayesian Nonparametrics? Dirichlet Distribution and Dirichlet Processes. 2 Parameter and Patterns Reference:
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationVariational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures
17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter
More informationGibbs Sampling for (Coupled) Infinite Mixture Models in the Stick Breaking Representation
Gibbs Sampling for (Coupled) Infinite Mixture Models in the Stick Breaking Representation Ian Porteous, Alex Ihler, Padhraic Smyth, Max Welling Department of Computer Science UC Irvine, Irvine CA 92697-3425
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationPart IV: Monte Carlo and nonparametric Bayes
Part IV: Monte Carlo and nonparametric Bayes Outline Monte Carlo methods Nonparametric Bayesian models Outline Monte Carlo methods Nonparametric Bayesian models The Monte Carlo principle The expectation
More informationHyperparameter estimation in Dirichlet process mixture models
Hyperparameter estimation in Dirichlet process mixture models By MIKE WEST Institute of Statistics and Decision Sciences Duke University, Durham NC 27706, USA. SUMMARY In Bayesian density estimation and
More informationProbabilistic Time Series Classification
Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign
More informationLecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models
Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of Discussion Functionals Calculus of Variations Maximizing a Functional Finding Approximation to a Posterior Minimizing K-L divergence Factorized
More informationA Simple Proof of the Stick-Breaking Construction of the Dirichlet Process
A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We give a simple
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 4 Problem: Density Estimation We have observed data, y 1,..., y n, drawn independently from some unknown
More informationBayesian Nonparametric Models
Bayesian Nonparametric Models David M. Blei Columbia University December 15, 2015 Introduction We have been looking at models that posit latent structure in high dimensional data. We use the posterior
More informationBayesian Mixtures of Bernoulli Distributions
Bayesian Mixtures of Bernoulli Distributions Laurens van der Maaten Department of Computer Science and Engineering University of California, San Diego Introduction The mixture of Bernoulli distributions
More informationBayesian Statistics. Debdeep Pati Florida State University. April 3, 2017
Bayesian Statistics Debdeep Pati Florida State University April 3, 2017 Finite mixture model The finite mixture of normals can be equivalently expressed as y i N(µ Si ; τ 1 S i ), S i k π h δ h h=1 δ h
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationBayesian Nonparametrics: Models Based on the Dirichlet Process
Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro
More informationInfinite Mixtures of Trees
Sergey Kirshner sergey@cs.ualberta.ca AICML, Department of Computing Science, University of Alberta, Edmonton, Canada 6G 2E8 Padhraic Smyth Department of Computer Science, University of California, Irvine
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationDeep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationPMR Learning as Inference
Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning
More informationAn Alternative Infinite Mixture Of Gaussian Process Experts
An Alternative Infinite Mixture Of Gaussian Process Experts Edward Meeds and Simon Osindero Department of Computer Science University of Toronto Toronto, M5S 3G4 {ewm,osindero}@cs.toronto.edu Abstract
More informationMachine Learning Overview
Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 2. Types of Information Processing Problems Solved 1. Regression
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationVariational Autoencoders
Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly
More informationPrinciples of Bayesian Inference
Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters
More informationSupplementary Notes: Segment Parameter Labelling in MCMC Change Detection
Supplementary Notes: Segment Parameter Labelling in MCMC Change Detection Alireza Ahrabian 1 arxiv:1901.0545v1 [eess.sp] 16 Jan 019 Abstract This work addresses the problem of segmentation in time series
More informationBayesian Nonparametric Regression for Diabetes Deaths
Bayesian Nonparametric Regression for Diabetes Deaths Brian M. Hartman PhD Student, 2010 Texas A&M University College Station, TX, USA David B. Dahl Assistant Professor Texas A&M University College Station,
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationNonparametric Bayesian modeling for dynamic ordinal regression relationships
Nonparametric Bayesian modeling for dynamic ordinal regression relationships Athanasios Kottas Department of Applied Mathematics and Statistics, University of California, Santa Cruz Joint work with Maria
More informationCurve Fitting Re-visited, Bishop1.2.5
Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationVariational Scoring of Graphical Model Structures
Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational
More informationMarkov Chain Monte Carlo
Markov Chain Monte Carlo (and Bayesian Mixture Models) David M. Blei Columbia University October 14, 2014 We have discussed probabilistic modeling, and have seen how the posterior distribution is the critical
More informationDeep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016
Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationVariational inference for Dirichlet process mixtures
Variational inference for Dirichlet process mixtures David M. Blei School of Computer Science Carnegie-Mellon University blei@cs.cmu.edu Michael I. Jordan Computer Science Division and Department of Statistics
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationLecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007
COS 597C: Bayesian Nonparametrics Lecturer: David Blei Lecture # Scribes: Jordan Boyd-Graber and Francisco Pereira October, 7 Gibbs Sampling with a DP First, let s recapitulate the model that we re using.
More informationCollapsed Variational Dirichlet Process Mixture Models
Collapsed Variational Dirichlet Process Mixture Models Kenichi Kurihara Dept. of Computer Science Tokyo Institute of Technology, Japan kurihara@mi.cs.titech.ac.jp Max Welling Dept. of Computer Science
More informationBagging During Markov Chain Monte Carlo for Smoother Predictions
Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods
More informationSlice Sampling Mixture Models
Slice Sampling Mixture Models Maria Kalli, Jim E. Griffin & Stephen G. Walker Centre for Health Services Studies, University of Kent Institute of Mathematics, Statistics & Actuarial Science, University
More informationTwo Useful Bounds for Variational Inference
Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the
More information(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis
Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals
More informationApproximate Inference using MCMC
Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov
More informationVariational inference
Simon Leglaive Télécom ParisTech, CNRS LTCI, Université Paris Saclay November 18, 2016, Télécom ParisTech, Paris, France. Outline Introduction Probabilistic model Problem Log-likelihood decomposition EM
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationPattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods
Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs
More informationDirichlet Process. Yee Whye Teh, University College London
Dirichlet Process Yee Whye Teh, University College London Related keywords: Bayesian nonparametrics, stochastic processes, clustering, infinite mixture model, Blackwell-MacQueen urn scheme, Chinese restaurant
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationHierarchical Dirichlet Processes with Random Effects
Hierarchical Dirichlet Processes with Random Effects Seyoung Kim Department of Computer Science University of California, Irvine Irvine, CA 92697-34 sykim@ics.uci.edu Padhraic Smyth Department of Computer
More informationBayesian Nonparametrics
Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet
More informationAdvanced Machine Learning
Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric Xing @ CMU, 2006-2009 Clustering Eric Xing
More information39th Annual ISMS Marketing Science Conference University of Southern California, June 8, 2017
Permuted and IROM Department, McCombs School of Business The University of Texas at Austin 39th Annual ISMS Marketing Science Conference University of Southern California, June 8, 2017 1 / 36 Joint work
More informationOverview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated
Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian
More informationMulti-Task Learning for Classification with Dirichlet Process Priors
Journal of Machine Learning Research 8 (2007) 35-63 Submitted 4/06; Revised 9/06; Published 1/07 Multi-Task Learning for Classification with Dirichlet Process Priors Ya Xue Xuejun Liao Lawrence Carin Department
More informationContents. Part I: Fundamentals of Bayesian Inference 1
Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian
More information16 : Approximate Inference: Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution
More informationFoundations of Nonparametric Bayesian Methods
1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models
More information18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)
10-708: Probabilistic Graphical Models 10-708, Spring 2014 18 : Advanced topics in MCMC Lecturer: Eric P. Xing Scribes: Jessica Chemali, Seungwhan Moon 1 Gibbs Sampling (Continued from the last lecture)
More informationTopic Modelling and Latent Dirichlet Allocation
Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer
More informationStochastic Backpropagation, Variational Inference, and Semi-Supervised Learning
Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian
More informationCSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection
CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection
More informationMachine Learning. Probabilistic KNN.
Machine Learning. Mark Girolami girolami@dcs.gla.ac.uk Department of Computing Science University of Glasgow June 21, 2007 p. 1/3 KNN is a remarkably simple algorithm with proven error-rates June 21, 2007
More informationLatent Dirichlet Allocation Introduction/Overview
Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models
More informationBayes: All uncertainty is described using probability.
Bayes: All uncertainty is described using probability. Let w be the data and θ be any unknown quantities. Likelihood. The probability model π(w θ) has θ fixed and w varying. The likelihood L(θ; w) is π(w
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationLECTURE 15 Markov chain Monte Carlo
LECTURE 15 Markov chain Monte Carlo There are many settings when posterior computation is a challenge in that one does not have a closed form expression for the posterior distribution. Markov chain Monte
More informationMachine Learning using Bayesian Approaches
Machine Learning using Bayesian Approaches Sargur N. Srihari University at Buffalo, State University of New York 1 Outline 1. Progress in ML and PR 2. Fully Bayesian Approach 1. Probability theory Bayes
More informationINFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES
INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, China E-MAIL: slsun@cs.ecnu.edu.cn,
More information