Bayesian Mixtures of Bernoulli Distributions

Similar documents
Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Recent Advances in Bayesian Inference Techniques

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Pattern Recognition and Machine Learning

Bayesian Nonparametrics: Dirichlet Process

Bayesian Methods for Machine Learning

Non-Parametric Bayes

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Bayesian Hidden Markov Models and Extensions

Infinite Latent Feature Models and the Indian Buffet Process

STA 4273H: Statistical Machine Learning

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Gentle Introduction to Infinite Gaussian Mixture Modeling

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

Dirichlet Processes and other non-parametric Bayesian models

Infinite latent feature models and the Indian Buffet Process

The Origin of Deep Learning. Lili Mou Jan, 2015

A Brief Overview of Nonparametric Bayesian Models

Introduction to Machine Learning

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Bayesian Learning in Undirected Graphical Models

Approximate Inference Part 1 of 2

Lecture 6: Graphical Models: Learning

Unsupervised Learning

Approximate Inference Part 1 of 2

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Statistical Machine Learning

Lecture 16 Deep Neural Generative Models

Latent Variable Models

Collapsed Variational Dirichlet Process Mixture Models

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Bayesian non parametric approaches: an introduction

Machine Learning using Bayesian Approaches

Contrastive Divergence

Non-parametric Clustering with Dirichlet Processes

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Nonparametric Bayesian Methods (Gaussian Processes)

Topic Modelling and Latent Dirichlet Allocation

Bayesian Nonparametrics: Models Based on the Dirichlet Process

The Variational Gaussian Approximation Revisited

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Machine Learning Overview

STA 4273H: Statistical Machine Learning

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Machine Learning Summer School

Latent Dirichlet Bayesian Co-Clustering

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

The Infinite Gaussian Mixture Model

Bayesian Nonparametrics for Speech and Signal Processing

Homework 6: Image Completion using Mixture of Bernoullis

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Latent Dirichlet Allocation (LDA)

p L yi z n m x N n xi

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Probabilistic Time Series Classification

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

The Indian Buffet Process: An Introduction and Review

STA 4273H: Statistical Machine Learning

Dirichlet Enhanced Latent Semantic Analysis

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Variational Scoring of Graphical Model Structures

Latent Variable Models and EM algorithm

CPSC 540: Machine Learning

Bayesian Nonparametric Models on Decomposable Graphs

Machine Learning Techniques for Computer Vision

Lecture 13 : Variational Inference: Mean Field Approximation

13: Variational inference II

Lecture 3a: Dirichlet processes

Gaussian Mixture Model

Deep Generative Models. (Unsupervised Learning)

An Introduction to Bayesian Machine Learning

Bayesian Networks BY: MOHAMAD ALSABBAGH

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Statistical Approaches to Learning and Discovery

Variational Principal Components

Non-parametric Bayesian Methods

Probabilistic Graphical Models

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

Probabilistic Graphical Models

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

an introduction to bayesian inference

Part IV: Monte Carlo and nonparametric Bayes

Advanced Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Introduction to Machine Learning

STA 414/2104: Machine Learning

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Restricted Boltzmann Machines for Collaborative Filtering

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Kyle Reing University of Southern California April 18, 2018

Bayesian Learning in Undirected Graphical Models

Deep Boltzmann Machines

Curve Fitting Re-visited, Bishop1.2.5

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Bayesian Models in Machine Learning

Transcription:

Bayesian Mixtures of Bernoulli Distributions Laurens van der Maaten Department of Computer Science and Engineering University of California, San Diego Introduction The mixture of Bernoulli distributions 6 is a technique that is frequently used for the modeling of binary random vectors. They differ from (restricted) Boltzmann Machines in that they do not model the marginal distribution over the binary data space X as a product of (conditional) Bernoulli distributions, but as a weighted sum of Bernoulli distributions. Despite the non-identifiability of the mixture of Bernoulli distributions 3, it has been successfully used to, e.g., dichotomous perceptual decision maing, text classification 7, and word categorization 4. Mixtures of Bernoulli distributions are typically trained using an expectation-maximization (EM) algorithm, i.e. by performing maximum lielihood estimation. In this report, we develop a Gibbs sampler for a fully Bayesian variant of the Bernoulli mixture, in which (conjugate) priors are introduced over both the mixing proportions and over the parameters of the Bernoulli distributions. We develop both a finite Bayesian Bernoulli mixture (using a Dirichlet prior over the latent class assignment variables) and an infinite Bernoulli mixture (using a Dirichlet Process prior). We perform experiments in which we compare the performance of the Bayesian Bernoulli mixtures with that of a standard Bernoulli mixture and a Restricted Boltzmann Machine on a tas in which the (unobserved) bottom half of a handwritten digit needs to be predicted from the (observed) top half of that digit. The outline of this report is as follows. Section 2 describes the generative model of the Bayesian Bernoulli mixture. Section 3 described how inference is performed in this model using a collapsed Gibbs sampler. Section 4 extends the Bayesian Bernoulli mixture to an infinite mixture model, and described the requires collapsed Gibbs sampler. Section 6 presents the setup and results of our experiments on a handwritten digit prediction tas. 2 Generative model A Bayesian mixture of Bernoulli distributions models a distribution over a D-dimensional binary space X. The generative model of the Bayesian Bernoulli mixture is shown graphically in Figure. The generative model for generating a point x is given below: p(π α) = Dirichlet(π α K,..., α ), () K where α is a hyperparameter of the Dirichlet prior, the so-called concentration parameter. The vector π is a K-vector describing the mixing weights, where K is the number of mixture components (in the case of the infinite mixture model, K ). p(z π) = Discrete(z π), (2) where z is a K-vector describing the class assignment (note that z {, } and z = ). p(a β, γ) = Beta(a β, γ), (3) For simplicity, we assume a symmetric Dirichlet prior, i.e. we assume : α = α/k.

α π β,γ a z K x N Figure : Generative model of infinite mixture of Bernoulli distributions. where a is a D-vector, and β and γ are hyperparameters of the Beta-prior. p(x {a,..., a K }, z) = K (Bern(x a )) z. (4) Note that our model does not include wealy informative hyperpriors over the hyperparameters, as has been proposed for, e.g., the infinite mixture of Gaussians 9 or the infinite Hidden Marov Model 2. We leave such an extension to future wor. The marginal distribution of the model over the data space is given by p(x) = p(x z, a) p(z π)p(π α)dπ p(a β, γ)da. (5) z K 3 Inference As analytically computing the posterior over cluster assignments p(z x) is intractable, we resort to using Marov Chain Monte Carlo (MCMC) to do inference in the model. In particular, we derive the conditionals that are required to run a collapsed Gibbs sampler in the model. We first wor out the integral over the mixing proportions π, and subsequently, wor out the integral over the Bernoulli parameters a. The Gibbs sampler will sample the cluster assignments z. The joint distribution over the assignment variables z can be obtained by integrating out 2 the mixing proportions π as follows 3 p(z,..., z N ) = p(z,..., z n π)p(π)dπ (6) K = Γ(α) Γ(α/K) K = Γ(α) Γ(N + α) = K K = K = π (α/k ) N n= π z n dπ (7) Γ(N + α/k), (8) Γ(α/K) where Z is a normalization constant, N is the number of point assigned to class, and K is the number of classes in the model. The conditional distribution of the assignment variable z n for data point n, K= Γ(α ) Γ( K = α ). 2 We use the standard Dirichlet integral in this result: Dirichlet(x α)dα = K K K = x(α ) dα = 3 Throughout the report, we omit the hyperparameters from the notation where possible, in order to prevent the notation from becoming too cluttered. 2

given the other assignment variables for data point n and all other variables, can be obtained from the above expression by fixing all but one cluster assignment, to give p(z n = z n ) = N n + α K, (9) where N n represents the number of data points assigned to class, not counting data point n (i.e., N n = n i= z i + N i=n+ z i). The conditional distribution over the Bernoulli parameters a d (a d represents the d-th variable of a ), given all other parameters, is given by p(a z, D) p(a β, γ)p(d a, z) () N = Beta(a β, γ) (Bern(x n a )) z () = Z = Z = n= a (β ) d ( a d ) (γ ) a xnd d ( a d) ( x nd) (2) n C a (β + n C x nd ) d ( a d ) (γ +N n C x nd ) (3) Beta(a d β + x nd, γ + N x nd ), (4) n C n C where C denotes the set of points that were assigned to class (i.e., all points i for which z i = ), and N denotes the cardinality of this set. The conditional distribution is obtained by combining our results from Equation 4 and 9, and integrating out 4 the Bernoulli parameters a as follows p(z n = z n, D) = = p(z n = z n )p(a z n, D)da (5) N n + α K Bern(x nd a d )Beta(a d β + x id, γ + N x id ) da i C i C = N n + α K = N n + α K = N n + α K Z nd (6) Bern(x nd a d )Beta(a d β + x id, γ + N x id )da d i C i C (7) a (x nd+β+ i C x id ) d ( a d ) ( x nd+γ+n i C x id ) dad Z nd B(x nd + β + i C x id, x nd + γ + N i C x id + ), where C and N do not include the current data point n, where Z nd represents a normalization constant, and where B(x, y) represents the beta function, i.e., B(x, y) = Γ(x)Γ(y) Γ(x)+Γ(y). The normalization term Z nd requires some additional attention, as its presence allows us to sample from the above conditional 4 We also use the Beta integral x(m ) ( x) (n ) dx = Γ(m)Γ(n) Γ(m+n). (8) (9) 3

distribution without performing the (expensive) computation of beta functions. The normalization term Z nd has the form Z nd = B(β + i C x id, γ + N i C x id ). (2) As a result, the value of the normalized beta evaluation when is β+ i C x id β+γ+n when x nd =, and γ+n i C x id β+γ+n when x nd =. We can thus set up a collapsed Gibbs sampler that only samples the cluster assignments by sampling from the following conditional distribution p(z n = z n, D) = N n + α K (β + i C x id ) x nd(γ + N i C x id ) ( x nd) β + γ + N. (2) To prevent numerical problems, it is better to compute the product over the D dimensions in the logdomain (when D is large). 4 Infinite Bernoulli mixture In the infinite Bernoulli mixture, the Dirchlet prior in Equation is replaced by a Dirichlet Process p(π α) = DP (π α), (22) where α is the concentration parameter of the Dirichlet Process. Most of the derivation remains similar, but the conditional distribution over the class assignment variables (given all other variables) changes as to reflect the distribution over unrepresented classes. Specifically, the probability of an assignment variable z n being on 8, 5 is given by p(z n = z n, D) = N n α, iff > K + +, (β + i C x id ) x nd(γ + N i C x id ) ( x nd), iff K + β + γ + N D Bern(x nd a d )Beta(a d β, γ)da d, iff = K + + (23) where K + indicates the number of represented classes. The integral in the equation for the unrepresented class can be wored out as follows: Bern(x nd a d )Beta(a d β, γ)da d = β B(β, γ) a (x nd+β ) d ( a d ) (x nd+γ) da d (24) = B(x nd + β, γ x nd + ). (25) B(β, γ) The integral thus has value β+γ for x nd =, which means that the prior belief that an observed variable β is on under an unrepresented class is β+γ. The value for x γ nd = is β+γ. 5 Predictive distribution At each state of the Gibbs sampler, the predictive distribution comprises two parts: a part corresponding to the represented classes and a part corresponding to the unrepresented classes (in the finite mixture the second part is empty). Denoting the query-part of x by x q and the non-query part by x q, the predictive distribution we are interested in is given by p(x q x q ) = z p(x q z, a) p(x q z, a) 4 p(z π)p(π α)dπ p(a β, γ)da. (26)

After every iteration of Gibbs sampling, we obtain a sample z (s) (and corresponding counts N ), which can be used to infer the Bernoulli parameters a (s) as follows a (s) = N n= xz(s) n. (27) N Using the samples z (s) and the corresponding Bernoulli parameters a (s), the predictive distribution can be estimated by a factorized approximation p(x q x q ) = S K + p(x i z (s) N, a(s) ) S N i q s= = p(x q z (s), a(s) ). (28) 6 Experiments To evaluate the performance of Bayesian Bernoulli mixtures, we performed experiments on the USPS handwritten digit dataset. The USPS dataset consist of, binary digit images (, digits per class) of size 6 6 = 256 pixels. We defined a prediction tas in which the model has to predict the bottom half of a digit, given the top half of that digit. In our experiments, we compare the performance of Bayesian Bernoulli mixtures with those made by () Bernoulli mixtures trained using the EM algorithm and (2) Restricted Boltzmann Machines (RBMs) trained using contrastive divergence. We performed experiments on each digit class separately. In each experiment, we randomly select, images (of the same class) as training data, and used the remaining images as test data. The bottom 8 rows of each test image are not presented to the models; these 8 6 = 28 pixels have to be predicted, given the top 28 pixels. As the models output a probability that a pixel is on, a natural way to measure the quality of the prediction is using receiver-operating curves (ROC curves). We use the area under the ROC curve (AUC) as evaluation criterion for the predictions. We repeat each experiment times (for each digit class) to reduce the variance in our AUC estimates. The standard (non-bayesian) Bernoulli mixtures were trained by running the EM algorithm for 5. The predictions from the standard Bernoulli mixture were obtained by () computing the responsibilities of the test image under each of the mixture components based on the top half of the image and (2) computing a weighted sum of the bottom half of each of the mixture components, using the responsibilities as weights. The Restricted Boltzmann Machines (RBMs) were trained using 3 iterations of contrastive divergence. At the start of the training, we use a single Gibbs sweep for contrastive divergence (CD-). During the training, the number of Gibbs sweeps is slowly increased to 9 (CD-9). We also experimented with sparsity priors on the hidden unit states (which maes RBMs behave more lie mixture models), but we did not find this to improve the performance of the RBMs. To predict the bottom half of each digit while observing only the top half of the digit, we employed the exact factorized predictive distribution that is described in. The predictions from the Bayesian Bernoulli mixture were obtained by performing Gibbs sampling with sweeps, and computing a prediction for the model at then end of the Gibbs chain (using the same approach as for the non-bayesian Bernoulli mixtures 5 ). This process is repeated 3 times, and the 3 resulting predictions are averaged to obtain the final prediction. We set the hyperparameters to α = 5, β = 2, and γ = 2. For the infinite Bernoulli mixture, we used the same hyperparameters. The mean AUCs that were recorded during our experiments are presented in Table. The table presents results for all digit classes, and for different values of the number of components K. The results presented in the table reveal the merits of using a fully Bayesian approach to Bernoulli mixture modeling: the Bayesian mixtures outperform their non-bayesian counterparts in all experiments. Also, somewhat surprisingly, the Bernoulli mixtures appear to outperform the RBMs. Some examples of the predictions produced by a standard Bernoulli mixture, an RBM, and a Bayesian Bernoulli mixture (all with K = 5) are shown in Figure 2. In the figure, a brighter pixel 5 For the infinite mixture model, we ignore the contribution of the non-represented classes when maing a prediction. 5

K Digit Digit 2 Digit 3 Digit 4 Digit 5 Digit 6 Digit 7 Digit 8 Digit 9 Digit BM K =.9682.7725.8242.893.843.7988.8965.859.853.969 BM K = 2.9436.7728.7989.836.844.87.894.8.8682.979 BM K = 3.9559.7748.796.8246.8344.7929.95.868.8732.93 BM K = 4.953.7473.7857.844.832.7936.889.853.8572.973 BM K = 5.962.7636.864.8232.8252.798.895.847.8752.993 BBM K =.9727.7847.8585.8423.8622.7797.8999.896.8739.93 BBM K = 2.974.7893.865.8632.8624.796.942.8293.8896.935 BBM K = 3.9743.7875.8653.8658.8643.7995.967.8332.8965.9347 BBM K = 4.9742.7938.8655.8699.8656.89.988.8356.8955.9385 BBM K = 5.9747.795.8695.8647.868.7999.98.8379.8973.9387 BBM K.9737.83.833.842.8425.7937.8762.862.849.987 RBM K =.9623.739.7764.7775.78.7767.8469.7923.8263.8872 RBM K = 2.954.753.789.7829.777.7663.8693.7885.84.9 RBM K = 3.969.737.78.7823.7653.7757.855.7949.8334.95 RBM K = 4.9583.74.7787.778.773.7524.864.7899.849.8763 RBM K = 5.9559.7283.7828.787.7785.769.8686.7842.8242.8966 Table : Area under the ROC curve (AUC) for the fill-in tas on the USPS digits. corresponds to a higher probability of that pixel being on. The plots reveal that the Bayesian Bernoulli mixture produce smoother predictions, resulting in higher AUCs on the prediction tas. References B.T. Bacus. The mixture of Bernoulli experts: A theory to quantify reliance on cues in dichotomous perceptual decisions. Journal of Vision, 9(): 9, 29. 2 M.J. Beal, Z. Ghahramani, and C.E. Rasmussen. The infinite Hidden Marov Model. In Advances of Neural Information Processing Systems, volume 4, 22. 3 M.Á. Carreira-Perpiñán and S. Renals. Practical identifiability of finite mixtures of multivariate Bernoulli distributions. Neural Computation, 2():4 52, 2. 4 J. Gonzalez, A. Juan, P. Dupont, E. Vidal, and F. Casacuberta. A Bernoulli mixture model for word categorisation. In Symposium Nacional de Reconocimiento de Formas y Analises de Imagenes, 2. 5 T.L. Griffiths and Z. Ghahramani. Infinite latent feature models and the indian buffet process. Technical Report GCNU TR 25, Gatsby Computational Neuroscience Unit, 25. 6 M.I. Jordan and R.A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2):8 24, 994. 7 A. Juan and E. Vidal. On the use of Bernoulli mixture models for text classification. Pattern Recognition, 35(2):275 27, 22. 8 R.M. Neal. Marov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249 265, 2. 9 C.E. Rasmussen. The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems, volume 2, pages 554 56, 2. R.R. Salahutdinov, A. Mnih, and G.E. Hinton. Restricted Boltzmann Machines for collaborative filtering. In Proceedings of the 24 th International Conference on Machine Learning, pages 79 798, 27. 6

(a) Standard Bernoulli mixture. (b) Restricted Boltzmann Machine. (c) Bayesian Bernoulli mixture. Figure 2: Examples of predictions constructed by a standard Bernoulli mixture, a Restricted Boltzmann Machine, and a Bayesian Bernoulli mixture (all with K = 5). 7