Clustering problems, mixture models and Bayesian nonparametrics

Similar documents
Lecture 13 : Variational Inference: Mean Field Approximation

13: Variational inference II

COS513 LECTURE 8 STATISTICAL CONCEPTS

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Bayesian Methods for Machine Learning

Note 1: Varitional Methods for Latent Dirichlet Allocation

MCMC and Gibbs Sampling. Sargur Srihari

16 : Approximate Inference: Markov Chain Monte Carlo

Linear Dynamical Systems

Graphical Models for Collaborative Filtering

Study Notes on the Latent Dirichlet Allocation

Introduction to Machine Learning

Unsupervised Learning

CS Lecture 18. Topic Models and LDA

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Latent Dirichlet Allocation

LDA with Amortized Inference

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Expectation Maximization

Non-Parametric Bayes

Density Estimation. Seungjin Choi

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Mixtures of Gaussians. Sargur Srihari

Topic Modelling and Latent Dirichlet Allocation

Bayesian Inference for Dirichlet-Multinomials

Generative Clustering, Topic Modeling, & Bayesian Inference

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Dirichlet Enhanced Latent Semantic Analysis

CSC 2541: Bayesian Methods for Machine Learning

The Expectation-Maximization Algorithm

Introduction to Bayesian inference

Lecture 4: Probabilistic Learning

Graphical Models and Kernel Methods

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Variational Scoring of Graphical Model Structures

Variational Mixture of Gaussians. Sargur Srihari

Bayesian Inference and MCMC

K-Means and Gaussian Mixture Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Recent Advances in Bayesian Inference Techniques

Introduction to Probabilistic Machine Learning

Mixture Models and Expectation-Maximization

STA 294: Stochastic Processes & Bayesian Nonparametrics

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Document and Topic Models: plsa and LDA

Part 1: Expectation Propagation

Probabilistic Graphical Networks: Definitions and Basic Results

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Variational Inference (11/04/13)

Latent Dirichlet Bayesian Co-Clustering

Expectation maximization

CPSC 540: Machine Learning

Lecture 7 and 8: Markov Chain Monte Carlo

an introduction to bayesian inference

Probabilistic Graphical Models

Latent Variable Models Probabilistic Models in the Study of Language Day 4

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

Lecture : Probabilistic Machine Learning

MCMC algorithms for fitting Bayesian models

Latent Dirichlet Allocation (LDA)

STA 4273H: Statistical Machine Learning

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Expectation Maximization

Latent Dirichlet Alloca/on

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Time-Sensitive Dirichlet Process Mixture Models

The Metropolis-Hastings Algorithm. June 8, 2012

Gaussian Mixture Model

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

17 : Markov Chain Monte Carlo

Learning Bayesian network : Given structure and completely observed data

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Approximate Inference using MCMC

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Machine Learning Summer School

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

PMR Learning as Inference

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture 6: Graphical Models: Learning

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Computer Intensive Methods in Mathematical Statistics

Latent Dirichlet Allocation

Introduction to Machine Learning

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Machine Learning for Data Science (CS4786) Lecture 12

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Lecture 8: Graphical models for Text

Transcription:

Clustering problems, mixture models and Bayesian nonparametrics Nguyễn Xuân Long Department of Statistics Department of Electrical Engineering and Computer Science University of Michigan Vietnam Institute of Advanced Studies of Mathematics (VIASM), 30 Jul - 3 Aug 2012 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 1 / 86

Outline 1 Clustering problem K-means algorithm 2 Finite mixture models Expectation Maximization algorithm 3 Bayesian estimation Gibbs sampling 4 Hierarchical Mixture Latent Dirichlet Allocation Variational inference 5 Dirichlet processes and nonparametric Bayes Hierarchical Dirichlet processes 6 Asymptotic theory Geometry of space of probability measures Posterior concentration theorems 7 References Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 2 / 86

What about... Something old, something new, something borrowed, something blue... All these techniques center around clustering problems, but they illustrate a fairly large body of work in modern statistics and machine learning Part 1, 2, 3 focus on aspects of algorithms, optimization and stochastic simulations Part 4 is an in-depth excursion into the world of statistical modeling Part 5 has a good dose of probability theory and stochastic processes Part 6 delves deeper into the statistical theory Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 3 / 86

A basic clustering problem Suppose we have data set D = {X 1,..., X N } in some space. How do we subdivide these data points to clusters? Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 4 / 86

A basic clustering problem Suppose we have data set D = {X 1,..., X N } in some space. How do we subdivide these data points to clusters? Data points may represent scientific measurements, business transactions, text documents, images Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 4 / 86

Example: Clustering of images Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 5 / 86

Example: A data point is an article in Psychology Today Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 6 / 86

Example: A data point is an article in Psychology Today Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 6 / 86

Example: A data point is an article in Psychology Today Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 6 / 86

Obtain clusters organized by certain topics: (Blei et al, 2010) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 7 / 86

K-means algorithm maintain two kinds of variables: { cluster means: µk, k = 1,..., K; cluster assigment: Z k n {0, 1}, n = 1,..., N. number of clusters K assumed known. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 8 / 86

K-means algorithm maintain two kinds of variables: { cluster means: µk, k = 1,..., K; cluster assigment: Z k n {0, 1}, n = 1,..., N. number of clusters K assumed known. Algorithm 1. Initialize {µ k } K k=1 arbitrarily. 2. Repeat (a) and (b) until convergence: (a) update for all n = 1,..., N: { Zn k 1, if k = arg mini K X := n µ i, 0, otherwise. (b) update for all k = 1,..., K: N n=1 µ k = Z n k X n N n=1 Z. n k Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 8 / 86

Illustration K = 2 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 9 / 86

Illustration K = 2 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 9 / 86

Illustration K = 2 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 9 / 86

Illustration K = 2 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 9 / 86

Illustration K = 2 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 9 / 86

Illustration K = 2 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 9 / 86

Illustration K = 2 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 9 / 86

Illustration K = 2 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 9 / 86

Illustration K = 2 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 9 / 86

What does all this mean? Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 10 / 86

What does all this mean? Operational/mechanical/algebraic meaning: It is easy to show that K-means algorithm obtains a (locally) optimal solution to optimization problem: min N {Z,µ} n=1 k=1 K Zn k X n µ k 2. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 10 / 86

What does all this mean? Operational/mechanical/algebraic meaning: It is easy to show that K-means algorithm obtains a (locally) optimal solution to optimization problem: min N {Z,µ} n=1 k=1 K Zn k X n µ k 2. (Much) harder questions: Why this optimization? Does this give us the true clusters? What if our assumptions are wrong? What is the best possible algorithm for learning clusters? Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 10 / 86

What does all this mean? Operational/mechanical/algebraic meaning: It is easy to show that K-means algorithm obtains a (locally) optimal solution to optimization problem: min N {Z,µ} n=1 k=1 K Zn k X n µ k 2. (Much) harder questions: Why this optimization? Does this give us the true clusters? What if our assumptions are wrong? What is the best possible algorithm for learning clusters? How can we be certain of the truth from empirical data? Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 10 / 86

Statistical inference a.k.a. learning: Learning "true" patterns from data Statistical inference is concerned with procedures for extracting out pattern/law/value of certain phenomenon from empirical data the pattern is parameterized by θ Θ, while data are samples X 1,..., X N Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 11 / 86

Statistical inference a.k.a. learning: Learning "true" patterns from data Statistical inference is concerned with procedures for extracting out pattern/law/value of certain phenomenon from empirical data the pattern is parameterized by θ Θ, while data are samples X 1,..., X N An inference procedure is called an estimator in mathematical statistics. It may be formalized as an algorithm, thus a learning algorithm in machine learning. Mathematically, it is a mapping from data to an estimate for θ: X 1,..., X N T (X 1,..., X N ) Θ Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 11 / 86

Statistical inference a.k.a. learning: Learning "true" patterns from data Statistical inference is concerned with procedures for extracting out pattern/law/value of certain phenomenon from empirical data the pattern is parameterized by θ Θ, while data are samples X 1,..., X N An inference procedure is called an estimator in mathematical statistics. It may be formalized as an algorithm, thus a learning algorithm in machine learning. Mathematically, it is a mapping from data to an estimate for θ: X 1,..., X N T (X 1,..., X N ) Θ The output of the learning algorithm, T (X ), is an estimate of the unknown truth θ. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 11 / 86

What, How, Why In clustering problem θ represents the variables used to describe cluster means and cluster assigments θ = {θ k ; Z k n }, as well as number of clusters K What is the right inference procedure? traditionally studied by statisticians How to achieve this learning procedure in a computationally efficient manner? traditionally studied by computer scientists Why is the procedure both right and efficient? how much data and how much computations do we need these questions drive asymptotic statistics and learning theory Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 12 / 86

We use clustering as a case study to illustrate these rather fundamental questions in statistics and machine learning, also because of vast range of modern applications fascinating recent research in algorithms and statistical theory motivated this type of problems interest links connecting optimization and numerical analysis to complex statistical modeling, probability theory and stochastic processes Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 13 / 86

Outline 1 Clustering problem 2 Finite mixture models 3 Bayesian estimation 4 Hierarchical Mixture 5 Dirichlet processes and nonparametric Bayes 6 Asymptotic theory 7 References Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 14 / 86

Roles of probabilistic models In order to infer about θ from data X, a probabilistic model is needed to provide the glue linking θ to X Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 15 / 86

Roles of probabilistic models In order to infer about θ from data X, a probabilistic model is needed to provide the glue linking θ to X A model is specified in the form of a probability distribution P(X θ) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 15 / 86

Roles of probabilistic models In order to infer about θ from data X, a probabilistic model is needed to provide the glue linking θ to X A model is specified in the form of a probability distribution P(X θ) Given the same probability model, statisticians may still disagree on how to proceed; there are two broadly categorized approaches to inference: frequentist and Bayes Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 15 / 86

Roles of probabilistic models In order to infer about θ from data X, a probabilistic model is needed to provide the glue linking θ to X A model is specified in the form of a probability distribution P(X θ) Given the same probability model, statisticians may still disagree on how to proceed; there are two broadly categorized approaches to inference: frequentist and Bayes these two viewpoints are consistent mathematically, but can be wildly incompatible in terms of interpretation both are interesting and useful in different inferential situations roughly speaking, a frequentist method assumes that θ is a non-random unknown parameter, while a Bayesian method always treats θ as a random variable frequentists view data X as infinitely available as independent replicates, while a Bayesian does not worry about the data he hasn t seen (he cares more about θ) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 15 / 86

Model-based clustering Assume that data are generated according to a random process: pick one of K clusters from a distribution π = (π 1,..., π K ) generate a data point from a cluster-specific probability distribution Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 16 / 86

Model-based clustering Assume that data are generated according to a random process: pick one of K clusters from a distribution π = (π 1,..., π K ) generate a data point from a cluster-specific probability distribution This yields a mixture model: P(X φ) = K π k P(X φ k ), k=1 where the collection of parameters is θ = (π, φ); π = π k s are mixing probabilities, φ = (φ 1,..., φ K ) are the parameters associated with the K clusters. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 16 / 86

Model-based clustering Assume that data are generated according to a random process: pick one of K clusters from a distribution π = (π 1,..., π K ) generate a data point from a cluster-specific probability distribution This yields a mixture model: P(X φ) = K π k P(X φ k ), k=1 where the collection of parameters is θ = (π, φ); π = π k s are mixing probabilities, φ = (φ 1,..., φ K ) are the parameters associated with the K clusters. We still need to specify the cluster-specific distributions P(X φ k ) for each k. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 16 / 86

Example: for Gaussian mixtures, φ k = (µ k, Σ k ) and P(X φ k ) is a Gaussian distribution with mean µ k and covariance matrix Σ k. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 17 / 86

Example: for Gaussian mixtures, φ k = (µ k, Σ k ) and P(X φ k ) is a Gaussian distribution with mean µ k and covariance matrix Σ k. Why Gaussians? What is K, the number of Gaussians subpopulations? Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 17 / 86

Representation via latent variables For each data point X, introduce a latent variable Z {1,..., K} that indicates which subpopulation X is associated with. Generative model Z Multinomial(π), X Z = k N(µ k, Σ k ). Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 18 / 86

Representation via latent variables For each data point X, introduce a latent variable Z {1,..., K} that indicates which subpopulation X is associated with. Generative model Z Multinomial(π), X Z = k N(µ k, Σ k ). Marginalizing out the latent Z, we obtain: P(X θ) = = K P(X Z = k, θ)p(z = k θ) k=1 K π k N(X µ k, Σ k ). k=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 18 / 86

Representation via latent variables For each data point X, introduce a latent variable Z {1,..., K} that indicates which subpopulation X is associated with. Generative model Z Multinomial(π), X Z = k N(µ k, Σ k ). Marginalizing out the latent Z, we obtain: P(X θ) = = K P(X Z = k, θ)p(z = k θ) k=1 K π k N(X µ k, Σ k ). k=1 Data set D = (X 1,..., X N ) are i.i.d. samples from this generating process. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 18 / 86

Equivalent representation via mixing measure Define the discrete probability measure K G = π k δ φk k=1 where δ φk is an atom at φ k The mixture model is define as follows: θ n := (µ n, Σ n ) G X n θ n P( θ n ) Each θ n is equal to the mean/variance of the cluster associated with data X n. G is called a mixing measure. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 19 / 86

Inference Setup: Given data D = {X 1,..., X N } assumed iid from a mixture model. {Z 1,..., Z N } are associated (latent) cluster assignment variables. Goal: Estimate parameters θ = (π, µ, Σ). Estimate cluster assigment via calculation of conditional probability of cluster labels P(Z n X n ) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 20 / 86

Maximum likelihood estimation A frequentist method for estimation going back to Fisher (see, e.g., van der Vaart, 2000) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 21 / 86

Maximum likelihood estimation A frequentist method for estimation going back to Fisher (see, e.g., van der Vaart, 2000) Likelihood function is a function of parameter: L(θ Data) = P(Data θ) = N P(X n θ). n=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 21 / 86

Maximum likelihood estimation A frequentist method for estimation going back to Fisher (see, e.g., van der Vaart, 2000) Likelihood function is a function of parameter: L(θ Data) = P(Data θ) = N P(X n θ). n=1 MLE gives the estimate: ˆθ N := arg max L(θ Data) θ N = arg max log P(X n θ). θ n=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 21 / 86

Maximum likelihood estimation A frequentist method for estimation going back to Fisher (see, e.g., van der Vaart, 2000) Likelihood function is a function of parameter: L(θ Data) = P(Data θ) = N P(X n θ). n=1 MLE gives the estimate: ˆθ N := arg max L(θ Data) θ N = arg max log P(X n θ). θ n=1 A fundamental theorem in asymptotic statistics Under regularity conditions, the maximum likelihood estimator is consistent and asymptotically efficient. i.i.d I.e., assuming that X 1,..., X N P(X θ ), then θ N θ in probability (or almost surely), as N. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 21 / 86

MLE for Gaussian mixtures (cont) Recall the mixture density: P(X θ) = π, µ, Σ := arg max K π k N(X µ k, Σ k ) k=1 N K log{ π k N(X n µ k, Σ k )}. n=1 It is possible but cubersome to solve this optimization directly, a more practically convenient approach is via the EM (Expectation-Maximization) algorithm. k=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 22 / 86

EM algorithm for Gaussian mixtures Intuition For each data point X n, if Z n is known for all n = 1,..., N, it would be easy to estimate the cluster means and covariances µ k, Σ k. But Z n s are hidden perhaps, we can "fill-in" the latent variable Z n by an estimate, such as the conditional expectation E(Z n X n ). This can be done if all parameters are known. Classic chicken-and-egg situation! Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 23 / 86

EM algorithm for Gaussian mixture 1. Initialize {µ k, Σ k, π k } K k=1 arbitrarily. 2. Repeat (a) and (b) until convergence: (a) For k = 1,..., K, n = 1,..., N, calculate conditional expectation of labels: τn k P(Z = k X n ) = P P(Xn Z=k)P(Z=k) K k=1 P(Xn Z=k)P(Z=k) = π k N(X n µ k,σ k ) P. K k=1 π k N(X n µ k,σ k ) (b) Update for k = 1,..., K: µ k Σ k P N n=1 τ k P n xn N, P n=1 τ n k N n=1 τ k n (xn µ k )(x n µ k ) T P N, n=1 τ n k π k 1 N N n=1 τ k n. This algorithm is a soft version that generalizes the K-means algorithm! Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 24 / 86

Illustration of EM algorithm Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 25 / 86

Illustration of EM algorithm Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 25 / 86

Illustration of EM algorithm Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 25 / 86

Illustration of EM algorithm Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 25 / 86

What does this algorithm really do? We will show that this algorithm ultimately obtains a local optimum of the likelihood function. I.e., it is indeed an MLE method. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 26 / 86

What does this algorithm really do? We will show that this algorithm ultimately obtains a local optimum of the likelihood function. I.e., it is indeed an MLE method. Suppose that we have full data (complete data) D c = {(Z n, X n ) N n=1 }. Then we can calculate the complete log-likelihood function: l c (θ D c ) = log P(D c θ) N = log P(X n, Z n θ) = = = n=1 N n=1 N { K log (π k N(X n µ k, Σ k )) Z k n n=1 k=1 N n=1 k=1 k=1 K Zn k log{π k N(X n µ k, Σ k )} } K Zn k (log π k + log N(X n µ k, Σ k )). Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 26 / 86

To estimate the parameters, we may wish to optimize the complete log-likelihood if we actually have full data (Z n, X n ) N n=1. Since Z n s are actually latent, we settle for conditional expectation. In fact, Easy exercise The updating step (b) of the EM algorithm described earlier optimizes the conditional expectation of the complete log-likelhood: θ := arg max E[l c (θ D c ) X 1,..., X N )], where E[l c (θ D c ) X 1,..., X N ] = N n=1 K k=1 τ k n (log π k + log N(X n µ k, Σ k )). (Proof by taking gradient with respect to parameters and setting to 0). Compare this to optimizing the original likelihood function: l(θ D) = N n=1 { K } log π k N(X n µ k, Σ k ). i=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 27 / 86

Summary Rewrite the EM algorithm as maximization of expected complete log-likelihood: Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 28 / 86

Summary Rewrite the EM algorithm as maximization of expected complete log-likelihood: EM algorithm for Gaussian mixtures 1. Initialize randomly θ = {µ k, Σ k, π k } K k=1. 2. Repeat (a) and (b) until convergence: (a) E-step : given current estimate of θ, compute E[l c (θ D c ) D]. (b) M-step : update θ by maximizing E[l c (θ D c ) D]; Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 28 / 86

Summary Rewrite the EM algorithm as maximization of expected complete log-likelihood: EM algorithm for Gaussian mixtures 1. Initialize randomly θ = {µ k, Σ k, π k } K k=1. 2. Repeat (a) and (b) until convergence: (a) E-step : given current estimate of θ, compute E[l c (θ D c ) D]. (b) M-step : update θ by maximizing E[l c (θ D c ) D]; It remains to show that maximizing the expected complete log-likelihood is equivalent to maximizing the log-likelihood function... Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 28 / 86

EM algorithm for latent variable models A model with latent variables abstractly defined as follows: Z P( θ) X Z P( Z, θ). Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 29 / 86

EM algorithm for latent variable models A model with latent variables abstractly defined as follows: Z P( θ) X Z P( Z, θ). This type of model includes mixture models, hierarchical models (will see later in this lecture) hidden Markov models, Kalman filters, etc Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 29 / 86

EM algorithm for latent variable models A model with latent variables abstractly defined as follows: Z P( θ) X Z P( Z, θ). This type of model includes mixture models, hierarchical models (will see later in this lecture) hidden Markov models, Kalman filters, etc Recall the log-likelihood function for observed data: N l(θ D) = log p(d θ) = log p(x n θ) and the log-likelihood function for the complete data: N l C (θ D C ) = log p(d C θ) = log p(x n, Z n θ). n=1 n=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 29 / 86

EM algorithm maximizes the likelihood EM algorithm for latent variable models 1. Initialize randomly θ. 2. Repeat (a) and (b) until convergence: (a) E-step : given current estimate of θ, compute E[l c (θ D c ) D]. (b) M-step : update θ by maximizing E[l c (θ D c ) D]; Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 30 / 86

EM algorithm maximizes the likelihood EM algorithm for latent variable models 1. Initialize randomly θ. 2. Repeat (a) and (b) until convergence: (a) E-step : given current estimate of θ, compute E[l c (θ D c ) D]. (b) M-step : update θ by maximizing E[l c (θ D c ) D]; Theorem The EM algorithm is a coordinatewise hill-climbing algorithm with respect to the likelihood function. For proof, see hand-written notes. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 30 / 86

Outline 1 Clustering problem 2 Finite mixture models 3 Bayesian estimation 4 Hierarchical Mixture 5 Dirichlet processes and nonparametric Bayes 6 Asymptotic theory 7 References Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 31 / 86

Bayesian estimation In a Bayesian approach, parameters θ = (π, µ, Σ) are assumed to be random Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 32 / 86

Bayesian estimation In a Bayesian approach, parameters θ = (π, µ, Σ) are assumed to be random There needs to be a prior distribution for θ Consequentially we obtain a Bayesian mixture model Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 32 / 86

Bayesian estimation In a Bayesian approach, parameters θ = (π, µ, Σ) are assumed to be random There needs to be a prior distribution for θ Consequentially we obtain a Bayesian mixture model Inference boils down to calculation of posterior probability: Bayes Rule P(θ Data) P(θ X ) = P(θ)P(X θ) P(θ)P(X θ)dθ posterior prior likelihood Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 32 / 86

Prior distributions π = (π 1,..., π K ) Dir(α) µ 1,..., µ K N(0, I) Σ 1,..., Σ K IW(Ψ, m). Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 33 / 86

Prior distributions π = (π 1,..., π K ) Dir(α) µ 1,..., µ K N(0, I) Σ 1,..., Σ K IW(Ψ, m). A million dollar question: how to choose prior distributions? (such as Dirichlet, Normal, Inverse-Wishart,... ) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 33 / 86

Prior distributions π = (π 1,..., π K ) Dir(α) µ 1,..., µ K N(0, I) Σ 1,..., Σ K IW(Ψ, m). A million dollar question: how to choose prior distributions? (such as Dirichlet, Normal, Inverse-Wishart,... )... the pure Bayesian viewpoint... the pragmatic viewpoint: computational convenience via conjugacy... the theoretical viewpoint: posterior asymptotics Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 33 / 86

Dirichlet distribution Let π = (π 1,..., π K ) be a point in the (K 1)-simplex i.e., 0 π k 1, and K k=1 = 1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 34 / 86

Dirichlet distribution Let π = (π 1,..., π K ) be a point in the (K 1)-simplex i.e., 0 π k 1, and K k=1 = 1 Let α = (α 1,..., α K ) be a set of parameters, where α k > 0 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 34 / 86

Dirichlet distribution Let π = (π 1,..., π K ) be a point in the (K 1)-simplex i.e., 0 π k 1, and K k=1 = 1 Let α = (α 1,..., α K ) be a set of parameters, where α k > 0 The Dirichlet density is defined as P(π α) = Γ( K k=1 α k) K k=1 Γ(α k) πα1 1 1... π α K 1 K. Eπ k = α k /( K k=1 α k) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 34 / 86

Multinomial-Dirichlet conjugacy Let π Dir(α). Let Z Multinomial(π), i.e. P(Z = k π) = π k for k = 1,..., K. Write Z as indicator vector Z = (Z 1... Z k ). Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 35 / 86

Multinomial-Dirichlet conjugacy Let π Dir(α). Let Z Multinomial(π), i.e. P(Z = k π) = π k for k = 1,..., K. Write Z as indicator vector Z = (Z 1... Z k ). Then the posterior probability of π is: P(π Z) P(π)P(Z π) (π α1 1 1... π α K 1 K ) (π Z 1 1... π Z k K ) π α1 1 1... π α K 1 K = π α1+z 1 1 1... π α K +Z k 1 K, which is again a Dirichlet density with modified parameter: Dir(α + Z) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 35 / 86

Multinomial-Dirichlet conjugacy Let π Dir(α). Let Z Multinomial(π), i.e. P(Z = k π) = π k for k = 1,..., K. Write Z as indicator vector Z = (Z 1... Z k ). Then the posterior probability of π is: P(π Z) P(π)P(Z π) (π α1 1 1... π α K 1 K ) (π Z 1 1... π Z k K ) π α1 1 1... π α K 1 K = π α1+z 1 1 1... π α K +Z k 1 K, which is again a Dirichlet density with modified parameter: Dir(α + Z) We say Multinomial-Dirichlet is a conjugate pair Other conjugate pairs: Normal-Normal (for mean variable µ), Normal-Inverse Wishart (for covariance matrix Σ), etc Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 35 / 86

Bayesian mixture model X n Z n = k N(µ k, Σ k ) Z n π Multinomial(π) π Dir(α) µ i µ 0, Σ 0 N(µ 0, Σ 0 ) Σ i Ψ IW(Ψ, m). (α, µ 0, Σ 0, Ψ) are non-random parameters (or they may be random and assigned with prior distributions as well) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 36 / 86

Posterior inference Posterior inference is about calculating conditional probability of latent variables and model parameters i.e., P((Z n ) N n=1, (π k, µ k, Σ k ) K k=1 X 1,..., X N ) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 37 / 86

Posterior inference Posterior inference is about calculating conditional probability of latent variables and model parameters i.e., P((Z n ) N n=1, (π k, µ k, Σ k ) K k=1 X 1,..., X N ) This is usually difficult computationally Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 37 / 86

Posterior inference Posterior inference is about calculating conditional probability of latent variables and model parameters i.e., P((Z n ) N n=1, (π k, µ k, Σ k ) K k=1 X 1,..., X N ) This is usually difficult computationally An approach is via sampling, exploiting conditional independence Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 37 / 86

Posterior inference Posterior inference is about calculating conditional probability of latent variables and model parameters i.e., P((Z n ) N n=1, (π k, µ k, Σ k ) K k=1 X 1,..., X N ) This is usually difficult computationally An approach is via sampling, exploiting conditional independence At this point we take a detour, discussing a general modeling and inference formalism known as graphical models Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 37 / 86

Directed graphical models Given a graph G = (V, E), where each node v V is associated with a random variable X v : The joint distribution on collection of variables X V = {X v : v V} factorizes accoding to the parent-of relation defined by directed edges E: P(X V ) = v V P(X v Xparents(v)) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 38 / 86

Conditional independence Observed variables are shaded It can be shown that X 1 {X 4, X 5, X 6 X 2, X 3 }. Moreover we read off all such conditional independence from the graph structure. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 39 / 86

Basic conditional independence structure chain structure : X Z Y causal structure : X Z Y explanation-away : X Z (marginally) but X Z Y Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 40 / 86

Explanation-away aliens = Alice was abducted by aliens watch = forgot to set watch alarm before bed late = Alice is late for class aliens watch aliens watch late Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 41 / 86

Condionally i.i.d. Conditional iid (identically and independently distributed) : this is represented by a plate notation that allows subgraphs to be replicated: Note that this graph represents a mixture distribution for observed variables (X 1,..., X N ): P(X 1,..., X N ) = P(X 1,..., X N θ)dp(θ) = N i=1 P(X i θ)dp(θ) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 42 / 86

Gibbs sampling A Markov chain Monte Carlo (MCMC) sampling method Consider a collection of variables, say X 1,..., X N with a joint distribution P(X 1,..., X N ) (which may be a conditional joint distribution in our specific problem) A stationary Markov chain is a sequence of X t = (X1 t,..., X N t ) for t = 1, 2,... such that given X t, random variable X t+1 is conditionally independent of all variables before t, and P(X t+1 X t ) is invariant with respect to t Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 43 / 86

Gibbs sampling (cont) Gibbs sampling method sets up the Markov chain as follows at step t = 1, initialize X 1 to arbitrary values at step t, choose n randomly among 1,..., N draw a sample for X t n from P(X n X 1,..., X n 1, X n+1,..., X N ) iterate A fundamental theorem of Markov chain theory Under mild conditions (ensuring ergodicity), X t converges in the limit to the joint distribution of X, namely P(X 1,..., X N ) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 44 / 86

Back to posterior inference The goal is the sample from the (conditional) joint distribution P((Z n ) N n=1, (π k, µ k, Σ k ) K k=1 X 1,..., X N ) By Gibbs sampling, it is sufficient to be able to sample from conditional distributions of each of the latent variables and parameters given everything else (and conditionally on the data) We will see that conditional independence helps in a big way Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 45 / 86

Sample π: P(π µ, Σ, Z 1...Z N, Data) = P(π Z 1...Z N ) where n j = N n=1 I(Z n = j). (conditional independence) P(Z 1...Z n π)p(π α) = Dir(α 1 + n 1, α 2 + n 2,..., α K + n K ), Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 46 / 86

Sample π: P(π µ, Σ, Z 1...Z N, Data) = P(π Z 1...Z N ) where n j = N n=1 I(Z n = j). (conditional independence) P(Z 1...Z n π)p(π α) = Dir(α 1 + n 1, α 2 + n 2,..., α K + n K ), Sample Z n : P(Z n = k everything else, including data) = P(Z n = k X n, π, µ, Σ) = (conditional independence) π k N(X n µ k, Σ k ) K k=1 π kn(x n µ k, Σ k ). Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 46 / 86

Sample µ k : P(µ k µ 0, Σ 0, Z, X, Σ) = P(µ k µ 0, Σ 0, Σ k, {Z n, X n such that Z n = k}) P({X n : Z n = k} µ k, Σ k )P(µ k µ 0, Σ 0 ) = Bayes Rule { exp 1 } 2 (X n µ k ) T Σ 1 k (X n µ k ) n:z n=k exp 1 2 (µ k µ 0 ) T Σ 1 0 (µ k µ 0 ) exp 1 2 (µ k µ k ) T Σk 1 (µk µ k ) N( µ k, Σ k ) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 47 / 86

Here, Hence, µ k = Σ k (Σ 1 0 µ 0 + Σ 1 Σ k 1 = Σ 1 0 + n k Σ 1 k, where n k = N 1(Z n = k) n=1 Σ k 1 µk = Σ 1 0 µ 0 + Σ 1 k X n:z n=k X n). k X n:z n=k X n Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 48 / 86

Here, Hence, µ k = Σ k (Σ 1 0 µ 0 + Σ 1 Σ k 1 = Σ 1 0 + n k Σ 1 k, where n k = N 1(Z n = k) n=1 Σ k 1 µk = Σ 1 0 µ 0 + Σ 1 k X n:z n=k X n). k X n:z n=k X n Notice that if n k, then Σ k 0. So, µ k 1 n k n:z n=k X n 0. (That is, the prior is taken over by data!) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 48 / 86

To sample Σ k, we use inverse Wishart distribution (a generalization of the chi-square distribution to multivariate cases) as prior: B W 1 (Ψ, m) B 1 W (Ψ, m) B, Ψ : p p PSD matrices, m : degree of freedom Inverse-Wishart density: P(B Ψ, m) Ψ m 2 B (n+p+1) 2 exp tr(ψb 1 /2) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 49 / 86

To sample Σ k, we use inverse Wishart distribution (a generalization of the chi-square distribution to multivariate cases) as prior: B W 1 (Ψ, m) B 1 W (Ψ, m) B, Ψ : p p PSD matrices, m : degree of freedom Inverse-Wishart density: P(B Ψ, m) Ψ m 2 B (n+p+1) 2 exp tr(ψb 1 /2) Assume that, as a prior for Σ k, k = 1,..., K, Σ k Ψ, m IW(Ψ, m). Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 49 / 86

So the posterior distribution for Σ k takes the form: P(Σ k Ψ, m, µ, pi, Z, Data) = P(Σ k Ψ, m, µ k, {X n : Z n = k}) P(X n Σ k, µ k ) P(Σ k Ψ, m) n:z n=k 1 Σ k n k 2 { exp n:z n=k Ψ m 2 Σk m+p+1 2 exp 1 2 tr[ψσ 1 k ] = IW(A + Ψ, n k + m) 1 } 2 tr[σ 1 k (X k µ k )(X k µ k ) T ] where A = N (X n µ k )(X n µ k ) T, n k = I(Z n = k) n:z n=k n=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 50 / 86

Summary of Gibbs sampling Use Gibbs sampling to obtain samples from the posterior distribution P(π, Z, µ, Σ X 1,..., X N ) Sampling algorithm Randomly generate (π (1), Z (1), µ, (1), Σ (1) ) For t = 1,..., T, do the following: (1) draw π (t+1) P(π Z (t), µ (t), Σ (t), Data) (2) draw Z n (t+1) P(Z n Z n, (t) µ (t), Σ (t), π (t+1), Data) (3) draw µ (t+1) k P(µ k µ (t) k, Z n (t+1), Σ (t), π (t+1), Data) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 51 / 86

By the fundamental theorem of Markov chain, for t sufficiently large, π (t), Z (t), µ, (t), Σ (t) ) can be viewed as a random sample of the posterior P( Data) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 52 / 86

By the fundamental theorem of Markov chain, for t sufficiently large, π (t), Z (t), µ, (t), Σ (t) ) can be viewed as a random sample of the posterior P( Data) Suppose that we have drawn samples from the posterior distribution, posterior probabilities can be obtained via Monte Carlo approximation Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 52 / 86

By the fundamental theorem of Markov chain, for t sufficiently large, π (t), Z (t), µ, (t), Σ (t) ) can be viewed as a random sample of the posterior P( Data) Suppose that we have drawn samples from the posterior distribution, posterior probabilities can be obtained via Monte Carlo approximation E.g., posterior probability of the cluster label for data point X n : P(Z n Data) 1 T s + 1 ΣT t=sδ Z s n Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 52 / 86

Comparing Gibbs sampling and EM algorithm Gibbs can be viewed as a stochastic version of the EM algorithm EM algorithm 1. E step: given current value of parameter θ, calculate conditional expectation of latent variables Z 2. M step: given the conditional expectations of Z, update θ by maximizing the expected complete log-likelihood function Gibbs sampling 1. given current values of θ, sample Z 2. given current values of Z, sample (random) θ Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 53 / 86

Outline 1 Clustering problem 2 Finite mixture models 3 Bayesian estimation 4 Hierarchical Mixture 5 Dirichlet processes and nonparametric Bayes 6 Asymptotic theory 7 References Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 54 / 86

Recall our Bayesian mixture model: for each n = 1,..., N, k = 1,..., K, X n Z n = i N(µ i, Σ i ), Z n π Multinomial(π) π Dir(α) µ µ 0, Σ 0 N(µ 0, Σ 0 ) Σ k Ψ IW(Ψ, m). Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 55 / 86

Recall our Bayesian mixture model: for each n = 1,..., N, k = 1,..., K, X n Z n = i N(µ i, Σ i ), Z n π Multinomial(π) π Dir(α) µ µ 0, Σ 0 N(µ 0, Σ 0 ) Σ k Ψ IW(Ψ, m). We may assume further that parameters (α, µ 0, Σ 0, Ψ) are may be random and assigned by prior distributions Thus we obtain a hierarchical model in which parameters appear as latent random variables Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 55 / 86

Exchangeability The existence of latent variables can be motivated by De Finetti s theorem Classical statistics often relies on assumption of i.i.d. data X 1,..., X N with respect to some probability model parameterized by θ (non-random) However, if X 1,..., X N are exchangeable, then De Finetti s theorem establishes the existence of a latent random variable θ such that, X 1,..., X N are conditionally i.i.d. given θ This theorem is regarded by many as one of the results that provide the mathematical foundation for Bayesian statistics Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 56 / 86

Definition Let I be a countable index set. A sequence (X i : i I ) (finite or infinite) is exchangeable if for any permutation ρ of I (X ρ(i) ) i I law = (Xi ) i I Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 57 / 86

Definition Let I be a countable index set. A sequence (X i : i I ) (finite or infinite) is exchangeable if for any permutation ρ of I (X ρ(i) ) i I law = (Xi ) i I De Finetti s theorem If (X 1,..., X n,...) is an infinite exchangeable sequence of random variables on some probability space then there is a random variable θ π such that X 1,..., X n,... are iid conditionally on θ. That is, for all n Remarks P(X 1,..., X n,...) = θ n P(X i θ) dπ(θ) i=1 Exchangeability is a weaker assumption than iid. θ may be (generally) an infinite dimensional variable Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 57 / 86

Latent Dirichlet allocation/ Finite admixture Developed by Pritchard et al (2000), Blei et al (2001) Widely applicable to data such as texts, images and biological data (Google Scholar has 12000 citations) A canonical and simple example of hierarchical mixture model for discrete data Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 58 / 86

Latent Dirichlet allocation/ Finite admixture Developed by Pritchard et al (2000), Blei et al (2001) Widely applicable to data such as texts, images and biological data (Google Scholar has 12000 citations) A canonical and simple example of hierarchical mixture model for discrete data Some jargons: Basic building blocks are words represented by random variables, say W, where W {1,..., V }. V is the length of the vocabulary. A document a sequence of words denoted by W = (W 1,..., W N ). A corpus is a collection of documents (W 1,..., W m ). General problems: Given a collection of documents, can we infer the topics that the documents may be clustered around? Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 58 / 86

Early modeling attempts Unigram Model: All documents in corpus share the same topic. I.e., For any document W = (W 1,..., W N ), W 1,..., W N Mult(θ) θ W n N M Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 59 / 86

Mixture of Unigram Model: Each document W is associated with a latent topic variable Z. I.e., For each d = 1,..., m, generate document W = (W 1,..., W N ) as follows: W 1,..., W N Z = k Z Mult(θ) iid Mult(βk ) θ θ Z W n β N W n M N M Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 60 / 86

Latent Dirichlet allocation model Assume the following: The words within each document are exchangeable this is the bag of words assumption The documents within each corpus are exchangeable A document may be associated with K topics Each word within a document is associated with any topics Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 61 / 86

Latent Dirichlet allocation model Assume the following: The words within each document are exchangeable this is the bag of words assumption The documents within each corpus are exchangeable A document may be associated with K topics Each word within a document is associated with any topics For d = 1,..., M, generate document W = (W 1,..., W N ) as follows: draw θ Dir(α 1,..., α K ), for each n = 1,..., N, Z n Mult(θ) i.e. P(Z n = k θ) = θ k W n Z n iid Mult(β) i.e. P(Wn = j Z n = k, β) = β kj, β R K V Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 61 / 86

Geometric illustration α θ β Z n W n N M Each x dot in the topic polytope (topic simplex in illustration) corresponds to the word frequency vector for a random document Extreme points of the topic polytope (e.g., topic 1, topic 2,...) in RHS are represented by vectors β k for k = 1, 2,... in the hierarchical model in LHS β k = (β k1,..., β kv ) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 62 / 86

Posterior inference The goal of inference includes: 1 Compute the posterior distribution, P(θ, Z W, α, β) for each document W = (W 1,..., W N ) 2 Estimating α, β from the data, e.g., corpus of M documents W 1,..., W M Both of the above are relatively easy to do using Gibbs Sampling, or Metropolis-Hastings, which will be left as an exercise. Unfortunately a sampling algorithm may be extremely slow (to achieve convergence), this motivate a fast deterministic algorithm for posterior inference. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 63 / 86

The posterior can be rewritten as P(θ, Z, W, α, β) P(θ, Z W, α, β) = P(W α, β) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 64 / 86

The posterior can be rewritten as P(θ, Z, W, α, β) P(θ, Z W, α, β) = P(W α, β) The numerator can be computed easily: N P(θ, Z, W, α, β) = P(θ α) P(Z n θ)p(w n Z n, β) n=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 64 / 86

The posterior can be rewritten as The numerator can be computed easily: P(θ, Z, W, α, β) P(θ, Z W, α, β) = P(W α, β) P(θ, Z, W, α, β) = P(θ α) Unfortunately, the denominator is difficult: Γ( αk ) K Γ(α k ) k=1 K k=1 N P(Z n θ)p(w n Z n, β) n=1 P(W α, β) = P(θ, Z, W α, β) = θ α k 1 k n=1 Z,θ N K V (θ k β kj ) I{Wn=j} dθ k=1 j=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 64 / 86

Variational inference This is an alternative to sampling-based inference. The main spirit is to turn a difficult computation problem into an optimization problem, one which can be modified/simplified. We consider the simplest form of variational inference, known as mean-field approximation: Consider a family of tractable distributions Q = {q(θ, Z W, α, β)} Choose the one in Q, that is closest to the true posterior: q = arg min KL(q p(θ, Z W, α, β)) q Q Use q instead of the true parameter q Q Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 65 / 86

In mean-field approximation, Q taken to be the family of "factorized" distributions, i.e.: q(θ, Z W, γ, φ) = q(θ W, γ)π N n=1q(z n W, φ) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 66 / 86

In mean-field approximation, Q taken to be the family of "factorized" distributions, i.e.: q(θ, Z W, γ, φ) = q(θ W, γ)π N n=1q(z n W, φ) Optimization is performed with rspect to variational parameters (γ, φ) Under q Q, for n = 1,..., N; k = 1,..., K, P(Z n = k W, φ n ) = φ nk θ γ Dir(γ), γ R K + The optimization of variational parameters can be achieved by implementing a simple system of updating equations. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 66 / 86

Mean-field algorithm 1. Initialize φ, γ arbitrarily. 2. Keep updating until convergence: φ nk β kwn exp{e q [log θ k γ]} N γ k = α k + φ nk. In the first updating equation, we use a fact of Dirichlet distribution: if θ = (θ 1,..., θ K ) Dir(γ), then where Ψ is the digamma function: n=1 K E[log θ k γ] = Ψ(γ k ) Ψ( γ k ), Ψ(x) = d log Γ dx k=1 = Γ (x) Γ(x). Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 67 / 86

Derivation of the mean-field approximation Jensen s Inequality log P(W α, β) = log P(θ, Z, W α, β)dθ The gap of the bound: θ θ Z = log θ Z q (θ, Z) log Z P(θ, Z, W α, β) q (θ, Z) dθ q(θ, Z) P(θ, Z, W α, β) dθ q(θ, Z) = E q log P(θ, Z, W α, β) E q log q(θ, Z) =: L(γ, φ; α, β) log P(W α, β) L(γ, φ; α, β) = KL(q(θ, Z) P(θ, Z W, α, β)). Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 68 / 86

So q solves the following maximization: max L(γ, φ; α, β). γ,φ Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 69 / 86

So q solves the following maximization: Note that log P(θ, Z, W α, β) = log P(θ α) + max L(γ, φ; α, β). γ,φ N ( ) log P(Z n θ) + log P(W n Z n, β) n=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 69 / 86

So q solves the following maximization: Note that So, log P(θ, Z, W α, β) = log P(θ α) + L(γ, φ; α, β) = E q log P (θ α) + E q log q(θ γ) max L(γ, φ; α, β). γ,φ N ( ) log P(Z n θ) + log P(W n Z n, β) n=1 N {E q log P(Z n θ) + E q log P(W n Z n, β)} n=1 N E q log q (Z n φ n ). n=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 69 / 86

Let s go over each term in the previous expression: log P (θ α) = Γ ( α k ) K Γ (α k ) log P (θ α) = E q log P (θ α) = k=1 K k=1 K k=1 θ α k 1 k, ( ) (α k 1 ) log θ k + log Γ αk ( ( K K )) (α k 1) Ψ(γ k ) Ψ γ k k=1 + log Γ ( αk ) k=1 K log Γ (α k ). k=1 K log Γ (α k ), k=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 70 / 86

Next term: P(Z n θ) = log P(Z n θ) = E q log P(Z n θ) = K k=1 θ I(Zn=k) k, K I (Z n = k) log θ k, k=1 ( K K )) φ nk (Ψ (γ k ) Ψ γ k. k=1 k=1 And next: K V log P (W n Z n, β) = log (β kj ) I(Wn=j,Zn=k), k=1 j=1 K V E q log P(Z n θ) = I (Z n = k) log β kj. k=1 j=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 71 / 86

And next: so, E q log q(θ γ) = q(θ γ) = Γ( γ k ) K Γ(γ k ) k=1 K k=1 θ γ k 1 k, ( ( K K )) (γ k 1) Ψ (γ k ) Ψ γ k k=1 k=1 k=1 K K + log Γ( γ k ) log Γ(γ k ). k=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 72 / 86

And next: so, And next: E q log q(θ γ) = q(θ γ) = Γ( γ k ) K Γ(γ k ) k=1 K k=1 θ γ k 1 k, ( ( K K )) (γ k 1) Ψ (γ k ) Ψ γ k k=1 k=1 k=1 K K + log Γ( γ k ) log Γ(γ k ). q(z n φ n ) = K k=1 φ I(Zn=k) nk, k=1 So E q log q(z n γ n ) = K φ nk log φ nk. k=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 72 / 86

To summarize, we maximize the lower bound of the likelihood function: L(γ, φ; α, β) := E q [log P(θ, Z, W α, β)] E q [log q(θ, Z)] with respect to variational parameters satisfying constraints, for all n = 1,..., N; k = 1,..., K K φ nk = 1, k=1 φ nk 0, γ k 0. Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 73 / 86

To summarize, we maximize the lower bound of the likelihood function: L(γ, φ; α, β) := E q [log P(θ, Z, W α, β)] E q [log q(θ, Z)] with respect to variational parameters satisfying constraints, for all n = 1,..., N; k = 1,..., K K φ nk = 1, k=1 φ nk 0, γ k 0. L decomposes nicely into a sum, which admits simple gradient-based update equations: Fix γ, maximize w.r.t. φ, φ nk β kwn exp(ψ(γ k ) Ψ( Fix φ, maximize w.r.t. γ, γ k = α k + N φ nk. n=1 k γ k )), k=1 Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 73 / 86

Variational EM algorithm It remains to estimate parameters α and β, Data D = set of documents {W 1, W 2,..., W M } M Log-likelihood L(D) = log p(w d α, β) = d=1 M L(γ d, φ d α, β) + KL(q P(θ, z D, α, β)) d=1 EM algorithm involves alternating between E step and M step until convergence: Variational E step For each d = 1,..., M, let (γ d, φ d ) := arg max L(γ d, φ d ; α, β). M step Solve (α, β) = arg max M d=1 L(γ d, φ d α, β). Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 74 / 86

Example An example article from the AP corpus (Blei et al, 2003) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 75 / 86

Example An example article from Science corpus (1880 2002) (Blei & Lafferty, 2009) Long Nguyen (Univ of Michigan) Clustering, mixture models & BNP VIASM, Hanoi 2012 76 / 86