Gaussian Mixture Model

Size: px

Start display at page:

Download "Gaussian Mixture Model"

Jeffry Hunt
5 years ago
Views:

1 Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th, 0 Gaussian Mixture Model Most commonly used mixture model Observations: x,...,x N Parameters: = {, } =[,..., K ] = { } = {µ, } z i Lielihood: p(x i ) = X x i p(x i ) Ex. z i = country of origin, x i = height of i th person th mixture component = distribution of heights in country K Emily Fox 0 N

2 Motivates EM Algorithm Initial guess: ˆ (0) Estimate at iteration t: ˆ (t) E-Step Compute U(, ˆ (t) )=E[log p(y ) x, ˆ (t) ] M-Step Compute ˆ (t+) = arg max U(, ˆ (t) ) Emily Fox 0 MAP Estimation Bayesian approach: Place prior p( ) on parameters Infer posterior p( x) Many, many, many motivations and implications For the sae of this class, simplest motivation is to thin of this as ain to regularization ˆ MAP = arg max log p( x) Saw importance of regularization in logistic regression (ML estimate can overfit data and lead to poor generalization) Emily Fox 0 4

EM Algorithm MAP Case Re-derive EM algorithm for p( x) Add log p( ) to U(, ˆ

3 EM Algorithm MAP Case Re-derive EM algorithm for p( x) Add log p( ) to U(, ˆ (t) ) What must be computed in E-Step remains unchanged because this term does not depend on y. M-Step becomes: ˆ (t+) = arg max U(, ˆ (t) ) Emily Fox 0 5 MAP EM Example MoG For mixture of Gaussians, conjugate priors are: Dir(,..., K ) tion π π (,0,0) (0,0,) (0,,0) π π (/,/,/) (/4,/4,/) (/,/,0) π π π Dir(,, ) π Dir(4, 4, 4) π π π π π Dir(4, 9, 7) π Dir(0., 0., 0.) Emily Fox 0 6

4 MAP EM Example MoG For mixture of Gaussians, conjugate priors are: Dir(,..., K ) p( ) = Q (P ) Y ( ) Dirichlet posterior Assume we condition on observations Count occurrences of Then, z i = p(, z,...,z N ) / z i Conjugacy: This posterior has same form as prior Emily Fox 0 7 MAP EM Example MoG For mixture of Gaussians, conjugate priors are: Dir(,..., K ) {µ, } NIW(m 0,apple 0, 0,S 0 ) Results in following M-Step: ˆµ = r x + apple 0 m 0 r + apple 0 ˆ = r + N + P K ˆ = S 0 + r S + apple 0r apple 0 +r ( x m 0 )( x m 0 ) r + d + Emily Fox 0 8 4

5 Posterior Computations MAP EM focuses on point estimation: ˆ MAP = arg max p( x) What if we want a full characterization of the posterior? Maintain a measure of uncertainty Estimators other than posterior mode (different loss functions) Predictive distributions for future observations Often no closed-form characterization (e.g., mixture models) Alternatives: Monte Carlo based estimates using samples from posterior Variational approximations to posterior (more next time) Emily Fox 0 9 Gibb Sampling Want draws: Construct Marov chain whose steady state distribution is Simplest case: Emily Fox 0 0 5

6 Example Mixture of Gaussians Recall model Observations: Cluster indicators: Parameters: Generative model: x,...,x N z,...,z N = {, } Dir(,..., K ) {µ, } F ( ) =[,..., K ] = { } = {µ, } z i x i z i N(x i ; µ z i, z i) Want to draw posterior samples of model parameters p(, x,...,x N ) p(, x,...,x N ) Emily Fox 0 Auxiliary Variable Samplers Augment variables of interest with variables to allow closed-form for sampling, just lie in EM z In both cases, simply looing at subchain converges to draws from marginal distribution ( ) { (t) } Emily Fox 0 6

7 Example Mixture of Gaussians Dir(,..., K ) z i {µ, } F( ) x i z i N(x i ; µ z i, z i) Try auxiliary variable sampler Introduce cluster indicators into sampler z i x i N K Emily Fox 0 Example Clustering Results I log p(x π, θ) = 59.7 log p(x π, θ) = log p(x π, θ) = log p(x π, θ) = log p(x π, θ) = log p(x π, θ) = Figure courtesy of Eri Sudderth Figure.8. Learning a mixture of K = 4 Gaussians using the Gibbs sampler of Alg... Columns show the current parameters after T= (top), T=0 (middle), and T=50 (bottom) iterations from two random initializations. Each plot is labeled Emily by the current Fox 0 data log lielihood. 4 7

8 Collapsed Gibbs Samplers Marginalize a set of latent variables or parameters Sometimes marginalized variables are nuisance parameters Other times what gets marginalized are the variables Mae post-facto inferences on variables of interest based on sampled variables Can improve efficiency if marginalized variables are high-dim Reduced dimension of search space But, often introduces dependences! Emily Fox 0 5 Example Collapsed MoG Sampling Dir(,..., K ) z i {µ, } F( ) x i z i N(x i ; µ z i, z i) Collapsed sampler z i x i N K Emily Fox 0 6 8

9 Example Collapsed MoG Sampling Dir(,..., K ) z i {µ, } F( ) x i z i N(x i ; µ z i, z i) Derivation z i x i N K Important facts: p(z :N ) = Q (P ) ( ) Q (n + ) ( P n + ) (m + ) (m) = m Emily Fox 0 7 Example Clustering Results II log p(x π, θ) = log p(x π, θ) = log p(x π, θ) = 97.8 log p(x π, θ) = 449. log p(x π, θ) = 96.5 log p(x π, θ) = Figure courtesy of Eri Sudderth Figure.9. Learning a mixture of K = 4 Gaussians using the Rao Blacwellized Gibbs sampler of Alg... Columns show the current parameters after T= (top), T=0 (middle), and T=50 (bottom) iterations from two random initializations. Each Emily plot isfox labeled 0 by the current data log lielihood. 8 9

94 CHAPTER. NONPARAMETRIC AND GRAPHICAL MODELS Given previous cluster assignments z (t ), sequentially sample new assignments as follows:.

.., τ (N )}, sequentially resample zi as follows: (a) For each of the K clusters, determine the predictive lielihood f (xi ) = p(xi {xj zj =, j "= i}, λ) This

(b) Sample a new cluster assignment zi from the following multinomial distribution: K K!

Regular = = N i is the number of other observations assigned to cluster (see eq. (.6)).

Optionally, mixture parameters may be sampled via steps of Alg... Algorithm.

Each iteration sequentially resamples the cluster assignments for all N observations x = {xi }N i= in a different random order.

50 50 400 400 log p(x π, θ) log p(x π, θ) Log Lielihood vs.

Sampler 0 600 0 0 Iteration 0 0 0 Iteration Figure.0. Comparison of standard (Alg.., dar blue) and Rao Blacwellized (Alg.

10 94 CHAPTER. NONPARAMETRIC AND GRAPHICAL MODELS Given previous cluster assignments z (t ), sequentially sample new assignments as follows:. Sample a random permutation τ ( ) of the integers {,..., N }.. Set z = z (t ). For each i {τ (),..., τ (N )}, sequentially resample zi as follows: (a) For each of the K clusters, determine the predictive lielihood f (xi ) = p(xi {xj zj =, j "= i}, λ) This lielihood can be computed from cached sufficient statistics via Prop...4. (b) Sample a new cluster assignment zi from the following multinomial distribution: K K!! i zi (N + α/k)f (xi )δ(zi, ) Zi = (N i + α/k)f (xi ) Zi Comparing Collapsed vs. Regular = = N i is the number of other observations assigned to cluster (see eq. (.6)). (c) Update cached sufficient statistics to reflect the assignment of xi to cluster zi.. Set z (t) = z. Optionally, mixture parameters may be sampled via steps of Alg... Algorithm.. Rao Blacwellized Gibbs sampler for a K component exponential family mixture model, as defined in Fig..9. Each iteration sequentially resamples the cluster assignments for all N observations x = {xi }N i= in a different random order. Mixture parameters are integrated out of the sampling recursion using cached sufficient statistics of the parameters assigned to each cluster log p(x π, θ) log p(x π, θ) Log Lielihood vs. Gibbs Iteration (multiple chains) Standard Gibbs Sampler Rao Blacwellized Sampler Standard Gibbs Sampler Rao Blacwellized Sampler Iteration Iteration Figure.0. Comparison of standard (Alg.., dar blue) and Rao Blacwellized (Alg.., light red) Gibbs samplers for a mixture of K = 4 two dimensional Gaussians. We compare data log lielihoods at each of 000 iterations for the single N = 00 point dataset of Figs..8 and.9. Left: Log lielihood sequences for 0 different random initializations of each algorithm. Right: From 00 different random initializations, we show the median (solid), 0.5 and 0.75 quantiles (thic dashed), and 0.05 and 0.95 quantiles (thin dashed) of the resulting log lielihood sequences. The Rao Blacwellized sampler has superior typical performance, but occasionally remains trapped in local optima for many iterations. optima for many iterations (see right columns of Figs..8 and.9). These results Figure courtesy suggest that while Rao Blacwellization can usefully accelerate mixing, convergence Eri Sudderth diagnostics are still important. Emily Fox 0 of 9 Tas : Cluster Documents n Previously: Cluster documents based on topic Emily Fox 0 0 0

11 A Generative Model Documents: x,...,x D Associated topics: z,...,z D Parameters: = {, } Generative model: z d N d D K Emily Fox 0 Tas : Cluster Documents Now: Document may belong to multiple clusters EDUCATION FINANCE TECHNOLOGY Emily Fox 0

12 Latent Dirichlet Allocation (LDA) Emily Fox 0 Latent Dirichlet Allocation (LDA) Emily Fox 0 4

13 Latent Dirichlet Allocation (LDA) Latent Dirichlet allocation (LDA) Topics Topic proportions and assignments Documents But we only observe the documents; the other structure is hidden. We compute the posterior Emily Fox 0 p.topics, proportions, assignments j documents/ 5 Example Inference Topic Weights n Data: The OCR ed collection of Science from K documents Example inference Probability Model: 00-topic LDA model 0.0 n M words 0K unique terms (stop words and rare words removed) Topics Emily Fox 0 6

14 Example Inference Topic Words human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new networ molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networs mapping new parasites software project two united new sequences common tuberculosis simulations Emily Fox 0 7 LDA Generative Model Observations: w,...,w d N d d Associated topics: z,...,z d N d d Parameters: = {{ d }, { }} Generative model: Emily Fox 0 8 4

15 LDA Generative Model d z d i K N d D KY DY p( ) = p( ) p( d ) = d=! YN d p(zi d d )p(wi d zi d, ) i= Emily Fox 0 9 Collapsed LDA Sampling Marginalize parameters Document-specific topic weights Corpus-wide topic-specific word distributions Sample topic indicators for each word Derivation: d zi d wi d N d D K p(z:n d d ) = (P Q Q ) (n d + ) ( ) ( P nd + ) p(z ) = DY p(z:n d d ) d= p({wi d zi d = }, )= Q (P ) ( ) KY p(w z, )= p({wi d zi d = }, ) = Q (v + ) ( P v + ) Emily Fox 0 0 5

16 Collapsed LDA Sampling Marginalize parameters Document-specific topic weights d zi d K Corpus-wide topic-specific word distributions Sample topic indicators for each word Algorithm: wi d N d D Emily Fox 0 Sample Document Etruscan trade maret Emily Fox 0 6

17 Randomly Assign Topics z d i Etruscan trade maret Emily Fox 0 Randomly Assign Topics z d i Etruscan trade maret Etruscan trade maret Etruscan trade maret Etruscan Etruscan trade trade maret maret Etruscan Etruscan trade trade maret maret Etruscan Etruscan trade Etruscan trade trade maret maret maret Etruscan Etruscan trade Etruscan trade trade maret maret maret Etruscan Etruscan trade Etruscan trade trade maret maret maret Etruscan Etruscan trade Etruscan trade trade maret maret maret Etruscan Etruscan Etruscan trade trade trade maret maret maret trade Etruscan Etruscan trade ship trade trade maret maret maret Etruscan Etruscan trade trade maret maret ship Etruscan trade ship trade maret maret Etruscan trade maret Italy ship trade maret Emily Fox 0 4 7

18 Maintain Global Statistics z d i Etruscan trade maret Total counts from all docs Etruscan 0 5 maret trade Emily Fox 0 5 Resample Assignments z d i Etruscan trade maret Etruscan 0 5 maret trade Emily Fox 0 6 8

19 What is the conditional distribution for this topic? z d i? Etruscan trade maret Emily Fox 0 7 What is the conditional distribution for this topic? Part I: How much does this document lie each topic? z d i? Etruscan trade maret Topic Topic Topic Emily Fox 0 8 9

20 What is the conditional distribution for this topic? Part I: How much does this document lie each topic? Part II: How much does each topic lie this word? z d i? Etruscan trade maret Topic Topic Topic trade 0 7 Emily Fox 0 9 What is the conditional distribution for this topic? Part I: How much does this document lie each topic? Part II: How much does each topic lie this word? z d i? Etruscan trade maret Topic Topic Topic Emily Fox

21 What is the conditional distribution for this topic? Part I: How much does this document lie each topic? Part II: How much does each topic lie this word? z d i? Etruscan trade maret Topic Topic Topic n d + P K j= nd j + j vtrade P + V j= v j + Emily Fox 0 4 j Sample a New Topic Indicator z d i? Etruscan trade maret Topic Topic Topic Emily Fox 0 4

22 Update Counts z d i? Etruscan trade maret Etruscan 0 5 maret trade Emily Fox 0 4 Geometrically z d i Etruscan trade maret Topic Topic Topic Emily Fox 0 44

23 Issues with Generic LDA Sampling Slow mixing rates à Need many iterations Each iteration cycles through sampling topic assignments for all words in all documents Modern approaches: Large-scale LDA. For example, Mimno, David, Matthew D. Hoffman and David M. Blei. "Sparse stochastic inference for latent Dirichlet allocation." International Conference on Machine Learning, 0. Distributed LDA. For example, Ahmed, Amr, et al. "Scalable inference in latent variable models." Proceedings of the fifth ACM international conference on Web search and data mining (0): - Next time: Variational methods instead of sampling Emily Fox 0 45 Acnowledgements Thans to Dave Blei, David Mimno, and Jordan Boyd-Graber for some material in this lecture relating to LDA Emily Fox 0 46

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling Case Stuy : Document Retrieval Collapse Gibbs an Variational Methos for LDA Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 7 th, 0 Example