Gaussian Mixture Model

Similar documents
Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Generative Clustering, Topic Modeling, & Bayesian Inference

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Introduction to Machine Learning

Text mining and natural language analysis. Jefrey Lijffijt

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Topic Modelling and Latent Dirichlet Allocation

Sparse Stochastic Inference for Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)

Topic Models. Charles Elkan November 20, 2008

Introduction To Machine Learning

Study Notes on the Latent Dirichlet Allocation

Non-Parametric Bayes

Lecture 13 : Variational Inference: Mean Field Approximation

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

Text Mining for Economics and Finance Latent Dirichlet Allocation

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

An Introduction to Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Bayesian Methods for Machine Learning

13: Variational inference II

Mixture Models and Expectation-Maximization

Two Useful Bounds for Variational Inference

Advanced Machine Learning

Contents. Part I: Fundamentals of Bayesian Inference 1

Latent Dirichlet Allocation (LDA)

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

But if z is conditioned on, we need to model it:

Latent Dirichlet Allocation Introduction/Overview

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Linear Dynamical Systems

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Probabilistic Graphical Models

CPSC 540: Machine Learning

Introduction to Machine Learning

Bayesian Modeling of Conditional Distributions

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

16 : Approximate Inference: Markov Chain Monte Carlo

Pattern Recognition and Machine Learning

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

MCMC: Markov Chain Monte Carlo

Latent Dirichlet Allocation

Latent Variable Models and EM algorithm

Introduction to Bayesian inference

Bayesian Nonparametric Models

STA 4273H: Statistical Machine Learning

Lecture 4: Probabilistic Learning

Evaluation Methods for Topic Models

STA 4273H: Statistical Machine Learning

STA 414/2104: Machine Learning

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Machine Learning Lecture Notes

Bayesian Nonparametrics: Dirichlet Process

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Document and Topic Models: plsa and LDA

CS Lecture 18. Topic Models and LDA

Variational Inference (11/04/13)

Recent Advances in Bayesian Inference Techniques

AN INTRODUCTION TO TOPIC MODELS

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

The Expectation-Maximization Algorithm

Unsupervised Learning

arxiv: v1 [stat.ml] 8 Jan 2012

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Probabilistic Time Series Classification

Mixed-membership Models (and an introduction to variational inference)

Multivariate Normal Models

Classification: Logistic Regression from Data

Introduction to Probabilistic Machine Learning

Topic Models and Applications to Short Documents

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

CS Lecture 18. Expectation Maximization

Bayesian Machine Learning

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

10-701/15-781, Machine Learning: Homework 4

CS145: INTRODUCTION TO DATA MINING

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models for Image Analysis - Lecture 4

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Introduction to Machine Learning CMU-10701

Content-based Recommendation

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Note for plsa and LDA-Version 1.1

Distance dependent Chinese restaurant processes

STA 4273H: Statistical Machine Learning

Graphical Models for Query-driven Analysis of Multimodal Data

Transcription:

Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th, 0 Gaussian Mixture Model Most commonly used mixture model Observations: x,...,x N Parameters: = {, } =[,..., K ] = { } = {µ, } z i Lielihood: p(x i ) = X x i p(x i ) Ex. z i = country of origin, x i = height of i th person th mixture component = distribution of heights in country K Emily Fox 0 N

Motivates EM Algorithm Initial guess: ˆ (0) Estimate at iteration t: ˆ (t) E-Step Compute U(, ˆ (t) )=E[log p(y ) x, ˆ (t) ] M-Step Compute ˆ (t+) = arg max U(, ˆ (t) ) Emily Fox 0 MAP Estimation Bayesian approach: Place prior p( ) on parameters Infer posterior p( x) Many, many, many motivations and implications For the sae of this class, simplest motivation is to thin of this as ain to regularization ˆ MAP = arg max log p( x) Saw importance of regularization in logistic regression (ML estimate can overfit data and lead to poor generalization) Emily Fox 0 4

EM Algorithm MAP Case Re-derive EM algorithm for p( x) Add log p( ) to U(, ˆ (t) ) What must be computed in E-Step remains unchanged because this term does not depend on y. M-Step becomes: ˆ (t+) = arg max U(, ˆ (t) ) Emily Fox 0 5 MAP EM Example MoG For mixture of Gaussians, conjugate priors are: Dir(,..., K ) tion π π (,0,0) (0,0,) (0,,0) π π (/,/,/) (/4,/4,/) (/,/,0) π π π Dir(,, ) π Dir(4, 4, 4) π π π π π Dir(4, 9, 7) π Dir(0., 0., 0.) Emily Fox 0 6

MAP EM Example MoG For mixture of Gaussians, conjugate priors are: Dir(,..., K ) p( ) = Q (P ) Y ( ) Dirichlet posterior Assume we condition on observations Count occurrences of Then, z i = p(, z,...,z N ) / z i Conjugacy: This posterior has same form as prior Emily Fox 0 7 MAP EM Example MoG For mixture of Gaussians, conjugate priors are: Dir(,..., K ) {µ, } NIW(m 0,apple 0, 0,S 0 ) Results in following M-Step: ˆµ = r x + apple 0 m 0 r + apple 0 ˆ = r + N + P K ˆ = S 0 + r S + apple 0r apple 0 +r ( x m 0 )( x m 0 ) 0 0 + r + d + Emily Fox 0 8 4

Posterior Computations MAP EM focuses on point estimation: ˆ MAP = arg max p( x) What if we want a full characterization of the posterior? Maintain a measure of uncertainty Estimators other than posterior mode (different loss functions) Predictive distributions for future observations Often no closed-form characterization (e.g., mixture models) Alternatives: Monte Carlo based estimates using samples from posterior Variational approximations to posterior (more next time) Emily Fox 0 9 Gibb Sampling Want draws: Construct Marov chain whose steady state distribution is Simplest case: Emily Fox 0 0 5

Example Mixture of Gaussians Recall model Observations: Cluster indicators: Parameters: Generative model: x,...,x N z,...,z N = {, } Dir(,..., K ) {µ, } F ( ) =[,..., K ] = { } = {µ, } z i x i z i N(x i ; µ z i, z i) Want to draw posterior samples of model parameters p(, x,...,x N ) p(, x,...,x N ) Emily Fox 0 Auxiliary Variable Samplers Augment variables of interest with variables to allow closed-form for sampling, just lie in EM z In both cases, simply looing at subchain converges to draws from marginal distribution ( ) { (t) } Emily Fox 0 6

Example Mixture of Gaussians Dir(,..., K ) z i {µ, } F( ) x i z i N(x i ; µ z i, z i) Try auxiliary variable sampler Introduce cluster indicators into sampler z i x i N K Emily Fox 0 Example Clustering Results I log p(x π, θ) = 59.7 log p(x π, θ) = 497.77 log p(x π, θ) = 404.8 log p(x π, θ) = 454.5 log p(x π, θ) = 97.40 log p(x π, θ) = 44.89 Figure courtesy of Eri Sudderth Figure.8. Learning a mixture of K = 4 Gaussians using the Gibbs sampler of Alg... Columns show the current parameters after T= (top), T=0 (middle), and T=50 (bottom) iterations from two random initializations. Each plot is labeled Emily by the current Fox 0 data log lielihood. 4 7

Collapsed Gibbs Samplers Marginalize a set of latent variables or parameters Sometimes marginalized variables are nuisance parameters Other times what gets marginalized are the variables Mae post-facto inferences on variables of interest based on sampled variables Can improve efficiency if marginalized variables are high-dim Reduced dimension of search space But, often introduces dependences! Emily Fox 0 5 Example Collapsed MoG Sampling Dir(,..., K ) z i {µ, } F( ) x i z i N(x i ; µ z i, z i) Collapsed sampler z i x i N K Emily Fox 0 6 8

Example Collapsed MoG Sampling Dir(,..., K ) z i {µ, } F( ) x i z i N(x i ; µ z i, z i) Derivation z i x i N K Important facts: p(z :N ) = Q (P ) ( ) Q (n + ) ( P n + ) (m + ) (m) = m Emily Fox 0 7 Example Clustering Results II log p(x π, θ) = 99.06 log p(x π, θ) = 46.94 log p(x π, θ) = 97.8 log p(x π, θ) = 449. log p(x π, θ) = 96.5 log p(x π, θ) = 448.68 Figure courtesy of Eri Sudderth Figure.9. Learning a mixture of K = 4 Gaussians using the Rao Blacwellized Gibbs sampler of Alg... Columns show the current parameters after T= (top), T=0 (middle), and T=50 (bottom) iterations from two random initializations. Each Emily plot isfox labeled 0 by the current data log lielihood. 8 9

94 CHAPTER. NONPARAMETRIC AND GRAPHICAL MODELS Given previous cluster assignments z (t ), sequentially sample new assignments as follows:. Sample a random permutation τ ( ) of the integers {,..., N }.. Set z = z (t ). For each i {τ (),..., τ (N )}, sequentially resample zi as follows: (a) For each of the K clusters, determine the predictive lielihood f (xi ) = p(xi {xj zj =, j "= i}, λ) This lielihood can be computed from cached sufficient statistics via Prop...4. (b) Sample a new cluster assignment zi from the following multinomial distribution: K K!! i zi (N + α/k)f (xi )δ(zi, ) Zi = (N i + α/k)f (xi ) Zi Comparing Collapsed vs. Regular = = N i is the number of other observations assigned to cluster (see eq. (.6)). (c) Update cached sufficient statistics to reflect the assignment of xi to cluster zi.. Set z (t) = z. Optionally, mixture parameters may be sampled via steps of Alg... Algorithm.. Rao Blacwellized Gibbs sampler for a K component exponential family mixture model, as defined in Fig..9. Each iteration sequentially resamples the cluster assignments for all N observations x = {xi }N i= in a different random order. Mixture parameters are integrated out of the sampling recursion using cached sufficient statistics of the parameters assigned to each cluster. 50 50 400 400 log p(x π, θ) log p(x π, θ) Log Lielihood vs. Gibbs Iteration (multiple chains) 450 500 550 450 500 550 Standard Gibbs Sampler Rao Blacwellized Sampler 600 0 0 0 0 Standard Gibbs Sampler Rao Blacwellized Sampler 0 600 0 0 Iteration 0 0 0 Iteration Figure.0. Comparison of standard (Alg.., dar blue) and Rao Blacwellized (Alg.., light red) Gibbs samplers for a mixture of K = 4 two dimensional Gaussians. We compare data log lielihoods at each of 000 iterations for the single N = 00 point dataset of Figs..8 and.9. Left: Log lielihood sequences for 0 different random initializations of each algorithm. Right: From 00 different random initializations, we show the median (solid), 0.5 and 0.75 quantiles (thic dashed), and 0.05 and 0.95 quantiles (thin dashed) of the resulting log lielihood sequences. The Rao Blacwellized sampler has superior typical performance, but occasionally remains trapped in local optima for many iterations. optima for many iterations (see right columns of Figs..8 and.9). These results Figure courtesy suggest that while Rao Blacwellization can usefully accelerate mixing, convergence Eri Sudderth diagnostics are still important. Emily Fox 0 of 9 Tas : Cluster Documents n Previously: Cluster documents based on topic Emily Fox 0 0 0

A Generative Model Documents: x,...,x D Associated topics: z,...,z D Parameters: = {, } Generative model: z d N d D K Emily Fox 0 Tas : Cluster Documents Now: Document may belong to multiple clusters EDUCATION FINANCE TECHNOLOGY Emily Fox 0

Latent Dirichlet Allocation (LDA) Emily Fox 0 Latent Dirichlet Allocation (LDA) Emily Fox 0 4

Latent Dirichlet Allocation (LDA) Latent Dirichlet allocation (LDA) Topics Topic proportions and assignments Documents But we only observe the documents; the other structure is hidden. We compute the posterior Emily Fox 0 p.topics, proportions, assignments j documents/ 5 Example Inference Topic Weights n Data: The OCR ed collection of Science from 990-000 7K documents Example inference 0. 0. Probability 0. 0.4 Model: 00-topic LDA model 0.0 n M words 0K unique terms (stop words and rare words removed) 8 6 6 6 46 56 66 76 86 96 Topics Emily Fox 0 6

Example Inference Topic Words human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new networ molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networs mapping new parasites software project two united new sequences common tuberculosis simulations Emily Fox 0 7 LDA Generative Model Observations: w,...,w d N d d Associated topics: z,...,z d N d d Parameters: = {{ d }, { }} Generative model: Emily Fox 0 8 4

LDA Generative Model d z d i K N d D KY DY p( ) = p( ) p( d ) = d=! YN d p(zi d d )p(wi d zi d, ) i= Emily Fox 0 9 Collapsed LDA Sampling Marginalize parameters Document-specific topic weights Corpus-wide topic-specific word distributions Sample topic indicators for each word Derivation: d zi d wi d N d D K p(z:n d d ) = (P Q Q ) (n d + ) ( ) ( P nd + ) p(z ) = DY p(z:n d d ) d= p({wi d zi d = }, )= Q (P ) ( ) KY p(w z, )= p({wi d zi d = }, ) = Q (v + ) ( P v + ) Emily Fox 0 0 5

Collapsed LDA Sampling Marginalize parameters Document-specific topic weights d zi d K Corpus-wide topic-specific word distributions Sample topic indicators for each word Algorithm: wi d N d D Emily Fox 0 Sample Document Etruscan trade maret Emily Fox 0 6

Randomly Assign Topics z d i Etruscan trade maret Emily Fox 0 Randomly Assign Topics z d i Etruscan trade maret Etruscan trade maret Etruscan trade maret Etruscan Etruscan trade trade maret maret Etruscan Etruscan trade trade maret maret Etruscan Etruscan trade Etruscan trade trade maret maret maret Etruscan Etruscan trade Etruscan trade trade maret maret maret Etruscan Etruscan trade Etruscan trade trade maret maret maret Etruscan Etruscan trade Etruscan trade trade maret maret maret Etruscan Etruscan Etruscan trade trade trade maret maret maret trade Etruscan Etruscan trade ship trade trade maret maret maret Etruscan Etruscan trade trade maret maret ship Etruscan trade ship trade maret maret Etruscan trade maret Italy ship trade maret Emily Fox 0 4 7

Maintain Global Statistics z d i Etruscan trade maret Total counts from all docs Etruscan 0 5 maret 50 0 4 0 0 0 0 trade 0 8... Emily Fox 0 5 Resample Assignments z d i Etruscan trade maret Etruscan 0 5 maret 50 0 4 0 0 0 0 trade 0 8... Emily Fox 0 6 8

What is the conditional distribution for this topic? z d i? Etruscan trade maret Emily Fox 0 7 What is the conditional distribution for this topic? Part I: How much does this document lie each topic? z d i? Etruscan trade maret Topic Topic Topic Emily Fox 0 8 9

What is the conditional distribution for this topic? Part I: How much does this document lie each topic? Part II: How much does each topic lie this word? z d i? Etruscan trade maret Topic Topic Topic trade 0 7 Emily Fox 0 9 What is the conditional distribution for this topic? Part I: How much does this document lie each topic? Part II: How much does each topic lie this word? z d i? Etruscan trade maret Topic Topic Topic Emily Fox 0 40 0

What is the conditional distribution for this topic? Part I: How much does this document lie each topic? Part II: How much does each topic lie this word? z d i? Etruscan trade maret Topic Topic Topic n d + P K j= nd j + j vtrade P + V j= v j + Emily Fox 0 4 j Sample a New Topic Indicator z d i? Etruscan trade maret Topic Topic Topic Emily Fox 0 4

Update Counts z d i? Etruscan trade maret Etruscan 0 5 maret 50 0 4 0 0 0 0 trade 0 7... Emily Fox 0 4 Geometrically z d i Etruscan trade maret Topic Topic Topic Emily Fox 0 44

Issues with Generic LDA Sampling Slow mixing rates à Need many iterations Each iteration cycles through sampling topic assignments for all words in all documents Modern approaches: Large-scale LDA. For example, Mimno, David, Matthew D. Hoffman and David M. Blei. "Sparse stochastic inference for latent Dirichlet allocation." International Conference on Machine Learning, 0. Distributed LDA. For example, Ahmed, Amr, et al. "Scalable inference in latent variable models." Proceedings of the fifth ACM international conference on Web search and data mining (0): - Next time: Variational methods instead of sampling Emily Fox 0 45 Acnowledgements Thans to Dave Blei, David Mimno, and Jordan Boyd-Graber for some material in this lecture relating to LDA Emily Fox 0 46