Mixture Models and Expectation-Maximization

Similar documents
Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Statistical Models. David M. Blei Columbia University. October 14, 2014

Expectation Maximization and Mixtures of Gaussians

COM336: Neural Computing

Gaussian Mixture Models

ECE 5984: Introduction to Machine Learning

Hidden Markov Models

Mixtures of Gaussians. Sargur Srihari

Expectation Maximization

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Lecture 13 : Variational Inference: Mean Field Approximation

Dimension Reduction. David M. Blei. April 23, 2012

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Introduction to Machine Learning

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Probabilistic Graphical Models

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation Maximization Algorithm

But if z is conditioned on, we need to model it:

Topic Modelling and Latent Dirichlet Allocation

Lecture 8: Graphical models for Text

Study Notes on the Latent Dirichlet Allocation

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

K-Means and Gaussian Mixture Models

Latent Variable View of EM. Sargur Srihari

Variational Inference (11/04/13)

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Gaussian Mixture Models

Expectation Maximization

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Unsupervised Learning

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

The Expectation-Maximization Algorithm

Latent Variable Models and EM algorithm

Gaussian Mixture Model

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Exponential Families

Expectation Propagation Algorithm

CPSC 540: Machine Learning

Basic math for biology

1 EM Primer. CS4786/5786: Machine Learning for Data Science, Spring /24/2015: Assignment 3: EM, graphical models

13: Variational inference II

CS838-1 Advanced NLP: Hidden Markov Models

Note Set 5: Hidden Markov Models

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Lecture 4: Probabilistic Learning

Midterm sample questions

CS Lecture 18. Expectation Maximization

Computing the MLE and the EM Algorithm

Lecture 6: April 19, 2002

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Generative Clustering, Topic Modeling, & Bayesian Inference

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Machine Learning

STA 414/2104: Machine Learning

LDA with Amortized Inference

Hidden Markov Models in Language Processing

1 Expectation Maximization

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

CSCI-567: Machine Learning (Spring 2019)

Note 1: Varitional Methods for Latent Dirichlet Allocation

Linear Dynamical Systems

Technical Details about the Expectation Maximization (EM) Algorithm

Naïve Bayes classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Latent Variable Models

Latent Variable Models and Expectation Maximization

Graphical Models for Collaborative Filtering

Bayesian Methods: Naïve Bayes

Forward algorithm vs. particle filtering

Expectation Maximization

Expectation maximization

Latent Dirichlet Allocation Introduction/Overview

Latent Variable Models and Expectation Maximization

Non-Parametric Bayes

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Estimating the parameters of hidden binomial trials by the EM algorithm

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models for Image Analysis - Lecture 1

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

13 : Variational Inference: Loopy Belief Propagation and Mean Field

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Note for plsa and LDA-Version 1.1

A Note on the Expectation-Maximization (EM) Algorithm

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning Midterm Exam Solutions

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

CSC321 Lecture 18: Learning Probabilistic Models

p(d θ ) l(θ ) 1.2 x x x

STA 4273H: Statistical Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Transcription:

Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters? Maximum likelihood. 5 There is an issue logp(x 1:D π,θ 1:K ) D logp(x d π,θ 1:K ) (1) d 1 D log d 1 d 1 K p( d π)p(x d π,θ 1:K ) (2) In the second line, the log cannot help simplify the summation. Why is that summation there? Because d is a hidden variable. If d were observed, what happens? Recall classification. The likelihood would decompose. 10 In general, fitting hidden variable models with maximum likelihood is hard. THe EM algorithm can help. (But there are exceptions in both ways: some hidden variable models don t require EM and for others EM is not enough.) 1

The expectation-maximimiation algorithm is a general-purpose technique for fitting parameters in models with hidden variables. Here is the algorithm for mixtures (in English) 15 REPEAT: 1. For each document d, compute the conditional distribution of its cluster assignment d given the current setting of the parameters π (t ) and θ (t ) 2. Obtain π (t +1) (t +1) and θ 1:K by computing MLE s for each cluster, but weighting each data point by its posterior probability of being in that cluster. 1:K. 20 Notice that this resembles k -means. What are the differences? This strategy simply works. You can apply it in all the settings that we saw at the end of the slides. But we need to be precise about what we mean by weighting. REPEAT: 25 1. E-step: For each document d, compute λ d p( d π (t ),θ (t ) 1:K ) 2. M-step: For each cluster label s distribution over terms (t +1) θ k,v D d 1 λ d,k n v (x d ) D d 1 λ d,k n(x d ) (3) and for the cluster proportions (t +1) π k D d 1 λ d,k D (4) Do we know how to do the E-step? p( d k π (t ),θ (t ) 1:K ) π k V v 1 θ n v (x d ) k,v. (5) 30 This is precisely the prediction computation from classification (e.g., from is this email Spam or Ham? to is this email class 1,2,...,10? ) We emphasie that the parameters (in the E-step) are fixed. How about the M-step for cluster-conditional distributions θ k? The numerator is the expected number of times I saw word v in class k. What is the expectation taken with respect to? The posterior distribution of the hidden variable, the variable that determines which cluster parameter each word was generated from. 2

35 The denominator is the expected number of words (total) that came from class k. How about the M-step for cluster proportions π? The numerator is the expected number of times I saw a document from cluster k. The denominator is the number of documents These are like traditional MLEs, but with expected counts. This is how EM works. 40 It is actually maximiing the likelihood. Let s see why. Mixtures of Gaussians The mixture of Gaussians model is equivalent, except x d is now a Gaussian observation. The mixture components are means µ 1:K and variances σ 2 1:K. The proportions are π. Mixtures of Gaussians can express different kinds of distributions 45 Multimodal Heavy-tailed As a density estimator, arbitrarily complex densities can be approximated with a mixture of Gaussians (given enough components). We fit these mixtures with EM. 50 The E-step is conceptually the same. With the current setting of the parameters, compute the conditional distribution of the hidden variable d for each data point x d, p( d x d,π,µ 1:K,σ 2 ) p( 1:K d π)p(x d d,µ 1:K,σ 2 ). (6) 1:K This is just like for the mixtures of multinomials, only the data are now Gaussian. The M-step computes the empirical mean and variance for each component. 55 In µ k, each point is weighted by its posterior probability of being in the cluster. In σ 2 k, weight (x d ˆµ k ) 2 by the same posterior probability. 3

The complex density estimate comes out of the predictive distribution p(x π,µ 1:K,σ 2 ) p( π)p(x,µ 1:K 1:K,σ 2 ) (7) 1:K K π k p(x µ k,σ 2 ) (8) k k 1 This is K Gaussian bumps, each weighted by π k. Why does this work? EM maximies the likelihood. 60 Generic model The hidden variables are. The observations are x. Tha parameters are θ. The joint distribution is p(,x θ ). 65 The complete log likelihood is l c (θ ;x, ) logp(,x θ ) (9) For a distribution q( ), the expected complete log likelihood is E[log p(z, x θ )]. Jensen s inequality: In for a concave function and λ (0, 1), f (λx + (1 λ)y ) λf (x) + (1 λ)f (y ) (10) Draw a picture with 70 x y λx + (1 λ)y f (λx + (1 λ)y ) λf (x) + (1 λ)f (y ) This generalies to probability distributions. For concave f, f (E[X ]) E[f (X )]. (11) 4

75 Jensen s bound and (θ,q) l(θ ;x) logp(x θ ) (12) log p(x, θ ) (13) p(x, θ ) log q( ) q( ) p(x, θ ) q( ) log q( ) (14) (15) (q,θ ). (16) This bound on the log likelihood depends on a distribution over the hidden variables q( ) and on the parameters θ. EM is a coordinate ascent algorithm. We optimie with respect to each argument, holding the other parameter fixed. q (t +1) ( ) arg max q θ (t +1) arg max θ (q,θ ( t )) (17) (q (t +1),θ ) (18) 80 In the E-step, we find q ( ). This equals the posterior p( x,θ ), (p( x,θ ),θ ) p(x, θ ) p( x,θ ) log p( x,θ ) (19) p( x,θ ) logp(x θ ) (20) logp(x θ ) (21) l(θ ;x) (22) In the M-step, we only need to worry about the expected complete log likelihood because (q,θ ) q( ) logp(x, θ ) q( ) logq( ). (23) The second term does not depend on the parameter θ. The first term is the expected complete log likelihood. Perspective of optimie bound; tighten bound 85 This finds a local optimum & is sensitive to starting points Test for convergence: After the E-step, we can compute the likelihood at the point θ (t ). 5

Big picture 90 With EM, we can fit all kinds of hidden variable models with maximum likelihood. EM has had a huge impact on statistics, machine learning, signal processing, computer vision, natural language processing, speech recognition, genomics and other fields. Missing data was no longer a block. It opened up the idea of Imagine I can observe everything. Then I can fit the MLE. But I cannot observe everything. With EM, I can still try to fit the MLE. For example 95 100 Used to fill in missing pixels in tomography. Modeling nonresponses in sample surveys. Many applications in genetics treating ancentral variables and various unknown rates as unobserved random variables. Time series models (we ll see these) Financial models EM application at Amaon: Users and items are clustered; a user s rating on an item has to do with their cluster combination. Closely related to 105 Variational methods for approximate posterior inference Gibbs sampling Both let us have the same flexibility in modeling, but in the set up where we cannot compute the real posterior. (This often occurs in Bayesian models.) History of the EM algorithm 110 115 Dempster et al. (1977): Maximum likelihood from incomplete data via the EM algorithm Hartley: I feel like the old minstrel who has been singing his song for 18 years and now finds, with considerable satisfaction, that his folklore is the theme of an overpowering symphony. Hartley (1958) Maximum likelihood procedures for incomplete data McKendrik (1926) Applications of mathematics to medical problems 6