Lecture 8: Graphical models for Text

Similar documents
The Expectation Maximization or EM algorithm

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Week 3: The EM algorithm

Gaussian Mixture Models

Lecture 6: Graphical Models: Learning

13: Variational inference II

A Note on the Expectation-Maximization (EM) Algorithm

Expectation Maximization

Note 1: Varitional Methods for Latent Dirichlet Allocation

LDA with Amortized Inference

Expectation Maximization

ECE 5984: Introduction to Machine Learning

Lecture 13 : Variational Inference: Mean Field Approximation

Latent Variable Models and EM algorithm

Expectation Maximization

Machine Learning Summer School

Quantitative Biology II Lecture 4: Variational Methods

Introduction to Machine Learning

Latent Variable Models and EM Algorithm

Technical Details about the Expectation Maximization (EM) Algorithm

CS Lecture 18. Expectation Maximization

Posterior Regularization

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Clustering with k-means and Gaussian mixture distributions

Study Notes on the Latent Dirichlet Allocation

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Series 7, May 22, 2018 (EM Convergence)

Latent Dirichlet Allocation (LDA)

Expectation maximization tutorial

Clustering with k-means and Gaussian mixture distributions

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Mixtures of Gaussians. Sargur Srihari

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Mixture Models and Expectation-Maximization

EM Algorithm LECTURE OUTLINE

CS Lecture 18. Topic Models and LDA

STA 4273H: Statistical Machine Learning

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Variational Inference (11/04/13)

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

G8325: Variational Bayes

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Bayesian Machine Learning - Lecture 7

Unsupervised Learning

Motif representation using position weight matrix

An Introduction to Expectation-Maximization

Expectation Maximization Algorithm

Latent Variable Models

The Expectation-Maximization Algorithm

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Dynamic Approaches: The Hidden Markov Model

Lecture 4 September 15

Latent Dirichlet Allocation Introduction/Overview

Hidden Markov Models

Directed Probabilistic Graphical Models CMSC 678 UMBC

Probabilistic Graphical Models

Latent Variable View of EM. Sargur Srihari

Lecture 7: Con3nuous Latent Variable Models

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

EM-algorithm for motif discovery

Topic Models. Charles Elkan November 20, 2008

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Lecture 6: April 19, 2002

Variational Inference and Learning. Sargur N. Srihari

EM (cont.) November 26 th, Carlos Guestrin 1

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Dirichlet Enhanced Latent Semantic Analysis

Hidden Markov Models in Language Processing

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Probabilistic Graphical Models for Image Analysis - Lecture 4

Chapter 20. Deep Generative Models

The Expectation Maximization Algorithm

Linear Dynamical Systems

But if z is conditioned on, we need to model it:

EM & Variational Bayes

Probabilistic Latent Semantic Analysis

Clustering with k-means and Gaussian mixture distributions

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Modeling Environment

STA 414/2104: Machine Learning

Introduction to Bayesian inference

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Stochastic Variational Inference

Recent Advances in Bayesian Inference Techniques

CS838-1 Advanced NLP: Hidden Markov Models

Basic math for biology

Two Useful Bounds for Variational Inference

CS6220: DATA MINING TECHNIQUES

Latent Dirichlet Allocation

Latent Dirichlet Alloca/on

Intractable Likelihood Functions

Introduction To Machine Learning

Transcription:

Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Quiñonero-Candela & Rasmussen (CUED) Lecture 8: Graphical models for Text 1 / 12

A really simple document model Consider a collection of D documents with dictionary of M unique words. N d : number of (non-unique) words in document d. w id : i-th word in document d (w id {1 : M}). β = [β 1,..., β M ] : parameters of a Multinomial distribution over the dictionary of M unique words. We can fit β by maximising the likelihood: ˆβ = argmax D Mult(c 1d,..., c Md β, N d ) d=1 = argmax Mult(c 1,..., c M β, N) ˆβ j = c j M l=1 c l N = D d=1 N d: total number of (non-unique) words in the collection. c jd : count of occurrences of unique word j in document d. c j = D d=1 c jd: count of total occurrences of unique word j in the collection. Quiñonero-Candela & Rasmussen (CUED) Lecture 8: Graphical models for Text 2 / 12

Limitations of the really simple document model Document d is the result of sampling N d words from the Multinomial β. β estimated by maximum likelihood reflects the aggregation of all documents. All documents are therefore modelled by the global word frequency distribution. The generative model wastes mass, because it cannot specialize. All unique words do not necessarily co-occur in a given document. It possible that documents might be about different topics. Quiñonero-Candela & Rasmussen (CUED) Lecture 8: Graphical models for Text 3 / 12

A mixture of Multinomials model We want to allow for a mixture of K Multinomials parametrised by β 1,..., β K. Each of those Multinomials corresponds to a document category. z d {1 : K} assigns all words in document d to one of the K categories. θ j = p(z d = j) is the probability any document d is assigned to category j. θ = [θ 1,..., θ K ] is also the parameter of a Multinomial over the K categories. We have introduced a new set of hidden variables z d. How do we fit those variables? What do we do with them? We are actually not interested in them: We are only interested in θ and β. Quiñonero-Candela & Rasmussen (CUED) Lecture 8: Graphical models for Text 4 / 12

Jensen s Inequality For any concave function, such as log(x) log(x) log(α x 1 + (1 α) x 2 ) α log(x 1 ) + (1 α) log(x 2 ) x 1 α x 1 + (1 α)x 2 x 2 For α i 0, α i = 1 and any {x i > 0} log ( ) α i x i α i log(x i ) i i Equality if and only if α i = 1 for some i (and therefore all others are 0). Quiñonero-Candela & Rasmussen (CUED) Lecture 8: Graphical models for Text 5 / 12

The Expectation Maximization (EM) algorithm Given a set of observed (visible) variables V, a set of unobserved (hidden / latent / missing) variables H, and model parameters θ, optimize the log likelihood: L(θ) = log p(v θ) = log p(h, V θ)dh, (1) where we have written the marginal for the visibles in terms of an integral over the joint distribution for hidden and visible variables. Using Jensen s inequality for any distribution of hidden states we have: p(h, V θ) p(h, V θ) L = log dh log dh = F(q, θ), (2) defining the F(q, θ) functional, which is a lower bound on the log likelihood. In the EM algorithm, we alternately optimize F(q, θ) wrt q and θ, and we can prove that this will never decrease L. Quiñonero-Candela & Rasmussen (CUED) Lecture 8: Graphical models for Text 6 / 12

The E and M steps of EM The lower bound on the log likelihood: F(q, θ) = p(h, V θ) log dh = log p(h, V θ)dh + H(q), (3) where H(q) = log dh is the entropy of q. We iteratively alternate: E step: optimize F(q, θ) wrt the distribution over hidden variables given the parameters: q (k) (H) := argmax F (, θ (k 1)). (4) M step: maximize F(q, θ) wrt the parameters given the hidden distribution: θ (k) := argmax F ( q (k) (H), θ ) = argmax q (k) (H) log p(h, V θ)dh, (5) θ θ which is equivalent to optimizing the expected complete-data likelihood p(h, V θ), since the entropy of does not depend on θ. Quiñonero-Candela & Rasmussen (CUED) Lecture 8: Graphical models for Text 7 / 12

EM as Coordinate Ascent in F Quiñonero-Candela & Rasmussen (CUED) Lecture 8: Graphical models for Text 8 / 12

The EM algorithm never decreases the log likelihood The difference between the cost functions: p(h, V θ) L(θ) F(q, θ) = log p(v θ) log dh p(h V, θ)p(v θ) = log p(v θ) log dh p(h V, θ) = log dh = KL (, p(h V, θ) ), is called the Kullback-Liebler divergence; it is non-negative and only zero if and only if = p(h V, θ) (thus this is the E step). Although we are working with the wrong cost function, the likelihood is still increased in every iteration: L ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen L ( θ (k)), where the first equality holds because of the E step, and the first inequality comes from the M step and the final inequality from Jensen. Usually EM converges to a local optimum of L (although there are exceptions). Quiñonero-Candela & Rasmussen (CUED) Lecture 8: Graphical models for Text 9 / 12

EM and Mixtures of Multinomials The the Mixture model for text, the latent variables are z d {1,..., K}, where d = 1,..., D which for each document encodes which mixture component generated it. E-step: for each document d, set q to the posterior N d q d (z d ) p(z d = k θ) p(w i β wi k) = θ k Mult(c 1d,..., c Md β k, N d ) = r kd M-step: Maximize K i=1 k=1 q d (z d = k) log p({w id }, z d ) = k = k = k,d r kd log D N d p(w i β wi k)p(z d = k) d=1 i=1 ( D M r kd log d=1 j=1 β c jd jk + log θ k M r kd ( c jd log β jk + log θ k ) = F(θ, β) Quiñonero-Candela & Rasmussen (CUED) Lecture 8: Graphical models for Text 10 / 12 j=1 )

EM: M step for mixture model Need Lagrange multipliers to constrain the maximization and ensure proper distributions. θ k = argmax F(θ, β) + λ(1 = D d=1 r kd K D k =1 d=1 r k d K θ k ) k=1 β jk = argmax F(θ, β) + = D d=1 r kdc jd K M λ k (1 β jk ) j=k M D j =1 d=1 r kdc j d j=1 Quiñonero-Candela & Rasmussen (CUED) Lecture 8: Graphical models for Text 11 / 12

A Bayesian mixture of Multinomials model With the EM algorithm we have essentially estimated α and β by maximum likelihood. An alternative, Bayesian treatment is to introduce hyperpriors. θ Dir(α) is a symmetric Dirichlet over category probabilities. β k Dir(γ) is a symmetric Dirichlet over unique word probabilities. What is different? We no longer want to compute a point estimate of θ or β. We are now interested in computing the posterior distributions. Quiñonero-Candela & Rasmussen (CUED) Lecture 8: Graphical models for Text 12 / 12