Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Similar documents
Latent Dirichlet Allocation (LDA)

Introduction to Bayesian inference

Study Notes on the Latent Dirichlet Allocation

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Lecture 8: Graphical models for Text

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

LDA with Amortized Inference

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Lecture 13 : Variational Inference: Mean Field Approximation

Document and Topic Models: plsa and LDA

Sparse Stochastic Inference for Latent Dirichlet Allocation

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Probabilistic Latent Semantic Analysis

Gaussian Mixture Model

CS Lecture 18. Topic Models and LDA

16 : Approximate Inference: Markov Chain Monte Carlo

Latent Dirichlet Alloca/on

Latent Dirichlet Allocation

Collapsed Variational Bayesian Inference for Hidden Markov Models

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

CS6220: DATA MINING TECHNIQUES

CS145: INTRODUCTION TO DATA MINING

Latent Dirichlet Bayesian Co-Clustering

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Topic Modelling and Latent Dirichlet Allocation

Latent Dirichlet Allocation Introduction/Overview

Language as a Stochastic Process

Gaussian Mixture Models, Expectation Maximization

Introduction to Machine Learning. Lecture 2

Graphical Models for Collaborative Filtering

Note 1: Varitional Methods for Latent Dirichlet Allocation

Topic Models. Charles Elkan November 20, 2008

Text Mining for Economics and Finance Latent Dirichlet Allocation

Linear Dynamical Systems

Introduction To Machine Learning

Note for plsa and LDA-Version 1.1

Latent Dirichlet Allocation (LDA)

Non-parametric Clustering with Dirichlet Processes

Evaluation Methods for Topic Models

Mixtures of Multinomials

Generative Clustering, Topic Modeling, & Bayesian Inference

Modeling Environment

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

Statistical Debugging with Latent Topic Models

Introduction to Probabilistic Machine Learning

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Time-Sensitive Dirichlet Process Mixture Models

Introduction to Machine Learning

Machine Learning Summer School

Streaming Variational Bayes

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Fitting Narrow Emission Lines in X-ray Spectra

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Topic Modeling: Beyond Bag-of-Words

Kernel Density Topic Models: Visual Topics Without Visual Words

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Bayesian Inference and MCMC

The Expectation-Maximization Algorithm

A concave regularization technique for sparse mixture models

Distributed ML for DOSNs: giving power back to users

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Latent Dirichlet Allocation

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Lecture 6: Graphical Models: Learning

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Dirichlet Enhanced Latent Semantic Analysis

Basic math for biology

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Clustering problems, mixture models and Bayesian nonparametrics

ECE 5984: Introduction to Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Language Information Processing, Advanced. Topic Models

AN INTRODUCTION TO TOPIC MODELS

Probabilistic Time Series Classification

Expectation Maximization

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Gaussian Mixture Models

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Latent Variable Models and EM Algorithm

Latent Variable View of EM. Sargur Srihari

Online Bayesian Passive-Aggressive Learning"

Mixtures of Gaussians. Sargur Srihari

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Lecture 16 Deep Neural Generative Models

EM & Variational Bayes

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Mixed-membership Models (and an introduction to variational inference)

Maximum Likelihood Estimation of Dirichlet Distribution Parameters

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Mixture Models and Expectation-Maximization

Online Bayesian Passive-Agressive Learning

Improving Topic Models with Latent Feature Word Representations

Classical Predictive Models

Introduction to Machine Learning

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Topic Models and Applications to Short Documents

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Transcription:

Gibbs Sampling Héctor Corrada Bravo University of Maryland, College Park, USA CMSC 644: 2019 03 27

Latent semantic analysis Documents as mixtures of topics (Hoffman 1999) 1 / 60

Latent semantic analysis We have a set of documents D Each document modeled as a bag of words (bow) over dictionary W. : the number of times word appears in document. x w,d w W d D 2 / 60

Latent semantic analysis Let's start with a simple model based on the frequency of word occurrences. Each document is modeled as with parameters θ d = {θ1,d,, θ W,d} n d draws from a Multinomial distribution Note θ w,d 0 and w θ w,d = 1. 3 / 60

Latent semantic analysis Probability of observed corpus D P r(d {θ d }) D d=1 W w=1 θ x w,d w,d 4 / 60

Latent semantic analysis Problem 1: Prove MLE ^θ w,d = x w,d n d 5 / 60

Constrained optimization We have a problem of type min x s.t. f 0 (x) f i (x) 0 i = 1,, m h i (x) = 0 i = 1,, p Note: This discussion follows Boyd and Vandenberghe, Convex Optimization 6 / 60

Constrained optimization To solve these type of problems we will look at the Lagrangian function: L(x, λ, ν) = f 0 (x) + m p λ i f i (x) + ν i g i (x) i=1 i=1 7 / 60

Constrained optimization We'll see these in more detail later, but there is a beautiful result giving optimality conditions based on the Lagrangian: Suppose ~x, ~ λ and ν ~ are optimal, then f i ( ~ x) 0 h i ( ~ x) = 0 ~ λ i 0 ~ λ i f i ( ~ x) = 0 L( ~ x, ~ λ, ~ ν) = 0 8 / 60

Constrained optimization We can use the gradient and feasibility conditions to prove the MLE result. 9 / 60

Probablistic Latent Semantic Analysis Let's change our document model to introduce topics. The key idea is that the probability of observing a word in a document is given by two pieces: The probability of observing a topic in a document, and The probability of observing a word given a topic P r(w, d) = T P r(w t)p r(t d) t=1 10 / 60

Probablistic Latent Semantic Analysis So, we rewrite corpus probability as P r(d {p d }{θ t }) D d=1 W w=1 ( T x w,d p t,d θ w,t ) t=1 11 / 60

Probablistic Latent Semantic Analysis So, we rewrite corpus probability as P r(d {p d }{θ t }) D d=1 W w=1 ( T x w,d p t,d θ w,t ) t=1 Mixture of topics!! 12 / 60

Probablistic Latent Semantic Analysis A fully observed model Assume you know the latent number of occurences of word w in document d generated from topic t: Δ w,d,t t Δ w,d,t = x w,d, such that. In that case we can rewrite corpus probability: P r(d {p d }, {θ t }) D W T d=1 w=1 t=1 (p t,d θ w,t ) Δ w,d,t 13 / 60

Probablistic Latent Semantic Analysis Problem 2 Show MLEs given by ^p t,d = W w=1 Δ w,d,t T t=1 W w=1 Δ w,d,t ^θ t,d = D d=1 Δ w,d,t W w=1 D d=1 Δ w,d,t 14 / 60

Probablistic Latent Semantic Analysis Since we don't observe Δ w,d,t we use the EM algorithm At each iteration (given current parameters {p d } and {θ d } find responsibility γ w,d,t = E[Δ w,d,t {p d }, {θ t }] and maximize fully observed likelihood plugging in γ w,d,t for Δ w,d,t 15 / 60

Probablistic Latent Semantic Analysis Problem 4: Show γ w,d,t = x w,d p t,d θ w,t T t =1 p t,dθ w,t 16 / 60

The EM Algorithm in General So, why does that work? Why does plugging in γ w,d,t for the latent variables Δ w,d,t work? Why does that maximize log likelihood l({p d }, {θ t }; D)? 17 / 60

The EM Algorithm in General Think of it as follows: Z: observed data : missing latent data : complete data (observed and missing) Z m T = (Z,Z m ) 18 / 60

The EM Algorithm in General Think of it as follows: Z : observed data : missing latent data : complete data (observed and missing) Z m T = (Z, Z m ) l(θ ; Z) l 0 (θ ; T ) : log likehood w.r.t. observed data : log likelihood w.r.t. complete data 19 / 60

The EM Algorithm in General Next, notice that Pr(Z θ ) = Pr(T θ ) Pr(Z m Z, θ ) 20 / 60

The EM Algorithm in General Next, notice that Pr(Z θ ) = Pr(T θ ) Pr(Z m Z, θ ) As likelihood: l(θ ; Z) = l 0 (θ ; T ) l 1 (θ ; Z m Z) 21 / 60

The EM Algorithm in General Iterative approach: given parameters θ take expectation of log likelihoods l(θ ; Z) = E[l 0 (θ ; T ) Z, θ] E[l 1 (θ ; Z m Z) Z, θ] Q(θ, θ) R(θ, θ) 22 / 60

The EM Algorithm in General Iterative approach: given parameters θ take expectation of log likelihoods l(θ ; Z) = E[l 0 (θ ; T ) Z, θ] E[l 1 (θ ; Z m Z) Z, θ] Q(θ, θ) R(θ, θ) In plsa, Q(θ is the log likelihood of complete data with replaced by, θ) Δ w,d,t γ w,d,t 23 / 60

The EM Algorithm in General The general EM algorithm 1. Initialize parameters 2. Construct function 3. Find next set of parameters 4. Iterate steps 2 and 3 until convergence θ (0) Q(θ, θ (j) ) θ (j+1) = arg max θ Q(θ, θ (j) ) 24 / 60

The EM Algorithm in General So, why does that work? l(θ (j+1) ; Z) l(θ (j) ; Z) = [Q(θ (j+1), θ (j) ) Q(θ (j), θ (j) )] [R(θ (j+1), θ (j) ) R(θ (j), θ (j) )] 0 25 / 60

The EM Algorithm in General So, why does that work? l(θ (j+1) ; Z) l(θ (j) ; Z) = [Q(θ (j+1), θ (j) ) Q(θ (j), θ (j) )] [R(θ (j+1), θ (j) ) R(θ (j), θ (j) )] 0 I.E., every step makes log likehood larger 26 / 60

The EM Algorithm in General Why else does it work? Q(θ minorizes, θ) l(θ ; Z) 27 / 60

The EM Algorithm in General General algorithmic concept: Iterative approach: Initialize parameters Construct bound based on current parameters Optimize bound 28 / 60

The EM Algorithm in General General algorithmic concept: Iterative approach: Initialize parameters Construct bound based on current parameters Optimize bound We will see this again when we look at variational methods 29 / 60

Approximate Inference by Sampling Ultimately, what we are interested in is learning topics Perhaps instead of finding parameters θ that maximize likelihood Sample from a distribution P r(θ D) that gives us topic estimates 30 / 60

Approximate Inference by Sampling Ultimately, what we are interested in is learning topics Perhaps instead of finding parameters θ that maximize likelihood Sample from a distribution P r(θ D) that gives us topic estimates But, we only have talked about P r(d θ) how can we sample parameters? 31 / 60

Approximate Inference by Sampling Like EM, the trick here is to expand model with latent data Z m And sample from distribution Pr(θ, Z m Z) 32 / 60

Approximate Inference by Sampling Like EM, the trick here is to expand model with latent data Z m And sample from distribution Pr(θ, Z m Z) This is challenging, but sampling from Pr(θ Z m, Z) and Pr(Z m θ, Z) is easier 33 / 60

Approximate Inference by Sampling The Gibbs Sampler does exactly that Property: After some rounds, samples from the conditional distributions Pr(θ Z m, Z) Correspond to samples from marginal Pr(θ Z) = Z m Pr(θ, Z m Z) 34 / 60

Approximate Inference by Sampling Quick aside, how to simulate data for plsa? Generate parameters Generate Δ w,d,t and {p d } {θ t } 35 / 60

Approximate Inference by Sampling Let's go backwards, let's deal with Δ w,d,t 36 / 60

Approximate Inference by Sampling Let's go backwards, let's deal with Δ w,d,t Δ w,d,t Mult xw,d (γ w,d,1,, γ w,d,t) Where γ w,d,t was as given by E step 37 / 60

Approximate Inference by Sampling Let's go backwards, let's deal with Δ w,d,t Δ w,d,t Mult xw,d (γ w,d,1,, γ w,d,t) Where γ w,d,t was as given by E step for d in range(num_docs): delta[d,w,:] = np.random.multinomial(doc_mat[d,w], gamma[d,w,:]) 38 / 60

Approximate Inference by Sampling Hmm, that's a problem since we need x w,d... But, we know P r(w, d) = t p t,dθ w,t so, let's use that to generate each x w,d as x w,d Mult nd (P r(1, d),, P r(w, d)) 39 / 60

Approximate Inference by Sampling Hmm, that's a problem since we need x w,d... But, we know P r(w, d) = t p t,dθ w,t so, let's use that to generate each x w,d as x w,d Mult nd (P r(1, d),, P r(w, d)) for d in range(num_docs): doc_mat[d,:] = np.random.multinomial(nw[d], np.sum(p[:,d] * theta), axis=0) 40 / 60

Approximate Inference by Sampling Now, how about p d? How do we generate the parameters of a Multinomial distribution? 41 / 60

Approximate Inference by Sampling Now, how about p d? How do we generate the parameters of a Multinomial distribution? This is where the Dirichlet distribution comes in... If p d Dir(α), then P r(p d ) T p α t 1 t,d t=1 42 / 60

Approximate Inference by Sampling Some interesting properties: E[p t,d ] = α t t α t So, if we set all α t = 1 we will tend to have uniform probability over topics ( 1/t each on average) If we increase α t = 100 little variance (it will almost always be 1/t) it will also have uniform probability but will have very 43 / 60

Approximate Inference by Sampling So, we can say p d Dir(α) and θ t Dir(β) 44 / 60

Approximate Inference by Sampling So, we can say p d Dir(α) and θ t Dir(β) And generate data as (with α t = 1) for d in range(num_docs): p[:,d] = np.random.dirichlet(1. * np.ones(num_topics)) 45 / 60

Approximate Inference by Sampling So what we have is a prior over parameters and : and And we can formulate a distribution for missing data Δ w,d,t : {p d } {θ t } Pr(p d α) Pr(θ t β) Pr(Δ w,d,t p d, θ t, α, β) = Pr(Δ w,d,t p d, θ t )Pr(p d α)pr(θ t β) 46 / 60

Approximate Inference by Sampling However, what we care about is the posterior distribution Pr(p d Δ w,d,t, θ t, α, β) What do we do??? 47 / 60

Approximate Inference by Sampling Another neat property of the Dirichlet distribution is that it is conjugate to the Multinomial If θ α Dir(α) and X θ Multinomial(θ), then θ X, α Dir(X + α) 48 / 60

Approximate Inference by Sampling That means we can sample p d from p t,d Dir( w Δ w,d,t + α) and θ w,t Dir( Δ w,d,t + β) d 49 / 60

Approximate Inference by Sampling Coincidentally, we have just specified the Latent Dirichlet Allocation method for topic modeling. This is the most commonly used method for topic modeling Blei, Ng, Jordan (2003), JMLR 50 / 60

Approximate Inference by Sampling We can now specify a full Gibbs Sampler for an LDA mixture model. Given: Word document counts Number of topics K Prior parameters α and x w,d Do: Learn parameters {p d } and {θ t } for K topics β 51 / 60

Approximate Inference by Sampling Step 0: Initialize parameters and {p d } {θ t } p d Dir(α) and θ t Dir(β) 52 / 60

Approximate Inference by Sampling Step 1: Sample Δ w,d,t based on current parameters {p d } and {θ t } Δ w,d,. Mult xw,d(γ w,d,1,, γ w,d,t) 53 / 60

Approximate Inference by Sampling Step 2: Sample parameters from p t,d Dir( w Δ w,d,t + α) and θ w,t Dir( Δ w,d,t + β) d 54 / 60

Approximate Inference by Sampling Step 3: Get samples for a few iterations (e.g., 200), we want to reach a stationary distribution... 55 / 60

Approximate Inference by Sampling Step 4: ^Δ w,d,t Estimate as the average of the estimates from the last m iterations (e.g., m=500) 56 / 60

Approximate Inference by Sampling Step 5: Estimate parameters p d and θ t based on estimated ^Δw,d,t ^p t,d = w t w ^Δ w,d,t + α ^Δw,d,t + α ^θ w,t = d w d ^Δw,d,t + β ^Δw,d,t + β 57 / 60

Mixture models We have now seen two different mixture models: soft k means and topic models 58 / 60

Mixture models We have now seen two different mixture models: soft k means and topic models Two inference procedures: Exact Inference with Maximum Likelihood using the EM algorithm Approximate Inference using Gibbs Sampling 59 / 60

Mixture models We have now seen two different mixture models: soft k means and topic models Two inference procedures: Exact Inference with Maximum Likelihood using the EM algorithm Approximate Inference using Gibbs Sampling Next, we will go back to Maximum Likelihood but learn about Approximate Inference using Variational Methods 60 / 60