Latent Dirichlet Allocation

Similar documents
Bayesian Networks: Independencies and Inference

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Graphical Models and Kernel Methods

Bayesian Networks (Part II)

STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Topic Modelling and Latent Dirichlet Allocation

Generative Clustering, Topic Modeling, & Bayesian Inference

Lecture 8: Bayesian Networks

Latent Dirichlet Allocation (LDA)

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Directed Graphical Models

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Machine Learning Lecture 14

Lecture 13 : Variational Inference: Mean Field Approximation

Probablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Directed Probabilistic Graphical Models CMSC 678 UMBC

Bayesian Networks. Motivation

Sampling from Bayes Nets

Text Mining for Economics and Finance Latent Dirichlet Allocation

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Probabilistic Graphical Models: Representation and Inference

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Study Notes on the Latent Dirichlet Allocation

An Introduction to Bayesian Machine Learning

Bayesian Methods: Naïve Bayes

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Directed Graphical Models (aka. Bayesian Networks)

Chapter 16. Structured Probabilistic Models for Deep Learning

Probabilistic Graphical Networks: Definitions and Basic Results

Latent Dirichlet Allocation (LDA)

CS Lecture 18. Topic Models and LDA

Statistical Approaches to Learning and Discovery

13: Variational inference II

Directed Graphical Models. William W. Cohen Machine Learning

Linear Dynamical Systems

Dynamic Approaches: The Hidden Markov Model

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Approximate Inference

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Introduction to Bayesian Learning

Hidden Markov Models Part 2: Algorithms

Bayesian Inference for Dirichlet-Multinomials

Chris Bishop s PRML Ch. 8: Graphical Models

Hidden Markov Models

Lecture 6: Graphical Models: Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

CSE 473: Artificial Intelligence Autumn Topics

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Modeling Environment

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

STA 4273H: Statistical Machine Learning

Learning Bayesian Networks (part 1) Goals for the lecture

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Intelligent Systems (AI-2)

Machine Learning Summer School

Non-Parametric Bayes

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Introduction to Bayesian inference

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

CSCI-567: Machine Learning (Spring 2019)

Gaussian Mixture Model

Bayesian Methods for Machine Learning

STA 4273H: Statistical Machine Learning

Learning Bayesian network : Given structure and completely observed data

Topic Models. Charles Elkan November 20, 2008

Probabilistic Graphical Models

Inference in Bayesian Networks

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

16 : Approximate Inference: Markov Chain Monte Carlo

Cours 7 12th November 2014

Infering the Number of State Clusters in Hidden Markov Model and its Extension

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Latent Dirichlet Alloca/on

Introduction to Artificial Intelligence. Unit # 11

Bayesian Machine Learning

Latent Dirichlet Allocation Introduction/Overview

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Based on slides by Richard Zemel

Document and Topic Models: plsa and LDA

Bayesian Networks BY: MOHAMAD ALSABBAGH

6.867 Machine learning, lecture 23 (Jaakkola)

PMR Learning as Inference

Intelligent Systems (AI-2)

MLE/MAP + Naïve Bayes

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

CS145: INTRODUCTION TO DATA MINING

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Machine Learning for Data Science (CS4786) Lecture 24

Hidden Markov Models

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

9 Forward-backward algorithm, sum-product on factor graphs

Transcription:

Latent Dirichlet Allocation 1

Directed Graphical Models William W. Cohen Machine Learning 10-601 2

DGMs: The Burglar Alarm example Node ~ random variable Burglar Earthquake Arcs define form of probability distribution: Pr(i 1,,i-1,i+1,,n) = Pr(i parents(i)) Alarm Phone Call Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. Earth arguably doesn t care whether your house is currently being burgled While you are on vacation, one of your neighbors calls and tells you your home s burglar alarm is ringing. Uh oh! 3

DGMs: The Burglar Alarm example Node ~ random variable Burglar Earthquake Arcs define form of probability distribution: Pr(i 1,,i-1,i+1,,n) = Pr(i parents(i)) Alarm Phone Call Generative story (and joint distribution): Pick b ~ Pr(Burglar), a binomial Pick e ~ Pr(Earthquake), a binomial Pick a ~ Pr(Alarm Burglar=b, Earthquake=e), four binomials Pick c ~ Pr(PhoneCall Alarm) 4

DGMs: The Burglar Alarm example You can also compute other quantities: e.g., Pr(Burglar=true PhoneCall=true) Burglar Alarm Earthquake Pr(Burglar=true PhoneCall=true, and Earthquake=true) Phone Call Generative story: Pick b ~ Pr(Burglar), a binomial Pick e ~ Pr(Earthquake), a binomial Pick a ~ Pr(Alarm Burglar=b, Earthquake=e), four binomials Pick c ~ Pr(PhoneCall Alarm) 5

Inference in Bayes nets + E General problem: given evidence E 1,,E k compute P( E 1,..,E k ) for any U 1 U 2 Big assumption: graph is polytree <=1 undirected path between any nodes,y Notation: Z 1 Y 1 Y 2 Z 2 E + = "causal support"for E E = "evidential support"for Slide 6

Slide 7 Inference in Bayes nets: P( E) ), ( ) ( + = E E P E P ) ( ) ( ), ( + + + = E E P E P E E P ) ( ) ( ) ( + E P E P E P )... ( letsstart with + E P Y 1 Y 2 Z 2 Z 1 U 2 U 1 E + E E+: causal support E-: evidential support

Inference in Bayes nets: P( E + ) P( E + ) = E +1 U 1 U 2 +2 + E u u 1, 2 + P( u1, u2, E ) P( u1, u2 E + ) Z 1 Z 2 = u u 1, 2 P u, u ) P( u E ) ( 1 2, j j + j Y 1 Y 2 E CPT table lookup Recursive call to P(. E + ) = P( u) P(u j E Uj\ ) u j So far: simple way of propagating requests for belief due to causal evidence up the tree Evidence for Uj that doesn t go thru I.e. info on Pr( E+) flows down Slide 8

Inference in Bayes nets: P(E - ) simplified P( E) P(E ) = j y jk j P( E P(E Yj\ = P(y jk ) j y jk ) P( ) P(E Yj\, y jk ) = P(y jk ) P(E j Yj y jk ) So far: simple way of propagating requests for belief due to evidential support down the tree E + ) d-sep + polytree Recursive call to P(E -. ) I.e. info on Pr(E- ) flows up 1 E E j j Y Z 1 U 1 U 2 Y 1 Y 2 Z 2 + E 2 E :evidential support for Y E Yj \ : evidence for Y j excluding support through j Slide 9

Recap: Message Passing for BP + E U 1 U 2 We reduced P( E) to product of two recursively calculated parts: P(=x E + ) = u u 1, 2 P u, u ) P( u E ) ( 1 2, j j + j Z 1 Y 1 Y 2 Z 2 E i.e., CPT for and product of forward messages from parents P(E - =x) = β j ( EY y jk) P( y j jk, z jk ) P( z jk EZ j \ Y j y z i.e., combination of backward messages from parents, CPTs, and P(Z E Z\Yk ), a simpler instance of P( E) This can also be implemented by message-passing (belief propagation) Messages are distributions i.e., vectors jk P ) jk k Slide 10

Recap: Message Passing for BP + E U 1 U 2 Top-level algorithm Z 1 Z 2 Pick one vertex as the root Any node with only one edge is a leaf Y 1 Y 2 E Pass messages from the leaves to the root Pass messages from root to the leaves Now every has received P( E+) and P(E- ) and can compute P( E) Slide 11

NAÏVE BAYES AS A DGM 12

Naïve Bayes as a DGM For each document d in the corpus (of size D): - Pick a label y d from Pr(Y) - For each word in d (of length N d ): Pick a word x id from Pr( Y=y d ) onion 0.3 economist 0.7 Pr(Y=y) Y for every Y Pr( y=y) onion aardvark 0.034 onion ai 0.0067 economist aardvark 0.0000003. economist zymurgy 0.01000 N d D 13 Plate diagram

Naïve Bayes as a DGM For each document d in the corpus (of size D): - Pick a label y d from Pr(Y) - For each word in d (of length N d ): Pick a word x id from Pr( Y=y d ) Y Not described: how do we smooth for classes? For multinomials? How many classes are there?. N d D 14 Plate diagram

Recap: smoothing for a binomial MLE: maximize Pr(D θ) MAP: maximize Pr(D θ)pr(θ) estimate Θ=P(heads) for a binomial with MLE as: #heads #tails and with MAP as: #imaginary heads Smoothing = prior over the parameter θ 15 #imaginary tails

Smoothing for a binomial as a DGM MAP for dataset D with α 1 heads and α 2 tails: #imaginary heads γ θ #imaginary tails MAP is a Viterbi-style inference: want to find max probability paramete θ, to the posterior distribution D Also: inference in a simple graph can be intractable if the conditional distributions are complicated 16

Smoothing for a binomial as a DGM MAP for dataset D with α 1 heads and α 2 tails: #imaginary heads γ θ replacing D=d with F=f1,f2, #imaginary Dirichlet prior has tails data for each of K classes Comment: conjugate for binomial is a Beta, conjugate for a multinomial is called a Dirichlet F α 1 + α 2 17

Recap: Naïve Bayes as a DGM Now: let s turn Bayes up to 11 for naïve Bayes. Y N d D 18 Plate diagram

A more Bayesian Naïve Bayes From a Dirichlet α: - Draw a multinomial π over the K classes From a Dirichlet β For each class y=1 K Draw a multinomial γ[y] over the vocabulary For each document d=1..d: - Pick a label y d from π - For each word in d (of length N d ): Pick a word x id from γ[y d ] α β 19 π Y W γ N d K D

Unsupervised Naïve Bayes From a Dirichlet α: - Draw a multinomial π over the K classes From a Dirichlet β For each class y=1 K Draw a multinomial γ[y] over the vocabulary For each document d=1..d: - Pick a label y d from π - For each word in d (of length N d ): Pick a word x id from γ[y d ] α β 20 π Y W γ N d K D

Unsupervised Naïve Bayes The generative story is the same What we do is different - We want to both Uind values of γ and π and find values for Y for each example. α π Y EM is a natural algorithm W N d D β 21 γ K

LDA AS A DGM 22

The LDA Topic Model 23

24

Unsupervised NB vs LDA one class prior α π α different class distrib for each doc π one Y per doc Y W one Y per word Y W N d N d D D β γ K β 25 γ K

Unsupervised NB vs LDA one class prior α π α different class distrib θ for each doc θ d one Y per doc Y W one Z per word ZY di W di N d N d D D β γ K β 26 γ k K

LDA Blei s motivation: start with BOW assumption θ Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words) For each document d = 1,!,M Generate θ d ~ D 1 ( ) w For each word n = 1,!, N d generate w n ~ D 2 (. θ dn ) Now pick your favorite distributions for D 1, D 2 27

Unsupervised NB vs LDA θ w Unsupervised NB clusters documents into latent classes LDA clusters word occurrences into latent classes (topics) The smoothness of β k è same word suggests same topic The smoothness of θ d è same document suggests same topic α β 28 θ d ZY di W di γ k N d K D

LDA s view of a document Mixed membership model 29

LDA topics: top words w by Pr(w Z=k) Z=13 Z=22 Z=27 Z=19 30

SVM using 50 features: Pr(Z=k θ d ) 50 topics vs all words, SVM 31

Gibbs Sampling for LDA 32

LDA Latent Dirichlet Allocation Parameter learning: Variational EM Collapsed Gibbs Sampling Wait, why is sampling called learning here? Here s the idea. 33

LDA Gibbs sampling works for any directed model! Applicable when joint distribution is hard to evaluate but conditional distribution is known Sequence of samples comprises a Markov Chain Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables. 34

α θ Z α θ Z 1 Z 2 Y 2 1 Y 1 2 Y 2 I ll assume we know parameters for Pr( Z) and Pr(Y ) 35

Initialize all the hidden variables randomly, then. Pick Z 1 ~ Pr(Z x 1,y 1,θ) Pick θ ~ Pr(z 1, z 2,α) pick from posterior Pick Z 2 ~ Pr(Z x 2,y 2,θ) α θ... Pick Z 1 ~ Pr(Z x 1,y 1,θ) Z 1 1 Y 1 Z 2 2 Y 2 Pick θ ~ Pr(z 1, z 2,α) pick from posterior Pick Z 2 ~ Pr(Z x 2,y 2,θ) So we will have (a sample of) the true θ in a broad range of cases eventually these will converge to samples from the true joint distribution 36

Why does Gibbs sampling work? Basic claim: when you sample x~p( y 1,,y k ) then if y 1,,y k were sampled from the true joint then x will be sampled from the true joint So the true joint is a Uixed point you tend to stay there if you ever get there How long does it take to get there? depends on the structure of the space of samples: how well-connected are they by the sampling steps? 37

38

Mixing time depends on the eigenvalues of the state space! 39

LDA Latent Dirichlet Allocation Parameter learning: Variational EM Not covered in 601-B Collapsed Gibbs Sampling What is collapsed Gibbs sampling? 40

Initialize all the Z s randomly, then. Pick Z 1 ~ Pr(Z 1 z 2,z 3,α) Pick Z 2 ~ Pr(Z 2 z 1,z 3,α) Pick Z 3 ~ Pr(Z 3 z 1,z 2,α) α θ Pick Z 1 ~ Pr(Z 1 z 2,z 3,α) Pick Z 2 ~ Pr(Z 2 z 1,z 3,α) Z 1 Z 2 Z 3 Pick Z 3 ~ Pr(Z 3 z 1,z 2,α)... 1 Y 1 2 Y 2 3 Y 3 Converges to samples from the true joint and then we can estimate Pr(θ α, sample of Z s) 41

Initialize all the Z s randomly, then. Pick Z 1 ~ Pr(Z 1 z 2,z 3,α) α θ What s this distribution? Z 1 Z 2 Z 3 1 Y 1 2 Y 2 3 Y 3 42

Initialize all the Z s randomly, then. Pick Z 1 ~ Pr(Z 1 z 2,z 3,α) α θ Simpler case: What s this distribution? called a Dirichlet-multinomial and it looks like this: Z 1 Z 2 Z 3 Pr(Z 1 = k 1, Z 2 = k 2, Z 3 = k 3 α) = θ Pr(Z 1 = k 1, Z 2 = k 2, Z 3 = k 3 θ)pr( θ α)dθ If there are k values for the Z s and n k = # Z s with value k, then it turns out: Pr(Z α) = if n is pos int Pr(Z θ)pr( θ α)dθ = θ Γ α k k Γ α k + n k k k k Γ( n k +α k ) Γ(α k ) 43

Initialize all the Z s randomly, then. Pick Z 1 ~ Pr(Z 1 z 2,z 3,α) α It turns out that sampling from a Dirichlet-multinomial is very easy! θ Notation: k values for the Z s n k = # Z s with value k Z = (Z 1,, Z m ) Z (-i) = (Z 1,, Z i-1, Z i+1,, Z m ) n k (-i) = # Z s with value k excluding Z i Z 1 Z 2 Z 3 Pr(Z i = k Z ( i),α) n k ( i) +α k Pr(Z α) = Pr(Z θ)pr( θ α)dθ = θ Γ α k k Γ α k + n k k k k Γ( n k +α k ) Γ(α k ) 44

What about with downstream evidence? Pr(Z i = k Z ( i),α) n k ( i) +α k captures the constraints on Z i via θ (from above, causal direction) what about via β? Pr(Z) = (1/c) * Pr(Z E+) Pr(E- Z) α θ Z 1 Z 2 Z 3 1 2 3 η Z 1 β[k] Z 2 β[k] 1 2 η 45

Sampling for LDA Notation: k values for the Z d,i s Z (-d,i) = all the Z s but Z d,i n w,k = # Z d,i s with value k paired with W d,i =w n *,k = # Z d,i s with value k n w,k (-d,i) = n w,k excluding Z i,d n *,k (-d,i) = n *k excluding Z i,d n *,k d,(-i) = n *k from doc d excluding Z i,d α θ d ZY di Pr(Z d,i = k Z ( d,i),w d,i = w,α) W di Pr(Z E+) ( n d,( i) *,k +α ) k fraction of time Z=k in doc d w',k Pr(E- Z) n ( d,i) w,k +η w n ( d,i) w',k +η w' η fraction of time W=w in topic k 46 β k N d K D

IMPLEMENTING LDA 47

Way way more detail 48

More detail 49

50

51

random z=1 z=2 z=3 unit height 52

What gets learned.. 53

In A Math-ier Notation N[*,*]=V N[*,k] N[d,k] M[w,k] 54

for each document d and word position j in d z[d,j] = k, a random topic N[d,k]++ W[w,k]++ where w = id of j-th word in d 55

for each pass t=1,2,. for each document d and word position j in d z[d,j] = k, a new random topic update N, W to reflect the new assignment of z: N[d,k]++; N[d,k ] - - where k is old z[d,j] W[w,k]++; W[w,k ] - - where w is w[d,j] 56

p(z d, j = k..) Pr(Z d, j = k "d")* Pr(W d,k = w Z d, j = k,...) = N[k, d] C d, j,k +α Z W[w, k] C d, j,k + β (W[*, k] C d, j,k )+ βn[*,*] C d, j,k =!# " $# 1 Z d, j = k 0 else 57

random z=1 z=2 z=3 unit height 1. You spend a lot of time sampling 2. There s a loop over all topics here in the sampler 58