Latent Dirichlet Alloca/on

Size: px

Start display at page:

Download "Latent Dirichlet Alloca/on"

Shannon Carr
6 years ago
Views:

1 Latent Dirichlet Alloca/on Blei, Ng and Jordan ( 2002 ) Presented by Deepak Santhanam

2 What is Latent Dirichlet Alloca/on? Genera/ve Model for collec/ons of discrete data Data generated by parameters which can be learned and used to do inference. LDA is a hierarchical Bayesian Model

3 LDA and Document Modeling A Document of a collec/on is modeled as a finite mixture over underlying topics. Topics in turn are modeled as an infinite mixture over an underlying set of topic probabili/es. Topic probabili/es are explicit representa/ons of a document. Find short descrip/ons of members while preserving sta/s/cal rela/ons. Document classifica/on is easier with LDA

4 Previous Schemes for Document Modeling! idf scheme where counts are taken for each word and document is modeled. Latent Seman.c Indexing which uses SVD to capture P idf features which capture most of the variance. plsi Each word in a document is a sample from a mixture model and generated from a single topic. (Each document is represented as a mixing propor/ons of topics and there is not probabilis/c model for these propor/ons)

5 An Early Example.. α θ β Z N w

6 An Early Example.. α θ β Z Words N w

7 An Early Example.. α Topics θ β Z Words N w

8 An Early Example.. α Topics θ β Z Words N w

9 Mixture of Topics α An Early Example.. Topics θ β Z Words N w Document

10 Mixture of Topics α An Early Example.. Topics θ β Z Words N w Document

11 Exchangeability and Bag of Words Assump/on that the order of words in the document can be neglected A finite set of Random Variables {x 1,..x N } is exchangeable if σ the joint distribu/on is invariant to any permuta/on of these RVs. i.e. if is a permuta/on of 1 to N: σ P(x 1,..., x N ) = P(x σ (1),..., x σ (N ) ) e.g : Any weighted average of i.i.d sequences of random variables is exchangeable.

12 De Fine\ s Theorem Can Rewrite the Joint of an infinitely exchangeable sequence of RVs by drawing a random parameter from some distribu/on and trea/ng the RVs as i.i.d condi/oned on that random parameter. θ Z n N

13 De Fine\ s Theorem Can Rewrite the Joint of an infinitely exchangeable sequence of RVs by drawing a random parameter from some distribu/on and trea/ng the RVs as i.i.d condi/oned on that random parameter. θ Random Parameter of a Mul/nomial over topics Z n N

$De Fine\ s Theorem Can Rewrite the Joint of an infinitely exchangeable sequence of RVs by drawing a random parameter from some distribu/on and$

14 De Fine\ s Theorem Can Rewrite the Joint of an infinitely exchangeable sequence of RVs by drawing a random parameter from some distribu/on and trea/ng the RVs as i.i.d condi/oned on that random parameter. θ Random Parameter of a Mul/nomial over topics Z n N Topics are now i.i.d condi/oned on theta.

LDA and Exchangeability Words are generated by topics with a fixed condi/onal distribu/on Topics are infinitely exchangeable within a document. For a document W= (w 1,w 2,.

15 LDA and Exchangeability Words are generated by topics with a fixed condi/onal distribu/on Topics are infinitely exchangeable within a document. For a document W= (w 1,w 2,..w N ) of N words and a corpus of M documents C = { W 1, W 2, W M } for k topics denoted by z, N p(w,z) = p(θ) p(z n θ) p(w n z n ) d(θ) n=1 What type of distribu/on can be used to make it easy for inference?

16 The Dirichlet Distribu/on A K Dimensional Dirichlet RV can take values in the (k 1) θ simplex and has the following density on that simplex Where is a k vector with components greater than 0. α Dirichlet makes it easy for inference as it has finite dimensional sufficient sta/s/cs and is a conjugate to the Mul/nomial distribu/on.

17 Genera/ve Process of LDA Choose N ~ Poisson(ξ) Choose θ ~ Dir(α) For each word w n : choose a topic z n ~ Mul6nomial ( ) Choose a word w n from a mul6nomial p(w n z n,β) probability condi6oned on the topic z n Beta is a k x v Matrix and β ij = p(w j =1 z i =1) θ

18 Graphical Model of LDA The joint over the topics and words is given by, Sampled once per corpus Sampled once every document Sampled once every word

19 The Marginal of a Document and The Probability of the Corpus. Integra/ng over the topic mixtures and summing over the words gives the Marginal of a document. Product of the Marginals of all documents gives the probability of the corpus Corpus level Document Level Word Level

20 Geometric Representa/on

21 Inference Problem We have to find the Posterior of the latent variables of a document. Intractable cause we need to marginalize over hidden variables. Tight Coupling between two parameters Use approximate inference like MCMC or varia.onal methods.

22 Varia/onal Inference Drop edges which cause the coupling in graphical model. Simplified graphical model with free varia/onal parameters Problema/c coupling not present in the simpler graphical model.

23 Problema/c edge Varia/onal Inference Drop edges which cause the coupling in graphcial model. Simplified graphical model with free varia/onal parameters Problema/c coupling not present in the simpler graphical model.

24 Varia/onal Inference Results in the following distribu/on : Minimize the Kullback Leibler divergence. Equa/ng deriva/ves of KL to zero, we get the update equa/ons,

25 Varia/onal Inference Results in the following distribu/on : Dirichlet Parameter Mul/nomial Parameter Minimize the Kullback Leibler divergence. Equa/ng deriva/ves of KL to zero, we get the update equa/ons,

26 Parameter Es/ma/on Using empirical Bayes Find the parameters which maximize the log likelihood of data. Intractable for same reasons. Varia/onal inference provided a /ght lower bound. Alterna/ng Varia/onal Expecta/on Maximiza/on: E Step : for each document find op/mizing values of varia/onal parameters. (γ,φ) M Step: Maximize the lower bound on the likelihood with respect to the model parameters. (α,β)

Smoothing Likelihood of previously unseen documents

Do the whole inference procedure again for new model

27 Smoothing Likelihood of previously unseen documents is always zero. Smooth matrix by considering its elements as RVs with a β posterior condi/oned on data. Do the whole inference procedure again for new model to get new update equa/ons. Another Dirichlet Prior Treat Elements of Beta as RVs endowed with a posterior

28 Smoothing Likelihood of previously unseen documents is always zero. Smooth matrix by considering its elements as RVs with a β posterior condi/oned on data. Do the whole inference procedure again for new model to get new update equa/ons. Final Hyper parameters

29 Extending LDA Make a con/nuous variant using gaussians instead of mul/nomials. Par/cular form of clustering by having a mixture of Dirichlet distribu/ons instead of one. What must be done to extend LDA to a more useful model? Can we use this LDA model in Computer Vision?

30 Applica/on in Computer Vision One of the methods extended in Describing visual scenes (Sudderth et al 2005) Topics in a scene Llama Sky Tree Grass Llama Llama Sky Tree Grass

31 Applica/on in Computer Vision One of the methods extended in Describing visual scenes (Sudderth et al 2005) Topics in a scene Spa/al rela/onships? Llama Sky Tree Grass Llama Llama Sky Tree Grass

32 Applica/on in Computer Vision One of the methods extended in Describing visual scenes (Sudderth et al 2005) Topics in a scene More Hierarchies.. Cooler Models.. Llama Sky Tree Grass Llama Llama Sky Tree Grass

33 Even cooler models..!

34 Take home message. LDA illustrates how Probabilis/c models can be scaled up. With good inference techniques, we can solve hard problems in mul/ple domains which have a mul/ple hierarchies. Genera/ve models are modular and extensible easily.

35 Thank You!

CS 6140: Machine Learning Spring 2017

CS 6140: Machine Learning Spring 2017 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis@cs Assignment