Topic Models. Charles Elkan November 20, 2008

Similar documents
Statistical Debugging with Latent Topic Models

Generative Clustering, Topic Modeling, & Bayesian Inference

Study Notes on the Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation

Gaussian Mixture Model

Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details THIS IS AN EARLY DRAFT. YOUR FEEDBACKS ARE HIGHLY APPRECIATED.

Modeling Environment

Mixtures of Multinomials

Note 1: Varitional Methods for Latent Dirichlet Allocation

Evaluation Methods for Topic Models

Topic Modeling: Beyond Bag-of-Words

Kernel Density Topic Models: Visual Topics Without Visual Words

CS Lecture 18. Topic Models and LDA

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Collapsed Variational Bayesian Inference for Hidden Markov Models

Bayesian Inference for Dirichlet-Multinomials

Note for plsa and LDA-Version 1.1

15780: GRADUATE AI (SPRING 2018) Homework 3: Deep Learning and Probabilistic Modeling

CS145: INTRODUCTION TO DATA MINING

Latent Dirichlet Allocation (LDA)

Lecture 8: Graphical models for Text

Document and Topic Models: plsa and LDA

Lecture 13 : Variational Inference: Mean Field Approximation

16 : Approximate Inference: Markov Chain Monte Carlo

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Latent Dirichlet Allocation

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Sparse Stochastic Inference for Latent Dirichlet Allocation

Bayesian Nonparametric Models

Machine Learning Summer School

Improving Topic Models with Latent Feature Word Representations

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Non-parametric Clustering with Dirichlet Processes

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Distributed ML for DOSNs: giving power back to users

Lecture 6: Graphical Models: Learning

The Bayes classifier

Latent Dirichlet Allocation Introduction/Overview

Dirichlet Enhanced Latent Semantic Analysis

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

Time-Sensitive Dirichlet Process Mixture Models

Latent Dirichlet Allocation Based Multi-Document Summarization

Introduction to Machine Learning

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Collapsed Variational Inference for HDP

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Mixed-membership Models (and an introduction to variational inference)

Introduction to Machine Learning

Latent Variable Models Probabilistic Models in the Study of Language Day 4

Expectation Maximization Mixture Models HMMs

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Latent Dirichlet Allocation (LDA)

Probablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07

Web Search and Text Mining. Lecture 16: Topics and Communities

Infering the Number of State Clusters in Hidden Markov Model and its Extension

Latent Variable Models and EM algorithm

Chapter 2 The Naïve Bayes Model in the Context of Word Sense Disambiguation

A Note on the Expectation-Maximization (EM) Algorithm

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

LSI, plsi, LDA and inference methods

Latent Dirichlet Alloca/on

Text Mining for Economics and Finance Latent Dirichlet Allocation

CS6220: DATA MINING TECHNIQUES

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Bayesian Methods: Naïve Bayes

Inference Methods for Latent Dirichlet Allocation

Lecture 19, November 19, 2012

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

ECE 5984: Introduction to Machine Learning

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Bayesian Mixtures of Bernoulli Distributions

CSC 2541: Bayesian Methods for Machine Learning

Introduction to Bayesian inference

Sampling from Bayes Nets

LDA with Amortized Inference

Clustering problems, mixture models and Bayesian nonparametrics

Introduction To Machine Learning

Applying hlda to Practical Topic Modeling

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

arxiv: v6 [stat.ml] 11 Apr 2017

arxiv: v2 [cs.lg] 27 Feb 2016

AN INTRODUCTION TO TOPIC MODELS

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

1 Inference in Dirichlet Mixture Models Introduction Problem Statement Dirichlet Process Mixtures... 3

Expectation Maximization (EM)

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

topic modeling hanna m. wallach

Infinite latent feature models and the Indian Buffet Process

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

An Introduction to Expectation-Maximization

Parameter estimation for text analysis

Topic Models and Applications to Short Documents

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Probabilistic Latent Semantic Analysis

Non-Parametric Bayes

13: Variational inference II

Transcription:

Topic Models Charles Elan elan@cs.ucsd.edu November 20, 2008 Suppose that we have a collection of documents, and we want to find an organization for these, i.e. we want to do unsupervised learning. One approach is to train a mixture distribution to fit the data. This model is based on the assumption that the documents can be grouped into clusters. Formally, a mixture distribution is a probability density function of the form f(x) = K α f(x; θ ). Here, K is the number of components in the mixture model. For each, f(x; θ ) is the pdf of component number. The scalar α is the proportion of component number. The specific topic model we consider is called latent Dirichlet allocation (LDA). (The same abbreviation is also used for linear discriminant analysis, which is unrelated.) LDA is based on the intuition that each document contains words from multiple topics; the proportion of each topic in each document is different, but the topics themselves are the same for all documents. Mathematical preliminaries Before we can see the LDA model in mathematical detail, we need to review the Dirichlet distribution. Let γ be a vector of length S such that γ s 0 for all s and S s= γ s =. The Dirichlet is a probability density function over all such vectors. The Dirichlet is useful because it is a well-defined prior distribution for

a multinomial: the set of all vectors γ satisfying the conditions above is precisely the set of all possible multinomial parameter vectors. A Dirichlet distribution has parameter vector α of length S. Note that α s > 0 is required for all s, but there is no constraint on S s= α s. Concretely, the equation for the Dirichlet distribution is where the function D is defined as p(γ α) = = S s= γ αs s S s= Γ(α s) Γ( S s= α s). Here the Γ function is the standard continuous generalization of the factorial such that Γ() = ( )! if is an integer. The D function has a very important property that will be used below: = γ i γ α i i where the integral is over the set of all unit vectors γ. 2 The LDA model and Gibbs sampling The generative process assumed by the LDA model is as follows: Given: Dirichlet distribution with parameter vector α of length K Given: Dirichlet distribution with parameter vector β of length V for topic number to topic number K draw a multinomial with parameter vector φ according to β for document number to document number M draw a topic distribution, i.e. a multinomial θ according to α for each word in the document draw a topic z according to θ draw a word w according to φ z Note that z is an integer between and K for each word. 2

For learning, the training data are the words in all documents. The prior distributions α and β are assumed to be fixed and nown, as are the number K of topics, the number M of documents, the length N m of each document, and the cardinality V of the vocabulary (i.e. the dictionary of all words). Learning has two goals: (i) to infer the document-specific multinomial θ for each document, and (ii) to infer the topic distribution φ for each topic. The algorithm that we use for learning is called collapsed Gibbs sampling. It does not infer the θ and φ distributions directly. Instead, it infers the hidden values z for all words w. Suppose that we now the value of z for every word in the corpus except word number i. The idea of Gibbs sampling is to draw a z value for word i randomly according to its distribution, then assume that we now this value, then draw a z value for another word, and so on. It can be proved that eventually this process converges to a correct distribution of z values for all words in the corpus. Note that Gibbs sampling never converges to a fixed z for each w; instead it converges to a distribution of z values for each w. Let w be the sequence of words maing up the entire corpus, and let z be a corresponding sequence of z values. Note that w is not a vector of word counts. Use the notation w to mean w with word number i removed, so w = {w i, w }. Similarly, write z = {z i, z }. In order to do Gibbs sampling, we need to compute p(z i z, w) = p( z, w) p( z, w) = p( w z)p( z) p(w i z )p( w z )p( z ) for z i = to z i = K. In principle the entire denominator can be ignored, because it is a constant that does not depend on z i. However, we will pay attention to the second and third factors in the denominator, because they lead to cancellations with the numerator. So we will evaluate p(z i z, w) 3 Mathematical details p( w z)p( z) p( w z )p( z ). () We will wor out the four factors of () one by one. Consider p( z) first, and let z refer temporarily to just document number m. For this document, p( z θ) = N m i= p(z i ) = θ n m 3

where n m is the number of times z i = within document m, and θ is the multinomial parameter vector for document m. What we really want is p( z α), which requires averaging over all possible θ. This is p( z α) = p(θ α)p( z θ). By the definition of the Dirichlet distribution, θ p(θ α) = K θα. Therefore, p( z α) = θ = θ α θ θ n m θ n m+α = D( n m + α) where n m is the vector of topic counts n m,..., n mk for document number m. Similarly, p( z α) = D( n m + α) where n m is n m with one subtracted from the count for topic number z i. For the corpus of all M documents, p( z α) = M m= D( n m + α) (2) and p( z α) = M m= Note that all factors except one are identical in (2) and (3). D( n m + α). (3) 4

Referring bac to (), consider p( w z), which means p( w z, β). This is p(φ β)p( w z, Φ) Φ where Φ is a collection of K different topic distributions φ. Again by the definition of the Dirichlet distribution, p(φ β) = V φ βt t. Now consider p( w z, Φ). To evaluate this, group the words w i together according to which topic z i they come from: p( w z, Φ) = i:z i = p(w i z i, Φ) = V where q t is the number of times that word t occurs with topic in the whole corpus. We get that p( w z, β) = Φ ( = = V Φ φ βt t )( φ = V V V φ q t t φ q t t ) φ βt +q t t φ βt +q t t D( q + β) where q is the vector of counts over the whole corpus of words belonging to topic. This equation is similar to (2), with the corpus divided into K topics instead of into M documents. Referring bac again to (), we get that p(z i z, w) is proportional to D( q + β) M D( q + β) m= 5 D( n m + α) M m= D( n m + α).

The and factors obviously cancel above. The products can be eliminated also because q + β = q + β except when the topic = z i, and n m + α = n m + α except for the document m that word i belongs to. So, p(z i z, w) D( q zi + β) D( q z i + β) D( n m + α) D( n m + α) For any vector γ, the definition of the D function is s D(γ) = Γ(γ s) Γ( s γ s) where s indexes the components of γ. Using this definition, p(z i = j z, w) is proportional to t Γ(q jt + β t ) Γ( Γ( t q jt + β t ) t q Γ(n m + α ) jt + β t ) t Γ(q jt + β t) Γ( Γ( n m + α ) n m + α ) Γ(n m + α ). Now q jt + β t = q jt + β t except when t = w i, in which case q jt + β t = q jt + β t +, so t Γ(q jt + β t ) t Γ(q jt + β t) = Γ(q jw i + β wi + ). Γ(q jw i + β wi ) Applying the fact that Γ(x + )/Γ(x) = x yields Γ(q jw i + β wi + ) Γ(q jw i + β wi ) = q jw i + β wi. Similarly, Γ(n m + α ) Γ(n m + α ) = n mj + α j where j = z i is the candidate topic assignment of word i. Summing over all words t in the vocabulary gives so V q jt + β t = + V q jt + β t. Γ( t q jt + β t ) Γ( t q jt + β t ) = t q jt + β t 6

and similarly Γ( n m + α ) Γ( n m + α ) = n m + α. Putting the simplifications above together, p(z i = j z, w) is proportional to q jw i + β wi n mj + α j t q jt + β t n m + α. (4) This result says that occurrence number i is more liely to belong to topic j if q jw i and/or n mj are large, i.e. if the same word occurs often with topic j elsewhere in the corpus, and/or if topic j occurs often elsewhere inside document m. 4 Implementation notes In expression (4) the vectors α and β are fixed and nown. Each value q jw i depends on the current assignment of topics to each appearance of the word w i throughout the corpus, not including the appearance as word number i. These assignments can be initialized in any desired way, and are then updated one at a time by the Gibbs sampling process. Each value n mj is the count of how many words within document m are assigned to topic j, not including word number i. These counts can be computed easily after the initial topic assignments are chosen, and then they can be updated in constant time whenever the topic assigned to a word changes. The LDA generative model treats each word in each document individually. However, the specific order in which words are generated does not influence the probability of a document according to the generative model. Similarly, the Gibbs sampling algorithm wors with a specific ordering of the words in the corpus. However, any ordering will do. Hence, the sequence of words inside each training document does not need to be nown. The only training information needed for each document is how many times each word appears in it. Therefore, the LDA model can be learned from the standard bag-of-words representation of documents. The standard approach to implementing Gibbs sampling iterates over every word of every document, taing the words in some arbitrary order. For each word, expression (4) is evaluated for each alternative topic j. For each j, evaluating (4) requires constant time, so the time to perform one epoch of Gibbs sampling is O(NK) where N is the total number of words in all documents and K is the number of topics. 7