Topic Models Material adapted from David Mimno University of Maryland INTRODUCTION Material adapted from David Mimno UMD Topic Models 1 / 51
Why topic models? Suppose you have a huge number of documents Want to know what s going on Can t read them all (e.g. every New York Times article from the 90 s) Topic models offer a way to get a corpus-level view of major themes Material adapted from David Mimno UMD Topic Models 2 / 51
Why topic models? Suppose you have a huge number of documents Want to know what s going on Can t read them all (e.g. every New York Times article from the 90 s) Topic models offer a way to get a corpus-level view of major themes Unsupervised Material adapted from David Mimno UMD Topic Models 2 / 51
Roadmap What are topic models How to know if you have good topic model How to go from raw data to topics Material adapted from David Mimno UMD Topic Models 3 / 51
What are Topic Models? Conceptual Approach From an input corpus and number of topics K words to topics Corpus Forget the Bootleg, Just Download Multiplex the Heralded Movie Legally As Linchpin The Shape To of Growth Cinema, Transformed A Peaceful At Crew the Click Puts of Muppets a Where Mouse Stock Trades: A Its Better Mouth Deal Is For The Investors three Isn't big Internet Simple portals Red Light, begin Green to distinguish Light: A among 2-Tone themselves L.E.D. to as shopping Simplify Screens malls Material adapted from David Mimno UMD Topic Models 4 / 51
What are Topic Models? Conceptual Approach From an input corpus and number of topics K words to topics TOPIC 1 TOPIC 2 TOPIC 3 computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Material adapted from David Mimno UMD Topic Models 4 / 51
What are Topic Models? Conceptual Approach For each document, what topics are expressed by that document? Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens The three big Internet portals begin to distinguish among themselves as shopping malls Stock Trades: A Better Deal For Investors Isn't Simple TOPIC 1 Forget the Bootleg, Just Download the Movie Legally TOPIC 2 The Shape of Cinema, Transformed At the Click of a Mouse Multiplex Heralded As Linchpin To Growth TOPIC 3 A Peaceful Crew Puts Muppets Where Its Mouth Is Material adapted from David Mimno UMD Topic Models 5 / 51
What are Topic Models? Topics from Science Material adapted from David Mimno UMD Topic Models 6 / 51
What are Topic Models? Why should you care? Neat way to explore / understand corpus collections E-discovery Social media Scientific data NLP Applications Word Sense Disambiguation Discourse Segmentation Machine Translation Psychology: word meaning, polysemy Inference is (relatively) simple Material adapted from David Mimno UMD Topic Models 7 / 51
What are Topic Models? Matrix Factorization Approach M K K V M V Topics Topic Assignment Dataset K Number of topics M Number of documents V Size of vocabulary Material adapted from David Mimno UMD Topic Models 8 / 51
What are Topic Models? Matrix Factorization Approach M K K V M V Topics Topic Assignment Dataset K Number of topics M Number of documents V Size of vocabulary If you use singular value decomposition (SVD), this technique is called latent semantic analysis. Popular in information retrieval. Material adapted from David Mimno UMD Topic Models 8 / 51
What are Topic Models? Alternative: Generative Model How your data came to be Sequence of Probabilistic Steps Posterior Inference Material adapted from David Mimno UMD Topic Models 9 / 51
What are Topic Models? Alternative: Generative Model How your data came to be Sequence of Probabilistic Steps Posterior Inference Blei, Ng, Jordan. Latent Dirichlet Allocation. JMLR, 2003. Material adapted from David Mimno UMD Topic Models 9 / 51
What are Topic Models? Multinomial Distribution Distribution over discrete outcomes Represented by non-negative vector that sums to one Picture representation (1,0,0) (0,0,1) (0,1,0) (1/3,1/3,1/3) (1/4,1/4,1/2) (1/2,1/2,0) Material adapted from David Mimno UMD Topic Models 10 / 51
What are Topic Models? Multinomial Distribution Distribution over discrete outcomes Represented by non-negative vector that sums to one Picture representation (1,0,0) (0,0,1) (0,1,0) (1/3,1/3,1/3) (1/4,1/4,1/2) (1/2,1/2,0) Come from a Dirichlet distribution Material adapted from David Mimno UMD Topic Models 10 / 51
What are Topic Models? Dirichlet Distribution Material adapted from David Mimno UMD Topic Models 11 / 51
What are Topic Models? Dirichlet Distribution Material adapted from David Mimno UMD Topic Models 11 / 51
What are Topic Models? Dirichlet Distribution Material adapted from David Mimno UMD Topic Models 11 / 51
What are Topic Models? Dirichlet Distribution Material adapted from David Mimno UMD Topic Models 12 / 51
What are Topic Models? Dirichlet Distribution If φ Dir(()α), w Mult(()φ), and n k = {w i : w i = k} then p(φ α, w) p( w φ)p(φ α) (1) φ n k φ α k 1 (2) k k k φ α k+n k 1 Conjugacy: this posterior has the same form as the prior (3)
What are Topic Models? Dirichlet Distribution If φ Dir(()α), w Mult(()φ), and n k = {w i : w i = k} then p(φ α, w) p( w φ)p(φ α) (1) φ n k φ α k 1 (2) k k k φ α k+n k 1 (3) Conjugacy: this posterior has the same form as the prior Material adapted from David Mimno UMD Topic Models 13 / 51
What are Topic Models? Generative Model TOPIC 1 computer, technology, system, service, site, phone, internet, machine TOPIC 2 sell, sale, store, product, business, advertising, market, consumer TOPIC 3 play, film, movie, theater, production, star, director, stage Material adapted from David Mimno UMD Topic Models 14 / 51
What are Topic Models? Generative Model Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens The three big Internet portals begin to distinguish among themselves as shopping malls Stock Trades: A Better Deal For Investors Isn't Simple TOPIC 1 Forget the Bootleg, Just Download the Movie Legally TOPIC 2 The Shape of Cinema, Transformed At the Click of a Mouse Multiplex Heralded As Linchpin To Growth TOPIC 3 A Peaceful Crew Puts Muppets Where Its Mouth Is Material adapted from David Mimno UMD Topic Models 14 / 51
What are Topic Models? Generative Model computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Material adapted from David Mimno UMD Topic Models 14 / 51
What are Topic Models? Generative Model computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Material adapted from David Mimno UMD Topic Models 14 / 51
What are Topic Models? Generative Model computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Material adapted from David Mimno UMD Topic Models 14 / 51
What are Topic Models? Generative Model computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Material adapted from David Mimno UMD Topic Models 14 / 51
What are Topic Models? Generative Model Approach λ βk K α θd zn wn N M For each topic k {1,...,K }, draw a multinomial distribution βk from a Dirichlet distribution with parameter λ Material adapted from David Mimno UMD Topic Models 15 / 51
What are Topic Models? Generative Model Approach λ βk K α θd zn wn N M For each topic k {1,...,K }, draw a multinomial distribution βk from a Dirichlet distribution with parameter λ For each document d {1,...,M}, draw a multinomial distribution θd from a Dirichlet distribution with parameter α Material adapted from David Mimno UMD Topic Models 15 / 51
What are Topic Models? Generative Model Approach λ βk K α θd zn wn N M For each topic k {1,...,K }, draw a multinomial distribution βk from a Dirichlet distribution with parameter λ For each document d {1,...,M}, draw a multinomial distribution θd from a Dirichlet distribution with parameter α For each word position n {1,...,N}, select a hidden topic zn from the multinomial distribution parameterized by θ. Material adapted from David Mimno UMD Topic Models 15 / 51
What are Topic Models? Generative Model Approach λ βk K α θd zn wn N M For each topic k {1,...,K }, draw a multinomial distribution βk from a Dirichlet distribution with parameter λ For each document d {1,...,M}, draw a multinomial distribution θd from a Dirichlet distribution with parameter α For each word position n {1,...,N}, select a hidden topic zn from the multinomial distribution parameterized by θ. Choose the observed word wn from the distribution β zn. Material adapted from David Mimno UMD Topic Models 15 / 51
What are Topic Models? Generative Model Approach λ βk K α θd zn wn N M For each topic k {1,...,K }, draw a multinomial distribution βk from a Dirichlet distribution with parameter λ For each document d {1,...,M}, draw a multinomial distribution θd from a Dirichlet distribution with parameter α For each word position n {1,...,N}, select a hidden topic zn from the multinomial distribution parameterized by θ. Choose the observed word wn from the distribution β zn. We use statistical inference to uncover the most likely unobserved Material adapted from David Mimno UMD Topic Models 15 / 51
What are Topic Models? Topic Models: What s Important Topic models Topics to word types Documents to topics Topics to word types multinomial distribution Documents to topics multinomial distribution Focus in this talk: statistical methods Model: story of how your data came to be Latent variables: missing pieces of your story Statistical inference: filling in those missing pieces We use latent Dirichlet allocation (LDA), a fully Bayesian version of plsi, probabilistic version of LSA Material adapted from David Mimno UMD Topic Models 16 / 51
What are Topic Models? Topic Models: What s Important Topic models (latent variables) Topics to word types Documents to topics Topics to word types multinomial distribution Documents to topics multinomial distribution Focus in this talk: statistical methods Model: story of how your data came to be Latent variables: missing pieces of your story Statistical inference: filling in those missing pieces We use latent Dirichlet allocation (LDA), a fully Bayesian version of plsi, probabilistic version of LSA Material adapted from David Mimno UMD Topic Models 16 / 51
Evaluation Evaluation Corpus Forget the Bootleg, Just Download Multiplex the Heralded Movie Legally As Linchpin The Shape To of Growth Cinema, Transformed A Peaceful At Crew the Click Puts of Muppets a Where Mouse Stock Trades: A Its Better Mouth Deal Is For The Investors three Isn't big Internet Simple portals Red Light, begin Green to distinguish Light: A among 2-Tone themselves L.E.D. to as shopping Simplify Screens malls Model A Model B Model C -4.8-15.16-23.42 Held-out Data Sony Ericsson's Infinite Hope for a Turnaround For Search, Murdoch Looks to a Deal With Microsoft Price War Brews Between Amazon and Wal-Mart How you compute it is important too (Wallach et al. 2009) Material adapted from David Mimno UMD Topic Models 17 / 51
Evaluation Evaluation Corpus Forget the Bootleg, Just Download Multiplex the Heralded Movie Legally As Linchpin The Shape To of Growth Cinema, Transformed A Peaceful At Crew the Click Puts of Muppets a Where Mouse Stock Trades: A Its Better Mouth Deal Is For The Investors three Isn't big Internet Simple portals Red Light, begin Green to distinguish Light: A among 2-Tone themselves L.E.D. to as shopping Simplify Screens malls Model A Model B Model C Held-out Log Likelihood -4.8-15.16-23.42 Held-out Data Sony Ericsson's Infinite Hope for a Turnaround For Search, Murdoch Looks to a Deal With Microsoft Price War Brews Between Amazon and Wal-Mart Measures predictive power, not what the topics are How you compute it is important too (Wallach et al. 2009) Material adapted from David Mimno UMD Topic Models 17 / 51
Evaluation Word Intrusion TOPIC 1 TOPIC 2 TOPIC 3 computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Material adapted from David Mimno UMD Topic Models 18 / 51
Evaluation Word Intrusion 1. Take the highest probability words from a topic Original Topic dog, cat, horse, pig, cow Material adapted from David Mimno UMD Topic Models 19 / 51
Evaluation Word Intrusion 1. Take the highest probability words from a topic Original Topic dog, cat, horse, pig, cow 2. Take a high-probability word from another topic and add it Topic with Intruder dog, cat, apple, horse, pig, cow Material adapted from David Mimno UMD Topic Models 19 / 51
Evaluation Word Intrusion 1. Take the highest probability words from a topic Original Topic dog, cat, horse, pig, cow 2. Take a high-probability word from another topic and add it Topic with Intruder dog, cat, apple, horse, pig, cow 3. We ask users to find the word that doesn t belong Hypothesis If the topics are interpretable, users will consistently choose true intruder Material adapted from David Mimno UMD Topic Models 19 / 51
Evaluation Word Intrusion Material adapted from David Mimno UMD Topic Models 20 / 51
Evaluation Word Intrusion Order of words was shuffled Which intruder was selected varied Model precision: percentage of users who clicked on intruder Material adapted from David Mimno UMD Topic Models 20 / 51
Evaluation Word Intrusion: Which Topics are Interpretable? New York Times, 50 LDA Topics Number of Topics 0 5 10 15 committee legislation proposal republican taxis fireplace garage house kitchen list americans japanese jewish states terrorist artist exhibition gallery museum painting 0.000 0.125 0.250 0.375 0.500 0.625 0.750 0.875 1.000 Model Precision Model Precision: percentage of correct intruders found Material adapted from David Mimno UMD Topic Models 21 / 51
Evaluation York Times Interpretability and Likelihood Better Model Precision 0.80 0.75 0.70 0.65 1.0 0.80 0.75 1.5 0.70 2.0 0.65 2.5 1.0 1.5 Model Precision on New York Times New York Times 7.32 7.30 7.28 7.26 7.24 7.52 7.50 7.4 Held-out Likelihood Predictive Log Likelihood Better 7.28 7.26 7.24 7.52 7.50 7.48 7.46 7.44 7.42 7.40 Predictive 2.0 Log Likelihood within a model, higher likelihood higher interpretability Wikipedia Model Precision Topic Log Odds Model CTM LDA plsi Number of topics 50 100 Material adapted from David Mimno UMD Topic Models 22 / 51 150
Evaluation 0.75 Wikipedia Interpretability and Likelihood 0.70 Topic Log Odds on Wikipedia 0.65 1.0 York Times Better Topic Log Odds 1.5 2.0 2.5 0.80 Number Model of topics CTM 50 100 LDA 150 plsi Number of topics 50 100 150 7.28 7.26 7.24 7.52 7.32 7.50 7.30 7.48 7.46 7.28 7.44 7.26 7.42 7.40 7.24 7.52 7.50 7.4 Predictive Log Likelihood Held-out Likelihood Predictive Log Likelihood Better Model Precision Topic Log Odds Model Precision Topic Log Odds Model CTM LDA plsi 7.28 7.26 7.24 7.52 7.50 7.48 7.46 7.44 7.42 7.40 Predictive Log Likelihood across models, higher likelihood higher interpretability Material adapted from David Mimno UMD Topic Models 22 / 51
Evaluation Evaluation Takeaway Measure what you care about If you care about prediction, likelihood is good If you care about a particular task, measure that Material adapted from David Mimno UMD Topic Models 23 / 51
Inference We are interested in posterior distribution p(z X,Θ) (4) Material adapted from David Mimno UMD Topic Models 24 / 51
Inference We are interested in posterior distribution p(z X,Θ) (4) Here, latent variables are topic assignments z and topics θ. X is the words (divided into documents), and Θ are hyperparameters to Dirichlet distributions: α for topic proportion, λ for topics. p( z, β, θ w,α,λ) (5) Material adapted from David Mimno UMD Topic Models 24 / 51
Inference We are interested in posterior distribution p(z X,Θ) (4) Here, latent variables are topic assignments z and topics θ. X is the words (divided into documents), and Θ are hyperparameters to Dirichlet distributions: α for topic proportion, λ for topics. p( z, β, θ w,α,λ) (5) p( w, z, θ, β α,λ) = p(β k λ) p(θ d α) p(z d,n θ d )p(w d,n β zd,n ) k d n Material adapted from David Mimno UMD Topic Models 24 / 51
Gibbs Sampling A form of Markov Chain Monte Carlo Chain is a sequence of random variable states Given a state {z1,...z N } given certain technical conditions, drawing z k p(z 1,...z k 1,z k+1,...z N X,Θ) for all k (repeatedly) results in a Markov Chain whose stationary distribution is the posterior. For notational convenience, call z with zd,n removed z d,n Material adapted from David Mimno UMD Topic Models 25 / 51
Inference computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Material adapted from David Mimno UMD Topic Models 26 / 51
Inference computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Material adapted from David Mimno UMD Topic Models 26 / 51
Inference computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Material adapted from David Mimno UMD Topic Models 26 / 51
Inference computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Material adapted from David Mimno UMD Topic Models 26 / 51
Inference computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Material adapted from David Mimno UMD Topic Models 26 / 51
Inference computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Material adapted from David Mimno UMD Topic Models 26 / 51
Inference computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Material adapted from David Mimno UMD Topic Models 26 / 51
Gibbs Sampling For LDA, we will sample the topic assignments Thus, we want: p(z d,n = k z d,n, w,α,λ) = p(z d,n = k, z d,n w,α,λ) p( z d,n w,α,λ) Material adapted from David Mimno UMD Topic Models 27 / 51
Gibbs Sampling For LDA, we will sample the topic assignments Thus, we want: p(z d,n = k z d,n, w,α,λ) = p(z d,n = k, z d,n w,α,λ) p( z d,n w,α,λ) The topics and per-document topic proportions are integrated out / marginalized Let nd,i be the number of words taking topic i in document d. Let v k,w be the number of times word w is used in topic k. θd i k θ α i+n d,i 1 d θ α k+n d,i d dθ d β = k θd i θ α i+n d,i 1 dθ d βk d i w d,n β λ i+v k,i 1 k,i i β λ i+v k,i 1 k,i dβ k β λ i+v k,i k,w d,n dβ k Material adapted from David Mimno UMD Topic Models 27 / 51
Gibbs Sampling For LDA, we will sample the topic assignments The topics and per-document topic proportions are integrated out / marginalized / Rao-Blackwellized Thus, we want: p(z d,n = k z d,n, w,α,λ) = n d,k + α k v k,wd,n + λ wd,n n d,i + α i i v k,i + λ i K i Material adapted from David Mimno UMD Topic Models 28 / 51
Gibbs Sampling Integral is normalizer of Dirichlet distribution V β λ i+v k,i 1 Γ (β i i + v k,i ) k,i dβ k = β k i Γ V β i i + v k,i Material adapted from David Mimno UMD Topic Models 29 / 51
Gibbs Sampling Integral is normalizer of Dirichlet distribution V β λ i+v k,i 1 Γ (β i i + v k,i ) k,i dβ k = β k i Γ V β i i + v k,i So we can simplify θd i k θ α i +n d,i 1 d θ α k +n d,i d dθ d β k θd i θ α i +n d,i 1 dθ d βk Γ d dβ k = dβ k i w d,n β λ i +v k,i 1 k,i β λ i +v k,i k,w d,n β λ i +v k,i 1 i k,i Γ(α k + n d,k + 1) K α i + n d,i + 1 Γ (α Γ(λ wd,n + v k,wd,n + 1) V i k k + n d,k ) Γ λ i + v k,i + 1 Γ λ i wd,n k + v k,wd,n K i K i Γ(α i + n d,i) K α i i + n d,i Γ V i V i Γ(λ i + v k,i) V λ i i + v k,i Γ Material adapted from David Mimno UMD Topic Models 29 / 51
Gamma Function Identity z = Γ (z + 1) Γ (z) (6) Γ Γ(α k + n d,k + 1) K α i + n d,i + 1 Γ (α Γ(λ wd,n + v k,wd,n + 1) V i k k + n d,k ) Γ λ i + v k,i + 1 Γ λ i wd,n k + v k,wd,n K i K i Γ(α i + n d,i) K α i i + n d,i Γ = n d,k + α k v k,wd,n + λ wd,n K n i d,i + α v i i k,i + λ i V i V i Γ(λ i + v k,i) V λ i i + v k,i Γ Material adapted from David Mimno UMD Topic Models 30 / 51
Gibbs Sampling Equation n d,k + α k v k,wd,n + λ wd,n n d,i + α i i v (7) k,i + λ i K i Number of times document d uses topic k Number of times topic k uses word type wd,n Dirichlet parameter for document to topic distribution Dirichlet parameter for topic to word distribution How much this document likes topic k How much this topic likes word wd,n Material adapted from David Mimno UMD Topic Models 31 / 51
Gibbs Sampling Equation n d,k + α k v k,wd,n + λ wd,n n d,i + α i i v (7) k,i + λ i K i Number of times document d uses topic k Number of times topic k uses word type wd,n Dirichlet parameter for document to topic distribution Dirichlet parameter for topic to word distribution How much this document likes topic k How much this topic likes word wd,n Material adapted from David Mimno UMD Topic Models 31 / 51
Gibbs Sampling Equation n d,k + α k v k,wd,n + λ wd,n n d,i + α i i v (7) k,i + λ i K i Number of times document d uses topic k Number of times topic k uses word type wd,n Dirichlet parameter for document to topic distribution Dirichlet parameter for topic to word distribution How much this document likes topic k How much this topic likes word wd,n Material adapted from David Mimno UMD Topic Models 31 / 51
Gibbs Sampling Equation n d,k + α k v k,wd,n + λ wd,n n d,i + α i i v (7) k,i + λ i K i Number of times document d uses topic k Number of times topic k uses word type wd,n Dirichlet parameter for document to topic distribution Dirichlet parameter for topic to word distribution How much this document likes topic k How much this topic likes word wd,n Material adapted from David Mimno UMD Topic Models 31 / 51
Gibbs Sampling Equation n d,k + α k v k,wd,n + λ wd,n n d,i + α i i v (7) k,i + λ i K i Number of times document d uses topic k Number of times topic k uses word type wd,n Dirichlet parameter for document to topic distribution Dirichlet parameter for topic to word distribution How much this document likes topic k How much this topic likes word wd,n Material adapted from David Mimno UMD Topic Models 31 / 51
Gibbs Sampling Equation n d,k + α k v k,wd,n + λ wd,n n d,i + α i i v (7) k,i + λ i K i Number of times document d uses topic k Number of times topic k uses word type wd,n Dirichlet parameter for document to topic distribution Dirichlet parameter for topic to word distribution How much this document likes topic k How much this topic likes word wd,n Material adapted from David Mimno UMD Topic Models 31 / 51
Sample Document Material adapted from David Mimno UMD Topic Models 32 / 51
Sample Document Material adapted from David Mimno UMD Topic Models 33 / 51
Randomly Assign Topics Material adapted from David Mimno UMD Topic Models 34 / 51
Randomly Assign Topics Material adapted from David Mimno UMD Topic Models 35 / 51
Total Topic Counts Material adapted from David Mimno UMD Topic Models 36 / 51
Total Topic Counts Sampling Equation n d,k + α k v k,wd,n + λ wd,n n d,i + α i i v k,i + λ i K i Material adapted from David Mimno UMD Topic Models 36 / 51
Total Topic Counts Sampling Equation n d,k + α k v k,wd,n + λ wd,n n d,i + α i i v k,i + λ i K i Material adapted from David Mimno UMD Topic Models 36 / 51
We want to sample this word... Material adapted from David Mimno UMD Topic Models 37 / 51
We want to sample this word... Material adapted from David Mimno UMD Topic Models 37 / 51
Decrement its count Material adapted from David Mimno UMD Topic Models 38 / 51
What is the conditional distribution for this topic? Material adapted from David Mimno UMD Topic Models 39 / 51
Part 1: How much does this document like each topic? Material adapted from David Mimno UMD Topic Models 40 / 51
Part 1: How much does this document like each topic? Material adapted from David Mimno UMD Topic Models 41 / 51
Part 1: How much does this document like each topic? Sampling Equation n d,k + α k v k,wd,n + λ wd,n n d,i + α i i v k,i + λ i K i Material adapted from David Mimno UMD Topic Models 41 / 51
Part 1: How much does this document like each topic? Sampling Equation n d,k + α k v k,wd,n + λ wd,n n d,i + α i i v k,i + λ i K i Material adapted from David Mimno UMD Topic Models 41 / 51
Part 2: How much does each topic like the word? Material adapted from David Mimno UMD Topic Models 42 / 51
Part 2: How much does each topic like the word? Sampling Equation n d,k + α k v k,wd,n + λ wd,n n d,i + α i i v k,i + λ i K i Material adapted from David Mimno UMD Topic Models 42 / 51
Part 2: How much does each topic like the word? Sampling Equation n d,k + α k v k,wd,n + λ wd,n n d,i + α i i v k,i + λ i K i Material adapted from David Mimno UMD Topic Models 42 / 51
Geometric interpretation Material adapted from David Mimno UMD Topic Models 43 / 51
Geometric interpretation Material adapted from David Mimno UMD Topic Models 43 / 51
Geometric interpretation Material adapted from David Mimno UMD Topic Models 43 / 51
Update counts Material adapted from David Mimno UMD Topic Models 44 / 51
Update counts Material adapted from David Mimno UMD Topic Models 44 / 51
Update counts Material adapted from David Mimno UMD Topic Models 44 / 51
Details: how to sample from a distribution P Q P Topic 1 0.0 n d,k + k P K i n d,i + i v k,wd,n + wd,n P i v k,i + i Topic 2 Topic 3 0.112 Topic 4 Normalize Topic 5 1.0 Material adapted from David Mimno UMD Topic Models 45 / 51
Algorithm 1. For each iteration i: 1.1 For each document d and word n currently assigned to z old : 1.1.1 Decrement n d,zold and v zold,w d,n 1.1.2 Sample z new = k with probability proportional to n d,k +α k v k,wd,n +λ wd,n K i n d,i +α i i v k,i +λ i 1.1.3 Increment n d,znew and v znew,w d,n Material adapted from David Mimno UMD Topic Models 46 / 51
Implementation Algorithm 1. For each iteration i: 1.1 For each document d and word n currently assigned to z old : 1.1.1 Decrement n d,zold and v zold,w d,n 1.1.2 Sample z new = k with probability proportional to n d,k +α k v k,wd,n +λ wd,n K i n d,i +α i i v k,i +λ i 1.1.3 Increment n d,znew and v znew,w d,n Material adapted from David Mimno UMD Topic Models 47 / 51
Desiderata Hyperparameters: Sample them too (slice sampling) Initialization: Random Sampling: Until likelihood converges Lag / burn-in: Difference of opinion on this Number of chains: Should do more than one Material adapted from David Mimno UMD Topic Models 48 / 51
Available implementations Mallet (http://mallet.cs.umass.edu) LDAC (http://www.cs.princeton.edu/ blei/lda-c) Topicmod (http://code.google.com/p/topicmod) Material adapted from David Mimno UMD Topic Models 49 / 51
Wrapup Topic Models: Tools to uncover themes in large document collections Another example of Gibbs Sampling In class: Gibbs sampling example Material adapted from David Mimno UMD Topic Models 50 / 51
Material adapted from David Mimno UMD Topic Models 51 / 51