Topic Models Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1
Low-Dimensional Space for Documents Last time: embedding space for words This time: embedding space for documents Generative story New inference techniques Advanced Machine Learning for NLP Boyd-Graber Topic Models 2 of 1
Why topic models? Suppose you have a huge number of documents Want to know what s going on Can t read them all (e.g. every New York Times article from the 90 s) Topic models offer a way to get a corpus-level view of major themes Advanced Machine Learning for NLP Boyd-Graber Topic Models 3 of 1
Why topic models? Suppose you have a huge number of documents Want to know what s going on Can t read them all (e.g. every New York Times article from the 90 s) Topic models offer a way to get a corpus-level view of major themes Unsupervised Advanced Machine Learning for NLP Boyd-Graber Topic Models 3 of 1
Roadmap What are topic models How to go from raw data to topics Advanced Machine Learning for NLP Boyd-Graber Topic Models 4 of 1
Embedding Space From an input corpus and number of topics K words to topics Corpus Forget the Bootleg, Just Download Multiplex the Heralded Movie Legally As Linchpin The Shape To of Growth Cinema, Transformed A Peaceful At Crew the Click Puts of Muppets a Where Mouse Stock Trades: A Its Better Mouth Deal Is For The Investors three Isn't big Internet Simple portals Red Light, begin Green to distinguish Light: A among 2-Tone themselves L.E.D. to as shopping Simplify Screens malls Advanced Machine Learning for NLP Boyd-Graber Topic Models 5 of 1
Embedding Space From an input corpus and number of topics K words to topics TOPIC 1 TOPIC 2 TOPIC 3 computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Advanced Machine Learning for NLP Boyd-Graber Topic Models 5 of 1
Conceptual Approach For each document, what topics are expressed by that document? Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens The three big Internet portals begin to distinguish among themselves as shopping malls Stock Trades: A Better Deal For Investors Isn't Simple TOPIC 1 Forget the Bootleg, Just Download the Movie Legally TOPIC 2 The Shape of Cinema, Transformed At the Click of a Mouse Multiplex Heralded As Linchpin To Growth TOPIC 3 A Peaceful Crew Puts Muppets Where Its Mouth Is Advanced Machine Learning for NLP Boyd-Graber Topic Models 6 of 1
Topics from Science Advanced Machine Learning for NLP Boyd-Graber Topic Models 7 of 1
Why should you care? Neat way to explore / understand corpus collections E-discovery Social media Scientific data NLP Applications Word Sense Disambiguation Discourse Segmentation Machine Translation Psychology: word meaning, polysemy Inference is (relatively) simple Advanced Machine Learning for NLP Boyd-Graber Topic Models 8 of 1
Matrix Factorization Approach M K K V M V Topics Topic Assignment Dataset K Number of topics M Number of documents V Size of vocabulary Advanced Machine Learning for NLP Boyd-Graber Topic Models 9 of 1
Matrix Factorization Approach M K K V M V Topics Topic Assignment Dataset K Number of topics M Number of documents V Size of vocabulary If you use singular value decomposition (SVD), this technique is called latent semantic analysis. Popular in information retrieval. Advanced Machine Learning for NLP Boyd-Graber Topic Models 9 of 1
Alternative: Generative Model How your data came to be Sequence of Probabilistic Steps Posterior Inference Advanced Machine Learning for NLP Boyd-Graber Topic Models 10 of 1
Alternative: Generative Model How your data came to be Sequence of Probabilistic Steps Posterior Inference Blei, Ng, Jordan. Latent Dirichlet Allocation. JMLR, 2003. Advanced Machine Learning for NLP Boyd-Graber Topic Models 10 of 1
Multinomial Distribution Distribution over discrete outcomes Represented by non-negative vector that sums to one Picture representation (1,0,0) (0,0,1) (0,1,0) (1/3,1/3,1/3) (1/4,1/4,1/2) (1/2,1/2,0) Advanced Machine Learning for NLP Boyd-Graber Topic Models 11 of 1
Multinomial Distribution Distribution over discrete outcomes Represented by non-negative vector that sums to one Picture representation (1,0,0) (0,0,1) (0,1,0) (1/3,1/3,1/3) (1/4,1/4,1/2) (1/2,1/2,0) Come from a Dirichlet distribution Advanced Machine Learning for NLP Boyd-Graber Topic Models 11 of 1
Dirichlet Distribution Advanced Machine Learning for NLP Boyd-Graber Topic Models 12 of 1
Dirichlet Distribution Advanced Machine Learning for NLP Boyd-Graber Topic Models 12 of 1
Dirichlet Distribution Advanced Machine Learning for NLP Boyd-Graber Topic Models 12 of 1
Dirichlet Distribution Advanced Machine Learning for NLP Boyd-Graber Topic Models 13 of 1
Dirichlet Distribution If φ Dir(()α), w Mult(()φ), and n k = {w i : w i = k} then p(φ α,w ) p(w φ)p(φ α) (1) φ n k (2) k k k φ α k +n k 1 φ α k 1 Conjugacy: this posterior has the same form as the prior (3)
Dirichlet Distribution If φ Dir(()α), w Mult(()φ), and n k = {w i : w i = k} then p(φ α,w ) p(w φ)p(φ α) (1) φ n k (2) k k k φ α k +n k 1 φ α k 1 Conjugacy: this posterior has the same form as the prior (3) Advanced Machine Learning for NLP Boyd-Graber Topic Models 14 of 1
Generative Model TOPIC 1 computer, technology, system, service, site, phone, internet, machine TOPIC 2 sell, sale, store, product, business, advertising, market, consumer TOPIC 3 play, film, movie, theater, production, star, director, stage Advanced Machine Learning for NLP Boyd-Graber Topic Models 15 of 1
Generative Model Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens The three big Internet portals begin to distinguish among themselves as shopping malls Stock Trades: A Better Deal For Investors Isn't Simple TOPIC 1 Forget the Bootleg, Just Download the Movie Legally TOPIC 2 The Shape of Cinema, Transformed At the Click of a Mouse Multiplex Heralded As Linchpin To Growth TOPIC 3 A Peaceful Crew Puts Muppets Where Its Mouth Is Advanced Machine Learning for NLP Boyd-Graber Topic Models 15 of 1
Generative Model computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Advanced Machine Learning for NLP Boyd-Graber Topic Models 15 of 1
Generative Model computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Advanced Machine Learning for NLP Boyd-Graber Topic Models 15 of 1
Generative Model computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Advanced Machine Learning for NLP Boyd-Graber Topic Models 15 of 1
Generative Model computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's itunes music store and other online services... Advanced Machine Learning for NLP Boyd-Graber Topic Models 15 of 1
Generative Model Approach λ βk K α θd zn wn N M For each topic k {1,..., K }, draw a multinomial distribution β k from a Dirichlet distribution with parameter λ Advanced Machine Learning for NLP Boyd-Graber Topic Models 16 of 1
Generative Model Approach λ βk K α θd zn wn N M For each topic k {1,..., K }, draw a multinomial distribution β k from a Dirichlet distribution with parameter λ For each document d {1,...,M }, draw a multinomial distribution θ d from a Dirichlet distribution with parameter α Advanced Machine Learning for NLP Boyd-Graber Topic Models 16 of 1
Generative Model Approach λ βk K α θd zn wn N M For each topic k {1,..., K }, draw a multinomial distribution β k from a Dirichlet distribution with parameter λ For each document d {1,...,M }, draw a multinomial distribution θ d from a Dirichlet distribution with parameter α For each word position n {1,...,N }, select a hidden topic z n from the multinomial distribution parameterized by θ. Advanced Machine Learning for NLP Boyd-Graber Topic Models 16 of 1
Generative Model Approach λ βk K α θd zn wn N M For each topic k {1,..., K }, draw a multinomial distribution β k from a Dirichlet distribution with parameter λ For each document d {1,...,M }, draw a multinomial distribution θ d from a Dirichlet distribution with parameter α For each word position n {1,...,N }, select a hidden topic z n from the multinomial distribution parameterized by θ. Choose the observed word w n from the distribution β zn. Advanced Machine Learning for NLP Boyd-Graber Topic Models 16 of 1
Generative Model Approach λ βk K α θd zn wn N M For each topic k {1,..., K }, draw a multinomial distribution β k from a Dirichlet distribution with parameter λ For each document d {1,...,M }, draw a multinomial distribution θ d from a Dirichlet distribution with parameter α For each word position n {1,...,N }, select a hidden topic z n from the multinomial distribution parameterized by θ. Choose the observed word w n from the distribution β zn. Advanced We Machine use Learning statistical for NLP Boyd-Graber inference to uncover the most likely unobserved Topic Models 16 of 1
Topic Models: What s Important Topic models Topics to word types multinomial distribution Documents to topics multinomial distribution Focus in this talk: statistical methods Model: story of how your data came to be Latent variables: missing pieces of your story Statistical inference: filling in those missing pieces We use latent Dirichlet allocation (LDA), a fully Bayesian version of plsi, probabilistic version of LSA Advanced Machine Learning for NLP Boyd-Graber Topic Models 17 of 1
Topic Models: What s Important Topic models (latent variables) Topics to word types multinomial distribution Documents to topics multinomial distribution Focus in this talk: statistical methods Model: story of how your data came to be Latent variables: missing pieces of your story Statistical inference: filling in those missing pieces We use latent Dirichlet allocation (LDA), a fully Bayesian version of plsi, probabilistic version of LSA Advanced Machine Learning for NLP Boyd-Graber Topic Models 17 of 1
Advanced Machine Learning for NLP Boyd-Graber Topic Models 18 of 1