Topic Modeling Using Latent Dirichlet Allocation (LDA) Porter Jenkins and Mimi Brinberg Penn State University prj3@psu.edu mjb6504@psu.edu October 23, 2017 Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 1 / 30
Overview 1 Introduction to Topic Modeling 2 Examples of Topic Modeling 3 Data 4 Common Methods in Topic Modeling 5 LDA 6 Extensions of LDA 7 Conclusion Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 2 / 30
What is the problem? Why is the problem important? In contrast to sentiment analysis where the goal is to determine the general feeling of a corpus, topic modeling aims to determine the underlying content, or semantic meaning, of a corpus. Given the increasingly large set of corpora, topic modeling can help people organize and understand large bodies of text. Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 3 / 30
Understanding Online Reviews Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 4 / 30
Understanding Online Reviews We can use text data from Amazon reviews, along with topic models to understand potential product issues, as well as popular product features. Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 5 / 30
Categorizing Novels Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 6 / 30
Let s Talk About Data Text data are most often used for topic modeling Before any topic modeling can be conducted, the text data have to be pre-processed (think back to our second homework assignment) Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 7 / 30
Typical Approaches to Topic Modeling Keyphrase Extraction (Hasan and Ng, 2014) Output: observed keyphrases Latent Dirichlet Allocation (Blei, Ng, and Jordan, 2003; Blei, 2012) Output: latent topics - list of words contained in the latent topics Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 8 / 30
Precursor to LDA Latent Semantic Analysis (LSA) Similar to PCA - reduces dimensionality of semantic space Uses singular vector decomposition on tf-idf to identify underlying topics Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 9 / 30
Overview of LDA Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 10 / 30
Overview of LDA Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 11 / 30
Overview of LDA Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 12 / 30
Overview of LDA Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 13 / 30
Generative probabilistic model for discrete data We expect each document to have multiple topics in it Previous generative mixture models require ALL words from in a document to be generated from the same topic Likewise, each topic is composed of a distribution of words Due to generative nature of model, a word can appear can be generated by different topics in different documents Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 14 / 30
Graphical Formulation of LDA Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 15 / 30
Mathematical Formulation of LDA Data Generating Process For each document in a Corpus: 1. Choose N Poisson(ζ) 2. Choose θ Dirichlet(α) 3. For each of the N words w n : (a) Choose a topic z n Multinomial(θ) (b) Choose a word w n from p(w n z n, β) the multinomial probability conditioned on the topic, z n Where N is the number of words in a document, θ is the topic mixture mixture, z n is a given topic, w n is a given word, β is is a matrix of word probabilities for w n in topic,z n, and α is a hyperparemter for the Dirichlet prior on θ Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 16 / 30
LDA in Practice Example (20newsgroup Data Set) >hmmmmmm. Sounds like your theology and Christ s are at odds. Which one am I to believe? >In this parable, Jesus tells the parable of the wedding feast. "The kingdom of heaven is like unto a certain king which made a marriage for his son". So the wedding clothes were customary, and "given" to those who "chose" to attend. This man "refused" to wear the clothes. The wedding clothes are equalivant to the "clothes of righteousness". When Jesus died for our sins, those "clothes" were then provided. Like that man, it is our decision to put the clothes on. Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 17 / 30
LDA in Practice Example (20newsgroup Data Set) >TSN Sportsdesk just reported that the OTTAWA SUN has reported that Montreal will send 4 players + $15 million including Vin Damphousse >and Brian Bellows to Phillidelphia, Phillie will send Eric Lindros to Ottawa, and Ottawa will give it s first round pick to Montreal. >If this is true, it will most likely depend on whether or not Ottawa gets to choose 1st overall. Can Ottawa afford Lindros salary? Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 18 / 30
LDA in Practice LDA using 2 latent topics: Topic 1 god one would christian subject line say peopl think know organ write like believ dont church jesu us time see Topic 2 0 1 2 game 3 team 4 play line hockey 5 organ subject player 6 go 7 25 year win Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 19 / 30
LDA in Practice LDA using 5 latent topics: Topic 1 christian subject line god one church organ know believ would write peopl univers question exist think truth dont bibl say Topic 2 god one would jesu christian sin say peopl christ homosexu subject us line believ love think church paul organ know Topic 3 game line subject organ team go write play get hockey would univers player year think articl one like fan win Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 20 / 30
LDA in Practice LDA using 5 latent topics (cont): Topic 4 25 team game play goal season hockey 2 penalti score blue nhl line point flyer 3 first leagu shot period Topic 5 0 1 2 3 4 5 6 7 8 pt 10 9 la 11 period vs 12 14 13 20 Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 21 / 30
Methods for Learning Posterior Variational Bayes algorithms Markov Chain Monte Carlo (MCMC) Laplace Approximation Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 22 / 30
Assumptions of LDA Exchangeability Documents and words are exchangeable Order doesn t matter (bag of words) Not as strong as i.i.d, but close Conditional on latent variable ( topic ), the documents and words are i.i.d Number of topics is fixed, determined by researcher Topics are independent Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 23 / 30
Dynamic Topic Models Dynamic Topic Models (Blei and Lafferty, 2006) Mei and Zhai (2005): Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 24 / 30
Correlated Topic Models Correlated Topic Models (Blei and Lafferty, 2007) Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 25 / 30
Spatio-Temporal Topic Models Yin and colleagues (2011) Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 26 / 30
Conclusion Variety of methods to summarize and understand corpora LDA is a popular method to examine underlying topics of text Advantages: Relatively straight-forward to implement Continued development of the method Disadvantages: Not always easy to interpret Still need input from the researcher (number and naming of topics) Not useful for short documents that contain less variance (in our own experience) Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 27 / 30
References/Resources 1 Blei, Ng, and Jordan (2003) Latent Dirichlet Allocation Journal of Machine Learning Research 3, 993 1022. Blei and Lafferty (2006) Dynamic Topic Models Proceedings of the 23rd International Conference on Machine Learning 113 120. Blei and Lafferty (2007) A Correlated Topic Model of Science The Annals of Applied Statistics 1(1), 17 35. Blei (2012) Probabilistic Topic Models Communications of the ACM 55(4), 77 84. Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 28 / 30
References/Resources 2 Hasan and Ng (2014) Automatic Keyphrase Extraction: A Survey of the State of the Art Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics 1262 1273. Hu (2009) Latent Dirichlet Allocation for Text, Images, and Music University of California, San Diego. Mei and Zhai (2005) Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining 198 207. Yin, Cao, Han, Zhai, and Huang (2011) Geographical Topic Discovery and Comparison Proceedings of the World Wide Web Conference 247 256. Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 29 / 30
The Dirichlet Process Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 30 / 30