Topic Modeling Using Latent Dirichlet Allocation (LDA)

Similar documents
Latent Dirichlet Allocation (LDA)

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Latent Dirichlet Allocation

Recent Advances in Bayesian Inference Techniques

Topic Modelling and Latent Dirichlet Allocation

Generative Clustering, Topic Modeling, & Bayesian Inference

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Collaborative Topic Modeling for Recommending Scientific Articles

Latent Dirichlet Allocation Based Multi-Document Summarization

Machine Learning

16 : Approximate Inference: Markov Chain Monte Carlo

CS145: INTRODUCTION TO DATA MINING

Latent Dirichlet Allocation (LDA)

Applying hlda to Practical Topic Modeling

A Continuous-Time Model of Topic Co-occurrence Trends

topic modeling hanna m. wallach

IPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S

Distributed ML for DOSNs: giving power back to users

Lecture 13 : Variational Inference: Mean Field Approximation

Language Information Processing, Advanced. Topic Models

Latent Dirichlet Allocation Introduction/Overview

Lecture 19, November 19, 2012

Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors

CS Lecture 18. Topic Models and LDA

Dirichlet Enhanced Latent Semantic Analysis

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Topic Modeling: Beyond Bag-of-Words

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Replicated Softmax: an Undirected Topic Model. Stephen Turner

AN INTRODUCTION TO TOPIC MODELS

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

Text Mining: Basic Models and Applications

Discovering Geographical Topics in Twitter

Measuring Topic Quality in Latent Dirichlet Allocation

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Non-Parametric Bayes

Multi-theme Sentiment Analysis using Quantified Contextual

There Is Therefore Now No Condemnation Romans 8:1-12

Determining Social Constraints in Time Geography through Online Social Networking

Topic Models and Applications to Short Documents

LDA with Amortized Inference

Text mining and natural language analysis. Jefrey Lijffijt

Comparative Summarization via Latent Dirichlet Allocation

Probabilistic Topic Models Tutorial: COMAD 2011

Classical Predictive Models

Latent Dirichlet Alloca/on

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Sampling from Bayes Nets

LSI, plsi, LDA and inference methods

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

RaRE: Social Rank Regulated Large-scale Network Embedding

Topic Learning and Inference Using Dirichlet Allocation Product Partition Models and Hybrid Metropolis Search

CPSC 540: Machine Learning

Bayesian Nonparametrics for Speech and Signal Processing

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Kernel Density Topic Models: Visual Topics Without Visual Words

Collaborative topic models: motivations cont

Text Mining for Economics and Finance Latent Dirichlet Allocation

Modeling Environment

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Content-based Recommendation

Modeling User Rating Profiles For Collaborative Filtering

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Probabilistic Latent Semantic Analysis

Benchmarking and Improving Recovery of Number of Topics in Latent Dirichlet Allocation Models

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY

Graphical Models and Kernel Methods

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Document and Topic Models: plsa and LDA

Latent variable models for discrete data

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami

Probabilistic Topic Models in Natural Language Processing

Evaluation Methods for Topic Models

Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University Bethlehem, PA

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

CSE 158 Lecture 10. Web Mining and Recommender Systems. Text mining Part 2

Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Russell Hanson DFCI April 24, 2009

Dirichlet Process Based Evolutionary Clustering

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

CS6220: DATA MINING TECHNIQUES

Yahoo! Labs Nov. 1 st, Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Introduction to Probabilistic Machine Learning

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

GLAD: Group Anomaly Detection in Social Media Analysis

Self-Organization by Optimizing Free-Energy

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

STA 4273H: Statistical Machine Learning

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Improving Topic Models with Latent Feature Word Representations

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

Transcription:

Topic Modeling Using Latent Dirichlet Allocation (LDA) Porter Jenkins and Mimi Brinberg Penn State University prj3@psu.edu mjb6504@psu.edu October 23, 2017 Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 1 / 30

Overview 1 Introduction to Topic Modeling 2 Examples of Topic Modeling 3 Data 4 Common Methods in Topic Modeling 5 LDA 6 Extensions of LDA 7 Conclusion Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 2 / 30

What is the problem? Why is the problem important? In contrast to sentiment analysis where the goal is to determine the general feeling of a corpus, topic modeling aims to determine the underlying content, or semantic meaning, of a corpus. Given the increasingly large set of corpora, topic modeling can help people organize and understand large bodies of text. Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 3 / 30

Understanding Online Reviews Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 4 / 30

Understanding Online Reviews We can use text data from Amazon reviews, along with topic models to understand potential product issues, as well as popular product features. Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 5 / 30

Categorizing Novels Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 6 / 30

Let s Talk About Data Text data are most often used for topic modeling Before any topic modeling can be conducted, the text data have to be pre-processed (think back to our second homework assignment) Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 7 / 30

Typical Approaches to Topic Modeling Keyphrase Extraction (Hasan and Ng, 2014) Output: observed keyphrases Latent Dirichlet Allocation (Blei, Ng, and Jordan, 2003; Blei, 2012) Output: latent topics - list of words contained in the latent topics Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 8 / 30

Precursor to LDA Latent Semantic Analysis (LSA) Similar to PCA - reduces dimensionality of semantic space Uses singular vector decomposition on tf-idf to identify underlying topics Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 9 / 30

Overview of LDA Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 10 / 30

Overview of LDA Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 11 / 30

Overview of LDA Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 12 / 30

Overview of LDA Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 13 / 30

Generative probabilistic model for discrete data We expect each document to have multiple topics in it Previous generative mixture models require ALL words from in a document to be generated from the same topic Likewise, each topic is composed of a distribution of words Due to generative nature of model, a word can appear can be generated by different topics in different documents Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 14 / 30

Graphical Formulation of LDA Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 15 / 30

Mathematical Formulation of LDA Data Generating Process For each document in a Corpus: 1. Choose N Poisson(ζ) 2. Choose θ Dirichlet(α) 3. For each of the N words w n : (a) Choose a topic z n Multinomial(θ) (b) Choose a word w n from p(w n z n, β) the multinomial probability conditioned on the topic, z n Where N is the number of words in a document, θ is the topic mixture mixture, z n is a given topic, w n is a given word, β is is a matrix of word probabilities for w n in topic,z n, and α is a hyperparemter for the Dirichlet prior on θ Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 16 / 30

LDA in Practice Example (20newsgroup Data Set) >hmmmmmm. Sounds like your theology and Christ s are at odds. Which one am I to believe? >In this parable, Jesus tells the parable of the wedding feast. "The kingdom of heaven is like unto a certain king which made a marriage for his son". So the wedding clothes were customary, and "given" to those who "chose" to attend. This man "refused" to wear the clothes. The wedding clothes are equalivant to the "clothes of righteousness". When Jesus died for our sins, those "clothes" were then provided. Like that man, it is our decision to put the clothes on. Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 17 / 30

LDA in Practice Example (20newsgroup Data Set) >TSN Sportsdesk just reported that the OTTAWA SUN has reported that Montreal will send 4 players + $15 million including Vin Damphousse >and Brian Bellows to Phillidelphia, Phillie will send Eric Lindros to Ottawa, and Ottawa will give it s first round pick to Montreal. >If this is true, it will most likely depend on whether or not Ottawa gets to choose 1st overall. Can Ottawa afford Lindros salary? Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 18 / 30

LDA in Practice LDA using 2 latent topics: Topic 1 god one would christian subject line say peopl think know organ write like believ dont church jesu us time see Topic 2 0 1 2 game 3 team 4 play line hockey 5 organ subject player 6 go 7 25 year win Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 19 / 30

LDA in Practice LDA using 5 latent topics: Topic 1 christian subject line god one church organ know believ would write peopl univers question exist think truth dont bibl say Topic 2 god one would jesu christian sin say peopl christ homosexu subject us line believ love think church paul organ know Topic 3 game line subject organ team go write play get hockey would univers player year think articl one like fan win Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 20 / 30

LDA in Practice LDA using 5 latent topics (cont): Topic 4 25 team game play goal season hockey 2 penalti score blue nhl line point flyer 3 first leagu shot period Topic 5 0 1 2 3 4 5 6 7 8 pt 10 9 la 11 period vs 12 14 13 20 Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 21 / 30

Methods for Learning Posterior Variational Bayes algorithms Markov Chain Monte Carlo (MCMC) Laplace Approximation Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 22 / 30

Assumptions of LDA Exchangeability Documents and words are exchangeable Order doesn t matter (bag of words) Not as strong as i.i.d, but close Conditional on latent variable ( topic ), the documents and words are i.i.d Number of topics is fixed, determined by researcher Topics are independent Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 23 / 30

Dynamic Topic Models Dynamic Topic Models (Blei and Lafferty, 2006) Mei and Zhai (2005): Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 24 / 30

Correlated Topic Models Correlated Topic Models (Blei and Lafferty, 2007) Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 25 / 30

Spatio-Temporal Topic Models Yin and colleagues (2011) Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 26 / 30

Conclusion Variety of methods to summarize and understand corpora LDA is a popular method to examine underlying topics of text Advantages: Relatively straight-forward to implement Continued development of the method Disadvantages: Not always easy to interpret Still need input from the researcher (number and naming of topics) Not useful for short documents that contain less variance (in our own experience) Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 27 / 30

References/Resources 1 Blei, Ng, and Jordan (2003) Latent Dirichlet Allocation Journal of Machine Learning Research 3, 993 1022. Blei and Lafferty (2006) Dynamic Topic Models Proceedings of the 23rd International Conference on Machine Learning 113 120. Blei and Lafferty (2007) A Correlated Topic Model of Science The Annals of Applied Statistics 1(1), 17 35. Blei (2012) Probabilistic Topic Models Communications of the ACM 55(4), 77 84. Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 28 / 30

References/Resources 2 Hasan and Ng (2014) Automatic Keyphrase Extraction: A Survey of the State of the Art Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics 1262 1273. Hu (2009) Latent Dirichlet Allocation for Text, Images, and Music University of California, San Diego. Mei and Zhai (2005) Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining 198 207. Yin, Cao, Han, Zhai, and Huang (2011) Geographical Topic Discovery and Comparison Proceedings of the World Wide Web Conference 247 256. Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 29 / 30

The Dirichlet Process Porter Jenkins and Mimi Brinberg (PSU) LDA October 23, 2017 30 / 30