Measuring Topic Quality in Latent Dirichlet Allocation

Similar documents
Generative Clustering, Topic Modeling, & Bayesian Inference

CS145: INTRODUCTION TO DATA MINING

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Topic Modelling and Latent Dirichlet Allocation

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Density Estimation. Seungjin Choi

CS Lecture 18. Topic Models and LDA

Latent Dirichlet Allocation Introduction/Overview

Non-Parametric Bayes

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Lecture 13 : Variational Inference: Mean Field Approximation

Latent Dirichlet Allocation (LDA)

Study Notes on the Latent Dirichlet Allocation

Topic Modeling: Beyond Bag-of-Words

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Note for plsa and LDA-Version 1.1

Bayesian Methods: Naïve Bayes

Latent variable models for discrete data

Introduction To Machine Learning

Kernel Density Topic Models: Visual Topics Without Visual Words

Latent Dirichlet Allocation Based Multi-Document Summarization

Learning Bayesian network : Given structure and completely observed data

PMR Learning as Inference

Improving Topic Models with Latent Feature Word Representations

Text mining and natural language analysis. Jefrey Lijffijt

Document and Topic Models: plsa and LDA

Topic Models and Applications to Short Documents

Statistical Debugging with Latent Topic Models

Latent Dirichlet Allocation

Topic Models. Charles Elkan November 20, 2008

Modeling Environment

13: Variational inference II

Introduction to Machine Learning

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Introduction to Probabilistic Machine Learning

Probabilistic Time Series Classification

Text Mining for Economics and Finance Latent Dirichlet Allocation

Introduction to Bayesian inference

Lecture : Probabilistic Machine Learning

Dirichlet Enhanced Latent Semantic Analysis

CS6220: DATA MINING TECHNIQUES

Online Bayesian Passive-Agressive Learning

Distributed ML for DOSNs: giving power back to users

Introduction to Machine Learning Midterm Exam Solutions

Behavioral Data Mining. Lecture 2

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Click Prediction and Preference Ranking of RSS Feeds

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Sparse Stochastic Inference for Latent Dirichlet Allocation

Gaussian Models

Nonparameteric Regression:

Machine Learning: Assignment 1

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Lecture 19, November 19, 2012

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Latent Dirichlet Alloca/on

Replicated Softmax: an Undirected Topic Model. Stephen Turner

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Introduction to Machine Learning Midterm Exam

The Bayes classifier

Latent Dirichlet Bayesian Co-Clustering

how to *do* computationally assisted research

Evaluation Methods for Topic Models

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Language Information Processing, Advanced. Topic Models

CSCI-567: Machine Learning (Spring 2019)

The Naïve Bayes Classifier. Machine Learning Fall 2017

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

PATTERN RECOGNITION AND MACHINE LEARNING

Applying LDA topic model to a corpus of Italian Supreme Court decisions

The supervised hierarchical Dirichlet process

Latent Dirichlet Conditional Naive-Bayes Models

Classifying Documents with Poisson Mixtures

COS513 LECTURE 8 STATISTICAL CONCEPTS

Midterm sample questions

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Latent Dirichlet Allocation

Introduction to Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Machine Learning

Modern Information Retrieval

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Lecture 2: Probability, Naive Bayes

Mathematical Formulation of Our Example

Latent Variable Models Probabilistic Models in the Study of Language Day 4

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 6: Graphical Models: Learning

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

ECE 5984: Introduction to Machine Learning

Bayesian Machine Learning

Spatial Bayesian Nonparametrics for Natural Image Segmentation

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Transcription:

Measuring Topic Quality in Sergei Koltsov Olessia Koltsova Steklov Institute of Mathematics at St. Petersburg Laboratory for Internet Studies, National Research University Higher School of Economics, St. Petersburg Philosophy, Mathematics, Linguistics: Aspects of Interaction 2014 April 25, 2014

Outline Topic modeling 1 Topic modeling 2

Probabilistic modeling Our work lies in the field of probabilistic modeling and Bayesian inference. Probabilistic modeling: given a dataset and some probabilistic assumptions, learn model parameters (and do some other exciting stuff). Bayes theorem: p(θ D) = p(θ)p(d θ). p(d) General problems in machine learning / probabilistic modeling: find p(θ D) p(θ)p(d θ); maximize it w.r.t. θ (maximal a posteriori hypothesis); find predictive distribution p(x D) = p(x θ)p(θ D)dθ.

Probabilistic modeling Two main kinds of machine learning problems: supervised: we have correct answers in the dataset and want to extrapolate them (regression, classification); unsupervised: we just have the data and want to find some structure in there (example: clustering). Natural language processing models with an eye to topical content: usually the text is treated as a bag of words; usually there is no semantics, words are treated as tokens; the emphasis is on statistical properties of how words cooccur in documents; sample supervised problem: text categorization (e.g., naive Bayes); still, there are some very impressive results.

Topic modeling Suppose that you want to study a large text corpus. You want to identify specific topics that are discussed in this dataset and then either study the topics that are interesting for you or just look at their general distribution, do topical information retrieval etc. However, you do not know the topics in advance. Thus, you need to somehow extract what topics are discussed and find which topics are relevant for a specific document, in a completely unsupervised way because you do not know anything except the text corpus itself. This is precisely the problem that topic modeling solves.

LDA Topic modeling, LDA: the modern model of choice for topic modeling. In naive approaches to text categorization, one document belongs to one topic (category). In LDA, we (quite reasonably) assume that a document contains several topics: a topic is a (multinomial) distribution on words (in the bag-of-words model); a document is a (multinomial) distribution on topics.

Pictures from [Blei, 2012]

Pictures from [Blei, 2012]

LDA Topic modeling LDA is a hierarchical probabilistic model: on the first level, a mixture of topics ϕ with weights z; on the second level, a multinomial variable θ whose realization z shows the distribution of topics in a document. It s called Dirichlet allocation because we assign Dirichlet priors α and β to model parameters θ and ϕ (conjugate priors to multinomial distributions).

LDA Topic modeling Generative model for the LDA: choose document size N p(n ξ); choose distribution of topics θ Dir(α); for each of N words w n : choose topic for this word z n Mult(θ); choose word w n p(w n ϕ zn ) by the corresponding multinomial distribution. So the underlying joint distribution of the model is p(θ, ϕ, z, w, N α, β) = p(n ξ)p(θ α)p(ϕ β) N p(z n θ)p(w n ϕ, z n). n=1

LDA: inference Topic modeling The inference problem: given {w} w D, find p(θ, ϕ w, α, β) p(w θ, ϕ, z, α, β)p(θ, ϕ, z α, β)dz. There are two major approaches to inference in complex probabilistic models like LDA: variational approximations simplify the graph by approximating the underlying distribution with a simpler one, but with new parameters that are subject to optimization; Gibbs sampling approaches the underlying distribution by sampling a subset of variables conditional on fixed values of all other variables.

LDA: inference Topic modeling Both variational approximations and Gibbs sampling are known for the LDA; we will need the collapsed Gibbs sampling: p(z w = t z w, w, α, β) q(z w, t, z w, w, α, β) = = t T n (d) w,t + α ( n (d) w,t + α ) n (w) w W w,t + β ( n (w ) w,t + β where n (d) w,t is the number of times topic t occurs in document d and n (w) w,t is the number of times word w is generated by topic t, not counting the current value z w. Gibbs sampling is usually easier to extend to new modifications, and this is what we will be doing. ),

LDA extensions Topic modeling Numerous extensions for the LDA model have been introduced: correlated topic models (CTM): topics are codependent; Markov topic models: MRFs model interactions between topics in different parts of the dataset (multiple corpora); relational topic models: a hierarchical model of a document network structure as a graph; Topics over Time, dynamic topic models: documents have timestamps (news, blog posts), and we model how topics develop in time (e.g., by evolving hyperparameters α and β); DiscLDA: each document has a categorical label, and we utilize LDA to mine topic classes related to the classification problem; Author-Topic model: information about the author; texts from the same author will share common words; a lot of work on nonparametric LDA variants based on Dirichlet processes (no predefined number of topics).

Outline Topic modeling 1 Topic modeling 2

Quality of the topic model We want to know how well we did in this modeling. Problem: there is no ground truth, the model runs unsupervised, so no cross-validation. Solution: hold out a subset of documents, then check their likelihood in the resulting model. Alternative: in the test subset, hold out half the words and try to predict them given the other half.

Quality of the topic model Formally speaking, for a set of held-out documents D test, compute the likelihood p(w D) = p(w Φ, αm)p(φ, αm D)dαdΦ for each held-out document w and then maximize the normalized result ( ) w D perplexity(d test ) = exp test log p(w). w D test N d It is a nontrivial problem computationally, but efficient algorithms have already been devised.

Quality of individual topics However, this is only a general quality measure for the entire model. Another important problem: quality of individual topics. Qualitative studies: is a topic interesting? We want to help researchers (social studies, media studies) identify good topics suitable for human interpretation.

Quality of individual topics Recent studies agree that topic coherence is a good candidate. For a topic t characterized by its set of top words W t, coherence is defined as c(t, W t ) = log d(w 1, w 2 ) + ɛ, d(w 1 ) w 1,w 2 W t where d(w i ) is the number of documents that contain w i, d(w i, w j ) is the number of documents where w i and w j cooccur, and ɛ is a smoothing count usually set to either 1 or 0.01. I.e., a topic is good if its words cooccur together often.

Quality of individual topics However, in our studies we have found that coherence does not work so well. We see two reasons: 1 many topics that have good coherence are composed of common words that do not represent any topic of discourse per se; these common words do indeed cooccur often but coherence does not distinguish between high frequency words and informative words; 2 in modern user-generated datasets (e.g., in the blogosphere), many topics stem from copies, reposts, and discussions of a single text that either directly copy or extensively cite the original text; thus, words that appear in this text have very good cooccurrence statistics even if they are rather meaningless words.

Quality of individual topics To alleviate these drawbacks, we propose a modification of the basic coherence metric that takes into account informative content by substituting tf-idf scores instead of the number of [co]occurrences. Namely, we define tf-idf coherence as c tf-idf(t, W t) = d:w log 1,w 2 d tf-idf(w1, d)tf-idf(w2, d) + ɛ w 1,w 2 W t d:w 1 d tf-idf(w1, d), where tf-idf is computed with augmented frequency, tf-idf(w, d) = tf(w, d) idf(w) = ( ) 1 2 + f (w, d) D log max w d f (w, d) {d D : w d}, and f (w, d) is how many times term w occurs in document d. Intuitively, we skew the metric towards topics with high tf-idf scores in top words.

Topics with top coherence scores Topics with common words have excellent coherence: (1) (2) (3) (4) (5) say just issue just author tell stay solution instance fact know solve problem example article just problem situation often say need moment side say issue nothing know relation have write (6) (7) (8) (9) (10) life have century just right know instance appear know law just image beginning say Russian live example history nothing state see system well-known general citizen say follow end see federation

Topics with top tf-idf coherence scores W.r.t. tf-idf coherence we see much more interesting topics: (1) (2) (3) (4) (5) (64) (58) (42) (63) (75) terr. attack pope church butter Korea explosion Vatican orthodox sugar North Boston Roman temple add DPRK terrorist church cleric flour South brother cardinal faith dough Kim police catolic church adj recipe Korean Boston adj Benedict patriarch egg nuclear (6) (7) (8) (9) (10) (48) (34) (61) (28) (25) add Cyprus Syria military war butter bank Syrian army German adj onion Russian Al service Germany meat Cypriot country general German pepper Euro Muslim officer Hitler minute financial fighter mil. force Soviet dish money Arab defense world

Experimental evaluation We also conducted an experimental evaluation of topic quality based on lists of top words. For each topic, we asked the subjects (among them media studies experts) two binary questions: (1) Do you understand why the words in this topic have been united together, do you see obvious semantic criteria that unite the words in this topic? (2) If you have answered yes to the first question: can you identify specific issues/events that documents in this topic might address? (plus an open question: please sum up a topic in a few words)

Experimental evaluation We compared quality metrics with the area-under-curve (AUC) measure: the share of pairs consisting of a positive and a negative example that the classifier ranks correctly (the positive one is higher). Tf-idf coherence is significantly better. Dataset # of Question 1 Question 2 topics AUC Ham. AUC Ham. coh. tf-idf coh. tf-idf March 2012 100 0.66 0.74 0.15 0.59 0.65 0.24 March 2012 200 0.72 0.76 0.19 0.67 0.73 0.24 April 2012 100 0.66 0.74 0.10 0.59 0.65 0.22 September 2012 200 0.67 0.73 0.14 0.65 0.70 0.25

Summary Topic modeling Latent Dirichlet allocation is a probabilistic model that extracts topics from a corpus of documents. By training the model, we decompose the word-document matrix into word-topic and topic-document matrices. There are many topics, and it is desirable to distinguish interesting ones. For this purpose, we propose a new metric (tf-idf coherence) and show that it is better.

Thank you! Topic modeling Thank you for your attention!