Evaluation Methods for Topic Models

Similar documents
topic modeling hanna m. wallach

Topic Modelling and Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Lecture 22 Exploratory Text Analysis & Topic Models

Sparse Stochastic Inference for Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)

Generative Clustering, Topic Modeling, & Bayesian Inference

AN INTRODUCTION TO TOPIC MODELS

Topic Models. Charles Elkan November 20, 2008

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Text Mining for Economics and Finance Latent Dirichlet Allocation

16 : Approximate Inference: Markov Chain Monte Carlo

Statistical Debugging with Latent Topic Models

Replicated Softmax: an Undirected Topic Model. Stephen Turner

CS Lecture 18. Topic Models and LDA

Modeling Environment

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

Topic Modeling: Beyond Bag-of-Words

Latent Dirichlet Allocation Introduction/Overview

Dirichlet Enhanced Latent Semantic Analysis

Collapsed Variational Bayesian Inference for Hidden Markov Models

Note for plsa and LDA-Version 1.1

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Document and Topic Models: plsa and LDA

Gaussian Mixture Model

Time-Sensitive Dirichlet Process Mixture Models

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Lecture 13 : Variational Inference: Mean Field Approximation

Recent Advances in Bayesian Inference Techniques

Applying LDA topic model to a corpus of Italian Supreme Court decisions

EM-algorithm for motif discovery

13: Variational inference II

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Bayesian Co-Clustering

Applying hlda to Practical Topic Modeling

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Markov chain Monte Carlo

Bayesian Inference and MCMC

Introduction to Machine Learning

CS145: INTRODUCTION TO DATA MINING

Introduction to Bayesian inference

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

CS 343: Artificial Intelligence

Topic Models and Applications to Short Documents

Latent Dirichlet Alloca/on

A recursive estimate for the predictive likelihood in a topic model

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

STA 4273H: Statistical Machine Learning

Note 1: Varitional Methods for Latent Dirichlet Allocation

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

An Alternative Prior Process for Nonparametric Bayesian Clustering

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Approximate Inference using MCMC

Midterm sample questions

ECE 5984: Introduction to Machine Learning

Conditional Random Field

Online Bayesian Passive-Agressive Learning

Efficient Methods for Topic Model Inference on Streaming Document Collections

Infering the Number of State Clusters in Hidden Markov Model and its Extension

Modeling Documents with a Deep Boltzmann Machine

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

arxiv: v6 [stat.ml] 11 Apr 2017

Lecture 16 Deep Neural Generative Models

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Gibbs Sampler Derivation for Latent Dirichlet Allocation (Blei et al., 2003) Lecture Notes Arjun Mukherjee (UH)

Sampling from Bayes Nets

Latent Dirichlet Allocation Based Multi-Document Summarization

STA 4273H: Statistical Machine Learning

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Expectation maximization tutorial

Markov Chain Monte Carlo

Introduction To Machine Learning

Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details THIS IS AN EARLY DRAFT. YOUR FEEDBACKS ARE HIGHLY APPRECIATED.

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Machine Learning for Data Science (CS4786) Lecture 24

LDA with Amortized Inference

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Bayesian Methods for Machine Learning

STA 4273H: Statistical Machine Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Distance dependent Chinese restaurant processes

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Lecture 8: Graphical models for Text

Graphical Models and Kernel Methods

Inference Methods for Latent Dirichlet Allocation

Graphical Models for Collaborative Filtering

MCMC and Gibbs Sampling. Kayhan Batmanghelich

The supervised hierarchical Dirichlet process

arxiv: v1 [stat.ml] 8 Jan 2012

Bayes Nets: Sampling

Lecture 19, November 19, 2012

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Gaussian Models

MCMC: Markov Chain Monte Carlo

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Transcription:

University of Massachusetts Amherst wallach@cs.umass.edu April 13, 2009 Joint work with Iain Murray, Ruslan Salakhutdinov and David Mimno

Statistical Topic Models Useful for analyzing large, unstructured text collections bounds units policy data neurons bound hidden action space neuron loss network reinforcement clustering spike functions layer learning points synaptic error unit actions distance firing Topic-based search interfaces (http://rexa.info) Analysis of scientific trends (Blei & Lafferty, 07; Hall et al., 08) Information retrieval (Wei & Croft 06)

Latent Dirichlet Allocation (Blei et al., 03) LDA generates a new document w by drawing: θ Dir (θ; αm) z P(z θ) = n θ z n w P(w z, Φ) = n φ w n z n a document-specific topic dist., a topic assignment for each token, and finally the observed tokens. The topic parameters Φ, and αm, are shared by all documents For real-world data, only the tokens w are observed

Evaluating Topic Model Performance Unsupervised nature of topic models makes evaluation hard There may be extrinsic tasks for some applications...... but we also want to estimate cross-task generalization Compute probability of held-out documents under the model Classic way of evaluating generative models Often used to evaluate topic models This talk: demonstrate that standard methods for evaluating topic models are inaccurate and propose two alternative methods

Evaluating LDA Given training documents W and held-out documents W: P(W W ) = dφ dα dm P(W Φ, αm) P(Φ, αm W ) Approximate this integral by evaluating at a point estimate Variational or MCMC can be used to marginalize out topic assignments for training documents to infer Φ and αm The probability of interest is therefore: P(W Φ, αm) = d P(w(d) Φ, αm)

Computing P(w Φ, αm) P(w Φ, αm) is the normalizing constant that relates the posterior distribution over z to the joint distribution over w and z: P(z w, Φ, αm) = P(w, z Φ, αm) P(w Φ, αm) Computing it involves marginalizing over latent variables: P(w Φ, αm) = z dθ P(w, z, θ Φ, αm)

Methods for Computing Normalizing Constants Simple importance sampling methods: e.g., MALLET s empirical likelihood, iterated pseudo-counts The harmonic mean method (Newton & Raftery, 94): Known to overestimate, yet used in topic modeling papers Annealed importance sampling (Neal, 01): Accurate, but prohibitively slow for large data sets A Chib-style method (Murray & Salakhutdinov, 09) A left-to-right method (Wallach, 08)

Chib-Style Estimates For any special set of latent topic assignments z : Chib-style estimation: P(w Φ, αm) = P(w z, Φ) P(z αm) P(z w, Φ, αm) 1. Pick some special set of latent topic assignments z 2. Compute P(w z, Φ) P(z αm) 3. Estimate P(z w, Φ, αm) Can use a Markov chain to estimate P(z w, Φ, αm)

Markov Chain Estimation Stationary condition for a Markov chain: P(z w, Φ, αm) = z T (z z) P(z w, Φ, αm) Estimate sum using a sequence of states Z = {z (1),..., z (S) } generated by a Markov chain that explores P(z w, Φ, αm)

Overestimate of P(w Φ, αm) P(z w, Φ, αm) is unbiased in expectation: [ P(z w, Φ, αm) = E 1 ] S S s=1 T (z z (s) ) But, in expectation, P(w Φ, αm) will be overestimated (Jensen): P(w Φ, αm) [ ] P(z, w Φ, αm) P(z = [ E 1 ], w Φ, αm) E S S s=1 T (z z (s) 1 S ) S s=1 T (z z (s) )

Chib-Style Method (Murray & Salakhutdinov, 09) Draw Z = {z (1),..., z (S) } from a carefully designed distribution Unbiased: P(w Φ, αm) P(w, z Φ, αm) / 1 S s S =1 T (z z (s ) )

Left-to-Right Method (Wallach, 08) Can decompose P(w Φ, αm) as P(w Φ, αm) = n = n P(w n w <n, Φ, αm) z n P(w n, z n w <n, Φ, αm) Approximate each sum over z n using a MCMC algorithm Left-to-right : appropriate for language modeling applications

Left-to-Right Method (Wallach, 08) 1: for each position n in w do 2: for each particle r = 1 to R do 3: for each position n < n do 4: resample z (r) n P(z(r) n w n, {z(r) <n} \n, Φ, αm) 5: end for 6: p n (r) := t P(w n, z n (r) =t z (r) <n, Φ, αm) 7: sample a topic assignment: z n (r) P(z n (r) w n, z (r) <n, Φ, αm) 8: end for 9: p n := r p(r) n / R 10: l := l + log p n 11: end for

Relative Computational Costs Gibbs sampling dominates cost for most methods Method Parameters Cost Iterated pseudo-counts # itns. I, # samples S (I + S) N Empirical likelihood # samples S SN Harmonic mean burn-in B, # samples S N (B + S) AIS # temperatures S SN Chib-style chain length S 2SN Left-to-right # particles R RN (N 1) / 2 Costs are in terms of # Gibbs site updates required (or equivalent)

Data Sets Two synthetic data sets, three real data sets: Data set V N St. Dev. Synthetic, 3 topics 9242 500 0 Synthetic, 50 topics 9242 200 0 20 Newsgroups 22695 120.4 296.2 PubMed Central abstracts 30262 101.8 49.2 New York Times articles 50412 230.6 250.5 V is the vocabulary size, N is the mean document length, St. Dev. is the estimated standard deviation in document length

Average Log Prob. Per Held-Out Document (20 Newsgroups) 1100 900 AIS HM LR CS ISEL ISIP AIS LR CS ISIP 1032.7 1029.4 AIS: Annealed importance sampling. HM: Harmonic mean. LR: Left-to-right. CS: Chib-style. IS-EL: Importance sampling (empirical likelihood). IS-IP: Importance sampling (iterated pseudocounts)

Conclusions Empirically determined that the evaluation methods currently used in the topic modeling community are inaccurate: Harmonic mean method often significantly overestimates Simple IS methods tend to underestimate (but not by as much) Proposed two, more accurate, alternatives A Chib-style method (Murray & Salakhutdinov, 09) A left-to-right method (Wallach, 08)

Questions? wallach@cs.umass.edu http://www.cs.umass.edu/~wallach/

Average Log Prob. Per Held-Out Document (Synth., 3 Topics) 3670 3490 AIS HM LR CS ISEL ISIP 3671.68 AIS LR CS ISIP AIS: Annealed importance sampling. HM: Harmonic mean. LR: Left-to-right. CS: Chib-style. IS-EL: Importance sampling (empirical likelihood). IS-IP: Importance sampling (iterated pseudocounts)

Average Log Prob. Per Held-Out Document (Synth., 50 Topics) 1460 1250 AIS HM LR CS ISEL ISIP AIS LR CS ISIP 1419.4 AIS: Annealed importance sampling. HM: Harmonic mean. LR: Left-to-right. CS: Chib-style. IS-EL: Importance sampling (empirical likelihood). IS-IP: Importance sampling (iterated pseudocounts)

Average Log Prob. Per Held-Out Document (PubMed Central) 870 740 AIS HM LR CS ISEL ISIP AIS LR CS ISIP 820.1 816.1 AIS: Annealed importance sampling. HM: Harmonic mean. LR: Left-to-right. CS: Chib-style. IS-EL: Importance sampling (empirical likelihood). IS-IP: Importance sampling (iterated pseudocounts)

Average Log Prob. Per Held-Out Document (New York Times) 1800 1600 AIS HM LR CS ISEL ISIP AIS LR CS ISIP 1747.6 1739.3 AIS: Annealed importance sampling. HM: Harmonic mean. LR: Left-to-right. CS: Chib-style. IS-EL: Importance sampling (empirical likelihood). IS-IP: Importance sampling (iterated pseudocounts)

Choosing a Special State z Run regular Gibbs sampling for a few iterations Iteratively maximize the following quantity: P(z n =t w, z \n, Φ, αm) P(w n z n =t, Φ) P(z n =t z \n, αm) φ wn t {N t } \n + αm t N 1 + α, {N t } \n is # times topic t occurs in z excluding position n