topic modeling hanna m. wallach

Similar documents
Evaluation Methods for Topic Models

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis

Latent Semantic Analysis. Hongning Wang

Topic Models and Applications to Short Documents

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

16 : Approximate Inference: Markov Chain Monte Carlo

Text Mining for Economics and Finance Latent Dirichlet Allocation

Document and Topic Models: plsa and LDA

Latent Dirichlet Allocation Introduction/Overview

Topic Modelling and Latent Dirichlet Allocation

Lecture 21: Spectral Learning for Graphical Models

Notes on Latent Semantic Analysis

Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

Gaussian Models

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Language Information Processing, Advanced. Topic Models

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Latent Dirichlet Allocation (LDA)

LSI, plsi, LDA and inference methods

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Modeling User Rating Profiles For Collaborative Filtering

Latent Dirichlet Allocation

Probabilistic Topic Models in Natural Language Processing

Distributed ML for DOSNs: giving power back to users

Dirichlet Enhanced Latent Semantic Analysis

Latent Semantic Analysis. Hongning Wang

Probabilistic Topic Models Tutorial: COMAD 2011

Study Notes on the Latent Dirichlet Allocation

Vector Space Models. wine_spectral.r

Generative Clustering, Topic Modeling, & Bayesian Inference

Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Lecture 19, November 19, 2012

Knowledge Discovery and Data Mining 1 (VO) ( )

The Doubly Correlated Nonparametric Topic Model

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

AN INTRODUCTION TO TOPIC MODELS

Information Retrieval

On Smoothing and Inference for Topic Models

Probabilistic Dyadic Data Analysis with Local and Global Consistency

Latent Topic Models for Hypertext

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

A Continuous-Time Model of Topic Co-occurrence Trends

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Kernel Density Topic Models: Visual Topics Without Visual Words

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Topic Modeling: Beyond Bag-of-Words

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Sparse Stochastic Inference for Latent Dirichlet Allocation

Lecture 16 Deep Neural Generative Models

Latent variable models for discrete data

Web Search and Text Mining. Lecture 16: Topics and Communities

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

Matrix decompositions and latent semantic indexing

Gaussian Mixture Model

Text mining and natural language analysis. Jefrey Lijffijt

Probabilistic Latent Semantic Analysis

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Lecture 22 Exploratory Text Analysis & Topic Models

CS 572: Information Retrieval

Modeling Environment

Embeddings Learned By Matrix Factorization

Lecture 13 : Variational Inference: Mean Field Approximation

Comparative Summarization via Latent Dirichlet Allocation

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Topic Models. Charles Elkan November 20, 2008

Collapsed Variational Bayesian Inference for Hidden Markov Models

13: Variational inference II

Statistical Topic Models for Computational Social Science Hanna M. Wallach

arxiv: v6 [stat.ml] 11 Apr 2017

Dynamic Mixture Models for Multiple Time Series

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

Latent Dirichlet Alloca/on

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Benchmarking and Improving Recovery of Number of Topics in Latent Dirichlet Allocation Models

Note for plsa and LDA-Version 1.1

Spectral Unsupervised Parsing with Additive Tree Metrics

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Efficient Methods for Topic Model Inference on Streaming Document Collections

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

HTM: A Topic Model for Hypertexts

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

Recent Advances in Bayesian Inference Techniques

Dimensionality Reduction and Principle Components Analysis

An Alternative Prior Process for Nonparametric Bayesian Clustering

Variable Latent Semantic Indexing

Topic Discovery Project Report

Discovering Geographical Topics in Twitter

Transcription:

university of massachusetts amherst wallach@cs.umass.edu

Ramona Blei-Gantz

Helen Moss (Dave's Grandma)

The Next 30 Minutes Motivations and a brief history: Latent semantic analysis Probabilistic latent semantic analysis Latent Dirichlet allocation: Model structure and priors Approximate inference algorithms Evaluation (log probabilities, human interpretation) Post-LDA...

The Problem with Information Needle in a haystack: as more information becomes available, it is harder and harder to find what we are looking for Need new tools to help us organize, search and understand information

A Solution? Use topic models to discover hidden topicbased patterns Use discovered topics to annotate the collection Use annotations to organize, understand, summarize, search...

Topic (Concept) Models Topic models: LSA, PLSA, LDA Share 3 fundamental assumptions: Documents have latent semantic structure ( topics ) Can infer topics from word-document co-occurrences Words are related to topics, topics to documents Use different mathematical frameworks Linear algebra vs. probabilistic modeling

Topics and Words

Documents and Topics

Latent Semantic Analysis (Deerwester et al., 1990) Based on ideas from linear algebra Form sparse termdocument co-occurrence matrix X Raw counts or (more likely) TF-IDF weights Use SVD to decompose X into 3 matrices: U relates terms to concepts V relates concepts to documents Σ is a diagonal matrix of singular values

Singular Value Decomposition 1. Latent semantic analysis (LSA) is a theory and method for... 2. Probabilistic latent semantic analysis is a probabilistic... 3. Latent Dirichlet allocation, a generative probabilistic model... allocation analysis Dirichlet generative latent LSA probabilistic semantic... 1 2 3 0 1 0 0 1 1 0 1 0 1 0 0 1 0 2 1... 1 0 1 1 1 1 1 0

Probabilistic Modeling Treat data as observations that arise from a generative probabilistic process that includes hidden variables Infer the hidden structure using posterior inference For documents, the hidden variables represent the thematic structure of the collection What are the topics that describe this collection? Situate new data into the estimated model

Probabilistic LSA (Hofmann, 1999) topic assignment document-specific topic distribution topics observed word

Advantages and Disadvantages Probabilistic model that can be easily extended and embedded in other more complicated models Not a well-defined generative model: no way of generalizing to new, unseen documents Many free parameters (linear in # training documents) Prone to overfitting (have to be careful when training)

Latent Dirichlet Allocation (Blei et al., 2003)

Graphical Model Dirichlet parameters topic assignment document-specific topic distribution topics observed word Dirichlet parameters

Dirichlet Distribution Distribution over K-dimensional positive vectors that sum to one (i.e., points on the probability simplex) Two parameters: Base measure, e.g., m (vector) Concentration parameter, e.g., α (scalar)

Varying Parameters

Asymmetric? Symmetric? (Wallach et al., 2009) People (almost always) use symmetric Dirichlet priors with heuristically set concentration parameters Simple, but is it the best modeling choice? Empirical comparison: and

Priors and Stop Words

Intuition Topics are specialized distributions over words Want topics to be as distinct as possible Asymmetric prior over { } makes topics more similar to each other (and to the corpus word frequencies) Want a symmetric prior to preserve topic distinctness Still have to account for power-law word usage: Asymmetric prior over { } means some topics can be used much more often than others

Posterior Inference

Posterior Inference latent variables Infer (or integrate out) all latent variables, given tokens

Inference Algorithms (Mukherjee & Blei, 2009; Asuncion et al., 2009) Exact inference in LDA is not tractable Approximate inference algorithms: Mean field variational inference Expectation propagation Collapsed Gibbs sampling Collapsed variational inference (Blei et al., 2001; 2003) (Minka & Lafferty, 2002) (Griffiths & Steyvers, 2002) (Teh et al., 2006) Each method has advantages and disadvantages

Evaluating LDA: Log Probability Unsupervised nature of LDA makes evaluation hard Compute probability of held-out documents: Classic way of evaluating generative models Often used to evaluate topic models Problem: have to approximate an intractable sum

Computing Log Probability (Wallach et al., 2009) Simple importance sampling methods The harmonic mean method (Newton & Raftery, 1994) Known to overestimate, used anyway Annealed importance sampling (Neal, 2001) Prohibitively slow for large collections of documents Chib-style method Left-to-Right method (Murray & Salakhutdinov, 2009) (Wallach, 2008)

Reading Tea Leaves

Word and Topic Intrusion (Chang et al., 2009) Can humans find the intruder word/topic?

Post-LDA Topic Modeling LDA can be embedded in more complicated models Data-generating distribution can be changed

Today's Workshop Text and language Time-evolving networks Visual recognition (L. Fei-Fei) Finance Archeology Music analysis (D. Hu & L. Saul)... even some theoretical work (S. Gerrish & D. Blei; M. Johnson; T Landauer) (E. Xing) (G. Doyle & C. Elkan) (D. Mimno) (D. Sontag & D. Roy)

questions? (thanks to Dave Blei for letting me steal pictures/content etc.)