Topic Modeling: Beyond Bag-of-Words

Similar documents
Topic Modeling: Beyond Bag-of-Words

Latent Dirichlet Allocation (LDA)

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Topic Modelling and Latent Dirichlet Allocation

Topic Models. Charles Elkan November 20, 2008

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

Study Notes on the Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)

Evaluation Methods for Topic Models

CS Lecture 18. Topic Models and LDA

Latent Dirichlet Allocation Introduction/Overview

A Continuous-Time Model of Topic Co-occurrence Trends

Gaussian Models

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Generative Clustering, Topic Modeling, & Bayesian Inference

Kernel Density Topic Models: Visual Topics Without Visual Words

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Latent Dirichlet Allocation Based Multi-Document Summarization

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Introduction To Machine Learning

Hierarchical Bayesian Nonparametric Models of Language and Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Online Bayesian Passive-Agressive Learning

Measuring Topic Quality in Latent Dirichlet Allocation

Statistical Debugging with Latent Topic Models

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Hierarchical Bayesian Nonparametric Models of Language and Text

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Bayesian Nonparametrics for Speech and Signal Processing

Latent Dirichlet Allocation

Topic Models and Applications to Short Documents

Collaborative topic models: motivations cont

topic modeling hanna m. wallach

Bayesian (Nonparametric) Approaches to Language Modeling

CS145: INTRODUCTION TO DATA MINING

Sparse Stochastic Inference for Latent Dirichlet Allocation

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Introduction to Bayesian inference

Collaborative Topic Modeling for Recommending Scientific Articles

Midterm sample questions

AN INTRODUCTION TO TOPIC MODELS

Gaussian Mixture Model

Natural Language Processing. Statistical Inference: n-grams

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Text Mining for Economics and Finance Latent Dirichlet Allocation

Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Document and Topic Models: plsa and LDA

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Time-Sensitive Dirichlet Process Mixture Models

Latent variable models for discrete data

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Using Both Latent and Supervised Shared Topics for Multitask Learning

Dirichlet Enhanced Latent Semantic Analysis

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Text mining and natural language analysis. Jefrey Lijffijt

Design of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.

A Hierarchical Bayesian Language Model based on Pitman-Yor Processes

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

Sparse Forward-Backward for Fast Training of Conditional Random Fields

Language Information Processing, Advanced. Topic Models

Nonparametric Bayes Pachinko Allocation

Collapsed Variational Inference for HDP

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Chapter 3: Basics of Language Modelling

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Note for plsa and LDA-Version 1.1

Distributed ML for DOSNs: giving power back to users

Latent Dirichlet Alloca/on

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Bayesian Hidden Markov Models and Extensions

Improving Topic Models with Latent Feature Word Representations

Language Models. Hongning Wang

Dimension Reduction (PCA, ICA, CCA, FLD,

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Lecture 13 : Variational Inference: Mean Field Approximation

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Present and Future of Text Modeling

Natural Language Processing SoSe Words and Language Model

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Efficient Methods for Topic Model Inference on Streaming Document Collections

Lecture 22 Exploratory Text Analysis & Topic Models

Hierarchical Bayesian Models of Language and Text

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Recent Advances in Bayesian Inference Techniques

Probabilistic Language Modeling

LDA with Amortized Inference

Lecture 19, November 19, 2012

Lecture 8: Graphical models for Text

Content-based Recommendation

Transcription:

University of Cambridge hmw26@cam.ac.uk June 26, 2006

Generative Probabilistic Models of Text Used in text compression, predictive text entry, information retrieval Estimate probability of a word in a given context: Here, both types of context are combined to improve performance This is done in a single Bayesian framework

Statistical Language Models Estimate the probability of a word occurring in a given context Context is normally some number of preceding words Used in text compression, predictive text entry, speech recognition There are many different models of this sort

A Simple Bigram Language Model Given a corpus w of N tokens, count N w = # of times word w appears in w N v w = # of times word v follows word w in w Form the predictive distribution: Use, e.g., cross validation to estimate weight λ

Hierarchical Dirichlet Language Model (MacKay & Peto, 95) A bigram model based on principles of Bayesian inference:

Hierarchical Dirichlet Language Model (MacKay & Peto, 95) A bigram model based on principles of Bayesian inference: For each word w in the vocabulary, draw a distribution over words φ w from Dir(φ w ; βm)

Hierarchical Dirichlet Language Model (MacKay & Peto, 95) A bigram model based on principles of Bayesian inference: For each word w in the vocabulary, draw a distribution over words φ w from Dir(φ w ; βm) For each position i in document d, draw a word w i from φ wi 1

Hierarchical Dirichlet Language Model (MacKay & Peto, 95) A bigram model based on principles of Bayesian inference: For each word w in the vocabulary, draw a distribution over words φ w from Dir(φ w ; βm) For each position i in document d, draw a word w i from φ wi 1

Hierarchical Dirichlet Language Model (MacKay & Peto, 95) A bigram model based on principles of Bayesian inference: For each word w in the vocabulary, draw a distribution over words φ w from Dir(φ w ; βm) For each position i in document d, draw a word w i from φ wi 1

HDLM: Predictive Distribution Integrate out each φ w Predictive probability of word v following word w is Weight per context: λ w = β N w +β Bayesian version of the simple bigram langauge model

Statistical Topic Models Documents are modeled as finite mixture of topics The topic mixture provides an explicit representation of a document Each word is generated by a single topic Used in information retrieval, classification, collaborative filtering

Statistical Topic Models Documents are modeled as finite mixture of topics The topic mixture provides an explicit representation of a document Each word is generated by a single topic Used in information retrieval, classification, collaborative filtering [1] http://www.worldbeardchampionships.com

Latent Dirichlet Allocation (Blei et al., 03) Models documents as mixtures of latent topics. Topics inferred from word correlations, independent of word order: bag-of-words

Latent Dirichlet Allocation (Blei et al., 03) Models documents as mixtures of latent topics. Topics inferred from word correlations, independent of word order: bag-of-words For each document d, draw a topic mixture θ d from Dir(θ d ; αn)

Latent Dirichlet Allocation (Blei et al., 03) Models documents as mixtures of latent topics. Topics inferred from word correlations, independent of word order: bag-of-words For each document d, draw a topic mixture θ d from Dir(θ d ; αn) For each topic t, draw a distribution over words φ t from Dir(φ t ; βm)

Latent Dirichlet Allocation (Blei et al., 03) Models documents as mixtures of latent topics. Topics inferred from word correlations, independent of word order: bag-of-words For each document d, draw a topic mixture θ d from Dir(θ d ; αn) For each topic t, draw a distribution over words φ t from Dir(φ t ; βm) For each position i in document d:

Latent Dirichlet Allocation (Blei et al., 03) Models documents as mixtures of latent topics. Topics inferred from word correlations, independent of word order: bag-of-words For each document d, draw a topic mixture θ d from Dir(θ d ; αn) For each topic t, draw a distribution over words φ t from Dir(φ t ; βm) For each position i in document d: Draw a topic z i from θ d

Latent Dirichlet Allocation (Blei et al., 03) Models documents as mixtures of latent topics. Topics inferred from word correlations, independent of word order: bag-of-words For each document d, draw a topic mixture θ d from Dir(θ d ; αn) For each topic t, draw a distribution over words φ t from Dir(φ t ; βm) For each position i in document d: Draw a topic z i from θ d Draw a word w i from φ zi

Combining Word Order and Topic Each type of model has something to offer the other Context-based language models can be improved by topics: Topic models can be improved by notion of word order:

Bigram Topic Model Combining ideas from HDLM and LDA gives a new topic model that moves beyond the bag-of-words assumption

Bigram Topic Model Combining ideas from HDLM and LDA gives a new topic model that moves beyond the bag-of-words assumption For each topic t and word w, draw a distribution over words φ w,t from a Dirichlet prior

Bigram Topic Model Combining ideas from HDLM and LDA gives a new topic model that moves beyond the bag-of-words assumption For each topic t and word w, draw a distribution over words φ w,t from a Dirichlet prior For each document d, draw a topic mixture θ d from Dir(θ d ; αn)

Bigram Topic Model Combining ideas from HDLM and LDA gives a new topic model that moves beyond the bag-of-words assumption For each topic t and word w, draw a distribution over words φ w,t from a Dirichlet prior For each document d, draw a topic mixture θ d from Dir(θ d ; αn) For each position i in document d:

Bigram Topic Model Combining ideas from HDLM and LDA gives a new topic model that moves beyond the bag-of-words assumption For each topic t and word w, draw a distribution over words φ w,t from a Dirichlet prior For each document d, draw a topic mixture θ d from Dir(θ d ; αn) For each position i in document d: Draw a topic z i from θ d

Bigram Topic Model Combining ideas from HDLM and LDA gives a new topic model that moves beyond the bag-of-words assumption For each topic t and word w, draw a distribution over words φ w,t from a Dirichlet prior For each document d, draw a topic mixture θ d from Dir(θ d ; αn) For each position i in document d: Draw a topic z i from θ d Draw a word w i from φ wi 1,z i

Bigram Topic Model Combining ideas from HDLM and LDA gives a new topic model that moves beyond the bag-of-words assumption For each topic t and word w, draw a distribution over words φ w,t from a Dirichlet prior For each document d, draw a topic mixture θ d from Dir(θ d ; αn) For each position i in document d: Draw a topic z i from θ d Draw a word w i from φ wi 1,z i

Bigram Topic Model Combining ideas from HDLM and LDA gives a new topic model that moves beyond the bag-of-words assumption For each topic t and word w, draw a distribution over words φ w,t from a Dirichlet prior For each document d, draw a topic mixture θ d from Dir(θ d ; αn) For each position i in document d: Draw a topic z i from θ d Draw a word w i from φ wi 1,z i

Bigram Topic Model Combining ideas from HDLM and LDA gives a new topic model that moves beyond the bag-of-words assumption For each topic t and word w, draw a distribution over words φ w,t from a Dirichlet prior For each document d, draw a topic mixture θ d from Dir(θ d ; αn) For each position i in document d: Draw a topic z i from θ d Draw a word w i from φ wi 1,z i

Bigram Topic Model Combining ideas from HDLM and LDA gives a new topic model that moves beyond the bag-of-words assumption For each topic t and word w, draw a distribution over words φ w,t from a Dirichlet prior For each document d, draw a topic mixture θ d from Dir(θ d ; αn) For each position i in document d: Draw a topic z i from θ d Draw a word w i from φ wi 1,z i

Prior over {φ w,t } Prior over {φ w,t } must be coupled so that learning about one φ w,t gives information about others Coupling comes from hyperparameter sharing Several ways of doing this: Single: Only one βm Per topic: β t m t for each topic t Per word: β w m w for each possible previous word w

Inference of Hyperparameters Integrate over φ w,t and θ d Let U = {αn, βm} or U = {αn, {β t m t }} Assume uniform hyperpriors over all hyperparameters Find the maximum of the evidence U MP = arg max z P(w, z U) using a Gibbs EM algorithm

Comparing Predictive Accuracy Information rate of unseen test data w in bits per word: R = log 2 P(w w) N Lower information rate = better predictive accuracy Direct measure of text compressibility Use Gibbs sampling to approximate P(w w)

Data Sets 150 abstracts from Psychological Review Vocabulary size: 1,374 words 13,414 tokens in training data, 6,521 in test data 150 postings from 20 Newsgroups data set Vocabulary size: 2,281 words 27,478 tokens in training data, 13,579 in test data

Information Rate: Psychological Review bits per word 13 12 11 10 9 8 Hierarchical Dirichlet language model Latent Dirichlet allocation Bigram topic model (single βm) Bigram topic model (β t m t per topic) 7 6 0 20 40 60 80 100 120 number of topics

Information Rate: 20 Newsgroups bits per word 13 12 11 10 9 8 Hierarchical Dirichlet language model Latent Dirichlet allocation Bigram topic model (single βm) Bigram topic model (β t m t per topic) 7 6 0 20 40 60 80 100 120 number of topics

Inferred Topics: Latent Dirichlet Allocation the i that easter [number] is proteins ishtar in satan the a to the of the espn which to have hockey and i with a of if but this metaphorical [number] english as evil you and run there fact is

Inferred Topics: Bigram Topic Model (Single βm) to the the the party god and a arab is between to not belief warrior i power believe enemy of any use battlefield [number] i there a is is strong of in this make there and things i way it

Inferred topics: Bigram Topic Model (β t m t per topic) party god [number] the arab believe the to power about tower a as atheism clock and arabs gods a of political before power i are see motherboard is rolling atheist mhz [number] london most socket it security shafts plastic that

Findings and Future Work Findings: Combining latent topics and word order improves predictive accuracy The quality of inferred topics is improved Future work: Per word: β w m w for each possible previous word w Other model structures Evaluation on larger corpora

Questions? hmw26@cam.ac.uk http://www.inference.phy.cam.ac.uk/hmw26/ Work conducted in part at the University of Pennsylvania, supported by NSF ITR grant EIA-0205456 and DARPA contract NBCHD030010.