Latent Dirichlet Allocation (LDA)

Similar documents
Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Study Notes on the Latent Dirichlet Allocation

Introduction To Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Applying LDA topic model to a corpus of Italian Supreme Court decisions

CS Lecture 18. Topic Models and LDA

CS145: INTRODUCTION TO DATA MINING

Document and Topic Models: plsa and LDA

Latent Dirichlet Alloca/on

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation Introduction/Overview

Lecture 13 : Variational Inference: Mean Field Approximation

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Topic Modelling and Latent Dirichlet Allocation

AN INTRODUCTION TO TOPIC MODELS

Latent Dirichlet Allocation (LDA)

Lecture 22 Exploratory Text Analysis & Topic Models

Introduction to Bayesian inference

Non-Parametric Bayes

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

LDA with Amortized Inference

Latent Dirichlet Allocation

Gaussian Mixture Model

Latent Dirichlet Allocation

Lecture 8: Graphical models for Text

13: Variational inference II

Distributed ML for DOSNs: giving power back to users

Recent Advances in Bayesian Inference Techniques

The Bayes classifier

Statistical Debugging with Latent Topic Models

Non-parametric Clustering with Dirichlet Processes

Text Mining for Economics and Finance Latent Dirichlet Allocation

Mixtures of Multinomials

Mixed-membership Models (and an introduction to variational inference)

Applying hlda to Practical Topic Modeling

CS 6140: Machine Learning Spring 2017

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

Modeling Environment

Topic Modeling: Beyond Bag-of-Words

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Topic Models and Applications to Short Documents

Topic Models. Charles Elkan November 20, 2008

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Expectation Maximization Mixture Models HMMs

Dirichlet Enhanced Latent Semantic Analysis

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Some Probability and Statistics

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Sampling from Bayes Nets

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Contents. Part I: Fundamentals of Bayesian Inference 1

Collaborative Topic Modeling for Recommending Scientific Articles

Bayesian Nonparametrics for Speech and Signal Processing

16 : Approximate Inference: Markov Chain Monte Carlo

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Hierarchical Bayesian Nonparametrics

Content-based Recommendation

CS6220: DATA MINING TECHNIQUES

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Note 1: Varitional Methods for Latent Dirichlet Allocation

Note for plsa and LDA-Version 1.1

Bayesian Methods for Machine Learning

Bayesian Models in Machine Learning

Mixed Membership Stochastic Blockmodels

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

STA 4273H: Statistical Machine Learning

Language Information Processing, Advanced. Topic Models

Replicated Softmax: an Undirected Topic Model. Stephen Turner

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

Mixtures of Gaussians. Sargur Srihari

Bayesian Nonparametric Models

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Evaluation Methods for Topic Models

Variational Inference (11/04/13)

Dimension Reduction (PCA, ICA, CCA, FLD,

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Graphical Models and Kernel Methods

Web Search and Text Mining. Lecture 16: Topics and Communities

Naïve Bayes, Maxent and Neural Models

Latent Dirichlet Conditional Naive-Bayes Models

Probabilistic Graphical Models

Notes on Machine Learning for and

Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Latent variable models for discrete data

Collapsed Variational Bayesian Inference for Hidden Markov Models

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Present and Future of Text Modeling

Transcription:

Latent Dirichlet Allocation (LDA) D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Following slides borrowed ant then heavily modified from: Jonathan Huang (jch1@cs.cmu.edu)

Bag of Words Models Assume that all the words within a document are exchangeable. The order of the words doesn t matter, just the count 2

Mixture of Unigrams = Naïve Bayes (NB) Z i z Mixture of Unigrams = Naïve Bayes Model: For each document: p p w i1 w 2i w 3i Choose a topic z d with p(topic i ) = θ Choose N words w n by drawing each one independently from a multinomial conditioned on z d with p(w n =word j topic i =z) = β z p Multinomial: take a (non-uniform prior) dice with a word on each side; roll the dice N times and count how often each word comes up w 4i In NB, we have exactly one topic per document 3 w Plate Model (equivalent) N

LDA: Each doc is a mixture of topics LDA lets each document be a (different) mixture of topics Naïve Bayes assumes each document is on a single topic LDA lets each word be on a different topic For each document, Choose a multinomial distribution θ d over topics for that document For each of the N words w n in the document Choose a topic z n with p(topic) = θ d Choose a word w n from a multinomial conditioned on z n with p(w=w j topic=z n ) Note: each topic has a different probability of generating each word 4

Dirichlet Distributions In the LDA model, we want the topic mixture proportions for each document to be drawn from some distribution. distribution = probability distribution, so it sums to one So, we want to put a prior distribution on multinomials. That is, k-tuples of non-negative numbers that sum to one. We want probabilities of probabilities These multinomials lie in a (k-1)-simplex Our prior: Simplex = generalization of a triangle to (k-1) dimensions. Needs to be defined for a (k-1)-simplex. Conjugate to the multinomial 5

3 Dirichlet Examples (over 3 topics) Corners: only one topic Center: uniform mixture of topics Colors indicate probability of seeing the topic distribution 6

Dirichlet Distribution Dirichlet distribution is defined over a (k-1)-simplex. I.e., it takes k nonnegative arguments which sum to one. is the conjugate prior to the multinomial distribution. I.e. if our likelihood is multinomial with a Dirichlet prior, then the posterior is also Dirichlet The Dirichlet parameter α i can be thought of as the prior count of the i th class. For LDA, we often use a symmetric Dirichlet where all the α are equal; α is then a concentration parameter 7

Effect of α When α < 1.0, the majority of the probability mass is in the "corners" of the simplex, generating mostly documents that have a small number of topics. When α > 1.0, the most documents contain most of the topics. 8

The LDA Model α θ θ θ z 1 z 2 z 3 z 4 z 1 z 2 z 3 z 4 z 1 z 2 z 3 z 4 w 1 w 2 w 3 w4 w 1 w 2 w 3 w4 w 1 w 2 w 3 w4 For each document, Choose the topic distribution θ ~ Dirichlet(α) For each of the N words w n : Choose a topic z ~ Multinomial(θ) Then choose a word w n ~ Multinomial(β z ) β Where each topic has a different parameter vector β for the words 9

The LDA Model: Plate representation For each of M documents, Choose the topic distribution θ ~ Dirichlet(α) For each of the N words w n : Choose a topic z ~ Multinomial(θ) Choose a word w n ~ Multinomial(β z ) 10

Parameter Estimation Given a corpus of documents, find the parameters α and β which maximize the likelihood of the observed data (words in documents), marginalizing over the hidden variables θ, z E-step: Compute p(θ,z w,α,β), the posterior of the hidden variables (θ,z) given a document w, and hyper-parameters α and β. M-step Estimate parameters α and β given the current hidden variable distribution estimates Unfortunately, the E-step cannot be solved in a closed form So people use a variational approximation θ: topic distribution for the document, z: topic for each word in the document 11

Variational Inference In variational inference, we consider a simplified graphical model with variational parameters γ, φ and minimize the KL Divergence between the variational and posterior distributions. q approximates p 12

Parameter Estimation: Variational EM Given a corpus of documents, find the parameters α and β which maximize the likelihood of the observed data. E-step: Estimate the variational parameters γ and φ in q(γ,φ;α,β) by minimizing the KL-divergence to p (with α and β fixed) M-step Maximize (over α and β) the lower bound on the log likelihood obtained using q in place of p (with γ and φ fixed) 13

14

LDA topics can be used for semisupervised learning 15

LDA requires fewer topics than NB perplexity = 2 H(p) per word I.e., log 2 (perplexity) = entropy 16

There are many LDA extensions The author-topic model z = topic x = author 17

Ailment Topic Aspect Model Observed word w aspect y = symptom, treatment or other Hidden topic type: background? (l), non-ailment (x) topic distribution Paul & Dredze

What you should know about LDA Each document is a mixture over topics Each topic looks like a Naïve Bayes model It produces words with some probability Estimation of LDA is messy Requires variational EM or Gibbs sampling In a plate model, each plate represents repeated nodes in a network The model shows conditional independence, but not the form of the statistical distribution (e.g. Gaussian, Poisson, Dirichlet,.) 19

LDA generation - example Topics = {sports, politics} Words = {football, baseball, TV, win, president} α = (0.8,0.2) β = sports football 0.3 0.01 baseball 0.25 0.01 TV 0.1 0.15 win 0.3 0.25 president 0.01 0.2 OOV 0.04 0.38 politics 20

LDA generation - example For each document, d Pick a topic distribution, θ d using α For each word in the document pick a topic, z given that topic, pick a word using β α = (0.8,0.2) β = sports politics football 0.3 0.01 baseball 0.25 0.01 TV 0.1 0.15 win 0.3 0.25 president 0.01 0.2 OOV 0.04 0.38 21