CS Lecture 18. Topic Models and LDA

Similar documents
Topic Modelling and Latent Dirichlet Allocation

CS145: INTRODUCTION TO DATA MINING

Generative Clustering, Topic Modeling, & Bayesian Inference

Latent Dirichlet Allocation Introduction/Overview

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Introduction to Bayesian inference

Lecture 13 : Variational Inference: Mean Field Approximation

Latent Dirichlet Allocation (LDA)

Document and Topic Models: plsa and LDA

Note 1: Varitional Methods for Latent Dirichlet Allocation

Dimension Reduction (PCA, ICA, CCA, FLD,

Introduction To Machine Learning

CS6220: DATA MINING TECHNIQUES

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Non-Parametric Bayes

Study Notes on the Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)

LDA with Amortized Inference

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Latent Dirichlet Allocation

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Content-based Recommendation

AN INTRODUCTION TO TOPIC MODELS

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Introduction to Machine Learning

Expectation maximization tutorial

Sparse Stochastic Inference for Latent Dirichlet Allocation

13: Variational inference II

Latent Dirichlet Alloca/on

Introduction to Machine Learning

Lecture 22 Exploratory Text Analysis & Topic Models

Text Mining for Economics and Finance Latent Dirichlet Allocation

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Topic Models and Applications to Short Documents

Data Mining Techniques

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CSC411 Fall 2018 Homework 5

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

Latent variable models for discrete data

Lecture 8: Graphical models for Text

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

COMP90051 Statistical Machine Learning

Learning with Probabilities

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Topic Models. Charles Elkan November 20, 2008

ECE 5984: Introduction to Machine Learning

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Methods: Naïve Bayes

Dirichlet Enhanced Latent Semantic Analysis

Latent Dirichlet Allocation

Introduction to Probabilistic Machine Learning

Posterior Regularization

Bayesian Machine Learning

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Expectation maximization

Evaluation Methods for Topic Models

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Measuring Topic Quality in Latent Dirichlet Allocation

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Collaborative Topic Modeling for Recommending Scientific Articles

Intelligent Systems (AI-2)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Statistical Methods for NLP

Intelligent Systems (AI-2)

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Applying hlda to Practical Topic Modeling

Statistical Debugging with Latent Topic Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Homework 6: Image Completion using Mixture of Bernoullis

Discriminative Training of Mixed Membership Models

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Graphical Models for Collaborative Filtering

Stochastic Variational Inference

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

An Introduction to Bayesian Machine Learning

Latent Variable Models

Variational Scoring of Graphical Model Structures

CS6220: DATA MINING TECHNIQUES

Topic Modeling: Beyond Bag-of-Words

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Probabilistic Time Series Classification

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Introduction to Machine Learning

Statistical learning. Chapter 20, Sections 1 4 1

Probabilistic Graphical Models

Bayesian Machine Learning

Information Extraction from Text

Lecture : Probabilistic Machine Learning

Latent Dirichlet Conditional Naive-Bayes Models

Clustering problems, mixture models and Bayesian nonparametrics

Notes on Machine Learning for and

Knowledge Discovery and Data Mining 1 (VO) ( )

Lecture 13: Structured Prediction

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Introduction to Machine Learning

Distributed ML for DOSNs: giving power back to users

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Transcription:

CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei)

Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same joint distribution X Y Y Discriminative X Generative Although these two models are equivalent (in the sense that they imply the same independence relations), they can differ significantly when it comes to inference/prediction 2

Generative vs. Discriminative Models X Y Y Discriminative X Generative Generative models: we can think of the observations as being generated by the latent variables Start sampling at the top and work downwards Examples? 3

Generative vs. Discriminative Models X Y Y Discriminative X Generative Generative models: we can think of the observations as being generated by the latent variables Start sampling at the top and work downwards Examples: HMMs, naïve Bayes, LDA 4

Generative vs. Discriminative Models X Y Y Discriminative X Generative Discriminative models: most useful for discriminating the values of the latent variables Almost always used for supervised learning Examples? 5

Generative vs. Discriminative Models X Y Y Discriminative X Generative Discriminative models: most useful for discriminating the values of the latent variables Almost always used for supervised learning Examples: CRFs 6

Generative vs. Discriminative Models X Y Y Discriminative X Generative Suppose we are only interested in the prediction task (i.e., estimating p(y X)) Discriminative model: p X, Y = p X p(y X) Generative model: p X, Y = p Y p(x Y) 7

Generative Models The primary advantage of generative models is that they provide a model of the data generating process Could generate new data samples by using the model Topic models (generative models of documents) Methods for discovering themes (topics) from a collection (e.g., books, newspapers, etc.) Annotates the collection according to the discovered themes Use the annotations to organize, search, summarize, etc. 8

Topic Models 9

Models of Text Documents Bag-of-words model: assume that the ordering of words in a document do not matter This is typically false as certain phrases can only appear together Unigram model: all words in a document are drawn uniformly at random from categorical distribution Mixture of unigrams model: for each document, we first choose a topic z and then generate words for the document from the conditional distribution p(w z) Topics are just probability distributions over words 10

Latent Dirichlet Allocation (LDA) 11

Latent Dirichlet Allocation (LDA) α and η are parameters of the prior distributions over θ and β θ d is the distribution of topics for document d (real vector of length K) β k is the distribution of words for topic k (real vector of length V) z d,n is the topic for the nth word in the dth document w d,n is the nth word of the dth document 12

Latent Dirichlet Allocation (LDA) Plate notation There are N D different variables that represent the observed words in the different documents There are K total topics (assumed to be known in advance) There are D total documents 13

Latent Dirichlet Allocation (LDA) The only observed variables are the words in the documents The topic for each word, the distribution over topics for each document, and the distribution of words per topic are all latent variables in this model 14

Latent Dirichlet Allocation (LDA) The model contains both continuous and discrete random variables θ d and β k are vectors of probabilities z d,n is an integer in {1,, K} that indicates the topic of the nth word in the dth document w d,n is an integer in 1,, V which indexes over all possible words 15

Latent Dirichlet Allocation (LDA) θ d ~Dir(α) where Dir(α) is the Dirichlet distribution with parameter vector α > 0 β k ~Dir(η) with parameter vector η > 0 Dirichlet distribution over x 1,, x K such that x 1,, x K 0 and σ i x i = 1 f(x 1,, x K ; α 1,, α K ) i x i α i 1 The Dirichlet distribution is a distribution over probability distributions over K elements 16

Latent Dirichlet Allocation (LDA) The discrete random variables are distributed via the corresponding probability distributions p(z d,n = k θ d = θ d k p w d,n = v z d,n, β 1,, β K = β zd,n v Here, θ d k is the kth element of the vector θ d which corresponds to the percentage of document d corresponding to topic k The joint distribution is then p w, z, θ, β α, η = p(β k η) k d p(θ d α) p(z d,n θ d n p w d,n z d,n, β 17

Latent Dirichlet Allocation (LDA) LDA is a generative model We can think of the words as being generated by a probabilistic process defined by the model How reasonable is the generative model? 18

Latent Dirichlet Allocation (LDA) Inference in this model is NP-hard Given the D documents, want to find the parameters that best maximize the joint probability Can use an EM based approach called variational EM 19

Variational EM Recall that the EM algorithm constructed a lower bound using Jensen s inequality K l θ = k=1 (k) (k) log p(x obs, xmis θ) x misk K = k=1 log x misk q k (k) x mis p x (k) (k) obs, xmis θ q k (k) x mis K k=1 x misk q k (k) x mis log p x (k) (k) obs, xmis θ q k (k) x mis 20

Variational EM Performing the optimization over q is equivalent to computing (k) (k) xobs, θ) p(x mis This can be intractable in practice Instead, restrict q to lie in some restricted set of distributions Q For example, could make a mean-field assumption (k) q x mis (k) = q i (x i ) i mis k The resulting algorithm only yields an approximation to the log-likelihood 21

EM for Topic Models p w α, η = න p(β k η) න k z d p(θ d α) p(z d,n θ d p w d,n z d,n, β dθ dβ n To apply variational EM, we write log p w α, η = log න න p w, z, θ, β α, η dθdβ z p w, z, θ, β α, η න න q z, θ, β log q z, θ, β z dθdβ where we restrict the distribution q to be of the following form q z, θ, β = q(β k η) q θ d α q(z d,n ) k d n 22

Example of LDA 23

Example of LDA 24

Extensions of LDA Author Topic model a d is the group of authors for the dth document x d,n is the author of the nth word of the dth document θ a is the topic distribution for author a The Author-Topic Model for Authors and Documents Rosen-Zvi et al. z d,n is the topic for the nth word of the dth document 25

Extensions of LDA Label Y d for each document represents a value to be predicted from the document E.g., number of stars for each document in a corpus of movie reviews 26