Latent Dirichlet Allocation Introduction/Overview

Similar documents
Latent Dirichlet Allocation (LDA)

CS Lecture 18. Topic Models and LDA

Study Notes on the Latent Dirichlet Allocation

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Distributed ML for DOSNs: giving power back to users

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Bayesian Models in Machine Learning

Latent Dirichlet Allocation (LDA)

Recent Advances in Bayesian Inference Techniques

Document and Topic Models: plsa and LDA

Lecture 13 : Variational Inference: Mean Field Approximation

Topic Modelling and Latent Dirichlet Allocation

Text mining and natural language analysis. Jefrey Lijffijt

Non-parametric Clustering with Dirichlet Processes

Generative Clustering, Topic Modeling, & Bayesian Inference

Dirichlet Enhanced Latent Semantic Analysis

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

CS145: INTRODUCTION TO DATA MINING

Kernel Density Topic Models: Visual Topics Without Visual Words

A Continuous-Time Model of Topic Co-occurrence Trends

Topic Models and Applications to Short Documents

Collaborative topic models: motivations cont

Non-Parametric Bayes

Latent variable models for discrete data

Modeling Environment

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Latent Dirichlet Allocation

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

STA 414/2104: Machine Learning

Mixtures of Multinomials

AN INTRODUCTION TO TOPIC MODELS

Introduction to Probabilistic Machine Learning

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Bayesian Nonparametrics for Speech and Signal Processing

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Machine learning: lecture 20. Tommi S. Jaakkola MIT CSAIL

Probabilistic Reasoning in Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

Introduction to Machine Learning

13: Variational inference II

Collapsed Variational Bayesian Inference for Hidden Markov Models

Gaussian Mixture Model

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Latent Variable Models Probabilistic Models in the Study of Language Day 4

Language Information Processing, Advanced. Topic Models

topic modeling hanna m. wallach

Bayesian Machine Learning

Text Mining for Economics and Finance Latent Dirichlet Allocation

Lecture 8: Graphical models for Text

Latent Dirichlet Alloca/on

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Lecture : Probabilistic Machine Learning

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

Evaluation Methods for Topic Models

Content-based Recommendation

Statistical Models. David M. Blei Columbia University. October 14, 2014

CSC321 Lecture 18: Learning Probabilistic Models

Language as a Stochastic Process

Probabilistic Latent Semantic Analysis

Bayesian Networks Inference with Probabilistic Graphical Models

CS6220: DATA MINING TECHNIQUES

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Parametric Mixture Models for Multi-Labeled Text

Probabilistic Latent Semantic Analysis

Topic Modeling: Beyond Bag-of-Words

16 : Approximate Inference: Markov Chain Monte Carlo

PATTERN RECOGNITION AND MACHINE LEARNING

LDA with Amortized Inference

Notes on Latent Semantic Analysis

Collaborative Topic Modeling for Recommending Scientific Articles

Introduction to Bayesian inference

Lecture 22 Exploratory Text Analysis & Topic Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Applying hlda to Practical Topic Modeling

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Lecture 6: Graphical Models: Learning

Gentle Introduction to Infinite Gaussian Mixture Modeling

Measuring Topic Quality in Latent Dirichlet Allocation

Linear Dynamical Systems

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Mixture Models and Expectation-Maximization

Unsupervised Learning

Modeling User Rating Profiles For Collaborative Filtering

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Latent Dirichlet Allocation

Machine Learning Summer School

Advanced Machine Learning

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Introduction. Chapter 1

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

ECE 5984: Introduction to Machine Learning

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Stochastic Variational Inference

Transcription:

Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016

Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models Latent Dirichlet Allocation Probabilistic Graphical Models The Effect of the Dirichlet parameter α Dynamic LDA Q&A

Topic Modeling Methods for automatically organizing, understanding, searching and summarizing categorical data Goals Uncover hidden topical patterns Annotate documents according to topics Organize, Summarize, Search

A Bit of Information Retrieval (IR) Terminology Corpus: is a large and structured set of texts Stop words: words which are filtered out before or after processing of natural language data (text) Unstructured text: information that either does not have a predefined data model or is not organized in a pre-defined manner. Tokenizing: process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens (see also lexical analysis) Natural language processing: field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages Term document (or document term) matrix: is a mathematical matrix that describes the frequency of terms that occur in a collection of documents Supervised learning: s the machine learning task of inferring a function from labeled training data Unsupervised learning: find hidden structure in unlabeled data Stemming: the process for reducing inflected (or sometimes derived) words to their word stem, base or root form generally a written word form Graphic courtesy Wikipedia 4

IR Workflow Graphic courtesy http://www.jisc.ac.uk/reports/value-and-benefits-of-text-mining 5

Aside: We Know Machines Are Getting Smarter, But Where Does Knowledge Come From? Evolution Experience Many orders of magnitude faster and larger Culture Machines

Ok, But How Can Machines Discover New Knowledge? Fill the gaps in existing knowledge Symbolists Technology: Induction/Inverse Deduction Emulate the brain Connectionists Technology: Deep neural nets Emulate evolution Evolutionaries Technology: Genetic Algorithms These correspond to the 5 major schools of thought in machine learning Systematically reduce uncertainty Bayesians Technology: Bayesian Inference - Notice similarities between old and new Analogizers Technology: Kernel machines/support Vector Machines

Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models Latent Dirichlet Allocation Probabilistic Graphical Models The Effect of the Dirichlet parameter α Dynamic LDA Q&A

Parametric vs. Non-parametric Models Parametric models assume some finite set of parameters θ. This means that given parameters θ, future predictions x are independent of the data D. That is: p(x θ, D) = p(x θ) This implies that θ captures everything there is to know about the data Also means that the complexity of the model is bounded even if the amount of data isn t Taken together these factors make parametric models inflexible Information theoretic view Information is constrained to flow from the prior through finite θ to the posterior Remembering that the posterior = p θ x = 8 x θ 7 8(?) 8(@) = 8(@,?) 8(@) = 8 @,? 8 @,? 6? http://www.1-4-5.net/~dmm/ml/ps.pdf 0123014556 7 89159 3:163;<3 Latent Dirichlet Allocation is a parametric model (why?)

Non-Parametric Models Non-parametric models assume that the data distribution cannot be defined in terms of such a finite set of parameters Parameters can often be described using an infinite dimensional θ Think of θ as a function The amount of information that θ can capture about the data D can grow as the amount of data grows This makes non-parametric models more flexible Better predictive performance More realistic Most successful ML models are non-parametric Examples include Kernel methods (SVMs, GPs) DNNs K-NNs,

Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models Latent Dirichlet Allocation Probabilistic Graphical Models The Effect of the Dirichlet parameter α Dynamic LDA Q&A

Latent Dirichlet Allocation (LDA) Generative probabilistic model Parametric Bayesian Probabilistic Graphical Model Treats data as observations Contains hidden variables Hidden variables reflect thematic structure of the collection Wealth of material here: https://www.cs.princeton.edu/~blei/topicmodeling.html Approach: Infer hidden structure using posterior inference Discovering topics in the collection using Bayesian inference Placing new data into the estimated model Situating new documents into the estimated topic structure Other Approaches Latent Semantic Indexing (LSI) Probabilistic Latent Semantic Indexing (plsi)

LDA Basic Intuition http://ai.stanford.edu/~ang/papers/nips01-lda.pdf Obvious Questions What exactly is a topic? Where do topics come from? How many topics /document Simple Intuition: Documents exhibit multiple topics Contrast mixture models Graphic courtesy David Blei https://www.cs.princeton.edu/~blei/kdd-tutorial.pdf

LDA Generative Model Hallucinate that the observed data were generated this way Assume topics exists outside of the document collection Each topic is a distribution over fixed vocabulary Each word is drawn from one of those topics Each document is a random mixture of corpus-wide topics Graphic courtesy David Blei https://www.cs.princeton.edu/~blei/kdd-tutorial.pdf

BTW, What Do We Actually Observe? So our goal here is to infer the hidden (latent) variables i.e., compute their distribution conditioned on the documents: p(topics, proportions, assignments documents) Graphic courtesy David Blei https://www.cs.princeton.edu/~blei/kdd-tutorial.pdf

β 1:K Let K be the number of topics (distributions over words) For each document d in corpus C: Draw the topic proportion distribution θ d for document d For each n = 1..N d : Draw topic index z d,n for the n th word of d from θ d # β 1:K ~ Dir(η) # θ d ~ Dir(α) # N d = # words in d # 1 z d,n K Draw word w d,n from topic β[z d,n ] # z d,n indexes β 1:K

Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models Latent Dirichlet Allocation Probabilistic Graphical Models The Effect of the Dirichlet parameter α Dynamic LDA Q&A

Probabilistic Graphical Models Graphic courtesy https://www.cs.princeton.edu/~blei/kdd-tutorial.pdf

Probabilistic Graphical Models, Cont Graphic courtesy https://www.cs.princeton.edu/~blei/kdd-tutorial.pdf

Graphic courtesy https://www.cs.princeton.edu/~blei/kdd-tutorial.pdf LDA Graphical Model (based on the generative model)

LDA Graphical Model Details

What is the LDA Joint Distribution?

Why does w d,n depend on z d,n and β? V x K Topic Matrix K Columns are the β k s Z d,n K -- number of topics V W d,n V -- number of words in the vocabulary TM VxK -- topic matrix Z d,n -- index of topic which W d,n comes from W d,n -- the n th word in the d th document TM[W d,n, Z d,n ] -- the probability of W d,n, β k [w d,n ] (k = z d,n ) 1 1 1 1. 1

Now, What Exactly is the Dirichlet Distribution (and why are we using it?) The Dirichlet is a dice factory Multivariate equivalent of the Beta distribution ( coin factory ) Parameters α determine the form of the prior The Dirichlet is defined over the (k-1) simplex The k non-negative arguments which sum to one The Dirichlet is the conjugate prior to the multinomial distribution If the likelihood has conjugate prior P then the posterior has the same form as P If we have a conjugate prior we know the (closed) form of the posterior So in this case the posterior is also a Dirichlet The parameter α controls the mean shape and sparsity of θ In LDA the topics are a V-dimensional Dirichlet and the topic proportions are a K-dimensional Dirichlet 24

Simplex? (space of non-negative vectors which sum to one)

Aside: Conjugate Priors Notes/Issues: Conjugate Prior: If the likelihood has conjugate prior P then the posterior has the same form as P The conjugate prior might not correctly reflect our uncertainty about θ Non-conjugate priors are typically not available in analytic (closed) form It is difficult to quantify our uncertainty about θ in the form of a particular distribution

Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models Latent Dirichlet Allocation Probabilistic Graphical Models The Effect of the Dirichlet parameter α Dynamic LDA Q&A

The Effect of the Dirichlet parameter α Generalizing the Idea of Co-Occurrence

Graphic courtesy https://www.cs.princeton.edu/~blei/kdd-tutorial.pdf α = 1

α = 0.1

α = 0.001 Essentially a mixture model (one topic/document)

α = 10

Larger α spreads out the mass α = 100

Ok, but we need the Posterior So we need to use approximate inference

Approximate Inference

Variational Inference

Jensen s Inequality Jensen's inequality generalizes the observation that the secant line of a convex function lies above the graph of the function In probability theory Jensen s Inequality is generally stated in the following form: If X is a random variable and φ is a convex function, then φ(e[x]) E[φ(X)]

Jensen s Inequality, Bounds, and KL Divergence

KL-Divergence

Why Does LDA Work?

Why Does LDA Work?

LDA Summary

Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models Latent Dirichlet Allocation Probabilistic Graphical Models The Effect of the Dirichlet parameter α Dynamic LDA Q&A

Beyond LDA: Dynamic Topic Models Graphic courtesy https://www.cs.princeton.edu/~blei/kdd-tutorial.pdf

Q&A Thanks!