Note 1: Varitional Methods for Latent Dirichlet Allocation

Similar documents
Lecture 2: Correlated Topic Model

LDA with Amortized Inference

Lecture 13 : Variational Inference: Mean Field Approximation

CS Lecture 18. Topic Models and LDA

Note for plsa and LDA-Version 1.1

Probablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07

Expectation Maximization and Mixtures of Gaussians

Lecture 8: Graphical models for Text

13: Variational inference II

Document and Topic Models: plsa and LDA

Introduction to Machine Learning

Study Notes on the Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Probabilistic Graphical Models for Image Analysis - Lecture 4

Dirichlet Enhanced Latent Semantic Analysis

Latent Dirichlet Allocation

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Probabilistic Graphical Models

Topic Models. Charles Elkan November 20, 2008

Clustering problems, mixture models and Bayesian nonparametrics

Introduction to Bayesian inference

Variational Autoencoder

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Assignment 2: Twitter Topic Modeling with Latent Dirichlet Allocation

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Inference Methods for Latent Dirichlet Allocation

Latent Dirichlet Bayesian Co-Clustering

G8325: Variational Bayes

Mixture Models and Expectation-Maximization

Variational Inference (11/04/13)

Expectation Maximization

Posterior Regularization

An Introduction to Expectation-Maximization

The Expectation-Maximization Algorithm

Latent Dirichlet Allocation (LDA)

LSI, plsi, LDA and inference methods

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Approximate Bayesian inference

Probabilistic Graphical Models

Basic math for biology

Exponential Families

Stochastic Variational Inference

Unsupervised Learning

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Generative Clustering, Topic Modeling, & Bayesian Inference

Language as a Stochastic Process

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

Gaussian Mixture Models

Statistical Machine Learning Lectures 4: Variational Bayes

An introduction to Variational calculus in Machine Learning

Evaluation Methods for Topic Models

Week 3: The EM algorithm

Understanding Covariance Estimates in Expectation Propagation

Latent Dirichlet Allocation Introduction/Overview

Streaming Variational Bayes

Latent Dirichlet Alloca/on

Expectation maximization tutorial

COS513 LECTURE 8 STATISTICAL CONCEPTS

Probabilistic Graphical Models

Introduction To Machine Learning

Foundations of Statistical Inference

Statistical Debugging with Latent Topic Models

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Introduction to Probabilistic Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

arxiv: v2 [stat.ml] 15 Aug 2017

But if z is conditioned on, we need to model it:

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

The Expectation Maximization or EM algorithm

Collapsed Variational Inference for Sum-Product Networks

Quantitative Biology II Lecture 4: Variational Methods

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Lecture 2: Conjugate priors

Modeling User Rating Profiles For Collaborative Filtering

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Latent Variable Models and EM algorithm

A Note on the Expectation-Maximization (EM) Algorithm

Maximum Likelihood Estimation of Dirichlet Distribution Parameters

Variational Inference via Stochastic Backpropagation

Technical Details about the Expectation Maximization (EM) Algorithm

Variational inference

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

EM & Variational Bayes

Motif representation using position weight matrix

Expectation Maximization

Lecture 6: April 19, 2002

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling

A concave regularization technique for sparse mixture models

CS Lecture 18. Expectation Maximization

ECE 5984: Introduction to Machine Learning

Programming Assignment 4: Image Completion using Mixture of Bernoullis

STA 4273H: Statistical Machine Learning

Variational Scoring of Graphical Model Structures

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

SUPERVISED MULTI-MODAL TOPIC MODEL FOR IMAGE ANNOTATION

Transcription:

Technical Note Series Spring 2013 Note 1: Varitional Methods for Latent Dirichlet Allocation Version 1.0 Wayne Xin Zhao batmanfly@gmail.com Disclaimer: The focus of this note was to reorganie the content in the original Blei s paper and add more detailed derivations. For convenience, in some part, I fully copied Blei s content. I hope it can help the beginners to vem of LDA. First of all, let us make some claims about the parameters and variables in the model. Let K be the number of topics, D be the number of documents and V be the number of terms in the vocabulary. We use i to index a topic 1, d to index a document 2, n index a word 3 and w or v to denote a word. In LDA, α K 1 and β K V are model parameters, while θ D K and 4 are hidden variables. As a variational distribution q, we use a fully factoried model, where all the variables are independently governed by a different distribution, qθ, γ, φ qθ γq φ, 1.1 where Dirichlet parameter γ D K and the multinomial parameters φ 5 are variational parameters. The topic assignments of words and the documents are exchangeable, i.e., conditionally independent on the parameters either model parameters or variational parameters. Note that all the variational distributions q here are conditional distributions and should be written as q w, for simplicity we write it as q. The main idea is that we use variational expectation-maximiation EM: In the E-step variational EM, we use the variational approximation to the posterior described in the previous section and find the optimal values of variational parameters. In the M-step, we maximie the bound with respect to the model parameters. In a more condense way, we perform variational inference for learning variational parameters in E-step while perform parameter estimation in M-step. These two steps alternate in a iteration. We will optimie the lower bound w.r.t variational parameters and model parameters one by one, and this is to perform optimiation using a coordinate ascent algorithm. 1.1 Variational objective function 1.1.1 Finding a lower bound for log pw α, β Jensens inequality. Let X be a random variable, and f is a convex function. Then we have fex Efx. If f is a concave function we have fex Efx. 1 i means K 2 d means D d1 3 n means N d n1, where N d is the length of current document 4 It is represented as a three dimension matrix, each entry is indexed by a triplet < d, n, i >, indicating whether the topic assignment of the nth word in the dth document is the ith topic. 5 Corresponding to, it is represented as a three dimension matrix, each entry is indexed by a triplet < d, n, i >, indicating the probability of the nth word in the dth document in the ith topic. 1-1

1-2 Lecture 1: Varitional Methods for Latent Dirichlet Allocation We use Jensens inequality to bound the log probability of a document 6, log pw α, β log pθ,, w α, βdθ, log pθ,, w α, βqθ, dθ, qθ, qθ, log pθ,, w α, βdθ qθ, log qθ, dθ, qθ, log pθ αdθ + qθ, log p θdθ + qθ, log pw, βdθ qθ, log qθ, dθ, E q [log pθ α] + E q [log p θ] + E q [log pw, β] + Hq, We introduce a function to denote the right side of the last line We can easily verify that Lγ, φ α, β E q [log pθ α] + E q [log p θ] + E q [log pw, β] + Hq. 1.2 log pw α, β Lγ, φ α, β + Dqθ, γ, φ pθ, α, β, w. 1.3 We indeed find a lower bound for log pw α, β, i.e., Lγ, φ α, β. This shows that maximiing the lower bound Lγ, φ α, β with respect to γ and φ is equivalent to minimiing the KL divergence between the variational posterior probability and the true posterior probability, the optimiation problem presented earlier in Eq. 1.5. 7 1.1.2 Expanding the lower bound In the lower bound defined above, we have four items to specify. We will present a detailed derivations for them next. For the first item, we have E q [log pθ α] { K } E q [log exp α i 1 log θ i + log Γ α i log Γα i ], K α i 1E q [log θ i ] + log Γ α i log Γα i, K α i 1Ψγ i Ψ j γ j + log Γ α i log Γα i. 6 Currently, we ignore the document index d, since all the hidden variables and variational parameters are document-specific. But we will use the document index d explicitly in the part of parameter estimation since these model parameters are related to all the documents. 7 When we learn the variational parameters, we fix all the model parameters. So that log pw α, β can be considered as a fixed value and it is the sum of the lower bound and the KL divergence.

Lecture 1: Varitional Methods for Latent Dirichlet Allocation 1-3 Let n denote the topic assignment of the nth word in current document,and it is a vector. 1 when the topic assignment is the ith topic, otherwise, 0. E q [log p θ] n E q [log p n θ], E q [log p θ i ], E q [log θ i ], E q [ log θ i ], E q [ ]E q [log θ i ], φ Ψγ i Ψ γ i. For the third item, we have E q [log pw, β] n E q [log pw n n, β] E q [log β i,w n ] φ log β i,wn. Hq qθ, log qθ, dη, qθ log qθdθ q log q, γ i 1E q [log θ i ] log Γ γ i + log Γγ i φ log φ, K γ i 1 Ψγ i Ψ γ i log Γ γ i + log Γγ i φ log φ, K Having these detailed derivations of these four items, we can have an expanded formulation for the lower bound

1-4 Lecture 1: Varitional Methods for Latent Dirichlet Allocation Lγ, φ α, β 1.4 K i 1Ψγ i α Ψ γ j + log Γ α i log Γα i j + φ Ψγ i Ψ γ i + φ log β i,wn K γ i 1 Ψγ i Ψ γ i log Γ γ i + log Γγ i φ log φ. 1.2 Variational inference The aim of variational inference is to learn the values of variational parameters γ, φ. With the learnt variational parameters, we can evaluate the posterior probabilities of hidden variables. Having specified a simplified family of probability distributions, the next step is to set up an optimiation problem that determines the values of the variational parameters: γ, φ. We can obtain a solution for these variational variables by solve the following optimiation problem: γ, φ arg min Dqθ, γ, φ pθ, α, β, w. 1.5 γ,φ With Eq. 1.3, we can achieve minimie Dqθ, γ, φ pθ, α, β, w by maximiing the lower bound Lγ, φ α, β. Note that in this section, variational parameters are learnt in terms of a document, so we ignore the document index d here. 1.2.1 Learning the variational parameters We have expanded each item of the lower bound Lγ, φ α, β in Eq. 1.2. Then we maximie the bound with respect to the variational parameters: γ and φ. We first maximie Eq. 1.4 with respect to φ, the probability that the nth word is generated by latent topic i. Observe that this is a constrained maximiation since i φ 1, so we incorporate Lagrange Multipliers λ n s for that

Lecture 1: Varitional Methods for Latent Dirichlet Allocation 1-5 L φ 1.6 φ Ψγ i Ψ γ i + φ log β i,wn φ log φ +λ n i φ 1. Taking the derivative of the L φ dl φ /dφ 1.7 Ψγ i Ψ γ i + log β i,wn log φ 1 +λ n. By setting dl [γ] /dγ i to ero, we can get φ β i,wn expψγ i Ψ γ i + λ n, where we do not need to compute λ n for φ since λ n is the same for all is. Thus we have φ β i,wn expψγ i. Next, we maximie Eq. 1.4 with respect to γ i, the ith component of the posterior Dirichlet parameter. The terms containing γ i s are:

1-6 Lecture 1: Varitional Methods for Latent Dirichlet Allocation L [γ] K i 1Ψγ i α Ψ γ j j + φ Ψγ i Ψ γ i K γ i 1 Ψγ i Ψ γ i log Γ γ i + log Γγ i. Taking the derivative of the L [γ] dl [γ] /dγ i α i 1Ψ γ i Ψ j γ j + n φ Ψ γ i Ψ γ i Ψγ i Ψ j γ j γ i 1Ψ γ i Ψ j γ j Ψ j γ j + Ψγ i Ψ γ i Ψ j γ j α i 1 + n φ γ i 1. By setting dl [γ] /dγ i to ero, we can get γ i α i + n φ. 1.8 1.3 Parameter estimation In the previous section, we have discussed how to estimate the variational parameters. Next, we continue to estimate our model parameters, i.e., β and α. We solve this problem by using the variational lower bound as a surrogate for the intractable marginal log likelihood, with the variational parameters. Note that we need to first aggregate document-specific lower bounds defined in Eq.1.2. And in this part, we will use the document index d. We first rewrite the lower bound by only keeping the items which contain β with the lagrange multipliers ρ i s L [β] V φ d, log β i,wn + ρ i β i,v 1. 1.9 d, v1

Lecture 1: Varitional Methods for Latent Dirichlet Allocation 1-7 By taking the derivative L [β], we can have dl [β] /dβ i,v d,n φ d, 1v w n β i,v + ρ i, 1.10 where 1v w n is an indicator function which returns 1 when the condition is true otherwise returns 0. We can set φ d, 1vw n d,n β i,v + ρ i to ero, and solve ρ i : ρ i d,n,v φ d,1v w n. Since we have v β i,v 1, we can ignore ρ i to estimate an un-normalied value of β i,v βˆ i,v φ d, 1v w n. 1.11 d,n Similarly, we can rewrite the lower bound by only keeping the items which contain α. L α D K d1 α i 1Ψγ d,i Ψ j γ d,j + log Γ α i log Γα i By taking the derivative L [α], we can have dl [α] /dα i d Ψγ d,i Ψ j γ d,j + D Ψ α i Ψα i, This derivative of α i depends on α j s, where j i, and we therefore must use an iterative method to find the maximal α. In particular, the Hessian is in the form: dl [α] /dα i α j DΨ α i DΨ α i 1[i j], having these derivatives, we can use the optimiation algorithm in Blei s paper, and I will not discuss it in current version. 1.4 Summary of Variational EM algorithm for LDA Having the variational inference part and parameter estimation, now we present a summary of the variational EM algorithm for LDA. The derivation yields the following iterative algorithm: E-step: For each document, running the following iterative algorithm to find the optimiing values of the variational parameters

1-8 Lecture 1: Varitional Methods for Latent Dirichlet Allocation initialie φ 0 ni 1 K initialie γ 0 i for all i and n; α i + N d K for all i and n; while not converge do for n 1 to N d do for i 1 to K do φ t+1 β i,wn expψγ i ; end end normalie φ t+1 sum to 1; end γ t+1 i α i + n φt+1 ; Algorithm 1: Iterative variational inference algorithm for a document. M-step Maximie the resulting lower bound on the log likelihood with respect to the model parameters α and β. This corresponds to finding maximum likelihood estimates with expected sufficient statistics for each document under the approximate posterior which is computed in the E-step. Appendix In this part, we will go over some preliminary points for previous derivations. This part is copied from Blei paper The need to compute the expected value of the log of a single probability component under the Dirichlet arises repeatedly in deriving the inference and parameter estimation procedures for LDA. This value can be easily computed from the natural parameteriation of the exponential family representation of the Dirichlet distribution. Recall that a distribution is in the exponential family if it can be written in the form: px η hx expη T T x Aη, where η is the natural parameter, T x is the sufficient statistic, and Aη is the log of the normaliation factor. We can write the Dirichlet in this form by exponentiating the log: K pθ α exp α i 1 log θ i + log Γ α i log Γα i, From this form, we immediately see that the natural parameter of the Dirichlet is η i α i 1 and the sufficient statistic is T θ i log θ i. Furthermore, using the general fact that the derivative of the log normaliation factor with respect to the natural parameter is equal to the expectation of the sufficient statistic, we obtain: E[log θ i α i ] Ψα i Ψ j α j, where Ψ is the digamma function, the first derivative of the log Gamma function. This is a very important point for the derivations in this note.