Sparse Stochastic Inference for Latent Dirichlet Allocation

Similar documents
Sparse stochastic inference for latent Dirichlet allocation

Streaming Variational Bayes

Topic Modelling and Latent Dirichlet Allocation

Generative Clustering, Topic Modeling, & Bayesian Inference

Online Bayesian Passive-Agressive Learning

CS Lecture 18. Topic Models and LDA

Lecture 13 : Variational Inference: Mean Field Approximation

Gaussian Mixture Model

Improving Topic Models with Latent Feature Word Representations

Evaluation Methods for Topic Models

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

arxiv: v6 [stat.ml] 11 Apr 2017

Collapsed Variational Bayesian Inference for Hidden Markov Models

Distributed Stochastic Gradient MCMC

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Latent Dirichlet Bayesian Co-Clustering

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Inference Methods for Latent Dirichlet Allocation

Online Bayesian Passive-Aggressive Learning"

Document and Topic Models: plsa and LDA

Dirichlet Enhanced Latent Semantic Analysis

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Statistical Debugging with Latent Topic Models

AN INTRODUCTION TO TOPIC MODELS

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

16 : Approximate Inference: Markov Chain Monte Carlo

Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Smoothed Gradients for Stochastic Variational Inference

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Sequential Monte Carlo Methods for Bayesian Computation

arxiv: v3 [stat.ml] 15 Aug 2017

Lecture 22 Exploratory Text Analysis & Topic Models

Introduction to Bayesian inference

Latent Dirichlet Conditional Naive-Bayes Models

Content-based Recommendation

Truncation-free Stochastic Variational Inference for Bayesian Nonparametric Models

Text Mining for Economics and Finance Latent Dirichlet Allocation

Introduction to Machine Learning

Auto-Encoding Variational Bayes

Latent Dirichlet Allocation

Applying hlda to Practical Topic Modeling

Latent variable models for discrete data

Clustering bi-partite networks using collapsed latent block models

Topic Models. Charles Elkan November 20, 2008

Study Notes on the Latent Dirichlet Allocation

Variational inference

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

Factor Modeling for Advertisement Targeting

EM & Variational Bayes

Bayesian Methods for Machine Learning

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Gaussian Models

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation

Supplementary Material of High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Collapsed Variational Dirichlet Process Mixture Models

Streaming Variational Bayes

Latent Dirichlet Allocation (LDA)

Stochastic Variational Inference

Efficient Methods for Topic Model Inference on Streaming Document Collections

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

Latent Dirichlet Allocation (LDA)

On Markov chain Monte Carlo methods for tall data

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Scaling Neighbourhood Methods

Distance dependent Chinese restaurant processes

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

LDA with Amortized Inference

how to *do* computationally assisted research

Topic Learning and Inference Using Dirichlet Allocation Product Partition Models and Hybrid Metropolis Search

Online Bayesian Passive-Aggressive Learning

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Approximate Bayesian inference

Stochastic Proximal Gradient Algorithm

Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation

Bayesian learning of sparse factor loadings

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

arxiv: v2 [stat.ml] 5 Nov 2012

Note for plsa and LDA-Version 1.1

Stochastic Annealing for Variational Inference

topic modeling hanna m. wallach

Topic Modeling: Beyond Bag-of-Words

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Additive Regularization of Topic Models for Topic Selection and Sparse Factorization

Graphical Models for Query-driven Analysis of Multimodal Data

Gaussian Process Approximations of Stochastic Differential Equations

Design of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.

Learning the hyper-parameters. Luca Martino

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling

Variational Inference for Scalable Probabilistic Topic Modelling

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Recent Advances in Bayesian Inference Techniques

Computational statistics

Deep Boltzmann Machines

Transcription:

Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation led by Miao Liu March 7, 2013

1 Introduction 2 Hybrid Stochastic-MCMC Inference 3 Related Work 4 Empirical Results

A hybrid algorithm for Bayesian topic models sparse Gibbs sampling to estimate variational expectation of local variables (efficiency) online VB to update the variational distribution of global variables (scalability) Applications: find large numbers of topics in massive collections of documents 1.2 million books 33 billion words

Latent Dirichlet Allocation (Blei et al., 2003) Notations α: the parameter of Dirichlet prior on the per-document topic distribution β: the parameter of Dirichlet prior on the per-topic word distribution θ i : the topic distribution for document i φ i : the word distribution for topic k z ij : the topic for the jth word in document i w ij : the specific word The generative process θ i Dir(α), where i {1,, M} (1) φ k Dir(β), where k {1,, K } (2) For each of the words w ij, where j {1,, N i } (3) z ij Multinomial(θ i ), w ij Multinomial(φ zij ) (4)

Online LDA (Hoffman et al. (2010)) pros less memory faster convergence does not scale to large number of topics

Hybrid Stochastic-MCMC Inference Sampling: a second source of stochasticity for the gradient Marginalize out the topic proportions θ d Corpus-level global variables: K topic-word distributions β 1,, β K over the V-dimensional vocabulary. Document-level local variables: For a document d of length N d θ d z d = (z d1,, z dnd ) variational distribution q(z 1,, z D, β 1, β K ) = d q(z d) k q(β k). q(z d ) is a single distribution over the K N d possible topic configurations The VB lower bound log p(w α, η) d E q log [ p(z d α) i ] β zdi w di + E q log p(β k η)+h(q) k (5)

The optimal variational distribution over topic configurations for a document, holding all other variational distribution fixed q (z d ) exp { E q( zd )[log p(z d α)p(w d z d, β)] } (6) Γ(K α) Γ(α + i = I z di =k) (7) Γ(K α + N d ) Γ(α) The hyper parameter of the optimal variational distribution over topic-word distribution, holding the other distributions fixed λ kw = η + E q [I zdi =ki wdi =w] (8) d i k

Online Stochastic Inference for λ kw Recast the variational LB as a summation over per-document terms l d l d = ) w (E q [N dkw ] + 1D (η λ) E q [log β kw ]+ ( 1 log Γ( λ kw ) ) log Γ(λ kw ), (9) D w w [ ] D d λ k l d = E B d B λ k l d E q [N dkw ] = i E q[i zdi =ki wdi =w] The natural gradient (Hoffman et al., 2010) E q [N dkw ] + 1 D (η λ kw) (10) faster convergence cheaper computation Issue: the natural gradient cannot be directly evaluated (with a combinatorial number of topic configurations z d ).

MCMC within Stochastic Inference Solution: use MCMC to sample z d from q (z d ) and the empirical average to estimate E q [N dkw ] q (z di = k z\i) (α + j i I z j =k) exp{e q [log β kwdi ]} E q [N dkw ] N kw = 1 S s d B i [I zdi s =k I wdi =w] Impacts from MCMC 1 add stochasticity to the gradient 2 allows using collapsed objective function that does not represent θ d 3 provides a sparse estimate of the gradient (for many words and topics, the estimate of E q [N dkw ] will be zero)

Previous variational methods lead to dense update to KV topic parameters, which is expensive when K and V are large. Algorithm 1 exploit the sparsity exhibited by samples from q.

Two sources of zero-mean noise in constructing an approximate gradient for VB Subsampling the data (SS) Monte Carlo inference (MC) SAEM (Delyon et al., 1999): EM + MC Kuhn & Lavielle (2004): extends SAEM to MCMC estimates online EM (Cappé & Moulines, 2009): EM + SS Collapsed VB (Teh et al. 2006) also analytically marginalizes over θ d, but still maintains a fully factorized distribution over z d The prosed method does not restrict to such factored distributions, hence reduces bias Parallelization

Evaluation Held-out probability: calculates the marginal likelihood for held out documents using the left-to-right sequential sampling (Wallach et al. 2009; Buntine, 2009) Topic coherence measures the semantic quality of a topic by approximating the experience of a user viewing the W most probable words for the topic (Mimno et al., 2011) D(w): the number of document containing one or more token of type w D(w i, w j ): the number of documents containing at least one token of w i and w j C(W ) = W i=2 i 1 j=1 log D(w i,w j )+ɛ D(w j ) C(W ) is related to point-wise mutual information (Newman et al., 2010) Wallclock time

Dataset Science/Nature/PNAS articles 350, 00 research articles Vocabulary size: 19, 000 90% articles for training, 10% for testing Pre-1922 books 1.2 million books 33 billion words

Comparison to Online VB (Hoffman et al. 2010) Each iteration consists of a mini-batch of 100 documents the number of coordinate ascent steps in VB is equal to the number of Gibbs sweeps both methods use the same learning schedule Standard online VB takes time linear in K

Comparison to Online VB (Hoffman et al. 2010) The entropy of the topic distributions: the proposed method: 6.8 ± 0.46 online VB: 6.0 ± 0.58 This result could indicate that coordinate ascent over the local variables for online LDA is not converging?

Comparison to Sequential Monte Carlo (Ahmed et al 2012) The SMC sampler sweeps through each document the same number of times as the sampled online algorithm The learning rate schedule allows sampled online inference to forget its initial topics SMC weights all documents equally

Effect of parameter settings Number of samples: B + S Topic-word smoothing: η Forgetting factors learning rate Size of corpus D

scalability