Streaming Variational Bayes

Similar documents
Streaming Variational Bayes

Sparse Stochastic Inference for Latent Dirichlet Allocation

Understanding Covariance Estimates in Expectation Propagation

Introduction to Bayesian inference

Lecture 13 : Variational Inference: Mean Field Approximation

Part 1: Expectation Propagation

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Expectation propagation as a way of life

Algorithms for Variational Learning of Mixture of Gaussians

Online Algorithms for Sum-Product

LDA with Amortized Inference

Bayesian Machine Learning

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Collapsed Variational Bayesian Inference for Hidden Markov Models

Introduction to Probabilistic Machine Learning

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Coresets for Bayesian Logistic Regression

Probabilistic Reasoning in Deep Learning

Online Bayesian Passive-Aggressive Learning"

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Patterns of Scalable Bayesian Inference Background (Session 1)

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Basic math for biology

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

CS Lecture 18. Topic Models and LDA

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Automated Scalable Bayesian Inference via Data Summarization

Generative Clustering, Topic Modeling, & Bayesian Inference

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Fast yet Simple Natural-Gradient Variational Inference in Complex Models

Bayesian Machine Learning

arxiv: v2 [stat.ml] 21 Jul 2015

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Automated Scalable Bayesian Inference via Data Summarization

Unsupervised Learning

Notes on pseudo-marginal methods, variational Bayes and ABC

Online Bayesian Passive-Aggressive Learning

Black-box α-divergence Minimization

AN INTRODUCTION TO TOPIC MODELS

Part IV: Variational Bayes and beyond

Non-Parametric Bayes

Dirichlet Enhanced Latent Semantic Analysis

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Expectation Propagation for Approximate Bayesian Inference

Sequential Monte Carlo Methods for Bayesian Computation

Online Bayesian Passive-Agressive Learning

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Gaussian Mixture Model

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Streaming, Distributed Variational Inference for Bayesian Nonparametrics

Latent Dirichlet Alloca/on

an introduction to bayesian inference

WE live in an era of Big Data, where science, engineering

CPSC 540: Machine Learning

Approximate Decentralized Bayesian Inference

Variational Scoring of Graphical Model Structures

Learning an astronomical catalog of the visible universe through scalable Bayesian inference

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Introduction into Bayesian statistics

Document and Topic Models: plsa and LDA

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

MCMC and Gibbs Sampling. Kayhan Batmanghelich

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

Patterns of Scalable Bayesian Inference

G8325: Variational Bayes

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Stochastic Annealing for Variational Inference

Truncation-free Stochastic Variational Inference for Bayesian Nonparametric Models

Bayesian Analysis for Natural Language Processing Lecture 2

STA 4273H: Statistical Machine Learning

Bayesian Machine Learning

Click Prediction and Preference Ranking of RSS Feeds

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Probabilistic Graphical Models & Applications

A new Hierarchical Bayes approach to ensemble-variational data assimilation

Stochastic Variational Inference for Bayesian Time Series Models

Online Bayesian Transfer Learning for Sequential Data Modeling

Tree-Based Inference for Dirichlet Process Mixtures

Distributed Stochastic Gradient MCMC

The Infinite PCFG using Hierarchical Dirichlet Processes

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

15780: GRADUATE AI (SPRING 2018) Homework 3: Deep Learning and Probabilistic Modeling

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

arxiv: v2 [stat.ml] 5 Nov 2012

You Lu, Jeff Lund, and Jordan Boyd-Graber. Why ADAGRAD Fails for Online Topic Modeling. Empirical Methods in Natural Language Processing, 2017.

Lecture 2: From Linear Regression to Kalman Filter and Beyond

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

PMR Learning as Inference

Provable Nonparametric Bayesian Inference

Graphical Models for Collaborative Filtering

Efficient Likelihood-Free Inference

Bayesian Inference and MCMC

Stochastic Variational Inference for the HDP-HMM

Design of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Probabilistic Graphical Models

Transcription:

Streaming Variational Bayes Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael I. Jordan UC Berkeley Discussion led by Miao Liu September 13, 2013

Introduction The SDA-Bayes Framework Streaming Bayesian updating Distributed Bayesian updating Asynchronous Bayesian updating Experiments

Motivation The advantages of the Bayesian paradigm: hierarchical modeling, coherent treatment of uncertainty These advantages seem out of reach in the setting of Big Data An exception: variational Bayes (VB) stochastic variational inference (SVI) = VB + stochastic gradient descent [Hoffman et. al. 2012] the objective function : the variational lower bound on the marginal likelihood the stochastic gradient is computed for a single data point (document) at a time Motivation The problem of SVI The objective is based on the conceptual existence of a full data set involving D data points (documents) D is a fixed number and must be specified in advance does not apply for the streaming setting Development in computer architectures, which permit distributed and asynchronous computations

Introduction The aim of this paper develop a scalable approximate Bayesian inference algorithm in the real streaming setting Methodology A recursive application of Bayes theorem provides a sequence of posteriors, not a sequence of approximations to a fixed posterior The posteriors are approximated by VB Similar methods density filtering [Opper, 1999] expectation propagation [Minka, 2001] Related work MCMC approximations [Canini et. al. AISTATS09] VB and VB-like approximations [Honkela and Valpola, 2003, Luts et al. preprint arxiv]

Streaming Bayesian updating Consider data x 1, x 2, i.i.d. p(x Θ), the collection of S data points, C 1 := (x 1,, x S ) The posterior after the bth mini batch p(θ C 1,, C b ) p(c b Θ)p(Θ C 1,, C b 1 ) (1) repeated application of Eqn (1) is streaming automatically yields the new posterior without needing to revisit old data points An approximation algorithm A for computing an approximate posterior q : q(θ) = A(C, p(θ)) setting q 0 (Θ) = p(θ), then recursively calculate an approximation to the posterior p(θ C 1,, C b ) q b (Θ) = A(C, q b 1 (Θ)) (2)

Distributed Bayesian updating Parallelizing computations increases algorithm throughput Calculate individual mini batch posteriors p(θ C b ) (perhaps in parallel), and then combine them to find the full posterior p(θ C 1,, C B ) [ B b=1 p(c b Θ) ] p(θ) The corresponding approximate update p(θ C 1,, C B ) q(θ) [ B [ b=1 p(θ Cb )p(θ) 1]] p(θ) (3) [ B b=1 A(C b, p(θ))p(θ) 1 ]p(θ) (4) Assumptions 1. p(θ) exp{ξ 0 T (Θ)} is an exponential family distribution for Θ with sufficient statistic T (Θ) and natural parameter ξ 0 2. Further assume A returns a distribution in the same exponential family q b (Θ) exp{ξ b T (Θ)} The update in Eq.(4) becomes {[ p(θ C 1,, C B ) q(θ) exp ξ 0 + ] } B b=1 (ξ b ξ 0 ) T (Θ) (5) It is not necessary to assume any conjugacy

Asynchronous Bayesian updating Asynchronous algorithms can decrease downtime in the system The proposed asynchronous algorithm is in the spirit of Hogwild! [Niu et al. NIPS2011] An asynchronous scheme (conceptual stepping stone, not used in pratice) Workers (processors that solves a subproblem) 1. collect a new minibatch C. 2. compute the local approximate posterior ξ A(C, ξ 0 ) 3. return ξ := ξ ξ 0 to the master The master starts by assigning the posterior to equal the prior: ξ post ξ 0 each time the master receives a quality ξ from any worker, it updates the posterior synchronously: ξ post ξ post + ξ

Streaming Distributed, Asynchronous (SDA) Bayes The master initializes its posterior estimate to the prior: ξ (post) x 0 Each worker continuously iterates between four steps 1. collect a new minibatch C. 2. copy the master posterior value locally ξ (local) ξ (post) 3. compute the local approximate posterior ξ A(C, ξ (local) ) 4. return ξ := ξ ξ 0 to the master each time the master receives a quality ξ from any worker, it updates the posterior synchronously: ξ post ξ post + ξ The key difference: the latest posterior is used as a prior in the second frameworks: The latter framework introduces a new layer of approximation, but works better in practice

Case study: latent Dirichlet allocation(blei et al., 2003) Notations α: the parameter of Dirichlet prior on the per-document topic distribution β: the parameter of Dirichlet prior on the per-topic word distribution θi : the topic distribution for document i φk : the word distribution for topic k z ij : the topic for the jth word in document i w ij : the specific word The generative process θ i Dir(α), where i {1,, M} (6) φ k Dir(β), where k {1,, K } (7) For each of the words w ij, where j {1,, N i } (8) z ij Categorical(θ i ), w ij Categorical(φ zij ) (9)

The posterior of LDA p(β, θ, z C, η, α) = [ K k=1 Dirchlet(β k η k ) ] [ D d=1 Dirchlet(θ d α) ] [ D Nd d=1 n=1 θ ] z dn β zdn,w dn (10) Posterior-approximation algorithms Mean-field variational Bayesian: KL(q p) Expectation propagation [Minka, 2001]: KL(p q) Other single-pass algorithms for approximate LDA posteriors Stochastic variational inference Sufficient statistics

Experiments Data set the full Wikipedia corpus (3,611, 558 for training, 10,000 for testing) the Nature corpus (351, 525 for training, 1,024 for testing) Experiment setup Hold out 10,000/1,024 Wikipedia/Nature documents for testing For all cases, fit an LDA model with K = 100 topics Chose hyprparameters as k, α k = 1/K, (k, v), η k,v = 1