VCMC: Variational Consensus Monte Carlo

Similar documents
On Bayesian Computation

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Statistical Machine Learning

Monte Carlo in Bayesian Statistics

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Scaling up Bayesian Inference

CPSC 540: Machine Learning

Bayes: All uncertainty is described using probability.

STA 4273H: Statistical Machine Learning

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Probabilistic Graphical Models

Introduction to Machine Learning

Bayesian Methods for Machine Learning

Default Priors and Effcient Posterior Computation in Bayesian

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Learning the hyper-parameters. Luca Martino

Non-Parametric Bayes

An ABC interpretation of the multiple auxiliary variable method

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Bayesian Inference and MCMC

Probabilistic Graphical Models

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Variational Inference via Stochastic Backpropagation

Bayesian Deep Learning

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Computer intensive statistical methods

Pattern Recognition and Machine Learning

Ch 4. Linear Models for Classification

The Expectation-Maximization Algorithm

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Markov Chain Monte Carlo methods

STA 4273H: Statistical Machine Learning

Probabilistic Time Series Classification

Supplementary Note on Bayesian analysis

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

On Markov chain Monte Carlo methods for tall data

Scaling Neighbourhood Methods

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Probabilistic Graphical Models

A = {(x, u) : 0 u f(x)},

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Density Estimation. Seungjin Choi

Part 1: Expectation Propagation

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Auto-Encoding Variational Bayes

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Introduction to Probabilistic Machine Learning

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Linear Dynamical Systems

LECTURE 15 Markov chain Monte Carlo

Bayesian Networks in Educational Assessment

Bayesian Methods in Multilevel Regression

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Bayesian Estimation of Input Output Tables for Russia

STA 414/2104: Machine Learning

Theory of Stochastic Processes 8. Markov chain Monte Carlo

Adaptive Monte Carlo methods

Machine Learning Techniques for Computer Vision

Computational statistics

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

Machine Learning. Probabilistic KNN.

Statistical learning. Chapter 20, Sections 1 4 1

13: Variational inference II

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Kernel adaptive Sequential Monte Carlo

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

STA414/2104 Statistical Methods for Machine Learning II

Stat 5101 Lecture Notes

Understanding Covariance Estimates in Expectation Propagation

Introduction to Machine Learning

Importance Weighted Consensus Monte Carlo for Distributed Bayesian Inference

Recent Advances in Bayesian Inference Techniques

Bayesian model selection: methodology, computation and applications

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Approximate Inference Part 1 of 2

Undirected Graphical Models

Bayesian Complementary Clustering, MCMC and Anglo-Saxon placenames

Online Algorithms for Sum-Product

STA 4273H: Statistical Machine Learning

Bayesian model selection in graphs by using BDgraph package

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Probabilistic Machine Learning

Approximate Inference Part 1 of 2

Probabilistic Graphical Models

Recurrent Latent Variable Networks for Session-Based Recommendation

Bayesian Regression Linear and Logistic Regression

Tutorial on Probabilistic Programming with PyMC3

Afternoon Meeting on Bayesian Computation 2018 University of Reading

XXV ENCONTRO BRASILEIRO DE ECONOMETRIA Porto Seguro - BA, 2003 REVISITING DISTRIBUTED LAG MODELS THROUGH A BAYESIAN PERSPECTIVE

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

Fast Likelihood-Free Inference via Bayesian Optimization

CSCI-567: Machine Learning (Spring 2019)

Approximate Inference using MCMC

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Nonparametric Bayesian Methods (Gaussian Processes)

An introduction to Sequential Monte Carlo

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Transcription:

VCMC: Variational Consensus Monte Carlo Maxim Rabinovich, Elaine Angelino, Michael I. Jordan Berkeley Vision and Learning Center September 22, 2015

probabilistic models! sky fog bridge water grass object tracking & recognition personalized recommendations genomics & phylogenetics small molecule discovery

Outline Bayesian inference and Markov chain Monte Carlo MCMC is hard New data parallel algorithms VCMC: Our approach and theoretical results Empirical evaluation

Bayesian models encode uncertainty using probabilities Probability distribution over model parameters π(α, β, σ x, y) y A model is a probabilistic description of data x y i N (αx i + β, σ 2 )

Bayesian inference uses Bayes rule π(θ x) posterior π(θ) prior π(x θ) likelihood Model parameters θ = (α, β, σ) Data x = {(x 1, y 1 ), (x 2, y 2 ),..., (x 10, y 10 )} Probabilistic model of data y i N (αx i + β, σ 2 )

In general, posterior distributions are difficult to work with Normalizing involves an integral that is often intractable π(θ x) = Θ π(θ)π(x θ) π(θ)π(x θ)dθ

In general, posterior distributions are difficult to work with Normalizing involves an integral that is often intractable π(θ x) = Θ π(θ)π(x θ) π(θ)π(x θ)dθ Expectations w.r.t. the posterior = More intractable integrals E π [f ] = f (θ)π(θ x)dθ Θ (These are statistics that distill information about the posterior)

Solution: Monte Carlo integration Given a finite set of samples θ 1, θ 2,..., θ T π(θ x)

Solution: Monte Carlo integration Given a finite set of samples θ 1, θ 2,..., θ T π(θ x) Estimate an intractable expectation as a sum: E π [f ] = Θ f (θ)π(θ x)dθ 1 T T f (θ t ) t=1

Solution: Monte Carlo integration Given a finite set of samples θ 1, θ 2,..., θ T π(θ x) Estimate an intractable expectation as a sum: E π [f ] = Θ f (θ)π(θ x)dθ 1 T T f (θ t ) t=1 i.e., replace a distribution with samples from it: y y x x

Markov chain Monte Carlo (MCMC) Widely used class of sampling algorithms Sample by simulating a Markov chain (biased random walk) whose stationary distribution (after convergence) is the posterior θ 1, θ 2,..., θ T π(θ x) Use samples for Monte Carlo integration E π [f ] = f (θ)π(θ x)dθ 1 T Θ T f (θ t ) t=1

Outline Bayesian inference and Markov chain Monte Carlo MCMC is hard New data parallel algorithms VCMC: Our approach and theoretical results Empirical evaluation

Traditional MCMC Serial, iterative algorithm for generating samples Slow for two reasons: (1) Large number of iterations required to converge (2) Each iteration depends on the entire dataset Most innovation in MCMC has targeted (1) Recent threads of work target (2)

Serial MCMC Data Single core Samples

Data-parallel MCMC Data Parallel cores Samples

Aggregate samples from across partitions but how? Data Parallel cores Samples Aggregate

Factorization ( ) motivates a data-parallel approach π(θ x) posterior π(θ) prior π(x θ) likelihood = J π(θ) 1/J π(x (j) θ) sub-posterior j=1

Factorization ( ) motivates a data-parallel approach π(θ x) posterior π(θ) prior π(x θ) likelihood = J π(θ) 1/J π(x (j) θ) sub-posterior j=1 Partition the data as x (1),..., x (J) across J cores The jth core samples from a distribution proportional to the jth sub-posterior (a piece of the full posterior) Aggregate the sub-posterior samples to form approximate full posterior samples

Aggregation strategies for sub-posterior samples π(θ x) posterior π(θ) prior π(x θ) likelihood = J π(θ) 1/J π(x (j) θ) sub-posterior j=1 Sub-posterior density estimation (Neiswanger et al, UAI 2014) Weierstrass samplers (Wang & Dunson, 2013) Weighted averaging of sub-posterior samples Consensus Monte Carlo (Scott et al, Bayes 250, 2013) Variational Consensus Monte Carlo (Rabinovich et al, NIPS 2015)

Aggregate horizontally ( ) across partions Data Parallel cores Samples Aggregate

Recall that samples are parameter vectors (, ) (, ) =

Naïve aggregation = Average Aggregate(, ) = 0.5 x + 0.5 x

Less naïve aggregation = Weighted average Aggregate(, ) = 0.58 x + 0.42 x

Consensus Monte Carlo (Scott et al, 2013) Aggregate(, ) Aggregate(, ) = x + x Weights are inverse covariance matrices Motivated by Gaussian assumptions Designed at Google for the MapReduce framework

Outline Bayesian inference and Markov chain Monte Carlo MCMC is hard New data parallel algorithms VCMC: Our approach and theoretical results Empirical evaluation

Variational Consensus Monte Carlo Goal: Choose the aggregation function to best approximate the target distribution Method: Convex optimization via variational Bayes

Variational Consensus Monte Carlo Goal: Choose the aggregation function to best approximate the target distribution Method: Convex optimization via variational Bayes F = aggregation function q F = approximate distribution L (F ) objective = E qf [log π (X, θ)] likelihood + H [q F ] entropy

Variational Consensus Monte Carlo Goal: Choose the aggregation function to best approximate the target distribution Method: Convex optimization via variational Bayes F = aggregation function q F = approximate distribution L (F ) objective = E qf [log π (X, θ)] likelihood + H [q F ] relaxed entropy

Variational Consensus Monte Carlo Goal: Choose the aggregation function to best approximate the target distribution Method: Convex optimization via variational Bayes F = aggregation function q F = approximate distribution L (F ) objective = E qf [log π (X, θ)] likelihood + H [q F ] relaxed entropy No mean field assumption

Variational Consensus Monte Carlo Aggregate(, ) Aggregate(, ) = x + x Optimize over weight matrices ( ) Restrict to valid solutions when parameter vectors constrained

Variational Consensus Monte Carlo Theorem (Entropy relaxation) Under mild structural assumptions, we can choose H [q F ] = c 0 + 1 K K h k (F ), k=1 with each h k a concave function of F such that H [q F ] H [q F ]. We therefore have L (F ) L (F ).

Variational Consensus Monte Carlo Theorem (Concavity of the variational Bayes objective) Under mild structural assumptions, the relaxed variational Bayes objective L (F ) = E qf [log π (X, θ)] + H [q F ] is concave in F.

Outline Bayesian inference and Markov chain Monte Carlo MCMC is hard New data parallel algorithms VCMC: Our approach and theoretical results Empirical evaluation

Empirical evaluation Compare 3 aggregation strategies: Uniform average Gaussian-motivated weighted average (CMC) Optimized weighted average (VCMC) For each algorithm A, report approximation error of some expectation E π [f ], relative to serial MCMC Preliminary speedup results ɛ A (f ) = E A [f ] E MCMC [f ] E MCMC [f ]

Example 1: High-dimensional Bayesian probit regression #data = 100, 000, d = 300 First moment estimation error, relative to serial MCMC (Error truncated at 2.0)

Example 2: High-dimensional covariance estimation Normal-inverse Wishart model #data = 100, 000, #dim = 100 = 5, 050 parameters (L) First moment estimation error (R) Eigenvalue estimation error

Example 3: Mixture of 8, 8-dim Gaussians Error relative to serial MCMC, for cluster comembership probabilities of pairs of test data points

VCMC error decreases as the optimization runs longer Initialize VCMC with CMC weights (inverse covariance matrices)

VCMC reduces CMC error at the cost of speedup ( 2x) VCMC speedup is approximately linear CMC VCMC

Concluding thoughts Contributions Convex optimization framework for Consensus Monte Carlo Structured aggregation accounting for constrained parameters Entropy relaxation Empirical evaluation Future work More structured and complex (latent variable) models Alternate posterior factorizations and aggregation schemes We d love to hear about your Bayesian inference problems!

Example 1: High-dimensional Bayesian probit regression 2.0 1.5 First Uniform Gaussian VCMC Subposteriors 2.0 Second (Mixed) 1.5 2.0 1.5 First Uniform Gaussian VCMC Partial posteriors 2.0 Second (Mixed) 1.5 Error 1.0 1.0 Error 1.0 1.0 0.5 0.5 0.5 0.5 0.0 5 10 25 50 100 Number of cores 0.0 5 10 25 50 100 Number of cores 0.0 5 10 25 50 100 Number of cores 0.0 5 10 25 50 100 Number of cores Figure: High-dimensional probit regression (d = 300). Moment approximation error for the uniform and Gaussian averaging baselines and VCMC, relative to serial MCMC, for (left) subposteriors and (right) partial posteriors. We assessed three groups of functions: first moments, with f (β) = β j for 1 j d; pure second moments, with f (β) = βj 2 for 1 j d; and mixed second moments, with f (β) = β i β j for 1 i < j d. For brevity, results for pure second moments are relegated to the supplement. Relative errors are truncated to 2 in all cases.

Example 2: High-dimensional covariance estimation Error 0.4 0.3 0.2 0.1 0.0 First Uniform Gaussian VCMC 25 50 100 Number of cores 0.4 Second (Pure) 0.3 0.2 0.1 0.0 25 50 100 Number of cores 0.4 Second (Mixed) 0.3 0.2 0.1 0.0 25 50 100 Number of cores Figure: High-dimensional normal-inverse Wishart model (d = 100). (Far left, left, right) Moment approximation error for the uniform and Gaussian averaging baselines and VCMC, relative to serial MCMC. Letting ρ j denote the j th largest eigenvalue of Λ 1, we assessed three groups of functions: first moments, with f (Λ) = ρ j for 1 j d; pure second moments, with f (Λ) = ρ 2 j for 1 j d; and mixed second moments, with f (Λ) = ρ i ρ j for 1 i < j d. (Far right) Graph of error in estimating E [ρ j ] as a function of j (where ρ 1 ρ 2 ρ d ).

Example 3: Mixture of 8, 8-dim Gaussians Figure: Mixture of Gaussians (d = 8, L = 8). Expectation approximation error for the uniform and Gaussian baselines and VCMC. We report the median error, relative to serial MCMC, for cluster comembership probabilities of pairs of test data points, for (left) σ = 1 and (right) σ = 2, where we run the VCMC optimization procedure for 50 and 200 iterations, respectively. When σ = 2, some comembership probabilities are estimated poorly by all methods; we therefore only use the 70% of comembership probabilities with the smallest errors across all the methods.

Computational efficiency Figure: Error versus timing and speedup measurements. (Left) VCMC error as a function of number of seconds of optimization. The cost of optimization is nonnegligible, but still moderate compared to serial MCMC particularly since our optimization scheme only needs small batches of samples and can therefore operate concurrently with the sampler. (Right) Error versus speedup relative to serial MCMC, for both CMC with Gaussian averaging (small markers) and VCMC (large markers). In this case, the cost of optimization is small enough that a near linear speedup is achieved.