On Markov chain Monte Carlo methods for tall data

Similar documents
MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Computational statistics

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Markov Chain Monte Carlo methods

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Bayesian Methods for Machine Learning

arxiv: v5 [stat.co] 10 Apr 2018

Approximate Slice Sampling for Bayesian Posterior Inference

Notes on pseudo-marginal methods, variational Bayes and ABC

Introduction to Machine Learning CMU-10701

Probabilistic Machine Learning

CPSC 540: Machine Learning

Approximate Slice Sampling for Bayesian Posterior Inference

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Monte Carlo in Bayesian Statistics

Scaling up Bayesian Inference

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

MCMC algorithms for fitting Bayesian models

Adaptive Monte Carlo methods

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Bayesian Regression Linear and Logistic Regression

Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference

STA 4273H: Statistical Machine Learning

Practical unbiased Monte Carlo for Uncertainty Quantification

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Minibatch Gibbs Sampling on Large Graphical Models

Markov Chain Monte Carlo

LECTURE 15 Markov chain Monte Carlo

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

An Efficient Minibatch Acceptance Test for Metropolis-Hastings

Introduction to Machine Learning

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Lecture 13 : Variational Inference: Mean Field Approximation


STAT 425: Introduction to Bayesian Analysis

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Sequential Monte Carlo Methods for Bayesian Computation

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Bayesian Inference and MCMC

17 : Markov Chain Monte Carlo

Learning the hyper-parameters. Luca Martino

Lecture 7 and 8: Markov Chain Monte Carlo

On Bayesian Computation

Markov Chain Monte Carlo (MCMC)

PSEUDO-MARGINAL METROPOLIS-HASTINGS APPROACH AND ITS APPLICATION TO BAYESIAN COPULA MODEL

Density Estimation. Seungjin Choi

Inference in state-space models with multiple paths from conditional SMC

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Hamiltonian Monte Carlo for Scalable Deep Learning

MCMC: Markov Chain Monte Carlo

MCMC and Gibbs Sampling. Kayhan Batmanghelich

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Probabilistic Graphical Models

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Riemann Manifold Methods in Bayesian Statistics

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Stochastic Gradient Descent

Bayesian Networks in Educational Assessment

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

STA414/2104 Statistical Methods for Machine Learning II

VCMC: Variational Consensus Monte Carlo

Probabilistic Graphical Models & Applications

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Lecture 1: Supervised Learning

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

ST 740: Markov Chain Monte Carlo

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Markov Chain Monte Carlo methods

Approximate Bayesian inference

The connection of dropout and Bayesian statistics

Introduction to Restricted Boltzmann Machines

Monte Carlo Inference Methods

Inexact approximations for doubly and triply intractable problems

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

An introduction to Sequential Monte Carlo

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Kernel Sequential Monte Carlo

Bayesian inference for multivariate extreme value distributions

Bayesian Machine Learning

MARKOV CHAIN MONTE CARLO

Stochastic Proximal Gradient Algorithm

Answers and expectations

Bayesian Phylogenetics:

Metropolis-Hastings Algorithm

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

13: Variational inference II

Bias-Variance Tradeoff

Bayesian GLMs and Metropolis-Hastings Algorithm

Kernel adaptive Sequential Monte Carlo

STA 4273H: Statistical Machine Learning

Reinforcement Learning

TEORIA BAYESIANA Ralph S. Silva

STA 4273H: Statistical Machine Learning

Probabilistic Time Series Classification

Transcription:

On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016

Introduction Many data sets in machine learning and computational statistics are considered tall: A tall data set contains a large number n of independent data points Often, tall data contains comparitively few predictors, but that addition is not explicitly included in this paper Bayesian inference is often based upon posterior sampling via MCMC Tall data renders traditional MCMC methods ineffective

Contributions of this paper The main contributions are: 1. Describe current limitations of MCMC for tall data 2. Review recently proposed subsampling MCMC approaches to create exact (unbiased) and approximate samplers 3. Propose a novel subsampling MCMC approach with minimal bias There is an included IPython notebook to recreate all experiments in the paper.

Basic setup Suppose we have a dataset X = {x 1,..., x n } X R d The data are assumed conditionally independent with an associated likelihood n i=1 p(x i θ) with parameters θ Θ, with the mean log likelihood defined: l(θ) = 1 n n log p(x i θ) = 1 n i=1 n l i (θ) i=1 The parameters are assumed drawn from a given prior p(θ), so the posterior distribution of interest is defined as π(θ) = p(θ x) γ(θ) p(θ)e nl(θ)

Classical Technique: Metropolis-Hastings Algorithm 1 Metropolis-Hastings 1: procedure MH(γ( ), q( ), θ 0, N iter ) 2: for k 1 to N iter do 3: θ θ k 1 4: θ q( θ) Propose new point 5: u Uniform(0,1) Acceptance Probability 6: α(θ, θ ) γ(θ )q(θ θ ) γ(θ)q(θ θ) 7: if u α(θ, θ ) then 8: θ k θ Accept 9: else θ k θ Reject 10: end if 11: end for 12: end procedure

Simple Examples to Demonstrate Concepts Fit a one-dimensional normal distribution p( µ, σ) = N ( µ, σ 2 ) with 10 5 points drawn according to: 1. X i N (0, 1) 2. X i log N (0, 1)

Figure 1: This shows 10 4 iterations of vanilla MH. This strategy works well per iteration, but is slow per iteration. This is for reference for the later results.

Computation is expensive! The acceptance ratio, [ p(θ log α(θ, θ )q(θ θ ] ) ) = log p(θ)q(θ + [ l(θ ) l(θ) ], θ) }{{}}{{} O(n) O(1) requires O(n) calculations to compute exactly. Some straightforward ideas are to 1. Divide the dataset into tractable batches 2. Approximate the acceptance ratio with a subset of the dataset

Divide-and-conquer approaches Suppose data X are divided into B batches x 1,..., x B. The likelihood can be equivalently written by p(θ X ) B p(θ) 1/B p(x i θ). i=1 One strategy (from Neiswanger et al.) is to run each batch of data separately to determine a smooth approximation to π i (θ) p(θ) 1/B p(x i θ), then approximate the full posterior by multiplying the smooth approximations.

Why is such an approach appealing? As the size of each batch, this is theoretically justified Each chain independently is O(n/B) can get B samples for the same price as one sample in the vanilla MH case The chains can be run in parallel for faster computation A caveat: the full likelihood is also trivially parallelizable It should also be noted that the multiplication can be unstable (i.e. when the support of the smooth approximations are disjoint) Wang and Dunson proposed using the Weierstrauss transform to stabilize the approximation Further extensions have been pursued in the literature, but the authors are discouraged because of: Poor stability/scaling with increased batches Theoretical guarantees have unobtainable conditions Many methods of combination are difficult to interpret

Pseudo-Marginal MH An alternative approach to splitting the data into batches is a set of methods based on pseudo-marginal MH (PMMH). Suppose we have an unbiased and almost surely non-negative estimator ˆγ(θ) for the true target γ(θ), then simply using the acceptance probability α(θ, θ ) ˆγ(θ ) ˆγ(θ) q(θ θ ) q(θ θ) (where only ˆγ(θ ) is calculated each iteration and ˆγ(θ) is stored from the previous iteration) will maintain the correct stationary chain

Drawbacks of Pseudo-Marginal MH The asymptotic variance of an estimator based off PMMH is always larger than the vanilla MH approach (but samples can be generated much quicker!) High variance in ˆγ(θ) will cause the chain to get stuck, and mixing rates will quickly collapse (can prove in some situations that the convergence rate of the invariant distribution goes from geometric to subgeometric) Requires ˆγ(θ) to be unbiased in tall data it is more natural to have log ˆγ(θ) be unbiased

Unbiased estimators for tall data In tall data, it is common to use an unbiased estimator of the log-likelihood (e.g. as in stochastic gradient descent), which is typically written nˆl(θ) = n t t log p(x i θ), i=1 where x i are randomly sampled from data X. Naively using ˆγ(θ) = exp(nˆl(θ)) would yield a biased estimator.

Unbiased estimators for tall data Rhee and Glynn proposed a method that can create an unbiased, non-negative estimator of γ(θ). Suppose that the log-likelihood for all data points is lower bounded, as in l i (θ) > a(θ) i. Let D j = n t t log p(x i,j θ) na(θ). i=1 Each D j represents a particular minibatch used for the given iteration. Then, the number J is drawn from a distribution with P(N k) = (1 + e) k and J ˆγ(θ) = Y e na(θ) 1 k 1 + D j P(N k)k! k=1 j=1

Is this feasible to improve MH? In the naive case, no. It would require O(n) minibatches per sample to control the variance. Recent methods have attempted to use control variates to better control the variance and make a feasible scheme, but have not been extended to a fully general algorithm yet.

Firefly Monte Carlo MacLaurin and Adams presented a MCMC method by using an extended target distribution. Letting b i (θ) lower bound l i (θ), the extended target is n n π(θ, z) p(θ) exp(b i (θ)) [exp(l i (θ) b i (θ)) 1] z i. i=1 If b i (θ) is cheap to compute, then you only have to evaluate the likelihood when z i = 1, which can be sparse. Note that this can be interpreted as a Pseudo-Marginal MH sampler with p(x i θ) = p(x i, z i θ), so ˆγ(θ) = p(θ) i=1 z i {0,1} n p(x i θ, z i ) becomes an unbiased estimator (details leading up to Eq. 13). i=1

Does Firefly MC solve the variance problem? If b i (θ) is a tight bound to l i (θ), yes. However, it does not in general.

Figure 2: 10 4 iterations of Firefly MC. 10% of datapoints are used per iteration on average.

Stochastic Gradient Langevin Dynamics The now well-known SGLD approach has updates [ θ k+1 = θ k + ɛ k+1 log p(θ) + n t log p(x i,k ]+ 2 t θ) ɛ k+1 η k+1, where x i,k are chosen randomly from X and η k is drawn from a standard normal. By choosing a decreasing sequence of ɛ k, this algorithm can provide a consistent estimate of the target. The authors also note similarities to Metropolis-adjusted Langevin algorithm (MALA), where a MH step is run after each update to leave the chain invariant regardless of the step-size. i=1

Figure 3: 104 iterations of SGLD.

Approximate Subsampling Approaches Instead of calculating the acceptance rate exactly (or using pseudo-marginal), the acceptance rate could be approximated. Note that an unbiased estimator of the log difference [l(θ ) l(θ)] is easy to obtain, with Λ t (θ, θ ) 1 t t i=1 log p(x i θ ) p(x i θ). These strategies attempt to use approximations to build a well-performing MCMC method.

Using the Central Limit Theorem If the log-likelihood ratio estimate is normal with known variance and unbiased, it is straightforward to make a chain that still targets π. Instead, it is possible to make an adaptive method with T-tests. Note the the proposed θ is accepted when 1 n n i=1 log p(x i θ ) p(x i θ) > 1 n log u 1 [ p(θ n log )q(θ θ ] ) p(θ)q(θ, θ) where u Uniform(0, 1). The minibatch used to estimate the acceptance can be expanded until a t-test makes the decision with a certain confidence, denoted Austerity MH.

Figure 4: 10 4 iterations of Austerity MH.

Figure 5: Histogram of 1000 realizations of the Student statistic required in Austerity MH

Confidence samplers Suppose that one knows a bound for the range C θ,θ n log [ p(x i θ ) ]. max i=1 Then using concentration inequalities P ( Λ(θ, θ ) (l(θ ) l(θ)) > c t (δ) ) 1 δ, one can design sampling schemes where the minibatch size T is chosen automatically such that the right decision is taken with probability 1 δ. c t (δ) can be defined in a number of ways. Unfortunately, typically requires O(n) per iteration. (i.e. minibatch increases with increased data size).

Figure 6: 10 4 iterations of the vanilla confidence sampler.

Improved Confidence Samplers Essentially: control variates reduce the variance and help tighten bounds, ergo can greatly improve performance. Target chains are properly maintained. A simple example proxy: Taylor expansions. ˆl i (θ) = l i (θ ) + g T i, (θ θ T ) + 1 2 (θ θ ) T H i, (θ θ ), and these quantities can be precomputed to create a global approximation ˆl(θ), then Λ(θ, θ ) = ˆl(θ ) ˆl(θ) + 1 t t i=1 log p(x i θ p(x i θ log ˆl i (θ ) ˆl i (θ). Under ideal conditions, the minibatch size can drop below O(n), even down to O(1).

Figure 7: 10 4 iterations of the improved confidence sampler.

Experiments Experiments are run demonstrating the improved confidence sampler approach with logistic regression and gamma regression. The used datasets are Toy datasets, with size log 1 0n {3, 4, 5, 6, 7} covtype.binary dataset, with 581,013 points, 10 quantitative attributes, and binary classes covtype where horizontal distance to nearest wildfire ignition is regressed onto the other quantitative features

Figure 8: 10 4 iterations of confidence MH with a single Taylor proxy, applied to a synthetic logistic regression dataset vs. n

Figure 9: 10 4 iterations of confidence MH with a single Taylor proxy, applied to a synthetic logistic regression dataset vs. n

Figure 10: Results of 5 runs of a confidence sampler with Taylor proxies dropped every 10 iterations, applied to logistic regression on covtype.

Figure 11: Results of 5 runs of a confidence sampler with Taylor proxies dropped every 10 iterations, applied to gamma linear regression on covtype.

Discussion According to the authors, the current state of the field is: Divide-and-conquer approaches have not yet solved the recombination problem Subsampling approaches do not necessarily approach the right target at the right speed, and are not necessarily less than O(n) per effective sample The effective methods described here (where effective samplers are as low as O(1)) are only observed in contexts where the Bernstein-von Mises approximation is already excellent