MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Size: px

Start display at page:

Download "MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17"

Mitchell Anthony
5 years ago
Views:

1 MCMC for big data Geir Storvik BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

2 Outline Why ordinary MCMC is not scalable Different approaches for making MCMC scalable Summary/status Geir Storvik MCMC for big data BigInsight lunch - May / 17

3 Big data and statistics Jordan et al. (2013), On statistics, computation and scalability: gatherers of large-scale data are often forced to turn to ad hoc procedures that perhaps do provide algorithmic guarantees but which may provide no statistical guarantees and which in fact may have poor or even disastrous statistical properties Statistical "solutions": Better algorithms for "standard" optimal solutions Embarrassingly parallel methods Bootstrapping Bagging/random forrest Divide-and-conquer methods (biglm package in R) Dynamic updating (Kalman filtering, particle filters) Sub-sampling (stochastic approximation) Alternative procedures that are both computationally and statistically efficient Bags of Little Bootstraps (Kleiner et al., 2014) Bayesian (MCMC-based) methods: Left-behind! Geir Storvik MCMC for big data BigInsight lunch - May / 17

4 Computing expectations Complicated expectations needed in many statistical inference settings ML estimation with latent variables L(θ) = p(y θ) = p(y z; θ)p(z θ)dz = E p(z θ) [p(y Z; θ)] z Bayesian statistics ˆθ = E p(θ y) [θ y] Markov chain Monte Carlo (Bayesian setting): Simulate Markov chain θ 1, θ 2,... Exact MCMC θ m D p(θ y) as m M M 1 θ m E p(θ y) [θ y] as M m=1 Approximate MCMC θ m D p(θ y) as m p(θ y) p(θ y) ε Geir Storvik MCMC for big data BigInsight lunch - May / 17

5 Metropolis-Hastings Algorithm: Generate θ q( θ i 1 ) { Calculate α = min 1, } π(θ y)q(θ i 1 θ ) π(θ i 1 y)q(θ θ i 1 ) Put θ i = { θ with probability α; θ i 1 otherwise. For transition density P(θ θ): p(θ y)p(θ θ) = p(θ y)p(θ θ ) Detailed balance Calculation of α (independent case): π(θ y) π(θ i 1 y) = π(θ )p(y θ ) π(θ i 1 )p(y θ i 1 ) ind = π(θ ) n i=1 p(y i θ ) π(θ i 1 ) n i=1 p(y i θ i 1 ) For big data: Product too time/memory-consuming Geir Storvik MCMC for big data BigInsight lunch - May / 17

6 Alternatives Change estimator Approximate Bayesian Computation (ABC) Variational Bayes Alternative MCMC methods (Bardenet et al., 2017) Divide-and-conquer methods Exact sub-sampling methods Approximate sub-sampling methods (Methods dynamically including more data) Geir Storvik MCMC for big data BigInsight lunch - May / 17

7 Divide-and-conquer metods Procedure Split the data into a large number of smaller (possibly overlapping) data sets Perform inference on each smaller data set Combine the results Neiswanger et al. (2013); Scott et al. (2016); Wang and Dunson (2013); Li et al. (2017); Minsker et al. (2014) Properties Computation only on smaller datasets Easy to run in parallel. Separation lead to inexact results Some asymptotic results available, but in the limit simple Laplace approximations better and easier (?) Geir Storvik MCMC for big data BigInsight lunch - May / 17

8 Consensus Monte Carlo (Scott et al., 2016) Assume independent blocks y 1,..., y S : S S p(θ y) = p s(θ y s) p(y s θ)p(θ) 1/S s=1 s=1 Simulate θ s1,..., θ sg from p s(θ y) p(y s θ)p(θ) 1/S Combine θ g = ( ) 1 s s Ws Wsθsg Properties: Exact if p s(θ y), s = 1,..., S are Gaussian ( W s = Var ps(θ y) [θ]) Approximate in general When choice of model complexity involved: How does prior p(θ) 1/S and subset of data influence complexity? Alternative: p s(θ y) p(y s θ) S p(θ) Geir Storvik MCMC for big data BigInsight lunch - May / 17

9 Delayed acceptance (Banterle et al., 2015) We have α(θ, θ ) = min{1, ρ(θ, θ )} where ρ(θ, θ ) = p(θ ) p(θ) = p(θ ) p(θ) n i=1 S s=1 p(y i θ ) p(y i θ) p(y s θ ) p(y s θ) Delayed acceptance: Accept with probability { S min 1, s=1 [ p(θ ) p(θ) ] } 1/S p(y s θ ) p(y s θ) Sequential procedure with only evaluating p(ys θ ) p(y s θ) Possible gain when rejecting, not when accepting at each step Geir Storvik MCMC for big data BigInsight lunch - May / 17

10 Subsampling-based Independent data: n log p(y θ) = log p(y i θ) i=1 ML/Gradient methods ˆθ s+1 = ˆθ s + γ [log p(ˆθ s )) + log p(y ˆθ s )] Big data: ˆθ s+1 =ˆθ s + γ s[ log p(ˆθ s )) + log p(y ˆθ s )] log p(y θ) = n m p(y ij θ) m j=1 i 1,..., i m random subsample of {1,..., n}: Utilising an unbiased estimate of log p(y θ). Stochastic gradient descent, convergence if γ s =, s=1 γ s < 2 s=1 Geir Storvik MCMC for big data BigInsight lunch - May / 17

11 Pseudo-likelihood approach Idea: Replace α by { } ˆπ(θ y)q(θ i 1 θ ) ˆα = min 1, ˆπ(θ i 1 y)q(θ θ i 1 ) If E[ˆπ(θ y)] = π(θ y) and positive: convergence properties are preserved! (Beaumont, 2003; Andrieu et al., 2009) Question: How to construct ˆπ(θ y)? Problem with subsampling p(y θ) = exp( log p(y θ)) is a biased estimate of p(y θ). Jacob et al. (2015): Without additional knowledge on log p(y θ) we cannot obtain positive, unbiased estimates p(y θ)! Geir Storvik MCMC for big data BigInsight lunch - May / 17

12 Firefly MCMC (Maclaurin and Adams, 2014) p(θ y) p(θ) i p(y i θ) can be extended to p(θ, z y) p(θ) i p(y i θ) i [ p(yi θ) B i (θ) p(y i θ) ] zi [ ] Bi (θ) 1 zi p(y i θ) where z i {0, 1} and 0 < B i (θ) p(y i θ). p(θ, z y) has p(θ y) as marginal Simulation: p(θ y, z) i [p(y i θ) B i (θ)] z i [B i (θ)] 1 z i Only require evaluation of p(y i θ) for z i = 1! p(z y, θ) i [ p(yi θ) B i (θ) p(y i θ) ] zi [ ] Bi (θ) 1 zi p(y i θ) Simple binomial sampling Main benefit if B i (θ) p(y i θ) and simple to calculate Enough to resample a (small) fraction of z i s at each iteration. Geir Storvik MCMC for big data BigInsight lunch - May / 17

13 Stochastic Gradient Langevin Dynamics (Welling and Teh, 2001) Stochastic optimisation (convergence towards mode): ( ) θ t+1 = θ t + ε t log p(θ t ) + n m log p(y ti θ) 2 m i=1 Require t=1 ε t =, t=1 ε2 t < Langevin dynamics (convergence towards posterior distribution) ( ) θ t+1 = θ t + ε n log p(θ t ) + log p(y i θ) + η t, η t N(0, ε) 2 i=1 Stochastic gradient Langevin: ( ) θ t+1 = θ t + ε t log p(θ t ) + n m log p(y ti θ) + η t, η t N(0, ε) 2 m i=1 Require t=1 ε t =, t=1 ε2 t < Geir Storvik MCMC for big data BigInsight lunch - May / 17

14 Noisy MCMC Simplifying notation: π(θ y) = π(θ) "Standard" MCMC: sup θ0 δ θ0 P s π TV Cρ s What if we use P instead of P where π P π? Mitrophanov (2005); Alquier et al. (2016): ) δ θ0 P s δ Ps θ0 TV (λ + Cρλ P P 1 ρ TV where Note: Independent of s! λ = log(1/c) log(ρ) Geir Storvik MCMC for big data BigInsight lunch - May / 17

15 Noisy Metropolis Hastings Draw U F (u θ ) Apply acceptance rate ˆα(θ, θ, u) α(θ, θ ) within M-H. Properties: Assume E F (u θ ) ˆα(θ, θ, u ) α(θ, θ ) δ(θ, θ ) Then ) δ θ0 P s δ Ps θ0 (λ + Cρλ 1 ρ θ q(θ θ)δ(θ, θ )dθ Examples: Ignoring discretisation error in Langevin Dynamics Using pseudo-likelihoods within Gibbs random fields Geir Storvik MCMC for big data BigInsight lunch - May / 17

16 Summary Many interesting approaches Some are exact but can have slow convergence Some are approximate, difficult to evaluate performance (Most success perhaps gained through use of sequential Monte Carlo as well!) "Standard" accelerating MCMC approaches also useful: Simulated tempering Adaptive MCMC Multiple-try MCMC Rao-Blackwellisation More research needed! Geir Storvik MCMC for big data BigInsight lunch - May / 17

17 References P. Alquier, N. Friel, R. Everitt, and A. Boland. Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels. Statistics and Computing, 26(1-2):29 47, C. Andrieu, G. O. Roberts, et al. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2): , M. Banterle, C. Grazian, A. Lee, and C. P. Robert. Accelerating metropolis-hastings algorithms by delayed acceptance. arxiv preprint arxiv: , R. Bardenet, A. Doucet, and C. Holmes. On Markov chain Monte Carlo methods for tall data. The Journal of Machine Learning Research, 18(1): , M. A. Beaumont. Estimation of population growth or decline in genetically monitored populations. Genetics, 164(3): , P. E. Jacob, A. H. Thiery, et al. On nonnegative unbiased estimators. The Annals of Statistics, 43(2): , M. I. Jordan et al. On statistics, computation and scalability. Bernoulli, 19(4): , A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan. A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4): , C. Li, S. Srivastava, and D. B. Dunson. Simple, scalable and accurate posterior interval estimation. Biometrika, 104(3): , D. Maclaurin and R. P. Adams. Firefly Monte Carlo: Exact MCMC with Subsets of Data. In UAI, pages , S. Minsker, S. Srivastava, L. Lin, and D. Dunson. Scalable and robust Bayesian inference via the median posterior. In International Conference on Machine Learning, pages , A. Y. Mitrophanov. Sensitivity and convergence of uniformly ergodic Markov chains. Journal of Applied Probability, 42(4): , W. Neiswanger, C. Wang, and E. Xing. Asymptotically exact, embarrassingly parallel MCMC. arxiv preprint arxiv: , S. L. Scott, A. W. Blocker, F. V. Bonassi, H. A. Chipman, E. I. George, and R. E. McCulloch. Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78 88, X. Wang and D. B. Dunson. Parallelizing MCMC via Weierstrass sampler. arxiv preprint arxiv: , M. Welling and Y. W. Teh. Belief optimization for binary networks: A stable alternative to loopy belief propagation. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages Morgan Kaufmann Publishers Inc., Geir Storvik MCMC for big data BigInsight lunch - May / 17

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational