Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Similar documents
Big Data and Bayesian Learning

Expectation propagation as a way of life

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

The connection of dropout and Bayesian statistics

Expectation Propagation Algorithm

Black-box α-divergence Minimization

Part 1: Expectation Propagation

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Variational Autoencoder

On Markov chain Monte Carlo methods for tall data

Day 3 Lecture 3. Optimizing deep networks

Bayesian Deep Learning

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

STA 4273H: Statistical Machine Learning

Lecture 13 : Variational Inference: Mean Field Approximation

Machine Learning for Data Science (CS4786) Lecture 24

On Bayesian Computation

Auto-Encoding Variational Bayes

Probabilistic Machine Learning

Stochastic Gradient MCMC with Stale Gradients

The Origin of Deep Learning. Lili Mou Jan, 2015

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Deep Generative Models

Variational Inference via Stochastic Backpropagation

Density Estimation. Seungjin Choi

Learning the hyper-parameters. Luca Martino

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Deep Learning II: Momentum & Adaptive Step Size

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

VCMC: Variational Consensus Monte Carlo

Probabilistic Graphical Models

HOMEWORK #4: LOGISTIC REGRESSION

Bayesian Inference and MCMC

Nonparametric Bayesian Methods (Gaussian Processes)

Monte Carlo in Bayesian Statistics

Expectation propagation for signal detection in flat-fading channels

Bayesian Approaches Data Mining Selected Technique

Introduction to Machine Learning

Expectation Propagation for Approximate Bayesian Inference

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

Probabilistic Graphical Models

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

ECS289: Scalable Machine Learning

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Generalizing expectation propagation with mixtures of exponential family distributions and an application to Bayesian logistic regression

Annealing Between Distributions by Averaging Moments

Bayesian Methods for Machine Learning

Hamiltonian Monte Carlo for Scalable Deep Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

A picture of the energy landscape of! deep neural networks

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Understanding Covariance Estimates in Expectation Propagation

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

Learning Energy-Based Models of High-Dimensional Data

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Variational Autoencoders

Recurrent Latent Variable Networks for Session-Based Recommendation

Natural Gradients via the Variational Predictive Distribution

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Scalable Clustering of Correlated Time Series using Expectation Propagation

HOMEWORK #4: LOGISTIC REGRESSION

Variational Dropout and the Local Reparameterization Trick

Integrated Non-Factorized Variational Inference

arxiv: v3 [stat.ml] 1 Jun 2016

Probabilistic Graphical Models for Image Analysis - Lecture 4

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling

Statistical Data Mining and Machine Learning Hilary Term 2016

State Space Gaussian Processes with Non-Gaussian Likelihoods

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Lecture 6: Graphical Models: Learning

Probabilistic Graphical Networks: Definitions and Basic Results

arxiv: v2 [stat.ml] 25 Feb 2016

Scaling Neighbourhood Methods

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Approximate inference in Energy-Based Models

Deep Boltzmann Machines

Bayesian Regression Linear and Logistic Regression

arxiv: v1 [stat.ml] 2 Mar 2016

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Markov Chain Monte Carlo

STA 4273H: Statistical Machine Learning

Pattern Recognition and Machine Learning

Deep Feedforward Networks

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Machine Learning Summer School

CPSC 540: Machine Learning

Greedy Layer-Wise Training of Deep Networks

Supplementary Note on Bayesian analysis

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

Approximate Bayesian computation: methods and applications for complex systems

CSC 2541: Bayesian Methods for Machine Learning

Stochastic Gradient Descent as Approximate Bayesian Inference

17 : Optimization and Monte Carlo Methods

Transcription:

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb, Sebastian Vollmer, Charles Blundell

Bayesian Learning X! Parameter vector X.! Data items Y = y1, y2,... yn. y 1 y 2 y 3 y 4... y N! Model: p(x, Y )=p(x) NY p(y i X) i=1! Aim:! Inference algorithms: p(x Y )= p(x)p(y X) p(y )! Variational inference: parametrise posterior as qθ and optimize θ.! Markov chain Monte Carlo: construct samples X 1 Xn ~ p(x Y).

Machine Learning on Distributed Systems! Distributed storage! Distributed computation! costly network communications y 1i y 2i y 3i y 4i

Parameter Server! Parameter server [Ahmed et al 2012], DistBelief network [Dean et al 2012]. parameter server: parameter x worker: xi = x updates to xi returns Δxi = xi - xi y 1i y 2i y 3i y 4i

Embarassingly Parallel MCMC Sampling Combine samples together. {X i } i=1...n Treat as independent inference problems. Collect samples. y 1i y 2i y 3i y 4i [Scott et al 2013, Neiswanger et al 2013, Wang & Dunson 2013, Stanislav et al 2014] {X ji } j=1...m,i=1...n! Only communication at the combination stage.

Embarassingly Parallel MCMC Sampling! Unclear how to combine worker samples well.! Particularly if local posteriors on worker machines do not overlap. Figure from Wang & Dunson

Main Idea! Identify regions of high (global) posterior probability mass.! Shift each local posterior to agree with high probability region, and draw samples from these.! How to find high probability region?!defined in terms of low order moments.!use information gained from local posterior samples (using small amount of communication). Figure from Wang & Dunson

! where Tilting Local Posteriors! Each worker machine j has access only to its data subset. IY p j (X y j )=p j (X) p(y ji X) i=1 pj(x) is a local prior and pj(x yj) is local posterior.! Adapt local priors pj(x) so that local posterior agree on certain moments E pj (X y j )[s(x)] = s 0 8j! Use expectation propagation (EP) [Minka 2001] to adapt local priors.

Expectation Propagation! If N is large, the worker j likelihood term p(yj X) should be well approximated by Gaussian p(y j X) q j (X) =N (X; µ j, j )! Parameters fit iteratively to minimize KL divergence: p(x y) p j (X y) / p(y j X) p(x) Y k6=j q k (X) {z } p j (X) q new j ( ) = arg min N ( ;µ, ) KL p j( y) k N ( ; µ, )p j ( )! Optimal qj is such that first two moments of N ( ; µ, )p j ( ) agree with! Moments of local posterior estimated using MCMC sampling.! At convergence, first two moments of all local posteriors agree. p j ( y) [Minka 2001]

Posterior Server Architecture Posterior Server Posterior Network Cavity Cavity Cavity MCMC approximation MCMC approximation MCMC approximation Data Data Data Worker Worker Worker

Bayesian Logistic Regression 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 2 2 2.5 250 500 750 1000 1250 1500 k T N/m 10 3 2.5 100 200 300 400 500 600 k T N/m 10 3! Simulated dataset.!d=20, # data items N=1000.! NUTS based sampler.!# workers m = 4,10,50.!# MCMC iters T = 1000,1000,10000.! # EP iters k given as vertical lines. 1 0.5 0 0.5 1 1.5 2 2.5 200 400 600 800 1000 1200 1400 k T N/m 10 3

Bayesian Logistic Regression! MSE of posterior mean, as function of total # iterations. 10 0 10 2 10 4 10 6 SMS(s) SMS(a) SCOT NEIS(p) NEIS(n) WANG 3.2 6.4 9.6 12.8 16 19.2 k T m x 10 5

Stochastic Natural-gradient EP! EP has no guarantee of convergence.! EP technically cannot handle stochasticity in moment estimates.! Long MCMC run needed for good moment estimates.! Fails for neural nets and other complex high-dimensional models.! Stochastic Natural-gradient EP:! Alternative variational algorithm to EP.! Convergent, even with Monte Carlo estimates of moments.! Double-loop algorithm [Welling & Teh 2001, Yuille 2002, Heskes & Zoeter 2002]

Demonstrative Example EP (500 samples) SNEP (500 samples)

Comparison to Maximum SGD Posterior Server Posterior! Maximum likelihood via SGD: Network! DistBelief [Dean et al 2012] Cavity Cavity Cavity! Elastic-averaging SGD [Zhang et al 2015] MCMC approximation MCMC approximation MCMC approximation Data Data Data Worker Worker Worker! Separate likelihood approximations and states per worker.! Worker parameters not forced to be exactly same.! Each worker learns to approximate its own likelihood.! Can be achieved without detailed knowledge from other workers.! Diagonal Gaussian exponential family.! Variance estimates are important to learning.

Experiments on Distributed Bayesian Neural Networks! Bayesian approach to learning neural network:! compute parameter posterior given complex neural network likelihood.! Diagonal covariance Gaussian prior and exponential-family approximation.! Two datasets and architectures: MNIST fully-connected, CIFAR10 convnet.! Implementation in Julia.! Workers are cores on a server.! Sampler is stochastic gradient Langevin dynamics [Welling & Teh 2011].!Adagrad [Duchi et al 2011]/RMSprop [Tieleman & Hinton 2012] type adaptation.! Evaluated on test accuracy.

MNIST 500x300 Varying the number of workers 2.5 test error in % 2.0 1.5 1 2 4 8 Adam 1.0 0 5 10 15 20 epochs per worker

MNIST 500x300 2.5 Power SNEP - Varying the number of workers test error in % 2.0 1.5 1 2 4 8 1.0 0 5 10 15 20 epochs per worker

MNIST 500x300 2.5 Comparison of distributed methods (8 workers) test error in % 2.0 1.5 SNEP p-snep A-SGD EASGD Adam 1.0 0 5 10 15 20 epochs per worker

MNIST Very Deep MLP 16 workers 5 test error in % 4 3 2 ASGD EASGD p-snep 1 0 10 20 30 40 50 epochs per worker

CIFAR10 ConvNet 40 Comparison of distributed methods (8 workers) 35 test error in % 30 25 SNEP p-snep A-SGD EASGD Adam 20 0 10 20 30 epochs per worker

Concluding Remarks! Novel distributed learning based on a combination of Monte Carlo and a convergent alternative to expectation propagation.! Combination of variational and MCMC algorithms.! Advantageous over both pure variational and pure MCMC algorithms.! Being Bayesian can be advantageous computationally in distributed setting.! Thank you!