Approximate Inference using MCMC

Similar documents
STA 4273H: Statistical Machine Learning

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Bayesian Inference and MCMC

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Markov Chain Monte Carlo

Density Estimation. Seungjin Choi

Lecture 6: Graphical Models: Learning

Lecture 7 and 8: Markov Chain Monte Carlo

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Machine Learning Summer School

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Variational Scoring of Graphical Model Structures

Part 1: Expectation Propagation

Andriy Mnih and Ruslan Salakhutdinov

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

LECTURE 15 Markov chain Monte Carlo

Introduction to Machine Learning

Pattern Recognition and Machine Learning

Bayesian Methods for Machine Learning

CPSC 540: Machine Learning

MCMC algorithms for fitting Bayesian models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA 4273H: Sta-s-cal Machine Learning

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

17 : Markov Chain Monte Carlo

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

Large-scale Ordinal Collaborative Filtering

The Laplace Approximation

Bayesian Machine Learning

Probabilistic Graphical Models

Unsupervised Learning

Sampling Algorithms for Probabilistic Graphical models

Sampling Methods. Bishop PRML Ch. 11. Alireza Ghane. Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo

Lecture : Probabilistic Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

STA414/2104 Statistical Methods for Machine Learning II

Introduction to Bayesian inference

CSC 2541: Bayesian Methods for Machine Learning

Markov Chain Monte Carlo (MCMC)

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Basic Sampling Methods

Principles of Bayesian Inference

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

13: Variational inference II

Markov Chain Monte Carlo methods

An Introduction to Bayesian Machine Learning

Graphical Models for Collaborative Filtering

Probabilistic Machine Learning

Graphical Models and Kernel Methods

Annealing Between Distributions by Averaging Moments

PMR Learning as Inference

an introduction to bayesian inference

Expectation Propagation for Approximate Bayesian Inference

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Graphical Models for Query-driven Analysis of Multimodal Data

Variational Inference. Sargur Srihari

Lecture 13 : Variational Inference: Mean Field Approximation

Recent Advances in Bayesian Inference Techniques

Non-Parametric Bayes

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Need for Sampling in Machine Learning. Sargur Srihari

16 : Markov Chain Monte Carlo (MCMC)

Bayesian Machine Learning

Part IV: Monte Carlo and nonparametric Bayes

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Bayesian Learning in Undirected Graphical Models

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Scaling Neighbourhood Methods

MCMC: Markov Chain Monte Carlo

Introduction to Gaussian Processes

Probabilistic Graphical Models

Time-Sensitive Dirichlet Process Mixture Models

STAT 518 Intro Student Presentation

Answers and expectations

Statistical Approaches to Learning and Discovery

Machine Learning using Bayesian Approaches

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Monte Carlo in Bayesian Statistics

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Bayesian networks: approximate inference

Bayesian Hidden Markov Models and Extensions

Integrated Non-Factorized Variational Inference

The Origin of Deep Learning. Lili Mou Jan, 2015

Bayesian Learning in Undirected Graphical Models

Markov chain Monte Carlo

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Markov Chain Monte Carlo

Markov chain Monte Carlo

Bayesian Machine Learning

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Auto-Encoding Variational Bayes

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Nonparameteric Regression:

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Transcription:

Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1

Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov chains. 5. Markov chain Monte Carlo algorithms. 2

References/Acknowledgements Chris Bishop s book: Pattern Recognition and Machine Learning, chapter 11 (many figures are borrowed from this book). David MacKay s book: Information Theory, Inference, and Learning Algorithms, chapters 29-32. Radford Neals s technical report on Probabilistic Inference Using Markov Chain Monte Carlo Methods. Zoubin Ghahramani s ICML tutorial on Bayesian Machine Learning: http://www.gatsby.ucl.ac.uk/ zoubin/icml04-tutorial.html Ian Murray s tutorial on Sampling Methods: http://www.cs.toronto.edu/ murray/teaching/ 3

Basic Notation P(x) P(x θ) P(x,θ) probability of x conditional probability of x given θ joint probability of x and θ Bayes Rule: P(θ x) = P(x θ)p(θ) P(x) where P(x) = P(x, θ)dθ Marginalization I will use probability distribution and probability density interchangeably. It should be obvious from the context. 4

Inference Problem Given a dataset D = {x 1,...,x n }: Bayes Rule: P(θ D) = P(D θ)p(θ) P(D) P(D θ) P(θ) P(θ D) Likelihood function of θ Prior probability of θ Posterior distribution over θ Computing posterior distribution is known as the inference problem. But: P(D) = P(D,θ)dθ This integral can be very high-dimensional and difficult to compute. 5

Prediction P(θ D) = P(D θ)p(θ) P(D) P(D θ) P(θ) P(θ D) Likelihood function of θ Prior probability of θ Posterior distribution over θ Prediction: Given D, computing conditional probability of x requires computing the following integral: P(x D) = P(x θ, D)P(θ D)dθ = E P(θ D) [P(x θ, D)] which is sometimes called predictive distribution. Computing predictive distribution requires posterior P(θ D). 6

Model Selection Compare model classes, e.g. M 1 and M 2. Need to compute posterior probabilities given D: P(M D) = P(D M)P(M) P(D) where P(D M) = P(D θ, M)P(θ, M)dθ is known as the marginal likelihood or evidence. 7

Computational Challenges Computing marginal likelihoods often requires computing very highdimensional integrals. Computing posterior distributions (and hence predictive distributions) is often analytically intractable. In this class, we will concentrate on Markov Chain Monte Carlo (MCMC) methods for performing approximate inference. First, let us look at some specific examples: Bayesian Probabilistic Matrix Factorization Bayesian Neural Networks Dirichlet Process Mixtures (last class) 8

Bayesian PMF User Features 1 2 3 4 5 6 7... 1 2 3 4 5 6 7... 5 3? 1... 3? 4? 3 2... R ~ U V Movie Features We have N users, M movies, and integer rating values from 1 to K. Let r ij be the rating of user i for movie j, and U R D N, V R D M be latent user and movie feature matrices: Goal: Predict missing ratings. R U V 9

Bayesian PMF α V α U Probabilistic linear model with Gaussian observation noise. Likelihood: Θ V Θ U p(r ij u i, v j, σ 2 ) = N(r ij u i v j,σ 2 ) Vj U i Gaussian Priors over parameters: p(u µ U,Λ U ) = N N(u i µ u, Σ u ), j=1,...,m R ij i=1,...,n i=1 M p(v µ V, Λ V ) = N(v i µ v, Σ v ). σ i=1 Conjugate Gaussian-inverse-Wishart priors on the user and movie hyperparameters Θ U = {µ u,σ u } and Θ V = {µ v,σ v }. Hierarchical Prior. 10

Bayesian PMF Predictive distribution: Consider predicting a rating rij and query movie j: p(r ij R) = for user i p(rij u i,v j )p(u,v,θ }{{ U,Θ V R) } d{u,v }d{θ U,Θ V } Posterior over parameters and hyperparameters Exact evaluation of this predictive distribution is analytically intractable. Posterior distribution p(u,v,θ U,Θ V R) is complicated and does not have a closed form expression. Need to approximate. 11

Bayesian Neural Nets Regression problem: Given a set of i.i.d observations X = {x n } N n=1 with corresponding targets D = {t n } N n=1. x D w (1) MD hidden units z M w (2) KM Likelihood: p(d X,w) = N N(t n y(x n,w),β 2 ) inputs x 1 x 0 z 1 z 0 w (2) 10 y K outputs y 1 n=1 The mean is given by the output of the neural network: M y k (x,w) = wkjσ 2 ( D wjix 1 ) i j=0 i=0 where σ(x) is the sigmoid function. Gaussian prior over the network parameters: p(w) = N(0, α 2 I). 12

Bayesian Neural Nets Likelihood: hidden units N x D w (1) MD z M w (2) KM y K p(d X,w) = n=1 N(t n y(x n,w),β 2 ) Gaussian prior over parameters: inputs outputs y 1 p(w) = N(0, α 2 I) x 1 x 0 z 1 w (2) 10 Posterior is analytically intractable: p(w D,X) = p(d w,x)p(w) p(d w,x)p(w)dw z 0 Remark: Under certain conditions, Radford Neal (1994) showed, as the number of hidden units go to infinity, a Gaussian prior over parameters results in a Gaussian process prior for functions. 13

Undirected Models x is a binary random vector with x i {+1, 1}: p(x) = 1 Z exp( (i,j) E θ ij x i x j + i V θ i x i ). where Z is known as partition function: Z = x exp ( (i,j) E θ ij x i x j + i V θ i x i ). If x is 100-dimensional, need to sum over 2 100 terms. The sum might decompose (e.g. junction tree). Otherwise we need to approximate. Remark: Compare to marginal likelihood. 14

Monte Carlo p(z) f(z) For most situations we will be interested in evaluating the expectation: E[f] = f(z)p(z)dz z We will use the following notation: p(z) = p(z) Z. We can evaluate p(z) pointwise, but cannot evaluate Z. Posterior distribution: P(θ D) = 1 P(D) P(D θ)p(θ) Markov random fields: P(z) = 1 Z exp( E(z)) 15

Simple Monte Carlo General Idea: Draw independent samples {z 1,...,z n } from distribution p(z) to approximate expectation: E[f] = f(z)p(z)dz 1 N N f(z n ) = ˆf n=1 Note that E[f] = E[ ˆf], so the estimator ˆf has correct mean (unbiased). The variance: var[ ˆf] = 1 N E[ (f E[f]) 2] Remark: The accuracy of the estimator does not depend on dimensionality of z. 16

Simple Monte Carlo In general: f(z)p(z)dz 1 N N f(z n ), z n p(z) n=1 Predictive distribution: P(x D) = P(x θ, D)P(θ D)dθ 1 N N P(x θ n, D), θ n p(θ D) n=1 Problem: It is hard to draw exact samples from p(z). 17

Basic Sampling Algorithm How to generate samples from simple non-uniform distributions assuming we can generate samples from uniform distribution. 1 h(y) Define: h(y) = y p(ŷ)dŷ p(y) Sample: z U[0,1]. Then: y = h 1 (z) is a sample from p(y). 0 y Problem: Computing cumulative h(y) is just as hard! 18

Rejection Sampling Sampling from target distribution p(z) = p(z)/z p is difficult. Suppose we have an easy-to-sample proposal distribution q(z), such that kq(z) p(z), z. kq(z 0 ) kq(z) Sample z 0 from q(z). Sample u 0 from Uniform[0,kq(z 0 )] z 0 u 0 p(z) z The pair (z 0,u 0 ) has uniform distribution under the curve of kq(z). If u 0 > p(z 0 ), the sample is rejected. 19

Rejection Sampling Probability that a sample is accepted is: kq(z 0 ) u 0 kq(z) p(z) p(accept) = = 1 k p(z) kq(z) q(z)dz p(z)dz z 0 z The fraction of accepted samples depends on the ratio of the area under p(z) and kq(z). Hard to find appropriate q(z) with optimal k. Useful technique in one or two dimensions. Typically applied as a subroutine in more advanced algorithms. 20

Importance Sampling Suppose we have an easy-to-sample proposal distribution q(z), such that q(z) > 0 if p(z) > 0. p(z) q(z) f(z) E[f] = f(z)p(z)dz z = 1 N f(z) p(z) q(z) q(z)dz p(z n ) q(z n ) f(zn ), z n q(z) n The quantities w n = p(zn )/q(z n ) are known as importance weights. Unlike rejection sampling, all samples are retained. But wait: we cannot compute p(z), only p(z). 21

Importance Sampling Let our proposal be of the form q(z) = q(z)/z q : E[f] = Z q 1 Z p N f(z)p(z)dz = n p(z n ) q(z n ) f(zn ) = Z q Z p 1 N f(z) p(z) q(z) q(z)dz = Z q Z p w n f(z n ), n f(z) p(z) q(z) q(z)dz z n q(z) But we can use the same importance weights to approximate Z p Z q : Z p Z q = 1 Z q p(z)dz = p(z) q(z) q(z)dz 1 N n p(z n ) q(z n ) = 1 N n w n Hence: E[f] 1 N n w n n wnf(zn ) Consistent but biased. 22

Problems If our proposal distribution q(z) poorly matches our target distribution p(z) then: Rejection Sampling: almost always rejects Importance Sampling: has large, possibly infinite, variance (unreliable estimator). For high-dimensional problems, finding good proposal distributions is very hard. What can we do? Markov Chain Monte Carlo. 23

Markov Chains A first-order Markov chain: a series of random variables {z 1,...,z N } such that the following conditional independence property holds for n {z 1,...,z N 1 }: p(z n+1 z 1,...,z n ) = p(z n+1 z n ) We can specify Markov chain: probability distribution for initial state p(z 1 ). conditional probability for subsequent states in the form of transition probabilities T(z n+1 z n ) p(z n+1 z n ). Remark: T(z n+1 z n ) is sometimes called a transition kernel. 24

Markov Chains A marginal probability of a particular state can be computed as: p(z n+1 ) = z n T(z n+1 z n )p(z n ) A distribution π(z) is said to be invariant or stationary with respect to a Markov chain if each step in the chain leaves π(z) invariant: π(z) = z T(z z )π(z ) A given Markov chain may have many stationary distributions. For example: T(z z ) = I{z = z } is the identity transformation. Then any distribution is invariant. 25

Detailed Balance A sufficient (but not necessary) condition for ensuring that π(z) is invariant is to choose a transition kernel that satisfies a detailed balance property: π(z )T(z z ) = π(z)t(z z) A transition kernel that satisfies detailed balance will leave that distribution invariant: z π(z )T(z z ) = π(z)t(z z) z = π(z) z T(z z) = π(z) A Markov chain that satisfies detailed balance is said to be reversible. 26

Recap We want to sample from target distribution π(z) = π(z)/z (e.g. posterior distribution). Obtaining independent samples is difficult. Set up a Markov chain with transition kernel T(z z) that leaves our target distribution π(z) invariant. If the chain is ergodic, i.e. it is possible to go from every state to any other state (not necessarily in one move), then the chain will converge to this unique invariant distribution π(z). We obtain dependent samples drawn approximately from π(z) by simulating a Markov chain for some time. Ergodicity: There exists K, for any starting z, T K (z z) > 0 for all π(z ) > 0. 27

Metropolis-Hasting Algorithm A Markov chain transition operator from current state z to a new state z is defined as follows: A new candidate state z is proposed according to some proposal distribution q(z z), e.g. N(z,σ 2 ). A candidate state x is accepted with probability: ( min 1, π(z ) q(z z ) ) π(z) q(z z) If accepted, set z = z. Otherwise z = z, or the next state is the copy of the current state. Note: no need to know normalizing constant Z. 28

Metropolis-Hasting Algorithm We can show that M-H transition kernel leaves π(z) invariant by showing that it satisfies detailed balance: ( π(z)t(z z) = π(z)q(z z)min 1, π(z ) π(z) q(z z ) ) q(z z) = min(π(z)q(z z),π(z )q(z z )) ( π(z) = π(z )q(z ) z) ) min π(z ) q(z z ), 1 = π(z )T(z z ) Note that whether the chain is ergodic will depend on the particulars of π and proposal distribution q. 29

Metropolis-Hasting Algorithm 3 2.5 2 1.5 1 Using Metropolis algorithm to sample from Gaussian distribution with proposal q(z z) = N(z,0.04). accepted (green), rejected (red). 0.5 0 0 0.5 1 1.5 2 2.5 3 30

Choice of Proposal σ max Proposal distribution: q(z z) = N(z,ρ 2 ). ρ ρ large - many rejections ρ small - chain moves too slowly σ min The specific choice of proposal can greatly affect the performance of the algorithm. 31

Gibbs Sampler Consider sampling from p(z 1,...,z N ). z 2 L Initialize z i, i = 1,...,N l For t=1,...,t Sample z1 t+1 p(z 1 z2,...,z t N t ) Sample z2 t+1 p(z 2 z1 t+1,x t 3,...,zN t ) Sample z t+1 N p(z N z t+1 1,...,z t+1 N 1 ) z 1 Gibbs sampler is a particular instance of M-H algorithm with proposals p(z n z i n ) accept with probability 1. Apply a series (componentwise) of these operators. 32

Gibbs Sampler Applicability of the Gibbs sampler depends on how easy it is to sample from conditional probabilities p(z n z i n ). For discrete random variables with a few discrete settings: p(z n z i n ) = p(z n,z i n ) z n p(z n,z i n ) The sum can be computed analytically. For continuous random variables: p(z n z i n ) = p(z n,z i n ) p(zn,z i n )dz n The integral is univariate and is often analytically tractable or amenable to standard sampling methods. 33

Bayesian PMF Remember predictive distribution?: Consider predicting a rating rij for user i and query movie j: p(r ij R) = p(rij u i,v j )p(u,v,θ }{{ U,Θ V R) } d{u,v }d{θ U,Θ V } Posterior over parameters and hyperparameters Use Monte Carlo approximation: p(r ij R) 1 N N n=1 p(rij u (n) i,v (n) j ). The samples (u n i,vn j ) are generated by running a Gibbs sampler, whose stationary distribution is the posterior distribution of interest. 34

Bayesian PMF Monte Carlo approximation: p(r ij R) 1 N N n=1 p(rij u (n) i,v (n) j ). The conditional distributions over the user and movie feature vectors are Gaussians easy to sample from: p(u i R,V,Θ U,α) = N ( u i µ i,σ ) i p(v j R,U,Θ U,α) = N ( v j µ j,σ ) j The conditional distributions over hyperparameters also have closed form distributions easy to sample from. Netflix dataset Bayesian PMF can handle over 100 million ratings. 35

MCMC: Main Problems Main problems of MCMC: Hard to diagnose convergence (burning in). Sampling from isolated modes. More advanced MCMC methods for sampling in distributions with isolated modes: Parallel tempering Simulated tempering Tempered transitions Hamiltonian Monte Carlo methods (make use of gradient information). Nested Sampling, Coupling from the Past, many others. 36

Deterministic Methods Laplace Approximation Bayesian Information Criterion (BIC) Variational Methods: Mean-Field, Loopy Belief Propagation along with various adaptations. Expectation Propagation. 37