Markov Chain Monte Carlo (MCMC)

Similar documents
Markov Chain Monte Carlo

STA 294: Stochastic Processes & Bayesian Nonparametrics

Markov Chain Monte Carlo Using the Ratio-of-Uniforms Transformation. Luke Tierney Department of Statistics & Actuarial Science University of Iowa

STA 4273H: Statistical Machine Learning

Lecture 8: The Metropolis-Hastings Algorithm

6 Markov Chain Monte Carlo (MCMC)

MCMC Methods: Gibbs and Metropolis

The Pennsylvania State University The Graduate School RATIO-OF-UNIFORMS MARKOV CHAIN MONTE CARLO FOR GAUSSIAN PROCESS MODELS

Bayesian Methods for Machine Learning

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Computational statistics

MARKOV CHAIN MONTE CARLO

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

A short diversion into the theory of Markov chains, with a view to Markov chain Monte Carlo methods

On Reparametrization and the Gibbs Sampler

Introduction to Machine Learning CMU-10701

17 : Markov Chain Monte Carlo

Monte Carlo methods for sampling-based Stochastic Optimization

Lecture 7 and 8: Markov Chain Monte Carlo

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

19 : Slice Sampling and HMC

CS281A/Stat241A Lecture 22

Monte Carlo Methods. Leon Gu CSD, CMU

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Metropolis-Hastings Algorithm

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

INTRODUCTION TO BAYESIAN STATISTICS

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Convergence Rate of Markov Chains

CPSC 540: Machine Learning

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Computer intensive statistical methods

Chapter 7. Markov chain background. 7.1 Finite state space

Computer intensive statistical methods

CSC 2541: Bayesian Methods for Machine Learning

Introduction to Bayesian methods in inverse problems

APM 541: Stochastic Modelling in Biology Bayesian Inference. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 53

University of Toronto Department of Statistics

eqr094: Hierarchical MCMC for Bayesian System Reliability

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

Some Results on the Ergodicity of Adaptive MCMC Algorithms

I. Bayesian econometrics

Lattice Gaussian Sampling with Markov Chain Monte Carlo (MCMC)

16 : Markov Chain Monte Carlo (MCMC)

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Stochastic optimization Markov Chain Monte Carlo

Likelihood-free MCMC

Adaptive Monte Carlo methods

Answers and expectations

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Advances and Applications in Perfect Sampling

Markov Random Fields

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Monte Carlo Methods in Bayesian Inference: Theory, Methods and Applications

Simulation - Lectures - Part III Markov chain Monte Carlo

MCMC: Markov Chain Monte Carlo

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Markov chain Monte Carlo

Multiplicative random walk Metropolis-Hastings on the real line

Stat 516, Homework 1

16 : Approximate Inference: Markov Chain Monte Carlo

Markov Chain Monte Carlo Lecture 6

Bayesian GLMs and Metropolis-Hastings Algorithm

Bayesian Methods with Monte Carlo Markov Chains II

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Control Variates for Markov Chain Monte Carlo

PSEUDO-MARGINAL METROPOLIS-HASTINGS APPROACH AND ITS APPLICATION TO BAYESIAN COPULA MODEL

Principles of Bayesian Inference

Markov and Gibbs Random Fields

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Sampling from complex probability distributions

MCMC algorithms for fitting Bayesian models

Markov Chain Monte Carlo methods

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Markov Chain Monte Carlo methods

An introduction to adaptive MCMC

Inverse Problems in the Bayesian Framework

Markov Chains and MCMC

Cheng Soon Ong & Christian Walder. Canberra February June 2018

MSc MT15. Further Statistical Methods: MCMC. Lecture 5-6: Markov chains; Metropolis Hastings MCMC. Notes and Practicals available at

Reminder of some Markov Chain properties:

Quantitative Non-Geometric Convergence Bounds for Independence Samplers

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Convex Optimization CMU-10725

Math 456: Mathematical Modeling. Tuesday, April 9th, 2018

1 Kernels Definitions Operations Finite State Space Regular Conditional Probabilities... 4

Statistics & Data Sciences: First Year Prelim Exam May 2018

Markov Chain Monte Carlo for Linear Mixed Models

The Metropolis Algorithm

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Geometric Ergodicity of a Random-Walk Metorpolis Algorithm via Variable Transformation and Computer Aided Reasoning in Statistics

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

Bayesian linear regression

Markov chain Monte Carlo methods in atmospheric remote sensing

Markov Chain Monte Carlo

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Transcription:

Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can be used if a trial density f can be found where π/f has a reasonable bound. When the dimension is high this is difficult or impossible. Rejection sampling, the inversion method, and the density transform method all produce independent realizations from π. If these methods are inefficient or difficult to implement, we can drop independence as a goal, aiming instead to generate a dependent sequence X 1, X 2,... such that the marginal distribution of ech X i is π. Going a step further, we may allow the marginal distribution π k (X k to be different from π, but converge to π in some sense: π k π. By relaxing the goal of the problem, it becomes possible to overcome the key problem of rejection sampling that it does not adapt as information about π is built up through repeated evaluations of π at the sampled points. A practical framework for constructing dependent sequences satsifying these goals is provided by discrete time Markov chains. Review of Markov Chains A discrete time Markov chain is a sequence of random variables X 0, X 1,... such that P (X n+1 A X 0,..., X n = P (X n+1 A X n P (X n, A. The function P is called the transition kernel. A probability distribution π is called the invariant distribution for P if π(a = P (x, Aπ(xdx, for all measurable sets A, in which case we may write πp = π. More generally this relationship may hold for a non-finite measure π, in which case π is called the invariant measure for P. 1

Suppose that P has a density, denoted q(x, y. This means that P (x, A = A q(x, ydy. A sufficient condition for π to be the invariant distribution for P is detailed balance: π(xq(x, y = π(yq(y, x. The proof is a simple application of Fubini s theorem. The distribution π is the equilibrium distribution for X 0, X 1,... if satisfies P n (X 0, A P (X n A X 0 lim P n (x, A = π(a, n for all measurable sets A and for π almost-every x. Key definitions: Irreducible: The Markov chain given by transition kernel P with invariant distribution π is irreducible if for each measurable set A with π(a > 0, for each x there exists n such that P n (x, A > 0. Recurrent: The Markov chain given by transition kernel P with invariant distribution π is recurrent if for every measurable set B with π(b > 0, and for π almost-every X 0. P (X 1, X 2,... B i.o. X 0 > 0 X 0 P (X 1, X 2,... B i.o. X 0 = 1 The Markov chain given by transition kernel P with invariant distribution π is Harris recurrent if P (X 1, X 2,... B i.o. X 0 = 1 X 0. Fact: A Markov transition kernel P that is π-irreducible and has an invariant measure is recurrent. A recurrent kernel P with a finite invariant measure is positive recurrent, otherwise it is null recurrent. 2

Periodic: A Markov transition kernel P is periodic if there exists n 2 and a sequence of nonempty, disjoint measurable sets A 1, A 2,..., A n such that x A j (j < n, P (x, A j+1 = 1, x A j, P (x, A 1 = 1 x A n. Total Variation norm: For any measure λ, the total variation (TV norm is λ = sup A λ(a inf A λ(a, where the supremum and infimum are taken over all λ measurable sets. The distance between two probability measures π 1 and π 2 can be assesed using π 1 π 2. Theorem: Let P be a Markov transition kernel with invariant distribution π, and suppose that P is π irreducible. Then P is positive recurrent, and π is the unique invariant distribution of P. If P is aperiodic, then for π almost-every x P n (x, A π(a 0 for all measurable sets A (P n converges to π in the TV norm. If P is Harris recurrent, then the convergence occurs for all x. In words, we can start at essentially any X, run the chain for a long time, and the final draw has a distribution that is approximately π. The long time is called the burn in, and it is generally impossible to know how long this needs to be in order to get a good approximation. We don t need independence for the most common application of Monte Carlo studies: estimating E π f: the mean of a function f with respect to the invariant distribution π. If chain is uniformly ergodic sup x P n (x, π Mr n for some M 0 and r < 1, there is a central limit theorem: let f n = n j=1 f(x j /n denote the sample mean. Then n( f n E π f converges weakly to a normal distribution with mean 0. Methods for generating chains with a given equilibrium distribution We have a density π that we can evaluate but not sample from. How do we construct a markov chain P with invariant distribution π? Gibbs sampling: Suppose the random variable X = (X 1,..., X d can be partitioned into blocks, and we can sample from each of the conditional distributions P (X i {X j, j i}. These transitions are invariant with respect to P (X 1,..., X d. Consider the case X = (X 1, X 2, and for a set A, let A X1 = {X 2 (X 1, X 2 A}. Under the proposal distribution P (X 2 X 1 : 3

P ((X 1, X 2, Aπ(X 1, X 2 dx 1 dx 2 = π(y X 1 dy π(x 2 X 1 π(x 1 dx 1 dx 2 A X1 = π(y X 1 π(x 1 dy dx 1 A X1 = π(a. The Gibbs chain will in general only be irreducible and aperiodic if we combine all conditional distributions in random order. The Metropolis (or Metropolis-Hastings algorithm. Suppose we have a transition density q(x, y (P (X n+1 A X n = x = A q(x, ydy. Define the Metropolis-Hastings ratio by α(x, y = π(yq(y, x min{, 1} π(xq(x, y π(xq(x, y > 0 = 1 π(xq(x, y = 0. The kernel q(x, y will roughly play the role that the trial density plays in rejection sampling. Suppose we are at X n. Sample Y from q(x n, y. With probability α(x n, Y set X n+1 = Y, otherwise set X n+1 = X n. The transition kernel can be represented as where δ x is the point mass at x, and p(x, y = q(x, yα(x, y + r(xδ x r(x = 1 q(x, yα(x, ydy is the marginal probability of remaining at x. The resulting chain satisfies detailed balance, and hence has invariant distribution π. Only density evaluations are required, and unknown normalizing constants cancel. General Metropolis-Hastings Schemes Gibbs sampling If q(x, y = π(y x then α(x, y 1, so we accept every move. This shows that Gibbs sampling is a special case of the Metropolis-Hastings algorithm. Random Walks Suppose q(x, y is a random walk: q(x, y = q (y x for some distribution q. The Metropolis-Hastings ratio is min(π(yq (x y/π(xq (y x, 1 (or 1 if π(xq (y x = 0. If q is symmetric, then the Metropolis-Hastings ratio reduces to min(π(y/π(x, 1 (or 1 if π(x = 0. 4

An important practical issue with random walk chains is the scaling of the random walk density q. If the variance of q is too large, most steps will move into the tails of the distribution and will be rejected. On the other hand if the variance is too small, the autocorrelation in the sequence of accepted draws will be high and convergence to the target distribution will be slow. Roberts and Rosenthal (2001 have shown that an acceptance rate of around 0.23 is optimal when the target density is the product of identical marginal density functions. Independence chains If q(x, y = q(y, so that the proposals are an iid sequence, the Metropolis-Hastings ratio is min(π(yq(x/π(xq(y, 1 (or 1 if π(xq(y = 0. This is similar to rejection sampling, however the acceptance probabilities are scaled by q(x/q(y rather than by 1/c, and points are repeated rather than rejected if the proposal failes. Two interesting special cases are: 1. q(x 1: if the proposal distribution is uniform, it is even more similar to rejection sampling, since the Metropolis-Hastings ratio becomes min(π(y/π(x, 1. 2. If q(y = π(y, then the Metropolis-Hastings ratio is identically 1, and we are sampling directly from the target density. Rejection sampling chains Suppose we want to use rejection sampling from a trial density f to produce realizations from π. We can use the Metropolis-Hastings algorithm to rescue rejection sampling when we can not find a good bound for π/f. We conjecture, but are not certain, that π/f < c. If we carry out rejection sampling we are actually drawing from the density proportional to f = min(π, cf. Suppose we use an independence chain generated by f to drive the Metropolis-Hastings algorithm. Thus we have q(x, y min(π(y, cf(y. Let S denote the subset of the sample space where the bound π/f < c holds. We have four possibilities: 1. x, y S α = 1. 2. x S, y S α = 1. 3. x S, y S α < 1. 4. x S, y S. In this case nothing can be said about α in general. Since S c is the set of points that will be undersampled using the rejection sampling procedure, it is reasonable that moves from S to S c are always accepted, while moves from S c to S always have a nonzero probability of being rejected. The chain dwells in S c to build up mass that is missing from f. Examples: Normal mean and variance: Suppose we observe Y 1,..., Y n iid from the N(µ, σ 2 distribution. If we take flat (improper priors on µ and σ 2, the posterior distribution is 5

( P (µ, σ 2 {Y } (σ 2 n/2 exp 1 (Y 2σ 2 i µ 2. i As a function of µ this is proportional to Thus exp( 1 2σ 2 (nµ2 2µY exp( n 2σ 2 (µ Ȳ 2 P (µ σ 2, {Y } = N(Ȳ, σ2 /n. To get the conditional distribution of σ 2 given {Y } and µ, do the change of variables η = 1/σ 2, so P (η µ, {Y } η n/2 2 exp( η i (Y i µ 2 /2. This is a Γ(n/2 1, 2/ i(y i µ 2 distribution. So the Gibbs sampling algorithm alternates between sampling µ from a normal distribution and η from a gamma distribution. Grouped binary data: Suppose we observe m groups of Bernoulli trials, where the i th group consists of n i iid trials. Let n i be the number of trials in group i, and let p i be the success probability. Suppose the p i are viewed as random effects from a symmetric Beta distribution, which has density proportional to p α 1 (1 p α 1. Let k i be the number of successes observed in the i th group, so P ({k i } {p i }P ({p i } α i p k i+α 1 i (1 p i n i k i +α 1. To be safe, we will take a nearly-flat proper exponential prior on α with mean θ (set to an arbitrary large value. Gibbs sampling operates using the following distributions The specific updates are P (α {p i }, {k i } P ({p i } α, {k i } α exponential ( 1/( i p i i (1 p i + 1/θ 6

and p i Beta(k i + α, n i k i + α. Grouped continuous data with right-skewed group means: Suppose we observe data Y ij for m groups with n i replicates in group i, so i = 1,..., m and j = 1,..., n i. Let µ i be the population mean for group i, and suppose we take the µ i to be random effects from an exponential distribution with mean λ. We will also take Y i1,..., Y ini µ i to be independent and Gaussian with mean µ i and variance σ 2. If we form the improper posterior based on flat priors for λ and σ 2, then integrate out the random effects µ i, we get P (λ, σ 2 {Y } i (σ 2 n i/2 λ 1 ( exp 1 2σ (Y 2 i µ1 ni (Y i µ1 ni exp( µ/λdµ = (σ 2 n /2+m/2 λ m i exp ( σ 2 (S i M 2 i /n i /2 M i λ 1 /n i + σ 2 λ 2 /2n i, where Y i = (Y i1,..., Y ini, M i = j Y ij, and S i = j Y 2 ij. This does not have a finite integral, therefore a sampling algorithm applied to this formulation of the problem will not give meaningful results. For simplicity, we first suppose that η = 1/λ and σ 2 have a joint prior density proportional to exp( 1 2 i 1 n i σ 2 η 2 on (0, (0,. In this case, the posterior on η, σ 2 is proportional to (σ 2 n /2+m/2 η m 2 exp ( σ 2 i (S i M 2 i /n i /2 η i M i /n i. Now if we set θ = 1/σ 2, we get the posterior expressed in terms of η and θ θ (n m/2 2 η m 2 exp ( θ i (S i M 2 i /n i /2 η i M i /n i. Letting V i = i S i /n i M 2 i /n 2 i, approximately var(y i /n i, and M i = M i /n i, the sample mean of Y i, we get θ (n m/2 2 η m 2 exp ( θ i n i V i /2 η i M i. 7

Thus η and θ have independent Gamma posteriors η Γ(m 1, 1/ i M i θ Γ((n m/2 1, 2/ i n i V i. If we wanted to use a different prior, say with λ and σ 2 independent and exponential with means β λ and β σ 2 we could use the Gibbs sampler. Cauchy regression: Suppose we take the linear regression model Y i = α + β X i + ɛ i, where Y i R and X i R d are observed, and the ɛ i have central Cauchy distributions, with density π 1 /(1 + ɛ 2 i. The joint distribution of the observed data is: π(y 1,..., Y n X 1,..., X n = i π 1 1 + (Y i α β X i 2. Suppose we consider the Bayesian model with improper prior on θ = (α, β. We use a Metropolis-Hastings algorithm with a random walk distribution for q: q(θ, θ = φ(θ θ, where φ is the bivariate normal distribution with mean 0. Due to the symmetry of the random walk, the Metropolis ratio α simplifies to α(θ, θ = π(θ /π(θ. Rejection sampling using the wrong bound. Suppose we want to sample from the density π(x = I( 1 x 1 x 1/2 /4. The density is unbounded at x = 0, so we can not use rejection sampling with a bounded trial density. Suppose we generate draws using the rejection sampling algorithm with a uniform trial density on ( 1, 1 and c = 6. These draws will have density q(x I( 1 x 1 min( x 1/2 /4, 3. Now we use these draws to drive a Metropolis algorithm. On S = {x x 1/144}, the bound π/f < 6 is satisfied (where f is the uniform trial density. The Metropolis- Hastings ratio is given by the following. 1. x, y S α = 1. 2. x S, y S α = 1. 3. x S, y S α = 12 x < 1. 4. x S, y S α = π(y/π(x = x/y. 8

3 2 alpha beta 1 0-1 -2-3 -4 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 3 2 alpha beta 1 0-1 -2-3 -4 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 3 2 1 0-1 -2-3 -4 alpha beta -5 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Figure 1: Sample paths for α and β for the Cauchy regression model, using random walk input chains with Gaussian steps having mean zero and standard deviation σ =.01 (top, σ = 3 (middle, and σ = 1 (bottom. 9

Gaussian mixture model. Suppose that Y 1,..., Y n R d are independent and observed, δ 1,..., δ n {1, 2,..., K} are iid and unobserved, and P (Y j δ j = k = φ(y j µ k, where φ( is the standard multivariate normal density. The parameters in the model are µ 1,..., µ K, and π 1,..., π K, where π k = P (δ j = k. We can take flat priors on these parameters, and estimate them through their posterior distribution, which can be simulated using Gibbs sampling. Here is a full set of conditional distributions, where µ = [µ 1,..., µ n ], and similarly for Y and δ. P (µ π, Y, δ = P (µ Y, δ P (Y µ, δ = φ(y j µ δj j This splits into a product of functions of µ 1,..., µ K, implying that the µ k are independent given π, Y and δ. The factor that involves µ k can be expressed exp( 1 2 ( 2µ k Y j I(δ j = k + µ 2 k I(δ j = k, j which is a multivariate normal density with mean j Y j I(δ j = k/ j I(δ j = k, and variance 1/ I(δ j = k (with zero correlation. j P (δ π, Y, µ P (Y δ, µ P (δ π = j φ(y j µ δj π δj This splits into a product of functions of δ 1,..., δ n, implying that the δ j are independent given π, Y and µ. To sample δ j, first compute q k = φ(y j µ k π k for k = 1,..., K, then rescale the q k so they sum to 1, and generate a multinomial draw according to the probabilities that result. P (π Y, µ, δ P (δ π π k k j I(δ j=k Let n k = j I(δ j = k, so the posterior density of π is proportional to k π n k k. This is a Dirichlet density, from which samples can be generated by simulating Z k Γ(n k +1, 1, and setting π k = Z k / l Z l. 10

1 0.8 0.6 0.4 0.2 0 0 20 40 60 80 100 120 140 160 180 200 Figure 2: Gibbs sample paths for the probability parameters π 1, π 2, π 3 in the Gaussian mixture model with k = 3 components, d = 10 Y dimensions, and n = 200 observations. The horizontal lines are the true values. Spatial processes. Suppose we have a random 0/1 array Y ij {0, 1}, and we specify the conditional probabilities as P (Y ij = 1 Y i+1,j, Y i 1,j, Y i,j+1, Y i,j 1 = ɛ + τ ɛ 4 (Y i+1,j + Y i 1,j + Y i,j+1 + Y i,j 1. where 0 ɛ τ 1. It is a fact that this family of conditional distributions is associated with a unique joint distribution on the array. The conditional proability of observing a 1 given the neighbors ranges from ɛ to τ, and increases as more neighbors have 1 s. Larger values of ɛ τ produce greater spatial autocorrelation (the tendency of 1 s to cluster spatially in the array. We can simulate from this distribution using the Gibbs sampler. 11