The Markov Chain Monte Carlo Method

Similar documents
Coupling. 2/3/2010 and 2/5/2010

Lecture 3: September 10

The Monte Carlo Method

Lecture 5: Random Walks and Markov Chain

Convergence Rate of Markov Chains

Lecture 28: April 26

Some Definition and Example of Markov Chain

25.1 Markov Chain Monte Carlo (MCMC)

University of Chicago Autumn 2003 CS Markov Chain Monte Carlo Methods

Homework 10 Solution

Lecture 2: September 8

Probability & Computing

Discrete solid-on-solid models

Markov Chains and MCMC

Markov Chains and MCMC

EECS 495: Randomized Algorithms Lecture 14 Random Walks. j p ij = 1. Pr[X t+1 = j X 0 = i 0,..., X t 1 = i t 1, X t = i] = Pr[X t+

Coupling AMS Short Course

Characterization of cutoff for reversible Markov chains

Introduction to Markov Chains and Riffle Shuffling

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

6.842 Randomness and Computation February 24, Lecture 6

Quantum walk algorithms

Stochastic process. X, a series of random variables indexed by t

Approximate Counting and Markov Chain Monte Carlo

Simulation - Lectures - Part III Markov chain Monte Carlo

The coupling method - Simons Counting Complexity Bootcamp, 2016

Lecture 7. We can regard (p(i, j)) as defining a (maybe infinite) matrix P. Then a basic fact is

Approximate Counting

Irreducibility. Irreducible. every state can be reached from every other state For any i,j, exist an m 0, such that. Absorbing state: p jj =1

MSc MT15. Further Statistical Methods: MCMC. Lecture 5-6: Markov chains; Metropolis Hastings MCMC. Notes and Practicals available at

Markov Processes Hamid R. Rabiee

Markov chains. Randomness and Computation. Markov chains. Markov processes

Essentials on the Analysis of Randomized Algorithms

Lecture 9: Counting Matchings

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines

Lecture 6: September 22

Decomposition Methods and Sampling Circuits in the Cartesian Lattice

Markov Chains, Random Walks on Graphs, and the Laplacian

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Bayesian Methods with Monte Carlo Markov Chains II

Advanced Sampling Algorithms

Markov Chains. Andreas Klappenecker by Andreas Klappenecker. All rights reserved. Texas A&M University

Markov Chain Monte Carlo Methods

6 Markov Chain Monte Carlo (MCMC)

Introduction to Stochastic Processes

7.1 Coupling from the Past

Statistics 251/551 spring 2013 Homework # 4 Due: Wednesday 20 February

The Theory behind PageRank

LIMITING PROBABILITY TRANSITION MATRIX OF A CONDENSED FIBONACCI TREE

1 Random Walks and Electrical Networks

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Powerful tool for sampling from complicated distributions. Many use Markov chains to model events that arise in nature.

CS 70 Discrete Mathematics and Probability Theory Spring 2018 Ayazifar and Rao Final

LECTURE 3. Last time:

Markov Chains. X(t) is a Markov Process if, for arbitrary times t 1 < t 2 <... < t k < t k+1. If X(t) is discrete-valued. If X(t) is continuous-valued

Lecture 20: Reversible Processes and Queues

Stat 516, Homework 1

Stochastic Processes (Week 6)

General Glivenko-Cantelli theorems

Markov Chain Monte Carlo Data Association for Multiple-Target Tracking

Model Counting for Logical Theories

Markov chain Monte Carlo algorithms

Modern Discrete Probability Spectral Techniques

Flip dynamics on canonical cut and project tilings

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Propp-Wilson Algorithm (and sampling the Ising model)

Introduction to Machine Learning CMU-10701

Characterization of cutoff for reversible Markov chains

Math Homework 5 Solutions

Exponential Metrics. Lecture by Sarah Cannon and Emma Cohen. November 18th, 2014

On Markov Chain Monte Carlo

An Introduction to Randomized algorithms

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

CS281A/Stat241A Lecture 22

Markov and Gibbs Random Fields

Random Variable. Pr(X = a) = Pr(s)

Markov chain Monte Carlo Lecture 9

Answers and expectations

Let (Ω, F) be a measureable space. A filtration in discrete time is a sequence of. F s F t

University of Chicago Autumn 2003 CS Markov Chain Monte Carlo Methods. Lecture 7: November 11, 2003 Estimating the permanent Eric Vigoda

On random walks. i=1 U i, where x Z is the starting point of

INTRODUCTION TO MARKOV CHAIN MONTE CARLO

Convex Optimization CMU-10725

9.1 Branching Process

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Distinct edge weights on graphs

Chapter 7. Markov chain background. 7.1 Finite state space

1 Stat 605. Homework I. Due Feb. 1, 2011

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Random Permutations using Switching Networks

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

Oberwolfach workshop on Combinatorics and Probability

Section 27. The Central Limit Theorem. Po-Ning Chen, Professor. Institute of Communications Engineering. National Chiao Tung University

Sampling Good Motifs with Markov Chains

Uniform random posets

1 Ways to Describe a Stochastic Process

Markov Chains and Stochastic Sampling

CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions

Markov Chain Monte Carlo (MCMC)

Transcription:

The Markov Chain Monte Carlo Method Idea: define an ergodic Markov chain whose stationary distribution is the desired probability distribution. Let X 0, X 1, X 2,..., X n be the run of the chain. The Markov chain converges to its stationary distribution from any starting state X 0 so after some sufficiently large number r of steps, the distribution at of the state X r will be close to the stationary distribution π of the Markov chain. Now, repeating with X r as the starting point we can use X 2r as a sample etc. So X r, X 2r, X 3r,... can be used as almost independent samples from π.

The Markov Chain Monte Carlo Method Consider a Markov chain whose states are independent sets in a graph G = (V, E): 1 X 0 is an arbitrary independent set in G. 2 To compute X i+1 : 1 Choose a vertex v uniformly at random from V. 2 If v X i then X i+1 = X i \ {v}; 3 if v X i, and adding v to X i still gives an independent set, then X i+1 = X i {v}; 4 otherwise, X i+1 = X i. The chan is irreducible The chain is aperiodic For y x, P x,y = 1/ V or 0.

N(x) set of neighbors of x. Let M max x Ω N(x). Lemma Consider a Markov chain where for all x and y with y x, P x,y = 1 M if y N(x), and P x,y = 0 otherwise. Also, P x,x = 1 N(x) M. If this chain is irreducible and aperiodic, then the stationary distribution is the uniform distribution. Proof. We show that the chain is time-reversible, and apply Theorem 7.10. For any x y, if π x = π y, then π x P x,y = π y P y,x, since P x,y = P y,x = 1/M. It follows that the uniform distribution π x = 1/ Ω is the stationary distribution.

The Metropolis Algorithm Assuming that we want to sample with non-uniform distribution. For example, we want the probability of an independent set of size i to be proportional to λ i. Consider a Markov chain on independent sets in G = (V, E): 1 X 0 is an arbitrary independent set in G. 2 To compute X i+1 : 1 Choose a vertex v uniformly at random from V. 2 If v X i then set X i+1 = X i \ {v} with probability min(1, 1/λ); 3 if v X i, and adding v to X i still gives an independent set, then set X i+1 = X i {v} with probability min(1, λ); 4 otherwise, set X i+1 = X i.

Lemma For a finite state space Ω, let M max x Ω N(x). For all x Ω, let π x > 0 be the desired probability of state x in the stationary distribution. Consider a Markov chain where for all x and y with y x, P x,y = 1 ( M min 1, π ) y π x if y N(x), and P x,y = 0 otherwise. Further, P x,x = 1 y x P x,y. Then if this chain is irreducible and aperiodic, the stationary distribution is given by the probabilities π x.

Proof. We show the chain is time-reversible. For any x y, if π x π y, then P x,y = 1 M and P y,x = 1 π x M π y. It follows that π x P x,y = π y P y,x. Similarly, if π x > π y, then P x,y = 1 π y M π x and P y,x = 1 M, and it follows that π x P x,y = π y P y,x. Note that the Metropolis Algorithm only needs the ratios π x /π y s. In our construction, the probability of an independent set of size i is λ i /B for B = x λsize(x) although we may not know B.

Coupling and MC Convergence An Ergodic Markov Chain converges to its stationary distribution. How long do we need to run the chain until we sample a state in almost the stationary distribution? How do we measure distance between distributions? How do we analyze speed of convergence?

Variation Distance Definition The variation distance between two distributions D 1 and D 2 on a countably finite state space S is given by D 1 D 2 = 1 D 1 (x) D 2 (x). 2 x S See Figure 11.1 in the book: The total area shaded by upward diagonal lines must equal the total areas shaded by downward diagonal lines, and the variation distance equals one of these two areas.

Lemma For any A S, let D i (A) = x A D i(x), for i = 1, 2. Then, D 1 D 2 = max A S D 1(A) D 2 (A). Let S + S be the set of states such that D 1 (x) D 2 (x), and S S be the set of states such that D 2 (x) > D 1 (x). Clearly max A S D 1(A) D 2 (A) = D 1 (S + ) D 2 (S + ), and max A S D 2(A) D 1 (A) = D 2 (S ) D 1 (S ). But since D 1 (S) = D 2 (S) = 1, we have which implies that D 1 (S + ) + D 1 (S ) = D 2 (S + ) + D 2 (S ) = 1, D 1 (S + ) D 2 (S + ) = D 2 (S ) D 1 (S ).

max A S D 1(A) D 2 (A) = D 1 (S + ) D 2 (S + ) = D 1 (S ) D 2 (S ). and D 1 (S + ) D 2 (S + ) + D 1 (S ) D 2 (S ) = x S D 1 (x) D 2 (x) = 2 D 1 D 2, we have max D 1(A) D 2 (A) = D 1 D 2, A S

Rate of Convergence Definition Let π be the stationary distribution of a Markov chain with state space S. Let p t x represent the distribution of the state of the chain starting at state x after t steps. We define x (t) = p t x π ; (t) = max x S x(t). That is, x (t) is the variation distance between the stationary distribution and p t x, and (t) is the maximum of these values over all states x. We also define τ x (ɛ) = min{t : x (t) ɛ} ; τ(ɛ) = max x S τ x(ɛ). That is, τ x (ɛ) is the first step t at which the variation distance between p t x and the stationary distribution is less than ɛ, and τ(ɛ) is the maximum of these values over all states x.

Coupling Definition A coupling of a Markov chain M with state space S is a Markov chain Z t = (X t, Y t ) on the state space S S such that Pr(X t+1 = x Z t = (x, y)) = Pr(X t+1 = x X t = x); Pr(Y t+1 = y Z t = (x, y)) = Pr(Y t+1 = y Y t = y).

The Coupling Lemma Lemma (Coupling Lemma) Let Z t = (X t, Y t ) be a coupling for a Markov chain M on a state space S. Suppose that there exists a T so that for every x, y S, Then Pr(X T Y T X 0 = x, Y 0 = y) ɛ. τ(ɛ) T. That is, for any initial state, the variation distance between the distribution of the state of the chain after T steps and the stationary distribution is at most ɛ.

Proof: Consider the coupling when Y 0 is chosen according to the stationary distribution and X 0 takes on any arbitrary value. For the given T and ɛ, and for any A S Pr(X T A) Pr((X T = Y T ) (Y T A)) = 1 Pr((X T Y T ) (Y T / A)) (1 Pr(Y T / A)) Pr(X T Y T ) Pr(Y T A) ɛ = π(a) ɛ. Here we used that when Y 0 is chosen according to the stationary distribution, then, by the definition of the stationary distribution,y 1, Y 2,..., Y T are also distributed according to the stationary distribution.

Similarly, or It follows that. Pr(X T A) π(s \ A) ɛ Pr(X T A) π(a) + ɛ max x,a pt x (A) π(a) ɛ,

Example: Shuffling Cards Markov chain: States: orders of the deck of n cards. There are n! states. Transitions: at each step choose one card, uniformly at random, and move to the top. The chain is irreducible: we can go from any permutation to any other using only moves to the top (at most n moves). The chain is aperiodic: it has loops as top card is chosen with probability 1 n. Hence, by Theorem 7.10, the chain has a stationary distribution π.

A given state x of the chain has N(x) = n: the new top card can be anyone of the n cards. Let π y be the probability of being in state y under π, then for any state x: π x = 1 n y N(x) π = ( 1 n, 1 n,..., 1 n ) is a solution and hence the stationary distribution is the uniform stationary distribution π y

Given two such chains: X t and Y t we define the coupling: The first chain chooses a card uniformly at random and moves it to the top. The second chain moves the same card (it may be in a different location) to the top. Once a card is on the top in both chains at the same time it will remain in the same position in both chains! Hence we are sure the chains will be equal once every card has been picked at least once. So we can use the coupon collector argument: after running the chain for at least n ln n + cn steps the probability that a specific card (e.g. ace of spades) has not been moved to the top yet is at most ( 1 1 ) n ln n+cn e ln n+c = e c n n

Hence the probability that there is some card which was not chosen by the first chain in n ln n + cn steps is at most e c. After n ln n + n ln (1/ɛ) = n ln (n/ɛ) steps the variation distance between our chain and the uniform distribution is bounded by ɛ, implying that τ(ɛ) n ln(nɛ 1 ).

Example: Random Walks on the Hypercube Consider n-cube, with N = 2 n nodes., Let x = (x 1,..., x n ) be the binary representation of x. Nodes x and y are connected by an edge iff x and ȳ differ in exactly one bit. Markov chain on the n-cube: at each step, choose a coordinate i uniformly at random from [1, n], and set x i to 0 with probability 1/2 and 1 with probability 1/2. The chain is irreducible, finite and and aperiodic so it has a unique stationary distribution π.

A given state x of the chain has N(x) = n + 1: the Let π y be the probability of being in state y under π, then for any state x: π x = π y P y,x y N(x) = 1 2 π x + 1 2n y N(x)\x π y π = ( 1 2 n, 1 2 n,..., 1 2 n ) solves this and hence it is the stationary distribution.

Coupling: both chains choose the same bit and give it the same value. The chains couple when all bits have been chosen. By the Coupling Lemma the mixing time satisfies τ(ɛ) n ln(nɛ 1 ).

Example: Sampling Independent Sets of a Given Size Consider a Markov chain whose states are independent sets of size k in a graph G = (V, E): 1 X 0 is an arbitrary independent set of size k in G. 2 To compute X t+1 : (a) Choose uniformly at random v X t and w V. (b) if w X t, and (X t {v}) {w} is an independent set, then X t+1 = (X t {v}) {w} (c) otherwise, X t+1 = X t. Assume k n/(3 + 3), where is the maximum degree. The chain is irreducible as we can convert an independent set X of size k into any other independent set Y of size k using the operation above (exercise 11.11). The chain is aperiodic as there are loops. For y x, P x,y = 1/ V (if they differ in exactly one vertex) or 0. By Lemma 10.7 the stationary distribution is the uniform distribution.

Convergence Time Theorem Let G be a graph on n vertices with maximum degree. For k n/(3 + 3), τ(ɛ) O(kn ln ɛ 1 ). Coupling: 1 X 0 and Y 0 are arbitrary independent sets of size k in G. 2 To compute X t+1 and Y t+1 : 1 Choose uniformly at random v X t and w V. 2 if w X t, and (X t {v}) {w} is an independent set, then X t+1 = (X t {v}) {w}, otherwise, X t+1 = X t. 3 If v Y t choose v uniformly at random from Y t X t, else v = v. 4 if w Y t, and (Y t {v }) {w} is an independent set, then Y t+1 = (Y t {v }) {w}, otherwise, Y t+1 = Y t.

Let d t = X t Y t, d t+1 d t 1. d t+1 = d t + 1: must be v X t Y t and there is move in only one chain. Either w or some neighbor of w must be in (X t Y t ) (Y t X t ) Pr(d t+1 = d t + 1) k d t k 2d t ( + 1). n d t+1 = d t 1: sufficient v Y t and w and its neighbors are not in X t Y t {v, v }. X t Y t = k + d t Pr(d t+1 = d t 1) d t k n (k + d t 2)( + 1). n

Conditional expectation: There are only 3 possible valuse for d t+1 given the value of d t > 0, namely d t 1, d t, d t + 1. Hence, using the formula for conditional expectation we have E[d t+1 d t ] = (d t + 1) Pr(d t+1 = d t + 1) + d t Pr(d t+1 = d t ) + (d t 1) Pr(d t+1 = d t 1) = d t (Pr(d t+1 = d t 1) + Pr(d t+1 = d t ) + Pr(d t+1 = d t + 1)) + Pr(d t+1 = d t + 1) Pr(d t+1 = d t 1) = d t + Pr(d t+1 = d t + 1) Pr(d t+1 = d t 1)

Now we have for d t > 0, E[d t+1 d t ] = d t + Pr(d t+1 = d t + 1) Pr(d t+1 = d t 1) d t + k d t 2d t ( + 1) d t n (k + d t 2)( + 1) ( k n k n = d t 1 n (3k d ) t 2)( + 1) kn ( ) n (3k 3)( + 1) d t 1. kn Once d t = 0, the two chains follow the same path, thus E[d t+1 d t = 0] = 0. ( E[d t+1 ] = E[E[d t+1 d t ]] E[d t ] 1 E[d t ] d 0 ( 1 ) (n 3k + 3)( + 1). kn ) n (3k + 3)( + 1) t. kn

E[d t ] d 0 ( 1 ) n (3k + 3)( + 1) t. kn Since d 0 k, and d t is a non-negative integer, Pr(d t 1) E[d t ] k ( 1 ) n (3k 3)( + 1) t n (3k 3)( +1) t ke kn. kn For k n/(3 + 3) the variantion distance converges to zero and τ(ɛ) kn ln(kɛ 1 ) n (3k 3)( + 1). In particular, when k and are constants, τ(ɛ) = O(ln ɛ 1 ).

Theorem Given two distributions σ X and σ Y on a state space S, Let Z = (X, Y ) be a random variable on S S, where X is distributed according to σ X and Y is distributed according to σ Y. Then Pr(X Y ) σ X σ Y. Moreover, there exists a joint distribution Z = (X, Y ), where X is distributed according to σ X and Y is distributed according to σ Y, for which equality holds.

Variation distance is nonincreasing Recall that δ(t) = max x x (t), where x (t) is the variational distance bewteen the stationary distribution and the distribution of the state of the Markov chain after t steps when it starts at state x. Theorem For any ergodic Markov chain M t, (t + 1) (t).