Markov and Gibbs Random Fields

Similar documents
Basics on the Graphcut Algorithm. Application for the Computation of the Mode of the Ising Model

Markov Chains and MCMC

Markov Random Fields

Undirected Graphical Models: Markov Random Fields

6 Markov Chain Monte Carlo (MCMC)

Markov Random Fields

Convex Optimization CMU-10725

Markov Chain Monte Carlo (MCMC)

Introduction to Machine Learning CMU-10701

Markov Chain Monte Carlo

Probabilistic Graphical Models Lecture Notes Fall 2009

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Monte Carlo Methods. Leon Gu CSD, CMU

Markov Random Fields and Gibbs Measures

Probabilistic Graphical Models

Rapid Introduction to Machine Learning/ Deep Learning

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Markov Random Fields and Bayesian Image Analysis. Wei Liu Advisor: Tom Fletcher

Undirected Graphical Models

Discrete solid-on-solid models

Undirected Graphical Models

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Probabilistic Graphical Models

Results: MCMC Dancers, q=10, n=500

Introduction to Graphical Models

Markov Networks.

Graphical Models and Kernel Methods

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

Outlines. Discrete Time Markov Chain (DTMC) Continuous Time Markov Chain (CTMC)

3 : Representation of Undirected GM

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Chapter 16. Structured Probabilistic Models for Deep Learning

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Markov Chain Monte Carlo Methods

The Monte Carlo Method

MARKOV CHAIN MONTE CARLO

Markov Chains. Andreas Klappenecker by Andreas Klappenecker. All rights reserved. Texas A&M University

Markov chain Monte Carlo

Sampling Methods (11/30/04)

Stat 516, Homework 1

Stat 8501 Lecture Notes Spatial Lattice Processes Charles J. Geyer November 16, Introduction

Review: Directed Models (Bayes Nets)

Markov Chains Handout for Stat 110

8. Statistical Equilibrium and Classification of States: Discrete Time Markov Chains

CSC 446 Notes: Lecture 13

Cheng Soon Ong & Christian Walder. Canberra February June 2018

The Markov Chain Monte Carlo Method

Markov properties for undirected graphs

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Math 456: Mathematical Modeling. Tuesday, April 9th, 2018

CSC 412 (Lecture 4): Undirected Graphical Models

EE512 Graphical Models Fall 2009

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Markov Chains on Countable State Space

Stochastic optimization Markov Chain Monte Carlo

INTRODUCTION TO MARKOV CHAIN MONTE CARLO

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

The Theory behind PageRank

The Ising model and Markov chain Monte Carlo

Introduction to Restricted Boltzmann Machines

Gibbs Fields & Markov Random Fields

Chapter 7. The worm algorithm

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

MATH 56A: STOCHASTIC PROCESSES CHAPTER 7

Exact Goodness-of-Fit Testing for the Ising Model

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Convergence Rate of Markov Chains

Image segmentation combining Markov Random Fields and Dirichlet Processes

1 Random Walks and Electrical Networks

Lecture 5: Random Walks and Markov Chain

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Chapter 7. Markov chain background. 7.1 Finite state space

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

MARKOV CHAINS: STATIONARY DISTRIBUTIONS AND FUNCTIONS ON STATE SPACES. Contents

MAP Examples. Sargur Srihari

7.1 Coupling from the Past

Inference in Bayesian Networks

17 : Markov Chain Monte Carlo

Lecture 6: Graphical Models

Lecture 3: September 10

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Ch5. Markov Chain Monte Carlo

Let (Ω, F) be a measureable space. A filtration in discrete time is a sequence of. F s F t

Introduction to Bayesian methods in inverse problems

Markov properties for undirected graphs

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Markov Chains. Arnoldo Frigessi Bernd Heidergott November 4, 2015

Markov Chains and MCMC

Markov chain Monte Carlo Lecture 9

Statistics 352: Spatial statistics. Jonathan Taylor. Department of Statistics. Models for discrete data. Stanford University.

Probability & Computing

Directed and Undirected Graphical Models

Conditional Independence and Factorization

Theory of Stochastic Processes 3. Generating functions and their applications

Transcription:

Markov and Gibbs Random Fields Bruno Galerne bruno.galerne@parisdescartes.fr MAP5, Université Paris Descartes Master MVA Cours Méthodes stochastiques pour l analyse d images Lundi 6 mars 2017

Outline The Ising Model Markov Random Fields and Gibbs Fields Sampling Gibbs Fields

Reference The main reference for this course is Chapter 4 of the book Pattern Theory: The Stochastic Analysis of Real-World Signals by D. Mumford and A. Desolneux [Mumford and Desolneux 2010]

Outline The Ising Model Markov Random Fields and Gibbs Fields Sampling Gibbs Fields

The Ising Model The Ising model is the simplest and most famous Gibbs model. It has its origin in statistical mechanics but can be presented in the context of image segmentation. Framework: A real-valued image I R M N of size M N is given (with gray-levels mostly in [ 1, 1] after some contrast change). One wants to associate a binary image J { 1, 1} M N with values in { 1, 1} ( 1 for black, and 1 for white) supposed to model the predominantly black/white areas of I. There is only a finite number of 2 MN values for J. Input image I A realization J of the associated Ising model (with T = 1.5)

The Ising Model Notation: Ω M,N = {0,..., M 1} {0,..., N 1} is the set of pixel indexes. For keeping the notation short we will denote by α or β pixel coordinates (k, l), (m, n), so that I(α) = I(k, l) for some (k, l). We note α β to mean that pixels α and β are neighbors for the 4-connectivity (each pixel has 4 neighbors, except at the border). Energy: To each couple of images (I, J), I R M N, J { 1, 1} M N, one associates the energy E(I, J) defined by E(I, J) = c (I(α) J(α)) 2 + (J(α) J(β)) 2 α Ω M,N α β where c > 0 is a positive constant and the sum α β means that each pair of connected pixels is summed once (and not twice). The first term of the energy measures the similarity between I and J. The second term measures the similarity between pixels of J that are neighbors.

The Ising Model Energy: E(I, J) = c (I(α) J(α)) 2 + (J(α) J(β)) 2 α Ω M,N α β The Ising model associated with I: For any fixed image I, this energy enables to define a discrete probability distribution on { 1, 1} M N by p T (J) = 1 Z T e 1 T E(I,J), where T > 0 is a constant called the temperature and Z T is the normalizing constant Z T = e 1 T E(I,J). J { 1,1} M N The probability distribution p T is the Ising model associated with I (and constant c and temperature T ).

The Ising Model Energy: E(I, J) = c (I(α) J(α)) 2 + (J(α) J(β)) 2 α Ω M,N α β Probability distribution: p T (J) = 1 Z T e 1 T E(I,J) Minimizing with respect to J the energy E(I, J) is equivalent to find the most probable state of the discrete distribution p T. Note that this most probable state (also called the mode of the distribution) is the same for all temperatures T. As T tends to 0, the distribution p T tends to be concentrated at this mode. As T tends to +, p T tends to be uniform over all the possible image configurations. Hence the temperature parameter T controls the amount of allowed randomness around the most probable state.

The Ising Model The main questions regarding the Ising model are the following: 1. How can we sample efficiently from the discrete distribution p T? Although the distribution is a discrete probability on the finite set of binary images { 1, 1} M N, one cannot compute the table of the probability distribution even for 10 10 images! Indeed, the memory size of the table would be 2 } 10 10 {{} }{{} 4 > 10 30 Bytes = 10 18 TeraBytes Number of binary images 4 bytes for a float 2. How can we compute the common mode(s) of the distributions p T, that is (one of) the most probable state of the distributions p T? 3. What is the good order for the temperature value T?

The Ising Model The main questions regarding the Ising model are the following: 1. How can we sample efficiently from the discrete distribution p T? Although the distribution is a discrete probability on the finite set of binary images { 1, 1} M N, one cannot compute the table of the probability distribution even for 10 10 images! Indeed, the memory size of the table would be 2 } 10 10 {{} }{{} 4 > 10 30 Bytes = 10 18 TeraBytes Number of binary images 4 bytes for a float 2. How can we compute the common mode(s) of the distributions p T, that is (one of) the most probable state of the distributions p T? 3. What is the good order for the temperature value T?

The Ising Model The main questions regarding the Ising model are the following: 1. How can we sample efficiently from the discrete distribution p T? Although the distribution is a discrete probability on the finite set of binary images { 1, 1} M N, one cannot compute the table of the probability distribution even for 10 10 images! Indeed, the memory size of the table would be 2 } 10 10 {{} }{{} 4 > 10 30 Bytes = 10 18 TeraBytes Number of binary images 4 bytes for a float 2. How can we compute the common mode(s) of the distributions p T, that is (one of) the most probable state of the distributions p T? 3. What is the good order for the temperature value T?

The Ising Model

Outline The Ising Model Markov Random Fields and Gibbs Fields Sampling Gibbs Fields

A General Framework The Ising model is a very special case of a large class of stochastic models called Gibbs fields. General framework: We consider an arbitrary finite graph (V, E) with vertices V and undirected edges E V V. Each vertex has a random label in a phase space Λ (also supposed to be finite, e.g. discrete gray-levels). The set Ω = Λ V of all label configurations is called the state space. The elements of Ω = Λ V are denoted by x, y,..., that is, x = (x α) α V, x α Λ. Example of the Ising model: The vertices V are the pixels of Ω M,N. The edges are the ones of the 4-connectivity. The phase space is Λ = { 1, 1}. The state space is the set of binary images Ω = { 1, 1} M N.

Markov Random Fields Definition (Markov Random Field) A random variable X = (X α) α V with values in the state space Ω = Λ V is a Markov random field (MRF) if for any partition V = V 1 V 2 V 3 of the set of vertices V such that there is no edge between vertices of V 1 and V 3, and denoting by X i = X Vi the restriction of X to V i, i = 1, 2, 3, one has P(X 1 = x 1 X 2 = x 2, X 3 = x 3 ) = P(X 1 = x 1 X 2 = x 2 ) whenever both conditional probabilities are well defined, that is whenever P(X 2 = x 2, X 3 = x 3 ) > 0. The above property is called the Markov property. It is equivalent to saying that X 1 and X 3 are conditionally independent given X 2, that is, P(X 1 = x 1, X 3 = x 3 X 2 = x 2 ) = P(X 1 = x 1 X 2 = x 2 )P(X 3 = x 3 X 2 = x 2 ) whenever the conditional probabilities are well defined, that is whenever P(X 2 = x 2 ) > 0...

Conditional Independence Proof: Conditional independence Markov property Assuming P(X 2 = x 2, X 3 = x 3 ) > 0 (since otherwise there is nothing to show): P(X 1 = x 1 X 2 = x 2, X 3 = x 3 ) = P(X 1 = x 1, X 2 = x 2, X 3 = x 3 ) P(X 2 = x 2, X 3 = x 3 ) = P(X 1 = x 1, X 3 = x 3 X 2 = x 2 ) P(X 3 = x 3 X 2 = x 2 ) = P(X 1 = x 1 X 2 = x 2 )P(X 3 = x 3 X 2 = x 2 ) P(X 3 = x 3 X 2 = x 2 ) = P(X 1 = x 1 X 2 = x 2 ).

Conditional Independence Proof: Markov property Conditional independence Assuming P(X 2 = x 2, X 3 = x 3 ) > 0 (since otherwise there is nothing to show): P(X 1 = x 1 X 2 = x 2 )P(X 3 = x 3 X 2 = x 2 ) = P(X 1 = x 1 X 2 = x 2, X 3 = x 3 )P(X 3 = x 3 X 2 = x 2 ) = P(X 1 = x 1, X 2 = x 2, X 3 = x 3 ) P(X 2 = x 2, X 3 = x 3 ) P(X 2 = x 2, X 3 = x 3 ) P(X 2 = x 2 ) = P(X 1 = x 1, X 3 = x 3 X 2 = x 2 ).

Gibbs Fields Definition (Cliques) Given a graph (V, E), a subset C V is a clique if any distinct vertices of C are linked with an edge of E: α, β C, α β (α, β) E. The set of cliques of a graph is denoted by C. Example 1. The singletons {α} of V are always cliques. 2. For the graph of the 4-connectivity, the cliques are the singletons, the pairs of horizontal adjacent pixels, the pairs of vertical adjacent pixels.

Gibbs Fields Definition (Families of potentials) A family of function U C : Ω R, C C, is said to be a family of potentials if each function U C only depends of the restriction on the cliques C: x, y Ω, x C = y C U C (x) = U C (y). Definition (Gibbs distribution and Gibbs field) A Gibbs distribution on the state space Ω = Λ V is a probability distribution P that comes from an energy function deriving from a family of potentials on cliques (U C ) C C : P(x) = 1 Z e E(x), where E(x) = C C U C (x), and Z is the normalizing constant Z = x Ω e E(x). A Gibbs field is a random variable X = (X α) α V Ω such that its law is a Gibbs distribution.

Gibbs Fields Gibbs distribution: P(x) = 1 Z e E(x), where E(x) = C C U C (x). Example The distribution of the Ising model is given by p T (J) = 1 Z T e 1 T E(I,J), where E(I, J) = c (I(α) J(α)) 2 + (J(α) J(β)) 2. α Ω M,N α β Hence it is a Gibbs distribution that derives from the family of potentials U {α} = c T (I(α) J(α))2, α V, and U {α,β} = 1 T (J(α) J(β))2, (α, β) E.

Gibbs Fields Gibbs distribution: P(x) = 1 Z e E(x), where E(x) = C C U C (x). Remark: One can always introduce a temperature parameter T : For all T > 0, is also a Gibbs distribution. P T (x) = 1 Z T e 1 T E(x)

The Hammersley-Clifford Theorem Theorem (Hammersley-Clifford) Gibbs Fields and MRF are equivalent in the following sense: 1. If X is a Gibbs field then it satisfies the Markov property. 2. If P is the distribution of a MRF such that P(x) > 0 for all x Ω, then there exists an energy function E deriving from a family of potentials (U C ) C C such that Take-away message: P(x) = 1 Z e E(x). Gibbs fields and Markov random fields are essentially the same thing.

Proof of Gibbs fields are MRF Let X be a Gibbs field with distribution P(x) = 1 Z e E(x), where E(x) = C C U C (x). Let V = V 1 V 2 V 3 be any partition of the set of vertices V such that there is no edge between vertices of V 1 and V 3. For each i = 1, 2, 3, one denotes by x i = x Vi the restriction of a state x to V i. Remark that a clique C C cannot intersect both V 1 and V 3, that is, either C V 1 V 2 or C V 2 V 3. Since U C (x) only depends on the restriction of x to C, U C (x) either depends on (x 1, x 2 ) or (x 2, x 3 ). One can thus write Hence E(x) = E 1 (x 1, x 2 ) + E 2 (x 2, x 3 ) and thus P(x) = F 1 (x 1, x 2 )F 2 (x 2, x 3 ). P(x 1 x 2, x 3 ) = P(x 1, x 2, x 3 ) P(x 2, x 3 ) P(x 1, x 2, x 3 ) = y Λ V P(y, x 1 2, x 3 ) = F 1 (x 1, x 2 )F 2 (x 2, x 3 ) y Λ V F 1 1(y, x 2 )F 2 (x 2, x 3 ) F 1 (x 1, x 2 ) = y Λ V F 1 1(y, x 2 ) = P(x 1, x 2 ) P(x 2 ) = P(x 1 x 2 ).

The Hammersley-Clifford Theorem Hence a Gibbs field satisfies the Markov property, it is a MRF. The difficult part of the Hammersley-Clifford theorem is the converse implication! It solely relies on the Möbius inversion formula. Proposition (Möbius inversion formula) Let V be a finite subset and f and g be two functions defined on the set P(V ) of subsets of V. Then, A V, f (A) = B A ( 1) A\B g(b) A V, g(a) = B A f (B).

Proof of Möbius inversion formula Simple fact: Let n be an integer, then ( ) { n n ( 1) k 1 if n = 0, = k 0 if n > 0, k=0 (since it is equal to (1 + ( 1)) n!) Proof of A V, f (A) = ( 1) A\B g(b) A V, g(a) = B A B A Let A V. Then, f (B) = ( 1) B\D g(d) B A B A D B = ( 1) B\D g(d) D A B D = ( 1) E g(d) D A E A\D = g(d) ( 1) E D A E A\D = g(d) D A ( ) A\D A \ D ( 1) k k k=0 }{{} = g(a). =0 if A \ D > 0 and 1 if D = A f (B):

Proof of Möbius inversion formula Proof of A V, g(a) = B A Let A V. Then, f (B) A V, f (A) = B A( 1) A\B g(b): f (D) B A( 1) A\B g(b) = B A( 1) A\B D B = ( 1) A\B f (D) D A B D = ( 1) A\(E D) D A E A\D = f (D) ( 1) A E D D A E A\D = f (D)( 1) A D ( 1) E D A = f (D)( 1) A D D A E A\D A\D ( 1) k k=0 }{{} =0 if A \ D > 0 and 1 if D = A = f (A).

Proof of MRF with Positive Probability Are Gibbs Fields Notation: Let λ 0 Λ denote a fixed value in the phase space Λ. For a subset A V, and a configuration x Ω = Λ V, we denote by xa the configuration that is the same as x at each site of A and is λ 0 elsewhere, that is, { x α if α A, (xa) α = otherwise. λ 0 In particular, we have xv = x and we denote x 0 = x the configuration with the value λ 0 at all the sites. Candidate energy: Remark that if P is a Gibbs distribution with energy E, then, P(x) = 1 Z e E(x) E(x) = log P(x 0) P(x) + E(x 0). Besides, the constant E(x 0 ) has no influence since it can be included in Z. Hence given our Markov distribution P, we now define a function E : Ω R by E(x) = log P(x 0) P(x) (P(x) > 0 by hypothesis).

Proof of MRF with Positive Probability Are Gibbs Fields Candidate energy: E(x) = log P(x 0) so that P(x) = P(x 0 )e E(x). P(x) Goal: Show that this energy E derives from a family of potentials on cliques (U C ) C C : For all x Ω, E(x) = U C (x). C C Let x be a given configuration. For all subset A V, we define U A (x) = B A( 1) A\B E(xB). By the Möbius inversion formula, we have, for all A V, E(xA) = B A U B (x). In particular, taking A = V, we get E(x) = U A (x) and thus A V ( P(x) = P(x 0 )e E(x) = P(x 0 ) exp ) U A (x). A V

Proof of MRF with Positive Probability Are Gibbs Fields ( P(x) = P(x 0 )e E(x) = P(x 0 ) exp ) U A (x). A V P has nearly the targeted form! U A (x) = B A( 1) A\B E(xB) only depends on the values of x at the sites of A. It only remains to show that U A (x) = 0 if A is not a clique. Indeed, in this case the energy E(x) = A V U A (x) = C C U C (x) derives from a family of potential, that is, P is a Gibbs field. The proof of U A (x) = 0 if A is not a clique will use the Markov property satisfied by P... (not used yet)

Proof of MRF with Positive Probability Are Gibbs Fields = Proof of U A (x) = 0 if A is not a clique: Let A V such that A is not a clique, that is A / C. Then, A contains two vertices α and β that are not neighbors in the graph (V, E). B A ( 1) A\B E(xB) + B A ( 1) A\B E(xB) α B, β B U A (x) = + B A = B A\{α,β} α B, β / B B A\{α,β} B A\{α,β} ( 1) A\B ( 1) A\B E(xB) + α / B, β / B B A α / B, β B ( 1) A\B E(x(B {α, β})) + ( 1) A\B E(x(B {α})) ( 1) A\B E(xB) B A\{α,β} B A\{α,β} ( 1) A\B E(xB) ( 1) A\B E(x(B {β})) E(x(B {α, β})) + E(xB) E(x(B {α})) E(x(B {β})). }{{} Let us show that it is = 0

Proof of MRF with Positive Probability Are Gibbs Fields Proof of U A (x) = 0 if A is not a clique: Goal: E(x(B {α, β})) + E(xB) E(x(B {α})) E(x(B {β})) = 0 Since E(x) = log P(x 0) P(x) E(x(B {α, β})) + E(xB) E(x(B {α})) E(x(B {β})) = log P(x(B {α}))p(x(b {β})). P(x(B {α, β}))p(xb) Let us use the conditional independence (eq. to Markov property) with the partition : V 1 = {α}, V 2 = V \ {α, β}, and V 3 = {β}. Remark that the four states x(b {α}), x(b {β}), x(b {α, β}), and xb have the same restriction x 2 to V 2 = V \ {α, β}. P(x(B {α, β})) = P(X α = x α, X β = x β, X 2 = x 2 ) Similarly = P(X α = x α, X β = x β X 2 = x 2 )P(X 2 = x 2 ) = P(X α = x α X 2 = x 2 )P(X β = x β X 2 = x 2 )P(X 2 = x 2 ). P(xB) = P(X α = λ 0 X 2 = x 2 )P(X β = λ 0 X 2 = x 2 )P(X 2 = x 2 ) P(x(B {α})) = P(X α = x α X 2 = x 2 )P(X β = λ 0 X 2 = x 2 )P(X 2 = x 2 ) P(x(B {β})) = P(X α = λ 0 X 2 = x 2 )P(X β = x β X 2 = x 2 )P(X 2 = x 2 )

Proof of MRF with Positive Probability Are Gibbs Fields Proof of U A (x) = 0 if A is not a clique: Goal: E(x(B {α, β})) + E(xB) E(x(B {α})) E(x(B {β})) = 0 Hence, P(x(B {α}))p(x(b {β})) = 1 P(x(B {α, β}))p(xb) which shows that E(x(B {α, β})) + E(xB) E(x(B {α})) E(x(B {β})) = log and thus U A (x) = 0. P(x(B {α}))p(x(b {β})) P(x(B {α, β}))p(xb) = 0

Outline The Ising Model Markov Random Fields and Gibbs Fields Sampling Gibbs Fields

Considered Problems Gibbs fields pose several difficult computational problems 1. Compute samples from this distribution, 2. Compute the mode, the most probable state x = argmax P(x) (if it is unique...), 3. Compute the marginal distribution on each component X α for some fixed α V. The simplest method to sample from a Gibbs field is to use a type of Monte Carlo Markov Chain (MCMC) known as the Metropolis Algorithm. The main idea of Metropolis algorithms are to let evolve a Markov chain taking values in the space of configurations Ω = Λ V. Running a sampler for a reasonable length of time enables one to estimate the marginals on each component X α or the mean of other random functions of the state. Simulated annealing: Samples at very low temperatures will cluster around the mode of the field. But the Metropolis algorithm takes a very long time to give good samples if the temperature is too low. Simulated annealing consists in starting the Metropolis algorithm at sufficiently high temperatures and gradually lowering the temperature.

Recalls on Markov Chains We refer to the textbook [Brémaud 1998] for a complete reference on Markov chains. The main reference for this section is [Mumford and Desolneux 2010]. Definition (Markov Chain) Let Ω be a finite set of states. A sequence (X n) n N of random variables taking values in Ω is said to be a Markov chain if it satisfies the Markov condition: For all n 1 and x 0, x 1,..., x n Ω, P(X n = x n X 0 = x 0, X 1 = x,..., X n 1 = x n 1 ) = P(X n = x n X n 1 = x n 1 ), whenever both conditional probabilities are well-defined, that is, whenever P(X 0 = x 0, X 1 = x,..., X n 1 = x n 1 ) > 0. Moreover the Markov chain (X n) n N is said homogeneous if the probability does not depend on n, that is, for all n 1, and x, y Ω, P(X n = y X n 1 = x) = P(X 1 = y X 0 = x). In what follows Markov chains will always be assumed homogeneous. In this case, the transition matrix Q of the chain is defined as the Ω Ω matrix of all transition probabilities [ ] Q(x, y) = q x y = P(X 1 = y X 0 = x) = P(X n = y X n 1 = x), n.

Recalls on Markov Chains Proposition For all n 1, the matrix Q n = Q Q gives the law of X n given the initial state X 0, that is, P(X n = y X 0 = x) = Q n (x, y). Proof. By induction. This is true for n = 1. Suppose it is true for some n 1. Then, P(X n+1 = y X 0 = x) = z Ω P(X n+1 = y, X n = z X 0 = x) = P(X n+1 = y, X n = z, X 0 = x) P(X 0 = x) z Ω = P(X n+1 = y, X n = z, X 0 = x) P(X n = z, X 0 = x) P(X n = z, X 0 = x) P(X 0 = x) z Ω = z Ω P(X n+1 = y X n = z, X 0 = x)p(x n = z X 0 = x) = z Ω P(X n+1 = y X n = z)q n (x, z) = z Ω Q n (x, z)q(z, y) = Q n+1 (x, y).

Recalls on Markov Chains Definition We say that the Markov chain is irreducible if for all x, y Ω, there exists n 0 such that Q n (x, y) > 0. If a Markov chain is irreducible then every pair of states can be connected by the chain. Some Markov chains can have periodic behavior in visiting in a prescribed order all the states of Ω. This is generally a behavior that is best avoided. Definition We say that a Markov chain is aperiodic if for all x Ω the greatest common divisor of all the n 1 such that Q n (x, x) > 0 is 1.

Equilibrium probability distribution Definition A probability distribution Π on Ω is an equilibrium probability distribution of Q if y Ω, Π(x)Q(x, y) = Π(y). x Ω Interpretation: If the initial state X 0 follows the distribution Π, then all the random variables X n also follow the distribution Π: P(X 1 = y) = x Ω P(X 1 = y X 0 = x)p(x 0 = x) = x Ω Q(x, y)π(x) = Π(y). Theorem If Q is irreducible, then there exists a unique equilibrium probability distribution Π for Q. If moreover Q is aperiodic, then x, y Ω, lim n + Qn (x, y) = Π(y). Consequence: If X 0 follows any distribution P 0, then, for all y Ω, P(X n = y) = P(X n = y X 0 = x)p(x 0 = x) x Ω = x Ω Q n (x, y)p 0 (x) n + x Ω Π(y)P 0 (x) = Π(y).

Equilibrium probability distribution Consequence: P(X n = y) n + Π(y) Whatever the distribution of the initial state X 0, as n tends to +, the sequence (X n) n N converges in distribution to Π. Approximate simulation of the distribution Π: 1. Draw randomly an initial state x 0 (with some distribution P 0 ). 2. For n = 1 to N (N large ), compute x n by simulating the discrete distribution P(X n X n 1 = x n 1 ) = Q(x n 1, ). 3. Return the random variate x N. This procedure is called a Monte Carlo Markov Chain (MCMC) method and is at the heart of the Metropolis algorithm to simulate Gibbs fields.

Metropolis algorithm Framework: Ω a finite state space (Ω = Λ V in our case). Given any energy function E : Ω R, we define a probability distribution on Ω at all temperatures T via the formula: p T (x) = 1 Z T e 1 T E(x), where Z T = x Ω e 1 T E(x). Idea: Simulate a Markov chain (X n) n 0, X n Ω, such that the equilibrium probability distribution is P T, so that, as n tends to +, the distribution of X n will tend to P T. We start with some simple Markov chain, given by a transition matrix Q(x, y) = (q x y) (x,y) Ω 2, q x y being the probability of the transition from a state x Ω to a state y Ω. Hypotheses on Q: Q is symmetric (qx y = q y x), Q is irreducible, Q is aperiodic.

Metropolis algorithm Definition The Metropolis transition matrix associated with P T and Q is the matrix P defined by q x y if x y and P T (y) P T (x), P T (y) P(x, y) = p x y = P T (x) qx y if x y and P T (y) P T (x), 1 p x z if y = x. z x Theorem The Metropolis transition matrix P(x, y) = (p x y) is irreducible and aperiodic if Q(x, y) = (q x y) is. In addition, the equilibrium probability distribution of P(x, y) = (p x y) is P T, that is, x, y Ω, lim n + Pn (x, y) = P T (y).

Metropolis algorithm Proof: Since P T (y) 0 for all y Ω, and by induction, Q(x, y) 0 P(x, y) 0, Q n (x, y) 0 P n (x, y) 0. As Q, P is irreducible and aperiodic. In consequence, P has a unique equilibrium probability distribution. We now prove that there is detailed balance, which will imply that P T is the equilibrium probability distribution of P: x, y Ω, P T (x)p x y = P T (y)p y x. Let x, y Ω. If x = y, there is obvious. If x y, we can suppose that P T (x) P T (y). Then, P T (x)p x y = P T (x) P T (y) P T (x) qx y = P T (y)q x y = P T (y)q y x = P T (y)p y x. Summing this equality over all x, one gets for all y Ω, p y x = P T (y), x Ω P T (x)p x y = P T (y) x Ω i.e. P T is the equilibrium probability distribution of the transition matrix P.

Metropolis algorithm Practical aspect: Choice of the transition matrix Q The uniform change satisfies the hypotheses x, y Ω, q x y = 1 Ω. However, this choice is not practical since one attempts at randomly draw a structured image from a white noise image. A more practical transition matrix is the one corresponding to the following procedure: one selects randomly a pixel/vertex α V, and one draws uniformly a new value λ Λ \ {x α} for this vertex. Hence, q x y = { 1 V ( Λ 1) if x and y differs on exactly one vertex/pixel, 0 otherwise. One of the main problem of the Metropolis algorithm is that many proposed moves x y will not be accepted. The Gibbs sampler attempts to reduce this problem.

Gibbs Sampler Algorithm Let Ω = Λ V be the Gibbs state space. The Gibbs sampler selects one site α V at random and then picks a new value λ for x α with probability distribution given by P T, conditioned on the values of the state at all other vertices: 1 V P T (X α = y α X V \{α} = x V \{α} ) if y differs from x only at α, p x y = 1 V P T (X α = x α X V \{α} ) if y = x, α V 0 otherwise. The transition matrix P satisfies the same property as the one of the Metropolis algorithm (Exercice).

Bibliographic references I P. Brémaud, Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues, Springer, 1998 D. Mumford and A. Desolneux, Pattern Theory: The Stochastic Analysis of Real-World Signals, Ak Peters Series, 2010