Markov Chain Monte Carlo Methods

Size: px

Start display at page:

Download "Markov Chain Monte Carlo Methods"

Lydia Stafford
5 years ago
Views:

1 Markov Chain Monte Carlo Methods p. /36 Markov Chain Monte Carlo Methods Michel Bierlaire Transport and Mobility Laboratory

2 Markov Chain Monte Carlo Methods p. 2/36 Markov Chains Andrey Markov, , Russian mathematician.

3 Markov Chain Monte Carlo Methods p. 3/36 Markov Chains Glossary: Stochastic process: X t, t = 0,,...,, collection of r.v. with same support, or states space {,...,i,...,j}. Markov process: (short memory) Pr(X t = i X 0,...,X t ) = Pr(X t = i X t ) Homogeneous Markov process: Pr(X t = j X t = i) = Pr(X t+k = j X t +k = i) = P ij t, k 0. Transition matrix: P R J J. Properties: J P ij =, i =,...,J, P ij 0, i, j, j=

4 Markov Chain Monte Carlo Methods p. /36 Markov Chains If state j can be reached from state i with non zero probability, we say that i communicates with j. Two states that communicate belong to the same class. A Markov chain is irreducible if it contains only one class.

5 Markov Chain Monte Carlo Methods p. 5/36 Markov Chains Pij t is the probability that the process reaches state j from i after t steps. Consider all t such that Pii t > 0. The largest common divisor d is called the period of state i. A state with period is aperiodic. If P ii > 0, state i is aperiodic. The period is the same for all states in the same class. Therefore, if the chain is irreducible, if one state is aperiodic, they all are.

6 Markov Chain Monte Carlo Methods p. 6/36 A periodic chain P = , d = 3.

7 Markov Chain Monte Carlo Methods p. 7/36 Markov Chains Stationary probabilities: unique solution of the system π j = J π i p ij, j =,...,J. () i= J π j =. j= Solution exists for any irreducible chain.

8 Markov Chain Monte Carlo Methods p. 8/36 Markov Chains Consider the following system of equations: x i P ij = x j P ji, i j, J x i = (2) i= We sum over i: J x i P ij = x j i= J P ji = x j. i= If (2) has a solution, it is also a solution of (). As π is the unique solution of () then x = π. π i P ij = π j P ji, i j The chain is said time reversible

9 Markov Chain Monte Carlo Methods p. 9/36 Example A machine can be in states with respect to wear perfect condition, partially damaged, seriously damaged, completely useless. The degradation process can be modeled by an irreducible aperiodic homogeneous Markov process, with the following transition matrix: P =

10 Markov Chain Monte Carlo Methods p. 0/36 Example Stationary distribution: ( 5 8,, 3 32, ) 32 ( 5 8,, 3 32, ) = ( 5 8,, 3 32, ) 32 Machine in perfect condition 5 days out of 8, in average. Repair occurs in average every 32 days From now on: Markov process = irreducible aperiodic homogeneous Markov process

11 Markov Chain Monte Carlo Methods p. /36 Stationary distributions Property: π j = lim t Pr(X t = j) j =,...,J. Ergodicity: Let f be any function on the state space. Then, with probability, lim T T T f(x t ) = t= J π j f(j). j= Computing the expectation of a function of the stationary states is the same as to take the average of the values along a trajectory of the process.

12 Markov Chain Monte Carlo Methods p. 2/36 Simulation We want to simulate a r.v. X with pmf Pr(X = j) = p j. We generate a Markov process with limiting probabilities p j (how?) We simulate the evolution of the process. p j = π j = lim t Pr(X t = j) j =,...,J.

13 Markov Chain Monte Carlo Methods p. 3/36 Example: T = Pr(Xt = j) t

14 Markov Chain Monte Carlo Methods p. /36 Example: T = Pr(Xt = j) t

15 Markov Chain Monte Carlo Methods p. 5/36 Example: T = Pr(Xt = j) t

16 Markov Chain Monte Carlo Methods p. 6/36 Simulation Assume that we are interested in simulating E[f(X)] = J f(j)p j. j= We use ergodicity to estimate it with T T f(x t ). t= We should drop early states (see above example). Better estimate: T T+k t=+k f(x t ).

17 Markov Chain Monte Carlo Methods p. 7/36 Metropolis-Hastings Nicholas Metropolis W. Keith Hastings

18 Markov Chain Monte Carlo Methods p. 8/36 Metropolis-Hastings Let b j, j =,...,J be positive numbers. Let B = j b j. Let π j = b j /B. We want to simulate a r.v. with pmf π j. Consider a Markov process on {,...,J} with transition probability Q. Define another Markov process with the same states in the following way: Assume the process is in state i, that is X t = i, Simulate the (candidate) next state j according to Q. Define { j with probability α ij X t+ = i with probability α ij

19 Markov Chain Monte Carlo Methods p. 9/36 Metropolis-Hastings Transition probability P : P ij = Q ij α ij if i j P ii = Q ii α ii + l i Q il( α il ) otherwise Must verify the property: = j P ij = P ii + j i P ij = Q ii α ii + l i Q il( α il ) + j i Q ijα ij = Q ii α ii + l i Q il l i Q ilα il + j i Q ijα ij = Q ii α ii + l i Q il As j Q ij =, we have α ii =.

20 Markov Chain Monte Carlo Methods p. 20/36 Metropolis-Hastings Stationary distribution and time reversibility: π i P ij = π j P ji, i j that is It is satisfied if or π i Q ij α ij = π j Q ji α ji, i j α ij = π jq ji π i Q ij and α ji = π i Q ij π j Q ji = α ji and α ij =

21 Markov Chain Monte Carlo Methods p. 2/36 Metropolis-Hastings Therefore α ij = min ( ) πj Q ji, π i Q ij Remember: π j = b j /B. Therefore α ij = min ( ) bj BQ ji, b i BQ ij = min ( ) bj Q ji, b i Q ij The normalization constant B does not play a role in the computation of α ij. In summary: Given Q and b j defining α as above creates a Markov process characterized by P with stationary distribution π.

22 Markov Chain Monte Carlo Methods p. 22/36 Metropolis-Hastings Algorithm:. Choose a Markov process characterized by Q. 2. Initialize the chain with a state i: t = 0, X 0 = i. 3. Simulate the (candidate) next state j based on Q.. Let r be a draw from U[0, [. ( ) 5. Compare r with bj Q α ij = min ji b i Q ij,. If r < b jq ji b i Q ij then X t+ = j, else X t+ = i. 6. Increase t by one. 7. Goto step 3.

23 Markov Chain Monte Carlo Methods p. 23/36 Example Q = b =(20,8, 3, ) π =( 5 8,, 3 32, 32 ) Run MH for 0000 iterations. Collect statistics after 000. Accept: [288, 532, 80, 283] Reject: [0, 952, 705, 2239] Simulated: [0.627, 0.250, 0.095, 0.028] Target: [0.625, 0.250, , ]

24 Markov Chain Monte Carlo Methods p. 2/36 Gibbs sampling Let X = (X, X 2,...,X n ) be a random vector with pmf (or pdf) p(x). Assume we can draw from the marginals: Pr(X i X j = x j, j i), i =,...,n. Markov process. Assume current state is x. Draw randomly (equal probability) a coordinate i. Draw r from the ith marginal. New state: y = (x,...,x i, r, x i+,...,x n ).

25 Markov Chain Monte Carlo Methods p. 25/36 Gibbs sampling Transition probability: Q xy = n Pr(Xi = r X j = x j, j i) = p(y) n Pr(X j = x j, j i) Metropolis-Hastings: α xy = min ( ) p(y)qyx, p(x)q xy = min The candidate state is always accepted. ( ) p(y)p(x) p(x)p(y), =

26 Markov Chain Monte Carlo Methods p. 26/36 Example: bivariate normal distribution ( X Y ) N (( µ X µ Y ), ( σ 2 X ρσ X σ Y ρσ X σ Y σ 2 Y )) Marginal distribution: Y (X = x) N ( µ Y + σ Y σ X ρ(x µ X ), ( ρ 2 )σ 2 Y ) Apply Gibbs sampling to draw from: N (( 0 0 ), ( )) Note: just for illustration. Should use Cholesky factor.

27 Markov Chain Monte Carlo Methods p. 27/36 Example: pdf

28 Markov Chain Monte Carlo Methods p. 28/36 Example: draws from Gibbs sampling 3 Draws from Gibbs sampling

29 Markov Chain Monte Carlo Methods p. 29/36 Simulated annealing Application of the Metropolis-Hastings algorithm to optimization. Name comes from analogy with annealing in metallurgy, involving heating and controlled cooling of a material to reduce its defects. Optimization problem: min f(x) x F where the feasible set F is a finite set of vectors. Let X be the set of optimal solutions, that is X = {x F f(x) f(y), y F} and f(x ) = f, x X. Consider the pmf on F p λ (x) = y F e λf(x) e λf(y), λ > 0.

30 Markov Chain Monte Carlo Methods p. 30/36 Simulated annealing Equivalently p λ (x) = p λ (x) = e λf(x) y F e λf(y) e λ(f f(x)) y F eλ(f f(y)) As f f(x) 0, when λ, we have lim p λ(x) = δ(x X ) λ X, where δ(x X ) = { if x X 0 otherwise.

31 Markov Chain Monte Carlo Methods p. 3/36 Example F = {, 2, 3} f(f) = {0,, 0} p λ () = p λ (2) = p λ (3) = 2 + e λ e λ 2 + e λ 2 + e λ

32 Markov Chain Monte Carlo Methods p. 32/36 Example 0.8 p λ () = p λ (3) p λ (2) λ

33 Markov Chain Monte Carlo Methods p. 33/36 Simulated annealing If λ is large, we generate a Markov chain with stationary distribution p λ (x). The mass is concentrated on optimal solutions. As the normalizing constant is not needed, only e λ(f f(x)) is used. Construction of the Markov process through the concept of neighborhood. A neighbor y of x is obtained by simple modifications of x. The Markov process will proceed from neighbors to neighbors. The neighborhood structure must be designed such that the chain is irreducible, that is the whole space F must be covered. It must be designed also such that the size of the neighborhood is reasonably small.

34 Markov Chain Monte Carlo Methods p. 3/36 Neighborhood Examples of neighborhoods: x and y are neighbors if they differ only in one coordinate. x and y are neighbors if two elements are interchanged. Denote N(x) the set of neighbors of x. Define a Markov process where the next state is a randomly drawn neighbor. Transition probability: Q xy = N(x) Metropolis Hastings: α xy = min ( ) p(y)qyx, p(x)q xy = min ( e λf(y) ) N(x) e λf(x) N(y),

35 Markov Chain Monte Carlo Methods p. 35/36 Notes The neighborhood structure can always be arranged so that each vector has the same number of neighbors. In this case, α xy = min ( ) e λf(y) e, λf(x) If y is better than x, the next state is automatically accepted. Otherwise, it is accepted with a probability that depends on λ. If λ is high, the probability is small. When λ is small, it is easy to escape from local optima.

36 Markov Chain Monte Carlo Methods p. 36/36 Notes In practice, it may be better to enumerate F (MH is asymptotic while F is finite). It is therefore usually used as a heuristic, where the value of λ is changed over time. For instance λ k = C ln( + k), C > 0. The heuristic returns the best solution encountered during the process.

6 Markov Chain Monte Carlo (MCMC)

6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution