ST4231, Semester I, 2003-2004 Ch5. Markov Chain Monte Carlo In general, it is very difficult to simulate the value of a random vector X whose component random variables are dependent. In this chapter we present a powerful approach for generating a random variable whose distribution is approximately that of X. This approach, called the Markov Chain Monte Carlo (MCMC) method, has the added significance of only requiring that the mass (or density) function of X be specified up to a multiplicative constant, and this, we will see, is of great importance in applications. Outline of the Chapter: Discrete Markov Chains Metropolis-Hastings Algorithm Gibbs Sampler Simulated Annealing MCMC in Bayesian Statistics 1
1 Discrete Markov Chains This section sets notation and reviews some fundamental results for discrete Markov chains. We only discuss discrete-time Markov chains in this chapter. We will not discuss the theory of chains with uncountable state spaces, although we will use such chains in Sections 2,3, and 4 for applications in statistics. Definition 1.1 A Markov Chain X is a discrete time stochastic process {X 0, X 1, } with the property that the distribution of X t given all previous values of the process, X 0, X 1,, X t 1, only depends upon X t 1, i.e., for any set A. P [X t A X 0,, X t 1 ] = P [X t A X t 1 ] 2
In a Markov chain, the transition from one state to another is probabilistic, but the production of an output symbol is deterministic. One-Step Transition Let p ij = P (X t+1 = j X t = i) denote the transition probability from state i to state j at time n + 1. Since the p ij are conditional probabilities, all transition probabilities must satisfy two conditions: p ij 0 for all (i, j) p ij = 1 for all i j We will assume that the transition probabilities are fixed and do not change with time. In such a case the Markov chain is said to be homogeneous in time. 3
In a case of a system with a finite number of possible state K, for example, the transition probabilities constitute a K by K matrix: p 11 p 12 p 1K p 21 p 22 P 2k P =.... p K1 p K2 p KK This kind of matrix is called a stochastic matrix. Any stochastic matrix can serve as a matrix of transition probabilities. 4
Multi-Step Transition The definition of one-step transition probability may be generalized to cases where the transition from one state to another takes place in some fixed number of states. j: Let p ij (m) denote the m-step transition probability from state i to state p ij (m) = P (X t+m = x j X t = x i ), m = 1, 2, We may view p ij (m) as the sum over all intermediate states k through which the system passes in its transition from state i to state j. It has the following recursive relation: p ij (m + 1) = k p ik (m)p kj, m = 1, 2, with p ik (1) = p ik. The recursive equation may be generalized as follows: p ij (m + n) = k p ik (m)p kj (n), (m, n) = 1, 2,. 5
Suppose a Markov chain starts in state i. The state i is said to be a recurrent state if the Markov chain returns to state i with probability 1; that is, f i = P (ever returning to state i) = 1. If the probability f i < 1, state i is said to be a transition state. If the Markov chain starts in a recurrent state, the state reoccures an infinite number of times. If it starts in a transient state, that state reoccures only a finite number of times. 6
The state j of a Markov chain is said to be accessible from state i if there is a finite sequence of transitions from i to j with a positive probability. If the states i and j are accessible to each other, they are said to communicate with each other, and the communication is denoted by i j. Clearly, if i j and j k, then we have i k. If two state of a Markov chain communicate with each other, they are said to belong to the same class. In general, the states of a Markov chain consist of one or more disjoint classes. If, however, all the states consist of a single class, the Markov chain is said to be indecomposable or irreducible. Reducible chains are of little practical interest in most areas of application. 7
Example 1 (Simple Random Walk on a Graph) A graph is described by a set of vertices and a set of edges. We say that the vertices i and j are neighbors if there is an edge joining i and j (we can also say that i and j are adjacent). For simplicity, assume that there is at most one edge joining any given pair of vertices and that there is no edge that joins a vertex to itself. The number of neighbors of the vertex i is called the degree of i and is written deg(i). In a simple random walk on a graph (Figure 1), the state space is the set of vertices of the graph. At each time t, the Markov chain looks at it current position, X t, and chooses one of its neighbors uniformly at random. It then jumps to the chosen neighbor, which becomes X t+1. Figure 1 2 1 3 5 4 Figure 1: A graph with five labeled vertices. 8
More formally, the transition probabilities are given by { 1/deg(i) if i and j are neighbors p ij = 0 otherwise (including i = j) The associated transition probability matrix is 0 1/3 1/3 0 1/3 1/2 0 1/2 0 0 1/4 1/4 0 1/4 1/4 0 0 1/2 0 1/2 1/3 0 1/3 1/3 0 The Markov chain is reducible if and only if the graph is connected. 9
Consider an irreducible Markov chain that starts in a recurrent state i at time n = 0. Let T i (k) denote the time that elapses between the (k 1)th and kth returns to state i. The mean recurrence time of state i is defined as the expectation of T i (k) over the returns k. The Steady-State Probability of state i, denoted by π i, is defined as π i = 1 E(T i (k)). If E(T i (k)) <, that is π i > 0, the state i is said to be a positive recurrent state. If E(T i (k)) =, that is π i = 0, the state i is said to be a null recurrent state. 10
In principle, ergodicity means that we may substitute time averages for ensemble averages. In the context of a Markov chain, ergodicity means that the long-term proportion of time spent by the chain in state i corresponds to the steady-state probability π i, which may be justified as follows. The Proportion of Time spent in state i after k returns, denoted by v i (k), is defined by k v i (k) = k l=1 T i(l). The return times T i (l) form a sequence of independently and identically distributed random variables since, by definition, each return time is statistically independent of all previous return times. Moreover, in the case of a recurrent state i the chain returns to state i an infinite number of times. Hence, as the number of returns k approaches infinity, the law of large numbers states that the proportion of time spent in state i approaches the steady-state probability, as shown by lim v i(k) = π i, i = 1, 2,, K. k A sufficient, but not necessary condition, for a Markov chain to be ergodic is for it to be both irreducible and aperiodic. 11
Consider an ergodic Markov chain characterized by a stochastic matrix P. Let the row vector π (n 1) denote the state distribution vector of the chain at time n 1. The state distribution vector at time n is defined by By iteration, we obtain π (n) = π (n 1) P. π (n) = π (n 1) P = π (n 2) P 2 = = π (0) P n, where π (0) is the initial value of the state distribution vector. 12
Let p (n) ij denote the ij-th element of P n. Suppose that as time n approaches infinity, p (n) ij tends to π j independent of i, where π j is the steadystate probability of state j. Correspondingly, for large n, the matrix P n approaches the limiting form of a square matrix with identical rows as shown by π 1 π 2 π K π lim P n = n π 1 π 2 π K... = π 1 π 2 π K π. π. We then find that ( K j=1 π (0) j 1)π = 0 Since, by definition K j=1 π(0) j = 1, this condition is satisfied by the vector π independent of the initial distribution. 13
The probability distribution {π j } K j=1 is called an invariant or stationary distribution. It is so called because it persists forever once it is established. For an ergodic Markov chain, we may say the following. 1. Starting from an arbitrary initial distribution, the transition probabilities of a Markov chain will converge to a stationary distribution provided that such a distribution exists. 2. The stationary distribution of the Markov chain is completely independent of the initial distribution if the chain is ergodic. Example 2: Simple Random Walk on a Graph (continued) Consider a simple random walk on a graph, as described in Example 1. We claim that the equilibrium distribution is π i = deg(i)/z, where Z = k S deg(k). That is, π = (3/14, 2/14, 4/14, 2/14, 3/14). 14
To summarize the descriptions above, we have the following definitions and theorems. Let τ ii be the time of the first return to state i: τ ii = min{t > 0 : X t = i X 0 = i}. Definition 1.2 X is called irreducible if for all i,j, there exists a t > 0 such that P ij (t) > 0. Definition 1.3 An irreducible chain X is recurrent if P [τ ii < ] = 1 for some (and hence for all)i. Otherwise, X is transient. Another equivalent condition for recurrence is P ii (t) = for all i. t 15
Definition 1.4 An irreducible recurrent chain X is called positive recurrent if E[τ ii ] < for some (and hence for all) i. Otherwise, it is called null-recurrent. Another equivalent condition for positive recurrence is the existence of a stationary probability distribution for X, that is there exist π( ) such that π(i)p ij (t) = π(j) (1) for all j and t 0. i If the probability distribution of X t say π i = {X t = i}, i = 0,, N is a stationary distribution, then P {X t+1 = j} = = N P {X t+1 = j X t = i}p {X t = i} i=0 N π i P ij = π j i=0 Definition 1.5 An irreducible chain X is called aperiodic if for some (and hence for all) i, greatest common divider {t > 0 : P ii (t) > 0} = 1. 16
Theorem 1.1 If X is positive recurrent and aperiodic then its stationary distribution π( ) is the unique probability distribution satisfying (1). We then say that X is ergodic and the following consequences hold: (i) P ij (t) π(j) as t for all i, j. (ii) (Ergodic theorem) If E π [ h(x) ] <, then P { h N E π [h(x)]} = 1, where E π [h(x)] = i h(i)π(i), the expectation of h(x) with respect to π( ). 17
There is sometimes an easier way than solving the set of equations (1) of finding the stationary probabilities. Suppose one can find positive numbers x j, j = 1, 2,, N such that x i P ij = x j P ji, for i j, N x j = 1 j=1 Then summing the preceding equations over all states i yields N x i P ij = x j i=1 N i=1 P ji = x j which, since {π j, j = 1, 2,, N} are the unique solution of (1), implies that π j = x j When π i P ij = π j P ji for all i j, the Markov chain is said to be time reversible, because it can be shown, under this condition, that if the initial state is chosen according to the probability {π j }, then starting at any time the sequence of states going backwards in time will also be a Markov chain with transition probability P ij. 18
Example 3: Simple Random Walk on a Graph (continued) Consider a simple random walk on a graph, as described in Example 1. We claim that the equilibrium distribution is where Z = k S deg(k). π i = deg(i)/z, Verify that: If i and j are not neighbors, then π i p ij = 0 = π j p ji ; if i and j are neighbors, then π i p ij = 1 = π j p ji. 19
Suppose now that we want to generate the value of a random variable X having probability mass function P {X = j} = p j, j = 1, 2,, N. If we could generate an irreducible aperiodic Markov chain with limiting probabilities p j, j = 1, 2,, N, then we could be able to approximately generate such a random variable by running the chain for n steps to obtain the value of X n, where n is large. In addition, if our objective is to estimate E[h(x)] = N j=1 h(j)p j, then we could also estimate this quantity by using the estimator 1 n n i=1 h(x i). However, since the early states of the Markov chain can be strongly influenced by the initial state chosen, it is common in practice to discard the first k states, for some suitably chosen value of k. That is, the estimator 1 n k n i=k+1 h(x i), is utilized. It is difficult to know exactly how large a value of k should be used and usually one just uses one s intuition (which usually works fine because the convergence is guaranteed no matter what value is used). 20
2 Metropolis-Hastings Algorithm Let b(j), j = 1,, m be positive numbers, and let B = m j=1 b(j). Suppose that m is large and B is difficult to calculate, and that we want to simulate a sequence of random variables with probability mass function π(j) = b(j)/b, j = 1, 2,, m. One way of simulating a sequence of random variables whose distributions converge to π ( j), j = 1, 2,, m, is to find a Markov chain that is easy to simulate and whose limiting probabilities are the π j. The Metropolis- Hastings algorithm is as follows. 21
Let Q be an irreducible Markov transition probability matrix on the integers 1,, m, with q(i, j) representing the row i, column j element of Q. Now define a Markov chain {X n, n 0} as follows. When X n = i, a random variable X such that P {X = j} = q(i, j), j = 1,, m, is generated. If X = j, then X n+1 is set equal to j with probability α(i, j) and is set equal to i with probability 1 α(i, j). Under these conditions, it is easy to see that the sequence of states will constitute a Markov chain with transition probabilities P i,j given by P i,j = q(i, j)α(i, j), if j i P i,i = q(i, i) + k i q(i, k)(1 α(i, k)) 22
Now this Markov chain will be time reversible and have stationary probabilities π(j) if π(i)p i,j = π j P j,i for j i which is equivalent to π(i)q(i, j)α(i, j) = π(j)q(j, i)α(j, i) It is now easy to check that this will be satisfied if we take π(j)q(j, i) i) α(i, j) = min(, 1) = min(b(j)q(j, π(i)q(i, j) b(i)q(i, j), 1) 23
The following sums up the Metropolis-Hastings algorithm for generating a time-reversible Markov chain whose limiting probabilities are π(j) = b(j)/b, for j = 1, 2,, m. 1. Choose an irreducible Markov transition probability matrix Q with transition probabilities q(i, j), i, j = 1, 2,, m. Also, choose some integer value k between 1 and m. 2. Let n = 0 and X 0 = k. 3. Generate a random variable X such that P {X = j} = q(x n, j) and generate a random number U. 4. If U < b(x)q(x,x n) b(x n )q(x n,x), then X n+1 = X; else X n+1 = X n. 5. Set n = n + 1, goto step 3. 24
Proposition 2.1 Show that the Markov chain induced by the Metropolis- Hastings algorithm described above is irreducible and reversible with respect to π. Proof: Irreducibility follows from the irreducibility of Q since p(i, j) > 0 whenever q(i, j) > 0. For reversibility, we need to check π(i)p (i, j) = π(j)p (j, i) for every i and j. This is trivial if i = j, so assume i j. Then π(j)q(j, i) π(i)p (i, j) = π(i)q(i, j)min(, 1) = min(π(j)q(j, i), π(i)q(i, j)), π(i)q(i, j) which is symmetric in i and j. 25
Example 4: The Knapsack Problem. Suppose you have m items. For each i = 1,..., m, the i th item is worth $v i and weighs w i kilograms. You can put some of these items into a knapsack, but the total weight cannot be more than b kilograms. What is the most valuable subset of items that will fit into the knapsack? This is an NP-complete problem (the problem can not be solve in a polynomial time of m, even for an optimal algorithm). Let w = (w 1,..., w m ) R m and v = (v 1,..., v m ) R m be the vectors of weights and values respectively. Let z = (z 1,..., z m ) {0, 1} m be a vector of decision variables, where z i = 1 if and only if item i is put into the knapsack. Then for a particular decision z, the total weight of the chosen items is m i=1 w iz i = w z. The set of all permissible ways to pack the knapsack is S = {z {0, 1} m : w z b}. 26
We call S the set of feasible solutions. The original problem can be expressed as the following integer programming problem: maximize v z subject to z S. The problem can be solved using the Metropolis-Hastings algorithm by sampling from the following distribution π(z) = 1 Z exp{βv z}, z S, where Z = y S exp{βv y} is the known normalizing constant, and β is a user specified parameter. 27
The Metropolis-Hastings algorithm for the problem is as follows. Given X t = z, (i) Choose j {1,..., m} uniformly at random. (ii) Let y = (z1,..., z j 1, 1 z j, z j+1,..., z m ). (iii) If y S, then X t+1 X t and stop. (iv) If y S, then let α = min{1, exp(βv y) exp(βv z) } = min{1, exp(βv (y z))}. Generate U u[0, 1]. If U α, then set X t+1 y, otherwise, set X t+1 X t. Note that value of α can be computed quickly since { v (y z) = v vj if z j = 1 (0,..., 1 2z j,..., 0) = v j if z j = 0 Therefore, { exp( βvj ) if z j = 1 α = 1 if z j = 0 In particular, a feasible attempt to add an item is always accepted. 28
3 Gibbs sampler Like the Metropolis algorithm, the Gibbs sampler generates a Markov chain with the Gibbs distribution as the equilibrium distribution. However, the transition probability distribution is chosen in a special way (Geman and Geman, 1984). Consider a k-dimensional random vector X made up of the components X 1, X 2,, X k. Suppose that we have knowledge of the conditional distribution of X m, given values of all other components of X for m = 1, 2,, k. The Gibbs sampler proceeds by generating a value for the conditional distribution for each component of the random vector X, given the values of all other components of X. 29
Specially, starting from an arbitrary configuration {x 1 (0), x 2 (0),, x k (0)}, the Gibbs Sampler iterates as follows. Set i = 1. x 1 (i) is drawn from the conditional distribution of X 1, given x 2 (i 1), x 3 (i 1),, x k (i 1). x 2 (i) is drawn from the conditional distribution of X 2, given x 1 (i), x 3 (i 1),, x k (i 1).. x m (i) is drawn from the conditional distribution of X m, given x 1 (i),, x m 1 (i), x m+1 (i 1), x k (i 1).. x k (i) is drawn from the conditional distribution of X k, given x 1 (i), x 2 (i),, x k 1 (i). The Gibbs sampler is an iterative adaptive scheme. After n iterations of its use, we arrive at k variates: X 1 (n),, X k (n). 30
For the Gibbs sampler, the following two points should be carefully noted: 1. Each component of the random vector X is visited in the natural order, with the result that a total of k new variates are generated on each iteration. 2. The new value of component X m 1 is used immediately when a new value of X m is drawn for m = 2, 3,, k. 31
Example Suppose we want to generate (U, V, W ) having the Dirichlet density f(u, v, w) = ku 4 v 3 w 2 (1 u v w) where u > 0, v > 0, w > 0, u + v + w < 1. Let Be(a, b) represent a beta distribution with parameters a and b. Then the appropriate conditional distributions are U (V, W ) (1 V W )Q, Q Be(5, 2) V (U, W ) (1 U W )R, R Be(4, 2) W (U, V ) (1 U V )S, S Be(3, 2) Where Q is independent of (V, W ), R is independent of (U, W ), and S is independent of (U, V ). 32
Therefore, to use the Gibbs sampler for this problem, the algorithm is as follows: 0. Initialize u 0, v 0 and w 0 such that u 0 > 0, v 0 > 0, w 0 > 0 and u 0 + v 0 + w 0 < 1. 1. Sample q 1 Be(5, 2), set u 1 = (1 v 0 w 0 )q 1. 2. Sample r 1 Be(4, 2), set v 1 = (1 u 1 w 0 )r 1. 3. Sample s 1 Be(3, 2), set w 1 = (1 u 1 v 1 )s 1. Then (u 1, v 1, w 1 ) is the first cycle of the Gibbs sampler. To find the second cycle, we would simulate Q 2, R 2 and S 2 independently, Q 2 Be(5, 2), R 2 Be(4, 2), S 2 Be(3, 2) with observed values q 2, r 2 and s 2. Then the second cycle is given by u 2 = (1 v 1 w 1 )q2, v 2 = (1 u 2 w 1 )r 2, w 2 = (1 u 2 v 2 )s 2. We compute the third and higher cycles in a similar fashion. As n, the number of cycles, goes to, the distribution of (U n, V n, W n ) converges to the desired Dirichlet distribution. 33
4 Simulated Annealing Let A be a finite set of vectors and let V (x) be a nonnegative function defined on x A, and suppose that we are interested in finding its maximal value and at least one argument at which the maximal value is attained. That is, letting V = max x A V (x) and µ = {x A : V (x) = V } we are interested in finding V as well as an element in µ. 34
To begin, let β > 0 and consider the following probability mass function on the set of values in A: π β (x) = βv (x) e x A eβv (x) By multiplying the numerator and denominator of the preceding by e βv, and letting µ be the cardinality of µ, we see that π β (x) = e β(v (x) V ) µ + x µ eβ(v (x) V ) However, since V (x) V < 0 for x µ, we obtain that as β, π β (x) δ(x, µ) µ where δ(x, µ) = 1 if x µ and is 0 otherwise. Hence, if we let β be large and generate a Markov chain whose limiting distribution is π β (x), then most of the mass of this limiting distribution will be concentrated on points in µ. 35
These observations prompt us to raise the question: Why not simply apply the Metropolis algorithm for generating a population of configurations representative of the stochastic system at very low temperature? We DO NOT advocate the use of such a strategy because the rate of convergence of the Markov chain to thermal equilibrium is extremely slow at very low temperatures. Rather, the preferred method for improved computational efficiency is to operate the stochastic system at a high temperature where convergence to equilibrium is fast, and then maintain the system at equilibrium as the temperature is carefully lowered. 36
That is, we use a combination of two related ingredients: A scheme that determines the rate at which the temperature is lowered. An algorithm exemplified by the Metropolis algorithm that iteratively find the equilibrium distribution at each new temperature in the schedule by using the final state of the system at the previous temperature as the starting point for the new temperature. The twofold scheme is the essence of simulated annealing (Kirkpatrick et al, 1983). The technique derives its name from analogy with an annealing process in physics/chemistry where we start the process at high temperature and then lower the temperature slowly while maintaining thermal equilibrium. 37
The primary objective of simulated annealing is to find a global minimum of a cost function that characterizes large and complex systems. Simulated annealing differs from conventional iterative optimization algorithms in two important respects: The algorithm need not get stuck, since transition out of a local minimum is always possible when the system operates at a nonzero temperature. Simulated annealing is adaptive in that gross features of the final state of the system are seen at higher temperatures, while fine details of the state appear at lower temperatures. 38
In simulated annealing, the temperature T plays the role of a control parameter. The process will converge to a configuration of minimal energy provided that the temperature is decreased no faster than logarithmically, i.e., T k > c/log(k), where T k denotes the value of the k-th temperature. Unfortunately, such an annealing schedule is extremely slow too slow to be of practical use. In practice, we must resort to a finite-time approximation of the asymptotic convergence of the algorithm at a price that the algorithm is no longer guaranteed to find a global minimum with probability 1. 39
One of popular annealing schedules (Kirkpatrick et al, 1983) is as follows. Initial value of the temperature. The initial value of T 0 of the temperature is chosen high enough to ensure that virtually all proposed transitions are accepted by the simulated annealing algorithm. Decrement of the temperature. Ordinarily, the cooling is performed exponentially, and the changes made in the value of the temperature are small. In particular, the decrement function is defined by T k = αt k 1, k = 1, 2, where α is a constant smaller than, but close to, unity. Typical values of α lie between 0.8 and 0.99. At each temperature, enough transitions are attempted so that there are 10 accepted transitions per experiment on the average. Final value of temperature. The system is frozen and annealing stops if the desired number of acceptances is not achieved at three successive temperature. Example 5: The Knapsack Problem Revisit the problem with simulated annealing. 40
5 MCMC in Bayesian Statistics In this section, we will discuss one example in detail, using it to introduce the basic ideas of Bayesian statistics. Example 6: Pump Failure. The nuclear power plant Farley 1 contains 10 pump systems. The i th has been running for time t i (in thousands of hours). During that time, it has failed N i times as recorded below: i N i t i N i /t i 1 5 94.320 0.0530 2 1 15.720 0.0636 3 5 62.880 0.0795 4 14 125.760 0.111 5 3 5.240 0.573 6 19 31.440 0.604 7 1 1.048 0.954 8 1 1.048 0.954 9 4 2.096 1.91 10 22 10.480 2.10 41
We treat the t i s as nonrandom quantities. We assume that failures of the i th pump occur according to a Poisson process with rate λ i, independent of the other nine pumps. This assumption means that N i P oisson(λ i t i ), and the likelihood function P (N λ) = 10 i=1 (λ i t i ) N i exp( λ it i ), N i! where N = (N 1,..., N 10 ) and λ = (λ 1,..., λ 10 ). We further assume that λ i Gamma(α, β), i = 1,..., 10 with α = 1.8 as in Gelfand and Smith (1990), and β Gamma(γ, δ), with γ = 0.01 and δ = 1. Then E(β) = 0.01, so that β tends to be small, which makes the Gamma(α, β) distribution very flat. A flat prior distribution for λ is desirable because it does not contain much a priori information and so in principle it should affect the results less. 42
Thus, we have the following posterior distribution P (λ, β N) = 1 P (N) i=1 10 [λ N i+α 1 i i=1 10 [ λn i+α 1 i exp( λ i (t i + β))t N i i N i!γ(α) β α exp( λ i (t i + β))t N i i β α ]β γ 1 e β ] βγ 1 e β Γ(γ) It is easy to see that f(λ i λ i, β, N) λ N i+α 1 i which is Gamma(N i + α, t i + β). Next, f(β λ, N) 10 i=1 which is Gamma(10α + γ, 10 i=1 λ i + 1). exp[ λ i (t i + β)], [exp( λ i β)β α ]β γ 1 e β 10 = β 10α+γ 1 exp[ β( λ i + 1)], i=1 Suppose that Markov chain samples X t = (λ 1 (t),..., λ 10 (t), β(t)), t = 0, 1,... have been generated. Based on that, we can estimate E(λ 1 N) by Recall that 1 n n t=1 N 1 + α t 1 + β(t). E(λ 1 N, β) = N 1 + α t 1 + β. 43