Ch5. Markov Chain Monte Carlo

Similar documents
6 Markov Chain Monte Carlo (MCMC)

MARKOV CHAIN MONTE CARLO

Stochastic optimization Markov Chain Monte Carlo

Introduction to Machine Learning CMU-10701

Convex Optimization CMU-10725

Markov Chains. X(t) is a Markov Process if, for arbitrary times t 1 < t 2 <... < t k < t k+1. If X(t) is discrete-valued. If X(t) is continuous-valued

Computational statistics

Bayesian Methods for Machine Learning

On Markov Chain Monte Carlo

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Statistics 150: Spring 2007

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

Lecture 20 : Markov Chains

Stochastic Simulation

Markov Chains and MCMC

Markov Chain Monte Carlo Methods

Approximate Counting and Markov Chain Monte Carlo

STA 4273H: Statistical Machine Learning

Example: physical systems. If the state space. Example: speech recognition. Context can be. Example: epidemics. Suppose each infected

MATH 56A: STOCHASTIC PROCESSES CHAPTER 1

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC)

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Markov chain Monte Carlo

Irreducibility. Irreducible. every state can be reached from every other state For any i,j, exist an m 0, such that. Absorbing state: p jj =1

Convergence Rate of Markov Chains

Markov Chain Monte Carlo

Chapter 16 focused on decision making in the face of uncertainty about one future

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Markov Chains Handout for Stat 110

SMSTC (2007/08) Probability.

Stat 516, Homework 1

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

1 Random Walks and Electrical Networks

Markov Chains. Andreas Klappenecker by Andreas Klappenecker. All rights reserved. Texas A&M University

Session 3A: Markov chain Monte Carlo (MCMC)

STOCHASTIC PROCESSES Basic notions

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

MCMC Methods: Gibbs and Metropolis

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Probability, Random Processes and Inference

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Quantifying Uncertainty

Markov Chain Monte Carlo. Simulated Annealing.

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Recap. Probability, stochastic processes, Markov chains. ELEC-C7210 Modeling and analysis of communication networks


Lecture 5: Random Walks and Markov Chain

Propp-Wilson Algorithm (and sampling the Ising model)

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Markov and Gibbs Random Fields

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

Introduction to Machine Learning CMU-10701

17 : Markov Chain Monte Carlo

Stochastic process. X, a series of random variables indexed by t

8. Statistical Equilibrium and Classification of States: Discrete Time Markov Chains

INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING

LECTURE 15 Markov chain Monte Carlo

MCMC: Markov Chain Monte Carlo

MCMC notes by Mark Holder

2. Transience and Recurrence

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Markov Processes Hamid R. Rabiee

Bayesian Methods with Monte Carlo Markov Chains II

An Introduction to Entropy and Subshifts of. Finite Type

Markov Chains and Stochastic Sampling

Stochastic processes. MAS275 Probability Modelling. Introduction and Markov chains. Continuous time. Markov property

Markov Chain Monte Carlo

MSc MT15. Further Statistical Methods: MCMC. Lecture 5-6: Markov chains; Metropolis Hastings MCMC. Notes and Practicals available at

INTRODUCTION TO MARKOV CHAIN MONTE CARLO

Simulation - Lectures - Part III Markov chain Monte Carlo

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Markov Chains and MCMC

Monte Carlo Methods. Leon Gu CSD, CMU

25.1 Markov Chain Monte Carlo (MCMC)

Markov Chains, Stochastic Processes, and Matrix Decompositions

16 : Markov Chain Monte Carlo (MCMC)

Model Counting for Logical Theories

eqr094: Hierarchical MCMC for Bayesian System Reliability

Brief introduction to Markov Chain Monte Carlo

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Lecture 9 Classification of States

Markov Chains (Part 3)

MARKOV PROCESSES. Valerio Di Valerio

Advanced Sampling Algorithms

5. Simulated Annealing 5.2 Advanced Concepts. Fall 2010 Instructor: Dr. Masoud Yaghini

Chapter 7. Markov chain background. 7.1 Finite state space

Question Points Score Total: 70

Chapter 10. Optimization Simulated annealing

Statistics 253/317 Introduction to Probability Models. Winter Midterm Exam Friday, Feb 8, 2013

Powerful tool for sampling from complicated distributions. Many use Markov chains to model events that arise in nature.

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Discrete time Markov chains. Discrete Time Markov Chains, Limiting. Limiting Distribution and Classification. Regular Transition Probability Matrices

Let (Ω, F) be a measureable space. A filtration in discrete time is a sequence of. F s F t

STA 294: Stochastic Processes & Bayesian Nonparametrics

Outlines. Discrete Time Markov Chain (DTMC) Continuous Time Markov Chain (CTMC)

Transcription:

ST4231, Semester I, 2003-2004 Ch5. Markov Chain Monte Carlo In general, it is very difficult to simulate the value of a random vector X whose component random variables are dependent. In this chapter we present a powerful approach for generating a random variable whose distribution is approximately that of X. This approach, called the Markov Chain Monte Carlo (MCMC) method, has the added significance of only requiring that the mass (or density) function of X be specified up to a multiplicative constant, and this, we will see, is of great importance in applications. Outline of the Chapter: Discrete Markov Chains Metropolis-Hastings Algorithm Gibbs Sampler Simulated Annealing MCMC in Bayesian Statistics 1

1 Discrete Markov Chains This section sets notation and reviews some fundamental results for discrete Markov chains. We only discuss discrete-time Markov chains in this chapter. We will not discuss the theory of chains with uncountable state spaces, although we will use such chains in Sections 2,3, and 4 for applications in statistics. Definition 1.1 A Markov Chain X is a discrete time stochastic process {X 0, X 1, } with the property that the distribution of X t given all previous values of the process, X 0, X 1,, X t 1, only depends upon X t 1, i.e., for any set A. P [X t A X 0,, X t 1 ] = P [X t A X t 1 ] 2

In a Markov chain, the transition from one state to another is probabilistic, but the production of an output symbol is deterministic. One-Step Transition Let p ij = P (X t+1 = j X t = i) denote the transition probability from state i to state j at time n + 1. Since the p ij are conditional probabilities, all transition probabilities must satisfy two conditions: p ij 0 for all (i, j) p ij = 1 for all i j We will assume that the transition probabilities are fixed and do not change with time. In such a case the Markov chain is said to be homogeneous in time. 3

In a case of a system with a finite number of possible state K, for example, the transition probabilities constitute a K by K matrix: p 11 p 12 p 1K p 21 p 22 P 2k P =.... p K1 p K2 p KK This kind of matrix is called a stochastic matrix. Any stochastic matrix can serve as a matrix of transition probabilities. 4

Multi-Step Transition The definition of one-step transition probability may be generalized to cases where the transition from one state to another takes place in some fixed number of states. j: Let p ij (m) denote the m-step transition probability from state i to state p ij (m) = P (X t+m = x j X t = x i ), m = 1, 2, We may view p ij (m) as the sum over all intermediate states k through which the system passes in its transition from state i to state j. It has the following recursive relation: p ij (m + 1) = k p ik (m)p kj, m = 1, 2, with p ik (1) = p ik. The recursive equation may be generalized as follows: p ij (m + n) = k p ik (m)p kj (n), (m, n) = 1, 2,. 5

Suppose a Markov chain starts in state i. The state i is said to be a recurrent state if the Markov chain returns to state i with probability 1; that is, f i = P (ever returning to state i) = 1. If the probability f i < 1, state i is said to be a transition state. If the Markov chain starts in a recurrent state, the state reoccures an infinite number of times. If it starts in a transient state, that state reoccures only a finite number of times. 6

The state j of a Markov chain is said to be accessible from state i if there is a finite sequence of transitions from i to j with a positive probability. If the states i and j are accessible to each other, they are said to communicate with each other, and the communication is denoted by i j. Clearly, if i j and j k, then we have i k. If two state of a Markov chain communicate with each other, they are said to belong to the same class. In general, the states of a Markov chain consist of one or more disjoint classes. If, however, all the states consist of a single class, the Markov chain is said to be indecomposable or irreducible. Reducible chains are of little practical interest in most areas of application. 7

Example 1 (Simple Random Walk on a Graph) A graph is described by a set of vertices and a set of edges. We say that the vertices i and j are neighbors if there is an edge joining i and j (we can also say that i and j are adjacent). For simplicity, assume that there is at most one edge joining any given pair of vertices and that there is no edge that joins a vertex to itself. The number of neighbors of the vertex i is called the degree of i and is written deg(i). In a simple random walk on a graph (Figure 1), the state space is the set of vertices of the graph. At each time t, the Markov chain looks at it current position, X t, and chooses one of its neighbors uniformly at random. It then jumps to the chosen neighbor, which becomes X t+1. Figure 1 2 1 3 5 4 Figure 1: A graph with five labeled vertices. 8

More formally, the transition probabilities are given by { 1/deg(i) if i and j are neighbors p ij = 0 otherwise (including i = j) The associated transition probability matrix is 0 1/3 1/3 0 1/3 1/2 0 1/2 0 0 1/4 1/4 0 1/4 1/4 0 0 1/2 0 1/2 1/3 0 1/3 1/3 0 The Markov chain is reducible if and only if the graph is connected. 9

Consider an irreducible Markov chain that starts in a recurrent state i at time n = 0. Let T i (k) denote the time that elapses between the (k 1)th and kth returns to state i. The mean recurrence time of state i is defined as the expectation of T i (k) over the returns k. The Steady-State Probability of state i, denoted by π i, is defined as π i = 1 E(T i (k)). If E(T i (k)) <, that is π i > 0, the state i is said to be a positive recurrent state. If E(T i (k)) =, that is π i = 0, the state i is said to be a null recurrent state. 10

In principle, ergodicity means that we may substitute time averages for ensemble averages. In the context of a Markov chain, ergodicity means that the long-term proportion of time spent by the chain in state i corresponds to the steady-state probability π i, which may be justified as follows. The Proportion of Time spent in state i after k returns, denoted by v i (k), is defined by k v i (k) = k l=1 T i(l). The return times T i (l) form a sequence of independently and identically distributed random variables since, by definition, each return time is statistically independent of all previous return times. Moreover, in the case of a recurrent state i the chain returns to state i an infinite number of times. Hence, as the number of returns k approaches infinity, the law of large numbers states that the proportion of time spent in state i approaches the steady-state probability, as shown by lim v i(k) = π i, i = 1, 2,, K. k A sufficient, but not necessary condition, for a Markov chain to be ergodic is for it to be both irreducible and aperiodic. 11

Consider an ergodic Markov chain characterized by a stochastic matrix P. Let the row vector π (n 1) denote the state distribution vector of the chain at time n 1. The state distribution vector at time n is defined by By iteration, we obtain π (n) = π (n 1) P. π (n) = π (n 1) P = π (n 2) P 2 = = π (0) P n, where π (0) is the initial value of the state distribution vector. 12

Let p (n) ij denote the ij-th element of P n. Suppose that as time n approaches infinity, p (n) ij tends to π j independent of i, where π j is the steadystate probability of state j. Correspondingly, for large n, the matrix P n approaches the limiting form of a square matrix with identical rows as shown by π 1 π 2 π K π lim P n = n π 1 π 2 π K... = π 1 π 2 π K π. π. We then find that ( K j=1 π (0) j 1)π = 0 Since, by definition K j=1 π(0) j = 1, this condition is satisfied by the vector π independent of the initial distribution. 13

The probability distribution {π j } K j=1 is called an invariant or stationary distribution. It is so called because it persists forever once it is established. For an ergodic Markov chain, we may say the following. 1. Starting from an arbitrary initial distribution, the transition probabilities of a Markov chain will converge to a stationary distribution provided that such a distribution exists. 2. The stationary distribution of the Markov chain is completely independent of the initial distribution if the chain is ergodic. Example 2: Simple Random Walk on a Graph (continued) Consider a simple random walk on a graph, as described in Example 1. We claim that the equilibrium distribution is π i = deg(i)/z, where Z = k S deg(k). That is, π = (3/14, 2/14, 4/14, 2/14, 3/14). 14

To summarize the descriptions above, we have the following definitions and theorems. Let τ ii be the time of the first return to state i: τ ii = min{t > 0 : X t = i X 0 = i}. Definition 1.2 X is called irreducible if for all i,j, there exists a t > 0 such that P ij (t) > 0. Definition 1.3 An irreducible chain X is recurrent if P [τ ii < ] = 1 for some (and hence for all)i. Otherwise, X is transient. Another equivalent condition for recurrence is P ii (t) = for all i. t 15

Definition 1.4 An irreducible recurrent chain X is called positive recurrent if E[τ ii ] < for some (and hence for all) i. Otherwise, it is called null-recurrent. Another equivalent condition for positive recurrence is the existence of a stationary probability distribution for X, that is there exist π( ) such that π(i)p ij (t) = π(j) (1) for all j and t 0. i If the probability distribution of X t say π i = {X t = i}, i = 0,, N is a stationary distribution, then P {X t+1 = j} = = N P {X t+1 = j X t = i}p {X t = i} i=0 N π i P ij = π j i=0 Definition 1.5 An irreducible chain X is called aperiodic if for some (and hence for all) i, greatest common divider {t > 0 : P ii (t) > 0} = 1. 16

Theorem 1.1 If X is positive recurrent and aperiodic then its stationary distribution π( ) is the unique probability distribution satisfying (1). We then say that X is ergodic and the following consequences hold: (i) P ij (t) π(j) as t for all i, j. (ii) (Ergodic theorem) If E π [ h(x) ] <, then P { h N E π [h(x)]} = 1, where E π [h(x)] = i h(i)π(i), the expectation of h(x) with respect to π( ). 17

There is sometimes an easier way than solving the set of equations (1) of finding the stationary probabilities. Suppose one can find positive numbers x j, j = 1, 2,, N such that x i P ij = x j P ji, for i j, N x j = 1 j=1 Then summing the preceding equations over all states i yields N x i P ij = x j i=1 N i=1 P ji = x j which, since {π j, j = 1, 2,, N} are the unique solution of (1), implies that π j = x j When π i P ij = π j P ji for all i j, the Markov chain is said to be time reversible, because it can be shown, under this condition, that if the initial state is chosen according to the probability {π j }, then starting at any time the sequence of states going backwards in time will also be a Markov chain with transition probability P ij. 18

Example 3: Simple Random Walk on a Graph (continued) Consider a simple random walk on a graph, as described in Example 1. We claim that the equilibrium distribution is where Z = k S deg(k). π i = deg(i)/z, Verify that: If i and j are not neighbors, then π i p ij = 0 = π j p ji ; if i and j are neighbors, then π i p ij = 1 = π j p ji. 19

Suppose now that we want to generate the value of a random variable X having probability mass function P {X = j} = p j, j = 1, 2,, N. If we could generate an irreducible aperiodic Markov chain with limiting probabilities p j, j = 1, 2,, N, then we could be able to approximately generate such a random variable by running the chain for n steps to obtain the value of X n, where n is large. In addition, if our objective is to estimate E[h(x)] = N j=1 h(j)p j, then we could also estimate this quantity by using the estimator 1 n n i=1 h(x i). However, since the early states of the Markov chain can be strongly influenced by the initial state chosen, it is common in practice to discard the first k states, for some suitably chosen value of k. That is, the estimator 1 n k n i=k+1 h(x i), is utilized. It is difficult to know exactly how large a value of k should be used and usually one just uses one s intuition (which usually works fine because the convergence is guaranteed no matter what value is used). 20

2 Metropolis-Hastings Algorithm Let b(j), j = 1,, m be positive numbers, and let B = m j=1 b(j). Suppose that m is large and B is difficult to calculate, and that we want to simulate a sequence of random variables with probability mass function π(j) = b(j)/b, j = 1, 2,, m. One way of simulating a sequence of random variables whose distributions converge to π ( j), j = 1, 2,, m, is to find a Markov chain that is easy to simulate and whose limiting probabilities are the π j. The Metropolis- Hastings algorithm is as follows. 21

Let Q be an irreducible Markov transition probability matrix on the integers 1,, m, with q(i, j) representing the row i, column j element of Q. Now define a Markov chain {X n, n 0} as follows. When X n = i, a random variable X such that P {X = j} = q(i, j), j = 1,, m, is generated. If X = j, then X n+1 is set equal to j with probability α(i, j) and is set equal to i with probability 1 α(i, j). Under these conditions, it is easy to see that the sequence of states will constitute a Markov chain with transition probabilities P i,j given by P i,j = q(i, j)α(i, j), if j i P i,i = q(i, i) + k i q(i, k)(1 α(i, k)) 22

Now this Markov chain will be time reversible and have stationary probabilities π(j) if π(i)p i,j = π j P j,i for j i which is equivalent to π(i)q(i, j)α(i, j) = π(j)q(j, i)α(j, i) It is now easy to check that this will be satisfied if we take π(j)q(j, i) i) α(i, j) = min(, 1) = min(b(j)q(j, π(i)q(i, j) b(i)q(i, j), 1) 23

The following sums up the Metropolis-Hastings algorithm for generating a time-reversible Markov chain whose limiting probabilities are π(j) = b(j)/b, for j = 1, 2,, m. 1. Choose an irreducible Markov transition probability matrix Q with transition probabilities q(i, j), i, j = 1, 2,, m. Also, choose some integer value k between 1 and m. 2. Let n = 0 and X 0 = k. 3. Generate a random variable X such that P {X = j} = q(x n, j) and generate a random number U. 4. If U < b(x)q(x,x n) b(x n )q(x n,x), then X n+1 = X; else X n+1 = X n. 5. Set n = n + 1, goto step 3. 24

Proposition 2.1 Show that the Markov chain induced by the Metropolis- Hastings algorithm described above is irreducible and reversible with respect to π. Proof: Irreducibility follows from the irreducibility of Q since p(i, j) > 0 whenever q(i, j) > 0. For reversibility, we need to check π(i)p (i, j) = π(j)p (j, i) for every i and j. This is trivial if i = j, so assume i j. Then π(j)q(j, i) π(i)p (i, j) = π(i)q(i, j)min(, 1) = min(π(j)q(j, i), π(i)q(i, j)), π(i)q(i, j) which is symmetric in i and j. 25

Example 4: The Knapsack Problem. Suppose you have m items. For each i = 1,..., m, the i th item is worth $v i and weighs w i kilograms. You can put some of these items into a knapsack, but the total weight cannot be more than b kilograms. What is the most valuable subset of items that will fit into the knapsack? This is an NP-complete problem (the problem can not be solve in a polynomial time of m, even for an optimal algorithm). Let w = (w 1,..., w m ) R m and v = (v 1,..., v m ) R m be the vectors of weights and values respectively. Let z = (z 1,..., z m ) {0, 1} m be a vector of decision variables, where z i = 1 if and only if item i is put into the knapsack. Then for a particular decision z, the total weight of the chosen items is m i=1 w iz i = w z. The set of all permissible ways to pack the knapsack is S = {z {0, 1} m : w z b}. 26

We call S the set of feasible solutions. The original problem can be expressed as the following integer programming problem: maximize v z subject to z S. The problem can be solved using the Metropolis-Hastings algorithm by sampling from the following distribution π(z) = 1 Z exp{βv z}, z S, where Z = y S exp{βv y} is the known normalizing constant, and β is a user specified parameter. 27

The Metropolis-Hastings algorithm for the problem is as follows. Given X t = z, (i) Choose j {1,..., m} uniformly at random. (ii) Let y = (z1,..., z j 1, 1 z j, z j+1,..., z m ). (iii) If y S, then X t+1 X t and stop. (iv) If y S, then let α = min{1, exp(βv y) exp(βv z) } = min{1, exp(βv (y z))}. Generate U u[0, 1]. If U α, then set X t+1 y, otherwise, set X t+1 X t. Note that value of α can be computed quickly since { v (y z) = v vj if z j = 1 (0,..., 1 2z j,..., 0) = v j if z j = 0 Therefore, { exp( βvj ) if z j = 1 α = 1 if z j = 0 In particular, a feasible attempt to add an item is always accepted. 28

3 Gibbs sampler Like the Metropolis algorithm, the Gibbs sampler generates a Markov chain with the Gibbs distribution as the equilibrium distribution. However, the transition probability distribution is chosen in a special way (Geman and Geman, 1984). Consider a k-dimensional random vector X made up of the components X 1, X 2,, X k. Suppose that we have knowledge of the conditional distribution of X m, given values of all other components of X for m = 1, 2,, k. The Gibbs sampler proceeds by generating a value for the conditional distribution for each component of the random vector X, given the values of all other components of X. 29

Specially, starting from an arbitrary configuration {x 1 (0), x 2 (0),, x k (0)}, the Gibbs Sampler iterates as follows. Set i = 1. x 1 (i) is drawn from the conditional distribution of X 1, given x 2 (i 1), x 3 (i 1),, x k (i 1). x 2 (i) is drawn from the conditional distribution of X 2, given x 1 (i), x 3 (i 1),, x k (i 1).. x m (i) is drawn from the conditional distribution of X m, given x 1 (i),, x m 1 (i), x m+1 (i 1), x k (i 1).. x k (i) is drawn from the conditional distribution of X k, given x 1 (i), x 2 (i),, x k 1 (i). The Gibbs sampler is an iterative adaptive scheme. After n iterations of its use, we arrive at k variates: X 1 (n),, X k (n). 30

For the Gibbs sampler, the following two points should be carefully noted: 1. Each component of the random vector X is visited in the natural order, with the result that a total of k new variates are generated on each iteration. 2. The new value of component X m 1 is used immediately when a new value of X m is drawn for m = 2, 3,, k. 31

Example Suppose we want to generate (U, V, W ) having the Dirichlet density f(u, v, w) = ku 4 v 3 w 2 (1 u v w) where u > 0, v > 0, w > 0, u + v + w < 1. Let Be(a, b) represent a beta distribution with parameters a and b. Then the appropriate conditional distributions are U (V, W ) (1 V W )Q, Q Be(5, 2) V (U, W ) (1 U W )R, R Be(4, 2) W (U, V ) (1 U V )S, S Be(3, 2) Where Q is independent of (V, W ), R is independent of (U, W ), and S is independent of (U, V ). 32

Therefore, to use the Gibbs sampler for this problem, the algorithm is as follows: 0. Initialize u 0, v 0 and w 0 such that u 0 > 0, v 0 > 0, w 0 > 0 and u 0 + v 0 + w 0 < 1. 1. Sample q 1 Be(5, 2), set u 1 = (1 v 0 w 0 )q 1. 2. Sample r 1 Be(4, 2), set v 1 = (1 u 1 w 0 )r 1. 3. Sample s 1 Be(3, 2), set w 1 = (1 u 1 v 1 )s 1. Then (u 1, v 1, w 1 ) is the first cycle of the Gibbs sampler. To find the second cycle, we would simulate Q 2, R 2 and S 2 independently, Q 2 Be(5, 2), R 2 Be(4, 2), S 2 Be(3, 2) with observed values q 2, r 2 and s 2. Then the second cycle is given by u 2 = (1 v 1 w 1 )q2, v 2 = (1 u 2 w 1 )r 2, w 2 = (1 u 2 v 2 )s 2. We compute the third and higher cycles in a similar fashion. As n, the number of cycles, goes to, the distribution of (U n, V n, W n ) converges to the desired Dirichlet distribution. 33

4 Simulated Annealing Let A be a finite set of vectors and let V (x) be a nonnegative function defined on x A, and suppose that we are interested in finding its maximal value and at least one argument at which the maximal value is attained. That is, letting V = max x A V (x) and µ = {x A : V (x) = V } we are interested in finding V as well as an element in µ. 34

To begin, let β > 0 and consider the following probability mass function on the set of values in A: π β (x) = βv (x) e x A eβv (x) By multiplying the numerator and denominator of the preceding by e βv, and letting µ be the cardinality of µ, we see that π β (x) = e β(v (x) V ) µ + x µ eβ(v (x) V ) However, since V (x) V < 0 for x µ, we obtain that as β, π β (x) δ(x, µ) µ where δ(x, µ) = 1 if x µ and is 0 otherwise. Hence, if we let β be large and generate a Markov chain whose limiting distribution is π β (x), then most of the mass of this limiting distribution will be concentrated on points in µ. 35

These observations prompt us to raise the question: Why not simply apply the Metropolis algorithm for generating a population of configurations representative of the stochastic system at very low temperature? We DO NOT advocate the use of such a strategy because the rate of convergence of the Markov chain to thermal equilibrium is extremely slow at very low temperatures. Rather, the preferred method for improved computational efficiency is to operate the stochastic system at a high temperature where convergence to equilibrium is fast, and then maintain the system at equilibrium as the temperature is carefully lowered. 36

That is, we use a combination of two related ingredients: A scheme that determines the rate at which the temperature is lowered. An algorithm exemplified by the Metropolis algorithm that iteratively find the equilibrium distribution at each new temperature in the schedule by using the final state of the system at the previous temperature as the starting point for the new temperature. The twofold scheme is the essence of simulated annealing (Kirkpatrick et al, 1983). The technique derives its name from analogy with an annealing process in physics/chemistry where we start the process at high temperature and then lower the temperature slowly while maintaining thermal equilibrium. 37

The primary objective of simulated annealing is to find a global minimum of a cost function that characterizes large and complex systems. Simulated annealing differs from conventional iterative optimization algorithms in two important respects: The algorithm need not get stuck, since transition out of a local minimum is always possible when the system operates at a nonzero temperature. Simulated annealing is adaptive in that gross features of the final state of the system are seen at higher temperatures, while fine details of the state appear at lower temperatures. 38

In simulated annealing, the temperature T plays the role of a control parameter. The process will converge to a configuration of minimal energy provided that the temperature is decreased no faster than logarithmically, i.e., T k > c/log(k), where T k denotes the value of the k-th temperature. Unfortunately, such an annealing schedule is extremely slow too slow to be of practical use. In practice, we must resort to a finite-time approximation of the asymptotic convergence of the algorithm at a price that the algorithm is no longer guaranteed to find a global minimum with probability 1. 39

One of popular annealing schedules (Kirkpatrick et al, 1983) is as follows. Initial value of the temperature. The initial value of T 0 of the temperature is chosen high enough to ensure that virtually all proposed transitions are accepted by the simulated annealing algorithm. Decrement of the temperature. Ordinarily, the cooling is performed exponentially, and the changes made in the value of the temperature are small. In particular, the decrement function is defined by T k = αt k 1, k = 1, 2, where α is a constant smaller than, but close to, unity. Typical values of α lie between 0.8 and 0.99. At each temperature, enough transitions are attempted so that there are 10 accepted transitions per experiment on the average. Final value of temperature. The system is frozen and annealing stops if the desired number of acceptances is not achieved at three successive temperature. Example 5: The Knapsack Problem Revisit the problem with simulated annealing. 40

5 MCMC in Bayesian Statistics In this section, we will discuss one example in detail, using it to introduce the basic ideas of Bayesian statistics. Example 6: Pump Failure. The nuclear power plant Farley 1 contains 10 pump systems. The i th has been running for time t i (in thousands of hours). During that time, it has failed N i times as recorded below: i N i t i N i /t i 1 5 94.320 0.0530 2 1 15.720 0.0636 3 5 62.880 0.0795 4 14 125.760 0.111 5 3 5.240 0.573 6 19 31.440 0.604 7 1 1.048 0.954 8 1 1.048 0.954 9 4 2.096 1.91 10 22 10.480 2.10 41

We treat the t i s as nonrandom quantities. We assume that failures of the i th pump occur according to a Poisson process with rate λ i, independent of the other nine pumps. This assumption means that N i P oisson(λ i t i ), and the likelihood function P (N λ) = 10 i=1 (λ i t i ) N i exp( λ it i ), N i! where N = (N 1,..., N 10 ) and λ = (λ 1,..., λ 10 ). We further assume that λ i Gamma(α, β), i = 1,..., 10 with α = 1.8 as in Gelfand and Smith (1990), and β Gamma(γ, δ), with γ = 0.01 and δ = 1. Then E(β) = 0.01, so that β tends to be small, which makes the Gamma(α, β) distribution very flat. A flat prior distribution for λ is desirable because it does not contain much a priori information and so in principle it should affect the results less. 42

Thus, we have the following posterior distribution P (λ, β N) = 1 P (N) i=1 10 [λ N i+α 1 i i=1 10 [ λn i+α 1 i exp( λ i (t i + β))t N i i N i!γ(α) β α exp( λ i (t i + β))t N i i β α ]β γ 1 e β ] βγ 1 e β Γ(γ) It is easy to see that f(λ i λ i, β, N) λ N i+α 1 i which is Gamma(N i + α, t i + β). Next, f(β λ, N) 10 i=1 which is Gamma(10α + γ, 10 i=1 λ i + 1). exp[ λ i (t i + β)], [exp( λ i β)β α ]β γ 1 e β 10 = β 10α+γ 1 exp[ β( λ i + 1)], i=1 Suppose that Markov chain samples X t = (λ 1 (t),..., λ 10 (t), β(t)), t = 0, 1,... have been generated. Based on that, we can estimate E(λ 1 N) by Recall that 1 n n t=1 N 1 + α t 1 + β(t). E(λ 1 N, β) = N 1 + α t 1 + β. 43