9 Markov chain Monte Carlo integration. MCMC

Size: px

Start display at page:

Download "9 Markov chain Monte Carlo integration. MCMC"

April Sherman
5 years ago
Views:

1 9 Markov chain Monte Carlo integration. MCMC Markov chain Monte Carlo integration, or MCMC, is a term used to cover a broad range of methods for numerically computing probabilities, or for optimization. They are simulation methods, mostly used in complex stochastic systems where exact computation and even simple simulation are not computationally feasible. Methods that fall under this heading include Metropolis sampling, Hastings sampling and Gibbs sampling which are for integration and simulated annealing and sometimes genetic algorithms which are optimization techniques. Although these methods are mainly used for complex systems we have already seen a problem which can be very easily addressed using MCMC- that is finding the exact p-value for a test of association between the rows and columns of a contingency table. 9.1 The basic idea of MCMC We have already addressed several problems of calculating p-values using simulation, for example in rank tests and randomization tests. These all work because the mean of the sample average of a function in the sample that we generate is the mean value of the function if we could average over all possible states. That is: E [ 1 s ] s f(x i ) i=1 = x f(x)p(x)dx if the X i are sampled so that P (X i = x) = p(x). In all the examples we have looked as so far we have generated an independent sequence of simulations X 1... X s, that is we have done simple Monte Carlo integration, sometimes called perfect simulation. However we note that the above equation holds true even when the observations are not independent. That is, they can be correlated. Markov chain Monte Carlo exploits this fact by creating a sequence of dependent observations and averaging functions of them to estimate the desired integral. More specifically, these observations are generated from a Markov chain. 9.2 Markov chains A Markov chain is a process that produces observation X 1, X 2... which are not independent, but for which the dependence is limited. Suppose we have run the process for i steps and are considering the probabilities for the possible states of the next observation, X i+1. In the general case this is allowed to depend on all previous observations and we could write it as P (X i+1 X i, X i 1, X i 2,...) If we have independent observations then the next observation is not allowed to depend on any of the previous observations. That is P (X i+1 X i, X i 1, X i 2,...) = P (X i+1 ) 81

2 For a Markov chain the next observation can only depend on the most recent previous observation. That is Informally this says: P (X i+1 X i, X i 1, X i 2,...) = P (X i+1 X i ) where I go to next depends only on where I am now, not how I got here. It turns out that allowing this limited dependence on the past greatly increases the ability to model all sorts of correlated data. For instance, scores in sports, stock prices, biological systems and so on. However, we are not going to model real data with a Markov chain. We are going to create artificial Markov chains that give us the simulations we want. Under some regularity conditions, when a Markov chain runs for long enough it will settle down to a limiting or ergodic distribution. This is true regardless of the state in which it is started. The main condition that we need to ensure holds is called irreducibility. This says that for any two states that are possible for the chain, it must be possible to get from one to the other just by running the chain. That is, the set of possible states of the Markov chain must not reduce into separate non-communicating classes. 9.3 MCMC If we want to generate X i s from p(x) by MCMC we have to Create a Markov chain which has ergodic distribution p(x). Show that the chain is irreducible. Find an initial state X 0 such p(x o ) > 0. Run the chain for a while to allow it to reach the ergodic distribution. Start harvesting simulations and computing the required functions. Depending on the application, these steps can be straightforward or difficult. For example, in image processing applications finding an initial state is trivial while in some genetic problems it is challenging. Also, it is not always clear whether the initial run of the chain before harvesting, sometimes called the burn in, is necessary. It can often be ignored and all simulations can be used. The real problems with MCMC methods, however, concern irreducibility. It can be difficult either to create a chain which is irreducible or to prove that the chain you have created is in fact irreducible. This can involve challenging issues that take some creativity to address. It is also often the case that although a chain is theoretically irreducible, in practice the probabilities of some transitions are so small as to make chain effectively reducible. How well a chain can move through different parts of its state space are called its mixing properties. In many examples a chain s mixing properties are so bad that the method is totally unreliable. The example we shall develop below is sufficiently simple that we don t need to worry about these issues. 82

3 9.4 Metropolis sampling There are several general techniques that work in a variety of cases and help ensure that regularity conditions are satisfied. One of these, the Metropolis method, can be applied to our contingency table problem. To simulate a Markov chain with ergodic distribution p(x) we proceed as follows: 1. Suppose we are currently in state y. 2. Propose a new state z with probability q y,z. These proposal probabilities need to be symmetric, that is q y,z = q z,y for all possible pairs of states y and z. In other words, the probability of proposing y when you are currently in z is the same as that of proposing z when you are in y. It is also necessary that some sequence of proposals will enable you to move from any possible state to any other possible state. This will ensure irreducibility. 3. If p(z) p(y) move to state z. If p(z) < p(y) move to state z with probability p(z) p(y). We can combine these acceptance criteria into a single rule: Simulate U from a Uniform(0,1) distribution. If U p(z) p(y) accept, that is move to, z. Otherwise stay at y for an extra turn. 4. Output the current state. 5. Repeat. Hastings s methods is similar to this but allows for asymmetric proposal probabilities. Gibbs sampling updates a state, consisting of a large set of variables, by changing a small subset of the variables based on the states of their neighbouring variables. Simulated annealing is very similar to Metropolis sampling except that the distribution p(x) changes gradually so that eventually all the probability is at the state of maximum probability, which enables it to be used as an optimization technique. Genetic algorithms use not a single incumbent state which is altered, but sets of incumbent states, which are altered and bred in ways that mimic genetic mutation and recombination. States are then selected to survive based on how good the solutions are in a way that mimics natural selection. Genetic algorithms may or may not be true MCMC techniques depending on the particular mutation, recombination and selection processes. Even when they are not MCMC they may still be good optimization methods. 83

4 9.5 Application to computing the p-value for a test of association between rows and columns of a contingency table So what makes this problem tricky is that we need to simulate values for the entries in a contingency table, but not just so that the total if fixed, but also all the row and columns are fixed at the observed values. In fact R A Fisher (the world s first statistician) showed that we need to generate tables with probability ni=1 r i! m j=1 c j! p(x) = n! n mj=1 i=1 x i,j! where x is a table with entries x i,j, n is the total number of observations and r i and c j are the row and column totals. Fisher s exact test uses a branch and bound search to find the total of all probabilities like this for tables which give more extreme values than the observed statistic. To get an updating scheme that keeps the rows and columns fixed, consider choosing two random rows and two random columns. We can use these to define 4 table elements that we can think of as being at the corners of a square R R2 C1 C2 If we now add 1 to both elements on one diagonal and remove 1 from both elements on the other diagonal, the row and column totals have to stay the same. This is going to be our method of proposing a change. Notice that if we chose the same rows and columns but in the opposite order then this update would bring us back to the same state, so the proposals are symmetric. Given a current table y and a proposed table z, we have to reject z if it contains any negative elements. Otherwise we generate a Uniform random random variable and accept z if if the value is less than ni=1 r i! m j=1 c j! p(z) p(y) = n! n mj=1 i=1 z i,j! n! ni=1 mj=1 y i,j! ni=1 r i! m = j=1 c j! ni=1 mj=1 y i,j! ni=1 mj=1 z i,j! Since factorials get very big, we have to be careful about computing the acceptance probability. It is better to take the logarithms of these numbers, which is made easy by 84

5 the built in R function lfactorial which returns the log of the factorial of the arguments. Hence U p(z) if and only if p(y) log(u) log(y i,j!) log(z i,j!) i j i j which, using R functions becomes log(runif(1)) <= sum(lfactorial(y)) - sum(lfactorial(z)) Putting all this together and using the observed table as our starting value we get the function on the following page. 85

6 mcfish = function(x, s = 1000) { r = apply(x,1,sum) c = apply(x,2,sum) n = sum(x) ex = x for (i in 1:length(r)) for (j in 1: length(c)) ex[i,j] = r[i]*c[j] / n t = sum( (x-ex)^2 / ex ) r = 1:length(r) c = 1:length(c) k = 0 y = x for (i in 1:s) { r = sample(r) c = sample(c) z = y if ((z[r[1],c[1]] > 0) && (z[r[1],c[1]] > 0)) { z[r[1],c[1]] = z[r[1],c[1]] - 1 z[r[2],c[2]] = z[r[2],c[2]] - 1 z[r[1],c[2]] = z[r[1],c[2]] + 1 z[r[2],c[1]] = z[r[2],c[1]] + 1 } if (log(runif(1)) < sum(lfactorial(y))-sum(lfactorial(z))) y = z } if ( sum( (y-ex)^2 / ex ) > t) k = k+1 } list(stat = t, pval = k/s, nsims = s, expect = ex) 86

7 9.6 Worksheet This week s work has been something of an aside looking at a topic that s really beyond the scope of a basic statistics course. However, it is a topic that you may well encounter in your reading or in your own research. So, not a ton of work to do this week. 1. Make sure you have a working version of the MCMC method to compute p-values for association in contingency tables. Run this on the contingency table examples from last week and check your results with the large sample χ 2 approximation, as well as the fisher.test R function. 2. Find a paper in your own research area that uses MCMC and read it. 87

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative