Lecture 6: Monte-Carlo methods

Miranda Holmes-Cerfon Applied Stochastic Analysis, Spring 2015 Lecture 6: Monte-Carlo methods Readings Recommended: handout on Classes site, from notes by Weinan E, Tiejun Li, and Eric Vanden-Eijnden Optional: Sokal [1989]. A classic manuscript on Monte Carlo methods. Posted to Classes site. Grimmett and Stirzaker [2001] 6.14. A section on Markov Chain Monte Carlo. Madras [2002] A short, classic set of notes on Monte Carlo methods. Diaconis [2009] An introduction to MCMC methods, that particularly discusses some of the theoretical tools available to analyze them. Before we start, here is the most important thing to remember about the lecture today: Monte Carlo is an extremely bad method; it should be used only when all alternative methods are worse. (Sokal [1989]) evertheless, there are times when all else is worse. This lecture is about those times. It is a very basic introduction to some of the fundamental ideas associated with the word Monte-Carlo. We will consider three questions: (1) How can we generate random variables with a particular distribution? (2) How can we integrate functions, using random variables? (3) How can we sample from very high-dimensional distributions? For the latter we will introduce a widely used set of tools called Markov Chain Monte Carlo. 6.1 Generating Random Variables Suppose we can generate uniform random variables, i.e. we can produce a random variable X U([0,1]). Although doing this is an important topic in itself, we will not say much about it because most numerical software has libraries to do this. You can learn about common algorithms, such as the Linear Congruential Generator, in references such as Press et al. [2007]. The next question is: how can we generate a random variable Y with distribution function F(x)? We will consider two methods to do this. 6.1.1 Inverse Transformation Method Here is the algorithm:

Choose X U([0,1]) Set Y = F 1 (x). Then Y has distribution function F(x). Proof. Calculate: P(Y y) = P(F 1 (X) y) = P(X F(y)) = F(y). Intuitively, what we are doing is throwing a random variable uniformly on the vertical axis, and seeing where it came from, see figure to right. There is more chance of it landing in regions where F(y) is changing rapidly, and this is where the density is highest. This algorithm works even when F is not strictly increasing (for example, if Y is discrete.) In this case, let F 1 (u) = inf{x : F(x) > u}. Example (Exponential(λ)). We have F(y) = 1 e λy, so F 1 (x) = 1 λ ln(1 x). Therefore we can generate an exponential random variable by setting Y = 1 λ lnx, where X U([0,1]). We have used the fact that X d = 1 X. Example (Cauchy). This has density f (x) = 1 π(1+x 2 ), so F(x) = π 1 arctan(x)+ 1 2, so F 1 (y) = tan(π(y 1 2 )). Therefore we set Y = tan(π(x 2 1 )), where X U([0,1]). This is a great way of generating random variables if F 1 (y) can be easily calculated, because no samples are wasted. However, F 1 (y) has an analytic expression only for a small number of distributions (e.g. uniform, exponential, Cauchy, Weibull, logistic, discrete). In other cases we need another method. If the random variables are Gaussian, the Box-Muller transform is very efficient (see handout.) Acceptance-Rejection works for general continuous random variables. 6.1.2 Acceptance-Rejection Method Suppose X has pdf p(x), that satisfies 0 p(x) < d, with support on [a,b]. Here is how to generate X: Choose X U([a,b]), Y U([0,d]) If 0 Y p(x ), accept set X = X. Otherwise, reject go back to the beginning and try again. 2

After the first step, (X,Y ) is uniformly distributed in the box [a,b] [0,d]. The probability of accepting a point on the horizontal axis is p(x) d p(x). Proof. The pdf of (X,Y ) is χ A (x,y), where A is the region under p(x) and χ A is the indicator function. The pdf of X is the marginal distribution of X : p(x) 0 χ A (x,y)dy = p(x) 0 1dy = p(x). otes This method can handle general pdfs However, it an be very inefficient if p(x) is large in a few small regions, and small elsewhere, e.g. It doesn t work if p(x) is unbounded, e.g. p(x) 1/ x near x = 0. It also doesn t handle unbounded regions, such as when the density is defined on (, ). A more general method works by finding a function f (x) that bounds p(x): 0 p(x) f (x), and generating (X,Y ) uniformly under f (x). The steps are: Let Z = f (x)dx Choose X Z 1 f (x). Choose Y U([0, f (X )]). Acceptance-Rejection (General Method). Given f (x) such that 0 p(x) f (x), let F(x) = x f (x)dx, and suppose we have an analytic expression for F 1 (x). Let Z = f (x)dx. ote that F(x) does not necessarily have to be a cumulative distribution function: Z 1 in general. otes Choose X = F 1 (ZW), where W U([0,1]). Choose Y U([0, f (X )]). If 0 Y p(x ), accept. Set X = X. Otherwise, reject, and try again. This works for low-dimensional random variables. However, in high dimensions the corners of the regions (where the holes are) take up proportionally more and more room, leading to LOTS of 3

rejections, so it becomes extremely inefficient. Use MCMC (section 6.3) instead. 6.2 Monte-Carlo Integration Monte-Carlo integration is a technique used to evaluate integrals, particularly in high dimensions. The idea is to choose points at which to approximate the integral randomly, instead of on a pre-determined grid. Let s first compare a deterministic method with a random one for a one-dimensional problem. Suppose we want to calculate the integral I( f ) = 1 0 f (x)dx. Deterministically: suppose we have points, with grid spacing x = 1/. If we use the trapezoidal rule, then we approximate I( f ) i=1 f (x i ) + f (x i+1 ) x = Error ( x)2 f (ξ ) C 2 12 2. Randomly: suppose we choose random points X 1,X 2,...,X uniformly on [0,1]. The we approximate I( f ) I ( f ) = 1 f (X i ). i=1 We know by the Law of Large umbers that this converges a.s. to I( f ). How quickly does it converge? The error has variance E(I ( f ) I( f )) 2 = i, j=1 E( f (X i ) I)( f (X j ) I) = 1 Var( f ), (1) where Var( f ) = f 2 ( f ) 2. Therefore the error is roughly I I 1 Var( f ). This converges extremely slowly to the true answer the deterministic method is much better in this case. What happens if we increase the dimension? Deterministically: if we choose points to be evenly spaced, then x 1/d. 1 Using a second-order method we would obtain Error C( x) 2 = C 2/d. Even for moderate d, e.g. d = 8, we get Error C 1/4. This is terrible convergence! Randomly: choosing the points to be uniformly distributed in the domain, and repeating the calculation (1), shows that Var( f ) Error 1/2 = C 1/2. This holds no matter what the dimension. 1 Suppose there are k points per dimension, with spacing x, then = k d = (1/ x) d. 4

Therefore, a Monte-Carlo integration method is expected to be better than the trapezoidal rule for d > 4. Of course, there are better methods than the trapezoidal rule, but for high-dimensional problems, Monte- Carlo always wins. ote that the asymptotic order or convergence of a deterministic method may not matter, if the number of grid points required to obtain it is too high. For example, we expect a fourth-order integration method (such as Simpson s rule) to be better than MC up to dimension d = 8. However, if we choose even a modest 10 points per dimension, we need 100 million points so obtaining a small error becomes rapidly impractical. 6.2.1 Importance sampling Importance sampling is one of a number of variance-reduction techniques. The idea is to reduce the prefactor in the error of MC integration, without changing. The idea is as follows: instead of sampling X uniformly on an interval, which wastes points in regions of low probability, choose points where the probability is likely to be large. Account for this by weighting the integral. Algorithm to calculate D f (x)dx, where D Rd : Choose points X i with density p(x) in D Calculate I n (p) = 1 i=1 This works because f (X i ) p(x i ). f (x)dx = ( ) f (x) f p(x) p(x)dx = E = lim p Therefore the mean value of the integral is as expected. Let s estimate the error variance: E(I (p) I)2 = 1 ( ) f Var = 1 p 1 f (X i ) i=1 p(x i ). ( ( f 2 ) 2 p dx f dx). If we choose p(x) = Z 1 f (x), with Z = f dx, then the variance above is 0, and I( f ) = I (p) ( f ) there is no error! 5

But Z is what we are trying to calculate in the first place, so if we could do this, then we would already know the answer. Therefore, we should choose p(x) to match f (x) as closely as possible, using an educated guess about how to do this, so as to reduce the overall variance of the answer. Importance sampling is particularly effective/necessary for rare event sampling. The homework contains an example of this. 6.3 Markov Chain Monte Carlo (MCMC) MCMC is a collection of techniques to sample a probability distribution π in a (usually) very high-dimensional space, by constructing a Markov Chain that has π as its stationary distribution. Typically these work even when we only know a function g(x) that is proportional to the stationary distribution: g(x) π(x), with no convenient way to get the normalization factor. A very common algorithm is Metropolis-Hastings. Examples (1) (Ising model): (see lecture 2 for description.) The energy of a particular configuration of spins σ {±1} is H(σ) = i, j σ i σ j, where i, j indicates that nodes i, j are neighbours. The energy is lower when neighbouring spins are the same. The stationary distribution is the Gibbs measure/boltzmann distribution: π(σ) = Z 1 e βh(σ). Here Z is a normalization constant, which is almost never known, and β is a parameter, representing the inverse temperature. In statistical mechanics we would set β = (k B T ) 1, where k B is Boltzmann s constant and T is the temperature. For large β (low temperature), π(σ) is bimodal: the system is either mostly (+1) or mostly (-1). This means the system is magnetized. For small β (high temperature), π(σ) gives most weight to systems with nearly equal,, so the system is disordered and loses its magnetization. One question of interest is: at which temperature (β 1 ) does this transition occur? We can answer this by calculating M π, the average with respect to π of the absolute value of the magnetization M = 1 i=1 σ i. Here f (x) π = x π x f (x). For this, we need to sample from π and calculate representative values of M, and then average these values. As β increases, M π should undergo a sharp transition from 0 to 1. (2) (Particles interacting with a pairwise potential) A very common model in chemistry and other areas that consider systems of interacting components (e.g. protein folding, materials science, etc) is to suppose there is a collection of point particles that interact with a pairwise potential V (r), where r is the distance between the pair. The total energy of a system of n particles is the sum over all the pairwise interactions: U(x) = V ( x i x j ), i, j where x = (x 1,x 2,...,x n ) is the 3n-dimensional vector of particle positions. The stationary distribution is again the Boltzmann distribution: π(x) = Z 1 e βu(x), where again, β is the inverse temperature, and we almost never know the normalization constant Z. Depending on β, the system could prefer to be in a number of different states such as a solid, crystal, liquid, gas, or other phase. To calculate the phase diagram we must sample π(x). 6

There are several difficulties in sampling π for these, and many related, examples: The size of the state space is often HUGE! Ising model, n n, has 2 n2 elements in the state space. Even for small n, say n = 10, we have over 10 30 configurations... there is no way we can list them all. Even a small system of 100 particles lives in 300-dimensional space. There is no way we are going to adequately sample every region in this space. Often we don t know the normalization constant for π, only a function it s proportional to. the Boltzmann distirbution Z 1 e βu(x) arises frequently. We usually know the potential energy U(x) but can almost never calculate Z. We may not know the true dynamics that give rise to π, or these may have widely separated time scales so cannot be efficiently simulated. Ising model what are the true dynamics of a magnet? For this, need quantum mechanics. Particles in a box these should also interact with a solvent (such as water), but we can t simulate all the necessary atoms. We can t use previous methods to generate independent samples from π when the dimensionality becomes very high. Instead, what we can do is invent an artificial dynamics that has π as its stationary distribution, and then simulate these dynamics instead. Metropolis-Hasting Algorithm. Given a set of n nodes and stationary distribution π, the algorithm samples it as follows: If you are in state i, choose a state j from a proposal probability distribution encoded in a transition matrix H = (h i j ). Here h i j = P(consider j next in state i). Move to state j with probability a i j. Otherwise, remain in state i. The induced Markov chain has transition probabilities { hi p i j = j a i j (i j) 1 i j h i j a i j (i = j) The acceptance matrix A = (a i j ) is typically chosen to satisfy detailed balance: π i h i j a i j = π j h ji a ji. ote that we can ensure this condition holds, even if we don t know the normalization constant Z! A common choice is a i j = min(1, π jh ji π i h i j ). This is the Hastings algorithm. (The Metropolis algorithm refers to the case when H is symmetric, ie h i j = h ji, so that a i j = min(1, π j π i ).) We showed on HW2 that the induced Markov chain satisfies detailed balance, and has π as its stationary distribution. otes This algorithm works equally well in a continuous state space, provided we interpret h as a density: ( h(y x) is the ) density of jumping to y, given we start at x. The acceptance ratio is a(y x) = min. 1, π(y)h(x y) π(x)h(y x) 7

A general form of the acceptance matrix is a i j = F( π jh ji π i h i j ), where F : [0, ] [0,1] is any function satisfying F(z) = zf(1/z). The proposal matrix can be absolutely anything, provided h i j > 0 h ji > 0 (so the chain can satisfy detailed balance), and provided the chain described by H is ergodic. Choosing a good proposal matrix, however, is an art. Proposing many far-away moves is good, because it lets you move around state space quickly. Proposing too many of these, however, leads to many rejected moves, which slows convergence. A rule of thumb used to be that you should choose your proposal matrix so that an average of roughly 50% of the proposals are rejected. A recent paper by Andrew Gelman at Columbia argued that 27% is a better number, for some processes. What does this mean? o one really knows for sure, and you should choose parameters that seem to work well for the problem at hand. Here are some example proposal matrices: (1) (Ising model) pick a spin, flip it pick a pair of spins, exchange their values pick a spin or a cluster of spins, change the value(s) depending on the environment surrounding them, e.g. set the value to be the one that minimizes the energy in the spin s local neighbourhood (2) (Particles) pick a single particle, move it some random amount move all particles in the direction of the gradient of the potential, plus some random amount To calculate averages and distributions, e.g. f (x) π, one typically discards the initial transient steps. This is because it takes time for the chain to forget its initial state, and reach equilibrium. Theoretically, we can bound the time it takes to reach equilibrium using 1/ λ 2, where λ 2 is the secondlargest eigenvalue (in absolute value) of the transition matrix. In practice, we almost never know λ 2, so one must determine whether the chain has converged empirically. (ELFS: can you think of ways to do this?) The total effective number of samples is less than the actual number of points generated, even when equilibrium is reached, because the points are correlated. There is a nice formula for the effective number of points in terms of the covariance function of the Markov process. We will consider this on the homework. 8

6.4 Monte-Carlo methods in optimization Suppose you have a non-convex, possibly very rugged, function U(x), e.g. as shown above. How can you find the global minimum? Deterministic methods (e.g. steepest-descent, Gauss-ewton, Levenberg-Marquadt, BFGS, etc) are very good at finding a local minimum. To find a global minimum, or one that is close to optimal, one must typically search the landscape stochastically. One way to do this is to create a stationary distribution π that puts high probability on the lowest-energy parts of the landscape. A common choice is the Boltzmann distribution Z 1 e βu(x) for some inverse temperature β. Then, one constructs a Markov chain to sample this stationary distribution, and keeps track of the smallest value the chain has seen. The result will be sensitive to the value of β. How should we choose this? If β is large, the global minimum will be the most likely place to be in equilibrium, but it will take a very long time to reach equilibrium. If β is small, the chain moves about on the landscape much more quickly, but doesn t tend to spend as much time in the low-energy parts of the space. One possibility is to cycle β periodically: to alternate between high-temperature, fast dynamics, and lowtemperature dynamics that tend to find a low minimum and stay there. Another possiblity is Simulated Annealing. This is a technique that slowly increases β, for example as β = logt or β = (1.001) t, where t is the number of steps. As t, it can be shown that the stationary distribution converges to a delta-function at the global minimum. In practice, it takes exponentially long to do so, but this method still gives good results for many problems. Example (Lennard-Jones [ clusters). A Lennard-Jones cluster is a set of n points interacting with a pairwise ( potential U(r) = ε σr ) 6 ( σr ) ] 12 for some parameters σ,ε. This is a model for many atoms, molecules, or other interacting components with reasonably short-range interactions. The energy landscape is very rugged, with a great many local minima. To explore the landscape, and/or to find the lowest minima, one method is to construct a Markov chain on the set of local minima. This works as follows: at each step in the chain, perturb the positions of the points x 1,x 2,...,x n by some random amount, then find the nearest local minimum using a deterministic method, then check the energy of this local minimum, and either accept it or reject it according to the Hastings criterion. This method can also be used to solve packing problems, where now the energy is a function of the density, or other quantity to be optimized. 9

Example (Cryptography). Another use of Monte-Carlo methods in optimization is in cryptography (Diaconis [2009].) A cipher is a function φ : S A, where S is a set of symbols (e.g. a permutation of the alphabet, the Greek letters, a collection of squiggles, etc), and A = {a,b,c,...,x,y,z} is the set of letters. A code is a string of symbols x 1 x 2 x 3...x n, where x i S. Here is an example of a code, used by inmates in a prison in California (this and the next figure are from Diaconis [2009]): If one knows the language the code is written in, then one can obtain the distribution of letter frequencies f 1 (a i ), where a i A. One can then construct a plausibility function, as L 1 (x 1,x 2,...,x n ;φ) = n i=1 f 1 (φ(x i )). To decipher a particular code, one can use a Monte-Carlo method to minimize L 1 over all ciphers φ. This method was actually used to decode the prisoners messages. It didn t work initially. However, when the plausibility function was updated to include information about the frequencies of pairs of letters f 2 (a i,a j ) as L 1 (x 1,x 2,...,x n ;φ) = n i=1 f 2 (φ(x i ),φ(x i+1 )), then it did work, and the researchers learned about daily life in prison (in a mixture of English, Spanish, and prison-slang): 10

References P. Diaconis. The markov chain monte carlo revolution. Bulletin of the American Mathematical Society, 46: 179 205, 2009. G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford University Press, 2001. eal Madras. Lectures on Monte Carlo Methods. American Mathematical Society, 2002. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. umerical Recipes. Cambridge University Press, 3 edition, 2007. Alan D. Sokal. Monte carlo methods in statistical mechanics: foundations and new algorithms. In Cours de Troisieme Cyle de la Physique en Suisse Romande., Lausanne, 1989. 11