Lecture 6: Monte-Carlo methods

Similar documents
Lecture 3: Markov Chains (II): Detailed Balance, and Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

16 : Markov Chain Monte Carlo (MCMC)

Optimization Methods via Simulation

Monte Carlo and cold gases. Lode Pollet.

Introduction to Machine Learning CMU-10701

Markov chain Monte Carlo Lecture 9

CPSC 540: Machine Learning

Markov chain Monte Carlo

17 : Markov Chain Monte Carlo

Simulated Annealing for Constrained Global Optimization

Lecture 2 : CS6205 Advanced Modeling and Simulation

Numerical methods for lattice field theory

16 : Approximate Inference: Markov Chain Monte Carlo

Convex Optimization CMU-10725

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Markov Chain Monte Carlo methods

Today: Fundamentals of Monte Carlo

Practical Numerical Methods in Physics and Astronomy. Lecture 5 Optimisation and Search Techniques

STA 294: Stochastic Processes & Bayesian Nonparametrics

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Random processes and probability distributions. Phys 420/580 Lecture 20

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Today: Fundamentals of Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Monte Carlo. Lecture 15 4/9/18. Harvard SEAS AP 275 Atomistic Modeling of Materials Boris Kozinsky

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Lecture 7 and 8: Markov Chain Monte Carlo

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

6 Markov Chain Monte Carlo (MCMC)

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

STA 4273H: Statistical Machine Learning

Markov chain Monte Carlo methods in atmospheric remote sensing

Markov Chain Monte Carlo (MCMC)

Generating the Sample

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Monte Carlo Methods. PHY 688: Numerical Methods for (Astro)Physics

Markov Chains and MCMC

Simulations with MM Force Fields. Monte Carlo (MC) and Molecular Dynamics (MD) Video II.vi

Math 456: Mathematical Modeling. Tuesday, April 9th, 2018

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Introduction to Machine Learning CMU-10701

Today: Fundamentals of Monte Carlo

2 Random Variable Generation

in Computer Simulations for Bioinformatics

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Approximate inference in Energy-Based Models

Sampling from complex probability distributions

Random Walks A&T and F&S 3.1.2

Sampling Methods (11/30/04)

Intelligent Systems I

Bayesian Methods for Machine Learning

Stochastic optimization Markov Chain Monte Carlo

Computational statistics

Markov chain Monte Carlo

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

References. Markov-Chain Monte Carlo. Recall: Sampling Motivation. Problem. Recall: Sampling Methods. CSE586 Computer Vision II

Robert Collins CSE586, PSU Intro to Sampling Methods

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

M155 Exam 2 Concept Review

7.1 Coupling from the Past

Lecture 6: Markov Chain Monte Carlo

Brief introduction to Markov Chain Monte Carlo

Data Analysis I. Dr Martin Hendry, Dept of Physics and Astronomy University of Glasgow, UK. 10 lectures, beginning October 2006

19 : Slice Sampling and HMC

Stat 451 Lecture Notes Monte Carlo Integration

Advanced Monte Carlo Methods Problems

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Markov-Chain Monte Carlo

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

INTRODUCTION TO MARKOV CHAIN MONTE CARLO

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Quantifying Uncertainty

Expectations, Markov chains, and the Metropolis algorithm

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Markov Chains and MCMC

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Advanced Sampling Algorithms

Numerical integration and importance sampling

Quantum Monte Carlo. Matthias Troyer, ETH Zürich

Numerical methods for lattice field theory

André Schleife Department of Materials Science and Engineering

Monte Carlo Methods. Geoff Gordon February 9, 2006

Markov Processes. Stochastic process. Markov process

Molecular dynamics simulation. CS/CME/BioE/Biophys/BMI 279 Oct. 5 and 10, 2017 Ron Dror

F denotes cumulative density. denotes probability density function; (.)

Doing Bayesian Integrals

STA205 Probability: Week 8 R. Wolpert

Lecture 35 Minimization and maximization of functions. Powell s method in multidimensions Conjugate gradient method. Annealing methods.

Lecture 21: Convergence of transformations and generating a random variable

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Markov Chains Handout for Stat 110

CS 188: Artificial Intelligence. Bayes Nets

Session 3A: Markov chain Monte Carlo (MCMC)

CPSC 540: Machine Learning

Transcription:

Miranda Holmes-Cerfon Applied Stochastic Analysis, Spring 2015 Lecture 6: Monte-Carlo methods Readings Recommended: handout on Classes site, from notes by Weinan E, Tiejun Li, and Eric Vanden-Eijnden Optional: Sokal [1989]. A classic manuscript on Monte Carlo methods. Posted to Classes site. Grimmett and Stirzaker [2001] 6.14. A section on Markov Chain Monte Carlo. Madras [2002] A short, classic set of notes on Monte Carlo methods. Diaconis [2009] An introduction to MCMC methods, that particularly discusses some of the theoretical tools available to analyze them. Before we start, here is the most important thing to remember about the lecture today: Monte Carlo is an extremely bad method; it should be used only when all alternative methods are worse. (Sokal [1989]) evertheless, there are times when all else is worse. This lecture is about those times. It is a very basic introduction to some of the fundamental ideas associated with the word Monte-Carlo. We will consider three questions: (1) How can we generate random variables with a particular distribution? (2) How can we integrate functions, using random variables? (3) How can we sample from very high-dimensional distributions? For the latter we will introduce a widely used set of tools called Markov Chain Monte Carlo. 6.1 Generating Random Variables Suppose we can generate uniform random variables, i.e. we can produce a random variable X U([0,1]). Although doing this is an important topic in itself, we will not say much about it because most numerical software has libraries to do this. You can learn about common algorithms, such as the Linear Congruential Generator, in references such as Press et al. [2007]. The next question is: how can we generate a random variable Y with distribution function F(x)? We will consider two methods to do this. 6.1.1 Inverse Transformation Method Here is the algorithm:

Choose X U([0,1]) Set Y = F 1 (x). Then Y has distribution function F(x). Proof. Calculate: P(Y y) = P(F 1 (X) y) = P(X F(y)) = F(y). Intuitively, what we are doing is throwing a random variable uniformly on the vertical axis, and seeing where it came from, see figure to right. There is more chance of it landing in regions where F(y) is changing rapidly, and this is where the density is highest. This algorithm works even when F is not strictly increasing (for example, if Y is discrete.) In this case, let F 1 (u) = inf{x : F(x) > u}. Example (Exponential(λ)). We have F(y) = 1 e λy, so F 1 (x) = 1 λ ln(1 x). Therefore we can generate an exponential random variable by setting Y = 1 λ lnx, where X U([0,1]). We have used the fact that X d = 1 X. Example (Cauchy). This has density f (x) = 1 π(1+x 2 ), so F(x) = π 1 arctan(x)+ 1 2, so F 1 (y) = tan(π(y 1 2 )). Therefore we set Y = tan(π(x 2 1 )), where X U([0,1]). This is a great way of generating random variables if F 1 (y) can be easily calculated, because no samples are wasted. However, F 1 (y) has an analytic expression only for a small number of distributions (e.g. uniform, exponential, Cauchy, Weibull, logistic, discrete). In other cases we need another method. If the random variables are Gaussian, the Box-Muller transform is very efficient (see handout.) Acceptance-Rejection works for general continuous random variables. 6.1.2 Acceptance-Rejection Method Suppose X has pdf p(x), that satisfies 0 p(x) < d, with support on [a,b]. Here is how to generate X: Choose X U([a,b]), Y U([0,d]) If 0 Y p(x ), accept set X = X. Otherwise, reject go back to the beginning and try again. 2

After the first step, (X,Y ) is uniformly distributed in the box [a,b] [0,d]. The probability of accepting a point on the horizontal axis is p(x) d p(x). Proof. The pdf of (X,Y ) is χ A (x,y), where A is the region under p(x) and χ A is the indicator function. The pdf of X is the marginal distribution of X : p(x) 0 χ A (x,y)dy = p(x) 0 1dy = p(x). otes This method can handle general pdfs However, it an be very inefficient if p(x) is large in a few small regions, and small elsewhere, e.g. It doesn t work if p(x) is unbounded, e.g. p(x) 1/ x near x = 0. It also doesn t handle unbounded regions, such as when the density is defined on (, ). A more general method works by finding a function f (x) that bounds p(x): 0 p(x) f (x), and generating (X,Y ) uniformly under f (x). The steps are: Let Z = f (x)dx Choose X Z 1 f (x). Choose Y U([0, f (X )]). Acceptance-Rejection (General Method). Given f (x) such that 0 p(x) f (x), let F(x) = x f (x)dx, and suppose we have an analytic expression for F 1 (x). Let Z = f (x)dx. ote that F(x) does not necessarily have to be a cumulative distribution function: Z 1 in general. otes Choose X = F 1 (ZW), where W U([0,1]). Choose Y U([0, f (X )]). If 0 Y p(x ), accept. Set X = X. Otherwise, reject, and try again. This works for low-dimensional random variables. However, in high dimensions the corners of the regions (where the holes are) take up proportionally more and more room, leading to LOTS of 3

rejections, so it becomes extremely inefficient. Use MCMC (section 6.3) instead. 6.2 Monte-Carlo Integration Monte-Carlo integration is a technique used to evaluate integrals, particularly in high dimensions. The idea is to choose points at which to approximate the integral randomly, instead of on a pre-determined grid. Let s first compare a deterministic method with a random one for a one-dimensional problem. Suppose we want to calculate the integral I( f ) = 1 0 f (x)dx. Deterministically: suppose we have points, with grid spacing x = 1/. If we use the trapezoidal rule, then we approximate I( f ) i=1 f (x i ) + f (x i+1 ) x = Error ( x)2 f (ξ ) C 2 12 2. Randomly: suppose we choose random points X 1,X 2,...,X uniformly on [0,1]. The we approximate I( f ) I ( f ) = 1 f (X i ). i=1 We know by the Law of Large umbers that this converges a.s. to I( f ). How quickly does it converge? The error has variance E(I ( f ) I( f )) 2 = i, j=1 E( f (X i ) I)( f (X j ) I) = 1 Var( f ), (1) where Var( f ) = f 2 ( f ) 2. Therefore the error is roughly I I 1 Var( f ). This converges extremely slowly to the true answer the deterministic method is much better in this case. What happens if we increase the dimension? Deterministically: if we choose points to be evenly spaced, then x 1/d. 1 Using a second-order method we would obtain Error C( x) 2 = C 2/d. Even for moderate d, e.g. d = 8, we get Error C 1/4. This is terrible convergence! Randomly: choosing the points to be uniformly distributed in the domain, and repeating the calculation (1), shows that Var( f ) Error 1/2 = C 1/2. This holds no matter what the dimension. 1 Suppose there are k points per dimension, with spacing x, then = k d = (1/ x) d. 4

Therefore, a Monte-Carlo integration method is expected to be better than the trapezoidal rule for d > 4. Of course, there are better methods than the trapezoidal rule, but for high-dimensional problems, Monte- Carlo always wins. ote that the asymptotic order or convergence of a deterministic method may not matter, if the number of grid points required to obtain it is too high. For example, we expect a fourth-order integration method (such as Simpson s rule) to be better than MC up to dimension d = 8. However, if we choose even a modest 10 points per dimension, we need 100 million points so obtaining a small error becomes rapidly impractical. 6.2.1 Importance sampling Importance sampling is one of a number of variance-reduction techniques. The idea is to reduce the prefactor in the error of MC integration, without changing. The idea is as follows: instead of sampling X uniformly on an interval, which wastes points in regions of low probability, choose points where the probability is likely to be large. Account for this by weighting the integral. Algorithm to calculate D f (x)dx, where D Rd : Choose points X i with density p(x) in D Calculate I n (p) = 1 i=1 This works because f (X i ) p(x i ). f (x)dx = ( ) f (x) f p(x) p(x)dx = E = lim p Therefore the mean value of the integral is as expected. Let s estimate the error variance: E(I (p) I)2 = 1 ( ) f Var = 1 p 1 f (X i ) i=1 p(x i ). ( ( f 2 ) 2 p dx f dx). If we choose p(x) = Z 1 f (x), with Z = f dx, then the variance above is 0, and I( f ) = I (p) ( f ) there is no error! 5

But Z is what we are trying to calculate in the first place, so if we could do this, then we would already know the answer. Therefore, we should choose p(x) to match f (x) as closely as possible, using an educated guess about how to do this, so as to reduce the overall variance of the answer. Importance sampling is particularly effective/necessary for rare event sampling. The homework contains an example of this. 6.3 Markov Chain Monte Carlo (MCMC) MCMC is a collection of techniques to sample a probability distribution π in a (usually) very high-dimensional space, by constructing a Markov Chain that has π as its stationary distribution. Typically these work even when we only know a function g(x) that is proportional to the stationary distribution: g(x) π(x), with no convenient way to get the normalization factor. A very common algorithm is Metropolis-Hastings. Examples (1) (Ising model): (see lecture 2 for description.) The energy of a particular configuration of spins σ {±1} is H(σ) = i, j σ i σ j, where i, j indicates that nodes i, j are neighbours. The energy is lower when neighbouring spins are the same. The stationary distribution is the Gibbs measure/boltzmann distribution: π(σ) = Z 1 e βh(σ). Here Z is a normalization constant, which is almost never known, and β is a parameter, representing the inverse temperature. In statistical mechanics we would set β = (k B T ) 1, where k B is Boltzmann s constant and T is the temperature. For large β (low temperature), π(σ) is bimodal: the system is either mostly (+1) or mostly (-1). This means the system is magnetized. For small β (high temperature), π(σ) gives most weight to systems with nearly equal,, so the system is disordered and loses its magnetization. One question of interest is: at which temperature (β 1 ) does this transition occur? We can answer this by calculating M π, the average with respect to π of the absolute value of the magnetization M = 1 i=1 σ i. Here f (x) π = x π x f (x). For this, we need to sample from π and calculate representative values of M, and then average these values. As β increases, M π should undergo a sharp transition from 0 to 1. (2) (Particles interacting with a pairwise potential) A very common model in chemistry and other areas that consider systems of interacting components (e.g. protein folding, materials science, etc) is to suppose there is a collection of point particles that interact with a pairwise potential V (r), where r is the distance between the pair. The total energy of a system of n particles is the sum over all the pairwise interactions: U(x) = V ( x i x j ), i, j where x = (x 1,x 2,...,x n ) is the 3n-dimensional vector of particle positions. The stationary distribution is again the Boltzmann distribution: π(x) = Z 1 e βu(x), where again, β is the inverse temperature, and we almost never know the normalization constant Z. Depending on β, the system could prefer to be in a number of different states such as a solid, crystal, liquid, gas, or other phase. To calculate the phase diagram we must sample π(x). 6

There are several difficulties in sampling π for these, and many related, examples: The size of the state space is often HUGE! Ising model, n n, has 2 n2 elements in the state space. Even for small n, say n = 10, we have over 10 30 configurations... there is no way we can list them all. Even a small system of 100 particles lives in 300-dimensional space. There is no way we are going to adequately sample every region in this space. Often we don t know the normalization constant for π, only a function it s proportional to. the Boltzmann distirbution Z 1 e βu(x) arises frequently. We usually know the potential energy U(x) but can almost never calculate Z. We may not know the true dynamics that give rise to π, or these may have widely separated time scales so cannot be efficiently simulated. Ising model what are the true dynamics of a magnet? For this, need quantum mechanics. Particles in a box these should also interact with a solvent (such as water), but we can t simulate all the necessary atoms. We can t use previous methods to generate independent samples from π when the dimensionality becomes very high. Instead, what we can do is invent an artificial dynamics that has π as its stationary distribution, and then simulate these dynamics instead. Metropolis-Hasting Algorithm. Given a set of n nodes and stationary distribution π, the algorithm samples it as follows: If you are in state i, choose a state j from a proposal probability distribution encoded in a transition matrix H = (h i j ). Here h i j = P(consider j next in state i). Move to state j with probability a i j. Otherwise, remain in state i. The induced Markov chain has transition probabilities { hi p i j = j a i j (i j) 1 i j h i j a i j (i = j) The acceptance matrix A = (a i j ) is typically chosen to satisfy detailed balance: π i h i j a i j = π j h ji a ji. ote that we can ensure this condition holds, even if we don t know the normalization constant Z! A common choice is a i j = min(1, π jh ji π i h i j ). This is the Hastings algorithm. (The Metropolis algorithm refers to the case when H is symmetric, ie h i j = h ji, so that a i j = min(1, π j π i ).) We showed on HW2 that the induced Markov chain satisfies detailed balance, and has π as its stationary distribution. otes This algorithm works equally well in a continuous state space, provided we interpret h as a density: ( h(y x) is the ) density of jumping to y, given we start at x. The acceptance ratio is a(y x) = min. 1, π(y)h(x y) π(x)h(y x) 7

A general form of the acceptance matrix is a i j = F( π jh ji π i h i j ), where F : [0, ] [0,1] is any function satisfying F(z) = zf(1/z). The proposal matrix can be absolutely anything, provided h i j > 0 h ji > 0 (so the chain can satisfy detailed balance), and provided the chain described by H is ergodic. Choosing a good proposal matrix, however, is an art. Proposing many far-away moves is good, because it lets you move around state space quickly. Proposing too many of these, however, leads to many rejected moves, which slows convergence. A rule of thumb used to be that you should choose your proposal matrix so that an average of roughly 50% of the proposals are rejected. A recent paper by Andrew Gelman at Columbia argued that 27% is a better number, for some processes. What does this mean? o one really knows for sure, and you should choose parameters that seem to work well for the problem at hand. Here are some example proposal matrices: (1) (Ising model) pick a spin, flip it pick a pair of spins, exchange their values pick a spin or a cluster of spins, change the value(s) depending on the environment surrounding them, e.g. set the value to be the one that minimizes the energy in the spin s local neighbourhood (2) (Particles) pick a single particle, move it some random amount move all particles in the direction of the gradient of the potential, plus some random amount To calculate averages and distributions, e.g. f (x) π, one typically discards the initial transient steps. This is because it takes time for the chain to forget its initial state, and reach equilibrium. Theoretically, we can bound the time it takes to reach equilibrium using 1/ λ 2, where λ 2 is the secondlargest eigenvalue (in absolute value) of the transition matrix. In practice, we almost never know λ 2, so one must determine whether the chain has converged empirically. (ELFS: can you think of ways to do this?) The total effective number of samples is less than the actual number of points generated, even when equilibrium is reached, because the points are correlated. There is a nice formula for the effective number of points in terms of the covariance function of the Markov process. We will consider this on the homework. 8

6.4 Monte-Carlo methods in optimization Suppose you have a non-convex, possibly very rugged, function U(x), e.g. as shown above. How can you find the global minimum? Deterministic methods (e.g. steepest-descent, Gauss-ewton, Levenberg-Marquadt, BFGS, etc) are very good at finding a local minimum. To find a global minimum, or one that is close to optimal, one must typically search the landscape stochastically. One way to do this is to create a stationary distribution π that puts high probability on the lowest-energy parts of the landscape. A common choice is the Boltzmann distribution Z 1 e βu(x) for some inverse temperature β. Then, one constructs a Markov chain to sample this stationary distribution, and keeps track of the smallest value the chain has seen. The result will be sensitive to the value of β. How should we choose this? If β is large, the global minimum will be the most likely place to be in equilibrium, but it will take a very long time to reach equilibrium. If β is small, the chain moves about on the landscape much more quickly, but doesn t tend to spend as much time in the low-energy parts of the space. One possibility is to cycle β periodically: to alternate between high-temperature, fast dynamics, and lowtemperature dynamics that tend to find a low minimum and stay there. Another possiblity is Simulated Annealing. This is a technique that slowly increases β, for example as β = logt or β = (1.001) t, where t is the number of steps. As t, it can be shown that the stationary distribution converges to a delta-function at the global minimum. In practice, it takes exponentially long to do so, but this method still gives good results for many problems. Example (Lennard-Jones [ clusters). A Lennard-Jones cluster is a set of n points interacting with a pairwise ( potential U(r) = ε σr ) 6 ( σr ) ] 12 for some parameters σ,ε. This is a model for many atoms, molecules, or other interacting components with reasonably short-range interactions. The energy landscape is very rugged, with a great many local minima. To explore the landscape, and/or to find the lowest minima, one method is to construct a Markov chain on the set of local minima. This works as follows: at each step in the chain, perturb the positions of the points x 1,x 2,...,x n by some random amount, then find the nearest local minimum using a deterministic method, then check the energy of this local minimum, and either accept it or reject it according to the Hastings criterion. This method can also be used to solve packing problems, where now the energy is a function of the density, or other quantity to be optimized. 9

Example (Cryptography). Another use of Monte-Carlo methods in optimization is in cryptography (Diaconis [2009].) A cipher is a function φ : S A, where S is a set of symbols (e.g. a permutation of the alphabet, the Greek letters, a collection of squiggles, etc), and A = {a,b,c,...,x,y,z} is the set of letters. A code is a string of symbols x 1 x 2 x 3...x n, where x i S. Here is an example of a code, used by inmates in a prison in California (this and the next figure are from Diaconis [2009]): If one knows the language the code is written in, then one can obtain the distribution of letter frequencies f 1 (a i ), where a i A. One can then construct a plausibility function, as L 1 (x 1,x 2,...,x n ;φ) = n i=1 f 1 (φ(x i )). To decipher a particular code, one can use a Monte-Carlo method to minimize L 1 over all ciphers φ. This method was actually used to decode the prisoners messages. It didn t work initially. However, when the plausibility function was updated to include information about the frequencies of pairs of letters f 2 (a i,a j ) as L 1 (x 1,x 2,...,x n ;φ) = n i=1 f 2 (φ(x i ),φ(x i+1 )), then it did work, and the researchers learned about daily life in prison (in a mixture of English, Spanish, and prison-slang): 10

References P. Diaconis. The markov chain monte carlo revolution. Bulletin of the American Mathematical Society, 46: 179 205, 2009. G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford University Press, 2001. eal Madras. Lectures on Monte Carlo Methods. American Mathematical Society, 2002. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. umerical Recipes. Cambridge University Press, 3 edition, 2007. Alan D. Sokal. Monte carlo methods in statistical mechanics: foundations and new algorithms. In Cours de Troisieme Cyle de la Physique en Suisse Romande., Lausanne, 1989. 11