Brief introduction to Markov Chain Monte Carlo

Brief introduction to Department of Probability and Mathematical Statistics seminar Stochastic modeling in economics and finance November 7, 2011 Brief introduction to

Content 1 and motivation Classical iid sampling 2 Basic definitions Stationary distribution Limit theorems for Markov Chains 3 4 Brief introduction to

What is MCMC? and motivation Classical iid sampling MCMC = Markov Chain + Monte Carlo Monte Carlo is just a cool name for random simulation. Typically we generate iid samples from the given distribution. = algorithm generating Markov Chain with prescribed stationary distribution as a tool to sample from this distribution. Not independent. Used when iid sampling is not feasible. MCMC first appears in Metropolis et al. (1953): Equation of state calculations by fast computing machines. Journal of Chemical Physics. MCMC originally used in statistical physics. Nowadays a general simulation tool (popular e.g. in Bayesian statistics). Brief introduction to

Why we need random sampling? and motivation Classical iid sampling Because we need to do following things: Create a random sample to see how does the distribution look like. Estimate the expectation or other parameters (e.g. VaR in operational risk) of the distribution which can t be computed analytically. Estimate the probability of a certain event. Special case: simulating p-values of statistical tests. Create simulated data to test statistical methods on them. Use Monte Carlo integration as an alternative to classical numerical integration. Solve optimization problem using a randomized algorithm. Brief introduction to

Classical iid sampling and motivation Classical iid sampling We want to estimate the expectation µ = E [g(x )]. Create an iid random sample X 1, X 2,..., X n from L(X ) and put ˆµ n = 1 n g(x i ). n i=1 Estimate is unbiased: E (ˆµ n ) = µ. Variance reciprocal to n: var (ˆµ n ) = σ 2 /n, where σ 2 = var [g(x )]. Strong Law of Large Numbers: ˆµ n Central Limit Theorem: n (ˆµ n µ) a.s. µ as n. D N(0, σ 2 ) as n. Brief introduction to

Classical sampling methods and motivation Classical iid sampling Inverse sampling: When the cumulative distribution function can be inverted easily. For example: exponential or logistic distribution. Using U(0, 1) as an input. Only for 1-dimensional cases. Rejection sampling: We need an auxiliary enveloping distribution from which it is easy to simulate. Then we reject some realizations randomly (using U(0, 1) as an additional input) according to the probability density ratio of desired and auxiliary distribution. Importance sampling: Estimating integral from f as an expectation of f /g with respect to g (from which we can simulate). Choose g as much close to f as possible. Brief introduction to

and motivation Classical iid sampling Problems with classical sampling methods Classical sampling methods are not always applicable, especially for large dimensional complex problems. Example: Sampling uniformly from n-dimensional unit sphere via rejection sampling. As the enveloping distribution we use uniform distribution on the n-dimensional cube [ 1, 1] n. What is the acceptance rate of such a generator? π n/2 Γ(n/2 + 1)2 n n 0!!! Brief introduction to

Markov Chains - Basic definitions Basic definitions Stationary distribution Limit theorems for Markov Chains Markov Chain = stochastic process with Markovian property: P(X t+1 = x t+1 t i=0 {X i = x i }) = P(X t+1 = x t+1 X t = x t ). For our purposes we need discrete time and: discrete state space X : transition probability matrix P ij continuous state space X : transition kernel P(x, x ) Homogeneity: transition probabilities P are independent of time. We will work with homogenous chains only. Irreducibility: every state is accessible in one or more steps from any other state with positive probability. Brief introduction to

Basic definitions Stationary distribution Limit theorems for Markov Chains Stationary distribution, Detailed balance equations Collection {π i, i X } is called a stationary distribution of Markov Chain X t with probability transition matrix P if i X π i = 1 and π = πp. Detailed balance equations: Let X t be a Markov Chain with probability transition matrix P. If {π i, i X } satisfies i X π i = 1 and π i P ij = π j P ji for all i, j X then π is a stationary distribution of X. Brief introduction to

Basic definitions Stationary distribution Limit theorems for Markov Chains Law of Large numbers and CLT for Markov Chains Let X t be a stationary Markov Chain with state space X and stationary distribution π. Suppose that µ = X g(x)dπ(x) exists. Let s estimate µ by the sample average of g(x t ) over time: ˆµ n = 1 n n g(x t ). t=1 Strong Law of Large Numbers: ˆµ n a.s. µ as n. Central Limit Theorem: n (ˆµ n µ) D N(0, σ 2 ) as n, where + σ 2 = var [g(x i )] + 2 cov [g(x i ), g(x i+k )]. k=1 Brief introduction to

MCMC - Basic principle We want to simulate from a given (target) distribution p on X. This can be discrete or continuous. Start from an arbitrary element x X and take it as X 0. Further construct Markov Chain {X t }, t = 1, 2,... which will explore the distribution p in successive time iterations. Design the transition probabilities of {X t } so that p is stationary distribution of {X t }. We loose independence which decrease efficiency of estimates when compared to iid. Brief introduction to

algorithm I Let X be finite or countable infinite set. Let p i > 0, i X be a probability distribution on X from which we want to simulate. We will construct an irreducible probability transition matrix P so that p will be a stationary distribution of Markov Chain {X t } with state space X and transition matrix P. Consider any irreducible probability transition matrix Q on state space X. Metropolis et a. (1953) considered symmetric Q so all the terms Q ij /Q ji (so called Hastings ratios) would vanish from the following formulas. Brief introduction to

algorithm II Start {X t } in an arbitrary state X 0 = x 0 X. Then calculate iteratively over time t: 1 Having X t, choose a proposal X X according to Q: P(X = x X t = x t ) = Q xtx. { } 2 Calculate α = min 1, p(x ) Q X X t p(x t) Q (0, 1], so called X t X acceptance probability. 3 With probability α accept X and put X t+1 = X. Otherwise (with probability 1 α) reject X and keep X t+1 = X t. 4 Set t = t + 1 and go back to 1. Brief introduction to

algorithm III Resulted probability transition matrix P of {X t }: { P ij = Q ij min 1, p(j) } Q ji i j, p(i) Q ij P ii = 1 P ij i X. j X {i} By simple algebraic manipulations we can show that matrix {P ij } satisfies the detailed balance equations in combination with p. So p is a stationary distribution of the constructed Markov Chain {X t }, what we needed to prove. We do not need to evaluate p, we just need to compute ratios p(j)/p(i). So the normalizing constant of p can be ignored. Brief introduction to

Examples of applications Generate uniformly from a set of m x n integer contingency tables with given rows and columns marginals. Irreducible Q can be defined as an uniform choice from neighboring tables. Since p is uniform, p(j)/p(i) 1 here. Generating from a Poisson distribution. Let Q be a random walk over X = {0, 1, 2,...}. Ratios p(i ± 1)/p(i) are easy to calculate. Exploring the feasible set of integer programming maximization problem. We can put p increasing with increasing objective function and so push {X t } towards regions with higher values of the objective function. Brief introduction to

algorithm I Let X R m be a support of the probability density function p(x), p(x) > 0 x X, from which we want to simulate. We will construct a Markov Chain {X t } with state space X so that p will be its stationary distribution. Consider any proposal probability density function q(x x) of x X, depending generally on x X. Brief introduction to

algorithm II Start {X t } in an arbitrary state X 0 = x 0 X. Then calculate iteratively over time t: 1 Having X t, draw a proposal X X from q(x x t ). { } 2 Calculate α = min 1, p(x ) q(x X t) p(x t) q(x t X ) (0, 1], so called acceptance probability. 3 With probability α accept X and put X t+1 = X. Otherwise (with probability 1 α) reject X and keep X t+1 = X t. 4 Set t = t + 1 and go back to 1. Brief introduction to

algorithm III Similarly as in the discrete case, we can show that the transition kernel P(x x) of {X t } satisfies the detailed balance equations in combination with p. So p is a stationary distribution of the constructed Markov Chain {X t }, what we needed to prove. We do not need to evaluate p(x), we just need to compute ratios p(x )/p(x). So the normalizing constant of p (which is often unknown) can be ignored. When q(x x) does not depend on x X, we talk about independence chain (but it is still not iid!!!). When q(x x) q(x x), we talk about random walk MCMC. When in addition q( ) is symmetric, formulas simplify again. Brief introduction to

Metropolis-Hastings 2D random walk illustration Brief introduction to

- N(0, 1) Target distribution is N(0, 1). Let s forget for a while that it is easy to simulate from N(0, 1) directly. Starting point X 0 = 0 (modus of the target distribution). We use normal random walk Metropolis-Hastings algorithm with symmetric trial (jump) distribution N(0, σ 2 J ). Simple implementation on Excel sheet. By changing σ J > 0 parameter, we drive the acceptance rate of the proposals, the convergence speed and efficiency. We use σ J {0.2, 0.5, 1, 2, 2.5, 3, 5, 8, 10} and evaluate the results. Brief introduction to

MCMC trajectory (400 observations): σ J = 0.1 Very high acceptance rate. Random walk with mean reversion. Brief introduction to

MCMC trajectory (400 observations): σ J = 2.5 Moderate acceptance rate. Looks most like a white noise. Brief introduction to

MCMC trajectory (400 observations): σ J = 10 Very low acceptance rate. Locally constant with rare high jumps. Brief introduction to

Simulated histogram (400 observations): σ J = 0.1 Chain didn t explore the whole target distribution (especially tails). Brief introduction to

Simulated histogram (400 observations): σ J = 2.5 Resulted histogram looks nice. Brief introduction to

Simulated histogram (400 observations): σ J = 10 The chain is too rigid, the histogram consists of several peaks. Brief introduction to

MCMC trajectory ACF: σ J = 0.1 Very low (linear) decay of ACF. Brief introduction to

MCMC trajectory ACF: σ J = 2.5 ACF (k) 0.6 k. Brief introduction to

MCMC trajectory ACF: σ J = 10 ACF (k) 0.83 k. Brief introduction to

Acceptance rate as a function of σ J The acceptance rate goes from 100 % to 15 % as σ J goes from 0.1 to 10. At moderate σ J = 2.5 the acceptance rate is around 40 %. Brief introduction to

ACF(1) as a function of σ J ACF(1) plot has an U-shape, starting at 1 and ending at 0.83. The minimum value 0.6 is attained somewhere near σ J = 2.5. Brief introduction to

Efficiency as a function of σ J The efficiency forms a -shape, starting at 0 % value. The maximum value around 30 % is attained again somewhere near σ J = 2.5. Brief introduction to

Choosing the trial distribution So called trial (or proposal, candidate, jumping ) distribution is crucial in achieving reasonable efficiency of MCMC estimates. We must be able to simulate from it easily. Its shape is recommended to copy the shape of the target distribution. Its width (standard deviation) must be tuned as to optimize the efficiency. 20-50% acceptance rate is recommended. For 1-dimensional case, the optimal efficiency is usually around 30 %. Efficiency decreases with higher dimensions. Brief introduction to

Convergence issue Since the constructed Markov Chain just approximates the target distribution in a limiting sense, we must care whether our Markov Chain has converged to the target distribution already. Some convergence diagnostics are available for this. The convergence speed can differ, be careful. Drop the beginning phase of MC trajectory before the convergence is reached. So called burn in stage. Run the simulation several times and compare characteristics of individual trajectories. Try run the simulation with different starting points and compare the results. Brief introduction to

Error of MCMC estimate Assessing the error of MCMC estimate is not so straightforward as in iid sampling. We can use the formula with process covariance function values to estimate the variance of its sample means. Typically autocorrelation function ACF (k) of g(x t ) decays to 0 like exponential ρ k with ρ > 0, i.e. like ACF of AR(1) process. We try to reach ρ as close to 0 as possible (higher efficiency). We can estimate sample mean error directly from variation of shorter block (batch) sample means. Tune the process to increase efficiency and so decrease the estimation error. Brief introduction to

Further MCMC topics There are many variants and modifications of MCMC one can meet when browsing through literature. Just to list them: Gibbs sampling Slice sampling Annealing (tempering) Langevin technique Hamiltonian hybrid algorithm Coupling from the past... Brief introduction to

References References Contacts W. K. Hastings : Monte Carlo sampling methods using and their applications. Biometrika 57 (1970), 97-109. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller and E. Teller : Equation of state calculations by fast computing machines. J. of Chemical Physics 21 (1953), 1087-92. J. M. Hohendorff : An to. Department of Statistics, University of Toronto, 2005. http://probability.ca/jeff/ftpdir/johannes.pdf. T. Balún : metódy posteriórnej simulácie a ich aplikácia v ekonómii. Diploma thesis, 2011. http://is.muni.cz/th/211335/prif m/diplomka.pdf. C. Geyer : to. University of Minnesota, 2003. http://www.stat.umn.edu/geyer/mcmc/talk/mcmc.pdf. L. Kroc : to. http://www.cs.cornell.edu/selman/cs475/lectures/intro-mcmc-lukas.pdf. S. Lalley : to. Department of Statistics, University of Chicago. http://galton.uchicago.edu/ lalley/courses/313/proppwilson.pdf. I. Murray : Markov chain Monte Carlo. Machine Learning Summer School 2009. http://mlg.eng.cam.ac.uk/mlss09/mlss slides/murray 1.pdf. P. Lam : MCMC Methods: Gibbs Sampling and the Metropolis-Hastings Algorithm. Harvard University. http://www.people.fas.harvard.edu/ plam/teaching/methods/mcmc/mcmc mprint.pdf. K. M. Hanson : Tutorial on. Los Alamos National Laboratory, 2000. http://kmh-lanl.hansonhub.com/talks/maxent00b.pdf. M. Scullard : Reversible Markov Chains. 2008. http://www.math.ucsd.edu/ williams/courses/m28908/scullardmath289 mreversibility.pdf. Brief introduction to

Contacts References Contacts mobile: 604 799 879 e-mail: tomas.hanzak@post.cz web: www.thanzak.sweb.cz Department of Probability and Mathematical Statistics Faculty of Mathematics and Physics Charles University in Prague Sokolovská 83, 186 75 Praha 8. e-mail: hanzak@karlin.mff.cuni.cz web: www.karlin.mff.cuni.cz/ kpms MEDIARESEARCH, a.s. Českobratrská 1, 130 00 Praha 3. mobile: 725 535 535 e-mail: tomas.hanzak@mediaresearch.cz web: www.mediaresearch.cz Brief introduction to