Today: Fundamentals of Monte Carlo

Today: Fundamentals of Monte Carlo What is Monte Carlo? Named at Los Alamos in 1940 s after the casino. Any method which uses (pseudo)random numbers as an essential part of the algorithm. Stochastic - not deterministic! A method for doing highly dimensional integrals by sampling the integrand. Often a Markov chain, called Metropolis MC. 1

Simple example: Buffon s needle Monte Carlo determination of Pi Consider a square inscribed by circle. Consider one quadrant. By geometry: area of 1/4 circle = π r 2 /4 = π /4. area of square r 2 Simple MC like throwing darts at unit square target: Using RNG ~ (0,1), pick a pair of coordinates (x,y). Count # of points (darts) in shaded section versus total. Pts in shaded circle/pts in square ~ π /4. Code Snippet hits = 0 DO n=1,n{ x = (random #) y = (random #) distance2 = (x 2 + y 2 ) If (distance2 1) hits = hits + 1} pi= 4 * hits/n 2

MC is advantageous for high dimensional integrals -the best general method Consider an integral in the unit D-dimensional hypercube: By conventional deterministic methods: Lay out a grid with L points in each direction with h=1/l Number of points is N =L D =h D ~ CPU time. How does error scale with CPU time or Dimensionality? Error in trapezoidal rule goes as ε=f (x) h 2 since f(x)=f(x 0 ) + f (x 0 ) h + (1/2)f (x 0 )h 2 + direct CPU time ~ ε -D/2. (ε~h 2 and CPU~h D ) But by sampling we find ε -2. (ε~n -1/2 and CPU~N) To get another decimal place takes 100 times longer! But MC is advantageous for D>4! 3

Improved Numerical Integration Integration methods in D-dimensions: Trapeziodal Rule: ε ~ f (2) (x) h 2 Simpson s Rule: ε ~ f (4) (x) h 4 generally: ε ~ f (α) (x) h α And CPU time scales with sample points (grid size) CPU time ~ h -D (e.g., 1-D like 1/L and 2-D like 1/L 2 ) Time to do integral: T int ~ ε -D/α By conventional deterministic methods: Lay out a grid with L points in each direction with h=1/l Number of points is N=h D =L D ~ CPU time. Stochastic Integration (Monte Carlo) Monte Carlo: ε ~ M -1/2 (sqrt of sampled points) CPU time: T ~ M ~ ε -2. > In the limit of small ε, MC wins if D > 2α! 4

4D 6D 8D Log(CPU Time) 2D MC Good Log(error) Bad Other reasons to do Monte Carlo: Conceptually and practically simple. Comes with built in error bars. Many methods of integration have been tried, and will be tried in this world of sin and woe. No one pretends that Monte Carlo is perfect or all-wise. Indeed, it has been said that Monte Carlo is the worst method except all those other methods that have been tried from time to time. Churchill 1947 5

Probability Distributions P(x)dx= probability of observing a value in (x,x+dx) is a probability distribution function (p.d.f.) x can be either a continuous or discrete variable. Cumulative distribution: Probability of x<y. Useful for sampling Average or expectation Moments: Zero th moment I 0 =1 Mean <x>=i 1 Variance <(x-<x>) 2 > =I 2 -(I 1 ) 2 6

Mappings of random variables Let p x (x)dx be a probability distribution Let y=g(x) be a new variable y=g(x) -- e.g., y=g(x) = ln(x) with 0 < x 1, so y 0 What is the pdf of y? With p y (y)dy=p x (x)dx x What happens when g is not monotonic? Example: y=g(x) = ln(x) *Distributed exponentially, like in Poisson event, e.g. y/λ has PDF of λe λy 7

What is Mapping Doing? Generate random deviate 1 Uniform deviate in PDF is normalized Let p(y)=f(y), then y(x) = C 1 (x) a functional inverse of c(y). Thus, dx/dy = f(y) or x=c(y) y c(y) = d y f ( y ) o x p(y) = f(y) 0 y transformed deviate out Allows a random deviate y from a known probability distribution p(y). The indefinite integral of p(y) must be known and invertible. A uniform deviate x is chosen from (0,1) such that its corresponding y on the definite-integral curve is desired deviate, i.e. x = c(y). 8

Interpreting the Mapping Let p(y)=f(y), then y(x) = C 1 (x), dx/dy = f(y) and x=c(y) 1 Uniform deviate in y c(y) = d y f ( y ) o x p(y) = f(y) 0 y transformed deviate out Since c(y) is area under p(y)=f(y), y(x) = C 1 (x) prescribes that Choose x=(0,1], then find value y that has that fraction x of area to left of y, or x=c(y) Return that value of y. Example: Drawing from Poisson Distribution y= ln(x)/λ, p(y)= λe λy and x=f(y) = 1 e λy Note: x = 1 e λy or x = 1- x = e λy (with x =(0,1]). So, indeed, y(x) = ln(x )/λ 9

Example: Drawing from Normal Gaussian Box-Muller method: get random deviates with normal distr. Consider: uniform deviates (0,1), x 1 and x 2, and two quantities y 1 and y 2. Or, equivalently, where So, each y is independently normal distributed. Better: Pick Advantage: no sine and cosine by using v 1 / R 2 and v 2 / R 2 and And get two RNG per calculation (1 for now, 2 for next time) 10

Reminder: Gauss Central Limit Theorem Sample N values from p(x)dx, i.e. (X 1, X 2, X 3, X N ). Estimate mean from y = (1/N) x i. What is the pdf of mean? Solve by fourier transforms. If you add together two random variables, you multiply together their characteristic functions: Then Taylor expand cumulants 11

Cumulants: κ n Mean = κ 1 =<x> = x Variance= κ 2 =<(x-x) 2 > = 2 Skewness = κ 3 / 3 =<((x-x)/ ) 3 > Kurtosis= κ 4 / 4 =<((x-x)/ ) 4 >-3 What happens to the reduced moments? The n=1 moment remains invariant. The rest get reduced by higher powers of N. Given enough averaging almost anything becomes a Gaussian distribution. 12

Approach to normality Gauss Central Limit Thm For any population distribution, the distribution of the mean will approach Gaussian. 13

Conditions on Central Limit Theorem We need the first three moments to exist. If I 0 is not defined -> not a pdf If I 1 does not exist -> not mathematically well-posed. If I 2 does not exist -> infinite variance. Important to know if variance is finite for Monte Carlo. Divergence could happen because of tails of distribution We need: Divergence because of singular behavior of P at finite x: 14

Multidimensional Generalization Suppose r is an m dimensional vector from a multidimensional pdf: p(r)d m r. The mean is defined as before. The variance becomes the covariance, a positive symmetric m x m matrix : V i,j = <(x i -< x i >)(x j -< x j >)> For sufficiently large N, the estimated mean (y) will approach the distribution: 15

2d histogram of occurrences of means x 2 Off-diagonal components of v ij are called the co-variance. Data can be uncorrelated, positively or negatively correlated depending on sign of v ij Like a moment of inertia tensor 2 principal axes with variances Find axes with diagonalization or singular value decomposition Eigenvectors of covariance matrix x 1 Individual error bars on x 1 and x 2 can be misleading if correlated. 16