Today: Fundamentals of Monte Carlo What is Monte Carlo? Named at Los Alamos in 1940 s after the casino. Any method which uses (pseudo)random numbers as an essential part of the algorithm. Stochastic - not deterministic! A method for doing highly dimensional integrals by sampling the integrand. Often a Markov chain, called Metropolis MC. Reading: Lesar Chapter 7 1
Simple example: Buffon s needle Monte Carlo determination of π Consider a square inscribed by circle. Consider one quadrant. By geometry: area of 1/4 circle = π r 2 /4 = π /4. area of square r 2 Simple MC like throwing darts at unit square target: Using RNG in (0,1), pick a pair of coordinates (x,y). Count # of points (darts) in shaded section versus total. Pts in shaded circle/pts in square ~ π /4. hits = 0 DO n=1,n{ x = (random #) y = (random #) distance2 = (x 2 + y 2 ) If (distance2 1) hits = hits + 1} pi= 4 * hits/n 2
MC is advantageous for high dimensional integrals -the best general method Consider an integral in the unit D-dimensional hypercube: I = dx 1...dx D f (x 1,...,x D ) By conventional deterministic methods: Lay out a grid with L points in each direction with h=1/l Number of points is N =L D =h D proportional to CPU time. How does error scale with CPU time or Dimensionality? Error in trapezoidal rule goes as ε=f (x) h 2 since f(x)=f(x 0 ) + f (x 0 ) h + (1/2)f (x 0 )h 2 + direct CPU time ~ ε -D/2. (ε~h 2 and CPU~h D ) But by sampling we find ε -2. (ε~m -1/2 and CPU~M) To get another decimal place takes 100 times longer! But MC is advantageous for D>4! 3
Improved Numerical Integration Integration methods in D-dimensions: Trapeziodal Rule: ε ~ f (2) (x) h 2 Simpson s Rule: ε ~ f (4) (x) h 4 generally: ε ~ f (α) (x) h α And CPU time scales with sample points (grid size) CPU time ~ h -D (e.g., 1-D like 1/L and 2-D like 1/L 2 ) Time to do integral: T int ~ ε -D/α By conventional deterministic methods: Lay out a grid with L points in each direction with h=1/l Number of points is N=h D =L D ~ CPU time. Stochastic Integration (Monte Carlo) Monte Carlo: ε ~ M -1/2 (sqrt of sampled points) CPU time: T ~ M ~ ε -2. > In the limit of small ε, MC wins if D > 2α! 4
4D 6D 8D Log(CPU Time) 2D MC Good Log(error) Bad Other reasons to do Monte Carlo: Conceptually and practically simple. Comes with built in error bars. Many methods of integration have been tried, and will be tried in this world of sin and woe. No one pretends that Monte Carlo is perfect or all-wise. Indeed, it has been said that Monte Carlo is the worst method except all those other methods that have been tried from time to time. Churchill 1947 5
Probability Distributions P(x)dx= probability of observing a value in (x,x+dx) is a probability distribution function (p.d.f.) x can be either a continuous or discrete variable. Cumulative distribution: Probability of x<y. Useful for sampling dx P(x) = 1 P(x) 0 y c( y) = dx P(x) 0 c( y) 1 Average or expectation g(x) = g = dx P(x)g(x) Moments: Zero th moment I 0 =1 Mean <x>=i 1 Variance <(x-<x>) 2 > =I 2 -(I 1 ) 2 I n =< x n >= dx P(x)x n 6
Mappings of random variables Let p x (x)dx be a probability distribution Let y=g(x) be a new variable y=g(x) -- e.g., y=g(x) = ln(x) with 0 < x 1, so y 0 What is the pdf of y? With p y (y)dy=p x (x)dx x p y (y) = p x (x) dx dy = p x(x) dx dg(x) = p x(x) dg(x) dx x 1 What happens when g is not monotonic? Example: y=g(x) = ln(x) p y (y)dy = p x (x) dg(x) dx 1 dy = e y dy *Distributed exponentially, like in Poisson event, e.g. y/λ has PDF of λe λy 7
Example: Drawing from Normal Gaussian p(y 1, y 2,...)dy 1 dy 2... = p(x 1, x 2,...) (x 1, x 2,...) (y 1, y 2,...) dy 1 dy 2... Box-Muller method: get random deviates with normal distr. p(y)dy = 1 2π e y 2 /2 dy Consider: uniform deviates (0,1), x 1 and x 2, and two quantities y 1 and y 2. where y 1 = ln x 1 cos2π x 2 y 2 = ln x 1 sin2π x 2 Or, equivalently, x 1 = exp( [y 1 2 + y 2 2 ] / 2) x 2 = 1 2π arctan(y 2 / y 1 ) (x 1, x 2,...) (y 1, y 2,...) = 1 2π e y 2 1 /2 1 2π e y 2 2 /2 So, each y is independently normal distributed. Better: Pick R 2 = v 1 2 + v 2 2 so x 1 = R 2 and (v 1,v 2 ) = 2π x 2 Advantage: no sine and cosine by using v 1 / R 2 and v 2 / R 2 and And get two RNG per calculation 10
Reminder: Gauss Central Limit Theorem Sample N values from p(x)dx, i.e. (X 1, X 2, X 3, X N ). Estimate mean from y = (1/N) x i. What is the pdf of mean? Solve by fourier transforms. If you add together two random variables, you multiply together their characteristic functions: c x (k) =< e ikx >= dx P(x)e ikx so c x+ y (k) = c x (k)c y (k) Then c x1 +...+ x N (k) = c x (k) N and c y (k) = c x (k / N) N Taylor expand ln[c y (k)] = n=1 (ik) n n! κ n cumulants 11
Cumulants: κ n Mean = κ 1 =<x> = x Variance= κ 2 =<(x-x) 2 > = s 2 Skewness = κ 3 /s 3 =<((x-x)/s) 3 > Kurtosis= κ 4 /s 4 =<((x-x)/s) 4 >-3! What happens to the reduced moments? κ n = κ n N 1 n The n=1 moment remains invariant. The rest get reduced by higher powers of N. lim N c y (k) = e ikκ 1 k 2 κ 2 / 2 N ik 3 κ 3 /6 N 2... P( y) = (N / 2πκ 2 ) 1/ 2 exp N( y κ 1 )2 2κ 2 Given enough averaging almost anything becomes a Gaussian distribution. 12
Approach to normality Gauss Central Limit Thm For any population distribution, the distribution of the mean will approach Gaussian. 13
Conditions on Central Limit Theorem We need the first three moments to exist. If I 0 is not defined à not a pdf If I 1 does not exist à not mathematically well-posed. If I 2 does not exist à infinite variance. Important to know if variance is finite for Monte Carlo. Divergence could happen because of tails of distribution We need: I n =< x n >= I 2 =< x 2 >= Divergence because of singular behavior of P at finite x: dx dx P(x)x 2 lim x ± x 3 P(x) 0 lim x 0 xp(x) 0 P(x)x n 14
Multidimensional Generalization Suppose r is an m dimensional vector from a multidimensional pdf: p(r)d m r. The mean is defined as before. The variance becomes the covariance, a positive symmetric m x m matrix : ν i,j = <(x i -< x i >)(x j -< x j >)> For sufficiently large N, the estimated mean (y) will approach the distribution: P(y)dy = 2π 1/2 N det(v) exp[ (y i < y i >) Nv 1 2 (y j < y j >)] 15
2d histogram of occurrences of means x 2 Off-diagonal components of ν ij are called the co-variance. Data can be uncorrelated, positively or negatively correlated depending on sign of ν ij Like a moment of inertia tensor 2 principal axes with variances Find axes with diagonalization or singular value decomposition x 1 Individual error bars on x 1 and x 2 can be misleading if correlated. 16