STA4502 (Monte Carlo Estimation) Lecture Notes, Jan Feb 2013

Size: px

Start display at page:

Download "STA4502 (Monte Carlo Estimation) Lecture Notes, Jan Feb 2013"

Judith Curtis
5 years ago
Views:

1 STA4502 Monte Carlo Estimation Lecture Notes, Jan Feb 203 by Jeffrey S. Rosenthal, University of Toronto Last updated: February 5, 203. Note: I will update these notes regularly on the course web page. However, they are just rough, point-form notes, with no guarantee of completeness or accuracy. They should in no way be regarded as a substitute for attending the lectures, doing the homework exercises, or reading the reference books. INTRODUCTION: Introduction to course, handout, references, prerequisites, etc. Course web page: probability.ca/sta4502 Six weeks only; counts for QUARTER-credit only. Bahen Centre room 200, Wednesdays, and Fridays 2. If not Stat Dept grad student, must REQUEST enrolment by ; need advanced undergraduate probability/statistics background, plus basic computer programming experience. Conversely, if you already know lots about MCMC etc., then this course might not be right for you since it s an INTRODUCTION to these topics. How many of you are stat grad students? undergrads? math? computer science? physics? economics? management? engineering? other? Auditing?? Theme of the course: use pseudorandomness on a computer to simulate and hence estimate important/interesting quantities. Example: Suppose want to estimate m := E[Z 4 cosz], where Z Normal0,. Monte Carlo solution: replicate a large number z,..., z n of Normal0, random variables, and let x i = zi 4 cosz i. Their mean x n n i= x i is an unbiased estimate of E[X] E[Z 4 cosz]. R: Z = rnorm00; X = Z 4 cosz; meanx [file RMC ] unstable... but if replace 00 with then x close to.23...

2 Variability?? Well, can estimate standard deviation of x by standard error of x, which is: se = n /2 sdx n /2 varx = n /2 n [file RMC ] Then what is, say, a 95% confidence interval for m? n x i x 2. Well, by central limit theorem CLT, for large n, have x Nm, v Nm, se 2. Strictly speaking, should use t distribution, not normal distribution... but if n large that doesn t really matter ignore it for now. So m x se N0,. So, P.96 < m x se < So, Px.96 se < m < x +.96 se i.e., approximate 95% confidence interval is [file RMC ] i= x.96 se, x +.96 se. Alternatively, could compute expectation as z 4 cosz e z2 /2 2π dz. Analytic? Numerical? Better? Worse? [file RMC :.23] What about higher-dimensional versions? Can t do numerical integration! Note: In this course we will just use R to automatically sample from simple distributions like Normal, Uniform, Exponential, etc. How does it work? Discussed in e.g. Statistical Computing course. What if distribution too complicated to sample from? MCMC!... including Metropolis, Gibbs, tempered, trans-dimensional,... 2

3 MONTE CARLO INTEGRATION: How to compute an integral with Monte Carlo? Re-write it as an expectation! EXAMPLE: Want to compute 0 0 gx, y dx dy. Regard this as E[gX, Y ], where X, Y i.i.d. Uniform[0, ]. e.g. gx, y = cos xy. file RMCint Easy! Get about 0.88 ± Mathematica gives e.g. estimate I = 5 0 Here 4 0 gx, y dy dx, where gx, y = cos xy gx, y dy dx = gx, y /4dy /5dx = E[5 4 gx, Y ], where X Uniform[0, 5] and Y Uniform[0, 4]. So, let X i Uniform[0, 5], and Y i Uniform[0, 4] all independent. Estimate I by M M i= 5 4 gx i, Y i. Standard error: se = M /2 sd 5 4 gx, Y,..., 5 4 gx M, Y M. With M = 0 6, get about 4. ± file RMCint2 e.g. estimate 0 0 hx, y dy dx, where hx, y = e y2 cos xy. Can t use Uniform expectations. Instead, write this as 0 0 ey hx, y e y dy dx. This is the same as E[e Y hx, Y ], where X Uniform[0, ] and Y Exponential are independent. So, estimate it by M M i= ey i hx i, Y i, where X i Uniform[0, ] and Y i Exponential i.i.d.. With M = 0 6 get about ± very accurate! file RMCint3 Check: Numerical integration [Mathematica] gives Alternatively, could write this as e5y hx, y 5 e 5y dy dx = E[ 5 e5y hx, Y ] where X Uniform[0, ] and Y Exponential5 indep.. 3

4 Then, estimate it by M M i= 5 e5y i hx i, y i, where x i Uniform[0, ] and y i Exponential5 i.i.d.. With M = 0 6, get about ± RMCint4. larger standard error... file If replace 5 by /5, get about ± about the same. So which choice is best? Whichever one minimises the standard error! λ.5, se ? END WEDNESDAY # In general, to evaluate I E[hY ] = hy πy dy, where Y has density π, could instead re-write this as I = hx πx fx fx dx, where f is easily sampled from, with fx > 0 whenever πx > 0. Then I = E hx πx fx, where X has density f. Importance Sampling Can then do classical iid Monte Carlo integration, get standard errors etc. Good if easier to sample from f than π, and/or if the function hx πx fx variable than h itself. In general, best to make hx πx fx approximately constant. is less e.g. extreme case: if I = 0 e 3x dx, then I = 0 /33e 3x dx = E[/3] where X Exponential3, so I = /3 error = 0, no MC needed. UNNORMALISED DENSITIES: Suppose now that πy = c gy, where we know g but don t know c or π. Unnormalised density, e.g. Bayesian posterior. Obviously, c =, but this might be hard to compute. gy dy Still, I = hx πx dx = hx gx dx hx c gx dx =. gx dx If sample {x i } f i.i.d., then hx gx dx = hx gx / fx fx dx = E[hX gx / fx] where X f. So, hx gx dx M M i= hx i gx i / fx i. 4

5 Similarly, gx dx M M i= gx i / fx i. So, I M i= M i= hx i gx i / fx i gx i / fx i. Importance Sampling : weighted average Not unbiased, standard errors less clear, but still consistent. Good to use same sample {x i } for both numerator and denominator: easier computationally, and smaller variance. Example: compute I EY 2 where Y has density c y 3 siny 4 cosy 5 0<y<, where c > 0 unknown and hard to compute!. Here gy = y 3 siny 4 cosy 5 0<y<, and hy = y 2. Let fy = 6 y 5 0<y<. [Fact check: if U Uniform[0, ], then X U /6 f.] M M hx i gx i / fx i sinx i= Then I i= = 4i cosx5i. file Rimp... M i= get about gx i / fx i M i= sinx 4 i cosx5 i / x2 i Or, let fy = 4 y 3 0<y<. [Then if U Uniform[0, ], then U /4 f.] M M hx i gx i / fx i sinx i= Then I i= = 4 i cosx5 i x2 i. file Rimp sinx 4i cosx5i M i= gx i / fx i END FRIDAY # M i= With importance sampling, is it important to use the same samples {x i } in both numerator and denominator? What if independent samples are used instead? Let s try it! file Rimpind Both ways work, but usually the same samples work better. What other methods are available to iid sample from π? REJECTION SAMPLER: Assume πx = c gx, with π and c unknown, g known but hard to sample from. Want to sample X π. Then if X, X 2,..., X M π iid, then can estimate E π h by M M i= hx i, etc. 5

6 Find some other, easily-sampled density f, and known K > 0, such that K fx gx for all x. Sample X f, and U Uniform[0, ] indep.. If U gx/kfx, then accept X as a draw from π. Otherwise, reject X and start over again. Now, PU gx/kfx X = x = gx/kfx, so conditional on accepting, we have that P X y U gx KfX = P X y, U P U gx KfX gx KfX = y gx fx Kfx dx gx fx Kfx dx = y gx dx gx dx = y πx dx. So, conditional on accepting, X π. Good! iid! However, prob. of accepting may be very small, then get very few samples. Example: π = N0,, i.e. gx = πx = 2π /2 exp x 2 /2. Want: E π X 4, i.e. hx = x 4. Let f be double-exponential distribution, i.e. fx = 2 e x. If K = 8, then: For x 2, Kfx = 8 2 exp x 8 2 exp 2 2π /2 πx = gx. For x 2, Kfx = 8 2 exp x 8 2 exp x2 /2 2π /2 exp x 2 /2 = πx = gx. So, can apply rejection sampler with this f and K, to get samples, estimate of E[X], estimate of E[hX], estimate of P[X < ], etc. file Rrej For Rejection Sampler, P accept = E[P accept X] = E[ gx KfX ] = K gx dx = ck. Only depends on K, not f. So, in M attempts, get about M/cK iid samples. gx Kfx fx dx = Rrej example: c =, K = 8, M = 0, 000, so get about M/8 = 250 samples. Since c fixed, try to minimise K. 6

7 Extreme case: fx = πx, so gx = πx/c = fx/c, and can take K = /c, whence P accept =, iid sampling: optimal. Note: these algorithms all work in discrete case too. Can let π, f, etc. be probability functions, i.e. probability densities with respect to counting measure. Then the algorithms proceed exactly as before. AUXILIARY VARIABLE APPROACH: related: slice sampler Suppose πx = c gx, and X, Y chosen uniformly under the graph of g. i.e., X, Y Uniform{x, y R 2 : 0 y gx}. Then X π, i.e. we have sampled from π. b a gx dx area with a<x<b Why? For a < b, Pa < X < b = total area = = b πx dx. gx dx a So, if repeat, get i.i.d. samples from π, can estimate E π h etc. Auxiliary Variable rejection sampler: If support of g contained in [L, R], and gx K, then can first sample X, Y Uniform[L, R] [0, K], then reject if Y > gx, otherwise accept as sample with X, Y Uniform{x, y : 0 y gx}, hence X π. Example: gy = y 3 siny 4 cosy 5 0<y<. Then L = 0, R =, K =. So, sample X, Y Uniform[0, ], then keep X iff Y gx. If hy = y 2, could compute e.g. E π h as the mean of the squares of the accepted samples. file Raux Can iid / importance / rejection / auxiliary sampling solve all problems? No! Many challenging cases arise, e.g. from Bayesian statistics later. Some are high-dimensional, and the above algorithms fail. Alternative algorithm: MCMC! 7

8 MARKOV CHAIN MONTE CARLO MCMC: Suppose have complicated, high-dimensional density π = c g. Want samples X, X 2,... π. Then can do Monte Carlo. Define a Markov chain random process X 0, X, X 2,..., so for large n, X n π. METROPOLIS ALGORITHM 953: Choose some initial value X 0 perhaps random. Then, given X n, choose a proposal move Y n MV NX n, σ 2 I say. Let A n = πy n / πx n = gy n / gx n, and U n Uniform[0, ]. Then, if U n < A n, set X n = Y n accept, otherwise set X n = X n reject. Repeat, for n =, 2, 3,..., M. Note: only need to compute πy n / πx n, so multiplicative constants cancel. Fact: Then, for large n, have X n π. rwm.html Java applet END WEDNESDAY #2 Handouts: class homework, project, participation. Also on course web page. Then can estimate E π h hx πx dx by: E π h M B M i=b+ hx i, where B burn-in chosen large enough so X B π, and M chosen large enough to get good Monte Carlo estimates. Aside: if accepted all proposals, then would have a random walk Markov chain. So, this is called the random walk Metropolis RWM algorithm. How large B? Difficult to say! Some theory... active area of research [see e.g. Rosenthal, Quantitative convergence rates of Markov chains: A simple account, Elec Comm Prob 2002, on instructor s web page]... usually use trial-and-error... What initial value X 0? 8

9 Virtually any one will do, but central ones best. Ideal: overdispersed starting distribution, i.e. choose X 0 randomly from some distribution that covers the important part of the state space. EXAMPLE: gy = y 3 siny 4 cosy 5 0<y<. Want to compute again! E π h where hy = y 2. Use Metropolis algorithm with proposal Y NX,. [file Rmet ] Works pretty well, but lots of variability! Plot: appears to have good mixing... EXAMPLE: πx, x 2 = C cos x x 2 I0 x 5, 0 x 2 4. Want to compute E π h, where hx, x 2 = e x + x 2 2. Metropolis algorithm... works... gets between about 34 and but large uncertainty... file Rmet2 Mathematica gets Individual plots appear to have good mixing... Joint plot shows fewer samples where x x 2 π/2 2. = END FRIDAY #2 e.g. if πx = exp i<j x j x i, then logπx = i<j x j x i. OPTIMAL SCALING: Can change proposal distribution to Y n MV NX n, σ 2 I for any σ > 0. Which is best? If σ too small, then usually accept, but chain won t move much. If σ too large, then will usually reject proposals, so chain still won t move much. Optimal: need σ just right to avoid both extremes. Goldilocks Principle Can experiment... rwm.html applet, files Rmet, Rmet2... Some theory... limited... active area of research... General principle: the acceptance rate should be far from 0 and far from. In a certain idealised high-dimensional limit, optimal acceptance rate is 0.234!. 9

10 [Roberts et al., Ann Appl Prob 997; Roberts and Rosenthal, Stat Sci 200] MCMC STANDARD ERROR: What about standard error, i.e. uncertainty? It s usually larger than in iid case due to correlations, and harder to quantify. Simplest: re-run the chain many times, with same M and B, with different initial values drawn from some overdispersed starting distribution, and compute standard error of the sequence of estimates. Then can analyse the estimates obtained as iid... But how to estimate standard error from a single run? M i.e., how to estimate v Var M B i=b+ hx i? Let hx = hx E π h, so E π h = 0. And, assume B large enough that X i π for i > B. Then, for large M B, [ v E π M B = = where = M i=b+ 2 ] hx i E π h [ = E π M B M i=b+ 2 ] hx i [ M B 2 M BE π hx i 2 + 2M B E π hx i hx i+ ] +2M B 2E π hx i hx i M B M B M B Var πh E π hx i E π hx i hx i+ + 2 E π hx i hx i Var π h + 2 Cov π hx i hx i+ + 2 Cov π hx i hx i varfact = Corr π hx i, hx i+ +2 Corr π hx i, hx i M B Var πh varfact = iid variance varfact, k= Corr π hx 0, hx k ρ k = k= k= ρ k

11 integrated auto-correlation time. Also varfact = 2 k=0 ρ k. Then can estimate both iid variance, and varfact, from the sample run, as usual. Note: to compute varfact, don t sum over all k, just e.g. until, say, ρ k < 0.05 or ρ k < 0 or... Can use R s built-in acf function, or can write your own better. Then standard error = se = v = iid-se varfact. e.g. the files Rmet and Rmet2 [modified]. Recall: true answers are about and 38.7, respectively. Usually varfact ; try to get better chains so varfact smaller. Sometimes even try to design chain to get varfact < antithetic. CONFIDENCE INTERVALS: Suppose we estimate u E π h by the quantity e = M B M i=b+ hx i, and obtain an estimate e and an approximate variance as above v. Then what is, say, a 95% confidence interval for u? Well, if have central limit theorem CLT, then for large M B, e Nu, v. So e u v /2 N0,. So, P.96 < e u v /2 < So, P.96 v < e u <.96 v i.e., with prob 95%, the interval e.96 v, e +.96 v will contain u. Again, strictly speaking, should use t distribution, not normal distribution... but if M B large that doesn t really matter ignore it for now. e.g. the files Rmet and Rmet2 [modified]. Recall: true answers are about and 38.7, respectively. But does a CLT even hold?? END WEDNESDAY #3

12 But does a CLT even hold?? Does not follow from classical i.i.d. CLT. Does not always hold. But often does. For example, CLT holds if chain is geometrically ergodic later! and E π h 2+δ < for some δ > 0. If chain also reversible then don t need δ: Roberts and Rosenthal, Geometric ergodicity and hybrid Markov chains, ECP 997. So MCMC is more complicated than standard Monte Carlo. Why should we bother? Some problems too challenging for other methods. For example... BAYESIAN STATISTICS: Have unknown parameters θ, and a statistical model likelihood function for how the distribution of the data Y depends on θ: LY θ. Have a prior distribution, representing our initial subjective? probabilities for θ: Lθ. Combining these gives a full joint distribution for θ and Y, i.e. Lθ, Y. Then posterior distribution of θ, πθ, is then the conditional distribution of θ, conditioned on the observed data y, i.e. πθ = Lθ Y = y. In terms of densities, if have prior density f θ θ, and likelihood f Y θ y, θ, then joint density is f θ,y θ, y = f θ θ f Y θ y, θ, and posterior density is πθ = f θ,y θ, y f Y y = c f θ,y θ, y = c f θ θ f Y θ y, θ. Bayesian Statistics Example: VARIANCE COMPONENTS MODEL a.k.a. random effects model : Suppose some population has overall mean µ unknown. Population consists of K groups. Observe Y i,..., Y iji from group i, for i K. Assume Y ij Nθ i, W cond. ind., where θ i and W unknown. 2

13 Assume the different θ i are linked by θ i Nµ, V cond. ind., with µ and V also unknown. Want to estimate some or all of V, W, µ, θ,..., θ K. Bayesian approach: use prior distributions, e.g. conjugate : V IGa, b ; W IGa 2, b 2 ; µ Na 3, b 3, where a i, b i known constants, and IGa, b is inverse gamma distribution, with density ba Γa e b/x x a for x > 0. Combining the above dependencies, we see that the joint density is for V, W > 0: fv, W, µ, θ,..., θ K, Y, Y 2,..., Y KJK = C e b /V V a e b 2/W W a 2 e µ a 3 2 /2b 3 K K V /2 e θ i µ 2 /2V i= J i i= j= W /2 e Y ij θ i 2 /2W K = C 2 e b/v V a e b2/w W a2 e µ a 3 2 /2b 3 V K/2 W 2 K K J i exp θ i µ 2 /2V Y ij θ i 2 /2W. i= i= j= i= J i Then πv, W, µ, θ,..., θ K = C 3 e b /V V a e b 2/W W a 2 e µ a 3 2 /2b 3 K K V /2 e θ i µ 2 /2V i= J i i= j= W /2 e Y ij θ i 2 /2W COMMENT: For big complicated π, often better to work with the LOGARITHMS, i.e. accept if logu n < loga n = logπy n logπx n. Then only need to compute logπx, which could be easier / finite. END FRIDAY #3 3

14 Bayesian Statistics Example: VARIANCE COMPONENTS MODEL cont d: After a bit of simplifying, πv, W, µ, θ,..., θ K K = Ce b/v V a e b2/w W a2 e µ a 3 2 /2b 3 V K/2 W 2 J i i= K K J i exp θ i µ 2 /2V Y ij θ i 2 /2W. i= i= j= Better to program on log scale: log πv, W, µ, θ,..., θ K =.... Dimension: d = K + 3, e.g. K = 9, d = 22. How to compute/estimate, say, E π W/V? Or sensitivity to choice of e.g. b? Numerical integration? No, too high-dimensional! Importance sampling? Perhaps, but what f? Not very efficient! Rejection sampling? What f? What K? Virtually no samples! Many applications, e.g.: Predicting success at law school D. Rubin, JASA 980, K = 82 schools. Melanoma recurrence s24/ Analysing Clinicals24 Bartolucci.pdf, K = 9 patient catagories. Comparing baseball home-run hitters J. Albert, The American Statistician 992, K = 2 players. Analysing fabric dyes Davies 967; Box/Tiao 973; Gelfand/Smith JASA 990, K = 6 batches of dyestuff. data in file Rdye INDEPENDENCE SAMPLER: Recall: with random-walk Metropolis, propose Y n MV NX n, σ 2 I d, then accept if U n < A n where U n Uniform[0, ] and A n = πy n / πx n. One alternative of many later is the independence sampler. Propose {Y n } q, i.e. the {Y n } are i.i.d. from some fixed density q, independent of X n. e.g. Y n MV N0, I d 4

15 Then accept if U n < A n where U n Uniform[0, ] and A n = πy n qx n πx n qy n. Special case of the Metropolis-Hastings algorithm, where Y n qx n,, and A n = πy n qy n, X n πx n qx n, Y n later. Very special case: if qy πy, i.e. propose exactly from target density π, then A n, i.e. make great proposals, and always accept them iid. EXAMPLE: independence sampler with πx = e x and qx = ke kx. Then if X n = x and Y n = y, then A n = e y ke kx e x ke ky k = : iid sampling great. k = 0.0: proposals way too large so-so. k = 5: proposals somewhat too small terrible. And with k = 5, confidence intervals often miss. file Rind2 Why is large k so much worse than small k? MCMC CONVERGENCE RATES, PART I: {X n } : Markov chain on X, with stationary distribution Π. Let P n x, S = P[X n S X 0 = x]. Hope that for large n, P n x, S ΠS. Let Dx, n = P n x, Π sup S X P n x, S ΠS. DEFN: chain is ergodic if lim n Dx, n = 0, for Π-a.e. x X. = e k y x. file Rind DEFN: chain is geometrically ergodic if there is ρ <, and M : X [0, ] which is Π-a.e. finite, such that Dx, n Mx ρ n for all x X and n N. DEFN: a quantitative bound on convergence is an actual number n such that Dx, n < 0.0 say. [Then sometimes say chain converges in n iterations.] Quantitative bounds often difficult though I ve worked on them a lot, but geometric ergodicity often easier to verify. Fact: CLT holds for n n i= hx i if chain is geometrically ergodic and E π h 2+δ < for some δ > 0. If chain also reversible then don t need δ: Roberts and Rosenthal, Geometric 5

16 ergodicity and hybrid Markov chains, ECP 997. If CLT holds, then have 95% confidence interval e.96 v, e +.96 v. So what do we know about ergodicity? Theorem later: if chain is irreducible and aperiodic and Π is stationary, then chain is ergodic. What about convergence rates of independence sampler? By Thm, independence sampler is ergodic provided qx > 0 whenever πx > 0. But is that sufficient? No, e.g. previous Rind example with k = 5: ergodic of course, but not geometrically ergodic, CLT does not hold, confidence intervals often miss. file Rind2 FACT: independence sampler is geometrically ergodic IF AND ONLY IF there is δ > 0 such that qx δπx for π-a.e. x X, in which case Dx, n δ n for π-a.e. x X. So, if πx = e x and qx = ke kx for x > 0, where 0 < k, then can take δ = k, so Dx, n k n. e.g. if k = 0.0, then Dx, converges after 459 iterations.. = < 0.0 for all x > 0, i.e. But if k >, then not geometrically ergodic. Fact: if k = 5, then D0, n > 0.0 for all n 4, 000, 000, while D0, n < 0.0 for all n 4, 000, 000, i.e. convergence takes between 4 million and 4 million iterations. Slow! [Roberts and Rosenthal, Quantitative Non-Geometric Convergence Bounds for Independence Samplers, MCAP 20.] What about other chains besides independence sampler? Coming soon! VARIABLE-AT-A-TIME MCMC: Propose to move just one coordinate at a time, leaving all the other coordinates fixed since changing all coordinates at once may be difficult. e.g. proposal Y n has Y n,i NX n,i, σ 2, with Y n,j = X n,j for j i. 6

17 Here Y n,i is the i th coordinate of Y n. Then accept/reject with usual Metropolis rule symmetric case: Metropolis-within- Gibbs or Metropolis-Hastings rule general case: Metropolis-Hastings-within-Gibbs. Need to choose which coordinate to update each time... Could choose coordinates in sequence, 2,..., d,, 2,... systematic-scan. Or, choose coordinate Uniform{, 2,..., d} each time random-scan. Note: one systematic-scan iteration corresponds to d random-scan ones... EXAMPLE: again πx, x 2 = C cos x x 2 I0 x 5, 0 x 2 4, and hx, x 2 = e x + x 2 2. Recall: Mathematica gives E π h = Works with systematic-scan file Rmwg or random-scan file Rmwg2. GIBBS SAMPLER: Special case of Metropolis-Hastings-within-Gibbs later. Proposal distribution for i th coordinate is equal to the conditional distribution of that coordinate according to π, conditional on the current values of all the other coordinates. Then, always accept. Reason later. Can use either systematic or random scan, just like above. END WEDNESDAY #4 EXAMPLE: Variance Components Model: Update of µ say should be from conditional density of µ, conditional on current values of all the other coordinates: Lµ V, W, θ,..., θ K, Y,..., Y JK K. This conditional density is proportional to the full joint density, but with everything except µ treated as constant. Recall: full joint density is: K = Ce b/v V a e b2/w W a2 e µ a 3 2 /2b 3 V K/2 W 2 J i i= K K J i exp θ i µ 2 /2V Y ij θ i 2 /2W. i= i= j= 7

18 So, conditional density of µ is C 2 e µ a 3 2 /2b 3 [ ] K exp θ i µ 2 /2V. i= This equals verify this! HW! C 3 exp µ 2 + K 2b 3 2V + µa 3 + b 3 V K i= θ i. Side calculation: if µ Nm, v, then density e µ m2 /2v e µ2 /2v+µm/v. Hence, here µ Nm, v, where /2v = 2b 3 + K 2V and m/v = a 3 b 3 + V K i= θ i. Solve: v = b 3 V/V + Kb 3, and m = a 3 V + b 3 K i= θ i / V + Kb 3. So, in Gibbs Sampler, each time µ is updated, we sample it from Nm, v for this m and v and always accept. Similarly HW!, conditional distribution for V is: C 4 e b /V V a V K/2 exp Recall that IGr, s has density [ ] K θ i µ 2 /2V, V > 0. i= sr Γr e s/x x r for x > 0. So, conditional distribution for V equals IGa + K/2, b + 2 K i= θ i µ 2. Can similar compute conditional distributions for W and θ i HW. So, in this case, the systematic-scan Gibbs sampler proceeds HW by: Update V from its conditional distribution IG...,.... Update W from its conditional distribution IG...,.... Update µ from its conditional distribution N...,.... Update θ i from its conditional distribution N...,..., for i =, 2,..., K. Repeat all of the above M times. Or, the random-scan Gibbs sampler proceeds by choosing one of V, W, µ, θ,..., θ K uniformly at random, and then updating that coordinate from its corresponding conditional distribution. 8

19 Then repeat this step M times [or MK + 3 times?]. MCMC CONVERGENCE RATES, PART II: FACT: if state space is finite, and chain is irreducible and aperiodic, then always geometrically ergodic. What about for random-walk Metropolis algorithm RWM, i.e. where {Y n X n } q for some fixed symmetric density q? e.g. Y n NX n, σ 2 I, or Y n Uniform[X n δ, X n + δ]. FACT: RWM is geometrically ergodic essentially if and only if π has exponential tails, i.e. there are a, b, c > 0 such that πx ae b x whenever x > c. Requires a few technical conditions: π and q continuous and positive; q has finite first moment; and π non-increasing in the tails, with in higher dims bounded Gaussian curvature. [Mengersen and Tweedie, Ann Stat 996; Roberts and Tweedie, Biometrika 996] EXAMPLES: RWM on R with usual proposals: Y n NX n, σ 2. CASE #: Π = N5, 4 2, and functional hy = y 2, so E π h = = 4. file Rnorm... σ = v. σ = 4 v. σ = 6 Does CLT hold? Yes! geometrically ergodic, and E π h p < for all p. Indeed, confidence intervals usually contain 4. file Rnorm2 CASE #2: πy = c +y 4, and functional hy = y2, so E π h = y2 +y 4 dy +y 4 dy = π/ 2 π/ 2 =. Not exponential tails, so no CLT; estimates less stable, confidence intervals often miss. file Rheavy END FRIDAY #4 CASE #3: πy = π+y 2 Cauchy, and functional hy = 0<y<0, so E π h = Π X < 0 = 2 arctan0/π = [Π0 < X < x = arctanx/π] Not geometrically ergodic. 9

20 Confidence intervals often miss file Rcauchy CASE #4: πy = π+y 2 Cauchy, and functional hy = miny2, 00. [Numerical integration: E π h. =.77] Again, not geometrically ergodic, and 95% CI often miss.77, though iid MC NOTE: does better. file Rcauchy2 Even when CLT holds, it s rather unstable, e.g. requires that chain has converged to Π, and might underestimate v. So, estimate of v is very important! varfact not always reliable? Repeated runs! Another approach is batch means, whereby chain is broken into m large batches, which are assumed to be approximately i.i.d., SO WHY DOES MCMC WORK?: Need Markov chain theory... STA447/ already know? Basic fact: if a Markov chain is irreducible and aperiodic, with stationarity distribution π, then LX n π as n. More precisely... THEOREM: If Markov chain is irreducible, with stationarity probability density π, then for π-a.e. initial value X 0 = x, n a if E π h <, then lim n n i= hx i = E π h hx πx dx; and b if chain aperiodic, then also lim PX n S = πx dx for all S X. n S Let s figure out what this all means... Notation: P i, j = PX n+ = j X n = i discrete case, or P x, A = PX n+ A X n = x general case. Also ΠA = πx dx. A Well, irreducible means that you have positive probability of eventually getting from anywhere to anywhere else. Discrete case: for all i, j X there is n N such that P X n = j X 0 = i > 0. discrete case General case: for all x X, and for all A X with ΠA > 0, there is n N such that P X n A X 0 = x > 0. 20

21 Usually satisfied for MCMC. And, aperiodic means there are no forced cycles, i.e. there do not exist disjoint nonempty subsets X, X 2,..., X d for d 2, such that P x, X i+ = for all x X i i d, and P x, X = for all x X d. Diagram. This is true for virtually any Metropolis algorithm, e.g. it s true if P x, {x} > 0 for any one state x X, e.g. if positive prob of rejection. Also true if P x, has positive density throughout S, for all x S, for some S X with ΠS > 0. Not quite guaranteed, e.g. X = {0,, 2, 3}, and π uniform on X, and Y n = X n ± mod 4. But almost always holds. What about Π being a stationary distribution?? Begin with DISCRETE CASE e.g. rwm.html. Assume for simplicity that πx > 0 for all x X. Let qx, y = PY n = y X n = x be proposal distribution, e.g. qx, x + = qx, x = /2. Always chosen to be symmetric, i.e. qx, y = qy, x. Acceptance probability is min, πy πx. State space is X, e.g. X {, 2, 3, 4, 5, 6}. Then, for i, j X with i j, P i, j = qi, j min, πj πi. Follows that chain is reversible : for all i, j X, by symmetry, πi P i, j = qi, j minπi, πj = qj, i minπi, πj = πj P j, i. Intuition: if X 0 π, i.e. PX 0 = i = πi for all i X, then PX 0 = i, X = j = PX 0 = j, X = i... time reversible... We then compute that if X 0 π, i.e. that PX 0 = i = Πi for all i X, then: PX = j = PX 0 = i P i, j = πi P i, j = πj P j, i i X i X i X 2

22 = πj i X P j, i = πj, i.e. X π too! So, the Markov chain preserves π, i.e. π is a stationary distribution. This is true for any Metropolis algorithm! It then follows from the Theorem i.e., Basic Fact that as n, LX n π, i.e. lim n P X n = i = πi for all i X. file rwm.html n Also follows that if E π h <, then lim n n i= hx i = E π h hx πx dx. LLN SO WHAT ABOUT THE MORE GENERAL, CONTINUOUS CASE? Some notation: Let X be the state space of all possible values. Usually X R d, e.g. for Variance Components Model, X = 0, 0, R R K R K+3. Let qx, y be the proposal density for y given x. So, in above case, qx, y = 2πσ d/2 exp d i= y i x i 2 /2σ 2. Symmetric: qx, y = qy, x. Let αx, y be probability of accepting a proposed move from x to y, i.e. αx, y = PU n < A n X n = x, Y n = y = PU n < πy πx = min[, πy πx ]. Let P x, S = PX S X 0 = x be the transition probabilities. Then if x S, then P x, S = PY S, U < A X 0 = x = S qx, y min[, πy/πx] dy. Shorthand: for x y, P x, dy = qx, y min[, πy/πx] dy. Then for x y, P x, dy πx dx = qx, y min[, πy/πx] dy πx dx = qx, y min[πx, πy] dy dx = P y, dx πy dy. symmetric Follows that P x, dy πx dx = P y, dx πy dy for all x, y X. reversible Shorthand: P x, dy Πdx = P y, dx Πdy. How does reversible help? 22

23 Well, suppose X 0 Π, i.e. we start in stationarity. Then PX S = PX S X 0 = x πx dx = = x X x X y S P y, dx πy dy = y S x X y S πy dy ΠS, so also X π. So, chain preserves π, i.e. π is stationary distribution. So, again, the Theorem applies. Note: key facts about qx, y are symmetry, and irreducibility. P x, dy πx dx So, could replace Y n N0, by e.g. Y n Uniform[X n, X n + ], or on discrete space Y n = X n ± with prob. 2 each, etc. METROPOLIS-HASTINGS ALGORITHMS: Hastings [Canadian!], Biometrika 970; see Now that we understand the theory, we can consider more general algorithms too... Previous Metropolis algorithm works provided proposal distribution is symmetric, i.e. qx, y = qy, x. But what if it isn t? For Metropolis, key was that qx, y αx, y πx was symmetric to make the Markov chain be reversible. [ If instead A n = πy n qy n,x n πx n qx n,y n, i.e. acceptance prob. αx, y = min, then: [ qx, y αx, y πx = qx, y min, So, still symmetric, even if q wasn t. ] πy qy,x πx qx,y, πy qy, x ] [ ] πx = min πx qx, y, πy qy, x. πx qx, y So, for Metropolis-Hastings algorithm, replace A n = πy n / πx n by A n = πy n qy n,x n πx n qx n,y n, then still reversible, and everything else remains the same. i.e., still accept if U n < A n, otherwise reject. Intuition: if qx, y >> qy, x, then Metropolis chain would spend too much time at y and not enough at x, so need to accept fewer moves x y. Do require that qx, y > 0 iff qy, x > 0. 23

24 INDEPENDENCE SAMPLER mentioned earlier: Proposals {Y n } i.i.d. from some fixed distribution say, Y n MV N0, I. Another special case of Metropolis-Hastings algorithm. Then qx, y = qy, depends only on y. So, now A n = πy n qx n πx n qy n. files Rind, Rind2 from before END WEDNESDAY #5 GIBBS SAMPLER mentioned earlier: Special case of Metropolis-Hastings-within-Gibbs. Proposal distribution for i th coordinate is equal to the conditional distribution of that coordinate according to π, conditional on the current values of all the other coordinates. That is, q i x, y = Cx i πy whenever x i = y i, where x i means all coordinates except the i th one. Here Cx i is the appropriate normalising constant which depends on x i. So Cx i = Cy i. Then A n = πy n q i Y n,x n =. n πy n So, always accept. LANGEVIN ALGORITHM: Y n MV N X n + 2 σ2 log πx n, σ 2 I. πx n q i X n,y n = πy n CY i n πx n πx n CX i Special case of Metropolis-Hastings algorithm. Intuition: tries to move in direction where π increasing. Based on discrete approximation to Langevin diffusion. Usually more efficient, but requires knowledge and computation of log π. Hard. EXAMPLE: again πx, x 2 = C cos x x 2 I0 x 5, 0 x 2 4, and hx, x 2 = e x + x 2 2. Recall: Mathematica gives E π h = Proposal distribution: Y n MV NX n, σ 2 + X n 2 2 I. 24

25 Intuition: larger proposal variance if farther from center. So, qx, y = C + x 2 2 exp y x 2 / 2 σ 2 + x 2 2. So, can run Metropolis-Hastings algorithm for this example. file RMH Usually get between 34 and 43, with claimed standard error 2. Recall: Mathematica gets EXAMPLES RE WHY DOES MCMC WORK: EXAMPLE #: Metropolis algorithm where X = Z, πx = 2 x /3, and qx, y = 2 if x y =, otherwise 0. Reversible? Yes, it s a Metropolis algorithm! π stationary? Yes, follows from reversibility! Aperiodic? Yes, since P x, {x} > 0! Irreducible? Yes: πx > 0 for all x X, so can get from x to y in x y steps. So, by theorem, probabilities and expectations converge to those of π good. EXAMPLE #2: Same as #, except now πx = 2 x for x 0, with π0 = 0. Still reversible, π stationary, aperiodic, same as before. Irreducible? No can t go from positive to negative! EXAMPLE #3: Same as #2, except now qx, y = 4 if x y 2, otherwise 0. Still reversible, π stationary, aperiodic, same as before. Irreducible? Yes can jump over 0 to get from positive to negative, and back! END FRIDAY #5 EXAMPLE #4: Metropolis algorithm with X = R, and πx = C e x6, and proposals Y n Uniform[X n, X n + ]. Reversible? Yes since qx, y still symmetric. π stationary? Yes since reversible! Irreducible? Yes since P n x, dy has positive density whenever y x n. 25

26 Aperiodic? Yes since if periodic, then if e.g. X [0, ] has positive measure, then possible to go from X directly to X, i.e. if x X [0, ], then P x, X > 0. Or, even simpler: since P x, {x} > 0 for all x X. So, by theorem, probabilities and expectations converge to those of π good. EXAMPLE #5: Same as #4, except now πx = C e x6 x<2 + x>4. Still reversible and stationary and aperiodic, same as before. But no longer irreducible: cannot jump from [4, to, 2] or back. So, does not converge. EXAMPLE #6: Same as #5, except now proposals are Y n Uniform[X n 5, X n + 5]. Still reversible and stationary and aperiodic, same as before. And now irreducible, too: now can jump from [4, to, 2] or back. EXAMPLE #7: Same as #6, except now Y n Uniform[X n 5, X n + 0]. Makes no sense proposals not symmetric, so not a Metropolis algorithm! Not even symmetrically zero, for a Metropolis-Hastings algorithm. OPTIMAL RWM PROPOSALS: Consider RWM on X = R d, where Y n MV NX n, Σ for some d d proposal covariance matrix Σ. What is best choice of Σ? Usually we take Σ = σ 2 I d for some σ > 0, and then choose σ so acceptance rate not too small, not too large e.g But can we do better? Suppose for now that Π = MV Nµ 0, Σ 0 for some fixed µ 0 and Σ 0, in dim=5. Try RWM with various proposal distributions file Ropt : first version: Y n MV NX n, I d. acc 0.06; varfact 220 second version: Y n MV NX n, 0. I d. acc 0.234; varfact 300 third version: Y n MV NX n, Σ 0. acc 0.3; varfact 5 26

27 fourth version: Y n MV NX n,.4 Σ 0. acc 0.234; varfact 7 Or in dim=20 file Ropt2, with file targ20 : Y n MV NX n, I d. acc 0.234; varfact 400 or more Y n MV NX n, Σ 0. acc 0.234; varfact 50 Conclusion: acceptance rates near are better. But also, proposals shaped like the target are better. This has been proved for targets which are orthogonal transformations of independent components Roberts et al., Ann Appl Prob 997; Roberts and Rosenthal, Stat Sci 200; Bédard, Ann Appl Prob Is approximately true for most unimodal targets... Problem: Σ 0 would usually be unknown; then what? Can perhaps adapt! ADAPTIVE MCMC: What if target covariance Σ 0 is unknown?? Can estimate target covariance based on run so far, to get empirical covariance Σ n. Then update proposal covariance on the fly, by using proposal Y n MV NX n, Σ n [or Y n MV NX n,.4σ n, or Y n MV NX n, /dσ n ]. Hope that for large n, Σ n Σ 0, so proposals nearly optimal. Usually also add ɛi d to proposal covariance, to improve stability, e.g. ɛ = Try R version, for the same MVN example as in Ropt file Radapt : Need much longer burn-in, e.g. B = 20, 000, for adaption to work. Get varfact of last 4000 iterations of about 8... optimal... competitive with Ropt The longer the run, the more benefit from adaptation. d Can also compute slow-down factor, s n d i= λ 2 in / d i= λ in, 2 where {λ in } eigenvals of Σ /2 n Σ /2 0. Starts large, should converge to. [Motivation: if Σ n = Σ 0, then λ in, so s n = dd/d 2.] 27

28 Higher dimensions: figure plotamx200.png dim=200. Works well, but it takes many iterations before the adaption is helpful. BUT IS ADAPTIVE MCMC A VALID ALGORITHM?? Not in general: see e.g. adapt.html Algorithm now non-markovian, doesn t preserve stationarity at each step. However, still converges to Π provided that the adaption i is diminishing and ii satisfies a technical condition called containment. For details see e.g. Roberts & Rosenthal, Coupling and Convergence of Adaptive MCMC J. Appl. Prob TEMPERED MCMC: Suppose Π is multi-modal, i.e. has distinct parts e.g., Π = 2 N0, + 2 N20, Usual RWM with Y n NX n, say can explore well within each mode, but how to get from one mode to the other? Idea: if Π were flatter, e.g. 2 N0, N20, 02, then much easier to get between modes. So: define a sequence Π, Π 2,..., Π m where Π = Π cold, and Π τ larger τ hot. e.g. Π τ = 2 N0, τ N20, τ 2 ; file Rtempered Then define joint Markov chain x, τ on X {, 2,..., m}. How? In the end, only count those samples where τ =. is flatter for END WEDNESDAY #6 Then define joint Markov chain x, τ on X {, 2,..., m}, with stationary distribution Π defined by ΠS {τ} = m Π τ S. Can also use other weights besides m. Define new Markov chain with both spatial moves change x and temperature moves change τ. 28

29 e.g. perhaps chain alternates between: a propose x Nx,, accept with prob min, b propose τ = τ ± prob 2 each, accept with prob min, = min,. πx,τ πx,τ π τ x π τ x πx,τ πx,τ = min, π τ x π τ x. Chain should converge to Π. In the end, only count those samples where τ =. EXAMPLE: Π = 2 N0, + 2 N20, Assume proposals are Y n NX n,. Mixing for Π: terrible! file Rtempered with dotempering=false and temp=; note the small claimed standard error! Define Π τ = 2 N0, τ N20, τ 2, for τ =, 2,..., 0. Mixing better for larger τ! temp=,2,3,4,...,0 file Rtempered with dotempering=false and Compare graphs of π and π 0 : plot commands at bottom of Rtempered... So, use above a b algorithm; converges fairly well to Π. file Rtempered, with dotempering=true So, conditional on τ =, converges to Π. Rtempered points command at end of file So, average of those hx with τ = gives good estimate of E π h. HOW TO FIND THE TEMPERED DENSITIES π τ? Usually won t know about e.g. Π τ = 2 N0, τ N20, τ 2. Instead, can e.g. let π τ x = c τ πx /τ. Sometimes write β = /τ. Then Π = Π, and π τ flatter for larger τ good. e.g. if πx density of Nµ, σ 2, then c τ πx /τ density of Nµ, τσ 2. Then temperature acceptance probability is: min, π τ x π τ x = min, c τ /τ πx /τ. c τ This depends on the c τ, which are usually unknown bad. 29

30 What to do? PARALLEL TEMPERING: a.k.a. Metropolis-Coupled MCMC, or MCMCMC Alternative to tempered MCMC. Instead, use state space X m, with m chains, i.e. one chain for each temperature. So, state at time n is X n = X n, X n2,..., X nm, where X nτ is at temperature τ. Stationary distribution is now Π = Π Π 2... Π m, i.e. ΠX S, X 2 S 2,..., X m S m = Π S Π 2 S 2... Π m S m. Then, can update the chain X n,τ at temperature τ for each τ m, by proposing e.g. Y n,τ NX n,τ,, and accepting with probability min,. π τ Y n,τ π τ X n,τ And, can also choose temperatures τ and τ e.g., at random, and propose to swap the values X n,τ and X n,τ, and accept this with probability min, π τ X n,τ π τ X n,τ π τ X n,τ π τ X n,τ. Now, normalising constants cancel, e.g. if π τ x = c τ πx /τ, then acceptance probability is: min, c τ πx n,τ /τ c τ πx n,τ /τ c τ πx n,τ /τ c τ πx n,τ /τ = min, πx n,τ /τ πx n,τ /τ, πx n,τ /τ πx n,τ /τ so c τ and c τ are not required. EXAMPLE: suppose again that Π τ = 2 N0, τ N20, τ 2, for τ =, 2,..., 0. Can run parallel tempering... works pretty well. file Rpara END FRIDAY #6 SUMMARY: Monte Carlo can be used for nearly everything! Good luck with your project, and with the rest of your studies. 30

STA3431 (Monte Carlo Methods) Lecture Notes, Winter 2011

STA3431 (Monte Carlo Methods) Lecture Notes, Winter 2011 STA3431 Monte Carlo Methods) Lecture Notes, Winter 2011 by Jeffrey S. Rosenthal, University of Toronto Last updated: April 4, 2011.) Note: I will update these notes regularly on the course web page). However,