Introuction A Dirichlet Form approach to MCMC Optimal Scaling Markov chain Monte Carlo (MCMC quotes: Metropolis et al. (1953, running coe on the Los Alamos MANIAC: a feasible approach to statistical mechanics problems which are as yet not analytically soluble. Giacomo Zanella, Wilfri S. Kenall, an Mylène Béar. g.zanella@warwick.ac.uk, w.s.kenall@warwick.ac.uk, mylene.bear@umontreal.ca Supporte by EPSRC Research Grants EP/D002060, EP/K013939 Arian Smith (circa 1990, you set the MCMC sampler going, then you go off to the pub. Charlie Geyer (circa 1996 on the uses of MCMC, If you can write own a moel, I can o likelihoo inference for it,..., whatever. Lonon Probability Seminar King s College Lonon Elizabeth Thompson (circa 2004, You shoul only use MCMC when all else has faile. 7th October 2016 If you have to use MCMC, how to make it work as well as possible? 1 3 Example: MCMC for Anglo-Saxon statistics Some historians conjecture that Anglo-Saxon placenames cluster by issimilar names. Zanella(2016,2015 uses MCMC: ata provies some support, resulting in useful clustering. MCMC iea Goal: estimate E = E π [h(x]. Metho: simulate ergoic Markov chain with stationary istribution π: use empirical estimate Ê n = 1 n0 +n n n=n 0 h(x n. (Much easier to apply theory if chain is reversible. Theory: Ê n E almost surely. 4 6
Varieties of MH MCMC Here is the famous Metropolis-Hastings recipe for rawing from a istribution with ensity f : Propose Y using conitional ensity q(y x; Accept/Reject move from X to Y, base on ratio f(y q(x Y / f(x q(y X Gaussian MHRW MCMC Simple Python coe for Gaussian MHRW MCMC, using normal an exponential from Numpy: Propose multivariate Gaussian step; Options: 1. Inepenence sampler: proposal q(y x = q(y oesn t epen on x; 2. Ranom walk (MHRW MCMC: proposal q(y x = q(y x behaves as a ranom walk; 3. Langevin MH MCMC (or MALA: proposal q(y x = q(y x λ gra log f rifts towars high target ensity f. Test whether to accept proposal by comparing exponential ranom variable with log MH ratio; Implement step if accepte (vector aition while not mcmc.stoppe(: z = normal(0, tau, size=mcmc.im if exponential( > mcmc.phi(mcmc.x + z-mcmc.phi(mcmc.x: mcmc.x += z mcmc.recor_result( We shall focus on MHRW MCMC with Gaussian proposals. 7 8 MHRW MCMC with Gaussian proposals (I MHRW MCMC with Gaussian proposals (II (smooth target, marginal exp( x 4 (smooth target, marginal exp( x 4 Target is given by 10 i.i.. coorinates. Target is given by 20 i.i.. coorinates: Scale parameter for proposal: τ = 0.01 is too small. Scale parameter for proposal: τ = 0.01 is clearly too small. 9 Acceptance ratio 98.5% 10 Acceptance ratio 96.7%
MCMC Optimal Scaling: classic result (I MHRW MCMC on (R, π π(x i = e φ(x i x i ; MH acceptance rule A ( = 0 or 1. X ( 0 = ( X 1,..., X X i ii π X ( 1 = (X 1 + A ( W 1,..., X + A ( W W i ii N(0, σ 2 Questions: (1 complexity as? (2 optimal σ? Theorem (Roberts, Gelman an Gilks, 1997 Given σ 2 = σ 2, Lipschitz φ, an finite E π [(φ 8 ], E π [(φ 4 ] {X ( t,1 } t Z where Z t = s(σ 1 2 Bt + s(σ φ (Z t /2 t. Answers: (1 mix in O( steps; (2 σ max = arg max σ s(σ. 11 MCMC Optimal Scaling: classic result (II How to maximize s(σ? Given I = E π [φ (X 2 ] an normal CDF Φ, s(σ = σ 2 2Φ( σ I 2 = σ 2 A(σ = 4 I ( Φ 1 ( A(σ 2 2 A(σ So σ max given by maximizing { asymptotic acceptance rate (Φ } A(σ max = arg max 1 A [0,1] ( A 2 2 A} 0.234 Strengths: Establish complexity as ; Practical information on how to tune proposal; Does not epen on φ (CLT-type universality. Some weaknesses that we aress: (there are others Convergence of marginal rather than joint istribution Strong regularity assumptions: Lipschitz g, finite E [ (g 8], E [ (g 4]. 12 13 MCMC Optimal Scaling: classic result (III There is a wie range of extensions: for example, Langevin / MALA, for which the magic acceptance probability is 0.574 (Roberts an Rosenthal 1998; Non-ientically istribute inepenent target coorinates (Béar 2007; Gibbs ranom fiels (Breyer an Roberts 2000; Infinite imensional ranom fiels (Mattingly, Pillai, an Stuart 2012; Markov chains on a hypercube (Roberts 1998; Aaptive MCMC; ajust online to optimize acceptance probability (Anrieu an Thoms 2008; Rosenthal 2011. All these buil on the s..e. approach of Roberts, Gelman, an Gilks (1997; hence regularity conitions ten to be severe (though see Durmus, Le Corff, Moulines, an Roberts 2016. 15 Dirichlet forms an MCMC 1 Definition of Dirichlet form A (symmetric Dirichlet form E on a Hilbert space H is a bilinear function E(u, v, efine for any u, v D H, which satisfies: 1. D is a ense linear subspace of H; 2. E(u, v = E(v, u for u, v D, so E is symmetric; 3. E(u = E(u, u 0 for u D; 4. D is a Hilbert space uner the ( Sobolev inner prouct u, v + E(u, v; 5. If u D then u = (u 1 0 D, moreover E(u, u E(u, u. Relate to Markov processes if (quasi-regular.
Dirichlet forms an MCMC 2 Two examples 1. Dirichlet form obtaine from (re-scale MHRW MCMC: E (h = [ ( ] 2 E h(x ( 2 1 h(x( 0. (E can be viewe as the Dirichlet form arising from speeing up the MHRW MCMC by rate. 2. Heuristic infinite-imensional iffusion limit of this form uner scaling: E (h = s(σ 2 E π [ h 2]. Can we euce that the MHRW MCMC scales to look like the infinite-imensional iffusion, by showing that E converges to E? 16 Useful moes of convergence for Dirichlet forms 1. Gamma-convergence; E n Γ -converges to E if (Γ 1 E (h lim inf n E n (h n whenever h n h H; (Γ 2 For every h H there are h n h H such that E (h lim sup n E n (h n. 2. Mosco (1994 introuces stronger conitions; (M1 E (h lim inf n E n (h n whenever h n h weakly in H; (M2 For every h H there are h n h strongly in H such that E (h lim sup n E n (h n. 3. Mosco (1994, Theorem 2.4.1, Corollary 2.6.1: conitions (M1 an (M2 imply convergence of associate resolvent operators, an inee of associate semigroups. 4. Sun (1998 gives further conitions which imply weak convergence of the associate processes: these conitions are implie by existence of a finite constant C such that E n (h C( h 2 + E(h for all h H. 17 19 Results Theorem (Zanella, Kenall an Béar, 2016 Consier the Gaussian MHRW MCMC base on proposal variance σ 2 / with target π, where π = f x = e φ x. Suppose I = φ 2 f x < (finite Fisher information, an φ (x + v φ (x < κ max{ v γ, v α } for some κ > 0, 0 < γ < 1, an α > 1. Let E be the corresponing [ Dirichlet form scale as] above. E Mosco-converges to E 1 exp(n ( 1 2 σ 2 I, σ 2 I E, so corresponing L 2 semigroups also converge. Corollary Suppose in the above that φ is globally Lipschitz. The corresponingly scale processes exhibit weak convergence. 20 Lemma Methos of proof 1: a CLT result Uner the conitions of the Theorem, almost surely (in x with invariant measure π the log Metropolis-Hastings ratio converges weakly (in W as follows as : f (x i + σ W i log f (x i = ( φ(x i + σ W i φ(x i N ( 1 2 σ 2 I, σ 2 I. We may use this to euce the asymptotic acceptance rate of the MHRW MCMC sampler.
Key iea for CLT Methos of proof 2: establishing conition (M2 ( φ(x i + σ W i φ(x i φ (x i σ W i + = 1 ( σ W i φ (x i + σ W i 0 u φ (x i u. Conition implicitly on x for first 2.5 steps. 1. First summan converges weakly to N (0, σ 2 I. 2. Decompose [ variance of secon summan to euce σ W Var i ( ] 1 0 φ (x i + σ W i u φ (x i u 0. 3. Use [ Hoeffing s inequality then absolute expectations: ( ] E φ (x i + σ W i u φ (x i u 1 2 σ 2 I. σ W i 1 0 For every h L 2 (π, fin h n h (strongly in L 2 (π such that E (h lim sup n E n (h n. 1. Sufficient to consier case E (h <. 2. Fin sequence of smooth cyliner functions h n with compact cylinrical support, such that E (h E (h n 1/n. 3. Using smoothness etc, E m (h n E (h n as m. 4. Subsequences.... 21 22 Methos of proof 3: establishing conition (M1 Conclusion If h n h weakly in L 2 (π, show E (h lim inf n E n (h n. 1. Set Ψ (h = 2 (h(x( 0 h(x( 1. 2. Integrate against test function ξ(x 1:N, W 1:N I(U < a(x 1:N, W 1:N for ξ smooth, compact support, U a Uniform(0, 1 ranom variable. Apply Cauchy-Schwarz. 3. Use integration by parts, careful analysis an conitions on φ. The Dirichlet form approach allows significant relaxation of conitions require for optimal scaling results; Nee to explore whether further relaxation can be obtaine (almost surely possible; Nee to explore evelopment beyon i.i.. targets; e.g. can regularity be similarly relaxe in more general ranom fiel settings? Can this be applie in iscrete Markov chain cases (c.f. Roberts 1998; Investigate applications to Aaptive MCMC. 23 25
Anrieu, C. an J. Thoms (2008. A tutorial on aaptive MCMC. Statistics an Computing 18(4, 343 373. Béar, M. (2007, January. Weak convergence of Metropolis algorithms for non-i.i.d. target istributions. Annals of Applie Probability 17(4, 1222 1244. Breyer, L. A. an G. O. Roberts (2000. From Metropolis to iffusions : Gibbs states an optimal scaling. Stochastic Processes an their Applications 90(2, 181 206. Durmus, A., S. Le Corff, E. Moulines, an G. O. Roberts (2016. Optimal scaling of the Ranom Walk Metropolis algorithm uner L p mean ifferentiability. arxiv 1604.06664. Geyer, C. (1999. Likelihoo inference for spatial point processes. In O. E. Barnorff-Nielsen, WSK, an M. N. M. van Lieshout (Es., Stochastic Geometry: likelihoo an computation, Chapter 4, pp. 79 140. Boca Raton: Chapman & Hall/CRC. Hastings, W. K. (1970. Monte Carlo sampling methos using Markov chains an their applications. Biometrika 57, 97 109. 26 27 28 Mattingly, J. C., N. S. Pillai, an A. M. Stuart (2012. Diffusion limits of the ranom walk metropolis algorithm in high imensions. Annals of Applie Probability 22(3, 881 890. Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, an E. Teller (1953, June. Equation of state calculations by fast computing machines. Journal Chemical Physics 21(6, 1087 1092. Mosco, U. (1994. Composite meia an asymptotic Dirichlet forms. Journal of Functional Analysis 123(2, 368 421. 29 Roberts, G. O. (1998. Optimal Metropolis algorithms for prouct measures on the vertices of a hypercube. Stochastics an Stochastic Reports (June 2013, 37 41. Roberts, G. O., A. Gelman, an W. Gilks (1997. Weak Convergence an Optimal Scaling of Ranom Walk Algorithms. The Annals of Applie Probability 7(1, 110 120. Roberts, G. O. an J. S. Rosenthal (1998. Optimal scaling of iscrete approximations to Langevin iffusions. J. R. Statist. Soc. B 60(1, 255 268. Rosenthal, J. S. (2011. Optimal Proposal Distributions an Aaptive MCMC. Hanbook of Markov Chain Monte Carlo (1, 93 112.
Sun, W. (1998. Weak convergence of Dirichlet processes. Science in China Series A: Mathematics 41(1, 8 21. Thompson, E. A. (2005. MCMC in the analysis of genetic ata on peigree. In WSK, F. Liang, an J.-S. Wang (Es., Markov chain Monte Carlo: Innovations an Applications, Chapter 5, pp. 183 216. Singapore: Worl Scientific. Zanella, G. (2016. Ranom Partition Moels an Complementary Clustering of Anglo-Saxon Placenames. Annals of Applie Statistics 9(4, 1792 1822. Zanella, G., WSK, an M. Béar (2016. A Dirichlet Form approach to MCMC Optimal Scaling. arxiv 1606.01528, 22pp. Zanella, G. (2015. Bayesian Complementary Clustering, MCMC, an Anglo-Saxon Placenames. Ph thesis, University of Warwick. 30 31