A Dirichlet Form approach to MCMC Optimal Scaling

Size: px

Start display at page:

Download "A Dirichlet Form approach to MCMC Optimal Scaling"

Blaise Oliver
5 years ago
Views:

1 A Dirichlet Form approach to MCMC Optimal Scaling Giacomo Zanella, Wilfrid S. Kendall, and Mylène Bédard. Supported by EPSRC Research Grants EP/D002060, EP/K LMS Durham Symposium on Stochastic Analysis 12th July 2017

2 Introduction Introduction MCMC and optimal scaling Dirichlet forms and optimal scaling Results and methods of proofs Conclusion

3 3 Introduction to Markov chain Monte Carlo (MCMC) General reference: Brooks et al. (2011) MCMC Handbook. Suppose x represents an unknown (and therefore random!) parameter, and y represents data depending on the unknown parameter, joint probability density p(x, y).

4 3 Introduction to Markov chain Monte Carlo (MCMC) General reference: Brooks et al. (2011) MCMC Handbook. Suppose x represents an unknown (and therefore random!) parameter, and y represents data depending on the unknown parameter, joint probability density p(x, y). Joint probability density p(x, y)

5 Introduction to Markov chain Monte Carlo (MCMC) General reference: Brooks et al. (2011) MCMC Handbook. Suppose x represents an unknown (and therefore random!) parameter, and y represents data depending on the unknown parameter, joint probability density p(x, y). Conditional density p(x y) = Joint probability density p(x, y) Z Norming constant Z can be hard to compute!

6 Introduction to Markov chain Monte Carlo (MCMC) General reference: Brooks et al. (2011) MCMC Handbook. Suppose x represents an unknown (and therefore random!) parameter, and y represents data depending on the unknown parameter, joint probability density p(x, y). Conditional density p(x y) = Build Markov chain with this as equilibrium (no need to know Z) Joint probability density p(x, y) Z Norming constant Z can be hard to compute!

7 Introduction to Markov chain Monte Carlo (MCMC) General reference: Brooks et al. (2011) MCMC Handbook. Suppose x represents an unknown (and therefore random!) parameter, and y represents data depending on the unknown parameter, joint probability density p(x, y). Conditional density p(x y) = Build Markov chain with this as equilibrium (no need to know Z) Joint probability density p(x, y) Z Norming constant Z can be hard to compute! Simulate Markov chain till approximate equilibrium.

8 Example: MCMC for Anglo-Saxon statistics Some historians conjecture, Anglo-Saxon placenames cluster by dissimilar names. Zanella (2015, 2016) uses MCMC: data provides some support, resulting in useful clustering.

9 Example: MCMC for Anglo-Saxon statistics Some historians conjecture, Anglo-Saxon placenames cluster by dissimilar names. Zanella (2015, 2016) uses MCMC: data provides some support, resulting in useful clustering.

10 5 MCMC and optimal scaling Introduction MCMC and optimal scaling Dirichlet forms and optimal scaling Results and methods of proofs Conclusion

11 6 Goal: estimate E = E π [h(x)]. MCMC idea

12 6 MCMC idea Goal: estimate E = E π [h(x)]. Method: simulate ergodic Markov chain with stationary distribution π: use empirical estimate Ê n = 1 n0 +n n n=n 0 h(x n ).

13 6 MCMC idea Goal: estimate E = E π [h(x)]. Method: simulate ergodic Markov chain with stationary distribution π: use empirical estimate Ê n = 1 n0 +n n n=n 0 h(x n ). (Much easier to apply theory if chain is reversible.)

14 MCMC idea Goal: estimate E = E π [h(x)]. Method: simulate ergodic Markov chain with stationary distribution π: use empirical estimate Ê n = 1 n0 +n n n=n 0 h(x n ). (Much easier to apply theory if chain is reversible.) Theory: Ê n E almost surely.

15 Varieties of MH-MCMC Here is the famous Metropolis-Hastings recipe for drawing from a distribution with density f : Propose Y using conditional density q(y x); Accept/Reject move from X to Y, based on ratio f(y) q(x Y) / f(x) q(y X)

16 Varieties of MH-MCMC Here is the famous Metropolis-Hastings recipe for drawing from a distribution with density f : Propose Y using conditional density q(y x); Accept/Reject move from X to Y, based on ratio f(y) q(x Y) / f(x) q(y X) Options: 1. Independence sampler: proposal q(y x) = q(y) doesn t depend on x;

17 Varieties of MH-MCMC Here is the famous Metropolis-Hastings recipe for drawing from a distribution with density f : Propose Y using conditional density q(y x); Accept/Reject move from X to Y, based on ratio f(y) q(x Y) / f(x) q(y X) Options: 1. Independence sampler: proposal q(y x) = q(y) doesn t depend on x; 2. Random walk (RW MH-MCMC): proposal q(y x) = q(y x) behaves as a random walk;

18 Varieties of MH-MCMC Here is the famous Metropolis-Hastings recipe for drawing from a distribution with density f : Propose Y using conditional density q(y x); Accept/Reject move from X to Y, based on ratio f(y) q(x Y) / f(x) q(y X) Options: 1. Independence sampler: proposal q(y x) = q(y) doesn t depend on x; 2. Random walk (RW MH-MCMC): proposal q(y x) = q(y x) behaves as a random walk; 3. MALA MH-MCMC: proposal q(y x) = q(y x λ grad log f ) drifts towards high target density f.

19 7 Varieties of MH-MCMC Here is the famous Metropolis-Hastings recipe for drawing from a distribution with density f : Propose Y using conditional density q(y x); Accept/Reject move from X to Y, based on ratio f(y) q(x Y) / f(x) q(y X) Options: 1. Independence sampler: proposal q(y x) = q(y) doesn t depend on x; 2. Random walk (RW MH-MCMC): proposal q(y x) = q(y x) behaves as a random walk; 3. MALA MH-MCMC: proposal q(y x) = q(y x λ grad log f ) drifts towards high target density f. We shall focus on RW MH-MCMC with Gaussian proposals.

20 8 Gaussian RW MH-MCMC Simple Python code for Gaussian RW MH-MCMC, using normal and exponential from Numpy:

21 8 Gaussian RW MH-MCMC Simple Python code for Gaussian RW MH-MCMC, using normal and exponential from Numpy: Propose multivariate Gaussian step;

22 8 Gaussian RW MH-MCMC Simple Python code for Gaussian RW MH-MCMC, using normal and exponential from Numpy: Propose multivariate Gaussian step; Test whether to accept proposal by comparing exponential random variable with log MH ratio;

23 8 Gaussian RW MH-MCMC Simple Python code for Gaussian RW MH-MCMC, using normal and exponential from Numpy: Propose multivariate Gaussian step; Test whether to accept proposal by comparing exponential random variable with log MH ratio; Implement step if accepted (vector addition).

24 8 Gaussian RW MH-MCMC Simple Python code for Gaussian RW MH-MCMC, using normal and exponential from Numpy: Propose multivariate Gaussian step; Test whether to accept proposal by comparing exponential random variable with log MH ratio; Implement step if accepted (vector addition). while not mcmc.stopped(): z = normal(0, tau, size=mcmc.dim) if exponential() > mcmc.phi(mcmc.x + z)-mcmc.phi(mcmc.x): mcmc.x += z mcmc.record_result()

25 8 Gaussian RW MH-MCMC Simple Python code for Gaussian RW MH-MCMC, using normal and exponential from Numpy: Propose multivariate Gaussian step; Test whether to accept proposal by comparing exponential random variable with log MH ratio; Implement step if accepted (vector addition). while not mcmc.stopped(): z = normal(0, tau, size=mcmc.dim) if exponential() > mcmc.phi(mcmc.x + z)-mcmc.phi(mcmc.x): mcmc.x += z mcmc.record_result() What is best choice of scale / standard deviation tau?

26 RW MH-MCMC with Gaussian proposals (smooth target, marginal exp( x 4 )) Target is given by 10 i.i.d. coordinates. Scale parameter for proposal: τ = 1 is too large! Acceptance ratio 1.7%

27 RW MH-MCMC with Gaussian proposals (smooth target, marginal exp( x 4 )) Target is given by 10 i.i.d. coordinates. Scale parameter for proposal: τ = 0.1 is better. Acceptance ratio 76.5%

28 RW MH-MCMC with Gaussian proposals (smooth target, marginal exp( x 4 )) Target is given by 10 i.i.d. coordinates. Scale parameter for proposal: τ = 0.01 is too small. Acceptance ratio 98.5%

29 10 MCMC Optimal Scaling: classic result (I) RW MH-MCMC on (R d, π d ) π(dx i ) = e φ(x i) dx i ; MH acceptance rule A (d) = 0 or 1. X (d) 0 = ( X 1,..., X d ) X i iid π X (d) 1 = (X 1 + A (d) W 1,..., X d + A (d) W d ) W i iid N(0, σ 2 d )

30 0 MCMC Optimal Scaling: classic result (I) RW MH-MCMC on (R d, π d ) π(dx i ) = e φ(x i) dx i ; MH acceptance rule A (d) = 0 or 1. X (d) 0 = ( X 1,..., X d ) X i iid π X (d) 1 = (X 1 + A (d) W 1,..., X d + A (d) W d ) W i iid N(0, σ 2 d ) Questions: (1) complexity as d? (2) optimal σ d?

31 0 MCMC Optimal Scaling: classic result (I) RW MH-MCMC on (R d, π d ) π(dx i ) = e φ(x i) dx i ; MH acceptance rule A (d) = 0 or 1. X (d) 0 = ( X 1,..., X d ) X i iid π X (d) 1 = (X 1 + A (d) W 1,..., X d + A (d) W d ) W i iid N(0, σ 2 d ) Questions: (1) complexity as d? (2) optimal σ d? Theorem (Roberts, Gelman and Gilks, 1997) Given σ 2 d = σ 2 d, Lipschitz φ, and finite E π [(φ ) 8 ], E π [(φ ) 4 ] {X (d) td,1 } t Z where dz t = s(σ ) 1 2 dbt s(σ ) φ (Z t ) dt.

32 MCMC Optimal Scaling: classic result (I) RW MH-MCMC on (R d, π d ) π(dx i ) = e φ(x i) dx i ; MH acceptance rule A (d) = 0 or 1. X (d) 0 = ( X 1,..., X d ) X i iid π X (d) 1 = (X 1 + A (d) W 1,..., X d + A (d) W d ) W i iid N(0, σ 2 d ) Questions: (1) complexity as d? (2) optimal σ d? Theorem (Roberts, Gelman and Gilks, 1997) Given σ 2 d = σ 2 d, Lipschitz φ, and finite E π [(φ ) 8 ], E π [(φ ) 4 ] {X (d) td,1 } t Z where dz t = s(σ ) 1 2 dbt s(σ ) φ (Z t ) dt. 0 Answers: (1) mix in O(d) steps; (2) σ max = arg max σ s(σ ).

33 11 MCMC Optimal Scaling: classic result (II) Optimization: maximize s(σ )! Given I = E π [φ (X) 2 ] and normal CDF Φ, s(σ ) = σ 2 2Φ( σ I 2 ) = σ 2 A(σ ) = 4 I ( Φ 1 ( A(σ ) 2 )) 2 A(σ ) So σ max maximized by choosing asymptotic acceptance rate A(σ max ) = arg max A [0,1] { (Φ 1 ( A 2 )) 2 A} } 0.234

34 11 MCMC Optimal Scaling: classic result (II) Optimization: maximize s(σ )! Given I = E π [φ (X) 2 ] and normal CDF Φ, s(σ ) = σ 2 2Φ( σ I 2 ) = σ 2 A(σ ) = 4 I ( Φ 1 ( A(σ ) 2 )) 2 A(σ ) So σ max maximized by choosing asymptotic acceptance rate A(σ max ) = arg max A [0,1] { (Φ 1 ( A 2 )) 2 A} } 0.234

35 11 MCMC Optimal Scaling: classic result (II) Optimization: maximize s(σ )! Given I = E π [φ (X) 2 ] and normal CDF Φ, s(σ ) = σ 2 2Φ( σ I 2 ) = σ 2 A(σ ) = 4 I ( Φ 1 ( A(σ ) 2 )) 2 A(σ ) So σ max maximized by choosing asymptotic acceptance rate A(σ max ) = arg max A [0,1] { (Φ 1 ( A 2 )) 2 A} } Strengths: Establish complexity as d ; Practical information on how to tune proposal; Does not depend on φ (CLT-type universality).

36 MCMC Optimal Scaling: classic result (II) Optimization: maximize s(σ )! Given I = E π [φ (X) 2 ] and normal CDF Φ, s(σ ) = σ 2 2Φ( σ I 2 ) = σ 2 A(σ ) = 4 I ( Φ 1 ( A(σ ) 2 )) 2 A(σ ) So σ max maximized by choosing asymptotic acceptance rate A(σ max ) = arg max A [0,1] { (Φ 1 ( A 2 )) 2 A} } Strengths: Establish complexity as d ; Practical information on how to tune proposal; Does not depend on φ (CLT-type universality). Some weaknesses that we will address: (there are others) Convergence of marginal rather than joint distribution Strong regularity assumptions: Lipschitz g, finite E [ (g ) 8], E [ (g ) 4].

37 12 MCMC Optimal Scaling: classic result (III) There is a wide range of extensions: for example, Langevin / MALA, for which the magic acceptance probability is (Roberts and Rosenthal, 1998);

38 12 MCMC Optimal Scaling: classic result (III) There is a wide range of extensions: for example, Langevin / MALA, for which the magic acceptance probability is (Roberts and Rosenthal, 1998); Non-identically distributed independent target coordinates (Bédard, 2007);

39 12 MCMC Optimal Scaling: classic result (III) There is a wide range of extensions: for example, Langevin / MALA, for which the magic acceptance probability is (Roberts and Rosenthal, 1998); Non-identically distributed independent target coordinates (Bédard, 2007); Gibbs random fields (Breyer and Roberts, 2000);

40 12 MCMC Optimal Scaling: classic result (III) There is a wide range of extensions: for example, Langevin / MALA, for which the magic acceptance probability is (Roberts and Rosenthal, 1998); Non-identically distributed independent target coordinates (Bédard, 2007); Gibbs random fields (Breyer and Roberts, 2000); Infinite dimensional random fields (Mattingly, Pillai and Stuart, 2012);

41 12 MCMC Optimal Scaling: classic result (III) There is a wide range of extensions: for example, Langevin / MALA, for which the magic acceptance probability is (Roberts and Rosenthal, 1998); Non-identically distributed independent target coordinates (Bédard, 2007); Gibbs random fields (Breyer and Roberts, 2000); Infinite dimensional random fields (Mattingly, Pillai and Stuart, 2012); Markov chains on a hypercube (Roberts, 1998);

42 12 MCMC Optimal Scaling: classic result (III) There is a wide range of extensions: for example, Langevin / MALA, for which the magic acceptance probability is (Roberts and Rosenthal, 1998); Non-identically distributed independent target coordinates (Bédard, 2007); Gibbs random fields (Breyer and Roberts, 2000); Infinite dimensional random fields (Mattingly, Pillai and Stuart, 2012); Markov chains on a hypercube (Roberts, 1998); Adaptive MCMC; adjust online to optimize acceptance probability (Andrieu and Thoms, 2008; Rosenthal, 2011).

43 12 MCMC Optimal Scaling: classic result (III) There is a wide range of extensions: for example, Langevin / MALA, for which the magic acceptance probability is (Roberts and Rosenthal, 1998); Non-identically distributed independent target coordinates (Bédard, 2007); Gibbs random fields (Breyer and Roberts, 2000); Infinite dimensional random fields (Mattingly, Pillai and Stuart, 2012); Markov chains on a hypercube (Roberts, 1998); Adaptive MCMC; adjust online to optimize acceptance probability (Andrieu and Thoms, 2008; Rosenthal, 2011). All these build on the s.d.e. approach of Roberts, Gelman and Gilks (1997); hence regularity conditions tend to be severe (but see Durmus et al., 2016).

44 13 Dirichlet forms and optimal scaling Introduction MCMC and optimal scaling Dirichlet forms and optimal scaling Results and methods of proofs Conclusion

45 Dirichlet forms and MCMC 1 Definition of Dirichlet form A (symmetric) Dirichlet form E on a Hilbert space H is a closed bilinear function E(u, v), defined / finite for any u, v D H, which satisfies: 1. D is a dense linear subspace of H; 2. E(u, v) = E(v, u) for u, v D, so E is symmetric; 3. E(u) = E(u, u) 0 for u D; 4. D is a Hilbert space under the ( Sobolev ) inner product u, v + E(u, v); 5. If u D then u = (u 1) 0 D, moreover E(u, u ) E(u, u). Relate to Markov process if (quasi)-regular.

46 Dirichlet forms and MCMC 1 Definition of Dirichlet form A (symmetric) Dirichlet form E on a Hilbert space H is a closed bilinear function E(u, v), defined / finite for any u, v D H, which satisfies: 1. D is a dense linear subspace of H; 2. E(u, v) = E(v, u) for u, v D, so E is symmetric; 3. E(u) = E(u, u) 0 for u D; 4. D is a Hilbert space under the ( Sobolev ) inner product u, v + E(u, v); 5. If u D then u = (u 1) 0 D, moreover E(u, u ) E(u, u). Relate to Markov process if (quasi)-regular. Regular Dirichlet form for locally compact Polish E: D C 0 (E) is E 1 2 -dense in D, uniformly dense in C 0 (E).

47 15 Dirichlet forms and MCMC 2 Two examples 1. Dirichlet form obtained from (re-scaled) RW MH-MCMC: E d (h) = d [ ( ) ] 2 E h(x (d) 2 1 ) h(x(d) 0 ). (E d can be viewed as the Dirichlet form arising from speeding up the RW MH-MCMC by rate d.)

48 15 Dirichlet forms and MCMC 2 Two examples 1. Dirichlet form obtained from (re-scaled) RW MH-MCMC: E d (h) = d [ ( ) ] 2 E h(x (d) 2 1 ) h(x(d) 0 ). (E d can be viewed as the Dirichlet form arising from speeding up the RW MH-MCMC by rate d.) 2. Heuristic infinite-dimensional diffusion limit of this form under scaling: E (h) = s(σ ) [ 2 E π h 2].

49 15 Dirichlet forms and MCMC 2 Two examples 1. Dirichlet form obtained from (re-scaled) RW MH-MCMC: E d (h) = d [ ( ) ] 2 E h(x (d) 2 1 ) h(x(d) 0 ). (E d can be viewed as the Dirichlet form arising from speeding up the RW MH-MCMC by rate d.) 2. Heuristic infinite-dimensional diffusion limit of this form under scaling: E (h) = s(σ ) [ 2 E π h 2]. Under mild conditions this is: closable, Dirichlet, quasi-regular.

50 Dirichlet forms and MCMC 2 Two examples 1. Dirichlet form obtained from (re-scaled) RW MH-MCMC: E d (h) = d [ ( ) ] 2 E h(x (d) 2 1 ) h(x(d) 0 ). 15 (E d can be viewed as the Dirichlet form arising from speeding up the RW MH-MCMC by rate d.) 2. Heuristic infinite-dimensional diffusion limit of this form under scaling: E (h) = s(σ ) [ 2 E π h 2]. Under mild conditions this is: closable, Dirichlet, quasi-regular. Can we deduce that the RW MH-MCMC scales to look like the infinite-dimensional diffusion, by showing that E d converges to E?

51 16 Useful modes of convergence for Dirichlet forms 1. Gamma-convergence; E n Γ -converges to E if

52 16 Useful modes of convergence for Dirichlet forms 1. Gamma-convergence; E n Γ -converges to E if (Γ 1) E (h) lim inf n E n (h n ) whenever h n h H;

53 16 Useful modes of convergence for Dirichlet forms 1. Gamma-convergence; E n Γ -converges to E if (Γ 1) E (h) lim inf n E n (h n ) whenever h n h H; (Γ 2) For every h H there are h n h H such that E (h) lim sup n E n (h n ).

54 6 Useful modes of convergence for Dirichlet forms 1. Gamma-convergence; E n Γ -converges to E if (Γ 1) E (h) lim inf n E n (h n ) whenever h n h H; (Γ 2) For every h H there are h n h H such that E (h) lim sup n E n (h n ). 2. Mosco (1994) introduces stronger conditions;

55 6 Useful modes of convergence for Dirichlet forms 1. Gamma-convergence; E n Γ -converges to E if (Γ 1) E (h) lim inf n E n (h n ) whenever h n h H; (Γ 2) For every h H there are h n h H such that E (h) lim sup n E n (h n ). 2. Mosco (1994) introduces stronger conditions; (M1) E (h) lim inf n E n (h n ) whenever h n h weakly in H;

56 6 Useful modes of convergence for Dirichlet forms 1. Gamma-convergence; E n Γ -converges to E if (Γ 1) E (h) lim inf n E n (h n ) whenever h n h H; (Γ 2) For every h H there are h n h H such that E (h) lim sup n E n (h n ). 2. Mosco (1994) introduces stronger conditions; (M1) E (h) lim inf n E n (h n ) whenever h n h weakly in H; (M2) For every h H there are h n h strongly in H such that E (h) lim sup n E n (h n ).

57 6 Useful modes of convergence for Dirichlet forms 1. Gamma-convergence; E n Γ -converges to E if (Γ 1) E (h) lim inf n E n (h n ) whenever h n h H; (Γ 2) For every h H there are h n h H such that E (h) lim sup n E n (h n ). 2. Mosco (1994) introduces stronger conditions; (M1) E (h) lim inf n E n (h n ) whenever h n h weakly in H; (M2) For every h H there are h n h strongly in H such that E (h) lim sup n E n (h n ). 3. Mosco (1994, Theorem 2.4.1, Corollary 2.6.1): conditions (M1) and (M2) imply convergence of associated resolvent operators,

58 6 Useful modes of convergence for Dirichlet forms 1. Gamma-convergence; E n Γ -converges to E if (Γ 1) E (h) lim inf n E n (h n ) whenever h n h H; (Γ 2) For every h H there are h n h H such that E (h) lim sup n E n (h n ). 2. Mosco (1994) introduces stronger conditions; (M1) E (h) lim inf n E n (h n ) whenever h n h weakly in H; (M2) For every h H there are h n h strongly in H such that E (h) lim sup n E n (h n ). 3. Mosco (1994, Theorem 2.4.1, Corollary 2.6.1): conditions (M1) and (M2) imply convergence of associated resolvent operators, and indeed of associated semigroups.

59 6 Useful modes of convergence for Dirichlet forms 1. Gamma-convergence; E n Γ -converges to E if (Γ 1) E (h) lim inf n E n (h n ) whenever h n h H; (Γ 2) For every h H there are h n h H such that E (h) lim sup n E n (h n ). 2. Mosco (1994) introduces stronger conditions; (M1) E (h) lim inf n E n (h n ) whenever h n h weakly in H; (M2) For every h H there are h n h strongly in H such that E (h) lim sup n E n (h n ). 3. Mosco (1994, Theorem 2.4.1, Corollary 2.6.1): conditions (M1) and (M2) imply convergence of associated resolvent operators, and indeed of associated semigroups. 4. Sun (1998) gives further conditions which imply weak convergence of the associated processes: these conditions are implied by existence of a finite constant C such that E n (h) C( h 2 + E(h)) for all h H.

60 Results and methods of proofs Introduction MCMC and optimal scaling Dirichlet forms and optimal scaling Results and methods of proofs Conclusion

61 18 Results Theorem (Zanella, Bédard and WSK, 2016) Consider the Gaussian RW MH-MCMC based on proposal variance σ 2 /d with target π d, where dπ = f dx = e φ dx. Suppose I = φ 2 f dx < (finite Fisher information), and φ (x + v) φ (x) < κ max{ v γ, v α } for some κ > 0, 0 < γ < 1, and α > 1. Let E d be the corresponding [ Dirichlet form scaled as] above. E d Mosco-converges to E 1 exp(n ( 1 2 σ 2 I, σ 2 I)) E, so corresponding L 2 semigroups also converge.

62 18 Results Theorem (Zanella, Bédard and WSK, 2016) Consider the Gaussian RW MH-MCMC based on proposal variance σ 2 /d with target π d, where dπ = f dx = e φ dx. Suppose I = φ 2 f dx < (finite Fisher information), and φ (x + v) φ (x) < κ max{ v γ, v α } for some κ > 0, 0 < γ < 1, and α > 1. Let E d be the corresponding [ Dirichlet form scaled as] above. E d Mosco-converges to E 1 exp(n ( 1 2 σ 2 I, σ 2 I)) E, so corresponding L 2 semigroups also converge. Corollary Suppose in the above that φ is globally Lipschitz. The correspondingly scaled processes exhibit weak convergence.

63 19 Methods of proof 1: a CLT result Lemma (A conditional CLT) Under the conditions of the Corollary, almost surely (in x with invariant measure π ) the log Metropolis-Hastings ratio converges weakly (in W ) as follows as d : d f (x i + σ W i log f (x i ) i=1 d i=1 d ) = ( ) φ(x i + σ W i d ) φ(x i ) N ( 1 2 σ 2 I, σ 2 I).

64 19 Methods of proof 1: a CLT result Lemma (A conditional CLT) Under the conditions of the Corollary, almost surely (in x with invariant measure π ) the log Metropolis-Hastings ratio converges weakly (in W ) as follows as d : d f (x i + σ W i log f (x i ) i=1 d i=1 d ) = ( ) φ(x i + σ W i d ) φ(x i ) N ( 1 2 σ 2 I, σ 2 I). We may use this to deduce the asymptotic acceptance rate of the RW MH-MCMC sampler.

65 0 Key idea for CLT Use exact Taylor expansion techniques: d i=1 d i=1 ( ) φ(x i + σ W i d ) φ(x i ) φ (x i ) σ W i d + d i=1 = 1 ( ) σ d W i φ (x i + σ W i 0 d u) φ (x i ) du.

66 20 Key idea for CLT Use exact Taylor expansion techniques: d i=1 d i=1 ( ) φ(x i + σ W i d ) φ(x i ) φ (x i ) σ W i d + d i=1 = 1 ( ) σ d W i φ (x i + σ W i 0 d u) φ (x i ) du. Condition implicitly on x for first 2.5 steps. 1. First summand converges weakly to N (0, σ 2 I).

67 20 Key idea for CLT Use exact Taylor expansion techniques: d i=1 d i=1 ( ) φ(x i + σ W i d ) φ(x i ) φ (x i ) σ W i d + d i=1 = 1 ( ) σ d W i φ (x i + σ W i 0 d u) φ (x i ) du. Condition implicitly on x for first 2.5 steps. 1. First summand converges weakly to N (0, σ 2 I). 2. Decompose [ variance of second summand to deduce d ( ) ] Var i=1 φ (x i + σ W i d u φ (x i ) du 0. σ W i d 1 0

68 20 Key idea for CLT Use exact Taylor expansion techniques: d i=1 d i=1 ( ) φ(x i + σ W i d ) φ(x i ) φ (x i ) σ W i d + d i=1 = 1 ( ) σ d W i φ (x i + σ W i 0 d u) φ (x i ) du. Condition implicitly on x for first 2.5 steps. 1. First summand converges weakly to N (0, σ 2 I). 2. Decompose [ variance of second summand to deduce d σ W Var i ( ) ] d 1 i=1 0 φ (x i + σ W i d u φ (x i ) du Use [ Hoeffding s inequality then absolute expectations: d ( ) ] E i=1 φ (x i + σ W i d u φ (x i ) du 1 2 σ 2 I. σ W i d 1 0

69 1 Methods of proof 2: establishing condition (M2) For every h L 2 (π ), find h n h (strongly) in L 2 (π ) such that E (h) lim sup n E n (h n ). 1. Sufficient to consider case E (h) <.

70 1 Methods of proof 2: establishing condition (M2) For every h L 2 (π ), find h n h (strongly) in L 2 (π ) such that E (h) lim sup n E n (h n ). 1. Sufficient to consider case E (h) <. 2. Find sequence of smooth cylinder functions h n with compact cylindrical support, such that E (h) E (h n ) 1/n.

71 1 Methods of proof 2: establishing condition (M2) For every h L 2 (π ), find h n h (strongly) in L 2 (π ) such that E (h) lim sup n E n (h n ). 1. Sufficient to consider case E (h) <. 2. Find sequence of smooth cylinder functions h n with compact cylindrical support, such that E (h) E (h n ) 1/n. 3. Using smoothness etc, E m (h n ) E (h n ) as m.

72 1 Methods of proof 2: establishing condition (M2) For every h L 2 (π ), find h n h (strongly) in L 2 (π ) such that E (h) lim sup n E n (h n ). 1. Sufficient to consider case E (h) <. 2. Find sequence of smooth cylinder functions h n with compact cylindrical support, such that E (h) E (h n ) 1/n. 3. Using smoothness etc, E m (h n ) E (h n ) as m. 4. Subsequences....

73 Methods of proof 3: establishing condition (M1) If h n h weakly in L 2 (π ), show E (h) lim inf n E n (h n ). Detailed stochastic analysis involves: 1. Set Ψ n (h) = n 2 (h(x(n) 0 ) h(x(n) 1 )).

74 Methods of proof 3: establishing condition (M1) If h n h weakly in L 2 (π ), show E (h) lim inf n E n (h n ). Detailed stochastic analysis involves: 1. Set Ψ n (h) = n 2 (h(x(n) 0 ) h(x(n) 1 )). 2. Integrate against test function ξ(x 1:N, W 1:N )I(U < a(x 1:N, W 1:N )) for ξ smooth, compact support, U a Uniform(0, 1) random variable. Apply Cauchy-Schwarz.

75 Methods of proof 3: establishing condition (M1) If h n h weakly in L 2 (π ), show E (h) lim inf n E n (h n ). Detailed stochastic analysis involves: 1. Set Ψ n (h) = n 2 (h(x(n) 0 ) h(x(n) 1 )). 2. Integrate against test function ξ(x 1:N, W 1:N )I(U < a(x 1:N, W 1:N )) for ξ smooth, compact support, U a Uniform(0, 1) random variable. Apply Cauchy-Schwarz. 3. Use integration by parts, careful analysis and conditions on φ.

76 3 Doing even better Durmus et al. (2016) introduce L p mean differentiability:

77 3 Doing even better Durmus et al. (2016) introduce L p mean differentiability: there is φ such that, for some p > 2, some α > 0, φ(x + u) φ(x) = ( φ(x) + R(X, u)) u, E [ R(X, u) p ] 1/p = o( u α ).

78 3 Doing even better Durmus et al. (2016) introduce L p mean differentiability: there is φ such that, for some p > 2, some α > 0, φ(x + u) φ(x) = ( φ(x) + R(X, u)) u, E [ R(X, u) p ] 1/p = o( u α ). [ Also I = E φ 2] <.

79 3 Doing even better Durmus et al. (2016) introduce L p mean differentiability: there is φ such that, for some p > 2, some α > 0, φ(x + u) φ(x) = ( φ(x) + R(X, u)) u, E [ R(X, u) p ] 1/p = o( u α ). [ Also I = E φ 2] <. Durmus et al. [(2016) obtain optimal scaling results when p > 4, and E φ 6] <,

80 3 Doing even better Durmus et al. (2016) introduce L p mean differentiability: there is φ such that, for some p > 2, some α > 0, φ(x + u) φ(x) = ( φ(x) + R(X, u)) u, E [ R(X, u) p ] 1/p = o( u α ). [ Also I = E φ 2] <. Durmus et al. [(2016) obtain optimal scaling results when p > 4, and E φ 6] <, L p mean differentiability applies straightforwardly to the Zanella, Bédard and WSK (2016) argument mutatis mutandis: the regularity conditions can be weakened even more at least for vague convergence.

81 Conclusion Introduction MCMC and optimal scaling Dirichlet forms and optimal scaling Results and methods of proofs Conclusion

82 25 Conclusion The Dirichlet form approach allows significant relaxation of conditions required for optimal scaling results;

83 25 Conclusion The Dirichlet form approach allows significant relaxation of conditions required for optimal scaling results; Combine with L p mean differentiability to obtain further relaxation of regularity conditions;

84 25 Conclusion The Dirichlet form approach allows significant relaxation of conditions required for optimal scaling results; Combine with L p mean differentiability to obtain further relaxation of regularity conditions; Soft argument for 1 2variance + mean 0 implied by N ( 1 2 σ 2 I, σ 2 I);

85 25 Conclusion The Dirichlet form approach allows significant relaxation of conditions required for optimal scaling results; Combine with L p mean differentiability to obtain further relaxation of regularity conditions; Soft argument for 1 2variance + mean 0 implied by N ( 1 2 σ 2 I, σ 2 I); MALA generalization (exercise in progress);

86 25 Conclusion The Dirichlet form approach allows significant relaxation of conditions required for optimal scaling results; Combine with L p mean differentiability to obtain further relaxation of regularity conditions; Soft argument for 1 2variance + mean 0 implied by N ( 1 2 σ 2 I, σ 2 I); MALA generalization (exercise in progress); Need to explore development beyond i.i.d. targets; e.g. can regularity be similarly relaxed in more general random field settings?

87 25 Conclusion The Dirichlet form approach allows significant relaxation of conditions required for optimal scaling results; Combine with L p mean differentiability to obtain further relaxation of regularity conditions; Soft argument for 1 2variance + mean 0 implied by N ( 1 2 σ 2 I, σ 2 I); MALA generalization (exercise in progress); Need to explore development beyond i.i.d. targets; e.g. can regularity be similarly relaxed in more general random field settings? Apply to discrete Markov chain cases? (c.f. Roberts, 1998);

88 25 Conclusion The Dirichlet form approach allows significant relaxation of conditions required for optimal scaling results; Combine with L p mean differentiability to obtain further relaxation of regularity conditions; Soft argument for 1 2variance + mean 0 implied by N ( 1 2 σ 2 I, σ 2 I); MALA generalization (exercise in progress); Need to explore development beyond i.i.d. targets; e.g. can regularity be similarly relaxed in more general random field settings? Apply to discrete Markov chain cases? (c.f. Roberts, 1998); Investigate applications to Adaptive MCMC.

89 6 References: Andrieu, Christophe and Johannes Thoms (2008). A tutorial on adaptive MCMC. In: Statistics and Computing 18.4, pp Bédard, Mylène (2007). Weak convergence of Metropolis algorithms for non-i.i.d. target distributions. In: Annals of Applied Probability 17.4, pp Breyer, L A and Gareth O Roberts (2000). From Metropolis to diffusions : Gibbs states and optimal scaling. In: Stochastic Processes and their Applications 90.2, pp Brooks, Stephen P, Andrew Gelman, Galin L Jones and Xiao-Li Meng (2011). Handbook of Markov Chain Monte Carlo. Boca Raton: Chapman & Hall/CRC, pp. 592+xxv.

90 Durmus, Alain, Sylvain Le Corff, Eric Moulines and Gareth O Roberts (2016). Optimal scaling of the Random Walk Metropolis algorithm under $Lˆp$ mean differentiability. In: arxiv Geyer, Charlie (1999). Likelihood inference for spatial point processes. In: Stochastic Geometry: likelihood and computation. Ed. by Ole E Barndorff-Nielsen, WSK and M N M van Lieshout. Boca Raton: Chapman & Hall/CRC. Chap. 4, pp Hastings, W K (1970). Monte Carlo sampling methods using Markov chains and their applications. In: Biometrika 57, pp Mattingly, Jonathan C., Natesh S. Pillai and Andrew M. Stuart (2012). Diffusion limits of the random walk metropolis algorithm in high dimensions. In: Annals of Applied Probability 22.3, pp

91 8 Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller and Edward Teller (1953). Equation of state calculations by fast computing machines. en. In: Journal Chemical Physics 21.6, pp Mosco, Umberto (1994). Composite media and asymptotic Dirichlet forms. In: Journal of Functional Analysis 123.2, pp Roberts, Gareth O (1998). Optimal Metropolis algorithms for product measures on the vertices of a hypercube. In: Stochastics and Stochastic Reports June 2013, pp Roberts, Gareth O, A Gelman and W Gilks (1997). Weak Convergence and Optimal Scaling of Random Walk Algorithms. In: The Annals of Applied Probability 7.1, pp

92 9 Roberts, Gareth O and Jeffrey S Rosenthal (1998). Optimal scaling of discrete approximations to Langevin diffusions.. In: J. R. Statist. Soc. B 60.1, pp Rosenthal, Jeffrey S (2011). Optimal Proposal Distributions and Adaptive MCMC. In: Handbook of Markov Chain Monte Carlo 1, pp Sun, Wei (1998). Weak convergence of Dirichlet processes. In: Science in China Series A: Mathematics 41.1, pp Thompson, Elizabeth A (2005). MCMC in the analysis of genetic data on pedigree. In: Markov chain Monte Carlo: Innovations and Applications. Ed. by WSK, Faming Liang and Jian-Sheng Wang. Singapore: World Scientific. Chap. 5, pp Zanella, Giacomo (2015). Bayesian Complementary Clustering, MCMC, and Anglo-Saxon Placenames. PhD Thesis. University of Warwick.

93 0 Zanella, Giacomo (2016). Random Partition Models and Complementary Clustering of Anglo-Saxon Placenames. In: Annals of Applied Statistics 9.4, pp Zanella, Giacomo, Mylène Bédard and WSK (2016). A Dirichlet Form approach to MCMC Optimal Scaling. In: arxiv , 22pp.

94 1 Version 1.34 (Wed Jul 12 06:27: ) ================================================ commit d81469c5f14c484aa363616e3d701c6e3fbb1141 Author: Wilfrid Kendall Made it explicit that Lp mean differentiability still doesn t cover weak without extra regularity: need to beat this!

Introduction. A Dirichlet Form approach to MCMC Optimal Scaling. MCMC idea

Introduction. A Dirichlet Form approach to MCMC Optimal Scaling. MCMC idea Introuction A Dirichlet Form approach to MCMC Optimal Scaling Markov chain Monte Carlo (MCMC quotes: Metropolis et al. (1953, running coe on the Los Alamos MANIAC: a feasible approach to statistical mechanics