Inexact approximations for doubly and triply intractable problems

Size: px

Start display at page:

Download "Inexact approximations for doubly and triply intractable problems"

Cuthbert Welch
5 years ago
Views:

1 Inexact approximations for doubly and triply intractable problems March 27th, 2014

2 Markov random fields Interacting objects Markov random fields (MRFs) are used for modelling (often large numbers of) interacting objects usually modelling symmetrical interactions. Used widely in statistics, physics and computer science, e.g. image analysis; ferromagnetism; geostatistics; point processes; social networks.

3 Markov random fields Image analysis The log expression of 72 genes on a particular chromosome over 46 hours (Friel et al. 2009).

4 Markov random fields Pairwise Markov random fields

5 Markov random fields Intractable normalising constants Pairwise MRFs correspond to the factorisation f (y θ) γ(y θ) = φ(y i,y j θ). (i,j) Nei(y) We also need to specify the normalising constant Z(θ) = φ(y i,y j θ)dy y (i,j) Nei(y) In general we are interested in models that take the form Gibbs random fields f (y θ) = γ(y θ) Z(θ). f (y θ) = exp(θ T S(y)). Z(θ)

6 A doubly intractable problem Doubly intractable Suppose we want to estimate parameters θ after observing Y = y. Use Bayesian inference to find π(θ y) f (y θ)p(θ). Could use MCMC, but the acceptance probability in MH is { min 1, q(θ θ ) p(θ ) γ(y θ } ) 1 Z(θ) q(θ θ) p(θ) γ(y θ) Z(θ. ) 1

7 A doubly intractable problem Doubly intractable Suppose we want to estimate parameters θ after observing Y = y. Use Bayesian inference to find π(θ y) f (y θ)p(θ). Could use MCMC, but the acceptance probability in MH is { min 1, q(θ θ ) p(θ ) γ(y θ } ) 1 Z(θ) q(θ θ) p(θ) γ(y θ) Z(θ. ) 1

8 A doubly intractable problem ABC-MCMC Approximate an intractable likelihood at θ with: R 1 R π ε (S(x r ) S(y)) r=1 where the x r f (. θ) are R simulations from f (originally in Ratmann et al. (2009)). Often R = 1 and π ε (. S(y)) = U (. (S(y) ε,s(y) + ε)). Essentially a nonparametric kernel estimator to the conditional distribution of the statistics given θ, based on simulations from f. ABC-MCMC is an MCMC algorithm that targets this approximate posterior.

9 A doubly intractable problem ABC-MCMC Approximate an intractable likelihood at θ with: R 1 R π ε (S(x r ) S(y)) r=1 where the x r f (. θ) are R simulations from f (originally in Ratmann et al. (2009)). Often R = 1 and π ε (. S(y)) = U (. (S(y) ε,s(y) + ε)). Essentially a nonparametric kernel estimator to the conditional distribution of the statistics given θ, based on simulations from f. ABC-MCMC is an MCMC algorithm that targets this approximate posterior.

10 A doubly intractable problem ABC on ERGMs True ABC

11 A doubly intractable problem Synthetic likelihood An alternative approximation proposed in Wood (2010). Again take R simulations from f, x r f (. θ), and take the summary statistics of each. But instead use a multivariate normal approximation to the distribution of the summary statistics given θ: L(S(y) θ) = N (S(y) µ θ, Σ ) θ, where µ θ = 1 R R S (x r ), r=1 Σ θ = sst R 1, with s = (S (x 1 ) µ θ,...,s (x R ) µ θ ).

12 A doubly intractable problem The single auxiliary variable (SAV) method Møller et al. (2006) augment the target distribution with an extra variable u and use π(θ,u y) q u (u θ,y)f (y θ)p(θ) where q u is some (normalised) arbitrary distribution and u is on the same space as y. As the MH proposal in (θ,u)-space they use (θ,u ) f (u θ )q(θ θ). This gives an acceptance probability of { min 1, q(θ θ ) p(θ ) γ(y θ ) q u (u θ,y) q(θ θ) p(θ) γ(y θ) γ(u θ ) γ(u θ) q u (u θ,y) }.

13 A doubly intractable problem Exact approximations Note that q u(u θ,y) γ(u θ ) 1 estimator of Z(θ ). is an unbiased importance sampling still targets the correct distribution! first seen in the pseudo-marginal methods of Beaumont (2003) and Andrieu and Roberts (2009). Relies on being able to simulate exactly from f (. θ ), which is usually not possible or computationally expensive. Girolami et al. (2013) introduce an approach that does not require exact simulation ( Russian Roulette ).

14 A doubly intractable problem Exact approximations Note that q u(u θ,y) γ(u θ ) 1 estimator of Z(θ ). is an unbiased importance sampling still targets the correct distribution! first seen in the pseudo-marginal methods of Beaumont (2003) and Andrieu and Roberts (2009). Relies on being able to simulate exactly from f (. θ ), which is usually not possible or computationally expensive. Girolami et al. (2013) introduce an approach that does not require exact simulation ( Russian Roulette ).

15 A doubly intractable problem Exact approximations Note that q u(u θ,y) γ(u θ ) 1 estimator of Z(θ ). is an unbiased importance sampling still targets the correct distribution! first seen in the pseudo-marginal methods of Beaumont (2003) and Andrieu and Roberts (2009). Relies on being able to simulate exactly from f (. θ ), which is usually not possible or computationally expensive. Girolami et al. (2013) introduce an approach that does not require exact simulation ( Russian Roulette ).

16 A doubly intractable problem The exchange algorithm Murray et al. (2006) propose instead to use γ(u θ) γ(u θ ) importance sampling estimator of Z(θ) Z(θ ). This gives an acceptance probability of { min 1, q(θ θ ) q(θ θ) p(θ ) p(θ) γ(y θ ) γ(y θ) γ(u θ) γ(u θ ) }. as an An unbiased estimator of the acceptance probability rather than the target, so no longer fits into the exact-approximation framework however, this method still has the correct target; something of a special case... Simpler, and often more efficient, than SAV.

17 A triply intractable problem Estimating the marginal likelihood The marginal likelihood (also known as the evidence) is p(y) = p(θ)f (y θ)dθ. Used in Bayesian model comparison θ p(m y) = p(m)p(y M), most commonly seen in the Bayes factor, for comparing models p(y M 1 ) p(y M 2 ). All commonly used methods require f (y θ) to be tractable in θ, and usually can t be estimated from MCMC output a triply intractable problem - Friel (2013).

18 A triply intractable problem Chib s method (via population exchange) Friel (2013) details an approach that uses Chib s method. For any θ: p(y) = f (y θ)π( θ) π( θ y) = γ(y θ)π( θ) Z( θ)π( θ y). A population variant of the exchange algorithm is used to simulate points from π(θ y). This approach gives an estimate of Z(θ) for each θ drawn from π(θ y). Then Chib s method is used, averaging the identity above over a number of high probability draws from π(θ y), using: the terms in the numerator directly; the estimate of Z(θ) from the population exchange algorithm; a kernel density estimate of π(θ y). Relies on θ being low dimensional.

19 A triply intractable problem Chib s method (via population exchange) Friel (2013) details an approach that uses Chib s method. For any θ: p(y) = f (y θ)π( θ) π( θ y) = γ(y θ)π( θ) Z( θ)π( θ y). A population variant of the exchange algorithm is used to simulate points from π(θ y). This approach gives an estimate of Z(θ) for each θ drawn from π(θ y). Then Chib s method is used, averaging the identity above over a number of high probability draws from π(θ y), using: the terms in the numerator directly; the estimate of Z(θ) from the population exchange algorithm; a kernel density estimate of π(θ y). Relies on θ being low dimensional.

20 A triply intractable problem Chib s method (via population exchange) Friel (2013) details an approach that uses Chib s method. For any θ: p(y) = f (y θ)π( θ) π( θ y) = γ(y θ)π( θ) Z( θ)π( θ y). A population variant of the exchange algorithm is used to simulate points from π(θ y). This approach gives an estimate of Z(θ) for each θ drawn from π(θ y). Then Chib s method is used, averaging the identity above over a number of high probability draws from π(θ y), using: the terms in the numerator directly; the estimate of Z(θ) from the population exchange algorithm; a kernel density estimate of π(θ y). Relies on θ being low dimensional.

21 A triply intractable problem Using importance sampling (IS) Importance sampling Returns a weighted sample {(θ (p),w (p) ) 1 p P} from π(θ y). For p = 1 : P Simulate θ (p) q(.) Weight w (p) = p(θ (p) )f (y θ (p) ). q(θ (p) ) Then p(y) = 1 P P p=1 w (p).

22 A triply intractable problem Using ABC-IS Didelot, Everitt, Johansen and Lawson (2011) investigate the use of the ABC approximation when using IS for marginal likelihoods. The weights are w (p) = p(θ (p) ) 1 R R r=1 π ε (S(x r (p) ) S(y)) q(θ (p) ) } R where { x r (p) f (. θ (p) ). r=1 This method gives p(s(y)) p(y). Didelot et al. (2011), Grelaud et al. (2009), Robert et al. (2011), Marin et al. (2014), discuss the choice of summary statistics.

23 A triply intractable problem Exponential family models Didelot et al. (2011): when comparing two exponential family models, if S 1 (y) is sufficient for the parameters in model 1 S 2 (y) is sufficient for the parameters in model 2 Then using the vector S(y) = (S 1 (y),s 2 (y)) for both models gives p(y M 1 ) p(y M 2 ) = p(s(y) M 1) p(s(y) M 2 ). Marin et al. (2014) has much more general guidance.

24 A triply intractable problem Synthetic likelihood IS We could also use the SL approximation within IS. The weight update is then p(θ (p) )N (S(y) µ θ, Σ ) θ w (p) = q(θ (p), ) where µ θ, Σ { θ are based on x (p) r } R f (. θ (p) ). r=1 Does not require choosing ε, but relies on normality assumption.

25 A triply intractable problem Exact methods? Importance sampling: p(y) = θ 1 P = 1 P f (y θ)p(θ) q(θ)dθ q(θ) P f (y θ (p) )p(θ (p) ) p=1 q(θ (p) ) P γ(y θ (p) )p(θ (p) ) 1 p=1 q(θ (p) ) Z(θ (p) ). Intractable...

26 A triply intractable problem Exact methods? Importance sampling: p(y) = θ 1 P = 1 P f (y θ)p(θ) q(θ)dθ q(θ) P f (y θ (p) )p(θ (p) ) p=1 q(θ (p) ) P γ(y θ (p) )p(θ (p) ) 1 p=1 q(θ (p) ) Z(θ (p) ). Intractable...

27 A triply intractable problem SAV importance sampling Consider the SAV target π(θ,u y) q u (u θ,y)f (y θ)p(θ), noting that it has the same marginal likelihood as π(θ y). Suppose we do importance sampling on this SAV target, and choose the proposal to be q(θ,u) = f (u θ)q(θ). We obtain p(y) = 1 P = 1 P P p=1 P p=1 q u (u θ (p),y)γ(y θ (p) )p(θ (p) ) Z(θ (p) ) γ(u θ (p) )q(θ (p) ) Z(θ (p) ) γ(y θ (p) )p(θ (p) ) q(θ (p) ) q u (u θ (p),y) γ(u θ (p). )

28 A triply intractable problem SAV importance sampling Consider the SAV target π(θ,u y) q u (u θ,y)f (y θ)p(θ), noting that it has the same marginal likelihood as π(θ y). Suppose we do importance sampling on this SAV target, and choose the proposal to be q(θ,u) = f (u θ)q(θ). We obtain p(y) = 1 P = 1 P P p=1 P p=1 q u (u θ (p),y)γ(y θ (p) )p(θ (p) ) Z(θ (p) ) γ(u θ (p) )q(θ (p) ) Z(θ (p) ) γ(y θ (p) )p(θ (p) ) q(θ (p) ) q u (u θ (p),y) γ(u θ (p). )

29 A triply intractable problem Exact approximations revisited Using unbiased weight estimates within importance sampling: (IS) 2 (Tran et al., 2013); random weight particle filters (Fearnhead et al. 2010); (SMC) 2 (Chopin et al. 2011). For each θ, we could use multiple u variables and use the estimate 1 Z(θ) = 1 q u (u (m) θ,y) M γ(u (m). θ) M m=1 For u the proposal is pre-determined, but we need to choose q u (u θ,y). Møller et al. (2006): one possible choice is q u (u θ,y) = γ(u θ)/z( θ) where θ is an ML estimate (or some other appropriate estimate) of θ.

30 A triply intractable problem Exact approximations revisited Using unbiased weight estimates within importance sampling: (IS) 2 (Tran et al., 2013); random weight particle filters (Fearnhead et al. 2010); (SMC) 2 (Chopin et al. 2011). For each θ, we could use multiple u variables and use the estimate 1 Z(θ) = 1 q u (u (m) θ,y) M γ(u (m). θ) M m=1 For u the proposal is pre-determined, but we need to choose q u (u θ,y). Møller et al. (2006): one possible choice is q u (u θ,y) = γ(u θ)/z( θ) where θ is an ML estimate (or some other appropriate estimate) of θ.

31 A triply intractable problem Exact approximations revisited Using unbiased weight estimates within importance sampling: (IS) 2 (Tran et al., 2013); random weight particle filters (Fearnhead et al. 2010); (SMC) 2 (Chopin et al. 2011). For each θ, we could use multiple u variables and use the estimate 1 Z(θ) = 1 q u (u (m) θ,y) M γ(u (m). θ) M m=1 For u the proposal is pre-determined, but we need to choose q u (u θ,y). Møller et al. (2006): one possible choice is q u (u θ,y) = γ(u θ)/z( θ) where θ is an ML estimate (or some other appropriate estimate) of θ.

32 A triply intractable problem SAVIS / MAVIS Using the suggested q u gives the following importance sampling estimate of 1/Z(θ) 1 Z(θ) = 1 Z( θ) 1 M M m=1 γ(u (m) θ) γ(u (m) θ). Or, using annealed importance sampling (Neal, 2001) with the sequence of targets f k (. θ, θ,y) γ k (. θ, θ) = γ(. θ) (K+1 k)/(k+1) +γ(. θ) k/(k+1), we obtain 1 Z(θ) = 1 Z( θ) 1 M M m=1 K k=0 γ k+1 (u (m) k θ,θ,y) γ k (u (m) k θ,θ,y).

33 A triply intractable problem SAVIS / MAVIS Using the suggested q u gives the following importance sampling estimate of 1/Z(θ) 1 Z(θ) = 1 Z( θ) 1 M M m=1 γ(u (m) θ) γ(u (m) θ). Or, using annealed importance sampling (Neal, 2001) with the sequence of targets f k (. θ, θ,y) γ k (. θ, θ) = γ(. θ) (K+1 k)/(k+1) +γ(. θ) k/(k+1), we obtain 1 Z(θ) = 1 Z( θ) 1 M M m=1 K k=0 γ k+1 (u (m) k θ,θ,y) γ k (u (m) k θ,θ,y).

34 A triply intractable problem Toy example: Poisson vs geometric Consider i.i.d. observations {y i } n i=1 of a discrete random variable that takes values in N. We find the Bayes factor for the models 1 Y θ Poisson(θ), θ Exp(1) f 1 ({y i } n i=1 θ) = λ x i exp( λ) i x i! 1 = exp(nλ) λ x i i x i! 2 Y θ Geometric(θ), θ Unif(0,1) f 2 ({y i } n i=1 θ) = p(1 p) x i = 1 p n (1 p) x i. i

35 A triply intractable problem Results: box plots

36 A triply intractable problem Results: ABC-IS

37 A triply intractable problem Results: SL-IS

38 A triply intractable problem Results: MAVIS

39 A triply intractable problem Application to social networks Compare the evidence for two alternative exponential random graph models p(y θ) exp(θ T S(y)). in model 1 S(y) = number of edges in model 2 S(y) = (number of edges, number of two stars) (so now θ is 2-d). Use prior p(θ) = N (0,25I ), as in Friel (2013).

40 A triply intractable problem Results: social network Friel (2013) finds that the evidence for model 1 is that for model 2. Using 1000 importance points (with 100 simulations from the likelihood for each point)... ABC: ε = 0.1 gives p(y M 1 )/ p(y M 2 ) 4; ε = 0.05 gives p(y M 1 )/ p(y M 2 ) 20, but has only 5 points with non-zero weight! Synthetic likelihood obtains p(y M 1 )/ p(y M 2 ) 40. MAVIS finds log[ p(y M 1 )] = , log[ p(y M 2 )] = giving p(y M 1 )/ p(y M 2 ) 41.

41 A triply intractable problem Results: social network Friel (2013) finds that the evidence for model 1 is that for model 2. Using 1000 importance points (with 100 simulations from the likelihood for each point)... ABC: ε = 0.1 gives p(y M 1 )/ p(y M 2 ) 4; ε = 0.05 gives p(y M 1 )/ p(y M 2 ) 20, but has only 5 points with non-zero weight! Synthetic likelihood obtains p(y M 1 )/ p(y M 2 ) 40. MAVIS finds log[ p(y M 1 )] = , log[ p(y M 2 )] = giving p(y M 1 )/ p(y M 2 ) 41.

42 A triply intractable problem Results: social network Friel (2013) finds that the evidence for model 1 is that for model 2. Using 1000 importance points (with 100 simulations from the likelihood for each point)... ABC: ε = 0.1 gives p(y M 1 )/ p(y M 2 ) 4; ε = 0.05 gives p(y M 1 )/ p(y M 2 ) 20, but has only 5 points with non-zero weight! Synthetic likelihood obtains p(y M 1 )/ p(y M 2 ) 40. MAVIS finds log[ p(y M 1 )] = , log[ p(y M 2 )] = giving p(y M 1 )/ p(y M 2 ) 41.

43 A triply intractable problem Results: social network Friel (2013) finds that the evidence for model 1 is that for model 2. Using 1000 importance points (with 100 simulations from the likelihood for each point)... ABC: ε = 0.1 gives p(y M 1 )/ p(y M 2 ) 4; ε = 0.05 gives p(y M 1 )/ p(y M 2 ) 20, but has only 5 points with non-zero weight! Synthetic likelihood obtains p(y M 1 )/ p(y M 2 ) 40. MAVIS finds log[ p(y M 1 )] = , log[ p(y M 2 )] = giving p(y M 1 )/ p(y M 2 ) 41.

44 A triply intractable problem Results: social network Friel (2013) finds that the evidence for model 1 is that for model 2. Using 1000 importance points (with 100 simulations from the likelihood for each point)... ABC: ε = 0.1 gives p(y M 1 )/ p(y M 2 ) 4; ε = 0.05 gives p(y M 1 )/ p(y M 2 ) 20, but has only 5 points with non-zero weight! Synthetic likelihood obtains p(y M 1 )/ p(y M 2 ) 40. MAVIS finds log[ p(y M 1 )] = , log[ p(y M 2 )] = giving p(y M 1 )/ p(y M 2 ) 41.

45 A triply intractable problem Comparison of methods ABC vs MAVIS both require the simulation of auxiliary variables, but in ABC/SL the use of summary statistics dramatically reduces the dimension of the space; but MAVIS only requires the auxiliary variable to look like it is a good simulation from f (. θ), not (the different requirement) that it is a good match to y. Plus the standard drawbacks of ABC remain choice of tolerance ε, S(.); not able to estimate the evidence, only Bayes factors. SL vs ABC SL fails when Gaussian assumption is not appropriate but it is surprisingly robust and there is no need to choose an ε.

46 A triply intractable problem Comparison of methods ABC vs MAVIS both require the simulation of auxiliary variables, but in ABC/SL the use of summary statistics dramatically reduces the dimension of the space; but MAVIS only requires the auxiliary variable to look like it is a good simulation from f (. θ), not (the different requirement) that it is a good match to y. Plus the standard drawbacks of ABC remain choice of tolerance ε, S(.); not able to estimate the evidence, only Bayes factors. SL vs ABC SL fails when Gaussian assumption is not appropriate but it is surprisingly robust and there is no need to choose an ε.

47 A triply intractable problem Comparison of methods ABC vs MAVIS both require the simulation of auxiliary variables, but in ABC/SL the use of summary statistics dramatically reduces the dimension of the space; but MAVIS only requires the auxiliary variable to look like it is a good simulation from f (. θ), not (the different requirement) that it is a good match to y. Plus the standard drawbacks of ABC remain choice of tolerance ε, S(.); not able to estimate the evidence, only Bayes factors. SL vs ABC SL fails when Gaussian assumption is not appropriate but it is surprisingly robust and there is no need to choose an ε.

48 Inexact approximations An inexact approximation MAVIS is exact only if exact sampling from f (. θ) is possible (also applies to ABC and synthetic likelihood); 1/Z( θ) is known. In practice use an internal MCMC to simulate from f (. θ); estimate 1/Z( θ) offline in advance of running the IS. Does the use of an inexact approximation matter? Everitt (2012) shows that the use of an internal MCMC within SAV-MCMC and ABC-MCMC does not result in large errors (adapted from the MCWM proof in Andrieu and Roberts (2009)).

49 Inexact approximations An inexact approximation MAVIS is exact only if exact sampling from f (. θ) is possible (also applies to ABC and synthetic likelihood); 1/Z( θ) is known. In practice use an internal MCMC to simulate from f (. θ); estimate 1/Z( θ) offline in advance of running the IS. Does the use of an inexact approximation matter? Everitt (2012) shows that the use of an internal MCMC within SAV-MCMC and ABC-MCMC does not result in large errors (adapted from the MCWM proof in Andrieu and Roberts (2009)).

50 Inexact approximations An inexact approximation MAVIS is exact only if exact sampling from f (. θ) is possible (also applies to ABC and synthetic likelihood); 1/Z( θ) is known. In practice use an internal MCMC to simulate from f (. θ); estimate 1/Z( θ) offline in advance of running the IS. Does the use of an inexact approximation matter? Everitt (2012) shows that the use of an internal MCMC within SAV-MCMC and ABC-MCMC does not result in large errors (adapted from the MCWM proof in Andrieu and Roberts (2009)).

51 Inexact approximations An inexact approximation MAVIS is exact only if exact sampling from f (. θ) is possible (also applies to ABC and synthetic likelihood); 1/Z( θ) is known. In practice use an internal MCMC to simulate from f (. θ); estimate 1/Z( θ) offline in advance of running the IS. Does the use of an inexact approximation matter? Everitt (2012) shows that the use of an internal MCMC within SAV-MCMC and ABC-MCMC does not result in large errors (adapted from the MCWM proof in Andrieu and Roberts (2009)).

52 Inexact approximations Returning to MCMC In the previous section we started to examine the use of importance samplers with estimated weights in practice, the small bias that is introduced is not as important as the Monte Carlo variance; empirically, a similar observation applies to SMC samplers. Returning to MCMC, we might wonder about the performance of algorithms with estimated acceptance probabilities Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels - Alquier, Friel, Everitt and Boland (2014). Is the MH method with acceptance probability α(θ,θ ) close to the method using α(θ,θ,x ), with x F (. θ,θ )?

53 Inexact approximations Returning to MCMC In the previous section we started to examine the use of importance samplers with estimated weights in practice, the small bias that is introduced is not as important as the Monte Carlo variance; empirically, a similar observation applies to SMC samplers. Returning to MCMC, we might wonder about the performance of algorithms with estimated acceptance probabilities Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels - Alquier, Friel, Everitt and Boland (2014). Is the MH method with acceptance probability α(θ,θ ) close to the method using α(θ,θ,x ), with x F (. θ,θ )?

54 Inexact approximations Motivation: noisy exchange algorithm Use γ(u θ) γ(u θ ), with u f (. θ ), as an importance sampling estimator of Z(θ), giving an acceptance probability of Z(θ ) { min 1, q(θ θ ) q(θ θ) p(θ ) p(θ) γ(y θ ) γ(y θ) γ(u θ) γ(u θ ) }. Could be improved by simulating R importance points ur, to give { } min 1, q(θ θ ) p(θ ) γ(y θ R ) 1 γ(ur q(θ θ) p(θ) γ(y θ) R θ) γ(ur θ? ) r=1 However, this no longer gives an exact algorithm: r = 1, exact; 1 < r <, inexact; r =, exact.

55 Inexact approximations Motivation: noisy exchange algorithm Use γ(u θ) γ(u θ ), with u f (. θ ), as an importance sampling estimator of Z(θ), giving an acceptance probability of Z(θ ) { min 1, q(θ θ ) q(θ θ) p(θ ) p(θ) γ(y θ ) γ(y θ) γ(u θ) γ(u θ ) }. Could be improved by simulating R importance points ur, to give { } min 1, q(θ θ ) p(θ ) γ(y θ R ) 1 γ(ur q(θ θ) p(θ) γ(y θ) R θ) γ(ur θ? ) r=1 However, this no longer gives an exact algorithm: r = 1, exact; 1 < r <, inexact; r =, exact.

56 Inexact approximations Noisy MCMC MCMC involves simulating a Markov chain (θ n ) n N with transition kernel P such that π is invariant under P: πp = π. In some situations there is a natural kernel P, s.t. πp = π, but which we cannot draw θ n+1 P(θ n, ) for a fixed θ n. A natural idea is to replace P by an approximation ˆP. Ideally, ˆP is close to P, but generally π ˆP π. This leads to the obvious question: Can we say something on how close the resultant Markov chain with transition kernel ˆP is that resulting from P? Eg, is it possible to upper bound? δ θ0 ˆP n π. It turns out that a useful answer is given by the study of the stability of Markov chains.

57 Inexact approximations Noisy MCMC MCMC involves simulating a Markov chain (θ n ) n N with transition kernel P such that π is invariant under P: πp = π. In some situations there is a natural kernel P, s.t. πp = π, but which we cannot draw θ n+1 P(θ n, ) for a fixed θ n. A natural idea is to replace P by an approximation ˆP. Ideally, ˆP is close to P, but generally π ˆP π. This leads to the obvious question: Can we say something on how close the resultant Markov chain with transition kernel ˆP is that resulting from P? Eg, is it possible to upper bound? δ θ0 ˆP n π. It turns out that a useful answer is given by the study of the stability of Markov chains.

58 Inexact approximations Noisy MCMC Theorem (Mitrophanov (2005), Corollary 3.1) If (H1) the MC with transition kernel P is uniformly ergodic: sup δ θ0 P n π Cρ n θ 0 for some C < and ρ < 1. Then we have, for any n N, for any starting point θ 0, ( ) δ θ0 P n δ ˆPn θ0 λ + Cρλ P 1 ρ ˆP where λ = log(1/c) log(ρ).

59 Inexact approximations Noisy MCMC Corollary Let us assume that (H1) holds. (The Markov chain with transition kernel P is uniformly ergodic), (H2) E x F θ ˆα(θ,θ,x ) α(θ,θ ) δ(θ,θ ). (1) Then we have, for any n N, for any starting point θ 0, ( ) δ θ0 P n δ θ0 ˆP n λ + Cρλ 2sup dθ q(θ θ)δ(θ,θ ) 1 ρ θ where λ = log(1/c) log(ρ).

60 Inexact approximations Noisy MCMC Note: when the upper bound in (1) is bounded: ˆα(θ,θ,x ) α(θ,θ ) δ(θ,θ ) δ <, E x F θ then it results that δ θ0 P n δ θ0 ˆP n δ ( λ + Cρλ 1 ρ Obviously, we expect that â is chosen in such a way that δ 1 and so in this case, δ θ0 P n δ θ0 ˆP n 1 as a consequence. ).

61 Inexact approximations Convergence of noisy exchange Lemma Here we show that the noisy exchange algorithm falls into our theoretical framework. ˆα satisfies (H2) in Lemma 4.2 with E x F θ ˆα(θ,θ,x ) α(θ,θ ) δ(θ,θ ) = 1 q(θ θ )π(θ )γ(y θ ( ) γ(y N q(θ Var ) θ) θ)π(θ)γ(y θ) y f (y θ ) γ(y θ. )

62 Inexact approximations Convergence of noisy exchange Moreover, we can show that. Assuming that the space Θ is bounded, then we have δ(θ,θ ) c2 h c2 πk 4 N, and therefore sup θ 0 Θ δ θ0 P n δ θ0 ˆP n C N where C = C (c π,c h,k ) is explicitly known.

63 Inexact approximations Noisy Langevin algorithms Langevin algorithm (e.g. Welling and Teh, 2011) use the update θ n+1 = θ n + Σ 2 logπ(θ n y) + η, η N(0,Σ). In practice, it is often the case that logπ(θ n ) cannot be computed. Here again, a natural idea is to replace logπ(θ n y) by an approximation or an estimate ˆ y logπ(θ n y). Noisy Langevin algorithm use θ n+1 = θ n + Σ logπ(θ 2 n y) + η, y F ( θ n ). η N(0,Σ), with

64 Inexact approximations A noisy Langevin algorithm for Gibbs random fields logπ(θ y) = logπ(θ) + logγ(y θ) log Therefore logπ(θ y) = logπ(θ) + logf (y θ) π(t)f (y t)dt. = logπ(θ) + [θ T s(y)] logz(θ) = logπ(θ) + s(y) E y f (. θ)[s(y )]. In practice, E y f θ [s(y )] is unavailable. However, it is possible to estimate it via Monte Carlo. Let y = (y 1,...,y N ) be N i.i.d. variables drawn from f (. θ ), we define ˆ y logπ(θ y) = logπ(θ) + s(y) 1 N N i=1 s(y i ). Here, we can also give theoretical guarantees for this algorithm.

65 Inexact approximations A noisy Langevin algorithm for Gibbs random fields logπ(θ y) = logπ(θ) + logγ(y θ) log Therefore logπ(θ y) = logπ(θ) + logf (y θ) π(t)f (y t)dt. = logπ(θ) + [θ T s(y)] logz(θ) = logπ(θ) + s(y) E y f (. θ)[s(y )]. In practice, E y f θ [s(y )] is unavailable. However, it is possible to estimate it via Monte Carlo. Let y = (y 1,...,y N ) be N i.i.d. variables drawn from f (. θ ), we define ˆ y logπ(θ y) = logπ(θ) + s(y) 1 N N i=1 s(y i ). Here, we can also give theoretical guarantees for this algorithm.

66 Inexact approximations Metropolis-adjusted Langevin exchange algorithm Of course, the noisy Langevin diffusion can be used as a proposal and corrected within an exchange algorithm. At each iteration we have draw y f (. θ n ); calculate ˆ y logπ(θ y) = logπ(θ) + s(y) 1 N N i=1 s(y i ); propose θ = θ n + Σ 2 ˆ y logπ(θ n y) + η n, where η n are i.i.d. N (0,Σ); accept with probability α(θ,θ n ) = γ(y θ n )π(θ )q(θ n θ )γ(y θ ) γ(y θ n )π(θ n )q(θ θ n )γ(y θ ). Can also be extended to a noisy version (as we did for the exchange algorithm).

67 Inexact approximations Simulation study 20 datasets were simulated from a first-order Ising model defined on a lattice, with a single interaction parameter θ = 0.4. The normalising constant z(θ) can be calculated exactly for a fine grid of {θ i : i = 1,...,N} values (Friel and Rue, 2007), which can be used to estimate ˆπ(y) = N i=2 (θ i θ i 1 ) 2 ( qθi (y) z(θ i ) π(θ i) + q θ i 1 (y) z(θ i 1 ) π(θ i 1) ), which in turn can be used to estimate of the posterior density at each grid point: π(θ i y) q θ i (y)π(θ i ) z(θ i )ˆπ(y), i = 1,...,n. Here we used a fine grid of 8,000 points in the interval [0,0.8].

68 Inexact approximations Results Bias Exchange Noisy Exch Noisy Lang MALA Exch Noisy MALA

69 Inexact approximations Conclusions A new approach (MAVIS) to evidence estimation for Gibbs random fields makes use of an inexact approximation; has some characteristics to recommend it over ABC and SL in general. Examined in more detail the use of noisy MCMC algorithms seen several examples of where this idea might be useful; an improved Monte Carlo variance may be more important than the biases that are introduced; quite a general framework (e.g. SL can also be seen to be a special case); Similar ideas can be used in SMC.

70 Inexact approximations Acknowledgements Noisy MCMC: Pierre Alquier, Nial Friel and Aiden Boland (UCD). Evidence estimation and synthetic likelihood: Nial Friel (UCD), Adam Johansen (Warwick), Melina Evdemon-Hogan and Ellen Rowing (Reading).

Evidence estimation for Markov random fields: a triply intractable problem

Evidence estimation for Markov random fields: a triply intractable problem January 7th, 2014 Markov random fields Interacting objects Markov random fields (MRFs) are used for modelling (often large numbers