Pseudo-marginal MCMC methods for inference in latent variable models

Size: px

Start display at page:

Download "Pseudo-marginal MCMC methods for inference in latent variable models"

Margaret Chandler
5 years ago
Views:

1 Pseudo-marginal MCMC methods for inference in latent variable models Arnaud Doucet Department of Statistics, Oxford University Joint work with George Deligiannidis (Oxford) & Mike Pitt (Kings) MCQMC, 19/08/2016 (MCQMC, 19/08/2016) 1 / 31

2 Organization of the talk Latent variable models (MCQMC, 19/08/2016) 2 / 31

3 Organization of the talk Latent variable models The pseudo-marginal method (MCQMC, 19/08/2016) 2 / 31

4 Organization of the talk Latent variable models The pseudo-marginal method Optimal tuning (MCQMC, 19/08/2016) 2 / 31

5 Organization of the talk Latent variable models The pseudo-marginal method Optimal tuning The correlated pseudo-marginal method (MCQMC, 19/08/2016) 2 / 31

6 Organization of the talk Latent variable models The pseudo-marginal method Optimal tuning The correlated pseudo-marginal method Illustrations (MCQMC, 19/08/2016) 2 / 31

7 Latent Variable Models Assume (Y t ) t 1 are i.i.d. random variables such that X t i.i.d. µ θ ( ), Y t (X t = x) g θ ( x) for t = 1,..., T where (X t ) t 1 are latent variables. (MCQMC, 19/08/2016) 3 / 31

8 Latent Variable Models Assume (Y t ) t 1 are i.i.d. random variables such that X t i.i.d. µ θ ( ), Y t (X t = x) g θ ( x) for t = 1,..., T where (X t ) t 1 are latent variables. The likelihood of θ Θ R d associated to observations Y 1:T = y 1:T is p θ (y 1:T ) = T t=1 p θ (y t ), where p θ (y t ) = µ θ (x t ) g θ (y t x t ) dx t. (MCQMC, 19/08/2016) 3 / 31

9 Latent Variable Models Assume (Y t ) t 1 are i.i.d. random variables such that X t i.i.d. µ θ ( ), Y t (X t = x) g θ ( x) for t = 1,..., T where (X t ) t 1 are latent variables. The likelihood of θ Θ R d associated to observations Y 1:T = y 1:T is p θ (y 1:T ) = T t=1 p θ (y t ), where p θ (y t ) = µ θ (x t ) g θ (y t x t ) dx t. In many scenarios, p θ (y 1:T ) cannot be evaluated exactly; e.g. multivariate probit model. (MCQMC, 19/08/2016) 3 / 31

10 Latent Variable Models Assume {X t } t 1 is a latent Markov process, i.e. X 1 µ θ ( ) and X t+1 (X t = x) f θ ( x), Y t (X t = x) g θ ( x). (MCQMC, 19/08/2016) 4 / 31

11 Latent Variable Models Assume {X t } t 1 is a latent Markov process, i.e. X 1 µ θ ( ) and X t+1 (X t = x) f θ ( x), Y t (X t = x) g θ ( x). The likelihood of θ R d associated to Y 1:T = y 1:T is p θ (y 1:T ) = p θ (x 1:T, y 1:T )dx 1:T where p θ (x 1:T, y 1:T ) = µ θ (x 1 )g θ (y 1 x 1 ) T t=2 f θ (x t x t 1 ) g θ (y t x t ). (MCQMC, 19/08/2016) 4 / 31

12 Latent Variable Models Assume {X t } t 1 is a latent Markov process, i.e. X 1 µ θ ( ) and X t+1 (X t = x) f θ ( x), Y t (X t = x) g θ ( x). The likelihood of θ R d associated to Y 1:T = y 1:T is p θ (y 1:T ) = p θ (x 1:T, y 1:T )dx 1:T where p θ (x 1:T, y 1:T ) = µ θ (x 1 )g θ (y 1 x 1 ) T t=2 f θ (x t x t 1 ) g θ (y t x t ). State-space models are a very popular class of time series models but inference is diffi cult as p θ (y 1:T ) is intractable for non-linear/non-gaussian models. (MCQMC, 19/08/2016) 4 / 31

13 Bayesian Inference Prior distribution of density p (θ). (MCQMC, 19/08/2016) 5 / 31

14 Bayesian Inference Prior distribution of density p (θ). Likelihood function p θ (y 1:T ). (MCQMC, 19/08/2016) 5 / 31

15 Bayesian Inference Prior distribution of density p (θ). Likelihood function p θ (y 1:T ). Bayesian inference relies on the posterior π (θ) = p ( θ y 1:T ) = p θ (y 1:T ) p (θ) Θ p θ (y 1:T ) p ( θ ) dθ. (MCQMC, 19/08/2016) 5 / 31

16 Bayesian Inference Prior distribution of density p (θ). Likelihood function p θ (y 1:T ). Bayesian inference relies on the posterior π (θ) = p ( θ y 1:T ) = p θ (y 1:T ) p (θ) Θ p θ (y 1:T ) p ( θ ) dθ. For non-trivial models, inference relies typically on MCMC. (MCQMC, 19/08/2016) 5 / 31

17 Standard MCMC Approaches Standard MCMC schemes target p ( θ, x 1:T y 1:T ) p (θ) p θ (x 1:T, y 1:T ) using Gibbs type strategy; i.e. sample alternately X 1:T p θ ( y 1:T ) and θ p ( y 1:T, X 1:T ). (MCQMC, 19/08/2016) 6 / 31

18 Standard MCMC Approaches Standard MCMC schemes target p ( θ, x 1:T y 1:T ) p (θ) p θ (x 1:T, y 1:T ) using Gibbs type strategy; i.e. sample alternately X 1:T p θ ( y 1:T ) and θ p ( y 1:T, X 1:T ). Problem 1: It can be diffi cult to sample p θ (x 1:T y 1:T ); e.g. state-space models. (MCQMC, 19/08/2016) 6 / 31

19 Standard MCMC Approaches Standard MCMC schemes target p ( θ, x 1:T y 1:T ) p (θ) p θ (x 1:T, y 1:T ) using Gibbs type strategy; i.e. sample alternately X 1:T p θ ( y 1:T ) and θ p ( y 1:T, X 1:T ). Problem 1: It can be diffi cult to sample p θ (x 1:T y 1:T ); e.g. state-space models. Problem 2: Even when it is implementable, Gibbs can converge very slowly. (MCQMC, 19/08/2016) 6 / 31

20 Standard MCMC Approaches Standard MCMC schemes target p ( θ, x 1:T y 1:T ) p (θ) p θ (x 1:T, y 1:T ) using Gibbs type strategy; i.e. sample alternately X 1:T p θ ( y 1:T ) and θ p ( y 1:T, X 1:T ). Problem 1: It can be diffi cult to sample p θ (x 1:T y 1:T ); e.g. state-space models. Problem 2: Even when it is implementable, Gibbs can converge very slowly. Pseudo-marginal methods mimic an algorithm targetting directly p ( θ y 1:T ) instead of p ( θ, x 1:T y 1:T ). (MCQMC, 19/08/2016) 6 / 31

21 Ideal Marginal Metropolis-Hastings Metropolis Hastings (MH) algorithm simulates an ergodic Markov chain {ϑ i } i 1 of limiting distribution π (θ). (MCQMC, 19/08/2016) 7 / 31

22 Ideal Marginal Metropolis-Hastings Metropolis Hastings (MH) algorithm simulates an ergodic Markov chain {ϑ i } i 1 of limiting distribution π (θ). (MCQMC, 19/08/2016) 7 / 31

23 Ideal Marginal Metropolis-Hastings Metropolis Hastings (MH) algorithm simulates an ergodic Markov chain {ϑ i } i 1 of limiting distribution π (θ). At iteration i Sample ϑ q ( ϑ i 1 ). (MCQMC, 19/08/2016) 7 / 31

24 Ideal Marginal Metropolis-Hastings Metropolis Hastings (MH) algorithm simulates an ergodic Markov chain {ϑ i } i 1 of limiting distribution π (θ). At iteration i Sample ϑ q ( ϑ i 1 ). With probability 1 π (ϑ) q ( ϑ i 1 ϑ) π (ϑ i 1 ) q ( ϑ ϑ i 1 ) = 1 p ϑ (y 1:T ) p (ϑ) q ( ϑ i 1 ϑ) p ϑi 1 (y 1:T ) p (ϑ i 1 ) q ( ϑ ϑ i 1 ), set ϑ i = ϑ, otherwise set ϑ i = ϑ i 1. (MCQMC, 19/08/2016) 7 / 31

25 Ideal Marginal Metropolis-Hastings Metropolis Hastings (MH) algorithm simulates an ergodic Markov chain {ϑ i } i 1 of limiting distribution π (θ). At iteration i Sample ϑ q ( ϑ i 1 ). With probability 1 π (ϑ) q ( ϑ i 1 ϑ) π (ϑ i 1 ) q ( ϑ ϑ i 1 ) = 1 p ϑ (y 1:T ) p (ϑ) q ( ϑ i 1 ϑ) p ϑi 1 (y 1:T ) p (ϑ i 1 ) q ( ϑ ϑ i 1 ), set ϑ i = ϑ, otherwise set ϑ i = ϑ i 1. Problem: MH cannot be implemented if p ϑ (y 1:T ) cannot be evaluated. (MCQMC, 19/08/2016) 7 / 31

26 Pseudo-Marginal Metropolis Hastings Idea: Replace p ϑ (y 1:T ) by an non-negative unbiased estimate p ϑ (y 1:T ) in MH. (MCQMC, 19/08/2016) 8 / 31

27 Pseudo-Marginal Metropolis Hastings Idea: Replace p ϑ (y 1:T ) by an non-negative unbiased estimate p ϑ (y 1:T ) in MH. (MCQMC, 19/08/2016) 8 / 31

28 Pseudo-Marginal Metropolis Hastings Idea: Replace p ϑ (y 1:T ) by an non-negative unbiased estimate p ϑ (y 1:T ) in MH. At iteration i Sample ϑ q ( ϑ i 1 ). (MCQMC, 19/08/2016) 8 / 31

29 Pseudo-Marginal Metropolis Hastings Idea: Replace p ϑ (y 1:T ) by an non-negative unbiased estimate p ϑ (y 1:T ) in MH. At iteration i Sample ϑ q ( ϑ i 1 ). Compute an estimate p ϑ (y 1:T ) of p ϑ (y 1:T ). (MCQMC, 19/08/2016) 8 / 31

30 Pseudo-Marginal Metropolis Hastings Idea: Replace p ϑ (y 1:T ) by an non-negative unbiased estimate p ϑ (y 1:T ) in MH. At iteration i Sample ϑ q ( ϑ i 1 ). Compute an estimate p ϑ (y 1:T ) of p ϑ (y 1:T ). With probability 1 = 1 p ϑ (y 1:T ) p (ϑ) q ( ϑ i 1 ϑ) p ϑi 1 (y 1:T ) p (ϑ i 1 ) q ( ϑ ϑ i 1 ) p ϑ (y 1:T ) p (ϑ) q ( ϑ i 1 ϑ) p ϑ (y 1:T ) /p ϑ (y 1:T ) p ϑi 1 (y 1:T ) p (ϑ i 1 ) q ( ϑ ϑ i 1 ) p ϑi 1 (y 1:T ) /p ϑi 1 (y 1:T ), set ϑ i = ϑ, p ϑi (y 1:T ) = p ϑ (y 1:T ) otherwise set ϑ i = ϑ i 1, p ϑi (y 1:T ) = p ϑi 1 (y 1:T ). (MCQMC, 19/08/2016) 8 / 31

31 Pseudo-Marginal Metropolis Hastings Proposition (Lin, Liu & Sloan, Phys. Rev. D, 2000): If p θ (y 1:T ) is a non-negative unbiased estimator of p θ (y 1:T ) then the pseudo-marginal MH kernel admits π (θ) as invariant density. (MCQMC, 19/08/2016) 9 / 31

32 Pseudo-Marginal Metropolis Hastings Proposition (Lin, Liu & Sloan, Phys. Rev. D, 2000): If p θ (y 1:T ) is a non-negative unbiased estimator of p θ (y 1:T ) then the pseudo-marginal MH kernel admits π (θ) as invariant density. Let U be the U-valued r.v. such that p θ (y 1:T ) = p θ (y 1:T ; U) and E [ p θ (y 1:T ; U)] = p θ (y 1:T ) when U m θ ( ). (MCQMC, 19/08/2016) 9 / 31

33 Pseudo-Marginal Metropolis Hastings Proposition (Lin, Liu & Sloan, Phys. Rev. D, 2000): If p θ (y 1:T ) is a non-negative unbiased estimator of p θ (y 1:T ) then the pseudo-marginal MH kernel admits π (θ) as invariant density. Let U be the U-valued r.v. such that p θ (y 1:T ) = p θ (y 1:T ; U) and E [ p θ (y 1:T ; U)] = p θ (y 1:T ) when U m θ ( ). Pseudo-marginal MH is a standard MH on Θ U with target π(θ, u) and proposal q ( ϑ θ) m ϑ (v) where π(θ, u) = π (θ) p θ (y 1:T ; u) p θ (y 1:T ) m θ (u) satisfies π(θ, u)du = π (θ). (MCQMC, 19/08/2016) 9 / 31

34 Likelihood Estimators For latent variable models, one has p θ (y 1:T ) = T t=1 p θ (y t ), where p θ (y t ) = µ θ (x t ) g θ (y t x t ) dx t. (MCQMC, 19/08/2016) 10 / 31

35 Likelihood Estimators For latent variable models, one has p θ (y 1:T ) = T t=1 p θ (y t ), where p θ (y t ) = µ θ (x t ) g θ (y t x t ) dx t. A non-negative unbiased estimator is given by { T T N 1 ( ) } p θ (y 1:T ) = p θ (y t ) = t=1 t=1 N g θ y t Xt k, X k i.i.d. t µ θ, k=1 i.e. U = ( X1 1,..., X 1 N,..., X T 1,..., X T N ) with m θ (u) = T N t=1 k=1 µ θ (x k t ). (MCQMC, 19/08/2016) 10 / 31

36 Likelihood Estimators For latent variable models, one has p θ (y 1:T ) = T t=1 p θ (y t ), where p θ (y t ) = µ θ (x t ) g θ (y t x t ) dx t. A non-negative unbiased estimator is given by { T T N 1 ( ) } p θ (y 1:T ) = p θ (y t ) = t=1 t=1 N g θ y t Xt k, X k i.i.d. t µ θ, k=1 i.e. U = ( X1 1,..., X 1 N,..., X T 1,..., X T N ) with m θ (u) = T N t=1 k=1 Computational complexity is O(NT ). µ θ (x k t ). (MCQMC, 19/08/2016) 10 / 31

37 Likelihood Estimators For state-space models, an alternative is to use particle filters where { T T N 1 ( ) } p θ (y 1:T ) = p θ (y 1 ) p θ (y t y 1:t 1 ) = t=2 t=1 N g θ y t Xt k k=1 where m θ (u) = N ) µ θ (x 1 k T k=1 t=2 with a k t 1 {1,..., N}, w j t g θ ( { N k=1 y t X j t ( w ak t 1 t f xt k x ak t 1 ), j w j t = 1. t 1 ) } (MCQMC, 19/08/2016) 11 / 31

38 Likelihood Estimators For state-space models, an alternative is to use particle filters where { T T N 1 ( ) } p θ (y 1:T ) = p θ (y 1 ) p θ (y t y 1:t 1 ) = t=2 t=1 N g θ y t Xt k k=1 where m θ (u) = N ) µ θ (x 1 k T k=1 t=2 with a k t 1 {1,..., N}, w j t g θ ( { N k=1 y t X j t ( w ak t 1 t f xt k x ak t 1 ), j w j t = 1. The estimator is unbiased (Del Moral, 1998), relative variance is bounded uniformly over T if N T (Cerou, Del Moral & Guyader, 2011), SQMC could also be used (Gerber & Chopin, JRSS B 2015). t 1 ) } (MCQMC, 19/08/2016) 11 / 31

39 Likelihood Estimators For state-space models, an alternative is to use particle filters where { T T N 1 ( ) } p θ (y 1:T ) = p θ (y 1 ) p θ (y t y 1:t 1 ) = t=2 t=1 N g θ y t Xt k k=1 where m θ (u) = N ) µ θ (x 1 k T k=1 t=2 with a k t 1 {1,..., N}, w j t g θ ( { N k=1 y t X j t ( w ak t 1 t f xt k x ak t 1 ), j w j t = 1. The estimator is unbiased (Del Moral, 1998), relative variance is bounded uniformly over T if N T (Cerou, Del Moral & Guyader, 2011), SQMC could also be used (Gerber & Chopin, JRSS B 2015). Computational complexity is O (NT ). t 1 ) } (MCQMC, 19/08/2016) 11 / 31

40 Examples of recent applications Measuring the impact of Ebola control measures in Sierra Leone, PNAS, (MCQMC, 19/08/2016) 12 / 31

41 Examples of recent applications Measuring the impact of Ebola control measures in Sierra Leone, PNAS, Dynamic prediction pools: an investigation of financial frictions and forecasting performance, J. Econometrics, (MCQMC, 19/08/2016) 12 / 31

42 Examples of recent applications Measuring the impact of Ebola control measures in Sierra Leone, PNAS, Dynamic prediction pools: an investigation of financial frictions and forecasting performance, J. Econometrics, Capturing the dynamics of pathogens with many strains, J. Mathematical Biology, (MCQMC, 19/08/2016) 12 / 31

43 Examples of recent applications Measuring the impact of Ebola control measures in Sierra Leone, PNAS, Dynamic prediction pools: an investigation of financial frictions and forecasting performance, J. Econometrics, Capturing the dynamics of pathogens with many strains, J. Mathematical Biology, Monte Carlo estimation of stage structured development from cohort data, Ecology, (MCQMC, 19/08/2016) 12 / 31

44 Examples of recent applications Measuring the impact of Ebola control measures in Sierra Leone, PNAS, Dynamic prediction pools: an investigation of financial frictions and forecasting performance, J. Econometrics, Capturing the dynamics of pathogens with many strains, J. Mathematical Biology, Monte Carlo estimation of stage structured development from cohort data, Ecology, Bayesian analysis of entry games: A simulated likelihood approach without simulation errors, Global Economic Review, (MCQMC, 19/08/2016) 12 / 31

45 Examples of recent applications Measuring the impact of Ebola control measures in Sierra Leone, PNAS, Dynamic prediction pools: an investigation of financial frictions and forecasting performance, J. Econometrics, Capturing the dynamics of pathogens with many strains, J. Mathematical Biology, Monte Carlo estimation of stage structured development from cohort data, Ecology, Bayesian analysis of entry games: A simulated likelihood approach without simulation errors, Global Economic Review, Controlling procedural modeling programs. SIGGRAPH, (MCQMC, 19/08/2016) 12 / 31

46 Empirical performance: Stochastic volatility model SV model with leverage (Huang & Tauchen, 2005): dv 1 (s) = k 1 {v 1 (s) µ 1 } ds + σ 1 dw 1 (s), dv 2 (s) = k 2 v 2 (s) ds + {1 + β 12 v 2 (s)} dw 2 (s), d log P (s) = µ y ds + s-exp [{v 1 (s) + β 2 v 2 (s)} /2] db (s), with φ 1 =corr{b (s), W 1 (s)} and φ 2 =corr{b (s), W 2 (s)}. (MCQMC, 19/08/2016) 13 / 31

47 Empirical performance: Stochastic volatility model SV model with leverage (Huang & Tauchen, 2005): dv 1 (s) = k 1 {v 1 (s) µ 1 } ds + σ 1 dw 1 (s), dv 2 (s) = k 2 v 2 (s) ds + {1 + β 12 v 2 (s)} dw 2 (s), d log P (s) = µ y ds + s-exp [{v 1 (s) + β 2 v 2 (s)} /2] db (s), with φ 1 =corr{b (s), W 1 (s)} and φ 2 =corr{b (s), W 2 (s)}. Euler discretization of the volatilities v 1 (s) and v 2 (s) provides closed form expression for Y t = log P ( t) log P ( (t 1)). (MCQMC, 19/08/2016) 13 / 31

48 Empirical performance: Stochastic volatility model SV model with leverage (Huang & Tauchen, 2005): dv 1 (s) = k 1 {v 1 (s) µ 1 } ds + σ 1 dw 1 (s), dv 2 (s) = k 2 v 2 (s) ds + {1 + β 12 v 2 (s)} dw 2 (s), d log P (s) = µ y ds + s-exp [{v 1 (s) + β 2 v 2 (s)} /2] db (s), with φ 1 =corr{b (s), W 1 (s)} and φ 2 =corr{b (s), W 2 (s)}. Euler discretization of the volatilities v 1 (s) and v 2 (s) provides closed form expression for Y t = log P ( t) log P ( (t 1)). Daily returns y = (y 1,..., y T ) of the S&P 500 index. (MCQMC, 19/08/2016) 13 / 31

49 Empirical performance: Stochastic volatility model SV model with leverage (Huang & Tauchen, 2005): dv 1 (s) = k 1 {v 1 (s) µ 1 } ds + σ 1 dw 1 (s), dv 2 (s) = k 2 v 2 (s) ds + {1 + β 12 v 2 (s)} dw 2 (s), d log P (s) = µ y ds + s-exp [{v 1 (s) + β 2 v 2 (s)} /2] db (s), with φ 1 =corr{b (s), W 1 (s)} and φ 2 =corr{b (s), W 2 (s)}. Euler discretization of the volatilities v 1 (s) and v 2 (s) provides closed form expression for Y t = log P ( t) log P ( (t 1)). Daily returns y = (y 1,..., y T ) of the S&P 500 index. ) Bayesian Inference on θ = (k 1, µ 1, σ 1, k 2, β 12, β 2, µ y, φ 1, φ 2. (MCQMC, 19/08/2016) 13 / 31

50 Empirical performance: Stochastic volatility model SV model with leverage (Huang & Tauchen, 2005): dv 1 (s) = k 1 {v 1 (s) µ 1 } ds + σ 1 dw 1 (s), dv 2 (s) = k 2 v 2 (s) ds + {1 + β 12 v 2 (s)} dw 2 (s), d log P (s) = µ y ds + s-exp [{v 1 (s) + β 2 v 2 (s)} /2] db (s), with φ 1 =corr{b (s), W 1 (s)} and φ 2 =corr{b (s), W 2 (s)}. Euler discretization of the volatilities v 1 (s) and v 2 (s) provides closed form expression for Y t = log P ( t) log P ( (t 1)). Daily returns y = (y 1,..., y T ) of the S&P 500 index. ) Bayesian Inference on θ = (k 1, µ 1, σ 1, k 2, β 12, β 2, µ y, φ 1, φ 2. Performance of the pseudo-marginal MH for RW proposal w.r.t σ, standard deviation of log p θ (y) at posterior mean θ. (MCQMC, 19/08/2016) 13 / 31

51 Integrated Autocorrelation Time of Pseudo-Marginal MH Figure: Average over the 9 parameter components of the log-integrated autocorrelation time of pseudo-marginal chain as a function of σ for T = 300. (MCQMC, 19/08/2016) 14 / 31

52 How precise should the log-likelihood estimator be? Aim: Minimize the computational time CT (Q, h) = IACT (Q, h) /σ 2 as σ 2 1/N and computational efforts proportional to N, where IACT (Q, h) = Integrated Autocorrelation Time of {h (ϑ i )} i 1 (MCQMC, 19/08/2016) 15 / 31

53 How precise should the log-likelihood estimator be? Aim: Minimize the computational time CT (Q, h) = IACT (Q, h) /σ 2 as σ 2 1/N and computational efforts proportional to N, where IACT (Q, h) The IACT is = Integrated Autocorrelation Time of {h (ϑ i )} i 1 IACT (Q, h) = τ=1 corr π,q {h (θ 0 ), h (θ τ )} where, for (θ, z) = (ϑ, w), the PM kernel Q is { Q {(θ, z), (dϑ, dw)} = q(ϑ θ)g ϑ (w) 1 π (ϑ) } exp (w z) dϑdw π (θ) with z = log{ p θ (y 1:T )/p θ (y 1:T )}, w = log{ p ϑ (y 1:T )/p ϑ (y 1:T )}. (MCQMC, 19/08/2016) 15 / 31

54 Computational time for the SV model Figure: Computational time as a function of σ (MCQMC, 19/08/2016) 16 / 31

55 Large sample analysis of the pseudo-marginal kernel Standard asymptotic study of MCMC relies on d and independence assumption on the target, interested here in fixed d, large T. (MCQMC, 19/08/2016) 17 / 31

56 Large sample analysis of the pseudo-marginal kernel Standard asymptotic study of MCMC relies on d and independence assumption on the target, interested here in fixed d, large T. Assumption 1 - Asymptotic Normality: We have p ( θ Y1:T ) φ(θ; θ T, Σ/T ) dθ P 0, where θ T P θ and Σ is a p.d. matrix. (MCQMC, 19/08/2016) 17 / 31

57 Large sample analysis of the pseudo-marginal kernel Standard asymptotic study of MCMC relies on d and independence assumption on the target, interested here in fixed d, large T. Assumption 1 - Asymptotic Normality: We have p ( θ Y1:T ) φ(θ; θ T, Σ/T ) dθ P 0, where θ T P θ and Σ is a p.d. matrix. Assumption 2 - Proposal: ϑ = θ + ε/ T where ε υ ( ) with υ (ε) = υ ( ε). (MCQMC, 19/08/2016) 17 / 31

58 Large sample analysis of the pseudo-marginal kernel Standard asymptotic study of MCMC relies on d and independence assumption on the target, interested here in fixed d, large T. Assumption 1 - Asymptotic Normality: We have p ( θ Y1:T ) φ(θ; θ T, Σ/T ) dθ P 0, where θ T P θ and Σ is a p.d. matrix. Assumption 2 - Proposal: ϑ = θ + ε/ T where ε υ ( ) with υ (ε) = υ ( ε). Proposition (Berard, Del Moral & D., EJP, 2014): Under regularity conditions, we have if N T log p θ (Y 1:T )/p θ (Y 1:T ) Y T N ( σ 2 (θ) /2, σ 2 (θ) ) (MCQMC, 19/08/2016) 17 / 31

59 Large sample analysis of the pseudo-marginal kernel Let {ϑ T i, Zi T := log p ϑ T (Y i 1:T )/p ϑ T (Y i 1:T )} i 0 the stationary PM Markov chain targetting π T (θ, z) = p ( θ Y 1:T ) exp (z) gθ T (z). (MCQMC, 19/08/2016) 18 / 31

60 Large sample analysis of the pseudo-marginal kernel Let {ϑ T i, Zi T := log p ϑ T (Y i 1:T )/p ϑ T (Y i 1:T )} i 0 the stationary PM Markov chain targetting π T (θ, z) = p ( θ Y 1:T ) exp (z) gθ T (z). Proposition: The F.D.D. of { ϑ T i = T (ϑ T i θ T ), Zi T } i 0 converge weakly as T ) to those of a stationary Markov chain of invariant density φ ( θ; 0, Σ φ ( z; σ 2 ( θ ) /2, σ 2 ( θ )) and kernel Q{(θ, z), (dϑ, dw)} = υ( ϑ θ)φ ( w; σ 2 ( θ ) /2, σ 2 ( θ )) { 1 φ( ϑ; 0, Σ) exp (w z) φ( θ; 0, Σ) } d ϑdw (MCQMC, 19/08/2016) 18 / 31

61 Large sample analysis of the pseudo-marginal kernel Let {ϑ T i, Zi T := log p ϑ T (Y i 1:T )/p ϑ T (Y i 1:T )} i 0 the stationary PM Markov chain targetting π T (θ, z) = p ( θ Y 1:T ) exp (z) gθ T (z). Proposition: The F.D.D. of { ϑ T i = T (ϑ T i θ T ), Zi T } i 0 converge weakly as T ) to those of a stationary Markov chain of invariant density φ ( θ; 0, Σ φ ( z; σ 2 ( θ ) /2, σ 2 ( θ )) and kernel Q{(θ, z), (dϑ, dw)} = υ( ϑ θ)φ ( w; σ 2 ( θ ) /2, σ 2 ( θ )) { 1 φ( ϑ; 0, Σ) exp (w z) φ( θ; 0, Σ) Let σ 2 = σ 2 ( θ ), it suggests a simplified analysis based on } d ϑdw Q σ {(θ, z), (dϑ, dw)} = q ( ϑ θ) φ ( w; σ 2 /2, σ 2) { 1 π (ϑ) } exp (w z) dϑdw π (θ) (MCQMC, 19/08/2016) 18 / 31

62 Available Results Aim: Minimize the computational cost CT( Q σ, h) = IACT( Q σ, h) σ 2. (MCQMC, 19/08/2016) 19 / 31

63 Available Results Aim: Minimize the computational cost Special cases: CT( Q σ, h) = IACT( Q σ, h) σ 2. (MCQMC, 19/08/2016) 19 / 31

64 Available Results Aim: Minimize the computational cost Special cases: CT( Q σ, h) = IACT( Q σ, h) σ 2. 1 When q(ϑ θ) = p ( ϑ y), σ opt = 0.9 (Pitt et al., 2012). (MCQMC, 19/08/2016) 19 / 31

65 Available Results Aim: Minimize the computational cost Special cases: CT( Q σ, h) = IACT( Q σ, h) σ 2. 1 When q(ϑ θ) = p ( ϑ y), σ opt = 0.9 (Pitt et al., 2012). 2 When π (θ) = d i=1 f (θ i ) and q(ϑ θ) is an isotropic Gaussian random walk then, as d, diffusion limit suggests σ opt = 1.8 (Sherlock et al., 2015). (MCQMC, 19/08/2016) 19 / 31

66 Sketch of the Analysis For general proposals and targets, direct minimization of CT( Q σ, h) impossible so minimize an upper bound over it. (MCQMC, 19/08/2016) 20 / 31

67 Sketch of the Analysis For general proposals and targets, direct minimization of CT( Q σ, h) impossible so minimize an upper bound over it. Theoretical study relies on π invariant kernel Qσ given for (θ, z) = (ϑ, w) by q(ϑ θ)φ ( w; σ 2 /2, σ 2) { min 1, π (ϑ) } min {1, exp (w z)} dϑdw, π (θ) instead of q(ϑ θ)φ ( w; σ 2 /2, σ 2) min { 1, π (ϑ) } exp (w z) dϑdw. π (θ) (MCQMC, 19/08/2016) 20 / 31

68 Sketch of the Analysis For general proposals and targets, direct minimization of CT( Q σ, h) impossible so minimize an upper bound over it. Theoretical study relies on π invariant kernel Qσ given for (θ, z) = (ϑ, w) by q(ϑ θ)φ ( w; σ 2 /2, σ 2) { min 1, π (ϑ) } min {1, exp (w z)} dϑdw, π (θ) instead of q(ϑ θ)φ ( w; σ 2 /2, σ 2) min { 1, π (ϑ) } exp (w z) dϑdw. π (θ) We have IACT( Q σ, h) IACT(Q σ, h) by Peskun (1973) so CT( Q σ, h) CT(Q σ, h) and exact expression of IACT(Q σ, h) is available. (MCQMC, 19/08/2016) 20 / 31

69 Practical Guidelines For good proposals, select σ 1.0 whereas for poor proposals, select σ 1.7. (MCQMC, 19/08/2016) 21 / 31

70 Practical Guidelines For good proposals, select σ 1.0 whereas for poor proposals, select σ 1.7. When you have no clue about the proposal effi ciency, (MCQMC, 19/08/2016) 21 / 31

71 Practical Guidelines For good proposals, select σ 1.0 whereas for poor proposals, select σ 1.7. When you have no clue about the proposal effi ciency, 1 If σ opt = 1.0 and you pick σ = 1.7, computing time increases by 150%. (MCQMC, 19/08/2016) 21 / 31

72 Practical Guidelines For good proposals, select σ 1.0 whereas for poor proposals, select σ 1.7. When you have no clue about the proposal effi ciency, 1 If σ opt = 1.0 and you pick σ = 1.7, computing time increases by 150%. 2 If σ opt = 1.7 and you pick σ = 1.0, computing time increases by 50%. (MCQMC, 19/08/2016) 21 / 31

73 Practical Guidelines For good proposals, select σ 1.0 whereas for poor proposals, select σ 1.7. When you have no clue about the proposal effi ciency, 1 If σ opt = 1.0 and you pick σ = 1.7, computing time increases by 150%. 2 If σ opt = 1.7 and you pick σ = 1.0, computing time increases by 50%. 3 If σ opt = 1.0 or σ opt = 1.7 and you pick σ = , computing time increases by 15%. (MCQMC, 19/08/2016) 21 / 31

74 Discussion Many sharp qualitative results for the pseudo-marginal MH algorithm have been obtained (Andrieu & Roberts, AoS 2009; Andrieu & Vihola, AAP 2015). (MCQMC, 19/08/2016) 22 / 31

75 Discussion Many sharp qualitative results for the pseudo-marginal MH algorithm have been obtained (Andrieu & Roberts, AoS 2009; Andrieu & Vihola, AAP 2015). Simplified quantitative analysis of the pseudo-marginal MH algorithm, useful in large data regime. (MCQMC, 19/08/2016) 22 / 31

76 Discussion Many sharp qualitative results for the pseudo-marginal MH algorithm have been obtained (Andrieu & Roberts, AoS 2009; Andrieu & Vihola, AAP 2015). Simplified quantitative analysis of the pseudo-marginal MH algorithm, useful in large data regime. Optimal σ depends on effi ciency of the ideal MH algorithm but σ is a sweet spot. (MCQMC, 19/08/2016) 22 / 31

77 Discussion Many sharp qualitative results for the pseudo-marginal MH algorithm have been obtained (Andrieu & Roberts, AoS 2009; Andrieu & Vihola, AAP 2015). Simplified quantitative analysis of the pseudo-marginal MH algorithm, useful in large data regime. Optimal σ depends on effi ciency of the ideal MH algorithm but σ is a sweet spot. Pseudo-marginal MH scales in O ( T 2) at each iteration as we require N T. (MCQMC, 19/08/2016) 22 / 31

78 Discussion Many sharp qualitative results for the pseudo-marginal MH algorithm have been obtained (Andrieu & Roberts, AoS 2009; Andrieu & Vihola, AAP 2015). Simplified quantitative analysis of the pseudo-marginal MH algorithm, useful in large data regime. Optimal σ depends on effi ciency of the ideal MH algorithm but σ is a sweet spot. Pseudo-marginal MH scales in O ( T 2) at each iteration as we require N T. Problem: the ratio p ϑ (y 1:T ) /p θ (y 1:T ) is estimated by estimating independently the numerator and denominator. (MCQMC, 19/08/2016) 22 / 31

79 The Correlated Pseudo-Marginal Algorithm Reparameterize the likelihood estimator p θ (y 1:T ) as a function of normal variates U N (0, I ) p θ (y 1:T ) = p θ (y 1:T ; U) (MCQMC, 19/08/2016) 23 / 31

80 The Correlated Pseudo-Marginal Algorithm Reparameterize the likelihood estimator p θ (y 1:T ) as a function of normal variates U N (0, I ) p θ (y 1:T ) = p θ (y 1:T ; U) Correlate estimators of p θ (y 1:T ) and p ϑ (y 1:T ) by setting p ϑ (y 1:T ) = p ϑ (y 1:T ; V ) where V = ρu + 1 ρ 2 ε, ε N (0, I ) for ρ ( 1, 1). (MCQMC, 19/08/2016) 23 / 31

81 The Correlated Pseudo-Marginal Algorithm Reparameterize the likelihood estimator p θ (y 1:T ) as a function of normal variates U N (0, I ) p θ (y 1:T ) = p θ (y 1:T ; U) Correlate estimators of p θ (y 1:T ) and p ϑ (y 1:T ) by setting p ϑ (y 1:T ) = p ϑ (y 1:T ; V ) where V = ρu + 1 ρ 2 ε, ε N (0, I ) for ρ ( 1, 1). In practice, ρ will be selected close to 1. (MCQMC, 19/08/2016) 23 / 31

82 Correlated Pseudo-Marginal Metropolis Hastings algorithm (MCQMC, 19/08/2016) 24 / 31

83 Correlated Pseudo-Marginal Metropolis Hastings algorithm At iteration i Sample ϑ q ( ϑ i 1 ) and V = ρu i ρ 2 ε, ε N (0, I ). (MCQMC, 19/08/2016) 24 / 31

84 Correlated Pseudo-Marginal Metropolis Hastings algorithm At iteration i Sample ϑ q ( ϑ i 1 ) and V = ρu i ρ 2 ε, ε N (0, I ). Compute the estimate p ϑ (y 1:T ; V ) of p ϑ (y 1:T ). (MCQMC, 19/08/2016) 24 / 31

85 Correlated Pseudo-Marginal Metropolis Hastings algorithm At iteration i Sample ϑ q ( ϑ i 1 ) and V = ρu i ρ 2 ε, ε N (0, I ). Compute the estimate p ϑ (y 1:T ; V ) of p ϑ (y 1:T ). With probability min{1, p ϑ (y 1:T ; V ) p (ϑ) q ( ϑ i 1 ϑ) p ϑi 1 (y 1:T ; U i 1 ) p (ϑ i 1 ) q ( ϑ ϑ i 1 ) } set ϑ i = ϑ, U i = V, otherwise set ϑ i = ϑ i 1, U i = U i 1. (MCQMC, 19/08/2016) 24 / 31

86 Large sample analysis of the correlated PM - i.i.d. case log Proposition. Let N T as T with N T = o(t ). When 1 ρ 2 T εt with ρ T = exp U T π( θ) and V T = ρ T U T + then as T ( { pθ+ξ/ T y1:t ; V T ) p θ (y 1:T ; U T ) / p θ+ξ/ T (y } 1:T ) p θ (y 1:T ) ( ) ψ N T T Y T, U T N ( κ2 (θ), κ 2 (θ)) 2 This CLT is conditional on the observation sequence and the current auxiliary variables. (MCQMC, 19/08/2016) 25 / 31

87 Large sample analysis of the correlated PM - i.i.d. case log Proposition. Let N T as T with N T = o(t ). When 1 ρ 2 T εt with ρ T = exp U T π( θ) and V T = ρ T U T + then as T ( { pθ+ξ/ T y1:t ; V T ) p θ (y 1:T ; U T ) / p θ+ξ/ T (y } 1:T ) p θ (y 1:T ) ( ) ψ N T T Y T, U T N ( κ2 (θ), κ 2 (θ)) 2 This CLT is conditional on the observation sequence and the current auxiliary variables. Asymptotically the distribution of the log-ratio decouples from the current location of the Markov chain. (MCQMC, 19/08/2016) 25 / 31

88 Large sample analysis of the correlated PM - i.i.d. case log Proposition. Let N T as T with N T = o(t ). When 1 ρ 2 T εt with ρ T = exp U T π( θ) and V T = ρ T U T + then as T ( { pθ+ξ/ T y1:t ; V T ) p θ (y 1:T ; U T ) / p θ+ξ/ T (y } 1:T ) p θ (y 1:T ) ( ) ψ N T T Y T, U T N ( κ2 (θ), κ 2 (θ)) 2 This CLT is conditional on the observation sequence and the current auxiliary variables. Asymptotically the distribution of the log-ratio decouples from the current location of the Markov chain. The asymptotic variance is O (1) even for N log(t ). (MCQMC, 19/08/2016) 25 / 31

89 Large sample analysis of the correlated PM - i.i.d. case Let {ϑ T i } i 0 the stationary non-markovian sequence of the correlated PM of invariant density p ( θ Y 1:T ). (MCQMC, 19/08/2016) 26 / 31

90 Large sample analysis of the correlated PM - i.i.d. case Let {ϑ T i } i 0 the stationary non-markovian sequence of the correlated PM of invariant density p ( θ Y 1:T ). Proposition. The F.D.D. of { ϑ T i = T (ϑ T i θ T )} i 0 converge weakly as T ) to those of a stationary Markov chain of invariant density φ ( θ; 0, Σ and kernel given for θ = ϑ by [ { }] Q( θ, d ϑ) = υ( ϑ θ)e R N ( κ2 min 1, φ( ϑ; 0, Σ) (θ)/2,κ 2 (θ)) φ( θ; 0, Σ) R d ϑ. (MCQMC, 19/08/2016) 26 / 31

91 Large sample analysis of the correlated PM - i.i.d. case Let {ϑ T i } i 0 the stationary non-markovian sequence of the correlated PM of invariant density p ( θ Y 1:T ). Proposition. The F.D.D. of { ϑ T i = T (ϑ T i θ T )} i 0 converge weakly as T ) to those of a stationary Markov chain of invariant density φ ( θ; 0, Σ and kernel given for θ = ϑ by [ { }] Q( θ, d ϑ) = υ( ϑ θ)e R N ( κ2 min 1, φ( ϑ; 0, Σ) (θ)/2,κ 2 (θ)) φ( θ; 0, Σ) R d ϑ. These results suggests that a simplified analysis of the CPM chain can be performed by looking at [ { Q(θ, dϑ) = q ( ϑ θ) E R N ( κ2 min 1, π (ϑ) }] (θ)/2,κ 2 (θ)) π (θ) R dϑ and κ 2.1 minimizes computational time. (MCQMC, 19/08/2016) 26 / 31

92 Limitations of Weak Convergence Previous result does not imply convergence of IACT. (MCQMC, 19/08/2016) 27 / 31

93 Limitations of Weak Convergence Previous result does not imply convergence of IACT. In the CPM context, the auxiliary variables U T evolves according to an AR scheme with persistency 1 N T /T. (MCQMC, 19/08/2016) 27 / 31

94 Limitations of Weak Convergence Previous result does not imply convergence of IACT. In the CPM context, the auxiliary variables U T evolves according to an AR scheme with persistency 1 N T /T. When N T grows to slowly with T, correlations decay too slowy leading to increasing IACT. (MCQMC, 19/08/2016) 27 / 31

95 Limitations of Weak Convergence Previous result does not imply convergence of IACT. In the CPM context, the auxiliary variables U T evolves according to an AR scheme with persistency 1 N T /T. When N T grows to slowly with T, correlations decay too slowy leading to increasing IACT. For the i.i.d. latent variable case, IACT(Q, h) is driven by the IACT of { ( ) Ψ T U T i := θ log p ( Y 1:T ; Ui T )} i 1. (MCQMC, 19/08/2016) 27 / 31

96 Limitations of Weak Convergence Previous result does not imply convergence of IACT. In the CPM context, the auxiliary variables U T evolves according to an AR scheme with persistency 1 N T /T. When N T grows to slowly with T, correlations decay too slowy leading to increasing IACT. For the i.i.d. latent variable case, IACT(Q, h) is driven by the IACT of { ( ) Ψ T U T i := θ log p ( Y 1:T ; Ui T )} i 1. Proposition. Let N T T α for 0 < α < 1 then IACT(Q, h) T 1 2α. (MCQMC, 19/08/2016) 27 / 31

97 Limitations of Weak Convergence Previous result does not imply convergence of IACT. In the CPM context, the auxiliary variables U T evolves according to an AR scheme with persistency 1 N T /T. When N T grows to slowly with T, correlations decay too slowy leading to increasing IACT. For the i.i.d. latent variable case, IACT(Q, h) is driven by the IACT of { ( ) Ψ T U T i := θ log p ( Y 1:T ; Ui T )} i 1. Proposition. Let N T T α for 0 < α < 1 then IACT(Q, h) T 1 2α. This result suggests we need at least T /N T = O (1). (MCQMC, 19/08/2016) 27 / 31

98 Example: Gaussian Latent Variable Model MH (T = 8192) IACT(θ) 15.6 PM (ρ = 0.0) N RIACT(θ) RCT(θ) CPM (ρ = ) N κ RIACT(θ) RCT(θ) Here RIACT = IACT/IACT MH and RCT = N RIACT. Improvement by 180 fold. (MCQMC, 19/08/2016) 28 / 31

99 Example: Noisy Autoregressive Model MH (T = 16, 000) IACT(θ) 5.8 PM (ρ = 0.0) N RIACT(θ) RCT(θ) CPM (ρ = ) N κ RIACT(θ) RCT(θ) Improvement by 100 fold. (MCQMC, 19/08/2016) 29 / 31

100 Discussion Large sample analysis of pseudo-marginal algorithm provides useful guidelines, overall complexity is O ( T 2). (MCQMC, 19/08/2016) 30 / 31

101 Discussion Large sample analysis of pseudo-marginal algorithm provides useful guidelines, overall complexity is O ( T 2). Correlated pseudo-marginal can achieve very substantial improvement. (MCQMC, 19/08/2016) 30 / 31

102 Discussion Large sample analysis of pseudo-marginal algorithm provides useful guidelines, overall complexity is O ( T 2). Correlated pseudo-marginal can achieve very substantial improvement. Analysis shows T /N T = O(1) is necessary and we conjecture it is suffi cient leading to complexity O(T 3/2 ) vs O ( T 2). (MCQMC, 19/08/2016) 30 / 31

103 Discussion Large sample analysis of pseudo-marginal algorithm provides useful guidelines, overall complexity is O ( T 2). Correlated pseudo-marginal can achieve very substantial improvement. Analysis shows T /N T = O(1) is necessary and we conjecture it is suffi cient leading to complexity O(T 3/2 ) vs O ( T 2). Implementation for state-space models in state dimension n > 1 relies on non-standard particle scheme (Gerber & Chopin, 2015): our analysis does not capture these cases, experimental results suggest O(T 1+ n+1 n ). (MCQMC, 19/08/2016) 30 / 31

104 Some References C. Andrieu & M Vihola, Convergence properties of pseudo-margninal MCMC, Ann. Applied Proba., J. Berard, P. Del Moral & A.D., A Lognormal CLT for Particle Approximations of Normalizing Constants, Elec. J. Proba., G. Deligiannidis, A.D. and M.K. Pitt, The Correlated Pseudo-marginal Method, arxiv: , A.D., M.K. Pitt, G. Deligiannidis and R. Kohn, Effi cient Implementation of Markov Chain Monte Carlo when Using an Unbiased Likelihood Estimator, Biometrika, L. Lin, K. Lin & J. Sloan, A Noisy Monte Carlo Algorithm, Phys. Rev. D, C. Sherlock, A. Thiery, G.O. Roberts & J.S. Rosenthal, On the Effi ciency of the RW Pseudo-Marginal MH, Ann. Stat., (MCQMC, 19/08/2016) 31 / 31

Controlled sequential Monte Carlo

Controlled sequential Monte Carlo Jeremy Heng, Department of Statistics, Harvard University Joint work with Adrian Bishop (UTS, CSIRO), George Deligiannidis & Arnaud Doucet (Oxford) Bayesian Computation