Emulating a gravity model to infer the spatiotemporal dynamics of an infectious disease

Size: px

Start display at page:

Download "Emulating a gravity model to infer the spatiotemporal dynamics of an infectious disease"

Ruth Atkins
5 years ago
Views:

1 Emulating a gravity model to infer the spatiotemporal dynamics of an infectious disease arxiv: v3 [stat.me] 14 Feb 2013 Roman Jandarov Murali Haran Department of Statistics Department of Statistics The Pennsylvania State University The Pennsylvania State University raj153@psu.edu mharan@stat.psu.edu Ottar Bjørnstad Departments of Entomology and Biology The Pennsylvania State University onb1@psu.edu Bryan Grenfell Departments of Ecology and Evolutionary Biology Princeton University grenfell@princeton.edu Draft: February 18, 2013 Abstract Probabilistic models for infectious disease dynamics are useful for understanding the mechanism underlying the spread of infection. When the likelihood function for these models is expensive to evaluate, traditional likelihood-based inference may be computationally intractable. Furthermore, traditional inference may lead to poor parameter estimates and the fitted model may not capture important biological characteristics of the observed data. We propose a novel approach for resolving these issues that is inspired by recent work in emulation and calibration for complex computer models. Our motivating example is the gravity time series susceptible-infected-recovered (TSIR) model. Our approach focuses on the characteristics of the process that are 1

2 of scientific interest. We find a Gaussian process approximation to the gravity model using key summary statistics obtained from model simulations. We demonstrate via simulated examples that the new approach is computationally expedient, provides accurate parameter inference, and results in a good model fit. We apply our method to analyze measles outbreaks in England and Wales in two periods, the pre-vaccination period from and the vaccination period from Based on our results, we are able to obtain important scientific insights about the transmission of measles. In general, our method is applicable to problems where traditional likelihood-based inference is computationally intractable or produces a poor model fit. It is also an alternative to approximate Bayesian computation (ABC) when simulations from the model are expensive. 1 Introduction Infectious disease dynamics are of interest to modelers from a range of disciplines. The theory of disease dynamics provides a tractable system for investigating key questions in population and evolutionary biology. Understanding the disease dynamics helps in management and with pressing disease issues such as disease emergence and epidemic control strategies. Probabilistic models for disease dynamics are important as they help increase our understanding of the mechanism underlying the spread of the infection while also accounting for their inherent stochasticity. Observations on reported cases of the diseases, especially in the form of space-time data, are becoming increasingly available, allowing for statistical inference for unknown parameters of these models. However, traditional likelihood-based inference for many disease dynamics models is often challenging because the likelihood function may be expensive to evaluate, making likelihood-based inference computationally intractable. Furthermore, traditional inference may lead to poor parameter estimates and the fitted model may not capture important biological characteristics of the observed data. Hence, an approach that simultaneously addresses the computational challenges as well as the inferential issues would be very useful for a number of interesting and important probabilistic models for dynamics of diseases. Inspired by work in the field of emulation and calibration for complex computer models (cf. Bayarri, Berger, Cafeo, Garcia-Donato, Liu, Palomo, Parthasarathy, Paulo, Sacks, and Walsh, 2007; Craig, Goldstein, Rougier, and Seheult, 2001; Kennedy and O Hagan, 2001; Sacks, Welch, Mitchell, and Wynn, 1989), we develop a novel approach for inference for such models. Our approach uses a Gaussian process approximation to the disease dynamics model using key biologically relevant summary statistics obtained from simulations of the model at differing parameter values. As we will demonstrate, this approach results in 2

3 reliable parameter estimates and a good model fit, and is also computationally efficient. The motivating example for our approach is the gravity time series susceptible-infectedrecovered (TSIR) model for measles dynamics. The spatiotemporal dynamics of measles have received a lot of attention in part due to the importance of the disease, the highly nonlinear outbreak dynamics and also because of the availability of rich data sets. Important aspects of local dynamics of measles are well studied. These include key issues like seasonality in transmission of the infection (Bjørnstad, Finkenstädt, and Grenfell, 2002; Dietz, 1976), effects of host demography on outbreak frequency (Finkenstädt, Keeling, and Grenfell, 1998; McLean and Anderson, 1988), and causes of local persistence and extinctions (Bartlett, 1956; Grenfell, Bjornstad, and Kappey, 2001; Grenfell and Harwood, 1997). During the course of outbreaks in well-mixed local populations, the epidemic trajectory of measles is virtually unaffected by infection that may enter from neighboring locations. However, spatial coupling is fundamental to the dynamics and management of measles for smaller communities where the infection may become locally extinct (Bartlett, 1956; Grenfell and Harwood, 1997). Hence, ecologists have also studied the spatial spread of the disease using so-called metapopulation models (Earn, Rohani, and Grenfell, 1998; Grenfell and Harwood, 1997; Swinton and Grenfell, 1998). In this paper, we investigate inference for a model first proposed by Xia, Bjørnstad, and Grenfell (2004). The model represents a combination of the TSIR model (Bjørnstad et al., 2002; Grenfell, Bjørnstad, and Finkenstädt, 2002) with a term that allows for spatial transmission between different host communities modeled as a gravity process. Xia et al. (2004) demonstrate how this model captures scientifically important properties of measles dynamics. Since each likelihood evaluation is computationally very expensive, however, Xia et al. (2004) obtain only point estimates of the parameters minimizing ad hoc objective functions instead of using a likelihood-based approach. Here, we develop a more statistically rigorous approach to inferring model parameters, characterizing associated uncertainties and carefully studying parameter identifiability issues. First, in order to explain the issues that arise in inferring these parameters via a likelihood-based approach, we propose a partial discretization of the parameter space that allows us to perform Bayesian inference for the parameters using a fast MCMC algorithm. Using this approach we are able to study uncertainties about the parameter values. The method allows us to investigate parameter identifiability issues, showing which gravity model parameters can or cannot be inferred from a given data set. However, this approach to resolving the computational challenges of traditional likelihoodbased inference is problematic, as is revealed by our simulated data examples. We find that the parameter estimates are poor and the forward simulations of the model at these param- 3

4 eter settings do not reproduce epidemiological features of the data deemed key in Xia et al. (2004). In order to address the above issues, we propose a new approach that directly focuses on the aspects of the underlying process that are of scientific interest. We develop a Gaussian process approximation to the gravity model based on key summary statistics obtained from simulations of the model at different parameter values. These statistics are chosen by domain experts to capture the biologically important characteristics of the dynamics of the disease. The Gaussian process model emulator is then used to develop a probability model for the observations, thereby permitting an efficient MCMC approach to Bayesian inference for the parameters. We demonstrate that the new method recovers the true parameters and the resultant fitted model captures biologically relevant features of the data. When applied to the gravity TSIR model, our approach allows us to investigate several scientific questions that are of interest to the dynamics of measles. We study changes in dynamics between school holiday periods versus non-holidays in the pre-vaccination era. This is particularly interesting because the local, age-structured transmission rate of the disease changes from holidays to non-holidays (Bjørnstad et al., 2002; Dietz, 1976). Since our approach allows us to construct confidence regions easily, we also infer the amounts of exported and imported infected individuals for different cities during different time periods and reveal that movement patterns of the infection do not seem to change significantly between the pre-vaccination and vaccination eras. Based on the parameter estimates obtained using our method, we are able to display the inflow and outflow networks of the infection between cities. Along with histograms of the degree distributions of the networks, these graphs help to identify the cities that are important hubs in measles transmission. More generally, the methodology we develop here may be useful for models where the likelihood is expensive to evaluate or in situations where the likelihood is unable to capture characteristics of the model that are of scientific interest. We note that the computational cost of forward simulations for our model makes approaches based on approximate Bayesian computation (ABC) (cf. Beaumont, Zhang, and Balding, 2002; Marjoram, Molitor, Plagnol, and Tavaré, 2003; Pritchard, Seielstad, Perez-Lezaun, and Feldman, 1999) infeasible. Hence our approach is computationally efficient, while ABC is not a viable option here. The rest of the paper is organized as follows. Section 2 describes in detail the gravity TSIR model, which acts as our motivating example. Section 3 describes the inferential and computational challenges posed by the model and the large space-time data set. Section 4 describes our new emulation-based approach that is an alternative to traditional likelihoodbased inference. Section 5 describes computational details and the application of our method 4

5 to the gravity TSIR model in simulated data examples. Section 6 describes the application of our method to the England-Wales measles data sets. Finally, in Section 7, we summarize our results and discuss our statistical approach and scientific conclusions. 2 A gravity model for disease dynamics A general goal of fitting metapopulation disease dynamics models is to describe spatiotemporal patterns of epidemics at the local scale and understand how these patterns are affected by the network of spatial spread of the disease (Cliff, Haggett, and Smallman-Raynor, 1993; Keeling, Bjørnstad, and Grenfell, 2004). The gravity model we study is an extension of a discrete time-series susceptible-infected-recovered model (Bjørnstad et al., 2002; Grenfell et al., 2002) for local disease dynamics which includes an explicit formulation for the spatial transmission between different host cities (Xia et al., 2004). The common theoretical framework used to describe the dynamics of infectious diseases is based on the division of the human host population into groups containing susceptible, infected (infectious) and recovered individuals. Let I kt and S kt denote the number of infected and susceptible individuals respectively in disease generation t in city k and variable L kt be the number of infected people commuting to city k at time t. The commuting assumption reflects that movement of infection is mostly through transient movement of individuals. Denote the size and birth rate of city k at time t by N kt and B kt, and let d kj represent the distance between cities k and j. The model can then be described as follows. First, the model for the number of incidences of measles is I k(t+1) Poisson(λ k,t+1 ), where λ k,t+1 = β t S kt (I kt + L kt ) α, (1) with t = 1,..., T, k = 1,..., K, where K is the number of cities in our data and T is the total number of time steps. The time-step is taken to be 2 weeks, roughly corresponding to the generation length (serial interval) of measles. The so-called transmission coefficient, β := {β t }, is a parameter that represents the attack rate of measles at time t and α is a positive real number correcting for the discrete-time approximation to the underlying continuous-time epidemic process (Glass, Xia, and Grenfell, 2003). Since these parameters only affect the local dynamics of measles, henceforth we refer to these parameters as local dynamics parameters. The indexing by t for β t reflects how this parameter is taken to be a piece-wise constant taking 26 different values to accommodate seasonal variability of the transmission rate that is repeated every year (Bjørnstad et al., 2002; Fine and Clarkson, 1982; Finkenstädt and Grenfell, 2000; Grenfell et al., 2002). From this, it can be seen that I k(t+1) increases depending 5

6 on the number of susceptibles and the number of moving infections coming to city k at the previous time step. Note that we use the Poisson distribution whereas Xia et al. (2004) use the Negative Binomial distribution; this is due to the greater computational stability of the Poisson distribution for small values of λ. Our approach would proceed in the same way for the Negative Binomial and Poisson assumption. In addition, our exploratory analysis show that a model fit from using the Poisson distribution is similar to a model fit obtained with the Negative Binomial distribution and the final inference about the parameters of interest is not affected by changing the distributional assumption. The susceptibles are modeled as follows S k(t+1) = S kt + B kt I k(t+1), (2) reflecting how susceptibles are replenished by births and depleted by infection. Since case fatality from measles was very low for the period of time in this study and mean age of infection was small, mortalities are not included in this balance equation. We note that here and in the following, after vaccinations are available, the birth rates (B kt ) are deflated by the corresponding percentage of vaccinated newborns (V kt ), since those cannot be infected. Finally, the gravity model describes the number of moving infected individuals by K I τ 2 L kt Gamma(m kt, 1), where m kt = θn τ 1 jt kt d ρ, (3) kj j=1,j k where Gamma(a,b) represents the Gamma distribution with shape and scale parameters a and b respectively. Here, b is chosen to be equal to unity based on exploratory analysis of the fitted model (Xia et al., 2004). The reason to model immigrant infection as a continuous random variable lies in the assumption that the transient infectives do not remain for a full epidemic generation. The local dynamics parameters in Equation (1) have been estimated previously (Bjørnstad et al., 2002; Finkenstädt, Bjørnstad, and Grenfell, 2002; Grenfell et al., 2002). In this study, we are interested in learning about the parameters θ, τ 1, τ 2 and ρ in Equation (3) as these parameters control the spatial spread and regional behavior of the disease. Note, however, that for convenience and numerical stability, we use a reparametrization of θ, θ = log 10 (θ)/5 throughout the paper. 3 Parameter inference for the gravity model Reliable estimates of the local dynamics parameters α and β are available for measles dynamics (Bjørnstad et al., 2002; Finkenstädt et al., 2002; Grenfell et al., 2002; Xia et al., 6

7 2004). Therefore, since we are only interested in spatial dynamics of the disease, we assume that these parameters are known and use the estimates obtained from previous work (cf. Xia et al., 2004) as the true values. In particular, the local seasonal transmission parameters for biweeks 1 through 26, β t, are taken to be equal to β t = (1.24, 1.14, 1.16, 1.31, 1.24, 1.12, 1.06, 1.02, 0.94, 0.98, 1.06, 1.08, 0.96, 0.92, 0.92, 0.86, 0.76, 0.63, 0.62, 0.83, 1.13, 1.20, 1.11, 1.02, 1.04, 1.08), and α is assumed to be Here, the difference in the values of β t is primarily related to the fact that attack rates of measles differ depending on the season of the year since it is known that schools are major hubs of transmission of the disease. It also known that the true transmission process is continuous. Since we are considering a discretized model with a step equal to two week, it is therefore expected that the true attack rates of the disease could be higher. This explains the value of α which is slightly less than unity. In principle, it may be possible to reduce the dimensionality of β t while still preserving the seasonality of attack rates of the infection. With lower dimensional β t, one could assume strong priors for the local dynamics parameters and try to infer these parameters with the remaining unknown parameters jointly. However, trying to simultaneously infer these parameter values still significantly increases the identifiability issues and further complicates computation. Crucially, we note that assuming the local dynamics parameters are known does not have an undue effect on the model fit as has already been shown in the literature (cf. Xia et al., 2004). Assuming the local dynamics parameters are known leaves us with four unknown parameters, θ, τ 1, τ 2 and ρ, that we call the gravity model parameters (in our Gaussian process based approach in Section 4 we will also introduce several other parameters). In this paper our focus is on investigating the gravity model parameters and, when possible, obtaining the best estimates of them with relevant descriptions of their variability. As suggested by our domain experts, feasible values for the gravity parameters lie in the interval [0, 2] (see also Xia et al., 2004). Therefore, we use uniform priors for (θ, τ 1, τ 2, ρ) in all the inferential approaches that follow. The data are spatiotemporal and tend to be high-dimensional, in the case of the England-Wales measles data for the pre-vaccination era and for the later time period ( ). To study whether our fitted model captures epidemiologically relevant features of the data, we focus on two important biological characteristics of the process as suggested by domain experts. These are: 1. Maximum number of incidences which we will denote by M = (M 1,, M K ), where M i is the maximum number of incidences for the i-th city. 2. Proportions of bi-weeks without any cases of infection denoted by P = (P 1,, P K ), 7

8 where P i is the proportion of incidence free biweeks for the i-th city. An important goal of our work is to find parameter settings (along with associated uncertainties and dependencies among them) that yield a model that produces disease dynamics that are as close as possible to the data in terms of capturing these key properties. 3.1 A gridded MCMC approach and simulated examples It is easy to see why each evaluation of the likelihood for the gravity model is expensive. As in many population dynamic models, the major difficulty is in integrating over high-dimensional unobserved variables. For our model, {L kt } and {S kt } are of K T dimensions each, which translates to 2 519, 792 in the case of measles data set for the pre-vaccination era considered in Section 6. Details of the likelihood function are given in Web Appendix A. In this section, using an MCMC algorithm based on the discretization of a subspace of the parameter space, we describe some issues that arise from a traditional likelihood-based or Bayes approach for inference for the gravity model. Because likelihood-based inference for the gravity model is computationally intractable, our gridded MCMC algorithm requires certain simplifying assumptions and data imputation for unobservable susceptibles {S kt }. These assumptions and details of constructing our gridded MCMC algorithm for parameter inference are explained in Web Appendix B. We note, however, that our inferential approach based on a Gaussian process described in Section 4 does not require the simplifying assumptions, nor does it require data imputation. We note that all simulated data sets we consider in this work are generated from the full gravity model described in Section 2 with initial points equal to the actual observations at t = 1. In these examples, the number of locations, their coordinates, demographic variables, and the number of time steps are the same as those in the measles data described in Section 6.1. In our first example, we simulate a data set using values for the gravity parameters θ = 0.71, τ 1 = 0.3, τ 2 = 0.7 and ρ = 1. This parameter setting results in realistic data that resembles the observations. Figure 1 shows conditional and unconditional posterior likelihood surface plots for θ and ρ obtained by using the above gridded MCMC approach. From these plots, we can easily see that inference for θ and ρ is not possible because of the apparent issue with identifiability (Figure 1 (a)). In Figure 1 (b) we see that identifiability is reduced, but still exists when we fix one of the parameters, say τ 1, at its known true value. In Figure 1 (c), we fix both of τ 1 and τ 2 at their true values and see that the obtained ridge contains 8

9 the true values for θ and ρ. Figure 1 (d) demonstrates that the ridge moves by changing the values of τ 1 and τ 2 away from their true values. Figure 1: Inferred posterior 2D likelihood surface obtained for data with known parameters (θ = 0.71, τ 1 = 0.3, τ 2 = 0.7 and ρ = 1): (a) Marginal 2D likelihood surface for (θ, ρ); (b) Marginal 2D likelihood surface for (θ, ρ) assuming τ 1 = 0.3 (true); (c) 2D likelihood surface for (θ, ρ) assuming τ 1 = 0.3 (true) and τ 2 = 0.7 (true); (d) 2D likelihood surface for (θ, ρ) assuming τ 1 = 0.5 (any value) and τ 2 = 1 (any value). In our second example, we simulate a data set using values for the gravity parameters θ = 0.71, τ 1 = 0.5, τ 2 = 1 and ρ = 1. Figure 2 is a plot of the two-dimensional likelihood in θ and ρ space obtained by fixing τ 1 and τ 2 at their true values 0.5 and 1 respectively. We can see here that the true values of the parameters of interest are not in the region where the likelihood is maximized. This, unfortunately, means that repeating the above with other simulated data with different true values for the gravity parameters reveals that the ridge analogous to the ridge in Figure 1 (c) does not always have to contain the true values for θ and ρ. From our study of multiple simulated data, we also find that the likelihood ridge can have an intercept that is different from the ridge that we would intuitively think as the true ridge while having the same slope. This difference in intercepts creates a shift thereby resulting in poor parameter inference. Unfortunately the magnitude and direction of the shift depends on the true parameter values, so no simple bias correction is available. At first, 9

Figure 2: Inferred posterior 2D likelihood surface obtained for data with known parameters (θ = 0.71, τ 1 = 0.5, τ 2 = 1 and ρ = 1): Posterior 2D likelihood surface for (θ, ρ) assuming τ 1 = 0.

10 Figure 2: Inferred posterior 2D likelihood surface obtained for data with known parameters (θ = 0.71, τ 1 = 0.5, τ 2 = 1 and ρ = 1): Posterior 2D likelihood surface for (θ, ρ) assuming τ 1 = 0.5 (true) and τ 2 = 1 (true) has a shift and does not contain the true (θ, ρ) at its highest probability area. one may think that the discretization of the parameters τ 2 and ρ may be causing some of these issues. We verify that this is not the case by simply computing the values of the true likelihood function at the top of the ridges obtained with the discretized likelihood. We are able to see that the likelihood surface using the discretization is similar to the true likelihood surface. The poor inference from our traditional Bayes approach is therefore clearly not a result of the discretization. By generating additional simulations using a simpler model where we fix all the latent variables at their means we also find the full gravity model does not substantially differ from the simpler one in terms of capturing interesting biological characteristics of the underlying dynamics of the disease. In order to study the effect of this fixing on the likelihood surface, we save the true latent variables while simulating data and use them in our gridded MCMC in place of the expectations used in our gridded MCMC algorithm. The results show that using the true values of the latent variables does not change the traditional Bayes inference. This also confirms that the shifts that we observe in the traditional Bayes approach are not due to simplifying the model in gridded MCMC algorithm (see Web Appendix B for details about these assumptions), but rather due to inherent problems with the likelihood function. We note that our main interest is to examine whether the parameter estimates result in a 10

11 model fit that is capable of reproducing important characteristics of the observations. In order to study the model fit from the gridded MCMC, we simulate a data set using the full gravity model with estimated values of the parameters, where here and throughout the paper, we use modes of the corresponding posterior density functions as estimates of the parameters. These estimates for the measles data described in Section 6.1 are (θ, τ 1, τ 2, ρ) = (0.71, 0.5, 1, 1.48). For the simulated data set, we calculate the two 952 dimensional vectors (number of cities in the data) of summary characteristics and plot them against the summary vectors for the observed measles data (Figure 3). We can see that the simulated data do not seem to match the actual data in terms of the maximums M and the proportions of zeros P (Figure 3 (a)- (b)). In Section 5.2, we compare the model fit obtained via the gridded MCMC to the model fit we obtain via our Gaussian process-based approach described in Section 4. We summarize below our conclusions based on the gridded MCMC approach: 1. The confidence regions for the parameters are very wide, suggesting that there may be relatively little information even with a fairly rich data set. Hence we assume that τ 1 = 1, τ 2 = 1 as estimated in Xia et al. (2004) and study the joint distribution of θ and ρ, which becomes well informed by the data. 2. The fitted gravity model, using the above inference about its parameters, does not capture important biological features of the data. 3. We find that the parameter estimates from the traditional Bayes approach are shifted and the direction of the shift varies as shown in Figure 2. For example, for a simulated data set using the parameters values (θ, τ 1, τ 2, ρ) = (0.71, 0.5, 1, 1), our attempt to infer ρ assuming other parameters are known results in an estimate ˆρ = 1.5 with a confidence region that does not contain the truth. 4 Gaussian processes for emulation-based inference Since a traditional Bayes approach suffers from the above shortcomings, we develop an alternative method that is directly linked to the characteristics of the infectious disease dynamics that are of most interest to biologists. This method is based on using a Gaussian process to emulate the gravity model. A short review of Gaussian process basics is provided in Web Appendix C. We describe a new two-stage approach for inferring the gravity parameters. In the first stage, we simulate the gravity model at several parameter settings. For each forward simu- 11

12 Figure 3: Characteristics of simulated data at the parameters obtained via the traditional Bayes approach: (a) Simulated M vs M from the data; (b) Simulated P vs P from the data. lation of the model we can calculate the vector of summary statistics based on the simulated data set. This vector is high-dimensional, 952 (354) dimensions in the case of measles data for ( ). Since Gaussian process-based emulation for high dimensions poses serious computational challenges, we emulate the model by fitting a Gaussian process to the Euclidean distances between the summary statistics of the simulated data at the chosen parameter settings and the summary statistics for the real data. In the second stage, we perform Bayesian inference for the observations using the GP emulator from the first stage. We also allow for additional sources of uncertainty such as observational error and modeldata discrepancy as described below. We note that such two-stage approaches to parameter inference in complex models has been used to reduce computational challenges and alleviate identifiability issues (cf. Bhat, Haran, Olson, and Keller, 2012; Liu, Bayarri, and Berger, 2009). We begin with some notation. Let Z denote the vector of summary statistics of interest (e.g. proportions of zeros) calculated using the observed space-time data set. Let Θ be the gravity parameters and Y (Θ) denote the vector of summary statistics obtained using a simulation from the gravity model with the parameter setting Θ. Let Ω = (Θ 1,, Θ p ) be a grid on the parameter space. Our first goal is then to model D = (D 1,, D p ), where D i is the Euclidean distance between Y (Θ i ) and Z for i = 1,, p. This is done in the first stage 12

13 of our approach where we assume, D Ω, β G, ξ G N(Xβ G, Σ(ξ G )) (4) Here, ξ G = (σ 2 G, τ 2 G, φ G) is a vector of parameters that specify the covariance matrix, and β G is a vector of regression coefficients. The matrix X is a design matrix of dimension p 5 with i-th row equal to (1, Θ T i ). In other words, columns of X are the values the gravity parameters, (θ, τ 1, τ 2, ρ), on the selected grid and an intercept. We use Gaussian covariance matrix, Σ(ξ G ), elements of which are given by, (Σ(ξ G )) ij =cov(d i, D j ) = σ G 2 = exp( φ2 G Θ i Θ j 2 ), σ G 2 + τ G 2, if i j otherwise. Here, a b := d(a b, a b), where throughout the paper, the function d(, ) returns the Euclidean distance between the argument vectors. Then, if we let the maximum likelihood estimate of (β G, ξ G ) be ( ˆβ G, ˆξ G ), using standard multivariate normal theory (cf. Anderson, 1984), the normal predictive distribution for the simulated distance D at a new Θ can be obtained by substituting ( ˆβ G, ˆξ G ) in place of (β G, ξ G ) and conditioning on D. We denote this predictive distribution by η(d; Θ). Detailed version of constructing this predictive distribution (emulator) is given in Web Appendix D. Consider a new space-time data set, and let the vector of summary statistics for these data be Y. Let the distance between Y and Z be D. The predictive distribution from the first stage provides a model for D, η(d ; Θ ), connecting it to some unknown parameter vector Θ. Following Bayarri, Berger, Paulo, Sacks, Cafeo, Cavendish, Lin, and Tu (2007), we model the discrepancy between the gravity model and the real data. Failing to account for datamodel discrepancy can lead to poor inference as pointed out in Bayarri et al. (2007) and Bhat, Haran, and Goes (2010). We account for this by setting D = Dδ := δ, where δ > 0 is the discrepancy term. It is positive since it represents an Euclidean distance that is nonnegative (in the unrealistic case that there is an exact match between the model for the data and the model used to fit the data, δ would be identically equal to 0). We then infer the gravity parameters using η(dδ ; Θ ) considering δ to be another unknown parameter in the MCMC algorithm. In other words, the likelihood function we use for our MCMC algorithm is a function f(δ, Θ ) := η(dδ ; Θ ). We note that including a model discrepancy term results in more reliable parameter inference with narrower confidence regions since it adjusts for the fact that even the best model fit is not going to reduce the distance between the simulated and 13

14 observed summary statistics to zero. In our simulated examples, where data are generated from the gravity model, the discrepancy term can be thought of as an adjustment parameter for the fact that two data sets simulated at the same parameter settings will always have small differences due to stochasticity. In these examples, as it is expected, estimate of the discrepancy is very small compared to the discrepancy term inferred from the original data. We also note that using negative values for δ would mean an extrapolation in our emulator beyond the grid of the parameter space that may lead to unreliable inference. In many situations, having a well-defined discrepancy term with an informative prior helps to reduce problems with identifiability of the parameters as well (cf. Craig et al., 2001). We can now summarize our inferential approach as follows: 1. Emulating the gravity model: (a) Select a grid (Θ 1,, Θ p ) on the range of possible values for Θ. (b) Calculate Y (Θ i ) using a simulation from the gravity model with Θ i for all i. (c) Calculate D = (D 1,, D p ), distances from Y i to Z for all i. (d) Find the maximum likelihood estimates of (β G, ξ G ), the parameters of the Gaussian process in Equation (4). Obtain the predictive distribution η(d; Θ). 2. Bayesian inference for δ and Θ given the observations Z: (a) Using the predictive distribution with a discrepancy term, η(dδ ; Θ ), perform Bayesian inference for the parameters (Θ, δ) from the posterior distribution via MCMC. 5 Emulation-based inference for the gravity TSIR model In this section we describe details of the application of the inferential approach described in Section 4 to the gravity TSIR model. By using simulated data examples, we show that the approach resolves the problems posed by traditional approaches. In order to contrast our approach to a traditional likelihood-based approach (carried out by gridded MCMC as described in Section 3.1), we also provide computational details from the application of both methods. 14

15 5.1 Computational details of gridded MCMC and emulation-based approaches Inference for both the traditional Bayes and emulator-based approaches relies on sampling from the corresponding posterior distributions via MCMC. In both methods, we use univariate sequential slice sampling updates for the continuous parameters (Agarwal and Gelfand, 2005; Neal, 2003). Parameters that are on the grid are updated via an analog of a simple random walk for discrete variables. In all the MCMC algorithms that are used for the discretized MCMC approach, the chain is run until we obtain 200,000 samples. This takes about 3 days on a Intel Xeon E5472 Quad-Core 3.0 GHz processor. In all the MCMC algorithms for the Gaussian process-based method, all the updates are carried out using slice sampling since all the parameters here are continuous. Chain lengths are 200,000 again and it takes about 10 hours to generate them. The chain lengths in both methods are adequate for producing posterior estimates with small Monte Carlo standard errors (Flegal, Haran, and Jones, 2008; Jones, Haran, Caffo, and Neath, 2006). We emulate the gravity model with a Gaussian process using proportions of zeros as a summary statistic of interest. Our selection of proportions of zeros as the primary summary statistic of the analysis is based on suggestions by domain experts and intuition that these summary statistics are the most informative regarding the parameters of interest. It could be argued that big cities do not have bi-weeks without incidences of measles making the proportions of zeros for these cities equal to 1. However, during the course of outbreaks in these cities, the epidemic trajectory of measles is nearly unaffected by infection that may enter from neighboring locations. This means that big cities may not contain information about the gravity parameters - parameters of the movement of the infection between cities from data on number of cases of measles. In our data, more than 90% of the cities may be considered as small cities. Spatial transmission is very important to the dynamics of measles for these smaller cities where the infection may become locally extinct. For small cities, infection re-entered from other cities is the only possible way to start a new outbreak. Using different summary statistics may, of course, lead to different inference. Inference based on the maximums, however, was identical to what is obtained here and therefore we do not include details of the analysis and the corresponding results. It is also possible to develop an emulator using these two summary statistics at the same time; this is computationally more demanding and based on our exploratory data analysis will not impact our conclusions. In general the most informative summary statistics are not trivial to judge, and depend on the disease and available data. The choice of summary statistics is closely linked to the 15

16 particular inference questions addressed and can be limited by the availability of informative statistics for any particular model parameters. In cases when there are no well-established summary statistics and/or scientifically important aspects of the disease dynamics that need to be captured, our emulation-based approach can be used with summary statistics constructed/selected via algorithms borrowed from the approximate Bayesian computation literature (cf. Blum and François, 2010; Fearnhead and Prangle, 2012; Nunes and Balding, 2010; Sisson and Fan, 2010). A possible approach to the lack of informative summary statistics is to increase the number of summary statistics, thereby hoping to increase the amount of information regarding the unknown parameters (Sousa, Fritz, Beaumont, and Chikhi, 2009). This approach could, however, make our inferential methods more computationally expensive. Another method for selecting summary statistics is based on ordering summary statistics according to whether their inclusion in the analysis substantially improves the quality of inference defined by different criteria (Joyce and Marjoram, 2008; Nunes and Balding, 2010). Finally, one may construct informative summary statistics using different dimension reduction techniques (Blum and François, 2010; Fearnhead and Prangle, 2012; Wegmann, Leuenberger, and Excoffier, 2009) or by transforming the existing summary statistics (Blum, 2010). We use the priors for the gravity model parameters that are described in Section 3. Since the discrepancy term, δ, is always positive, we use an exponential(1) as its prior distribution. We use a uniform grid in the four-dimensional cube, each side of which is equal to the intervals [0, 2]. For each parameter, we use 20 different values on each axis of the cube; this grid size permits computationally expedient inference. Our analysis of simulated data sets also shows that 20 is sufficient for accurate inference. In addition, for each point on the grid, the average distances from multiple forward simulations can be used instead of the distances calculated from a single simulation. This may be important when model realizations are highly variable. For the parameters of the gravity model, however, our inference was insensitive to the number of repetitions. This was because multiple realizations from the probability model varied very little for a given parameter setting. Therefore, it was much more important to use our computational resources for emulation across more parameter settings than it was to obtain repeated realizations at the same setting. Hence, we used one simulated time-series at each location for each set of parameters in the four-dimensional cube. 16

17 5.2 Application to simulated data In the simulated examples that follow, our goal is to compare inference based on the GPapproach to inference from the traditional Bayes approach. In Figure 4, we show a simulated example where both the GP and traditional Bayes approaches yield the same inference, and another simulated example where the two approaches yield different answers. In both cases, the emulation-based approach provides inference that captures the true parameter values. In the first simulated data, the true parameters are θ = 1, τ 1 = 0.6, τ 2 = 1 and ρ = 1. In Figure 4 (a), we overlay two different 95% confidence regions obtained using the two different methods. Both of these regions are found by assuming τ 1 = 0.6 and τ 2 = 1. We can see that for this example, both solid (traditional Bayes) and dashed (GP emulator-based) regions contain the true values of θ and ρ. This shows that inference based on the GP emulator is as good as inference based on the traditional Bayes method. To demonstrate that the new approach is better than the traditional Bayes approach, we choose a second set of values for the gravity parameters (θ = 0.71, τ 1 = 0.62, τ 2 = 1 and ρ = 1.5) for which we know inference based on the traditional Bayes approach to be poor (like in Figure 2). Figure 4 (b) shows how the 95% confidence region from the traditional Bayes method (outlined with a solid line) is shifted and does not contain the truth. The permissible region obtained using the GP emulator (outlined with a dashed line) has corrected the shift and contains the true values of the parameters. We analyze the ability of the fitted gravity model to reproduce the key characteristics of the process at these new parameter estimates. Using estimates obtained via the GP-emulator based approach, (θ, τ 1, τ 2, ρ) = (0.71, 0.5, 0.5, 1.48), we generate a data set to obtain plots similar to the ones in Figure 3. Plots on Figure 5 (a)-(b) show that the model now can fit the maximums M and the proportions of zeros P very well. Comparing the plots in Figures 3 and 5, we can now say that the new emulation-based approach improves the model fit substantially while the traditional Bayes parameter estimates from the gridded MCMC fail to provide a model that captures the key epidemiological features of the data. In order to study the effect of a discrepancy term in our approach, we also tried to infer the gravity parameters using the emulation-based model with δ = 0 (no discrepancy). The resultant 95% confidence regions were much wider for the latter approach containing incorrect parameter settings, supporting the points made in Bayarri et al. (2007) about the importance of adding a discrepancy term to approximate models. We note, however, that these new confidence regions still contained the true parameters values in simulated examples and did not have the kinds of shifts seen in parameter inference using grid-based MCMC as in Section 3.1. This means that the problem when the true parameters of the model are 17

18 Figure 4: 95% C.I. s for (θ, ρ) obtained via different methods (assuming that τ 1 and τ 2 are known): Solid line shows the 95% region obtained using the traditional Bayes method. Dashed line outlines the 95% region obtained via GP emulator: (a) Both regions contain the true parameter values; (b) Region obtained by the GP emulator contains the true values of the parameters, while the traditional Bayes region does not. Figure 5: Characteristics of simulated data at the parameters chosen to minimize the discrepancy between the data and the simulation: (a) Simulated M vs M from the data; (b ) Simulated P vs P from the data. 18

19 not recovered by a likelihood-based approach is not related to the issue of accounting for model-data discrepancy. Continuing to explore the effect of the discrepancy term, we also tried a few different priors for δ; using the exponential(1) prior for the discrepancy term worked very well as was clear from the results. The posterior median for the discrepancy term was found to be around 2 which was close to the minimal distance from the simulated and the true vectors of summary statistics taken over all the points on the grid. 6 Results from application to measles data We apply our emulation-based approach to inference for the gravity TSIR model to a well known measles data set from the U.K. The purpose of this is twofold: to demonstrate the applicability of our approach to a real data set as well as to provide some insights into measles dynamics in the pre-vaccination era. 6.1 Description of measles data set The following description of the data closely follows Xia et al. (2004). We analyze weekly case reports of measles for cities in England and Wales. The data is available for K = 952 locations in the pre-vaccination era from 1944 to 1965 and for K = 354 locations from 1966 to 1994 with information on vaccine coverage. The data represent an interesting case study of spatiotemporal epidemic dynamics (Grenfell et al., 2002) with well understood underreporting rate of 40%-55% (Bjørnstad et al., 2002). Besides the under-reporting, the data are complete and reveal inter-annual outbreaks of infection. A critical feature of this data set is that, except for a few large cities, infection frequently goes locally extinct, so that overall persistence hinges on episodic reintroduction and spatial coupling. Before further analysis, we correct the reported data by a factor of 1/0.52, with 52% being the average reporting rate taken from previous analysis (Bjørnstad et al., 2002; Clarkson and Fine, 1985; Finkenstädt and Grenfell, 2000). In addition, as in previous works, we use a timescale that represent the exposed and infectious period, which is known to be about 2 weeks for measles (Black, 1989). In the analysis of the data for pre-vaccination era, following a standard assumption in the literature (see, for instance, Bjørnstad et al., 2002; Grenfell et al., 2002; Xia et al., 2004, and the references therein), the population sizes and per capita birth rates for all locations in this work are assumed to be approximately constant throughout the time period. These variables are taken as those in 1960 for each of the areas. This is a rough approximation, since 19

20 most communities grew during the period we analyze. The force of infection is, therefore, on average slightly underestimated (overestimated) during the early (late) part of the study. In the analysis of the newer data for , the population sizes and per capita birth rates are allowed to be variable as specified in the gravity model. We note that these assumptions are made for the consistency of our work with the previous analysis and do not have an effect on our inference and/or conclusions. 6.2 Some implications for measles dynamics Important biological questions we want to answer based on these data are: (i) do the gravity model parameters (and hence disease transmission) change for school holiday periods versus non-holiday periods? Do they change for different time periods (before and after vaccines against measles were available)? (ii) do movement rates of infected people change in different time periods? In order to answer these questions, using our emulator-based approach, we first fit the model to the parts of the data corresponding to periods of holidays and non-holidays. As demonstrated in our simulated examples in Section 3.1 and 5.2, it is not possible to infer all the gravity model parameters at once. Hence, we set the parameters τ 1 and τ 2 equal to 1 and study the remaining key gravity model parameters θ and ρ. The resulting 95% confidence regions for θ and ρ are provided in Figure 6 (a). As can be seen from this figure, the two regions are almost identical, indicating that any change in the number of cases of measles for holidays and non-holidays is not due to the change in the way the infection spreads between cities of the metapopulation during these periods. Since the matrix M = {m kj }, where m kj = θ N τ 1 kt T t=1 (Ijt) τ 2 d ρ kj is interpreted as a matrix of the amount of movement, sum of k-th row of M represents the amount of infected individuals leaving city k while sum of k-th column is the number of infected people coming to city k. Using samples for θ and ρ, we easily obtain a sample for the spatial flux of infection for selected cities. In Table 1, we report our estimates with corresponding credible regions based on this analysis. We use the posterior median as point estimates. For example, we estimate the average number of emigrating infections during the holiday periods each week to be equal to 31.1 for London. Below the estimate, we report a 95% credible interval for it which is (4.4, 479.1). Based on these estimates, the mobility of the infection appears to be less during the periods of holidays. Figure 6 (b) shows confidence regions obtained by the GP emulator-based approach by fitting the model to the data from and separately. From this figure, 20

21 Figure 6: 95% C.I. for (θ, ρ) obtained via fitting GP emulator to a part of the data: (a) Solid line outlines the confidence region for parameters when data from only holiday periods are used; Dashed line outlines the confidence region for parameters when data for only non-holiday periods are used; (b) Solid line outlines the confidence region for parameters when data for years from are used; Dashed line outlines the confidence region for parameters when data for years from are used. Table 1: Estimated amount of average movement in two weeks City From To Holiday Non-Holiday Holiday Non-Holiday London (4.4, 479.1) (6.6, 744.7) (4.6, 564.9) (6.9, 823.9) Birmingham (1.2, 72.9) (1.8, 110.6) (1.2, 74.7) (1.9, 115.8) Manchester (1.0, 151.4) (1.4, 180.9) (1.2, 162.9) (1.5, 189.1) Blackpool (0.1, 6.7) (0.2, 8.8) (0.1, 5.2) (0.1, 6.1) 21

22 Table 2: Estimated amount of average movement in two weeks City From To London (7.9, 488.4) (4.4, 623.2) (7.0, 591.5) (4.9, 739.8) Birmingham (1.9, 75.1) (1.2, 112.8) (2.8, 93.7) (1.2, 121.3) Manchester (2.1, 128.4) (0.9, 176.7) (1.9, 163.6) (1.4, 193.1) Blackpool ( 0.3, 8.1) (0.2, 9.7) (0.1, 7.4) ( 0.1, 7.1) we conclude that the change in parameter values is statistically insignificant for these two different time periods. The important scientific implication of this result is that introduction of vaccination in England and Wales in 1966 does not change the movement patterns of the infection between cities. This also means that any observed change in incidence rates of measles is only due to the effects of vaccination, not a change in movement patterns in the vaccination era. Table 2 shows estimates of the average amount of transit infections each bi-week for years and We see here that the infection appears to move less during the later years. We note that none of the differences are statistically significant. As a visual summary of this table for the time period with vaccination, in Figure 7, we plot histograms of log-transformed estimated amount of average movement in two weeks for From these plots, we can conclude that both incoming (Figure 7 (a)) and outgoing (Figure 7 (b)) number of infections for most of the cities is very small. Figure 8 displays graphs of networks of the movement of measles between cities in our data. These graphs are obtained using the movement matrix M and estimates of the gravity parameters from data for via the GP emulator-based approach. In Figure 8 (a), we plot the network of outgoing infections. In Figure 8 (b) we plot the network of incoming infections for cities of the metapopulation. Figure 8 (a) illustrates the importance of big cities in the dynamics of measles for smaller communities where the infection may become locally extinct. From this figure, we see that the edges radiating from the populated cities reach the small cities causing a re-introduction of the infection in these communities. This link between big and small cities do not seem to depend on distances between the cities. On the other hand, in Figure 8 (b), we see that the amount of incoming infections is mostly dependent on distances between cities since edges connecting different cities in this graph are 22

23 Figure 7: Histogram of estimated amount of average movement in two weeks for : (a) outgoing infections; (b) incoming infections. shorter relative to the edges of the graph in Figure 8 (a). This means that big cities are the only important factors in starting an outbreak in smaller cities, excluding the possibility of re-introduction of the disease from neighboring cities with small population sizes. 7 Discussion Complex models are very useful for representing physical phenomena, whether the phenomena is the spread of an infectious disease or the change in sea surface temperatures in the Atlantic. As is well known, it is not always possible for every aspect of such complicated phenomena to be modeled accurately; certain key characteristics of the process necessarily have to be focal points of the modeling effort. However, these key characteristics are not typically the focus of a statistical inferential procedure that uses a traditional likelihood-based approach. The approach we have developed in this paper addresses this point by providing a flexible inferential method that directly takes into account the characteristics of the process that are most important to scientists. Even though focusing on different summary statistics can lead to different estimates, parameter inference based on our approach produces an improved model fit to the biologically interesting features of the infectious disease dynamics. In addition to the flexibility this provides, we find that our approach is also computationally tractable in 23

24 Figure 8: Movement networks of the infection: (a) network of outgoing infections; (b) network of incoming infections. 24

Tutorial on Approximate Bayesian Computation

Tutorial on Approximate Bayesian Computation Michael Gutmann https://sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology 16 May 2016