Bayesian inference for stochastic multitype epidemics in structured populations via random graphs

Size: px

Start display at page:

Download "Bayesian inference for stochastic multitype epidemics in structured populations via random graphs"

Tamsyn Whitehead
5 years ago
Views:

1 Bayesian inference for stochastic multitype epidemics in structured populations via random graphs Nikolaos Demiris Medical Research Council Biostatistics Unit, Cambridge, UK and Philip D. O Neill 1 University of Nottingham, UK Summary. This paper is concerned with new methodology for statistical inference for final-outcome infectious disease data using certain structured-population stochastic epidemic models. A major obstacle to inference for such models is that the likelihood is both analytically and numerically intractable. The approach taken here is to impute missing information in the form of a random graph that describes the potential infectious contacts between individuals. This level of imputation overcomes various constraints of existing methodologies, and yields more detailed information about disease spread. The methods are illustrated with both real and test data. Keywords: Bayesian inference, epidemics, Markov chain Monte Carlo Methods, Metropolis- Hastings algorithm, random graphs, stochastic epidemic models 1 Introduction This paper is concerned with the problem of inferring information about disease spread given data on the final outcome of an epidemic in a structured population. Before outlining our approach, we begin by briefly recalling relevant background material. Stochastic epidemic models that incorporate structured populations have become a subject of considerable research activity in recent years. Examples include independenthousehold models (e.g. Longini and Koopman, 1982; Becker and Dietz, 1995; Becker and Hall, 1996), models with two levels of mixing (e.g. Ball et al., 1997; Ball and Lyne, 2001; Demiris and O Neill, 2005), random network models (e.g. Andersson, 1999; Britton and O Neill, 2002), and social cluster models (e.g. Schinazi, 2002). The basic motivation for such work is that, in contrast to epidemic models that assume a homogeneously mixing population of individuals, most human populations contain inherent 1 Address for correspondence: School of Mathematical Sciences, University of Nottingham, Nottingham NG7 2RD, UK. pdo@maths.nott.ac.uk

2 2 structure because individuals usually spend their time in various groups such as dwelling places, work places, childcare facilities etc. Our focus here is on so-called two-level-mixing models, defined formally below. These models, introduced in Ball et al. (1997), describe a population partitioned into groups in which infectious contacts can occur both locally within a group, and globally between groups. The basic inference problem is then to estimate the local and global infection rates, given knowledge of the underlying structure, and data that indicate which individuals in the population ever became infected during an epidemic. This problem is complicated because of the model dependence structures. Specifically, global infections are explicitly described in the model, and so the number infected in a given group is not independent of the numbers infected in other groups. This in turn means that the likelihood cannot be expressed as a simple product over group outcome, and in fact is intractable in any case of practical concern. This problem can be partly overcome by simply assuming independence between households, which is a reasonable assumption in a large population, and moreover is asymptotically the case as the number of groups tends to infinity. In particular, it is then possible to approximate the two-level mixing model with a simpler independent-groups model possessing a tractable likelihood. In such models, local mixing occurs as before, but global mixing is replaced by the assumption that each individual independently avoids infection from outside its group with some fixed probability. Inference for these models is possible in a variety of ways, see e.g. Addy et al. (1991), Becker and Dietz (1995) and Li et al. (2002). The use of an independent-households model as an approximation for a two-level-mixing model underlies the statistical analyses in Ball et al. (1997), Britton and Becker (2000), Ball and Lyne (2004), and Demiris and O Neill (2005). An attractive aspect of our methods is that they dispense with the need for such approximation. Another difficulty with performing inference for epidemics (and other models with threshold behaviour, e.g. branching and contact processes) arises due to the bimodal nature of realisations. Typically, either an epidemic dies out quickly, or else it infects a fraction of the population which, in a large population, is approximately Gaussian (see e.g. Andersson and Britton, 2000, Chapter 4). Many statistical analyses require the assumption that the epidemic has taken off, this being expressed by requiring that a threshold parameter R exceeds unity (e.g. Becker, 1989, Chapter 8; Rida, 1991; Ball and Lyne, 2004; Demiris and O Neill, 2005). Such an assumption (i) often leads to underestimation in the variability of model parameters or the threshold parameter, and (ii) is clearly not desirable when attempting to infer control strategies which require that R < 1. The methods that we present here are free from this restriction. For intractable likelihood problems, such as that which concerns us, one solution is to

3 3 augment the parameter space, adding missing data or other quantities that then yield a tractable likelihood. Although such methods are widely-used, in the current context it is far from obvious what should be imputed. Demiris and O Neill (2005) describe an approach based on imputation of the so-called final severity of the epidemic, which leads to an approximate analysis involving an independent-groups model, as mentioned above. The key idea in the present paper is instead to impute much more detailed information about the epidemic, namely the set of susceptible individuals that each infected individual would infect if no other infections were permitted. This information is conveniently described by a random digraph in which links correspond to potential infections. Three points about our approach should be noted. First, it is apparently ambitious, since a great deal of extra information is imputed. However, as described later, it is typically the case that only relatively few digraphs are likely to be compatible with the data, and thus the imputation is practicable. Second, some problems involving temporal data, such as weekly case incidence counts, have been approached by imputing missing information in the form of infection pathways (Haydon et al., 2003; Wallinga and Teunis, 2004). Although superficially related to our approach, in fact there are fundamental differences, namely (i) our data are not temporal; (ii) we have to deal with an intractable likelihood; and (iii) we do not impute the infection pathway itself. Third, although we focus here on two-level-mixing models, in fact our approach has very wide applicability, as will be outlined later. The paper is structured as follows. Sections 2 and 3 contain, respectively, the epidemic model of interest and the associated random digraph. Section 4 describes the data and augmented likelihood obtained when the imputed information is employed. An MCMC algorithm is described in Section 5, and illustrated in Section 6, while Section 7 contains some concluding remarks. 2 Multitype epidemic models with two levels of mixing In this section we describe the epidemic model of interest, and an associated threshold parameter that will be of importance in the sequel.

4 4 2.1 Multitype two-level mixing model The following model of a continuous-time epidemic process is defined in Ball and Lyne (2001). Consider a closed population of N individuals, labelled 1,..., N, that is partitioned into groups (e.g. households, farms) of varying sizes. Suppose that the population contains m j groups of size j, and let m = j=1 m j be the total number of groups. Thus N = j=1 jm j. In addition, each individual in the population is assumed to be one of a possible k types, these typically representing categorical covariates (e.g. age, vaccination status, previous infection history). For i = 1,..., k, let N i denote the total number of individuals in the population of type i. For convenience, suppose that the possible types are labelled 1,..., k, and for j = 1,..., N denote by τ(j) the type of individual j. Each individual in the population can, at any time t 0, be in one of three states, namely susceptible, infective, or removed. A susceptible individual is healthy and may contract the disease in question. An infective individual has become infected, and moreover can transmit the disease to others. A removed individual is one who is no longer infectious, and plays no part in further disease spread. In practice this could occur either because of actual immunity (induced by antibodies), or by isolation following the appearance of symptoms. If a type i individual, j say, becomes infected, then they remain so for a random time I(j) whose distribution is the same as some specified nonnegative random variable I i. The random variables describing the infectious periods I(j), j = 1, 2,... of different individuals are assumed to be mutually independent. The epidemic is initiated at time t = 0 by a (typically small) number of individuals becoming infectious. During its infectious period, an individual of type i makes infectious contacts with each type j susceptible, independently of other individuals, at times given by the points of a Poisson process of rate λ G ij/n j. In addition, and independently, the infectious individual also makes infectious contacts with each type j susceptible in its own group according to a Poisson process of rate λ L ij. Once contacted by an infective, a susceptible individual immediately becomes infectious. At the end of its infectious period, an infective individual becomes removed, and plays no further part in the epidemic. The epidemic ceases as soon as there are no infectives present in the population. Two points concerning realism should be noted. First, the model does not include a latent period, i.e. a time period between infection and infectiousness of an individual. However, the distribution of final numbers infected in the model is invariant to the inclusion of any sensible latent period, as described below, and in particular the finaloutcome-data scenario that is under consideration here. Second, the stipulation that the global and local infection rates have different scalings is common practice in two-level mixing models. It corresponds to the assumption that an individual would, on average,

5 5 make more local contacts if their group size increased, but would not make more global contacts if the entire population increased in size. 2.2 Threshold parameter Threshold parameters, known in some contexts as basic reproduction numbers, are of fundamental importance both in stochastic epidemic theory (Andersson and Britton, 2000, p.6), and epidemiology (Farrington et al., 2001). Typically, for a stochastic epidemic model, there is a parameter R such that epidemics in an infinite population of susceptibles are almost surely finite if and only if R 1. Such results essentially originate in branching process theory, in that the early stages of an epidemic are approximately identical to a suitable branching process. In practical terms, the goal of most disease control measures is to ensure that the value of R is reduced to below unity. A threshold parameter for the multitype two-level mixing model can be obtained by allowing the population to become large globally, i.e. by allowing the number of groups, m, to tend to infinity. The details are given in Ball and Lyne (2001), and are essentially as follows. Define the k k matrix M := (m ij ) where m ij is the average number of global contacts to type j individuals made by a group in which the first infected individual is of type i. The threshold parameter, R, is then defined as the maximal eigenvalue of M. It follows that, in a large population, epidemics are extremely unlikely to occur if R 1. Calculation of R in practice involves computing M; explicit details of how to do this are given in Ball et al. (2004). 2.3 Distribution of final outcome We shall be interested in the final outcome of the above model, i.e. the numbers of initially susceptible individuals of each type in each group who ever become infected during the epidemic. As described in Ball and Lyne (2001), it is possible in principle to write down a triangular set of linear equations, the solution of which furnishes us with the required joint probability mass function. However, in practice this system of equations can be numerically intractable, even for small population sizes. Such problems are well-known in epidemic modelling (e.g. Andersson and Britton, 2000, p.18), and arise because the final outcome probabilities that are typically of interest are derived recursively using probabilities that are often very close to zero, and in particular which may be outside the range of normal machine accuracy. An extra complication is that for non-homogeneous population models, such as that of interest here, the number of equations themselves can be enormous.

6 6 In the present context, these numerical problems mean that the most natural likelihood, namely that obtained by calculating the probability of the observed data given a parameter set, is intractable. To overcome this, we shall adopt a form of data imputation using a random graph, to which we now turn our attention. 3 Random Digraphs We now describe a representation of the final outcome of the epidemic model in terms of an associated random graph. Such representations have been considered by a number of authors (e.g. Ludwig, 1975; Barbour and Mollison, 1990; Islam et al., 1996; Andersson and Britton, 2000, Chapter 7), although to our knowledge such graphs have never been directly used for the purposes of statistical inference. Consider the following random directed graph (digraph) on N vertices labelled 1,..., N. Vertex j has type τ(j), where 1 τ(j) k, and there are N i vertices of type i in total. Associate with each vertex j of type i a random lifetime I(j) which is a realisation of a non-negative random variable I i. The lifetimes of different vertices are assumed independent. The set of vertices is partitioned into groups, not necessarily of equal size. For j = 1,..., N and l = 1,..., k insert a global edge from vertex j to each vertex of type l with probability 1 exp( I(j)λ G τ(j)l /N l), and insert a local edge to each vertex of type l in the same group as j with probability 1 exp( I(j)λ L τ(j)l ). The insertion of each possible edge is independent of the presence or absence of all other edges. Note that no edge can be drawn from a vertex to itself. Finally, say that a vertex j is directionally connected to a vertex l if and only if there exists a path from j to l, and for each vertex j define C j as the set of vertices to which j is directionally connected. For later convenience we adopt the convention that j C j. The relationship between the random digraph and the epidemic model is as follows (see also Andersson and Britton, Chapter 7). Each vertex corresponds in the obvious way to an individual, and the lifetime random variables correspond to infectious periods. The edges emanating from vertices do not correspond directly to infections, although they can be thought of as corresponding to potential infections. For example, the probability of a local edge from individual j to individual l in the same group is 1 exp( I(j)λ L τ(j)l ), which is the probability that j ever infects l in the epidemic model, provided that j ever becomes infected and that l remains susceptible until contacted by j. An equivalent view is that the edges describe all contacts between individuals, and that these only result in infection when the vertices concerned form an infective-susceptible pair. Without loss of generality, suppose that individuals 1,..., a are initially infective in the epidemic model. Then the random set of individuals who are ultimately infected in the epidemic

7 7 has the same distribution as the random set of vertices a j=1c j. Note that the random digraph does not contain any temporal information about the epidemic, but instead is defined in terms of infection probabilities. In particular this means that inclusion of a latent period in the epidemic model, provided it is almost surely finite, makes no change to the digraph, and hence makes no change to the final outcome distribution. Similarly, the infectious period in the original model need not be continuous, but could be modelled as separate disjoint parts in real time (e.g. corresponding to daytime-only contact). Nor does the digraph explicitly represent the actual route of infections: a vertex corresponding to an individual who does become infected might have more than one potential infector in the graph. However, given a realisation of the digraph it is straightforward to calculate the probability of any possible infection pathway, if desired. 4 Data and augmented Likelihood 4.1 Final outcome Data We consider data of the form n = {n(s 1,..., s k ; i 1,..., i k )}, where n(s 1,..., s k ; i 1,..., i k ) denotes the number of groups containing s j initially susceptible individuals of type j, of whom i j ever become infected, where j = 1,..., k. Our focus is on Bayesian statistical inference for the two infection rate matrices Λ L := (λ L ij) and λ G := (Λ G ij), given n. By Bayes Theorem, the posterior density of interest, π(λ L, Λ G n), satisfies π(λ L, Λ G n) π(n Λ L, Λ G )π(λ L, Λ G ), where π(n Λ L, Λ G ) denotes the likelihood and π(λ L, Λ G ) denotes the joint prior density of (Λ L, Λ G ). However, as described in Section 2.3, the likelihood is analytically and numerically intractable in any case of interest, and so something extra is needed. Before addressing this we make two observations. First, final outcome data contain no temporal information. Specifically, there is no information regarding the mean length of the infectious period. Consequently, we fix the infectious period distribution in advance of any data analysis. This implicitly creates a time-scale, with respect to which the values of Λ G and Λ L (but not R ) should be interpreted. Second, provided the data are not of single type, we would expect that the model parameters are not all identifiable. With sufficiently many different compositions of groups in the data, all of the Λ L parameters are identifiable. However, the only information

8 8 available about global infection is, essentially, the numbers of each type infected (k data points), which is insufficient for the k 2 parameters of Λ G (c.f. Britton, 1998). Although in principle the MCMC algorithm we describe below can function without regard to this problem, in practice it is usually pragmatic to consider a reduced-parameter model that involves extra constraints on Λ G. Examples of this are considered later. 4.2 Augmented Likelihood In order to surmount the difficulty of an intractable likelihood, we consider augmenting the parameter space by including a digraph describing potential infectious contacts. Note that it is only necessary to consider this digraph on vertices corresponding to individuals who ever become infected in the epidemic, according to the data. This is because (i) none of these individuals can have contacted any individual who escapes infection, so there is no need to impute such potential edges, and (ii) the data contain no information about potential (but never realised) edges from individuals escaping infection. To be precise, suppose now that the total number of individuals in the population who ever become infected is n, labelled 1,..., n, and define G as the digraph on these n vertices. For j = 1,..., n let I(j) denote the lifetime random variable corresponding to vertex j, distributed according to I τ(j), and I = (I(1),..., I(n)). As before, let N denote the total number of individuals who are initially susceptible. It is necessary to assume that at least one individual is initially infective: we assume henceforth that there is exactly one individual, whose label is κ, although an arbitrary number of initial infectives is easily catered for. The augmented posterior density is π(λ L, Λ G, G, I, κ n) π(n Λ L, Λ G, G, I, κ)π(g Λ L, Λ G, I, κ)π(i)π(κ)π(λ L, Λ G ), (4.1) where the last three terms on the RHS are prior densities. Note that π(i) is simply a product of the individual densities of the I(j) terms, i.e. it is specified by the model assumptions. It remains to evaluate the other two RHS terms in (4.1), which effectively correspond to the augmented likelihood. The π(n Λ L, Λ G, G, I, κ) term is the probability of there being no edges from the n vertices in G to the remaining N n outside G, provided that κ is directionally connected to every other vertex in G, i.e. C κ = {1,..., n}. If the latter does not hold then π(n Λ L, Λ G, G, I, κ) = 0, since G is then incompatible with the observed data. The term π(g Λ L, Λ G, κ, I) is simply the probability of the edges in G and is straightforward to write down.

9 9 For i = 1,..., n and j = 1,..., k, define the following quantities. Let ν L ij denote the number of local edges from vertex i to type j vertices, and define ν G ij similarly for global edges. Let N L ij denote the number of individuals in i s group of type j. Note that N L ij includes individuals who are never infected during the epidemic, and will include individual i if τ(i) = j. Finally, define δ(κ, G; n) = 1 {Cκ ={1,...,n}} and ij = 1 {τ(i)=j}, where 1 A denotes the indicator function of the set A. Thus δ(κ, G; n) is simply the indicator function of the event that, given the initial infective κ, G is compatible with the observed data n. Then L(Λ L, Λ G, G, I, κ) := π(n Λ L, Λ G, G, I, κ)π(g Λ L, Λ G, I, κ) = δ(κ, G; n) n k { 1 exp( I(i)λ L τ(i)j ) } νij L exp( I(i)λ L τ(i)j(nij L ij νij)) L i=1 j=1 { 1 exp( I(i)λ G τ(i)j /N j ) } ν G ij exp( I(i)λ G τ(i)j(n j ij ν G ij )/N j ). (4.2) Computation of (4.2) is straightforward in practice, the only minor challenge being the δ(κ, G; n) term. Note that in the above formulation, the lifetimes I(i), i = 1,..., n are included as extra model parameters. It is possible to integrate these out of (4.2) by multiplying out the {1 exp( )} terms and taking expectations, exploiting the independence of the I(i) terms. However, the resulting expressions consist of alternating sums and possess poor numerical stability, so in most scenarios it is expeditious to retain the lifetimes as parameters. 5 Markov chain Monte Carlo algorithm In order to obtain samples from the posterior density defined at (4.1), we now define a Metropolis-Hastings algorithm (see e.g. Gilks et al., 1996) in which the parameters are updated in blocks in the following manner. Updates for parameters other than the digraph G are largely routine so we only give brief details. It is assumed that the initial configuration of the parameters has positive probability, and in particular that δ(κ, G; n) = 1. The infection rate parameters λ L ij, λ G ij, 1 i, j k, are each updated individually using Gaussian proposal distributions centred on the current value, with either fixed variances, or by using an adaptive scheme in which the variances can change as the algorithm proceeds. The latter is especially useful in those cases where the data are relatively uninformative about a particular infection rate. The lifetime random variables I(i), i = 1,..., n can be updated naturally by using the prior distributions as proposals. The proposed new lifetimes I, say, are accepted with probability L(Λ L, Λ G, G, I, κ) L(Λ L, Λ G, G, I, κ) 1.

10 10 Note that directional connectivity is unaffected by this update, which simplifies the computation. It is usually best to update the lifetimes in blocks rather than all at once, to prevent low acceptance rates. The label of the initial infective, κ, can be updated using a Gibbs step. Specifically, let A(G) = {1 j n : δ(g, j; n) = 1} denote the possible values of κ, for a given G. Under the assumptions of the model each of these values is equally likely, conditional upon the current G. Thus κ has full conditional distribution given by π(κ G, n) = π(κ)δ(g, κ; n) κ:κ A(G) π(κ). Alternative updating methods for κ, that also ensure G remains directionally connected, include proposing a new value κ from among those vertices to which κ is connected, and then swapping the direction of the edge from κ to κ. By necessity, such moves involve some updating of G. Updating G can be achieved by simply adding and deleting edges at random, as described below. We use the terminology non-edge to refer to the absence of an edge, i.e. there is a non-edge from vertex i to vertex j if and only if there is not an edge from i to j. For i = 1,..., n and j = 1,..., k, denote by n L ij the number of individuals in i s group of type j who ever become infected. Thus n L ij is the number of vertices in G in i s group of type j, and k ( ) j=1 n L ij ij is the maximum possible number of local edges emanating from i. First, choose to try to add an edge with probability p a, otherwise try to delete an edge. In both cases, choose to act on the local edges with probability p L, otherwise act on the global edges. For local addition, first select, uniformly at random, an edge to add from among the entire set of local non-edges. If this set is empty, then stop at this point. Otherwise, suppose the edge is from vertex s to vertex t, these vertices being in the same group. To calculate the acceptance probability, note that the likelihood ratio of proposed to existing graph is simply {1 exp( I(s)λ L τ(s)τ(t) )}/ exp( I(s)λL τ(s)τ(t) ). Combining this with the proposal mechanism, and that for deletion described below, yields the acceptance probability { exp(i(s)λ L τ(s)τ(t) ) 1 } (1 p a) n k i=1 j=1 (nl ij ij νij) L ( p a 1 + n ) 1. k i=1 j=1 νl ij Note that there is no need to check directional connectivity. The addition of global edges occurs in the same way, mutatis mutandis. For local deletion, an edge is picked at random from among the n i=1 k j=1 νl ij available, and then deleted with probability { exp(i(s)λ L τ(s)τ(t) ) 1 } 1 δ(κ, G; n)p n k a i=1 j=1 νl ij ( (1 p a ) 1 + n ) 1, k i=1 j=1 (nl ij ij νij L)

11 11 where the proposed deletion is an edge from vertex s to vertex t. Note that evaluation of the acceptance probability requires checking directional convectivity. Global deletion is similar. 6 Application to data We now consider the performance of our methods in a variety of examples. Our aim is not to perform thorough data analyses, but to use suitable data sets to illustrate the feasibility of our approach, and its scope for providing new kinds of information not available via existing methods. All results are based on samples of size 10,000 from MCMC sample chain output. Algorithm convergence was checked by inspection of the resulting chain output. Unless otherwise indicated, parameters on (0, ) are assigned exponential prior densities with rate In all cases such priors allow the data to dominate the posterior distribution. If a uniform prior mass function for κ is assumed then, in the examples below, inference for Λ L and Λ G was found to be indistinguishable from the case where κ is simply fixed. Indeed for the single-type case, it can be shown that κ has no bearing on inference for the local and global infection rate parameters. This is essentially a consequence of the fact that the probability that a given individual i infects a given individual j is the same as the probability that j infects i. For the multitype case this is no longer true, but κ only becomes important in data on small populations. 6.1 Example 1: Single type, homogeneous mixing model We start by exploring the performance of the algorithm in the special case of a homogeneouslymixing single-type population. In the terminology of the general model this corresponds to a single type (k = 1), all groups being of size one, and Λ L being redundant since no local infection occurs. Thus there is just one infection rate parameter, λ G 11 = λ, say. In some sense this setting provides the most challenging inverse problem, since the data only comprise two numbers (initial number susceptible N, final number ever infected n), from which we shall try to infer information about both the infection rate and, implicitly, the random digraph. Note that R, usually called R 0 in this setting, equals λe[i], where I is the infectious period. Suppose that N = 100, with one initial infective, and consider the three data points n = 25, 50 and 75. We also consider three possible infectious period distributions, each with mean one, namely constant, exponential, and Gamma with variance 10. It should

12 12 be noted that the numerical problems outlined in Section 2.3 apply in these cases, so that direct likelihood calculation via the standard triangular equations for final size would typically exceed machine accuracy. Our focus in the following is on R 0 ; the next example illustrates how information about G can be easily obtained. Some posterior summary statistics are given in Table 1, and Figure 1 shows the posterior density estimates of R 0 for the case n = 25, under the three infectious period distributions. The algorithm ran successfully, with typical run times of a few hours, and with no apparent difficulties in terms of mixing. Various starting values for both λ and G were explored and in all cases the Markov chain quickly moved to a high posterior density region. In particular, this means that more exotic updates for G do not appear necessary in this case. Table 1 near here Figure 1 near here We highlight three aspects of our results. First, as Figure 1 illustrates, in all cases the posterior density of R 0 was found to be roughly symmetric, but with a discernible right tail. This tail became more pronounced as the variance of I increased, although the modal value for a given n was found to be very similar as I varied. Second, the key effect of the different infectious period distributions was to alter the posterior variance of R 0, which increased with the variance of I. Such findings are intuitively reasonable. Note that the posterior mean also increases with the variance of I, although this is essentially a consequence of the increased skewness. Third, the posterior probability that R 0 < 1 was found to be approximately 0.25 for n = 25, regardless of the distribution, and between 0.01 and 0.06 for n = 50, increasing with the variance of I. We mention this to emphasize the fact that our methods do not require any assumption that R 0 > 1, and moreover they provide information with how reliable such an assumption would be. 6.2 Example 2: Single type, two-level-mixing model We now turn to analyses based on two-level mixing models. In the sequel we consider data sets taken from detailed studies on outbreaks of influenza A(H3N2) in Tecumseh, Michigan. The data are in the form that we require in that they consist of final numbers infected in a population that has been divided into households. Many aspects of these data have been previously explored, see for example Monto et al. (1985), Longini et al. (1988), Addy et al. (1991), and references therein. More recent analyses based on two-level mixing models, all of which use approximations of one kind or another, can

13 13 be found in Ball et al. (1997), Britton and Becker (2000), Ball and Lyne (2004), and Demiris and O Neill (2005). Table 2 near here We begin with a single-type analysis for an outbreak in The data are given in Table 2 and show the numbers infected in households containing up to seven initially susceptible individuals. Previous analyses of the Tecumseh data have often only used households up to size five, this being due to numerical problems of the kind described in Section 2.3 above (e.g. Longini et al., 1988; Addy et al., 1991). Our methods have no such restriction. We define λ L = λ L 11 and λ G = λ G 11. In keeping with previous studies, we assume that the infectious periods are distributed according to a Gamma random variable with shape parameter 2 and scale parameter (1/2.05), i.e. with mean 4.1 days. Table 3 gives posterior summary information for λ L, λ G, the threshold parameter R, and the total numbers of local and global edges in G, denoted η L = i,j νl ij and η G = i,j νg ij, respectively. All of the marginal posterior density estimates of these parameters were unimodal and approximately symmetric. Estimation for λ L and λ G is reasonably precise in that the posterior credible intervals are relatively small. The 95% posterior credible interval for R includes unity, and moreover P (R 1 n) 0.085, highlighting the fact that assuming R > 1 is not entirely satisfactory for these data. Table 3 near here The results for η L and η G can be interpreted in a variety of ways. First, they provide summary information about G, and in particular the standard deviation and credible intervals give some indication of how accurately we can infer G from the data. For example, since there are 82 infected households, and 128 infected individuals, it follows that 81 η G = We might expect η G to be concentrated towards the lower end of this range, since larger values would be incompatible with the large number of individuals avoiding infection, but even so the posterior information reveals that η G can be inferred with considerable accuracy. Moreover, it would appear that the graph is fairly tree-like in structure, since η G +η L is typically not far in excess of the total number of infected individuals. The η L and η G parameters are also informative about the actual typical number of potential infections, and thus they give an alternative to R itself. For example, dividing both by the number of vertices in G, 128, we find the mean numbers of local and global links emanating from a vertex are 0.39 and 0.77, respectively. The sum of these is close to the posterior mean of R, while the individual values give some

14 14 idea of the relative importance of local and global infections during the outbreak. More sophisticated variants are possible, such as considering the ratio of actual to potential edges realised, or using the local structure to obtain more detailed descriptions of local spread (the point being that the distribution of number of local contacts depends on an individual s group size.) Thus far we have assumed that the infectious periods have a Gamma distribution, with mean 4.1 days. Although the algorithm computation times are reasonable, typically several hours, these can be reduced considerably (e.g. a factor of 3-5) by using a simpler model in which the infectious periods have fixed length 4.1 days. This change does not make much difference to the results: for example, the posterior means of λ L and λ G are similar, but the posterior variances are slightly smaller than before. Such similarities for these models are not new, see e.g. Ball et al. (1997), O Neill et al. (2000), but the point here is that the simpler model can be analysed rather more quickly. 6.3 Example 3: Two-type, two-level-mixing We now consider a two-type data set described in Longini et al. (1988). These data, also from the Tecumseh study, divide the at-risk population into two strata according to antibody titre level, the strata being termed low or higher. The data are actually combined from two separate influenza outbreaks, but this is immaterial for the purposes of illustrating our methods. The data set is given in Table 4 of Longini et al. (1988), and comprises 567 households containing between one and five initially susceptible individuals. In thirteen of the households the exact outcome is not presented, and for simplicity we exclude these from our present analysis. Of those households included, T 1 = 163 out of N 1 = 742 low-titre (type 1) individuals became infected, compared with T 2 = 53 out of N 2 = 562 higher-titre (type 2) individuals. The analysis described in Longini et al. (1988) employs an independent-households model with fixed-length infectious periods, in which individuals can differ in their susceptibility, but not infectivity. For the two-level mixing model, a natural way of making the latter assumption is to set λ L ij = λ L lj and λg ij = λ G lj for j, l = 1, 2. We refer to the resulting model as the LGS (Local-Global-Susceptibility) model. Since our methods do not require any structural restrictions on Λ L, we also consider for illustration the model with the sole constraint that λ G ij = λ G lj for j, l = 1, 2, and refer to this as the GS (Global-Susceptibility) model. In keeping with Example 2, and Longini et al. (1988), the infectious periods were all set to be of fixed length 4.1 days. Starting with the LGS model, we first indicate that results comparable to those presented in Longini et al. (1988) can be easily obtained via our methods. For example, the

15 15 independent-households model used in that paper is defined in terms of parameters Q i and B i, respectively representing the probability that a type i individual avoids infection from a single same-household infective, and the community at large. These parameters are of direct interest because they are used to define the so-called secondary attack rate (viz., (1 Q i ) 100%) and community probability of infection, 1 B i. In our model we have Q i = exp( 4.1 λ L 1i), and a simple approximation to B i is exp[ 4.1 λ G 1i(T 1 N1 1 + T 2 N2 1 )]. The maximum likelihood estimates of Q i and B i, i = 1, 2, in Longini et al. (1988) were found to be very similar to our corresponding posterior mean and median values. Table 4 near here Turning now to a comparison of the GS and LGS models, Table 4 gives some posterior summary statistics. As expected, the LGS model in some sense averages out differences in λ L 1j and λ L 2j found in the GS model. In the latter model, for j = 1, 2 the posterior mean of λ L 1j is somewhat larger than λ L 2j, suggesting that higher-titre individuals are less infectious. However, the posterior standard deviation of λ L 21 is relatively large, so any difference between λ L 21 and λ L 11 is not clear-cut. The posterior uncertainty arises because estimation of λ L 21 requires infected higher-titre individuals who then infect low-titre individuals, but the data only contain a few households with two types of individual, both of whom became infectious. For Λ G, the two models give roughly similar posterior distributions, the differences essentially arising as compensation for the corresponding Λ L differences. The threshold parameter R was found to have posterior mean 1.21 for the LGS model and 1.23 for the GS model, and in both cases the posterior standard deviation was Finally, although the data sets are certainly not strictly comparable, it is notable that the single-type model of Example 2 attributes more of the epidemic spread to global infections than the present example, insofar as the posterior mean of λ G exceeds all of the posterior Λ G entries. In particular, the extra detail of the multitype data appears to indicate that local spread between low-titre individuals is of key importance, suggesting that control measures should be targetted towards this. 7 Discussion In this paper we have described new methodology for performing Bayesian inference for two-level mixing epidemic models, given final outcome data. Implementation, although not a trivial matter, is not especially complicated. The methods work well in practice, although for data sets with very large numbers of infectives (thousands as opposed to

16 16 hundreds) the algorithm takes days rather than hours to run. However, data on such large outbreaks are not common and so this is not a serious restriction. The methods have several appealing features. First, they generate information regarding the actual propagation of the epidemic via the random digraph. Although not explicitly temporal, this information could be loosely regarded as such, for example by supposing that the real-time delay between generations of infection is roughly constant. The random digraph can then be used to infer information about the duration of the epidemic. Second, the methods do not require approximations, such as those introducing independence between groups, or that the epidemic is above threshold. Third, the methods are clearly very flexible, and have scope for application to other structured population epidemic models. Examples of the latter include spatial models, models with three or more levels of mixing, and models with overlapping subgroups (for example, with households, schools, and workplaces all explicitly described.) Although the form of the structure is required to be known, in practice different plausible scenarios could be explored if the exact contact structure was not available. An extension of practical interest is to the case where the observed data form only a fraction of the total population. The main impact of this setting compared to that we have studied is on the posterior variances of the quantities of interest; estimates of posterior mean behaviour will be largely unaffected. Our methods can still be applied, but now require additional assumptions regarding the (unknown) number of individuals infected from outside the observed fraction. Each such infective individual would then give rise to a connected digraph of its own on some subset of susceptibles within the observed fraction. One way to generate the unknown infectives is to use approximation methods involving the final severity along the lines discussed in Demiris and O Neill (2005). Roughly speaking, each individual in the observed fraction would have a fixed probability of being infected from outside, this probability itself being calculated by an approximation to the final severity of the epidemic in the entire population. A drawback with this approach is that it requires the undesirable R > 1 assumption discussed previously. An exact alternative is to simply impute the entire digraph in the unobserved population as well, which would be feasible in small population settings, but time-consuming for larger populations. Finally, an important extension of our methodology is towards Bayesian model choice. In principle it is possible to implement trans-dimensional MCMC methods in the multitype epidemic model setting, the key (non-trivial) challenge being to efficiently move between different models. This is a subject of current investigation. Acknowledgments

17 17 We thank Owen Lyne for helpful discussions. The first author was partly supported by EPSRC grant GR/M86323/01, and computing facilities were partly funded by EPSRC JREI grant GR/R08292/01. References Addy, C. L., Longini, I. M. and Haber, M. (1991). A generalized stochastic model for the analysis of infectious disease final size data. Biometrics 47, Andersson, H. (1999). Epidemic models and social networks. Math. Sci. 24, Andersson, H. and Britton, T. (2000). Stochastic Epidemic Models and Their Statistical Analysis. Lecture Notes in Statistics 151, Springer, New York. Ball, F. G., Britton, T. and Lyne, O. D. (2004) Stochastic multitype epidemics in a comunity of households: Estimation of threshold parameter R and secure vaccination coverage. Biometrika 91, Ball, F. G. and Lyne, O. D. (2001). Stochastic multitype SIR epidemics among a population partitioned into households. Adv. in Appl. Probab. 33, Ball, F. G. and Lyne, O. D. (2004). Private communication. Ball, F. G., Mollison, D. and Scalia-Tomba, G. (1997). Epidemics with two levels of mixing. Ann. Appl. Probab. 7, Barbour, A. D. and Mollison, D. (1990) Epidemics and random graphs. In Stochastic Processes in epidemic theory, eds. Gabriel J. P. and Lefévre, C., Lecture notes in Biomathematics 86, Becker, N. G. (1989) Analysis of Infectious Disease Data. Chapman and Hall, London. Becker, N. G. and Dietz, K. (1995) The effect of the household distribution on transmission and control of highly infectious diseases. Math. Biosci. 127, Becker, N. G. and Hall, R. (1996) Immunization levels for preventing epidemics in a community of households made up of individuals of various types. Math. Biosci. 132, Britton, T. (1998) Estimation in multitype epidemics. J. R. Statist. Soc. B 60, Britton, T. and Becker, N. G. (2000) Estimating the immunity coverage required to prevent epidemics in a community of households. Biostatistics 1, Britton, T. and O Neill, P. D. (2002) Bayesian inference for stochastic epidemics in populations with random social structure. Scand. J. Statist. 29, Demiris, N. and O Neill, P. D. (2005) Bayesian inference for epidemic models with two levels of mixing. To appear, Scand. J. Statist. Farrington C. P., Kanaan M. N. and Gay N. J. (2001) Estimation of the basic reproduction number for infectious diseases from age-stratified serological survey data, with

18 18 discussion. J. R. Statist. Soc. C, 50, Gilks, W. Richardson, S. and Spiegelhalter, D. (1996) Markov chain Monte Carlo in practice. Chapman and Hall, London. Haydon D. T., Chase-Topping M., Shaw D. J., Matthews L, Friar J. K., Wilesmith J., Woolhouse M. E. (2003) The construction and analysis of epidemic trees with reference to the 2001 UK foot-and-mouth outbreak. Proc. R. Soc. Lond. B 270, Islam, M. N., O Shaughnessy, C. D. and Smith, B. (1996) A random graph model for the final-size distribution of household infections. Stat. in Med. 15, Li, N., Qian, G., and Huggins, R. (2002) Analysis of between-household heterogeneity in disease transmission from data on outbreak sizes. Aust. N. Z. J. Stat. 44, Longini, I. M. and Koopman, J. S. (1982) Household and community transmission parameters from final distributions of infections in households. Biometrics 38, Longini, I. M., Koopman, J. S., Haber, M., and Cotsonis, G. A. (1988) Statistical inference for infectious diseases: risk-specific household and community transmission parameters. Am. J. Epid. 128, Ludwig, D. (1975) Final size distributions for epidemics. Math. Biosci. 23, Monto, A. S., Koopman, J. S. and Longini, I. M. (1985) Tecumseh study of illness. XIII. Influenza infection and disease, American Journal of Epidemiology 121, O Neill, P. D., Balding, D. J., Becker, N. G., Eerola, M. and Mollison, D. (2000) Analyses of infectious disease data from household outbreaks by Markov Chain Monte Carlo methods. J. R. Statist. Soc. C, 49, Rida, W. (1991) Asymptotic properties of some estimators for the infection rate in the general stochastic epidemic. J. R. Statist. Soc. B, 53, Schinazi, R., (2002) On the role of social clusters in the transmission of infectious diseases. Theoretical Population Biology 61, Wallinga J. and Teunis P. (2004) Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures. American Journal of Epidemiology 160,

19 19 Constant Exponential Gamma n = 25 Mean S. Dev n = 50 Mean S. Dev n = 75 Mean S. Dev Table 1: Posterior means and standard deviations for R 0 under the assumption of three different infectious period distributions, each with mean 1, and with variance 0 (Constant), 1 (Exponential) and 10 (Gamma). The population size is N = 100. Susceptibles per household No. infected Total Table 2: Final numbers infected in households during influenza outbreak in Tecumseh, Michigan.

20 20 Parameter λ L λ G R η L η G Mean Median S. dev % C. I. (0.032,0.072) (0.15,0.25) (0.91,1.65) (38,61) (90,109) Table 3: Posterior parameter summaries for the infection rates, threshold parameter, and numbers of local and global edges in G, Tecumseh data set. Model Λ L Λ G GS LGS ( (0.015) (0.013) ) ( 0.146(0.015) ) (0.0090) (0.032) (0.011) 0.146(0.015) (0.0090) ( (0.013) ) (0.0073) ( 0.143(0.014) ) (0.0090) (0.013) (0.0073) 0.143(0.014) (0.0090) Table 4: Posterior mean (standard deviation) for Λ L and Λ G for the Global-Susceptibility and Local-Global-Susceptibility models.

21 21 Fig. 1. Posterior density plots for R 0 under the assumption of three different infectious period distributions, each with mean 1, and with variance 0 (Constant), 1 (Exponential) and 10 (Gamma). The data are n = 25 cases in a population size of N = Constant Exponential Gamma 0.8 π(r 0 ) R

Bayesian inference for stochastic multitype epidemics in structured populations using sample data

Bayesian inference for stochastic multitype epidemics in structured populations using sample data PHILIP D. O NEILL School of Mathematical Sciences, University of Nottingham, Nottingham, UK. SUMMARY This