Bayesian Inference for Contact Networks Given Epidemic Data

Bayesian Inference for Contact Networks Given Epidemic Data Chris Groendyke, David Welch, Shweta Bansal, David Hunter Departments of Statistics and Biology Pennsylvania State University SAMSI, April 17, 010 Supported by NIH Grant R01-GM083603-01.

Outline 1 Inference for Contact Networks Epidemic Data 3 Simulation studies 4 Hagelloch Measles Data 5 Future Extensions

General goal Contact Network Nodes represent individuals; edges represent potentially disease-causing contacts between two individuals (context-dependent). Given an epidemic in a population transmitted across a (generally unobserved) contact network, we d like to be able to describe the properties of this network. NB: Obtaining the contact network itself is not necessarily a goal.

Contact networks and transmission networks 4 5 7 Assume a contact network G on the individuals: A contact is necessary for disease transmission. 1 8 6 3

Contact networks and transmission networks 8 4 6 5 E 7 7 1 3 Assume a contact network G on the individuals: A contact is necessary for disease transmission. Beginning with the first infected, disease is spread at exponential rate β, defining a subtree of the contact network called the transmission tree P. Data E 7, E 6,... are exposure times.

Contact networks and transmission networks 8 4 E 6 6 5 E 7 7 1 3 Assume a contact network G on the individuals: A contact is necessary for disease transmission. Beginning with the first infected, disease is spread at exponential rate β, defining a subtree of the contact network called the transmission tree P. Data E 7, E 6,... are exposure times.

Contact networks and transmission networks 8 4 E 5 E 6 6 5 E 7 7 1 3 Assume a contact network G on the individuals: A contact is necessary for disease transmission. Beginning with the first infected, disease is spread at exponential rate β, defining a subtree of the contact network called the transmission tree P. Data E 7, E 6,... are exposure times.

Contact networks and transmission networks 8 4 E 5 E 6 6 5 E 7 7 1 3 E 3 Assume a contact network G on the individuals: A contact is necessary for disease transmission. Beginning with the first infected, disease is spread at exponential rate β, defining a subtree of the contact network called the transmission tree P. Data E 7, E 6,... are exposure times.

Contact networks and transmission networks 8 4 E E 5 E 6 6 5 E 7 7 1 3 E 3 Assume a contact network G on the individuals: A contact is necessary for disease transmission. Beginning with the first infected, disease is spread at exponential rate β, defining a subtree of the contact network called the transmission tree P. Data E 7, E 6,... are exposure times.

Existing literature We will use data from an epidemic to perform simultaneous inference on the network and epidemic parameters. A few papers (Britton and O Neill (00), Neal and Roberts (005), Ray and Marzouk (008)) have discussed this type of inference. These papers make very significant simplifying assumptions, and no papers have attempted to use more general network models or analyze larger data sets. Here, we will use Britton and O Neill (00) as a starting point...

Statistical vs. Probabilistic Modeling paradigm: Probability: Simulate networks from model, epidemic data on network Statistics (probability in reverse): Start with epidemic data, learn about parameters via understanding of model!

ERGMs The framework we use to model contact networks is the Exponential-family Random Graph Model (ERGM) or where P η (Y = y) exp{η t g(y)} P η (Y = y) = exp{ηt g(y)}, κ(η) η is a vector of parameters g(y) is a known vector of graph statistics on y κ(η) is the normalizing constant

Erdős - Rényi network model Let g(g) = G consist of the single statistic counting the number of edges in G. This gives as an ERGM where the (scalar) parameter P η (Y = y) = exp{ηg}, κ(η) η = logit(p) = log ( p ) 1 p is the log-odds of the existence of an edge. NB: We ll use p, not η, throughout.

A few words about software R: An open-source statistical package statnet: An R package for network analysis See volume 4 of Journal of Statistical Software www.r-project.org www.statnet.org Methods described later are in R package epinet.

Outline 1 Inference for Contact Networks Epidemic Data 3 Simulation studies 4 Hagelloch Measles Data 5 Future Extensions

Compartmental Models This type of model partitions the population into multiple classes, based on current disease status. One type of compartmental model is the SIR model: Susceptible Infective Removed The SEIR model adds an Exposed class, corresponding to a latent period for the disease: Susceptible Exposed Infective Removed

Example Toy Dataset (ideal) Node Exposure Time Infective Time Removal Time 1 0.0 6.4 15.1 8.1 1.3 16.7 3 13.5.9 41. 4 38.6 48.0 56.9 1 1 5 5 4 3 4 3 Contact Network Transmission Tree

The loglikelihood: EIR times observed Parameters: β, k, θ, η Data: E, I, R, (G, P) L(parameters) = f (E, I, R, G, P β, k, θ, η) G,P 4 5 8 6 7 1 3 = G,P f (E, I, R β, k, θ, G, P)f (P G)f (G η), where:

The loglikelihood: EIR times observed Parameters: β, k, θ, η Data: E, I, R, (G, P) L(parameters) = f (E, I, R, G, P β, k, θ, η) G,P 4 5 8 6 7 1 3 = G,P f (E, I, R β, k, θ, G, P)f (P G)f (G η), f (E, I, R β, k, θ, G, P) models for how the times (Exposed, Infected, Recovered) depend on the transmission parameters and the networks G and P: For each infected i, E i is determined by β and G and P, while I i E i Gamma(k E, θ E ) and R i I i Gamma(k I, θ I ).

The loglikelihood: EIR times observed Parameters: β, k, θ, η Data: E, I, R, (G, P) L(parameters) = f (E, I, R, G, P β, k, θ, η) G,P 4 5 8 6 7 1 3 = G,P f (E, I, R β, k, θ, G, P)f (P G)f (G η), f (P G) models for how the transmission network P depends on the contact network G. f (P G) I{P is possible given G}. In other words, we assume a uniform distribution on all possible transmission trees P given G.

The loglikelihood: EIR times observed Parameters: β, k, θ, η Data: E, I, R, (G, P) L(parameters) = f (E, I, R, G, P β, k, θ, η) G,P 4 5 8 6 7 1 3 = G,P f (E, I, R β, k, θ, G, P)f (P G)f (G η), f (G η) is the model for how the contact network G depends on the ERGM parameters η: f (G η) exp{η t g(g)}. Here, the constant of proportionality depends only on η and may be intractable as usual for an ERGM.

Priors and updates to parameters Our MCMC-based Bayesian estimation procedure uses prior distributions β gamma θ I, θ E inverse gamma p beta k I, k E gamma The first three of these are conjugate priors; the parameters may be updated using Gibbs sampling. The k I and k E parameters may be updated using a standard Metropolis-Hastings, where proposals are made from a uniform density centered at the current values.

Parameter Updates Graph and Tree Parameters Updating the graph (G): Since we are (currently) assuming a dyadic independence graph model, we can update each dyad individually. We calculate the full conditional probability of existence for each possible edge, given all of the other parameters (including P). Updating the transmission tree (P): We must determine, for each affected node except the initial exposed, which node infected it.

Notation for Data The exposure, infective, and removal times for node j are denoted by E j, I j, and R j, respectively. Denote the identity of the initial exposed node by κ (which may or may not be known). In order for node a to infect node b, it is necessary that b is exposed during the time that a is infective: I a < E b < R a (1)

Example Toy Dataset (slightly less ideal) Node Exposure Time Infective Time Removal Time 1? 6.4 15.1? 1.3 16.7 3?.9 41. 4? 48.0 56.9 If necessary... Update each E i individually via Metropolis-Hastings. for i κ: Propose uniformly from possible range for i = κ: No lower bound; use an exponential proposal. The I i are updated in a similarly. To update κ, propose uniformly from children of current κ in P; swap times and direction of transmission.

Outline 1 Inference for Contact Networks Epidemic Data 3 Simulation studies 4 Hagelloch Measles Data 5 Future Extensions

Exploring the Parameter Space through Simulations The rapid spread of an epidemic throughout a population could be due to either a fast transmission rate (high value of β) or a more fully connected network (large value of p). This can lead to difficulties in estimating these parameters separately. We want to find which areas of the (p β) parameter space lend themselves to meaningful estimation. We simulated Erdős-Rényi networks of 40 individuals with p = 0.1, 0.,..., 1. Over each of these ten networks, we simulated epidemics with five different values of β: 0.01, 0.05, 0.1, 0.5, and 1. Assumed that all E i, I i, and R i times were known.

Posterior Scatterplots (true p = 0.) β = 0.01 β = 0.1 β = 1 log(1) log(β) log(0.1) log(0.01) log(0.05) log(0.0) log(0.80) log(0.05) log(0.0) log(0.80) log(p) Scatterplots of the posterior samples for β and p for three different simulations. log(0.05) log(0.0) log(0.80)

Sample Epidemics p = 0.1, β = 0.3 p = 1, β = 0.03 5 4 9 11 19 1 10 18 6 1 5 17 1 6 17 1 0 11 15 4 3 8 3 4 3 1 18 9 5 13 13 7 5 7 16 10 4 15 16 14 0 0 40 60 80 100 10 Time 0 0 40 60 80 Time

Outline 1 Inference for Contact Networks Epidemic Data 3 Simulation studies 4 Hagelloch Measles Data 5 Future Extensions

Hagelloch Measles Data We consider here an actual data set, namely data from a measles epidemic that spread through the small town of Hagelloch, Germany in 1861. The data contains (among other things) proxies for the Infective and Recovery times (we have to infer the Exposure times). All 188 individuals in the susceptible population were infected during the course of the epidemic. (And one outlier.)

Hagelloch posterior histograms for p and β Samples from Posterior Distribution of p Samples from Posterior Distribution of β Frequency 0 000 4000 6000 8000 10000 1000 14000 16000 Frequency 0 000 4000 6000 8000 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 p 0.0 0.5 1.0 1.5.0.5 3.0 3.5 4.0 β

Results for Inferred Exposure Periods Estimated Posterior Densities for k E! E Estimated Posterior Densities for k E! E Density 0.0 0.1 0. 0.3 0.4 0.5 0.6 10.3 (0.7) 11.1 (0.8) Density 0.00 0.05 0.10 0.15 0.0 0.5 0.30 6.7 (1.5) 11.8 (.7) 6 8 10 1 14 0 5 10 15 0 Estimated mean (left panel) and variance (right panel) of exposure periods in Hagelloch measles data. Solid lines denote results for all data, and dashed lines indicate results with one outlier removed.

Using Information About the Transmission Tree The Hagelloch data set also happens to contain additional information that we can put to use in our inference. In particular, for each infected individual, a putative parent (i.e., the individual most likely to be responsible for the infection) is recorded. We can use this information to form a more informative prior for the transmission tree P. For each node, we can give added prior weight to its putative parent node.

Additional Transmission Tree Information Results Uniform Tree Prior 8 x Prior Weight on Putative Parent Node Frequency 0 100 00 300 400 500 Frequency 0 100 00 300 400 500 4 45 17 173 174 176 177 179 180 181 18 183 Possible Parents for Node 1 4 45 17 173 174 176 177 180 181 18 183 Possible Parents for Node 1

Posterior distribution for R 0 We use as the notion of R 0 the expected number of first-generation infection events under the model if a randomly chosen node is suddenly infected. R 0 = (Np)P(X < Y ) ( [ ] ) 1 ki = Np 1, 1 + βθ I where X Exponential(β) Y Gamma(k I, θ I ) Frequency 0 5000 10000 15000 0000 Posterior Samples for R0 Conjugate Priors / Uniform Tree Prior 0 5 10 15 R0

Posterior Predictive Modeling 100 80 Number of infectives 60 40 0 0 10 0 30 40 50 60 Day Multiple simulations of epidemics based on draws from the posteriors (observed data in red).

Outline 1 Inference for Contact Networks Epidemic Data 3 Simulation studies 4 Hagelloch Measles Data 5 Future Extensions

More General ERGM Models One possible extension consists of using a more general ERGM to model the interactions in population. For the general ERGM, the parameter η would replace p. We would have to modify the MCMC algorithm. Unfortunately, for a general ERGM, κ(η) cannot be evaluated in closed form; hence, more complicated updating schemes for η may be necessary. We may be forced to simulate the entire network in order to produce an update, likely using some type of MCMC method.

Incorporating Other Types of Data We also want to consider how to best make use of any additional (beyond the Exposure/Infective/Recovery times) data that is available to us. We ve actually already had some success doing this for the Hagelloch data. Additional data (such as viral genetic data) may allow us to partially or fully inform the transmission tree P. This type of genetic data is already used to inform phylogenetic trees; we might consider using similar approaches in order to inform transmission trees.

The loglikelihood: Including sequence information S Parameters: β, k, θ, η, µ Data: E, I, R, S, G, P L = f (E, I, R, S, G, P β, k, θ, η, µ) G,P 4 5 7 1 8 6 3 = f (E, I, R, S β, k, θ, G, P)f (S P, E, R, µ)f (P G)f (G η), G,P E 7 time R 6 E 3 R 3 E 6 E 5 E R 5 R R 7 (P, E, R) is the phylogenetic tree shown E and R are the exposure and recovery times; P is the tree. µ are the parameters governing the mutation process (Jukes-Cantor? HKY?)

Summary A statistical approach to learning about contact networks from epidemic data requires explicit specification of a contact network model. The model is parametric. In this case, we use an ERGM. Bayesian methods not only provide a means for fitting a complicated model, they also allow incorporation of disease- or network-specific prior information. Estimating parameters allows both simulation of realistic contact networks (e.g., to check model fit) and also understanding of contact processes. This work is still somewhat preliminary in that many extensions are possible; certainly, Erdős-Rényi is not an appropriate model generally.