Estimation. John M. Drake & Pejman Rohani

Similar documents
Seasonally forced epidemics Biennial outbreaks of Measles in England and Wales

Numerical solution of stochastic epidemiological models

Simulating stochastic epidemics

ST430 Exam 1 with Answers

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Chaper 5: Matrix Approach to Simple Linear Regression. Matrix: A m by n matrix B is a grid of numbers with m rows and n columns. B = b 11 b m1 ...

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

5. Linear Regression

Introduction to model parameter estimation

Chapter 5 Exercises 1

1 The Classic Bivariate Least Squares Model

5. Linear Regression

Downloaded from:

Lecture 4 Multiple linear regression

Logistic Regression 21/05

Variance Decomposition and Goodness of Fit

Chapter 3 - Linear Regression

Workshop 7.4a: Single factor ANOVA

R in Linguistic Analysis. Wassink 2012 University of Washington Week 6

General Linear Statistical Models - Part III

Applied Regression Analysis

Thursday. Threshold and Sensitivity Analysis

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

The SIR Disease Model Trajectories and MatLab

Chapter 5 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004)

Understanding the contribution of space on the spread of Influenza using an Individual-based model approach

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

Models of Infectious Disease Formal Demography Stanford Summer Short Course James Holland Jones, Instructor. August 15, 2005

Simple Linear Regression

Approximation of epidemic models by diffusion processes and their statistical inferencedes

Multiple Linear Regression (solutions to exercises)

Introduction to SEIR Models

Lecture 1 Intro to Spatial and Temporal Data

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression

The Big Picture. Model Modifications. Example (cont.) Bacteria Count Example

Mathematical Epidemiology Lecture 1. Matylda Jabłońska-Sabuka

Chapter 8 Conclusion

BMI 541/699 Lecture 22

Part 6: Multivariate Normal and Linear Models

WISE Regression/Correlation Interactive Lab. Introduction to the WISE Correlation/Regression Applet

Density Temp vs Ratio. temp

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

Model Modifications. Bret Larget. Departments of Botany and of Statistics University of Wisconsin Madison. February 6, 2007

STAT 3022 Spring 2007

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Regression and the 2-Sample t

Business Statistics. Lecture 10: Course Review

Analysing categorical data using logit models

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Simple and Multiple Linear Regression

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

STAT 350: Summer Semester Midterm 1: Solutions

Linear Regression Model. Badr Missaoui

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Inferences on Linear Combinations of Coefficients

Regression and Models with Multiple Factors. Ch. 17, 18

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

1 Multiple Regression

df=degrees of freedom = n - 1

STAT 215 Confidence and Prediction Intervals in Regression

Handout 4: Simple Linear Regression

Exercise 2 SISG Association Mapping

Model Specification and Data Problems. Part VIII

STAT 510 Final Exam Spring 2015

Longitudinal Data Analysis of Health Outcomes

Coefficient of Determination

Using R in 200D Luke Sonnet

A Generic Multivariate Distribution for Counting Data

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

MCMC 2: Lecture 2 Coding and output. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

Inference for Regression Inference about the Regression Model and Using the Regression Line

Stochastic Models. John M. Drake & Pejman Rohani

Linear Regression is a very popular method in science and engineering. It lets you establish relationships between two or more numerical variables.

Linear Regression Models P8111

Association studies and regression

STAT5044: Regression and Anova. Inyoung Kim

SCHOOL OF MATHEMATICS AND STATISTICS

Example: 1982 State SAT Scores (First year state by state data available)

Multiple Regression Part I STAT315, 19-20/3/2014

Inference for Regression

An introduction to plotting data

Simple, Marginal, and Interaction Effects in General Linear Models

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Recovering the Time-Dependent Transmission Rate in an SIR Model from Data and an Overfitting Danger

STAT Lecture 11: Bayesian Regression

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Inference for Regression Simple Linear Regression

Gov 2000: 9. Regression with Two Independent Variables

Epidemics in Networks Part 2 Compartmental Disease Models

Math 2311 Written Homework 6 (Sections )

Modelling spatio-temporal patterns of disease

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Linear Probability Model

Gov 2000: 9. Regression with Two Independent Variables

Transcription:

Estimation John M. Drake & Pejman Rohani 1 Estimating R 0 So far in this class we have focused on the theory of infectious disease. Often, however, we will want to apply this theory to particular situations. One of the key applied problems in epidemic modeling is the estimation of R 0 from outbreak data. In this session, we study two methods for estimating R 0 from an epidemic curve. As a running example, we will use the data on influenza in a British boarding school. > load('data.rdata') #load the data and plot flu cases > plot(flu,type='b',log='y',main='epidemic in a British boarding school', cex.main=0.85, + xlab='day', ylab='active influenza cases') Epidemic in a British boarding school Active influenza cases 5 10 50 200 2 4 6 8 10 12 14 Day Licensed under the Creative Commons attribution-noncommercial license, http://creativecommons.org/licenses/bync/3.0/. Please share and remix noncommercially, mentioning its origin. 1

2 Estimating R 0 from the final outbreak size Our first approach is to estimate R 0 from the final outbreak size. Although unhelpful at the early stages of an epidemic (before the final epidemic size is observed), this method is nonetheless a useful tool for post hoc analysis. The method is general and can be motivated by the following argument (Keeling and Rohani 2007): First, we assume that the epidemic is started by a single infectious individual in a completely susceptible population. On average, this individual infects R 0 others. The probability a particular individual escaped infection is therefore e R0/N. If Z individuals have been infected, the probablility of an individual escaping infection from all potential sources is e ZR0/N. It follows that at the end of the epidemic a proportion R( ) = Z/N have been infected and the fraction remaining susceptible is S( ) = e R( )R0, which is equal to 1 R( ), giving Rearranging, we have the estimator 1 R( ) e R( )R0 = 0 (1) ˆR 0 = which, in this case, evaluates to log(1 512/764) 512/764 = 1.655. log(1 Z/N), (2) Z/N Exercise 1. Show how this result could have been obtained graphically without the estimating equation (eqn 2). Exercise 2. This equation shows the important one-to-one relationship between R 0 and the final epidemic size. Plot the relationship between the total epidemic size and R 0 for the complete range of values between 0 and 1. 3 Linear approximation The next method we introduce takes advantage of the fact that during the early stages of an outbreak, the number of infected individuals is given approximately as I(t) I 0 e ((R0 1)(γ+µ)t). Taking logarithms of both sides, we have log(i(t)) log(i 0 ) + (R 0 1)(γ + µ)t, showing that the log of the number of infected individuals is approximately linear in time with a slope that reflects both R 0 and the recovery rate. This suggests that a simple linear regression fit to the first several data points on a log-scale, corrected to account for γ and µ, provides a rough and ready estimate of R 0. For flu, we can assume µ = 0 because the epidemic occurred over a time period during which natural mortality is negligible. Further, assuming an infectious period of about 2.5 days, we use γ = (2.5) 1 = 0.4 for the correction. Fitting to the first four data points, we obtain the slope as follows. > model<-lm(log(flu[1:4])~day[1:4],data=flu); #fit a linear model > summary(model) #summary statistics for fit model Call: lm(formula = log(flu[1:4]) ~ day[1:4], data = flu) Residuals: 2

1 2 3 4 0.03073-0.08335 0.07450-0.02188 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.02703 0.10218-0.265 0.81611 day[1:4] 1.09491 0.03731 29.346 0.00116 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.08343 on 2 degrees of freedom Multiple R-squared: 0.9977, Adjusted R-squared: 0.9965 F-statistic: 861.2 on 1 and 2 DF, p-value: 0.001159 > slope<-coef(model)[2] #extract slope parameter > slope #print to screen day[1:4] 1.094913 Rearranging the linear equation above and denoting the slope coefficient by ˆβ 1 we have the estimator ˆR 0 = ˆβ 1 /γ + 1 giving ˆR 0 = 1.094913/0.4 + 1 3.7. Exercise 3. Our estimate assumes that boys remained infectious during the natural course of infection. The original report on this epidemic indicates that boys found to have syptoms were immediately confined to bed in the infirmary. The report also indicates that only 1 out of 130 adults at the school exhibited any symptoms. It is reasonable, then, to suppose that transmission in each case ceased once he had been admitted to the infirmary. Supposing admission happened within 24 hours of the onset of symptoms. How does this affect our estimate of R 0? Twelve hours? Exercise 4. Biweekly data for outbreaks of measles in three communities in Niamey, Niger are provided in the dataframe niamey. Use this method to obtain estimates of R 0 for measles from the first community assuming that the infectious period is approximately two weeks or 14/365 0.0384 years. Exercise 5. A defect with this method is that it uses only a small fraction of the information that might be available, i.e., the first few data points. Indeed, there is nothing in the method that tells one how many data points to use this is a matter of judgment. Further, there is a tradeoff in that as more and more data points are used the precision of the estimate increases, but this comes at a cost of additional bias. Plot the estimate of R 0 obtained from n = 3, 4, 5,... data points against the standard error of the slope from the regression analysis to show this tradeoff. 4 Estimating dynamical parameters with least squares The objective of the previous exercise was to estimate R 0. Knowing R 0 is critical to understanding the dynamics of any epidemic system. It is, however, a composite quantity and is not sufficient to completely describe the epidemic trajectory. For this, we require estimates for all parameters of the model. In this exercise, we introduce a simple approach to model estimation called least squares fitting, sometimes called trajectory matching. The basic idea is that we find the values of the model parameters that minimize the squared differences between model predictions and the observed data. To demonstrate least squares fitting, we consider an outbreak of measles in Niamey, Niger, reported on by Grais et al. 2006 (Grais, R.F., et al. 2006. Estimating transmission intensity for a measles outbreak in Niamey, Niger: lessons for intervention. Transactions of the Royal Society of Tropical Medicine and Hygiene 100:867-873.). 3

> load('data.rdata') > niamey[5,3]<-0 #replace a "NA" > niamey<-data.frame(biweek=rep(seq(1,16),3),site=c(rep(1,16),rep(2,16),rep(3,16)), + cases=c(niamey[,1],niamey[,2],niamey[,3])) #define "biweeks" > plot(niamey$biweek,niamey$cases,type='p',col=niamey$site,xlab='biweek',ylab='cases') > lines(niamey$biweek[niamey$site==1],niamey$cases[niamey$site==1]) > lines(niamey$biweek[niamey$site==2],niamey$cases[niamey$site==2],col=2) > lines(niamey$biweek[niamey$site==3],niamey$cases[niamey$site==3],col=3) Cases 0 200 400 600 800 1000 5 10 15 Biweek 5 Dynamical model First, we write a specialized function for simulating the SIR model in a case where the removal rate is hard-wired and with no demography. > closed.sir.model <- function (t, x, params) { #SIR model equations + S <- x[1] + I <- x[2] 4

+ beta <- params + ds <- -beta*s*i + di <- beta*s*i-(365/13)*i + list(c(ds,di)) + } 6 Objective function Now we set up a function that will calculate the sum of the squared differences between the observations and the model at any parameterization (more commonly known as sum of squared errors ). In general, this is called the objective function because it is the quantity that optimization seeks to minimize. > sse.sir <- function(params0,data,site){ #function to calculate squared errors + data<-data[data$site==site,] #working dataset, based on site + t <- data[,1]*14/365 #time in biweeks + cases <- data[,3] #number of cases + beta <- exp(params0[1]) #parameter beta + S0 <- exp(params0[2]) #initial susceptibles + I0 <- exp(params0[3]) #initial infected + out <- as.data.frame(ode(c(s=s0,i=i0),times=t,closed.sir.model,beta,hmax=1/120)) + sse<-sum((out$i-cases)^2) #sum of squared errors + } > Notice that the code for sse.sir makes use of the following modeling trick. We know that β, S 0, and I 0 must be positive, but our search to optimize these parameters will be over the entire number line. We could constrain the search using a more sophisticated algorithm, but this might introduce other problems (i.e., stability at the boundaries). Instead, we parameterize our objective function (sse.sir) in terms of some alternatve variables log(β), log(s 0 ), and log(i 0 ). While these numbers range from to (the range of our search) they map to our model parameters on a range from 0 to (the range that is biologically meaningful). 7 Optimization Our final step is to use the function optim to find the values of β, S 0, and I 0 that minimize the sum of squared errors as calculated using our function. > library(desolve) #differential equation library > params0<-c(-3.2,7.3,-2.6) #initial guess > fit1 <- optim(params0,sse.sir,data=niamey,site=1) #fit > exp(fit1$par) #back-transform parameters [1] 5.463181e-03 9.110385e+03 2.331841e+00 > fit2 <- optim(params0,sse.sir,data=niamey,site=2) #fit > exp(fit2$par) #back-transform parameters 5

[1] 8.666138e-03 6.276503e+03 2.843753e-01 > fit3 <- optim(params0,sse.sir,data=niamey,site=3) #fit > exp(fit3$par) #back-transform parameters [1] 7.130417e-02 8.625791e+02 1.031319e-03 Finally, we plot these fits against the data. > par(mfrow=c(3,1)) #set up plotting area for multiple panels > plot(cases~biweek,data=subset(niamey,site==1),type='b',col='blue', pch=21) #plot site 1 > t <- subset(niamey,site==1)[,1]*14/365 > mod.pred<-as.data.frame(ode(c(s=exp(fit1$par[2]),i=exp(fit1$par[3])),times=t, + closed.sir.model,exp(fit1$par[1]),hmax=1/120)) > #obtain model predictions > lines(mod.pred$i~subset(niamey,site==1)[,1]) #and plot as a line > plot(cases~biweek,data=subset(niamey,site==2),type='b',col=site) #site 2 > t <- subset(niamey,site==2)[,1]*14/365 > mod.pred<-as.data.frame(ode(c(s=exp(fit2$par[2]),i=exp(fit2$par[3])),times=t, + closed.sir.model,exp(fit2$par[1]),hmax=1/120)) > lines(mod.pred$i~subset(niamey,site==2)[,1]) > plot(cases~biweek,data=subset(niamey,site==3),type='b',col=site) #site 3 > t <- subset(niamey,site==3)[,1]*14/365 > mod.pred<-as.data.frame(ode(c(s=exp(fit3$par[2]),i=exp(fit3$par[3])),times=t, + closed.sir.model,exp(fit3$par[1]),hmax=1/120)) > lines(mod.pred$i~subset(niamey,site==3)[,1]) > 6

cases 0 400 800 5 10 15 biweek cases 0 400 800 5 10 15 biweek cases 0 50 150 5 10 15 biweek Exercise 6. To make things easier, we have assumed the infectious period is known to be 14 days. In terms of years, γ = (365/14) 1 0.0384. Now, modify the code above to estimate γ and β simultaneously. Exercise 7. What happens if one or both of the other unknowns (I 0 and S 0 ) is fixed instead of γ? 8 From least squares to likelihood Many researchers use the theory of likelihood as the basis for optimization. Likelihood is the probability that a the data were generated by a candidate model. By comparing models, one can identify that with the maximum likelihood. For a more developed introduction to the theory of fitting via maximum likelihood, see the optional exercise Estimating model parameters by maximum likelihood. A special case in likelihood theory is where the data are drawn from a normal probability density with mean given by the model and constant variance. That is, there is observation noise, but not process 7

noise; the observation errors are symmetrical and Gaussian, and the noise does not scale with the state of the system. If Y t is the number of infectives at time t and I t is the model s prediction, then the log of the probability of the data given the model is: From this, it follows that ( ( 1 log P (Y t I t ) = log exp (Y t I t ) 2 )) 2πσ 2 2σ 2 = 1 2 log 2πσ2 1 (Y t I t ) 2 2 σ 2 log L = 1 2 ( ) 1 σ 2 (Y t I t ) 2 + log (σ 2 ) + log (2π), t where this last equation is the negative log likeliood of the data given the model. But, what is σ 2? It is a new parameter the theoretical variance of the normally distributed errors, which is unknown. We can approximate it, however, with the variance of the deviations Y t I t. Exercise 8. Modify your optimizer so that it returns the negative log-likelihood. What is it for the three districts in Niamey? 8