Computational Statistics and Data Analysis

Similar documents
Chapter 15 Spatial Disease Surveillance: Methods and Applications

Cluster Detection Based on Spatial Associations and Iterated Residuals in Generalized Linear Mixed Models

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III)

FleXScan User Guide. for version 3.1. Kunihiko Takahashi Tetsuji Yokoyama Toshiro Tango. National Institute of Public Health

A spatial scan statistic for multinomial data

Using AMOEBA to Create a Spatial Weights Matrix and Identify Spatial Clusters, and a Comparison to Other Clustering Algorithms

Identification of Local Clusters for Count Data: A. Model-Based Moran s I Test

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

A nonparametric spatial scan statistic for continuous data

An Introduction to SaTScan

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

BOOTSTRAPPING WITH MODELS FOR COUNT DATA

Cluster investigations using Disease mapping methods International workshop on Risk Factors for Childhood Leukemia Berlin May

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Loglinear Residual Tests of Moran s I Autocorrelation and their Applications to Kentucky Breast Cancer Data

USING CLUSTERING SOFTWARE FOR EXPLORING SPATIAL AND TEMPORAL PATTERNS IN NON-COMMUNICABLE DISEASES

Quasi-likelihood Scan Statistics for Detection of

Statistics for analyzing and modeling precipitation isotope ratios in IsoMAP

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Poisson Regression. Ryan Godwin. ECON University of Manitoba

Poisson regression 1/15

BAYESIAN MODEL FOR SPATIAL DEPENDANCE AND PREDICTION OF TUBERCULOSIS

PQL Estimation Biases in Generalized Linear Mixed Models

Chapter 22: Log-linear regression for Poisson counts

Analyzing the Geospatial Rates of the Primary Care Physician Labor Supply in the Contiguous United States

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

Pseudo-score confidence intervals for parameters in discrete statistical models

Detecting Clusters of Diseases with R

Poisson regression: Further topics

Cluster Analysis using SaTScan. Patrick DeLuca, M.A. APHEO 2007 Conference, Ottawa October 16 th, 2007

Spatial Variation in Infant Mortality with Geographically Weighted Poisson Regression (GWPR) Approach

Bayesian Hierarchical Models

Multinomial Logistic Regression Models

In matrix algebra notation, a linear model is written as

Outline. Practical Point Pattern Analysis. David Harvey s Critiques. Peter Gould s Critiques. Global vs. Local. Problems of PPA in Real World

Trends in Human Development Index of European Union

Cluster Analysis using SaTScan

Logistic regression model for survival time analysis using time-varying coefficients

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Local Likelihood Bayesian Cluster Modeling for small area health data. Andrew Lawson Arnold School of Public Health University of South Carolina

Generalized Linear Models (GLZ)

A comparison of inverse transform and composition methods of data simulation from the Lindley distribution

1Department of Demography and Organization Studies, University of Texas at San Antonio, One UTSA Circle, San Antonio, TX

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Figure 36: Respiratory infection versus time for the first 49 children.

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

Using Estimating Equations for Spatially Correlated A

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data

Hotspot detection using space-time scan statistics on children under five years of age in Depok

Repeated ordinal measurements: a generalised estimating equation approach

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Ch 2: Simple Linear Regression

9 Generalized Linear Models

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Statistical Analysis of Spatio-temporal Point Process Data. Peter J Diggle

Scalable Bayesian Event Detection and Visualization

Community Health Needs Assessment through Spatial Regression Modeling

Disease mapping with Gaussian processes

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Spatial Clusters of Rates

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

Generalized Linear Models

Testing Goodness Of Fit Of The Geometric Distribution: An Application To Human Fecundability Data

How to deal with non-linear count data? Macro-invertebrates in wetlands

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Estimating the long-term health impact of air pollution using spatial ecological studies. Duncan Lee

Spatio-Temporal Cluster Detection of Point Events by Hierarchical Search of Adjacent Area Unit Combinations

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev

Sample size determination for logistic regression: A simulation study

Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

STAT 525 Fall Final exam. Tuesday December 14, 2010

An introduction to biostatistics: part 1

SaTScan TM. User Guide. for version 7.0. By Martin Kulldorff. August

A Reliable Constrained Method for Identity Link Poisson Regression

SUPPLEMENTARY SIMULATIONS & FIGURES

A class of latent marginal models for capture-recapture data with continuous covariates

Inference Methods for the Conditional Logistic Regression Model with Longitudinal Data Arising from Animal Habitat Selection Studies

Equivalence of random-effects and conditional likelihoods for matched case-control studies

Cohen s s Kappa and Log-linear Models

8 Nominal and Ordinal Logistic Regression

ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS

Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions Data

Survival Analysis for Case-Cohort Studies

Answer Key for STAT 200B HW No. 8

Interactive GIS in Veterinary Epidemiology Technology & Application in a Veterinary Diagnostic Lab

LOGISTIC REGRESSION Joseph M. Hilbe

Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection

y response variable x 1, x 2,, x k -- a set of explanatory variables

Inclusion of Non-Street Addresses in Cancer Cluster Analysis

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

ARIC Manuscript Proposal # PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Introduction to the Analysis of Tabular Data

Surveillance of Infectious Disease Data using Cumulative Sum Methods

Transcription:

Computational Statistics and Data Analysis 53 (2009) 2851 2858 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Spatial scan statistics in loglinear models Tonglin Zhang a,, Ge Lin b,1 a Department of Statistics, Purdue University, West Lafayette, IN, USA b Department of Health Services Research and Administration, College of Public Health, University of Nebraska Medical Center, Omaha, NE, USA a r t i c l e i n f o a b s t r a c t Article history: Available online 24 September 2008 The likelihood ratio spatial scan statistic has been widely used in spatial disease surveillance and spatial cluster detection applications. In order to better understand cluster mechanisms, an equivalent model-based approach is proposed to the spatial scan statistic that unifies currently loosely coupled methods for including ecological covariates in the spatial scan test. In addition, the utility of the model-based approach with a Wald-based scan statistic is demonstrated to account for overdispersion and heterogeneity in background rates. Simulation and case studies show that both the likelihood ratio-based and Wald-based scan statistics are comparable with the original spatial scan statistic. 2008 Elsevier B.V. All rights reserved. 1. Introduction Modeling spatial phenomena and detecting spatial clusters are two traditions in spatial statistics. Spatial econometricians, regional scientists, and ecologists are mainly interested in the former for obtaining less biased substantively driven parameter estimates. Spatial epidemiologists and geographers are interested in the latter for detecting and quantifying localized clusters, particularly with regard to possible relationships between clustered phenomena and corresponding environments. The assessments of spatial patterns from both traditions often involve statistical tests for an overall spatial clustering, as well as elevated local clusters. While these two traditions have been linked in spatial regressions for continuous or Gaussian data since Cliff and Ord (1981), they are rather separated when dealing with Poisson count data. Recently Lin and Zhang (2007) attempted to bridge the two traditions for Poisson data based on Moran s I statistic that was originally designed for Gaussian data (Moran, 1948). In this paper, we attempt to bridge the two traditions based on Kulldorff s spatial scan statistic that was originally designed for Poisson or binomial data (Kulldorff, 1997). We chose Kulldorff s spatial scan statistic for two reasons. First, it is the most widely used cluster detection method for disease surveillance. It uses a moving circle of varying size to detect a set of clustered regions or points that are unlikely to happen by chance. Time and categorical covariates, such as age or sex, can be included as control variables. Since their publication, the two papers by Kulldorff and Nagarwalla (1995) and Kulldorff (1997) have garnered more than 150 and 200 citations, respectively, from the combined science and social science citation indices. Federal, state and city agencies, such as the National Cancer Institute, the Washington State Department of Public Health, and the New York City Department of Mental Health and Hygiene, use SatSan, the software that performs the test for various retrospective and prospective cluster detection and surveillance. Regardless of retrospective and prospective cluster detection, there is a genuine need to incorporate ecological covariates in the spatial scan statistic for better understanding cluster mechanisms. For instance, after identifying local breast cancer clusters by the spatial scan method, Roche et al. (2002) compared local socioeconomic Corresponding address: Department of Statistics, Purdue University, 250 North University Street, West Lafayette, IN 47907-2066, USA. Tel.: +1 765 496 2097; fax: +1 765 494 0558. E-mail addresses: tlzhang@stat.purdue.edu (T. Zhang), glin@unmc.edu (G. Lin). 1 Mail address: 984350 Nebraska Medical Center, Omaha, NE 68198, USA. 0167-9473/$ see front matter 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2008.09.016

2852 T. Zhang, G. Lin / Computational Statistics and Data Analysis 53 (2009) 2851 2858 conditions within and outside of clustered areas. Recuenco et al. (2007) followed a loosely coupled method proposed by Kulldorff (2005), and examined spatial and temporal patterns of enzootic raccoon rabies with multiple covariates. They first fitted a regular Poisson regression with known ecological covariates, and then performed the spatial scan test by using the expected value from each spatial unit. In both applications, it is desirable to have a model-based spatial scan test that can either explain detected clusters with ecological covariates, or control for covariates. Second, the spatial scan statistic is shown to be the most powerful for local cluster detection (Kulldorff et al., 2003; Song and Kulldorff, 2006), but the probability of both rejecting the null hypothesis and detecting the true cluster correctly is a different matter (Tango and Takahashi, 2005). The Poisson distribution assumes that its variance equals its mean, but in reality they are often not equal (Agresti, 2002). Cases are overdispersed when their variability exceeds that predicted by the Poisson distribution. In the context of spatial phenomena, a common cause of overdispersion is spatially correlated and uncorrelated heterogeneities (Hossain and Lawson, 2006). Many spatial test statistics, including the spatial scan statistic, are unable to treat overdispersion. Consequently, the scan statistic may have an unacceptable type I error probability in the presence of overdispersion (Loh and Zhu, 2007). In such a situation, the user has to decide whether an alarm or signal of cluster is false. For example, dispersed cases were noticed in the syndromic surveillance of lower respiratory infection based on the spatial scan statistic in Eastern Massachusetts (Kleinman et al., 2005). After the adjustment of known factors that caused overdispersion, the false alarm rate was reduced by as much as 30%. In addition to adjusting known factors, there are at least two ways to account for overdispersion statistically. One is to add a parameter to a generalized linear model that allows an extra variation from the mean, and the other is to model spatial random effects. Both require a model-based approach that can incorporate an additional parameter into the spatial scan statistic. There are many other situations where a model-based spatial scan statistic is desirable, and it is almost impossible to list all of them. For the purpose of the current study, we intend to develop a model-based approach to Kulldorff s spatial scan statistic. In particular, we propose the model-based likelihood scan statistic G 2 s. In addition, we propose a new spatial scan statistic, based on the Wald statistic to demonstrate the utility of the model-based approach. We show that the Wald-based spatial scan statistic does not suffer the complication of computing the likelihood function, and it can be used for cluster detection in the presence of overdispersion. In the following section, we first briefly describe the spatial scan statistic, and then propose the model-based scan statistics: the likelihood ratio (deviance) scan and the Wald scan statistics. We show that in the absence of covariates, the likelihood ratio scan statistic is equivalent to Kulldorff s spatial scan statistic. In addition, a model-based Wald scan statistic can be effective when considering overdispersion. In Section 3, we present simulation and case studies by using infant mortality data from Guangxi Province in China. We first compare the powers among the two model-based tests via Monte Carlo simulations, and then present the study of infant mortality. Finally, in Section 4, we provide some concluding remarks. 2. From the spatial scan to model-based scan statistics Kulldorff scan test. Kulldorff developed the spatial scan statistic by combining time series scan statistic (Naus, 1965) and spatial analysis machine (GAM) methods (Openshaw et al., 1987). The spatial scan test uses a moving circle of varying size to find a set of regions or points that maximizes the likelihood ratio test for the null hypothesis of a purely random Poisson or Bernoulli process. Time can be included for a space-time scan test. Categorical covariates, such as age group and sex, can be included as control variables. Recent developments have led to various modifications of scan statistics for cluster shape detections (Assuncao et al., 2006). The spatial scan statistic requires that (1) a study area has a set of m locations or subregions, and (2) each location i has a number of case counts y i and the corresponding at-risk population n i. Assume that the counts are independent Poisson random variables with the expected value n i θ i, where θ i is the unit specific relative risk. The scan statistic is to detect any spatial regions within a cluster C, in which counts are significantly higher than expected. The test compares the total number of disease counts, y c = i C y i, within C with the total number of counts, y c = i C y i, outside of C, given the corresponding at-risk populations inside and outside of C, denoted by n c = i C n i and n c = i C n i respectively. Let y = y c + y c = m i=1 y i and n = n c + n c = m i=1 n i. Assuming that θ i = θ c for i C and θ i = θ 0 for i C, and contrasting the null hypothesis of H 0 : θ c = θ 0 against the alternative hypothesis of H 1 : θ c > θ 0, Kulldorff developed the likelihood ratio as: max L(C, θ c, θ 0 ) ( ) yc ( θ c >θ 0 Λ C = max L(C, θ c, θ 0 ) = yc /n c y c /n c y/n y/n θ c =θ 0 ) y c, (1) when y c /n c y c /n c, and Λ C = 1 otherwise, where L(C, θ c, θ 0 ) is the likelihood function. Since C is unknown, it can be treated as a parameter, and the maximum likelihood ratio test statistic for unspecified spatial cluster C is given by Λ = max C C Λ C, where C is the collection of candidates of possible spatial clusters. Since the null distribution of Λ is not analytically tractable, Kulldorff recommends computing the p-value by Monte Carlo simulations. (2)

T. Zhang, G. Lin / Computational Statistics and Data Analysis 53 (2009) 2851 2858 2853 Like most test statistics, the spatial scan statistic also has some limitations. For instance, it cannot directly include ecological covariates. Kulldorff (2005) recommends a loosely coupled method that uses the expected values from a loglinear model that includes ecological covariates. In addition, the spatial scan statistic does not take overdispersion into consideration. Here, we propose a model-based spatial scan test based on a loglinear model, so that spatial analysts can include ecological covariates directly in their spatial cluster detection analyses. We show that the model-based likelihood ratio scan statistic is essentially equivalent to the spatial scan statistic in the absence of ecological covariates. It is, however, more complicated to find a way to deal with overdispersion in the model-based scan approach. A common way to account for overdispersion is to extend the Poisson regression to the negative binomial regression. However, this method usually implies a specific mechanism for which the relationship between the mean and variance functions is known. For spatially autocorrelated data, overdispersion is often caused by unknown factors; the negative binominal distribution cannot be extended to allow for a spatially correlated structure (Wakefield et al., 2000). Loh and Zhu (2007) offered an MCMC method that adjusts the calculation of p-value in the spatial scan test for spatial autocorrelation. However, since this method was mainly concerned with spatial correlation, it only addressed the heterogeneity problem without getting into model correction. In a more general case, an overdispersion parameter that allows the variance to be greater than the mean can be introduced in a Poisson regression (Faraway, 2006). Some existing estimation method for overdispersion, such as the Quasi-Likelihood method, can be extended to the spatial scan test (McCullagh, 1983). Since the Wald-statistic is computationally tractable, and proven to be effective for accounting for overdispersion, we proposed two Wald-based spatial scan tests. One can be used as a general spatial scan statistic, and the other can be used when considering overdispersion. Model-based spatial scan statistic. We use the concept of relative risk to introduce a spatial loglinear model: log(θ i ) = x t i β + α i, where β is the vector of coefficients of parameters, α i is a cluster indicator defined as α i = α c if i C and α i = α 0 if i C. By the constraint, we can always set α 0 = 0. Under the null hypothesis of H 0 : α c = 0, model (3) becomes log(θ i ) = x t i β. Under the alternative hypothesis of H 1 : α c > 0 for some C, model (3) becomes log(θ i ) = x t i β + α ci i C. When C is given, the MLE (maximum likelihood estimate) of β under the null hypothesis and the MLE of α c and β under the alternative hypothesis can easily be derived. Let ˆβ 0 be the MLE under the null hypothesis and ˆα c and ˆβ 1 be the MLE under the alternative hypothesis. The predicted counts based on H 0 and H 1 can be derived by ŷ 0 i = n i e xt i ˆβ 0 and ŷ 1 i = n i e xt i ˆβ 1 + ˆα c I i C respectively. The 2 times the logarithm of the likelihood ratio (deviance) statistic conditional on cluster C is G 2 C = 2 m i=1 y i log ŷ1 i. ŷ 0 i If x t i β only includes the intercept term β 0, then under H 0, we have log(θ 0 ) = β 0, and under H 1, we have log(θ i ) = β 0 + α c I i C. We now show that G 2 C, given by Eq. (6) is equivalent to the Kulldorff likelihood ratio function Λ C given by Eq. (1). The MLEs are: ˆβ 0 = log(y/n) 0 under H 0, and ˆβ 1 = 0 log(y c/n c ) and ˆα c = log(y c /n c ) log(y c /n c ) if y c /n c y c /n c, or ˆβ 1 = log(y/n) 0 and ˆα c = 0 if y c /n c < y c /n c under H 1. Then, ŷ 0 i = yn i /n and y c n i I i C + y c I i C, when y c /n c y c /n c ŷ 1 n i = c n c yn i n, when y c/n c < y c /n c. Therefore, the likelihood ratio statistic becomes G 2 = ( ) yc C 2 /n c y i log + 2 y i log y/n i C i C ( y c /n c y/n ) ( yc /n c = 2y c log y/n if y c /n c y c /n c and G 2 C = 0 otherwise, which implies G2 C = 2 log Λ C. Likewise, we define the Wald statistic as ) + 2y c log ( y c /n c y/n ), (3) (4) (5) (6) Z C = ˆα c ˆσ ˆαc, (7) where ˆσ ˆαc is the estimate of the standard error of ˆα c under the alternative hypothesis. Following Kulldorff, we define the model-based deviance scan statistic as G 2 s = max C C G2 C, (8)

2854 T. Zhang, G. Lin / Computational Statistics and Data Analysis 53 (2009) 2851 2858 and the Model-based Wald scan statistic as Z s = max C C Z C. (9) It should be pointed out that the likelihood-based spatial scan statistic G 2 s is analytically more appealing, but the Wald-based spatial scan statistic Z s is more flexible. Since Z s only involves the estimates of the cluster parameter α c and its standard error, and most model fitting procedures for GLMMs can provide parameter estimates and their standard errors, Z s can be easily modified for many spatial GLMMs. If, for instance, a Poisson regression is approximated by a negative binomial regression, then Z s can be modified by taking ˆα c and ˆσ ˆαc as the estimated value from the negative binomial regression model. If a spatially correlated or uncorrelated random effect is implemented in the Poisson regression model, then Z s can be modified under spatial GLMMs. In both cases, Z C can be derived by calculating the corresponding parameter estimate and its standard error based on Eq. (7). Due to its flexibility, the proposed Z s is a nice addition to the spatial scan statistic family. In the following, we provide a simple modification to demonstrate its utility for overdispersion. Modification for overdispersion. Overdispersion can occur in the spatial Poisson process, either because of spatially correlated data, or because of spatially varied cases and populations from a large number of spatial units. To gauge and to reduce the potential effect of overdispersion, a dispersion parameter denoted by φ can be introduced, and this parameter is similar to the general overdispersion parameter in a Poisson regression. When φ = 1, it is a regular Poisson regression without overdispersion. When φ > 1, it is a Poisson regression with overdispersion. If we know the extent of spatially correlated data, an adjustment of deviance via φ for a p-value can be effective (Loh and Zhu, 2007). When we know the exact relationship between the variance and the expected value, a negative binomial regression can also be effective. In the absence of this information, an estimate of the dispersion parameter can be used as (McCullagh, 1983): ( ) X 2 ˆφ = max, 1, df res where X 2 is the Pearson χ 2 statistic and df res is the residual degree of freedom. When an overdispersion parameter is considered, the estimated variance of ˆα c is adjusted by ˆφ 1/2 ˆσ αc, and the Wald statistic testing for the significance of cluster C conditional on cluster C is modified as (10) Z C,o = ˆα c ˆφ 1/2 ˆσ αc. (11) The model-based Wald scan statistic with the adjustment is then defined as Z s,o = max C C Z C,o. (12) Note that under the loglinear model framework, the above unadjusted Z s does not consider the overdispersion parameter φ, and thus it always assigns the dispersion parameter equal to 1. The Z s,o statistic, which is treated as a modification of Z s, considers the overdispersion parameter. Similar to Kulldorff s spatial scan statistic Λ, the null distributions of G 2 s, Z s and Z s,o are not analytically tractable. We, therefore, follow Kulldorff to compute their p-values by the bootstrap method. The null hypothesis of no clustering is rejected for large G 2 s, Z s or Z s,o values. The corresponding bootstrap p-value is the rank of the real observed values in the corresponding bootstrap distribution. Following Kulldorff (1997), we calculate the p-values conditionally on the total number of counts by the following algorithm. Bootstrap for the p-values of G 2 s, Z s and Z s,o : (i) Calculate the observed values of G 2 s, Z s and Z s,o (denoted by G 2 s,0, Z s,0, and Z s,o,0 respectively) according to Eqs. (8), (9) and (12) respectively. (ii) Derive the predicted counts ˆµ i under the null model (4). (iii) Generate K independent multinomial random variables with total counts equal to Y and proportional vector equal to ( ˆµ 1 /Y,..., ˆµ m /Y). Calculate the simulated values of G 2 s, Z s and Z s,o (denoted by G 2 s,k, Z s,k and Z s,o,k respectively for k = 1,..., K ) for each generation. (iv) Report the p-values of G 2 s, Z s and Z s,o by #{G 2 s,k G2 s,0 : k = 0,..., K}/(K + 1), #{Z s,k Z s,0 : k = 0,..., K}/(K + 1) and #{Z s,k,o Z s,o,0 : k = 0,..., K}/(K + 1) respectively, where #A represents the number of elements in set A. Both model-based G 2 s and Z s statistics have their roles in spatial scan for clusters. G 2 s is conceptually equivalent to the loosely coupled spatial scan statistic that can include ecological covariates. The model-based Wald scan statistics (Z s and Z s,o ) can be effective in the presence of overdispersion.

T. Zhang, G. Lin / Computational Statistics and Data Analysis 53 (2009) 2851 2858 2855 Fig. 1. County level infant mortality rate per 100,000 in Guangxi, China in 2000. 3. Simulation and case study In both simulation and case studies, we used real world data, based on 110 counties in Guangxi Zhuang ethnic Province, one of five autonomous ethnic minority regions in China. We obtained the county-level infant birth and death data from the 2000 Census in China. Infant mortality rates vary substantially among these counties, ranging from 248 to 7260 per 100,000 (Fig. 1). Guangxi for the most part is mountainous and is considered a less developed region in southwestern China. The county-level elevations range from 20 to 1140 m above sea level, lower in the southeast and higher toward the west. In the absence of other variables, elevation is a good measure of access to care, education and other socioeconomic resources. For this reason, we included elevation in both simulation and the case study. 3.1. Simulation We evaluate the type I error rates and the power functions of the proposed statistics. For a pure spatial situation, the model-based likelihood ratio scan statistic is equivalent to the original spatial scan statistic, and its statistical powers have been thoroughly evaluated (Kulldorff et al., 2003). For this reason, we opted to evaluate the performance of these model-based scan statistics, by including an ecological covariate. For simplicity, we define the size of a local cluster or spatial association by a county and its adjacent counties (Lin and Zhang, 2007), which can be indexed by a column vector (w i ) in the spatial weight matrix. We used 110 counties to generate the spatial weight matrix based on spatial adjacency, and used infant births (n i ) and deaths (Y i ), and elevations x to fits a baseline model. Assuming that E(Y i ) = θ i n i, we fitted a baseline model with elevation and its quadratic term as ecological covariates, and the fitted model is: log(θ i ) = 4.160 + 2.07 10 3 x 1.27 10 6 x 2. (13) The coefficient of the quadratic term in the fitted model was negative. The maximum predicted mortality rate was (2.07 10 3 )/(2 1.27 10 6 ) = 815. Since the elevation of most counties (103 out of 110) is lower than 815 m, model (13), suggests that the infant mortality rate increased as elevation increased. Model (13) serves as a baseline model for generating a new response and a local cluster term. We inserted a seven-county cluster C in the baseline model with its center at Pingnan, a non-border county, and the new response variable Y i was then generated from log(θ ) = i 4.160 + 2.07 10 3 x 1.27 10 6 x 2 + δi i C, (14) where θ i = E(Y i )/n i, I i C is the indicator function, which is 1 if i C, and is 0 otherwise. We used δ to measure the strength of the cluster from 0 to 0.20, or 20% more than the logarithm of the relative risk outside of the cluster. Since the model was exactly Poisson, there was no overdispersion for the simulated data.

2856 T. Zhang, G. Lin / Computational Statistics and Data Analysis 53 (2009) 2851 2858 Table 1 Percentage of cluster detectability (D) and location specificity (LS), based on 999 simulated data sets δ G 2 s Z s Z s,o D LS D LS D LS 0.0 6.0 0.0 5.8 0.0 4.3 0.0 0.025 7.1 8.5 6.8 10.3 6.8 14.7 0.05 14.9 39.6 14.9 38.9 12.6 42.1 0.075 37.4 65.2 37.9 66.2 39.4 63.5 0.1 71.6 82.4 72.3 82.2 73.5 81.9 0.125 93.9 90.2 94.3 90.7 95.0 90.3 0.15 99.1 96.5 99.2 96.3 99.3 96.7 0.175 100.0 99.3 100.0 99.3 100.0 99.3 In the simulation, we compared the type I error rates G 2 s, Z s and Z s,o when δ = 0 and their power functions based on 1000 runs for each selected δ (Table 1). We generated random Poisson counts, and calculated p-values for G 2 s, Z s and Z s,o based on 999 random replications of the null distribution from Monte Carlo hypothesis testing. Since a spatial cluster can be treated as a spatial association term or an explanatory variable in the loglinear model, we scanned all counties and searched for the spatial association term that yielded the largest likelihood ratio or Wald z-value. We fixed the significance level at 0.05. We evaluated the power or detectability of the inserted cluster by the percentage of p-values that were less than or equal to the significance level, or the rejected rates of the null hypothesis of no spatial clustering. In addition, since the cluster center is known, we assessed the specificity of a detected local cluster by the percentage of time that Pingnan was correctly located as the cluster center among the total significant p-values. The results are shown in Table 1, where D represents detectability or the power and LS represents location specificity in percentage for Pingnan when the test is significant. When there was no overdispersion, the type I error and power functions of the three model-based spatial scan statistics were very consistent. All three model-based scan statistics had an acceptable type I error probability, with 6.0%, 5.8% and 4.3% respectively for G 2 s, Z s and Z s,o. As δ or cluster strength increased, the power functions of detectability increased rapidly, while accounting for ecological covariates. They were all able to detect the existence of the cluster with a moderate cluster strength. When the relative risk inside the cluster was 0.175 more than outside the cluster, the powers for all three were 100%. The Wald-based spatial scan statistics also performed well in comparison to the likelihood ratio based scan statistic for location specific (LS) detection. All three found zero clusters at Pingnan when the cluster strength was zero. As the strength of the cluster increased, the rates of correctly identified location also increased. In addition, when the relative risk was 0.15 or higher, there were more than 95% chances that the correct cluster center was identified, and this chance increased to 99% percent when the relative risk within the cluster was 1.175 times of the outside relative risk. To briefly summarize, type I error probabilities and power functions for G 2 s, Z s and Z s,o were all comparable for a given cluster strength. In the presence of a strong cluster, all three were able to correctly identified the cluster center 99% of the time. Even though we did not evaluate the pure spatial process without any ecological covariates, we expect the same results based on the formulations for G 2 s and Z s, for example see details on page 37 40 in Davison and Hinkley (1997). 3.2. Case study As mentioned earlier, Guangxi is a provincial-level autonomous region for ethnic minority Zhuang. Higher infant mortality in the region is a sensitive political issue for the central government, which tries to improve living standards for all ethnic groups and nationalities. It is already known that access to adequate primary care is an issue, and primary care is more available in lowland counties than in highland and mountainous counties. An interesting question is that in addition to access to care, what other factors might play a role for high infant mortality in the province. Since local socioeconomic and health care data were not available, we used average county elevation to approximate the lack of access. One of the coauthors was from the region, and our task was to first identify local pockets of counties by controlling for the elevation effect, and then turn our findings over to provincial health officials for further health disparity analyses. We first fitted infant mortality rates with a loglinear model, that includes a quadratic elevation term as: log(θ i ) = β 0 + β 1 x + β 2 x 2, where x represented the county level elevation in meters. The estimated values were ˆβ 0 = 4.16, ˆβ 1 = 2.07 10 3 and ˆβ 2 = 1.27 10 6, with standard errors 2.22 10 2, 1.21 10 4 and 1.23 10 7. Since both the linear term and quadratic term of elevation were highly significant, we considered the following model for cluster detection: log(θ i ) = β 0 + β 1 x + β 2 x 2 + αi i C, where C C and C is the collection of all candidates for a potential circular cluster. In the analysis, we also considered potential overdispersion by using both Z s,o and model-based Pearson residual Moran s I or I PR (Lin and Zhang, 2007). The original Moran s I is for a continuous variable, and it does not suffer overdispersion. Since I PR is based on the same formulation, it is not sensitive to overdispersion either. Based on this property and the fact that I PR can be derived from the same loglinear model used for G 2 s and Z s, we can use I PR to check for the consistency of Z s,o. We used

T. Zhang, G. Lin / Computational Statistics and Data Analysis 53 (2009) 2851 2858 2857 Table 2 Results from the stepwise scan for infant mortality in Guangxi, based on 999 simulated data sets for the significance of G 2 s, Z s and Z s,0 ˆφ C G 2 s Z s Z s,o I PR Value p-value Value p-value Value p-value Value p-value 29.1 w 45 638 0.001 27.9 0.001 6.03 0.001 0.316 10 7 21.3 w 51 196 0.001 14.3 0.001 3.17 0.048 0.138 0.012 20.2 w 81 81.4 0.001 9.39 0.001 2.13 0.641 0.062 0.226 a stepwise scan method, searching for local clusters by treating a local cluster as an explanatory variable indexed by w i. This method is identical to the spatial scan method for detecting the first cluster. However, if multiple clusters exist, our method to scan for the secondary cluster might be slightly different from the spatial scan statistic. In SatScan, the secondary cluster is searched by considering the first cluster. In our method, when the first cluster is identified, its effect is taken out by the first cluster spatial association term, and an additional cluster term is used to scan and test for the existence of an additional cluster. The stepwise scan stops if the test statistic is no longer significant. In both Satscan and our stepwise search methods, there are no multiple testing problems. In the scan process, the first cluster was signaled with a p-value less than 0.001 for G 2 s, Z s, Z s,o and I PR (Table 2). This cluster, centered at Fangcheng Qu (w 45 ), is close to the border with Vietnam, and the area had dilapidated infrastructure for years due to the 1979 Sino Vietnam Border War. The second cluster was signaled with a p-value less than 0.001 for G 2 s and Z s. The p-values for I PR and Z s,o were much weaker in the range between 0.01 and 0.05. Even though the second cluster, centered at Pubei Xian (w 51 ), is not far from the Vietnam border, most counties were not near the war zone. However, due to the war and prewar refugees who flooded into China from Vietnam, the counties around Pubei Xian were a government-designated settlement area, with a large influx of ethnic Chinese refugees from Vietnam. It was likely that after 20 years, some refugees had resettled elsewhere, but many of them, especially the poor, still lived on government-sponsored farms. The signals for the potential third cluster were mixed. Even though the third cluster was signaled from G 2 s and Z s scan statistics, the results from Z s,o and I PR were not significant. The overdispersion parameters were greater than 20 among the first three models, signaling a strong effect. Since only Z s,o considers the adjustment of the overdispersion problem, and the result is consistent with that from I PR, the third possible cluster was much less trustworthy. Based on Z s,o, the stepwise spatial scan procedure was stopped and it concluded the existence of two clusters centered at Fangcheng Qu and at Pubei Xian (Fig. 1). The final model, therefore, is: log(θ i ) = 4.52 + 3.24 10 3 x 2.13 10 6 + 0.927w 45 + 0.365w 51. Based on the final model with two spatial association terms, w 45 and w 51, we calculated parameter estimates for elevation and its quadratic terms, 3.24 10 3 and 2.13 10 6 respectively (with corresponding z-values 24.31 and 16.31 respectively). With the parameter estimate of overdispersion ˆφ = 20.16, the adjusted z-values were 5.42 and 3.65, which were highly significant (with p-values less than 0.001). Based on these parameter estimates, we further examined mortality variation inside and outside of the clusters. The odds ratios of the first and second clusters were 2.53 and 1.44, respectively. Taking elevation into consideration, the mortality rates of the counties within the first and second cluster were 153% or 44% higher, respectively, than the expected value. Since the elevation that corresponds to the maximum predicted mortality rate was about 760 m, and there were only 12 counties above this level, the two detected clusters were the net effect of a general positive relationship between elevation and infant mortality. 4. Discussion and concluding remarks Spatial analysts are often interested in factors that contribute to the detected clusters. Previously, researchers have resorted to using the loosely coupled method that first models ecological covariates, and then using the expected values to perform spatial scan statistic for cluster detection. In this article, we have directly extended the spatial scan statistic of Kulldorff to include ecological covariates, and have integrated the loosely coupled steps in the spatial scan statistic into a single model-based statistic, G 2 s. In the absence of ecological covariates, G2 s is equivalent to the spatial scan statistic. Both simulation and case studies show that the proposed G 2 s statistic has desirable properties, with acceptable type I error probabilities and power functions. Similar to disease mapping, the method can be used to display residual information, while controlling for other effects, such as overdispersion. In addition, a spatial explicit cluster term can be parameterized for its shape and size. However, since G 2 s is unable to account for overdispersion, we proposed model-based scan statistics, based on the Wald tests Z s, Z s,o. Our initial evaluations suggest that Z s can be used in the same way as G 2 s, while Z s,o is a standard way to account for overdispersion. Our case study of infant mortality showed that a third cluster could be signaled if overdispersion were not considered, and the result from Z s,o signaled that the third one could be a false alarm. In this regard, the proposed Wald statistic Z s,o can complement the spatial scan statistic. Since the Wald-based spatial scan statistic does not need to compute the marginal likelihood function in GLMMs, and many approximate computing methods, such as quasi-likelihood, the GEE method initially proposed by Liang and Zeger (1986) are available, the Wald-based spatial scan statistic can be implemented in many situations. It should be noted that, even though the Wald statistic in Z s,o is widely recognized in practice, it does

2858 T. Zhang, G. Lin / Computational Statistics and Data Analysis 53 (2009) 2851 2858 not involve any spatial operations. If we can identify spatial overdispersion, a spatial explicit or implicit measure can be developed. Future work should explore various spatial explicit overdispersion measures and adjustment methods. In this regard, our proposed Z s,o provides an approach for further extending Wald-based statistics to a wide range of models and model-based test statistics. The model-based spatial scan statistics provide more flexibility than the spatial scan statistic. Recently, considerable effort has been spent on refining different shapes for the spatial scan tests (Tango and Takahashi, 2005). The proposed model-based method is likely to be helpful in these efforts, as it can treat the detection of cluster with different shapes as a modeling process. Even though our proposed methods and test statistics are based on Kulldorff s spatial scan test, the idea can be used in other testing methods for spatial clusters for count data. As the Wald z-value of an unknown parameter can be easily derived under a variety of statistical models, the modifications of the Wald-based spatial statistic for local cluster can be applied to modeling spatial clusters and clustering in many generalized linear mixed effect models. Since the spatial scan statistic has been extensively used in many fields (Kulldorff, 2005), the flexibility from the model-based likelihood ratio-based spatial statistic, and Wald-based spatial statistics will enable spatial analysts to account for ecological covariates, overdispersion and spatial random effects. By linking cluster detection to cluster mechanism discovery, we hope that the model-based scan statistics will find wide applications for Poisson count data. Acknowledgements We would like to thank the co-editor of the special issue James LeSage and the anonymous reviewers for their insightful comments and suggestions, which have improved the quality of the paper substantially. This research was funded by US National Science Foundation Grants SES-07-52657 (Zhang) and SES-07-52019 (Lin). References Agresti, A., 2002. Categorical Data Analysis. Wiley, New York. Assuncao, R., Costa, M., Tavares, A., Ferreira, S., 2006. Fast detection of arbitrarily shaped disease clusters. Statistics in Medicine 25, 723 745. Cliff, A.D., Ord, J.K., 1981. Spatial Processes: Models and Applications. Pion, London. Davison, A.C., Hinkley, D.V., 1997. Bootstrap Methods and Their Application. Cambridge University Press, Cambridge, UK. Faraway, J., 2006. Extending the Linear Model with R. Chapman and Hall/CRC, New Boca Raton, Florida. Hossian, M.M., Lawson, A.B., 2006. Cluster detection diagnostics for small area health data: With reference to evaluation of local likelihood models. Statistics in Medicine 25, 771 786. Kleinman, K., Abrams, A., Kulldorff, M., Platt, R., 2005. A model-adjusted space-time scan statistic with an application to syndromic surveillance. Epidemiology and Infection 133, 409 419. Kulldorff, M., Nagarwalla, N., 1995. Spatial disease clusters: Detection and inference. Statistics in medicine 14, 799 810. Kulldorff, M., 1997. A spatial scan statistic. Communications in Statistics, Theory and Methods 26, 1481 1496. Kulldorff, M., Tango, T., Park, P., 2003. Power comparisons for disease clustering tests. Computation Statistics and Data Analysis 42, 665 684. Kulldorff, M., 2005. Scan statistics for geographical disease surveillance: An overview. In: Lawson, Kleinman (Eds.), Spatial and Symdromic Surveillance for Public Health. Wiley, New York, pp. 113 131. Liang, K.Y., Zeger, S.L., 1986. Longtitude data analysis using generalized linear models. Biometrika 73, 353 358. Lin, G., Zhang, T., 2007. Loglinear residual tests of Moran s I autocorrelation and their applications to Kentucky Breast Cancer Data. Geographical Analysis 39, 293 310. Loh, J., Zhu, Z., 2007. Accounting for spatial correlation in the scan statistic. Annals of Applied Statistics 2, 560 584. McCullagh, P., 1983. Quasi-likelihood functions. Annals of Statistics 11, 59 67. Moran, P.A.P., 1948. The interpretation of statistical maps. Journal of the Royal Statistic Society Series B 10, 243 251. Naus, J.I., 1965. The distribution of the size of the maximum cluster of points on the line. Journal of American Statistical Association 60, 532 538. Openshaw, S., Charlton, M., Wymer, C., Craft, A.W., 1987. A mark I geographical analysis machine for the automated analysis of point data sets. International Journal of GIS 1, 335 358. Recuenco, S., Eidson, M., Kulldorff, M., Johnson, G., Cherry, B., 2007. Spatial and temporal patterns of enzootic raccoon rabies adjusted for multiple covariates. International Journal of Health Geographics 6, 14. Roche, L.S., Skinner, R., Weinstein, R.B., 2002. Use of a geographic information system to identify and characterize areas with high proportions of distant stage breast cancer. Journal of Public Health Management Practice 8, 26 32. Song, C., Kulldorff, M., 2006. Likelihood based tests for spatial randomness. Statistics in Medicine 25, 825 839. Tango, T., Takahashi, K., 2005. A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics, available on: http://www.ij-healthgeographics.com/content/pdf/1476-072x-4-11.pdf. Wakefield, J.C., Kelsall, J.E., Morris, S.E., 2000. Clustering, cluster detection, and spatial variation in risk. In: Elliott, P., Wakefield, J.C., Best, N.G., Briggs, D.J. (Eds.), Spatial Epidemiology: Methods and Applications. Oxford University Press, Oxford, UK.