Cluster Detection Based on Spatial Associations and Iterated Residuals in Generalized Linear Mixed Models

Size: px

Start display at page:

Download "Cluster Detection Based on Spatial Associations and Iterated Residuals in Generalized Linear Mixed Models"

Samantha Bryant
5 years ago
Views:

1 Biometrics 65, June 2009 DOI: /j x Cluster Detection Based on Spatial Associations and Iterated Residuals in Generalized Linear Mixed Models Tonglin Zhang 1, and Ge Lin 2 1 Department of Statistics, Purdue University, 250 North University Street, West Lafayette, Indiana , U.S.A. 2 Department of Geology and Geography, West Virginia University, Morgantown, West Virginia , U.S.A. tlzhang@stat.purdue.edu Summary. Spatial clustering is commonly modeled by a Bayesian method under the framework of generalized linear mixed effect models (GLMMs). Spatial clusters are commonly detected by a frequentist method through hypothesis testing. In this article, we provide a frequentist method for assessing spatial properties of GLMMs. We propose a strategy that detects spatial clusters through parameter estimates of spatial associations, and assesses spatial aspects of model improvement through iterated residuals. Simulations and a case study show that the proposed method is able to consistently and efficiently detect the locations and magnitudes of spatial clusters. Key words: Generalized linear models; Local clusters; Mixed effects; Moran s I statistic; Pearson residuals; Spatial heterogeneity. 1. Introduction Spatial analyses of disease clusters are typically based on Poisson random variables under two separated approaches: Bayesian disease mapping and frequentist cluster detection. The disease-mapping approach, typically formulated under a Bayesian mixed effect model (Lawson and Clark, 2002), provides stable estimates for unit-specific relative risks while accounting for potential explanatory variables and extra dispersion due to spatial heterogeneity. This approach has an advantage of modeling an overall or global variation in disease rate over the entire study area, while at the same time capturing local variation. The frequentist approach tests for the existence of spatial clustering or clusters, but normally not under any mixed effect models. Our interest is to use a frequentist method for cluster detection based on a spatial mixed effect model. In disease mapping, spatial generalized linear mixed effect models(glmms)areoftenspecifiedintermsofrelativerisks for their fixed and random effects, where the relative risks are proportional to the expected incidence rates of Poisson random variables. These models are typically fitted by a Bayesian method that provides analytical results on ecological covariates and the expected relative risks. Although there are several frequentist methods for estimating GLMMs for time series data (Zeger, 1988; McCulloch, 1997), their uses in spatial data are rare. Two exceptions highlight recent attempts to use frequentist methods for spatial GLMMs. One is the score test for global spatial correlation but not for local cluster detection (Jacqmin-Gadda et al., 1997). The other is the spatial logit or log-linear model for small-area surveillance specified by nonspatial random effects (Kleinman, 2005). In this article, we propose a cluster-detection strategy that combines the estimation of a GLMM with the identification of local clusters in the model selection process. Most frequentist methods for spatial cluster detection ignore potential explanatory variables at the hypothesis testing stage. When the null hypothesis of no spatial clustering is rejected, most testing methods are able to identify local clusters in a more focused test. Afterward, researchers may begin to explore potential geographical or ecological explanatory variables that might contribute to the identified clusters. For example, after identifying two distant-stage breast cancer hot spots, Roche et al. further compared geographic factors between clustered and noncluster areas and found that the two clusters tend to be linguistically isolated (Roche, Skinner, and Weinstein, 2002). This suggests that a known risk factor might contribute to the observed pattern, but the separation of cluster detection from the control of the risk factor in the testing method makes it impossible to infer this fact statistically. Our goal is to add the capability of incorporating explanatory variables in the spatial cluster-detection process from a frequentist approach, which resembles the existing methods for disease mapping. Because our proposed method is based on spatial GLMMs, its model-fitting process should include spatial effects. In the presence of random effects, model selections based on the goodness of fit and information criteria, such as Akaike, may not be able to capture extra information about autocorrelation. For example, a spatial logit association model in a GLM can include potential explanatory variables and identify C 2008, The International Biometric Society 353

2 354 Biometrics, June 2009 high- and low-value clusters (Lin, 2003). This model, however, cannot deduce cluster information based on the goodness-offit statistic, and therefore, cannot indicate spatial clustering effects. To search for a cluster in a model estimation process, it is crucial to retain extra information about spatial properties in the model improvement process (Baddeley et al., 2005). Recently, Loh and Zhu (2007) extended the spatial scan test (Kulldorff, 1997) by retaining spatial autocorrelation information for overdispersion, but the method is based on the spatial scan test rather than on a spatial GLMM. In both time series and spatial point processes, residuals have been used to assess model fitting and clustering effects (Zhuang, 2006). We extend these methods to lattice data by providing a theoretical basis for residual-based clustering tests, which in turn can be used to assess spatial properties in spatial GLMMs. Taken together, we set out a spatial cluster surveillance framework that is able to confirm the existence of clusters while accounting for ecological covariates and potential global trends. For a spatial GLMM, we rely on (i) the local association parameter estimates to capture clusters, and (ii) Pearson residuals to check the remaining clustering tendency. Because locations of potential spatial clusters are an unknown priori, we also introduce a spatial search algorithm to detect significant local associations. In the following section, we first set out the proposed cluster-detection method, and then evaluate it through Monte Carlo simulations in Section 3. In Section 4, we introduce a case study for colorectal cancer mortality in Indiana. Finally, we provide concluding remarks. 2. Statistic Methods For a study area with m units, let the observed cases be y i of a random variable Y i and expected cases be E i, respectively. Assume that the event is rare such that Y i for i =1,..., m are conditionally independently Poisson distributed with conditionally expected value θ i E i given θ i ; the unknown relative risk θ i for unit i can be set out in a spatial GLMM as: log(θ i )=x t i β + K α k d ki + ɛ i. (1) Equation (1) is based on Gangnon and Clayton s (2003) work: x t i β is a nonspatial component representing the overall relative risk across the study area adjusted by ecological covariates x i ; d ki is the local association term for cluster membership centered at unit j k with d ki = 1 if unit i belongs to the kth cluster and d ki =0otherwise;α 1,..., α k are unknown log relative risks (relative to x t i β) associated with each cluster; K is the unknown number of clusters; and ɛ i is a spatially unstructured random effect assumed identically independently normally distributed. Both K α k =1 k d ki and ɛ i in equation (1) present spatial search and statistical estimation problems that need to be solved jointly. First, the coefficient of parameter α k represents the strength of the kth local cluster captured by d ki.asignificantly positive coefficient indicates a high-value cluster or a hot spot. A significantly negative coefficient indicates a lowvalue cluster or a cool spot. If all α 1,..., α K are 0, the model becomes a spatial independent model. Because the number of potential clusters K is unknown, it is necessary to develop a procedure to search for spatial clusters. Second, the random k =1 effect ɛ i represents heterogeneity, which can be attributed to omitted covariate variables, measurement errors, and overdispersion. Although the use of ɛ i is natural for many applications, the potential inflation of the variance poses a problem for model estimation (Agresti, 2002, p. 8). In the following section, we first introduce the model estimation problem and then the cluster search problem. Model estimation. To estimate parameters in model (1), it is necessary to compute its variance inflation. Given that Y i ɛ i Poisson(θ i E i ), we have E(Y i ɛ i ) = θ i E i and V (Y i ɛ i ) = θ i E i. Under the assumption that ɛ i iid N (0,σ 2 ),e ɛ i are independently log-normally distributed with E(e ɛ i )=e σ 2 / 2 and V (e ɛ 2 i )=e σ 2 (e σ 2 1). Let μ i = E(Y i ). Then, μ i = E[E(Y i ɛ i )] = E i E(θ i )=E i e x t β + Let v i = V (Y i ). Then, v i = E[V (Y i ɛ i )] + V [E(Y i ɛ i )] K i =1 α k d ki +σ 2 / 2. = E(θ i E i )+V (θ i E i )=μ i [ 1+μi (e σ 2 1 ) ]. As v i only depends on μ i and σ 2,wewritev i = v(μ i ; σ 2 )as the general form of the marginal variance of Y i. By comparing equations (2) and (3), we find that the variance is inflated because the marginal variance is a quadratic function of the marginal mean when σ 2 > 0. Given a potential number of clusters C 1,..., C k, we fit model (1) in a generalized estimating equation (GEE) by treating σ 2 as a nuisance parameter (Zeger, 1988). Let α = μ i α )t (α 1,..., α K )and μ i =( μ i be the column vector of β the first-order partial derivatives for all the unknown parameters. For a given σ 2, we consider the quasilikelihood parameter estimates ˆβ for the ith unit by a quasiscore equation: m i=1 (2) (3) ( μ i ) v 1( μ i ; σ 2) (Y i μ i )=0, (4) where μ i is given by equation (2) and v 1 (μ i ; σ 2 )={μ i [1 + μ i (e σ 2 1 )]} 1. We use an iterative algorithm to estimate both μ i and σ 2, because σ 2 is also a parameter and equation (4) can only estimate μ i for a given σ 2.Wefirstsetσ 2 = 0 and estimate ˆμ i by equation (4) and then estimate σ 2 by moment estimation as: where s i =max( (Y i ˆμ i ) 2 ˆμ 2 i ˆσ 2 = 1 m 1ˆμ i m log s i, (5) i=1 +1, 1). Both the estimates of μ i and σ 2 are updated iteratively until their convergence, so that we may obtain the estimates ˆμ i and ˆσ 2 of μ i and σ 2.The marginal variance, therefore, can be estimated by ˆv i =ˆμ i [ 1+ˆμi (e ˆσ 2 1) ]. (6) Because it is difficult to obtain the maximum likelihood estimate ( ˆβ, ˆα) of(β, α) and the corresponding estimate of the variance covariance matrix, we opt to calculate the Wald z-score values of clusters C 1,..., C k as z ˆα i,c i =ˆα i /ˆσ ˆα i,where

3 Cluster Detection Generalized Linear Mixed Models 355 ˆσ ˆα i is the estimate of standard deviation of ˆα i.thesez-values can be used to determine the significance of a cluster in the following. Search for local associations. As mentioned earlier, the number and location of clusters are unknown, and we need to develop a spatial search procedure to identify significant local association terms. For simplicity, we follow the classical way of defining a cluster at location i that includes all its neighbors; then we search for all possible is in a study area. Although the analysis of residual clustering is applicable to several spatial test statistics, we demonstrate it based on Moran s I (Moran, 1948) because of its popularity and wide availability. Let z i be the variable of interest in unit i. Moran s I statistic is given by I = 1 m m w ij (z i z)(z j z), (7) S 0 b 2 i=1 j =1,j i where S 0 = m m w i=1 j =1,j i ij(w ij =1 if two units are adjacent, 0 otherwise), z = m z i=1 i/m, b k = m (z i=1 i z) k /m. For count data, z i is often taken as the observed relative risk or z i = y i /E i. Moran s I usually ranges between 1 and1: acoefficientcloseto1indicatesneighborhood similarity or clustering, a coefficient close to 1 indicates neighborhood dissimilarity, and a coefficient close to 0 indicates spatial randomness or independence. The p-value of Moran s I statistic is calculated under the random permutation test scheme. Let E R ( ) andv R ( ) bethe expected value and variance under a random permutation. Formulae of the moments (Cliff and Ord, 1981) of I are given by: E R (I) = 1 m 1, (8) and E R (I 2 )= S ( ) 1 mb 2 2 b 4 S0 2b2 2 (m 1) + (S ( 2 2S 1 ) 2b4 mb2) 2 S0 2b2 2 (m 1)(m 2) ( )( ) S 2 0 S 2 + S 1 3mb 2 2 6b 4 + S0 2b2 2 (m 1)(m 2)(m 3), (9) where S 1 = m m (w i=1 j =1,j i ij + w ji ) 2 /2 and S 2 = m i=1 [ m (w j =1,j i ij + w ji )] 2. The variance of I is calculated by V R (I) =E R (I 2 ) E 2 R (I). Assuming I is approximately normally distributed, its p-value is calculated by a two-sided z- test given by 2[1 Φ( I E R (I ) )]. If the p-value is less than the V R (I ) significance level, then the null hypothesis of spatial independence is rejected and spatial dependence is concluded. Moran s I can be derived either from the observed value or from the model residuals, and it is suggested that Pearson residual Moran s I denoted by I PR is able to account for population heterogeneity and variation inflation (Waller and Gotway, 2004; Lin and Zhang, 2007). It can be seen from equation (7) that Moran s I does not suffer overdispersion. Thus, ɛ i in model (1) is likely to capture unstructured random effects. We provide the theoretical basis for I PR in the Web-based Supplementary Materials at the Biometrics website, and state a theorem below for I PR in reference to equation (1). Theorem 1: Suppose that Y i for i =1,..., m are independent random variables with expected value μ i,θ and variance σ 2 i,θ, where θ is a vector of unknown parameters. Assume that θ is m consistently estimated by ˆθ = ˆθ(Y 1,Y 2,...,Y m ). Define I PR as the Moran s I statistic by taking z i =(y i μ i, ˆθ )/σ i, ˆθ in equation (7). Then, [I PR E V (I PR )]/ V R (I PR ) L N (0, 1) as m, where L represents convergence in law or convergence in distribution. To explicitly account for an independent and identically distributed random effect in model (1), we can take z i = (y i ˆμ i )/ ˆv i as the ith residual so that I PR becomes an adjusted Pearson residual Moran s I or I apr and then the random permutation test can also be applied. Corollary 1: Suppose that β and α are m consistently estimated by ˆβ and ˆα as m. Then, ˆσ 2 given by equation (5) is also m consistently estimated. Define I apr as the Moran s I statistic by taking z i =(y i ˆμ i )/ ˆv i in equation (7), where ˆv i is given by equation (6). Then, [I ap R E R (I ap R )]/ V R (I ap R ) L N (0, 1) as m. Search algorithm. To search for potential clusters, we take advantage of the global indicator I apr and the local indicator d ki, and use them to evaluate the location and strength of a local cluster based on model (1). Suppose that C is a cluster in the study area, a spatial association term d ki assumes that θ i = θ c if i C and θ i = θ 0 if i C in the estimation process. Our task is to search for significant local associations by placing d ki terms based on the Wald s z-test, and then check for residual clustering based on I apr. The reason to use the Wald s z-test rather than the likelihood ratio test is the computational complication of the likelihood function in the presence of random effects (Agresti, 2002, p. 521). For simplicity, we describe the procedure by dropping ecological covariates in model (1). The procedure starts with the test of significance of I apr for model (1) without any local association term. If the p-value of I apr is not significant for spatial clustering, then there is no need to search for potential clusters. If I apr is significant, then a stepwise search algorithm is invoked. The algorithm first calculates all the z-values of α 1 coefficients for all possible C, and retains the largest significant z α 1 value for the first spatial association, its parameter estimate β 1,andtheestimatedσ 2. Then it reevaluates the p-value of I apr for residual clustering. If the statistic is no longer significant, the algorithm stops by indicating that the cluster centered at j 1 and quantified by β 1 contributes the significant clustering effect. Otherwise, it starts to search for a second cluster by repeating the above steps in the second round of iteration. Parameter estimates for the previously retained spatial association terms are updated together with σ 2 in the estimation process. The algorithm stops when I apr is not significant based on the kth iterated residuals. Because each association term accounts for the effect of the corresponding cluster, the procedure does not suffer the multiple testing problem. The model as a whole takes account of spatial random effects when σ 2 > Simulation In this section, we evaluate the proposed I apr test and the search procedure via simulated data. We used 92 counties in Indiana in the United States as the template and assigned

4 356 Biometrics, June 2009 as much as the mean. We ran 10,000 Monte Carlo simulations for each combination of σ, δ 1,andδ 2 values. In addition to assessing the proposed method against simulated patterns, we also compared it with Kulldorff s spatial scan statistic. Kulldorff s scan statistic is based on the likelihood ratio test for the null hypothesis of equal rates among all units versus the alternative of higher rates or lower rates inside of the cluster (Kulldorff, 1997). Suppose that C C is a candidate of a cluster and assume θ i = θ c if i C and θ i = θ 0 of i C. Under the null hypothesis θ c = θ 0 and the alternative hypothesis θ c >θ 0, the likelihood function is LR(C, θ c,θ 0 )= [ i C (θ c E i ) y i y i! e θ c E i ][ i C (θ 0 E i ) y i y i! The spatial scan statistic is sup θ λ =maxl(c) =max c >θ 0 LR(C, θ c,θ 0 ) C C C C sup θ c =θ 0 LR(C, θ c,θ 0 ), e θ 0E i ]. (11) Figure 1. County population sizes in 2000 in two-cluster simulations. the at-risk population of each county according to the 2000 Census of Population, which ranges from 5623 to 860,454. We generated local cluster effects from no cluster to two clusters, and fixed the centers of two clusters at the outset (Figure 1). The first, denoted by C 1, was centered upon Tippecanoe county in the west that also included its seven adjacent counties (Montgomery, Fountain, Warren, Clinton, Carroll, White, and Benton). The second cluster, denoted by C 2, was centered upon Noble county in northeastern Indiana that included its seven adjacent counties (De Kalb, Kosciusko, Whitley, Allen, Elkhart, Lagrange, and Steuben). In all evaluations, we fixed the significance level at We first generated the random effect ɛ i identically independently from N (0, σ 2 ); and then we generated the Poisson random counts Y i conditionally independently with conditional mean θ i n i e V i for i =1,..., 92, where 0.001(1 + δ 1 d 1i + δ 2 d 2i ), when i C 1 or θ i = i C 2, respectively, 0.001, when i C 1,C 2, (10) where n i is the population size of the ith county, d 1i is 1 if county i is in cluster C 1, and 0 otherwise. Likewise, d 2i is 1 if county i is in cluster C 2, and 0 otherwise. The quantities δ 1 and δ 2 represent the strength and sign of the clusters. When δ 1 = δ 2 = 0, there is no clustering. Otherwise, there is at least one cluster. A positive δ value indicates a hot spot, and a negative δ value indicates a cool spot. We let δ change from 0 to 1 to place the relative risk within a cluster from 1 to twice where C is the class of all the possible candidates of local clusters. Once the value of the test statistic is calculated by searching for all the possible candidates, a p-value for the cluster C with the maximum value of the likelihood ratio test statistic is obtained by comparing the value of the scan statistic λ under the null hypothesis. Because the exact distribution of the test statistic λ cannot be determined analytically, it is approximated by Monte Carlo simulations. The spatial scan statistic uses a circular area and its variant to detect excessive events within the circle against the rest of the study area. It is a consistent and powerful test for detecting at least one cool or hot spot. Since its introduction into the field of disease cluster detection, the spatial scan statistic has quickly become a standard method for geographic disease surveillance. In the simulation, we compared the type I error rates and the powers of the two methods. Ideally, we would want to include spatial random effects throughout. However, we dropped the spatial scan statistic in the power comparison because it cannot account for the case when σ 2 > 0. In addition, the power comparison is limited to one cluster without spatial heterogeneity(i.e., when σ = 0), because the spatial scan statistic is most effective. In this situation, we used I PR for the comparison, because in the absence of random effect, I apr reduces to regular I PR. In all comparisons, we fixed the window size according to the spatial adjacency matrix, so that it remained consistent with our design for local clusters. We calculated the p-value of the spatial scan statistic based on 999 random replications of the null distribution for Monte Carlo hypothesis testing. The results show that I apr had a consistent type I error probability in the presence of spatial random effects (Table 1). As σ increased from 0 to 1.0, the rejection rates hovered between and The spatial scan statistic, on the other hand, only had an acceptable type I error probability when σ =0.Asσ increased, the rejection rates increased rapidly, which would result in false alarms. In the limited power comparison, the scan statistic is more powerful than our permutation method for I PR. When the cluster strength was moderate (δ 1 = 0.2), or 20% above the

5 Cluster Detection Generalized Linear Mixed Models 357 Table 1 Type I error probabilities (when δ 1 = δ 2 =0)of I apr and Kulldorff s scan statistic σ Scan I apr mean, the spatial scan statistic had a 52.5% rejection rate, whereas I PR had 17.7%. When the cluster strength became stronger, the rejection rates from the scan statistic were still higher than the rates from I PR.Whenδ 1 values were 0.4 and 0.5, the corresponding rejection rates were 99.8% and 100% for the spatial scan test, as opposed to 75.8% and 94.3% for I PR. These results were expected because the likelihood ratio test in the spatial scan statistic is more powerful than the I PR permutation test. Because the spatial scan statistic can neither straightforwardly incorporate ecological covariate nor account for spatial random effects, in the remaining simulation, we only assess our iterative residual method with regard to the effectiveness of spatial association terms and the search procedure. We display the rejection rates of I apr with the inclusion of no association term, one local association term for Noble county cluster, and two local association terms for both clusters (Figure 2). To simplify this part of simulations, we set δ 1 = δ 2 = δ in expression (10) as they increased from 0 to 1. Because the results in the presence of spatial heterogeneity were similar to those without spatial heterogeneity, we focus the discussion on the case when σ =0.0. Figure 2 shows that rejection rates remained consistently around 0.05 in the absence of clustering effect (lower-left corner). When δ>0.3 with a moderate cluster effect, I apr had a very high rejection rate. It suggested that I apr was able to detect the existence of spatial clustering in the absence of local association terms. This is true even when the cluster center upon Noble county was accounted for by one association term. In other words, if one cluster is sufficient to cause a significant clustering tendency, its effect remains even though another one is accounted for. When both clusters were accounted for by two corresponding association terms, the rejection rates remained consistently around 5%, which resembled the case when δ = 0 for type I error rates. These results suggest that I apr was likely to accept the hypothesis of no clustering once the two spatial association terms absorbed the existing cluster tendency. The second part of simulations was intended to evaluate the effectiveness of the searching procedure by how often it identified C 1 or C 2 centered at Tippecanoe and Noble counties correctly. We also displayed the rates when the searching procedure identified these two clusters but not centered exactly at Tippecanoe or Noble counties, for which we denote counties C N 1 or C N 2, respectively. As before, we used δ 1 or δ 2 values to set the strength of a cluster from 0 to 0.9. Table 2 displays the proportions of cluster centers that were correctly identified (C 1 or C 2 )and somewhat correctly identified (C N 1 or C N 2 ). With a strong cluster (δ 1 =0.9orδ 2 = 0.9), the searching procedure was effective. When both clusters were weak, the procedure was, as expected, unable to pick up any cluster locations. When at least one strong cluster was present (δ 1 =0.9orδ 2 =0.9), the procedure was able to pick up the corresponding location Figure 2. Significance of I apr with zero, one, or two spatial associations.

358 Biometrics, June 2009 Table 2 Proportion of correctly identified cluster centers in two-cluster simulations (δ 1, δ 2 ) C 1 C N 1 C 2 C N 2 (0.0, 0.0) 0.0009 0.0051 0.0010 0.0039 (0.3, 0.3) 0.

6 358 Biometrics, June 2009 Table 2 Proportion of correctly identified cluster centers in two-cluster simulations (δ 1, δ 2 ) C 1 C N 1 C 2 C N 2 (0.0, 0.0) (0.3, 0.3) (0.3, 0.6) (0.3, 0.9) (0.6, 0.3) (0.6, 0.6) (0.6, 0.9) (0.9, 0.3) (0.9, 0.6) (0.9, 0.9) precisely more than 99% of times. When both clusters were moderately strong (δ 1 = δ 2 = 0.6), the procedure was likely to find the cluster at Noble county first and more often, because it contributed to greater deviance with its population (740,873), which was higher than that of Tippecanoe county (301,676). These results suggest that the iterative clusterdetection procedure can consistently identify the locations of spatial clusters. In the absence of local clusters, I apr had an acceptable level of misidentified cluster locations that corresponded to its type I error probabilities. 4. Case Study In this section, we provide a case study of county-level colorectal cancer mortality in Indiana. Colorectal cancer mortality has declined steadily in Indiana in the past decade. The male rates declined sharply from an annual rate of 28.1 per 100,000 in the period to 21.1 per 100,000 during the period, while the female rates experienced a moderate decline from an annual rate of 19.9 per 100,000 in the period to 17.5 per 100,000 during the period. The sharp decline in male colorectal cancer mortality was attributable to an increase in colorectal cancer screening rates, which can detect colorectal cancer at an early stage. In 1996 and 1997, the male screening rate for those age 50+ was 38%; in 2004 the rate increased to 50%. To examine regional disparity with potential implications for improvement in screening, we obtained the mortality data for males in the period as well as at-risk populations from the Bureau of Census s 2002 population estimates. The state average for the 5-year period for males was 0.11%, with the lowest of 0.056% in White county, the highest of 0.258% in Warren county (see Figure 3). We used I apr, to first test for spatial clustering, and then to identify a local cluster if the global clustering effect was significant. Because there is no equivalent test at both global and local levels, we compared our results with two other tests: the empirical Bayesian index (EBI) at the global level, and the spatial scan statistic at the local level. EBI is a population-adjusted Moran s I test proposed by Assuncao and Reis (1999), and it can consistently detect spatial clustering in the presence of population heterogeneity. EBI, however, cannot include ecological covariates or spatial association terms. For this reason, we use the spatial scan statistic to check the Figure 3. Male colorectal cancer mortality per 100,000 in Indiana: 2000 to consistency of the location detected by our search procedure. To closely mirror our proposed iterative framework, upon detection of the first cluster, we introduced a spatial association term corresponding to the clustered area into the spatial scan statistic, and further compared the results from the two methods. The results from I apr show significant spatial clustering in the null model that does not include any spatial and ecological covariates. The I apr value and its p-value were and 0.010, with ˆσ = and ˆσ =0.2001, which were close to the EBI value of with a p-value of Based on this initial result, we used the iterative residual search algorithm and found a cool spot centered at Marion county. The coefficient for Marion county cluster was or an expected relative risk of Once this cluster was accounted for, the p-value for I apr became or not significant. It is known that Marion is the county seat for the state capital of Indianapolis, with five large hospitals right at the city center. The cluster had an average mortality rate of 0.084% as opposed to 0.123% for the rest of the state. To assess whether ecological covariates or age covariates could explain the clustering effect, we collected a number of county level variables, such as primary care physicians, hospital beds, colorectal cancer screening rate, and broad age groups ( 45, 45 64, 65+). We made several contrasts between clustered and nonclustered areas. We found that screening

7 Cluster Detection Generalized Linear Mixed Models 359 rate, which is defined as the proportion of adults age 50 and older who have ever had a sigmoidoscopy or colonoscopy, or who have had a blood stool test within the past 2 years, had the sharpest contrast: 66.3% within the cluster and 54.2% outside. Although other variables may reduce overall deviance, screening rate was the only one associated with the detected clustering effect. Once we included the screening rate without any spatial association terms, the p-value for I apr became with ˆσ =0.2164, which suggested no clustering. In other words, we could explain the lower level of mortality near the capital counties by the high screening rate. Finally, we ran the spatial scan statistic based on the same spatial weight matrix. The result for the first cluster was identical to the cool spot detected by the iterative residual method with a p-value of 0.001, and a λ value of The consistent results for the first cluster show that in the pure spatial cluster test, both I apr and spatial scan statistics were able to consistently detect the existence and the location of a cluster. Because it was necessary to include the same spatial association term in the spatial scan statistic as in model (1), we added it as a covariate to check if it would similarly remove the detected cluster and clustering effects. The result was mixed. While the detected cool spot vanished from the first cluster, an excessive mortality cluster emerged around Newton county in the northwest of the state with a p-value of 0.004, and a λ value of This inconsistency could relate to the greater power of spatial scan statistic, or to spatial random effects that were not accounted for by the spatial scan statistic. Even if we used geographic coordinates based on the SatScan default setting, we would yield the identical second cluster centered upon Newton county, while the first cluster of the cool spot was larger than the geographic neighbor-based cluster. These results were consistent when we used different circle sizes by changing the at-risk population threshold from 30% to 50%. 5. Concluding Remarks In this article, we have set out a frequentist framework for a spatial GLMM that combines local cluster indicators or spatial association terms with the residual-based global indicator of clustering or I apr. Previously, spatial GLMMs have been fitted predominantly by a Bayesian disease-mapping method. We proved the validity of I apr based on the asymptotic properties of Pearson residuals. We found that it has a consistent type I error probability in the presence of spatial random effects and population heterogeneity. The power of I apr is lower than that of the spatial scan statistic, but still satisfactory in the presence and absence of spatial random effects. The use of I apr in the modeling improvement process for a spatial GLMM is iterative. It is based on (i) the GEE method for estimating and assessing the detected spatial associations and associated goodness of fit statistics, and (ii) the evaluation of I apr for spatial clustering in the iterative search of spatial association terms. The coefficient of a local indicator can gauge the effect of a hot spot or a cool spot, and multiple clusters derived from the model-based test do not suffer from the multiple testing problem. Because the model can include potential ecological covariates, known risk factors can also be incorporated into the hypothesis testing process for cluster detection. This modeling strategy can be extended to cases where the detected spatial dependence may suggest the type of exposure that is partially responsible for spatial variation, because both geographic covariates and spatial association terms can be treated as explanatory variables in the modeling improvement process, and they provide different statistical inferences. A spatial association term can identify a location-specific cluster, which in turn can provide clues for identifying ecological covariates, such as the screening rate used in the case study. An ecological covariate, if used effectively, can help the design of intervention. In the case study, we know not only the inverse relationship between the screening rate and colorectal cancer mortality, but also the screening levels and mortality rates within and outside the cluster. These descriptive statistics and the main effects from the coefficients can assist intervention strategies. Although by no means inconsistent, the underlying goal of model improvement differs from that of clustering detection and estimation. When the global indicator such as I apr becomes insignificant, the iterative search procedure is likely to stop searching for clusters, but even with the same goal here, different search strategies may reveal slightly different clustering structures. This situation can occur in the presence of two weak clusters. When a spatial association coefficient accounts for one of them, the previously detected clustering tendency may or may not disappear. At this point, if we used an alternative search algorithm that could continue as long as an additional association term could substantially improve the model fit (Lin, 2003), we would likely identify the second association term not contributing to clustering but to the overall model fit. This complicated design warrants further statistical justification and evaluation, so that different spatial searching algorithms will have clearly different statistical inferences associated with different spatial association terms. These local cluster-detection designs could also be compared with global autocorrelation removal strategies, such as conditional autoregressive priors in Bayesian disease mapping. Finally, although we used the standard way of defining a cluster by its first-order spatial adjacency, the method can incorporate many spatial weights. Further empirical and theoretical refinement is necessary to evaluate the sensitivity of different spatial weights. 6. Supplementary Materials Web Appendices and proofs of Theorem 1 referenced in Section 2 are available under the Paper Information link at the Biometrics website Acknowledgements We would like to thank the associate editor and two reviewers for their insightful comments and suggestions, which have improved the quality of the article substantially. This research was funded by US National Science Foundation Grants SES (Zhang) and SES (Lin). References Agresti, A. (2002). Categorical Data Analysis. NewYork: Wiley. Assuncao, R. and Reis, E. (1999). A new proposal to adjust Moran s I for population density. Statistics in Medicine 18,

8 360 Biometrics, June 2009 Baddeley, A., Turner, R., Moller, J., and Hazelton, M. (2005). Residual analysis for spatial point processes (with discussion). Journal of the Royal Statistical Society, Series B 67, Cliff, A. D. and Ord, J. K. (1981). Spatial Processes: Models and Applications. London: Pion. Gangnon, R. E. and Clayton, M. K. (2003). A hierarchical model for spatially clustered disease rates. Statistics in Medicine 22, Jacqmin-Gadda, H., Commenges, D., Nejjari, C., and Dartigues, J. (1997). Tests of geographical correlation with adjustment for explanatory variables: An application to dyspnoea in the elderly. Statistics in Medicine 16, Kleinman, K. (2005). Generalized linear models and generalized linear mixed models for small-area surveillance. In Spatial and Syndromic Surveillance for Public Health, A. B. Lawson and K. Kleinman (eds), London: Wiley. Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics, Theory and Methods 26, Lawson, A. B. and Clark, A. (2002). Spatial mixture relative risk models applied to disease mapping. Statistics in Medicine 21, Lin, G. (2003). A spatial logit association model for cluster detection. Geographical Analysis 35, Lin, G. and Zhang, T. (2007). Loglinear residual tests of Moran s I autocorrelation: An application to Kentucky breast cancer incidence data. Geographical Analysis 39, Loh, J. M. and Zhu, Z. (2007). Accounting for spatial correlation in the scan statistic. Annals of Applied Statistics 1, McCulloch, C. E. (1997). Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association 92, Moran, P. A. P. (1948). The interpretation of statistical maps. Journal of the Royal Statistical Society, Series B 10, Roche, L. S., Skinner, R., and Weinstein, R. B. (2002). Use of a geographic information system to identify and characterize areas with high proportions of distant stage breast cancer. Journal of Public Health Management Practice 8, Waller, L. and Gotway, C. (2004). Applied Spatial Statistics for Public Health Data. Hoboken, New Jersey: Wiley. Zeger, S. L. (1988). A regression model for time series of counts. Biometrika 75, Zhuang, J. (2006). Second-order residual analysis of spatiotemporal point processes and applications in model evaluation. Journal of the Royal Statistical Society, Series B 68, Received June Revised March Accepted March 2008.

Computational Statistics and Data Analysis

Computational Statistics and Data Analysis 53 (2009) 2851 2858 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Spatial