Identification of Local Clusters for Count Data: A. Model-Based Moran s I Test

Size: px

Start display at page:

Download "Identification of Local Clusters for Count Data: A. Model-Based Moran s I Test"

Gervais Ross
6 years ago
Views:

1 Identification of Local Clusters for Count Data: A Model-Based Moran s I Test Tonglin Zhang and Ge Lin Purdue University and West Virginia University February 14, 2007 Department of Statistics, Purdue University, 250 North University Street,West Lafayette, IN , tlzhang@stat.purdue.edu Department of Geology and Geography, West Virginia University, Morgantown, WV , glin@wvu.edu 0

2 Identification of Local Clusters for Count Data: A Model-Based Moran s I Test Abstract We set out I DR as a loglinear model-based Moran s I test for Poisson count data that resembles the Moran s I residual test for Gaussian data. We evaluate its type I and type II error probabilities via simulations, and demonstrate its utility via a case study. When population sizes are heterogeneous, I DR is effective in detecting local clusters by local association terms with an acceptable type I error probability. When used in conjunction with local spatial association terms in loglinear models, I DR can also indicate the existence of first-order global cluster that can hardly be removed by local spatial association terms. In this situation, I DR should not be directly applied for local cluster detection. In the case study of St. Louis homicides, we bridge loglinear model methods for parameter estimation to exploratory data analysis, so that a uniform association term can be defined with spatially varied contributions among spatial neighbors. The method makes use of exploratory tools such as Moran s I scatter plots and residual plots to evaluate the magnitude of deviance residuals, and it is effective to model the shape, the elevation and the magnitude of a local cluster in the model-based test. Keywords: Cluster and clustering; deviance residual; Moran s I; permutation test; spatial autocorrelation; type I error probability. 1 Introduction Count and cross-tabulated frequency data are common in geographical analyses. Many spatial phenomena, such as births, deaths, crimes and species richness, can be counted by a spatial unit, either as a raw count or as a ratio over some exposure. Prior to the 1970s, count data were often converted to rate for statistical analyses because of limited computational power in categorical statistics. In the late 1970s, computationally expensive methods, such as loglinear models for 1

3 cross-tabulated data were introduced into social sciences and geography [15, 38], and they were quickly included in many statistical packages. In spatial statistical analyses, however, counts are still frequently converted to rate so that a testing method for continuous variables, such as Moran s I [26, 27] or Getis-Ord s G [20], can be directly applied. However, when population sizes are heterogeneous across spatial units, converting counts to rates often leads to variance inflation and biased type I error probabilities. Some propose to incorporate a population weight to the test statistics [29, 35], but the heterogeneity problem still remains [5]. Since a loglinear model can take account of population sizes in its likelihood ratio test, it is natural to extend the spatial statistics under the loglinear model framework. In this article, we set out a loglinear model-based test statistic for Poisson count data that corresponds to Moran s I for continuous data. We chose Moran s I because of its popularity and its ease of implementation. There have been hundreds of applications and extensions of the statistic since Moran s I was first published in 1948 [26]. Currently, most researchers focus on estimation methods [13, 22, 37], spatial distribution properties [7, 19], and adjustment of heterogeneous population sizes for count data [5, 29, 35]. A concurrent theme is focused on local spatial statistics or indicators [16]. It is pointed out that the extent of spatial correlation may vary locally due to omitted, misspecified, and deficient measurements for a stationary spatial relationship [17]. A significant Moran s I test may be caused by either a global trend of spatial autocorrelation, or a few local spatial associations. Attempts have been made to partition space for spatially varied parameterization [10], and to decompose a global autocorrelation measure, such as Moran s I, into a local indicator of spatial association (LISA) [3]. With auxiliary information, LISA is able to locate spatial associations, such as hot spots and cool spots [33]. Our model-based test should complement LISA, because it is not only able to explicitly indicate high-value and low-value clusters, but it is also able to account for heterogeneous population sizes. As its name suggested, a model-based test depends on a particular statistical model. In a linear regression model, a dependent variable is often associated with a set of explanatory variables. After 2

4 a final model is derived, a residual Moran s I test for spatial autocorrelation can also be performed to detect spatial clustering for unexplained variations ([14], p. 197). When a regression model does not include any explanatory variables, the residual Moran s I test is identical to the Moran s I test of the dependent variable. If we can bridge this test for spatial autocorrelation to a loglinear model, it would likely narrow the apparent knowledge gap between Moran s I for continuous data and other autocorrelation tests for count data ([38], p. 307). There are some recent advances in incorporating count data in spatial statistics. Griffith [21] introduced a spatial filter specification of the auto-regressive logistic model that is able to remove the global clustering effect. The model is likely to provide unbiased parameter estimates for autoregressive logistic regression, but due to its focus on model correction, the method may not be able to detect a local association. Several test statistics, such as I pop and Ipop [29] and modified I [35] by Empirical Bayes Index (EBI), C G [34] or spatial-x 2 test [31], are able to account for heterogeneous population sizes and to detect a local cluster, but none of them can account for ecological or geographic covariates. Lin s [23] spatial logit association model is able to include ecological covariates and spatial associations, but the significance of a logit association term is not a direct measure of spatial clustering. Although Apanasovich and his coauthors [6] used the Pearson residuals to test for spatial autocorrelation in their autoregressive model, the test was not formally specified and evaluated for wider applications. In this paper, we demonstrate that Moran s I based on loglinear residuals can be used not only as a global indicator of spatial autocorrelation, but also as a tool for modeling the location, the shape, the elevation and the size of a local spatial cluster. In the remaining sections of the paper, we briefly review the permutation test of Moran s I by using regression residuals and reformulate it in the context of Poisson data by using the deviance residuals of a loglinear model. We then use simulations to evaluate its statistical properties under the null and alternative hypotheses of spatial independence. In section 4, we apply the deviance residual Moran s I test to the St. Louis crime data. Finally, we provide some concluding remarks. 3

5 2 A Model-Based Moran s I Consider a study area that has m regions indexed by i. Let X i be the variable of interest in region i. Moran s I [26, 27] is expressed as: I = mi=1 mj=1 w ij (X i X)(X j X) [ m i=1 j=1 w ij ][ m i=1 (X i X) 2 /m], (1) where X = m i=1 X i /m, and w ij with w ii = 0 is the (i, j)-th element of a spatial weight matrix W. Commonly w ij is defined by the adjacency of spatial units: w ij = 1 if regions i and j are adjacent (neighbors) and w ij = 0 otherwise. A significant and positive value of Moran s I indicates the existence of a positive autocorrelation, or the existence of high-value or low-value clustering. A significant and negative value of Moran s I indicates a negative autocorrelation, or a tendency toward the juxtaposition of high values next to low values. The null hypothesis of Moran s I is usually based on the assumption that the distributions of X i are homogeneous. The p-value of the significance of Moran s I is computed from a z-test based on its z-value given by Z(I) = [I E(I)]/ V (I), where E(I) and V (I) are the theoretically mean and variance respectively under the null hypothesis. Under the null hypothesis of no spatial autocorrelation, Z(I) is assumed to be asymptotically distributed of N(0, 1) as m. The theoretical values of E(I) and V (I) are usually computed under the random permutation test scheme: E(I) = 1 m 1 (2) and V (I) = m[(m2 3m + 3)S 1 ms 2 + 3S 2 0] b 2 [(m 2 m)s 1 2mS 2 + 6S 2 0] (m 1)(m 2)(m 3)S 2 0 E 2 (I), (3) where S 0 = m i=1 mj=1,j i (w ij +w ji )/2, S 1 = m i=1 mj=1,j i (w ij +w ji ) 2 /2, and S 2 = m i=1 (w i +w i ) 2 with w i = m j=1,j i w ij, and b 2 = m m i=1 (z i z) 4 /[ m i=1 (z i z) 2 ] 2 ([14], p. 21). When observations are counts, such as crimes, X i in (1) often takes the form of case rate as X i = n i /ξ i, where n i is the number of cases and ξ i is the at risk population size in region 4

6 i. However, the homogeneous assumption under this specification may not be valid [11]. Since loglinear models can relax this assumption, we can specify a loglinear model and use its deviance residuals to test for spatial autocorrelation. Suppose that the random count N i with an observed count n i, i = 1,, m, follows a Poisson distribution and assume that the counts N i s are independent. Suppose that a set of geographical covariates are observed together with the count n i. Then, a loglinear model can be set out by taking the observed geographical covariates as explanatory variables and the logarithm of the at risk population size (e.g. log(ξ i )) as an offset term. When the parameters are estimated by maximum likelihood estimation, the estimated value ˆn i of the expected count E(N i ) can be derived and the conventional deviance residual ([1], p. 588) for region i is r i,d = 2sign(n i ˆn i )[n i log(n i /ˆn i ) n i + ˆn i ] 1/2, (4) where sign( ) is the sign function defined as sign(a) is 1 if a > 0, is 0 if a = 0, and is 1 if a < 0. The concepts and statistical properties of deviance residuals in loglinear models are well established. We can readily extend these concepts to spatial statistics. Note that the numerator of Moran s I is a martingale if X i s are independent with mean 0. When X i = r i,d with r i,d given in (4) and ˆn i is replaced by the expected count E(N i ), X 1,, X m are independent and E(X i ) is almost 0 if E(N i ) is large (e.g. E(N i ) > 5). When E(N i ) is estimated by ˆn i from a loglinear model, then under the model assumption, ˆn i is a consistent estimator of E(N i ), and the joint distribution of (r 1,d,, r m,d ) is approximately normal with mean 0 and variance-covariance matrix an orthogonal projection matrix [1, 30] denoted by P m. For a fixed number of covariates when m is large, the orthogonal project matrix P m is almost equivalent to the identity matrix since the dimension of the kernel space of the project matrix is equal to the number of covariates. When m goes to infinity, r 1,d,, r m,d are approximately independent and the asymptotic normality of Z(I) can be proven by a martingale approximation of the numerator of Moran s I with an application of the Martingale Central Limit Theorem ([9], p. 475). In addition, one must also consider the 5

7 convergence of the permutation mean and variance of Moran s I in this scenario [32]. This particular asymptotic formulation of the deviance residuals is analogous to that of regression residuals ([14], p. 198). It is noted that deviance residuals are very flexible in loglinear models, and they reflect categorical structure (in this case spatial structure) while controlling for potentially heterogeneous population sizes ([1], p. 495). We can similarly test deviance residuals for spatial autocorrelation by specifying a loglinear model. Since a loglinear model, such as log-rate model, can incorporate geographic (or ecological) covariates, we can test its residuals for spatial autocorrelation in the presence or absence of ecological covariates. A nature approach is to apply the random permutation test so that Moran s I based on the deviance residual of a loglinear model is analogous to Moran s I based on residuals of a regression model [6]. Given that deviance residuals are approximately multivariate normal, we can test spatial autocorrelation of the residuals by replacing X i in (1) with r i,d in equation (4), and we label it I DR. The mean and variance of I DR can be identically derived from the random permutation scheme of the conventional Moran s I as given by equations (2) and (3) respectively. To implement I DR, we can simply estimate the expected counts under the null model with the intercept only, which indicates that ˆn i = ξ i (n/ξ) with n = n i=1 n i and ξ = m i=1 ξ i. In this case, the i-th deviance residual r i,d can be derived by inserting ˆn i in (4). If I DR is positive and significant, it suggests spatial clustering, which can either be contributed by a first-order clustering trend or a few local clusters. We can detect clustering contributions by applying spatial association models [23, 24]. First, a number of spatial association terms are added to the null model. Then, the parameter estimates together with residuals are derived in the model fitting process. The existence of spatial autocorrelation is tested again via I DR for the model residuals. If I DR is significant in the null model but not significant in the model with local association terms, the significance found in the null model is likely to be accounted for by the local association terms. If a few spatial association terms cannot reduce the significance of I DR from the null model, it suggests the existence of a first-order global clustering tendency. 6

8 Under the assumption that there is a local cluster in the study area, a loglinear model with a spatial association term is: log(λ i ) = log(ξ i ) + β 0 + β 1 d i(j) (5) where λ i = E(N i ), β 0 is the grand mean and β 1 is the unknown parameter for the spatial association term defined by d i(j) in which d i(j) is 1 if location i is believed to be in a cluster centered at unit j, and otherwise d i(j) is 0. We test if β 1 significantly differs from 0. The significance of the local association term is determined by its p-value via the likelihood ratio test over the null model without the spatial association term. Likewise, the contribution of the spatial association term to I DR is determined by comparing the p-value of I DR with and without the term. If the coefficient of β 1 is positively significant, then the local cluster is a hot spot. If the coefficient is negatively significant, then the local cluster is a cool spot. Besides the above likelihood ratio test, we can gauge the contribution of β 1 for the local cluster by comparing I DR results with and without the β 1 term in model (5). If I DR is not significant when the spatial association term is included, then the clustered effect in the null model is sufficiently removed by the association term. If the inclusion of spatial association term in model (5) does not change the significance level of I DR in the null model, then the clustering effect remains. To further improve model fit and to identify the explained clustering effect, one can either refine the spatial association term already in the model or include another spatial association term. Finally, if the existence of a local cluster is accompanied by a first-order global clustering trend, the likelihood ratio test may still be significant by the inclusion of the local association term, but it is unlikely to reduce the p-value of I DR from a significant level to a non-significant level. 7

9 3 Simulation Assessment of I DR We designed Monte Carlo simulation experiments to assess the effectiveness of the model-based test I DR under population heterogeneity. Type I error rates were evaluated under the null hypothesis of homogeneous rates with heterogeneous population sizes, while spatial cluster modeling was evaluated in the presence and absence of first-order global clustering trend. All the simulation experiments were based on a lattice with w ij being defined by the rook rule of spatial adjacency. We set the significance level of α = 0.05 to assess the rejection rates of I DR in each set of simulations. In the presence of a local cluster, a residual plot was also furnished to facilitate the evaluation process. In addition to I DR, we included the original Moran s I by converting counts to rates, and denoted it by I r which is defined by letting x i = r i = n i /ξ i in (1). Previous studies have demonstrated that I r is sensitive to heterogeneous populations, and the inclusion of I r was to serve as a baseline for comparison. We also included the Empirical Bayesian Index (EBI) denoted by I EBI, a population-adjusted Moran s I proposed by Assuncao and Reis [5]. I EBI is found to be effective for adjusting population sizes in the presence of population heterogeneity, and it has been included in GeoDa, a popular spatial exploratory data analysis freeware [4]. However, I EBI is not a model-based test and it cannot include ecological covariates. This can be seen from the definition, in which z i = r i,ebi = (p i b)/ ν i, where b = n/ξ, ν i = a + b/ξ i, a = s 2 b/(ξ/m) and s 2 = m i=1 ξ i (p i d)/ξ. Hence, I EBI = mi=1 j i w ij ( p i b νi [ m l=1 ( p i b νi 1 m 1 m ml=1 p l b νl )( p j b νj 1 ml=1 p l b m ml=1 p l b νl ) 2 /m][ m i=1 mj=1, i w ij ] νl ). (6) Assess I DR for population heterogeneity. Following the simulation studies of Walter [36] and Assuncao and Reis [5], we compared the type I error probabilities of I DR, I EBI and I r based on Monte Carlo simulations. Walter [36] reported that densely populated areas with a pocket sparsely populated area could cause an excessive type I error probability for I r. To represent this pattern, we generated relatively low population of 10 6 (1 η) 2 for lattice points within a 2-unit 8

10 circle centered at (3, 3), and 10 6 for others. The η value indexes population heterogeneity from 0 to 0.8 with an increment of 0.04 increments. When η = 0, all the populations were homogeneous. As η is getting closer to 1, the populations are increasingly heterogeneous. Based on the above population patterns, we generated independent Poisson random variables with the mean value being 10 4 times the population size for each lattice point. Since identical rates were expected across all lattice points, there should be no spatial clustering. The rejection rate, therefore, should reflect the type I error probability of the spatial autocorrelation test. For each η value, we calculated type I error probabilities based on 10, 000 simulations and resultant z values. The results (Figure 1) show that both I DR and I EBI were able to account for population heterogeneity with an almost identical type I error probability around 0.05 for all η values. The type I error probability of I r, however, was only acceptable when η is small with little variation in population sizes. As η increased, the type I error rates also increased. When population sizes varied substantially (η = 0.8), the rejection rate was as high as 25%, a result consistent with Walter s simulations. Assess I DR for local cluster detection. Based on the previous simulation result, we devised a fixed heterogeneous population pattern: the population was 10 5 if a point on the lattice was within the circle and the population was 10 6 otherwise. We generated independent Poisson random variables with the mean equal to times the population size for each lattice point. We then inserted a 2-unit circle for a cluster effect centered at (7, 7), and set the mean equal to (1+δ) times the population. The δ value represented the strength and direction of the cluster effect, and it increased from 0.8 to 0.8 with 0.04 in each step. If δ < 0, the circle represented a low-value cluster; if δ > 0, it represented a high-value cluster. Again, based on 10, 000 simulations for each δ value under population heterogeneity, the rejection rates of I DR with and without the spatial association term from model (5) are shown in Figure 2. The rejection rate without the spatial association term indicates the statistical power of I DR, while the rate with the spatial association term indicates the effectiveness of the model-based 9

11 test for a high-value or low-value cluster. If the model based test is effective, the test statistic should no longer be significant when the spatial association term that covers the exact circle being included. The results show that I DR under the null model had a reasonable power (Figure 2). When δ values were around 0, the rejection rate was around When the absolute δ values were greater than 0.25, the rejection rates were about 15%. When the δ value reached 0.8 or a cool spot, the rejection rate was almost 100%. When δ value reached 0.8, the rejection rate was about 85%. Both results suggest that I DR under the null model is likely to be significant when there is a strong local cluster. However, when the cluster tendency was accounted for by the spatial association term, the rejection rates were consistently around 0.05, suggesting that I DR was unlikely to be significant when a spatial association term absorbed the cluster effect. Since the relative risks within the cluster were all similarly higher or lower than the rest of the area in our simulations, once its effect was removed by the spatial association term, the study area became spatially independent, a result consistent with previous simulations in the spatial logit association model [23]. The effect of the spatial association term can be illustrated by the residual and QQ-normal plots from a single simulation. The upper panel of Figure 3 displays the results under the null model. The I DR test had a p-value of primarily due to a number of extremely high deviance residuals from the clustered area. Likewise, the QQ-plot shows that a number of high values are concentrated in the upper tail, suggesting the existence of extreme values. The lower panel in Figure 3 shows that once the spatial association term was added to the model, the effect of extreme large residual values in the null model was disappeared, whereas the p-value of I DR reduced to 0.12 with evenly distributed residuals. This result is also collaborated from the QQ-plot with all the values along a straight line. Assess I DR for a local cluster in the presence of first-order global clustering. It is known that a local cluster and a first-order clustering trend can operate simultaneously. In the presence of global clustering, it is often necessary to first de-trend before fitting a spatial regression 10

12 model [2, 12]. We intend to evaluate the performance of I DR in this situation by generating the global spatial structure from a log-normal distribution, and by inserting a local cluster from the previous simulation with δ = 2.0 in the simulation. If the local test is insensitive to the first-order clustering tendency, then it indicates the existence of global clustering. In the simulation process, we first generated 100 identically independently distributed (iid) N(0, 1) random variables, denoted by ɛ = (ɛ 1, ɛ 2,, ɛ 100 ). Next, we calculated a vector u by letting u = (I ρw ) 1 ɛ with ρ increasing from 0 to 0.2 in step increment 0.01 such that u satisfied u = ρw u + ɛ, where ρ is the coefficient of the global spatial association [2, 5]. Third, we let λ = (1+2d)e u, where λ = (λ 1,, λ 100 ) was the vector of Poisson intensity for generating counts. We generated a conditional independent Poisson random variable N i with parameter λ i times the i-th population size. When ρ = 0, there was only a local cluster in the simulated pattern, and when ρ 0, there were both local and global clustering tendencies in the simulation pattern. We assess the effectiveness of I DR by comparing the rejection rates of I DR with and without the spatial association term. Based on 10, 000 simulations for each ρ value, the results (Figure 4) showed that the spatial association term was unable to reduce the clustering effect except when the global clustering trend was very weak. For instance, when ρ = 0, the rejection rate for I DR in the null model was about 28%, and it suggested spatial clustering. When the spatial association term was included in this case, the local cluster tendency was reduced similar to the previous simulation. As the global clustering trend ρ increased, the rejection rates of I DR also increased, and the two curves with and without the association term were likely to be significant for even a modest increase in ρ. The inclusion of the spatial association term had little effect on removing a local clustering effect in the presence of the global clustering tendency. It further suggests that even when a association term might be significant in terms of the likelihood ratio test, the local effect might not be trustworthy, 11

13 because the global effect overshadowed the local effect. Figure 5 displays the residual and QQ-plots of the deviance residuals with and without the spatial association term from a single simulation run (ρ = 0.15). It is evident, there were only few large deviance residuals in absolute values, and they were not clumped together. This pattern is in sharp contrast with the one in Figure 3. In addition, the p-values of I DR with and without using the spatial association term were very close: with the spatial association term and without. These result suggest that the inclusion of a local association term is unlikely to reduce the significance of I DR because of the overall global clustering effect. In summary, I DR is effective in reducing type I error probabilities of the traditional Moran s I due to heterogeneous population sizes, and its performance is comparable to that of I EBI. An advantage of I DR over I EBI is its ability to include ecological or other spatial covariates. When a significant I DR is contributed mainly by a local cluster, we can devise a spatial association term to remove the cluster effect, so that the spatial autocorrelation observed in the null model would not be significant anymore. The exact form of association term can be determined either by a stepwise regression method [23] or from a exploratory method, such as deviance residual plots. Since I DR is sensitive to the existence of local clusters but not sensitive to the presence of the global trend, the inclusion of a spatial association term in the I DR test can indicate whether a first-order global clustering trend exists or not. 4 St. Louis Homicides Data analysis In this section, we apply I DR to analyzing homicides in the St. Louis region. The data set was originally analyzed by Messner, et. al [25], and it is also included as part of exercises in GeoDa [4], a simple spatial analysis package developed by Anselin and his associates. In the original paper, homicide rates for and periods were analyzed at the county level, and a number of local clusters including one centered at St. Louis City were identified by LISA. Here, 12

14 we can use the model based I DR to detect spatial clustering based on homicide incidents and the at risk population. Analogous to LISA, we also included a local version of deviance residual Moran s I or deviance residual LISA denoted by I DR,i. The result of I DR,i was compared with the results of the local versions of I r and I EBI, denoted by I r,i and I EBI,i respectively, where I DR,i, I r,i and I EBI,i can be defined according to the formula given by Anselin [3] as I i = mj=1,j i w ij (x i x)(x j x) mi=1 (x i x) 2 /m (7) by letting x i = r i,d, x i = r i and x i = r i,ebi respectively. All I DR,i, I r,i and I EBI,i are able to provide additional ways of exploratory spatial analysis for count data, such as cluster maps. However, only I DR,i is able to provide an additional clustering analysis when a covariate variable is accounted for. In the preliminary analysis, we found that I DR for the period was with an insignificant p-value of , and I DR for the period was with a significant p-value of We, therefore, focused on the latter period. Between 1988 and 1993, there were 2, 650 homicides, and the average homicide rate was about 10 per 100, 000. County populations in the study area vary substantially: St. Louis County was the largest with more than one million residents, and five other counties that include St. Louis City, St. Clair, Boone, Sangamon and Macon had at least 50, 000 residents. To detect spatial clustering for homicides, we first fitted the null model. The results from I DR indicated a significant clustering tendency with the z-value of 2.72 and p-value of When we plotted the deviance residuals by five equal intervals (Figure 6), St. Louis City was in the first interval with deviance residual, St. Clair county was in the third interval (18.06), and there was no county in the second interval. This indicated that St. Louis City was the only county that indicated a high-value cluster surrounded by St. Clair, St. Louis and Madison counties. In addition, we further plotted deviance residuals LISA by using GeoDa, and found that the standardized values of I DR,i, I r,i and I EBI,i were 17.5, and respectively when i indicated the St. Louis 13

15 county, and the values were 12.27, and respectively when i indicated the St. Clair county. The values of the rest counties were much lower that the values of those two counties. The LISA plot also indicated that St. Louis and Madison counties were next to high valued counties, presumably the two very high valued county St. Louis City and St. Clair counties. Based on the above information, we decided not to adopt the spatial association term that assigns equal contribution to the clustered effect. We refined the shape of the cluster by examining each individual residual within the adjacent counties, and devised a spatially varied association term to capture the magnitude of residual variation within a cluster. Based on the principle of the uniform association model [1], a large residual value should correspond to a large d i(j) value, and a relatively small residual value should correspond to a small d i(j) value. When a neighbor county has an ignorable absolute residual value, it can be dropped from the spatial specification. From the five equal interval classification, St. Louis City was in the first interval, St. Clair in the third, and Madison and St. Louis in the fourth interval. We assigned, accordingly, 4 to St. Louis City, 2 to St. Clair, and 1 to Madison and St. Louis counties, and this assignment could be achieved automatically in our search algorithm because of standard intervals were used. The results show that the model with the spatial association term was highly significant contributing to an around 2, 610 reduction of deviance from the null model of 2, 944 to the alternative model of 334. In the meantime, the coefficient of the spatial association term indicates a high-value cluster, and its inclusion changed the p-values of I DR from in the null model to in the alternative model. It suggests that the spatial association term can remove the effect of the local cluster, and there was no global clustering trend. In addition, if we inspect the deviance residuals individually for the 4 counties, we could see 40, 20, and 5. Based on this information, we further experimented with assigning 4 to St. Louis City, 2.5 to St. Clair, and 1 to Madison and St. Louis counties (Model II), and this assignment could further reduce the deviance to 177 with a z-value of for I DR. In both cases, the values of I DR,i significantly decreased to a very low level for St. Louis and St. Clair counties which was almost not 14

16 significant throughout the region at the 0.05 probability level when we adjusted for the multiple testing problem of 78 units by the Bonferroni s method (see [28], p 153). It is worth noting that odds ratios can be used to describe the shape of a cluster. For instance, the odds ratio of = e in Model I between St. Louis county and other counties indicates that St. Louis county was = e times as likely as other counties to have a homicide. Similarly, St. Louis City would be = e times and St. Clair would be = e times as likely. Alternatively, we can use geographical covariates to explain the detected clustering tendency. For instance, it is known that St. Louis City had a high concentration of Blacks. We obtained the percentage of Blacks from the 1990 census for all the 78 counties and used it as an ecological covariate in place of the spatial association term. The results (Table 1 last row) show the percentage of Blacks was positively associated with the likelihood of homicides in the study area. The ecological model performed slightly better than the spatial association model in terms of the likelihood ratio test, i.e., smaller deviance with the same number degrees of freedom. In addition, when the ecological variable was included, the p-value of for I DR was not significant, suggesting that there was no spatial autocorrelation anymore. This result implies that the St. Louis City cluster detected by the association term can be explained by the percentage of Blacks in the case study. The use of an ecological variable or a spatial association term can both yield useful information to describe and quantify a detected cluster. 5 Concluding Remarks In this paper, we have specified and evaluated I DR as a loglinear model-based Moran s I test for Poisson count data that resembles the Moran s I residuals test for Gaussian data. Based on previous studies, we pointed out that loglinear residuals are not only asymptotically normal, but also applicable to the permutation test of Moran s I for a correctly specified model. We evaluated 15

17 type I and type II error rates via simulations, and found that I DR was effective to account for heterogeneous population sizes, and to detect a local cluster in the absence of a global trend. In the presence of a global trend, the power of detecting a local cluster was very weak, a problem that also exists for a continuous dependent variable in a linear regression model [12]. In the case study, we extended Lin s [23] spatial association model that emphasizes equal contributions among spatial neighbors to an ordered or uniform spatial association model that captures spatially varied contributions among spatial neighbors within a cluster. This model has several advantages. First, it makes use of exploratory tools such as Moran s I scatter plots and residual plots to evaluate the magnitude of deviance residuals. Second, cluster shape can be determined in terms of its geographic coverage and its slope via odds ratios. In other words, a 3-dimensional cluster that spatially varies in terms of its magnitude can be derived by the spatially varied association term. Third, this analysis can be extended to probit, logit [6] and other limited dependent variables under the loglinear framework. Finally, our model-based I DR test is complementary to recent development of residual-based spatial statistical approaches [8]. Future research should extend I DR to other test statistics, such as Getis-Ord s G [20] and Geary s c [18], and assess their effectiveness for various spatial problems. Likewise, there are many conventional methods for modeling categorical associations, and we should examine their effectiveness for constructing a spatially varied association term, and for specifying various forms of loglinear models in the context of spatial analysis. The current study does not offer any detrend methods in the presence of a global trend, and how to de-trend while locating and explaining local clusters remains an challenging issue. Finally, like other model-based tests, when a model is mis-specified, the result from a model-based test, such as I DR can be misleading, criteria for a correctly specified model should be established for spatial loglinear models. Aknowledgements: The authors would like to thank a reviewer for the detailed comments and suggestions, which have substantially improved the quality of the paper. 16

18 References [1] Agresti, A. (2002). Categorical Data Analysis. Wiley, New York. [2] Anselin, L. (1990). Spatial dependence and spatial structural instability in applied regression analysis. Journal of Regional Science, 30, [3] Anselin, L. (1995). Local indicators of spatial association-lisa. Geographical Analysis, 27, [4] Anselin, L., Syabri, I. and Kho, Y. (2006). GeoDa: An introduction to spatial data analysis. Geographical Analysis, 38, [5] Assuncao, R. and Reis, E. (1999). A new proposal to adjust Moran s I for population density. Statistics in Medicine, 18, [6] Apanasovich, T. V, Sheather, S., Lupton, J. R., Popovic, N., Yurner, N. D., Chapkin, R. S., Braby, L., A., Carroll, R. J. (2003). Testing for spatial correlation in nonstationary binary data, with application to aberrant crypt foci in colon carcinogenesis. Biometrics, 50, [7] Bennett, R. J. and Haining, R. P. (1985). Spatial structure and spatial interaction modeling approaches to the statistical analysis of geographic data. Journal of Royal Statistical Society A, 48, [8] Baddeley, A., Turner, R. and Hazelton, M. (2005). Residual analysis for spatial point processes. Journal of Royal Statistical Society B, 67, [9] Billingsley, P. (1995). Probability and Measure, Wiley, New York. [10] Brunsdon, C., Aitkin, M., Fotheringham, S. and Charlton, M. (1999). A comparison of random coefficient modeling and geographically weighted regression for spatial non-stationary regression problems. Geographical and Environmental Modelling, 3,

19 [11] Besag, J. and Newell, J. (1991). The detection of clusters in rare diseases. Journal of Royal Statistical Society A, 154, [12] Cressie, N. (1993). Statistics for spatial data, Wiley, New York. [13] Cliff, A. D. and Ord, J. K. (1972). Test for spatial autocorrelation among regression residuals. Geographical Analysis, 4, [14] Cliff, A. D. and Ord, J. K. (1981). Spatial Processes: Models And Applications, Pion, London. [15] Fingleton, B. (1983b). Loglinear models with dependent spatial data. Environment and Planning A, 15, [16] Fotheringham, S. (1997). Trends in quantitative geography: I: stressing the local. Progress in Human Geography, 21, [17] Fotheringham, S. (1999). Guest editorial: local modeling. Geographical and Environmental Modeling, [18] Geary, R. C. (1954). The contiguity ratio and statistical mapping. The Incorporated Statistician, 5, [19] Getis, A. and Aldstadt, J. (2004). Constructing the spatial weights matrix using a local statistic. Geographical Analysis, 36, [20] Getis, A. and Ord, J. (1992). The analysis of spatial association by use of distance statistics. Geographical Analysis, 24, [21] Griffith, D. (2002). A spatial filtering specification for the auto-poisson model. Statistics and Probability Letters, 58, [22] Lee, S. I. (2004). A generalized significance testing method for global measures of spatial association: an extension of the Mantel test. Environment And Planning A, 36,

20 [23] Lin, G. (2003). A spatial logit association model for cluster detection. Geographical Analysis, 35, [24] Lin, G. and Zhang, T. (2005). Loglinear residual tests of Moran I autocorrelation and their applications to Kentucky Breast Cancer Data. Geographical Analysis, to appear. [25] Messner, S., Anselin, L., Baller, R., Hawkins, D., Deane, G. and Tolnay, S. (1999). The spatial patterning of county homicide rates: an application of exploratory spatial data analysis. Journal of Quantitative Criminology, 15, [26] Moran, P. A. P. (1948). The interpretation of statistical maps. Journal of the Royal Statistical Society Series B, 10, [27] Moran, P. A. P. (1950). Notes on continuous stochastic phenomena. Biometrika, 37, [28] Neter, J., Kutner, M. H., Nachtsheim, C. and Wasserman, W. (1996). Applied Linear Statistical Models, 4th Edition, McGraw Hill, New York. [29] Oden, N. (1995). Adjusting Moran s I for population density. Statistics in Medicine, 14, [30] Pierce, D. and Schafer, D. (1986). Residuals in Generalized linear models. Journal of American Statistical Association, 81, [31] Rogerson, P. A. (1999). The detection of clusters using a spatial version of the chi-square goodness-of-fit statistics. Geographical Analysis, 31, [32] Sen, A. (1976). Large sample-size distribution of statistics used in testing for spatial correlation. Geographical analysis, 9, [33] Sokal, P. R., Oden, N. L. and Thomson, B. A. (1998). Local spatial autocorrelation in a biological model. Geographical Analysis, 30, [34] Tango, T. (1995). A class of tests for detecting general and focused clustering of rare diseases. Statistics in Medicine, 14,

21 [35] Waldhor, T. (1996). The spatial autocorrelation coefficient Moran s I under heteroscedasticity. Statistics in Medicine, 15, [36] Walter, S. D. (1992). The analysis of regional patterns in health data. American Journal of Epidemiology, 136, [37] Whittemore, A., Friend, N., Brown, B. and Holly, E. (1987). A test to detect clusters of disease. Biometrika, 74, [38] Wrigley, N. (1985) Categorical Data Analysis for Geographers and Environmental Scientists. Longman, New York. 20

22 Rejection Rate I r I EBI I DR η Figure 1: Type I error rates of I r, I DR and I EBI under heterogeneity (α = 0.05). 21

23 Local Cluster Rejection Rate I DR without I DR with δ Figure 2: Rejection rate of I DR with and without the spatial association term (α = 0.05). 22

24 Residual Plot: Without QQ plot: Without Deviance Residuals Sample Quantiles Index Theoretical Quantiles Residual Plot: With QQ plot: With Deviance Residuals Sample Quantiles Index Theoretical Quantiles Figure 3: Residual plots and QQ-plots in the presence a local cluster (δ = 0.5). 23

25 Global and Local Trend Rejection Rate I DR without I DR with δ Figure 4: Power functions of I DR with and without the spatial association term. 24

26 Residual Plot: Without QQplot: Without Deviance Residuals Sample Quantiles Index Theoretical Quantiles Residual Plot: With QQplot: With Deviance Residuals Sample Quantiles Index Theoretical Quantiles Figure 5: Residual plots and QQ-plots in the presence of local and global clustering structures 25

27 St. Louis Madison St. Clair Deviance Miles Figure 6: Deviance residuals of the null model for St. Louis homicides. 26

28 Table 1: Loglinear model estimates and I DR results for St. Louis homicides: Models ˆβ1 p-value G 2 d.f. I DR p-value Null Spatial association I (St. Louis) Spatial association II (St. Louis)* Ecological covariate (% of Blacks) Note: variables captured by ˆβ 1 are in parentheses. Model I assigns 4 to St. Louis county, 2 to St. Clair county, and 1 to the other adjacent counties; Model II differs by assigning 2.5 to St. Clair county. 27

Using AMOEBA to Create a Spatial Weights Matrix and Identify Spatial Clusters, and a Comparison to Other Clustering Algorithms

Using AMOEBA to Create a Spatial Weights Matrix and Identify Spatial Clusters, and a Comparison to Other Clustering Algorithms Arthur Getis* and Jared Aldstadt** *San Diego State University **SDSU/UCSB