Loglinear Residual Tests of Moran s I Autocorrelation and their Applications to Kentucky Breast Cancer Data

Size: px

Start display at page:

Download "Loglinear Residual Tests of Moran s I Autocorrelation and their Applications to Kentucky Breast Cancer Data"

Angela Kelly
6 years ago
Views:

1 Geographical Analysis ISSN Loglinear Residual Tests of Moran s I Autocorrelation and their Applications to Kentucky Breast Cancer Data Ge Lin, 1 Tonglin Zhang 1 Department of Geology and Geography, West Virginia University, Morgantown, WV, Department of Statistics, Purdue University, West Lafayette, IN This article bridges the permutation test of Moran s I to the residuals of a loglinear model under the asymptotic normality assumption. It provides the versions of Moran s I based on Pearson residuals (I PR ) and deviance residuals (I DR ) so that they can be used to test for spatial clustering while at the same time account for potential covariates and heterogeneous population sizes. Our simulations showed that both I PR and I DR are effective to account for heterogeneous population sizes. The tests based on I PR and I DR are applied to a set of log-rate models for early-stage and late-stage breast cancer with socioeconomic and access-to-care data in Kentucky. The results showed that socioeconomic and access-to-care variables can sufficiently explain spatial clustering of early-stage breast carcinomas, but these factors cannot explain that for the late stage. For this reason, we used local spatial association terms and located four late-stage breast cancer clusters that could not be explained. The results also confirmed our expectation that a high screening level would be associated with a high incidence rate of early-stage disease, which in turn would reduce late-stage incidence rates. Introduction Linear or loglinear spatial regression models are common in spatial epidemiology (Best et al. 000). A set of ecological variables are often associated with disease rates or counts, and after a final model is derived, residuals can be visually inspected on a map for spatial clusters. For a linear model, a residual test of Moran s I for spatial autocorrelation can also be performed to detect spatial clustering for the unexplained regression errors. However, there is no corresponding spatial residual test of clustering for loglinear or Poisson regressions on count data, which was the challenge for the current study. The study was motivated by our inquiry into the spatial patterns of breast cancer incidents in Kentucky counties as they related to Correspondence: Ge Lin, Department of Geology and Geography, West Virginia University, Morgantown, WV glin@wvu.edu Submitted: August, 005. Revised version accepted: April 04, 006. Geographical Analysis 39 (007) r 007 The Ohio State University 93

2 Geographical Analysis the development stage of disease at diagnosis. Breast cancer staging at diagnosis is known to be associated with socioeconomic conditions, mammography screening services, and other variables (Yabroff and Gordis 003; Barry and Breen 005). As socioeconomic variables are often spatially autocorrelated (e.g., poor areas tend to be clustered), we expect clustering of breast cancer to occur, at least for the earlystate incidence rates. If there is no significant environmental cause of breast cancer, the clustering tendency should disappear once we introduce area socioeconomic variables. One way to test for the existence of spatial clustering is to set up a spatial autocorrelation test, such as Moran s I, for Guassian or continuous data by using the permutation test of residuals for Moran s I in a linear regression (Cliff and Ord 1981). Converting incidence to rate, however, is often less appealing than retaining the original count of each in spatial data analysis (Griffith and Haining 006). In addition, Moran s I test assumes that attribute values (e.g., disease prevalence) are either in equal probability among all the geographic units or from a single parent distribution. These assumptions are often violated in the permutation test of Moran s I in disease data due to heterogeneous regional populations and large variation in sparsely populated areas (Besag and Newell 1991). Although there have been several extensions of Moran s I to account for population heterogeneity (Oden 1995; Waldhor 1996; Assuncao and Reis 1999), none of them can include potential ecological covariates. For example, Oden proposes a test statistic Ipop that applies regional population sizes to adjust Moran s I. However, because of a minor modification in the null hypothesis, Ipop is no longer comparable to the original Moran s I (Assuncao and Reis 1999). Consequently, Ipop cannot be extended to evaluate covariates and spatial autocorrelation simultaneously. A spatial logit association model can include potential explanatory variables and identify high-value and low-value clusters (Lin 003; Zhang and Lin 006). It does not, however, have a global measure of spatial clustering that would complement the modeling process for local spatial logit associations. Jacqmin-Gadda et al. (1997) propose a homogeneity score test of a generalized linear model that can also include potential explanatory variables in a correlation test. The test is based on residuals in generalized linear models, a design that its authors claimed to correspond to the permutation test of linear regression errors. However, as we will later show, the test does not adjust variance for heterogeneous population sizes. In addition, because its weight matrix is not necessarily spatially constructed, its null hypothesis is not necessarily spatial independence as one would assume when applying Moran s I test. Consequently, it is not straightforward to use the score test in a generalized linear model that includes spatial correlation and heterogeneity. The purpose of this article is to extend the permutation test of residuals of the Moran s I autocorrelation to generalized linear models so that spatial analysts can directly test for spatial clustering while controlling for potential ecological covariates. While no one has proposed either deviance or Pearson residuals in the spatial statistic literature, Waller and Gotway (004) point out a form close to Pearson 94

3 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran s I Autocorrelation residuals as a way to account for inflated variance in Moran s I under heteroskedasticity. In this article, we demonstrate that permutation tests are applicable to Pearson or deviance residuals of loglinear models in the same way as in the traditional permutation test of residuals for Moran s I. In the remaining sections of the article, we first review the permutation test of Moran s I by using regression residuals and then reformulate it in the context of Poisson data by using the Pearson and deviance residuals of a loglinear model. We then evaluate their statistical properties under the null hypothesis of spatial independence in a series of simulated patterns and apply the Pearson residual Moran s I test and deviance residual Moran s I test to breast cancer incidence in Kentucky counties that include potential ecological covariates. From linear to loglinear residual tests of Moran s I Let us consider a study area that has m regions indexed by i. Let z i be the variable of the interest in region i. Moran s I (Moran 1950) is expressed as: P m P m i¼1 j¼1 I ¼ w ijðz i zþðz j zþ hp ih m i¼1 ðz i zþ Pm P i ð1þ =m i¼1 j¼1 w ij where z ¼ P m i¼1 z i=m, w ij is an element of a spatial weight matrix W, with 1 being adjacent for regions i and j and is 0 otherwise. Under the assumption of homogeneity, the moments of Moran s I can be computed under the randomization assumption, which assumes that the observations are generated from a set of random permutations of the observed values. The pffiffiffiffiffiffiffiffiffiffiffiffiffi significance of Moran s I can be determined by the quantity I std ¼½I EðIÞŠ= VarðIÞ under the asymptotic normality assumption, where the respective mean and variance are obtained from the random permutation scheme suggested by Cliff and Ord (1981, p. 1). A significant and positive value of Moran s I (i.e., I std 4z a/ ) usually indicates a positive autocorrelation, such as the existence of either high-value or low-value clustering. A significant and negative value of Moran s I (i.e., I std o z a/ ) usually indicates a negative autocorrelation, such as a tendency toward the juxtaposition of high values with low values. If there is no spatial dependence, I is often close to the mean value 1/ (m 1), which can be approximated by 0 if m is large. In order to account for ecological covariates, it is often suggested that z i be taken as the ith regional residual of a linear regression model when testing for spatial autocorrelation (Cliff and Ord 1981, p. 198). For data resulting purely from a random process, the calculation of Moran s I in equation (1) based on the residuals is the same as the one based on the observed values. It can be concluded that I std is approximately N(0, 1) as m!1under some regularity conditions (Sen 1976). Schmoyer (1994) also demonstrated the validity of a general form of permutation test based on the residuals of a linear model with independent and identically distributed errors. In Poisson models with heterogeneous population sizes, these justifications cannot be applied (Waldhor 1996; Assuncao and Reis 1999). 95

4 Geographical Analysis The residuals of loglinear models are well established in the statistical literature in a nonspatial context. We can extend them to a spatial context to test for spatial autocorrelation. In his synthesis of previous studies, Agresti (1990, p. 431) showed that the Pearson and log-likelihood (deviance) residuals of loglinear models are asymptotically multivariate normal with mean 0 and the variance covariance matrix a projection matrix. This particular asymptotic form of the residuals is analogous to that of linear regression residuals. Following this reasoning, we can devise a log-rate model that closely resembles a linear regression model to account for potentially heterogeneous population sizes and ecological covariates. We apply the permutation test based on the asymptotic normality assumption, so that Moran s I based on the residuals of a log-rate model is analogous to Moran s I based on regression residuals. In the following, we specify the residuals of a log-rate model for the permutation test of Moran s I. Let n i be the observed counts for a Poisson random variable N i at region i (i 5 1,..., m), and let x i and y i be the ith regional population and relative risk, respectively. Then, N i are assumed to be independently Poisson distributed, or N i Poisson (E i y i ), where E i is the expected count for region i. The null hypothesis that all the relative risks (y 1,...,y m ) are 1 can be stated (Elliott et al. 000, p. 13). Suppose that a set of explanatory variables (x i,1,...,x i,k 1 ) are observed together with n i. A log-rate model can then be expressed as logð^n i =x i Þ¼^b 0 þ ^b 1 x i;1 þ...þ ^b k 1 x i;k 1 ðþ In equation (), nˆ i and log(nˆ i/x i ) are, respectively, the estimated count and estimated log rate for region i, ^b 0 is the estimate of grand mean, and the other ^bs are parameter estimates for explanatory variables. Equation () is often estimated by moving log(x i ) to the right-hand side, so that it becomes the offset. logð^n i Þ¼logðx i Þþ^b 0 þ ^b 1 x i;1 þ...þ ^b k 1 x i;k 1 ð3þ Based on these notations, the conventional Pearson residual for region i is defined as r i;p ¼ n i ^n i ^n 1= i ð4þ and the conventional deviance residual (Agresti 1990, p. 45) for region i is defined as r i;d ¼ signðn i ^n i Þ½n i logðn i =^n i Þ n i þ ^n i Š 1= ð5þ where sign(a) is1ifa40, is 0 if a 5 0, and is 1ifao0. We can test for spatial autocorrelation based on the residuals in equation () by replacing z i in (1) with either the Pearson residual in (4) or the deviance residual in (5). When z i is replaced with r i,p, for example, Moran s I becomes Pearson residual Moran s I, and we denote it as I PR, which can be expressed as 96

5 Ge Lin and Tonglin Zhang I PR ¼ P m P m i¼1 j¼1 w ij 4 P m i¼1 n i ^n i ^n 1= i n i ^n i ^n 1= i 1 m 1 m Loglinear Residual Tests of Moran s I Autocorrelation X m l¼1 X m k¼1 n l ^n l ^n 1= l n k ^n k ^n 1= k! 0 nj ^n j 1 m ^n 1= m l¼1 j! 3 h =m5 P P i m i¼1 j¼1 w ij n l ^n l ^n 1= l 1 A ð6þ When z i is replaced with r i,d, Moran s I becomes deviance or log-likelihood residual Moran s I and we denote it as I DR. To implement these tests, the parameters of explanatory variables together with the residuals are first estimated in the modelfitting process. Then, I PR and I DR are calculated. Finally, the means and variances of I PR and I DR are computed according to the random permutation scheme (Cliff and Ord 1981). The P values of I PR and I DR can be computed based on the asymptotic normality assumption when m is large and the number of covariates is small. If the independence model is rejected while all the ecological covariates can be found and modeled, the residuals of I PR or I DR from the model should be completely lacking in spatial autocorrelation. Otherwise, it may indicate spatial clustering that cannot be explained. Although no one has specified both I DR and I PR as above, it is worth noting the difference between I PR and the score test of the residuals in a generalized linear model (Jacqmin-Gadda et al. 1997). The score test, which also accounts for explanatory variables, was derived from a random-effect approach by neglecting an additional term of overdispersion. Based on its original formula, the test statistic is expressed as T ¼ Xm X m w ij ðy i ^m i ÞðY j ^m j Þ ð7þ i¼1 j6¼i where Y i is the variable of interest in region i as assumed from the exponential family, ^m i is an estimate of E(Y i ), and w ij is an element from the weight matrix. For a Poisson model, one can take Y i 5 N i so that y i is the number of the observed counts n i, and ^m i is the estimated count nˆ i in our notations. Then, T ¼ Xm X m w ij ðn i ^n i Þðn j ^n j Þ i¼1 j¼1;j6¼i This particular form of T is different from the formulation of I PR given by (6) because we use z i ¼ðn i ^n i Þ=^n 1= i instead of z i 5 n i nˆ i. This difference also leads to a variance adjustment problem for T, because the variance of Y i ^m i still depends on the i th regional population size x i even when ^m i equals the true parameter m i.in addition, w ij in T may not be based on spatial relationships, such as spatial adjacency or distance, and it adds complications in designing a proper weight matrix and in deriving the P value. Pearson residuals, on the other hand, are well established, and it is much easier to calculate or implement Pearson residuals in a stan- 97

6 Geographical Analysis dard statistical package than it is for the score test. Because of these differences, we will not compare I PR and I DR with the score test in our simulations. Simulation study To evaluate Pearson residuals I PR and deviance residuals I DR for variance adjustments, we carried out simulations under the null hypothesis of no spatial clustering. We included two alternative test statistics, Oden s Ipop and Assuncao and Reis s EBI, both of which are designed to account for heteroskedasticity without controlling for ecological covariates. As a reference point, we also included the original Moran s I, denoted by I r, by taking z i 5 n i /x i in equation (1). For ease of discussion, we list some expressions for Oden s Ipop and Assuncao and Reis s EBI below. Oden s Ipop is defined as P Ipop ¼ n m P m i¼1 j¼1 M ij ðe i d i Þðe j d j Þ nð1 bþ P m i¼1 M ii e i nb P m i¼1 M ii d i bð1 bþ x P m i¼1 P m j¼1 d id j M ij x P m i¼1 d im ii ð8þ where b ¼ n=x, e i 5 n i /n, d i 5 x i /x, Mij ¼ w p ffiffiffiffiffiffiffiffi P ij= d i d j, n ¼ m i¼1 n i and x ¼ P m i¼1 x i. Oden also derived the mean and variance of Ipop. If there is no spatial dependence, Ipop is close to the mean value 1/(x 1). Oden s Ipop weights a pair of regional populations together with the spatial weight matrix by using Mij. While w ij in Mij is still a spatial weight in the W matrix, its diagonal element w ii 6¼0. This feature makes Ipop incomparable with Moran s I when spatial clustering and population heterogeneity both exist. Noticing this difference, Assuncao and Reis (1999) proposed their version of population-adjusted p Moran s I or EBI. In their definition, z i ¼ðp i bþ= ffiffiffiffi n i, where b 5 n/x, ni 5 a1b/x i, a 5 s b/(x/m) ands ¼ P m i¼1 x iðp i dþ=x. Hence, P m P i¼1 j6¼i w p i b ij pffiffiffiffi 1 X! m p l b pj b pffiffiffiffi pffiffiffiffi 1 X m p l b pffiffiffiffi n i m l¼1 n l n j m l¼1 n l EBI ¼ " P m p i b l¼1 pffiffiffiffi 1 X # m p l b ð9þ h pffiffiffiffi =m Pm P i m m l¼1 i¼1 j¼1;6¼i w ij n i n l Although they did not notice the connection, Assuncao and Reis point out that n i can be set equal to b/x i if n i o0. If all n i o0, then I PR and EBI are identical in the absence of covariates. The justification and evaluation of EBI, in this case, would be directly applicable to I PR. However, we never encountered this case in our simulations, and so the general form of EBI is always assumed. Simulations were based on a 0 0 lattice. We defined a spatial weight matrix according to spatial adjacency, that is, w ij 5 1 if two lattice points (i, j) are adjacent, 0 otherwise. For I pop, we followed Oden and took w ii 5. These weight assignments are identical to those used in Assuncao and Reis s (1999) simulations. We denote l i and x i, respectively, as the disease rate and population size at lattice 98

7 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran s I Autocorrelation (a1) Half Sparse Half Dense (a3) One Cluster of Dense (a5) One Cluster of Sparse ξ= Y 10 Y 10 ξ=1 Y 10 ξ= ξ=1 5 ξ= ξ= X X X (a) Quad Sparse Quad Dense (a4) Two Clusters of Dense (a6) Two Clusters of Sparse ξ=1000 ξ=1 15 ξ= ξ=1 Y 10 Y 10 ξ=1 Y 10 ξ= ξ=1 ξ= ξ= ξ= X X X Figure 1. Population patterns for (a1) (a6) in the unit of point i, and N i as the Poisson random variable with the mean value m i 5 l i x i.in each run, 400 Poisson random variables with the constant rate l i were generated independently on the lattice points, so that there would be no spatial clustering of disease rates. In order to compare a wide range of heteroskedasticity, we included a spatial homogeneous pattern, together with six heterogeneous patterns similar to those used in Waldhor s (1996) simulations (Fig. 1). We used u i and v i to represent the vertical axis and the horizontal axis, respectively, so that a lattice point i can be easily identified by i 5 0(u i 1)1v i. The seven spatial population patterns were defined in the units of 10 5 (e.g., x i 5 1 means and x i means ), and they are: (a0) Homogeneous population with x i 5 1 for all i 5 1,..., 400. (a1) Half sparse and half dense: x i 5 1ifu i 10, x i if u i 11. (a) One quad sparse and one quad dense: x i 5 1ifu i 10 and v i 10 or u i 11 and v i 10; x i if u i 10 and v i 11 or u i 11 and v i 10. (a3) All sparse except one cluster with a dense population: x i 5 1, except when qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi lattice point i is within ðu i 5Þ þðv i 5Þ 3, in which x i

8 Geographical Analysis (a4) (a5) (a6) All sparse except two clusters with dense populations: x i 5 1, except qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi when lattice point i is within ðu i 5Þ þðv i 5Þ 3 or qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðu i 15Þ þðv i 15Þ 3, in which x i All dense except one cluster with sparse population: x i , except when qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi lattice point i is within ðu i 5Þ þðv i 5Þ 3, in which x i 5 1. All dense except two clusters with sparse populations: x i , except qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi when lattice point i is within ðu i 5Þ þðv i 5Þ 3 or qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðu i 15Þ þðv i 15Þ 3, in which x i 5 1. For a given pattern of population and rate distributions, we ran the simulations 10,000 times and calculated I r, I PR, I DR, Ipop, and EBI for each run. We fixed the disease rate at for all locations in the seven heterogeneous patterns (a0 a6), so that E(N i ) 5 V(N i ) 5 10 if x i 5 1 unit and E(N i ) 5 V(N i ) 5 10,000 if x i units. Under the null hypothesis, we assessed the validity of the permutation test by comparing (a) the observed variance for each test statistic versus corresponding permutation variance and (b) the rejection rates at the 0.05 nominal value (a ). In the preliminary analysis, we also compared the observed and permutation means for each of I r, I D, I DR, and EBI. As the difference between the two could be ignored, we will not compare them. Table 1 lists the simulation results of the observed and permutation variances of I r, I PR, I DR, Ipop, and EBI based on the seven patterns (a0 a6), where s S denotes the observed sample variance from the simulation and s R denotes estimated variance from the random permutation scheme. As s S can be treated as the true values, we compared the ratio of the values of s S and s R. If the ratio is close to 1, it suggests that estimated variance is close to the permutation variance. The results show that, when the populations were homogeneous, the ratios between s S and s R were very close to 1. When the populations were heteroge- Table 1 Comparison of Variance for Selected Test Statistics I r ( 0.001) I PR ( 0.001) I DR ( 0.001) EBI ( 0.001) Ipop ( 10 1 ) s S s R s S s R s S s R (a0) (a1) (a) (a3) (a4) (a5) (a6) s S s R s S s R 300

9 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran s I Autocorrelation Table Rejection Rates (%) for Selected Test Statistics I r I PR I DR EBI Ipop (a0) (a1) (a) (a3) (a4) (a5) (a6) neous, the rate-based test I r produced predominantly biased variance estimates, a result consistent with Waldhor s simulations. When a set of sparsely populated points or grids were clustered, the permutation tended to underestimate the true variance. In the cases of one-cluster (a5) and two-cluster (a6) patterns, the permutation variances were underestimated by about 9.% and 18.%, respectively. When the population distribution was characterized by (a1) or (a), the permutation underestimated true variances by about 50%. In contrast, the ratios between s S and s R for I PR, I DR, Ipop, and EBI were all very close to 1 in patterns (a1) to (a6). The results for I PR and I DR suggest that the estimated values of the variance are trustworthy and that they can effectively reduce potentially inflated variance. As the ratios were very close to 1 for Ipop and EBI, I PR and I DR, it might not be critical for using I PR or I DR if one simply wants to account for heteroskedasticity. Nevertheless, I PR or I DR have an advantage when one wishes to incorporate covariates. Table lists the rejection rates of the permutation test based on I r, I PR, I DR, Ipop and EBI according to the seven spatial population patterns. When populations are homogeneous, all the test statistics had a type I error rate between 4.60% to 5.03% with an I DR above 5% by a fraction. When populations were heterogeneous, all the test statistics except I r had a consistent rejection rates around 5%, suggesting that I PR, I DR, Ipop, and EBI were all reasonable under the null hypothesis and they can effectively correct potentially inflated variance with a satisfactory level of type I errors. The results for I r, in contrast, failed to reject the null hypothesis with the accepted nominal value 0.05 due to an inflated variance. Although some patterns (a3 and a4) registered lower rejection rates, all the patterns had a rejection rate more than 5%. In the cases of one or two sparsely populated regions clustered in a densely populated study area (patterns a5 and a6), the rejection rate for a spatially random pattern was more than 40%. These results essentially confirmed Waldhor s simulation results for I r. Even though the performance of the permutation test for I PR and I DR is comparable with the performance from Ipop and EBI, I PR and I DR have the flexibility of incorporating potential covariates in to their tests for spatial clustering, which we will demonstrate in the following section. 301

Geographical Analysis Application Data and variables County-level breast cancer and at-risk population data were obtained from the Kentucky cancer registry for the years 1996 000.

10 Geographical Analysis Application Data and variables County-level breast cancer and at-risk population data were obtained from the Kentucky cancer registry for the years The data set reports breast cancer cases according to their developmental stage as follows: 0, a benign tumor; 1 an in situ tumor;, localized tumor; 3, a regional tumor; 4, a distant tumor; and 5, usually used to code patients who died with later stage disease without an autopsy report on file. For the purpose of our analysis, we deleted the stage 0 cases. Following the U.S. Surveillance, Epidemiology, and End Results (SEER) Program definitions, we regrouped the in situ and localized tumors as early stage and the regional, distant and unknown tumors as late stage. Generally, early-stage breast carcinomas are confined to the breast and can often be treated successfully, whereas late-stage carcinomas tend to spread beyond the breast and are often fatal. There were a total of 16,055 breast cancer cases during this period, and 80% of the cases were between age 35 and 75. On average, there were 96.7 per 100,000 women diagnosed at an early stage and 36.9 per 100,000 at a later stage. If the overall breast cancer incidence rate is constant across counties, the early-stage breast cancer rate should, theoretically, be negatively related to the late-stage breast cancer rate. Based on 10-year age-group incidence data during the same period, we calculated the indirect standard incidence rate (SIR) or relative risk for each county (Waller and Gotway 004, p. 15), and the results are shown in Figs. and 3 in six quantiles. We observed that a strip of counties along the boundary between Appalachian and non-appalachian regions had an elevated early-stage breast cancer risk, as did the westernmost counties. With regard to late-stage breast cancer, there was a cluster of counties with a high incidence rate around the Figure. Early-stage standard relative risk in Kentucky. 30

11 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran s I Autocorrelation Figure 3. Late-stage standard relative risk in Kentucky. northeastern Appalachian area; counties along the southern border of the state also had a higher relative risk. As screening for early-stage breast cancer tends to be associated with age, socioeconomic conditions, and access to health care within a geographic area (Freeman 1989), we included additional county variables while testing spatial clustering for both early-stage and late-stage breast cancer incidents. We obtained county population and socioeconomic data from the 000 U.S. Census. Age is expected to be positively associated with breast cancer incidents, and we obtained median age and population age groups as potential control variables. The socioeconomic conditions in a county can be related to breast cancer in two ways (Bradley, Given, and Robert 001). On the one hand, breast cancer is more prevalent among White women or those with higher socioeconomic status, and so counties that have a higher median family income (MEDFINC) are expected to have greater breast cancer incidence rates. On the other hand, women who have a higher socioeconomic status tend to be more aware of and more able to afford breast cancer screening than are those who have a lower socioeconomic status. Consequently, although counties that have better socioeconomic conditions may have a higher incidence rate of early-stage disease, they may not necessarily have a higher incidence rate of late-stage disease than do counties with worse socioeconomic conditions (Roche, Skinner, and Weinstein 00). We relied on several other data sources for access-to-care measures. We obtained breast cancer screening rates from the 1997 and 1998 Behavioral Risk Factor Surveillance Systems (BRFSS). Owing to the confidentiality concern, the original release of cancer screening level had some counties grouped together, and so we divided rates into tertiles of high (H-screening: 470%), middle (M-screening: 65 70%), and low (L-screening: o65%). A higher breast cancer screening level 303

12 Geographical Analysis (primarily by mammography) is expected to be associated with higher early-stage incidence rates, and negatively associated with late-stage rates. In the preliminary analysis, we found that the differences between low and middle tertiles were minimal, and we grouped them together in the final analysis. We also obtained the 1998 population-to-primary care physician ratio (POP/PMD) in 1998 from the Kentucky Department of Public Health; a lower ratio indicates that a physician can pay more attention to each patient. As breast cancer screening is most frequently recommended in a primary care setting, having a greater number of primary care physicians should help to reduce the incidence of all stages of breast cancer. For this reason, we expected the population-to-physician ratio to be negatively associated with both early-stage and late-stage breast cancer rates. Finally, we used data from a geographic information system to derive geographic access measures for Kentucky counties. We used the TIGER file from the 000 U.S. census to derive a measure of access to major highways. A county is coded 1 if a major national highway (HWY) passes through it, and 0 otherwise. It was expected that highway access would increase access to health care facilities and increase early-stage breast cancer diagnoses. We also divided counties into within and outside the Appalachian region (Appalachian) as encircled by the bold line in Figs. and 3. Counties within the Appalachian region generally are economically distressed and medically underserved, and the all-cause mortality rate tends to be much higher within the region (Haaga 004; Mather 004). Analysis To test Moran s I for spatial clustering, we first used Pearson residuals I PR and deviance residuals I DR for the null model without any covariates. Here, I PR and I DR in the null loglinear model correspond to the traditional Moran s I without any covariates. We then introduced explanatory variables in the so-called ecological model. In the preliminary analysis, we found that college education, poverty rate, and MEDFINC were highly correlated; we used MEDFINC in the final analysis because it was the most significant variable in terms of the likelihood ratio test. We also experimented with different age variables and found that proportions of age together with age 65 and over were not significant. We selected median age, because it was consistently significant in the ecological models for both stages; the greater the age, the greater the likelihood of breast cancer. We expected that the null model would indicate some spatial clustering through significant spatial autocorrelation and that the correlation should be weakened or disappeared once the explanatory variables were introduced. As most of explanatory variables in the literature are tapped for the early stage, their effects on the late stage are expected to be weaker. If the autocorrelation was found to persist or could not be explained by the ecological model, our task was to locate spatially clustered counties and provide our findings to epidemiologists and cancer specialists for further identification of the etiologies associated with breast cancer clusters. We used a spatial mixed 304

13 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran s I Autocorrelation model to search for high-value and low-value spatial clusters by including additional local spatial association terms, as demonstrated by Lin (003). The method makes use of the vector of the spatial weight matrix, with 1 being adjacent to i inclusive (i.e., including the ith region itself), and 0 being otherwise. If a cluster of counties associated with the ith vector could significantly reduce the log-likelihood (deviance), it indicates a local association or cluster centered around the ith county. If the association is positive, it suggests a high-value cluster. If the association is negative, it suggests a low-value cluster (Lin and Zhang 004). After controlling for pockets of high-value and low-value clustering and ecological covariates, we would then expect an insignificant residual Moran s I. Results Table 3 lists Moran s I coefficients and the t values of ecological variables from the early-stage log-rate models. In the null model, both Pearson residual Moran s I PR and deviance residual Moran s I DR were positively significant, suggesting a clustering tendency. Once the explanatory variables were introduced into the ecological model, however, I PR and I DR both became insignificant based on their corresponding residuals. Hence, a spatial clustering tendency in the null model reflected spatial patterning of age structure, socioeconomic status, and access to care. In particular, counties with an older median age, a high level of breast cancer screening, or with easy highway access were associated with higher detection rates of early-stage breast cancer, whereas the POP/PMD and location in the Appalachian region were negatively associated with detection rates. These results were all consistent with our expectations and the existing literature. Finally, county Table 3 Early-Stage Breast Cancer Log-Rate Models Models Null Ecological Coefficient t value Coefficient t value (Intercept) Median age MEDFINC HWY Appalachian High screening POP/PMD Clustering tests Moran s I PR Moran s I DR Summary statistics G G NOTE: G stands for likelihood ratio w and t value is the ratio of estimated value and its standard error. POP/PMD, population-to-primary care physician ratio; MEDFINC, median family income. 305

14 Geographical Analysis MEDFINC had a negative and weak (P value o0.10) association with the early breast cancer detection rate. While it has been reported widely that women in lessdeveloped counties, such as those in the Appalachian area of Kentucky, are less likely to have their breast cancer diagnosed early (Gregorio et al. 00), it seemed counterintuitive to us that a higher county MEDFINC would be associated with a lower county rate of early-stage breast cancer. When we added only MEDFINC to the null model, the coefficient for MEDFINC was positive and significant. The weak association, therefore, may reflect an effect when age, the level of breast cancer screening, and other access-to-care variables were taken into account. Turning to the results from the late-stage log-rate models (Table 4), we found that Moran s I PR and I DR were significant for both the null and ecological models and that the explanatory variables in the ecological model were insufficient to account for the clustering tendency of the late-stage incidence rates. Three variables that remained significant were median age, population-to-physician ratio, and high screening level. The coefficients for the median age and population-to-physician ratio remained consistent with the early-stage model. However, the late-stage rate became negatively associated with a high screening level. This result was consistent with our expectation that a high screening level was associated with high rates of early-stage diagnoses and low rates of late-stage disease. To further pinpoint the core of clustered counties unexplained by the ecological model, we deleted insignificant variables except median age from the Table 4 Late-Stage Breast Cancer Log-Rate Models Models Null Ecology Mixed Coefficient t value Coefficient t value Coefficient t value (Intercept) Median age MEDFINC HWY Appalachian H-screening POP/PMD Barran-cluster Greenup-cluster Marshall-cluster Union-cluster Clustering tests Moran s I PR Moran s I DR Summary statistics G 5 39 df G 5 94 df G 5 51 df 5 11 NOTE: G stands for likelihood ratio w and t value is the ratio of estimated value and its standard error. POP/PMD, population-to-primary care physician ratio; MEDFINC, median family income. 306

15 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran s I Autocorrelation Figure 4. High-value and low-value clusters of late-stage breast cancer in Kentucky. ecological model and applied a stepwise regression by including local spatial association terms. We identified four local association terms, none of which overlapped geographically. Each core county and its adjacent counties constituted a cluster, and including the core county only would not significantly reduce the clustered effect. Except for a cool spot around Barran County, the other three core counties represented the centers of three elevated late-stage clusters (Fig. 4). For instance, the rate in Union County and its adjacent counties was times the rates of other counties not included in the clusters. By including these clusters in the final model, both I PR and I DR were found to be insignificant, suggesting the disappearance of the clustering tendency once these clusters were accounted for in the model. It is worth mentioning that, if we dropped Marshall County, the median age effect would be significant, but the clustering effect would still remain. It suggests that a greater proportion of aging population around Marshall and its adjacent counties contributed to this cluster. It is also worth mentioning that Jefferson County, where Louisville is located, had a very low late-stage rate, but its adjacent counties each had a relatively high rate. Although including Jefferson County in the final model would reduce the log-likelihood ratio and the t values for I PR and I PD, counties around Jefferson County would not constitute a cluster. 1 Even though the main purpose of using I PR and I DR was to account for heteroskedasticity, new insights have been gained from analyzing I PR and I DR in the model-fitting processes. When the five explanatory variables were sufficient to reduce the spatial clustering effect of early-stage breast cancer, they highlight the underlying socioeconomic and health access processes that operate in geographic space. When known explanatory variables were not sufficient to reduce the spatial clustering effect as in the case of late-stage breast cancer, local spatial association terms can be used to further identify local clusters. The identification of spatial clusters may help to further reveal unique geographic characteristics in the clus- 307

16 Geographical Analysis tered areas associated with cancer disparities. Without measuring I PR or I DR,we would not be able to evaluate potential clustering effect in the traditional log-rate or Poisson regression models. Conclusions Although the asymptotic validity of permutation tests has been demonstrated in the literature and Pearson residuals and deviance residuals of loglinear models are asymptotically normal, no one has evaluated and applied them in the spatial context. In the current study, we have bridged Pearson residuals and deviance residuals of loglinear models with the permutation test of Moran s I under the asymptotic normality assumption. Our simulation study showed that both test statistics I PR and I DR were effective in reducing inflated variance caused by heterogeneous populations, and they both had an acceptable type I error rate. In addition, we tested both based on a set of log-rate models or Poisson regressions for early-stage and late-stage breast cancer incidence data, together with socioeconomic and access-to-care data in Kentucky. The results showed that socioeconomic and access-to-care variables were sufficient to account for spatial clustering of early-stage breast carcinomas with access-to-care measures, such as breast cancer screening and number of primary care providers being more persistent than county MEDFINC. After controlling for age, access-to-care measures, and regional distress factors in the Appalachian counties, the purported positive association between higher socioeconomic and early-stage breast cancer could be substantially weakened. For late-stage carcinomas, two salient and persistent factors were level of breast cancer screening and POP/PMD. In contrast to the finding that a high screening level was associated with a high incidence rate of early-stage breast cancer, the late-stage incidence rate was negatively associated with breast cancer screening level. This result confirmed our expectation: a high screening level is associated with a high incidence rate of early-stage diagnoses, which in turn reduces late-stage incidence rates. When the two access variables failed to reduce the spatial clustering tendencies from late-stage breast cancer, we searched for a local spatial association based on the likelihood ratio test. We located four clusters: one low-value cluster around Barran County and three high-value clusters. Two of the high-value clusters were adjacent to each other near the western corner of Kentucky. These unexplained clusters provided the basis for further investigation of the etiological and ecological factors of late-stage breast cancer in Kentucky. It should be pointed out that, even though we include median age, the age effect may not be fully accounted for in our model: future work should explore ways to account for both age and ecological effects while testing for spatial clustering. Spatial regressions have been widely used, but their use with the permutation tests of residuals either in linear or loglinear models is rarely seen. An advantage of the loglinear residual permutation test over the linear residual permutation test is that the former can account for potential spatially heterogeneous populations, 308

17 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran s I Autocorrelation which makes it a viable alternative in the log-rate modeling of disease rates, as demonstrated in our study. The method is expected to complement some spatial cluster tests, such as the spatial scan statistic (Kulldorff 1997) and G statistic (Getis and Ord 199). In addition, the ability to show spatial clustering in I PR and I DR is complementary to disease mapping, which intends to display true disease risks while controlling for heterogeneous populations and regional risk factors (Lawson and Clark 00). Finally, a local version of loglinear residuals can be developed as an exploratory tool to complement other local indicators of spatial association (Anselin 1995). Acknowlegements Data used in this publication were provided by the Kentucky Cancer Registry, Lexington, KY. We would like to acknowledge helpful comments from Linda Pickle and Robert Hanham. We would also like to thank three reviewers and the editor for their comments and suggestions. Note 1 We also analyzed the late-stage cases versus all cases at the patient level. The results were consistent. Population-to-physician ratio and breast cancer screening and close to highway were less likely to be associated with late-stage diagnoses, while residing in the Appalachian region was more likely to be associated with late-stage diagnoses. Counties around Jefferson County had an excessive late-stage breast cancer diagnoses among breast cancer patients. References Agresti, A. (1990). Categorical Data Analysis. New York: Wiley. Anselin, L. (1995). Local Indicators of Spatial Association-LISA. Geographic Analysis 7, Assuncao, R., and E. Reis. (1999). A New Proposal to Adjust Moran s I for Population Density. Statistics in Medicine 18, Barry, J., and N. Breen. (005). The Importance of Place of Residence in Predicting Late- Stage Diagnosis of Breast or Cervical Cancer. Health and Place 11, Besag, J., and J. Newell. (1991). The Detection of Clusters in Rare Diseases. Journal of Royal Statistic Society A 154, Best, N., K. Ickstadt, and R. Wolpert. (000). Spatial Poisson Regression for Health and Exposure Data Measured at Disparate Resolutions. Journal of the American Statistical Association 95, Bradley, C. J., C. W. Given, and C. Robert. (001). Disparities in Cancer Diagnosis and Survival. Cancer 91, Cliff, A. D., and J. K. Ord. (1981). Spatial Processes: Models and Applications. London: Pion. Elliott, P., J. Wakefield, N. Best, and D. Briggs. (000). Spatial Epidemiology. New York: Oxford University Press. 309

18 Geographical Analysis Freeman, H. P. (1989). Cancer and the Socioeconomically Disadvantaged. Ca-A Cancer Journal for Clinicians 39, Getis, A., and J. Ord. (199). The Analysis of Spatial Association by Use of Distance Statistics. Geographical Analysis 4, Gregorio, D. I., M. Kulldorff, L. Barry, and H. Samociuk. (00). Geographic Differences in Invasive and in situ Breast Cancer Incidence According to Precise Geographic Coordinates, Connecticut International Journal of Cancer 100, Griffith, D., and R. Haining. (006). Beyond Mule Kicks: The Poisson Distribution in Geographical Analysis. Geographical Analysis 38, Haaga, J. H. (004). Educational Attainment in Appalachia. Population Reference Bureau. ARC Series, Vol. 5. Washington, DC: Population Reference Bureau. Jacqmin-Gadda, H., D. Commenges, C. Nejjari, and J. Dartigues. (1997). Testing of Geographical Correlation with Adjustment for Explanatory Variables: An Application to Dyspnoea in the Elderly. Statistics in Medicine 16, Kulldorff, M. (1997). A Spatial Scan Statistic. Communications in Statistics, Theory and Methods 6, Lawson, A. B., and A. Clark. (00). Spatial Mixture Relative Risk Models Applied to Disease Mapping. Statistics in Medicine 1, Lin, G. (003). A Spatial Logit Association Model for Cluster Detection. Geographical Analysis 35, Lin, G., and T. Zhang. (004). A Method for Testing Low-Value Spatial Clustering for Rare Disease. Acta Tropica 91, Mather, M. (004). Households and Families in Appalachia. Population Reference Bureau ARC Series, Vol. 4. Washington, DC: Population Reference Bureau. Moran, P. A. P. (1950). A Test for the Serial Independence of Residuals. Biometrika 37, Oden, N. (1995). Adjusting Moran s I for Population Density. Statistics in Medicine 14, Roche, L. S., R. Skinner, and R. B. Weinstein. (00). Use of a Geographic Information System to Identify and Characterize Areas with High Proportions of Distant Stage Breast Cancer. Journal of Public Health Management Practice 8, 6 3. Schmoyer, R. L. (1994). Permutation Tests Form Correlation in Regression Errors. Journal of American Statistical Association 89, Sen, A. (1976). Large Sample-Size Distribution of Statistics Used in Testing for Spatial Correlation. Geographical analysis 9, Waldhor, T. (1996). The Spatial Autocorrelation Coefficient Moran s I Under Heteroscedasticity. Statistic in Medicine 15, Waller, L. A., and C. A. Gotway. (004). Applied Spatial Statistics for Public Health Data. New York: Wiley. Yabroff, Y. R., and L. Gordis. (003). Does Stage at Diagnosis Influence the Observed Relationship Between Socioeconomic Status and Breast Cancer Incidence, Case- Fatality, and Mortality? Social Sciences and Medicine 57, Zhang, T., and G. Lin. (006). A Supplemental Indicator of High-Value or Low-Value Spatial Clustering. Geographical Analysis 38,

Cluster Detection Based on Spatial Associations and Iterated Residuals in Generalized Linear Mixed Models

Biometrics 65, 353 360 June 2009 DOI: 10.1111/j.1541-0420.2008.01069.x Cluster Detection Based on Spatial Associations and Iterated Residuals in Generalized Linear Mixed Models Tonglin Zhang 1, and Ge