Cluster Detection Based on Spatial Associations and Iterated Residuals in Generalized Linear Mixed Models

Size: px
Start display at page:

Download "Cluster Detection Based on Spatial Associations and Iterated Residuals in Generalized Linear Mixed Models"

Transcription

1 Biometrics 65, June 2009 DOI: /j x Cluster Detection Based on Spatial Associations and Iterated Residuals in Generalized Linear Mixed Models Tonglin Zhang 1, and Ge Lin 2 1 Department of Statistics, Purdue University, 250 North University Street, West Lafayette, Indiana , U.S.A. 2 Department of Geology and Geography, West Virginia University, Morgantown, West Virginia , U.S.A. tlzhang@stat.purdue.edu Summary. Spatial clustering is commonly modeled by a Bayesian method under the framework of generalized linear mixed effect models (GLMMs). Spatial clusters are commonly detected by a frequentist method through hypothesis testing. In this article, we provide a frequentist method for assessing spatial properties of GLMMs. We propose a strategy that detects spatial clusters through parameter estimates of spatial associations, and assesses spatial aspects of model improvement through iterated residuals. Simulations and a case study show that the proposed method is able to consistently and efficiently detect the locations and magnitudes of spatial clusters. Key words: Generalized linear models; Local clusters; Mixed effects; Moran s I statistic; Pearson residuals; Spatial heterogeneity. 1. Introduction Spatial analyses of disease clusters are typically based on Poisson random variables under two separated approaches: Bayesian disease mapping and frequentist cluster detection. The disease-mapping approach, typically formulated under a Bayesian mixed effect model (Lawson and Clark, 2002), provides stable estimates for unit-specific relative risks while accounting for potential explanatory variables and extra dispersion due to spatial heterogeneity. This approach has an advantage of modeling an overall or global variation in disease rate over the entire study area, while at the same time capturing local variation. The frequentist approach tests for the existence of spatial clustering or clusters, but normally not under any mixed effect models. Our interest is to use a frequentist method for cluster detection based on a spatial mixed effect model. In disease mapping, spatial generalized linear mixed effect models(glmms)areoftenspecifiedintermsofrelativerisks for their fixed and random effects, where the relative risks are proportional to the expected incidence rates of Poisson random variables. These models are typically fitted by a Bayesian method that provides analytical results on ecological covariates and the expected relative risks. Although there are several frequentist methods for estimating GLMMs for time series data (Zeger, 1988; McCulloch, 1997), their uses in spatial data are rare. Two exceptions highlight recent attempts to use frequentist methods for spatial GLMMs. One is the score test for global spatial correlation but not for local cluster detection (Jacqmin-Gadda et al., 1997). The other is the spatial logit or log-linear model for small-area surveillance specified by nonspatial random effects (Kleinman, 2005). In this article, we propose a cluster-detection strategy that combines the estimation of a GLMM with the identification of local clusters in the model selection process. Most frequentist methods for spatial cluster detection ignore potential explanatory variables at the hypothesis testing stage. When the null hypothesis of no spatial clustering is rejected, most testing methods are able to identify local clusters in a more focused test. Afterward, researchers may begin to explore potential geographical or ecological explanatory variables that might contribute to the identified clusters. For example, after identifying two distant-stage breast cancer hot spots, Roche et al. further compared geographic factors between clustered and noncluster areas and found that the two clusters tend to be linguistically isolated (Roche, Skinner, and Weinstein, 2002). This suggests that a known risk factor might contribute to the observed pattern, but the separation of cluster detection from the control of the risk factor in the testing method makes it impossible to infer this fact statistically. Our goal is to add the capability of incorporating explanatory variables in the spatial cluster-detection process from a frequentist approach, which resembles the existing methods for disease mapping. Because our proposed method is based on spatial GLMMs, its model-fitting process should include spatial effects. In the presence of random effects, model selections based on the goodness of fit and information criteria, such as Akaike, may not be able to capture extra information about autocorrelation. For example, a spatial logit association model in a GLM can include potential explanatory variables and identify C 2008, The International Biometric Society 353

2 354 Biometrics, June 2009 high- and low-value clusters (Lin, 2003). This model, however, cannot deduce cluster information based on the goodness-offit statistic, and therefore, cannot indicate spatial clustering effects. To search for a cluster in a model estimation process, it is crucial to retain extra information about spatial properties in the model improvement process (Baddeley et al., 2005). Recently, Loh and Zhu (2007) extended the spatial scan test (Kulldorff, 1997) by retaining spatial autocorrelation information for overdispersion, but the method is based on the spatial scan test rather than on a spatial GLMM. In both time series and spatial point processes, residuals have been used to assess model fitting and clustering effects (Zhuang, 2006). We extend these methods to lattice data by providing a theoretical basis for residual-based clustering tests, which in turn can be used to assess spatial properties in spatial GLMMs. Taken together, we set out a spatial cluster surveillance framework that is able to confirm the existence of clusters while accounting for ecological covariates and potential global trends. For a spatial GLMM, we rely on (i) the local association parameter estimates to capture clusters, and (ii) Pearson residuals to check the remaining clustering tendency. Because locations of potential spatial clusters are an unknown priori, we also introduce a spatial search algorithm to detect significant local associations. In the following section, we first set out the proposed cluster-detection method, and then evaluate it through Monte Carlo simulations in Section 3. In Section 4, we introduce a case study for colorectal cancer mortality in Indiana. Finally, we provide concluding remarks. 2. Statistic Methods For a study area with m units, let the observed cases be y i of a random variable Y i and expected cases be E i, respectively. Assume that the event is rare such that Y i for i =1,..., m are conditionally independently Poisson distributed with conditionally expected value θ i E i given θ i ; the unknown relative risk θ i for unit i can be set out in a spatial GLMM as: log(θ i )=x t i β + K α k d ki + ɛ i. (1) Equation (1) is based on Gangnon and Clayton s (2003) work: x t i β is a nonspatial component representing the overall relative risk across the study area adjusted by ecological covariates x i ; d ki is the local association term for cluster membership centered at unit j k with d ki = 1 if unit i belongs to the kth cluster and d ki =0otherwise;α 1,..., α k are unknown log relative risks (relative to x t i β) associated with each cluster; K is the unknown number of clusters; and ɛ i is a spatially unstructured random effect assumed identically independently normally distributed. Both K α k =1 k d ki and ɛ i in equation (1) present spatial search and statistical estimation problems that need to be solved jointly. First, the coefficient of parameter α k represents the strength of the kth local cluster captured by d ki.asignificantly positive coefficient indicates a high-value cluster or a hot spot. A significantly negative coefficient indicates a lowvalue cluster or a cool spot. If all α 1,..., α K are 0, the model becomes a spatial independent model. Because the number of potential clusters K is unknown, it is necessary to develop a procedure to search for spatial clusters. Second, the random k =1 effect ɛ i represents heterogeneity, which can be attributed to omitted covariate variables, measurement errors, and overdispersion. Although the use of ɛ i is natural for many applications, the potential inflation of the variance poses a problem for model estimation (Agresti, 2002, p. 8). In the following section, we first introduce the model estimation problem and then the cluster search problem. Model estimation. To estimate parameters in model (1), it is necessary to compute its variance inflation. Given that Y i ɛ i Poisson(θ i E i ), we have E(Y i ɛ i ) = θ i E i and V (Y i ɛ i ) = θ i E i. Under the assumption that ɛ i iid N (0,σ 2 ),e ɛ i are independently log-normally distributed with E(e ɛ i )=e σ 2 / 2 and V (e ɛ 2 i )=e σ 2 (e σ 2 1). Let μ i = E(Y i ). Then, μ i = E[E(Y i ɛ i )] = E i E(θ i )=E i e x t β + Let v i = V (Y i ). Then, v i = E[V (Y i ɛ i )] + V [E(Y i ɛ i )] K i =1 α k d ki +σ 2 / 2. = E(θ i E i )+V (θ i E i )=μ i [ 1+μi (e σ 2 1 ) ]. As v i only depends on μ i and σ 2,wewritev i = v(μ i ; σ 2 )as the general form of the marginal variance of Y i. By comparing equations (2) and (3), we find that the variance is inflated because the marginal variance is a quadratic function of the marginal mean when σ 2 > 0. Given a potential number of clusters C 1,..., C k, we fit model (1) in a generalized estimating equation (GEE) by treating σ 2 as a nuisance parameter (Zeger, 1988). Let α = μ i α )t (α 1,..., α K )and μ i =( μ i be the column vector of β the first-order partial derivatives for all the unknown parameters. For a given σ 2, we consider the quasilikelihood parameter estimates ˆβ for the ith unit by a quasiscore equation: m i=1 (2) (3) ( μ i ) v 1( μ i ; σ 2) (Y i μ i )=0, (4) where μ i is given by equation (2) and v 1 (μ i ; σ 2 )={μ i [1 + μ i (e σ 2 1 )]} 1. We use an iterative algorithm to estimate both μ i and σ 2, because σ 2 is also a parameter and equation (4) can only estimate μ i for a given σ 2.Wefirstsetσ 2 = 0 and estimate ˆμ i by equation (4) and then estimate σ 2 by moment estimation as: where s i =max( (Y i ˆμ i ) 2 ˆμ 2 i ˆσ 2 = 1 m 1ˆμ i m log s i, (5) i=1 +1, 1). Both the estimates of μ i and σ 2 are updated iteratively until their convergence, so that we may obtain the estimates ˆμ i and ˆσ 2 of μ i and σ 2.The marginal variance, therefore, can be estimated by ˆv i =ˆμ i [ 1+ˆμi (e ˆσ 2 1) ]. (6) Because it is difficult to obtain the maximum likelihood estimate ( ˆβ, ˆα) of(β, α) and the corresponding estimate of the variance covariance matrix, we opt to calculate the Wald z-score values of clusters C 1,..., C k as z ˆα i,c i =ˆα i /ˆσ ˆα i,where

3 Cluster Detection Generalized Linear Mixed Models 355 ˆσ ˆα i is the estimate of standard deviation of ˆα i.thesez-values can be used to determine the significance of a cluster in the following. Search for local associations. As mentioned earlier, the number and location of clusters are unknown, and we need to develop a spatial search procedure to identify significant local association terms. For simplicity, we follow the classical way of defining a cluster at location i that includes all its neighbors; then we search for all possible is in a study area. Although the analysis of residual clustering is applicable to several spatial test statistics, we demonstrate it based on Moran s I (Moran, 1948) because of its popularity and wide availability. Let z i be the variable of interest in unit i. Moran s I statistic is given by I = 1 m m w ij (z i z)(z j z), (7) S 0 b 2 i=1 j =1,j i where S 0 = m m w i=1 j =1,j i ij(w ij =1 if two units are adjacent, 0 otherwise), z = m z i=1 i/m, b k = m (z i=1 i z) k /m. For count data, z i is often taken as the observed relative risk or z i = y i /E i. Moran s I usually ranges between 1 and1: acoefficientcloseto1indicatesneighborhood similarity or clustering, a coefficient close to 1 indicates neighborhood dissimilarity, and a coefficient close to 0 indicates spatial randomness or independence. The p-value of Moran s I statistic is calculated under the random permutation test scheme. Let E R ( ) andv R ( ) bethe expected value and variance under a random permutation. Formulae of the moments (Cliff and Ord, 1981) of I are given by: E R (I) = 1 m 1, (8) and E R (I 2 )= S ( ) 1 mb 2 2 b 4 S0 2b2 2 (m 1) + (S ( 2 2S 1 ) 2b4 mb2) 2 S0 2b2 2 (m 1)(m 2) ( )( ) S 2 0 S 2 + S 1 3mb 2 2 6b 4 + S0 2b2 2 (m 1)(m 2)(m 3), (9) where S 1 = m m (w i=1 j =1,j i ij + w ji ) 2 /2 and S 2 = m i=1 [ m (w j =1,j i ij + w ji )] 2. The variance of I is calculated by V R (I) =E R (I 2 ) E 2 R (I). Assuming I is approximately normally distributed, its p-value is calculated by a two-sided z- test given by 2[1 Φ( I E R (I ) )]. If the p-value is less than the V R (I ) significance level, then the null hypothesis of spatial independence is rejected and spatial dependence is concluded. Moran s I can be derived either from the observed value or from the model residuals, and it is suggested that Pearson residual Moran s I denoted by I PR is able to account for population heterogeneity and variation inflation (Waller and Gotway, 2004; Lin and Zhang, 2007). It can be seen from equation (7) that Moran s I does not suffer overdispersion. Thus, ɛ i in model (1) is likely to capture unstructured random effects. We provide the theoretical basis for I PR in the Web-based Supplementary Materials at the Biometrics website, and state a theorem below for I PR in reference to equation (1). Theorem 1: Suppose that Y i for i =1,..., m are independent random variables with expected value μ i,θ and variance σ 2 i,θ, where θ is a vector of unknown parameters. Assume that θ is m consistently estimated by ˆθ = ˆθ(Y 1,Y 2,...,Y m ). Define I PR as the Moran s I statistic by taking z i =(y i μ i, ˆθ )/σ i, ˆθ in equation (7). Then, [I PR E V (I PR )]/ V R (I PR ) L N (0, 1) as m, where L represents convergence in law or convergence in distribution. To explicitly account for an independent and identically distributed random effect in model (1), we can take z i = (y i ˆμ i )/ ˆv i as the ith residual so that I PR becomes an adjusted Pearson residual Moran s I or I apr and then the random permutation test can also be applied. Corollary 1: Suppose that β and α are m consistently estimated by ˆβ and ˆα as m. Then, ˆσ 2 given by equation (5) is also m consistently estimated. Define I apr as the Moran s I statistic by taking z i =(y i ˆμ i )/ ˆv i in equation (7), where ˆv i is given by equation (6). Then, [I ap R E R (I ap R )]/ V R (I ap R ) L N (0, 1) as m. Search algorithm. To search for potential clusters, we take advantage of the global indicator I apr and the local indicator d ki, and use them to evaluate the location and strength of a local cluster based on model (1). Suppose that C is a cluster in the study area, a spatial association term d ki assumes that θ i = θ c if i C and θ i = θ 0 if i C in the estimation process. Our task is to search for significant local associations by placing d ki terms based on the Wald s z-test, and then check for residual clustering based on I apr. The reason to use the Wald s z-test rather than the likelihood ratio test is the computational complication of the likelihood function in the presence of random effects (Agresti, 2002, p. 521). For simplicity, we describe the procedure by dropping ecological covariates in model (1). The procedure starts with the test of significance of I apr for model (1) without any local association term. If the p-value of I apr is not significant for spatial clustering, then there is no need to search for potential clusters. If I apr is significant, then a stepwise search algorithm is invoked. The algorithm first calculates all the z-values of α 1 coefficients for all possible C, and retains the largest significant z α 1 value for the first spatial association, its parameter estimate β 1,andtheestimatedσ 2. Then it reevaluates the p-value of I apr for residual clustering. If the statistic is no longer significant, the algorithm stops by indicating that the cluster centered at j 1 and quantified by β 1 contributes the significant clustering effect. Otherwise, it starts to search for a second cluster by repeating the above steps in the second round of iteration. Parameter estimates for the previously retained spatial association terms are updated together with σ 2 in the estimation process. The algorithm stops when I apr is not significant based on the kth iterated residuals. Because each association term accounts for the effect of the corresponding cluster, the procedure does not suffer the multiple testing problem. The model as a whole takes account of spatial random effects when σ 2 > Simulation In this section, we evaluate the proposed I apr test and the search procedure via simulated data. We used 92 counties in Indiana in the United States as the template and assigned

4 356 Biometrics, June 2009 as much as the mean. We ran 10,000 Monte Carlo simulations for each combination of σ, δ 1,andδ 2 values. In addition to assessing the proposed method against simulated patterns, we also compared it with Kulldorff s spatial scan statistic. Kulldorff s scan statistic is based on the likelihood ratio test for the null hypothesis of equal rates among all units versus the alternative of higher rates or lower rates inside of the cluster (Kulldorff, 1997). Suppose that C C is a candidate of a cluster and assume θ i = θ c if i C and θ i = θ 0 of i C. Under the null hypothesis θ c = θ 0 and the alternative hypothesis θ c >θ 0, the likelihood function is LR(C, θ c,θ 0 )= [ i C (θ c E i ) y i y i! e θ c E i ][ i C (θ 0 E i ) y i y i! The spatial scan statistic is sup θ λ =maxl(c) =max c >θ 0 LR(C, θ c,θ 0 ) C C C C sup θ c =θ 0 LR(C, θ c,θ 0 ), e θ 0E i ]. (11) Figure 1. County population sizes in 2000 in two-cluster simulations. the at-risk population of each county according to the 2000 Census of Population, which ranges from 5623 to 860,454. We generated local cluster effects from no cluster to two clusters, and fixed the centers of two clusters at the outset (Figure 1). The first, denoted by C 1, was centered upon Tippecanoe county in the west that also included its seven adjacent counties (Montgomery, Fountain, Warren, Clinton, Carroll, White, and Benton). The second cluster, denoted by C 2, was centered upon Noble county in northeastern Indiana that included its seven adjacent counties (De Kalb, Kosciusko, Whitley, Allen, Elkhart, Lagrange, and Steuben). In all evaluations, we fixed the significance level at We first generated the random effect ɛ i identically independently from N (0, σ 2 ); and then we generated the Poisson random counts Y i conditionally independently with conditional mean θ i n i e V i for i =1,..., 92, where 0.001(1 + δ 1 d 1i + δ 2 d 2i ), when i C 1 or θ i = i C 2, respectively, 0.001, when i C 1,C 2, (10) where n i is the population size of the ith county, d 1i is 1 if county i is in cluster C 1, and 0 otherwise. Likewise, d 2i is 1 if county i is in cluster C 2, and 0 otherwise. The quantities δ 1 and δ 2 represent the strength and sign of the clusters. When δ 1 = δ 2 = 0, there is no clustering. Otherwise, there is at least one cluster. A positive δ value indicates a hot spot, and a negative δ value indicates a cool spot. We let δ change from 0 to 1 to place the relative risk within a cluster from 1 to twice where C is the class of all the possible candidates of local clusters. Once the value of the test statistic is calculated by searching for all the possible candidates, a p-value for the cluster C with the maximum value of the likelihood ratio test statistic is obtained by comparing the value of the scan statistic λ under the null hypothesis. Because the exact distribution of the test statistic λ cannot be determined analytically, it is approximated by Monte Carlo simulations. The spatial scan statistic uses a circular area and its variant to detect excessive events within the circle against the rest of the study area. It is a consistent and powerful test for detecting at least one cool or hot spot. Since its introduction into the field of disease cluster detection, the spatial scan statistic has quickly become a standard method for geographic disease surveillance. In the simulation, we compared the type I error rates and the powers of the two methods. Ideally, we would want to include spatial random effects throughout. However, we dropped the spatial scan statistic in the power comparison because it cannot account for the case when σ 2 > 0. In addition, the power comparison is limited to one cluster without spatial heterogeneity(i.e., when σ = 0), because the spatial scan statistic is most effective. In this situation, we used I PR for the comparison, because in the absence of random effect, I apr reduces to regular I PR. In all comparisons, we fixed the window size according to the spatial adjacency matrix, so that it remained consistent with our design for local clusters. We calculated the p-value of the spatial scan statistic based on 999 random replications of the null distribution for Monte Carlo hypothesis testing. The results show that I apr had a consistent type I error probability in the presence of spatial random effects (Table 1). As σ increased from 0 to 1.0, the rejection rates hovered between and The spatial scan statistic, on the other hand, only had an acceptable type I error probability when σ =0.Asσ increased, the rejection rates increased rapidly, which would result in false alarms. In the limited power comparison, the scan statistic is more powerful than our permutation method for I PR. When the cluster strength was moderate (δ 1 = 0.2), or 20% above the

5 Cluster Detection Generalized Linear Mixed Models 357 Table 1 Type I error probabilities (when δ 1 = δ 2 =0)of I apr and Kulldorff s scan statistic σ Scan I apr mean, the spatial scan statistic had a 52.5% rejection rate, whereas I PR had 17.7%. When the cluster strength became stronger, the rejection rates from the scan statistic were still higher than the rates from I PR.Whenδ 1 values were 0.4 and 0.5, the corresponding rejection rates were 99.8% and 100% for the spatial scan test, as opposed to 75.8% and 94.3% for I PR. These results were expected because the likelihood ratio test in the spatial scan statistic is more powerful than the I PR permutation test. Because the spatial scan statistic can neither straightforwardly incorporate ecological covariate nor account for spatial random effects, in the remaining simulation, we only assess our iterative residual method with regard to the effectiveness of spatial association terms and the search procedure. We display the rejection rates of I apr with the inclusion of no association term, one local association term for Noble county cluster, and two local association terms for both clusters (Figure 2). To simplify this part of simulations, we set δ 1 = δ 2 = δ in expression (10) as they increased from 0 to 1. Because the results in the presence of spatial heterogeneity were similar to those without spatial heterogeneity, we focus the discussion on the case when σ =0.0. Figure 2 shows that rejection rates remained consistently around 0.05 in the absence of clustering effect (lower-left corner). When δ>0.3 with a moderate cluster effect, I apr had a very high rejection rate. It suggested that I apr was able to detect the existence of spatial clustering in the absence of local association terms. This is true even when the cluster center upon Noble county was accounted for by one association term. In other words, if one cluster is sufficient to cause a significant clustering tendency, its effect remains even though another one is accounted for. When both clusters were accounted for by two corresponding association terms, the rejection rates remained consistently around 5%, which resembled the case when δ = 0 for type I error rates. These results suggest that I apr was likely to accept the hypothesis of no clustering once the two spatial association terms absorbed the existing cluster tendency. The second part of simulations was intended to evaluate the effectiveness of the searching procedure by how often it identified C 1 or C 2 centered at Tippecanoe and Noble counties correctly. We also displayed the rates when the searching procedure identified these two clusters but not centered exactly at Tippecanoe or Noble counties, for which we denote counties C N 1 or C N 2, respectively. As before, we used δ 1 or δ 2 values to set the strength of a cluster from 0 to 0.9. Table 2 displays the proportions of cluster centers that were correctly identified (C 1 or C 2 )and somewhat correctly identified (C N 1 or C N 2 ). With a strong cluster (δ 1 =0.9orδ 2 = 0.9), the searching procedure was effective. When both clusters were weak, the procedure was, as expected, unable to pick up any cluster locations. When at least one strong cluster was present (δ 1 =0.9orδ 2 =0.9), the procedure was able to pick up the corresponding location Figure 2. Significance of I apr with zero, one, or two spatial associations.

6 358 Biometrics, June 2009 Table 2 Proportion of correctly identified cluster centers in two-cluster simulations (δ 1, δ 2 ) C 1 C N 1 C 2 C N 2 (0.0, 0.0) (0.3, 0.3) (0.3, 0.6) (0.3, 0.9) (0.6, 0.3) (0.6, 0.6) (0.6, 0.9) (0.9, 0.3) (0.9, 0.6) (0.9, 0.9) precisely more than 99% of times. When both clusters were moderately strong (δ 1 = δ 2 = 0.6), the procedure was likely to find the cluster at Noble county first and more often, because it contributed to greater deviance with its population (740,873), which was higher than that of Tippecanoe county (301,676). These results suggest that the iterative clusterdetection procedure can consistently identify the locations of spatial clusters. In the absence of local clusters, I apr had an acceptable level of misidentified cluster locations that corresponded to its type I error probabilities. 4. Case Study In this section, we provide a case study of county-level colorectal cancer mortality in Indiana. Colorectal cancer mortality has declined steadily in Indiana in the past decade. The male rates declined sharply from an annual rate of 28.1 per 100,000 in the period to 21.1 per 100,000 during the period, while the female rates experienced a moderate decline from an annual rate of 19.9 per 100,000 in the period to 17.5 per 100,000 during the period. The sharp decline in male colorectal cancer mortality was attributable to an increase in colorectal cancer screening rates, which can detect colorectal cancer at an early stage. In 1996 and 1997, the male screening rate for those age 50+ was 38%; in 2004 the rate increased to 50%. To examine regional disparity with potential implications for improvement in screening, we obtained the mortality data for males in the period as well as at-risk populations from the Bureau of Census s 2002 population estimates. The state average for the 5-year period for males was 0.11%, with the lowest of 0.056% in White county, the highest of 0.258% in Warren county (see Figure 3). We used I apr, to first test for spatial clustering, and then to identify a local cluster if the global clustering effect was significant. Because there is no equivalent test at both global and local levels, we compared our results with two other tests: the empirical Bayesian index (EBI) at the global level, and the spatial scan statistic at the local level. EBI is a population-adjusted Moran s I test proposed by Assuncao and Reis (1999), and it can consistently detect spatial clustering in the presence of population heterogeneity. EBI, however, cannot include ecological covariates or spatial association terms. For this reason, we use the spatial scan statistic to check the Figure 3. Male colorectal cancer mortality per 100,000 in Indiana: 2000 to consistency of the location detected by our search procedure. To closely mirror our proposed iterative framework, upon detection of the first cluster, we introduced a spatial association term corresponding to the clustered area into the spatial scan statistic, and further compared the results from the two methods. The results from I apr show significant spatial clustering in the null model that does not include any spatial and ecological covariates. The I apr value and its p-value were and 0.010, with ˆσ = and ˆσ =0.2001, which were close to the EBI value of with a p-value of Based on this initial result, we used the iterative residual search algorithm and found a cool spot centered at Marion county. The coefficient for Marion county cluster was or an expected relative risk of Once this cluster was accounted for, the p-value for I apr became or not significant. It is known that Marion is the county seat for the state capital of Indianapolis, with five large hospitals right at the city center. The cluster had an average mortality rate of 0.084% as opposed to 0.123% for the rest of the state. To assess whether ecological covariates or age covariates could explain the clustering effect, we collected a number of county level variables, such as primary care physicians, hospital beds, colorectal cancer screening rate, and broad age groups ( 45, 45 64, 65+). We made several contrasts between clustered and nonclustered areas. We found that screening

7 Cluster Detection Generalized Linear Mixed Models 359 rate, which is defined as the proportion of adults age 50 and older who have ever had a sigmoidoscopy or colonoscopy, or who have had a blood stool test within the past 2 years, had the sharpest contrast: 66.3% within the cluster and 54.2% outside. Although other variables may reduce overall deviance, screening rate was the only one associated with the detected clustering effect. Once we included the screening rate without any spatial association terms, the p-value for I apr became with ˆσ =0.2164, which suggested no clustering. In other words, we could explain the lower level of mortality near the capital counties by the high screening rate. Finally, we ran the spatial scan statistic based on the same spatial weight matrix. The result for the first cluster was identical to the cool spot detected by the iterative residual method with a p-value of 0.001, and a λ value of The consistent results for the first cluster show that in the pure spatial cluster test, both I apr and spatial scan statistics were able to consistently detect the existence and the location of a cluster. Because it was necessary to include the same spatial association term in the spatial scan statistic as in model (1), we added it as a covariate to check if it would similarly remove the detected cluster and clustering effects. The result was mixed. While the detected cool spot vanished from the first cluster, an excessive mortality cluster emerged around Newton county in the northwest of the state with a p-value of 0.004, and a λ value of This inconsistency could relate to the greater power of spatial scan statistic, or to spatial random effects that were not accounted for by the spatial scan statistic. Even if we used geographic coordinates based on the SatScan default setting, we would yield the identical second cluster centered upon Newton county, while the first cluster of the cool spot was larger than the geographic neighbor-based cluster. These results were consistent when we used different circle sizes by changing the at-risk population threshold from 30% to 50%. 5. Concluding Remarks In this article, we have set out a frequentist framework for a spatial GLMM that combines local cluster indicators or spatial association terms with the residual-based global indicator of clustering or I apr. Previously, spatial GLMMs have been fitted predominantly by a Bayesian disease-mapping method. We proved the validity of I apr based on the asymptotic properties of Pearson residuals. We found that it has a consistent type I error probability in the presence of spatial random effects and population heterogeneity. The power of I apr is lower than that of the spatial scan statistic, but still satisfactory in the presence and absence of spatial random effects. The use of I apr in the modeling improvement process for a spatial GLMM is iterative. It is based on (i) the GEE method for estimating and assessing the detected spatial associations and associated goodness of fit statistics, and (ii) the evaluation of I apr for spatial clustering in the iterative search of spatial association terms. The coefficient of a local indicator can gauge the effect of a hot spot or a cool spot, and multiple clusters derived from the model-based test do not suffer from the multiple testing problem. Because the model can include potential ecological covariates, known risk factors can also be incorporated into the hypothesis testing process for cluster detection. This modeling strategy can be extended to cases where the detected spatial dependence may suggest the type of exposure that is partially responsible for spatial variation, because both geographic covariates and spatial association terms can be treated as explanatory variables in the modeling improvement process, and they provide different statistical inferences. A spatial association term can identify a location-specific cluster, which in turn can provide clues for identifying ecological covariates, such as the screening rate used in the case study. An ecological covariate, if used effectively, can help the design of intervention. In the case study, we know not only the inverse relationship between the screening rate and colorectal cancer mortality, but also the screening levels and mortality rates within and outside the cluster. These descriptive statistics and the main effects from the coefficients can assist intervention strategies. Although by no means inconsistent, the underlying goal of model improvement differs from that of clustering detection and estimation. When the global indicator such as I apr becomes insignificant, the iterative search procedure is likely to stop searching for clusters, but even with the same goal here, different search strategies may reveal slightly different clustering structures. This situation can occur in the presence of two weak clusters. When a spatial association coefficient accounts for one of them, the previously detected clustering tendency may or may not disappear. At this point, if we used an alternative search algorithm that could continue as long as an additional association term could substantially improve the model fit (Lin, 2003), we would likely identify the second association term not contributing to clustering but to the overall model fit. This complicated design warrants further statistical justification and evaluation, so that different spatial searching algorithms will have clearly different statistical inferences associated with different spatial association terms. These local cluster-detection designs could also be compared with global autocorrelation removal strategies, such as conditional autoregressive priors in Bayesian disease mapping. Finally, although we used the standard way of defining a cluster by its first-order spatial adjacency, the method can incorporate many spatial weights. Further empirical and theoretical refinement is necessary to evaluate the sensitivity of different spatial weights. 6. Supplementary Materials Web Appendices and proofs of Theorem 1 referenced in Section 2 are available under the Paper Information link at the Biometrics website Acknowledgements We would like to thank the associate editor and two reviewers for their insightful comments and suggestions, which have improved the quality of the article substantially. This research was funded by US National Science Foundation Grants SES (Zhang) and SES (Lin). References Agresti, A. (2002). Categorical Data Analysis. NewYork: Wiley. Assuncao, R. and Reis, E. (1999). A new proposal to adjust Moran s I for population density. Statistics in Medicine 18,

8 360 Biometrics, June 2009 Baddeley, A., Turner, R., Moller, J., and Hazelton, M. (2005). Residual analysis for spatial point processes (with discussion). Journal of the Royal Statistical Society, Series B 67, Cliff, A. D. and Ord, J. K. (1981). Spatial Processes: Models and Applications. London: Pion. Gangnon, R. E. and Clayton, M. K. (2003). A hierarchical model for spatially clustered disease rates. Statistics in Medicine 22, Jacqmin-Gadda, H., Commenges, D., Nejjari, C., and Dartigues, J. (1997). Tests of geographical correlation with adjustment for explanatory variables: An application to dyspnoea in the elderly. Statistics in Medicine 16, Kleinman, K. (2005). Generalized linear models and generalized linear mixed models for small-area surveillance. In Spatial and Syndromic Surveillance for Public Health, A. B. Lawson and K. Kleinman (eds), London: Wiley. Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics, Theory and Methods 26, Lawson, A. B. and Clark, A. (2002). Spatial mixture relative risk models applied to disease mapping. Statistics in Medicine 21, Lin, G. (2003). A spatial logit association model for cluster detection. Geographical Analysis 35, Lin, G. and Zhang, T. (2007). Loglinear residual tests of Moran s I autocorrelation: An application to Kentucky breast cancer incidence data. Geographical Analysis 39, Loh, J. M. and Zhu, Z. (2007). Accounting for spatial correlation in the scan statistic. Annals of Applied Statistics 1, McCulloch, C. E. (1997). Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association 92, Moran, P. A. P. (1948). The interpretation of statistical maps. Journal of the Royal Statistical Society, Series B 10, Roche, L. S., Skinner, R., and Weinstein, R. B. (2002). Use of a geographic information system to identify and characterize areas with high proportions of distant stage breast cancer. Journal of Public Health Management Practice 8, Waller, L. and Gotway, C. (2004). Applied Spatial Statistics for Public Health Data. Hoboken, New Jersey: Wiley. Zeger, S. L. (1988). A regression model for time series of counts. Biometrika 75, Zhuang, J. (2006). Second-order residual analysis of spatiotemporal point processes and applications in model evaluation. Journal of the Royal Statistical Society, Series B 68, Received June Revised March Accepted March 2008.

Computational Statistics and Data Analysis

Computational Statistics and Data Analysis Computational Statistics and Data Analysis 53 (2009) 2851 2858 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Spatial

More information

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III)

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III) Title: Spatial Statistics for Point Processes and Lattice Data (Part III) Lattice Data Tonglin Zhang Outline Description Research Problems Global Clustering and Local Clusters Permutation Test Spatial

More information

Loglinear Residual Tests of Moran s I Autocorrelation and their Applications to Kentucky Breast Cancer Data

Loglinear Residual Tests of Moran s I Autocorrelation and their Applications to Kentucky Breast Cancer Data Geographical Analysis ISSN 0016-7363 Loglinear Residual Tests of Moran s I Autocorrelation and their Applications to Kentucky Breast Cancer Data Ge Lin, 1 Tonglin Zhang 1 Department of Geology and Geography,

More information

Quasi-likelihood Scan Statistics for Detection of

Quasi-likelihood Scan Statistics for Detection of for Quasi-likelihood for Division of Biostatistics and Bioinformatics, National Health Research Institutes & Department of Mathematics, National Chung Cheng University 17 December 2011 1 / 25 Outline for

More information

Identification of Local Clusters for Count Data: A. Model-Based Moran s I Test

Identification of Local Clusters for Count Data: A. Model-Based Moran s I Test Identification of Local Clusters for Count Data: A Model-Based Moran s I Test Tonglin Zhang and Ge Lin Purdue University and West Virginia University February 14, 2007 Department of Statistics, Purdue

More information

Cluster investigations using Disease mapping methods International workshop on Risk Factors for Childhood Leukemia Berlin May

Cluster investigations using Disease mapping methods International workshop on Risk Factors for Childhood Leukemia Berlin May Cluster investigations using Disease mapping methods International workshop on Risk Factors for Childhood Leukemia Berlin May 5-7 2008 Peter Schlattmann Institut für Biometrie und Klinische Epidemiologie

More information

Chapter 15 Spatial Disease Surveillance: Methods and Applications

Chapter 15 Spatial Disease Surveillance: Methods and Applications Chapter 15 Spatial Disease Surveillance: Methods and Applications Tonglin Zhang 15.1 Introduction The availability of geographical indexed health and population data and statistical methodologies have

More information

Aggregated cancer incidence data: spatial models

Aggregated cancer incidence data: spatial models Aggregated cancer incidence data: spatial models 5 ième Forum du Cancéropôle Grand-est - November 2, 2011 Erik A. Sauleau Department of Biostatistics - Faculty of Medicine University of Strasbourg ea.sauleau@unistra.fr

More information

Spatial Clusters of Rates

Spatial Clusters of Rates Spatial Clusters of Rates Luc Anselin http://spatial.uchicago.edu concepts EBI local Moran scan statistics Concepts Rates as Risk from counts (spatially extensive) to rates (spatially intensive) rate =

More information

FleXScan User Guide. for version 3.1. Kunihiko Takahashi Tetsuji Yokoyama Toshiro Tango. National Institute of Public Health

FleXScan User Guide. for version 3.1. Kunihiko Takahashi Tetsuji Yokoyama Toshiro Tango. National Institute of Public Health FleXScan User Guide for version 3.1 Kunihiko Takahashi Tetsuji Yokoyama Toshiro Tango National Institute of Public Health October 2010 http://www.niph.go.jp/soshiki/gijutsu/index_e.html User Guide version

More information

Bayesian Hierarchical Models

Bayesian Hierarchical Models Bayesian Hierarchical Models Gavin Shaddick, Millie Green, Matthew Thomas University of Bath 6 th - 9 th December 2016 1/ 34 APPLICATIONS OF BAYESIAN HIERARCHICAL MODELS 2/ 34 OUTLINE Spatial epidemiology

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

In matrix algebra notation, a linear model is written as

In matrix algebra notation, a linear model is written as DM3 Calculation of health disparity Indices Using Data Mining and the SAS Bridge to ESRI Mussie Tesfamicael, University of Louisville, Louisville, KY Abstract Socioeconomic indices are strongly believed

More information

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence Sunil Kumar Dhar Center for Applied Mathematics and Statistics, Department of Mathematical Sciences, New Jersey

More information

Using AMOEBA to Create a Spatial Weights Matrix and Identify Spatial Clusters, and a Comparison to Other Clustering Algorithms

Using AMOEBA to Create a Spatial Weights Matrix and Identify Spatial Clusters, and a Comparison to Other Clustering Algorithms Using AMOEBA to Create a Spatial Weights Matrix and Identify Spatial Clusters, and a Comparison to Other Clustering Algorithms Arthur Getis* and Jared Aldstadt** *San Diego State University **SDSU/UCSB

More information

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1 Parametric Modelling of Over-dispersed Count Data Part III / MMath (Applied Statistics) 1 Introduction Poisson regression is the de facto approach for handling count data What happens then when Poisson

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

Modeling Longitudinal Count Data with Excess Zeros and Time-Dependent Covariates: Application to Drug Use

Modeling Longitudinal Count Data with Excess Zeros and Time-Dependent Covariates: Application to Drug Use Modeling Longitudinal Count Data with Excess Zeros and : Application to Drug Use University of Northern Colorado November 17, 2014 Presentation Outline I and Data Issues II Correlated Count Regression

More information

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS020) p.3863 Model Selection for Semiparametric Bayesian Models with Application to Overdispersion Jinfang Wang and

More information

Statistics for analyzing and modeling precipitation isotope ratios in IsoMAP

Statistics for analyzing and modeling precipitation isotope ratios in IsoMAP Statistics for analyzing and modeling precipitation isotope ratios in IsoMAP The IsoMAP uses the multiple linear regression and geostatistical methods to analyze isotope data Suppose the response variable

More information

ARIC Manuscript Proposal # PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2

ARIC Manuscript Proposal # PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2 ARIC Manuscript Proposal # 1186 PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2 1.a. Full Title: Comparing Methods of Incorporating Spatial Correlation in

More information

Penalized Loss functions for Bayesian Model Choice

Penalized Loss functions for Bayesian Model Choice Penalized Loss functions for Bayesian Model Choice Martyn International Agency for Research on Cancer Lyon, France 13 November 2009 The pure approach For a Bayesian purist, all uncertainty is represented

More information

if n is large, Z i are weakly dependent 0-1-variables, p i = P(Z i = 1) small, and Then n approx i=1 i=1 n i=1

if n is large, Z i are weakly dependent 0-1-variables, p i = P(Z i = 1) small, and Then n approx i=1 i=1 n i=1 Count models A classical, theoretical argument for the Poisson distribution is the approximation Binom(n, p) Pois(λ) for large n and small p and λ = np. This can be extended considerably to n approx Z

More information

Lecture 5: LDA and Logistic Regression

Lecture 5: LDA and Logistic Regression Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant

More information

USING CLUSTERING SOFTWARE FOR EXPLORING SPATIAL AND TEMPORAL PATTERNS IN NON-COMMUNICABLE DISEASES

USING CLUSTERING SOFTWARE FOR EXPLORING SPATIAL AND TEMPORAL PATTERNS IN NON-COMMUNICABLE DISEASES USING CLUSTERING SOFTWARE FOR EXPLORING SPATIAL AND TEMPORAL PATTERNS IN NON-COMMUNICABLE DISEASES Mariana Nagy "Aurel Vlaicu" University of Arad Romania Department of Mathematics and Computer Science

More information

The STS Surgeon Composite Technical Appendix

The STS Surgeon Composite Technical Appendix The STS Surgeon Composite Technical Appendix Overview Surgeon-specific risk-adjusted operative operative mortality and major complication rates were estimated using a bivariate random-effects logistic

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Review Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 Chapter 1: background Nominal, ordinal, interval data. Distributions: Poisson, binomial,

More information

Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses

Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses Outline Marginal model Examples of marginal model GEE1 Augmented GEE GEE1.5 GEE2 Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association

More information

A spatial scan statistic for multinomial data

A spatial scan statistic for multinomial data A spatial scan statistic for multinomial data Inkyung Jung 1,, Martin Kulldorff 2 and Otukei John Richard 3 1 Department of Epidemiology and Biostatistics University of Texas Health Science Center at San

More information

Negative Multinomial Model and Cancer. Incidence

Negative Multinomial Model and Cancer. Incidence Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence S. Lahiri & Sunil K. Dhar Department of Mathematical Sciences, CAMS New Jersey Institute of Technology, Newar,

More information

Using Estimating Equations for Spatially Correlated A

Using Estimating Equations for Spatially Correlated A Using Estimating Equations for Spatially Correlated Areal Data December 8, 2009 Introduction GEEs Spatial Estimating Equations Implementation Simulation Conclusion Typical Problem Assess the relationship

More information

Score test for random changepoint in a mixed model

Score test for random changepoint in a mixed model Score test for random changepoint in a mixed model Corentin Segalas and Hélène Jacqmin-Gadda INSERM U1219, Biostatistics team, Bordeaux GDR Statistiques et Santé October 6, 2017 Biostatistics 1 / 27 Introduction

More information

Estimation and sample size calculations for correlated binary error rates of biometric identification devices

Estimation and sample size calculations for correlated binary error rates of biometric identification devices Estimation and sample size calculations for correlated binary error rates of biometric identification devices Michael E. Schuckers,11 Valentine Hall, Department of Mathematics Saint Lawrence University,

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion Biostokastikum Overdispersion is not uncommon in practice. In fact, some would maintain that overdispersion is the norm in practice and nominal dispersion the exception McCullagh and Nelder (1989) Overdispersion

More information

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples ST3241 Categorical Data Analysis I Generalized Linear Models Introduction and Some Examples 1 Introduction We have discussed methods for analyzing associations in two-way and three-way tables. Now we will

More information

Early Detection of a Change in Poisson Rate After Accounting For Population Size Effects

Early Detection of a Change in Poisson Rate After Accounting For Population Size Effects Early Detection of a Change in Poisson Rate After Accounting For Population Size Effects School of Industrial and Systems Engineering, Georgia Institute of Technology, 765 Ferst Drive NW, Atlanta, GA 30332-0205,

More information

A nonparametric spatial scan statistic for continuous data

A nonparametric spatial scan statistic for continuous data DOI 10.1186/s12942-015-0024-6 METHODOLOGY Open Access A nonparametric spatial scan statistic for continuous data Inkyung Jung * and Ho Jin Cho Abstract Background: Spatial scan statistics are widely used

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Analysing data: regression and correlation S6 and S7

Analysing data: regression and correlation S6 and S7 Basic medical statistics for clinical and experimental research Analysing data: regression and correlation S6 and S7 K. Jozwiak k.jozwiak@nki.nl 2 / 49 Correlation So far we have looked at the association

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

Outline. Practical Point Pattern Analysis. David Harvey s Critiques. Peter Gould s Critiques. Global vs. Local. Problems of PPA in Real World

Outline. Practical Point Pattern Analysis. David Harvey s Critiques. Peter Gould s Critiques. Global vs. Local. Problems of PPA in Real World Outline Practical Point Pattern Analysis Critiques of Spatial Statistical Methods Point pattern analysis versus cluster detection Cluster detection techniques Extensions to point pattern measures Multiple

More information

Hierarchical Modeling and Analysis for Spatial Data

Hierarchical Modeling and Analysis for Spatial Data Hierarchical Modeling and Analysis for Spatial Data Bradley P. Carlin, Sudipto Banerjee, and Alan E. Gelfand brad@biostat.umn.edu, sudiptob@biostat.umn.edu, and alan@stat.duke.edu University of Minnesota

More information

For more information about how to cite these materials visit

For more information about how to cite these materials visit Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/

More information

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models Optimum Design for Mixed Effects Non-Linear and generalized Linear Models Cambridge, August 9-12, 2011 Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

More information

A class of latent marginal models for capture-recapture data with continuous covariates

A class of latent marginal models for capture-recapture data with continuous covariates A class of latent marginal models for capture-recapture data with continuous covariates F Bartolucci A Forcina Università di Urbino Università di Perugia FrancescoBartolucci@uniurbit forcina@statunipgit

More information

Departamento de Economía Universidad de Chile

Departamento de Economía Universidad de Chile Departamento de Economía Universidad de Chile GRADUATE COURSE SPATIAL ECONOMETRICS November 14, 16, 17, 20 and 21, 2017 Prof. Henk Folmer University of Groningen Objectives The main objective of the course

More information

Cluster Analysis using SaTScan. Patrick DeLuca, M.A. APHEO 2007 Conference, Ottawa October 16 th, 2007

Cluster Analysis using SaTScan. Patrick DeLuca, M.A. APHEO 2007 Conference, Ottawa October 16 th, 2007 Cluster Analysis using SaTScan Patrick DeLuca, M.A. APHEO 2007 Conference, Ottawa October 16 th, 2007 Outline Clusters & Cluster Detection Spatial Scan Statistic Case Study 28 September 2007 APHEO Conference

More information

1 Introduction. 2 AIC versus SBIC. Erik Swanson Cori Saviano Li Zha Final Project

1 Introduction. 2 AIC versus SBIC. Erik Swanson Cori Saviano Li Zha Final Project Erik Swanson Cori Saviano Li Zha Final Project 1 Introduction In analyzing time series data, we are posed with the question of how past events influences the current situation. In order to determine this,

More information

Analyzing the Geospatial Rates of the Primary Care Physician Labor Supply in the Contiguous United States

Analyzing the Geospatial Rates of the Primary Care Physician Labor Supply in the Contiguous United States Analyzing the Geospatial Rates of the Primary Care Physician Labor Supply in the Contiguous United States By Russ Frith Advisor: Dr. Raid Amin University of W. Florida Capstone Project in Statistics April,

More information

Generalized common spatial factor model

Generalized common spatial factor model Biostatistics (2003), 4, 4,pp. 569 582 Printed in Great Britain Generalized common spatial factor model FUJUN WANG Eli Lilly and Company, Indianapolis, IN 46285, USA MELANIE M. WALL Division of Biostatistics,

More information

Community Health Needs Assessment through Spatial Regression Modeling

Community Health Needs Assessment through Spatial Regression Modeling Community Health Needs Assessment through Spatial Regression Modeling Glen D. Johnson, PhD CUNY School of Public Health glen.johnson@lehman.cuny.edu Objectives: Assess community needs with respect to particular

More information

Bayesian Areal Wombling for Geographic Boundary Analysis

Bayesian Areal Wombling for Geographic Boundary Analysis Bayesian Areal Wombling for Geographic Boundary Analysis Haolan Lu, Haijun Ma, and Bradley P. Carlin haolanl@biostat.umn.edu, haijunma@biostat.umn.edu, and brad@biostat.umn.edu Division of Biostatistics

More information

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA JAPANESE BEETLE DATA 6 MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA Gauge Plots TuscaroraLisa Central Madsen Fairways, 996 January 9, 7 Grubs Adult Activity Grub Counts 6 8 Organic Matter

More information

This paper has been submitted for consideration for publication in Biometrics

This paper has been submitted for consideration for publication in Biometrics BIOMETRICS, 1 10 Supplementary material for Control with Pseudo-Gatekeeping Based on a Possibly Data Driven er of the Hypotheses A. Farcomeni Department of Public Health and Infectious Diseases Sapienza

More information

A Latent Model To Detect Multiple Clusters of Varying Sizes. Minge Xie, Qiankun Sun and Joseph Naus. Department of Statistics

A Latent Model To Detect Multiple Clusters of Varying Sizes. Minge Xie, Qiankun Sun and Joseph Naus. Department of Statistics A Latent Model To Detect Multiple Clusters of Varying Sizes Minge Xie, Qiankun Sun and Joseph Naus Department of Statistics Rutgers, the State University of New Jersey Piscataway, NJ 08854 Summary This

More information

Statistical Inference of Covariate-Adjusted Randomized Experiments

Statistical Inference of Covariate-Adjusted Randomized Experiments 1 Statistical Inference of Covariate-Adjusted Randomized Experiments Feifang Hu Department of Statistics George Washington University Joint research with Wei Ma, Yichen Qin and Yang Li Email: feifang@gwu.edu

More information

SUPPLEMENTARY SIMULATIONS & FIGURES

SUPPLEMENTARY SIMULATIONS & FIGURES Supplementary Material: Supplementary Material for Mixed Effects Models for Resampled Network Statistics Improve Statistical Power to Find Differences in Multi-Subject Functional Connectivity Manjari Narayan,

More information

Generalized Linear Models (GLZ)

Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the

More information

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Jae-Kwang Kim 1 Iowa State University June 26, 2013 1 Joint work with Shu Yang Introduction 1 Introduction

More information

Figure 36: Respiratory infection versus time for the first 49 children.

Figure 36: Respiratory infection versus time for the first 49 children. y BINARY DATA MODELS We devote an entire chapter to binary data since such data are challenging, both in terms of modeling the dependence, and parameter interpretation. We again consider mixed effects

More information

Spatial Analysis I. Spatial data analysis Spatial analysis and inference

Spatial Analysis I. Spatial data analysis Spatial analysis and inference Spatial Analysis I Spatial data analysis Spatial analysis and inference Roadmap Outline: What is spatial analysis? Spatial Joins Step 1: Analysis of attributes Step 2: Preparing for analyses: working with

More information

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science. Texts in Statistical Science Generalized Linear Mixed Models Modern Concepts, Methods and Applications Walter W. Stroup CRC Press Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint

More information

Generalized Estimating Equations (gee) for glm type data

Generalized Estimating Equations (gee) for glm type data Generalized Estimating Equations (gee) for glm type data Søren Højsgaard mailto:sorenh@agrsci.dk Biometry Research Unit Danish Institute of Agricultural Sciences January 23, 2006 Printed: January 23, 2006

More information

Generalized Quasi-likelihood versus Hierarchical Likelihood Inferences in Generalized Linear Mixed Models for Count Data

Generalized Quasi-likelihood versus Hierarchical Likelihood Inferences in Generalized Linear Mixed Models for Count Data Sankhyā : The Indian Journal of Statistics 2009, Volume 71-B, Part 1, pp. 55-78 c 2009, Indian Statistical Institute Generalized Quasi-likelihood versus Hierarchical Likelihood Inferences in Generalized

More information

Generalized Linear Models. Kurt Hornik

Generalized Linear Models. Kurt Hornik Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general

More information

STAT 526 Advanced Statistical Methodology

STAT 526 Advanced Statistical Methodology STAT 526 Advanced Statistical Methodology Fall 2017 Lecture Note 10 Analyzing Clustered/Repeated Categorical Data 0-0 Outline Clustered/Repeated Categorical Data Generalized Linear Mixed Models Generalized

More information

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti Good Confidence Intervals for Categorical Data Analyses Alan Agresti Department of Statistics, University of Florida visiting Statistics Department, Harvard University LSHTM, July 22, 2011 p. 1/36 Outline

More information

PQL Estimation Biases in Generalized Linear Mixed Models

PQL Estimation Biases in Generalized Linear Mixed Models PQL Estimation Biases in Generalized Linear Mixed Models Woncheol Jang Johan Lim March 18, 2006 Abstract The penalized quasi-likelihood (PQL) approach is the most common estimation procedure for the generalized

More information

Inclusion of Non-Street Addresses in Cancer Cluster Analysis

Inclusion of Non-Street Addresses in Cancer Cluster Analysis Inclusion of Non-Street Addresses in Cancer Cluster Analysis Sue-Min Lai, Zhimin Shen, Darin Banks Kansas Cancer Registry University of Kansas Medical Center KCR (Kansas Cancer Registry) KCR: population-based

More information

Stat 579: Generalized Linear Models and Extensions

Stat 579: Generalized Linear Models and Extensions Stat 579: Generalized Linear Models and Extensions Linear Mixed Models for Longitudinal Data Yan Lu April, 2018, week 15 1 / 38 Data structure t1 t2 tn i 1st subject y 11 y 12 y 1n1 Experimental 2nd subject

More information

GROUPED SURVIVAL DATA. Florida State University and Medical College of Wisconsin

GROUPED SURVIVAL DATA. Florida State University and Medical College of Wisconsin FITTING COX'S PROPORTIONAL HAZARDS MODEL USING GROUPED SURVIVAL DATA Ian W. McKeague and Mei-Jie Zhang Florida State University and Medical College of Wisconsin Cox's proportional hazard model is often

More information

Local Likelihood Bayesian Cluster Modeling for small area health data. Andrew Lawson Arnold School of Public Health University of South Carolina

Local Likelihood Bayesian Cluster Modeling for small area health data. Andrew Lawson Arnold School of Public Health University of South Carolina Local Likelihood Bayesian Cluster Modeling for small area health data Andrew Lawson Arnold School of Public Health University of South Carolina Local Likelihood Bayesian Cluster Modelling for Small Area

More information

Scalable Bayesian Event Detection and Visualization

Scalable Bayesian Event Detection and Visualization Scalable Bayesian Event Detection and Visualization Daniel B. Neill Carnegie Mellon University H.J. Heinz III College E-mail: neill@cs.cmu.edu This work was partially supported by NSF grants IIS-0916345,

More information

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative

More information

Robust covariance estimator for small-sample adjustment in the generalized estimating equations: A simulation study

Robust covariance estimator for small-sample adjustment in the generalized estimating equations: A simulation study Science Journal of Applied Mathematics and Statistics 2014; 2(1): 20-25 Published online February 20, 2014 (http://www.sciencepublishinggroup.com/j/sjams) doi: 10.11648/j.sjams.20140201.13 Robust covariance

More information

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL Intesar N. El-Saeiti Department of Statistics, Faculty of Science, University of Bengahzi-Libya. entesar.el-saeiti@uob.edu.ly

More information

Generalized Linear Models I

Generalized Linear Models I Statistics 203: Introduction to Regression and Analysis of Variance Generalized Linear Models I Jonathan Taylor - p. 1/16 Today s class Poisson regression. Residuals for diagnostics. Exponential families.

More information

MODULE 12: Spatial Statistics in Epidemiology and Public Health Lecture 7: Slippery Slopes: Spatially Varying Associations

MODULE 12: Spatial Statistics in Epidemiology and Public Health Lecture 7: Slippery Slopes: Spatially Varying Associations MODULE 12: Spatial Statistics in Epidemiology and Public Health Lecture 7: Slippery Slopes: Spatially Varying Associations Jon Wakefield and Lance Waller 1 / 53 What are we doing? Alcohol Illegal drugs

More information

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Multilevel Statistical Models: 3 rd edition, 2003 Contents Multilevel Statistical Models: 3 rd edition, 2003 Contents Preface Acknowledgements Notation Two and three level models. A general classification notation and diagram Glossary Chapter 1 An introduction

More information

Obtaining Critical Values for Test of Markov Regime Switching

Obtaining Critical Values for Test of Markov Regime Switching University of California, Santa Barbara From the SelectedWorks of Douglas G. Steigerwald November 1, 01 Obtaining Critical Values for Test of Markov Regime Switching Douglas G Steigerwald, University of

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 1 Bootstrapped Bias and CIs Given a multiple regression model with mean and

More information

Pseudo-score confidence intervals for parameters in discrete statistical models

Pseudo-score confidence intervals for parameters in discrete statistical models Biometrika Advance Access published January 14, 2010 Biometrika (2009), pp. 1 8 C 2009 Biometrika Trust Printed in Great Britain doi: 10.1093/biomet/asp074 Pseudo-score confidence intervals for parameters

More information

Sample Size and Power Considerations for Longitudinal Studies

Sample Size and Power Considerations for Longitudinal Studies Sample Size and Power Considerations for Longitudinal Studies Outline Quantities required to determine the sample size in longitudinal studies Review of type I error, type II error, and power For continuous

More information

CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA

CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA STATISTICS IN MEDICINE, VOL. 17, 59 68 (1998) CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA J. K. LINDSEY AND B. JONES* Department of Medical Statistics, School of Computing Sciences,

More information

Propensity Score Weighting with Multilevel Data

Propensity Score Weighting with Multilevel Data Propensity Score Weighting with Multilevel Data Fan Li Department of Statistical Science Duke University October 25, 2012 Joint work with Alan Zaslavsky and Mary Beth Landrum Introduction In comparative

More information

Testing Goodness Of Fit Of The Geometric Distribution: An Application To Human Fecundability Data

Testing Goodness Of Fit Of The Geometric Distribution: An Application To Human Fecundability Data Journal of Modern Applied Statistical Methods Volume 4 Issue Article 8 --5 Testing Goodness Of Fit Of The Geometric Distribution: An Application To Human Fecundability Data Sudhir R. Paul University of

More information

Poisson Regression. Ryan Godwin. ECON University of Manitoba

Poisson Regression. Ryan Godwin. ECON University of Manitoba Poisson Regression Ryan Godwin ECON 7010 - University of Manitoba Abstract. These lecture notes introduce Maximum Likelihood Estimation (MLE) of a Poisson regression model. 1 Motivating the Poisson Regression

More information

Exploratory Spatial Data Analysis (ESDA)

Exploratory Spatial Data Analysis (ESDA) Exploratory Spatial Data Analysis (ESDA) VANGHR s method of ESDA follows a typical geospatial framework of selecting variables, exploring spatial patterns, and regression analysis. The primary software

More information

Surveillance of Infectious Disease Data using Cumulative Sum Methods

Surveillance of Infectious Disease Data using Cumulative Sum Methods Surveillance of Infectious Disease Data using Cumulative Sum Methods 1 Michael Höhle 2 Leonhard Held 1 1 Institute of Social and Preventive Medicine University of Zurich 2 Department of Statistics University

More information

Testing for Regime Switching in Singaporean Business Cycles

Testing for Regime Switching in Singaporean Business Cycles Testing for Regime Switching in Singaporean Business Cycles Robert Breunig School of Economics Faculty of Economics and Commerce Australian National University and Alison Stegman Research School of Pacific

More information

Answer Key for STAT 200B HW No. 8

Answer Key for STAT 200B HW No. 8 Answer Key for STAT 200B HW No. 8 May 8, 2007 Problem 3.42 p. 708 The values of Ȳ for x 00, 0, 20, 30 are 5/40, 0, 20/50, and, respectively. From Corollary 3.5 it follows that MLE exists i G is identiable

More information

Econometrics II - EXAM Answer each question in separate sheets in three hours

Econometrics II - EXAM Answer each question in separate sheets in three hours Econometrics II - EXAM Answer each question in separate sheets in three hours. Let u and u be jointly Gaussian and independent of z in all the equations. a Investigate the identification of the following

More information

Mohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago

Mohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago Mohammed Research in Pharmacoepidemiology (RIPE) @ National School of Pharmacy, University of Otago What is zero inflation? Suppose you want to study hippos and the effect of habitat variables on their

More information

STAT 525 Fall Final exam. Tuesday December 14, 2010

STAT 525 Fall Final exam. Tuesday December 14, 2010 STAT 525 Fall 2010 Final exam Tuesday December 14, 2010 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points will

More information

Appendix A. Numeric example of Dimick Staiger Estimator and comparison between Dimick-Staiger Estimator and Hierarchical Poisson Estimator

Appendix A. Numeric example of Dimick Staiger Estimator and comparison between Dimick-Staiger Estimator and Hierarchical Poisson Estimator Appendix A. Numeric example of Dimick Staiger Estimator and comparison between Dimick-Staiger Estimator and Hierarchical Poisson Estimator As described in the manuscript, the Dimick-Staiger (DS) estimator

More information

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown. Weighting We have seen that if E(Y) = Xβ and V (Y) = σ 2 G, where G is known, the model can be rewritten as a linear model. This is known as generalized least squares or, if G is diagonal, with trace(g)

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

Approximate and Fiducial Confidence Intervals for the Difference Between Two Binomial Proportions

Approximate and Fiducial Confidence Intervals for the Difference Between Two Binomial Proportions Approximate and Fiducial Confidence Intervals for the Difference Between Two Binomial Proportions K. Krishnamoorthy 1 and Dan Zhang University of Louisiana at Lafayette, Lafayette, LA 70504, USA SUMMARY

More information