Shape and scale in detecting disease clusters

Size: px
Start display at page:

Download "Shape and scale in detecting disease clusters"

Transcription

1 University of Iowa Iowa Research Online Theses and Dissertations 2008 Shape and scale in detecting disease clusters Soumya Mazumdar University of Iowa Copyright 2008 Soumya Mazumdar This dissertation is available at Iowa Research Online: Recommended Citation Mazumdar, Soumya. "Shape and scale in detecting disease clusters." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Geography Commons

2 1 SHAPE AND SCALE IN DETECTING DISEASE CLUSTERS by Soumya Mazumdar An Abstract Of a thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Geography in the Graduate College of The University of Iowa December 2008 Thesis Supervisor: Professor Gerard Rushton

3 1 ABSTRACT This dissertation offers a new cluster detection method. This method looks at the cluster detection problem from a new perspective. I change the question of What do real clusters look like? to the question of What do spurious clusters look like? and How do spurious clusters affect the ability to recover real clusters? Spurious clusters can be identified from their geographical characteristics. These are related to the spatial distribution of people at risk, the shape and scale of the geographic units used to aggregate these people, the shape and scale of the spatial configurations that the disease mapping or cluster detection method may impose on the data and the shape and scale of the area of increased risk. The statistical testing process may also create spurious clusters. I propose that the problem of spurious clusters can be resolved using a computational geographic approach. I argue that Monte Carlo simulations can be used to estimate the patterns of spurious clusters in any situation of interest given knowledge of the first three of these four determinants of spurious clusters. Then, given these determinants, where real measurements of disease or mortality are known, it is possible to show those areas of increased risk that are true clusters as opposed to those that are spurious clusters. This distinction is made in a three dimensional signature space, with shape, size and rate as the three axes. The extent of similarity (or dissimilarity) of a cluster to the simulated spurious cluster influences whether it can be recovered. These experiments show that this method is successful in detecting clusters. This method can also predict with reasonable certainty which clusters can be recovered, and which cannot. I compare this method with Rogerson s Score statistic method. These comparisons expose the weaknesses of Rogerson s method. Finally these two methods and the Spatial Scan Statistic are applied to searching for possible clusters of prostate cancer incidence in Iowa. The implications of the findings are discussed.

4 2 Abstract Approved: Thesis Supervisor Title and Department Date

5 SHAPE AND SCALE IN DETECTING DISEASE CLUSTERS by Soumya Mazumdar A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Geography in the Graduate College of The University of Iowa December 2008 Thesis Supervisor: Professor Gerard Rushton

6 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Soumya Mazumdar has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Geography at the December 2008 graduation. Thesis Committee: Gerard Rushton, Thesis Supervisor David Bennett Naresh Kumar Marc Linderman Dale Zimmerman

7 ACKNOWLEDGMENTS I would like to acknowledge the help I have received during the course of my stay in Iowa. I would like to thank Dr Rushton for supervising my research. I would also like to thank my committee members for their contributions. The last four years of my life have been emotionally challenging for me. I thank the great masters before us who have helped me through. I am thankful to the writings of M. Scott Peck, Viktor Frankl, Swami Vivekananda, and the yogic practices of Sri Sri Art of Living Foundation. I would also like to thank my family members, especially my mom, mishtimashi and late Dr Mazumdar for their support. Thanks are also due to all my friends and well wishers. ii

8 TABLE OF CONTENTS LIST OF TABLES...v LIST OF FIGURES... vi CHAPTER 1. DETECTING CLUSTERS OF DISEASE: INVESTIGATING SPURIOUS CLUSTERS Statement of Purpose Introduction Organization of the dissertation Review of existing methods of cluster detection Map data without further geographic processing Methods that do not smooth the data Methods that smooth the data Methods that pre-process the data before calculating and/or testing for significant disease risk Non combinatiorial approches Combinatorial approaches Hybrid approaches Significance testing and spurious clusters Identifying spurious clusters and distinguishing true clusters from spurious clusters The spatial distribution of the locations of people in the map The scale and spatial configuration of the geographic units that are used to aggregate data into discrete small areas Identifying spurious clusters and distinguishing true clusters from spurious clusters Why use size, shape and rate THE SHAPE SIZE SENSITIVE (S.S.S) METHOD FOR DETECTING DISEASE CLUSTERS Theoretical foundations of the S.S.S method Hypothesis testing The simulated dataset Hypothetical study area and population Hypothetical case population Datasets under the null hypothesis of no clustering Extracting the cluster candidates Datasets under the alternative hypothesis of clustering iii

9 Rationale Behind the choice of these configurations of synthetic clusters Rogerson s Score Statistic Theory Diagnostics Computational Scheme Results Discussions and future directions INVESTIGATING THE SPATIAL PATTERNS OF PROSTATE CANCER IN IOWA Background Methods Results Discussion Conclusion Contribution that this dissertation makes to the geography literature REFERENCES iv

10 LIST OF TABLES Table 2.1 Hold one validation for null hypothesis Hold one validation for alternative hypothesis Summary statistics of the simulated 3675 spurious clusters Shape, size, risk (signature) and the ability to recover simulated clusters The table illustrates the average sensitivity (ability to detect a cluster when it exists) and specificity (ability to classify an area that is not a cluster as such) This table compares sensitivity and specificity with which clusters are recovered for SSS and Rogerson s method and the higher the sensitivity the better the cluster is recovered Cluster recovery using only rates and only shapes How do true clusters differ in shape and size from spurious clusters v

11 LIST OF FIGURES Figure 1.1 This figure displays the statistical significance of accidents per square kilometer (a p- map over densities), where accidents have been randomly scattered across the study area. A 30 meter grid was laid over the entire study area and a 600 meter filter was used to estimate the accident densities. The black areas are significant noisy clusters This figure displays a spurious cluster detected by Duczmal s Simulated Annealing based SaTScan method. This cluster has a high, statistically significant likelihood value In the geographic area, 42 people are distributed over a uniform grid. Each circle represents an individual. They are color coded white to indicate that they are healthy A noise or spurious cluster generating process operates at the scale of the entire geographical area. No person is at a greater risk of disease than any other. All people are at a risk of Diseased people are randomly diseased over the map. These disease people are color coded black to indicate a diseased state A boundary is drawn around those people who are diseased. This represents our gerrymandered cluster. Note the highly irregular and large shape of the cluster In contrast to 1.4, a cluster generating process operates on this geographic area. The cluster generating process predisposes the people living in the area bound by the dotted lines to a greater risk than other areas of the map. These people are at a risk of In one realization of the process cluster of 10 people therefore are diseased in this area The cluster is then enclosed within a boundary. Note the relatively regular shape of the cluster (compared to a random distribution of diseased people) People are distributed non uniformly over space The entire geographic space is subject to the same risk (0.24) noise generating process. The resulting 10 diseased people and the gerrymandered cluster are shown The cluster generating process in figure 6 operates on the inhomogenously distributed population. The risk elevation is the same as in Figure This causes 8 people to fall ill from an at-risk population of vi

12 1.11 The estimated cluster shape and size is very different from what the shape and size of the cluster is in reality (The dotted line in Figure 10). It is also very different from what was obtained for a homogenous distribution of people in Figure Now a cluster generating process operates on this space. The white river within the dotted lines is the area of excess risk. People living within this area are at an excess risk of disease Assuming an inhomogeneous distribution of people as in figure 1.8 and a risk elevation of 0.71, we see that a certain number of people (10) within the area of excess risk are diseased The gerrymandered cluster now encloses the diseased people. Note the highly irregular and large shape of this cluster Two cluster generating processes of circular shape and risk elevation of 0.75 operate on a homogenous distribution of people The clusters that are estimated from this have the same triangular shape. This is highly unlikely in reality In this example a slightly larger area of increased risk is considered than in the earlier example. 6 people in each of the two clusters are subject to a risk of 0.5, which results in 3 of them becoming cases/ falling ill The clusters that are generated have very different shapes. In fact the larger the area of increased risk, the greater the number of possible shapes and sizes of the estimated cluster In this example people are inhomogenously distributed. The same cluster generating process in Figure 1.15 gives rise to two circular areas of increased risk where the risk elevation is The two clusters generated have very different shapes. There is no configuration of cases within the clusters for which two estimated clusters could have the same shape Using echelons to extract cluster candidates A set of 50,000 cardiovascular disease mortality cases are randomly distributed by population weights to each of 942 ZCTAs in the state of Iowa. A pattern is then extracted using Spatial Filtering. The pattern is binarized, and the resulting polygon cluster candidates are extracted using a GIS An example set of spurious cluster signatures S(Z N ) in signature space An example set of spurious cluster signatures S(Z N ) in signature space with a few candidate clusters (grey squares) Bounding rectangle for elliptical footprint vii

13 2.6 Flowchart of the S.S.S method Population distribution of ZCTAs in Iowa, This figure displays the computational process used to create the simulated dataset. Each bin is labeled as k and has a specific size. For the simulations in this research n= The simulated datasets follow a multinomial distribution Summary of shapes of simulated spurious clusters, frequency and cumulative frequency Summary of sizes of simulated spurious clusters, frequency and cumulative frequency Summary of rates of simulated spurious clusters, frequency and cumulative frequency Characteristics of the four clusters simulated under the alternative hypothesis Cluster detection diagnostics (The key to the numbers is in the text) Patterns detected by the Score statistic and the S.S.S method for one dataset among 20 datasets simulated for cluster-4. The true cluster pattern can be seen inset. In this particular dataset S.S.S is able to identify 62% of the true cluster pattern, while the Score statistic is able to identify 20% Patterns detected by the Score statistic and the S.S.S method for one dataset among 20 datasets simulated for cluster-3. The true cluster pattern can be seen in the inset. In this particular dataset S.S.S is able to identify 98% of the true cluster pattern, while the Score statistic is able to identify 91% Spatial patterns of prostate cancer incidence ( ) in Iowa Cluster of prostate cancer incidence in Iowa, detected by the S.S.S method Cluster detected by SaTScan when the geometry of the cluster is assumed to be ellipsoidal Cluster detected by SaTScan when the geometry of the cluster is assumed to be circular Large secondary cluster with low elevation in risk detected by Kulldorff s SaTScan when the geometry of the cluster is assumed to be elliptical ZCTAs in Iowa with a significant value of Rogerson s Score statistic viii

14 3.7 Expected number of cases in ZCTAs: Entire Iowa versus areas with a significant value of Rogerson s Score statistic ZCTAs in the North West Iowa cluster of high prostate cancer incidence Counties boundaries with ZCTAs in the North West Iowa cluster of high prostate cancer incidence Change in mortality and incidence rates from in five counties Dickinson, Clay, Buena-Vista, Emmet and Clay Counties in the cluster. The expected counts for the particular year (1990, ) are calculated using 2000 census population for the local area, and incidence/mortality information for the state of Iowa (Same procedure as indirect standardization) Variations in the directly standardized incidence and mortality rate in Iowa, and incidence of Prostate cancer in Dickinson County for the years Variations in the directly standardized incidence and mortality rate in Iowa, and incidence of Prostate cancer in Clay County for the years ix

15 1 CHAPTER 1: DETECTING CLUSTERS OF DISEASE: INVESTIGATING SPURIOUS CLUSTERS 1.1 Statement of Purpose This dissertation offers a new cluster detection method. This method looks at the cluster detection problem from a new perspective. I change the question of What do real clusters look like? to the question of What do spurious clusters look like? and How do spurious clusters affect the ability to recover real clusters? Spurious clusters can be identified from their geographical characteristics. These are related to the spatial distribution of people at risk, the shape and scale of the geographic units used to aggregate these people, the shape and scale of the spatial configurations that the disease mapping or cluster detection method may impose on the data and the shape and scale of the area of increased risk. The statistical testing process may also create spurious clusters. I propose that the problem of spurious clusters can be resolved using a computational geographic [1] approach. I argue that Monte Carlo simulations can be used to estimate the patterns of spurious clusters in any situation of interest given knowledge of the first three of these four determinants of spurious clusters. Then, given these determinants, where real measurements of disease or mortality are known, it is possible to show those areas of increased risk that are true clusters as opposed to those that are spurious clusters. The extent of similarity (or dissimilarity) of a cluster to the simulated spurious cluster influences whether it can be recovered. These experiments show that this method is successful in detecting clusters. This method can also predict with reasonable certainty which clusters can be recovered, and which cannot. I compare this method with Rogerson s Score statistic method [2]. These comparisons expose the weaknesses of Rogerson s method. Finally these two methods and the Spatial Scan Statistic [3] are

16 2 applied to searching for possible clusters of prostate cancer incidence in Iowa. The implications of the findings are discussed. 1.2 Introduction Disease mapping has a long history. Starting with the example of John Snow s cholera map to the intelligent agents [4] of the present century, disease mapping has progressed with developments in science, especially Geographical Information Systems (G.I.S) and epidemiology. Some of the first disease maps were simple dot maps indicating the location of disease cases. These gave way to maps of statistical summaries known as thematic maps". These maps convey more information than simple dot maps and are therefore, powerful exploratory and decision making tools. For example, when mortality maps of lung cancer for the United States were made in the 1960s, high rates were found in areas of the Eastern Seaboard [5, 6]. Later, these high rates were attributed to exposure to asbestos among shipyard workers in these areas. A disease map can thus be used to map spatial variations in disease risk. A decision maker can ask Is a person living in a given area at a greater risk of disease than a person living in another area? or In which areas of the map do people have the greatest risk of disease? In the disease mapping literature the problem of finding areas of excess risk is often called cluster detection", a cluster being defined as A geographically bounded group of occurrences of sufficient size and concentration to be unlikely to have occurred by chance" [7] or in plain English, a geographic area of high disease risk. A geographical cluster is therefore spatially analogous to statistical clustering [8], where the question of interest is finding things near in statistical space instead of geographical space. While investigating the causal factors (or etiology) of areas of increased risk are important, there are other important applications of these methods. Public health agencies are often interested in allocating resources to areas with an increased burden of disease [9, 10]. Cluster detection methods are used to identify areas with increased burden of

17 3 disease. Sometimes, environmental policy is formulated on the basis of such studies. In one instance, the Vatican was taken to task for operating radio transmitters at illegal frequencies after studies showed an increased risk of cancer among people living close to these transmitters [11, 12]. Note that policies are often formulated on the basis of evidence that an increased risk exists even though the etiological basis for the increased risk may not have been established. An interesting extension to etiological research is that the presence of spatial clusters of increased risk could also be used to prove the existence of disease risk factors that are spatially non random. For example, it has been claimed that clusters of autism in California prove the existence of risk factors that are not related to genetics or the vaccine hypothesis 1 (barring selective migration) [13]. Many public health agencies maintain on the fly cluster investigation infrastructure to address cluster related enquiries [14]. A number of methods exist that can be used to delineate clusters. A persistent problem with many of these methods is the detection of areas not at high risk being identified as such. Some convenient terms for such false positives are noise" [15], noisy clusters or spurious clusters [16-19]. In this research I develop a method to detect and adjust for the occurrence of spurious clusters in cluster detection studies. The cluster detection literature identifies at least three types of spurious clusters. The first is when the estimate of risk in an area is based on a small number of people [15]. These estimates of risk are unreliable and therefore the area may not have a significant excess risk. A number of solutions exist to solve this problem [20-26]. The second type of spurious clusters stem from statistical issues in the cluster detection method. For example, failing to adjust for multiple hypothesis testing problems may give rise to spurious clusters [18, 27]. This problem is an area of active research [28]. 1 The vaccine hypothesis is that exposure to Thimerosol a mercury based additive in vaccines is a risk factor for autism.

18 4 Kulldorff s SaTScan method resolves this problem by adopting a likelihood based hypothesis testing framework [3]. The third type of spurious cluster is created by a mismatch in the scale and spatial structure of the process that generates the cluster, with the scale and spatial structure used to measure the process. The scale and spatial structure or spatial form of the cluster search process (which measures or samples the underlying data) can generate spurious clusters. Unlike the other sources of spurious clusters very little research exists on this form of noise. There are a number of reasons for this. Until recently, the computational power available to researchers, for cluster detection problems was limited. A cluster can have any geometry or spatial form in reality. However a limited amount of computational power confined researchers to searching for clusters within a small range of spatial forms. For instance, it is a common strategy to search for circular clusters. This strategy was adopted by some of the first cluster search methods [27], and remains common today [29]. If the real cluster is not circular in shape, then the power to detect non circular clusters is greatly reduced. But, a limited search also implies that the likelihood of mismatch between the circles and the underlying true cluster is also limited (given that the spatial form of this true cluster is unknown). In contrast, if the cluster search incorporates a number of different spatial forms, then the likelihood of mismatch increases. Since computational power is not a limiting factor anymore, some researchers have developed shape free" disease cluster detection methods. These methods, that draw from the work of geographers in the 1960 s and 70 s [30] measure spatial attributes (like disease counts or rates) at a large number of possible shapes, sizes and scales. The measured spatial attributes or some functions of the attributes are used to decide if an area of a given shape and size at a given scale is a cluster or not. For example, Duczmal s [31] scan assigns a likelihood value to each cluster it finds, where the likelihood is a function of attributes such as an observed number of cases in the cluster. The clusters with the highest likelihood are most likely to be clusters. These methods thus, promise to

19 5 seek out the true clusters, no matter what their spatial form. However, this also means, that at some shape and scale, noise or spurious clusters will be detected. These spatial forms will represent a mismatch between the shape and scale of the process that generated the process and the shape and scale of the process being used to detect it. The closest analogy that can be drawn to this is similar to what is known in the disease mapping literature as the Texas Sharpshooter Effect. If a shotgun is used on a wall, then the wall is splattered with seemingly random bullet holes. At the scale of the wall, the process is random. However, it is always possible to draw targets a posteriori around the bullet holes. The act of drawing a target is similar to searching for a cluster at a scale different from the scale at which the original process occurred (the entire wall). Duczmal s search procedure, thus often finds clusters that are spurious. Such spurious clusters will be found by any method that offers the least amount of geometric freedom to the clusters search. In fact, these spurious clusters have even been found when the search is limited to circular geometries (for example, see Kulldorff [32]). Tackling this problem therefore requires a) A thorough understanding of the problem of what gives rise to these spurious clusters. b) Suggesting a method to solve or in the very least, manage this problem. This dissertation is an attempt at this. It is clear that an understanding of this problem requires an understanding of scale and shape of the spurious cluster or noise generating process. The shape, size and risk elevation of a cluster, whether spurious or real, is unique to each and every disease mapping/cluster detection situation. The characteristics (shape, size and risk elevation) of a cluster depend on : a) The cluster generating process, especially the shape and size of the area of excess risk, b) The spatial distribution of people over space and c) The scale at which the spatial data are aggregated [19]. These factors are unique to each disease mapping situation/example, and these factors are responsible for creating spurious clusters. Once we have established these facts, two take home facts are: 1) Every disease mapping situation has a unique noise or spurious cluster signature b) It is not possible to

20 6 guess this signature a-priori. However this signature may be computed as explained below. Since, each disease mapping situation has a unique noise or spurious cluster signature, it follows that in every disease mapping situation there will be some clusters which will be hard to detect. These clusters will be in ways similar to the spurious or noisy clusters. This issue or the issue of recoverability has just started being discussed in the disease mapping literature [33, 34]. The methods I describe incorporate the following features. First, it extracts cluster candidates using an exploratory approach. Second, shape, size and rate are used to distinguish true clusters from spurious clusters. Third, the method incorporates recoverability of clusters into the analyses. The researcher is able to know (computationally) a-priori what spatial form of clusters are recoverable. The method utilizes computational geography and two fundamental geographic aspects of clusters- shape, and size to analyze the recoverability of clusters and to separate cluster from non cluster or spurious clusters. This dissertation diverges from the traditional disease clustering literature in taking shape and size into consideration. Traditionally only the rate at a given location or some function of the rate is used to separate a true cluster from a spurious one. Since the method incorporates the shape and size of the cluster in its analysis, I call it the Shape, Size Sensitive disease cluster detection method or the S.S.S method. The S.S.S method is tested and validated on simulated data. This method demonstrates the power of computational geography over traditional methods [35]. The ideas and methods developed and tested in this dissertation are either new, or have been discussed only in scant detail in the literature. Yet, they are fundamental to geography and disease mapping. This research thus makes an important contribution to the disease mapping literature.

21 7 1.3 Organization of the dissertation In this chapter (Chapter 1) I discuss how various disease mapping and cluster detection techniques approach the problem of spurious clusters. I then argue that these methods do not address the issue of spurious clusters adequately. I suggest that a geographical approach can help us better understand the problem and explain how geography gives rise to spurious clusters. Then, having understood the geographical bases for spurious clusters I propose a geographically sensitive disease cluster detection method. I explain this method the Shape Size Sensitive (S.S.S) method in Chapter 2. Then, using simulated data, I test the sensitivity of this method. I also compare the performance of the S.S.S method with Rogerson s Score statistic method for detecting disease clusters. The final, short chapter is Chapter 3. Here I use the S.S.S method and Rogerson s Score Statistic and Kulldorff s Spatial Scan Statistic to investigate the spatial patterns of prostate cancer risk in Iowa. The implications of the findings are discussed. 1.4 Review of existing methods of cluster detection All disease mapping and cluster detection approaches share a common goal. This is to uncover the underlying pattern of risk. These methods calculate statistics as rates or likelihoods which serve as measures of risk. The patterns" on a map are obtained by mapping either these statistics, or those areas that cross some threshold of the calculated statistic. When the second procedure is followed, that is, the rate, or, the likelihood of an area having an excess risk is statistically tested; the method is often called a cluster detection method. Most cluster detection methods test a large number of areas which could possibly be clusters. These are called candidate clusters [31, 36] or cluster candidates. If a cluster passes the statistical test, but demarcates an area where no cluster exists in reality, then, it is a noisy cluster [31] or spurious cluster [16-19]. The term true cluster may be used to indicate geographic areas of excess risk. It is also

22 8 possible that a true cluster is suppressed by the cluster detection process. In the disease cluster detection literature this problem is usually not discussed separately, but forms an integral part of the spurious cluster detection problem. Spurious clusters may be created at various stages in the disease mapping/cluster detection process. The first step for applying a cluster detection method is to collect spatial data. This data may come preaggregated into administrative regions, or it may come in the individual form [37, 38]. If the data are in the individual form, they need to be processed and aggregated such that summary statistics may be gleaned from them and the summary statistics mapped. The process of aggregation may create spurious clusters. One solution is to use the individual level data to search for clusters [39]. While a number of methods will work with both aggregated and individual level data, there are a very few methods, that have been developed exclusively for individual level data [40, 41]. With better quality data being increasingly available, such analyses will become more common [37, 42]. The majority of disease mapping situations start with aggregated data and summary statistics are calculated from these datasets. When the summary statistics are calculated based on a small base population (also called a small support size ), then these statistical estimates are likely to be unreliable. This is the small number problem. Some methods carry out a process called smoothing", where information from neighboring regions is used to obtain a better estimate of the mapped statistic for a given region. This, to some extent alleviates the problem of spurious clusters created from small numbers. The statistical testing procedure could also create spurious clusters. If multiple hypothesis tests without adjustment are carried out then, this process may also give rise to spurious clusters. In a famous example, Openshaw [27] carried out multiple hypothesis tests when searching for leukemia clusters in Northern England. Whenever a test was significant, a circle was drawn. Some of these circles were spurious clusters, and would not have existed if adjustments for multiple testing were carried out. Sometimes, using the wrong reference distribution may also create spurious clusters. Conversely, using overly conservative

23 9 multiple testing correction techniques may suppress true clusters [28]. Waller and Gotway [4] write of situations where for a Poisson reference distribution, it is not possible to distinguish a lack of fit to the Poisson distribution (spurious cluster) from a rejection of the null hypothesis (true cluster). This is an area of active statistical research, and some new and innovative solutions have been proposed to these problems [43, 44]. Kulldorff s SatScan method uses a likelihood based hypothesis testing framework to solve the problem of multiple testing [3]. Instead of testing multiple hypotheses, this method tests only one hypothesis. This hypothesis test is carried out on the cluster candidate that is most likely to be a cluster. The likelihood is a statistical function, that is calculated under the assumption that the observed data conform to certain known distributions (ex: Poisson or binomial). There still remains the third source of spurious clusters. Unlike the first two, there is little research on this source of spurious clusters. This is when spurious clusters are created from mismatch between the process that generates the disease map patterns, and the processes used to recover the patterns. This mismatch could arise when the data are aggregated to administrative regions, or to other shapes and scales by the method of analysis. In this section I discuss the various methods for the detection of cluster detection in context of their ability to handle this problem. Among the various methods available, some methods offer the opportunity of multiscalar analysis. In these methods, the data may be geographically rescaled. While these methods geographically process the data before mapping patterns other methods consider the sanctity of geographic boundaries unbreachable. The latter attempts to expose the underlying risk pattern by mapping summary statistics within existing geographic boundaries without any further geographic processing of the data.

24 Map data without further geographic processing In these methods the geographic boundaries of regions are left as they are, however various statistical manipulations are carried out on the data. Some researchers prefer to call this group of methods as disease mapping methods [45]. As I discussed earlier, these methods can again be subdivided into two groups, methods that smooth the data and methods that do not smooth the data Methods that do not smooth the data The vast majority of diseases maps are maps of raw rates, where the number of cases per unit population within existing geographic regions such as counties or states are mapped [46]. Another approach is a map of probabilities" [47, 48], where instead of mapping a rate, the probability of observing the rate within a geographic region is mapped. Mapping raw rates are often problematic when the rates are based on small base populations [15]. The maps thus produced are likely to display noisy (small number problem) patterns Methods that smooth the data In these methods various statistical manipulations are used to smooth the rates in each region while at the same time keeping the geographic boundaries intact. Information from neighboring regions are used to stabilize the rates in a given region. Some examples of this approach can be found in the Bayesian disease mapping literature [23, 24]. Other examples are method of moving averages and headbanging [20, 22].These methods are not very successful in dealing with the problem of spurious clusters. A study by Kafadar [22] has shown that many of the popular smoothers such as headbanging and empirical Bayes are unable to detect true patterns in the data or have issues with detecting spurious patterns or clusters. Some of the methods smooth the data

25 11 by averaging rates over kernels or filters. For example Sabel et al. [49] investigate rates of Amylotropic Lateral Sclerosis (Lou Gehring s disease) incidence in Finland by smoothing rates using Gaussian Kernels. Another method is Rogerson s Local Score statistic [2, 4, 50]. In this method the deviations from the expected rate are smoothed using Gaussian Kernels. Like other methods, if the rates are based on small numbers, then smoothing these unreliable rates may create spurious clusters. I use Rogerson s Score statistic in my research and therefore, this method is discussed in detail in later sections. Spurious clusters are often created by these methods. First, because these methods map the rates based on small areas before smoothing them, they are prone to the small number problem. Second, these methods do not in any way attempt to deal with the problem of spurious clusters from spatial mismatch discussed earlier. Third, the statistical tests that these methods carry out may not be able to distinguish spurious clusters from true clusters. For example, there is no consensus on what the correct reference distribution is for Rogerson s Score statistic [2, 4, 50]. A separate group of methods that often smooth the data, are local measures of spatial similarity. These methods, which are also known as LISA (Local Indicators of Spatial Autocorrelation) [51] address the question, - How similar is the risk at a given small area to that of its neighbors? The greater the similarity, the higher the likelihood that the small area belongs to (or is) a cluster. Some of the LISA statistics are local Morans-I and local Geary s C [50-54]. Since, the underlying philosophy of this approach is that things nearer are more similar than things farther away [55], the implicit definition of scale here is the distance at which this similarity is manifested. Thus a process that acts at a large scale may cause similarity among immediately neighboring local areas, than processes that work at a smaller scale. Like other methods, if the statistics are calculated on small areas, they could be unreliable. The reference distribution of LISA statistics are often not known [4] and the scale at which a process operates is not investigated before

26 12 LISA statistics are calculated. Any of these factors could lead to the creation of spurious clusters Methods that pre-process the data before calculating and/or testing for significant disease risk These methods allow the modification of geographic boundaries to extract the underlying risk surface and/or to find which area has the greatest excess in disease risk. One group of methods, often called density estimation methods, [56] simply ignore existing geographic boundaries. Drawing from the field" theory of geographic phenomena [20]; they consider that disease risk patterns are continuous in nature and that they do not change or stop abruptly at geographic boundaries. When appropriately used, these methods provide the opportunity to control the spatial basis of support, and thus, the scale of the analysis [57, 58].The other group of methods draw from concepts of region building which were developed by geographers [30]. One approach to building regions is to coalesce groups of areas to build aggregate regions. These methods attempt to find that combination of areas which has the greatest likelihood of being a zone of high disease risk. A third group of methods combine concepts of region building methods with the first group of methods or with methods discussed in the last section. The ability of all these methods is limited by the scale of the data. Often the data come aggregated into small areas and the analysis must be carried out at scales equal or greater than the scale of aggregation. Nevertheless, these methods are better equipped than other methods to control the shape and the scale of the data, and this gives them an edge over other methods when dealing with the problem of spurious clusters.

27 Non combinatorial approaches These methods ignore geographic boundaries and attempt to extract the underlying patterns of risk. They often lay a uniform grid over the map area and measure the statistic of interest at each grid point. Irrespective of whether the data are aggregated or not, a value can be obtained at each grid point. While there are a number of approaches to calculating the statistic at each grid point [21]; a simple and common approach is to filter" the data using circular spatial filters [3, 9, 21, 27]. Some methods map the statistic calculated at each grid point [9] while others do not [3]. These circles can be of fixed or varying sizes. However, since these filters are of a certain shape, they bias the cluster search. The bias is in favor of detecting clusters of or similar to, the shape of the filter (circles in this case). Statistically, the clusters that are of the shape of the filter have a higher power of detection than clusters of other shapes. This approach therefore, overcomes the limitation outlined in the methods discussed earlier, but is limited in its treatment of geographic shape. Ellipses and other geometric shapes have also been studied [29, 59]. One of the methods, based on Rushton s Adaptive DMap [9] maps rates at grid points using adaptive filters and interpolates these with an IDW (Inverse Distance Weighting) interpolation algorithm. The adaptive filter [58, 60] ensures that the rates are based on the same number of people or the same support size. Thus, unlike the LISA methods, all statistics are equally reliable. Also, the use of an adaptive filter ensures that the scale of the analysis can be precisely controlled. The Inverse Distance Weighting Algorithm used for creating the final pattern was also found by Kafadar [22] to be the least noisy of all smoothing/interpolation methods. Thus, by allowing multiscalar analysis, relative freedom of cluster shape (clusters don t have to conform to geographic boundaries) and using a robust interpolation technique, Rushton s Adaptive Filtering method is best suited for dealing with the problem of spurious clusters from mismatch between the process and analysis scales. I use this method in my analyses. Another important density estimation method is Kulldorff's SaTScan [3]. While the

28 14 DMap method maps the extracted pattern, and is therefore good for visualizing and exploring the underlying pattern, SaTScan can be used to map only those areas that are significant clusters. SaTScan has found wide acceptance in the public health community because of its ability to account for the multiple hypotheses testing problem and a robust, freely available software. Some of the recent developments in the disease clustering literature have followed the combinatorial approaches that I discuss next, and their method of choice has been based on the Spatial Scan Statistic method of cluster detection. Since multiple testing is an issue with these combinatorial approaches, the Spatial Scan Statistic is a reasonable choice. Since I use the Spatial Scan Statistic in Chapter-3 to investigate clusters of prostate cancer in North West Iowa, some of the details of the Spatial Scan Statistic are provided next: The scan statistic originated as a one dimensional test. Its objective was to test if a one dimensional point process is purely random. The one dimensional spatial scan statistic was extended by Kulldorff into the spatial domain [3].The spatial scan statistic moves a circle across the study area. The circle centers on to a centroid. The centroid could be the location of a single individual for unaggregated data, the centroid of a census tract (for example) for aggregated data or for a set of grid points. Kulldorff (1997) [3] states The zone defined by a circle consists of all individuals in those cells whose centroids lie inside the circle and each zone is uniquely identified by these individuals. Thus, although the number of circles is infinite the number of zones will be finite. For unaggregated data the zones are perfectly circular, that is, the individuals in the zone are exactly those located within a defining circle. With data aggregated into census districts, a zone may have irregular boundaries that depend on the size and the shape of the several contiguous census districts it includes. The Spatial Scan Statistic is implemented through the freely available software SaTScan [32]. The methodology of the Spatial Scan Statistic is explained as follows. The method involves two steps, - 1. Confounder adjustment and 2. Hypothesis testing

29 15 In disease cluster detection studies known risk factors or confounders are adjusted for, before the cluster detection algorithm is implemented. Thus, for example, it is known that age is associated with prostate cancer. It may be desirable to remove the effect of age from the analyses, such that the clusters that are detected reflect the presence of other, yet unknown, risk factors. The confounder adjustment procedure that SaTScan utilizes is known as the indirect standardization method. It is as follows: If, e i = Expected number of cases in local area/zcta i after confounder adjustment. n i = Observed number of cases in local area/zcta i after confounder adjustment. r = specific cofounder group, for example age group from yrs. ρ = Total number of confounder groups. n r = Total number of cases in G in age group r N ir = Total number of people in G in local area i, in age group r. The confounder adjustment procedure is: e i = r [ (n r / i 1 N )* N ] The adjusted numbers of cases are then used to test the hypothesis if a given local area/zcta i has an excess risk/belongs to a cluster. The hypothesis testing procedure is explained next. The Spatial Scan Statistic tests the hypothesis if a given area of the map (for example a collection of ZCTAs) has a greater (or lesser) risk, than the rest of the ZCTAs in the entire geographic region G. If Z j is the j th cluster :

30 16 For all possible Z j s in Z (The collection of k possible clusters in G), if the risk inside Z j is R (inside, j) is the risk inside Z j while R (outside, j) is the risk outside Z j,then under the null hypothesis and alternative hypothesis: H0: R (inside, j) = R (outside, j) H1: R (inside, j) > R (outside, j) The observed number of cases n j inside (or outside) a cluster candidate is assumed to be Poisson Distributed, and a function of the expected number of cases in the cluster e j and the risk R (inside, j). r Let n= k i 1 N ir n j Poisson [ e j *R (inside, j) ] The likelihood function that is used, from these null and alternative hypotheses are as follows: λ = Likelihood (R (inside, j) > R (outside, j) ) / Likelihood(R (inside, j) = R (outside, j) ) This likelihood ratio can be solved and written in the logarithmic form as follows: Log Likelihood Ratio or LLR j = (n j ln (n j / e j )) + ((n- n j ) ln [(n- n j )/(n- e j )]) The significance of the log likelihood ratio is tested using a Monte Carlo hypothesis test. The SaTScan program carries out a user-specified number of Monte Carlo randomizations of the data and tests to % (The percentage can be user specified too) significance of the presence of a cluster. A p value is reported. This is calculated as p value = Rank of LLR / (1 + #simulation)." Note that the spatial scan statistic procedure does not adjust for multiple testing in the traditional sense for example by carrying out a Bonferroni or other multiple testing adjustment procedure. Instead, it avoids the problem of testing multiple hypotheses, by concentrating on those clusters candidates that are most likely to be true clusters (and thus have the highest log likelihood

31 17 value). Also note that the Spatial Scan Statistic procedure explained above is the spatial Poisson model, which is the model used in disease mapping. There are numerous other modifications to the Spatial Scan Statistic procedure [29] Combinatorial Approaches Some geographers are interested in creating or building regions [30, 61-64]. Regions are built up by assigning small areas to groups such that they fulfill certain criteria. Regional geographers have called this the assignment problem. Small areas are so assigned to regions, that a certain attribute of the region is optimized [30, 62]. Sometimes, the problem could involve maximizing the variation in an attribute of the newly built region as a proportion of the variation within the entire map [30, 65]. The general question in this approach is What combination of areas will optimize a given objective? ". In the disease mapping context disease risk or the likelihood of risk can be maximized. An example in the disease mapping context was investigated by Alvanides [61]. A similar strategy was also suggested (but not implemented) by Rushton [66]. These ideas were implemented in computer programs first by Openshaw [64] and later by other researchers [63, 67, 68]. Independently Duczmal suggested a similar solution to finding disease clusters of any shape. He operationally achieved this by maximizing the Spatial Scan Statistic likelihood function over possible combinations of areas. While it is sometimes possible to look at all possible combinations/ collections of areas, for most realistic geographical areas this is not possible (For example, see Cliff and Haggett [62]). Neither are there theoretical solutions to the problem. In operations research, such problems are called np-complete. This means that for a collection of n areas, the problem cannot be solved in polynomial computer time. Heuristics are used to solve such problems. Duczmal uses the Simulated Annealing (SA) and Genetic Algorithm (GA) heuristics in his research [31, 69]. An important aspect of these methods is that they provide enormous freedom of analysis of shape and scale. The analysis scale and shape

32 18 vary across a multitude of combinations. Thus instead of asking the question Is there a cluster at a given scale of the following shape? these methods demand - Find clusters of any shape at any scale. This makes these methods immensely powerful. But this strength also brings about a weakness. If spurious clusters are created from a mismatch between the process and analysis scale and shapes, and if a large number of scales and shapes are evaluated by this analysis method, then it follows that noisy clusters will almost always be detected by these methods alongside genuine or true clusters. At the end of this section will shall see an example of this. The next section discusses some of the modifications that researchers have proposed to these methods. These modifications offer better power of detecting clusters Hybrid Approaches These approaches combine some of the strategies of the non-combinatorial approaches with a combinatorial search. Some examples are the approaches proposed by Patil and Tallie [70], Tango [71] and Yinnakoulias [36]. Tango proposed that the search begin with a circular cluster as a seed", but then regions adjacent to the circular cluster be coalesced with it and the resulting hybrid be tested as a possible cluster. With every level of adjacency enumerated the problem becomes computationally complex, and therefore in their example Tango suggested that three levels of adjacency be tested. Patil and Tallie`s [70] approach is limited to restricting the search space to areas with the highest rates, which Patil and Tallie call the Upper level sets". These methods provide interesting extensions to the combinatorial shape-free methods of cluster search. We are now in a position to summarize the various methods discussed. All the methods outlined above have one singular goal: To extract the underlying pattern of significant excess risk. Some methods are good at mapping the entire pattern [9], while others are good at testing for significant excess risk [3]. In the next section, I discuss how problems with significance testing can introduce spurious clusters.

33 Significance Testing and Spurious Clusters In general all methods at some point, address the following question: Of all the candidate clusters in the pattern of risk (whether mapped or not), what clusters are true clusters? Each candidate cluster has a specific risk elevation, a size, and a shape. Traditionally most cluster detection" techniques have used some function of the risk elevation or rate of a given area to decide if the area is a true cluster. The question that is asked is How likely are we to observe this risk elevation or rate in this area if the underlying process is noise? " If the probability is small then the area is not a cluster. The distribution of risks/rates under the process of noise is also known as the reference distribution. Traditionally, the reference distribution is normatively chosen. Some choices are the normal distribution [2, 50], the chi-squared distribution [2, 50], the Poisson [3] distribution and the Gumbel distribution [43]. However, using such distributions is problematic. If the populations are small, the normal distribution cannot be used. It is often hard to distinguish a lack of fit to the Chi-Squared distribution from a genuine deviation from the Chi-Squared distribution (indicating clustering) [4]. A more robust method of achieving this is to use a Monte Carlo simulation approach to empirically determine the reference distribution. Methodologically this may be achieved by simulating a series of maps, in each of which noise is the underlying process. Multiple Monte-Carlo simulations of the data are used to mimic the noise process. If the observed risk elevation (or some function of the risk value such as the rate) for the area is significantly different from the ones in the simulated maps, then the area is considered to be a cluster. However Monte Carlo simulations do not guarantee that spurious clusters will not be detected. Steenberghen et al.,[72] carried out an experiment that illustrates this problem. This is displayed in Fig 1.1. Fig 1.1 is a map in which simulated locations of traffic accidents (points) were randomly scattered [72], filtered using 600 meter filters,

34 20 the density of points estimated, the resulting clusters tested for significance and the level of significance was displayed (also known as a p-map). If areas which show % significance are called clusters, the black shapes in Figure 1.1 are spurious clusters. Some methods attempt to tackle this problem with a combination of both Monte Carlo and normative statistical techniques. Examples are Duczmals and Kulldorff s methods. Duczmal s method [3, 31, 43, 69, 73] (which derives from Kuldorff s method) generates a large number of irregular cluster candidates. For each candidate the rate is calculated. The rate is then fed into a function known as a likelihood function to yield a likelihood value of the cluster candidate being a true cluster. This value is divided by the likelihood of the cluster candidate not being a true cluster. This ratio is known as the likelihood ratio. The likelihood ratios for all cluster candidates are calculated. The cluster candidates with the highest ratios are the most likely clusters. Multiple Monte Carlo simulations are carried out, and the rates at all the candidate clusters calculated. Again, the rates are fed into the likelihood function, thus generating a reference distribution of likelihood ratios for each cluster candidate. The likelihood ratio value of the cluster candidate is compared with the reference distribution to decide if the cluster candidate is a true cluster. However when Duczmal applied this approach to some of his data, problems with this approach were dramatically exposed. In one of his studies Duczmal [31] simulated breast cancer cases and randomly distributed them over 245 counties in New England (Fig 1.2). When he instructed his Simulated Annealing (SA) SaTScan based irregular cluster search algorithm to search for clusters, one of the clusters that it found was a large and extremely irregular cluster encompassing 122 counties, and enclosing a large percentage of the randomly scattered cases. This cluster is an example of a noisy cluster. The noise generating process (random distribution of cases) operated at the scale of 245 counties (aggregated). The shape of the area at which this process operated is the shape of the New England region that we see in Fig 1.2. At this scale and shape, the process generates noise. However, if this process is studied at the scale of an

35 21 aggregation of 122 counties and at the shape that follows the darker (orange if your copy of this document is in color) shaded counties in Figure 1.2, then, a noisy or spurious cluster is generated. It is known that the process that generated this cluster is noise. This example thus illustrates a situation where spurious clusters are created from a mismatch between the scale and shape of the process that generates the cluster and the scale and the shape imposed by the method of analysis. Duczmal [31] noted that this noisy cluster was large in size and extremely irregular in shape. Duczmal [73] suggests that large and irregular clusters like the one found in his study (above) are likely to be spurious. He and some other researchers [36] therefore, incorporate a penalty for irregularity of shape in this cluster search algorithm. The extent of this penalty is decided on a priori knowledge of the shape of the cluster. Therefore, if researchers believe that the clusters in an area are likely to be circular; they place a high penalty on clusters that are not circular in shape and vice versa. The spurious cluster detected by Duczmal s method and the proposed solution raises some important questions. Is this spurious cluster large and irregular with a high risk/rate elevation a cluster of his particular method, or is it possible that if a cluster detection method is given freedom of shape and size then these clusters are likely to be detected? We note that the shape and size of the spurious clusters in Fig 1.1 are different from the shape and size of Duczmal s spurious cluster. Thus not all spurious clusters are large and irregular. Duczmal s problem has reintroduced the otherwise rarely discussed issue of shape and size in the disease cluster detection literature [69, 74, 75]. Risk elevation is just one possible characteristic of a cluster. McCullagh [76] states - In map analysis, features of prime importance may be size, shape, orientation and spacing". It is possible for clusters of different shapes and sizes to have the same risk elevation. It is also possible for clusters of same shape and sizes to have different risk elevations. The first objective of any cluster search should therefore be to distinguish spurious or noisy clusters from everything else. The risk or rate value of a possible cluster alone is not sufficient to make

36 22 this distinction. The shape and size of the cluster must also be factored in, when considering if a cluster is a true cluster. Duczmal proposes a solution that makes certain a priori assumptions about the shape and size of a cluster. This solution is interesting. However, the problem of spurious clusters may be approached from a different angle. Instead of asking the question What is the shape of a true cluster? which is what these methods do, and which is a question which is hard if not impossible to answer, the question that should be asked is What is the shape of a spurious cluster?. Unlike the first question, this is easier to answer. This is because the shape of a spurious cluster, unlike a true cluster can be mined a-posteriori from the data. To know how this can be done, we first need to understand how spurious clusters are generated in the first place. Thus, in the chapter that follows I discuss in depth, the phenomenon of noise and the creation of spurious clusters Identifying spurious clusters and distinguishing true clusters from spurious clusters Spurious clusters enclose noise. Across disciplines noise is defined as.. a random and unpredictable signal" [77]. By this definition if the nature of the signal is known, then noise can be detected and filtered out. For example in a satellite image, it may be known that certain frequencies are the signal frequencies and therefore a spectral analysis and subsequent filtering may help remove the undesirable noise. In a satellite image the signal has a physical existence. For example, infrared radiation emitted by vegetation can be measured with certain instruments. In contrast, in mapping disease the signal cannot be physically measured. The signal is conceptual and has to be estimated from the available data. Some geographers and statisticians attempt to tackle the problem by developing statistical models that attempt to separate signal from noise [21, 23, 78-

37 23 80]. Perhaps a better approach to understanding signal and noise in a disease map is to understand the physical process that gives rise to the signal (as in a satellite signal). It is known that in a disease map, the observed patterns are the result of underlying processes. The observed patterns are patterns obtained from mapping statistical summaries of disease outcomes. For example, a map of patterns of cholera mortality in England could be displaying the number of cholera deaths per unit population in each county. The outcome in this case is cholera mortality which is the outcome of a disease process. Since cholera is a communicable disease it is possible that the spread of cholera can be modeled as a contact network process [81]. There exist many other spatially explicit disease processes 2. For example, patterns of disease could be the result of processes that reflect an underlying lack of access to healthcare [10, 56, 82-84]. Whatever the specific process may be, these processes have a common trait in having a spatial form [85], and this means that they predispose some areas of the map to have a greater risk than any other. It is also possible that the underlying process does not cause any region of the map to have a greater risk than any other. Since a disease case may appear at any point on the map by random chance, by the earlier definition of noise, this is a noise generating process. A cluster defined by enclosing some of these disease cases is a spurious cluster. On any given map disease patterns can be the result of one or more processes. It could be the result of one process that generates clusters and another process that generates noise. The challenge therefore, is to distinguish the areas of a pattern that are the result of a cluster generating process from those that are not. Also, given a disease process that generates patterns on a map; a number of other factors also influence the patterns we 2 It is important to distinguish between a spatially explicit disease process and a spatial disease process. Some scientists attempt to model diseases as purely spatial processes. Examples of this can be seen from the cellular automata based disease modeling literature. No disease process is purely spatial and therefore such models are misleading.

38 24 actually observe. Given a cluster generating process, the following factors influence the pattern that is then extracted: 1. The spatial distribution of the locations of people in the map. 2. The shape and size of the geographic units that are used to aggregate individuals into discrete small areas. 3. The shape and size of the spatial configuration, the disease mapping or cluster detection method may impose on the data (In addition to 2). Understanding these factors is essential to understanding noise and spurious clusters. I discuss this next The spatial distribution of the locations of people in the map A cluster generating process causes an area of the map to have a greater risk than other areas of the map. Cluster detection methods seek to estimate the shape, size and risk elevation of the area of increased risk using the locations of people as proxy sample sites. A representative spatial sample of the area of risk would be a uniform grid [86]. People are never distributed uniformly over space; instead, a likely spatial distribution consists of dense settlements interspaced with sparsely populated areas. This creates a challenge in estimating the true shape of the cluster. As I illustrate from figures 1.3 to 1.11, a cluster that in reality has a uniform shape, may be estimated as having a highly irregular shape, because of the way people are distributed over space [75].The shape of the actual area of increased risk or true cluster created by the cluster generating process also influences the shape of the cluster that is finally estimated. If the shape of the true cluster

39 25 is highly irregular, it is quite likely that the shape of the cluster that is estimated is also highly irregular, but the converse may also be true! This is illustrated from figures 1.12 to 1.14.Another phenomenon long observed by geographers is that the same risk process may give birth to different shaped clusters in different areas of the map or, in more general terms, the same cluster generating process may give rise to different patterns [87]. While the shape of the original area of the increased risk or true cluster may be the same in two areas and the spatial distribution of the people may be the same, it is not necessary that the pattern of people who are diseased (and who are not) will be the same in both areas. This means that the shape of the estimated area of increased risk will not be the same in both areas. This is further complicated by the fact that people are almost never distributed similarly over space in two different regions (Figures 1.15 to 1.20). First, for the purposes of understanding this issue, let us assume the highly improbable situation that people are uniformly distributed over space. Let the distribution be over a uniform grid. Figure 1.3 illustrates the situation. Next, let us consider that out of the 42 people in the region, 10 are afflicted by some disease. However, we assume that the process that causes disease is a noise generating process. Therefore, we expect diseased people (or cases) to be randomly distributed over the region among 42 people as shown in figure 1.4. A convex hull boundary of these cases is seen in Figure 1.5. In contrast, if there is a cluster generating process, we would expect the diseased people to be clustered together. Figure 1.6 illustrates such a situation. People enclosed within a dotted area of increased risk are diseased, the risk being (the risk in other areas being 0). We observe in Figure 1.6 one realization of the risk process, so 10 people are diseased. Figure 1.7 displays the convex hull boundary of this cluster of diseased people. The smooth and regular shape of this cluster is in sharp contrast to the irregular cluster shape that we observe in Figure 1.5. Since it is highly unlikely, that people will be uniformly distributed over space, Figure 1.8 illustrates the more realistic possibility of people being non uniformly distributed over space. If the entire geographic area in figure

40 is subject to a risk, we expect some people to become diseased (again, one realization of the process). Figure 1.9 illustrates this and the boundary that demarcates the cluster. The shape of the cluster is very different from what was obtained in Figure 1.5. An increased area of risk on such a heterogeneously distributed population gives rise to clusters of unpredictable shapes (figures 1.10 and 1.11).These example show how the spatial distribution of the people affect the shape and size of the risk surface detected. From these examples it may seem that for a given distribution of people over space, a cluster generating process gives rise to patterns on a map that are regular compared to the shapes generated by a noise generating process. Indeed, some scientists use measures of regularity of a cluster s shape to distinguish a true cluster from a cluster spurious cluster [73]. Also, people never are distributed uniformly over geographic space. Next, we see how this affects the shape and size of the cluster detected. In the example I have discussed I assumed that the cluster generating process gives rise to a very regularly shaped area of increased risk (The area within the dotted line). In reality this may not be true. The area of increased risk may have a very irregular shape. Some examples of geographic features that can be areas of increased risk are rivers, roads, underground groundwater streams, plumes of aerial pollution or a combination of some of these. We therefore observe that the shape and size of a cluster cannot be predicted a- priori and is unique to the risk elevation of the cluster generating process and the spatial distribution of the people. Another aspect of a cluster generating process is that the same process can give rise to different shaped clusters in different regions of the map. This can happen even if people are uniformly distributed. The examples below illustrate this: From the discussion and the examples, we can conclude that both the spatial distribution of people and the shape and size of the area of increased risk, have an important bearing on the shape and size of the cluster that is finally detected. The area of increased risk or the true cluster may have a very different spatial configuration from the cluster that is detected. Parts of the true cluster may be suppressed or spurious areas

41 27 of increased risk may arise. Spurious clusters are created from the method used to measure the outcome of the process of clustering. By definition, the method uses a scale and (or) shape of measurement that is dependent on the spatial distribution of people. Since this distribution is not representative of the underlying area of increased risk, there is a mismatch between the measurement shape/scale and the process shape scale. While the above examples are with individual level data, the conclusions drawn can be generalized to aggregated data. The act of data aggregation itself could introduce noise over and above the problem of heterogeneously distributed people. This is discussed in the next section The scale and spatial configuration of the geographic units that are used to aggregate data into discrete small areas In the geography literature the term scale is used to refer to three different kinds of scales, two of which are of relevance here. The first is the phenomenon scale, or the scale at which a spatial process operates. The second is the analysis scale the scale at which data are aggregated for measurement and analysis [88]. When a phenomenon such as a disease operates at a given scale, its outcome is often registered as heterogeneity in disease rates at that scale [89]. Geographers have often attempted to find the scale at which a process operates [90]. Two well known methods are the use of spectral analysis [65] and variogram [91] modeling. The latter approach is often used in the health geography literature. Studies in China have shown that Esophageal and Liver Cancers operate at scales of less than 150 kms while stomach cancers operate at scales less than 90 km [91]. In Sweden substance related disorders operate at scales less than 3 kms [92]. Unfortunately, the scale at which a given process operates is not known in most geographic studies. A geographer attempts to study a process by collecting and analyzing

42 28 spatial data. This process involves analysis through the calculation of statistical summaries of data aggregated at an appropriate scale. When the process scale is not known there is every possibility of a mismatch between the process scale and the analysis scale. This mismatch or misalignment arises from two sources. First, geographic data are often aggregated into discrete units often for purposes different from the analyses for which they are being used. These units of aggregation could differ in shape and scale from the process scale and shape. As Haining [93] states in Conceptual models of spatial variation [93]...This might be referred to as process-induced spatial heterogeneity. This source of heterogeneity may be compounded in the case of regional data by measuring attributes through spatial units of different size. This might be referred to as measurement-induced heterogeneity because it is a product of how attributes are observed and measured. A second source of mismatch is from the spatial structures that a disease mapping/ cluster detection method imposes on the data. For example, spatial filtering [9, 10] and Spatial Scan Statistic based methods calculate summary statistics by aggregating data along circular filters. In the geography literature the problems that arise from spatial mismatch are grouped under MAUP or the Modifiable Area Unit Problem [91, 94]. MAUP phenomena are again grouped under two broad sub groups as the zone effect and the scale effect. The creation of spurious heterogeneity or destruction of true heterogeneity with changing scales is a manifestation of the scale effect. If the scale is kept fixed but the shape of the zones of aggregation are changed, then the zone effect is likely to be seen. Geographic data aggregated to administrative units often display both the zone and scale effects of MAUP. Aggregating data has a smoothing effect on disease rates [95], and therefore clusters at scales smaller than the scale of aggregation could be missed, when analyses are done using these data. Conversely, if the scale of aggregation is smaller than the process scale, then noisy clusters could be detected. A recent study by Ozonoff et al., [19] demonstrated that when individual level data are aggregated and a Spatial Scan Statistic cluster search method used on the data,

43 29 then noise increases with increasing levels of aggregation. Therefore, analysis and process scales interact in complex ways to create noisy clusters and suppress true clusters We can conclude from our discussions above, that a number of complex factors influence the shape, size and the risk elevation of the clusters that are detected and the spurious clusters created. These factors are dependent on the spatial distribution of the people and the process and analysis scales. It is not possible to make a priori assumptions about these factors, and it is certainly not possible to predict the shape of a noisy cluster a priori. What approach is then appropriate if the spurious clusters have to be separated from the true clusters? The section that follows answers this question Identifying the noisy" or spurious components of the pattern A reasonable cluster detection technique should take into consideration not only the risk elevation but also the shape and size of the cluster. I propose a spatially enabled computational process that uses these attributes of a cluster, to identify the signature of spurious clusters from patterns on a disease map. Earlier, I introduced the idea that a pattern is the outcome of a process. Analyzing a pattern or the components of a pattern such as individual clusters may yield clues about the underlying process. A map of disease patterns represents one realization of the underlying process. It may not be possible to draw conclusions on the process that generated the pattern or components of the pattern by analyzing just one map. However, if multiple maps were available, representing multiple realizations of the process, then analyzing the patterns may yield clues about the underlying process. A classic example of this approach can be found in Hagerstrand s classic paper [96] in which he simulates multiple maps assuming an underlying process. He then compares maps of empirical data with the maps that he has simulated to draw conclusions about the validity with which he represents the process in his model. Another example can be seen from Diggle [97].Therefore, if maps were

44 30 created using a known process, then analysis of the simulated patterns on the maps would yield clues on the signature" of that particular process. Once this signature" is known, then the pattern could imply (or not imply) the existence of this process. More specifically, this scheme can help identify a signature" for spurious clusters. These signatures can then be used to distinguish clusters that are spurious from clusters that are true", in any given pattern of disease risk. Shape, size and risk elevation are part of this signature". For example, the signature of spurious clusters in Duczmal s [73] method was that these clusters were large in size and had irregular shapes. The next chapter is devoted to the method I have developed based on these ideas. The method is first described, then tested and validated on simulated data Why use size, shape and rate The reason I add the dimensions of size and shape, in addition to rate, is to characterize the reference space in which spurious clusters are located. I know from theory (as discussed in this chapter) that spurious clusters arise differently to the extent that the numbers of people at risk in relation to the overall relative risk of the disease exist differ across the space. When people are distributed uniformly in space, the average number and average size of spurious clusters in that space can be determined from theory. As Schinazi [98] shows, deterministic statistics can be used to determine the chance of finding a given number of clusters with a rate higher or lower than the expected rate. However, when people at risk are distributed non-uniformly in space, the equivalent number is more difficult to determine directly from theory. The same theory still applies; it is just more difficult to implement in the case of non-uniform distribution of people at risk. For this reason, I use Monte Carlo simulation to discover the rate, size, shape space in which typical spurious clusters lie, given the particular distribution of people at risk and the particular overall relative risk of the disease in the study area in question. In his seminal paper King [85] states The mathematics of stochastic spatial processes have

45 31 proven to be extremely complex and it is perhaps not surprising that alternative approaches to study these processes have been sought. In the analysis of any system, simulation represents a lower level of abstraction than the formal mathematical analysis, and this technique has been applied to geographic research. In this research I use shape, size and rate to distinguish real clusters from spurious ones. Since the probability of disease in a cluster is higher than in a non cluster, we expect the rate, which is an estimate of this probability to be higher in a cluster. Conversely, if people with higher probabilities of disease are grouped together or are spatially clustered, than randomly scattered about the map, we expect a higher degree of spatial autocorrelation in the former situation. We would then expect the size of true clusters to be larger than any spurious clusters created by noise. The causative agent for this increased spatial autocorrelation could be environmental toxins or social and behavioral factors. There is a vast literature on the social and environmental causes of increased risks [99], a complete discussion of which is out of the scope of this dissertation. Nevertheless, I briefly discuss some of these agents of increased risk. As I discussed in the introduction chapter, disease mapping owes its beginnings to infectious diseases such as cholera and smallpox. Infectious agents such as bacteria or viruses are often transmitted through close physical contact. It is therefore not surprising that infectious diseases such as Cholera [100] and Yellow fever [101] have served as some of the best example cases of disease clusters. A collection of cases are positively autocorrelated compared to a random distribution of cases. Conversely a high spatial autocorrelation of disease X in space could indicate an infectious etiology for that disease. One would expect the clusters thus formed to be contiguous and large as opposed to a random allocation of cases. Other causative factors of diseases are environmental toxins. Environmental toxins tend to follow certain physical features or attributes of the environment. People residing within or close to these features are at an increased risk of disease compared to others because of a differential exposure to these toxins. Some examples of physical attribute/toxin pairs are: rivers and fungicides [102], radio antennae

46 32 and electromagnetic wave plumes [12], farms and pesticides [103], cranberry bushes and pesticides [104], Concentrated Animal Feeding Operations [105] and dust plumes, Canals and assorted chemical wastes (The famous Love Canal )[106]. It is expected that these toxins act across local areas in contiguous areas. The elevations of risk caused by such agents are over a large area as opposed to any risk caused by spatially random events. While physical toxins may cause an increased risk regime, the social environment may also cause the same effect. Public health researchers discuss the context and composition of social environment [78, 80, 99, 107, 108]. If a number of individuals practicing high risk behaviors compose a neighborhood they could end up reinforcing each other s behaviors. This could result in a cluster of disease cases created by the compositional effect of high risk individuals living together [107]. If a number of high risk individuals are living together, they form a cluster. This cluster would naturally be larger than isolated individuals or even families practicing high risk behaviors. In contrast to this compositional process, if a certain neighborhood has poor access to services, then the access context of this neighborhood causes the people living in it to have a higher risk of disease. Some of the examples of access/outcome pairs from the literature are access to prenatal care clinics and birth outcomes [83], general accessibility and health risk factors [109], access to radiation clinics and choice of therapy [110], access to health resources and late stage colorectal cancer [10]. Network distance, Euclidean distance or some function thereof are used to quantify access. It is not possible for immediately neighboring individuals to have different accessibilities. One therefore finds clusters of high or low accessibility, which translates to larger clusters than random, While we would expect the sizes of true clusters to be larger than the sizes of spurious clusters, there is a small but finite probability that by random chance some spurious clusters will be larger than the true clusters. Also, the shapes of true clusters will have a greater degree of freedom than the shapes of spurious clusters. For example, the

47 33 shapes of true clusters could follow a road or river network, in which case they will be extremely irregular. Conversely, they could be regular or circular. The shapes of spurious clusters on the other hand are constrained by the particular geographic aspects of the data such as level of aggregation and spatial distribution of people as discussed in this chapter. Therefore, we can expect the shape, size and rate of true clusters to be different from the shapes, sizes and rates of spurious clusters. The question of whether each of these dimensions contributes to the power to discriminate between spurious clusters and true clusters is an empirical question that can be answered. In my simulations, I hold rate constant across synthetic clusters (Horizontal axis in figure 2.13: clusters 1 and 2, clusters 3 and 4), when changing shape and size, and conversely change rate when keeping shape and size constant (Vertical axis in figure 2.13: clusters 1 and 3, clusters 2 and 4).I also address the empirical question, how much does each of these dimension contribute to the overall sensitivity, if information on the other dimensions is withdrawn. The theoretical reasons, however, for expecting the dimensions of size and shape to contribute to the ability to separate spurious clusters from true clusters remains; viz. the shape and size of spurious clusters in any area depend on the spatial distribution of the people at risk in the area. If this spatial distribution changes, the patterns of spurious clusters change. Therefore, size and shape as well as rate are part of the signatures of spurious clusters in a particular region. The ability to measure size and shape using GIS methods thus becomes an important part of the methodology for distinguishing true from spurious clusters in any area. At this stage we are in a position to revisit the definition of clusters. Knox s definition of a cluster is a A geographically bounded group of occurrences of sufficient size and concentration to be unlikely to have occurred by chance". The vast majority of disease clustering literature interpret unlikely to have occurred by chance as the unlikeliness of the estimated risk of disease in the cluster only to have occurred by chance. As shown by Duczmal [31] and Ozonoff [19], this interpretation is fallible to the problem of spurious clusters. Thus, two different clusters with different geographical

48 34 bounds but the same rate, would be evaluated similarly. This dissertation does not redefine Knox s definition of clustering. It interprets it in a geographically meaningful manner by including shape and size along with rate into cluster interpretation in a three dimensional computational space.

49 35 Figure 1.1: This figure displays the statistical significance of accidents per square kilometer (a p- map over densities), where accidents have been randomly scattered across the study area. A 30 meter grid was laid over the entire study area and a 6000 meter filter was used to estimate the accident densities. The black areas are significant noisy clusters. Note: Reproduced from Steenbergehen, Thomas and Wetts (2005) [72].

50 36 Figure 1.2: This figure displays a spurious cluster detected by Duczmal s Simulated Annealing based SaTScan method. This cluster has a high, statistically significant likelihood value. Note: Reproduced from Duczmal, Kulldorff and Huang (2006) [73].

51 Figure 1.3: In the geographic area, 42 people are distributed over a uniform grid. Each circle represents an individual. They are color coded white to indicate that they are healthy. 37

52 Figure 1.4: A noise or spurious cluster generating process operates at the scale of the entire geographical area. No person is at a greater risk of disease than any other. All people are at a risk of Diseased people are randomly diseased over the map. These disease people are color coded black to indicate a diseased state. 38

53 Figure 1.5: A boundary is drawn around those people who are diseased. This represents our gerrymandered cluster. Note the highly irregular and large shape of the cluster. 39

54 Figure 1.6: In contrast to 1.4, a cluster generating process operates on this geographic area. The cluster generating process predisposes the people living in the area bound by the dotted lines to a greater risk than other areas of the map. These people are at a risk of In one realization of the process cluster of 10 people therefore are diseased in this area. 40

55 Figure 1.7: The cluster is then enclosed within a boundary. Note the relatively regular shape of the cluster (compared to a random distribution of diseased people). 41

56 Figure 1.8: People are distributed non uniformly over space. 42

57 Figure 1.9: The entire geographic space is subject to the same risk (0.24) noise generating process. The resulting 10 diseased people and the gerrymandered cluster are shown. 43

58 Figure 1.10: The cluster generating process in figure 1.6 operates on the inhomogenously distributed population. The risk elevation is the same as in Figure This causes 8 people to fall ill from an at-risk population of

59 1.11: The estimated cluster shape and size is very different from what the shape and size of the cluster is in reality (The dotted line in Figure 1.10). It is also very different from what was obtained for a homogenous distribution of people in Figure

60 Figure 1.12: Now a cluster generating process operates on this space. The white river within the dotted lines is the area of excess risk. People living within this area are at an excess risk of disease. 46

61 Figure 1.13: Assuming an inhomogeneous distribution of people as in figure 1.8 and a risk elevation of 0.71, we see that a certain number of people (10) within the area of excess risk are diseased. 47

62 Figure 1.14: The gerrymandered cluster now encloses the diseased people. Note the highly irregular and large shape of this cluster. 48

63 Figure 1.15: Two cluster generating processes of circular shape and risk elevation of 0.75 operate on a homogenous distribution of people. 49

64 Figure 1.16: The clusters that are estimated from this have the same triangular shape. This is highly unlikely in reality. 50

65 Figure 1.17: In this example a slightly larger area of increased risk is considered than in the earlier example. 6 people in each of the two clusters are subject to a risk of 0.5, which results in 3 of them becoming cases/ falling ill. 51

66 Figure 1.18: The clusters that are generated have very different shapes. In fact the larger the area of increased risk, the greater the number of possible shapes and sizes of the estimated cluster. 52

67 Figure 1.19: In this example people are inhomogenously distributed. The same cluster generating process in Figure 1.15 gives rise to two circular areas of increased risk where the risk elevation is

68 Figure 1.20: The two clusters generated have very different shapes. There is no configuration of cases within the clusters for which two estimated clusters could have the same shape. 54

69 55 CHAPTER 2: THE SHAPE SIZE SENSITIVE (S.S.S) METHOD FOR DETECTING DISEASE CLUSTERS In this chapter I describe and then test the S.S.S method for the detection of disease clusters. In the section that immediately follows, I discuss the theory, hypothesis testing framework and the algorithm that implements the method. The method is also compared with a second method, Rogerson s Score test method. The results from the two methods are compared, and implications are discussed. 2.1 Theoretical foundations of the S.S.S method A pattern on a map consists of a number of possible candidate clusters. A map may be that of a region such as the state of Iowa. Let us denote this region as G. The region may be comprised of X number of local areas G= [A 1, A 2.. A X ] 3. An example of a local area A 1 is a ZCTA (Zip Code Tabulation Areas). Also there may be k candidate clusters on the map then this set can be denoted as Z= [Z 1, Z 2.. Z k ]. These candidate clusters have different properties for different methods of detecting disease clusters. For example, in methods that do not geographically process the data, clusters follow local area boundaries, and the set of all candidate clusters comprise the universe, or Z=G. Each candidate cluster Z i,i=1to k is a collection of some discrete local areas A i s. However, in methods that process geographic or spatial data, especially density estimation methods like spatial filtering, these properties do not necessarily apply. The set Z of candidate clusters can be also divided into two complementary subsets a set of 3 The terminology in this section is similar to that of Duczmal, Kulldorff and Huang (2006) [73].

70 56 true clusters and a set of noisy clusters. Let these be Z T and Z N respectively, where Z T Z N = Z. Of course, the sets Z T and Z N are not known a-priori to the researcher. Note that these cluster candidates can be extracted using any method that offers relative freedom of shape and size of cluster candidates. In this research, I use Rushton s Spatial Filtering and echelons. But the approach can be used for example, with Duczmal s [31] method or Kulldorff s SaTScan [3] analyses. Each cluster candidate could either be a true cluster or it could be a spurious cluster. It is not known a priori, if a candidate cluster is true or spurious. Each candidate cluster has a shape K(Z), size S(Z) and a third attribute R(Z) that provides a measure of the risk at the cluster. A measure of shape- K(Z) is the compactness or regularity of the cluster s geometry. Compactness is measured as K Z 4π Area Z / Perimeter Z [73].A circle is perfectly regular and has a compactness of 1. The compactness value tends to 0 as the geometry tends to become less regular. Size S(Z) is the area of Z as Area(Z). R(Z) is the statistical measure of clustering. Thus R(Z) could be a likelihood statistic, as in Kulldorff s SaTScan, a rate statistic in Spatial Filtering, or a measure of spatial autocorrelation as in the LISA (Local Indicators of Spatial Autocorrelation) methods. R(Z) in most methods is a continuous variable, and this is the attribute that is used to create patterns on a map. If cluster candidates or Z i, s have to be extracted from the pattern, then R(Z) has to be discretized. Since clusters are discrete geographic entities with strict boundaries, an appropriate method of extracting geographically bounded discrete entities from a continuous surface (such as a map pattern) is required. One approach to do this is to use echelons [111], where, the continuous surface is divided into a hierarchy of topographic features such as peaks, ridges and saddles. While there are other approaches to extracting cluster candidates [31] that are irregular in shape, the echelons approach when used with a continuous surface, has two strengths. First, the surface is continuous, thus the cluster candidates that are extracted do not have

71 57 to conform to underlying geographic boundaries. Second, this approach provides an exploratory approach to selecting cluster candidates. While selecting cluster candidates on the basis of results of an earlier search is not recommended, since this leads to selection bias, it is possible to look at a smoothed surface of risk, and decide on what level of smoothing or what level of echelons are appropriate for the research. It is not absolutely necessary to make these decisions, and a brute force approach using multiple echelons and multiple filter sizes will work. Nevertheless, this approach does offer the opportunity to make these a-priori selections if required. Discrete entities are created from the intersection of this three dimensional topography with a horizontal cutoff plane called the level in the echelons literature. The level of this cutoff plane can be such that only the peaks or highs are revealed above it, or it could be lowered, such that features at lesser altitudes are exposed. In my analysis I use one level or threshold. This has the effect of binarizing the continuous surface into highs and lows. Thus, in a map of smoothed rates, all cluster candidates that show rates greater than the threshold rate in the region will be considered as possible cluster candidates (See Figure 2.1).Raising the threshold has the effect of showing only the peaks and ridges as cluster candidates, while lowering the threshold has the opposite effect. A powerful method for the detection of disease clusters can weed out the false positives and thus offer high sensitivity, when applied on such a map. In later versions of this method, multiple levels of echelons will be used. One echelon serves as a good starting point. We can therefore define a threshold T, such that: If R(Z) > T, t=1, R(Z) = R(Z)*t If R(Z) < T, t=0, R(Z) = R(Z)*t Where T is a threshold that decides beyond what value of the statistic R(Z) we consider clusters to exist. Once the cluster candidates are extracted, the average rate at the

72 58 cluster candidate can be calculated. The average rate is calculated as the observed number of cases divided by the expected number of cases. Alternatively, if a fine, uniform grid is applied to the study area and the rate at each grid point is known, then the rate in a candidate cluster is equal to the average of rates at the grid points that lie within a candidate cluster. Let us call this as R Z. These attributes [K(Z),S(Z) and R Z ] comprise the signature" S(Z) of a cluster. My hypothesis is that the signatures of clusters that are spurious will be different from signatures of true clusters. A disease map will yield many possible cluster candidates. If all the cluster candidates are spurious then: Z j Z N, j, Z T = φ For any candidate that is a true cluster : Z j Z T I assume that a cluster is a true cluster, if its signature is not classified as the signature of a spurious cluster. Or, Z j, Z T if S(Z j ) S(Z N ) and consequently S(Z j ) S(Z T ) This classification can be done on the basis of a decision rule D, D: S(Z j ) S(Z), such that if there are no true clusters then: S(Z j ) S(Z N ) j While, if there is a cluster that is a true cluster then, S(Z j ) S(Z T ). To decide if any given cluster signature belongs to a set of noisy or spurious cluster signatures, we first need a reference set of spurious cluster signatures. This reference set must be computationally generated. Thus for example, if we are analyzing cluster candidates in a map of cardiovascular disease mortality by ZCTA in Iowa, then the noisy clusters must be extracted from simulated datasets of cardiovascular disease mortality by ZCTA in Iowa. The simulated datasets must be created under a noise or spurious cluster generating process. Thus no person in a ZCTA should have a greater risk of dying of cardiovascular

73 59 disease than in any other ZCTA (details of data simulation are in a later section). The number of cases that are simulated should be equal to the number of cases in the real dataset. Thus one map, Sm 1, created under a noise generating process generates say Sm 1 (n) spurious clusters Z N 1, Z N 2.. Z N Sm1(n) when the map is analyzed with a cluster detection algorithm such as spatial filtering. M number of simulations provide a valid reference set of noisy clusters [ (Z N 1.. Z N Sm1(n)),(Z N 1.. Z N Sm2(n)),(Z N 3.. Z N Sm3(n) ),. (Z N M.. Z N SmM(n) )] = Z N.The reference set of signatures is thus [ (S(Z N 1).. S(Z N Sm1(n))),(S(Z N 1)..S(Z N Sm2(n))),(S(Z N 3).. S(Z N Sm3(n) )),. (S(Z N M).. S(Z N SmM(n) ))] = S(Z N ). Figure 2.2 displays the signatures of spurious clusters on a map of Iowa. The noisy data is an allocation of 50,000 4 cardiovascular mortality cases to each of 942 ZCTAs in Iowa. The process used to extract the clusters was Rushton s Adaptive Filtering [9]. The output of this process is a surface of smoothed rate statistics.the statistic was binarized such that areas with rates greater than the echelon level 1.1 are coded as black. Each individual polygon that has been coded as black is a noisy cluster. This one map provides one set of reference noisy signatures, say: (Z N 1.. Z N Sm1(n)). M sets of similar maps make Z N. Once the reference distribution of simulated spurious signatures S(Z N ) is obtained, the important questions are What do these spurious signatures look like? and given candidate clusters Z j,j=1 to k and their signatures, how do we differentiate these from S(Z N )? The first question is addressed by exploring the signatures S(Z N ). The shapes, sizes and rates of the clusters in this set are explored. We can expect clusters that have a similar shape, size and rate as these spurious clusters to be the least likely to be recovered. 4 This number is consistent with real epidemiological data. This is discussed in detail in the data simulation section later.

74 60 To differerentiate signatures of true clusters from signatures of spurious clusters, we can describe a three dimensional signature space [K(Z),S(Z) and ] where shape K(Z) is on the X axis, size S(Z) is on the Y axis and rate on the Z axis. In this signature space is a set of noise signatures S(Z N ). The signature set for M=50 can be seen in Fig 2.3. This signature set of spurious clusters is used as a reference set in this research.they occupy a certain region of the signature space.figure 2.3 illustrates the reference distribution of spurious cluster signatures: S(Z N ). Thus, if S(Z N ) were to be enclosed by a boundary in signature space, then all candidate clusters that are spurious should be enclosed within this boundary. In contrast, all candidate clusters that are true would lie outside this boundary. This boundary thus defines a rejection region in three dimensional space. The rejection region can be defined in a Monte Carlo Hypothesis testing framework. This is discussed in the next section. 2.2 Hypothesis testing I test the hypothesis that any of (as opposed to one of) the candidate clusters (Z j,j=1to k ) is a true cluster. The signature of a candidate cluster is compared to a reference distribution S(Z N ) of simulated noisy cluster signatures to decide if the candidate cluster is noisy. The theoretical distributional properties of the signatures of spurious clusters are not known (and could be a subject for future research).for example, while there is substantial research on how rates and likelihoods are distributed [69, 112, 113] there is little research on the trivariate distribution of rates, shapes and sizes of clusters in different geographies. This creates a challenge in testing any hypothesis. It is known, for example, that in the bivariate case, a normally distributed variable will have a footprint that can be approximated by an ellipse [114]. The exact configuration of the ellipse depends on how the two variables are correlated. It is possible to draw a rectangle that encloses this ellipse. Irrespective of how the correlation varies, this rectangle will always enclose the ellipse. The ellipse assumption is true only for

75 61 normal distributions. Without any available research, it is not possible to assume whether the bivariate distributions of shapes and sizes or rates and sizes (for example) of cluster candidates are normally distributed. Also, even if they are normally distributed in this specific disease mapping situation, there is no reason to assume that they would be so, in other situations. However, it would still be possible to enclose these bivariate distributions in a bounding rectangle. The extremums of shape and size can be used to define the bounds of such a rectangle. If hypotheses were being tested using ellipses then, data points residing in the outer band of the ellipse (0.001 or 0.05 would be considered significant). With a bounding rectangle a similar band can be created. Note that the rectangle is more conservative than the ellipse. Refer to Figure 2.5 Rectangular confidence intervals have been discussed in the literature and recommended for use [115]. Also note that the rectangle is easily used in a Monte Carlo hypothesis testing framework. Instead of using the location of a data point in space, its rank among a set of simulated data points can be used as an indicator of its significance. Monte Carlo approach to hypothesis testing is appropriate for these analyses. The rank of the shape, size or rate of a candidate cluster in a list of candidate clusters and simulated spurious clusters can be easily calculated. A natural extension to the rectangle in two dimensions is the hyper-rectangle or cuboid [116] in three dimensions, when the data are three dimensional (shape, size and rate), with the extremums of each of these defining the bounding rectangle. Making the bounding rectangle bigger will not change the results. The Monte Carlo ranks remain the same. The hyper rectangle is therefore a visualization tool rather than a hypothesis testing device, since the hypothesis test in itself is non parametric. Under the null hypothesis the signatures of all the candidate clusters S(Z j ) are equal to the mean or median of the signatures of noisy clusters. Since the signature is a trivariate variable (shape, size, rate), the mean or median of the signature is the mean or median of the shape, size and rate.

76 62 H0 : S(Z j ) = [S mean (Z N ),K mean (Z N ), R ( (Z N )] j K mean (Z N ) : Is the mean cluster compactness (shape) among all signatures in S(Z N ) S mean (Z N ) : Is the mean cluster size among all signatures in S(Z N ) R (Z N ) : Is the mean rate over all mean rates R (Z N ) Under the alternative hypothesis there is at least one candidate cluster that is a true cluster: Thus, the signature of at least one candidate cluster is not equal to the mean H1 : S(Z j ) [S mean (Z N ),K mean (Z N ), R (Z N )] j The p values of the variables shape, size and rate, for any given candidate is calculated as the relative rank of the variable with respect to the rank of all the other variables. Thus for example, the p-value of shape (p shape ) for cluster candidate j is Rank (S(Z j )) / (Total number of cluster candidates + Total number of simulated spurious clusters). If the hypothesis test is at the α level, then a cluster candidate is significant if p shape < α /6. A cluster candidate is significant if either of shape, size or rate is significant. I simulate 50 datasets under the null hypothesis and 20 datasets under the alternative hypothesis. In this Monte Carlo hypothesis test 50 datasets under the null hypothesis are being compared with datasets under the alternative hypothesis. 50 datasets do not imply that there are 50 data points under the null hypothesis. This is because each of the 50 datasets generate a large number (on the average around 70) spurious cluster candidates or data points in three dimensional shape, size rate space. Thus the comparison dataset under the null hypothesis has 3675 and not 50 data points. Nevertheless, the question arises, if this number 3675 or 50 datasets is sufficient to satisfy the requirements of creating the reference null distribution. I thus cross validated my

77 63 simulations through a hold one cross validation process. In this method of cross validation a statistic such as the mean, is measured for all datasets, then one dataset is randomly removed and the mean is measured again. If the mean does not change, then the number of simulations is sufficient for hypothesis testing. Table 2.1 below summarizes the results. For the alternative hypothesis, the test for validation is slightly different. A stable number of datasets are over which the mean sensitivity and specificity values converge. Table 2.2 illustrates the results. Next, I shall illustrate the hypothesis testing procedure with an example. Let us suppose I am testing the hypothesis that a dataset simulated under the alternative hypothesis (Section 2.3) of cluster 2 is being tested for true clusters. The comparison dataset is a dataset of spurious cluster candidates that has been simulated under the null hypothesis (Section 2.3). There are a total of 3675 data points in this comparison dataset. Each datapoint has a shape, size and rate value. The test dataset has 45 datapoints. Each datapoint has a shape, a size and a rate. The first step in the hypothesis test is to merge the two datasets (comparison and test datasets) together. After merging the datasets we have a total of 3720 datapoints. Next, we rank the datapoints by their shape value on the shape axis, by their size value on the size axis and rate value on the rate axis. If we are testing the hypothesis at 0.01 the level, then we expect 3720*0.01 = 36 (rounded to six multiple) datapoints to be rejected. If a two sided test is carried out, then each axis, shape, size and rate has to contribute 12 datapoints to the rejection region each. Since this is a two sided test each side of the axes contribute six datapoints. For this specific dataset there are specific cutoffs for shape, size and rate. The cutoff for shape are 0.71 and 0.045, size and 1.193, and rate and 1. Out of 36 datapoints that are rejected four are from the test dataset, the rest are from the reference dataset and are thus discarded. Of the four, three are rejected on their shape values (0.73, 0.73., 0.72). One datapoint is rejected both on its shape (0.55) and rate 1.22.

78 64 In these analyses the test and the reference datasets are merged during the hypothesis test. An alternative to this would be to take each datapoint from the test dataset and test it separately against the reference datasets. There are a number of drawbacks to this approach. First, is the problem of multiple testing. Carrying out multiple tests would introduce statistical noise in the dataset. Adjustments for multiple testing would make the test ultra conservative. Simultaneous testing is the best approach in this context. There is also another advantage of using the merge procedure. It is known that some of the datapoints in the test datasets are spurious clusters or are in some ways similar to the spurious clusters. When merged with the reference dataset, and ranked as explained above, these spurious cluster datapoints join the reference dataset or become part of the reference population against which those clusters that are most different from spurious clusters are compared. This increases the power to distinguish or discriminate the true clusters from spurious clusters. This makes the test slightly more conservative, since potentially true clusters may not be rejected. The number of cluster candidates in a given geographic situation is unknown. It is important that the number of datapoints in the reference dataset be large. This is easily achieved by carrying out an adequate number of simulations, and testing for validation as above. It is important to calibrate the rejection region to match the structure of the signature set created by the process that generated the spurious clusters. The structure of this signature set will vary from one disease mapping situation to another. Using a hyper rectangle and a non parametric method of hypothesis testing offers this ability. Computational geography [1] thus allows us to use a rejection region appropriate to the situation at hand, instead of assuming a normative truth about the distribution of spurious cluster signatures. This ability to adapt the rejection region to the local geography is the strength of the S.S.S method. The flowchart in figure 2.6 summarizes the S.S.S method. The S.S.S method extracts the signature of spurious clusters for a given geography. By the theory on which this method is based, if the true clusters that one is

79 65 attempting to recover, have a signature that is very similar to the signature of the spurious clusters, then it is unlikely that this signature will be recovered. However, if the signature of the true clusters is different, that is the shape, size and/or the rate is different from the shape, size and/or rate of the spurious clusters, then these clusters are very likely to be recovered. The simulated data that I describe next are used to test these ideas. The S.S.S method is also compared with an existing method of disease cluster detection: Rogerson s Score statistic. 2.3 The simulated dataset It is a standard practice in the literature [32, 75, 113] to test any new method for the detection of disease clusters against data simulated under conditions of clustering and no clustering. The ability of a cluster detection method to correctly classify regions into cluster and not cluster, are quantified into measures of sensitivity, specificity, false positive percent and false negative percent. Thus, data are simulated in this research. The data consist of 50 datasets simulated under the null hypothesis of no clustering and 20 (X 4) datasets simulated under the alternative hypothesis of having clusters. Four different clusters (20 datasets each) are simulated under the alternative hypothesis to reflect different configurations of possible true clusters. The hypothetical study area and the datasets are described in detail next Hypothetical study area and population My analysis is based on the geographical area of the state of Iowa. The size of this geographic area is approximately 240 miles * 360 miles and the population is around 2,892,853. I simulate groups/counts of people at the small area level. The small areas are ZCTAs (Zip Code Tabulation Areas). There are a total of 942 ZCTAs in Iowa. The advantages of using a geographical area as Iowa are manifold. The state has a relatively

80 66 homogenously distributed population with low population densities in most areas, but small and densely populated urban areas. Also, there are existing datasets of cancer/ birth-defect for the state [10]. In Chapter 3 of this dissertation, the S.S.S methods, Rogerson s Score statistic and the Spatial Scan Statistic[3] are applied to an existing dataset of prostate cancer incidence in Iowa (at ZCTA level geography). Figure 2.7 is a choropleth map of ZCTAs in Iowa by population Hypothetical case population The simulated cases are of deaths from cardiovascular disease. Cardiovascular diseases (CVD) or diseases of the heart [117] include ICD-10 codes I00-I99 (diseases of the circulatory system) and a number of other related disorders (ICD-10 codes I00- I09,I11,I13,I20-I51). Together they contributed to 26,897 deaths in the year 2000, in Iowa [117]. In my study I simulate 50,000 cases of cardiovascular disease deaths (mortality) in Iowa. This accounts for approximately two years of observed CVD deaths Datasets under the null hypothesis of no clustering The hypothetical case populations are simulated by distributing cases among the ZCTAs weighted by populations. The methodology that was used to generate the simulated datasets is simple. A computational array was created with 942 bins where each bin represents a ZCTA. The size of each bin is proportional to the population in the ZCTA. Our task is to allocate a certain number (50,000) of cases to these bins in proportion to their populations. If we visualize each case as a dart, then the cases are allocated by randomly throwing darts at the array. Once all darts are exhausted, the number of darts or cases in each bin is summed. This sum represents the total number of cases in the particular ZCTA. One dataset is thus created from this finite allocation of cases to bins (Note: not finite # cases to each bin. Some bins may get none through this

81 67 process and that is fine). If the process is repeated 50 times, 50 simulated datasets are obtained. Figure 2.8 illustrates the process. The philosophy behind this is that the process that gives rise to the case population is noise", or that no ZCTA is at a greater risk of having a case than any other after accounting for the relative differences in populations in the ZCTAs. Therefore from a common pot of a fixed number of cases, each time a case is drawn, a decision has to be made on which ZCTA shall receive the case. Each ZCTA has a probability of receiving a case proportional to the number of people in the ZCTA. Once all cases are allocated, a dataset is ready. This procedure if replicated M number of times gives M datasets. Each of the different M datasets are expected to have a different spatial distribution of cases. It can be theoretically proved that the resulting case distribution follows a multinomial distribution. Figure 2.9 shows the proof. For this analysis 50 datasets were simulated. It is important to note, that a number of risk factors affect the outcome simulated. Some of these risk factors are non spatial. For example, for heart disease, age is a possible risk factor. The observed patterns of risk on a map reflect the outcome of these risk factors over and above spatial risk factors. If the simulated patterns do not reflect the underlying age and sex distribution, then comparing these with the observed patterns is incorrect. Thus, if the S.S.S method is used in a real life epidemiological situation then, it is important that these covariates be adjusted for, and the at risk population be used to create the simulated patterns. In Chapter-3 the S.S.S method is used to investigate prostate cancer clusters in Iowa. The population used to simulate the reference spurious clusters is men (sex) over the age of 45 (age). The observed patterns are adjusted to the underlying age distribution. Since, Rushton s spatial filtering is used; covariates are adjusted at the stage when the rates are calculated at each grid point. This approach can be extended to include any number of risk factors, in a logistic regression or multilevel regression framework. For a detailed discussion on this topic, see Banerjee (2004) or Klassen et al., (2005) [56, 118, 119].

82 68 In the simulations that are carried out in this chapter the above considerations are not important. The purpose of the simulations in this chapter is to test the ability of the S.S.S method to recover certain simulated patterns, and to compare the S.S.S method with Rogerson s Score statistic method of cluster detection. While, a dataset simulated with using the population at risk can perhaps be of a more realistic nature, they do not affect the results of analyses carried out on the simulated data. The aspatial risk factors are considered to be the same and fixed in the datasets simulated under the null and alternative hypothesis Extracting the cluster candidates For each of the datasets simulated above, patterns of risk were extracted using adaptive spatial filtering. A uniform 2.5 mile grid is used. The denominator size or filter size was set at 6600 people which is around 114 expected cases. From the smoothed patterns echelons were used (as explained in chapter 1) to extract the cluster candidates. The echelon level was set at This level is approximately equal to the mean rate in the simulated noisy patterns. Choosing an echelon at a low altitude (for example the minima) could increase the possibility of detecting false positives. Choosing an echelon that is too high could increase the possibility of type -2 error. A median or a mean is thus a reasonable choice for one echelon level. All clusters candidates that cross this mean threshold are considered spurious clusters. The end result of the process explained above is a set of spurious cluster signatures, for the geography of my choice (ZCTAs in Iowa, 2000 population). Each spurious cluster has a signature. This signature is comprised of a shape (expressed as a value of compactness), a size (in square miles) and a rate. Recall that this is S(Z N ) the signature set of reference spurious clusters. We are now in a position to explore this. In the figures that follow, these shapes, sizes and rates of the spurious cluster signatures are summarized. There are a total of 3675 spurious clusters. The bar charts in Figures 2.10, 2.11 and 2.12 summarize the shapes, the size and the rates

83 69 of the spurious clusters. Table 2.3 provides further statistical summaries on the spurious clusters. The datasets that are described next are simulated under the alternative hypothesis that clusters exist. Recall, that S.S.S compares the shapes, sizes and rates of true cluster candidates (like the ones that are simulated in the next section), with the shapes, sizes and rates of clusters simulated under the null hypothesis (which were discussed in this section). If the theory underlying the S.S.S method is correct, then the greater the difference in the signature of the spurious clusters and any given cluster candidate, the more likely the cluster candidate is a true cluster. In the next section I describe datasets that are simulated under the alternative hypothesis. In these datasets, I carefully control the shape, size and the rate of the simulated clusters with varying degrees of difference from the signatures of spurious clusters Datasets under the alternative hypothesis of clustering 10 ( X 4) datasets are simulated under the alternative hypothesis that a cluster exists. The patterns extracted from these map datasets will yield the true clusters Z T. The procedure is similar to the one followed in the last section. The only difference here is that people living in some of the ZCTAs are placed at a higher risk of disease than people living in other ZCTAs. Four different clusters are simulated. Figure 2.15 summarizes the characteristics of the clusters simulated. We can also call these the four clustering situations Rationale Behind the choice of these configurations of synthetic clusters The shapes, sizes and risk elevations of the synthetic clusters were chosen to test the S.S.S method and its underlying theory. If the shape, size and the risk elevation of the

84 70 clusters are similar to the shape, size and rates of the spurious cluster as explained in the last section, then the S.S.S method will not be able to detect these clusters. Conversely, if the shape, size and the rates of the clusters simulated under the alternative hypothesis are different from the shape, size and rates of the spurious clusters, then, S.S.S will be able to detect these clusters. The former clusters are thus less likely to be recovered than the latter ones. Clusters 1 and 2 were chosen such that their shape and size mimics the shape and size of a spurious cluster (Refer to Figure 2.13 and Table 2.4 for a summary). They are composed of four small areas of increased risk. The shape and size were chosen to be similar to the shape, size and rate of the spurious clusters (refer to Figure 2.13 for a comparison). Also, the spatial form or geometries of the four areas of increased risk were chosen from cluster candidates of one of the fifty patterns simulated under the null hypothesis. While both clusters 1 and 2 have the similar shape and size as spurious clusters, Cluster 1 has a risk elevation that is many times higher than Cluster 2. Cluster 2 thus, is the least likely to be recovered of all clusters because its shape and size are similar to that of a spurious cluster in the given geometry, and the risk elevation (1.25) is similar to that of spurious clusters (1.1). Cluster 2 has better recoverability because it has a rate that is different from that of spurious clusters even though its shape and size are similar to them. In contrast, Cluster 3 has a shape, size and risk elevation that is very different from those of spurious clusters. It is a contiguous, aggregation of 111 ZCTAs in the state It is therefore the most recoverable of all the clusters. Cluster 4 has a shape and size different from the spurious clusters, but its risk elevation is similar to that seen in spurious cluster. Table 2.4 summarizes these ideas. It illustrates the shapes, sizes and rates of the four simulated clusters, and also the mean shape, size and rate of the spurious clusters. Every cluster is ranked according to the extent it can be recovered. The cluster which is most similar to the spurious clusters

85 71 (Cluster 2) is the least recoverable, while Cluster 3, which is different from the spurious clusters, is the most recoverable. Note that the geography of Clusters 3 and 4, could pose a significant challenge to the traditional disease cluster detection methods. Many methods of detecting clusters of disease coalesce neighboring local areas to gain statistical power [2, 120]. In Clusters 3 and 4, this opportunity is not available. Nevertheless, Cluster 3 has a high elevation in risk. This may increase power of detection. However, Cluster 4 has neither a high elevation of risk, nor an agglomeration of local areas with a slightly high elevation of risk. It is therefore likely that S.S.S will be able to detect Cluster 4 with success while other methods will fail to do so. Another question of interest is the relative recoverability of Clusters 1 and 4. Cluster 1 has a shape and size similar to spurious clusters but a risk which is very different (3 times higher than the mean rate at spurious clusters). Cluster 4 has a risk similar to the mean rate of spurious clusters but the shape and size that are different from them. It is not known, how in this specific example, the interplay of rate, size and shape will affect the relative recoverability of Cluster 1 and Cluster 4. In later sections, the S.S.S method is applied to the simulated data, and the postulates that I have discussed above are tested. The S.S.S method is also compared with another method of disease cluster detection, - Rogerson s Score statistic. If a limited number of synthetic clusters are evaluated, it can be argued that the synthetic clusters were cherry picked such that power evaluations of a given method (S.S.S in this case) will be successful. It is therefore important that the synthetic clusters that are used in any analysis reflect the possible local geographies within a given region. In the context of disease mapping this would imply that the synthetic clusters reproduce the population densities or the density of controls in the local areas in the region. This can either be done by simulating multiple synthetic clusters, each of which covers a different population or control density regime, or by simulating one or two clusters that cover a number of possible control densities regimes. If the second option is chosen then the

86 72 clusters may have to be large in size to encompass different local areas. This strategy was followed in the design of Clusters 3 and 4 in these analyses.. Alternatively; multiple clusters may be used in a single simulation to cover the different local areas. This approach was used in Clusters 1 and 2. To keep the power evaluations conservative, it is advisable to err on the side of caution, and choose local areas for synthetic clusters with population densities lower than what exists in the region. Areas with higher population densities are easier to detect since they offer greater power. Any cluster detection method can detect clusters in densely populated region but most cluster detection methods have difficulty detecting areas with small numbers of controls [32]. While cluster 1 and 2 cover predominantly rural areas, they also include relatively urban areas such as Walcott and Carroll towns. Similarly, clusters 3 and 4 include large towns such as Council Bluffs, Marshalltown, Fort Dodge and Davenport. The median population density in all ZCTAs in Iowa is 22 people per square mile while in the simulated clusters 1,2,3 and 4 it is 18 people per square mile. As is shown in Figure 2.14, the population densities in all the simulated clusters in this study are less than in Iowa as a whole. Thus, while the number of clusters that are simulated are limited in this study, the population regimes that the clusters cover are not. While, the simulated clusters cover areas which are slightly less densely populated than areas in Iowa, the ability to detect such areas is a more difficult test of sensitivity for any cluster detection method than the alternative of testing for clusters in densely populated urban areas. The cluster recovery process is free to recover any fraction of a given simulated cluster. While the number of a clusters that are simulated are limited in this validation test to four, there are no limitations to the number of parts of these clusters that are successfully recovered using a given cluster detection method. Thus, if there are N ZCTAs in a given synthetic cluster, the cluster detection method has the opportunity to recover 2 N possible combinations of ZCTAs from the synthetic cluster. If on top of this we account for the fact that the S.S.S method can detect clusters across ZCTA boundaries, there are unlimited number of possible clusters that can

87 73 be detected. In spite of these unlimited possibilities the empirical results in Section 2.7 show that the clusters that are detected have some common attributes. The number of simulations that are used in these analyses are thus sufficient to derive meaningful generalizations. This and the fact that the simulated clusters are a conservative and reasonable representation of the local geographies in Iowa justifies the choice of simulated clusters in these analyses. 2.4 Rogerson s Score Statistic Rogerson s Score statistic is a focused test [16] that has been used as a local cluster testing method used by Waller [4] and Rogerson [2, 50, 121] to study spatial patterns of leukemia in New York. A focused testing method is used to test areas of excess risk around a given point or area, while a local testing method is used to test for an excess of risk in a local area. It can be implemented with a freely available software GeoSurveillance [121]. The power of the score statistic as a local cluster detection tool has not been tested, and this study is a first attempt at this. Rogerson s Score statistic is set up to test one local area at a time. If the test were to be used repeatedly on a number of local areas, the statistical problem of multiple testing comes into play. These power tests will address the question, if in a realistic disease mapping situation, multiple testing is an issue with Rogerson s Score test. The theory of the Score test is explained next: Theory Rogerson s local score statistic maps a smoothed value based on the difference between the observed and expected counts of cases in a given region. To test for a raised incidence or prevalence around region i, the statistic that is mapped is as follows: W X Adjusted U i = * (x h -E h ) E

88 74 Where, U i is the value of the score statistic for local area/zcta A i W ih is a weight parameter that decides the extent of smoothing that will be applied to the above statistic. X is the total number of local areas/zctas x h is the total number of cases in a local area/zcta E h is the expected number of cases in a local area S(G) is the area of the geographic region in the map, for example, the state of Iowa. d ih is the distance from the i th local area to the h th one. = W W W W = ( 1 πσ ) * exponent ( - d / 2σ (X/S(G)) exp h = (N i /N)*n = (Total number of people in ZCTA/Total number of people)*(total number of cases) i,h=1,2,,n σ is the bandwidth parameter, and this decides the extent of smoothing that will be applied. Since this method smoothes the statistic calculated above with a Gaussian Kernel, the size of the kernel will decide the extent of smoothing that is applied to the data. If σ=1, then the size of one standard deviation of the kernel is equal to the average distance between the centroids of all ZCTAs or local areas in the study area. The mapped statistic for each local area is tested for significance by comparing with a normal reference distribution [2], with mean zero. In this research Rogerson s Score statistic is compared with the S.S.S method and the S.S.S method is applied to the simulated data

89 75 described in the earlier sections. Therefore, the performances of these methods need to be quantified. This is done, using diagnostic measures that are summarized in the next section. 2.5 Diagnostics Power, sensitivity and specificity are metrics used to evaluate the quality of any method for the detection of disease clusters [122, 123]. These metrics are also used to compare different methods. Figure 2.15 (and the key below) summarizes the diagnostic measures that are used. The area that is a true cluster but has not been identified as a true cluster by the cluster detection method. This area as a percentage of the area within the true cluster (black oval) is a measure of the false negative percent of the cluster detection test. The area on the map that is not a cluster, but has been wrongly classified by the cluster detection method as a true cluster. This area, as a percentage of the area of the map (rectangle minus black oval) that is not a cluster is the false positive percent. The area of the map that is a cluster and has been correctly classified as being a cluster by the cluster detection method. This area as a percentage of the area of the true cluster (black oval) is a measure of the sensitivity of the cluster detection method. The area of the map that is not a cluster and has been correctly classified as not being a cluster by the cluster detection method. This area as a percentage of the area that is not a cluster (which is the rectangle minus the black oval) is a measure of the specificity of the cluster detection method.

90 76 The above diagnostics are accepted measures of the quality of a disease clustering test [122]. A good test has high values of sensitivity and specificity and low true negative and false positive scores. Most researchers report sensitivity and specificity values, because the other two diagnostic measures can be calculated with ease from the sensitivity and specificity measures. These diagnostics were calculated for the two methods. Note that these measures have been developed for cluster detection techniques that detect clusters that follow administrative boundaries. A region (such as a ZCTA) either in its entirety belongs to a cluster or does not. The measures are based on counts and percentages of administrative regions that lie within or do not lie within clusters. Thus, these diagnostics are easily calculated for Rogerson s Score statistic method. However, grid based smoothing methods such as Spatial Filtering cut across regional boundaries. These metrics are thus slightly modified for the S.S.S method. Instead of using a binary count of inclusiveness or non inclusiveness of a cluster candidate, the percentage of the area of a ZCTA that is within (or not within) a cluster, is used. Sensitivity and specificity can be calculated for any one map. However, in simulation experiments like this one, there will be some variation from one map to another. Therefore, the diagnostics that are reported are averaged over all the simulated maps. In this study, for each cluster (1,2,3,4) there are 20 simulated datasets. Sensitivity and specificity are thus averaged over 20 maps. 2.6 Computational Scheme The DMap Filtering routine is realized using a VBA-Excel program written by the author. Rogerson s Score statistic is available in Rogerson s GeoSurveillance software [121]. All the GIS functions (including convex hull) are realized using ArcGIS 9.1 [124]. Monte Carlo hypothesis testing is achieved in VBA-Excel.

91 Results Figure 2.16, displays an example cluster detected by S.S.S and Rogerson s Score statistic for Cluster-4. Cluster-4 is the cluster in which the shape, size and geometry of the cluster pose a challenge for traditional cluster detection methods. It was predicted earlier that S.S.S would show greater sensitivity than Rogerson s Score statistic in detecting Cluster-4. This is observed in the results. S.S.S is three times more sensitive (the ability to detect a cluster given that it exists) than Rogerson s Score statistic. Cluster-3 was a relatively easier cluster detection scenario. The risk elevation at the cluster was 3.0. Both Rogerson s method and the S.S.S are equally successful in detecting these clusters. Figure 2.17 displays the clusters detected by the two methods for one of the 20 simulated datasets. Tables 2.5 and 2.6 summarize the summary diagnostic statistics for the two cluster detection methods. The average sensitivity (the ability to detect a cluster when it exists) for S.S.S is 95%.This implies that on the average S.S.S is able to detect 95% of the simulated cluster (Cluster-3). Clusters-1 and 2 were simulated to test the underlying theory (or assumption), that cluster candidates that resemble spurious clusters are less likely to be recovered by S.S.S. Cluster-2 has the greatest resemblance to the spurious clusters and is therefore the least recoverable. These results confirm this. The average sensitivity (or the ability to detect a cluster if it exists) is around 33%. In contrast S.S.S offers an average sensitivity of around 83% with Cluster-1. The sensitivity with which the clusters are recovered are thus almost exactly as predicted by the theory. Cluster-3 is the most recoverable, while Cluster-2 is the least recoverable. It was predicted that Cluster-1 (which has shape and size similar to spurious clusters) and Cluster-4 (which had a rate similar to spurious clusters) would have medium recoverability. In these simulations Cluster-1 is better recovered than Cluster-4. S.S.S shows a sensitivity of 83% with Cluster-1, while it shows a sensitivity of 58% with Cluster-4. However, lower specificity (the ability to classify areas that do not have an excess of risk as such) is obtained with Cluster-1 than with Cluster-4. Tables 2.5 and 2.6

92 78 summarize the predicted ability to recover a cluster with along with sensitivity and specificity with which it was recovered. The higher the sensitivity and specificity, the better the cluster is recovered. Score statistic was also used to detect Clusters-1 and 2 (Table. The average sensitivity obtained with Cluster-1 was 96% and 87% with Cluster-2. The comparable sensitivities for Cluster-1 and Cluster-2 with SSS are 83% and 33% respectively. It is important to interpret these results in the right context. The computational approach [1] that S.S.S uses makes it possible to predict which Clusters are most likely to be recovered. Thus from the simulations discussed earlier, I predicted that Cluster-1 would be the hardest to recover. Similarly, since both Cluster-1 and Cluster-2 had shapes and sizes similar to those of spurious clusters, they were considered to be less likely to be recovered. In contrast using Rogerson s method does not offer such predictive abilities. For example, Rogerson s method is not able to reasonably detect Cluster-4 (sensitivity 27%). If Cluster-4 had existed in a real epidemiological situation, and Rogerson s method had been applied to it, then the interpretation of the results would be challenging. Is the result of Rogerson s method showing no clusters (in areas where there are clusters) to be interpreted as a failure of Rogerson s method or is it interpreted as the non existence of a cluster? Without any additional evidence, the researcher may conclude that the latter is true, with significant negative consequences. In contrast if Cluster-2 had existed in reality and S.S.S had been applied to it, it would have been known from the simulations that clusters of certain shapes, sizes and rates would be hard to detect with S.S.S. It is also a prudent approach to systematically carry out the analyses at a multiple scales. This could be achieved by using geographic data at the individual level or the analysis scale can be changed by using a different filter size. For an example see Rushton et al.,[125]. S.S.S thus empowers the researcher with a-priori knowledge based on computational geography compared to blind approaches like Rogerson s Score statistic.

93 79 These experiments are designed to empirically test the contribution that each of the axes, - shape, size and rate make to the overall sensitivity of these analyses. I have argued earlier that using a multidimensional cluster signature space should offer better power than using simply a rate or likelihood statistic. Simulated clusters 1 and 3 have the same risk elevation and clusters 2 and 4 also have the same risk elevation. If risk elevation were the only deciding factor in the recoverability of a cluster, or if the shape and size axes did not matter, then it would be expected that the same, or similar sensitivity will be obtained for clusters 1 and 3, 2 and 4. However, this is not the case. Clusters 3 and 4, are recovered better than clusters 1 and 2. The reason why these clusters are better recovered is clusters 3 and 4, have shapes and sizes that are markedly different from the shape and size of Clusters 1 and 2 (And concurrently spurious clusters). It is not that a rate is not important. An increase in rate from 1.25 to 3.0, increases the recoverability of Cluster 2 by 50% and Cluster 4 by 37%. Also since the patterns are created and clusters extracted based on rates, the rate axis is indispensable. However, as, these simulations show, gains in sensitivity are to be had from incorporating shape and size in the analyses. Which of these axes- shape and size, is contributing to the overall sensitivity, and in what proportion? I calculated sensitivity values using information on just one axis, when withholding information on the other two axes. For example, a datapoint found significant on the basis of rate, shape and rate, size and rate or all of shape, size and rate in the earlier table, would be found significant in these analyses for rate axis only. The results of these analyses show that all three axes are important. Shape and rate contribute most to the significance of candidate clusters. Tables 2.8 and 2.9 illustrate these results. The clusters that are significant generally have more regular shapes (with the exception of cluster 3) and higher rates than spurious clusters. While size does not contribute to the significance (with the exception of cluster 3) of candidate clusters, the sizes of candidate clusters are different from the sizes of spurious clusters. As Table 2.8 illustrates, the median size of the significant candidate clusters are greater

94 80 than the median size of spurious clusters. Also observe from Table 2.9 that for clusters 2 and 4, the shape axis is contributing significantly to the sensitivity of the test. These are the clusters where the rate is very similar to that of spurious clusters. Clusters that have high rates are easily recovered. If the rate is high enough the pattern is extracted almost exactly as it was simulated (Cluster 3). For clusters that do, not have a high rate, shape acts as a back up sensitivity engine. Thus, for example, in Cluster 4, the rate is similar to that of the background noise. When this pattern is recovered, it breaks up into multiple cluster candidates, that may or may not have rates higher than the background. However, these candidates are more regular in shape than the background noise. I do not know the exact mechanism that causes these clusters to be more regular than the background noise or spurious clusters; nevertheless the empirical evidence is there. Note the shape axis does cause an increase in false positives, but this may be outweighed by the increase in sensitivity. The costs and benefits would have to be judged individually for every specific public health situation (see discussion). Nevertheless, I can at least claim that the two dimensional signature space with rate and shape, is better than the traditional uni-dimensional rate axis. Should the size axis be discarded? Perhaps not. While, in these experiments size was not significant (In all but cluster 3), the average size of clusters is still larger than non clusters. I therefore argue that a three dimensional signature space should be used in distinguishing clusters from non clusters. These simulations thus provide empirical evidence to support the theory that a three dimensional computation space is best for disease cluster detection. The empirical evidence conforms to the theory discussed earlier, - the sizes of clusters are larger than non clusters, and the shape of clusters are different from non clusters.

95 Discussions and future directions Recent discussions in the disease clustering literature have brought into the fore questions about the geographical aspects of disease clusters. Issues of cluster shape and size, and recoverability are being investigated [31, 33, 34, 43, 69, 73]. Computational systems are being developed that address the question of recoverability or How well could a cluster be identified, if it were there? While the answer to this question was taken for granted (that the clusters are fully recoverable under any situation), there is increasing acknowledgement that methods to address these issues need to be developed. As Jacquez states two of the major deficiencies of geographic studies of disease clusters are that they often assume clusters have a specific shape (e.g. circle or ellipse) and do not evaluate statistical power using the geography, at-risk population, demographics, covariates and numbers of observed cases of the cancer under investigation [33]. In this research I have argued that recoverability of a cluster depends on the disease mapping situation at hand. Depending on the geography of a region, certain clusters will be less recoverable than others. Clusters that are less recoverable will resemble noisy or spurious clusters. A computational approach is outlined where for any given disease mapping situation the nature of these spurious clusters can be mined from the data. This knowledge can be used to predict the ability to recover any given cluster. True clusters are compared with spurious clusters in a three dimensional computational space that incorporates shape, size and rate. Shape and size are fundamental to geographical analysis [39, 64, 74, 76, 87, 88, 90, 91, 95, ]. While the use of rate as a means of distinguishing real clusters from spurious ones is well established in the disease cluster literature, there is less documented evidence of the use and utility of shape and size in this literature. Nevertheless there has been some limited attempt to study shape and size by other researchers. Duczmal [31, 43, 69, 73] can be credited with having made the empirical observation that clusters have a different shape and size than non-clusters. Duczmal and

96 82 other researchers [31, 69, 71, 73] discovered that clusters in Duczmal s Simulated Annealing Cluster search method are of a large size and irregular shape when data are simulated under conditions of no clusters being present, at a census tract level geography. The researchers thus considered clusters with this signature as being spurious and put a penalty on any cluster candidates that resembled this signature [69, 73]. Other researchers have attempted to compare the size of clusters detected by different cluster detection methods [130]. However, the shape of their clusters was limited to circles, and the cluster detection methods used different estimates of risk (log likelihood ratios versus rates). Thus, while there is some evidence of research involving shape and size in the disease cluster detection literature, and there is some empirical evidence to show that the shape and size of clusters are different from non clusters, there has been no systematic effort, like the one outlined in this research, to utilize size and shape along with rate in cluster detection. The empirical observations in this research support the observations made by Duczmal [31, 69, 71, 73] that the shape and size of clusters are different from non clusters. I show that incorporating the shape axis in addition to rate can greatly increase the sensitivity of the cluster detection method. This increase in sensitivity comes with some cost in the form of false positives. The size axis, is found to be important, but not enough to add extra sensitivity to the test. One may choose to either abandon this axis, or at least do more empirical research to test the reliability of this axis in other geographical situations. Like any other disease mapping/cluster detection situation, it is important to take the public health implications of any cluster search into consideration. While it is desirable to have a cluster detection technique that is conservative, the public have a right to know if there is an increased risk of disease in their neighborhood. We have observed in these analyses that incorporating the shape axis along with rate can increase the sensitivity of the S.S.S method. However, it also somewhat increases the number of false positives. This is a limitation of this method. The ultimate decision as to whether

97 83 one should be conservative or not depends on the public health implications of the decisions. While it may be necessary to have the highest sensitivity possible (even at the cost of specificity/increased false positives) for contagious or fatal diseases, a conservative approach could serve better for diseases with smaller public health burdens. An important question in this context is the elevation of risk that is important enough to need a public health intervention. For instance, a small elevation in risk in a common disease like prostate cancer, where an intervention may not have a desirable risk-return tradeoff can be ignored without consequence. On the other hand, a small elevation in risk in a rare, highly contagious and/or non endemic disease like non-hodgkin s lymphoma, West-Nile, Ebola may need immediate emergency intervention. Increasing or decreasing the filtersize has an effect on the patterns that are recovered. A larger filtersize implies that a smoother pattern will be recovered while a smaller filter size creates a pattern with greater spatial variations. These changes in the patterns will be manifested in cluster candidates that are extracted. However this effect will be manifested both in the simulated noise datasets, and the real dataset in the S.S.S methodology. Thus, while increasing or decreasing the filter size does affect the results that are obtained from the SSS methodology, it does not affect their accuracy. The adaptive filters are larger in areas with lower population densities. However, this does not directly translate to a possibility of having larger clusters in rural areas. First, the risk elevation at the region may not be above the cutoff threshold to create candidate clusters. Second, the noise generating process may create reference spurious clusters that have a size larger or similar to the large candidate clusters extracted from rural areas. If a significant cluster is found, then it is likely that this cluster exists. In this context it is important to take into consideration some of the limitations that this method has. The conclusions that one draws from using this method are dependent on the pattern of spurious clusters that are simulated for comparison. The pattern is unique to a particular geography and it is therefore, a good practice to simulate the pattern every time

98 84 the method is ported to a new geographic area. Sometimes however, it may be possible to use the same geography to create the simulated patterns. For example, if prostate cancer incidence is studied at the ZCTA level in Iowa, it may be possible to use the patterns obtained in Iowa, in areas with similar geography (northern Missouri, Nebraska), when studying the same disease. However, for the sake of accuracy it is recommended that a new set of simulations be done for each and every new geographic situation. There are a number of interesting extensions to this research. Some of this can be inspired by the work of Patil [70], Duczmal [31, 43, 69, 70, 73] and Boscoe [131]. For example, instead of using the rates as an indicator of risk, it is possible to use the spatial scan likelihood ratios. This method can be extended to be used with multiple echelons instead of one. This may improve the sensitivity of this method. Interesting methods of visualizing the results of these analyses can also be developed. One approach could be to use nested clusters of any shape with colors or shades representative of their level of significance ( a p-map [9] of a different type) or their risk elevations [131]. A limited number of spatial forms of synthetic clusters are tested in this research. While it is possible to extend this testing to multiple spatial forms, it is important to understand the computationally intensive nature of the problem. We may be tempted to believe that the shapes, sizes and rates of clusters can be represented in three dimensional space by a three dimensional grid, with each grid point representing a possible synthetic cluster. The power of the S.S.S (and other) methods can then be tested against these synthetic clusters. Unfortunately this belief is misleading. This is because; the relationship of the parameter shape represented by a compactness value between the three dimensional attribute space and geographic space is not one to one. There are an infinite or certainly a large number of possible shapes than can have the same value of compactness. The problem thus, while not entirely intractable, is certainly computationally complex. The limited number of cluster shapes and sizes that are tested in this research are a) realistic and b) cover a reasonable spectrum of possible cluster

99 85 spatial forms. A survey of the forms of the spatial forms of clusters used in the existing cluster diagnostic literatures either shows a complete lack of realistic cluster forms [29, 31, 36] or a test of just one possible form [71]. The methods that are proposed in this research are only as good as the data that go into the analyses. The coarser the resolution of the data, the less likely that local variations will be detected. While ZCTAs are sufficient to demonstrate the methods proposed in this research, individual level data may serve the purpose better in real epidemiological situations. These analyses can easily be extended to individual level data. Rogerson s Score statistic performs reasonably in these simulations. If multiple testing were a problem, then a number of spurious clusters would have been detected. This would have decreased the specificity of this method. However, this was not the case. This method suffers from one weakness (apart from being blind as discussed), which was not exposed in this chapter, but becomes clear in Chapter-3. This weakness is the inability of this method to address the small number problem. While in these simulations relatively large base populations were used, in the next chapter, the base populations are segmented by age. Thus, some ZCTAs have small populations, and the score statistic calculates rates based on these populations. This creates noisy or spurious clusters. While a large number of methods for the detection of disease clusters exist, very few methods offer the ability to manipulate the shape and size of the candidate clusters. In the last few years, a variety of methods have attempted to address this issue, almost all of which are based on Kuldorff s Spatial Scan Statistic method of statistical testing. Duczmal s method, which has been shown to be powerful [31, 43] is also based on the Spatial Scan Statistic. This research has been the one of the few attempts to suggest an alternate approach. Unlike the SaTScan based approach that assumes normative truths about the distributional characteristics of the data, this approach offers a relatively positivistic approach. Deriving its strengths from computational geography [35], this approach adapts the statistical testing of candidate clusters, to the specific disease

100 clustering situation in question. 86

101 Figure 2.1: Using echelons to extract cluster candidates. 87

102 Figure 2.2: A set of 50,000 cardiovascular disease mortality cases are randomly distributed by population weights to each of 942 ZCTAs in the state of Iowa. A pattern is then extracted using Spatial Filtering. The pattern is binarized, and the resulting polygon cluster candidates are extracted using a GIS. 88

103 Figure 2.3: An example set of spurious cluster signatures S(Z N ) in signature space. 89

104 Figure 2.4: An example set of spurious cluster signatures S(Z N ) in signature space with a few candidate clusters (grey squares). 90

105 Figure 2.5: Bounding rectangle for elliptical footprint. 91

106 Figure 2.6: Flowchart of the S.S.S method. 92

107 93 Figure 2.7: Population distribution of ZCTAs in Iowa, k 1 k 2 Uniform distribution of PRNG Range of k Cumulative weight of k regions k n 1 Figure 2.8: This figure displays the computational process used to create the simulated dataset. Each bin is labeled as k and has a specific size. For the simulations in this research n=942. Note: Reproduced from Kumar, N. and A. Bragdon, Pseudo Random Number Generators for Simulating Randomness in Geographic Space, manuscript, Dept of Geography, The University of Iowa, Iowa City, IA

108 Figure 2.9: Proof: The simulated datasets follow a multinomial distribution. 94

109 95 Figure 2.10: Summary of shapes of simulated spurious clusters, frequency and cumulative frequency. Figure 2.11: Summary of sizes of simulated spurious clusters, frequency and cumulative frequency.

110 Figure 2.12: Summary of rates of simulated spurious clusters. 96

111 Figure 2.13: Characteristics of the four clusters simulated under the alternative hypothesis. 97

112 98 Figure 2.14: Population densities in simulated clusters compared to population densities in Iowa.

113 Figure 2.15: Cluster detection diagnostics (The key to the numbers is in the text). 99

114 Figure 2.16: Patterns detected by the Score statistic and the S.S.S method for one dataset among 20 datasets simulated for cluster-4. The true cluster pattern can be seen inset. In this particular dataset S.S.S is able to identify 56% of the true cluster pattern, while the Score statistic is able to identify 18 %. 100

115 Figure 2.17: Patterns detected by the Score statistic and the S.S.S method for one dataset among 20 datasets simulated for cluster-3. The true cluster pattern can be seen in the inset. In this particular dataset S.S.S is able to identify 98% of the true cluster pattern, while the Score statistic is able to identify 92%. 101

116 102 Attribute Datasets Shape Size Rate Table 2.1: Hold one validation for null hypothesis. Total Sensitivity Specificity datasets Cluster 1 83% 82% 92% 92% Cluster 2 33% 30% 95% 97% Cluster 3 95% 95% 90% 90% Cluster 4 58% 60% 99% 99% Table 2.2: Hold one validation for alternative hypothesis.

117 103 Shape Size Rate Mean Median Minimum Maximum Table 2.3: Summary statistics of the simulated 3675 spurious clusters.

118 104 Simulated Cluster e Shap Size (In square miles) Risk How recoverable is this cluster? Cluster 1 Cluster 2 (com pactness) 0.20, 0.25, 0.39, , 0.25, 0.39, Hypothesize d ranks of the extent to which the cluster will be recovered by S.S.S Medium recoverability 1.25 Hardest to recover Cluster Easiest to recover Cluster Medium recoverability Mean shape, size and rate of spurious clusters Table 2.4: Shape, size, risk (signature) and the ability to recover simulated clusters.

119 105 Average Sensitivity Average specificity Average False Positive rate Average False Negative rate Predicted recoverability Cluster 1 83% 92% 8% 17% Medium Cluster 2 33% 95% 5% 67% Hardest Cluster 3 95% 90% 10% 5% Easiest Cluster 4 58% 99% 1% 42% Medium Table 2.5: The table illustrates the average sensitivity (ability to detect a cluster when it exists) and specificity (ability to classify an area that is not a cluster as such).

120 106 Cluster num Average sensitiviy S.S.S Average Specificity Averag e false positive rate Average false negative rate Average Sensitiviy Local Score Statistic Average Specificiy Average false positive rate Average false negative rate 3 95% 90% 10% 5% 95% 92% 8% 5% 4 58% 99% 1% 42% 27% 98% 2% 63% Table 2.6: This table compares sensitivity and specificity with which clusters are recovered for SSS and Rogerson s method and the higher the sensitivity the better the cluster is recovered.

121 107 Rate axis only Sensitivity Specificity False Positive Shap Rate Shape Rate e axis axis only axis only axis only only Shape axis only Cluster 1 82% 80% 89% 88% 11% 12% Cluster 2 3% 31% 95% 83% 5% 17% Cluster 3 95% 95% 90% 90% 10% 10% Cluster 4 23% 56% 96% 90% 4% 10% Table 2.7: Cluster recovery using only rates and only shapes.

122 108 Spurious clusters Cluster candidates Cluster 1 Cluster 2 Cluster 3 Cluster 4 Mean shape Mean shape of significant cluster candidates Median size Median size of significant cluster candidates , Table 2.8: How do true clusters differ in shape and size from spurious clusters.

123 109 CHAPTER 3: INVESTIGATING THE SPATIAL PATTERNS OF PROSTATE CANCER IN IOWA In this chapter I investigate the spatial patterns of prostate cancer incidence in Iowa. I begin with a discussion on the spatial epidemiology of prostate cancer, and how cluster detection techniques and geographic analyses have advanced this research. I apply three cluster detection techniques The S.S.S (Shape Size Sensitive) method, developed in this dissertation, Kulldorff s Spatial Scan Statistic and Rogerson s Score test to investigate possible prostate cancer clusters in the state of Iowa. The implications of the results are investigated and discussed. 3.1 Background Prostate cancer is the most common cancer among men (excluding skin cancer) in the United States. It is also the second leading cause of cancer related deaths [132]. It is estimated that 186,320 men will be diagnosed with and 28,660 men will die of cancer of the prostate in 2008 [133]. Other developed countries fare similarly. Prostate cancer is the most common cancer among men in the UK and accounts for 24% of all new cancers detected [134] there. The risk of an individual having prostate cancer rises exponentially with age. Almost 80% of all prostate cancers that are diagnosed are in men above the age of 65. Estimates from the SEER (Survival Epidemiology and End Results) program indicate that US men under the age of 65 had an incidence rate of 56.8 per 10,000, while the same statistic for the 65 plus age group is [135]. Other risk factors for prostate cancer are race and family history. Some risk factors of prostate cancer for which the evidence is inconsistent are, diet, occupation (with farmers being at an increased risk), obesity, low vitamin-d intake, sexually transmitted diseases, diabetes, smoking and physical inactivity [135].

124 110 Prostate cancer also shows marked geographic variability. There are substantial geographical variations in the mortality, incidence, treatment and survival patterns of prostate cancer [131, 136].Because of the marked geographical variations shown by this cancer, a number of interesting GIS based techniques have been used to study these spatial patterns [46, 119, 131, 137, 138]. Since, the validity of the spatial patterns is dependent on the quality of the underlying spatial data [37, 42], researchers have investigated the role of scale and geocoding quality on the spatial patterns of this disease [125, 139, 140]. Historically, there is a higher risk of death from prostate cancer for people living in the Northern Plains than in the rest of the US [141, 142]. There is a higher risk associated with a rural residence than an urban one [141, 142]. Some associations have been hypothesized between various agricultural pesticides and prostate cancer. For instance, pesticide applicators have a relative risk 1.4 times the general population [103]. Some studies are attempting to address these concerns by studying the risk of prostate (and other) cancers in a large cohort of agricultural workers and their families [103, 143]. Iowa is a large agricultural state and prostate cancer accounts for 26.3% (the highest percentage of all cancers) of new cancers detected in Iowa [144]. While there are some maps of prostate cancer incidence from states with SEER registries (of which Iowa is one) and this data can be obtained from certain sources [145] no record exists, of any attempt to systematically search for areas with excess risk of prostate cancer in Iowa. Maps for the ICCCC (Iowa Consortium for Comprehensive Cancer Control) are one step in this direction [146]. Some of these maps show smoothed rates of prostate cancer incidence and mortality in Iowa. Smoothed maps created from adaptive filters are better representative of the underlying variations in risk, than simple choropleth maps [2, 58, 60]. The rates mapped on choropleth maps could be inaccurate if they are calculated using a small base support population [58, 60, 146]. While these maps are powerful

125 111 exploratory tools, some of the patterns observed on these maps could be spurious, and have arisen by random chance. These patterns need to be statistically tested to find the likelihood of them having arisen by chance. If this likelihood is small, then it is very likely that these patterns are real. Cluster detection techniques [45, 147] were developed with this objective in mind [146]. The objective of this study is to address the question Are there any areas of Iowa with significant excess prostate cancer incidence risk? or Are there any clusters of prostate cancer incidence in Iowa? If there are any clusters, then how is their existence interpreted? Note that the objective of this study is to find areas with excess risk of prostate cancer incidence as opposed to excess prostate cancer incidence per se. Risk is a dynamic and unobserved quantity that the researcher wishes to estimate [4]. The incidence of prostate cancer can be high in a given area, but this does not necessarily imply an excess of risk. For example, if the incidence rates are based on small numbers, then risk estimates based on these rates may not be representative of the true value of the risk. The methods that are used in this research such as, Kulldorff s Spatial Scan Statistic [3], the adaptive filtering method[58] and the S.S.S method report values that can be directly interpreted as risk. These methods are discussed in the next section. 3.2 Methods The data examined is for new diagnosed cases of prostate cancer for ages above 45 geocoded to the ZIP Codes for the years , for the state of Iowa. Subseqently they were geocoded to ZCTAs (Zip Coded Tabulation Areas).The data was obtained from the State Health Registry of Iowa (S.H.R.I) [148] which is one of the 18 SEER registries in the United States, through a data sharing agreement. The University of Iowa IRB (Institutional Review Board) approved the use of this data (The IRB application number is ).

126 112 The ZCTA boundaries were obtained from the US Census website [149]. For all the three methods that were applied to the data, the cancer rates were age-standardized using the indirect standardization method. Three age groups were used in the standardization procedure, 45-64, and 85+. These age groups are consistent with that used by previous researchers [142].The numbers of cases in the three groups were 3944, 7071 and 2193 respectively, accounting for a total of 13,208 prostate cancer cases in Iowa. There are many cluster detection methods [45, 150], some of which were discussed in Chapter-1 of this dissertation. One of the approaches in the public health community is to use a battery of methods to search for excesses of risk on a map [130, ]. If a number of methods agree on certain areas of a map having an excess of risk, it is likely that these excesses are for real. Two of the best examples are Breast cancer clusters around Cape Cod and the Long Island Breast Cancer cluster. In the former researchers found clusters using GAM (Generalized Additive Modeling), Bonnetti and Pagano s M-Statistic and Kulldorff s SaTScan [153, 154]. Breast Cancer clusters were found on Long Island, New York, using Kulldorff s SaTScan [155] and LISA methods [156]. The three methods that were used in this research are 1) The S.S.S method 2) Rogerson s Score statistic [50] and 3) Kulldorff s SaTScan [3]. The three methods were chosen because they are based on different principles. The S.S.S method and Rogerson s Score statistic were described in the last chapter, thus they are not explained in detail in this section. Rogerson s Score statistic is a localized version of a focused test. For every ZCTA on the map, it looks for an excess of risk around the ZCTA. This excess risk is calculated as a weighted sum of the difference between the observed and expected rates. This is summarized as a normalized statistic that is then compared with a standard normal distribution. In these analyses all tests were carried out at the 0.01 level. The test is repeated for all ZCTAs. Note that because the method calculates rates at individual

127 113 ZCTAs, it is possible that when there are a small number of expected cases, or population at risk, this test may detect spurious clusters. Rogerson s Score statistic is realized by the freely downloadable GeoSurveillance software [121]. The S.S.S method is a new method proposed by the author that compares the signature of the background noise or spurious clusters to decide if a given cluster candidate is a true cluster. Like most cluster detection methods, the S.S.S method addresses the question of the likelihood of a given cluster candidate having arisen by chance. But, unlike most methods S.S.S uses the shape, size and rate of a cluster (instead of just rate) to infer if a given cluster candidate is a true cluster. In this research, the cluster candidates are extracted by using echelons [70, 111] and surfaces from the adaptive [58] spatial filtering approach. The process is simple. A horizontal plane at a risk level of choice (In this case 1.2) is intersected with the three dimensional surface created from adaptive spatial filtering of the real data. The risk level represents the mean rate in all the simulated cluster candidates. This yields numerous two dimensional irregular cluster candidates. At the same time a number of datasets are simulated using the same geography, and same number of cases, but the risk at each ZCTA is proportional to the population at risk. When the age standardization procedure is used to allocate disease cases to the ZCTAs, the number of cases that a ZCTA receives depends on the population structure of that particular ZCTA. Thus a ZCTA with a large proportion of 85+ people will receive more proportionally more cases for that age group. Computationally we can imagine three strings instead of the one seen in Figure 2.8, one string for each age group. The final number of cases is a sum of cases received from each age group. This process causes the numerators (number of cases) to change differentially from the denominators (populations).the older the population structure of an area, the greater the relative risk. Cluster candidates are extracted from these datasets as they are for the real data. These reference or spurious cluster candidates are compared with the candidates from the real data using their shape, size and rate. If any of the cluster

128 114 candidates from the real data are found significant, they are marked out as true clusters. These true clusters can be mapped for further exploration. While the S.S.S method is used in this research with cluster candidates extracted using echelons and adaptive filtering, in principle it can be extended to cluster candidates extracted using any other method. The third method used in this method is the SaTScan method which uses density estimation and a likelihood based testing approach [3, 26]. Kulldorff s Spatial Scan is widely used in the public health community to detect areas of excess risk of disease on a map [92, 130, 131, 155, 157]. The method uses overlapping circles of increasing radii centered on the centroids of local areas (ZCTAs). For each circle, the relative risk and likelihood ratio is calculated based on the cumulative observed and expected deaths contained within it. Relative risk is a measure of the increased or decreased risk associated with being in a particular circle relative to the state and the ratio of observed deaths to expected deaths. In contrast, the likelihood ratio in a circle is a measure of how the incidence rate within a circle differs from the rate outside the circle and can be calculated as follows: LLR = (O ln (O/E)) + ((n-o) ln [(n-o)/(n-e)]) where, LLR represents the logarithm of the likelihood ratio, O is observed cases, E is expected cases, and n is the total number of cases in the entire region (Iowa). This formula assumes that disease events are distributed as a Poisson random variable. The likelihood ratios are compared to the results of a Monte Carlo simulation of the data, and each circle is thus assigned a p-value according to its likelihood rank. Circles with p- values less than 0.01 or 0.05 can be considered significant. Recent modifications in the SaTScan methodology allows for the search of both circular and elliptic clusters [29]. Both circular and elliptic cluster searches were carried out in this research. The largest

129 115 cluster size was set at 50% of the population. The Spatial Scan Statistic is implemented by the SaTScan software that is freely available. The three methods S.S.S, Spatial Scan Statistic and Rogerson s Score statistic were used to investigate if any excesses of risk of prostate cancer incidence exist in Iowa. The results of these investigations are explained next. There is considerable agreement among the methods on the location and size of the excesses of prostate cancer incidence risk in Iowa. 3.3 Results I start with exploratory spatial analyses of the data. Adaptive filters were used to smooth the data to observe the underlying variations in risk. Figure 3.1 displays this map. Each rate is based on 397 expected cases on a 2.5 mile grid. With this number of expected cases the smallest detectable difference in relative risk is around 10% [4, 158]. This difference is sufficient for these analyses (for example see Alvanja et.al., [103]). A larger filter size would allow smaller risk differences to be detected, but would also smooth out the spatial variations [25]. A smaller filter size decreases the ability to reliably detect small differences in risk. For example if 30 expected cases were used as the filter size, the smallest difference in risk that would be detectable would be 36%. The analyses using spatially adaptive filters show that some areas in North West Iowa have risks 30% greater than normal. There are also isolated patches of high risk (20% or 30% above normal) in East Central Iowa. The S.S.S method detected one cluster in North West Iowa (Fig 3.2). The risk elevation at the cluster is 1.3. The cluster encompassed a total of 47 ZCTAs. A majority rule (>50% area) was used to express the cluster in ZCTA geography. The observed number of cases in the cluster is 604 and the expected number of expected cases in the cluster is 451. No other significant clusters were detected by S.S.S. Since the cluster candidates derived for the S.S.S method were from the adaptively filtered surface shown

130 116 in Figure 3.1, there is some concordance on the location of areas of excess risk in the two maps. The results of Kulldorff s SaTScan analyses can be seen in figures 3.3 to 3.5. Two significant clusters were detected by these analyses. A primary cluster and a secondary cluster. Figure 3.3 shows the primary cluster detected by this method assuming a ellipsoidal geometry. Figure 3.4 shows the same results with a circular geometry. With a circular geometry, the observed numbers of cases in this cluster are 814 while 606 were expected. The relative risk is The cluster that is detected when an ellipsoidal geometry is assumed has a more compact shape, and a higher relative risk. The number of expected cases is 446, observed cases are 631, and the relative risk is 1.44, which is 40% higher than normal. Note that the number of observed and expected cases is similar to what is obtained from S.S.S. The SaTScan analyses also detected a secondary cluster (with both elliptic and circular analyses) in Eastern Iowa. However, unlike, the primary cluster this cluster has a lower relative risk of 1.1 and includes 3471 cases. The expected number of cases is This cluster thus includes almost a quarter of all the expected cases in Iowa. It is likely, that this cluster is an agglomeration of areas with relatively small elevations of risk. The large size of the cluster makes it powerful enough to be statistically significant. This secondary cluster found by the elliptic analyses is displayed in Figure 3.5. The results of the analyses using Rogerson s local score statistics can be seen in Fig 3.6. Rogerson s method does not report one value of risk for a cluster because the rates are calculated individually for each ZCTA. The ZCTAs that have a significant value of the score statistic are mapped. Rogerson s Score statistic detects a number of isolated clusters that are not detected either by S.S.S or by SaTScan. There is thus the possibility that these clusters could be spurious. These spurious clusters could arise from problems with the statistical testing procedure. Since at least 942 (number of ZCTAs) separate tests are carried out,

131 117 there is a possibility that at the 0.01 level, nine ZCTAs have been classified as clusters, while they are not so in reality. However, the number of ZCTAs that appear in the isolated clusters are far greater than nine (a conservative estimate is ninety). These spurious clusters are not artifacts of multiple testing. Instead they are created from the small number problem. Figure 3.7 illustrates this. The figure shows that the clusters detected by Rogerson s method have a smaller number of expected cases than the average. For example, 77% of the ZCTAs found to be cluster by Rogerson s method have less than 10 expected cases compared to 70% of all ZCTAs in Iowa. Note that the problem of unstable rates does not arise with Kulldorff s SaTScan and S.S.S. These methods use adaptive filters and therefore the statistics they calculate are based on a reliable number of expected cases. Since in both these cases one cluster was found by either of these methods, Figure 3.7 cannot be made for them. Nevertheless, Rogerson s method does demarcate the same area in Northwest Iowa as a cluster of prostate cancer incidence, as SaTScan and Rogerson s Score statistic. We are now in a position to address the question: What areas of Iowa, are found to be at an excess risk of prostate cancer incidence by all the three methods? This question can be addressed with a simple GIS intersect operation, which shows the areas that are common between the clusters found by S.S.S, Rogerson s method and the Spatial Scan Statistic. Figure 3.8 displays these areas. Figure 3.9 displays the same map as in Figure 3.8 along with the County boundaries. Interpretation of clusters of any disease pose a challenge [45, 140]. It is especially challenging to interpret clusters of diseases that have undergone radical changes in the means and methods of diagnosis and detection [89]. Prostate cancer is one such disease. In 1986 PSA (Prostate Specific Antigen) testing was introduced in the United States. This test is affordable and easy (though it has a high false positive rate), and was adopted throughout the United States. The adoption was not uniform either temporally or spatially. Thus variations in patterns of prostate cancer risk can reflect the underlying

132 118 variations in rates of screening rather than variations from a spatially varying etiologic factor [89, 119, 137, 139]. While there are no definitive methods of disentangling the effects of screening from other factors such as access to healthcare and any number of intervening factors [119, 159], there are certain epidemiological indicators of the effects of high screening uptake. Areas with relatively high screening uptake, show high incidence rates with a migration towards early stages [119, 160]. A number of studies have shown spikes in prostate cancer incidence after the adoption of PSA testing with relatively small changes in mortality [137, ]. Figure 3.10 displays the change in incidence and mortality rates in the Counties that have more than 60% of their areas within the cluster. The mortality and incidence rates for these counties were aggregated to calculate the rates. The data is from a GIS based database maintained by the University of Kentucky and Kentucky Cancer Registry [145]. The same dataset was queried to extract directly standardized rates for two counties within the cluster, and also for the state of Iowa. These figures can be seen in Figures 3.11 and All these graphs are consistent in the observation that there were sharp increases in incidence during the late nineties ( ) in Counties within the cluster. Also, these areas may show a lower than usual mortality rate. Maps from the ICCCC [146], which display rates of late stage and mortality for a similar period can be used to interpret this cluster. These maps show, that the areas marked as Cluster in this research have a very much lower than expected occurrence of late stage cases. They also show that these areas have very low mortality rates. It is therefore likely, that there are cohorts of high incidence rates in this area. The possible reasons for this are, in-migration of aged individuals, who receive their first diagnosis in the area, an etiologic agent that has become prevalent suddenly in time or an unusually heavy uptake of prostate cancer screening relative to other areas, in this area.

133 Discussion Three cluster detection methods were used to study the spatial patterns of prostate cancer incidence in Iowa. A cluster of excess risk was found in North Western Iowa. This finding is consistent through all the methods. On further investigation it was found that the rates of prostate cancer incidence increased rapidly in the late nineties within the Cluster, compared to the state of Iowa. There is also evidence suggesting that there is an excess of early stage prostate cancers in the area. It is therefore, possible that the observed prostate cancer cluster in Iowa, is an artifact of increased prostate cancer screening (perhaps PSA screening), which is picking up latent cases of prostate cancer. This phenomenon peaked in the late nineties in this area (compared to early nineties in the rest of the US [161]). However, this peak of increased early stage detection has not subsided to levels comparable to the rest of the state, and therefore a cluster of prostate cancer incidence is found to exist in the area. Mortality rates for the region show small decreases in mortality in the cluster region over the same period of time when incidence rates increase. Changing diagnostic regimes often contribute to increased incidence and consequently prevalence rates. This phenomenon has been observed in diseases as diverse as breast cancer [164] to Autism[165]. It is feasible for such a diagnostic regime change to play out in a geographically localized area, as in the Tyrol Prostate cancer study [160] or in rural Iowa. These regions show similar temporal patterns of incidence and mortality change. Nevertheless, further research is required to pin point the exact causes of excess cancer risk in this region within this cluster. One possible approach is to map the temporal trend of prostate cancer stages over time in this region. A changed diagnostic regime causes an increase in early stage cancers[160]. Other alternatives are to look into various etiological agents, or over cohorts of people residing in this area [103, 143].

134 120 This cluster is highly unlikely to have occurred by chance, since three different methods have identified the presence of this cluster. Independent confirmation from multiple methods decreases the likelihood that the observed cluster is a false positive. While the likelihood of this cluster being a false positive is low, it is possible that other clusters, with lower elevations in risk or extremely small clusters with high elevations in risk are not detected by the methods used in this research. The use of data aggregated at the ZCTA level could be a contributing factor to this problem. 3.5 Conclusion Patterns of prostate cancer risk were investigated over the state of Iowa. An excess of prostate cancer risk was found in an area in the North West of the state. Further investigations led to indications of the possibility that this cluster could be caused by an increased uptake of screening compared to the rest of the state in the period However, further investigations would be required to verify this claim. The section that follows next summarizes the work in terms of its contribution to the geographic literature. 3.6 Contribution that this dissertation makes to the geography literature Traditionally geographers have contributed greatly to the disease mapping and disease cluster detection literature [9, 27]. The contributions range from choropleth maps [46, 166] to density estimation techniques [27, 42, 46, 166] to name a few. The disease cluster detection literature has developed rapidly in the last decade, with the availability of cheap and readily available computational power. This has made it possible to search for clusters of various spatial forms. While this is a positive development, it has also brought into focus the problem of noise or spurious clusters. The problem of spurious clusters is inherently geographic in nature, and for any given cluster detection problem there are different degrees of recoverability for clusters of different shapes, sizes

135 121 and risk elevations. The disease cluster literature has started addressing questions about the recoverability and power of disease clusters [33, 34, 69]. Recoverability or the extent, to which a given cluster can be recovered, is related to issues of shape and scale [39, 64, 74, 87, 88, 90, 91, 95, ]. In this dissertation I proposed a new method of cluster detection (S.S.S or Shape Size Sensitive method) that utilizes the shape, size and rate of clusters to predict the ability of the method to detect clusters. Shape, size and rate are used to distinguish spurious or noisy clusters from true clusters. This method differs from other methods in first addressing the question of identifying the characteristics of spurious clusters.. The clusters that differ most from these spurious clusters are most likely to be true and thus recovered. Shape, size and rate are used in a three dimensional computational space to distinguish spurious clusters from true clusters. It is shown empirically that the inclusion of these axes improves the ability to distinguish true clusters from spurious ones.the method is compared with an existing method of cluster detection (Rogerson s Score Statistic). Results show that the S.S.S method is successfully able to predict the extent to which it is able to detect any given cluster. For those clusters that it is able to detect, it is more powerful, or at least as powerful as Rogerson s method. Unlike, the existing geography literature this research also makes use of a realistic set of synthetic disease clusters to test the robustness of the proposed method. The S.S.S method is also used to study patterns of prostate cancer incidence in Iowa along with Rogerson s method and Kulldorff s Spatial Scan Statistic. The three methods show remarkable convergence in the location of a cluster (and the risk elevation at it) in North Western Iowa. The causative factors of this cluster need to be investigated further. This dissertation thus makes an important contribution to the small but significant health geography literature on the spatial patterns of prostate cancer [119, 125, 137]. Rogerson s method is affected by the problem of unstable rates or small numbers. The Spatial Scan Statistic method and the method used to extract cluster candidates in S.S.S are robust to small numbers. This research reinforces the assertion that some geographers

136 122 have made [94], and that is, instead of accepting geographic boundaries as they are, space should be manipulated according to the needs of the specific research problem at hand [64, 67]. This research also utilizes the approach of echelons in disease mapping. These analyses show that geography is important. They show that shape and size are important aspects of disease cluster detection. Yet, these aspects have been long ignored by the disease mapping community. This research demonstrates that taking the geographic aspects of disease clusters into account can greatly increase the effectiveness of such analyses.

137 Figure 3.1: Spatial patterns of prostate cancer incidence ( ) in Iowa. 123

138 Figure 3.2: Cluster of prostate cancer incidence in Iowa, detected by the S.S.S method. 124

139 Figure 3.3: Cluster detected by SaTScan when the geometry of the cluster is assumed to be ellipsoidal. 125

140 Figure 3.4: Cluster detected by SaTScan when the geometry of the cluster is assumed to be circular. 126

141 Figure 3.5: Large secondary cluster with low elevation in risk detected by Kulldorff s SaTScan when the geometry of the cluster is assumed to be elliptical. 127

142 Figure 3.6: ZCTAs in Iowa with a significant value of Rogerson s Score statistic. 128

143 129 Figure 3.7: Expected number of cases in ZCTAs: Entire Iowa versus areas with a significant value of Rogerson s Score statistic.

144 Figure 3.8: ZCTAs in the North West Iowa cluster of high prostate cancer incidence. 130

145 Figure 3.9: Counties boundaries with ZCTAs in the North West Iowa cluster of high prostate cancer incidence. 131

Outline. Practical Point Pattern Analysis. David Harvey s Critiques. Peter Gould s Critiques. Global vs. Local. Problems of PPA in Real World

Outline. Practical Point Pattern Analysis. David Harvey s Critiques. Peter Gould s Critiques. Global vs. Local. Problems of PPA in Real World Outline Practical Point Pattern Analysis Critiques of Spatial Statistical Methods Point pattern analysis versus cluster detection Cluster detection techniques Extensions to point pattern measures Multiple

More information

Cluster Analysis using SaTScan. Patrick DeLuca, M.A. APHEO 2007 Conference, Ottawa October 16 th, 2007

Cluster Analysis using SaTScan. Patrick DeLuca, M.A. APHEO 2007 Conference, Ottawa October 16 th, 2007 Cluster Analysis using SaTScan Patrick DeLuca, M.A. APHEO 2007 Conference, Ottawa October 16 th, 2007 Outline Clusters & Cluster Detection Spatial Scan Statistic Case Study 28 September 2007 APHEO Conference

More information

Cluster Analysis using SaTScan

Cluster Analysis using SaTScan Cluster Analysis using SaTScan Summary 1. Statistical methods for spatial epidemiology 2. Cluster Detection What is a cluster? Few issues 3. Spatial and spatio-temporal Scan Statistic Methods Probability

More information

Inclusion of Non-Street Addresses in Cancer Cluster Analysis

Inclusion of Non-Street Addresses in Cancer Cluster Analysis Inclusion of Non-Street Addresses in Cancer Cluster Analysis Sue-Min Lai, Zhimin Shen, Darin Banks Kansas Cancer Registry University of Kansas Medical Center KCR (Kansas Cancer Registry) KCR: population-based

More information

Outline. Introduction to SpaceStat and ESTDA. ESTDA & SpaceStat. Learning Objectives. Space-Time Intelligence System. Space-Time Intelligence System

Outline. Introduction to SpaceStat and ESTDA. ESTDA & SpaceStat. Learning Objectives. Space-Time Intelligence System. Space-Time Intelligence System Outline I Data Preparation Introduction to SpaceStat and ESTDA II Introduction to ESTDA and SpaceStat III Introduction to time-dynamic regression ESTDA ESTDA & SpaceStat Learning Objectives Activities

More information

FleXScan User Guide. for version 3.1. Kunihiko Takahashi Tetsuji Yokoyama Toshiro Tango. National Institute of Public Health

FleXScan User Guide. for version 3.1. Kunihiko Takahashi Tetsuji Yokoyama Toshiro Tango. National Institute of Public Health FleXScan User Guide for version 3.1 Kunihiko Takahashi Tetsuji Yokoyama Toshiro Tango National Institute of Public Health October 2010 http://www.niph.go.jp/soshiki/gijutsu/index_e.html User Guide version

More information

Spatial Analysis 1. Introduction

Spatial Analysis 1. Introduction Spatial Analysis 1 Introduction Geo-referenced Data (not any data) x, y coordinates (e.g., lat., long.) ------------------------------------------------------ - Table of Data: Obs. # x y Variables -------------------------------------

More information

USING CLUSTERING SOFTWARE FOR EXPLORING SPATIAL AND TEMPORAL PATTERNS IN NON-COMMUNICABLE DISEASES

USING CLUSTERING SOFTWARE FOR EXPLORING SPATIAL AND TEMPORAL PATTERNS IN NON-COMMUNICABLE DISEASES USING CLUSTERING SOFTWARE FOR EXPLORING SPATIAL AND TEMPORAL PATTERNS IN NON-COMMUNICABLE DISEASES Mariana Nagy "Aurel Vlaicu" University of Arad Romania Department of Mathematics and Computer Science

More information

Overview of Spatial analysis in ecology

Overview of Spatial analysis in ecology Spatial Point Patterns & Complete Spatial Randomness - II Geog 0C Introduction to Spatial Data Analysis Chris Funk Lecture 8 Overview of Spatial analysis in ecology st step in understanding ecological

More information

A nonparametric spatial scan statistic for continuous data

A nonparametric spatial scan statistic for continuous data DOI 10.1186/s12942-015-0024-6 METHODOLOGY Open Access A nonparametric spatial scan statistic for continuous data Inkyung Jung * and Ho Jin Cho Abstract Background: Spatial scan statistics are widely used

More information

Pattern Extraction From Spatial Data - Statistical and Modeling Approches

Pattern Extraction From Spatial Data - Statistical and Modeling Approches University of South Carolina Scholar Commons Theses and Dissertations 12-15-2014 Pattern Extraction From Spatial Data - Statistical and Modeling Approches Hu Wang University of South Carolina - Columbia

More information

Spatial Analysis I. Spatial data analysis Spatial analysis and inference

Spatial Analysis I. Spatial data analysis Spatial analysis and inference Spatial Analysis I Spatial data analysis Spatial analysis and inference Roadmap Outline: What is spatial analysis? Spatial Joins Step 1: Analysis of attributes Step 2: Preparing for analyses: working with

More information

Types of spatial data. The Nature of Geographic Data. Types of spatial data. Spatial Autocorrelation. Continuous spatial data: geostatistics

Types of spatial data. The Nature of Geographic Data. Types of spatial data. Spatial Autocorrelation. Continuous spatial data: geostatistics The Nature of Geographic Data Types of spatial data Continuous spatial data: geostatistics Samples may be taken at intervals, but the spatial process is continuous e.g. soil quality Discrete data Irregular:

More information

Bayesian Hierarchical Models

Bayesian Hierarchical Models Bayesian Hierarchical Models Gavin Shaddick, Millie Green, Matthew Thomas University of Bath 6 th - 9 th December 2016 1/ 34 APPLICATIONS OF BAYESIAN HIERARCHICAL MODELS 2/ 34 OUTLINE Spatial epidemiology

More information

Using AMOEBA to Create a Spatial Weights Matrix and Identify Spatial Clusters, and a Comparison to Other Clustering Algorithms

Using AMOEBA to Create a Spatial Weights Matrix and Identify Spatial Clusters, and a Comparison to Other Clustering Algorithms Using AMOEBA to Create a Spatial Weights Matrix and Identify Spatial Clusters, and a Comparison to Other Clustering Algorithms Arthur Getis* and Jared Aldstadt** *San Diego State University **SDSU/UCSB

More information

Chapter 6 Spatial Analysis

Chapter 6 Spatial Analysis 6.1 Introduction Chapter 6 Spatial Analysis Spatial analysis, in a narrow sense, is a set of mathematical (and usually statistical) tools used to find order and patterns in spatial phenomena. Spatial patterns

More information

Cluster investigations using Disease mapping methods International workshop on Risk Factors for Childhood Leukemia Berlin May

Cluster investigations using Disease mapping methods International workshop on Risk Factors for Childhood Leukemia Berlin May Cluster investigations using Disease mapping methods International workshop on Risk Factors for Childhood Leukemia Berlin May 5-7 2008 Peter Schlattmann Institut für Biometrie und Klinische Epidemiologie

More information

In matrix algebra notation, a linear model is written as

In matrix algebra notation, a linear model is written as DM3 Calculation of health disparity Indices Using Data Mining and the SAS Bridge to ESRI Mussie Tesfamicael, University of Louisville, Louisville, KY Abstract Socioeconomic indices are strongly believed

More information

Mapping and Analysis for Spatial Social Science

Mapping and Analysis for Spatial Social Science Mapping and Analysis for Spatial Social Science Luc Anselin Spatial Analysis Laboratory Dept. Agricultural and Consumer Economics University of Illinois, Urbana-Champaign http://sal.agecon.uiuc.edu Outline

More information

Basics of Geographic Analysis in R

Basics of Geographic Analysis in R Basics of Geographic Analysis in R Spatial Autocorrelation and Spatial Weights Yuri M. Zhukov GOV 2525: Political Geography February 25, 2013 Outline 1. Introduction 2. Spatial Data and Basic Visualization

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

KAAF- GE_Notes GIS APPLICATIONS LECTURE 3

KAAF- GE_Notes GIS APPLICATIONS LECTURE 3 GIS APPLICATIONS LECTURE 3 SPATIAL AUTOCORRELATION. First law of geography: everything is related to everything else, but near things are more related than distant things Waldo Tobler Check who is sitting

More information

Analyzing the Geospatial Rates of the Primary Care Physician Labor Supply in the Contiguous United States

Analyzing the Geospatial Rates of the Primary Care Physician Labor Supply in the Contiguous United States Analyzing the Geospatial Rates of the Primary Care Physician Labor Supply in the Contiguous United States By Russ Frith Advisor: Dr. Raid Amin University of W. Florida Capstone Project in Statistics April,

More information

Tracey Farrigan Research Geographer USDA-Economic Research Service

Tracey Farrigan Research Geographer USDA-Economic Research Service Rural Poverty Symposium Federal Reserve Bank of Atlanta December 2-3, 2013 Tracey Farrigan Research Geographer USDA-Economic Research Service Justification Increasing demand for sub-county analysis Policy

More information

POPULAR CARTOGRAPHIC AREAL INTERPOLATION METHODS VIEWED FROM A GEOSTATISTICAL PERSPECTIVE

POPULAR CARTOGRAPHIC AREAL INTERPOLATION METHODS VIEWED FROM A GEOSTATISTICAL PERSPECTIVE CO-282 POPULAR CARTOGRAPHIC AREAL INTERPOLATION METHODS VIEWED FROM A GEOSTATISTICAL PERSPECTIVE KYRIAKIDIS P. University of California Santa Barbara, MYTILENE, GREECE ABSTRACT Cartographic areal interpolation

More information

Lecture 4. Spatial Statistics

Lecture 4. Spatial Statistics Lecture 4 Spatial Statistics Lecture 4 Outline Statistics in GIS Spatial Metrics Cell Statistics Neighborhood Functions Neighborhood and Zonal Statistics Mapping Density (Density surfaces) Hot Spot Analysis

More information

Nature of Spatial Data. Outline. Spatial Is Special

Nature of Spatial Data. Outline. Spatial Is Special Nature of Spatial Data Outline Spatial is special Bad news: the pitfalls of spatial data Good news: the potentials of spatial data Spatial Is Special Are spatial data special? Why spatial data require

More information

Comparison of spatial methods for measuring road accident hotspots : a case study of London

Comparison of spatial methods for measuring road accident hotspots : a case study of London Journal of Maps ISSN: (Print) 1744-5647 (Online) Journal homepage: http://www.tandfonline.com/loi/tjom20 Comparison of spatial methods for measuring road accident hotspots : a case study of London Tessa

More information

An Introduction to SaTScan

An Introduction to SaTScan An Introduction to SaTScan Software to measure spatial, temporal or space-time clusters using a spatial scan approach Marilyn O Hara University of Illinois moruiz@illinois.edu Lecture for the Pre-conference

More information

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III)

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III) Title: Spatial Statistics for Point Processes and Lattice Data (Part III) Lattice Data Tonglin Zhang Outline Description Research Problems Global Clustering and Local Clusters Permutation Test Spatial

More information

Spatial Clusters of Rates

Spatial Clusters of Rates Spatial Clusters of Rates Luc Anselin http://spatial.uchicago.edu concepts EBI local Moran scan statistics Concepts Rates as Risk from counts (spatially extensive) to rates (spatially intensive) rate =

More information

Texas A&M University

Texas A&M University Texas A&M University CVEN 658 Civil Engineering Applications of GIS Hotspot Analysis of Highway Accident Spatial Pattern Based on Network Spatial Weights Instructor: Dr. Francisco Olivera Author: Zachry

More information

Roger S. Bivand Edzer J. Pebesma Virgilio Gömez-Rubio. Applied Spatial Data Analysis with R. 4:1 Springer

Roger S. Bivand Edzer J. Pebesma Virgilio Gömez-Rubio. Applied Spatial Data Analysis with R. 4:1 Springer Roger S. Bivand Edzer J. Pebesma Virgilio Gömez-Rubio Applied Spatial Data Analysis with R 4:1 Springer Contents Preface VII 1 Hello World: Introducing Spatial Data 1 1.1 Applied Spatial Data Analysis

More information

GEOGRAPHY 350/550 Final Exam Fall 2005 NAME:

GEOGRAPHY 350/550 Final Exam Fall 2005 NAME: 1) A GIS data model using an array of cells to store spatial data is termed: a) Topology b) Vector c) Object d) Raster 2) Metadata a) Usually includes map projection, scale, data types and origin, resolution

More information

A spatial scan statistic for multinomial data

A spatial scan statistic for multinomial data A spatial scan statistic for multinomial data Inkyung Jung 1,, Martin Kulldorff 2 and Otukei John Richard 3 1 Department of Epidemiology and Biostatistics University of Texas Health Science Center at San

More information

Spatial and Temporal Geovisualisation and Data Mining of Road Traffic Accidents in Christchurch, New Zealand

Spatial and Temporal Geovisualisation and Data Mining of Road Traffic Accidents in Christchurch, New Zealand 166 Spatial and Temporal Geovisualisation and Data Mining of Road Traffic Accidents in Christchurch, New Zealand Clive E. SABEL and Phil BARTIE Abstract This paper outlines the development of a method

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information

Luc Anselin Spatial Analysis Laboratory Dept. Agricultural and Consumer Economics University of Illinois, Urbana-Champaign

Luc Anselin Spatial Analysis Laboratory Dept. Agricultural and Consumer Economics University of Illinois, Urbana-Champaign GIS and Spatial Analysis Luc Anselin Spatial Analysis Laboratory Dept. Agricultural and Consumer Economics University of Illinois, Urbana-Champaign http://sal.agecon.uiuc.edu Outline GIS and Spatial Analysis

More information

2/7/2018. Module 4. Spatial Statistics. Point Patterns: Nearest Neighbor. Spatial Statistics. Point Patterns: Nearest Neighbor

2/7/2018. Module 4. Spatial Statistics. Point Patterns: Nearest Neighbor. Spatial Statistics. Point Patterns: Nearest Neighbor Spatial Statistics Module 4 Geographers are very interested in studying, understanding, and quantifying the patterns we can see on maps Q: What kinds of map patterns can you think of? There are so many

More information

DATA DISAGGREGATION BY GEOGRAPHIC

DATA DISAGGREGATION BY GEOGRAPHIC PROGRAM CYCLE ADS 201 Additional Help DATA DISAGGREGATION BY GEOGRAPHIC LOCATION Introduction This document provides supplemental guidance to ADS 201.3.5.7.G Indicator Disaggregation, and discusses concepts

More information

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new

More information

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data Person-Time Data CF Jeff Lin, MD., PhD. Incidence 1. Cumulative incidence (incidence proportion) 2. Incidence density (incidence rate) December 14, 2005 c Jeff Lin, MD., PhD. c Jeff Lin, MD., PhD. Person-Time

More information

Oikos. Appendix 1 and 2. o20751

Oikos. Appendix 1 and 2. o20751 Oikos o20751 Rosindell, J. and Cornell, S. J. 2013. Universal scaling of species-abundance distributions across multiple scales. Oikos 122: 1101 1111. Appendix 1 and 2 Universal scaling of species-abundance

More information

Many natural processes can be fit to a Poisson distribution

Many natural processes can be fit to a Poisson distribution BE.104 Spring Biostatistics: Poisson Analyses and Power J. L. Sherley Outline 1) Poisson analyses 2) Power What is a Poisson process? Rare events Values are observational (yes or no) Random distributed

More information

Interactive GIS in Veterinary Epidemiology Technology & Application in a Veterinary Diagnostic Lab

Interactive GIS in Veterinary Epidemiology Technology & Application in a Veterinary Diagnostic Lab Interactive GIS in Veterinary Epidemiology Technology & Application in a Veterinary Diagnostic Lab Basics GIS = Geographic Information System A GIS integrates hardware, software and data for capturing,

More information

Detecting Dark Matter Halos using Principal Component Analysis

Detecting Dark Matter Halos using Principal Component Analysis Detecting Dark Matter Halos using Principal Component Analysis William Chickering, Yu-Han Chou Computer Science, Stanford University, Stanford, CA 9435 (Dated: December 15, 212) Principal Component Analysis

More information

SPACE Workshop NSF NCGIA CSISS UCGIS SDSU. Aldstadt, Getis, Jankowski, Rey, Weeks SDSU F. Goodchild, M. Goodchild, Janelle, Rebich UCSB

SPACE Workshop NSF NCGIA CSISS UCGIS SDSU. Aldstadt, Getis, Jankowski, Rey, Weeks SDSU F. Goodchild, M. Goodchild, Janelle, Rebich UCSB SPACE Workshop NSF NCGIA CSISS UCGIS SDSU Aldstadt, Getis, Jankowski, Rey, Weeks SDSU F. Goodchild, M. Goodchild, Janelle, Rebich UCSB August 2-8, 2004 San Diego State University Some Examples of Spatial

More information

MATH 10 INTRODUCTORY STATISTICS

MATH 10 INTRODUCTORY STATISTICS MATH 10 INTRODUCTORY STATISTICS Ramesh Yapalparvi Week 2 Chapter 4 Bivariate Data Data with two/paired variables, Pearson correlation coefficient and its properties, general variance sum law Chapter 6

More information

Everything is related to everything else, but near things are more related than distant things.

Everything is related to everything else, but near things are more related than distant things. SPATIAL ANALYSIS DR. TRIS ERYANDO, MA Everything is related to everything else, but near things are more related than distant things. (attributed to Tobler) WHAT IS SPATIAL DATA? 4 main types event data,

More information

Probability and Probability Distributions. Dr. Mohammed Alahmed

Probability and Probability Distributions. Dr. Mohammed Alahmed Probability and Probability Distributions 1 Probability and Probability Distributions Usually we want to do more with data than just describing them! We might want to test certain specific inferences about

More information

Class 9. Query, Measurement & Transformation; Spatial Buffers; Descriptive Summary, Design & Inference

Class 9. Query, Measurement & Transformation; Spatial Buffers; Descriptive Summary, Design & Inference Class 9 Query, Measurement & Transformation; Spatial Buffers; Descriptive Summary, Design & Inference Spatial Analysis Turns raw data into useful information by adding greater informative content and value

More information

Diffusion of GIS in Public Policy Doctoral Program

Diffusion of GIS in Public Policy Doctoral Program Diffusion of GIS in Public Policy Doctoral Program By Fulbert Namwamba PhD Southern University This presentation demonstrate how GIS was introduced in the Public Policy Doctoral Program at Southern University,

More information

Exploratory Spatial Data Analysis Using GeoDA: : An Introduction

Exploratory Spatial Data Analysis Using GeoDA: : An Introduction Exploratory Spatial Data Analysis Using GeoDA: : An Introduction Prepared by Professor Ravi K. Sharma, University of Pittsburgh Modified for NBDPN 2007 Conference Presentation by Professor Russell S. Kirby,

More information

Simulation study on using moment functions for sufficient dimension reduction

Simulation study on using moment functions for sufficient dimension reduction Michigan Technological University Digital Commons @ Michigan Tech Dissertations, Master's Theses and Master's Reports - Open Dissertations, Master's Theses and Master's Reports 2012 Simulation study on

More information

VISUALIZATION QUALITY. Visual Representation. MacEachren s Model. See: MacEachren, 1994, Some Truth with Maps AAG Resource Series, Chpt.

VISUALIZATION QUALITY. Visual Representation. MacEachren s Model. See: MacEachren, 1994, Some Truth with Maps AAG Resource Series, Chpt. VISUALIZATION QUALITY See: MacEachren, 1994, Some Truth with Maps AAG Resource Series, Chpt. 4 Visual Representation Mapping constraints MacEachren s Model 1 Symbolizing Data in Areas First step - determine

More information

Robotics 2 Data Association. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard

Robotics 2 Data Association. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard Robotics 2 Data Association Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard Data Association Data association is the process of associating uncertain measurements to known tracks. Problem

More information

Where to Invest Affordable Housing Dollars in Polk County?: A Spatial Analysis of Opportunity Areas

Where to Invest Affordable Housing Dollars in Polk County?: A Spatial Analysis of Opportunity Areas Resilient Neighborhoods Technical Reports and White Papers Resilient Neighborhoods Initiative 6-2014 Where to Invest Affordable Housing Dollars in Polk County?: A Spatial Analysis of Opportunity Areas

More information

Topic 6A: Geographical Investigations fieldwork Investigating coastal landscapes

Topic 6A: Geographical Investigations fieldwork Investigating coastal landscapes Topic 6A: Geographical Investigations fieldwork Investigating coastal landscapes Enquiry question When completing a geographical study, it is important to have an aim. We can do this by asking a task question,

More information

CHAPTER 9 DATA DISPLAY AND CARTOGRAPHY

CHAPTER 9 DATA DISPLAY AND CARTOGRAPHY CHAPTER 9 DATA DISPLAY AND CARTOGRAPHY 9.1 Cartographic Representation 9.1.1 Spatial Features and Map Symbols 9.1.2 Use of Color 9.1.3 Data Classification 9.1.4 Generalization Box 9.1 Representations 9.2

More information

ENGRG Introduction to GIS

ENGRG Introduction to GIS ENGRG 59910 Introduction to GIS Michael Piasecki October 13, 2017 Lecture 06: Spatial Analysis Outline Today Concepts What is spatial interpolation Why is necessary Sample of interpolation (size and pattern)

More information

Michael Harrigan Office hours: Fridays 2:00-4:00pm Holden Hall

Michael Harrigan Office hours: Fridays 2:00-4:00pm Holden Hall Announcement New Teaching Assistant Michael Harrigan Office hours: Fridays 2:00-4:00pm Holden Hall 209 Email: michael.harrigan@ttu.edu Guofeng Cao, Texas Tech GIST4302/5302, Lecture 2: Review of Map Projection

More information

ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI

ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI Department of Computer Science APPROVED: Vladik Kreinovich,

More information

Hierarchical Modeling and Analysis for Spatial Data

Hierarchical Modeling and Analysis for Spatial Data Hierarchical Modeling and Analysis for Spatial Data Bradley P. Carlin, Sudipto Banerjee, and Alan E. Gelfand brad@biostat.umn.edu, sudiptob@biostat.umn.edu, and alan@stat.duke.edu University of Minnesota

More information

Spatio-Temporal Cluster Detection of Point Events by Hierarchical Search of Adjacent Area Unit Combinations

Spatio-Temporal Cluster Detection of Point Events by Hierarchical Search of Adjacent Area Unit Combinations Spatio-Temporal Cluster Detection of Point Events by Hierarchical Search of Adjacent Area Unit Combinations Ryo Inoue 1, Shiho Kasuya and Takuya Watanabe 1 Tohoku University, Sendai, Japan email corresponding

More information

This lab exercise will try to answer these questions using spatial statistics in a geographic information system (GIS) context.

This lab exercise will try to answer these questions using spatial statistics in a geographic information system (GIS) context. by Introduction Problem Do the patterns of forest fires change over time? Do forest fires occur in clusters, and do the clusters change over time? Is this information useful in fighting forest fires? This

More information

Applied Spatial Analysis in Epidemiology

Applied Spatial Analysis in Epidemiology Applied Spatial Analysis in Epidemiology COURSE DURATION Course material will be available from: June 1- June 30, 2018 INSTRUCTOR Rena Jones, PhD MS rena.jones@yale.edu COURSE DESCRIPTION This course will

More information

Learning Outbreak Regions in Bayesian Spatial Scan Statistics

Learning Outbreak Regions in Bayesian Spatial Scan Statistics Maxim Makatchev Daniel B. Neill Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213 USA maxim.makatchev@cs.cmu.edu neill@cs.cmu.edu Abstract The problem of anomaly detection for biosurveillance

More information

The Joys of Geocoding (from a Spatial Statistician s Perspective)

The Joys of Geocoding (from a Spatial Statistician s Perspective) The Joys of Geocoding (from a Spatial Statistician s Perspective) Dale L. Zimmerman University of Iowa October 21, 2010 2 Geocoding context Applications of spatial statistics to public health and social

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

ColourMatrix: White Paper

ColourMatrix: White Paper ColourMatrix: White Paper Finding relationship gold in big data mines One of the most common user tasks when working with tabular data is identifying and quantifying correlations and associations. Fundamentally,

More information

Sampling Distributions

Sampling Distributions Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Remember sampling? Sampling Part 1 of definition Selecting a subset of the population to create a sample Generally random sampling

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.1 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

DRAFT: A2.1 Activity rate Loppersum

DRAFT: A2.1 Activity rate Loppersum DRAFT: A.1 Activity rate Loppersum Rakesh Paleja, Matthew Jones, David Randell, Stijn Bierman April 3, 15 1 Summary On 1st Jan 14, the rate of production in the area of Loppersum was reduced. Here we seek

More information

Knowledge Spillovers, Spatial Dependence, and Regional Economic Growth in U.S. Metropolitan Areas. Up Lim, B.A., M.C.P.

Knowledge Spillovers, Spatial Dependence, and Regional Economic Growth in U.S. Metropolitan Areas. Up Lim, B.A., M.C.P. Knowledge Spillovers, Spatial Dependence, and Regional Economic Growth in U.S. Metropolitan Areas by Up Lim, B.A., M.C.P. DISSERTATION Presented to the Faculty of the Graduate School of The University

More information

Linkage Methods for Environment and Health Analysis General Guidelines

Linkage Methods for Environment and Health Analysis General Guidelines Health and Environment Analysis for Decision-making Linkage Analysis and Monitoring Project WORLD HEALTH ORGANIZATION PUBLICATIONS Linkage Methods for Environment and Health Analysis General Guidelines

More information

SaTScan TM. User Guide. for version 7.0. By Martin Kulldorff. August

SaTScan TM. User Guide. for version 7.0. By Martin Kulldorff. August SaTScan TM User Guide for version 7.0 By Martin Kulldorff August 2006 http://www.satscan.org/ Contents Introduction... 4 The SaTScan Software... 4 Download and Installation... 5 Test Run... 5 Sample Data

More information

Concepts and Applications of Kriging. Eric Krause

Concepts and Applications of Kriging. Eric Krause Concepts and Applications of Kriging Eric Krause Sessions of note Tuesday ArcGIS Geostatistical Analyst - An Introduction 8:30-9:45 Room 14 A Concepts and Applications of Kriging 10:15-11:30 Room 15 A

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Measuring community health outcomes: New approaches for public health services research

Measuring community health outcomes: New approaches for public health services research Research Brief March 2015 Measuring community health outcomes: New approaches for public health services research P ublic Health agencies are increasingly asked to do more with less. Tough economic times

More information

Single Sample Means. SOCY601 Alan Neustadtl

Single Sample Means. SOCY601 Alan Neustadtl Single Sample Means SOCY601 Alan Neustadtl The Central Limit Theorem If we have a population measured by a variable with a mean µ and a standard deviation σ, and if all possible random samples of size

More information

Acknowledgments xiii Preface xv. GIS Tutorial 1 Introducing GIS and health applications 1. What is GIS? 2

Acknowledgments xiii Preface xv. GIS Tutorial 1 Introducing GIS and health applications 1. What is GIS? 2 Acknowledgments xiii Preface xv GIS Tutorial 1 Introducing GIS and health applications 1 What is GIS? 2 Spatial data 2 Digital map infrastructure 4 Unique capabilities of GIS 5 Installing ArcView and the

More information

Fuzzy Geographically Weighted Clustering

Fuzzy Geographically Weighted Clustering Fuzzy Geographically Weighted Clustering G. A. Mason 1, R. D. Jacobson 2 1 University of Calgary, Dept. of Geography 2500 University Drive NW Calgary, AB, T2N 1N4 Telephone: +1 403 210 9723 Fax: +1 403

More information

An Introduction to Pattern Statistics

An Introduction to Pattern Statistics An Introduction to Pattern Statistics Nearest Neighbors The CSR hypothesis Clark/Evans and modification Cuzick and Edwards and controls All events k function Weighted k function Comparative k functions

More information

Applied Spatial Analysis in Epidemiology

Applied Spatial Analysis in Epidemiology Applied Spatial Analysis in Epidemiology COURSE DURATION This is an on-line, distance learning course and material will be available from: June 1 30, 2017 INSTRUCTOR Rena Jones, PhD, MS renajones@gmail.com

More information

Computational Cognitive Science

Computational Cognitive Science Computational Cognitive Science Lecture 9: A Bayesian model of concept learning Chris Lucas School of Informatics University of Edinburgh October 16, 218 Reading Rules and Similarity in Concept Learning

More information

Geographers Perspectives on the World

Geographers Perspectives on the World What is Geography? Geography is not just about city and country names Geography is not just about population and growth Geography is not just about rivers and mountains Geography is a broad field that

More information

GIS & Spatial Analysis in MCH

GIS & Spatial Analysis in MCH GIS & Spatial Analysis in MCH Russell S. Kirby, University of Alabama at Birmingham rkirby@uab.edu, office 205-934-2985 Dianne Enright, North Carolina State Center for Health Statistics dianne.enright@ncmail.net

More information

Chapter 6. Fundamentals of GIS-Based Data Analysis for Decision Support. Table 6.1. Spatial Data Transformations by Geospatial Data Types

Chapter 6. Fundamentals of GIS-Based Data Analysis for Decision Support. Table 6.1. Spatial Data Transformations by Geospatial Data Types Chapter 6 Fundamentals of GIS-Based Data Analysis for Decision Support FROM: Points Lines Polygons Fields Table 6.1. Spatial Data Transformations by Geospatial Data Types TO: Points Lines Polygons Fields

More information

Statistical Methods for Analysis of Genetic Data

Statistical Methods for Analysis of Genetic Data Statistical Methods for Analysis of Genetic Data Christopher R. Cabanski A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements

More information

The CrimeStat Program: Characteristics, Use, and Audience

The CrimeStat Program: Characteristics, Use, and Audience The CrimeStat Program: Characteristics, Use, and Audience Ned Levine, PhD Ned Levine & Associates and Houston-Galveston Area Council Houston, TX In the paper and presentation, I will discuss the CrimeStat

More information

Points. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Points. Luc Anselin.   Copyright 2017 by Luc Anselin, All Rights Reserved Points Luc Anselin http://spatial.uchicago.edu 1 classic point pattern analysis spatial randomness intensity distance-based statistics points on networks 2 Classic Point Pattern Analysis 3 Classic Examples

More information

Using Geographic Information Systems for Exposure Assessment

Using Geographic Information Systems for Exposure Assessment Using Geographic Information Systems for Exposure Assessment Ravi K. Sharma, PhD Department of Behavioral & Community Health Sciences, Graduate School of Public Health, University of Pittsburgh, Pittsburgh,

More information

The Use of Spatial Weights Matrices and the Effect of Geometry and Geographical Scale

The Use of Spatial Weights Matrices and the Effect of Geometry and Geographical Scale The Use of Spatial Weights Matrices and the Effect of Geometry and Geographical Scale António Manuel RODRIGUES 1, José António TENEDÓRIO 2 1 Research fellow, e-geo Centre for Geography and Regional Planning,

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Lecture 3: Exploratory Spatial Data Analysis (ESDA) Prof. Eduardo A. Haddad

Lecture 3: Exploratory Spatial Data Analysis (ESDA) Prof. Eduardo A. Haddad Lecture 3: Exploratory Spatial Data Analysis (ESDA) Prof. Eduardo A. Haddad Key message Spatial dependence First Law of Geography (Waldo Tobler): Everything is related to everything else, but near things

More information

Outline. 15. Descriptive Summary, Design, and Inference. Descriptive summaries. Data mining. The centroid

Outline. 15. Descriptive Summary, Design, and Inference. Descriptive summaries. Data mining. The centroid Outline 15. Descriptive Summary, Design, and Inference Geographic Information Systems and Science SECOND EDITION Paul A. Longley, Michael F. Goodchild, David J. Maguire, David W. Rhind 2005 John Wiley

More information

Applications of GIS in Health Research. West Nile virus

Applications of GIS in Health Research. West Nile virus Applications of GIS in Health Research West Nile virus Outline Part 1. Applications of GIS in Health research or spatial epidemiology Disease Mapping Cluster Detection Spatial Exposure Assessment Assessment

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Effective Use of Geographic Maps

Effective Use of Geographic Maps Effective Use of Geographic Maps Purpose This tool provides guidelines and tips on how to effectively use geographic maps to communicate research findings. Format This tool provides guidance on geographic

More information

Explaining Results of Neural Networks by Contextual Importance and Utility

Explaining Results of Neural Networks by Contextual Importance and Utility Explaining Results of Neural Networks by Contextual Importance and Utility Kary FRÄMLING Dep. SIMADE, Ecole des Mines, 158 cours Fauriel, 42023 Saint-Etienne Cedex 2, FRANCE framling@emse.fr, tel.: +33-77.42.66.09

More information