A Comparison of Three Exploratory Methods for Cluster Detection in Spatial Point Patterns

Size: px

Start display at page:

Download "A Comparison of Three Exploratory Methods for Cluster Detection in Spatial Point Patterns"

Gervais Waters
5 years ago
Views:

1 A. Stewart Fotheringham and F. Benjamin Zhan A Comparison of Three Exploratory Methods for Cluster Detection in Spatial Point Patterns This paper compares the performances of three explorato y methods for cluster detection in spatial point patterns where the at-risk population is known. After reviewing two existing methods, Openshaw et al. (1987) and Besag and Newell (1 991), an alternative method is introduced. These three methods are then compared empirically using two point patterns drawn from a disaggregate housing database consisting of 28,832 observations. Each observation in the data set contains attributes of single-family detached dwellings in the City of Amherst, New York. This paper provides some new insights into the performance of the three methods, as previous applications have used spatially aggregated (and hence rather inaccurate) data. The paper also demonstrates the utility of GZS for this type of spatial analysis. 1. INTRODUCTION The analysis of spatial point patterns has long been an important concern in geographical inquiry [see, for example, Boots and Getis (1988) and the references therein]. The availability of georeferenced point type data in digital form and the advantages that geographical information systems (GIS) offer for analyzing spatial point data suggest that interest in spatial point pattern analysis will increase. Indeed, there has been much recent interest from researchers across several disciplines (Clayton and Kaldor 1987; Openshaw et al. 1987; Stone 1988; Doll 1989, Gardner 1989; Hills and Alexander 1989; Wheldon 1989; Cuzick and Edwards 1990; Besag and Newell 1991), and particularly in The idea for this paper began as part of Research Initiative #14, Spatial Analysis and GIS, of the National Center for Geographic Information and Analysis in the United States, supported by a grant from the National Science Foundation (SBR ). Continued support for A. Stewart Fotheringham was provided by the North-East Regional Research Laboratory in the United Kingdom and for F. Benjamin Zhan by a faculty research enhancement grant from Southwest Texas State University. The authors thank Dr. Barry Lentnek for allowing them to use the Amherst housing data; David Phillips and Martin Camacho for their assistance with the data set; and Professor Stan Openshaw for his comments. Generous help from Fuxiang Xia and Ge Lin is also greatly appreciated. A. Stewart Fotheringham is professor of geography at the North-East Regional Research Laboratory, University of Newcastle. F. Benjamin Zhan is assistant professor of geography and planning, Southwest Texas State University. Geographical Analysis, Vol. 28, No. 3 (July 1996) Ohio State University Press Final version accepted 12/20/94.

2 A. Stewart Fotheringham and F. Benjamin Zhan / 201 the study of spatial patterns of disease (Marshall 1991). Given the rich availability of data in GIs, and the nature of spatial point pattern analysis, where the underlying statistical assumptions are often hard to specify and selection biases are usually present (Besag and Newell 1991), it seems particularly important to examine the detection of clusters using exploratory techniques (Openshaw et al. 1987; Besag and Newell 1991). Two broad categories of point patterns can be identified: those for which the at-risk population is known, and those for which the at-risk population is unknown. While it is recognized that there may be many situations where the atrisk population is unknown, such as the occurrence of the lighting strikes, for example, this paper concentrates solely on the former. The reason for this is that knowledge of the spatial distribution of the at-risk population allows more interesting clusters to be distinguished from those that arise purely from spatial variations in the density of the at-risk population. For example, a map of the incidence of some disease is relatively uninformative if the underlying distribution of the population is unknown: clusters of the disease will inevitably appear in areas of high population density. The geographically interesting question is not where is the sample clustered? but where is the sample clustered relative to the population? Regardless of the specific technique used for cluster detection, the general procedure for hypothesis testing is basically the same: a null hypothesis (Ho) and alternative (research) hypothesis (HI) are specified; a test statistic is computed from the observed point pattern; and a technique is chosen for assessing the significance of the statistic. Ideally the test statistic should be computed from a comparison of the observed points and the underlying at-risk population. This is a problem if data are aggregated to a certain level as by Openshaw et al. (1987) and Besag and Newell (1991) where the observed cases and the population at risk are aggregated into census enumeration districts (EDs) or census tracts and georeferenced to the centroids of these zones. The purpose of this paper is to compare the performance of three exploratory methods used for detecting clusters in spatial point patterns using examples from a file containing georeferenced data on 28,832 houses in Amherst, New York. We will first give a brief review of the existing exploratory methods for cluster detection in section 2. Section 3 presents an alternative method to those that currently exist. The design of the empirical research is presented in Section 4 and results discussed in Section 5. Conclusions are drawn in Section TWO EXISTING METHODS FOR DETECTING SPATIAL POINT CLUSTERS Reviews on general point pattern analysis can be found in Ripley (1981), Diggle (1983), and Upton and Fingleton (1985), and those particularly related to geographical research can be found in Boots and Getis (1988). Reviews of the methods used for the analysis of clusters in spatial point patterns concerned with disease are provided by Hills and Alexander (1989) and Marshall (1991). Because our concern here is with the detection of clusters in spatial point patterns using exploratory methods, the literature review is focused on such methods. The first attempt for detecting spatial point clusters using exploratory methods is the Geographical Analysis Machine (GAM) developed by Openshaw et al. (1987). For convenience, the method will be called the Openshaw method hereafter. GAM consists of four components: ( 1) a spatial hypothesis generator, (2) a procedure for assessing significance, (3) a GIS to handle retrieval of spatial data, and (4) a geographical display and map processing system (Openshaw et al. 1987, p. 338).

3 202 Geographical Analysis The technique used by Openshaw et al. (1987) is illustrated in Figure 1. First, a universe of all possible circle-based hypotheses are generated using the following algorithm. (1) Construct an initial grid over the area of interest, and define the minimum, maximum, and the incremental value of radii of the circles to be located at the intersections on the grid. The length of each side of the grid and the radius of a circle are chosen in such a way that the initial grid lattice is sufficiently fine-grained and that the circles can overlap to a large degree. (2) For a constructed grid mesh and a determined circle size, move the circle in such a manner that it is located on each grid intersection systematically. Compute the test statistic for each circle at each grid intersection. If the test statistic passes the significance test (see below), the location and the circle are stored for later visualization. (3) Increase the radius of the circle by the specified increment, and accordingly construct a new grid mesh. (4) Repeat steps 2 and 3 until the radius reaches the maximum value. Openshaw et al. (19871, in their CAM, used Monte Carlo simulation to assess significance. Circles are located systematically in the study area as discussed above. The count of observed cases within each circle is used as the test statistic. That is, the count of observed cases within a circle for the observed point data is compared with the count of simulated cases in the circle for each of the a - 1 sampled data sets. The circle and its location are recorded if and only if the count of observed cases in the circle for the observed point data is the largest one among the a test statistics. For a - 1 simulated sample data sets, the significance level is i. Using a Monte Carlo significance test has a number of advantages as described by Hope (1968, p. 582). Essentially, the technique is assumption free and can always be used when underlying distributions are unknown or when the necessary conditions for applying a test are not met. It may also be used when only vague alternative hypotheses exist and when only a vague definition of the test criteria can be given. However, as identified by Besag and Newell (1991) and Marshall (1991), there are weaknesses in the method used by Openshaw et al. (1987). One such weakness is that there is no control for multiple testing both locally and globally (Besag and Newell 1991, p. 148). The global aspect means that clusters may be produced by chance alone when the circle used is large. The local aspect is related to the problem that the change of radius and the shifts in location are not taken into account in the calculation of the significance levels. Secondly, it is very difficult to calculate the observed cases and to define the population at risk within the circular area given that the data are aggregated into irregular districts. More recently, Besag and Newell (1991) propose a method (hereafter referred to as the Besag method) that avoids some of these deficiencies. In the Besag method, under the null hypothesis Ho it is assumed that the total observed cases in a circle (defined in the same way as in the Openshaw method) are located randomly among the population at risk (pm) with the mean probability P, = % (n is the total number of observed cases and N is the total population at risk). The probability of observing exactly x cases among the population at risk can then be approximated by the Poisson term: e- AZ - for ~=1,2,3, X!... where A = Pmean x pm. It follows that, for a prespecified value k, the probability of observing k or more than k cases among p,,, is

4 Q nput the observed data (number of observations = n 1 input the data of population at risk + (number of observations = N) for significance test at level a, randomly sample 1 / (a-i) data sets from the population at risk, and make sure that + each data set contains n observation set the minimum, maximum and incremental values for the radii of the circles obtain the radius of a circle and construct a grid mesh so that the length of the side of a cell in the grid mesh is some fraction of the radius of the circle move the circle in such a way so that each time the circle is located on one of the intersections of the grid mesh consecutively compute a test statistic for the observed data within the circl I t ompute a test statistic for each of the 1 / (a-i) sampled data sets within the circle I t se Monte Carlo significance test to assess the significanc + store the circle and the location 6 FIG. 1. Openshaw et al. s Procedure for Cluster Detection

5 204 / Geographical Analysis This formula is used to calculate the significance level for each potential cluster. In this method, the cluster is detected based on whether an observed case forms the center of a cluster of cases through examining the number of nearest zones, M, given that a prespecified accumulated k cases are observed in the M zones. Suppose that at least one case is observed in zone i = 0, labeled Ao. In order to check if there is a cluster around Ao, all other zones are labeled Ai, i = 1,2,..., sequentially, the sequence depending on the distance between the centroid of a zone i # 0 and the centroid of zone i = 0. Let xi be the observed number of cases in zone i, yi be the population at risk in zone i. The accumulated number of cases and accumulated number of population at risk in the zones can be defined respectively, as follows: Let M = min(i : Di 1 k) (5) where k is a predetermined number of observed cases (for example, k = 4) and M is defined in such a way that zones Ao,..., AM contain at least k cases. When the value of M is small, it is indicative of a cluster around Ao. It should be noted that Di and pi as defined in (3) and (4) are slightly different from the definition in Besag and Newell (1991) in that no observed case is discounted. Formulas (3) and (4) are used in the experiment conducted in this paper because of the use of disaggregated data. To understand the physical base behind this method, one has to appreciate that the observed cases and the population at risk are aggregated data and are georeferenced to the centroids of zones [census enumeration districts (EDs) or census tracts] distributed over the study area. If individual data were available, the method would be more subjective because of the lack of predefined zones. The method can also be criticized because the value of k is chosen in an ad hoc manner, although the results for different values of k can obviously be displayed to obviate this problem. In each example presented by Openshaw et al. (1987) and Besag and Newell (1991), the data used are aggregated into census tracts or enumeration districts. The observed cases and population at risk are georeferenced to the centroids of these zones. There is an obvious fundamental problem for computing the test statistic and defining population at risk for any given circle when the Openshaw method is used (Besag and Newell 1991) because the computation is based on the census enumeration districts (EDs) whose centroids are within the circle. This apparently does not reflect the situation in reality. It would be more desirable to conduct the analysis using the true coordinates of the observed cases and the population at risk using disaggregated data.

A. Stewart Fotheringham and F. Benjamin Zhan / 205 3. AN ALTERNATIVE METHOD FOR DETECTING SPATIAL POINT CLUSTERS A third method for detecting spatial point clusters is introduced in this paper.

6 A. Stewart Fotheringham and F. Benjamin Zhan / AN ALTERNATIVE METHOD FOR DETECTING SPATIAL POINT CLUSTERS A third method for detecting spatial point clusters is introduced in this paper. The procedure for the method is illustrated in Figure 2. It differs from the Openshaw method in basically two respects: the location and size of a circle are determined randomly within specified ranges, and the Poisson probability distribution is used directly for assessing significance. Let the total population at risk in the area of interest be N and the total number of observed cases (with a particular attribute) be n; then the mean probability of observing a case in the entire area is For any circle whose location and radius are determined randomly, the number of cases (2) and the population at risk (y) within the circle can be obtained. The expected number of cases (A) in the circle then can be determined as: The probability P(zlA) for observing exactly 2 cases in the circle with expected cases A can then be determined using the Poisson distribution (Getis and Boots 1978, p. 19): Two methods can be used to assess the significance. In the first, P(zl A) in (8) is directly used as the measurement of significance. That is, if P(x,Iz) < U, where u is a prespecified level of significance, the radius and location of the circle generating P(zl A) are stored. The second method of significance testing that can be applied is that adopted in the Besag method described above. The only difference here is that the locations and radii of the circles are determined randomly, and k is the number of observations with a particular attribute that lie within each circle. In what follows, the results of the two significance testing procedures are very similar so only the results of the first one are reported. Hereafter, this third method is called the Fotheringham and Zhan method, or Fotheringham method for short. 4. RESEARCH DESIGN FOR COMPARING THE THREE METHODS The methods described above for detecting point clusters were coded in C and linked with ARC/INFO 6.1, running on a SUN workstation. The purpose of this section is to discuss the experimental design for testing these methods in terms of the preparation of data and the choice of search parameters in the programs. 4.1 The Preparation of Data In this study, the objects under investigation are houses in Amherst, New York. A master database consisting of 28,832 houses (observations) is constructed, stored, and managed using the ARC/INFO database. The data in the master database is derived from a data file containing information about the single-family detached dwellings in the City of Amherst, New York. These houses are geocoded using two-dimensional coordinates, and the locations of

7 nput the observed data (number of observations = n I input the data of population at risk (number of observations = N) i compute the mean probability: n / N J set the minimum and maximum values of the radii of the circle 4 randomly select the radius of a circle within a specified range of radius value 1 randomly locate the circle in the area of interest 1 (compute the number of points from the observed data within the circle: x) t compute the number of population at risk within the circle I compute the expected number of points in the circle using the mean probability and the population at risk in the circle 1 compute the probability of observing x points in the circle using the Poisson distribution and the expected number of points & store the circle and the location /. sufficient number of circles seeded? FIG. 2. Fotheringham and Zhan's Procedure for Cluster Detection

8 A. Stewart Fotheringham and F. Benjamin Zhan / 207 all 28,832 houses are shown in Figure 3. In addition to the x, y coordinates, other attributes such as age, type of construction, quality, and price are available for every observation. The 28,832 houses can be regarded as the population at risk in the area and houses with various attributes are drawn from this population. Polygons of census tracts covering Amherst are also created and added to the map, but are used solely for visualization. Two data sets were drawn from the total population at risk. Data set 1 consists of houses whose overall construction quality is rated in the lowest category (1-5) and contains the 277 points mapped in Figure 4. Data set 2, illustrated in Figure 5, consists of two hundred randomly selected points from the master database. In Figures 4 and 5, various clusters seem to be present although different clusters may be apparent to different people. It is also not clear whether a cluster is worthy of further investigation because it results from some clustering process or whether it is simply a reflection of the distribution of the underlying at-risk population. For example, Table 1 shows that the use of standard tests such as the variance-to-mean ratio and nearest neighbor analysis indicate extremely strong clustering in both spatial distributions (one of which is a random drawing) in Figures 4 and 5. This is because the distribution of the at-risk population in each case is strongly clustered but this is ignored in the calculation of the variance-to-mean and nearest neighbor statistics. These statistics cannot differentiate between distributions that are clustered because of the distribution of the underlying population and those that exhibit clusters that are geographically interesting. Conversely, there might well be points that do not appear to merit investigation when examined without exogenous knowledge, and only appear as significant clusters when compared to the at-risk population. The three methods described above are designed to remove these problems by using information on the at-risk population to automate the identification of spatial clusters that warrant further geographic investigation. 4.2 The Choice of Search Parameters Before each program is run, a number of search parameters must be decided. One parameter that is used in all three methods is the minimum number of observed cases to be considered in a circle. This parameter is set to one (1) for the Fotheringham and Openshaw methods, which means that significance assessment is conducted as long as there is at least one case within a given circle at a particular location. Because the Besag method directly uses a prespecified number (Ic), the minimum value of k is set to two (2), following the experiment conducted by Besag and Newell (1991). Other search parameters to be set are the minimum and maximum radii of the circles. After a number of trials, the minimum radius is set to meters (600 feet), and the maximum radius is set to meters (2,100 feet). In Amherst, the expansion in east-west direction is meters (35,704 feet), and meters (49,866 feet) in north-south. It should be pointed out here that the range of the radii is data dependent and only becomes clear after several trials. Circles that are too small will not detect the extent of large clusters and may miss clusters all together, while circles that are too large risk hiding variations at smaller scales. This is one of the reasons why exploratory data analysis is important, and the optimal choice of the parameters remains subject to further investigation. For the Openshaw and the Besag methods, one other parameter, the increment of the radius, has to be determined and based on the results of a number of trials, it is here set to 76.2 meters (250 feet). Various tests of sensitivity of each of the three methods are employed. For

9 208 Geographical Analysis FIG. 3. Houses in the Study Area (At-Risk Population)-The Master Database the Openshaw method, the number of simulated sample data sets are chosen as 19, 49, 99, 199, 499, so that they are equivalent to significance levels of 0.05, 0.02, 0.01, 0.005, 0.002, respectively. A significance level of is not used for the Openshaw method because of the computer time required to investigate 999 simulated samples. The Besag method is sensitive to the value of Ic and six values are reported (Ic = 2,3,4,5,6, and 7), all at the 0.05 significance level. In the Fotheringham method, significance levels are set to 0.05, 0.02, 0.01, 0.005, 0.002, and and clusters are displayed at each of the levels. Because data set 2 is a sample drawn randomly from the at-risk population and hence is subject to sampling variation, ten such samples are drawn and

10 209 FIG. 4. Test Data Set Category (1-5) Lowest FIG. 5. Test Data Set 2: Randomly Selected Points from the At-Risk Population results are reported for the average of all ten. For the Openshaw and the Besag tests, 29,218 circles were seeded for each random sample at each significance level and the proportion of circles displayed (that is, those containing significant clusters of points) is calculated for each significance level in the case of the

210 / Geographical Analysis TABLE 1 Classical Point-Pattern Analysis Results for Data Sets 1 and 2 Variance-Mean ratio t value R (nearest neighbor) Data Set 1 (Figure 4) Data Set 2 (Figure 5) 21.

11 210 / Geographical Analysis TABLE 1 Classical Point-Pattern Analysis Results for Data Sets 1 and 2 Variance-Mean ratio t value R (nearest neighbor) Data Set 1 (Figure 4) Data Set 2 (Figure 5) ' ' NME: 'significantly different from 1.0 at the 99 percent confidence level. Openshaw method and for each value of k in the case of the Besag method. For the Fotheringham test, five thousand circles were seeded and the proportion of circles displayed is calculated for each significance level. The Fotheringham method uses a random placement of circles and hence fewer circles need to be seeded. 5. RESULTS 5.1 Visualization In order to demonstrate the relative performance of the three techniques of automatic cluster detection, the results (all retained circles and locations) were displayed using the ARC/INFO GIS and are reported in Figures 6 and 7. Both figures refer to the respective data displayed in Figures 4 and 5 and both are composed of three sets of circles derived from the Openshaw, Besag, and Fotheringham methods, respectively. Every circle represents a significant cluster of points using the significance testing procedures described above. Figures 6a-c contain results using data set one defined as houses with the lowest construction quality ranking and Figures 7a-c show the results from the two hundred points in data set two which are drawn randomly from the master IignifKUre lewl= 0.01 observed point pattern aignifiancc lcvd = FIG. 6a. Detecting Spatial Point Clusters Using the Openshaw Method: Actual Point Pattern

12 A. Stewart Fotheringham and F. Benjamin Zhan / 211 k=2 k=3 k=4 M observed point pattern k=s k=6 k=7 FIG. 6b. Detecting Spatial Point Clusters Using the Besag and Newell Method: Actual Point Pattern significance level = 0.02 significanec level = 0.01 I significance ~ w = u 0.00s ~ significance lcvel = significsnec levcl = FIG. 6c. Detecting Spatial Point Clusters Using the Fotheringharn and Zhan Method: Actual Point Pattern database. Both sets of figures contain a separate window displaying the point pattern on which the results are based. Figure 6a shows the results of applying Openshaw s method to data set 1. It is clear that the technique identifies a large number of clusters, especially at traditional significance levels (0.05 and 0.01) where almost every point appears as a

13 212 J Geographical Analysis I M M M significance level = 0.0s significance level = 0.02 significance lcvcl M M observed point pattern FIG. 7a. Detecting Spatial Point Clusters Using the Openshaw Method: Random Point Pattern k-2 k=3 k-4 observed point pattern k-s k=6 k=l FIG. 7b. Detecting Spatial Point Clusters Using the Besag and Newell Method: Random Point Pattern significant cluster. Even at significance levels as extreme as 0.002, the technique identifies large numbers of clusters. The Besag method, the results of which are shown in Figure 6b, is much more conservative although the results are clearly dependent on k, a predefined number of points within a circle. Above k = 4, the technique picks out only the clusters of points in the southern part of the

A. Stewart Fotheringham and F. Benjamin Zhan J 213 significance level 0.05 significance level = 0.02 significance level * 0.01 M I observed point pattern I I significance level = 0.

14 A. Stewart Fotheringham and F. Benjamin Zhan J 213 significance level 0.05 significance level = 0.02 significance level * 0.01 M I observed point pattern I I significance level = significance level = significance level = FIG. 7c. Detecting Spatial Point Clusters Using the Fotheringham and Zhan Method: Random Point Pattern map and only two areas of the map appear to have interesting clusters. The Fotheringham methodology appears marginally more selective than Openshaw s at more extreme levels of significance but essentially depicts similar results. One general finding is that the techniques would appear to be more useful when used with an extreme significance level such as so that a limited number of significant clusters is identified. At less extreme values, the techossibly identify too many clusters to be useful as an exploratory tool. niques To pace P the above results in perspective, each of the three methods is applied to a set of points randomly drawn from the at-risk population and these results are shown in Figures 7a-c. It is important to emphasize that the point pattern in this data set does not appear random because the distribution reflects the distribution of the at-risk population from which the sample is drawn. Given that the population is nonrandomly located in space, the sample is similarly spatially nonrandom. A logical test of each method is therefore to see whether it can separate a visual cluster from a geographically interesting one, the latter being a set of points that is significantly more clustered than the distribution of the underlying at-risk population would suggest. The Openshaw technique performs slightly less satisfactorily in this regard: clusters appear at significance levels even as extreme as It is more difficult to evaluate the Besag technique because although clusters are identified at all values of Ic, the significance level is The Fotheringham technique identifies relatively fewer clusters and identifies none at a significance level above These results are encouraging because if in the random samples geographically interesting clusters can be separated from clusters that result merely from the distribution of the underlying population, clusters that are identified in the nonrandom distributions can be treated as geographically interesting. For instance, the results of the Fotheringham method at significance levels and with the random data suggest that the clusters identified from this method in data set 1 are of geographic interest in that they probably arise

214 Geographical Analysis TABLE 2 Performance Indicators of the Three Techniques on Ten Random Samples Significance level Number of circles Average number of Average proportion or k value seeded

15 214 Geographical Analysis TABLE 2 Performance Indicators of the Three Techniques on Ten Random Samples Significance level Number of circles Average number of Average proportion or k value seeded circles displayed of circles displayed a. Openshaw et al. method, ,01314, , , b. Besag and Newell method (significance level: 0.05) , , ,001 c. Fotheringham and Zhan method ,02400, from a spatial process and not from variations in the underlying population density. 5.2 A Further Test Based on Random Distributions A further test of the three techniques is undertaken by examining the performance of each technique on ten different random drawings from the at-risk population. These results are summarized in Table 2 where each of the ten distributions consists of two hundred randomly drawn points. Table 2a contains the results of the Openshaw technique applied to each of the ten distributions. At each significance level, 29,218 circles are seeded and the average number of these circles that are displayed (and hence contain a significance cluster of points) is given in column 3. These average frequencies are converted to average proportions in column 4. Given that the distributions are random drawing from the at-risk population, a comparison of these proportions across the different techniques yields some insights into the probability of each technique identifying false positives (although it says nothing about failure to identify real positives). Unfortunately, the Besag results in Table 2b depend on the value of k, the minimum number of points within a circle, and so are not directly comparable. The results for the Fotheringham technique, shown in Table 2c, result from only five thousand seeded circles at each significance level because in the technique the circles are seeded randomly, whereas in the other two techniques the circles are uniformly placed over the studying area. The results for all three methods are encouraging in that the average proportion of circles displayed is always less than half the significance level (circles are displayed only when a significantly larger number of points is observed than would be expected). The Besag procedure is particularly impressive when it is noted that the proportions are all calculated at a significance level of The results again suggest that the circles identified at extreme significance levels in

A. Stewart Fotheringham and F. Benjamin Zhan / 215 Figure 6a-c would therefore seem to represent the outcomes of some interesting geographic processes.

16 A. Stewart Fotheringham and F. Benjamin Zhan / 215 Figure 6a-c would therefore seem to represent the outcomes of some interesting geographic processes. It is useful to note that the Fotheringham method appears to be less sensitive than the other two methods at low levels of significance but is more sensitive at higher levels of significance. This suggests that the simpler procedure of randomly assigning circles (the Fotheringham method) works just as well as comprehensively covering the study area, and may in fact be more selective when extreme significance levels are used. 5.3 Sensitivity to Circle Definition All three methods of point pattern analysis depend upon a definition of circle size. The above results, for instance, are for circles that have a radius between and meters. In order to examine the potential sensitivity of the results to this definition, some other ranges were selected and the methodology described in section 5.1 repeated. The results of one significance level corresponding to data set one are shown in Figures 8a-c. Each technique has a similar sensitivity to circle definition in that as the circles increase in size, the circles in which significant clusters occur increasingly overlap and give an exaggerated appearance to a cluster. That is, regardless of the effect on statistical detection, varying the size of circles used affects the perception of the results. Given that all three techniques are intended to be used in an exploratory mode, this perceptual sensitivity needs more attention. It could be argued that an advantage of exploratory techniques is that analyses can be undertaken under many different conditions and in this case maps can be reported with different circle ranges. 6. CONCLUSIONS The increasing prevalence of GIS technology and the concomitant access to disaggregate spatial data sets will lead to a greater demand for automated cluster detection techniques. Such techniques have obvious applications in the 183 m 5 R< 2SVm observed point pattern 41 I rn 5 R < S64 m FIG. 8a. Detecting Spatial Point Clusters Using the Openshaw Method: The Effect of Circle Size

17 216 1 Geographical Analysis 183 m < - R < 259 m 259m5R<411m observed point pattern 411 m L R < 564 m 564mLR FIG. 8b. Detecting Spatial Point Clusters Using the Besag and Newell Method: The Effect of Circle Size 183 m < R < 259m 259m<Rc411 m observed point pattern 411 m<r<564m 564mLR FIG. 8c. Detecting Spatial Point Clusters Using the Fotheringham and Zhan Method: The Effect of Circle Size investigation of the incidence of certain types of disease but they can also be applied to a host of subjects in the social and environmental sciences. Openshaw et al. (1987) popularized the automation and visualization of cluster detection through a randomized version of quadrat analysis although earlier work such as that by Hudson (1969) predates this by about two decades. Openshaw et al.'s work not only has the advantage of producing a visual output showing the locations of significant clusters of points, but it overcomes the disadvantage of stan-

A. Stewart Fotheringham and F. Benjamin Zhan / 217 dard quadrat analysis by allowing the units in which occurrences are counted to be of random size.

18 A. Stewart Fotheringham and F. Benjamin Zhan / 217 dard quadrat analysis by allowing the units in which occurrences are counted to be of random size. Besag and Newell (1991) present an alternative methodology for the automated detection of point clusters, and a third approach is provided in this paper. A comparison of the three techniques is presented based on a disaggregate housing data set in which all 28,832 points in the at-risk population have been geocoded. The Openshaw et al. and Besag and Newell techniques have previously been applied only to point patterns aggregated to larger spatial units and, to our knowledge, this is the first application of all three techniques to a disaggregate data set. Testing the techniques with point patterns randomly drawn from a known spatial distribution provides encouraging results in that they appear to perform well in separating geographically interesting clusters from those that result merely from the distribution of the underlying population. The Besag and Newell method appears to be particularly good at not identifying false positives although the Fotheringham and Zhan technique is easier to apply and is not dependent on a definition of minimum cluster size. Finally, the results demonstrate that there are still perceptual issues concerning exploratory graphics that need to be resolved. The techniques evaluated in this paper are potentially very useful in identifying clusters of points that warrant further investigation. Although each relies upon a statistical procedure to determine whether the number of points within a random circle is significant, the overall result in each case is a visual representation of the set of circles in which significant clusters are found. This leads to perceptual questions which are not addressed here about the way such information should be presented. LITERATURE CITED Besag, J., and P. J. Diggle (1977). Simple Monte Carlo Tests for Spatial Pattern. Applied Statistics 26 (3), Besag, J., and J. Newell (1991). The Detection of Clusters in Rare Diseases. Journal of the Royal Statistical Society, A 154, Part 1, Boots, B. N., and A. Getis (1988). Point Pattern Analysis. The Publishers of Professional Social Science. Newbury Park: Sage Publications. Clayton, D., and J. Kaldor (1987). Empirical Bayes Estimates of Age-standardized Relative Risks for Use in Disease Mapping. Biomdrics 43, Cusick, J., and R. Edwards (1990). Spatial Clustering for Inhomegeneous Populations (with discussion). Journal of the Royal Statistical Society B 52, Diggle, P. J. (1983). Statistical Analysis of Spatial Point Patterns. New York: Academic Press. Doll, R. (1989). The Epidemiology of Childhood Leukaemia. Journal of the Royal Statistical Society, A, 152, Gardner, M. J. (1989). Review of Reported Increases of Childhood Cancer Rates in the Vicinity of Nuclear Installations in the UK. ]ournal of the Royal Statisticnl Society, A 152, Getis, A,, and B. Boots (1978). Models of Spatial Processes: An Approach to the Study of Point, Line, and Area Patterns. Cambridge, London: Cambridge University Press. Hope, A. C. A. (1968). A Simplified Monte Carlo Significance Test Procedure. Jouml of the Royal Statistical Society, B 30 (3) Hills, M., and F. Alexander (1989). Statistical Method Used in Assessing the Risk of Disease near a Source of Environmental Pollution: A Review. Journal of the Royal Statistical Society, A 152, Hudson, J. C. (1969). Pattern Recognition in Empirical Map Analysis. Journal of Regional Science 9 (2), Marshall, R. J. (1991). A Review of Methods for the Statistical Analysis of Spatial Patterns of Disease. Journal of the Royal Statistical Society, A 154 (P3), Openshaw, S., M. E. Charlton, C. Wymer, and A. W. Craft (1987). A Mark 1 Geographical Analysis

218 1 Geographical Analysis Machine for the Automated Analysis of Point Data Sets. Intemutwnal]oumd of Geographical Informtwn Systems 1 (4), 359-77. Ripley, B. D. (1981). Spatial Statistics.

19 218 1 Geographical Analysis Machine for the Automated Analysis of Point Data Sets. Intemutwnal]oumd of Geographical Informtwn Systems 1 (4), Ripley, B. D. (1981). Spatial Statistics. New York: John Wiley. Stone, R. A. (1988). Investigations of Excess Environmental Risks around Putative Sources: Statistical Problems and a Proposed Test. Statistical Methods 7, Upton, G. J. G., and B. Fingleton (1985). Spatial Data Analysis by Example, Vol. 1, Point Pattern and Quantitatiue Data. Chichester: John Wiley and Sons. Wheldon, T. E. (1989). The Assessment of Risk of Radiation-Induced Childhood Leukaemia in the Vicinity of Nuclear Instdations. ]oumul of the Royal Statistical Society, A 152,

Outline. Practical Point Pattern Analysis. David Harvey s Critiques. Peter Gould s Critiques. Global vs. Local. Problems of PPA in Real World

Outline. Practical Point Pattern Analysis. David Harvey s Critiques. Peter Gould s Critiques. Global vs. Local. Problems of PPA in Real World Outline Practical Point Pattern Analysis Critiques of Spatial Statistical Methods Point pattern analysis versus cluster detection Cluster detection techniques Extensions to point pattern measures Multiple