Measuring Association between Spatially Defined Variables: An Alternative Procedure

Size: px

Start display at page:

Download "Measuring Association between Spatially Defined Variables: An Alternative Procedure"

Paul Richards
5 years ago
Views:

1 Lawrence 1. Hubert, Reginald G. Golledge C. Michael Costanxo, and Nathan Gale Measuring Association between Spatially Defined Variables: An Alternative Procedure INTRODUCTION An earlier paper (Hubert and Golledge 1982) developed a measure of association between two spatially defined variables, generalizing some earlier work of Tjestheim (1978). As a very brief review, there are two variables, F and G, with n realizations of each. The values of F are denoted by f,,..., f, and are observed in locations 0,,..., 0,; those on G are denoted by g,,..., g, and are observed in locations p,,..., p,. In the applications we discuss, the sets of unordered locations {O,,...,On} and { p,,..., p, } are exactly the same, though the indexing of the locations may differ for the two variables. Typically, F and G are commensurable with the observations being ranks or possibly z-scores generated for each separate variable. Throughout the paper, our major concern will be with the measurement of association in a spatial sense, as distinct from what may be best termed point-topoint association. The latter is usually approached by some well-known index of correlation, such as those attributed to Pearson, Spearman, or Kendall, and refers to the correspondence between F and G over identical spatial locations. Spatial association, on the other hand, deals with the degree to which the values on the two variables are reflected by locations that are spatially close or distant. Note that this distinction between point-to-point and spatial association does not preclude a single spatial process being responsible for producing both. Rather, our overall interest is in a generic notion of association between two spatially distributed variables that might be reflected as a point-to-point correspondence and/or as one that is best characterized in terms of the similarity of the values on two variables observed in different locations, with weights proportional to their spatial separation. Somewhat more formally, to obtain a measure of association between the variables F and G in relation to the available spatial information, two n x n matrices are defined: { dij}, where dij is some spatial separation measure between Oi and pj, and { cij}, where cij = h(f;, gj) for some particular bivariate function. This project was funded by the National Institute of Justice Grant 82-IJ-CX4019. Lawrence J. Hubert is professor of education, Reginald G. Golledge is professor of geography, C. Michael Costanu, and Nathan Gale are doctoral students in geography, University of Califmia, Santa Barbara. Geographical Analysis, Vol. 17, No. 1 (Jan. 1985) Ohio State University Press Submitted 5/83. Revised version accepted 1/84.

2 Hubert, Golledge, Costanzo, and Gale / 37 Both d j and ci j characterize two different notions of similarity between the spatial locations Oi and pi. The former is based on spatial separation and the latter on the observed values of the variables through the chosen function h(.,.), for example, h(x, gj) = If; - gjl. For us, the task of assessing association reduces to comparing the correspondence between the two matrices { dij} and { cij} through a raw cross-product statistic i,j ci jd i. For example, suppose F and G denote the same variable but measured at times t and t + 1, respectively. If h(f;, gi) is specified as, say, lx - gjl, then the raw cross-product measure should be relatively extreme in size if the common variable reflects a process that is partially determined spatially (e.g., see the various models described by Cliff and Ord 1981). If p is some permutation on the first n integers, a general index of the form = CCijdip(j) i, j can be defined; the observed index of association is then obtained through an identity permutation pi: '(PI) = Ccijdij. i.j Significance testing proceeds by assuming that all n! possible permutations for the index A( p) are equally likely and by comparing A( p r) to the resulting distribution. This is equivalent to assuming that the observations on F are fixed in their respective locations and all permutations of g,,..., g, among their corresponding n locations are equally likely. Monte Carlo sampling can be used to obtain significance levels or possibly a Pearson Type I11 approximation based on the mean and variance of A(p) given in Hubert and GoUedge (1982) plus the skewness parameter y: - where a, = d G - - a: + d *, and where "j = (l/n)cdg i a:= (l/n)cdg i a* = (l/n2)cdg i,j This process of fitting a Type I11 approximation is discussed by Mielke, Berry, and Brier (1981) and Costanzo, Hubert, and Golledge (1983). Normalized indexes that correspond to the raw index A(p) are discussed by Hubert and Golledge (1982).

38 / Geographical Analysis From a broader historical perspective, the reader may also wish to consult Bachi (1962), whose descriptive work on measuring the spatial association between two variables

3 38 / Geographical Analysis From a broader historical perspective, the reader may also wish to consult Bachi (1962), whose descriptive work on measuring the spatial association between two variables predates that of Tjestheim (1978). Bachi was concerned, in part, with evaluating a one-to-one correspondence between two spatial sets using the concept of standard distance, that is, the function h(., - ) is (0-1) dichotomous and defines a pairing of the locations in {O,,...,On} and { p,,..., p,}; the measure of spatial separation between two locations is their squared Euclidean distance apart. Although no inference model was proposed by Bachi, he did offer several tentative suggestions at a descriptive level for generalizing the (0-1) mapping to one that uses more complete information from the two variables. This earlier work can now be incorporated within our framework through the use of the cross-product measure between the two matrices { cij} and { dij}. Reference might also be made to Tobler (1965), who incorporates Bachi s ideas in discussing the correspondence between two geographical patterns, and to Glick (1982), who evaluates Tjestheim s original proposal through a small Monte Carlo simulation. Finally, for a general discussion of the importance of assessing association between two spatially distributed variables, we refer the reader to the classic empirical work of McCarty and Salisbury (1962). AN ALTERNATIVE INFERENCE STRATEGY FOR ASSESSING THE ASSOCIATION BETWEEN TWO VARIABLES Although the generalized Tjestheim approach to measuring association between variables is a natural strategy to follow and one that is capable of a number of variations, it has at least one anomaly because the index A(p) does not take advantage of the fact that the row and column objects are the same. To be more explicit, suppose we consider Tjestheim s original index Ec,ld,JccI, where c = 1 if 1. rank f; = rank g,, and 0 otherwise. In constructing the reference distnbution against which to compare an observed index, different permutations lead to a range of values for the index A(p). The ranks for variable F are fixed in given spatial locations and those for G are redistributed according to the particular permutation being considered. Each permutation of the ranks on G results in some pairing of integers at each of the n spatial locations; therefore, in addition to A( p), we could compute a Spearman rank order correlation measure for the given pairing. In general, there is a nonzero covariance between these two statistics suggesting that the inference model for Tjestheim s measure is spatially contaminated with an association statistic that is aspatial in design. Following Fraser (1956), we could, in fact, go further and provide an explicit formula for this covariance for any arbitrary spatial separation matrix { d,, }. The important point, however, is that the covariance is typically nonzero, and, thus, we do not need to go into the details. An alternate way of examining the difficulty of assessing the significance of the Tjestheim index relies on emphasizing the conceptual distinction between two types of correspondence that may be present for spatially distributed variables. As mentioned earlier, one of these is a point-tepoint notion that does not depend on the locational arrangement of the observations, but only on the relationship within the pairs (A, gt). This latter relationship may OT may not be due to some spatial process and there is no way of making this determination by relying solely on the point-tepoint framework. The other notion of spatial association depends on the relationship between distinct pairs treated as indivisible units. In this case, the relationship can be attributed unambiguously to spatial process. It may be true (and in most geographically relevant instances we would expect it to be true) that some spatial process is responsible for both point-tepoint and spatial association; thus, reasonable measures of each should be positively related over most data sets. In fact, given two such measures, we might even wish to aggregate them into one final

4 Hubert, Golledge, Costanzo, and Gale / 39 measure of association. Nevertheless, spatial association, as we have characterized it, is conceptually distinct from a point-tepoint relationship because the latter cannot be attributed unambiguously to some spatially determined process. Consequently, it would be convenient to have a strategy for evaluating the relative size of spatial association per se, conditional on the fixed level of point-tepoint association that is present in the data and without any commitment to the point-tepoint relationship itself being spatially determined. Staying within the framework we have set out, there does not seem to be any obvious solution to the problem. For example, given the symmetry in the use of variables F and G, fixing the second variable G and permuting the values of F does not help. Nor does a model discussed by Klauber (1971) offer a solution in which both the sets of values on F and G would be permuted at random among the n locations. Taking a somewhat different tactic, however, there is a way of resolving this difficulty as long as the realizations on the two variables F and G are not all identical. What is required is to treat the association task as a special case of the spatial autocorrelation index discussed by Hubert, Golledge, and Costanzo (1981). In our notation, this would take the form The observed statistics A(pl) and T(p,) are numerically the same (assuming zero diagonals for { d } and { ci }), but we would use T( p) to develop the appropriate reference distribution. Because the inference model permutes rows and columns simultaneously, all permutations would lead to varying values of the association statistic but exactly to the same value of Spearman s index or, for that matter, to any of the usual correlation coefficients, for example, Kendall s tau. Pairs of observations on F and G, (fl, g,), are permuted as indivisible units among the n locations; that is, conditional on the observed pairs of observations, the null assumption is that the n pairs represent n independent observations on a common underlying bivariate distribution; the alternative hypothesis is one of nonindependence. Viewed slightly differently, suppose the spatial separation matrix { d } is symmetric and cij = If; - gjl, i # j. Then, without loss of generality, each offdiagonal value ci could be redefined as and T(p) is a generalized spatial autocorrelation statistic designed to assess the correspondence between two square symmetric matrices, and, in turn, the degree to which the association between F and G can be explained by their spatial realizations. In short, a seemingly minor change in the usual measure of spatial autocorrelation for a single variable leads to a strategy for assessing the association between two variables in a spatial sense. Compared to Tjestheim s alternative, the inference strategy based on T does not have the same contamination with an aspatial measure. Practical significance testing procedures for T(p,) would be the same as for A(p,), for example, Monte Carlo methods or Type I11 approximations based on the first three moments of r(p) given, say, by Mielke (1979). EXAMPLE As an illustration of the difference between a measure of spatial association and a traditional correlation coefficient based on point-tepoint correspondence, consider the data displayed in Figures la and lb. Using Tobler s unclassed choropleth

5 40 / Geographical Analysis FIG. la (bottoln). Rank Order of Per Capita Residential Burglary, lb (top). Rank Order of Per Capita Residential Burglary, FIG lc City of Santa Barbara Police Reporting Distncts

6 -- Y TABLE 1 ~ Rt?poltlllg Distnct _ ~~ Hubert, Golledge, Costanzo, and Gale / 41 -~ ~ - _ ~ Reporting - - ~ - Bstnct 19x0 ~ ~ - _-~- ~ ~ - ~ 19x mapping procedure (see Tobler 1973; Gale and Halperin 1982), these figures portray the rank order of per capita residential burglary for 1980 and 1981, respectively, over the sixty-four police reporting districts used by the city of Santa Barbara, California. Figure lc merely indicates the arbitrary numbering of the sixty-four districts for the actual data given in Table 1. Spearman s rank order correlation between these two sets is quite high. At 0.66, it is significant at less than the 0.01 level based on the usual normal approximation discussed in Kendall(l970). In addition to a significant Spearman correlation across the two data sets, Figures la and lb also display a significant amount of spatial association using the average of the rank differences as given in (1) and the Euclidean distances between district centroids. In particular, the raw index r has a value of x lo5, with E(T) = x lo5, V(T) = x lo6, and a skewness parameter of Thus, based on either a Type I11 approximation or the Monte Carlo distribution of Table 2 for a random sample size of 99, this latter raw index would be assigned similar significance levels of less than 0.01 (see Costanzo et al. 1983). Finally, as a more convenient normalized index, we might note that the Pearson product-moment correlation between the offdiagonal entries of the distance matrix and the matrix of rank differences from (1) is 0.20.

~ ~ ~ ~~ ~~ 42 / Geographical Analysis ~- TABLE 2 Approximate Monte Carlo Distnbution for the Index r, Using Rank Differences of Data in Table 1 and Euclidean Distances between Distncts (sample size

7 ~ ~ ~ ~~ ~~ 42 / Geographical Analysis ~- TABLE 2 Approximate Monte Carlo Distnbution for the Index r, Using Rank Differences of Data in Table 1 and Euclidean Distances between Distncts (sample size = 99) -~ (,miidatwe Frequency ~ FIG. 2. Maximized Spatid Association, Residential Burglary, 1980 (bottom), 1981 (top)

Hubert, Golledge, Costanzo, and Gale / 43 To show several possible patternings of spatid association that might conceivably be present in data of this same form, a simple optimization algorithm was

8 Hubert, Golledge, Costanzo, and Gale / 43 To show several possible patternings of spatid association that might conceivably be present in data of this same form, a simple optimization algorithm was applied to the data as originally given in Table 1. Specifically, the pairs of ranks attached to each of the sixty-four districts were treated as intact units but were reauocated to maximize (or minimize) the raw spatial association index r. Starting with the data as listed, the pairs of ranks between two districts were interchanged to increase (or decrease) r the most. Starting at this point, another interchange was implemented in the same way, and so on until no painvise interchange could be found that would increase (or decrease) r. These final allocations are not necessarily globally optimal but the mappings they produce, given in Figure 2 (Fig. 3), at least show rather clearly how strong positive (or negative) spatial association might look graphically. The point-tepoint Spearman correlation for Figures la, lb, 2, and 3 are all exactly the same, that is, a value of What has changed is the degree of spatial association. For Figures la and lb we obtained a normalized index of 0.20 (the correlation between the offdiagonal entries of the distance matrix and a second based on rank differences); the index is 0.84 for Figure 2 and for Figure 3. Although the pairs of maps in Figures 1, 2, and 3 were concerned with spatial association, note the degree of spatial autocorrelation exhibited by the single variables mapped in each of the six representations. As given in Table 3, we use a FIG. 3. Minimized Spatial Association, Residential Burglary, 1980 (bottom), 1981 (top)

9 ~~ -- ~ ~ ~~~ ~~~ ~ ~ - ~~ 44 / Geographical Analysis TABLE3 ~ Spatial Autocorrelation Analysis, Using Data from Table 1 in Initial and Optimized Fonn ~ -- ~ ~~~~~ ~~~~- ~ Data 1 (X105) ~~ ~- Initial Initial Maximized Maxlrnized Minimized s Minimized ~ ~~ - ~ ~~ - ~~ ~ ~ ~ NOTES The CxpeCtdtlOll, VdndllLe dlld 5keWle5S Of 1 dre LOnltdllt dcio1s the 5IX CdSrS h ( 1 ) = X in5 v( 1 ) = skewers = Ba\d 011 Type I11 approxmatlons all value5 of 1 xt. slgnlftc.+rrt at less tllall the 0 01 level raw index of the same form as r, that is, or a normalized counterpart based on the correlation between the distances and the rank differences. As apparent from Table 3, there is a significant degree of spatial autocorrelation in the original variables (Figs. la and lb). The normalization represented in Figure 2 leads to a much stronger positive spatial autocorrelation for the two mapped variables; conversely, Figure 3 leads to a strong negative spatial autocorrelation. In short, maximization (or minimization) of spatial association brings an increased positive (or negative) spatial autocorrelation in the two variables. When individuals subjectively compare maps for correspondence, it appears that at least two rather distinct sources of information are being used. The first is the obvious point-to-point relationship, which can be measured by Spearman s index or one of its usual competitors. Second, there is an additional component that we have discussed as spatial association. To show this distinction in a very controlled way, Figure 4 displays hypothetical data constructed so that the rank order correlation between the sixty-four cells of the checkerboard is nearly zero (a value of ). In addition, almost no spatial association exists (the correlation between the FIG. 4. Hypothetical Data, Variable F (left), Variable G (Tight)

Hubert, Golledge, Costanzo, and Gale / 45 FIG. 5. Maximized Spatial Association, Variable F (left), Variable G (right) distance matrix and rank differences from (1) is 0.0oO').

The positive correspondence of the pair of figures in Figure 5 and the negative association of the pair of figures in Figure 6 should be apparent even though the point-tepoint correlation between

10 Hubert, Golledge, Costanzo, and Gale / 45 FIG. 5. Maximized Spatial Association, Variable F (left), Variable G (right) distance matrix and rank differences from (1) is 0.0oO'). Reorganizing these data in the same manner as before, a local maximum shown in Figure 5 was identified for a normalized index of +0.58; a local minimum of is displayed in Figure 6. The positive correspondence of the pair of figures in Figure 5 and the negative association of the pair of figures in Figure 6 should be apparent even though the point-tepoint correlation between each of these pairs of figures is essentially zero. What the observer is recognizing is spatial association and not a point-to-point correspondence. SOME FINAL COMMENTS This paper points to the need for renewed interest in the concept of spatial association and the way in which it might be evaluated. The ideas outlined in the FIG. 6. Minimized Spatial Association, Variable F (left), Variable G (right)

46 / Geographical Analysis paper are being pursued in several directions, especially in relating the proposed formal measures introduced above to alternative strategies for assessing the

11 46 / Geographical Analysis paper are being pursued in several directions, especially in relating the proposed formal measures introduced above to alternative strategies for assessing the correspondence between maps (e.g., spectral methods) and to other ways of defining the measure of proximity between the variables F and G. A particularly interesting problem relates to the interface between the normative discussion of correspondence, as presented here, and a descriptive notion based on human perception of correspondence (cf. McCarty and Salisbury 1962; Lloyd and Steinke 1977). This latter research area obviously overlaps behavioral and quantitative research, and, given the importance of these problems to geographical data analysis, we anticipate more widespread interest in this topic in the future. LITERATURE CITED Bachi, R. (1962). Standard Distance Measures and Related Methods for Spatial Analysis. Pupms, Regiunul Science Association, 10, Cliff, A. D., and J. K. Ord (1981). Sputial Processes: Models and Applications. London: Pion. Costanzo, C. M., L. J. Hubert, and R. G. Golledge (1983). A Higher Moment for Spatial Statistics. Geographical Analysis, 15, Fraser, D. A. S. (1956). A Vector Form of the Wald-Wolfowitz-Hoeffding Theorem. Annuls of Mathmatical Statistics, 27, Gale, N., and W. C. Halperin (1982). A Case for Better Graphics: The Unclassed Choropleth Map. The Ammican Statistician, 36, Glick, B. J. (1982). A Spatial Rank-Order Correlation Measure. Geogruphicul Anulysis, 14, Hubert, L. J., and R. G. Golledge (1982). Measuring Association between Spatially Defined Variables: Tjostheim s Index and Some Generalizations. Geographical Analysis, 14, Hubert, L. J., R. G. Golledge, and C. M. Costanzo (1981). Generalized Procedures for Evaluating Spatial Autocorrelation. Geographical Analysis, 13, Kendall, M. G. (1970). Rank Correlation Methods. London: Griffin. Klauber, M. R. (1971). Two Sample Randomization Tests for Space-Time Clustering. Biometries, 27, Lloyd, R. E., and T. R. Steinke (1977). Visual and Statistical Comparison of Choropleth Maps. Annuls, Association of American Geogruphers, 67, McCarty, H. H., and N. E. Salisbury (1962). Visual Corrparison of Isopletli Maps us (I Meuns of Determining Correlations between Spatially Distributed Phenomena. Iowa City: Department of Geography, University of Iowa. Mielke, P. W. (1979). On Asymptotic Non-Normality of Null Distributions of MRPP Statistics. Cummunicutions in Statistics, Theory and Methods, A8, Mielke, P. W., K. J. Berry, and G. W. Brier (1981). Application of Multi-Response; Permutation Procedures for Examining Seasonal Changes in Monthly Mean Sea-Level Pressure Patterns. Monthly Weather Reoiew, 109, Tjostheiiu, D. (1978). A Measure of Association for Spatial Variables. Biometrika, 65, Tobler, W. R. (1965). Computing the Correspondence of Geographical Patterns. Pupers, Regional Science Association, (1973). Choropleth Maps without Class Intervals? Geographical Analysis, 3,

Research Notes and Comments I 347

Research Notes and Comments I 347 mum-likelihood estimation of the constant and does not need to be applied a posteriori. Overall, the replacement of the lognormal model by the Poisson model provides a