Using AMOEBA to Create a Spatial Weights Matrix and Identify Spatial Clusters, and a Comparison to Other Clustering Algorithms Arthur Getis* and Jared Aldstadt** *San Diego State University **SDSU/UCSB Joint PhD Program Paper presented at the Regional Research Institute, West Virginia University Morgantown, West Virginia December 8, 25
AMOEBA A design for the construction of a spatial weights matrix using empirical data. Multidirectional: Searches for spatial association in all specified directions. Optimal: Optimum in the sense that the scale is local (the finest scale) and the analysis reveals all spatial association. Ecotope-Based: The ecotope is a specialized region (a particular habitat) within a larger region. Algorithm: The algorithm for finding the ecotope is based on an analytical system that often finds highly irregular (amoeba-like) sub-regions of spatial association.
The Issues Question 1 How does one create an appropriate spatial weights matrix? Question 2 Can we have confidence in the identification of spatial clusters?
Question 1 How does one create an appropriate spatial weights matrix?
The Spatial Weights Matrix In a regression context W is the formal expression of spatial dependence between spatial units (the spatial effects). Used in, for example: y = ρwy + Xβ + ε
The Typical W Matrix j-------> 1 2 3 n i=1 w 11 w 12 w 13... w 1n i=2 w 21 w 22 i=3 w 31 i=n w n1 w nn
Some Traditional W Schemes Contiguity Inverse Distances Lengths of Shared Borders, Perimeters n th Nearest Neighbor Distance All Centroids within d Ranked Distances Network Links
Commentators on W Anselin: Outlined the problem Dacey: varying results given schemes Cliff and Ord: rook s and queen s cases Griffith: better under-specified Florax & Rey: over-specification reduces power Kooijman: maximize Moran s Openshaw: computer search for best model Bartels: binary defensible Hammersley-Clifford: near neighbors in Markov Tiefelsdorf, Griffith, Boots: standardization Florax and Graff: bias due to matrix sparseness GEODA listserv
Some Recent W Schemes Fotheringham, Brunsdon, and Charlton s bandwidth distance decay (1996) LeSage s Gaussian distance decline (1999) McMillen s tri-cube distance decline (1998) Getis and Aldstadt s local statistics model (21, 22) Fotheringham, Charlton, Brunsdon s optimize bandwidth (22) LeSage s Bayesian approach (23) Aldstadt and Getis AMOEBA (23)
W Theory or Reality? Exogenous versus endogenous Estimation versus prediction Model driven versus data driven The AMOEBA approach
AMOEBA: Critical Number of Links Identification Local statistics values are computed around each observation as the number of links (d) increases. When the absolute values fail to rise, the cluster diameter is reached. First peak equals G i * dc. 2.5 2 Gi* 1.5 1.5 1 2 3 4 5 Distance Links
AMOEBA: Weight Calculation When d c >, w ij w ij P( z Zd = P( z Z =, otherwise. c d ) P( z Z c ) P( z d ij Z ) ), for all j where d ij d c When d c =, for all j, w ij = P(z) is the cumulative probability associated with the standard variate of the normal distribution Weights vary between and 1.
AMOEBA: Links Designations d ij is the number of links from the focus spatial unit i to another spatial unit j d c is the critical number of links: the number of links from i beyond which no further autocorrelation exists.
AMOEBA as W and U in an Autoregressive Spatial Lag Model It is conceivable for rows of the weights matrix to be completely filled with zeroes indicating that there is no local spatial autocorrelation surrounding an observation. To compensate for the zero row effect, we create a dummy variable, U, that assigns a 1 for all observations with no dependence structure and otherwise. y = θwy + αu + Xβ + ε
AMOEBA as W and U in a Autoregressive Spatial Error Model y = αu + Xβ + (I - κw) -1 ε
AMOEBA: The non-spatial and spatial matrices U = 1 1 1 1 W = w 2,1 w 3,1 w 4,1 w 7,1 w 8,1 w 9,1 w 11,1 w 12,1 w 14,1 w 1,2 w 3,2 w 4,2 w 7,2 w 8,2 w 9,2 w 11,2 w 12,2 w 14,2 w 1,3 w 2,3 w 4,3 w 7,3 w 8,3 w 9,3 w 11,3 w 12,3 w 14,3 w 1,4 w 2,4 w 3,4 w 7,4 w 8,4 w 9,4 w 11,4 w 12,4 w 14,4 w 1,5 w 2,5 w 3,5 w 4,5 w 7,5 w 8,5 w 9,5 w 11,5 w 12,5 w 14,5 w 1,6 w 2,6 w 3,6 w 4,6 w 7,6 w 8,6 w 9,6 w 11,6 w 12,6 w 14,6 w 1,7 w 2,7 w 3,7 w 4,7 w 8,7 w 9,7 w 11,7 w 12,7 w 14,7 w 1,8 w 2,8 w 3,8 w 4,8 w 7,8 w 9,8 w 11,8 w 12,8 w 14,8 w 1,9 w 2,9 w 3,9 w 4,9 w 7,9 w 8,9 w 11,9 w 12,9 w 14,9 w 1,1 w 2,1 w 3,1 w 4,1 w 7,1 w 8,1 w 9,1 w 11,1 w 12,1 w 1,11 w 2,11 w 3,11 w 4,11 w 7,11 w 8,11 w 9,11 w 12,11 w 1,12 w 2,12 w 3,12 w 4,12 w 7,12 w 8,12 w 9,12 w 11,12 w 1,13 w 2,13 w 3,13 w 4,13 w 7,13 w 8,13 w 9,13 w 1,13 w 11,14 w 12,13 w 14,1 w 14,11 w 14,12 w 14,13 w 1,14 w 2,14 w 3,14 w 4,14 w 7,14 w 8,14 w 9,14 w 12,14
Generalized AMOEBA Yc 1c Wcc Wc Yc ε c α ρ β Y = 1 + + + Y 1 ε
Total Fertility Rates Amman, Jordan An Example 1994 (data by census units)
Mediterranean Sea LEBANON SYRIA IRAQ Gaza PALESTINIAN AUTHORITY ISRAEL SAUDI ARABIA EGYPT
Explanatory Variables Regressor social variables 1. Percent of females with higher education (called hi-ed ) 2. Percent females married (called married )
Ordinary Least Squares No W or U AIC 165.35 t VALUES constant 6.266 hi-ed -14.344 married 1.261
AMOEBA in Spatial Error Models A M O E B A Contiguity G i I i c i AIC 167.352 79.159 147.43 11.1 t VALUES constant 6.499 6.499 7.21 6.342 hi-ed -13.4-11.55-13.316-4.68 married 1.164 1.978 1.227 1.154 lambda 1.634 98.792 1.187 14.5 non-spatial 12.588-4.48 7.89
Comparison of Spatial Contiguity and AMOEBA Spatial Error Model Spatial Error Model: G i AMOEBA has AIC much lower than contiguity (79.159 to 16.625). All AMOEBA models are an improvement over contiguity. G i AMOEBA has an extremely high lambda and nonspatial vector: good descriptor of spatial and nonspatial effects. G i AMOEBA shows social variables to be significant in explaining TFR.
AMOEBA in Spatial Lag Models A M O E B A Contiguity G i I i c i AIC 16.625 18.27 148.481 123.881 t VALUES constant 5.419 3.866 5.68 4.742 hi-ed -9.927 -.87-9.51-8.642 married 1.164 2.16 1.341 1.21 Rho -.5 7.435 1.819 5.443 Non-Spatial 7.594-1.657 8.58
Comparison of Spatial Contiguity and AMOEBA Spatial Lag Model Again all AMOEBA have lower AIC than contiguity; G i AMOEBA is best. All variables significant.
Question 2 Can we have confidence in the identification of spatial clusters?
Problems with Spatial Clusters Not explicit (what is a cluster?) Are they statistically significant? (degree of confidence) What is the appropriate spatial scale? Often arbitrary, too general Over and under identification Appropriate shape (too circular, ellipsoidal) In general, the believability problem
AMOEBA Procedure I For each observation i, local statistics values (e.g., G i*, Z[I i ], Z[c i ]) are obtained for all combinations of near neighbors j of i within distance d of i. The set of j observations that maximizes the local statistic become members of the ecotope together with the i th observation.
1 1 1 1
AMOEBA Subsequent Procedures The procedure is repeated at increasing distances from i. At each distance d from i, only the j observations that are contiguous to the already existing ecotope are evaluated. Again, using the local statistic, all combinations together with the already existing ecotope members are evaluated. That new set of j observations that maximizes the local statistic become members of the ecotope.
2 2 1 2 2 1 1 2 2 1 2
3 4 3 2 3 4 4 3 2 1 2 3 4 3 2 1 1 2 3 3 2 1 2 3 4 6 5 4 3 4 6 5 4
mean = variance = 1 Hypothetical Clusters mean = variance = 1
AMOEBA Example 1 LSM AMOEBA G i AMOEBA I i AMOEBA c i
AMOEBA Example 2 LSM AMOEBA G i AMOEBA I i AMOEBA c i
AMOEBA Example 3 LSM AMOEBA G i AMOEBA I i AMOEBA c i
AMOEBA Example 4 LSM AMOEBA G i AMOEBA I i AMOEBA c i
AMOEBA Example 5 LSM AMOEBA G i AMOEBA I i AMOEBA c i
Heterogeneous Clusters This is like the data used in the GA paper.
Homogeneous Clusters This is the same 6 clusters with radii 2,4, and 6. The high clusters have a mean of.5 and the low clusters have a mean of -.5. These means are added to random values from the Normal(,1) distribution.
Peaked Clusters
Real World Example Clustering of dengue hemorrhagic fever in Thailand by province and by month. 14 years data: 168 monthly observations
STARS: A GIS System Rey, Sergio. Space-Time Analysis of Regional Systems (STARS). Available as an open source program on the Internet.
Other Clustering Algorithms SaTScan by Kulldorff (1997, v4. 24), (Communications in Statistics) FleXScan by Tango and Takahashi (24, 25) (International Journal of Health Geographics)
Bases of Clustering Methods AMOEBA SaTScan FleXScan Based on values of the Based on a moving Based on spatial scan local statistic as d circle of varying radii statistic used on increases in many searching for the circle irregularly shaped directions from an that is the least likely windows formed by index location. to have occurred by connecting adjacent chance. neighbors.
Clustering Methods Tests AMOEBA Ho: The sum of the observed values within ecotopes is greater (lesser) than expected by chance. The p value is calculated based on the location of the local statistic values of the observed ecotope within Monte Carlo permutations. SaTScan Ho: The sum of observed cases within the circular search region is proportional to the population size. The p value is calculated based on Poisson realizations using the global rate. FleXScan Ho and p: Same as SaTScan, but within the irregular search region.
Clustering Comparison High Risk Provinces Low Risk Provinces --------------------------------------------------- Cluster No Cluster Cluster No Cluster --------------------------------------------------- Relative Risk Expected 38 178 AMOEBA 34 4 178 SaTScan 35 3 21 157 FleXScan 35 3 3 175