Clustering: global indexes (to measure the global degree of clustering for the whole set of events) -> methods based on quadrats (joint count) vs. on distances AVERAGE NEAREST NEIGHBOUR: the distance between events is less (clustering) or more (pattern inibitorio) of the expected distance in case of complete spatial randomness? (Clark-Evans, 50s) Nearest neighbour ratio = observed mean distance / expected mean distance (CSR) -> Input: Points: unweighted (= 1) / Projected coordinate system! (Polygons and lines: convert into points with x, y = centroids) Output: -Observed Mean Distance -Expected Mean Distance - Nearest Neighbor Index -Graphic report - Test variables: -> Toolbox / Spatial statistics / Analyzing patterns p-value: probabilty of the spatial distribution to be random z-score: standard deviation of the real values from expected values - measure the ANN for firms within the GRA (selection of rm_immig.shp) Bivariate point patterns : co-agglomeration, co-location, competition/cooperation, related variety: Bivariate/Cross K function, Pairwise interaction point process.. Crimestat, R.. Risk-Adjusted Nearest Neighbor Hierarchical Spatial Clustering (Rnnh) (Crimestat) Clustering index in which the probability of identifying clusters for certain categories of events is assessed in relation to the spatial distribution of all events, by using an interpolation between the (kernel) density surfaces of the primary file (e.g. crimes) and the secondary files (eg. population) Multi-variate point patterns ( ). -> Bivariate point patterns analysis for each couple of patterns
Clustering processes at different scales In the figure: 10 clusters of first order, 8 clusters of second order, 3 of third order, and so on.. NEAREST NEIGHBOR HIERARCHICAL CLUSTER: constantdistance clustering routine for non-weighted events, hierarchical: first order clusters are considered points which may cluster at the second order and so on, until criteria are satisfied (for each order). RIPLEY'S K-FUNCTION: To identify clustered/inhibitory/random point patterns t different scales/distances between points (Ripley 1976, 1981 Spatial statistics ) Two uses: to confirm/reject the null/random hypothesis at various scales/distances + to dientify the scale/distance where the clustering/inhibition is more intense/weak K = expected number of events / real number of events In case of complete spatial randomness: K(d) = πd2 : Output (dbf, shp): n. cluster, mean center, deviational ellipse and convex hull (spezzata) of points beloning to each cluster, area and cluster density. Results are heavily influenced by the identified first order clusters
Linearization of the K function: L function (Besag 1977) In case of complete spatial randomness: L(d) = d (ArcGIS): Or L(d) = 0 -> (Crimestat) K value (clustering) Confidence interval Expected value Confidence interval Ripley s K Lower and upper confidence envelops: beyond which results may be considered significant Confidence envelops are estimated thanks to the reiteration of a Montecarlo simulation (Crimestat: 100 simulations; ArcGIS: 0 / 9%, 99% o 99,9% of the confidence interval). Corollary: simulations work better if the number of points is not small (> 100) Spatial statistics / Analyzing patterns / Multi-Distance Spatial Cluster Analysis Maximum distance Crimestat: SQRT(A)/3 ArcGIS:? Distance ranges Crimestat: 100 ArcGIS: from 1 to 100 (or: beginning distance + distance increment )
(K function) other parameters: Weight field: default: 1, fixed: weight (number of events at each point). The weighted estimation gives different results (clustering is likely to be higher)!: points cannot have distance=0* Problems with the analysis of spatial data #1: -Study area extension (if too small, the analysis may not include elements which are important to provide an exhaustive explanation. If too big, the spatial distribution pattern may be due of a diversity of processes which have nothing to do with what we want to explain. Example: suburban, scattered and low density urban areas). Is an area sensitive tool: results are influenced by the area extension Study area methods: Default: minimum enclosing rectangle User provided: via polygonial layer -> «Study Area» -> reduce the size of the area Creat a mask of the area within the GRA (ring road) by selecting (manually) the zone urbanistiche within the GRA and exporting the selection as mask_area.shp Specific problems in the analysis of spatial data #2: Boundary problems: given the probability of non observed events beyond the study area s boundaries (with a similar or dissimilar spatial distribution), con distribuzione spaziale simile o dissimile), clustering near the boundaries is under-estimated. Boundary correction methods: NONE: because events are only to be found within the boundaries. Or because the point layer is wider than the study area: points beyond the boundaries of the study area are used for estimating the K function (!!!) SIMULATE_OUTER_BOUNDARY_VALUES: simulate a «mirrored» distribution of points beyond the bounadries REDUCE_ANALYSIS_AREA: reduces the study area. RIPLEY'S_EDGE_CORRECTION_FORMULA: those points whose distance from the boundary is smaller than to other points, are weighted more (good only for non irregular study areas) Output: table(+ Display result graphically): - ExpK (K expected value in case of CSR), - Envelopes (confidence intervals), - ObservedK (value of K) - DiffK (ObservedK-ExpK) Cautions: - Works better for clustered than for inhibitory processes - It s mainly a tool for identifying second-order clusters, i.e. localized clusters, intra-regional scales or medium distances. - Not reliable for small numbers of events (>30, >100) - Not reliable for strongly irregular areas (if it s not possible to solve adequately the boundary problem)
Measure the Ripley K function for the distribution of firms owned by foreigners within the GRA (ring road) Space-time Ripley s K Input: vv/rm_immig_wdata.shp (Confidence envelop: 0 permutations)* Click Display results graphically Distance bands: 20 Weight field: CNT Beginning distance: 250 Distance increments: 250 Boundary correction method: NONE, because: Study area: User provided = dropbox/corsimemotef/lezgis16/4/mask.shp Verify the graphic and table (diff) output Taxonomy of spatial analysis tools (in ArcGIS and Crimestat) Of events (spatial distribution) Of intensities (spatial association) Global indexes of spatial autocorrelation Global indexes Average nearest neighbour (Multi scale) K Ripley Global indexes of autocorrelation: Moran s I Geary s C Kernel density maps Local indicators of spatial association (LISA): Local indexes Nearest neighbour hierarchical clustering Local Anselin of Moran s I (Cluster and outlier analys.) Risk Adjusted Nearest Neighbor Hierarchical Clustering Getis Ord Gi (Hot spot analysis)
3. Global indexes of spatial AUTOCORRELATION First law of geography (Tobler) = "Everything is related to everything else, but near things are more related than distant things." It s a form of spatial dependence (positive or negative): the degree to which nearby features are similar or dissimilar*, vs. an hypothesis of complete spatial randomness. - Similar to time series analysis, but both proximity AND direction/position (2D) Why to estimate the degree of spatial autocorrelation: - To understand the process (or the variety of processes..) which explain the geographical distribution of intensities - To estimate the degree to which nearby features potentially influence each other (=interaction, interdependence, attraction, contagion, clustering, segregation, etc ) - To verify the degree to which the observed variables are (not) statistically indipendent (eg. autocorrelation reduces the dataset s information content or obscures what is specific about each area, because intensities in one area are partially influenced by what is happening nearby) - (Eg. to test the spatial autocorrelation of models residuals) - (Eg. to assist in the identification of the spatial sample size) Exploratory Spatial Data Analysis (and mapping) vs. Modelling (formal verification and testing of hypothesis) Spatial auto-correlation: global indexes Moran s I Spatial autocorrelation (MORAN S I): Global co-variance index adapted from the analysis of the memory effect in time series (Moran 40s, Whittle 1954). Measures the gobal degree of similarity between the (upper and lower) intensities (-/+) of nearby features Xi X = intensity in point Xi average intensity (Xi-X)(Xj-X): Cross-product, high if values are similar Wij: spatial weights (/influences) matrix * Clustered/high autocorrelation if I is high (I>0), dispersed/low autocorrelation if I is low (I<0), vs. the CSR hypothesis Iexp=-[1/(n-1)]
Spatial statistics / Analyzing patterns / Spatial autocorrelation (Moran s I) Conceptualization of spatial relationships: Inverse distance (squared): spatial relationships between features are inversely proportional to their (squared) distance. Computational problems with small distances (crimestat: adjust for small distances ) and no threshold (n to n) Fixed distance band: within the threshold (band) any feature weights 1. Appropriate in the case of non-uniform polygons, and for large point datasets. Zone of indifference: neighbors (or features within the distance threshold) weight 1. Other features weight is inversely proportional to their distance. Appropriate as above, when the influence of distant features is relevant. Computational problems. Polygon contiguity (adjacency!): considersonlybordering features (1 if bordering, 0 all the others). Appropriate only for regular polygons (original Moran s I. Generalized by Cliff and Ord 1973. Widely used in spatial econometrics) Conceptualization of spatial relationships (2): Spatial statistics / Modeling spatial relationships / Generate spatial weight matrix Distance Band or Threshold Distance (mostly for large datasets): threshold beyond which influence is null (with inverse distance = i) 0: all features are considered; ii) Empty: applies a default threshold distance (min distance at which any feature has a neighbour); iii) defined by the user Weights Matrix from file: uses a spatial weight matrix file (.swm) created/adapted by the user Spatial weight matrix Table in.swm format in which any cell includes an expression of the distance, time, cost, influence, spatial relationship between any couple of features (presence/absence or intensity)
Conceptualization of spatial relationships (3): INVERSE_DISTANCE: ( ) + Exponent (!), eg. 2 FIXED_DISTANCE: ( ) K_NEAREST_NEIGHBORS: considers only a K number of the most proximate features CONTIGUITY_EDGES_ONLY: considers only features which share a boundary ( rooke ) CONTIGUITY_EDGES_CORNERS: considers only features which share a boundary and/or vertex ( queen ) ROW STANDARDIZATION: values in the spatial weight matrix are standardized in order for their sum to be = 1. To avoid the indexes to be influenced by the different number of nearby features: appropriate in the case of sample data and compulsory in the case of polygon contiguity, because (irregular) polygons have a different and arbitrary number of bordering features. Test variables: Z-score = standard deviation / p-value DELAUNAY_TRIANGULATION: create overlapping triangles connecting polygons centroids, and considers only features which share a triangle s vertex.. CONVERT_TABLE: to specify spatial relationships in a table [Convert spatial weight matrix to table (utilities)] Normality: the Z-score displays a normal distribution? Output: -Moran s index - Expected index - Variance - Z-score e p-value Cautions: -Significant only above a certain number of features (> 30) Vs. Geary s autocorrelation index (Moran is more robust) HIGH/LOW CLUSTERING (Getis & Ord). The probability for high or low values (+) to be clustered or dispersed (similar to Average Nearest Neighbour) LAB: a spatial analysis of public schools quality in Rome = to what extent school quality depends on the context? = what is the degree of spatial autocorrelation of school quality? Input: spatial17/addxy/schools_roma_xy_dprv.dbf = a table with XY coordinates of all primary and secondary schools in Rome, including a (normalized) «deprivation index» f(dropouts, repetitions, students to teachers ratio, students per classroom, foreign students ratio). 1. Georefer the Schools_Roma dbf using Add XY data + export the data output setting its coordinate system «as the data frame» 2. Estimate the global autocorrelation of the normalized deprivation index, using arctoolbox/spatial statistics/analyzing patterns/moran s I, and setting all the parameters..
LAB: what is the degree of spatial autocorrelation of school quality? 2. Estimate the global autocorrelation of the normalized deprivation index, using the Moran s I. Parameters: Input feature class: schools Input field = «DPRV_NORM» Conceptualization of spatial relationships:? Row standardization:? Threshold distance: 10.000 meters Generate report -> Verify the graphic report and test variables: what is the result? Is this statistically significant? Do high or low quality schools cluster in certain zones, and where? -> Local indicators of spatial autocorrelation 4. LOCAL INDEX OF SPATIAL AUTOCORRELATION To measure the degree of autocorrelation for each geographical feature (where and which features?) Local indexes of spatial association/autocorrelation Anselin local of Moran s I (Anselin L. 1995, Local indicators of spatial association LISA. Geographical Analysis 27, 93-115) To attribute to each feature a degree of high/low autocorrelation based on its (high/low) intensity being similar/dissimilar to nearby features Z: intensity, S: variance, W: spatial weight matrix
Input: polygons (crimestat) and points(arcgis) Output: Grado di segregazione tra aree a prevalenza di imprenditori cinesi e aree a prevalenza di imprenditori italiani Contributo Anselin locale alla local segregazione of Moran s tra aree a I prevalente of the presenza distance of unità condotte entrepreneurs da imprenditori from cinesi o their italiani country of origin Cluster type (COType) identifies (and renders): - Features which are part of high (HH) or low (LL) values clusters, because nearby features have similar values, and are statistical significant (positive and high z-score). - outlier features, with high or low values, surrounded by features with low (HL) or high (LH) values, and are statistical significant (low and negative z-score) Spatial statistics / Mapping clusters / Cluster and outlier analysis LAB: a spatial analysis of public schools quality in Rome = To what extent school quality depends on the context? Do high or low quality schools cluster in certain zones, and where? 1. Identify and render those schools which are part of clusters of nearby low or high quality schools using arctoolbox/spatial statistics/mapping clusters/cluster and outlier analysis Input: Schools with data shapefile, input field: DPRV_NORM Spatial relationships: Inverse distance -> Modify the symbology of the ouput layer in order to visualize only the schools in clusters of high or low and significant spatial autocorrelation values -> Open and verify the ouput layer attribute table -> In a copied layer, represent the value of the index (L_Milndex) disregarding of the degree of significance -> Check(and trytomakesense) ofoutliers
Local indexes of spatial autocorrelation (2): Getis-Ord Gi, high/low clustering (Hot Spot Analysis) Identifies features which are part of hot spots : areas with unusual clustering of high or low values (Cliff & Ord, Spatial autocorrelation, 1973), based on the value of the GiZScore (categorized according to the standard deviation: the higher the GiZscore, the more nearby features have high values, and viceversa. (You may do a density map of using the Z-Score as weight) Cautions: - reliable only with large dataset (>30 features) - test problems (the significativity test is based on global indexes of spatial autocorrelation)
LAB: 1. measure the (global) spatial autocorrelation of the distribution of all foreigners (and of Chinese) in Rome s zone urbanistiche and 2. identify (local) clusters of contigous zones with an high or low density of foreigners (and of Chinese) Spatial interpolation: to obtain surface data from point sample observations Input: spatial17/vv/zurb_wdata.shp Input field:? Arctoolbox tools? Conceptualization of spatial relationship:? Standardization:? Threshold dist.:? Results? Spatial interpolation: INVERSE DISTANCE WEIGHTED Spatial interpolation: KRIGING
Spatial interpolation: KRIGING (..more) problems with the analysis of spatial data Example of the modifiable area unit problem (MAUP): Gerrymandering (distortions due to the shape of electoral partitions) The modifiable area unit problem (MAUP): any geographical discontinuity is artificial, (more or the less) arbitrary, modifiable, and influences the results and explanation - Scale problem, f(spatial resolution). E.g. statistical relations are stronger the lower is the degree of spatial resolution, because variance is lower = the more we aggregate data, the stronger they correlate. The more we disaggregate date (and increase spatial resolution), the more the variance and the risk this is due to chance or mistakes - Zoning problem, f(geodata geometry), for any given number of zones, results are influenced by their shape The urban (and mostly liberal) concentration of Columbus, Ohio, located at the center of the map, is split into thirds, each segment then attached to - and outnumbered by - largely conservative suburbs. -Non uniformity: a uniform geographical partition, will be non uniform in terms of statistical attributes, and viceversa (e.g. population). Data in less dense areas are less reliable. -Irregularity, vs. compactness (e.g. administrative divisions)
And.. - Ecological fallacy: the results of aggregate analysis cannot be attributed to each individuals, or to higher scales (the rate of suicides is higher where more catholics live = catholics more keen to suicide?) - Outliers: very frequent in spatial data. The higher the spatial resolution of data, the more the probability of outliears. - Geodata quality (accuracy, completeness, consistence, resolution..) Specific problems: measurement mistakes are not indipendent (e.g. population subtracted from an area is attributed to the neighbour). The more dense the areas, the lower the data quality (but the lower the distortion due to measurement mistakes) - Categorial data: spatial analysis tools for categorial data are still largely missing - Coincident locations (distance = 0) -> collect events (to turn coincident points of unique events into weighted points) ArcGIS desktop/online Help.. ArcGIS desktop/online Help (2) Help!!! http://forums. arcgis.com http://support. esri.com/en/ http://mappingcenter. esri.com http://blogs.esri.com/ esri/arcgis/