Exploratory Spatial Data Analysis (And Navigating GeoDa)

Exploratory Spatial Data Analysis (And Navigating GeoDa) June 9, 2006 Stephen A. Matthews Associate Professor of Sociology & Anthropology, Geography and Demography Director of the Geographic Information Analysis Core Population Research Institute Stephen A. Matthews GISPopSci Friday June 9 2006 Slide 01

Outline 1. Exploratory Spatial Data Analysis 2. GeoDa Navigating GeoDa Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 02

What is Exploratory Spatial Data Analysis? "ESDA is a collection of techniques to describe and visualize spatial distributions, identify atypical locations or spatial outliers, discover patterns of spatial association, clusters or hot-spots, and suggest spatial regimes or other forms of spatial heterogeneity. Central to this conceptualization is the notion of spatial autocorrelation or spatial association, i.e., the phenomenon where locational similarity (observations in spatial proximity) is matched by value similarity (attribute correlation)" (Anselin, 1998, p. 79-80). Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 03

Exploratory Data Analysis (EDA) Graphical and visual methods/tools that are used to identify data properties for purposes of - pattern detection in data - hypothesis formulation from the data - aspects of models assessment (e.g., goodness-of-fit) EDA emphasize the interaction between human cognition and computation in the form of dynamic statistical graphics that allow the user to manipulate "views" of the data (via box-plots, scatterplots, etc). Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 04

Exploratory Spatial Data Analysis (ESDA) Graphical and is an extension of EDA to detect spatial properties of data. There is a need for additional techniques to - detect spatial patterns in data - formulate hypotheses based on the geography of the data - assessing spatial models In many instances it is important to be able to link numerical and graphical procedures with a map to answer questions such as "Where are those cases?" Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 05

Exploratory Spatial Data Analysis (ESDA) With modern graphical interfaces this is often done by "brushing" - where cases are selected from the relevant areas of a box-plot or scatterplot, and the related regions are identified on a map. First example of ESDA by Mark Monmonier (1989) Interactive Spatial Data Analysis Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 06

Exploratory Spatial Data Analysis (ESDA) These geographic brushing tools are now found in some GIS packages where graphical and tabular displays are dynamically linked, such that the selection of any subset of observations in a map or other data view is immediately reflected in all other views of displays. In addition to choropleth maps, other displays/views include histograms, box-plots, scatterplots (see GeoDa). True ESDA pays attention to both spatial and attribute association (Anselin, 1998, p.79). Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 07

ESDA and GIS The integration of ESDA and GIS can be traced back to Michael Goodchild's 1987 article "A spatial analytical perspective on geographical information systems" in the International Journal of Geographic Information Systems. Integration of EDSA (within statistical packages) and GIS took on various forms. Goodchild et al (1992) categorize the integrations as being either loose coupling or close coupling. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 08

Stephen Matthews Spatial Demography Course Fall 2001 Spatial Analysis options in Arcview and Coupling Arcview with Spatial Analysis software packages Later in the course I will use/introduce: Arcview Geoprocessing Wizard (Week 6) Arcviw and DynESDA (Week 7) Spatial Analyst Extension (Week 8) Arcview and CrimeStat (Week 9) Arcview and S-plus / SpatialStats (Week 10) Arcview and SpaceStat (Week 11) Arcview interface options to SpaceStat and S+ S-Plus (Week 10) SpaceStat (Week 11) S-plus Spatial Data SpaceStat Statistics Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 09

Stephen Matthews Spatial Demography Course Fall 2001 Coupling GIS and Spatial Statistics ESDA & GIS: Loose or Close Coupling With loose coupling data and commands are passed back and forth between the packages by means of auxiliary files. With close coupling the commands in one software system are called from another system by means of a seamless inter-process communication. Close coupling is a recent development and is exemplified by the S+GISlink between ArcInfo and S-plus. In both cases however these integrations consist of adding statistical capabilities to a GIS and rarely are focused on spatial data analysis methods. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 10

Stephen Matthews Spatial Demography Course Fall 2001 ESDA & GIS: Encompassing Coupling With ESDA and GIS another form of integration emerges, encompassing coupling (see Anselin and Getis, 1992). Encompassing consists of writing spatial data analysis routines into a system's macro or scripting language (e.g., Avenue or VBA for Arcview). Such an approach is fully integrated within the GIS interface and hides the linked nature of the spatial data routines from the user. For example, see Ding and Fotheringham 1992, Bao et at 1995 and Zhang and Griffith 1997. Anselin (1998b p. 84) comments that these implementations are computationally slow and limited in terms of the size of data set that can be used (for more on this see Anselin and Boa, 1997). David Wong s SEG ESDA & GIS: Modular Coupling Anselin identifies a hybrid model or modular coupling approach which consists of including spatial data analysis software - typically specially developed routines - in a collection of linked systems where the communication between the different systems is established by means of a combination of loose and close couplings. Examples: ArcInfo, SpaceStat, Xgobi and clustering software Zhang et al. 1994 Arcview, Xgobi and XploRe - Symanzik et al. 1998 ArcInfo and spatial statistics routines - Haining et al, 1996 Arcview and SpaceStat - Anselin and Bao, 1997 Luc Anselin s SpaceStat Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 11

S-plus/SpatialStats samples Local Indicators of Spatial Association Testing the hypothesis of no autocorrelation using the global Moran statistic assumes stationarity of variance through space. Using local indicators of spatial association (LISA) is not limited by this assumption. The Local Moran is applied to each individual point/area with the index revealing clustering or dispersion relative to the local neighborhood. Local Indicators of Spatial Association The local Moran s I measures and z- scores are now automatically attached to the.shp file and can now be visualized in the Arcview-View window. Geographic Information Analysis Core Population Research Institute Geographic Information Analysis Core Population Research Institute Local Indicators of Spatial Association LISA - local Moran map LISA results will be saved Results are automatically joined to.shp file Geographic Information Analysis Core Population Research Institute Geographic Information Analysis Core Population Research Institute Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 12

S-plus/SpatialStats samples Spatial Regression Models To estimate a spatial regression model choose Spatial Linear Regression from the Arcview-Spatial Statistics menu. Spatial Regression Models - output #2 The model output contains Moran s index of spatial association for the model residuals it appears that the spatially lagged model accounts for most of the spatial autocorrelation in the data since the residual spatial autocorrelation is not significant. Geographic Information Analysis Core Population Research Institute Geographic Information Analysis Core Population Research Institute Spatial Regression Models Spatial Regression Model 2 - spatial residual map Select the dependent variable and the independent variables. Spatial relations are specified. Automatic joins Various model types Display results Geographic Information Analysis Core Population Research Institute Geographic Information Analysis Core Population Research Institute Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 13

SpaceStat samples www.spacestat.com Arcview interface options to SpaceStat Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 14

SpaceStat samples Regress Module (spatial regression analysis) x x Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 15

GeoVista Studio (Penn State) ESTAT http://www.geovista.psu.edu/estat/ ESTAT Lab on Monday June 12, 2006 Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 16

GeoDa can be found at the CSISS site: http://www.csiss.org Under Spatial Tools Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 17

GeoDa Webpage: http://geoda.uiuc.edu/default.php Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 18

GeoDa Datasets \GeoDa\Sample Data Baltimore Columbus National SIDS (North Carolina) South St. Louis Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 19

GeoDa Datasets Columbus Variable Codebook HTML files Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 20

GeoDa Datasets National Consortium on Violence Research Codebook HTML files See paper by Baller et al (2001) Criminology 39, 561-590 Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 21

GeoDa Basics and Some Tips Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 22

GeoDa: Data Formats To work with GeoDa, your data have to have the following characteristics: 1. Continuously (as opposed to categorically) distributed 2. Contain no missing values 3. Refer to discrete areal units. 4. Contain a unique ID Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 23

GeoDa: No Missing Data GeoDa cannot deal with missing values - it will fill blank fields with zeros or treat values such as 99, -1, etc. as observed. There is no easy solution to this problem. Some options include excluding missing observations, re-saving your shape file for only those areas without missing values or interpolating missing values. (Note: care needs to be taken that this interpolation is not based on the values of immediate neighbors, otherwise spatial autocorrelation is introduced by design). Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 24

Choice of a weights matrix The choice of your weights matrix is mainly a substantive rather than a technical choice, i.e., it depends on your theory/definition of who your relevant neighbors are. For instance, if you are exploring a phenomenon where you expect the spatial structure to be concentrated around particular locations rather than dispersed, you would want to choose a matrix that only defines your immediately bordering locations as neighbors, and vice versa. In choosing a weights matrix you want to ask which neighbors' values should be averaged in comparison to a particular location. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 25

Choice of Weights Matrix If you do not have much prior theoretical grounds to go by, you can also create different weights matrices and explore how sensitive your outputs are to the differences in matrices. By linking and brushing the weights characteristics with choropleth maps, you can get a better sense of what weights matrices capture what neighborhood structure and which might be most appropriate for your purposes. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 26

Choice of Weights Matrix Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 27

Choice of Weights Matrix Also, before embarking on the computation of spatial autocorrelation statistics, it is a good idea to check the spatial weights matrix for the presence of islands (unconnected observations) and other undesirable characteristics (see TOOLS>WEIGHTS>PROPERTIES) Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 28

Islands The problem with running spatial regressions with data islands is that by definition it does not make sense to assess a spatial relationship between units that are spatially isolated. If the isolated area is actually connected to other areas in a substantive sense (e.g., through transportation, trade, etc.), you can assign it as a neighbor to other areas by editing your contiguity weights matrix or by using distance weights. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 29

GeoDa uses a Row-Standardized Weights Matrix The results in GeoDa's global and local measures of spatial autocorrelation, its spatial lag, and its spatial regressions are based on row-standardized weights. A weights matrix is row-standardized when the values of each of its rows sum to one. By convention, the location at the center of its neighbors is not included in the definition of neighbors and is therefore set to 0. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 30

Consider this lattice (example courtesy of Paul Voss, Wisconsin) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 31

j GISPopSci Workshop Penn State 2006 June 2006 Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 32

j GISPopSci Workshop Penn State 2006 June 2006 Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 33

Consider these y i values, i = 1,,16 1 5 7 2 6 6 3 7 4 4 8 5 4 5 4 4 9 10 11 12 5 6 3 4 13 14 15 16 3 4 1 2 Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 34

For i = 6, the spatial lag Wy i is given by: Wy = w y i ij j j 1 = + + + + + + + 8 7 1 8 6 1 8 4 1 8 4 1 8 4 1 8 5 1 8 6 1 8 3 = 4.9 Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 35

Distance based definition of the contiguity matrix K Nearest Neighbors (KNN) is a distance-based definition of neighbors where "k" refers to the number of neighbors of a location. It is computed as the distance between a point and the number (k) of nearest neighbor points (i.e. the distance between the central points of polygons). It is often applied when areas (counties) have different sizes to ensure that every location has the same number of neighbors, independently how large the neighboring areas are. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 36

Distance based definition of the contiguity matrix K Nearest Neighbors weights matrices can be created in GeoDa. Note: These KNN weights matrices are asymmetric (e.g., point A is B's nearest neighbor but point B does not have to be point A's nearest neighbor). Because of this asymmetry, it is currently not possible to correctly estimate spatial lag or error models with KNN weights Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 37

Constructing Spatially Lagged Variables (see An introduction to Spatial Autocorrelation Analysis with GeoDa, Luc Anselin, 2003 pp.5-7) You can add spatial lags for any variable in your data set using the Table Calculation options. With a table active, right click and select Add Column (or use the Options menu). In the dialog, specify a meaningful name for a spatial lag variable (conventionally a W is placed in front of a variable name to indicate it is a spatially lagged variable). Note that before you can compute the lag, you must create an empty column in the table to contain it. Also, you must make sure a spatial weights file has been opened (click on the Open Weights toolbar button and specify the file name). Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 38

Constructing Spatially Lagged Variables In the table, right click and select Field Calculation. You will need to click on the third tab for Lag Operations. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 39

Constructing Spatially Lagged Variables Next, select new W variable as the Result in the drop down list, make sure the correct weights file is specified and choose original Variable. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 40

Constructing Spatially Lagged Variables Click on OK to create the new variable. Its values will be added in the new column. To permanently add the new field to the table, right-click on the table and go to Save Shape File As... Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 41

Adding Centroids (central points) to Tables Rightclick on map Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 42

Adding Centroids (central points) to Tables Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 43

Adding Centroids (central points) to Tables Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 44

Explore > Conditional Plots Conditional plots (Trellis graphs) are 3x3 micromap (or microplot) matrices. They visualize multivariate relationships (three or four variables in two dimensions). They consist of nine smaller plots or maps of one continuous variable (two for the scatter plot option), conditioned on two other variables. The interval breaks of the two other variables can be controlled in GeoDa. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 45

Explore > Conditional Plots This multivariate analysis might reveal interaction effects masked by univariate exploratory analysis (interaction effects exist when the distribution in a sub-view differs from the rest). By manipulating the handles in the conditional plots, you can analyze the sensitivity of the main variable to the conditioning variables. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 46

Explore > Conditional Plots Map View Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 47

Explore > Conditional Plots > Map View Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 48

Explore > Conditional Plots Scatter Plot Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 49

Explore > Conditional Plots > Scatter Plots Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 50

Explore > Conditional Plots > Scatter Plots Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 51

Explore > Conditional Plots > Scatter Plots Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 52

Box Plots Unemployment rate 1990 (NAT) Box plots are particularly useful to identify outliers and gain an overview of the spread of a distribution. The box plot (sometimes referred to as box and whisker plot) is a non-parametric method. For normally distributed data, the median corresponds to the mean and the inter-quartile range to the standard deviation. The box plot shows the median, first and third quartile of a distribution (the 50%, 25% and 75% points in the cumulative distribution) as well as outliers. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 53

Box Plots Unemployment rate 1990 (NAT) An observation is classified as an outlier when it lies more than a given multiple of the inter-quartile range (the difference in value between the 75% and 25% observation) above or below respectively the value for the 75th percentile and 25th percentile. The standard multiples used are 1.5 and 3 times the inter-quartile range. The hinge, corresponding to the default criterion for the hinge is 1.5. Observations are shown as blue dots in 1 st and 4 th Quartiles IQR Hinge Median Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 54

Unemployment rate 1990 (NAT) outliers highest values (hinge = 1.5) Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 55

Unemployment rate 1990 (NAT) observations in 4 th Quartile (i.e., highest values) Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 56

Standard Deviational Map A standard deviational map highlights differences in standardized values from the mean. GeoDa's standard deviational map displays the data in 7 categories: The mean, and three standard deviational units above and below the mean. The standard deviational map is the parametric counterpart to the box map. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 57

Bivariate Moran Scatter Plot In a bivariate Moran scatter plot, y and x are different variables. The neighboring values of one variable (y) are regressed on the values of another variable (x). In the bivariate scatter plot the standardized version of one variable (y) can be regressed on the lag of another variable (x). Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 58

Geoda Exercise Nepal Handout Page 32. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 59

LISA Local Indicators of Spatial Association (LISA) indicate the presence or absence of significant spatial clusters or outliers for each location. A randomization approach is used to generate a spatially random reference distribution to assess statistical significance. The Local Moran statistic implemented in GeoDa is a special case of a LISA. The average of the Local Moran statistics is proportional to the Global Moran's I value. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 60

LISA LISA maps are particularly useful to assess the hypothesis of spatial randomness and to identify local hot spots. However, since LISA maps are univariate, they may mask multivariate associations, variability related to scale mismatch, and other spatial heterogeneity. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 61

Saving Lisa Results Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 62

Saving Lisa Results You can add the results of the LISA analysis to a table, by invoking Option > Save Results or right click on the map (select Save Results). The three options are: a) the Local Moran statistic for each location (LISA indices) b) the indicator for the type of spatial autocorrelation (Clusters) c) the significance or p-value for the Local Moran statistic. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 63

Saving Lisa Results The cluster indicators take on five values: 0 for not significant, 1 for high-high, 2 for low-low, 3 for high-low and 4 for low-high. As before, these additions do not become permanent until the table has been saved. Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 65

This afternoon Spatial Regression Modeling using GeoDa Stephen A. Matthews GISPopSci - Friday June 9 2006 Slide 66

E-mail: matthews@pop.psu.edu Stephen A. Matthews GISPopSci - Friday June 9 2006 The End