Attribute Data. ArcGIS reads DBF extensions. Data in any statistical software format can be

Similar documents
Introduction. Part I: Quick run through of ESDA checklist on our data

Exploratory Spatial Data Analysis Using GeoDA: : An Introduction

OPEN GEODA WORKSHOP / CRASH COURSE FACILITATED BY M. KOLAK

Lecture 3: Exploratory Spatial Data Analysis (ESDA) Prof. Eduardo A. Haddad

Lecture 3: Exploratory Spatial Data Analysis (ESDA) Prof. Eduardo A. Haddad

Exploratory Spatial Data Analysis (ESDA)

Soc/Anth 597 Spatial Demography March 14, GeoDa 0.95i Exercise A. Stephen A. Matthews. Outline. 1. Background

Geographical Information Systems Institute. Center for Geographic Analysis, Harvard University. GeoDa: Spatial Autocorrelation

Exploratory Spatial Data Analysis (And Navigating GeoDa)

Data Structures & Database Queries in GIS

Outline. Introduction to SpaceStat and ESTDA. ESTDA & SpaceStat. Learning Objectives. Space-Time Intelligence System. Space-Time Intelligence System

Mapping and Analysis for Spatial Social Science

In this exercise we will learn how to use the analysis tools in ArcGIS with vector and raster data to further examine potential building sites.

Where to Invest Affordable Housing Dollars in Polk County?: A Spatial Analysis of Opportunity Areas

EXPLORATORY SPATIAL DATA ANALYSIS OF BUILDING ENERGY IN URBAN ENVIRONMENTS. Food Machinery and Equipment, Tianjin , China

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

A GEOSTATISTICAL APPROACH TO PREDICTING A PHYSICAL VARIABLE THROUGH A CONTINUOUS SURFACE

Outline ESDA. Exploratory Spatial Data Analysis ESDA. Luc Anselin

Geographical Information Systems Institute. Center for Geographic Analysis, Harvard University. GeoDa: Exploratory Spatial Data Analysis

Tutorial 8 Raster Data Analysis

Modeling the Ecology of Urban Inequality in Space and Time

Exploratory Spatial Data Analysis and GeoDa

Spatial Investigation of Mineral Transportation Characteristics in the State of Washington

Introduction GeoXp : an R package for interactive exploratory spatial data analysis. Illustration with a data set of schools in Midi-Pyrénées.

Where Do Overweight Women In Ghana Live? Answers From Exploratory Spatial Data Analysis

SPACE Workshop NSF NCGIA CSISS UCGIS SDSU. Aldstadt, Getis, Jankowski, Rey, Weeks SDSU F. Goodchild, M. Goodchild, Janelle, Rebich UCSB

This lab exercise will try to answer these questions using spatial statistics in a geographic information system (GIS) context.

Why Is It There? Attribute Data Describe with statistics Analyze with hypothesis testing Spatial Data Describe with maps Analyze with spatial analysis

Geometric Algorithms in GIS

Spatial Analysis I. Spatial data analysis Spatial analysis and inference

Introduction to Spatial Statistics and Modeling for Regional Analysis

ECON 497: Lecture 4 Page 1 of 1

Modeling Spatial Relationships using Regression Analysis

GIS Analysis: Spatial Statistics for Public Health: Lauren M. Scott, PhD; Mark V. Janikas, PhD

Review of Multiple Regression

KAAF- GE_Notes GIS APPLICATIONS LECTURE 3

Spatial Regression Modeling

Concepts and Applications of Kriging. Eric Krause

Concepts and Applications of Kriging

Final Project: An Income and Education Study of Washington D.C.

Spatial Autocorrelation

Geog 210C Spring 2011 Lab 6. Geostatistics in ArcMap

Modeling Spatial Relationships Using Regression Analysis

In matrix algebra notation, a linear model is written as

Using Spatial Statistics Social Service Applications Public Safety and Public Health

COLUMN. Spatial Analysis in R: Part 2 Performing spatial regression modeling in R with ACS data

CHAPTER 6: SPECIFICATION VARIABLES

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Community Health Needs Assessment through Spatial Regression Modeling

Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means

Luc Anselin Spatial Analysis Laboratory Dept. Agricultural and Consumer Economics University of Illinois, Urbana-Champaign

Model Estimation Example

Spatial Regression. 1. Introduction and Review. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

The GeoDa Book. Exploring Spatial Data. Luc Anselin

Spatial Regression. 10. Specification Tests (2) Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Local Spatial Autocorrelation Clusters

2/7/2018. Module 4. Spatial Statistics. Point Patterns: Nearest Neighbor. Spatial Statistics. Point Patterns: Nearest Neighbor

Geography 281 Map Making with GIS Project Four: Comparing Classification Methods

An Introduction to Path Analysis

Spatial Data Analysis in Archaeology Anthropology 589b. Kriging Artifact Density Surfaces in ArcGIS

Learning ArcGIS: Introduction to ArcCatalog 10.1

How to Model Stream Temperature Using ArcMap

Empirical Economic Research, Part II

Step 2: Select Analyze, Mixed Models, and Linear.

Child Opportunity Index Mapping

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Bivariate data analysis

Multiple Regression Analysis

Subject CS1 Actuarial Statistics 1 Core Principles

Outline. ArcGIS? ArcMap? I Understanding ArcMap. ArcMap GIS & GWR GEOGRAPHICALLY WEIGHTED REGRESSION. (Brief) Overview of ArcMap

Box-Jenkins ARIMA Advanced Time Series

Using Microsoft Excel

Structural Equation Modeling and Confirmatory Factor Analysis. Types of Variables

Exercise 6: Using Burn Severity Data to Model Erosion Risk

Preparing Spatial Data

Neighborhood social characteristics and chronic disease outcomes: does the geographic scale of neighborhood matter? Malia Jones

Overlay Analysis II: Using Zonal and Extract Tools to Transfer Raster Values in ArcMap

How to Make or Plot a Graph or Chart in Excel

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

Map your way to deeper insights

6348 Final, Fall 14. Closed book, closed notes, no electronic devices. Points (out of 200) in parentheses.

Business Statistics. Lecture 9: Simple Regression

Spatial Analysis 1. Introduction

How to Create Stream Networks using DEM and TauDEM

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

An Introduction to Mplus and Path Analysis

Virtual Beach Building a GBM Model

Statistics Toolbox 6. Apply statistical algorithms and probability models

Review of Statistics 101

Tutorial using the 2011 Statistics Canada boundary files and the Householder survey

Inclusion of Non-Street Addresses in Cancer Cluster Analysis

Lab 7: Cell, Neighborhood, and Zonal Statistics

Concepts and Applications of Kriging. Eric Krause Konstantin Krivoruchko

Geovisualization. Luc Anselin. Copyright 2016 by Luc Anselin, All Rights Reserved

Decision 411: Class 3

Exercise on Using Census Data UCSB, July 2006

Application of Spatial Regression Models to Income Poverty Ratios in Middle Delta Contiguous Counties in Egypt

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

Spatial Effects in Convergence of Portuguese Product

Transcription:

This hands on application is intended to introduce you to the foundational methods of spatial data analysis available in GeoDa. We will undertake an exploratory spatial data analysis, of 1,387 southern counties using. Objectives. Carry out visualization and exploratory spatial data analysis (ESDA) Assess global and local spatial autocorrelation Review spatial diagnostics Conduct a spatial regression analysis Shapefiles. First, a word on shapefiles. Conceptually, a shapefile is a specialized file that links a digital data file (e.g., attribute data in an Excel spreadsheet) with a digital map file. You need a variable that is common to both the data table and the map objects that can provide a unique, one to one link of the appropriate attribute values in the data table to the appropriate object (e.g., polygon) in the map. The unique identifier in the example data is FIPS (county id FIPS 2000). Attribute Data. ArcGIS reads DBF extensions. Data in any statistical software format can be converted into a DBF using StatTransfer, DBMS/COPY, or a comparable program, and linked to the map file in ArcGIS. Map Data. Although shapefiles for contemporary periods are easier to locate than historical files, the number and availability of historical files is expanding. Shapefiles are available for geographical units in the U.S. and abroad. An inquiry through any online search engine will lead you to a number of potential sources, but always verify the accuracy of your source. Our example data are from the U.S. and are constructed from the U.S. Census Bureau Tiger Files. Note that the unit of observation must be consistent between the two data files (i.e., county attribute data linked to county map data; for example, county to census tract will not work). A technical, though somewhat dated, description about shapefiles can be found through ESRI s (the makers of ArcGIS) online library at the following URL: http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf Several files comprise the shapefile. The following are necessary: *.shp: main file that contains the feature geometry *.shx: index file that contains the index of the feature geometry *.dbf: database of attributes (standard, non spatial data) Things to keep in mind when working with your own shapefile: Your data should include an numeric ID variable Note whether the X and Y coordinate fields are latitude and longitude, whether the shapefile is projected, and the unit of distance Keeping to these guidelines will help your analyses run more smoothly. For example, knowing the unit of distance will help make distance based weighting criteria used more transparent; our southern county data uses meters (~1600 meters per mile). 1

* Part 1: ESDA. Open GeoDa and load south00.shp using FIPS as the key field. The general goal of ESDA is to seek a good understanding and description of the data that will inform hypotheses to explore. We are looking for clues about the spatial process present in our data (i.e., spatial heterogeneity and spatial dependence). A text file containing a description of the variables is available to you and labeled vardescription_south00.doc. You may want to refer to this documentation when working through the examples. We will use non spatial tools in combination with spatial tools to get to know our data. Choropleth Maps Quantile map of child poverty, PPOV (Map>Quantile) o Choose 4 classes Duplicate map (Edit>Duplicate Map, or toolbar icon) Quartile map proportion of family households headed by female with own children under age 18 but with no husband present, PFHH (Map>Quantile) o Choose 4 classes Compare number of observations in each quartile they are the same (Window>Tile Vertical or Tile Horizontal) o Note: If there is a 0 for one of the classes (usually the first) this suggests fewer classes, a percentile map and/or a standard deviation map might be better tools to explore this variable One can zoom in or out and return to the full extent of the map by right clicking (Zoom>Zoom In/Zoom Out/Full Extent) and selecting the desired area Histograms Duplicate the map (Edit>Duplicate Map, or toolbar icon) Graph child poverty, PPOV (Explore>Histogram), and proportion of family households headed by female, PFHH (Explore>Histogram) Click on the furthest right column of the PPOV histogram o Notice that it highlights the distribution within the PFHH histogram in addition to the maps (it is clearest in the duplicated map given the contrasting colors note that you can change the color of the map by right clicking Color>Map) This is the dynamically linked objects feature of GeoDa Note: Consider transformations at this point. Would the data be better behaved if we were to take the square root of PPOV (SQRTPPOV)? Box Plots Plot child poverty, PPOV (Explore>Box Plot) o Helps identify potential outliers Hinges can be drawn at 1.5 or 3 times the inter quartile range (between 25% and 75%) Plot child poverty, PPOV (Map>Box Map>Hinge 1.5 or 3) Select the potential outliers in the box plot and identify the cases in the box map o Again, this is the dynamically linked objects feature of the software o The cases are also highlighted in the attribute table (i.e., Starr County, TX has the highest value) Scatter Plots Plot PPOV and PFHH (Explore>Scatter Plot) 2

o Positive correlation: higher child poverty is associated with a higher proportion of female headed households Standardize the scatter plot by right clicking (ScatterPlot>Standardized data) to yield a correlation coefficient (the regression slope in this plot equals the correlation, rather than the bivariate regression slope) o It is re scaled to standard deviational units, so any observation greater than 2 on the x axis or y axis might be considered an outlier o Any outliers? If so, where are they located on the map? o Highlight outliers to see the distribution on the map and within plots of other variables if you d like (i.e., histogram, box plot) Exclude Feature o Right click and choose Exclude selected o PPOV and PFHH: the slope in the scatter plot changes when excluding the outliers highlighted in the correlation plot (see the blue and red values at the top and the lines within the plot) Parallel Coordinate Plot Multivariate Scatter Plot Plot PPOV, PFHH and PUNEM (Explore>Parallel Coordinate Plot) Each variable is plotted in a separate axis for each case o Values observed for each variable are shown in the lowest/left to the highest/right on each axis Notice the potential outliers and how the patterns compare to the other cases The plot can be standardized (right click>standardized Data Set) Better with fewer data points Conditional Scatter Plot Example of geographic conditioning (Explore>Conditional Plot>Scatter Plot) o X Variable: XCOORD (left: West; right: East) o Y Variable: YCOORD (top: North; bottom: South) o Variable 1: PPOV o Variable 2: PFHH Bounds cannot change, but divisions within the bounds can by clicking and dragging the nodes at the end of the bound lines on the x axis and y axis Scatter plots can be standardized, individually (right click on plot>scatterplot>standardized data) Imagine a map overlayed on top of the scatter plots Where, geographically, is the correlation highest? Where is it lowest? o Is there a spatial drift? o What are the sub regional differences? Conditional Map Another example of geographic conditioning (Explore>Conditional Plot>Map View) o Demonstrates the extent to which there is systematic variation in selected variables across geographic sub regions o Univariate, unlike conditional scatter plot o Like the parallel coordinate plot, it is difficult to see with many observations Where, geographically, is child poverty highest? And lowest? 3

Things to Consider: What have you learned from the exploratory spatial analysis that would not be suggested by a non spatial exploratory analysis of the same data? What clues have you gained about the spatial process generating your data? What types of actions (e.g., transformations) might be necessary and why? ** Part 2: Global and Local Spatial Autocorrelation: We need a spatial weights matrix to assess spatial autocorrelation. Do you anticipate the phenomena of interest to operate according to an adjacency model or a distance model? If adjacency, which order (i.e., 1 st, 2 nd )? If distance, how far? The answers to these questions inform the creation and selection of the spatial weights matrix used to diagnose and to treat spatial autocorrelation. In GeoDa, you can create two types of contiguity weights matrices: rook and queen. You also can create two types of distance weights matrices: geographic distance or k nearest neighbors. Create Queen 1 st Order Tools>Weights>Create Note: ID variable must be selected (FIPS) o Again, this ensures a complete match between the data in the table and the corresponding contiguity in the weights file Choose queen 1 st order (name queen1 ) Review Queen 1 st Order Open with Notepad or Wordpad Review the first line: o 0, a flag (nothing important) o 1387, number of observations o south00, name of shapefile from which the weights are derived o FIPS, name of the key variable Open table (toolbar icon) Select FIPS #45075 o Remember map: how many neighbors? o Look at weights file: how many neighbors? Select in the table the neighbors specified in the weights file and look at the distribution on the map o Any difference compared to the rook weights matrix? Open connectivity histogram (Tools>Weights>Properties>queen1.GAL) o What is the range in the number of neighbors? Create Minimum Threshold Tools>Weights>Create Note: ID variable must be selected (FIPS) Use Euclidean since centroids are in a projected format (if they were in latitude and longitude, you would select Arc Distance ) Select the x and y coordinates (XCOORD and YCOORD) 4

Choose minimum threshold; the default (~93,814 meters or ~58 miles) ensures that each observation has at least one neighbor (name mindis ) o Anything smaller may result in islands (not in the example data) Review Minimum Threshold Open file in Notepad or Wordpad Header is the same as the contiguity matrices, but the rest of the file differs o For each pair, column 1 reports the origin, column 2 reports the destination, column 3 reports the distance between the origin and destination Open the connectivity histogram (Tools>Weights>Properties>mindis.GWT) o The distribution is much wider than contiguity matrices (ranges from 1 to 41 neighbors) o Range can be even greater when points have an irregular distribution (to give islands neighbors) o Minimum threshold may be too large for most points in the data set (Bourbon County, KY (21017), Fayette County, KY (21067), and Jessamine County, KY (21113) each have 41 neighbors) o k nearest neighbor may be a better option for a distance matrix Create K Nearest Neighbor Tools>Weights>Create Note: ID variable must be selected (FIPS) Select the x and y coordinates (XCOORD and YCOORD) Try k4 and/or k6 (name k4 and/or k6 ) o Others? Why? Review K Nearest Neighbor Check connectivity (Tools>Weights>Properties>k4.GWT or k6.gwt) o Not for information regarding the distribution since all have the same number of neighbors, but to confirm that the weights are correct Open file in Notepad or Wordpad Header is the same as the contiguity matrices, but the rest of the file differs o For each pair, column 1 reports the origin, column 2 reports the destination, column 3 reports the distance between the origin and destination Select FIPS #45075 o How do the neighbors compare to the queen contiguity matrix? Note: Cannot use k weights in spatial regression analyses since these weights are asymmetric o GeoDa will perform the regression, but the results will be inaccurate Spatial Autocorrelation: We will use and compare the different weights matrices that we have created to assess the extent of spatial autocorrelation in our data and determine which spatial weights matrix is the most useful for our data. Univariate Moran Scatter Plot (Global Spatial Autocorrelation) Moran s I statistics is an indicator of global spatial autocorrelation o Examine the extent and nature of spatial autocorrelation in child poverty, SQRTPPOV (square root transformed) or PPOV, if you prefer 5

(Space>Univariate Moran) using the various weights matrices o Contiguity: rook and queen, 1 st and 2 nd order o Distance: minimum threshold, k nearest neighbor Reveiw the 4 quadrants: o High high, low low (positive autocorrelation) o High low, low high (negative autocorrelation) What is the value of the Moran statistic? Permutations (right click>randomization># of Permutations) o Recalculates the statistic many times to generate a reference distribution which is compared to the obtained statistic (the estimate of I) to compute a pseudo significance level o Reference distribution: burgundy o Statistic: yellow bar o I: statistic value o E(I): theoretical mean o Mean and SD: of empirical distribution o If the observed Moran s I (yellow bar) falls outside of the reference distribution (burgundy), this indicates that the data are not spatially random Envelopes (right click>envelopes ON) o Another way of visualizing the significance of the Moran statistic o Slopes correspond with the 2.5 and 97.5 percentiles of the reference distribution and, therefore, contain 95% of the distribution of the Moran statistic in spatially random data sets o If the observed slope falls outside of the envelope, this indicates that the data are not spatially random Which weights matrix has the lowest Moran s I? Why? Which has the highest? Why? Which weights matrix do you prefer? Why? LISA Maps (Local Spatial Autocorrelation) LISA maps identify potential spatial clusters o In the presence of global autocorrelation, they represent the cases that have more than the average amount of spatial autocorrelation o In the absence of global autocorrelation, the clusters represent smaller areas where spatial autocorrelation is evidenced o Univariate LISA of child poverty, SQRTPPOV (square root transformed) or PPOV, if you prefer (Space>Univariate LISA) o Select the queen 1 st order contiguity weights matrix o Check all 4 options for output: the significance map, the cluster map, the box plot, and the Moran scatter plot Cluster Map o Very useful map! o The type of spatial autocorrelation is color coded and can be saved to the table (i.e., right click>save Results) o Notice any patterns in this map? o Run multiple permutations, like the scatter plot, to get a sense of the stability and sensitivity in the estimate (right click>randomization>999 Permutations) 6

o Also, can filter out lower significance levels (right click>significance Filter>0.01 etc.); Anselin suggests not putting too much emphasis on the 0.05 level (the default) Significance map o Locations with significant local Moran statistics o Darker colors are more significant o Right click to see whether the results changes with permutations o What patterns do you see? Box Plot o A tool to suggest that potential locations show different local autocorrelation patterns o Global Moran statistic is the mean value for the local Moran statistics o Brush with the cluster map open; notice positive values in the box plot can be high high or low low clusters on the cluster map Moran Scatter Plot o The final of the 4 options accompanying the LISA analysis o Simply the global statistic we reviewed earlier A note on Cluster o High high and low low clusters generally are referred to as spatial clusters; high low and low high are called spatial outliers o The clusters displayed in the maps likely extend to additional neighbors of the cluster A group of areal units is classified as a cluster when the value at a location is more similar to its neighbors than would be under spatial randomness Saving Results (data, not images) o Right click on any of the 4 LISA cluster output windows and you see the option to save Indices: local Moran statistics Clusters: type of cluster (for significant locations only) Significance: p values from the most recent permutation routine o Data are not automatically appended to the shapefile; must save as a new shapefile (File>Save to Shape File As ) Multivariate Moran Scatter Plot (Global Spatial Autocorrelation) Really a bivariate scatter plot of the correlation between the spatially lagged Y variable distribution and the non spatially lagged X variable distribution (Space>Multivariate Moran) o Y (lagged): SQRTPPOV or PPOV o X (non lagged): PFHH Graph indicates the extent to which the value at a location for the X variable (PFHH) is correlated with the weighted average of the Y variable (SQRTPPOV or PPOV), with the average computed over the neighboring locations o Can highlight to show where, geographically, there is a larger positive correlation and a larger negative correlation Compare the results of the bivariate Moran scatter plot using several different weights matrices Multivariate LISA Maps (Local Spatial Autocorrelation) Again, LISA maps identify potential spatial clusters o In the presence of global autocorrelation, they represent the cases that have more than the average amount of spatial autocorrelation o In the absence of global autocorrelation, the clusters represent smaller areas where spatial autocorrelation is evidenced 7

A bivariate map of the correlation between the spatially lagged Y variable distribution and the non spatially lagged X variable distribution Y (lagged): PPOV o X (non lagged): PFHH o Select the queen 1 st order contiguity weights matrix o Check the following 3 options for output: the significance map, the cluster map, and the box plot (the Moran scatter plot corresponds with those that we created above) The maps refer to the local patterns of spatial correlation at a location between PFHH and the average SQRTPPOV or PPOV for its neighbors o Again, can highlight, adjust the significance levels and run permutation routines, exclude observations, and save results (right click>) Compare the results of the bivariate LISA maps using several different weights matrices Things to Consider: What is a theoretical basis for any observed spatial autocorrelation in your variable of interest? Why would this variable be spatially autocorrelated? What does your selection of a specific weight matrix suggest about the underlying spatial relationship present in your data? How would you characterize the spatial arrangement of your outcome variable? How would characterize the spatial arrangement of the relationship between Y (e.g., child poverty) and X (e.g., female headed households)? *** Part 3: Spatial Diagnostics in OLS. In this part, we will request and interpret the spatial diagnostics of a standard OLS regression model using GeoDa. The diagnostics provide information about the type of spatial process underlying your data and inform your selection of an appropriate spatial regression model (i.e., spatial error or spatial lag in GeoDa). What are important correlates of child poverty that should be included in the regression model? In GeoDa, you can run a series of standard OLS regressions; note that the assumptions of linearity and normality apply. Decisions about variable transformations and outliers should be made before running an OLS regression. The results of the regression, of course, also can assist this analytical process. Regression Run an OLS regression analysis of child poverty and some reasonable correlates (Regress>) o Change the output title; this helps keep your records organized when you run multiple models (e.g., OLS1) o Change the output title with each run, or it will overwrite the original file; it does not append to a single file o The output file is saved to the directory where the data are located o The extension is *.OLS and can be read in Wordpad or MS Word Specify the output format o The Predicted Value and Residual option is not too useful with large data sets since it prints the values for each observation and, thus, creates a huge output (text) file This information can be added to the data table at another point o The Coefficient Variance Matrix option provides the variance of the estimates (on the diagonal) and all covariances Used to carry out customized tests of constraints on the model coefficients in statistical packages other than GeoDa (e.g., STATA) o The Moran s I z value option reports an estimate of the spatial autocorrelation in the residuals of the model you are specifying Select this option; the Moran s I value is reported automatically, but tests for statistical significance reported only when you select this option 8

Specify the regression model o Dependent Variable: child poverty, SQRTPPOV (square root transformed) or PPOV, if you prefer o Independent Variables: What shall we explore? o Choose weights matrix (necessary to get spatial diagnostics): Which should we use? Choose Classic model o Note: In GeoDa the include constant term option is checked by default; uncheck if you have reason to exclude a constant from your model (e.g., fixed effects model) Run the model by clicking on the Run button Choose Save if you want to add predicted values and residuals to the data table; this is an option only after running the model o If you select the OK button before you select the Save button, you will need to rerun the model to get the estimates o Name the variables (predicted values and/or residuals) something meaningful (e.g., OLS1_RES) o You will need to create a new shapefile to permanently append the new variables to your table (it is like a working file in SAS) (activate the table object>file>save to Shape File As ) Output File An output window automatically appears when selecting OK o The file also can be viewed in Wordpad or MS Word; Notepad is not recommended (can open but the format is messy) File content: o Summary statistics of the model and measures of fit o Parameter estimates o Model diagnostics The F statistic reported in the top section is a test of the null hypothesis that all regression coefficients are jointly 0 o Not that useful, unless your model is way off base 3 important statistics reported at the top for model comparisons: o Log likelihood: higher, better (less negative) o Akaike Information Criterion (AIC): lower, better ( 2L + 2K) o o Schwarz Criterion (SC): lower, better ( 2L + 2K x ln(n)) where L is the log likelihood, K is the number of parameters, and Ln(N) is natural log of the frequency values of the observation Standard Diagnostics Multicollinearity: not a test statistic, per se, but a diagnostic to suggest problems with the stability of the regression results due to multicollinearity o > 30 is problematic, in general o Note: high values are common when interaction terms are used since the independent variables are powers and cross products of each other o Other Note: I have found this diagnostic to be unreliable in GeoDa especially with small data sets; examine multicollinearity in other statistical packages (e.g., SAS) Normality: Jarque Bera test o Chi square distributions with 2 df 9

o Tests the assumption of normality in the errors Heteroskedasticity is tested on three null hypotheses o Breusch Pagan: assumes heteroskedasticity is a function of the squares of the explanatory variables o Koenker Bassett: same as BP, except residuals are studentized (made robust to nonnormality) o White: does not assume a specific functional form of heteroskedasticity A NA is sometimes reported for this test when interactions are included in the model because all square powers and cross products are considered in this test for heteroskedasticity Moran s I (Error) This is the global value, as reported in the scatter plot, less any explanatory value of the predictors and is derived from the errors of the regression model o Usually observe some reduction (compared to original MI on the outcome) o What was our original statistic? How do the values compare? Tests for statistical significance are not reported (i.e., NA is reported) if you did not select the Moran s I z value option when you specified the output Lagrange Multiplier In general, the LM is used in mathematical optimization problems and is a method for finding the local extreme values of a function of several variables subject to one or more constraints Here, the LM gives some indication of which type of spatial regression model is most appropriate o Compare as you add predictors; do not run with the first model output o We are trying to eliminate spatial autocorrelation from our model and can inappropriately estimate it if we haven t exhausted the alternatives to a spatial dependence regression model Error, lag, or SARMA (both lag and error)? o Only consider the robust LM statistics when the standard LM values are statistically significant o A larger LM suggests the more likely model o SARMA is always significant, it seems, and is not that useful in practice It tends to be significant when either lag or error is indicated, not just when a higher order model is The value can be compared with the standard LM values; if similar, then it is not picking up a higher order model Which model is indicated? Have we exhausted other explanations? What about a trend surface or other techniques to address spatial heterogeneity? Residuals The predicted and residual values are appended at the end of the table if you chose this option under the Save button when specifying the regression model (open data table). Maps Predicted value maps (Map>Std Dev>predicted value variable saved to table) 10

o In essence, smoothed maps since the random variability due to factors other than those in the model has been smoothed out Residual maps (Map>Std Dev>residual value variable saved to table) o Gives a sense of spatial autocorrelation patterns since they suggest any under or overprediction in sub regions Quantile Maps of predicted values and residual values (Map>Quantile>variable) o Predicted value quantile map shows where predicted poverty is higher (darker) and lower (lighter) o Residual value map is more intuitive, for me, and shows over prediction (lighter) and under prediction (darker) Where is the model over predicting? Under predicting? Is there evidence of spatial clustering? What about the possibility of spatial regimes? Moran Scatter Plot & LISA Map Run a Moran scatter plot on the residuals (Space>Univariate Moran) o Use the same weights matrix that you used in the regression model It is purely descriptive o Through this approach, we are not able to obtain reliable estimate for significance tests or LISA map construction because the permutation function ignores the fact that OLS residuals are already correlated by construction o Still, it gives you some sense and it is usually not far off base Construct a LISA map (Space>Univariate LISA) o Use the same weight matrix that you used in the regression model Again, purely descriptive, but somewhat useful in identifying geographic areas where the model does not explain the spatial distribution of the dependent variable Things to Consider: What do you think is indicated by the tests for spatial autocorrelation based on the OLS residuals in terms of what model might be a good fit for your data (error or lag)? How or do the diagnostics for spatial autocorrelation when you use different spatial weights matrices? How does the patterning of positive and negative residuals in the choropleth maps of your OLS residuals relate to your model diagnostics? What clustering is evidenced in the residuals using LISA maps? Do you think there might be any processes or omitted variables that could help explain the clustering in the residuals? **** Part 4: Spatial Error and Spatial Lag Regression. In this section, we will specify and interpret two spatial regression models: the spatial error model and the spatial lag model. The two approaches have different assumptions and theoretical implications about the form of the spatial process being analyzed. The spatial error model identifies spatial autocorrelation in the error structure of the regression model. The spatial lag model, in contrast, identifies spatial autocorrelation in the covariance structure of the dependent variable. Spatial Regression Specify the regression model o Dependent Variable: child poverty, SQRTPPOV (square root transformed) or PPOV, if you prefer o Independent Variables: What shall we explore? (should be consistent with OLS to be compared) 11

o Choose weight matrix: Which should we use? (should be consistent with OLS to be compared) Run both the Spatial Error and Spatial Lag options for comparison o Save the residuals, predicted values, and predicted errors (choose the Save button and give the variables a meaningful name) o Remember that you will need to create a new shapefile to permanently append the new variables to your table (it is like a working file in SAS) (activate the table object>file>save to Shape File As ) Output File An output window automatically appears when selecting OK o The file also can be viewed in Wordpad or MS Word, but not Notepad (can open but the format is messy) The file content is similar to that reported for the classic OLS regression o Summary statistics of the model and measures of fit o Parameter estimates o Model diagnostics Do not focus on the R squared; it is a pseudo R squared that is not directly comparable to the OLS models o o Instead, use the log likelihood, AIC and SC To review Log likelihood: bigger, better (less negative) AIC and SC: lower, better Review the autoregressive coefficient (ρ, spatial lag, or λ, spatial error) o Is it significant? What is the direction? Is it what you expected? Review the explanatory variables o Check the signs, significance, and magnitude Check model heteroskedasticity o Only the Breush Pagan test is reported (tests on random coefficients that assumes a functional form based on the squares of the explanatory variables) o Also, can plot the model residuals (Explore>Scatter Plot) Y: residual values X: predicted values Check the likelihood ratio test for the specified spatial form (lag or error, depending on the model) o This test compares the spatial model to the non spatial alternative o What is missing in the GeoDa diagnostics is a direct comparison with the alternative spatial model (lag vs. error); can get this through SpaceStat and, hopefully, in future versions of GeoDa o For now, we compare the two models on a number of different points (LL, AIC, etc.) Predicted Values, Prediction Errors and Residuals Predicted Values: the estimated value of child poverty ( I ˆ W ) 1 X ˆ 12

Prediction Errors: the difference between the observed and predicted values of child poverty, obtained by considering the exogenous variables alone 1 ( I W ) u Residuals: estimates for the model error term ( I ˆ W ) y X ˆ Construct a univariate Moran scatter plot for the residuals and errors (Space>Univariate Moran) o Residuals: should be close to 0 since spatial autocorrelation has been purged from the model or, alternatively phrased, captured in the ρ or λ parameter o Prediction Errors: is about the same as the original OLS MI statistic This is okay since, by definition, they are spatially correlated; the predicted errors are an estimate for the spatially transformed errors Compare the scatter plot of the lag and error model residuals o What does this comparison indicate? Things to Consider: Which model, given all of the information we ve explored, is a better fit for our data? What does this model selection mean, conceptually, in terms of our outcome variable? What, if any, substantive information is gained through spatial regression techniques? What else would we include in the model, if it was available to us? 13