Spatial Regression Modeling Paul Voss & Katherine Curtis The Center for Spatially Integrated Social Science Santa Barbara, CA July 12-17, 2009 Day 4
Plan for today Focus on spatial heterogeneity A bit of wrap-up and EDA regarding our regression results from yesterday morning Putting it all together: A worked example Spatial heterogeneity in relationships Geographically Weighted Regression (GWR) Lab: Spatial regression diagnostics & model strategies; GWR
Questions?
As class ended yesterday, we quickly looked at the results of a simple OLS multivariate regression model using GeoDa Dependent variable: sqrt(ppov) Independent variables: sqrt(unem) sqrt(pfhh) log(hdplus)
Compare R results with the same model run in GeoDa As a reminder, here were the GeoDa results
GeoDa output OLS regression Virtually identical results
Advantages of AIC and SC Both measures improve (decline) as R 2 increases, but they degrade as the model size increases Like adjusted, R 2, they place a premium on achieving a given fit with a smaller number of parameters per observation, but the penalty for added variables is greater Both criteria have their virtues, neither has an advantage over the other. The SC, with its heavier penalty for degrees of freedom lost, will lean toward a simpler model
Some general comments regarding goodness of fit measures R 2 is fine for OLS; other estimators often generate a pseudo R 2 (which is usually the squared correlation of observed and predicted values) Log likelihood values can be used to compare models only if the models are nested AIC is more general measure than log likelihood, but must play by the rules: Like Log likelihood, can use AIC to compare models of the same type (e.g., OLS) with same DV but different specifications, or Can compare models of different types (e.g., OLS, spatial regression, GWR) if they have the same model specification
End of digression regarding goodness of fit measures What about all those other diagnostics at the end of the GeoDa regression output?
Recall, the lower half of the GeoDa output from the OLS regression run looked like this
We need seriously to consider this part of the output There are some troubling numbers here. Let s take them one at a time
Part 2 of the GeoDa regression output
One advantage of R over GeoDa is that the diagnostic analysis capability is much richer in R And we looked at some of this in the lab session on Tuesda
A very useful diagnostic plot. What are we looking for here and what s our conclusion?
There are other common residual diagnostic plots. Here s a residual vs. carrier plot Heteroskedasticity? Probably
Another residual vs. carrier plot Heteroskedasticity? Probably, although the fan seems a bit softer to me
Another residual vs. carrier plot Seems to be much less heteroskedasticity coming from the education variable
What about the rejection of normality among residuals found in the GeoDa run? They don t look too bad here. Maybe it s the sample size that s just giving us a large JB statistic Shapiro-Wilk normality test gives us the same story, however: data: residuals(reg1) W = 0.9716, p-value = 6.702e-16
How can the Q-Q plot of residuals further inform us? The residual Q-Q plot confirms that we have a problem. The problem continues even if we remove the most obvious offenders (almost all counties in Texas)
What about the rejection of normality among residuals found in the GeoDa run? Here it is with 8 serious outliers removed from the plot. Better, but still some problems. Shapiro-Wilk normality test data: residuals(reg1)[-c(275, 774, 1048, 1066, 1114, 1130, 1135, 1143)] W = 0.9915, p-value = 3.867e-07
What s the take-home message? GeoDa and R give us the same message (which they should!) Unless you re very fortunate, the OLS model diagnostics almost always leave us with a bad taste. Why? Because we need to and want to move on. We especially want to do something about the spatial autocorrelation in the residuals But neither the residual Moran statistic nor the Lagrange multiplier statistics are trustworthy in the presence of non-normality & heteroskedasticy Furthermore, Monte Carlo simulations have shown that in the presence of residual spatial autocorrelation, heteroskedasticity is induced This is where we begin to look for ways of reducing the unresolved heterogeneity that appears to be plaguing our OLS model
Perhaps a worked example will help An example using county levels of child poverty, all U.S. counties: 2000
Objective Considering spatial effects when analyzing the uneven geographic distribution of child poverty in the U.S. Still a work in progress A central issue is whether to model our data under an assumption of spatial dependence or spatial heterogeneity
Our Data Proportion of Children in Poverty (transformed) 2000 Census of Population (U.S.) Counties are unit of analysis (n = 3,074) Contiguous 48 states only Most indep. cities merged with surrounding county
Log Odds of Proportion of Children in Poverty: 2000
Industrial Structure % Extr. ind. % Non dur. mfg. ind. % Misc. svcs. % Prof. svcs. Our Model (after Friedman and Lichter, 1998) Emp. Oppy. Structure % Unemp. % Males under emp. Family Structure % of families w/ children headed by females Control Variables % Hispanic % Black % HS or less % Emp. in county Log Odds % Children in Poverty
Begin with a Standard Regression Approach and examine the diagnostics
Standard Regression Results Variable Industrial composition: Parameter (p-value) Extractive (+) 2.377 (0.0000) Non-durable manufacturing (+) -0.101 (0.5359) ns Miscellaneous services (+) 0.246 (0.2215) ns Professional services (-) 0.614 (0.0000) Local employment opportunities: Prop. LF unemployed (+) 4.942 (0.0000) Prop. male underemployment (+) 1.415 (0.0000) Family structure: Prop. Female-headed families (+) 3.883 (0.0000) Control variables: Proportion black (+) -0.045 (0.5254) ns Proportion Hispanic (+) 0.396 (0.0000) Prop. HS education or less (+) 2.405 (0.0000) Proportion work in county (+) 0.236 (0.0000) Intercept -4.854 (0.0000) Adjusted R 2 0.751 Jarque-Bera test (normality of errors) 7480.38 (0.0000) B-P test (Heteroskedasticity) 401.27 (0.0000) Moran s I (residuals) 0.331 (0.0000) Compare Moran s I on dep. var. = 0.590
Test for Normality of Residuals JB = n 6 k S 2 K + ( 3) 4 2 H : JB = 0 0 The statistic has an asymptotic chisquared distribution with two degrees of freedom (one for skewness, one for kurtosis). 5% critical value is 5.99
Tests for Heteroskedasticity H = E 2 i Var[ ε ] [ ] = 0 i : ε σ 2 H A + : σ 2 = σ 2 f ( α z α ) i 0 p p Breusch and Pagan (1979) Koenker and Bassett (1982) White (1980) p Tests are not valid in the presence of spatial dependence
Standard Regression Results Variable Industrial composition: Parameter (p-value) Extractive (+) 2.377 (0.0000) Non-durable manufacturing (+) -0.101 (0.5359) ns Miscellaneous services (+) 0.246 (0.2215) ns Professional services (-) 0.614 (0.0000) Local employment opportunities: Prop. LF unemployed (+) 4.942 (0.0000) Prop. male underemployment (+) 1.415 (0.0000) Family structure: Prop. Female-headed families (+) 3.883 (0.0000) Control variables: Proportion black (+) -0.045 (0.5254) ns Proportion Hispanic (+) 0.396 (0.0000) Prop. HS education or less (+) 2.405 (0.0000) Proportion work in county (+) 0.236 (0.0000) Intercept -4.854 (0.0000) Adjusted R 2 0.751 Jarque-Bera test (normality of errors) 7480.38 (0.0000) B-P test (Heteroskedasticity) 401.27 (0.0000) Moran s I (residuals) 0.331 (0.0000)
So, what to do now.(?) Proceed directly to re-specify the process under a spatial dependence assumption? A common approach, but not necessarily the best one. We really should try first to do something with the obvious spatial heterogeneity (non-stationarity) Moreover, are we ready at this point to assume that poverty is a social phenomenon resulting from spatial interaction? Eliminate or control for the spatial heterogeneity? Model it first and then work with residuals as the new dependent variable Add a trend surface to our OLS model Seek other variables to improve specification? Identify spatial regimes and interaction effects Shift to a space-time framework
We ll proceed to perform each of these options and comment along the way Hopefully in the end we ll have some sense of reasonable ways to introduce the different spatial effects as possible alternative datagenerating processes w.r.t. child poverty
So let s proceed directly to re-specify the process under a spatial dependence assumption The Lagrange Multiplier statistics suggested preference for a spatial error dependence model
Spatial Dependence Models Variable (A) = OLS (A) + Spatial Lag (A) + Spatial Error (Several control variables) *** (***) *** (***) *** (***) Industrial composition: Extractive 2.377 (0.0000) 1.904 (0.0000) 2.080 (0.0000) Non-durable manufacturing -0.101 (0.5359) ns 0.011 (0.9383) ns -0.069 (0.6720) ns Miscellaneous services 0.246 (0. 2215) ns -0.035 (0.8432) ns -0.030 (0.8830) ns Professional services 0.614 (0.0000) 0.292 (0. 0122) 0.295 (0.0366) Local employment opportunities: Prop. LF unemployed 4.942 (0.0000) 3.636 (0.0000) 3.798 (0.0000) Prop. male underemployment 1.415 (0.0000) 1.296 (0.0000) 1.519 (0.0000) Family structure: Prop. Female-headed families 3.883 (0.0000) 3.844 (0.0000) 4.129 (0.0000) Intercept: -4.854 (0.0000) -3.591 (0.0000) -4.347 (0.0000) Spatial parameter: 0.351 (0.0000) 0.612 (0.0000) Diagnostics: Robust LM (lag) 155.82 (0.0000) Robust LM (error) 352.20 (0.0000) Heteroskedasticity (B-P) 401.27 (0.0000) 740.47 (0.0000) 800.09 (0.0000) Likelihood -578.94-259.10-205.94 AIC 1181.88 544.21 435.88 Moran s I (residuals) 0.331 0.079-0.048
So Where are we? Happy with these results? In particular, are we satisfied with the spatial error dependence model? Write it up and ship it off?? But what s our theory? How do we introduce the spatial dependence model? or interpret it? What about apparent unresolved spatial dependence and heterogeneity? Just let it go?
A brief digression to address the theory question Why do high poverty counties (and low poverty counties) cluster in space? Poverty spillover Importance of family Welfare levels as product of spatial dependence Tendency of non-poor to live near one another Role of government in fostering economic segregation Legacy effects
So let s say we re satisfied with the spatial dependence model But what about the apparent heteroskedasticity in the spatial error model? This clearly is a sign that we ve not resolved the nonstationarity (heterogeneity) in our model and this has consequences Just ignore the heterogeneity? Try to model it first and then examine the remaining dependence process assuming stationarity? Bring it into our model? how? trend surface?
Trend Surface Analysis One commonly suggested approach to handling spatial heterogeneity Alas, there are problems (Ripley, 1981) It tends to distort ( wave ) the surface at the edges of the area in order to fit points in the center For higher orders, the polynomial terms tend to be highly correlated, causing fierce multicollinearity problems Uneven distribution in the data can distort the fit High positive autocorrelation in the process can lead to a tendency to fit a surface of too high an order Nevertheless
Second-Order Trend Surface?
Variable (A) OLS (B) Trend Surface only (Several control variables) *** (***) *** (***) Industrial composition: (C) OLS Vars + Trend Extractive 2.377 (0.0000) 2.488 (0.0000) Non-durable manufacturing -0.101 (0.5359) ns 0.043 (0.7781) ns Miscellaneous services 0.246 (0.2215) ns 0.005 (0.9785) ns Professional services 0.614 (0.0000) 0.584 (0.0000) Local employment opportunities: Prop. LF unemployed 4.942 (0.0000) 5.402 (0.0000) Prop. male underemployment 1.415 (0.0000) 1.567 (0.0000) Family structure: Prop. Female-headed families 3.883 (0.0000) 4.039 (0.0000) Intercept -4.854 (0.0000) 7.081 (0.0000) -1.568 (0.0026) Large-scale spatial effect: Trend Surface Models Long 0.037 (0.0021) 0.030 (0.0000) Lat -0.319 (0.0000) -0.067 (0.0000) Long 2 0.000 (0.8756) ns 0.000 (0.8271) ns Lat 2 0.002 (0.0000) -0.000 (0.0628) ns Long * Lat -0.001 (0.0000) -0.001 (0.0000) AIC 1181.88 4745.20 789.12 Moran s I (residuals) 0.331 0.474 0.250
Might even try adding the spatial dependence processes to this latter model Such models assume the data-generating process is approximated both by a spatial heterogeneity component and a spatial dependence component
Variable (C) OLS vars + trend (C) + Spatial Lag (C) + Spatial Error (Several control variables) *** (***) *** (***) *** (***) Industrial composition: Extractive 2.488 (0.0000) 2.047 (0.0000) 2.161 (0.0000) Non-durable manufacturing 0.043 (0.7781) ns 0.041 (0.7753) ns -0.042 (0.7941) ns Miscellaneous services 0.005 (0.9785) ns -0.060 (0.7375) ns -0.046 (0.8159) ns Professional services 0.584 (0.0000) 0.369 (0.0019) 0.350 (0.0111) Local employment opportunities: Prop. LF unemployed 5.402 (0.0000) 4.152 (0.0000) 4.077 (0.0000) Prop. male underemployment 1.567 (0.0000) 1.411 (0.0000) 1.564 (0.0000) Family structure: OLS Vars. + Trend Surface + Spatial Effects Prop. Female-headed families 4.039 (0.0000) 3.912 (0.0000) 4.188 (0.0000) Intercept -1.568 (0.0026) -2.578 (0.0000) -1.446 (0.1065) ns Large-scale spatial effect: Long 0.030 (0.0000) 0.016 (0.0125) 0.024 (0.0539) ns Lat -0.067 (0.0000) -0.009 (0.5547) ns -0.078 (0.0060) Long 2 0.000 (0.8271) ns -0.000 (0.2715) ns -0.000 (0.4405) ns Lat 2-0.000 (0.0628) ns -0.001 (0.0001) -0.000 (0.3379) ns Long * Lat -0.001 (0.0000) -0.001 (0.0000) -0.001 (0.0000) AIC 789.12 426.93 336.95 Moran s I (residuals) 0.250 0.073-0.032
We now continue by asking whether unresolved spatial heterogeneity (and the consequent heteroskedasticity in the model) might better be addressed by exploring the role of a key missing variable What might be a good candidate for such a variable? Possibly a poverty legacy variable?
Models with Legacy Variable (1980 child poverty) Variable OLS + Legacy OLS + Legacy + Trend OLS + Leg + Trend + Sp Error (Several control variables) *** (***) *** (***) *** (***) Industrial composition: Extractive 1.581 (0.0000) 1.710 (0.0000) 1.555 (0.0000) Non-durable manufacturing -0.011 (0.9391) ns 0.118 (0.4043) ns 0.022 (0.8861) ns Miscellaneous services 0.066 (0.7216) ns -0.153 (0.3819) ns -0.196 (0.2923) ns Professional services 0.426 (0.0004) 0.449 (0.0001) 0.286 (0.0260) Local employment opportunities: Prop. LF unemployed 3.917 (0.0000) 4.232 (0.0000) 3.297 (0.0000) Prop. male underemployment 0.886 (0.0000) 1.067 (0.0000) 1.218 (0.0000) Family structure: Prop. Female-headed families 3.373 (0.0000) 3.530 (0.0000) 3.802 (0.0000) Intercept: Legacy: Poverty rate in 1980 2.070 (0.0000) 1.898 (0.0000) 1.704 (0.0000) Large scale trend *** (***) *** (***) Spatial error parameter 0.426 (0.0000) Diagnostics: Heteroskedasticity (B-P) 440.12 688.38 951.40 Likelihood -301.05-121.92 9.26 AIC 628.10 279.85 17.49 Moran s I (residuals) 0.270 0.185-0.022
Concluding Comments Attention to spatial effects gives us better parameter estimates than OLS and a model with much improved diagnostics Combining corrections for both spatial heterogeneity and spatial dependence appears to give us a more realistic model Addition of the temporally lagged dependent variable ( legacy effect) gives us a strong additional predictor variable without seriously attenuating the contributions of other IVs w.r.t. the determinants of child poverty, attributes of place retain a strong causal influence even after controlling for family-level attributes and the temporally lagged dependent variable
Questions about the example?
Geographically Weighted Regression Again, our child poverty data
Log Odds of Proportion of Children in Poverty: 2000
Industrial Structure % Extr. Ind. % Non Dur Ind. % Misc. Svcs. % Prof. Svcs Same Model (after Friedman and Lichter, 1998) Emp. Oppy. Structure % Unemp % Males under emp. Family Structure % of families w/ children headed by females Control Variables % Hispanic % Black % HS or less % Emp. in county Log Odds % Children in Poverty
Now, we ll take a look at a particular form of spatial heterogeneity: heterogeneity in relationships Enter GWR
Geographically Weighted Regression: the analysis of spatially varying relationships A. Stewart Fotheringham, Chris Brunsdon, and Martin Charlton (John Wiley & Sons, Ltd, 2002)
GWR Software GWR 3.x Software developed by Stewart Fotheringham Martin Charlton Chris Brunsdon University of Newcastle upon Tyne (at the time)
Standard Regression y i = β + β x + β x +... + β x + 0 where: 1 1i βˆ = 2 2i ( X X T ) 1 T X y k ki ε i Contrast Geographically Weighted Regression (GWR) y i = β + β x + β x +... + β x + 0i 1i 1i 2i 2i ki ki ε i where: βˆ i = T ( X W X i ) 1 T X W y i
Schematic Showing Kernel Estimation
Basic Operation of GWR 3.x Data Ideas GWR MODEL EDITOR Creates a Control File RUN PROGRAM Listing File Output File Map Results ArcGIS Other GIS Files
Here s how it works Log Odds of Proportion of Children in Poverty: 2000
Regression Intercept Global = -4.984
T-value: Regression Intercept
Variable: Prop. H.S. Educ. or Less Global: 2.6 Local: -4.1 - +6.5
Variable: Proportion Labor Force Unemployed Global: 4.2 Local: -9.0 - +14.2
Variable: Prop. Extractive Industries Global: 2.5 Local: -14.6 - +25.5
Variable: Prop. Female Headed Households Global: 3.9 Local: -4.3 - +8.4 ns
Local R 2 Values Global R 2 : 0.713
Strengths of GWR Potentially important tool when exploring spatial data. Nothing is the same everywhere. Helps you to understand spatial heterogeneity in your data. Provides better understanding global model. Serves as a device for possibly identifying specification errors in global model (e.g., important interaction effects). Thus, GWR and local analysis becomes a potential model-building procedure Permits logistic regression and Poisson regression as well as normal regression
Okay. Pretty slick! Everyone likes this, right? Well, No
Some faults regarding GWR Very rote. Regression undertaken at each regression point without much care regarding regression assumptions Data with spatially autocorrelated residuals fit with OLS rather than spatial regression model (MLE, IV/GMM) The results are not easily amenable to tabular presentation; they often make great maps, however
But Wait There s More Usual rule in regression: n observations, k parameters; n >> k GWR fits n x k parameters with only n observations Some unusual results arising from GWR are not yet fully understood. For example, it has been observed and commented upon that GWR can sometimes generate high (negative) correlations among estimated parameters
1.0.8.6.4.2 Variables: Proportion Black vs. Proportion Nondurable Manufacturing Pearson s r =.368 PROBLCK 0.0 -.2 -.1 0.0.1.2.3.4.5 4 PRONMAN 3 2 Parameters: Proportion Black vs. Proportion Nondurable Manufacturing Pearson s r = -.729 PARM_2 1 0-1 -1.5-1.0 -.5 0.0.5 1.0 PARM_9
David Wheeler & Michael Tiefelsdorf Journal of Geographical Systems 7(2005):161-287 GWR appears to amplify regression parameter correlations present in global model One local parameter pattern can be used to predict another local parameter pattern Misspecified kernel function increases coefficient correlation Perhaps, local spatial autocorrelation among the residuals influences the parameter correlation
Readings for today Fotheringham, A. Stewart, and Chris Brunsdon. 1999. Local forms of Spatial Analysis. Geographical Analysis 31(4):340-358. Wheeler, David, and Michael Tiefelsdorf. 2005. Multicollinearity and Correlation among Local Regression Coefficients in Geographically Weighted Regression. Journal of Geographical Systems 7:161-187. Messner, Steven F., and Luc Anselin. 2004. Spatial Analyses of Homicide with Areal Data. Pp. 127-144 in Michael F. Goodchild and Donald G. Janelle (eds.) Spatially Integrated Social Science. (Oxford: Oxford University Press). O Loughlin, John, Colin Flint, and Luc Anselin. 1994. The Geography of the Nazi Vote: Context, Confession, and Class in the Reichstag Election of 1930. Annals of the Association of American Geographers 84(3):351-380.
Questions?
Afternoon Lab GWR hands on (in R)