Spatial econometric modeling of presidential voting outcomes

Size: px

Start display at page:

Download "Spatial econometric modeling of presidential voting outcomes"

Kimberly Page
6 years ago
Views:

1 The University of Toledo The University of Toledo Digital Repository Theses and Dissertations 2005 Spatial econometric modeling of presidential voting outcomes Ryan Christopher Sutter The University of Toledo Follow this and additional works at: Recommended Citation Sutter, Ryan Christopher, "Spatial econometric modeling of presidential voting outcomes" (2005). Theses and Dissertations This Thesis is brought to you for free and open access by The University of Toledo Digital Repository. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of The University of Toledo Digital Repository. For more information, please see the repository's About page.

2 A Thesis entitled Spatial Econometric Modeling of Presidential Voting Outcomes by Ryan Sutter As partial fulfillment of the requirements for the Master of Arts in Economics Advisor: Dr. James LeSage Graduate School The University of Toledo May 2005

4 An Abstract of Spatial Econometric Modeling of Presidential Voting Outcomes Ryan Sutter Submitted in partial fulfillment of the requirements for the Master of Arts in Economics The University of Toledo May 2005 We examine the spatial autoregressive relationship between county-level voting outcomes in the 2000 Presidential election and a host of candidate explanatory variables taken from the year 2000 census. These include: measures of past voting behavior, indicators of socioeconomic demographic status of the population, and economic variables that reflect recent economic conditions. Using a recently developed spatial econometric extension of least-squares regression-based Markov Chain Monte Carlo model composition methodology (often labelled MC 3 ) by LeSage and Parent (2004), we present evidence on which explanatory variables are important in explaining voting outcomes. The LeSage and Parent (2004) methodology deals with cases where the number of possible models based on different combinations of candidate explanatory variables is large enough that calculation of posterior probabilities for all models is difficult or infeasible. In addition, we produce estimates using a spatial autoregressive seemingly unrelated regression methodology developed in LeSage and Pace (2005), that takes into account cross-equation error covariance between the Bush and Gore equations in the model. ii

5 Acknowledgments I would like to thank the Department of Economics for providing me with an excellent environment in which to grow as a student. I would also like to specifically thank Dr. James LeSage for spending so much of his time teaching me, I greatly appreciate all that he has taught me. iii

6 To Melissa and Jacob for supporting me and graciously accepting the burden I place on them by continuing my studies.

7 Contents Abstract ii Acknowledgments iii Contents iv List of Figures vi List of Tables vii 1 Introduction Regression-based models Relevant regression issues Least-squares results Spatial Regression Models Spatial autoregressive model The impact of spatial dependence on least-squares Spatial autoregressive results iv

8 2.4 Comparison of least-squares and maximum likelihood spatial autoregressive results Nonconstant Variance Explanatory Variables The MC 3 technique Priors for the model parameters Model comparison results Model averaged estimates The influence of economic variables Spatial SUR Models Spatial SUR example Spatial SUR results Conclusions 69 References 71 v

9 List of Figures 2-1 Map of County-level Percentage Votes for Bush Moran Scatterplot of Percentage Votes for Bush Map associated with the Moran Scatterplot vi

10 List of Tables 1.1 Past voting variables Socioeconomic demographic variables Economic variables Least Squares Results Maximum Likelihood Spatial Autoregressive (MLSAR) Results Least Squares (OLS) vs Maximum Likelihood Spatial Autoregressive (ML SAR) Model for y = Bush/Total Votes Least Squares (OLS) vs Maximum Likelihood Spatial Autoregressive (MLSAR) Model for y = Gore/Total Votes Bush Results Gore Results Model Comparison Results for y = Bush/Total Votes Model Comparison Results for y = Gore/Total Votes Bush Model Averaged Estimates Gore Model Averaged Estimates vii

11 4.1 Spatial SUR Comparison Without Cross Eq Error Covariance variance-covariance Comparison Spatial SUR Comparison With Cross Eq Error Covariance variance-covariance Comparison Estimated variance-covariance Matrix Spatial SUR Comp Results for y =Bush/Total Votes Spatial SUR Comp Results for y =Gore/Total Votes viii

12 Chapter 1 Introduction Presidential election outcomes exert a great deal of impact on society, so it is no surprise that there is interest in determinants of voting outcomes in these elections. Our analysis focuses on the cross-sectional relationship between county-level voting outcomes in the 2000 Presidential election using three suites of explanatory variables in a regression setting. The variable suites include, measures of past voting behavior, indicators of the socioeconomic demographic characteristics of the county-level populations, and measures of economic conditions. Specifically, we will examine two regression relationships, one predicting the proportion of votes going to George Bush in the 2000 Presidential election and the other doing the same for Al Gore in A sample of 3,107 counties were used in our analysis. There are 3,109 counties in the 48 contigious states that could be used in our spatial sample but two counties were ommited: Loving county Texas, and, Miami Dade county. Loving County Texas contained 70 eligible voters in the year 2000 census but 200 voters were registered in that county in that year causing us to drop that county from our analysis. Another 1

13 2 excluded county was Miami-Dade county Florida which changed definitions between the 1990 and 2000 census. Other data was obtained from either the US Census, the Regional Economic Information System, or a data set of past voting outcomes purchased from David Leip, who assembled the data set by contacting state election officials (see Specifically, the socioeconomic demographic characteristics were obtained from the 2000 US Census and aggregated to the countylevel by GeoLytics, Inc., a commercial provider of US Census information. The variables reflecting economic conditions were obtained from the 2000 Census as well as from data made available to the public by the Regional Economic Information System. Lastly, the data reflecting past voting behavior was purchased from David Leip. Past voting behavior was expected to be an important factor in explaining voting behavior in the 2000 presidential election as county-level voting outcomes tend to exhibit persistence over time. This meaning that those counties which voted for a given political party in one election tend to vote for that same political party in subsequent elections. To incorporate this persistence into our model we included the following voting variables in our analysis: the proportion of votes received by the Republican candidate in 1992, the proportion of votes received by Ross Perot in 1992, the proportion of votes received by the Republican candidate in 1996, and the proportion of votes received by Ross Perot in Aside from past voting behavior we also included the proportion of votes received by other party candidates and write-ins in 2000 to address and explain the affect of other party candidates on the proportion of votes received by George Bush and Al Gore in the 2000 Presidential

14 3 Table 1.1: Past voting variables Variable name oparty repub92 repub96 perot92 perot96 description % of 2000 votes for others and write-ins % of 1992 votes for the Republication candidate % of 1996 votes for the Republication candidate % of 1992 votes for candidate Perot % of 1996 votes for candidate Perot election. These variables reflect the percentage of the total population that cast votes for the given candidate in the given year. Exact descriptions of the variables along with the variable name associated with each variable can be found in Table 1.1. It is widely acknowledged that socioeconomic demographic characteristics of the population influence voting behavior. Because of this, we obtained year 2000 census information that was aggregated to the county-level. This aggregation results in variables that represent the average county-level characteristics of the population residing in the 3,107 counties used in our analysis. These variables capture important variation in the socioeconomic demographic characteristics across our spatial sample and are used to explain observed variation in voting behavior across these counties. An important aspect of the presidential election in 2000 is that our census information is based on a US Census that was taken during the spring of year 2000 and the election occurred in the fall of the same year. The small amount of time between the census and the election means that our county-level socioeconomic demographic information accurately measures the average characteristics of the counties when voting occurred in November of that year. Specific descriptions of the variables used in our analysis along with the variable names associated with that information can be found by

15 4 examination of Table 1.2. The influence economic conditions have on presidential voting behavior is unresolved in the Political Science literature with debate over what role economic conditions play in determining voting outcomes in presidential elections. Some evidence suggests that people vote based on ideology, aligning themselves with a given political party, and voting along those party lines independent of the economic conditions occurring over a given administration s time in power (Elliott, Kim, and Wang (2003)). Other evidence suggests that economic conditions are indeed relevant in explaining voting behavior as individuals will cross party lines and reward or punish a candidate based on the performance of the economy (Lewis-Beck and Stegmaier (2000)). To address these issues we obtained information on the unemployment rate in 2000, the level of per capita income in 2000, the unemployment rate in 1990, the change in the unemployment rate from 1990 to 2000, the change in per capita income from 1990 to 2000, the change in per capita income from 1997 to 1999, as well as the change in per capita income from 1998 to A description of each variable as well as the names associated with each variable are included in Table 1.3. Note that the level of per capita income in 1990 was obtained from the 1990 US Census. 1.1 Regression-based models To examine voting behavior, two separate regressions are performed, one using the proportion of 2000 county-level votes going to George Bush as the dependent variable and the other using Al Gore s proportion of votes as the dependent variable. Our

16 5 Table 1.2: Socioeconomic demographic variables Variable name description female/male female/male ratio for persons aged 16 or older black % of black population asian % of asian population hispanic % of hispanic population famwchild % of families with children femhhwchild % of female-headed households with children owner-occupied % of owner-occupied housing highschool % of population aged 25 plus, high school as highest degree associate % of population aged 25 plus, assoc. degree as highest degree college % of population aged 25 plus, college degree as highest degree grad/prof % of population aged 25 plus, grad/prof as highest degree nevermarried % of population aged 16 plus never married divorced % of population aged 16 plus divorced widowed % of population aged 16 plus widowed samehouse % of population living in the same house as 5 years ago foreignborn % of population that is foreign born language % of the population speaking a foreign language at home military % of population in the military fem emp/females % of females aged 16 plus that are employed work home % of population that work at home traveltime mean travel time to work poverty % of population in poverty newhouse % of occupied houses built since 1995 oldhouse % of occupied houses built 1939 or before log(hvalue) log of median house value log(rent) log of median rent log(mortgage) log of median mortgage payment govt workers % of employed workers in government manuf workers % of employed workers in manufacturing arts workers % of workers in arts, recreation, food services

17 6 Variable name Table 1.3: Economic variables description urate00 unemployment rate in 2000 durate90to00 change in unemployment rate from 1990 to 2000 income00 per capita income in 2000 dincome90to00 change in per capita income from 1990 to 2000 dpi98to99 change in per capita income from 1998 to 1999 dpi97to99 change in per capita income from 1997 to 1999 initial analysis relies on the same matrix of explanatory variables to explain variation in voting outcomes across the sample of n = 3, 107 U.S. counties located in the 48 contiguous states. The regression model can then be described using matrix notation as: y 1 y 2 = X 0 0 X β 1 β 2 + ε 1 ε 2 (1.1) Where the n x 1 vectors y i, i = 1, 2 contains the percentage of votes received by the respective presidential candidates and the n x k matrix X contains the k explanatory variables for each of the n counties in the sample. The n x 1 vectors of disturbances ε i, i = 1, 2 are assumed to be distributed normally with a mean of zero, constant variance and zero covariance. Use of the same matrix X in the two regression relationships assures zero covariance, so we might write: var-cov ε 1 ε 2 = σ 2 1I n 0 0 σ 2 2I n (1.2) Where σ 2 i denotes the constant variance from regression i = 1, 2. This assumption

18 7 will be relaxed in Chapter 4, where the results of Chapter 3 are used to determine separate matrices X i, i = 1, 2 for the two regressions. In this situation, it seems highly plausible that non-zero error covariance exists between the two regressions. Since the dependent variables take on values bounded between 0 and 1, log transformations were used on these proportions (p) to produce distributions of the dependent variables that were more normal. That is y = log(p) was used on the dependent variables in the two regressions. Log transformations are also used on some of the county-level explanatory variables from the 2000 Census for scaling purposes, specifically, the mean travel time to work, median household income, median house value, median rent, and median mortgage. This transformation was also used on per capita incomes in 1990, 1997, and The remaining variables in the X matrix were not transformed because they were expressed as proportions of the total population and so were already adequately scaled. For example, the poverty variable reflects the proportion of county-level population in poverty, as do other variables such as education level, ethnicity, gender, etc. Lastly, it is important to note that all of the explanatory variables were studentized to accommodate the prior mean of zero employed in the Zellner g prior, which will be used later on in our analysis. 1.2 Relevant regression issues There are a number of issues related to regression modelling of voting outcomes that we will explore. First, it is well-known that voting behavior exhibits spatial dependence (Elliott, Kim, and Wang (2003)), an issue explored in Chapter 2. This

19 8 results in outcomes that are spatially correlated, invalidating the use of ordinary leastsquares regression methods. The focus is on differences between spatial regression model estimates and those from least-squares. Secondly, the variances of the errors will be inspected to examine whether or not the assumption of constant variance is consistent with the sample data. It is important to note that two separate issues are at work here. One issue is heteroscedasticity and the second issue is outliers. These two issues are important because heteroscedasticity can cause inefficiency whereas outliers are known to cause biased and inconsistent parameter estimates. Diagnosing these problems and correcting for them, if identified, is the topic of section 2.5 where the focus is on comparing spatial autoregressive models that assume homoscedasticity to models that are robust to the influence of outliers and heteroscedasticity. Another issue involves which explanatory variables exert a significant influence on voting behavior. There are a host of potential explanatory variables that may or may not contribute significantly to explaining voting behavior. The Bayesian theory behind model comparison has been widely recognized through work done by Zellner (1971) and Fernandez, Ley, and, Steel (2001a and 2001b). This work enables us to compare models systematically and draw conclusions about which independent variables are important in explaining variation in the dependent variable. While this work is important, it only provides procedures for model comparison in an ordinary least-squares setting. Other work by LeSage and Parent (2004) demonstrates that least-squares model comparison procedures will be adversely impacted by the spatial dependence that is unaccounted for in the least-squares model. This means that

20 9 the spatial dependence will influence the model selection inferences for least-squares model comparison adversely. LeSage and Parent (2004) suggest implementing techniques to perform model comparison in a spatial autoregressive setting. They describe these techniques and demonstrate the superiority of using them when spatial dependence exists in the sample data. The focus of Chapter 3 is comparing models with different sets of explanatory variables to identify those that are most relevant. in section 3.5 we use these methods to explore the role the six economic explanatory variables in determining voting outcomes for our two presidential candidates as motivated in Chapter 1 that described this debate in the political science literature. A final issue relates to the possible error covariance between voting behavior as it relates to different candidates. Initially, we used the same matrix of explanatory variables, ruling out error covariance between the two regressions reflecting Bush and Gore. However, as indicated, results of Chapter 3 will be used to identify separate matrices of explanatory variables for the two regressions. Again, in this situation it seems highly plausible that non-zero error covariance exists between the two regressions, resulting in Seemingly Unrelated Regression equations. Chapter 4 will test for cross equation error covariance between the two regression using a spatial autoregressive Seemingly Unrelated Regression (SUR) model. 1.3 Least-squares results Least-squares parameter estimates along with their associated marginal probability levels resulting from the Bush and Gore regressions are presented in Table 1.4. In

21 10 this table a single represents significance at the 95% level and a double denotes significance at the 99% level. The results in Table 1.4 indicate that the following variables had a negative influence on votes going to bush, in decreasing absolute value terms: the constant term, femhhwchild, fem emp/female, urate00, income00, samehouse, oldhouse, nevermarried, foreignborn, log(rent), language, oparty, traveltime, log(hvalue), grad/prof, female/male, work home, military, dpi97to99, and associate. Note that all of these variables were significant at the 99-percent level except log(hvalue), dpi97to99, and associate, which were significant at the 95-percent level. The following variables positively influenced votes going to Bush, in increasing order: asian, college, govt workers, dincome90to00, hispanic, black, famwchild, perot92, perot96, repub92, and repub96. Note that all of these variables were significant at the 99-percent level except the asian and college variables, which were significant at the 90 and 95-percent levels. The following variables had insignificant influences on votes going to Bush: arts workers, newhouse, manuf workers, highschool, divorced, widowed, poverty, dpi98to99, durate90to00, owner-occupied, and log(mortgage). Review of the least-squares results for the Bush regression yield some counterintuitive results. For example, counties that contained higher than average numbers of blacks, hispanics, and asians tended to vote for George Bush in the 2000 presidential election. These results seem counterintuitive because conventional wisdom suggests that these race categories vote Democratic. Table 1.4 also contains least-squares parameter estimates and marginal probability levels associated with the Gore regression. The results indicate that the following

22 11 Table 1.4: Least Squares Results Variable Bush Coeff Variable Gore Coeff femhhw/child ** repub ** fem emp/females ** perot ** urate ** black ** income ** famwchild ** samehouse ** repub ** oldhouse ** oparty ** nevermarried ** dpi98to ** foreignborn ** college ** log(rent) ** work home ** language ** perot ** oparty ** govt workers ** traveltime ** hispanic * log(hvalue) * poverty grad/prof ** asian ** female/male ** log(hvalue) work home ** dincome90to military ** log(mortgage) dpi97to * samehouse associate * oldhouse arts workers newhouse newhouse highschool manuf workers divorced * highschool dpi97to ** divorced arts workers ** widowed associate ** poverty female/male ** dpi98to foreignborn ** asian language durate90to owner-occupied ** owner-occupied military ** log(mortgage) traveltime ** college * log(rent) ** govt workers ** grad/prof ** dincome90to ** durate90to ** hispanic ** manuf workers ** black ** widowed ** famwchild ** urate ** perot ** nevermarried ** perot ** income ** repub ** femhhwchild ** repub ** fem emp/females **

23 12 variables had a negative influence on votes going to Gore, in decreasing absolute value terms: the constant term, repub96, perot96, black, famwchild, repub92, oparty, dpi98to99, college, work home, perot92, govt workers, hispanic, asian. Note that all of these variables were significant at the 99-percent level except the hispanic variable, which was significant at the 95-percent level. The following variables positively influenced votes going to Gore, in increasing order: newhouse, divorced, dpi97to99, arts workers, associate, female/male, foreignborn, language, owner-occupied, military, traveltime, log(rent), grad/prof, durate90to00, manuf workers, widowed, urate00, nevermarried, income00, femhhwchild, fem emp/female. Note that all of these variables were significant at the 99-percent level except the newhouse, language, and divorced variables, which were significant at the 90 and 95-percent levels. The following variables had insignificant influence on votes going to Gore: poverty, log(hvalue), dincome90to00, log(mortgage), samehouse, oldhouse, and highschool. The least-squares results for the Gore regression reveal counterintuitive results similar to the counterintuitive Bush regression results. The least-squares estimates indicate that black, hispanic, and asian variables all negatively impacted votes going to Gore. The dpi98to99 variable also negatively impacted votes going to Gore despite the fact that he was involved in the adminstration in power during the growth occurring in that period.

24 Chapter 2 Spatial Regression Models It is widely accepted that spatial dependence exists in voting outcomes, that is, nearby counties tend to exhibit similar voting behavior. Estimation of a spatial autoregressive regression model is a simple way to test for spatial dependence across the observations in our sample, and is the subject of Chapter 2 of this paper. Specifically, section 2.1 sets up and describes the particular spatial model used in our analysis, and section 2.2 explains what impact spatial dependence has on leastsquares parameter estimates. Section 2.3 contains the results and subsequent discussion from the estimation of the maximum likelihood spatial autoregressive model described in section 2.1. In, section 2.4, maximum likelihood spatial autoregressive results are compared to the least-squares results, presented in section 1.3, using a t test to test for significantly different parameter estimates. 13

25 Spatial autoregressive model A spatial autoregressive model (SAR) is estimated to investigate the presence of spatial dependence across the observations in the sample. This model takes the form shown in (2.1): y i = ρ i W y i + Xβ i + ε i, i = 1, 2 (2.1) Where W denotes an n x n spatial weight matrix that is described below and ρ i is a scalar parameter measuring the strength of the relationship between the dependant variable y i and the spatially lagged variable vector W y i. If the estimation of this model reveals an insignificant coefficient ρ i, then this model collapses to ordinary least-squares, indicating that no significant spatial dependence exist in the sample data. For simplicity of exposition, we drop the subscript i and use y to denote a vector containing the dependent variable values in the following discussion. This allows us to use subscripts y i to refer to observation i in the vector y. Spatial dependence becomes an issue to address when an observation at one location, y i, is dependent on neighboring observations, y j, where we can use j Ω i to denote the set of observations that neighbor observation i. A large number of approaches have been taken to define neighboring observations, such as first order contiguity that relies on neighboring counties, those with borders touching. We tried a sequence of alternative weight matrices based on varying numbers of nearest neighbors. The model comparison methods described in Chapter 3 were used to calculate posterior probabilities associ-

26 15 ated with each weight matrix, leading us to an optimal number of nearest neighbors equal to 11 for Bush and 8 for Gore. The use of W y is referred to as a spatial lag and represents a simple and parsimonious way to define the spatial dependence among observations. The spatial weight matrix W is an n x n matrix with rows specified using Ω i. The individual elements contained in the weight matrix, W, can be defined as w(i, j). If y j, j = 1,..., n is contained in the set of neighboring observations, Ω i, then w(i, j) > 0. If y j is not contained in Ω i, then w(i, j) = 0. The diagonal elements of W contain values of zero to prevent dependence of observation i on its own value. A convention is to rowstandardize the matrix W, so that each row sums to one. This type of standardization leads to a row stochastic matrix since W is non-negative. The purpose of obtaining a matrix that is row stochastic is that this type of matrix has nice numerical and interpretive properties discussed in LeSage and Pace (2004). The spatial lag of y is obtained by multiplying the n x 1 vector y by the n x n weight matrix, W, which produces an arithmetic average of voting outcomes, where the average is over the m nearest neighboring counties to each observation i. Intuitively, this suggests that voting outcomes for county i should be similar to the outcomes of the m nearest neighboring counties. A simple map of the proportion of votes received by George Bush is shown in Figure 2-1, where we see clear patterns of spatial clustering that illustrate spatial dependence. Neighboring counties tend to exhibit similar vote proportions in favor (or against) Bush, resulting in similar colors assigned to the counties. Another way to examine this relationship between voting behavior of county i and

27 16 neighboring counties j Ω i is to produce a scatter plot of the vector y containing the percentage of votes for Bush versus the vector W y, produced by the matrix product of W times y. Conventionally, the vector y is transformed to deviation from means form, and this plot is known as a Moran scatter plot, which is shown in Figure 2-2. A first-order contiguity matrix W based on counties that have borders that touch was used to produce the figure. The horizontal axis shows the percentage of votes for Bush in deviation from the means form, so that positive values reflect counties where Bush received more than his average percentage of votes, and negative values reflect the converse case. On the vertical axis we place the vector W y which reflects the average (over neighboring counties) of the Bush percentage, again in deviation from the means form. A positive value on the vertical axis indicates that Bush received more than his average percentage of votes in the neighboring counties (on average). Taken together, a positive value on the horizontal and vertical axis results in a red point in quadrant I in the Moran Scatter plot. This indicates that Bush received higher than average vote percentages in these counties as well as a higher than average proportion in the average of neighboring counties. Roughly speaking, these counties are those where Bush did well. The blue points in quadrant III of the Moran Scatter plot indicate counties where Bush did worse than his average and also did worse than average in the neighboring counties. We might loosely refer to these counties as places where Bush did poorly. Quadrant II containing green points and quadrant IV with purple points represent counties where Bush s performance was inconsistent with his average performance in neighboring counties. Specifically, the green points in quadrant II represent counties where Bush did worse than average, but better than average in the

28 17 neighboring counties, and the purple points in quadrant IV are counties where Bush did better than average but worse than average in the neighboring counties. High positive spatial dependence is indicated by positive correlation of the points in the Moran Scatter plot, a pattern that is clearly evident in Figure 2-2. Figure 2-3 shows a map that is color-coded to match the counties identified in the Moran Scatter plot of Figure The impact of spatial dependence on leastsquares Estimation of the SAR model provides a way to test for the presence of spatial autocorrelation since the estimate of the parameter ρ will be statistically significantly different from zero, supposing that there is indeed spatial dependence. If ρ is significantly different from zero, then the ˆβ OLS will be different from the ˆβ SAR. This can easily be shown using the relations in (2.2). ˆβ SAR = (X X) 1 X y ρ(x X) 1 X W y = ˆβ OLS ρ(x X) 1 X W y (2.2) If ρ is significantly different from zero, then ρ(x X) 1 )X W y will be non zero and will appropriately assign the omitted variation to the spatially lagged dependent variable W y, correcting for the bias that has been shown to exist in the least-squares estimates.

29 18 For our sample of voting outcomes we find a parameter ρ that is significantly different from zero. Since the parameter ρ is significant and non zero, least-squares estimates are biased and should not be relied on for inference Anselin (1988). 2.3 Spatial autoregressive results Maximum likelihood spatial autoregressive parameter estimates along with the marginal probability levels associated with the Bush and Gore regressions are presented in Table 2.1. In this table and all subsequent tables, a single represents significance at the 95% level and a double denotes significance at the 99% level, unless indicated otherwise. The results found in Table 2.1 indicate that the following variables had a negative affect on votes going to George Bush in 2000, in decreasing absolute value terms: the constant term, femhhwchild, fem emp/female, urate00, income00, samehouse, oldhouse, nevermarried, foreignborn, log(rent), language, oparty, traveltime, log(hvalue), grad/prof, female/male, work home, military, dpi97to99, associate, and arts workers. All of these variables were significant at the 99-percent level except, log(hvalue), grad/prof, associate, and arts workers, which were significant at the 95-percent level. The following variables had a positive affect on votes going to Bush in 2000, in increasing order: dpi98to99, asian, log(mortgage), govt workers, dincome90to00, hispanic, black, famwchild, perot92, perot96, repub92, and repub96. All of these variables were also significant at the 99-percent level except, the dpi98to99, asian, and log(mortgage) variables, which were significant at the 90-percent level. The following

30 19 variables were found to be statistically not significantly different from zero: newhouse, manuf workers, highschool, divorced, widowed, poverty, durate90to00, owneroccupied, and college. The ρ parameter was significant and positive with and estimated value of This provides evidence for spatial dependence, an issue that will be discussed more fully in section 2.4. Table 2.1 provides evidence for counterintuitive results that were similar to the least-squares results from Table 1.4. Counties that contained larger than average numbers of asians, hispanics, and blacks tended to vote for Gorge Bush in The results, here, also indicate that the dpi98to99 and dincome90to00 variables positively influenced votes going to Bush, when Al Gore was part of the incumbent administration presiding over the growth in those periods. Table 2.1 also contains maximum likelihood spatial autoregressive parameter estimates and their associated marginal probability levels for the Gore regression. The results provided in that table indicate that the following variables had a negative affect on votes going to Al Gore, in decreasing absolute value terms: the constant term, repub96, perot96, black, famwchild, repub92, oparty, dpi98to99, college, work home, perot92, gov workers, hispanic, and asian. All of these variables were significant at the 99-percent level except, the college and hispanic variables, which were significant at the 95-percent level. The following variables had a positive affect on votes going to Al Gore, in increasing order: divorced, dpi97to99, arts workers, associate, female/male, foreignborn, language, owner-occupied, military, traveltime, log(rent), grad/prof, durate90to00, manuf workers, widowed, nevermarried, income00, femhhw-

31 20 Table 2.1: Maximum Likelihood Spatial Autoregressive (MLSAR) Results Variable Bush Coeff Variable Gore Coeff femhhwchild ** repub ** fem emp/females ** perot ** urate ** black ** income ** famwchild ** samehouse ** repub ** oldhouse ** oparty ** nevermarried ** dpi98to ** foreignborn ** college * log(rent) ** work home ** language ** perot ** oparty ** govt workers ** traveltime ** hispanic * log(hvalue) * poverty grad/prof * asian ** female/male ** log(hvalue) work home ** dincome90to military ** log(mortgage) dpi97to ** samehouse associate * oldhouse arts workers * newhouse newhouse highschool manuf workers divorced highschool dpi97to ** divorced arts workers ** widowed associate ** poverty female/male * dpi98to foreignborn ** asian language * durate90to owner-occupied ** owner-occupied military ** log(mortgage) traveltime ** college log(rent) ** govt workers ** grad/prof ** dincome90to ** durate90to ** hispanic ** manuf workers ** black ** widowed ** famwchild ** urate perot ** nevermarried ** perot ** income ** repub ** femhhwchild ** repub ** fem emp/females ** rho ** rho **

32 21 child, fem emp/females. All of these variables were also significant at the 99-percent level except, the divorced, female/male, and language variables, which were significant at the 90, 95, and 95-percent levels. Results from this table indicate that the following variables had an affect on votes going to Al Gore in 2000 that was not statistically different from zero: poverty, log(hvalue), dincome90to00, log(mortgage), samehouse, oldhouse, newhouse, highschool, and urate00. The ρ parameter was significant and positive with and estimated value of This provides evidence for spatial dependence, an issue that will be discussed more fully in section 2.4 as well. 2.4 Comparison of least-squares and maximum likelihood spatial autoregressive results Parameter estimates along with the marginal probability levels associated with Bush are presented in Table 2.2 and Gore are presented in Table 2.3 for a traditional ordinary least-squares model as well as for a maximum likelihood spatial autoregressive regression model. These tables also contain the difference between ˆβ OLS and ˆβ MLSAR and the associated marginal probability levels from the t tests that were carried out to test if the two procedures produced estimates that were significantly different from each other. The spatial autoregressive model is robust to the influence of spatial autocorrelation whereas the least-squares model produces estimates that are biased and inconsistent in the face of the same problem. To conclusively answer

33 22 the question of whether or not our data exhibits spatial dependence, a comparison of the least-squares and maximum likelihood SAR estimates is needed. If spatial dependence exists and is influential enough to bias the least-squares estimates then one would expect to see an estimated rho value that is significantly different from zero as well as least-squares parameter estimates that are larger in absolute value magnitudes than the maximum likelihood SAR parameter estimates. Table 2.2 contains least-squares and maximum likelihood SAR parameter estimates for the y=bush/total votes regression along with indications of their significance levels. The difference between ˆβ OLS and ˆβ MLSAR along with the marginal probability levels associated with a t test, used to test for significantly different parameter estimates resulting from the two regression routines, is also included in the table. Table 2.2 provides evidence for the existence of spatial dependence in the sample of Bush votes. The results yield a ρ estimate that is significantly different from zero, at the 99-percent level, with an estimated value of Column six indicates that two variables have parameter estimates that are significantly different from each other, the constant term and the repub96 variable. These variables are significantly different at the 99 and 95-percent levels. Note that ˆβ OLS is larger than ˆβ MLSAR, in absolute value terms, for both of these variables. This is consistent with theoretical expectations as the least-squares parameter estimates are biased away from zero in the face of spatial dependence and evidence exists for influential spatial dependence in those variables. Aside from the differences between ˆβ OLS and ˆβ MLSAR, other differences between

34 23 Table 2.2: Least Squares (OLS) vs Maximum Likelihood Spatial Autoregressive (ML SAR) Model for y = Bush/Total Votes variables OLS SAR bdiff constant ** ** ** femhhwchild ** ** fem emp/females ** ** urate ** ** income ** ** samehouse ** ** oldhouse ** ** nevermarried ** ** foreignborn ** ** log(rent) ** ** language ** ** oparty ** ** traveltime ** ** log(hvalue) * * grad/prof ** * female/male ** ** work home ** ** military ** ** dpi97to * ** associate * * arts workers * newhouse manuf workers highschool divorced widowed poverty dpi98to asian durate90to owner-occupied log(mortgage) college * govt workers ** ** dincome90to ** ** hispanic ** ** black ** ** famwchild ** ** perot ** ** perot ** ** repub ** ** repub ** ** * rho NA ** NA

35 24 the two modelling techniques exist. Results from the least-squares technique indicate that the dpi98to99, log(mortgage), and arts workers variables are not statistically different from zero while the maximum likelihood spatial autoregressive technique indicates that those variables are significant at the 90, 90, and 95-percent levels. The college variable is significantly different from zero at the 95-percent level in the leastsquares routine, but is marginally insignificant in the maximum likelihood spatial autoregressive results. Table 2.3 contains least-squares and maximum likelihood SAR parameter estimates for the Gore regression along with symbols that reflect their associated levels of significance. This table also contains the difference between ˆβ OLS and ˆβ MLSAR and the associated marginal probability levels from a t test that was carried out to see if the two sets of estimates were significantly different. Table 2.3 provides evidence for the existence of spatial dependence in our Gore data also. The results yield a ρ estimate that is significantly different from zero, at the 99-percent level, with an estimated value of Column six indicates that ten variables have parameter estimates that are significantly different from each other: the constant term, black, repub92, durate90to00, femhhwchild, perot92, dincome90to00, dpi97to99, female/male, and language. These variables are significantly different at the 99, 95, and 90-percent levels. Note that ˆβ OLS is larger than ˆβ MLSAR, in absolute value terms, for all ten of these variables. This is consistent with theoretical expectations as the least-squares parameter estimates are biased away from zero in the face of spatial dependence and evidence exists for influential spatial dependence in those variables.

36 25 Table 2.3: Least Squares (OLS) vs Maximum Likelihood Spatial Autoregressive (ML- SAR) Model for y = Gore/Total Votes variables OLS SAR bdiff plevel constant ** ** ** 0.00 repub ** ** perot ** ** black ** ** ** 0.00 famwchild ** ** repub ** ** * 0.02 oparty ** * dpi98to ** ** college ** ** work home * * perot ** ** govt workers ** ** hispanic ** ** poverty asian ** ** log(hvalue) ** * dincome90to ** ** log(mortgage) ** ** samehouse * oldhouse ** ** newhouse highschool ** ** divorced * dpi97to ** ** arts workers ** associate ** ** female/male ** ** foreignborn ** ** language ** ** owner-occupied ** ** military traveltime ** ** log(rent) ** ** grad/prof durate90to * 0.04 manuf workers widowed urate ** ** nevermarried income ** ** femhhwchild ** ** * 0.03 fem emp/females ** ** rho NA ** NA NA

37 26 Aside from the differences between ˆβ OLS and ˆβ MLSAR, other differences between the two modelling techniques exist. Results from the least-squares technique indicate that the oparty and log(hvalue) variables were significantly different at the 99-percent level and maximum likelihood SAR estimates found both of them to be significantly different from zero at the 95-percent level. The samehouse variable goes from being significantly different at the 95-percent level to being significantly different from zero at the 90-percent level. The arts workers variable goes from being significantly different at the 99-percent level to being not significantly different from zero. The durate90to00 variable goes from being significantly different from zero at the 90- percent level to not being significantly different from zero. The divorced variable was significantly different from zero at the 90-percent level and actually becomes more significant, being significantly different from zero at the 95-percent level when estimated using a maximum likelihood SAR model. Overall, the results given in Table 2.2 and Table 2.3 provide evidence for the existence of spatial autocorrelation in the sample data. This spatial dependence stems from the omitted unobservable variables that can be accounted for only by the inclusion of the lagged dependent variable. The spatial autoregressive model properly assigns the unobserved variation in y to W y and thus avoids the bias and inconsistency problems encountered by the ordinary least-squares regression procedure. If spatial autocorrelation is such an important issue, why do we see so few variables that have significantly different parameter estimates between the two regression techniques? This is because the affect that spatial dependence has on ˆβ OLS is a function of the size of ρ. The larger ρ is the greater the divergence between ˆβ OLS and

38 27 ˆβ MLSAR. Table 2.1 contains the maximum likelihood SAR estimates for ρ from both the Bush regression and the Gore regression. The estimates are for Bush and for Gore. These estimates are relatively small and so we see relatively few parameter estimates that have significantly different estimates. It is comforting to see that while neither regression produced very large ρ estimates, the ρ associated with the Gore regression was nearly twice as large as the ρ estimate resulting from the Bush regression and this corresponded to larger divergence between ˆβ OLS and ˆβ MLSAR in the Gore results. In Table 2.2 we see only two parameters with significantly different parameter estimates, however, in Table 2.3 we see that there exist ten parameter estimates that are significantly different. This, again, is due to the fact that the Gore regression yielded nearly twice as large of a ρ estimate, causing OLS to produce more biased estimates. 2.5 Nonconstant Variance Two issues are involved in the analysis of nonconstant variance of the errors (LeSage, 1997). One issue is heteroscedasticity and the other issue is outliers. These two issues are important because heteroscedasticity can cause inefficiency and outliers are known to cause bias in the parameter estimates. To investigate this issue we use a method employed by LeSage (1997) to estimate the following model. For a single equation, the Heteroscedastic Linear SAR model takes the form:

39 28 y = ρw y + Xβ + ε (2.3) var-cov(ε) N(0, σ 2 V ) V = diag(v 1, v 2,..., v n ) This spatial model is the exact same as was described in equation 2.1, except for one alteration. In this model our homoscedastic assumption about ε is relaxed. The Heteroscedastic Linear SAR model allows for the occurrence of outliers and nonconstant error variance by utilizing Markov Chain Monte Carlo (MCMC) methods, described in Geweke (1993), to take draws of V 1 from a chi-squared distribution following Geweke (1993). The shape of the chi-squared distribution is controlled by manipulating the degrees of freedom parameter and Geweke (1993) recommends setting this parameter to a small value in the range of 4 to 7 to produce optimal results for heteroscedastic estimation of this model. This degree of freedom setting allows for v i draws that deviate greatly from homoscedastic values of 1. Setting this parameter to a large value changes the shape of the chi-squared distribution making it approximately equal to that of the normal distribution and forces the v i draws to take on values much closer to unity. This setting creates a situation where this model produces estimates statistically equal to those from a homoscedastic model. The βs, here, take a form similar to those from generalized least-squares, extended to the context of spatial autoregressive models. Aberrant or outlying observations are down-weighted by the inverse of large v i estimates that reside on the diagonal of the

Lecture 7: Spatial Econometric Modeling of Origin-Destination flows

Lecture 7: Spatial Econometric Modeling of Origin-Destination flows James P. LeSage Department of Economics University of Toledo Toledo, Ohio 43606 e-mail: jlesage@spatial-econometrics.com June 2005 The