Lecture 7: Spatial Econometric Modeling of Origin-Destination flows

Lecture 7: Spatial Econometric Modeling of Origin-Destination flows James P. LeSage Department of Economics University of Toledo Toledo, Ohio 43606 e-mail: jlesage@spatial-econometrics.com June 2005

The gravity/spatial interaction model The terms gravity models and spatial interaction models has been used in the literature to label models that focus on flows between origins and destinations, Sen and Smith (1995). These typical rely on a function of the distance between an origin and destination and explanatory variables pertaining to characteristics of both origin and destination regions. With a few exceptions, use of spatial lags typically found in spatial econometric methods have not been used in these models. The notion that use of distance functions in conventional spatial interaction models effectively capture spatial dependence in the interregional flows being analyzed has been challenged in recent work by Porojon (2001) for the case of international trade flows, Ming and Pace (2004) for retail sales and unpublished work that utilizes both German and Canadian transportation network flows. The residuals from conventional models were found to exhibit spatial dependence, which could be exploited to improve the precision of inference as well as prediction accuracy. 1

Interregional flows in a spatial regression context Starting with an n by n square matrix of interregional flows from each of the n origin regions to each of the n destination regions, we can produce an n 2 by 1 vector of these flows by stacking the columns of the flow matrix into a variable vector that we designate as y. Table 1: A flow matrix (population migration 1995-2000) Destination/Origin Origin 1 Origin 2... Origin n Alabama Arizona Wyoming Destination 1 (Alabama) 1,392,507 2,797 467 Destination 2 (Arizona) 3,452 1,669,415 4,458. Destination n (Wyoming) 517 2,246 147,232 A typical least-squares gravity model y = αι + X d β d + X o β o + Dγ + ε (1) where: 2

X d = 0 B@ X(nxk) X(nxk). X(nxk) 1 CA X o = 0 B@ x 1 x 1. x 1 x 2 x 2. x 2... x n x n. x n 1 CA (2) D is a matrix of O-D distances vectorized (n 2 x1) ι is an n 2 x1 intercept vector ε N(0, σ 2 I n 2) α, β o, β d, γ are parameters to be estimated 3

Spatial dependence of flows In contrast to the traditional regression-based gravity model, a spatial econometric model of the variation in origin-destination flows would be characterized by: 1) reliance on spatial lags of the dependent variable vector, which we refer to as a spatial autoregressive model (SAR): y = ρw y + Xβ + ε, or 2) of the disturbance terms, which we label a spatial error model (SEM): y = Xβ + u, u = ρw u + ε, or 3) perhaps spatial lags of both kinds, which we denote as the general spatial model (SAC): y = ρw y + Xβ + u u = ρw u + ε. Spatial weight matrices represent a convenient and parsimonious way to define the spatial dependence or connectivity relations among observations. 4

Spatial weight matrices for flows A key issue is how to construct a meaningful spatial weight matrix in the case where the n 2 by 1 vector of observations reflect flows from all origins to all destinations, rather than the typical case where each observation represents a region. We can create a typical n by n first-order contiguity or m nearest neighbors weight matrix W that reflects relations between the n destinations/origin regions. This can be repeated using I n W to create an n 2 by n 2 row-standardized spatial weight matrix that we label W o, shown in (3). 0 B@ 1 CA (3) W 0... 0 0 W 0. W o =. 0... 0 0... 0 W Use of this matrix to form a spatial lag of the dependent variable, W o y, (where W o = I n W with W rowstandardized), we can capture origin-based spatial dependence relations using an average of flows from neighbors to each origin state to each of the destinations. Intuitively, it seems plausible that forces leading to flows from any origin to a particular destination state may create similar flows from neighbors to this origin to the same destination. captures. This is what the spatial lag W o y 5

An example of origin-based spatial dependence As an example, consider a single row i of the spatial lag vector W o y that represents flows from the origin state of Florida to the destination state of Washington. If the states are organized by FIPS code, or alphabetically (excluding Alaska and Hawaii) this would be the (45 x 49 + 9)th row, since Washington is the 45th state and Florida is the 9th state. First-order contiguous neighbors to the origin Florida are Alabama and Georgia, and neighbors to the destination Washington are Oregon and Idaho. The spatial lag W o y would represent an average of the flows from Alabama and Georgia (neighbors to the origin) to the destination state Washington. A similar interpretation applies to other rows of the spatial lag W o y, for example the (45 x 49 + 1)st row represents flows from the origin state of Alabama to the destination state of Washington, since Washington is the 45th state and Alabama is the 1st state. In this case, spatial lag W o y would represent an average of the flows from Florida, Georgia, Mississippi and Tennessee (neighbors to the origin) to the destination state Washington. 6

Destination-based spatial dependence A second type of spatial dependence that could arise in the gravity model would be destination-based dependence. Intuitively, it seems plausible that forces leading to flows from an origin state to a destination state may create similar flows to nearby or neighboring destinations. A spatial weight matrix that we label W d can be constructed to capture this type of dependence using W I n, producing an n 2 by n 2 spatial weight matrix that captures connectivity relations between the flows from an origin state to neighbors of the destination state. Using our example of flows from the origin state of Florida to the destination state of Washington, the (45 x 49 + 9)th row of the spatial lag vector W d y represent an average of flows from Florida to Idaho and Oregon, states that neighbor the destination state of Washington. In our other example, where the origin state was Alabama, the (45 x 49 + 1)st row of the spatial lag vector W d y would represent an average of flows from Alabama to Idaho and Oregon, states that neighbor the destination state of Washington. 7

An example of destination-based spatial dependence To provide an example of this we consider four regions located in a row as presented in Table 2. Table 2: Location of 4 Regions in Space Region #1 Region #2 Region #3 Region #4 The row-standardized first-order contiguity matrix associated with this regional configuration is shown in (4). W = 0 B@ 0 1 0 0 1/2 0 1/2 0 0 1/2 0 1/2 0 0 1 0 1 CA (4) For this example, W d = W I n taking the form shown in (5), where the n by n matrix W is defined in (5) and 0 represents an n by n matrix of zeros, where n = 4 in this example. W d = 0 B@ 0 I n 0 0 (1/2)I n 0 (1/2)I n 0 0 (1/2)I n 0 (1/2)I n 0 0 I n 0 (5) 1 CA 8

Origin-destination-based spatial dependence A third type of dependence to consider is reflected in the product W o W d = (I n W ) (W I n ) = W W. This spatial weight matrix reflects an average of flows from neighbors to the origin state to neighbors of the destination state. One motivation for this matrix product might be a spatial filtering perspective. We might envision a spatial autoregressive model of the type shown in (6) based on successive filtering. (I n 2 ρ 1 W o )(I n 2 ρ 2 W d )y = αι+x d β d +X o β o +Dγ+ε (6) This leads to a model that includes the interaction term in the sequence of spatial lags: y = ρ 1 W o y + ρ 2 W d y ρ 1 ρ 2 W o W d y + αι + X d β d + X o β o + Dγ + ε (7) 9

An example of origin-destination-based spatial dependence Using our example of flows from the origin state of Florida to the destination state of Washington, the (45 x 49 + 9)th row of the spatial lag vector W o W d y represent an average of: flows from Alabama and Georgia (neighbors to the origin state) to Idaho (a neighbor to the destination state), and flows from Alabama and Georgia (neighbors to the origin state) to Oregon (a neighbor to the destination state). In the case of our other example based on flows from the origin state of Alabama to the destination state of Washington, the (45 x 49 + 1)st row of the spatial lag vector W o W d y represent an average of: flows from Florida, Georgia, Mississippi and Tennessee (neighbors to the origin state) to Idaho (a neighbor to the destination state) and flows from Florida, Georgia, Mississippi and Tennessee (neighbors to the origin state) to Oregon (a neighbor to the destination state). 10

Spatial regression models for O-D flows Proposition #1 Given a row-normalized spatial weight matrix W, the spatial weight matrices: W o = I n W (8) W d = W I n W od = W W can be used directly to extend the gravity regression model in (1) to a SAR, SEM or SAC model specification as shown in (9). y = ρw J y + αι + X d β d + X o β o + Dγ + u u = λw J u + ε (9) where: W J denotes any of the spatial weight matrices W o, W d, W od. The spatial weight matrices W o, W d, W od would obey the usual properties of row-normalized weight matrices allowing use of existing algorithms for maximum likelihood (Pace and Barry, 1997), Bayesian (LeSage, 1997) or generalized method of moments estimation estimation (Kelejian and Prucha, 1999). 11

Proposition #2 Given a row-normalized spatial weight matrix W, the elements of the matrices W o and W d are mutually exclusive. This allows us to add the spatial weight matrices: W J = W o + W d (10) = (I n W ) + (W I n ) and use this summation directly to extend the gravity regression model in (1) to a SAR, SEM or SAC model specification as shown in (9). The sum of weight matrices reflecting destination and origin dependence spatial weight matrices W o, W d would obey the usual properties of rownormalized weight matrices allowing use of existing algorithms for maximum likelihood (Pace and Barry, 1997), Bayesian (LeSage, 1997) or generalized method of moments estimation estimation (Kelejian and Prucha, 1999). 12

Proposition #3 Given a row-normalized spatial weight matrix W, one can form a spatial weight matrix: W J = W o + W d + W od (11) = (I n W ) + (W I n ) + (W W ) This captures origin, destination, as well as origindestination interaction dependence. This weight matrix would obey the usual properties of row-normalized weight matrices allowing use of existing algorithms for maximum likelihood (Pace and Barry, 1997), Bayesian (LeSage, 1997) or generalized method of moments estimation estimation (Kelejian and Prucha, 1999). 13

Non-conventional model specifications In addition to model specifications that can be estimated using conventional algorithms, there are a series of specifications that require slight changes to the conventional algorithms. For example, a successive spatial filtering approach suggests the specification: (I n 2 ρ 1 W o )(I n 2 ρ 2 W d )y = αι+x d β d +X o β o +Dγ+ε (12) This leads to a model that includes the interaction term in the sequence of spatial lags: y = ρ 1 W o y + ρ 2 W d y ρ 1 ρ 2 W o W d y + αι + X d β d + X o β o + Dγ + ε (13) Note that W o W d = W W, and we can relax the restriction from (13) that ρ 3 = ρ 1 ρ 2 using: y = ρ 1 W o y + ρ 2 W d y ρ 3 W o W d y + αι + X d β d + X o β o + Dγ + ε (14) 14

Maximum likelihood estimation Despite the statements made here regarding use of conventional algorithms for maximum likelihood, Bayesian or generalized method of moments estimation of the spatial econometric origin-destination interregional flow models, there are some computational considerations that arise as the number of observations becomes large. For example, use of an origin-destination flow matrix for the over 3,100 US counties would result in sparse spatial weight matrices of dimension n 2 by n 2 where n 2 = 9, 610, 000. Maximum likelihood and Bayesian estimation both require calculation of the log of the determinant expression for the n 2 by n 2 matrix: (I n 2 ρw J ). While specialized approaches to calculating this logdeterminant have been proposed by Pace and LeSage (2004) and Anselin and Smirnov (2004), it turns out there are much more efficient approaches that can exploit the special structure of matrices like W d = I n W, W o = W I n and W w = W W. 15

The log-determinant term in the likelihood We note that the likelihood function for models based on a single spatial weight matrix W J, concentrated with respect to the parameters β and σ will take the form: LogL(ρ) = C +log I n 2 ρw J (n 2 /2)log(e e(ρ)) (15) Where e e(ρ) represents the sum of squared errors expressed as a function of the scalar parameter ρ alone after concentrating out the parameters β, σ (see LeSage and Pace, 2004). The log-determinant of the transformation is the trace of the matrix logarithm of the transformation, and the Taylor series expansion of this has a simple form for the positive definite matrix transformation I n 2 ρw J. ln I n 2 ρw J = tr (ln(i n 2 ρw J )) = X t=1 ρ t tr(w t J ) t (16) 16

For the case of destination or origin weight matrices, W d = I n W or W o = W I n, which we designate W J, J = o, d tr(w t J ) = tr(it n W t ) = tr(i t n ) tr(w t ) = n tr(w t ) (17) and thus the trace of a square matrix of order n 2 is simplified to a scalar (n) time a trace of a square matrix of order n (i.e., W ). This means we can handle the US county sample containing n 2 = 9, 610, 000 observations in the same time as the n = 3, 100 problem, or around 1-2 seconds. 17

Successive spatial filtering model specifications As noted earlier, a possible variant of using either an origin weight matrix W o or a destination weight matrix W d would be to successively transform the dependent variable by (I n 2 ρ 1 W o ) and (I n 2 ρ 2 W d ). The idea would be to remove the destination dependence and subsequently remove the origin dependence or viceversa. Remarkably, the order in which one transforms the dependent variable makes no difference as stated in (18). (I n 2 ρ 1 W o )) (I n 2 ρ 2 W d )) = (18) (I n 2 ρ 2 (I n W )) (I n 2 ρ 1 (W I n )) The explanation for this can be seen by expanding the products in (19). Since the cross-product of W I n and I n W is W W, and this is the same as the crossproduct of I n W and W I via the mixed-product rule for Kronecker products. (I n 2 ρ 1 W d )) (I n 2 ρ 2 W o ) = (19) (I n 2 ρ 1 (W I n ) ρ 2 (I n W ) + ρ 1 ρ 2 (W W )) 18

Because the log-determinant of a product is the sum of the log-determinants, the overall log-determinant of the successive filtering approach is quite simple as shown in (20). ln (I n 2 ρ 1 W d ) (I n 2 ρ 2 W o ) = n(ln I n ρ 1 W + ln I n ρ 2 W ) (20) This suggests we can specify a general form based on: W J = ρ 1 (W I n )+ρ 2 (I n W )+ρ 3 (W W ) (21) This general specification can yield a number of specific models of interest. Obviously, when ρ 1 = ρ 2 = ρ 3 = 0 no spatial autoregressive dependence exists. When ρ 2 = ρ 3 = 0 the model becomes one of origin autoregressive spatial dependence. When ρ 1 = ρ 3 = 0 the model becomes one of destination autoregressive spatial dependence. When ρ 3 = ρ 1 ρ 2 the successive filtering or product separable model emerges. Finally, the restriction ρ 3 = 0 produces the joint origin destination autoregressive dependence model 19

The log-determinant term for successive spatial filtering model specifications The challenge is to easily compute tr(w t J ) for t = 1 m where m is the largest moment computed. In brief, by combining an approach set forth by Barry and Pace (1999) for a statistical Monte Carlo approximation to the log-determinant and some results from Pace and LeSage (2002, Geographical Analysis), Kelley and I have implemented a MATLAB function to compute the logdeterminant term for these models. Our approach is a statistical approximation, but using results from Pace and LeSage (2002) where we show how the moments tr(w t J ) must monotonically decline for t > 1, we can provide an estimate of the confidence intervals for the accuracy of our approximation and show that the interval is narrow provided (ρ 1 + ρ 2 + ρ 3 ) m+1 /(m + 1) is reasonably small. 20

Some applied illustrations Using population migration flows over the periods 1985-90 and 1995-2000 for the 48 contiguous US states plus the District of Columbia, n = 49, n 2 = 2401, a small example. Dependent variable y is the growth rate in interstate migration flows from the earlier to later period, intrastate flows are set to zero. Explanatory variables are origin and destination characteristics for the states in 1990 taken from the 1990 US Census. All variables are log transformed. State area State per capita income, retirement income # of males, # of females (sum = population) (# persons aged 22-29) young, (# persons aged 60-64) near retirement, (# persons aged 65 plus) retired # persons born in the state, # persons foreign born, # recent foreign immigrants # of college graduates, # of persons with graduate and professional degrees median house value, median rent travel time to work # persons unemployed, # persons self-employed log(distance) and log(distance)*log(distance) 21

Model comparisons Table 3: Log-likelihood function values for various models Model log-like OLS 347.3761 SAR using W J = W d only 542.9683 SAR using W J = W o only 567.5280 SAR using W j = W o + W d + W od 604.8205 SAR using W J = W o + W d 617.9472 SAR using (I 2 n ρ 1W o ρ 2 W d + ρ 1 ρ 2 W o W d ) 623.4115 SAR using (I 2 n ρ 1W o ρ 2 W d ) 624.0966 SAR using (I 2 n ρ 1W o ρ 2 W d + ρ 3 W o W d ) 624.6686 Table 4: ρ 1, ρ 2, ρ 3 estimates for various models Model ρ 1, ρ 2, ρ 3 OLS 0 W J = W d only 0.2759 W J = W o only 0.3099 W j = W o + W d + W od 0.2019 W J = W o + W d 0.2719 (I 2 n ρ 1W o ρ 2 W d + ρ 1 ρ 2 W o W d ) 0.2381, 0.2854 (I 2 n ρ 1W o ρ 2 W d ) 0.2551, 0.2985 (I 2 n ρ 1W o ρ 2 W d + ρ 3 W o W d ) 0.2532, 0.2993, -0.0060 22

Variable β t-stat z-prob constant -1.207-0.32 0.7425 Darea 0.072 4.09 0.0000 Dpcincome 0.385 2.11 0.0340 Dmales -2.361-3.28 0.0010 Dfemales 2.935 3.41 0.0006 Dyoung -0.007-0.04 0.9629 Dnretire -0.380-1.50 0.1311 Dretired -0.120-0.62 0.5328 Dborninstate -0.206-3.71 0.0002 Dforeignborn -0.099-2.16 0.0303 Dcollege 0.169 1.96 0.0496 Dgradprof -0.009-0.13 0.8963 Dhousevalue -0.097-1.66 0.0960 Dtraveltime 0.037 0.29 0.7645 Dunemp 0.076 1.59 0.1116 Dmedianrent -0.287-1.86 0.0625 Dretireincome -0.389-4.58 0.0000 Dself-emp -0.217-4.76 0.0000 Drecentimmigr 0.088 2.28 0.0222 Oarea 0.090 4.95 0.0000 Opcincome 0.185 1.02 0.3071 Omales -0.686-0.93 0.3473 Ofemales 0.092 0.10 0.9158 Oyoung 0.785 4.64 0.0000 Onretire -1.319-5.21 0.0000 Oretired 1.079 5.39 0.0000 Oborninstate -0.081-1.48 0.1374 Oforeignborn 0.132 2.82 0.0047 Ocollege -0.238-2.70 0.0067 Ogradprof 0.146 2.11 0.0345 Ohousevalue 0.060 1.01 0.3078 Otraveltime 0.514 4.09 0.0000 Ounemp -0.327-6.39 0.0000 Omedianrent 0.019 0.12 0.8976 Oretireincome -0.198-2.36 0.0179 Oself-emp -0.100-2.17 0.0293 Orecentimmigr -0.160-4.05 0.0000 log(distance) -0.028-2.08 0.0366 log(distance2) 0.003 2.48 0.0128 rho1 0.253 9.27 0.0000 rho2 0.299 12.93 0.0000 rho3-0.006-0.13 0.8896 24

Other issues/work to be done Generalization to contrasts for non-flow spatial data samples Problem: dominant diagonal flows within regions Potential solutions: estimates transformations, Bayesian robust Problem: large # of zero flows for fine spatial scales, e.g., US counties Potential solutions: tobit estimation transformations, Bayesian spatial 25

Conclusions It is relatively easy to incorporate spatial lags into conventional least-squares gravity models The log-determinant takes a particularly convenient form, making it easy to handle models with n 2 in the millions 26