Spatial Models in Econometrics: Section 13 1

Size: px

Start display at page:

Download "Spatial Models in Econometrics: Section 13 1"

Blake Barnett
5 years ago
Views:

1 Spatial Models in Econometrics: Section Single Equation Models 1.1 An over view of basic elements Space is important: Some Illustrations (a) Gas tax issues (b) Police expenditures (c) Infrastructure productivity (d) Cities and budgets (e) regulation issues (f) exchage market contagion (h) volitility of GDP (i) Spatial spill-overs relating between governments relating to the quality measures Consider a cross sectional framework: i =1..., N Concept of Neighbor: Neighboring units are units that interact in a meaningful way. This interaction could relate to spill-overs, externalities, copy cat policies, geographic proximity issues, industrial structure, similarity of markets, sharing of infrastructure, welfare benefits, banking regulations, tax issues, re-election issues, etc. Example of Geographic Neighbors: Forthei th unit, N denotes a close neighbor, NN a neighbor that is less close etc. 1 These notes are mostly based on work I have done with Ingmar Prucha, and parts of them were taken from his notes. 1

2 NN NN NN NN NN NN N N N NN NN N i N NN NN N N N NN NN NN NN NN NN The pattern described by N is called a Queen; the pattern described by N and NN is called a double Queen. Aweightingmatrix:A matrix that select neighbors, and indicates how important each neighbor is. For example, suppose we have N observations on the dependent variable Y 0 =(Y 1,..., Y N ). Suppose the neighbors corresponding to the i th observation ( i th cross sectional unit) are units 1, 2, and 3. Then the i th row of the weighting matrix, W N N will have non-zero elements: w i1,w i2, and w i3. If P N j=1 w ij =1for all i, the weighting matrix is said to be row normalized. Because a unit is not viewed as its own neighbor w ii =0,i=1,..., n. Example of use: Let W i. be i th row of W and let X i be a scalar. Then a model such as Y i = b 0 + b 1 X i + b 2 W i. X + ε i X 0 = (X 1,..., X N ) suggest that Y i depends upon X i (a within unit effect), as well as P N j=1 w ijx j, which is a weighted sum of the regressor in neighboring units. Typically, W is specified to be row normalized so that Y i depends on X i and a weighted average of this regressor corresponding to neighboring units. Clearly, the simplest weighted average is the uniform: e.g. if Y i has 5 neighbors, then the non-zero weights in the i th row of W are all 1/5. Other weighting schemes will be considered below. A Further Elaboration and Some Specifications of w ij Consider again the above relation in scalar terms Y i = b 0 + b 1 X i + b 2 Σ n j=1w ij X j + ε i Y i = b 0 + b 1 X i + b 2 X i + ε i ; X i = Σ n j=1w ij X j 2

3 In this form one can clearly see that w ij relates to the effect that X j has on Y i. That is, the dependent variable depends on the within unit value of X, and a weighted sum (which could be a weighted average) of the values of X corresponding to neighboring units. If i and j are not neighbors, w ij =0. If they are neighbors there are various ways researchers have specified w ij. (A) Let n i be the number of neighbors that unit i has. Then, if j is a neighbor to i, researchers often take w ij =1/n i. In this case W would be a row normalized weighting matrix. (B) Again suppose j is a neighbor to i. Let d ij be a distance measure between i and j. Then, one wants w ij 0 as d ij. Also, the closer is j to i, the larger one might want w ij. Thus, researchers sometimes take w ij =1/d ij. They may also specify a row normalized version as w ij = 1/d ij Σ n r=1(1/d ir ) (C) Let INC r be income per capita in cross sectional unit r; then one specification of w ij that has been considered is w ij = INC i INC j 1 This form has a disadvantage in that w ij is not bounded which, as we will see, is important for certain statistical results. One possible improvement is therefore w ij =[ INC i INC j +1] 1 Other variables which could signify distance between neighboring units i and j are (a) the average level of education (b) the proportion of housing units that are rental units (c) ethic group composition differences (d) geographic distances (e) trade shares (D) A generalization of the above would be, for two neighboring units, w ij = 1/(d ij +1) d ij = [(z i1 z j1 ) (z ir z jr ) 2 ] 1/2 3

4 where z iq is the q th relevant variable in cross sectional unit i, q =1,..., r. This measure depends upon the scale of the variables involved. Let z q,ij =(z iq + z jq )/2, q=1,..., r Then, perhaps a better measure would be the scale-normalized weights: w ij = 1/(d ij +1)where d ij =[ (z i1 z j1 ) 2 z 2 1,ij (z ir z jr ) 2 ] 1/2 z r,ij Cliff and Ord type models Basic specifications and parameter space issues Let Y 0 =[Y 1,..., Y N ],X 0 =[X 0 1.,..., X 0 N. ] K N,u0 =[u 1,..., u N ]. Then one spatial model for Y i is Y i = a + X i B 1 (1 K)(K 1) + ρ 1 Ã W i. (1 N) Y (N 1)! Ã + W i. X (1 N)(N K)! B 2 + u i (1) (K 1) (1 1) u i = ρ 2 W i. u + ε i ; ρ 1 < 1; ρ 2 < 1 (2) where ε i is i.i.d (0,σ 2 ε), and X relates to exogenous variables which we take as nonstochastic. Let W = W 1.. W N. N N be the weighting matrix which we assume at this point is observable and exogenous. Note that (1) and (2) can be written as Y = ae n + XB 1 + ρ 1 (WY)+(WX) B 2 + u (1A) u = ρ 2 (Wu)+ε (2A) where e n is an n 1 vector of unit elements. Assuming the inverse exists at thetruevalueofρ 1 and ρ 2 we have Y = (I ρ 1 W ) 1 [ae n + X 1 B 1 + WXB 2 + u] (3) u = (I ρ 2 W ) 1 ε (4) 4

5 Since the elements of (I ρ 1 W ) 1 and (I ρ 2 W ) 1 will generally depend upon N, thevectorsy and u are really triangular arrays. For example, this means that the first element of Y will be different if N =20then when N =25. This implies that these elements and the vector Y should be indexed with N : Y 0 N =(Y 1N,Y 2N,...,Y NN ) Thus, our sample on Y would be Y 11 Y 12 Y 13 etc Y 22 Y 23 Y 33 At this point we do not index all the variables for simplicity of notation. Interestingly the triangular nature of the variables involved, which leads to certain statistical problems, had not been recognized until recently in the (formal) literature. Because many researchers estimate spatial models by ML procedures, it is typically assumed that W is such that (I aw ) 1 exists for all a < 1, which is taken to be the parameter space, see (2). Note 1: If W is row normalized (I aw ) is singular at a =1 Proof: Let e 0 N =(1, 1,...,1) 1 N.Then (I W ) e N = e N We N = e N e N =0 Note 2: If W is row normalized, then (I aw ) 1 exits for all a < 1. Proof: We prove this by relying on a theorem by Gershgorin. Gershgorin s Theorem Let A N N have elements a ij. Let R i = NX j=1,j6=i a ij ; C j = NX i=1,i6=j a ij Then each eigenvalue, λ i of A lies in at least one of the N circles λ a ii R i,i=1,..., N 5

6 and hence in the union of these circles. Also each root lies in at least one of the N circles, and hence their union λ a jj C j, j =1,..., N An important application to a weighting matrix: Consider again W which has w ii =0. Let r =max i P j w ij c =max j Pi w ij r = max i Then the roots of W satisfy, R i ; c =maxc j j λ i r, λ i c, i =1,..., N If W is row normalized, r =1and so λ i 1. Given Gershgorin s theorem, note the following. Assume w ii =0and W is row normalized. Let Q be the matrix that triangularizes W : QW Q 1 = D λ,d λ = Then D λ = λ 1...λ N and λ 1 I aw = QQ 1 (I aw ) = Q (I aw ) Q 1 λ ij... 0 λ N if a < 1 since aλ i a < 1. = I ad λ =(1 aλ 1 )...(1 aλ N ) 6= 0 Note 3: The above indicates that if W is row normalized, the parameter space specified in(2)is such that the inverses in(3)and(4)exist. If W is 6

7 not row normalized (I aw ) will generally be singular for certain values of a < 1. In this case the following should be noted. Let α =min(r, c), where r and c are defined above. Then, assuming thattheelementsofw are nonnegative, (I aw ) will be nonsingular for all a < 1 α. This could be taken as the parameter space. Proof: If a =0, (I aw ) is nonsingular. Now consider a 6= 0. In this case I aw =0implies µ 1 I W a = 0 or µ 1 W I a = 0 So I aw is singular if 1 a is equal to a root of W. Now if the roots of W are such that λ i r, λ i c then λ i min (r, c) =α So I aw 6= 0 if 1 a > min (r, c) =α or a < 1 α This is an important result because a model which has a weighting matrix whichisnotrownormalizedcanalwaysbenormalizedinsuchawaythatthe inverse needed to solve the model will exist in an easily established region. For example, suppose W is not row normalized. Then the model 2 Y = XB + ρ 1 WY + ε (5) µ W = XB +(ρ 1 α) Y + ε α 2 Note that α below will depend on n and hence so will ρ 1 and W. For ease of notation, we do not indicate this dependence. 7

8 or Y = XB + ρ 1W Y + ε (6) where ρ 1 = ρ 1 α, W = W and again α X X α = min(c, r) :c =max w ij ; r =max w ij j i Note that α is easily determined. Also note that I ρ 1W 6= 0 i j For ρ 1 < < 1 min c, r α α 1 1 =1 α min (c, r) So if the model is renormalized as Y = XB + ρ 1W Y + ε (7) and ρ 1 is taken to be the parameter, the inverse exists for all ρ 1 < 1. One would then estimate ρ 1 as a parameter, and since ρ 1 = ρ 1 α, one would estimate ρ 1 as ˆρ 1 =ˆρ 1/α (8) Note 4: We note one more point which corresponds to a special case. Let W again be a weighting matrix, w ii =0,i=1,...,N, and assume that all of the roots of W are real. This is a strong assumption. Assume that W is not row normalized. Let λ max and λ min be the largest and smallest roots of W. Assume, as will typically be the case if all of the roots are real, that λ max > 0 and λ min < 0. Then (I aw ) is nonsingular for all λ 1 min <a<λ 1 max (9) Proof: I aw is nonsingular for a =0. If a 6= 0we have, as before I aw = I ad λ = (1 aλ 1 )(1 aλ 2 ) (1 aλ N ) 8

9 so I aw is nonsingular unless a is equal to the inverse of a root: λ 1 1,...,λ 1 N or a 1 is equal to a root λ 1,...,λ N. or if But Thus if a 1 <λ min a 1 >λ max (I aw ) is nonsingular a 1 > λ max a<λ 1 max a 1 < λ min a>λ 1 min Estimation: Again consider a variation on the model in (1A) and (2A) Y = XB 1 + ρ 1 WY +(WX) B 2 + u (1A) u = ρ 2 (Wu)+ε, ρ 1 < 1, ρ 2 < 1 (2A) Assume (I aw ) 1 exits for a < 1. Again if W is not row normalized, then the model can always be normalized as described above. Case 1: ρ 1 =0=ρ 2 In this case the model reduces to Y = XB 1 +(WX) B 2 + ε, (X N K ) (10) Assume rank (X, WX) =2K. Note if X contains the constant term, then (X, WX) < 2K if the weighting matrix is row normalized since We N = e N. Typically, we would not include We N in our model even if the weighting matrix is not row normailzed. To be a bit more precise, if X contains the constant term, X =(e N,X 1 ) then our model might be Y = b 0 e N + X 1 b 1 +(WX 1 )b 2 + ε (11) 9

10 and rank(e N,X 1,WX 1 )=2K 1. Assume ε (0,σ 2 I),and that X and W are nonstochastic. Consider again the model (10) and let Z = (X, WX), or if the model is (11) let Z =(e N,X 1,WX 1 ). Assume that Z has full column rank. Then estimate via OLS. If ε 0 =(ε 1,...,ε N ), where ε i is i.i.d. (0,σ 2 ), then the usual large sample theory holds if N 1 Z 0 Z Q zz where Q zz is a finite matrix and Q 1 zz exits. or Case 2: ρ 1 =0,ρ 2 6=0, ρ 2 < 1 Now if the model is (11) we have Properties of u Y = ZB + u (12) u = ρ 2 Wu+ ε (13) Y = ZB + u, u= ρ 2 Wu+ ε (14) Z = (e N,X 1,WX 1 ), B 0 =(b 0,b 0 1,b 0 2) If then So ε 0,σ 2 I u =(I ρ 2 W ) 1 ε E (u) = 0 E (uu 0 ) = σ 2 ε (I ρ 2 W ) 1 (I ρ 2 W 0 ) 1 = σ 2 εω u So the elements of u are heteroskedastic, as well as spatially correlated. GLS ˆB GLS = Z 0 Ω 1 u Z 1 Z 0 Ω 1 u Y (not feasible unless ρ 2 is known) (15) 10

11 Feasible Procedures ML. ³ E ˆBGLS = B, V C GLS = σ 2 ε Typically based on ε N (0,σ 2 I). So u N 0,σ 2 εω u Z 0 Ω 1 u Z 1 Thus Thus so L = (σ 2 ε) N 2 Y N ZB,σ 2 εω u e 1 2σ 2 ε [Y ZB] 0 Ω 1 u [Y ZB] 2π N I ρ2 W 1 + (16) ln (L) = 1 [Y ZB] 0 [I ρ 2σ 2 2 W 0 ][I ρ 2 W ][Y ZB] (17) ε N 2 ln ³ σε 2 +ln I ρ2 W + N ln 2π = 1 ([I ρ 2σ 2 2 W ][Y ZB]) 0 [I ρ 2 W ][Y ZB] ε N 2 ln ³ σ 2 ε +ln I ρ2 W + N ln 2π = 1 (Y (ρ 2σ 2 2 ) Z (ρ 2 ) B) 0 (Y (ρ 2 ) Z (ρ 2 ) B) ε N 2 ln ³ σ 2 ε +ln I ρ2 W + N ln 2π where Y (ρ 2 ) = Y ρ 2 WY Z (ρ 2 ) = Z ρ 2 WZ Note Y (ρ 2 ) and Z (ρ 2 ) are spatial counterparts to the Cochrane-Orcutt transformation. Atthispointwenotethatamajorprobleminmaximizingln L relates to the term ln I ρ 2 W +. For example, this term must 11

12 (a) be evaluated repeatedly for each trial value of ρ 2. If N is large this will indeed be tedious. As one example, cross sectional units could relate to counties and there are over 3000 counties in the US. In other cross sectional studies, the number of cross sectional units could be families and so it could be the case that N>50, 000. (b) as on page (6) of these notes, using Ord s (1975) suggestion ln I ρ 2 W + = ln[(1 ρ 2 λ 1 ) (1 ρ 2 λ N )] NX = ln (1 ρ 2 λ i ) i=1 Now if the roots can be evaluated, ln 1 ρ 2 W + can be evaluated in terms of the sum for each trial value of ρ 2. This will be far simpler than the method proposed in (a). The problem is that if N 450 both (a) and (b) will involve computation accuracy problems. For example Kelejian and Prucha (1999) found that the calculation of roots for even a nonsymmetric matrix involved accuracy problems. The form of the MLE for B and σ 2 ε Based on (17) we have ln L B = 1 2Z (ρ 2σ 2 2 ) 0 Y (ρ 2 )+2Z (ρ 2 ) 0 Z (ρ 2 ) B =0 (18) ε ln L = 1 [Y (ρ σ 2 ε 2σ 4 2 ) Z (ρ 2 ) B] 0 [Y (ρ 2 ) Z (ρ 2 ) B] N (19) ε 2σ 2 ε It follows that Interpretation ˆB ML = Z (ˆρ 2 ) 0 Z (ˆρ 2 ) 1 Z (ˆρ 2 ) 0 Y (ˆρ 2 ) ˆσ 2 ε = 1 h i 0 h i Y (ˆρ N 2 ) Z (ˆρ 2 ) ˆB Y (ˆρ 2 ) Z (ˆρ 2 ) ˆB Premultiplying (14) by (I ρ 2 W ) yields Y (ρ 2 )=Z (ρ 2 ) B + ε,ε N 0,σ 2 I (20) 12

13 Theresultsshouldbeclearfromthisform. LargeSampleResultsforML InarecentarticleLee(2004)gaveaformal demonstration of conditions that ensure consistency and asymptotic normality of the ML estimators for the general spatial model considered. Although some of his assumptions are strong, to date, this is the only paper in which a formal demonstration is given. In applied studies it is always assumed that the usual results hold: i.e. ³ N ˆP P D N (0,V) V 1 = lim E N 1 µ 2 ln L P P 0, P = Small sample inference is typically based on the approximations Ã ˆB ˆP = ˆρ 2, ˆP. N P, ˆV! ˆσ 2 N ε N ˆV 1 2 ln L = P P 0 ˆP B ρ 2 σ 2 ε An important note relating to the likelihood function Consider the model y = Xβ + ρ 1 Wy + u (21) u = ρ 2 Mu + ε where X is exogenous, ε N(0,σ 2 I), and W and M are two weighting matrices. Now consider a special case of this model in which β =0and W = M. Assuming both inverses exits, in this case y N(0,σ 2 Ω) Ω = (I ρ 1 W ) 1 (I ρ 2 W ) 1 (I ρ 2 W 0 ) 1 (I ρ 1 W 0 ) 1 13

14 and so Ω 1 = (I ρ 1 W 0 )(I ρ 2 W 0 )(I ρ 2 W )(I ρ 1 W ) (22) = GG 0 G = [I (ρ 1 + ρ 2 )W 0 + ρ 1 ρ 2 W 0 W 0 ] It should be clear from (22) that the likelihood is perfectly symmetrical in ρ 1 and ρ 2, and so these two parameters are not identified under the stated conditions. This is a known result in the literature, see e.g., Anselin (1985). Note carefully what we have stated. If in (21) β =0and W = M, thereis an identification problem concerning ρ 1 and ρ 2. In practice it is typically assumed that in a model such as (21) W = M. However, in practice typically models are not considered in which β =0. Results given below will imply that if in a model such as (21), β 6= 0there is no identification problem concerning the parameters of the model even if W = M. This is important to note, and unfortunately has not been noted by all researchers. For instance, in (21) the model in which ρ 1 =0is often referred to as the spatial error model; if in (21) ρ 2 =0the model is often referred to as the spatial lag model. There have been quite a number of studies in which researchers (still) try to determine whether the true model is a spatial error model, or a spatial lag model because it is assumed that the identification condition restricts the consideration of the general model in (21) in which neither ρ 1 nor ρ 2 are zero. This is unfortunate because the spatial patterns implied by this more general model are so much richer" than that implied by either the spatial error model or the spatial lag model. (B) Feasible GLS and the GMM method Again the model is Y = ZB + u (23) u = ρ 2 Wu+ ε Essentially, we first get a consistent estimator of B, use it to obtain û; use û to obtain ˆρ 2 ; use ˆρ 2 to transform the model by the spatial Cochrane-Orcutt procedure and then estimate B by OLS. Preliminaries 14

15 1) We will say that the row and column sums of an N N matrix, A, are uniformly bounded in absolute value if NX max a ij c a i max j j=1 NX a ij c a i=1 for all N 1 where c a is a finite constant which does not depend on N. We will also abbreviate reference to a matrix such as A just saying that it is absolutely summable". 2) If A and B are N N absolutely summable matrices, then so is D = AB Proof: Note and consider d ij = NX d ij j=1 = = NX a ir b rj r=1 NX j=1 r=1 NX r=1 j=1 NX a ir b rj NX a ir b rj NX a ir r=1 c a c b A similar demonstration will reveal that NX d ij c a c b i=1 NX b rj 3) If A is absolutely summable, its elements are bounded. Obvious! 4)If A is absolutely summable, and Z N K has bounded elements, then the elements of Z 0 AZ are O (N). 15 j=1

16 Proof: δ ij Let Z =(Z ij ) and Z ij c z. Now consider the i, j element of Z 0 AZ, say δ ij = δ ij NX NX Z si a sr Z rj r=1 s=1 NX r=1 s=1 NX Z si a sr Z rj X X c z Z rj a sr r s X c z Z rj X r s c z c a c z N a sr Assumptions of model (23) 1. ε i is i.i.d. (0,σ 2 ε),e(ε 4 i ) < 2. ρ 2 < 1 3. P =(I ρ 2 W ) is nonsingular at the true value of ρ w ii =0, i =1,...,N 5. W and P 1 are absolutely summable 6. Z is nonstochastic, has bounded elements, and rank(z N K )=K 7. lim N 1 Z 0 Z = Q z where Q z is nonsingular 8. lim N 1 Z 0 Ω u Z = Q 1, where Q 1 is nonsingular and where Ω u is given below 9. lim N 1 Z 0 Ω 1 u Z = Q 2, where Q 2 is nonsingular Basic Results 16

17 (R1) VC u = σ 2 εω u where Ω u =(I ρ 2 W ) 1 (I ρ 2 W 0 ) 1 (R2) ˆB =(Z 0 Z) 1 Z 0 Y is consistent Proof: The result in (R1) is obvious. Consider (R2). Since Y = ZB + u ˆB = B +(Z 0 Z) 1 Z 0 u So ³ E ˆB = B VCˆB = (Z 0 Z) 1 Z 0 VC u Z (Z 0 Z) 1 = σ 2 ε (Z 0 Z) 1 [Z 0 Ω u Z](Z 0 Z) 1 = N 1 σ 2 ε N (Z 0 Z) 1 N 1 [Z 0 Ω u Z] N (Z 0 Z) 1 Q 1 Z Q 1 Q 1 Z So VCˆB 0 and hence by Tchebyschev s inequality ˆB P B We will need the following. Let where 4 N 0 û = Y Z ˆB (24) = Y ZB ³ + ZB Z ˆB = u + Z B ˆB = u + Z4 N A consistent generalized moments estimators for ρ 2 [GME] In this section we will suggest an estimator for ρ 2 andthengiveahigh level proof that it is consistent. This high level proof is, although tedious, somewhat straight forward. A more complex low level proof is given by Kelejian and Prucha (1999). 17

18 Note from (23) that u ρ 2 Wu = ε (25) so that Wu ρ 2 W 2 u = Wε (26) Let ū = Wu, ū = W 2 u, ε = Wε and denote their i th elements as ū i, ū i, ε i. Then Square (27), sum and divide by N to get u i ρ 2 ū i = ε i i =1,...,N (27) ū i ρ 2 ū i = ε i (28) P u 2 i N + ρ2 2 Square (28), sum and divide by N P ū2 i N + ρ2 2 P ū2 i N 2ρ 2 P ū2 i N 2ρ 2 P P ui ū i ε 2 N = i N P P ūi ū i ε 2 N = i N Multiply (27) by (28), sum and divide by N to get P P P P ui ū i N + ūi ū i ρ2 2 N ρ ui ū i 2 N + ū2 i N Note that since ε i is i.i.d. (0,σ 2 ε),e(ε 4 i ) < = P εi ε i N (29) (30) (31) P ε 2 i N P σ 2 ε [bykhintchineorbytchebyshev] (32) and so Note that We will assume E P ε 2 i N P ε 2 i N = ε0 W 0 Wε N = σ 2 Tr(W 0 W ) ε N lim Tr(W 0 W ) N 18 (33)

19 exists. Given this we will demonstrate below that p lim ε0 W 0 Wε N Finally the RHS of (31) is = σ 2 ε lim Tr(W 0 W ) N (34) P εi ε i N We will demonstrate below that ε 0 W 0 ε N = ε0 Wε N P 0 (35) Given (32), (34) and (35) express P ε 2 i = σ 2 ε + δ 1, δ 1 P 0 (36) PN ε 2 i = σ 2 tr (W 0 W ) ε + δ 2, δ 2 P 0 (37) P N N εi ε i = 0+δ 3, δ 3 P 0 (38) N Substitute (36)-(38) into (29)-(31). Let λ 0 =[r, ρ 2,σ 2 ε],r= ρ 2 2,δ 0 =[δ 1,δ 2,δ 3 ] and A 1 = P ū2 i /N 2 P u i ū i /N 1 P _ū2 i /N 2 P ū i _ūi /N Tr(WW 0 ) /N P h P ūi _ūi /N ui _ūi + P i ū 2 i /N 0 and A 0 2 = P u 2 i /N, P ū 2 i /N, P u i ū i /N Given this notation (29)-(31), in light of (36), (37), (38) can be expressed as We will demonstrate below that A 1 λ = A 2 + δ (39) δ P 0 We will also obtain expressions for plima 1 and plima 2. These expressions will involve limits which we assume exists. Finally, we assume that the weighting 19

20 matrix W is such that (plima 1 ) is nonsingular. Given all this we note from (39) that (p lim A 1 ) λ = p lim A 2 + p lim δ (40) = p lim A 2 so λ =(plim A 1 ) 1 p lim A 2 (41) Thus if u were observed a consistent estimator of λ would be the (over parameterized) OLS estimator (which we may call the linear estimator) ˆλ = A 1 1 A 2 (42) If the information that r = ρ 2 2 is recognized, the NLLS estimator of ρ 2 and σ 2 ε could be considered, namely min ρ 2,σ 2 ε Feasible estimators δ 0 δ =min[a 1 λ A 2 ] 0 [A 1 λ A 2 ] ρ 2,σ 2 ε Clearly these estimators are not feasible because u is not observed. Let Â 1 and Â 2 be identical to A 1 and A 2 except that u is replaced by û. Thus, e.g. we would replace P ū2 i N by P b ū 2 i N where bū i is the i th element of bū = W û, etc. Then, we will show below that eλ = Â 1 1 Â P 2 λ (43) Kelejian and Prucha (1999) show that the NLLS estimator based on i 0 i min hâ1 λ Â 2 hâ1 λ Â 2 ρ 2,σ 2 ε is also consistent. Furthermore, as expected, Monte Carlo results suggest that the NLLS estimators are more efficient then the OLS estimators described in (43) Proof that e λ P λ 20

21 Preliminary 1: Let S be an N N absolutely summable matrix. Let ε 0 =(ε 1,...,ε N ) where ε i is i.i.d. (0,σ 2 ) and E (ε 4 i )=μ 4 <. Then ε 0 Sε N Tr(S) σ2 P ε 0 (44) N Proof: Note that ε 0 Sε E N = σ 2 Tr(S) ε N Also P P ε 0 Sε ε 2 N = i s ii N + i<j ε iε j [s ij + s ji ] (45) N Note E (ε 2 i ε r ε s )=0unless r = s. Thereforeeveryterminthedoublesum in (45) is uncorrelated with every squared term. Also all the squared terms are uncorrelated with each other, as are all of the cross product terms since E [ε i ε j ε r ε s ]=0unless i = j and r = s, or i = r and j = s, or i = j = r = s. All of these conditions are ruled out. Thus from (45) [Noting E (ε i ε j ) 2 = E (ε 2 i ) E εj 2 = σ 4,ifi6= j] µ " ε 0 Sε Var = 1 X N s 2 N N iivar # X ε 2 2 i + σ 4 ε [s ij + s ji ] 2 (46) i=1 i<j " 1 N # X X s 2 N iih + σ 4 2 ε [ s ij + s ji ] 2 i=1 where h = E (ε 4 i ) σ 4. Again, let c s betheboundontheelementsofs, as i<j 21

22 well as on its absolute row and column sums. It then follows from (46) that µ ε 0 Sε Var 1 NX s 2 N N iih + (47) 2 i=1 " σ 4 N # X NX NX ε s N 2 ij 2 + s ji 2 +2 s ij s ji i<j 1 N hc2 s + 3σ4 εc s N 2 1 N hc2 s + 4σ4 εc s N 2 i<j NX i<j NX i=1 s ij + σ4 εc s N 2 NX s ij j=1 i<j NX s ji i<j Therefore, via (47) 1 N hc2 s + 4σ4 εc 2 s N 0, as N µ ε 0 Sε Var 0 as N N The result in (44) follows from Tchebyshev s inequality. Given this preliminary note that every element of A 1 and A 2 above is of the form u 0 S 1 u N = ε0 (I ρ 2 W 0 ) 1 S 1 (I ρ 2 W ) 1 ε N = ε0 S 2 ε N where S 1 and S 2 are again absolutely summable matrices. Therefore the probability limit of each element of A 1 and A 2 can be determined via the preliminary. As a trivial application it follows that δ P 1 0,i =1, 2, 3 as indicated in (36) - (38).. We now show that ³Â1 p lim ³Â2 p lim = p lim (A 1 ) (48) = p lim (A 2 ) 22

23 FirstnotethateachelementofÂ 1 and Â 2 is of the form û 0 Sû/N where S is absolutely summable. Since û 0 Sû N Our result follows if and Consider û = u + Z4 N, 4 N P 0 = (u + Z4 N) 0 S (u + Z4 N ) N = u0 Su N 4 0 N + 40 N Z0 SZ4 N N Z 0 SZ N 4 N 4 0 Z 0 Su N N 4 0 N P 0 P 0 Z 0 SZ N 4 N N Z0 Su N By preliminary 4 above, the elements of Z0 SZ = O (1). Thus 4 P N N 0 implies 4 0 N Z0 SZ 4 P N N 0 Consider Let Then 4 0 N Z 0 Su N ψ = Z0 Su N Eψ = 0 VC(ψ) = N 2 Z 0 SΩ u S 0 Z. We have assumed that Ω u is absolutely summable and hence so is SΩ u S 0. It follows that N 2 Z 0 SΩ u S 0 Z 0 since N 1 Z 0 SΩ u S 0 Z 0(1). It follows that ψ P 0 and so 4 0 N Z0 Su P 0. It follows that λ e P λ since (48) holds. N Feasible GLS 23

24 Let ˆρ P 2 ρ 2 be any consistent estimator of ρ 2, and let ˆΩ u =(I ˆρ 2 W ) 1 (I ˆρ 2 W 0 ) 1. Then ³ 1 ˆB FGLS = Z 0 ˆΩ 1 u Z Z 0 ˆΩ 1 u Y (49) ˆB GLS = Z 0 Ω 1 u Z 1 Z 0 Ω 1 u Y (50) Theorem: Given the assumptions of the model (a) ³ ³ N ˆB FGLS B D N 0,σ 2 ε p lim N Z 0 Ω 1 u Z 1 Outline of Proof: (b) N ³ ˆB FGLS ˆB GLS P 0 Consider first part b. Using the usual manipulations ³ ³ N ˆBFGLS B = N Z 0 ˆΩ 1 u ³ N ˆB GLS B 1 Z N 1 2 Z 0 ˆΩ 1 u u (51) = N Z 0 Ω 1 u Z 1 N 1 2 Z 0 Ω 1 u u (52) Thus (b) holds since: N 1 Z ³ˆΩ 0 1 u Ω 1 u Z = N 1 Z 0 (ρ 2 ˆρ 2 )(W + W 0 )+ ρ 2 2 ˆρ 2 2 W 0 W Z (ρ 2 ˆρ 2 ) N 1 Z 0 (W + W 0 ) Z = {z } + ρ 2 2 ˆρ 2 2 N 1 Z{z 0 W 0 WZ} 0 O (1) 0 O (1) and N 1 2 Z 0 ³ˆΩ 1 u Ω 1 u u = N 1 2 Z 0 (ρ 2 ˆρ 2 )(W + W 0 )+ ρ 2 2 ˆρ 2 2 W 0 W u = (ρ 2 ˆρ 2 ) N 1 2 Z 0 (W + W 0 ) u + ρ 2 2 ˆρ 2 2 N 1/2 Z 0 W 0 Wu P 0 since and ρ 2 ˆρ 2 P 0 N 1 2 Z 0 (W + W 0 ) u = O ρ (1) N 1 2 Z 0 W 0 Wu = O ρ (1) 24

25 Toseethisnote E ³ N 1 2 Z 0 (W + W 0 ) u =0 VC = N 1 Z 0 (W + W 0 ) Ω u (W 0 + W ) Z =0(1) {z } absolutely summable {z } A similar result holds for N 1/2 Z 0 W 0 Wu. Now consider part (a) O(N) Given part b we need only consider ³ N ˆB GLS B = N Z 0 Ω 1 u Z 1 N 1 2 Z 0 Ω 1 u u (53) We will use the following CLT for triangular arrays which is a variation on a problem given in Billingsley 1979, p319 problem 27.6 A formal statement of the CLT Let {v in, 1 i N,N 1} be a triangular array of random variables that are identically distributed and (jointly) independent for each N with Ev in =0and EviN 2 = σ2, 0 <σ 2 <. Let {x ij,n, 1 i N,N 1}, j =1,..., K be triangular arrays of real numbers that are bounded in absolute value, i.e., c x =sup N sup i N,j K x ij,n <. Further,let{V N : n 1} and {X N : n 1} with V N =(v in ) i=1,...,n and X N =(x ij,n ) i=1,...,n; j=1,...,k denote corresponding sequences of N 1 random vectors and N K real matrices, respectively, and let lim N N 1 XNX 0 N = Q be finite and positive definite. Then N 1/2 XNV 0 D N N(0,σ 2 Q). On an intuitive level, the CLT implies that if v i is i.i.d. (0,σ 2 ), 0 <σ 2 <, and X N = {x ijn } is be a sequence of N K real nonstochastic matrices, with bounded elements sup sup x ijn < N>1 i N j N and lim N N 1 XNX 0 N = Q x, where Q 1 x exists. Let VN 0 =(v 1,...,v N ). Then N 1 2 X 0 N V D N N 0,σ 2 Q x (54) 25

26 A point to note concerning the formal assumption concerning v i,n. In a triangular array each element can change as the sample size increases. Therefore such an array does not rule out the possibility that v 3,10 = v 7,25. Ifoneweretoonlyassumethatv i is i.i.d then one would be assuming away this characteristics relating to a triangular array. For instance, the way the formal assumption is stated v 1,N, v 2,N,...,v N,N are i.i.d. for each N. Now consider (53). First note via assumption 8 of model (21): Now note that N Z 0 Ω 1 u Z 1 Q 1 2 N 1 2 Z 0 Ω 1 u u = N 1 2 Z 0 (I ρ 2 W 0 )(I ρ 2 W ) u = N 1 2 Z 0 (I ρ 2 W 0 ) ε and so Since N 1 2 Z 0 Ω 1 u u D N 0,σ 2 ε lim N 1 Z 0 Ω 1 u Z (55) (i) (ii) Elements of ε are i.i.d. (0,σ 2 ε) Elements of Z (ρ 2 )=(I ρ 2 W ) Z are uniformly bounded (iii) N 1 Z (ρ 2 ) 0 Z (ρ 2 )=Q 2 = lim N 1 Z 0 Ω 1 u Z Given the results in (53) and (55) the suggested small sample guidance is ³. ˆB FGLS N B, ˆσ 2 ε Z (ˆρ 2 ) 0 Z (ˆρ 2 ) 1 (56) where ˆσ 2 ε = N 1 h Y (ˆρ 2 ) Z (ˆρ 2 ) ˆB FGLS i 0 h Y (ˆρ 2 ) Z (ˆρ 2 ) ˆB FGLS i A point to note The following procedure is sometimes suggested. Since Y = ZB + u u = ρ 2 Wu+ ε 26

27 we have Y = ρ 2 (WY)+ZB (WZ) ρ 2 B + ε (57) Since W is observed, WY and WZ are observed. Note that (57) can not be consistently estimated by OLS since E (WYε 0 ) = W (I ρ 2 W ) 1 E (εε 0 ) (58) = W (I ρ 2 W ) 1 σ 2 ε 6= 0 Thus an instrumental variable estimator is sometimes suggested. In one case people express (57) as Y = ρ 2 (WY)+ZB +(WZ) γ + ε (59) where the restriction γ = ρ 2 B is not used. Thus, (59) is over-parameterized. Then (59) is estimated by 2SLS using the non-redundant variables from the set Z, WZ, W 2 Z etc. (typically Z, WZ, W 2 Z). This 2SLS procedure is not consistent. The reason for this is that E (WY)=WZB,and (59) already contains Z and WZ as regressor matrices. In brief, there are no instruments for WY which are linearly independent of Z and WZ. For example, the ideal instrument for WY is WZB but WZ is already in the model. You should be able to show the following. Let D =(WY,Z,WZ) and λ 0 =(ρ 2,B 0,γ 0 ) so that (59) is Y = Dλ + ε (60) Let H be any N p matrix of nonstochastic instruments such that lim N 1 H 0 H = Q H, where Q 1 H exists. Then p lim N 1 H 0 D = G (61) where G does not have full column rank. One implication of this is that (60) can not be consistently estimated by 2SLS. One would think that (57) can be consistently estimated by NL2SLS, using the instruments Z, WZ, W 2 Z, etc. This procedure is also not consistent. The reason for this is similar to that given above. To see the issue involved re-write (57) as Y N 1 = F N 1 + ε N 1 (62) F N 1 = ρ 2 (WY)+ZB (WZ) ρ 2 B 27

28 Suppose we have an exogenous matrix of instruments say H N r,r K + 1. Then, a condition given by Amemiya (1985, page 110 and 246) 3 for consistency is that µ F p lim N 1 H 0 (ρ 2,B 0 ) has full column rank. Now F (ρ 2,B 0 ) = [WY WZB, Z ρ 2 WZ] = [W (Y ZB), Z ρ 2 WZ] = [Wu, Z ρ 2 WZ] You should be able to demonstrate that the first column of p lim N 1 H 0 F (ρ 2,B 0 ) is a column of zeros if, as is typically assumed, N 1 H 0 H Q HH where Q HH is a finite invertible matrix. It follows that Amemiya s condition will not hold. On a simpler scale, to see that something is wrong note that E [F ] = ρ 2 WZB + ZB ρ 2 WZB = ZB, Z N K only involves K variables which can be used as instruments. However, the model in (57) has K +1parameters. Case 3: ρ 1 6=0,ρ 2 6=0 In this case the model is (1A) Y = ZB + ρ 1 WY + u (2A) u = ρ 2 Wu+ ε, ρ 1 < 1, ρ 2 < 1 3 T. Amemiya (1985), Advanced Econometrics, Harvard university Press. 28

29 Note that (WY) is endogenous because Also note that which is not collinear with Z : E [(WY) u 0 ] = W (I ρ 1 W ) 1 Ω u 6= 0 E (WY)=W (I ρ 1 W ) 1 ZB rank W (I ρ 1 W ) 1 Z, Z >rank[z] for most reasonable weighting matrices. 4. All of this suggests that, under usual further assumptions, (1A) can be consistently estimated by 2SLS using the instruments Z, WZ, W 2 Z etc. These instruments are suggested because E [WY]=W (I ρ 1 W ) 1 ZB If the roots of W are all less than or equal to 1 in absolute value, then the roots of ρ 1 W are less than 1 in absolute value, if ρ 1 < 1. Thus E [WY] = W I + ρ 1 W + ρ 2 1W ZB = WZB + W 2 Z (ρ 1 B)+W 3 Z ρ 2 1B +... Therefore E (WY) is linear in WZ,W 2 Z, etc. Estimating (1A) by 2SLS does not account for the spatial correlation problem. We now describe a procedure that was put forth by Kelejian and Prucha (1998) which does account for it. The procedure Step 1: Estimate 1A by 2SLS using the linearly independent columns of Z, WZ, W 2 Z. Obtain eb,eρ 1 Step 2: Obtain eu = Y Z eb eρ 1 WY, ēu = W eu, ē u = W 2 eu. Use these residual vectors to obtain the GM estimator of ρ 2, say eρ 2 4 As an illustration suppose Z 0 =(1, 2, 3) and W = 1. Then WZ = Clearly rank[z, WZ] =

30 Step 3: Obtain Y (eρ 2 ) = Y eρ 2 WY Z (eρ 2 ) = Z eρ 2 WZ WY (eρ 2 ) = WY eρ 2 W 2 Y Note that (1A) and (1B) imply (based on the true value of ρ 2 ) Y (ρ 2 )=Z (ρ 2 ) B + ρ 1 WY (ρ 2 )+ε which can be consistently estimated by 2SLS using the instruments Z (ρ 2 ), WZ (ρ 2 ),W 2 Z (ρ 2 ) Step 4: Obtain the feasible counterpart to the 2SLS estimator outlined above. Specifically, let D (eρ 2 ) = [Z (eρ 2 ),WY (eρ 2 )] λ 0 = (B 0,ρ 1 ) ˆD (eρ 2 ) = H (H 0 H) 1 H 0 D (eρ 2 ) H = Z, WZ, W 2 Z Then obtain ˆλ = h i 1 ˆD (eρ 2 ) 0 ˆD (eρ 2 ) ˆD (eρ 2 ) 0 Y (eρ 2 ) Kelejian and Prucha (1998) show µ h 1 N ³ˆλ λ D N 0,σ 2 ε p lim N ˆD (eρ 2 ) 0 ˆD (eρ 2 )i = N µ h 1 0,σ 2 ε p lim N ˆD (ρ 2 ) 0 ˆD (ρ 2 )i Let h Y (eρ 2 ) D (eρ 2 ) ˆλ i 0 h Y (eρ 2 ) D (eρ 2 ) ˆλ i ˆσ 2 ε = N Then the suggested small sample inference would be based on µ h i ˆλ. 1 N λ, σ 2 ˆD ε (eρ 2 ) 0 ˆD (eρ 2 ) 30

31 1.2.3 Implications of the spatial model: Emanating and Own spillover effects. Consider the model y = Xβ + ρ 1 Wy+ u (63) where X is exogenous, and E(u) =0. In this section it does not make any difference whether or not the elements of u are spatially correlated. The model in (63) is a structural model. Its reduced form, i.e., the solution of the model for y is y =(I ρ 1 W ) 1 [Xβ + u] and so E(y) =(I ρ 1 W ) 1 Xβ (64) Now consider interpretations that are based on (63) and (64). For ease of presentation, we suppose that X is a vector i.e., there is only one exogenous variable; we denote the ith element of X as x i. The extension to the case in which X is n k matrix will be evident. If there were no spatial effects in the sense that ρ 1 =0, the effect of a oneunitchangeinx 1 on E(y 1 ) would be β -see(63). Thiseffect can also be thought of as the direct effect of x 1 on E(y 1 ) in the sense that it does not account for spatial spill-overs which would take place if ρ 1 6=0. Clearly, if ρ 1 =0a change in x 1 would have no effect on the expected values of the other dependent variables. Let G =(I ρ 1 W ) 1 and consider final effects implied by the model. From (64) it should be clear that the expected effect of a change in x 1 on all of the elements of the vector E(y) is (using evident notation) E(y j ) = G j1 β, j =1,..., n (65) x 1 The effects described in (65) have been referred to in the literature as emanating effects. These effects describe how a change in a regressor relating to a given unit, in our illustrative case unit 1, fan out to all the units. Of course these emanating effects can also be described in terms of elasticities; again, using evident notation η j1 = G j1 β x 1 y j,j=1,..., n (66) 31

32 A closely related concept is that of own spill-over effects. These effects directly follow from (66). Specifically, as indicated above, in the absence of spill-overs, the effect of a change in x 1 on E(y 1 ) is just β. In the presence of spill-overs, that effect is E(y 1 ) = G 11 β x 1 Therefore, one measure of the own spill-over effect is clearly (G 11 1)β Of course these effects can also be expressed in terms of elasticities. In passing we note that the above material relates to a change in the regressor corresponding to the first unit. Obviously, one can also calculate the effects of a change in the regressor corresponding to each and every unit, or to a change in the regressors corresponding to a set of units- e.g., say the first and second! Emanating effects with respect to a uniform worsening of the exogenous (fundamental) variables in the originating country Kelejian, Tavlas, and Hondroyannis (2006) used a spatial model to study contagion problems in foreign exchange markets. That is, in a number of episodes when the currency of a given country experiences a run and it depreciates, the effects fan out to other related countries. In the Kelejian, Tavlas, and Hondroyonnis study they considered a variant of the emanating effects described above. In particular, instead of calculating the effect of a given variable in one country, say country 1, on the other countries, they considered a uniform worsening of the exogenous variables of that originating country, in this case, country 1 on the other countries involved. In somewhat more detail, write the conditional mean in (64) as, using evident notation, E(y) = (I ρ 1 W ) 1 Xβ (67) = G[X 1 β X k β k ] Suppose now that high values of the dependent variable are associated with more severe exchange problems than are low values. In this case if the coefficient of a regressor is negative, a worsening of the corresponding variable in country 1 would relate to a decrease in that variable, i.e., the change in that 32

33 variable multiplied by the corresponding coefficient is positive. Similarly, if acoefficient of a regressor is positive, a worsening of that variable would be an increase in the value of the corresponding regressor. Let the first value of X j be x j,1, j=1,...,k and in order to avoid unnecessary tediousness assume that the values of all of the regressors are positive. Then implication of the above is that the response of E(y r ) with respect to a worsening of all of the regressors of country 1 would be E(y r ) = G r1 ( x 1,1 β 1 ) G r1 ( x k,1 β k ) (68) G r1 x 1,1 β G r1 x k,1 β k or, if a uniform percentage worsening is considered x 1,1 β E(y r ) = G 1 x 1,1 x k,1 β r G k x k,1 r1 (69) x 1,1 x k,1 = G r1 β 1 x 1,1 α G rk β k x k,1 α where α>0is the uniform percentage worsening. Or, one can calculate the emanating elasticity of E(y r ) with respect to the uniform percentage worsening of all of the regressors in country 1 as E(y r ) = G r1 [ x 1,1 β αy r y x k,1 β r y k ] (70) r Of course, an alternative to (70) would be to replace y r in the denominator of (70) by E(y r ) which can be estimated from (67) Prediction Issues This section is based on Kelejian and Prucha (2006). As a preliminary for this section we note the following result for the convenience of the reader. Using evident notation, let the vectors Z 1 and Z 2 be jointly normally distributed as (Z 1,Z 2 ) N(μ, V ) where μ 0 =(μ 0 1,μ 0 2); V = {V ij },i,j =1, 2. Then the minimum mean squared error predictor of Z 1 based on Z 2, and the corresponding predictor variance-covariance matrix are E(Z 1 Z 2 = z 2 )=μ 1 + V 12 V22 1 (z 2 μ 2 ) (71) VC(Z 1 Z 2 = z 2 )=V 11 V 12 V22 1 V 21, 33

34 see, e.g. Greene (2003, p. 872). The discussion below concerning prediction is based on the assumption that the model parameters are known. Of course in practice they are not known and so predictors would be based on their estimated values. An analysis of predictor efficiency based on estimated parameter values would then have to consider a wide variety sample sizes, regressor variations and co-variations. The reason is that these considerations have an effect on the precision of parameter estimators and these, in turn, have an effect on prediction efficiency. By basing our discussion of prediction efficiency on known parameter values we need not consider such issues, and our interpretation of theresultsisthattheyrelatetolimitsofefficiency.. Now consider the model y n = λw n y n + X n β + u n, (72) u n = ρw n u n + ε n, where W is an n n nonstochastic weighting matrix, X is an n k nonstochastic matrix of observations on k exogenous variables, and the remaining notation is evident. We assume that ε n N(0,σ 2 εi n ). The i-th unit in the model in (72) is determined as y n,i = x n,i. β + λw n,i. y n + u n,i (73) u n,i = ρw n,i. u n + ε n,i where y n,1 is the i-th unit of y n,x n,1. is the i-th row of X n, etc. Predictors that might be considered for y n,i (a) The Reduced form predictor. This is suggested by MSE issues: y (1) n,i = E(y n,i x n,w n ) (74) = (I λw n ) 1 i. x n β where, again the notation should be evident - e.g., (I λw n ) 1 i. is the i-th row of (I λw n ) 1. 34

35 (b) A larger Information set y (2) n,i = E(y n,i x n,w n,w n,i. y n ) (75) = λw n,i. y n + x n,i. β + cov(u n,i,w n,i. y n ) [w n,i. y n E(w n,i. y n )] var(w n,i. y n ) Note that the terms involved in (73) are straight forward to determine. For example, consider cov(u n,i,w n,i. y n ) Since u n =(I n ρw n ) 1 ε n, it follows that u n,i =(I n ρw n ) 1 i. ε n (76) and so u n,i N(0,σ 2 ε(i n ρw n ) 1 i. (I n ρw n ) 10 i. ). It is also clear that so that from (76) and (77) w n,i. y n = w n,i (I n λw n ) 1 X n β + (77) w n,i (I n λw n ) 1 (I n ρw n ) 1 ε n cov(u n,i,w n,i. y n ) = E[u n,i (w n,i. y n ) 0 ] (78) = σ 2 ε(i n ρw n ) 1 i. (I n ρw 0 n) 1 (I n λw 0 n) 1 w 0 n,i The remaining covariances and variances can also be calculated in a similar fashion. (c) The efficient estimator. y (3) n,i = E(y n,i x n,w n,y n, i ) (79) = λw n,i. y n + x n,i. β + cov(u n,i,y n, i )[VC(y n, i )] 1 [y n, i E(y n, i )] where y n, i is the same as y n except y n,i is deleted. This efficient predictor would be especially applicable to the case in which one were to predict the value of a house, given the hedonic characteristics and prices of the houses in the area. (d) The intuitive predictor: 35

36 y (4) n,i = x n,i.β + λw n,i. y n (80) This predictor is simply based on the right hand side of the generating model in (73). It is bases because ignores the correlation between w n,i. y n and u n,i. Mean Squared Errors of the Predictors. There is a technical problem when comparing the mean squared errors of the four predictors outlined above. Specifically, the predictors y (2) n,i,y(3) n,i, and y (4) n,i depend upon a particular realization of the dependent vector, y n. In order to compare these predictors to each other, and to the reduced form predictor y (1) n,i, Kelejian and Prucha (2007) essentially average the mean squared errors of the last three predictors over all realizations of the dependent vector, namely they calculate Theoretically, they show E[MSE(y (j) n,i x, w)],j =1, 2, 3, 4 (81) MSE(y (1) n,i ) MSE(y(2) n,i ) MSE(y(3) n,i ), (82) MSE(y (4) n,i ) MSE(y(2) n,i ) The exact calculations are given in Kelejian and Prucha (2007), however, the inequalities in (82) are consistent with theoretical notions. The first line in (82) follows because of increasing informations sets. The second line follows because y (2) n,i is the unbiased version of y(4) n,i. The expressions in (82) involve the model parameters. Kelejian and Prucha (2007) evaluated these expressions over a wide range of model parameter values. Their numerical results are more revealing. On average, the ratio of MSEs to efficient MSE 3 are For y (1) n,i :16.6 For y (2) n,i :1.07 For y (4) n,i :2.2 It is interesting to note that the predictor based on the reduced form, namely y (1) n,i, is by far the worst predictor. In addition to its high average value, it also had outliers for certain model parameter values. 36

37 1.2.5 Spatial Models with Uniform Weights Spatial models whose weighting matrices have equal elements might be considered if units can reasonably be viewed as equally distant within certain neighborhood. As one example, that neighborhood might be a school in a study of student achievement where it is suspected that each student s achievement is, at least in part, related to the achievements of others in that school. As another example, an equal weights matrix might be considered in a study of the agricultural productivity of farmers in a given region - e.g., farmers who live in a given village. In such a study it might be assumed that farmers learn from each other and hence their productivity is interrelated. Consider the model y N = e N α + X N β + λw N y N + ε N (83) = Z N γ + ε N Z N = (e N, X N, W N y N ),γ 0 =(α, β 0,λ) where y N is the N 1 vector of observations on the dependent variable, e N is an N 1 vector of unit elements, and the remaining notation is evident except that we are assuming in a manner specifiedbelowthatallofthenondiagonal elements of W N are equal. Since we are explicitly introducing an intercept, the regressor matrix X N doesnotcontaintheconstantterm. We note for future reference that the model in (83) contains both an intercept and a spatial lag of the dependent variable. Suppose the researcher assumes, as would often be the case, that E(ε N )= 0 and E(ε N ε 0 N )=σ2 I N. Then, given I N λw N is non-singular, we have y N =(I N λw N ) 1 [αe N + X N β + ε N ] and so W N y N = W N (I N λw N ) 1 [αe N + X N β + ε N ].Therefore E(W N y N ε 0 N )=σ 2 W N (I N λw N ) 1 6=0 (84) i.e., in general the spatial lag W N y N will be correlated with the disturbance vector ε N. Given this endogeneity of W N y N the researcher might attempt to estimate model (83) by the 2SLS procedure. Suppose the model in (83) is indeed estimated by two stage least squares in terms of the full column rank N (1 + k + r) matrix of instruments H N = 37

38 (e N, X N, G N ) where, of course, G N is an N r matrix and r 1. Given results in Kelejian and Prucha (1998), G N could be taken to be the linearly independent columns of (W N X N, W 2 N X N,..., W q N X N ), where typically q 2. LetP HN = H N (H 0 N H N ) 1 H 0 N and Ẑ N = P HN Z N. Then, assuming that Ẑ N has full column rank, the 2SLS estimator of γ is ˆγ N =(ˆα N, ˆβ 0 N, ˆλ N ) 0 =(Ẑ 0 N ẐN ) 1 Ẑ 0 N y N (85) OurmainresultisgiveninTheorem1. Itsimplicationsaregiveninthe remarks that follow. Theorem 1 Assume the model in (83). Let y N = e 0 N y N /N denote the sample mean of y N.If 0 a N... a N a N a N 0... a N a N W N = a N [e N e 0 N I N ]= (86) a N a N... 0 a N a N a N... a N 0 where a N is a constant whose value could depend upon the sample size, N, then (a) ˆγ N =(ˆα N, ˆβ 0 N, ˆλ N ) 0 =(Nȳ N, 0, 1/a N ), (b) ˆε N = y N Z N ˆγ N = 0. Proof of Theorem 1: FirstnotethatifW N is given by (86), then W N y N =(Na N y N )e N a N y N (87) which is linear in the variable being explained, namely y N. Given (87), the estimated residual vector ˆε N = y N Z N ˆγ N canbewrittenas ˆε N = y N e N ˆα N X N ˆβN ˆλ N W N y N = y N (1 + ˆλ N a N ) e N (ˆα N + N ˆλ N a N y N ) X N ˆβN Substituting the expressions for ˆα N, ˆβ N,andˆλ N given in part (a) of the theorem, it is then readily seen that ˆε N = 0. The 2SLS objective function is given by ˆε 0 N H N (H 0 N H N ) 1 H 0 Nˆε N (88) 38

39 Since H N (H 0 N H N ) 1 H 0 N is positive semi-definite ˆε 0 N H N (H 0 N H N ) 1 H 0 Nˆε N 0. The 2SLS objective function is thus clearly minimized for ˆε N = 0. Since we have just shown that ˆε N = 0 for ˆγ 0 N =(N y N, 0, 1/a N ) it follows that ˆγ N is indeed the vector of 2SLS estimators. Remark 1: Since the diagonal elements of W N are all zero, the non-diagonal elements are all equal, and the sample size is N, one would typically take a N = 1 in the above illustrative cases, - e.g., villages. N 1 Remark 2: Given the model in (83) and the weighting matrix in (86) it should be clear that any estimator that is defined as a minimizer of a positive semi-definte quadratic form of the disturbances, e.g., OLS, will be identical to the 2SLS estimators. Remark 3 Part(a)ofTheorem1impliesthat,inasinglepanelframework (e.g., data on just one village) the model in (83) with W N specified as in (86) is not a useful one, and indeed, should be avoided! This should also be clear from part (b), which implies that the usual estimator for σ 2 is given by ˆσ 2 N = N 1ˆε 0 N ˆε N =0, and so typical test statistics are not defined because they require division by zero. The suggestion is that results relating to them obtained in practice will, most likely, be based on rounding errors. Remark 4: Theorem 1 also has implications concerning 2SLS estimation of model (83) for situations where the weighting matrix is not observed, but instead is parameterized in terms of observable variables and then its parameters are estimated by a nonlinear 2SLS procedure along with the regression parameters. Unfortunately, for a wide variety of parameterizations the results of such an estimation procedure would not be consistent. To see the issues involved, suppose for the moment that the (i, j)-th element of the weighting matrix is specified as w ii,n (c) =0; w ij,n (c) = 1 1+d c ij,n, i 6= j (89) where d ij,n 0 is an observable distance measure between the (i)-th and (j)-th units, and c 0 is a parameter to be estimated. Let W N (c) be the N N weighting matrix for this case, and let Z N (c) =(e N, X N, W N (c)y N ) be the regressor matrix corresponding to this more general version of the model in (83). Let ε N ( c N )=y N Z N ( c N ) γ N where γ 0 N =( α N, β 0 N, λ N ). 39

40 Then the non-linear two stage least squares estimator for this model would minimize ε 0 N ( c N )H N (H 0 N H N ) 1 H 0 N ε N ( c N ) (90) w.r.t. ˆα N, β N, λ N and c N. Unfortunately, as should be clear from Theorem 1, the results of the minimization will lead to c N =0, (ˆα N, β 0 N, λ N )=(N y, 0, 2) since c =0implies uniform weights (in this case, a N =1/2) andthis,in turn, implies via part (b) of Theorem 1 that the minimized value in (90) iszero. Wenotethatthisnegativeresultwouldnotbealteredforother specifications of w ij,n, as long as there are admissible parameter values such that all non-diagonal weights are equal. In a sense there is a corollary to Remark 4. Specifically, suppose in a model such as (83) the weighting matrix is not known a priori and the researcher considers various observable specifications of it in terms of, say, various distance measures, e.g., trade shares, geographic distance, etc.. Remark 4 suggests that if that researcher then selects the specification of the weighting matrix on the basis of the standard R 2 statistic, the results may be biased in the direction of the matrix with the most uniform weights. Clearly, the suggestion is that the R 2 measure of fit should not be used to determine the weighting matrix. There is a subtle point concerning the remarks above relaying to the case in which w ij,n (c) is given by (89). Specifically, taking c =0,impliesthat w ij,n (c) =1/2. This in turn violates the assumption maintained in our large sample analysis that the weighting matrix is summable"" i.e., c =0should not be in the parameter space. On the other hand, if the weighting matrix were formed by first taking w ij,n (c) as in (89) and then row normalizing, we would still end up with uniform weights if c =0and a matrix that is summable, and so there would still be problems! Suppose now that the researcher had panel data on a model such as (83): e.g., data on more than one village. The reader should be able to convince himself that the identification problem will still hold if the number of observations in each village were the same, and if fixed effects were considered - i.e., each village had it own intercept. On the other hand, if fixed effects were not considered, or if the number of observations in each village were not the same this identification problem would not arise and so, under usual assumptions, the 2SLS estimators would be consistent. 40

41 1.2.6 Estimation issues relating to missing data for border units Data shortcomings often arise in the analysis of spatial models containing spatial lags in either the dependent variable or in the exogenous variables. In many cases these shortcoming relate to the lack of data on relevant variables relating certain units which are defined to be neighbors of at least some of the other units in the study which are observed. There are various ways researchers have confronted this problem. One is to ignore it. This approach leads to an ommited variable problem. Another approach is to construct a sample from the available data which is complete in the sense that observations are available for all units and their neighbors. This approach leads to an errors in variables problem. 41

ESTIMATION PROBLEMS IN MODELS WITH SPATIAL WEIGHTING MATRICES WHICH HAVE BLOCKS OF EQUAL ELEMENTS*

JOURNAL OF REGIONAL SCIENCE, VOL. 46, NO. 3, 2006, pp. 507 515 ESTIMATION PROBLEMS IN MODELS WITH SPATIAL WEIGHTING MATRICES WHICH HAVE BLOCKS OF EQUAL ELEMENTS* Harry H. Kelejian Department of Economics,