Errors in variables and spatial effects in hedonic house price models of ambient air quality

Size: px

Start display at page:

Download "Errors in variables and spatial effects in hedonic house price models of ambient air quality"

Barnaby Hardy
6 years ago
Views:

1 Empirical Economics (2008) 34:5 34 DOI /s ORIGINAL PAPER Errors in variables and spatial effects in hedonic house price models of ambient air quality Luc Anselin Nancy Lozano-Gracia Received: 15 January 2007 / Accepted: 16 April 2007 / Published online: 27 July 2007 Springer-Verlag 2007 Abstract In the valuation of the effect of improved air quality through the estimation of hedonic models of house prices, the potential errors in variables aspect of the interpolated air pollution measures is often ignored. In this paper, we assess the extent to which this may affect the resulting empirical estimates for marginal willingness to pay (MWTP), using an extensive sample of over 100,000 individual house sales for 1999 in the South Coast Air Quality Management District of Southern California. We This paper is part of a joint research effort with James Murdoch (University of Texas, Dallas) and Mark Thayer (San Diego State University). Their valuable input is gratefully acknowledged. The research was supported in part by NSF Grant BCS to the Center for Spatially Integrated Social Science (CSISS), and by NSF/EPA Grant SES Earlier versions were presented at the 5th International Workshop on Spatial Econometrics and Statistics, Rome, Italy, May 2006, the 53th North American Meetings of the Regional Science Association International, Toronto, ON, Nov. 2006, the 2007 Meetings of the Allied Social Science Assocations, Chicago, IL, Jan 2007, and at departmental seminars at the University of Illinois. Comments by discussants and participants are greatly appreciated. A special thanks to Harry Kelejian for his detailed and patient clarification of the HAC estimator. The usual disclaimer holds. L. Anselin (B) School of Geographical Sciences, Arizona State University, Tempe, AZ , USA luc.anselin@asu.edu N. Lozano-Gracia Spatial Analysis Laboratory (SAL) and Department of Agricultural and Consumer Economics, University of Illinois, Urbana-Champaign, Urbana, IL 61801, USA lozano@uiuc.edu Present Address: N. Lozano-Gracia School of Geographical Sciences, Arizona State University, Tempe, AZ , USA

2 6 L. Anselin, N. Lozano-Gracia take an explicit spatial econometric perspective and account for spatial dependence and endogeneity using recently developed Spatial 2SLS estimation methods. We also account for both spatial autocorrelation and heteroskedasticity in the error terms, using the Kelejian Prucha HAC estimator. Our results are consistent across different spatial weights matrices and different kernel functions and suggest that the bias from ignoring the endogeneity in interpolated values may be substantial. Keywords Spatial econometrics Hedonic models HAC estimation Endogeneity Air quality valuation Real estate markets JEL Classification C21 Q51 Q53 R31 1 Introduction An important aspect of assessing the effectiveness of environmental policies that address the improvement of air quality is obtaining a quantitative measure of the economic value of the accrued benefits (e.g., Freeman 2003). In the absence of an explicit market for clean air, several methods have been suggested to estimate this value empirically, such as contingent valuation, conjoint analysis, discrete choice models and hedonic specifications. In this paper, we focus on the latter and consider some methodological issues associated with the estimation of an implicit price for clean air by including one or more pollution variables in a hedonic model of house prices. The rationale behind this approach is that, ceteris paribus, houses in areas with cleaner air will have this benefit capitalized into their value, which should be reflected in a higher sales price. The hedonic approach has become an established methodology in environmental economics (e.g., Palmquist 1991). Originating with the classic studies of Ridker and Henning (1967) and Harrison and Rubinfeld (1978), it has generated a voluminous literature dealing with theoretical, methodological and empirical aspects. Extensive reviews are provided in Smith and Huang (1993, 1995), Boyle and Kiel (2001), and Chay and Greenstone (2005), among others. Recently, empirical econometric work has started to take into account the potential bias and loss of efficiency that can result when spatial effects such as spatial autocorrelation and spatial heterogeneity are ignored in the estimation process. Spatial econometric methods (Anselin 1988), which incorporate the spatial dependence in cross-sectional data into model specification, estimation and testing have become fairly commonplace in empirical studies of housing and real estate, leading to so-called spatial hedonic models. Reviews of the basic specifications and estimation methods are provided in Anselin (1998), Basu and Thibodeau (1998), Pace et al. (1998), Dubin et al. (1999), Gillen et al. (2001), and Pace and LeSage (2004), among others. In the context of the valuation of environmental amenities, a spatial hedonic approach has been less common, although some recent applications include Kim et al. (2003), Beron et al. (2004), Brasington and Hite (2005), and Anselin and Le Gallo (2006). A theoretical perspective is offered in Small and Steimetz (2006).

3 Errors in variables and spatial effects in hedonic house price models of ambient air quality 7 In Chay and Greenstone (2005) (CG), several methodological issues are addressed pertaining to the identification and consistent estimation of the implicit price of air quality, using total suspended particulates as an environmental indicator. Specifically, CG focus on the potential endogeneity of the pollution variable and suggest an instrumental variable approach to estimate it consistently. They also consider potential endogeneity due to sorting by house purchasers when there is heterogeneity in their preference functions with different pollution levels. While considerable care is taken in addressing these specification problems, the model itself is estimated at a fairly aggregate spatial scale of US counties. Bayer et al. (2006) follow Chay and Greenstone (2005) by suggesting the possibility of local air pollution being correlated with unobserved local characteristics. They address this form of endogeneity by using the contribution of distant sources to local air pollution as an instrument for air pollution at the county level. In this paper, we focus on a separate source of endogeneity of the air quality variables in the hedonic specification. We elaborate on an idea outlined in Anselin (2001c), where it was argued that the use of spatially interpolated values for air quality (or, pollution) results in a prediction error which may be correlated with the overall model disturbance term. This would lead to simultaneity bias in an ordinary least squares regression. We thus consider the treatment of endogeneity in the pollution variable from the particular perspective of an errors in variables problem. We use polynomials in the coordinates of the house locations as instruments to correct for this endogeneity. In contrast to the aggregate approach of CG, our empirical work is based on observations for individual house transactions. 1 Consequently, we face the mismatch between the spatial support of the explanatory variable, a pollution measure collected at a finite set of monitoring stations, and the dependent variable, the price observed at the location of the house sales transaction. As outlined in Anselin and Le Gallo (2006), this requires a spatial interpolation operation. Several alternatives are possible, each with implications for the precision of the resulting variable. We take an explicit spatial econometric approach and include a spatially lagged dependent variable (spatial lag) in the hedonic specification. The combination of the endogeneity of the spatial lag and the air quality variables requires the application of spatial two stage least squares estimation (Anselin 1988; Kelejian and Robinson 1993; Kelejian and Prucha 1998; Lee 2003, 2006) and specialized test statistics (Anselin and Kelejian 1997). In addition, we allow for remaining spatial autocorrelation and heteroskedasticity of an unspecified nature (HAC) and obtain robust standard error estimates using the method of Kelejian and Prucha (2006a). We believe ours is the first true empirical application of spatial hedonic models in which both types of endogeneity (spatial and non-spatial) are considered jointly and that uses the HAC standard errors. 1 CG also employ a panel data set with observations at two points in time, whereas our sample is a pure cross-section. CG do not consider spatial effects. In our work, we do not explicitly consider endogeneity due to sorting. However, from an empirical point of view, the source of the endogeneity is irrelevant once it is properly accounted for.

4 8 L. Anselin, N. Lozano-Gracia We assess the extent to which the selection of a particular method affects the parameter estimates in the hedonic function and the derived economic valuation of willingness to pay (MWTP) for improved air quality. Specifically, we compare nonspatial to spatial hedonic specifications and estimation with and without instruments for the endogeneity of the air quality variable. We further assess the robustness of our findings by carrying out estimation for different spatial weights and different kernel functions. We pursue this empirical assessment by means of an investigation of a sample of 115,732 house sales in the South Coast Air Quality Management District of Southern California, for which we have detailed characteristics, as well as neighborhood measures and observations on ozone and particulate matter. 2 In the remainder of the paper, we first provide a brief discussion of data sources and variables included in the model. We next give some methodological background on the spatial econometric estimators and test statistics used. This is followed by a review of the estimation results, with a special focus on the estimates of the parameters of the air quality variables. In a brief discussion of policy implications, we compare the estimates for marginal willingness to pay. We close with some concluding remarks. 2 Data and variables The basic data used in this paper come from three main sources: Experian Company (formerly TRW) for the individual house sales price and characteristics, the 2000 US Census of Population and Housing for the neighborhood characteristics (at the census tract and block group level), and the South Coast Air Quality Management District for the measures of ozone (OZ) and particulate matter (TSP) concentration. The house price and characteristics are from 115,729 sales transactions of owner-occupied single family homes that occurred during 1999 in the region, which covers four counties: Los Angeles (LA), Riverside (RI), San Bernardino (SB) and Orange (OR). The data were geocoded, which allows for the assignment of each house to any spatially aggregate administrative district (such as a census tract, block group or a school district) and for the computation of accessibility measures and interpolated pollution values for the location of each individual house in the sample. House price and characteristics are matched with neighborhood and locational characteristics at the census tract, and, where possible, at the block group level from the 2000 U.S. Census of Population and Housing. 3 The variables used in the hedonic specification are essentially the same as those in earlier work by Beron et al. (2004) and Anselin and Le Gallo (2006). This base set is extended with newly computed measures on crime rates, school quality, distance 2 Other studies of the relation between house prices and air quality in this region can be found in Graves et al. (1988), Beron et al. (1999, 2001, 2004), and Anselin and Le Gallo (2006), although only the latter two take an explicit spatial econometric approach. Also of interest is a general equilibrium analysis of ozone abatement in the same region, using a hierarchical locational equilibrium model, outlined in Smith et al. (2004). 3 We assume that the values obtained for the 2000 Census are representative of the spatial distribution in 1999.

5 Errors in variables and spatial effects in hedonic house price models of ambient air quality 9 Table 1 Variable names and description Variable name Description Elevation Relative elevation of the house Livarea Interior living space (10,000 sq.m.) Landarea Lot size (1,000 sq.m.) Baths Number of bathrooms Fireplace Number of fireplaces Pool Indicator variable for swimming pool Age Age of the house (10 years) AC Indicator variable for central air conditioning Heat Indicator variable for central heating Beach Indicator variable for location less than 5 miles from beach Avdistp Average distance to parks in meters Highway1 Indicator variable for location within a 0.25 km from a highway Highway2 Indicator variable for location within km from a highway Traveltime Average time to work in census tract (CT) Poverty % of population with income below the poverty level in CT White % of the population that is white in the census block group (BG) Over65 % of the population older than 65 years in the census BG College % of population with college in the CT Income Median household income in BG (10,000 US$) Vcrime Violent crime rate for the city (or non urban county rate) API Average academic performance index for the school district Riverside Indicator variable for Riverside county San Bern. Indicator variable for San Bernardino county Orange Indicator variable for Orange county OZ Ozone measured in ppb TSP Total Suspended Particles in µ/m 3 to parks, and access to the highway system. All the variables used in the analysis are listed in Tables 1 and 2. We grouped the variables in the Table into five categories: house-specific characteristics from the Experian data set; location-specific characteristics, such as accessibility measures, computed from the house coordinates; neighborhood characteristics, obtained from the Census, supplemented with variables calculated from the FBI Uniform Crime Reports and the State of California Department of Education school performance scores; county dummies; and interpolated air pollution values. Five new variables are included in the current analysis that were not used in Anselin and Le Gallo (2006): Vcrime, API, Avdistp, Highway1 and Highway2. They were computed from different sources. Crime rates for violent crimes taking place during 1998 were obtained from the FBI Uniform Crime database. This measure is reported at the city as well as the county level. Where possible, we assigned the city level crime

6 10 L. Anselin, N. Lozano-Gracia Table 2 Basic descriptive statistics for all variables Variable name Mean Std. Deviation Min Max House price 243, ,000 20,000 5,345,455 Ln(house price) Elevation Livarea Landarea Baths Fireplace Pool Age AC Heat Beach Pavdist Highway Highway Traveltime Poverty White Over College Income Vcrime API Riverside San Bern Orange OZ TSP rate to each house in the city. Where crime rates were not available at the city scale, we used the non-urban crime rate for the county in which the house is located. A measure of the average school quality is computed from the Academic Performance Index (API), published by the California Department of Education. 4 This is the primary indicator used by the state to evaluate school performance. The API is an index calculated using both base and growth values of student rankings in the State Standardized tests. It is based on a scale from 200 to 1,000 with the target being

7 Errors in variables and spatial effects in hedonic house price models of ambient air quality 11 The average 1999 API value for all schools in a school district is calculated and then assigned to all the houses in the district. 5 We supplement the beach access variable with three other indicators of accessibility to amenities. First, we obtained the locations for each park in the four counties from the Geographic Names Information System website. 6 For each house location, we then computed the average distance to parks as a summary measure. We also supplemented the Census travel time measure with two other indicators of access to the highway system. These are intended to capture both the negative externalities (such as noise) experienced from being very close to the highways, as well as positive externalities due to shorter travel distances. We used ArcGIS and detailed highway maps 7 to define buffers of 0.25 km around the highways and to create two indicator variables. The first takes the value of one if the house is within 0.25 km of a highway, the second takes the value of one if the house is between 0.25 and 1 km from a highway. Air quality is measured as ambient air pollution. In the literature, hedonic specifications typically include either ozone (OZ) or total suspended particulate matter (TSP) as pollutants, since these are most visible in the form of smog. In addition, local news outlets report daily measures of these pollutants and broadcast alerts when dangerous levels are reached. Consequently, it is reasonable to assume that these pollutants enter into the utility function of potential buyers, although the question remains to what extent a continuous measure of air quality is the appropriate metric. 8 We include both pollutants in the specification, in order to minimize omitted variable problems. 9 We use the average of the daily maxima during the worst quarter of 1998 from the hourly observations recorded at monitoring stations for ozone and suspended particles. It should be noted that the number and locations of stations in the South Coast Air Quality Management District (SCAQMD) is not the same for each pollutant. In 1998, there were measurements for OZ for 28 monitoring stations, while TSP only had 12. The location of the monitoring stations relative to the houses in the sample is illustrated in Fig. 1. This yields a reasonable coverage of the spatial distribution of house locations for OZ, but much less so for TSP, which has fewer than half the number of stations. We interpolate the values at the monitoring stations to the location of every house in the sample using ordinary kriging. Anselin and Le Gallo (2006), find ordinary kriging to be the most reliable among several interpolation methods, including Thiessen 5 It would have been preferable to use a measure of school quality from the year previous to the year in which the house sale takes place, as we do for the air quality measures. However, information for the API in California school districts is only available starting in ESRI Data & Maps CD-ROM (2002). Redlands, CA, USA: Environmental Systems Research Institute. 8 In Anselin and Le Gallo (2006) discrete categories were also considered. In the current paper, our focus is on endogeneity and we leave the issue of the proper metric for a separate analysis. 9 We also ran the analysis for specifications with only one pollutant in the equations and the results and conclusions were qualitatively similar to what we found here. Detailed results are not reported, but available from the authors.

8 12 L. Anselin, N. Lozano-Gracia Fig. 1 Spatial distribution of houses and location of monitoring stations polygons, inverse distance weighting and splines. Figures 2 and 3 show the resulting interpolated values of ozone and particles, with darker color representing higher levels of the pollutant. 10 The spatial pattern is very different for the two measures of air pollution. For ozone, lower levels are observed closer to the ocean and air quality seems to worsen as one moves North-East with a suggestion of separate air quality bands. For TSP, generally lower pollution is observed in the North-West corner of the Basin, with increasing levels as one moves towards the South-East. The precision of the interpolated value varies across the sample, becoming worse for locations further removed from monitoring sites. To correct for a possible biasing effect of such high-error interpolated values, the house locations within the upper 5% of the prediction error distribution for either pollutant were dropped from the sample. This resulted in a final set of 103,867 house locations, of which 67,864 are in LA county, 17,914 in OR county, 12,266 in SB and 5,823 in Riverside county. The observed sales price ranges from $20,000 to $5,345,455, with an overall mean of $243,346. There is considerable variability across counties. For example, the average house price for observations in LA county is $ 261,946, while it is $269,081 in OR, $148,948 in SB and $146,249 in RI. Figure 4 illustrates the spatial distribution of house prices, with higher prices represented through darker colors. Some concentration of high prices per squared meter can be seen in the coast of LA and OR, although overall, 10 Kriging interpolations were carried out using the ESRI ArcGIS Geostatistical Analyst extension. A spherical model allowing for directional effects was used for both pollutants. For OZ the model chosen included 8 lags with a lag size of 9 km, and the estimated parameters were and 9 for the direction (angle), 4.16 for the partial sill, 68,604 and 68,236 for the major ranges and 59,381 and 68,236 for the minor ranges. The model chosen for TSP included 9 lags with a lag size of 6km, and the estimated parameters were and 9 for the direction, for the partial sill, 50,969 and 50,959 for the major ranges and 11,303 and 50,959 for the minor ranges.

9 Errors in variables and spatial effects in hedonic house price models of ambient air quality 13 Fig. 2 Kriging interpolation: OZ Fig. 3 Kriging interpolation: TSP there is considerable complexity in the spatial distribution of prices. Basic descriptive statistics for all the variables included in the analysis are given in Table 2. 3 Spatial econometric issues We estimate a hedonic function in log-linear form and take an explicit spatial econometric approach. This includes testing for the presence of spatial autocorrelation and

14 L. Anselin, N. Lozano-Gracia Fig. 4 Spatial distribution of house prices (Price/sq.m.) estimating specifications that incorporate spatial dependence.

10 14 L. Anselin, N. Lozano-Gracia Fig. 4 Spatial distribution of house prices (Price/sq.m.) estimating specifications that incorporate spatial dependence. 11 We follow Anselin (1988) and distinguish between spatial dependence in the form of a spatially lagged dependent variable, and a model with a spatially correlated error term. We refer to these as spatial lag and spatial error models, respectively. Formally, a spatial lag model is expressed as: y = ρwy+ Xβ + u, (1) where y is a n 1 vector of observations on the dependent variable, X is a n k matrix of observations on explanatory variables, W is a n n spatial weights matrix, u a n 1 vector of i.i.d. error terms, ρ the spatial autoregressive coefficient, and β a k 1 vector of regression coefficients. The theoretical motivation for a spatial lag specification is based on the literature on interacting agents and social interaction. For example, a spatial lag follows as the equilibrium solution of a spatial reaction function (Brueckner 2003) that includes the decision variable of other agents in the determination of the decision variable of an agent (see also Manski 2000). In the current setting, which is purely cross-sectional, it is difficult to maintain such a theoretical motivation, since it would imply that buyers and sellers simultaneously take into account prices obtained in other transactions. An alternative interpretation is provided by focusing on the reduced form of the spatial lag model: y = (I ρw ) 1 Xβ + (I ρw ) 1 u, (2) 11 For a general overview of methodological issues involved in the specification, estimation and diagnostic testing of spatial econometric models, we refer to Anselin (1988, 2001b, 2006)andAnselin and Bera (1998), among others.

11 Errors in variables and spatial effects in hedonic house price models of ambient air quality 15 where, under standard regularity conditions, the inverse (I ρw ) 1 can be expressed as a power expansion (I ρw ) 1 = I + ρw + ρ 2 W 2 +. (3) The reduced form thus expresses the house price as a function of the own characteristics (X), but also of the characteristics of neighboring properties (WX, W 2 X), albeit subject to a distance decay operator (the combined effect of powering the spatial autoregressive parameter and the spatial weights matrix). In addition, omitted variables, both property-specific as well as related to neighboring properties are encompassed in the error term. In essence, this reflects a scale mismatch between the property location and the spatial scale of the attributes that enter into the determination of the equilibrium price. From a purely empirical perspective, one can also argue that the spatial lag specification allows for a filtering of a strong spatial trend (similar to detrending in the time domain), i.e., to ensure the proper inference for the β coefficients when there is insufficient variability across space. Formally, the spatial filter interpretation stresses the estimation of β in: y ρwy = Xβ + u. (4) In contrast, spatial error autocorrelation results when omitted variables follow a spatial structure such that the error variance-covariance matrix is no longer diagonal: Var[uu ]=E[uu ]=, (5) where = I, with I as the identity matrix. Arguably, such spatially structured omitted variables may be addressed by means of spatial fixed effects, e.g., by including a dummy variable for each census tract or block group. This rests on the assumption that the spatial range of the unobserved heterogeneity/dependence is specific to each spatially delineated unit. In practice, there may be spatial units (such as school districts) where such a spatial fixed effects approach is sufficient to correct for the problem. However, the nature of omitted neighborhood variables tends to be complex, as is the definition of the correct neighborhood. Instead of including spatial fixed effects, we assume a process for the error terms that allows the externalities to spill over throughout the system. More specifically, in contrast to most earlier work, we do not impose a specific functional form, but take a non-parametric perspective, implementing the recent results of Kelejian and Prucha (2006a). By means of the spatial weights matrix W, a neighbor set is specified for each location. The positive elements w ij of W are non-zero when observations i and j are neighbors, and zero otherwise. By convention, self-neighbors are excluded, such that the diagonal elements of W are zero. In addition, in practice, the weights matrix is typically row-standardized, such that j w ij = 1. Many different definitions of the neighbor relation are possible, and there is little formal guidance in the choice of the correct spatial weights. 12 The term Wyin Eq. (1) is referred to as a spatially lagged 12 For a more extensive discussion, see Anselin (2002, pp ), and Anselin (2006, pp ).

12 16 L. Anselin, N. Lozano-Gracia dependent variable, or spatial lag. For a row-standardized weights matrix, it consists of a weighted average of the values of y in neighboring locations, with weights w ij. In our application, we consider three spatial weights to assess the sensitivity of the results to this important aspect of the model specification. One weight is derived from the contiguity relationship for Thiessen polygons constructed from the house locations. This effectively turns the spatial representation of the sample from points into polygons. The resulting weights matrix is symmetric and extremely sparse (0.006% non-zero weights). On average it contains 6 neighbors for each location (ranging from a minimum of 3 neighbors to a maximum of 35 neighbors, with 6 as the median). We supplement this with two weights based on a nearest neighbor relation among the locations, for respectively 6 and 12 neighbors. The corresponding weights matrix is asymmetric, but equally sparse (respectively and 0.012% non-zero weights). The three weights matrices are used in row-standardized form. We first obtain ordinary least squares (OLS) estimates for the hedonic model and assess the presence of spatial autocorrelation using the Lagrange Multiplier test statistics for error and lag dependence (Anselin 1988), as well as their robust forms (Anselin et al. 1996). 13 The results consistently show very strong evidence of positive residual spatial autocorrelation, with an edge in favor of the spatial lag alternative (see Sect. 4). This matches earlier results obtained in Anselin and Le Gallo (2006). We therefore focus on the estimation of the spatial lag model but allow remaining spatial error autocorrelation of unspecified form, as well as heteroskedasticity of unspecified form. Our paper takes two distinctive approaches towards estimation and inference of the spatial hedonic model that warrant further elaboration. First, we use a spatial twostage least squares estimator (S2SLS) that allows for a spatial lag as well as other endogenous variables. Consider the spatial lag model (1) with an additional term: y = ρwy+ Y ν + Xβ + u, (6) where Y is a n p matrix of endogenous variables, with associated coefficient vector ν. In our model, the endogenous variables are the air quality variables, say y 2 and y 3. Since the actual pollution is not observed at the locations i of the house transactions, it is replaced by a spatially interpolated value, such as the result of a kriging prediction. This interpolated value measures the true pollution with error, for example, at location i: y 2i = y 2i + ψ i, (7) where y 2i is the true air quality that enters into the agent s utility function, y 2i is the observed value (the interpolated value), and ψ i an error term. Note that this error is related to the interpolation error to the extent that the predicted item is also what enters into the utility function. An additional source of error would be a discrepancy between what is predicted as air quality and what is included into the agent s utility function as 13 See Anselin (2001a), for an extensive review of statistical issues.

13 Errors in variables and spatial effects in hedonic house price models of ambient air quality 17 air quality. 14 From a practical perspective, due to the nature of the kriging predictor, the prediction error will be highly spatially structured. We suggest that it therefore is likely to mimic the spatially structured equation disturbance u. In addition, the failure to predict air quality correctly at a location may be due to similar omitted variables as those that affect the error of the hedonic specification (e.g., the omitted presence of noxious facilities). As a result, it is likely that E[ψ i u i ] = 0, causing simultaneous equation bias due to errors in variables. Using traditional notation, Eq. (6) can be rewritten concisely as: y = Zγ + u, (8) with Z =[Wy, Y, X] and γ =[ρ,ν,β ]. The spatial two stage least squares estimator is an extension of the standard two stage least squares estimator that includes specific instruments for the spatially lagged dependent variable (see Anselin 1980, 1988; Kelejian and Robinson 1993; Kelejian and Prucha 1998; Kelejian et al. 2004; Lee 2003, 2006). Specifically, consider the q n matrix of instruments Q, with q k + p + 1: Q =[X, WX, H], (9) where WX is a matrix consisting of the spatially lagged explanatory variables (exogenous variables only, and excluding the intercept), and H is a matrix of instruments for the other endogenous variables (the air quality variables). The use of WX as instruments for the spatial lag is based on the reduced form of the model. The selection of instruments for the errors in variables problem is less straightforward. Proper instruments should be correlated with the unobserved true pollution value y and uncorrelated with the regression error u. The effects on the estimates of using weak instruments have been widely discussed in the literature (see e.g., Staiger and Stock 1997) and the question of how to specify the right instruments remains unresolved for many economic problems. We chose instruments that are able to proxy the overall spatial pattern of the pollution as a global spatial trend. They therefore are unlikely to be correlated with the hedonic error terms, which reflect local spatial patterns of omitted variables. Specifically, we use the latitude, longitude and their product as the instruments. Note that these instruments may also aid in correcting endogeneity due to other factors, such as sorting. As long as they are uncorrelated with the error term, they will yield consistent estimates. However, if the instruments do not accurately capture the causal mechanism underlying the other sources of endogeneity, the resulting estimates will not be most efficient. This needs to be considered together with other sources of inefficiency, such as unobserved heterogeneity and spatial autocorrelation in the error term. In order for the asymptotic properties of the HAC estimator to hold, we only need consistency of the estimates in the first stage, which 14 An early application of instrumental variables in this context within the economic literature is Friedman (1957), where a measurement problem appears when using annual income as a proxy for permanent income in estimating a consumption function.

14 18 L. Anselin, N. Lozano-Gracia will be satisfied by our instruments (as long as they are uncorrelated with the error term). With the instrument matrix in hand, we obtain the S2SLS estimates as: ˆγ S2SLS =[Z Q(Q Q) 1 Q Z] 1 Z Q(Q Q) 1 Q y. (10) Inference is based on the asymptotic variance matrix: AsyV ar[ˆγ S2SLS ]= ˆσ 2 [Z Q(Q Q) 1 Q Z] 1, (11) with ˆσ 2 = (y Z ˆγ S2SLS ) (y Z ˆγ S2SLS )/n. We relax the assumption of homoskedasticity used in (11) and allow for heteroskedasticity of unspecified form. A direct application of the approach outlined in White (1980) yields an alternative estimate for the asymptotic variance matrix as: AsyV ar[ˆγ S2SLS W ]=[Z Q (Q Q) 1 Q Z] 1, (12) with (Q Q) 1 = (Q SQ) 1, where S is a diagonal matrix containing the squared S2SLS residuals. 15 We also continue to test for remaining spatial error autocorrelation, using the generalized LM tests for 2SLS residuals (Anselin and Kelejian 1997). The second distinctive methodological aspect of our approach is that we allow for remaining spatial error autocorrelation of unspecified form. Since the specification tests indicate the presence of such autocorrelation (see Sect. 4), we apply the recently developed heteroskedastic and autocorrelation robust (HAC) approach of Kelejian and Prucha (2006a). This builds upon the framework outlined in Conley (1999)asan extension to the spatial domain of the well-known Newey and West (1987) result from time series analysis (see also Andrews 1991). The core of the HAC technique is a non-parametric estimator for the spatial covariance, using weighted averages of cross-products of residuals, the range of which is determined by a kernel function. 16 Formally, we need to obtain an estimate of the matrix = Q Q, where is a non-diagonal spatial variance covariance matrix for the error terms. As Kelejian and Prucha (2006a) show, the estimator for the individual r, s elements of the matrix is given by: ψ r,s = (1/n) i q ir q js û i û j K (d ij /d), (13) j 15 For a recent discussion of technical aspects associated with heteroskedastic robust estimation in spatial models, see Kelejian and Prucha (2006b)andLin and Lee (2005). 16 The origins of this approach can be found in Hall and Patil (1994).

15 Errors in variables and spatial effects in hedonic house price models of ambient air quality 19 where the subscripts refer to the individual elements of the matrix Q and residual vector û, and K is a kernel function. 17 In the case of OLS, Q is replaced by X, the matrix of observations on the explanatory variables. The kernel function K ()determines which pairs i, j are included in the cross products in (13). The kernel function is a real, continuous and symmetric function that is bounded and integrates to one, similar to a probability density function. 18 In the current context, the kernel is formulated as K (d ij /d), where d ij is the distance between i and j, and d is the bandwidth, such that K (d ij /d) = 0ford ij d. In our application, we use three different kernel functions: the triangular or Bartlett kernel, with K (z) = 1 z (with z = d ij /d), the Epanechnikov kernel, with K (z) = 1 z 2, and the bisquare kernel, with K (z) = (1 z 2 ) 2. Note that for each of these K = 1 for d ij = 0. We implement this using a variable bandwidth, based on the distances to the 40 nearest neighbors. Using the estimates for from (13), the HAC variance for the S2SLS estimates is obtained as: AsyV ar[ˆγ S2SLS HAC ]=(Z q Z q) 1 Z Q(Q Q) 1 (Q Q) 1 Q Z(Z q Z q) 1, (14) with Z q Z q = Z Q(Q Q) 1 Q Z. One final methodological note pertains to the assessment of model fit. In spatial models, the use of the standard R 2 measure is not appropriate (see Anselin 1988, Chap. 14). In order to provide for an informal comparison of the fit of the various specifications, we report a pseudo-r 2 measure, computed as the ratio of the variance of the predicted value to the variance of the observed values. In the classical linear regression model, this is equivalent to the R 2, but in the spatial models this measure should be interpreted with caution. In the spatial lag model, the spatially lagged dependent variable Wy is endogenous. We therefore obtain the predicted value from the expression for the conditional expectation of the reduced form: ŷ = E[y X] =(I ˆρW ) 1 X ˆβ (15) This operation requires the inverse of a matrix of dimension n n, which we approximate by means of a power method, accurate up to 6 decimals of precision. 4 Estimation results We begin the review of our empirical results by focusing on the coefficients obtained using the four estimation methods under consideration: OLS, IV (standard nonspatial 2SLS with the pollutants treated as endogenous), LAG (spatial 2SLS with 17 In practice, the term (1/n) cancels out in the final expression for the variance matrix in (14). We include it here to be consistent with the notation in Kelejian and Prucha (2006a). 18 See, among others, Härdle (1990, Chap. 3), Andrews (1991, pp ), Simonoff (1996, Chap. 3), and Cameron and Trivedi (2005, pp ).

16 20 L. Anselin, N. Lozano-Gracia Table 3 Coefficient estimates: traditional hedonic variables queen weights Variable name OLS IV LAG LAG-end Constant Landarea Livarea Elevation Baths Fireplace Pool Age Age AC Heat Beach Distance Parks Highway Highway Travel time Poverty White Over College Income Vcrime API Riverside San Bern Orange R 2 (var ratio) Not significant Significantat5% a spatially lagged dependent variable), and LAG-end (spatial 2SLS with a spatially lagged dependent variable and the pollutants treated as endogenous). We separate the results into those for the traditional hedonic variables, reported in Table 3, and those for the pollutant coefficients, reported in Table 4 together with some model diagnostics. The tables only contain results for the queen spatial weights (to create the spatially lagged dependent variable). The complete set of estimates for all three spatial weights is given in the Appendix. First, consider the OLS results. Overall, the coefficients of the house characteristics are significant and of the expected sign, in accordance with earlier findings in the literature. The only exception is relative elevation, which was not found to be significant. House prices increase as both land and living area increase. Similarly, houses with

17 Errors in variables and spatial effects in hedonic house price models of ambient air quality 21 Table 4 Pollutant coefficients by estimator queen weights Variable Name OLS IV LAG LAG-end OZ TSP ρ RLM-LAG RLM-ERR DWH 2,540 A-K more bathrooms, fireplaces, as well as with AC and heating systems are higher valued. As the literature suggests (see among others Bourassa et al. 1999; Beron et al. 2004) there appears to be a quadratic relationship between age and price: prices are higher for more recently built houses. There is also a vintage effect of age on prices that is reflected in the positive sign of the quadratic term. In terms of access variables, there is a significant premium for houses that are located closer to the beach and closer to parks, but the effect of the immediate vicinity to the highway is that of a nuisance. Location in a zone km from the highway is not significant (for OLS; it is positive and becomes significant at p < 0.05 in the spatial models). The results for the neighborhood variables are also in accordance with conventional wisdom: travel time and crime are negatively valued, whereas % white, the proportion of college graduates and median income have a positive effect. Poverty and the school quality score were not found to be significant. The percentage elderly is positive, but this finding is not stable across estimators (see below). Los Angeles county was used as the base case, which resulted in a negative value for the dummy variables for Riverside and San Bernardino, but no significant difference for Orange county. The overall fit is very satisfactory, with an R 2 of However, as the model diagnostics indicate (Table 4), OLS suffers from a number of problems. First, the Durbin Wu Haussman test statistic for endogeneity strongly rejects the null hypothesis that the interpolated pollutants are exogenous. In addition, there is evidence of very high residual spatial autocorrelation, with the robust LM test statistic suggesting the lag specification as the proper alternative. We next consider the effect on the estimates for the traditional hedonic variables of treating the pollutants as endogenous (column IV in Table 3), including a spatially lagged dependent variable (column LAG), and combining both spatial lag and endogeneity of the pollutants (column LAG-end). Note that the A K test for residual spatial autocorrelation also rejected the null for all three non-ols cases, even after a spatially lagged dependent variable was included. The latter is highly significant, with estimates for the spatial autoregressive coefficient around 0.3. The A K test points to the need to account for remaining spatial error autocorrelation through the HAC approach. The most appropriate specification is therefore the LAG-end with HAC

18 22 L. Anselin, N. Lozano-Gracia variance estimates. The other results are provided to assess the effect of addressing endogeneity and spatial effects in isolation versus in combination. For the individual house characteristics and accessibility variables, the estimated coefficients remain fairly stable across methods, with only marginal changes. The estimates obtained with LAG-end are slightly smaller in absolute value, but all the significance remain the same. This is not the case for the estimates of the neighborhood characteristics. These vary considerably across methods, both in magnitude as well as in significance. For example, Poverty, which is not significant for OLS, IV and LAG, becomes significant and negative in the LAG-end model. In the reverse direction, the % elderly, which is significant in OLS, gradually loses significance (significant only at p < 0.05 for IV and LAG) to become insignificant in LAG-end. The absolute value of the coefficients for Income, College and Vcrime in LAG-end is less than half the magnitude for OLS. These variables are measured at an aggregate scale (census tract or block group, or city for the crime variable) and therefore the disturbances from the model may be correlated within the aggregation groups (Moulton 1990). It is likely that houses in the same census tract share unobservable characteristics leading to correlation in the error terms. We surmise that the inclusion of a spatially lagged dependent variable filters out some of this error and yields more accurate estimates. The pollution variables are similarly affected by the estimation method. Both coefficients of Ozone and TSP are negative and highly significant throughout. However, their absolute value varies considerably across methods. Taken individually, the effect of controlling for endogeneity seems to be strongest, resulting in a change between OLS and IV of to for Ozone, and of to for TSP. Between OLS and LAG, the change is much smaller. In LAG-end, accounting for both the spatial effects and the endogeneity yields a coefficient of for Ozone and for TSP. This suggests that a reduction of 1 ppb in OZ levels would raise house prices by 0.99% and a decrease of 1 µ/m 3 in TSP values would increase house values by 0.73%. Since the joint consideration of spatial effects and endogeneity is new in the current paper, there are no results available in the literature to compare our findings to directly. However, our OLS estimates are in line with previous published results. For example, in a meta-analysis of 37 studies, Smith and Huang (1995) suggest that a decrease of 1 µ/m 3 in the TSP values will result in an increase of house values ranging between 0.05 and 0.10%. Using an IV estimator Chay and Greenstone (2005) estimate that a change in 1 µ/m 3 will produce a % change in house prices in the opposite direction. These estimates are considerably lower than those obtained in the current study, but it is important to keep in mind that their results are obtained for county aggregates. The OLS results in Beron et al. (2001) suggest that a decrease in one ppb of OZ would produce an increase in house prices ranging from 2.3 to 7.1%, which is consistent with our OLS estimates. Relative to OLS, when accounting for both endogeneity and spatial autocorrelation in the LAG-end model, the effect of ozone on house prices appears to be significantly smaller in absolute terms, while the effect of TSP is larger in absolute value. As shown in Table 4, the A K test in the LAG-end model still shows significant remaining spatial error autocorrelation. We assess the effect of this on the precision of the estimates for both pollutants by computing three sets of standard errors: classical,

19 Errors in variables and spatial effects in hedonic house price models of ambient air quality 23 Table 5 Standard errors: OZ Coeff. Standard errors OZ Classical White HAC-Ep HAC-Tr HAC-Bi OLS IV LAG Queen LAG-end LAG Knn LAG-end LAG Knn LAG-end Table 6 Standard errors: TSP Coeff. Standard errors TSP Classical White HAC-Ep HAC-Tr HAC-Bi OLS IV LAG Queen LAG end LAG Knn LAG end LAG Knn LAG end White (heteroskedastic consistent), and HAC. The results are reported in Tables 5 and 6, for the three spatial weights matrices and three kernel functions. The estimates for the pollution variables are essentially the same across the three spatial weights, with only a slight difference for ozone. However, accounting for remaining heteroskedasticity and spatial error correlation has a dramatic effect on the precision of the estimates. The standard errors are up to twice as large for the HAC as the classical and White results with consistently the largest value for the Epanechnikov kernel. By and large, the numerical values are essentially the same across kernels and spatial weights, which provides some evidence of the robustness of our findings. The more realistic measure of the standard errors of the estimates will be important in assessing the precision of the derived welfare measures, such as the MWTP, to which we turn next. 5 Policy analysis We conclude this empirical exercise by comparing the valuation of air quality computed from the parameter estimates obtained by the alternative methods. In a hedonic

Luc Anselin and Nancy Lozano-Gracia

Errors in variables and spatial effects in hedonic house price models of ambient air quality Luc Anselin and Nancy Lozano-Gracia Presented by Julia Beckhusen and Kosuke Tamura February 29, 2008 AGEC 691T: