ISSUES IN SEMI-PARAMETRIC REGRESSION WITH APPLICATIONS IN TIME SERIES MODELS FOR AIR POLLUTION AND MORTALITY

Size: px

Start display at page:

Download "ISSUES IN SEMI-PARAMETRIC REGRESSION WITH APPLICATIONS IN TIME SERIES MODELS FOR AIR POLLUTION AND MORTALITY"

Roger Hart
5 years ago
Views:

1 ISSUES IN SEMI-PARAMETRIC REGRESSION WITH APPLICATIONS IN TIME SERIES MODELS FOR AIR POLLUTION AND MORTALITY Francesca Dominici, Aidan McDermott, and Trevor Hastie February 4, 2003 Abstract In 2002, methodological issues around time series analyses of air pollution and health attracted the attention of statisticians, epidemiologists, the press, industry, and policy makers. As the Environmental Protection Agency (EPA) was finalizing its review of the evidence on particulate air pollution, statisticians and epidemiologists found that the S-Plus implementation of Generalized Additive Models (GAM), the most common statistical approach for time series analysis of air pollution and health, can overestimate effects of air pollution and understate statistical uncertainty. This discovery delayed the process for the Particulate Matter (PM) Criteria Document prepared as part of the review of the National Ambient Air Quality Standard (NAAQS), as the time-series findings were a critical component of the evidence. In addition, it raised concerns about the adequacy of current model formulations and their software implementations. Time-series data on air pollution and mortality are analyzed by using semi-parametric regression models; where the linear component (corresponding to the air pollution effects) is of interest, and the smooth component (smooth functions of time and temperature to control for confounding), is a nuisance. In this paper we provide two methodological results in semi-parametric regression directly relevant to risk estimation in time series studies of air pollution. First, we introduce a closed form estimate of the asymptotically exact covariance matrix of the linear component of a GAM. To ease the implementation of these calculations, we develop the package gam.exact, an extended version of gam. Use of gam.exact allows a more robust assessment of uncertainty of the air pollution effects. Second, we introduce a semi-automated procedure for selecting the degrees of freedom in the smooth component which minimizes the mean squared error of the regression coefficients corresponding to the linear component. We justify our approach by theoretical arguments and by a simulationbased validation study. In time series analyses of air pollution and health, our semi-automated procedure provides a diagnostic tool for choosing the degree of adjustment for time-varying factors that might confound the pollution-mortality relationship. Finally, we apply our methods to the data for National Mortality Morbidity Air Pollution Study (NMMAPS), which includes time series data from the 90 largest US locations for the period We report national-level estimates of the health effects of air pollution, and review their sensitivity to model choice.

2 Key Words: Semiparametric regression, time-series, air pollution, GAM. Acknowledgments Research described in this article was partially supported by a contract and grant from Health Effects Institute (HEI), an organization jointly funded by the Environmental Protection Agency (EPA R824835) and automotive manufacturers. The contents of this article do not necessarily reflect the views and policies of HEI, nor do they necessarily reflect the views and policies of EPA, or motor vehicles or engine manufacturers. Funding for Francesca Dominici was provided by a grant from the Health Effects Institute (Walter A. Rosenblith New Investigator Award). Research described within was also supported by a grant from the National Institute of Environmental Health Sciences for the Johns Hopkins Center in Urban Environmental Health (P30ES0389-2). We would like to thank Drs Scott L. Zeger, Jonathan M. Samet, and Giovanni Parmigiani for comments. 2

3 Introduction Evidence from time series studies of air pollution and health has been central to major policy decisions in the United States and elsewhere in developing standards to protect the public against adverse effects of air pollution. Time series studies of air pollution and health estimate associations between day-to-day variations in air pollution concentrations and day-to-day variations in adverse health outcomes, contributing epidemiological evidence useful for evaluating the risks of current levels of air pollution (Schwartz and Marcus, 990; Lipfert and Wyzga, 993; Pope et al., 995; Schwartz, 995; American Thoracic Society, 996a,b; Katsouyanni et al., 997; Samet et al., 2000a; Clancy et al., 2002; Lee et al., 2002; Stieb et al., 2002; Goldberg et al., 2003). Interpretation of short-term associations between air pollution and health is complicated by the social, political, and regulatory context which looks to scientific evidence for guidance in selecting among policy options. Under the Clean Air Act (Environmental Protection Agency, 970), the Environmental Protection Agency sets the National Ambient Air Quality Standards (NAAQS) for six air pollutants to protect the public health (Environmental Protection Agency, 996, 200). Such standards are reviewed periodically based on the accumulated scientific evidence. Balancing a series of health effects, including hospitalization and death, against the costs of further controls creates a difficult and sensitive policy debate. In the last 0 years, many advances have been made in the statistical modelling of time series of air pollution. Standard regression methods used initially have been almost fully replaced by semi-parametric approaches (Speckman, 988; Hastie and Tibshirani, 990; Green and Silverman, 994). These methods allow more flexibility in the adjustment for confounding factors by including non-parametric functions of time and weather variables in the regression model. One widely used approach for a time series analysis of air pollution and health involves a semi-parametric Poisson regression with daily mortality or morbidity counts as the outcome, linear terms measuring the percentage increase in the mortality/morbidity associated with elevations in air pollution levels (the relative rates βs), and smooth functions of time and weather variables to adjust for the timevarying confounders. The nature and characteristics of time series data make risk estimation challenging, requiring complex statistical methods able to detect effects that can be small relative to the combined effect of other time-varying covariates. More specifically, the association between air pollution and mortality/morbidity can be confounded by weather and by seasonal fluctuations in health outcomes due to influenza epidemics, and to other unknown slow-varying factors (Schwartz et al., 996; Katsouyanni et al., 996; Samet et al., 997). Thus, smooth functions of temperature and time are included into the regression model to control for fluctuations in the mortality/morbidity 3

4 time series that are not associated with air pollution; residual variation is then used to estimate the air pollution effect. Generalized linear models (GLM) with regression splines (McCullagh and Nelder, 989) and Generalized additive models (GAM) with non-parametric splines (Hastie and Tibshirani, 990) have become the methods of choice. During the last few years, GAM with non-parametric splines was preferred to fully parametric formulations because of the increased flexibility in estimating the smooth component of the model. While GAM has been the preferred method, recent reports have questioned the adequacy of its use for studies on air pollution and health. The original default parameters of the gam functions in Splus (Versions 3.4 and 5.) were found inadequate to guarantee the convergence of the backfitting algorithm (Dominici et al., 2002b). The defaults have recently been revised in the Splus Version 6., but they may be still too lax for these applications. In addition, the S-plus function gam calculates the standard errors of the linear terms (the air pollution effects) by effectively assuming that the smooth component of the model is linear, resulting in an underestimation of uncertainty (Chambers and Hastie, 992; Ramsay et al., 2002; Klein et al., 2002). Convergence of the backfitting algorithm and underestimation of standard errors in GAM depend critically on the number of degrees of freedom in the smooth component of the model (df ). In timeseries studies of air pollution and health, more strict control for potential confounding factors leads to larger df, which increases the correlation between the predictors in the model and makes the linear approximation of the smooth component of the model inadequate. Thus, the use of several highly flexible smooth functions of time-varying covariates to adjust for confounding reduces bias in the relative rate estimates, but can also slow convergence and underestimate standard errors substantially. The degree of adjustment for factors, controlled by df is another critical aspect of time series analyses of air pollution and health. Ideally, we would like to choose df large enough to remove confounding entirely, but not large enough to eliminate all the residual variation in the data, thus inflating the statistical variance of ˆβ. In absence of strong biological hypotheses, the choice of df has been based on expert judgment on the time-scales of variations in the mortality time series where confounding is more likely to occur (Kelsall et al., 997; Dominici et al., 2000), or via optimality criteria, such as minimum prediction error (based on the Akaike Information Criteria), or by visual inspection of the partial autocorrelation function of the residuals (Toulomi et al., 997; Burnett et al., 200). This paper is motivated by the pressing need for methodological developments in semi-parametric regression to better handle the general complexity of air pollution risk estimation. In section 2, 4

5 we detail how to calculate the asymptotically exact standard error of ˆβ in GAM. We implement these calculations in the new S-plus software function gam.exact. In section 3, we provide the relationship between asymptotic bias of ˆβ and degrees of freedom in the smooth component of the model (df ), and we introduce a semi-automated procedure for selecting df that minimizes the mean squared error of ˆβ. In section 4 we apply our methods to the data from National Mortality Morbidity Air Pollution Study (NMMAPS) (Samet et al., 2000c,b) and calculate national average estimates of the air pollution effects as a function of the degree of adjustment for confounding factors. Strengths and limitations of our approaches are discussed in section 5. 2 Statistical Model Semi-parametric model specifications for time-series analyses of air pollution and health have been extensively discussed in the literature (Burnett and Krewski, 994; Kelsall et al., 997; Katsouyanni et al., 997; Dominici et al., 2000; Zanobetti et al., 2000; Schwartz, 2000) and are briefly reviewed here. Data consist of daily mortality or morbidity counts (y t ), daily levels of one or more air pollution variables (x t..., x Jt ), and additional time-varying covariates (u t..., u Lt ) to control for slowvarying confounding effects such as season and weather. Regression coefficients are estimated by fitting the following semi-parametric model: y t Poisson(µ t ), t =,..., T var(y t ) = φµ t log µ t = β 0 + J j= β jx jt + L l= f l(u lt, df l ) = η t. () In our application, β j s describe the percentage increase in mortality/morbidity per unit increases in ambient air pollution levels x jt. The functions f(, df l ) denote smooth functions of calendar time, temperature, and humidity, often constructed using smoothing splines, loess smoothers, or natural cubic splines with smoothing parameters df l s; φ is the over-dispersion parameter. Here we assume that the f s are modelled using smoothing splines (GAM) or using natural cubic splines (GLM). 5

6 3 Asymptotically Exact Standard Errors in GAM In this section we develop an explicit expression for the asymptotically exact (a.e.) statistical covariance matrix of the vector of the regression coefficients β = [β,..., β J ] corresponding to the linear component of model () when f are modelled using smoothing splines and a GAM is used. Note that when f s are modelled using regression splines (such as natural cubic splines), the fully parametric model () is fitted by using Iteratively Re-weighted Least Squares (IRLS), and asymptotically exact standard errors are returned by the S-plus function glm. Estimation in GAM is based on a combination of the local scoring algorithm and backfitting algorithm (Friedman and Stuetzle, 98; Buja et al., 989; Hastie and Tibshirani, 990). The local scoring algorithm is a generalization of the Fisher scoring procedure for finding maximum likelihood estimates in GLMs (Nelder and Wedderburn, 972; McCullagh and Nelder, 989). The backfitting algorithm is suitable for fitting any additive model, and in GAM is used within the local scoring iteration when several smooth functions are included in the model. An explicit expression for the a.e. covariance matrix of β can be obtained from the closed form solution for β (Hastie and Tibshirani, 990, page 54), and: β = Hz where : H = { X t W (I S)X } X t W (I S) (2). X is the T J design matrix with columns x j = [x j,..., x jt ] t ; 2. z is the working response from the final iteration of the IRLS algorithm (McCullagh and Nelder, 989) defined as z t = ˆη t + (y t ˆµ t )/ˆµ t ; 3. W is diagonal in the final IRLS weights; and 4. S is the T T operator matrix that fits the additive model involving the smooth terms in (). Notice that here we have put all the additive smooth terms L l= f l(u lt, df l ) together, and S represents the operator for computing this additive fit. As such, S represents a backfitting algorithm on just these terms. From (2) and the usual asymptotics we find that var( β) = H t W H where W = ĉov(z). Because calculation of the operator matrix S can be computationally expensive, the current version of the S- plus function gam approximates var( β) by effectively assuming that the smooth component of model 6

7 () is linear. That is var( β) is approximated by the appropriate submatrix of (X t augw X aug ), where X aug is the design matrix of model () augmented by the predictors used in the smooth component of the model, i.e. X aug = [x,..., x J, u,..., u L ] t (Hastie and Tibshirani, 990; Chambers and Hastie, 992). In time series studies of air pollution and mortality, the assumption of linearity of the smooth component of model () is inadequate, resulting in underestimation of the standard error of the air pollution effects (Ramsay et al., 2002; Klein et al., 2002). The degree of underestimation will tend to increase with the number of degrees of freedom used in the smoothing splines, because a larger number of non-linear terms is ignored in the calculations. However, if S is a symmetric operator matrix, then H can be re-defined as: H = { X t (W X W SX) } (W X W SX) t. (3) (Symmetry in this case is with respect to a W weighted inner product, and implies that W S = S t W ; weighted smoothing splines are symmetric, as are weighted additive model operators that use weighted smoothing splines as building blocks.) Hence the expensive part of the calculation of var( β) involves the calculation of the T J matrix SX, having as column j the fitted vector resulting from fitting the (weighted) additive model L l= f l(u lt, λ l ) to a response x j. We now provide details on the calculation of z, W and SX: step. fit the semi-parametric model () using gam and extract the weights w, the actual degrees of freedom used in the backfitting λ l. The weights w are the diagonal elements of the matrix W ; step 2. smooth each column of X with respect to L l= f l(u lt, λ l ), by using a gam with identity link and weights w. The columns of SX are the corresponding fitted values. Steps and 2 are implemented in our S-plus function gam.exact, which returns the a.e. covariance matrix of β for any GAM. The software is available at Although we have detailed the standard error calculations for a semi-parametric model with log link and Poisson error, these calculations can be generalized for the entire class of link functions for GLM by calculating z t = ˆη t + (y t ˆµ t ) ˆη t ˆµ t (Nelder and Wedderburn, 972) in step 2. In the simpler case of a Gaussian regression, the asymptotic covariance matrix var( β) can be obtained by setting w = and z t = y t. Details of these calculations in this case have been discussed by Durban et al. (999). the actual degrees of freedom may differ slightly from those requested in the call to gam, as a consequence of the changing weights in the IRLS algorithm 7

8 4 Understanding bias in semi-parametric regression One important issue in time series studies of air pollution and mortality is the selection of the number of degrees of freedom (df ) to be used in representing the smooth component of model (). Ideally, we would like to choose df large enough to eliminate bias in the relative rate estimate ˆβ. However, if df is too large we may remove the bias but inflate the variance of ˆβ. In this section we examine the form of this bias, as we vary the complexity of our representation for f, and we illustrate a semi-automated procedure to select df. To illustrate the relationship between the bias and variance of ˆβ and df, we consider a simple additive model of the following form: y t = βx t + f(t) + ɛ t, ɛ t N(0, σ 2 ). (4) Suppose that the dependence between x t and time is described by: x t = g(t) + ξ t, ξ t N(0, σξ 2 ). (5) Suppose also that f(t) here representing a true and unknown relationship can be represented by a basis expansion: r f(t) = h l (t)δ l, l= or in vector notation f(t) = h t (t)δ. To study the bias-variance tradeoff, we will assume that g(t) = h (t)γ, where h (t) is a subset of q < r of the basis functions in h(t) = (h (t), h 2 (t)). For a given set of T time points, we can represent the vector of function values by f = Hδ, where H is an T r basis matrix. Without loss of generality we assume that H t H = T I. We are therefore assuming that the h l (t) are mutually orthogonal, and are size-standardized. The factor T is needed in asymptotic arguments below, and is realistic in the following sense. Suppose f and g, and hence each of the h l, are periodic (with a period of a year). We standardize them so that Year h2 l (t)dt =, or 365 t= h l(t) 2 /365 =. Then the sum-of-squares over m years of data will be T = 365 m. We now examine the bias and variance of ˆβ as we vary the complexity of our representation for f when we fit x and t jointly in a regression model. We consider three scenarios: we model f perfectly by using sufficient df to fully represent the relationship between y t and t, e.g. ˆf(t) = h(t)ˆδ; 8

9 we model f imperfectly, by using sufficient df to fully represent the relationship between x t and t: ˆf(t) = h (t)ˆδ ; we do not model f(t). Model f perfectly, using h(t) In this case, the Least Square (LS) estimate ˆβ is unbiased for β, (E ˆβ = β), and Var ˆβ = σ 2 x (HH t /T )x 2 (6) x (HH t /T )x is the residual vector when we regress x on H. The richer our model for f, the smaller this residual vector, and hence the larger the variance of ˆβ. Under our model (5) for x t, the denominator is distributed as σ 2 ξ χ2 T r, with mean value (T r)σ2 ξ. Model f imperfectly, using h (t) Simple calculations show in this case that E[ ˆβ x t, t] β = = ξ t H 2 δ 2 x (H H t /T )x 2 ξ t H 2 δ 2 ξ t (I H H t /T )ξ This bias term, which depends only on the random errors ξ t in (5), is O p (/ T ) and hence is asymptotically negligible. Therefore, even though we do not model f perfectly, we include sufficient df to eliminate the bias. The variance of ˆβ in this case is Var ˆβ σ = 2 x (H H t /T )x 2 σ = 2 (7) ξ t (I H H t/t )ξ Comparing (7) with (6), we see that the variance here is smaller (the denominator is σ 2 ξ χ2 T q ). Notice that the variance is O p (/T ), as is the squared bias, so that the mean-squared-error might be a better basis for the bias variance trade off. 9

10 Ignoring f(t) Here we have E[ ˆβ x t, t] β = Tγt δ x 2 + ξt Hδ x 2 Since x 2 /T = O p (), the first term does not disappear asymptotically. The second term does as before, as does the variance Var( ˆβ) = σ 2 / x 2. In summary, if we represent f(t) with enough bases to model the relationship between x t and t, then the squared bias term is asymptotically zero, and fades away at the same rate as the variance. Selecting the df beyond this threshold trades off bias against variance. If we do not cover the dependence of x t on t, a systematic component to the bias persists. The results presented in this section assume that f(t) and g(t) are modelled by the use of orthogonal basis functions, as for example, regression splines. Similar results when f(t) and g(t) are modelled by use of kernel smoothers are discussed in Green et al. (985) and Speckman (988). For smoothing splines, the analysis is complicated by the fact that all components of functions f(t) and g(t) (a part from the linear components), are modelled with bias. These biases depend on the complexity (roughness) of the component, and the df used, and will disappear asymptotically if df grows appropriately (Green and Silverman, 994). 4. Selecting sufficient degrees-of-freedom via bootstrap In this section we present a bootstrap-based procedure to estimate df that minimizes the mean squared error of ˆβ when the the exact forms of f and g in (4) and (5) are unknown. The computational steps are illustrated below:. choose df max in (4) so that f(t) is well modelled in his spline representation, for example, by using generalized cross-validation (GCV) methods (Hastie and Tibshirani, 990; Hastie et al., 993). 2. estimate the full model (4) with df = df max and calculate the unbiased estimate ˆβ dfmax. 3. for b =,..., B: sample y b t from the fitted model (4) with df = df max ; for df =, 2,..., df max, estimate ˆβ b df by fitting the model yb t = β df x t + s(t, df) + ɛ t where s(t, df ) = df l= h l(t)δ l ; 0

11 4. calculate MSE( ˆβ df ) = B b b= ( ˆβ b df ˆβ dfmax ) 2 ; 5. choose the df that minimizes MSE( ˆβ df ). To illustrate our procedure we implement a simulation study. We first estimate the true MSE by generating data sets (x m t, yt m ), for m =,..., 500 from models (4) and (5) with known parameters and known f(t) and g(t) for t =, (we considered two years of data). Figure (top left and right) shows the known f(t) and g(t) (red curves) and bottom left shows the true MSE as functions of df (red points). The true MSE is minimized near df = 5. Black lines in Figure (top left and right) shows the estimated ˆf(t) and ĝ(t) obtained by modelling f(t) and g(t) as natural cubic splines with 5 degrees of freedom in (4) and (5), respectively. Thus df = 5 leads to an inaccurate estimate of f(t), but to a better estimate of g(t). This further illustrates the point made in the previous section: to optimize inferential properties of ˆβ, it is not necessary to model f(t) accurately. To validate our procedure, we apply steps to 4 to each data set (x m t, yt m ) sampled from the true models, and we estimate MSE m ( ˆβ df ) and other model selection measures for m =,..., 500. Figure (bottom left) shows the boxplots of MSE m ( ˆβ df ) for each df. Note that there is a very good correspondence between the true MSE and the distribution of the estimated MSE across data sets. The triangle is placed at the df where the true and the estimated MSE are minimized. Figure (bottom right) compare four measures of model choices, that is the average MSE, AIC, BIC, and PACF across the 500 data sets as function of df. AIC is the Akaike information criteria (Akaike, 973). BIC is the Bayesian Information Criteria (Schwartz, 978) and it has a Bayesian interpretation since it may be viewed as an approximation to the posterior odds ratio. PACF is defined as the sum of the absolute values of the autocorrelation functions of the residuals. MSE and BIC are minimized at the same df, indicating that df = 5 is also the model with the highest posterior odds. Because AIC has a smaller penalty in the dimension of the model, it dominates MSE for any df and it is minimized at df = 0. PACF is a compromise between the MSE and the AIC and it is minimized at df = 6. Note that, to achieve a good prediction, it is necessary to use larger df to adequately model f(t). 5 NMMAPS Data Analysis In this section, we apply our methods to the NMMAPS data base which is comprised of daily time series of air pollution levels, weather variables, and mortality counts for the largest 90 cities in the US from 987 to 994. A full description of the NMMAPS data base is detailed in Samet et al. (2000b).

12 First, we extend results of section 3. so that several smooth functions are included in the model. Here we define a class of semi-parametric models which allows the degree of adjustment for confounding factors to vary as the function of only one calibration parameter α. We then apply our procedure for estimating α to four NMMAPS cities with daily data available and compare it with alternative approaches in the literature. Second, we extend modelling approaches in a hierarchical fashion, and we estimate national average air pollution effects as a function of α. Details of the two data analyses are below. Bootstrap-based procedure for df-selection in four NMMAPS cities To apply our approach to the class of semi-parametric regression models defined in equation (), we assume df l = α λ l, where α is unknown and λ l is pre-specified. Thus the parameter α measures the degree of adjustment for confounding factors with respect to a baseline choice λ l. To illustrate our methods, we use the following simplified version of the NMMAPS core model (Dominici et al., 2000, 2002c): y t Poisson(µ t ), var(y t ) = φµ t log µ t = β 0 (α) + β(α)p M 0t + s (t, df ) + s 2 (temp t, df 2 ) = β 0 (α) + β(α)p M 0t + s (t, λ α) + s 2 (temp t, λ 2 α) (8) where y t is the daily number of deaths, φ is the over-dispersion parameter, P M 0t is the daily level of particulate matter less than 0 microns, temp is the temperature, and t =,..., days. We assume λ = 56 (i.e. 7 degrees of freedom per year), λ 2 = 6, α to be 25 equally-spaced points between α min and α max, and s to be natural cubic splines. Justification for the selection of λ and λ 2 as baseline degrees of freedom to control for longer-term trends, seasonality and weather can be found in Samet et al. (995,997,2000a), Kelsall et al. (997), and Dominici et al. (2000b). First, to identify α min and α max, we estimate (df, df 2 ) in the smooth functions of time and temperature that best predict P M 0. Here we use generalized cross-validation (GCV) methods (Hastie and Tibshirani, 990; Hastie et al., 993). Cities with larger df s have a stronger relationship between P M 0 and the other time-varying confounders, and therefore might require more flexible s and s 2 to remove confounding. The estimated ( df, df 2 ) are summarized in Table. Note that in Seattle larger (df, df 2 ) are needed than in the other cities. This is because P M 0 levels in Seattle are higher in the winter and lower in the summer, thus inducing a negative correlation between P M 0 and weather. We then use ( df, df 2 ) as guidance to identify a value of α large enough to fully model the relationship between P M 0 and time and temperature. In Pittsburgh, Minneapolis, and Chicago, we set α max = 3, in Seattle we set α max = 6, and in all the cities we assume α min = /α max. 2

13 Within each city, we apply our procedure illustrated in section 3. to the four cities as follows:. Fit model (8) with α = α max using a quasi-likelihood approach (McCullagh and Nelder, 989) and obtain an unbiased estimate of β, here denoted by β(α max ), and a robust variance v(α max ); 2. For b =,..., 500, generate mortality time series from the fitted model (8) with α = α max ; 3. For each α in [α min, α max ], re-fit the data by using a quasi-likelihood approach and obtain β b (α) and a robust variance v b (α); 4. Calculate MSE( β(α)) = B B b= ( β b (α) β(α max )) 2. For each value of α, and for each bootstrap iteration b, we also estimate the goodness of fit and the degree of serial correlation in model (8) by calculating:. the Akaike Information Criteria AIC(α) = deviance + 2 ˆφ p, where ˆφ is the estimated over-dispersion parameter, p is the number of parameters in the model; and 2. the sum of the absolute values of the partial autocorrelation function of the residuals (PACF). Choosing df that minimize AIC or PACF are strategies routinely used in air pollution literature research (Toulomi et al., 997; Burnett et al., 200). Figure 2 (left panels) shows boxplots of ˆβ b (α), b =,..., 500 as function of α. Red and black horizontal lines are placed at ˆβ(α max ) and at 0, respectively. As expected, as α increases, bias diminishes in all cities. More specifically, in Pittsburgh, Minneapolis, Chicago, and Seattle, bias is removed for values of α larger 2.7,, 2, and, respectively. Because the statistical variance of ˆβ(α) increases as α increases, an optimal value of α for the bias/variance tradeoff can be identified by estimating the MSE. The estimated MSE, AIC, and PACF as functions of α are shown in the right panels of Figure 2. To ease the comparison we have normalized these estimates to vary in [0,]. In Pittsburgh, Minneapolis, and Chicago, the MSE plot is almost dominated by bias and level off for values of α larger than 2.,, and.3, respectively. The PACF and the AIC always dominate the MSE, indicating that large values of α would be needed to eliminate correlation in the residuals or to adequately model the smooth functions of time and temperature to optimize prediction. In Minneapolis, the PACF is minimized at, and rises for α >, indicating over-fitting and negative autocorrelation of the residuals. Note that a choice of α that minimizes the MSE still leaves autocorrelation in 3

14 the residuals (PACF is quite large at these values), which is accounted for using robust variance estimation. In Seattle, the MSE, AIC and PACF are quite different and non-monotone. The MSE is minimized at α = where the bias is small and then starts to rise because of the increase in variance, and levels off for α > 3.8. The PACF has a local minimum at.9, rises again for α larger than.9, and it is minimized at α = 6 as the AIC. However, both choices (α = and α = 6) indicate no association between air pollution and mortality. Note that, although exploratory analyses in Seattle and Minneapolis suggested larger df, df 2 to best predict P M 0 as smooth functions of time and temperature, the inspection of the MSE indicated that a small α should be used in the model (8). This is probably because ˆβ(α max ) 0, and therefore very little bias arises from the relationship between P M 0 and time and temperature in the full model. Finally, point estimates and 95 % CI of ˆβ(α max ), and ˆβ(ˆα), where ˆα is the value of α that minimizes MSE, are summarized in Table. The two sets of estimates are similar because for α > ˆα bias is removed. However ˆβ(ˆα)s have narrower confidence intervals than ˆβ(α max )s because a more parsimonious model is used. Analysis of the NMMAPS data base The pollutant P M 0 is measured approximately every six days in the other NMMAPS locations, except for Pittsburgh, Minneapolis, Seattle, and Chicago, thus complicating the implementation of our bootstrap-based methodology. However, we can still extend the NMMAPS model described below in an hierarchical fashion and estimate national average air pollution effects as function of α. We consider the following semi-parametric model used in the NMMAPS analysis: log E[yt c ] = age-specific intercepts + β c (α)p M0t c + s(t, 7/year α)+ + s(temp t, 6 α) + s(dewpoint t, 3 α) + age s(t, 8 α) (9) where yt c is the daily number of deaths in city c, P M 0t is the daily level of particulate matter, temp and dew are the temperature and dewpoint temperature, and the age-specific intercepts correspond to the three age groups of younger than 65, between 65 and 75 and older than 75. Details of this modeling approach can be found elsewhere (Dominici et al., 2000; Samet et al., 2000c). Based upon the statistical analyses of the four cities with daily data and additional exploratory analyses, we set α to take on 25 equally spaced points varying from /3 to 3. As in the previous model formulation, 4

15 this choice allows the degree of adjustment for confounding factors to vary greatly. We then assume the following two-stage normal-normal hierarchical model: ˆβ c (α) N(β c (α), v c (α)) β (α) N(β (α), τ 2 (α)) (0) where β (α) and τ 2 (α) are the national average air pollution effects and the variance across cities of the true city-specific air pollution effects, both as a function of α. We fit model (0) by using a Bayesian approach, with a flat prior on β (α) and uniform prior on the shrinkage factor τ 2 (α)/ [ τ 2 (α) + v c (α) ] (Everson and Morris, 2000). Sensitivity of the national average estimates to the prior specification for τ 2 when α = has been explored elsewhere (Dominici et al., 2002a,d). To investigate sensitivity of the national average estimates to model choice, for each value of α, we estimate ˆβ c (α) and v c (α) using three methods: ) GAM with approximated standard errors (GAM-approx s.e.); 2) GAM with asymptotically exact standard errors (GAM-exact); 3) and GLM. The left top panel of Figure 3 shows the national average estimates (posterior means) as a function of α. Black, red, and blue dots denote estimates under GAM-approx s.e., GAM-exact, and GLM, respectively. The grey polygon represents 95% posterior intervals of the national average estimates under GAM-exact. The red segment is placed at α =, that is, the degree of adjustment used in the NMMAPS model (Dominici et al., 2000). The black curves at the top right panel denote the city-specific Bayesian estimates of the relative rates under GAM-exact. Figure 3 provides strong evidence for association between short-term exposure to P M 0 and mortality, which persists for different values of α. Consistent with the results for the four cities, national average estimates decrease as α increase, and level off for α larger than.2 with a very modest increase in posterior variance. However even when α = 3, the national average effect is estimated as to 0.2% increase in total mortality for 0 µg/m 3 increase in P M 0 (95% posterior interval 0.05 to 5). This picture also shows robustness of the results to model choice (GAM versus GLM). National average estimates under GAM-exact are slightly smaller than those obtained under GAM-approx, although this difference is very small. The comparability of these two sets of estimates occurs because in hierarchical models, underestimation of standard errors at the first stage ( v c (α)) is compensated by the overestimation of the heterogeneity parameter at the second stage (τ 2 (α)). Thus the posterior total variance of the national average estimates remains approximately constant 5

16 (Daniels et al., 2002). The bottom left and right panels of Figure 3 show posterior means of the average s.e. of ˆβ c ( 90 c vc (α)), and of the heterogeneity parameters τ(α). Because of the nature of the approximation, the average standard errors are smaller in GAM-approx than in GAM-exact or GLM, and do not vary with α. If GAM-exact or GLM are used, then the average standard errors increase with α, with GAM-exact providing slightly larger estimates. Under all three modelling approaches, the posterior mean of τ(α) (heterogeneity) decreases as α increases, indicating that less control for confounding factors inflates the variability across cities of the β c (α)s. 6 Discussion In this paper, we propose two methodological contributions in semi-parametric regression that overcome limitations of current approaches of time series analyses of air pollution and health. First, we develop an algorithm for estimating the covariance matrix of the vector of the regression coefficients in GAM that properly accounts for the smooth component of the model. Second, we introduce a conceptual framework for choosing the degrees of freedom in the smooth component of the model that minimizes the MSE of the estimated regression coefficients. Our contributions are relevant in time series analyses of air pollution and health because they improve the assessment of uncertainty of the estimated risks, and provide a strategy that removes bias due to confounding without inflating statistical variance. We developed a user-friendly S-plus function gam.exact, that implements our calculations and returns the asymptotically exact covariance matrix of the regression coefficients corresponding to the linear component of a GAM. gam.exact can be used for any number of linear predictors, smooth terms, link functions, and distribution errors. Our calculations are computationally efficient because they simply require to fit as many GAM as the number of regression coefficients in the linear component of the model, instead of calculating the T T smoother operator S (in our case-study, T is equal to 8 years of daily data, and therefore the computation of S would have been almost prohibitive). However this computational efficiency can be obtained for symmetric smoothers only, as for example smoothing splines. Simulation studies suggest that these standard error calculations are also adequate for non symmetric smoothers when a GAM with identity link is used (Durban et al., 999). A similar conclusion may hold for any link function, although additional investigations are warranted. In this application, the linear component of the model (the air pollution effects) is of interest, and the smooth component of the model (the smooth functions of time and temperatures) is nuisance. 6

17 Therefore, we proposed a model selection criterion that optimizes the inferential properties (MSE) of the parameter of interests. We compared our approach with two common alternatives in the literature based upon minimizing prediction error (AIC) or eliminating serial correlation in the residuals (PACF). We found that selecting df that minimizes AIC or PACF leads to a richer model, which removes bias, but may also inflate the variance of ˆβ. Choosing df that minimizes the MSE leads to a more parsimonious model. However under this model, the residuals may be autocorrelated, and therefore robust estimation of the variance is recommended, as for example, by use of quasi-likelihood methods. In summary, when the linear component of a semi-parametric model is the goal of the inference, then visual inspection of MSE and its comparison with AIC and PACF provides a comprehensive diagnostic tool for df-selection. The application of our semi-automated procedure for the df-selection to the four cities further illustrates the complex relationship between bias of β and the degree of adjustment for confounding factors. We found that, when an air pollution effect (as in Pittsburgh and Chicago) exists, bias mostly arises from the relationship between P M 0 and the other time-varying covariates included in the model, also called concurvity (Buja et al., 989). Thus choosing df that best predict P M 0 from the smooth component of the model can reduce bias substantially, but it may not be sufficient to eliminate the bias in full. Visual inspection of MSE plots provides further guidance, and identifies the optimal value of α for bias/variance tradeoff. We also found that, although for small values of α the PACF is large for most the cities, the estimated over-dispersion parameter was generally small. Therefore, autocorrelation of the residuals has little impact on the statistical variance of the relative rate estimates. The choice between fully parametric versus semi-parametric formulations of the smooth component of the model has been raised in recent discussions. Although GAM-exact and GLM rely upon different estimation methods, we did not find any noticeable difference between these two models in risk estimation of air pollution. We have implemented simulation studies where we have generated time-series data from a model with known smooth functions of time and temperature and we have calculated MSE and AIC as a function of α under GAM-exact and GLM. The MSE and AIC plots were very similar under the two modelling approaches (results not shown). Our sensitivity analyses of the national-average air pollution effects to model choice confirms this similarity: the nationalaverage relative rate estimates under GLM and under GAM-exact were almost identical for any value of α, thus suggesting that the choice between parametric versus semi-parametric formulations of the smooth component of the model is less critical than the degree of adjustment for confounding factors. 7

18 References Akaike, H. (973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, eds. B. Petrox and F. Caski, 267. American Thoracic Society, Bascom, R. (996a). Health effects of outdoor air pollution, Part. American Journal of Respiratory and Critical Care Medicine, 53, (996b). Health effects of outdoor air pollution, Part 2. American Journal of Respiratory and Critical Care Medicine, 53, Buja, A., Hastie, T., and Tibshirani, R. (989). Linear smoothers and additive models. The Annals of Statistics, 7, Burnett, R. and Krewski, D. (994). Air Pollution effects of hospital admission rates: A random effects modelling approach. The Canadian Journal of Statistics, 22, Burnett, R., Renjun, M., Jerrett, M., Goldberg, M., S., C., A., P., and Krewski, D. (200). The Spatial Association between Community Air Pollution and Mortality: A New method for Analyzing Correlated Geographic Cohort Data. Environmental Health Perspectives, 09, Chambers, J. and Hastie, T. (992). Statistical Models in S. Chapman and Hall, London. Clancy, L., Goodman, P., Sinclair, H., and Dockery, D. (2002). Effect of air-pollution control on death rates in Dublin, Ireland: an intervention study. Lancet, 360, Daniels, M., Dominici, F., and Zeger, S. (2002). Underestimation of Standard Errors in Time Series Studies of Air Pollution and Mortality. Technical report, Bloomberg School of Public Health, Johns Hopkins University. Dominici, F., Daniels, M., Zeger, S. L., and Samet, J. M. (2002a). Air Pollution and Mortality: Estimating Regional and National Dose-Response Relationships. Journal of the American Statistical Association, 97, 00. Dominici, F., McDermott, A., Zeger, S. L., and Samet, J. M. (2002b). 0n Generalized Additive Models in Time Series Studies of Air Pollution and Health. American Journal of Epidemiology, 56, 3,. (2002c). Airborne Particulate Matter and Mortality: Time-Scale Effects in Four US Cities. American Journal of Epidemiology (in press). (2002d). A Report to the Health Effects Institute on Reanalyses of the NMMAPS Database. The Health Effects Institute, Cambridge, MA. 8

19 Dominici, F., Samet, J. M., and Zeger, S. L. (2000). Combining Evidence on Air pollution and Daily Mortality from the Twenty Largest US cities: A Hierarchical Modeling Strategy (with discussion). Royal Statistical Society, Series A, with discussion, 63, Durban, M., Hackett, C., and Currie, I. (999). Approximate Standard Errors in Semiparametric Models. Biometrics, 55, Environmental Protection Agency (970). The Clean Air Act (CAA); 42 U.S.C. s/s 740 et seq. (970) Clean Air Act and Amendments of 970 (PL 9-604; 42 USC 857h-7 et seq.; amended 970. US Environmental Protection Agency. Environmental Protection Agency (996). Review of the National Ambient Air Quality Standards for Particulate Matter: Policy Assessment of Scientific and Technical Information. OAQPS Staff Paper. Research Triangle Park, North Carolina, U.S. Government Printing Office. Environmental Protection Agency. (200). Air Quality Criteria for Particulate Matter: Second External Review Draft March 200. US Environmental Protection Agency, Office of Research and Development. Everson, P. and Morris, C. (2000). Inference for multivariate Normal hierarchical models. Journal of the Royal Statistical Society, series B, 62, Friedman, J. and Stuetzle, W. (98). Projection Pursuit Regression. Journal of American Statistical Association, 76, Goldberg, M., Burnett, R., Valois, M., Flegel, K., and Bailar, J. (2003). Associations between ambient air pollution and daily mortality among persons with congestive heart failure. Environ Research, 9, Green, P., Jennison, C., and Seheult, A. (985). Analysis of Field Experiments by Least Square Smoothing. Journal of the Royal Statistical Society, 47, 2, Green, P. J. and Silverman, B. W. (994). Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman & Hall, London UK. Hastie, T., Tibshirani, R., and Buja, A. (993). Scoring. Tech. rep. Flexible Discriminant Analysis by Optimal Hastie, T. J. and Tibshirani, R. J. (990). Generalized additive models. Chapman and Hall, New York. Katsouyanni, K., Schwartz, J., Spix, C., Touloumi, G., Zmirou, D., Zanobetti, A., and Wojtyniak, B. (996). Short term effects of air pollution on health: a European approach using epidemiologic time series data: the APHEA protocol. J Epidemiol Community Health, 50, Supp : S2 8. 9

20 Katsouyanni, K., Toulomi, G., Spix, C., Balducci, F., Medina, S., Rossi, G., Wojtyniak, B., Sunyer, J., Bacharova, L., Schouten, J., Ponka, A., and Anderson, H. R. (997). Short term effects of ambient sulphur dioxide and particulate matter on mortality in 2 European cities: results from time series data from the APHEA project. British Medical Journal, 34, Kelsall, J., Samet, J. M., and Zeger, S. L. (997). Air Pollution, and Mortality in Philadelphia, American Journal of Epidemiology, 46, Klein, M., Flanders, W., and Tolbert, P. (2002). Variances may be underestimated using available software for generalized additive models (abstract). American Journal of Epidemiology (supplement), 55, S06. Lee, J., Kim, H., Song, H., Hong, Y., Cho, Y., Shin, S., Hyun, Y. J., and Kim, Y. (2002). Air pollution and asthma among children in Seoul, Korea. Epidemiology, 3, Lipfert, F. and Wyzga, R. (993). Air Pollution and Mortality: Issues and Uncertainty. Journal Air Waste Manage. Assoc, 45, McCullagh, P. and Nelder, J. A. (989). Generalized Linear Models (Second Edition). New York: Chapman & Hall. Nelder, J. A. and Wedderburn, R. W. M. (972). Generalized Linear Models. Journal of the Royal Statistical Society, Series A, 35, Pope, C. A., Dockery, D., and Schwartz, J. (995). Review of Epidemiological evidence of health effects of particulate air pollution. Inhal.Toxicol., 47, -8. Ramsay, T., Burnett, R., and Krewski, D. (2002). The effect of concurvity in generalized additive models linking mortality and ambient air pollution. Epidemiology (to appear). Samet, J. M., Dominici, F., Curriero, F., Coursac, I., and Zeger, S. L. (2000a). Fine Particulate air pollution and Mortality in 20 U.S. Cities: New England Journal of Medicine (with discussion), 343, 24, Samet, J. M., Zeger, S. L., and Berhane, K. (995). The Association of Mortality and Particulate Air Pollution. Health Effects Institute, Cambridge, MA. Samet, J. M., Zeger, S. L., Dominici, F., Curriero, F., Coursac, I., Dockery, D., Schwartz, J., and Zanobetti, A. (2000b). The National Morbidity, Mortality, and Air Pollution Study Part II: Morbidity and Mortality from Air Pollution in the United States. Health Effects Institute, Cambridge, MA. Samet, J. M., Zeger, S. L., Dominici, F., Dockery, D., and Schwartz, J. (2000c). The National Morbidity, Mortality, and Air Pollution Study Part I: Methods and Methodological Issues. Health Effects Institute, Cambridge, MA. 20

21 Samet, J. M., Zeger, S. L., Kelsall, J., Xu, J., and Kalkstein, L. (997). Air pollution, weather and mortality in Philadelphia, In Particulate Air Pollution and Daily Mortality: Analyses of the Effects of Weather and Multiple Air Pollutants. The Phase IB report of the Particle Epidemiology Evaluation Project. Health Effects Institute, Cambridge, MA. Schwartz, J. (995). Air Pollution and Daily Mortality in Birmingham, Alabama. American Journal of Epidemiology, 37, (2000). Assessing Confounding, Effect Modification, and Thresholds in the Associations between Ambient Particles and Daily Deaths. Environmental Health Perspective, 08, Schwartz, J. and Marcus, A. (990). Mortality and Air Pollution in London: A time series analysis. American Journal of Epidemiology, 3, Schwartz, J., Spix, C., Touloumi, G., Bacharova, L., and Barumamdzadeh, T. e. a. (996). Methodological issues in studies of air pollution and daily counts of deaths or hospital admissions. J Epidemiol Community Health, 50, Supp : S. Schwartz, M. (978). A Mathematical Model Used to Analyze Breast Cancer Screening Strategies (STMA V2 863). Operations Research, Journal of Operations Research Society of America, 26, Speckman, P. (988). Kernel Smoothing in Partial Linear Models. Journal of the Royal Statistical Society Series B, 50, Stieb, D., Judek, S., and Burnett, R. (2002). Meta-analysis of time-series studies of air pollution and mortality: effects of gases and particles and the influence of cause of death, age, and season. J Air Waste Manag Assoc., 52, Toulomi, G., Katsouyanni, K., Zmirou, D., and Schwartz, J. (997). Short-Term Effects of Ambient Oxidant Exposure on Mortality: A combined Analysis within the APHEA Project. American Journal of Epidemiology, 46, Zanobetti, A., Schwartz, J., and Dockery, D. (2000). Airborne particles are a risk factor for hospital admissions for heart and lung disease. Environmental Health Perspective, 08,

22 Figure Legends. Simulation Results: In the top panels red curves indicate the known f(t) and g(t). In the bottom left panel the red points indicate true MSE as function of df. Notice that the true MSE is minimized at df = 5. In the top panels the black curves indicate the estimated ˆf(t) and ĝ(t) obtained by modelling f(t) and g(t) as natural cubic splines with 5 degrees of freedom in (4) and (5). In the bottom left panel the boxplots indicate the distribution of the estimated MSE for the 500 data sets sampled at each df. The triangle is placed at the estimated ˆdf from our procedure which coincides with the one that minimizes the true MSE. In the bottom right panel are compared four model selection s measures as function of df: MSE, AIC, BIC, and PACF. 2. Four cities results: left panels show boxplots of ˆβ b (α) for each city and for each value of α. Red and black horizontal lines are placed at ˆβ(α max ) and at 0, respectively. Right panels show MSE, AIC and PACF estimates as function of α. To ease the comparison, the values have been normalized to vary in [0,]. 3. NMMAPS sensitivity analysis: top left panels show national average estimates (posterior means) as function of α. Black dots denote estimates under GAM with approximated standard errors, the red dots denote estimates under GAM with asymptotically exact standard errors, and the blue dots denote estimates under GLM. The grey polygon represents the 95% posterior intervals of the national average estimates under the GAM model with exact standard errors. The red segment is placed at α =, e.g the degree of adjustment used in the NMMAPS model. The black curves in the top right panel denote the city-specific Bayesian estimates of the relative rates under GAM with asymptotically exact standard errors. Bottom panels show the posterior means of the average s.e. of ˆβ c ( 90 c vc (α)) (left) and posterior means of the heterogeneity parameters τ(α) (right). 22

American Journal of EPIDEMIOLOGY

Volume 156 Number 3 August 1, 2002 American Journal of EPIDEMIOLOGY Copyright 2002 by The Johns Hopkins Bloomberg School of Public Health Sponsored by the Society for Epidemiologic Research Published by