ISSUES IN SEMI-PARAMETRIC REGRESSION WITH APPLICATIONS IN TIME SERIES MODELS FOR AIR POLLUTION AND MORTALITY

Size: px
Start display at page:

Download "ISSUES IN SEMI-PARAMETRIC REGRESSION WITH APPLICATIONS IN TIME SERIES MODELS FOR AIR POLLUTION AND MORTALITY"

Transcription

1 ISSUES IN SEMI-PARAMETRIC REGRESSION WITH APPLICATIONS IN TIME SERIES MODELS FOR AIR POLLUTION AND MORTALITY Francesca Dominici, Aidan McDermott, and Trevor Hastie February 4, 2003 Abstract In 2002, methodological issues around time series analyses of air pollution and health attracted the attention of statisticians, epidemiologists, the press, industry, and policy makers. As the Environmental Protection Agency (EPA) was finalizing its review of the evidence on particulate air pollution, statisticians and epidemiologists found that the S-Plus implementation of Generalized Additive Models (GAM), the most common statistical approach for time series analysis of air pollution and health, can overestimate effects of air pollution and understate statistical uncertainty. This discovery delayed the process for the Particulate Matter (PM) Criteria Document prepared as part of the review of the National Ambient Air Quality Standard (NAAQS), as the time-series findings were a critical component of the evidence. In addition, it raised concerns about the adequacy of current model formulations and their software implementations. Time-series data on air pollution and mortality are analyzed by using semi-parametric regression models; where the linear component (corresponding to the air pollution effects) is of interest, and the smooth component (smooth functions of time and temperature to control for confounding), is a nuisance. In this paper we provide two methodological results in semi-parametric regression directly relevant to risk estimation in time series studies of air pollution. First, we introduce a closed form estimate of the asymptotically exact covariance matrix of the linear component of a GAM. To ease the implementation of these calculations, we develop the package gam.exact, an extended version of gam. Use of gam.exact allows a more robust assessment of uncertainty of the air pollution effects. Second, we introduce a semi-automated procedure for selecting the degrees of freedom in the smooth component which minimizes the mean squared error of the regression coefficients corresponding to the linear component. We justify our approach by theoretical arguments and by a simulationbased validation study. In time series analyses of air pollution and health, our semi-automated procedure provides a diagnostic tool for choosing the degree of adjustment for time-varying factors that might confound the pollution-mortality relationship. Finally, we apply our methods to the data for National Mortality Morbidity Air Pollution Study (NMMAPS), which includes time series data from the 90 largest US locations for the period We report national-level estimates of the health effects of air pollution, and review their sensitivity to model choice.

2 Key Words: Semiparametric regression, time-series, air pollution, GAM. Acknowledgments Research described in this article was partially supported by a contract and grant from Health Effects Institute (HEI), an organization jointly funded by the Environmental Protection Agency (EPA R824835) and automotive manufacturers. The contents of this article do not necessarily reflect the views and policies of HEI, nor do they necessarily reflect the views and policies of EPA, or motor vehicles or engine manufacturers. Funding for Francesca Dominici was provided by a grant from the Health Effects Institute (Walter A. Rosenblith New Investigator Award). Research described within was also supported by a grant from the National Institute of Environmental Health Sciences for the Johns Hopkins Center in Urban Environmental Health (P30ES0389-2). We would like to thank Drs Scott L. Zeger, Jonathan M. Samet, and Giovanni Parmigiani for comments. 2

3 Introduction Evidence from time series studies of air pollution and health has been central to major policy decisions in the United States and elsewhere in developing standards to protect the public against adverse effects of air pollution. Time series studies of air pollution and health estimate associations between day-to-day variations in air pollution concentrations and day-to-day variations in adverse health outcomes, contributing epidemiological evidence useful for evaluating the risks of current levels of air pollution (Schwartz and Marcus, 990; Lipfert and Wyzga, 993; Pope et al., 995; Schwartz, 995; American Thoracic Society, 996a,b; Katsouyanni et al., 997; Samet et al., 2000a; Clancy et al., 2002; Lee et al., 2002; Stieb et al., 2002; Goldberg et al., 2003). Interpretation of short-term associations between air pollution and health is complicated by the social, political, and regulatory context which looks to scientific evidence for guidance in selecting among policy options. Under the Clean Air Act (Environmental Protection Agency, 970), the Environmental Protection Agency sets the National Ambient Air Quality Standards (NAAQS) for six air pollutants to protect the public health (Environmental Protection Agency, 996, 200). Such standards are reviewed periodically based on the accumulated scientific evidence. Balancing a series of health effects, including hospitalization and death, against the costs of further controls creates a difficult and sensitive policy debate. In the last 0 years, many advances have been made in the statistical modelling of time series of air pollution. Standard regression methods used initially have been almost fully replaced by semi-parametric approaches (Speckman, 988; Hastie and Tibshirani, 990; Green and Silverman, 994). These methods allow more flexibility in the adjustment for confounding factors by including non-parametric functions of time and weather variables in the regression model. One widely used approach for a time series analysis of air pollution and health involves a semi-parametric Poisson regression with daily mortality or morbidity counts as the outcome, linear terms measuring the percentage increase in the mortality/morbidity associated with elevations in air pollution levels (the relative rates βs), and smooth functions of time and weather variables to adjust for the timevarying confounders. The nature and characteristics of time series data make risk estimation challenging, requiring complex statistical methods able to detect effects that can be small relative to the combined effect of other time-varying covariates. More specifically, the association between air pollution and mortality/morbidity can be confounded by weather and by seasonal fluctuations in health outcomes due to influenza epidemics, and to other unknown slow-varying factors (Schwartz et al., 996; Katsouyanni et al., 996; Samet et al., 997). Thus, smooth functions of temperature and time are included into the regression model to control for fluctuations in the mortality/morbidity 3

4 time series that are not associated with air pollution; residual variation is then used to estimate the air pollution effect. Generalized linear models (GLM) with regression splines (McCullagh and Nelder, 989) and Generalized additive models (GAM) with non-parametric splines (Hastie and Tibshirani, 990) have become the methods of choice. During the last few years, GAM with non-parametric splines was preferred to fully parametric formulations because of the increased flexibility in estimating the smooth component of the model. While GAM has been the preferred method, recent reports have questioned the adequacy of its use for studies on air pollution and health. The original default parameters of the gam functions in Splus (Versions 3.4 and 5.) were found inadequate to guarantee the convergence of the backfitting algorithm (Dominici et al., 2002b). The defaults have recently been revised in the Splus Version 6., but they may be still too lax for these applications. In addition, the S-plus function gam calculates the standard errors of the linear terms (the air pollution effects) by effectively assuming that the smooth component of the model is linear, resulting in an underestimation of uncertainty (Chambers and Hastie, 992; Ramsay et al., 2002; Klein et al., 2002). Convergence of the backfitting algorithm and underestimation of standard errors in GAM depend critically on the number of degrees of freedom in the smooth component of the model (df ). In timeseries studies of air pollution and health, more strict control for potential confounding factors leads to larger df, which increases the correlation between the predictors in the model and makes the linear approximation of the smooth component of the model inadequate. Thus, the use of several highly flexible smooth functions of time-varying covariates to adjust for confounding reduces bias in the relative rate estimates, but can also slow convergence and underestimate standard errors substantially. The degree of adjustment for factors, controlled by df is another critical aspect of time series analyses of air pollution and health. Ideally, we would like to choose df large enough to remove confounding entirely, but not large enough to eliminate all the residual variation in the data, thus inflating the statistical variance of ˆβ. In absence of strong biological hypotheses, the choice of df has been based on expert judgment on the time-scales of variations in the mortality time series where confounding is more likely to occur (Kelsall et al., 997; Dominici et al., 2000), or via optimality criteria, such as minimum prediction error (based on the Akaike Information Criteria), or by visual inspection of the partial autocorrelation function of the residuals (Toulomi et al., 997; Burnett et al., 200). This paper is motivated by the pressing need for methodological developments in semi-parametric regression to better handle the general complexity of air pollution risk estimation. In section 2, 4

5 we detail how to calculate the asymptotically exact standard error of ˆβ in GAM. We implement these calculations in the new S-plus software function gam.exact. In section 3, we provide the relationship between asymptotic bias of ˆβ and degrees of freedom in the smooth component of the model (df ), and we introduce a semi-automated procedure for selecting df that minimizes the mean squared error of ˆβ. In section 4 we apply our methods to the data from National Mortality Morbidity Air Pollution Study (NMMAPS) (Samet et al., 2000c,b) and calculate national average estimates of the air pollution effects as a function of the degree of adjustment for confounding factors. Strengths and limitations of our approaches are discussed in section 5. 2 Statistical Model Semi-parametric model specifications for time-series analyses of air pollution and health have been extensively discussed in the literature (Burnett and Krewski, 994; Kelsall et al., 997; Katsouyanni et al., 997; Dominici et al., 2000; Zanobetti et al., 2000; Schwartz, 2000) and are briefly reviewed here. Data consist of daily mortality or morbidity counts (y t ), daily levels of one or more air pollution variables (x t..., x Jt ), and additional time-varying covariates (u t..., u Lt ) to control for slowvarying confounding effects such as season and weather. Regression coefficients are estimated by fitting the following semi-parametric model: y t Poisson(µ t ), t =,..., T var(y t ) = φµ t log µ t = β 0 + J j= β jx jt + L l= f l(u lt, df l ) = η t. () In our application, β j s describe the percentage increase in mortality/morbidity per unit increases in ambient air pollution levels x jt. The functions f(, df l ) denote smooth functions of calendar time, temperature, and humidity, often constructed using smoothing splines, loess smoothers, or natural cubic splines with smoothing parameters df l s; φ is the over-dispersion parameter. Here we assume that the f s are modelled using smoothing splines (GAM) or using natural cubic splines (GLM). 5

6 3 Asymptotically Exact Standard Errors in GAM In this section we develop an explicit expression for the asymptotically exact (a.e.) statistical covariance matrix of the vector of the regression coefficients β = [β,..., β J ] corresponding to the linear component of model () when f are modelled using smoothing splines and a GAM is used. Note that when f s are modelled using regression splines (such as natural cubic splines), the fully parametric model () is fitted by using Iteratively Re-weighted Least Squares (IRLS), and asymptotically exact standard errors are returned by the S-plus function glm. Estimation in GAM is based on a combination of the local scoring algorithm and backfitting algorithm (Friedman and Stuetzle, 98; Buja et al., 989; Hastie and Tibshirani, 990). The local scoring algorithm is a generalization of the Fisher scoring procedure for finding maximum likelihood estimates in GLMs (Nelder and Wedderburn, 972; McCullagh and Nelder, 989). The backfitting algorithm is suitable for fitting any additive model, and in GAM is used within the local scoring iteration when several smooth functions are included in the model. An explicit expression for the a.e. covariance matrix of β can be obtained from the closed form solution for β (Hastie and Tibshirani, 990, page 54), and: β = Hz where : H = { X t W (I S)X } X t W (I S) (2). X is the T J design matrix with columns x j = [x j,..., x jt ] t ; 2. z is the working response from the final iteration of the IRLS algorithm (McCullagh and Nelder, 989) defined as z t = ˆη t + (y t ˆµ t )/ˆµ t ; 3. W is diagonal in the final IRLS weights; and 4. S is the T T operator matrix that fits the additive model involving the smooth terms in (). Notice that here we have put all the additive smooth terms L l= f l(u lt, df l ) together, and S represents the operator for computing this additive fit. As such, S represents a backfitting algorithm on just these terms. From (2) and the usual asymptotics we find that var( β) = H t W H where W = ĉov(z). Because calculation of the operator matrix S can be computationally expensive, the current version of the S- plus function gam approximates var( β) by effectively assuming that the smooth component of model 6

7 () is linear. That is var( β) is approximated by the appropriate submatrix of (X t augw X aug ), where X aug is the design matrix of model () augmented by the predictors used in the smooth component of the model, i.e. X aug = [x,..., x J, u,..., u L ] t (Hastie and Tibshirani, 990; Chambers and Hastie, 992). In time series studies of air pollution and mortality, the assumption of linearity of the smooth component of model () is inadequate, resulting in underestimation of the standard error of the air pollution effects (Ramsay et al., 2002; Klein et al., 2002). The degree of underestimation will tend to increase with the number of degrees of freedom used in the smoothing splines, because a larger number of non-linear terms is ignored in the calculations. However, if S is a symmetric operator matrix, then H can be re-defined as: H = { X t (W X W SX) } (W X W SX) t. (3) (Symmetry in this case is with respect to a W weighted inner product, and implies that W S = S t W ; weighted smoothing splines are symmetric, as are weighted additive model operators that use weighted smoothing splines as building blocks.) Hence the expensive part of the calculation of var( β) involves the calculation of the T J matrix SX, having as column j the fitted vector resulting from fitting the (weighted) additive model L l= f l(u lt, λ l ) to a response x j. We now provide details on the calculation of z, W and SX: step. fit the semi-parametric model () using gam and extract the weights w, the actual degrees of freedom used in the backfitting λ l. The weights w are the diagonal elements of the matrix W ; step 2. smooth each column of X with respect to L l= f l(u lt, λ l ), by using a gam with identity link and weights w. The columns of SX are the corresponding fitted values. Steps and 2 are implemented in our S-plus function gam.exact, which returns the a.e. covariance matrix of β for any GAM. The software is available at Although we have detailed the standard error calculations for a semi-parametric model with log link and Poisson error, these calculations can be generalized for the entire class of link functions for GLM by calculating z t = ˆη t + (y t ˆµ t ) ˆη t ˆµ t (Nelder and Wedderburn, 972) in step 2. In the simpler case of a Gaussian regression, the asymptotic covariance matrix var( β) can be obtained by setting w = and z t = y t. Details of these calculations in this case have been discussed by Durban et al. (999). the actual degrees of freedom may differ slightly from those requested in the call to gam, as a consequence of the changing weights in the IRLS algorithm 7

8 4 Understanding bias in semi-parametric regression One important issue in time series studies of air pollution and mortality is the selection of the number of degrees of freedom (df ) to be used in representing the smooth component of model (). Ideally, we would like to choose df large enough to eliminate bias in the relative rate estimate ˆβ. However, if df is too large we may remove the bias but inflate the variance of ˆβ. In this section we examine the form of this bias, as we vary the complexity of our representation for f, and we illustrate a semi-automated procedure to select df. To illustrate the relationship between the bias and variance of ˆβ and df, we consider a simple additive model of the following form: y t = βx t + f(t) + ɛ t, ɛ t N(0, σ 2 ). (4) Suppose that the dependence between x t and time is described by: x t = g(t) + ξ t, ξ t N(0, σξ 2 ). (5) Suppose also that f(t) here representing a true and unknown relationship can be represented by a basis expansion: r f(t) = h l (t)δ l, l= or in vector notation f(t) = h t (t)δ. To study the bias-variance tradeoff, we will assume that g(t) = h (t)γ, where h (t) is a subset of q < r of the basis functions in h(t) = (h (t), h 2 (t)). For a given set of T time points, we can represent the vector of function values by f = Hδ, where H is an T r basis matrix. Without loss of generality we assume that H t H = T I. We are therefore assuming that the h l (t) are mutually orthogonal, and are size-standardized. The factor T is needed in asymptotic arguments below, and is realistic in the following sense. Suppose f and g, and hence each of the h l, are periodic (with a period of a year). We standardize them so that Year h2 l (t)dt =, or 365 t= h l(t) 2 /365 =. Then the sum-of-squares over m years of data will be T = 365 m. We now examine the bias and variance of ˆβ as we vary the complexity of our representation for f when we fit x and t jointly in a regression model. We consider three scenarios: we model f perfectly by using sufficient df to fully represent the relationship between y t and t, e.g. ˆf(t) = h(t)ˆδ; 8

9 we model f imperfectly, by using sufficient df to fully represent the relationship between x t and t: ˆf(t) = h (t)ˆδ ; we do not model f(t). Model f perfectly, using h(t) In this case, the Least Square (LS) estimate ˆβ is unbiased for β, (E ˆβ = β), and Var ˆβ = σ 2 x (HH t /T )x 2 (6) x (HH t /T )x is the residual vector when we regress x on H. The richer our model for f, the smaller this residual vector, and hence the larger the variance of ˆβ. Under our model (5) for x t, the denominator is distributed as σ 2 ξ χ2 T r, with mean value (T r)σ2 ξ. Model f imperfectly, using h (t) Simple calculations show in this case that E[ ˆβ x t, t] β = = ξ t H 2 δ 2 x (H H t /T )x 2 ξ t H 2 δ 2 ξ t (I H H t /T )ξ This bias term, which depends only on the random errors ξ t in (5), is O p (/ T ) and hence is asymptotically negligible. Therefore, even though we do not model f perfectly, we include sufficient df to eliminate the bias. The variance of ˆβ in this case is Var ˆβ σ = 2 x (H H t /T )x 2 σ = 2 (7) ξ t (I H H t/t )ξ Comparing (7) with (6), we see that the variance here is smaller (the denominator is σ 2 ξ χ2 T q ). Notice that the variance is O p (/T ), as is the squared bias, so that the mean-squared-error might be a better basis for the bias variance trade off. 9

10 Ignoring f(t) Here we have E[ ˆβ x t, t] β = Tγt δ x 2 + ξt Hδ x 2 Since x 2 /T = O p (), the first term does not disappear asymptotically. The second term does as before, as does the variance Var( ˆβ) = σ 2 / x 2. In summary, if we represent f(t) with enough bases to model the relationship between x t and t, then the squared bias term is asymptotically zero, and fades away at the same rate as the variance. Selecting the df beyond this threshold trades off bias against variance. If we do not cover the dependence of x t on t, a systematic component to the bias persists. The results presented in this section assume that f(t) and g(t) are modelled by the use of orthogonal basis functions, as for example, regression splines. Similar results when f(t) and g(t) are modelled by use of kernel smoothers are discussed in Green et al. (985) and Speckman (988). For smoothing splines, the analysis is complicated by the fact that all components of functions f(t) and g(t) (a part from the linear components), are modelled with bias. These biases depend on the complexity (roughness) of the component, and the df used, and will disappear asymptotically if df grows appropriately (Green and Silverman, 994). 4. Selecting sufficient degrees-of-freedom via bootstrap In this section we present a bootstrap-based procedure to estimate df that minimizes the mean squared error of ˆβ when the the exact forms of f and g in (4) and (5) are unknown. The computational steps are illustrated below:. choose df max in (4) so that f(t) is well modelled in his spline representation, for example, by using generalized cross-validation (GCV) methods (Hastie and Tibshirani, 990; Hastie et al., 993). 2. estimate the full model (4) with df = df max and calculate the unbiased estimate ˆβ dfmax. 3. for b =,..., B: sample y b t from the fitted model (4) with df = df max ; for df =, 2,..., df max, estimate ˆβ b df by fitting the model yb t = β df x t + s(t, df) + ɛ t where s(t, df ) = df l= h l(t)δ l ; 0

11 4. calculate MSE( ˆβ df ) = B b b= ( ˆβ b df ˆβ dfmax ) 2 ; 5. choose the df that minimizes MSE( ˆβ df ). To illustrate our procedure we implement a simulation study. We first estimate the true MSE by generating data sets (x m t, yt m ), for m =,..., 500 from models (4) and (5) with known parameters and known f(t) and g(t) for t =, (we considered two years of data). Figure (top left and right) shows the known f(t) and g(t) (red curves) and bottom left shows the true MSE as functions of df (red points). The true MSE is minimized near df = 5. Black lines in Figure (top left and right) shows the estimated ˆf(t) and ĝ(t) obtained by modelling f(t) and g(t) as natural cubic splines with 5 degrees of freedom in (4) and (5), respectively. Thus df = 5 leads to an inaccurate estimate of f(t), but to a better estimate of g(t). This further illustrates the point made in the previous section: to optimize inferential properties of ˆβ, it is not necessary to model f(t) accurately. To validate our procedure, we apply steps to 4 to each data set (x m t, yt m ) sampled from the true models, and we estimate MSE m ( ˆβ df ) and other model selection measures for m =,..., 500. Figure (bottom left) shows the boxplots of MSE m ( ˆβ df ) for each df. Note that there is a very good correspondence between the true MSE and the distribution of the estimated MSE across data sets. The triangle is placed at the df where the true and the estimated MSE are minimized. Figure (bottom right) compare four measures of model choices, that is the average MSE, AIC, BIC, and PACF across the 500 data sets as function of df. AIC is the Akaike information criteria (Akaike, 973). BIC is the Bayesian Information Criteria (Schwartz, 978) and it has a Bayesian interpretation since it may be viewed as an approximation to the posterior odds ratio. PACF is defined as the sum of the absolute values of the autocorrelation functions of the residuals. MSE and BIC are minimized at the same df, indicating that df = 5 is also the model with the highest posterior odds. Because AIC has a smaller penalty in the dimension of the model, it dominates MSE for any df and it is minimized at df = 0. PACF is a compromise between the MSE and the AIC and it is minimized at df = 6. Note that, to achieve a good prediction, it is necessary to use larger df to adequately model f(t). 5 NMMAPS Data Analysis In this section, we apply our methods to the NMMAPS data base which is comprised of daily time series of air pollution levels, weather variables, and mortality counts for the largest 90 cities in the US from 987 to 994. A full description of the NMMAPS data base is detailed in Samet et al. (2000b).

12 First, we extend results of section 3. so that several smooth functions are included in the model. Here we define a class of semi-parametric models which allows the degree of adjustment for confounding factors to vary as the function of only one calibration parameter α. We then apply our procedure for estimating α to four NMMAPS cities with daily data available and compare it with alternative approaches in the literature. Second, we extend modelling approaches in a hierarchical fashion, and we estimate national average air pollution effects as a function of α. Details of the two data analyses are below. Bootstrap-based procedure for df-selection in four NMMAPS cities To apply our approach to the class of semi-parametric regression models defined in equation (), we assume df l = α λ l, where α is unknown and λ l is pre-specified. Thus the parameter α measures the degree of adjustment for confounding factors with respect to a baseline choice λ l. To illustrate our methods, we use the following simplified version of the NMMAPS core model (Dominici et al., 2000, 2002c): y t Poisson(µ t ), var(y t ) = φµ t log µ t = β 0 (α) + β(α)p M 0t + s (t, df ) + s 2 (temp t, df 2 ) = β 0 (α) + β(α)p M 0t + s (t, λ α) + s 2 (temp t, λ 2 α) (8) where y t is the daily number of deaths, φ is the over-dispersion parameter, P M 0t is the daily level of particulate matter less than 0 microns, temp is the temperature, and t =,..., days. We assume λ = 56 (i.e. 7 degrees of freedom per year), λ 2 = 6, α to be 25 equally-spaced points between α min and α max, and s to be natural cubic splines. Justification for the selection of λ and λ 2 as baseline degrees of freedom to control for longer-term trends, seasonality and weather can be found in Samet et al. (995,997,2000a), Kelsall et al. (997), and Dominici et al. (2000b). First, to identify α min and α max, we estimate (df, df 2 ) in the smooth functions of time and temperature that best predict P M 0. Here we use generalized cross-validation (GCV) methods (Hastie and Tibshirani, 990; Hastie et al., 993). Cities with larger df s have a stronger relationship between P M 0 and the other time-varying confounders, and therefore might require more flexible s and s 2 to remove confounding. The estimated ( df, df 2 ) are summarized in Table. Note that in Seattle larger (df, df 2 ) are needed than in the other cities. This is because P M 0 levels in Seattle are higher in the winter and lower in the summer, thus inducing a negative correlation between P M 0 and weather. We then use ( df, df 2 ) as guidance to identify a value of α large enough to fully model the relationship between P M 0 and time and temperature. In Pittsburgh, Minneapolis, and Chicago, we set α max = 3, in Seattle we set α max = 6, and in all the cities we assume α min = /α max. 2

13 Within each city, we apply our procedure illustrated in section 3. to the four cities as follows:. Fit model (8) with α = α max using a quasi-likelihood approach (McCullagh and Nelder, 989) and obtain an unbiased estimate of β, here denoted by β(α max ), and a robust variance v(α max ); 2. For b =,..., 500, generate mortality time series from the fitted model (8) with α = α max ; 3. For each α in [α min, α max ], re-fit the data by using a quasi-likelihood approach and obtain β b (α) and a robust variance v b (α); 4. Calculate MSE( β(α)) = B B b= ( β b (α) β(α max )) 2. For each value of α, and for each bootstrap iteration b, we also estimate the goodness of fit and the degree of serial correlation in model (8) by calculating:. the Akaike Information Criteria AIC(α) = deviance + 2 ˆφ p, where ˆφ is the estimated over-dispersion parameter, p is the number of parameters in the model; and 2. the sum of the absolute values of the partial autocorrelation function of the residuals (PACF). Choosing df that minimize AIC or PACF are strategies routinely used in air pollution literature research (Toulomi et al., 997; Burnett et al., 200). Figure 2 (left panels) shows boxplots of ˆβ b (α), b =,..., 500 as function of α. Red and black horizontal lines are placed at ˆβ(α max ) and at 0, respectively. As expected, as α increases, bias diminishes in all cities. More specifically, in Pittsburgh, Minneapolis, Chicago, and Seattle, bias is removed for values of α larger 2.7,, 2, and, respectively. Because the statistical variance of ˆβ(α) increases as α increases, an optimal value of α for the bias/variance tradeoff can be identified by estimating the MSE. The estimated MSE, AIC, and PACF as functions of α are shown in the right panels of Figure 2. To ease the comparison we have normalized these estimates to vary in [0,]. In Pittsburgh, Minneapolis, and Chicago, the MSE plot is almost dominated by bias and level off for values of α larger than 2.,, and.3, respectively. The PACF and the AIC always dominate the MSE, indicating that large values of α would be needed to eliminate correlation in the residuals or to adequately model the smooth functions of time and temperature to optimize prediction. In Minneapolis, the PACF is minimized at, and rises for α >, indicating over-fitting and negative autocorrelation of the residuals. Note that a choice of α that minimizes the MSE still leaves autocorrelation in 3

14 the residuals (PACF is quite large at these values), which is accounted for using robust variance estimation. In Seattle, the MSE, AIC and PACF are quite different and non-monotone. The MSE is minimized at α = where the bias is small and then starts to rise because of the increase in variance, and levels off for α > 3.8. The PACF has a local minimum at.9, rises again for α larger than.9, and it is minimized at α = 6 as the AIC. However, both choices (α = and α = 6) indicate no association between air pollution and mortality. Note that, although exploratory analyses in Seattle and Minneapolis suggested larger df, df 2 to best predict P M 0 as smooth functions of time and temperature, the inspection of the MSE indicated that a small α should be used in the model (8). This is probably because ˆβ(α max ) 0, and therefore very little bias arises from the relationship between P M 0 and time and temperature in the full model. Finally, point estimates and 95 % CI of ˆβ(α max ), and ˆβ(ˆα), where ˆα is the value of α that minimizes MSE, are summarized in Table. The two sets of estimates are similar because for α > ˆα bias is removed. However ˆβ(ˆα)s have narrower confidence intervals than ˆβ(α max )s because a more parsimonious model is used. Analysis of the NMMAPS data base The pollutant P M 0 is measured approximately every six days in the other NMMAPS locations, except for Pittsburgh, Minneapolis, Seattle, and Chicago, thus complicating the implementation of our bootstrap-based methodology. However, we can still extend the NMMAPS model described below in an hierarchical fashion and estimate national average air pollution effects as function of α. We consider the following semi-parametric model used in the NMMAPS analysis: log E[yt c ] = age-specific intercepts + β c (α)p M0t c + s(t, 7/year α)+ + s(temp t, 6 α) + s(dewpoint t, 3 α) + age s(t, 8 α) (9) where yt c is the daily number of deaths in city c, P M 0t is the daily level of particulate matter, temp and dew are the temperature and dewpoint temperature, and the age-specific intercepts correspond to the three age groups of younger than 65, between 65 and 75 and older than 75. Details of this modeling approach can be found elsewhere (Dominici et al., 2000; Samet et al., 2000c). Based upon the statistical analyses of the four cities with daily data and additional exploratory analyses, we set α to take on 25 equally spaced points varying from /3 to 3. As in the previous model formulation, 4

15 this choice allows the degree of adjustment for confounding factors to vary greatly. We then assume the following two-stage normal-normal hierarchical model: ˆβ c (α) N(β c (α), v c (α)) β (α) N(β (α), τ 2 (α)) (0) where β (α) and τ 2 (α) are the national average air pollution effects and the variance across cities of the true city-specific air pollution effects, both as a function of α. We fit model (0) by using a Bayesian approach, with a flat prior on β (α) and uniform prior on the shrinkage factor τ 2 (α)/ [ τ 2 (α) + v c (α) ] (Everson and Morris, 2000). Sensitivity of the national average estimates to the prior specification for τ 2 when α = has been explored elsewhere (Dominici et al., 2002a,d). To investigate sensitivity of the national average estimates to model choice, for each value of α, we estimate ˆβ c (α) and v c (α) using three methods: ) GAM with approximated standard errors (GAM-approx s.e.); 2) GAM with asymptotically exact standard errors (GAM-exact); 3) and GLM. The left top panel of Figure 3 shows the national average estimates (posterior means) as a function of α. Black, red, and blue dots denote estimates under GAM-approx s.e., GAM-exact, and GLM, respectively. The grey polygon represents 95% posterior intervals of the national average estimates under GAM-exact. The red segment is placed at α =, that is, the degree of adjustment used in the NMMAPS model (Dominici et al., 2000). The black curves at the top right panel denote the city-specific Bayesian estimates of the relative rates under GAM-exact. Figure 3 provides strong evidence for association between short-term exposure to P M 0 and mortality, which persists for different values of α. Consistent with the results for the four cities, national average estimates decrease as α increase, and level off for α larger than.2 with a very modest increase in posterior variance. However even when α = 3, the national average effect is estimated as to 0.2% increase in total mortality for 0 µg/m 3 increase in P M 0 (95% posterior interval 0.05 to 5). This picture also shows robustness of the results to model choice (GAM versus GLM). National average estimates under GAM-exact are slightly smaller than those obtained under GAM-approx, although this difference is very small. The comparability of these two sets of estimates occurs because in hierarchical models, underestimation of standard errors at the first stage ( v c (α)) is compensated by the overestimation of the heterogeneity parameter at the second stage (τ 2 (α)). Thus the posterior total variance of the national average estimates remains approximately constant 5

16 (Daniels et al., 2002). The bottom left and right panels of Figure 3 show posterior means of the average s.e. of ˆβ c ( 90 c vc (α)), and of the heterogeneity parameters τ(α). Because of the nature of the approximation, the average standard errors are smaller in GAM-approx than in GAM-exact or GLM, and do not vary with α. If GAM-exact or GLM are used, then the average standard errors increase with α, with GAM-exact providing slightly larger estimates. Under all three modelling approaches, the posterior mean of τ(α) (heterogeneity) decreases as α increases, indicating that less control for confounding factors inflates the variability across cities of the β c (α)s. 6 Discussion In this paper, we propose two methodological contributions in semi-parametric regression that overcome limitations of current approaches of time series analyses of air pollution and health. First, we develop an algorithm for estimating the covariance matrix of the vector of the regression coefficients in GAM that properly accounts for the smooth component of the model. Second, we introduce a conceptual framework for choosing the degrees of freedom in the smooth component of the model that minimizes the MSE of the estimated regression coefficients. Our contributions are relevant in time series analyses of air pollution and health because they improve the assessment of uncertainty of the estimated risks, and provide a strategy that removes bias due to confounding without inflating statistical variance. We developed a user-friendly S-plus function gam.exact, that implements our calculations and returns the asymptotically exact covariance matrix of the regression coefficients corresponding to the linear component of a GAM. gam.exact can be used for any number of linear predictors, smooth terms, link functions, and distribution errors. Our calculations are computationally efficient because they simply require to fit as many GAM as the number of regression coefficients in the linear component of the model, instead of calculating the T T smoother operator S (in our case-study, T is equal to 8 years of daily data, and therefore the computation of S would have been almost prohibitive). However this computational efficiency can be obtained for symmetric smoothers only, as for example smoothing splines. Simulation studies suggest that these standard error calculations are also adequate for non symmetric smoothers when a GAM with identity link is used (Durban et al., 999). A similar conclusion may hold for any link function, although additional investigations are warranted. In this application, the linear component of the model (the air pollution effects) is of interest, and the smooth component of the model (the smooth functions of time and temperatures) is nuisance. 6

17 Therefore, we proposed a model selection criterion that optimizes the inferential properties (MSE) of the parameter of interests. We compared our approach with two common alternatives in the literature based upon minimizing prediction error (AIC) or eliminating serial correlation in the residuals (PACF). We found that selecting df that minimizes AIC or PACF leads to a richer model, which removes bias, but may also inflate the variance of ˆβ. Choosing df that minimizes the MSE leads to a more parsimonious model. However under this model, the residuals may be autocorrelated, and therefore robust estimation of the variance is recommended, as for example, by use of quasi-likelihood methods. In summary, when the linear component of a semi-parametric model is the goal of the inference, then visual inspection of MSE and its comparison with AIC and PACF provides a comprehensive diagnostic tool for df-selection. The application of our semi-automated procedure for the df-selection to the four cities further illustrates the complex relationship between bias of β and the degree of adjustment for confounding factors. We found that, when an air pollution effect (as in Pittsburgh and Chicago) exists, bias mostly arises from the relationship between P M 0 and the other time-varying covariates included in the model, also called concurvity (Buja et al., 989). Thus choosing df that best predict P M 0 from the smooth component of the model can reduce bias substantially, but it may not be sufficient to eliminate the bias in full. Visual inspection of MSE plots provides further guidance, and identifies the optimal value of α for bias/variance tradeoff. We also found that, although for small values of α the PACF is large for most the cities, the estimated over-dispersion parameter was generally small. Therefore, autocorrelation of the residuals has little impact on the statistical variance of the relative rate estimates. The choice between fully parametric versus semi-parametric formulations of the smooth component of the model has been raised in recent discussions. Although GAM-exact and GLM rely upon different estimation methods, we did not find any noticeable difference between these two models in risk estimation of air pollution. We have implemented simulation studies where we have generated time-series data from a model with known smooth functions of time and temperature and we have calculated MSE and AIC as a function of α under GAM-exact and GLM. The MSE and AIC plots were very similar under the two modelling approaches (results not shown). Our sensitivity analyses of the national-average air pollution effects to model choice confirms this similarity: the nationalaverage relative rate estimates under GLM and under GAM-exact were almost identical for any value of α, thus suggesting that the choice between parametric versus semi-parametric formulations of the smooth component of the model is less critical than the degree of adjustment for confounding factors. 7

18 References Akaike, H. (973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, eds. B. Petrox and F. Caski, 267. American Thoracic Society, Bascom, R. (996a). Health effects of outdoor air pollution, Part. American Journal of Respiratory and Critical Care Medicine, 53, (996b). Health effects of outdoor air pollution, Part 2. American Journal of Respiratory and Critical Care Medicine, 53, Buja, A., Hastie, T., and Tibshirani, R. (989). Linear smoothers and additive models. The Annals of Statistics, 7, Burnett, R. and Krewski, D. (994). Air Pollution effects of hospital admission rates: A random effects modelling approach. The Canadian Journal of Statistics, 22, Burnett, R., Renjun, M., Jerrett, M., Goldberg, M., S., C., A., P., and Krewski, D. (200). The Spatial Association between Community Air Pollution and Mortality: A New method for Analyzing Correlated Geographic Cohort Data. Environmental Health Perspectives, 09, Chambers, J. and Hastie, T. (992). Statistical Models in S. Chapman and Hall, London. Clancy, L., Goodman, P., Sinclair, H., and Dockery, D. (2002). Effect of air-pollution control on death rates in Dublin, Ireland: an intervention study. Lancet, 360, Daniels, M., Dominici, F., and Zeger, S. (2002). Underestimation of Standard Errors in Time Series Studies of Air Pollution and Mortality. Technical report, Bloomberg School of Public Health, Johns Hopkins University. Dominici, F., Daniels, M., Zeger, S. L., and Samet, J. M. (2002a). Air Pollution and Mortality: Estimating Regional and National Dose-Response Relationships. Journal of the American Statistical Association, 97, 00. Dominici, F., McDermott, A., Zeger, S. L., and Samet, J. M. (2002b). 0n Generalized Additive Models in Time Series Studies of Air Pollution and Health. American Journal of Epidemiology, 56, 3,. (2002c). Airborne Particulate Matter and Mortality: Time-Scale Effects in Four US Cities. American Journal of Epidemiology (in press). (2002d). A Report to the Health Effects Institute on Reanalyses of the NMMAPS Database. The Health Effects Institute, Cambridge, MA. 8

19 Dominici, F., Samet, J. M., and Zeger, S. L. (2000). Combining Evidence on Air pollution and Daily Mortality from the Twenty Largest US cities: A Hierarchical Modeling Strategy (with discussion). Royal Statistical Society, Series A, with discussion, 63, Durban, M., Hackett, C., and Currie, I. (999). Approximate Standard Errors in Semiparametric Models. Biometrics, 55, Environmental Protection Agency (970). The Clean Air Act (CAA); 42 U.S.C. s/s 740 et seq. (970) Clean Air Act and Amendments of 970 (PL 9-604; 42 USC 857h-7 et seq.; amended 970. US Environmental Protection Agency. Environmental Protection Agency (996). Review of the National Ambient Air Quality Standards for Particulate Matter: Policy Assessment of Scientific and Technical Information. OAQPS Staff Paper. Research Triangle Park, North Carolina, U.S. Government Printing Office. Environmental Protection Agency. (200). Air Quality Criteria for Particulate Matter: Second External Review Draft March 200. US Environmental Protection Agency, Office of Research and Development. Everson, P. and Morris, C. (2000). Inference for multivariate Normal hierarchical models. Journal of the Royal Statistical Society, series B, 62, Friedman, J. and Stuetzle, W. (98). Projection Pursuit Regression. Journal of American Statistical Association, 76, Goldberg, M., Burnett, R., Valois, M., Flegel, K., and Bailar, J. (2003). Associations between ambient air pollution and daily mortality among persons with congestive heart failure. Environ Research, 9, Green, P., Jennison, C., and Seheult, A. (985). Analysis of Field Experiments by Least Square Smoothing. Journal of the Royal Statistical Society, 47, 2, Green, P. J. and Silverman, B. W. (994). Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman & Hall, London UK. Hastie, T., Tibshirani, R., and Buja, A. (993). Scoring. Tech. rep. Flexible Discriminant Analysis by Optimal Hastie, T. J. and Tibshirani, R. J. (990). Generalized additive models. Chapman and Hall, New York. Katsouyanni, K., Schwartz, J., Spix, C., Touloumi, G., Zmirou, D., Zanobetti, A., and Wojtyniak, B. (996). Short term effects of air pollution on health: a European approach using epidemiologic time series data: the APHEA protocol. J Epidemiol Community Health, 50, Supp : S2 8. 9

20 Katsouyanni, K., Toulomi, G., Spix, C., Balducci, F., Medina, S., Rossi, G., Wojtyniak, B., Sunyer, J., Bacharova, L., Schouten, J., Ponka, A., and Anderson, H. R. (997). Short term effects of ambient sulphur dioxide and particulate matter on mortality in 2 European cities: results from time series data from the APHEA project. British Medical Journal, 34, Kelsall, J., Samet, J. M., and Zeger, S. L. (997). Air Pollution, and Mortality in Philadelphia, American Journal of Epidemiology, 46, Klein, M., Flanders, W., and Tolbert, P. (2002). Variances may be underestimated using available software for generalized additive models (abstract). American Journal of Epidemiology (supplement), 55, S06. Lee, J., Kim, H., Song, H., Hong, Y., Cho, Y., Shin, S., Hyun, Y. J., and Kim, Y. (2002). Air pollution and asthma among children in Seoul, Korea. Epidemiology, 3, Lipfert, F. and Wyzga, R. (993). Air Pollution and Mortality: Issues and Uncertainty. Journal Air Waste Manage. Assoc, 45, McCullagh, P. and Nelder, J. A. (989). Generalized Linear Models (Second Edition). New York: Chapman & Hall. Nelder, J. A. and Wedderburn, R. W. M. (972). Generalized Linear Models. Journal of the Royal Statistical Society, Series A, 35, Pope, C. A., Dockery, D., and Schwartz, J. (995). Review of Epidemiological evidence of health effects of particulate air pollution. Inhal.Toxicol., 47, -8. Ramsay, T., Burnett, R., and Krewski, D. (2002). The effect of concurvity in generalized additive models linking mortality and ambient air pollution. Epidemiology (to appear). Samet, J. M., Dominici, F., Curriero, F., Coursac, I., and Zeger, S. L. (2000a). Fine Particulate air pollution and Mortality in 20 U.S. Cities: New England Journal of Medicine (with discussion), 343, 24, Samet, J. M., Zeger, S. L., and Berhane, K. (995). The Association of Mortality and Particulate Air Pollution. Health Effects Institute, Cambridge, MA. Samet, J. M., Zeger, S. L., Dominici, F., Curriero, F., Coursac, I., Dockery, D., Schwartz, J., and Zanobetti, A. (2000b). The National Morbidity, Mortality, and Air Pollution Study Part II: Morbidity and Mortality from Air Pollution in the United States. Health Effects Institute, Cambridge, MA. Samet, J. M., Zeger, S. L., Dominici, F., Dockery, D., and Schwartz, J. (2000c). The National Morbidity, Mortality, and Air Pollution Study Part I: Methods and Methodological Issues. Health Effects Institute, Cambridge, MA. 20

21 Samet, J. M., Zeger, S. L., Kelsall, J., Xu, J., and Kalkstein, L. (997). Air pollution, weather and mortality in Philadelphia, In Particulate Air Pollution and Daily Mortality: Analyses of the Effects of Weather and Multiple Air Pollutants. The Phase IB report of the Particle Epidemiology Evaluation Project. Health Effects Institute, Cambridge, MA. Schwartz, J. (995). Air Pollution and Daily Mortality in Birmingham, Alabama. American Journal of Epidemiology, 37, (2000). Assessing Confounding, Effect Modification, and Thresholds in the Associations between Ambient Particles and Daily Deaths. Environmental Health Perspective, 08, Schwartz, J. and Marcus, A. (990). Mortality and Air Pollution in London: A time series analysis. American Journal of Epidemiology, 3, Schwartz, J., Spix, C., Touloumi, G., Bacharova, L., and Barumamdzadeh, T. e. a. (996). Methodological issues in studies of air pollution and daily counts of deaths or hospital admissions. J Epidemiol Community Health, 50, Supp : S. Schwartz, M. (978). A Mathematical Model Used to Analyze Breast Cancer Screening Strategies (STMA V2 863). Operations Research, Journal of Operations Research Society of America, 26, Speckman, P. (988). Kernel Smoothing in Partial Linear Models. Journal of the Royal Statistical Society Series B, 50, Stieb, D., Judek, S., and Burnett, R. (2002). Meta-analysis of time-series studies of air pollution and mortality: effects of gases and particles and the influence of cause of death, age, and season. J Air Waste Manag Assoc., 52, Toulomi, G., Katsouyanni, K., Zmirou, D., and Schwartz, J. (997). Short-Term Effects of Ambient Oxidant Exposure on Mortality: A combined Analysis within the APHEA Project. American Journal of Epidemiology, 46, Zanobetti, A., Schwartz, J., and Dockery, D. (2000). Airborne particles are a risk factor for hospital admissions for heart and lung disease. Environmental Health Perspective, 08,

22 Figure Legends. Simulation Results: In the top panels red curves indicate the known f(t) and g(t). In the bottom left panel the red points indicate true MSE as function of df. Notice that the true MSE is minimized at df = 5. In the top panels the black curves indicate the estimated ˆf(t) and ĝ(t) obtained by modelling f(t) and g(t) as natural cubic splines with 5 degrees of freedom in (4) and (5). In the bottom left panel the boxplots indicate the distribution of the estimated MSE for the 500 data sets sampled at each df. The triangle is placed at the estimated ˆdf from our procedure which coincides with the one that minimizes the true MSE. In the bottom right panel are compared four model selection s measures as function of df: MSE, AIC, BIC, and PACF. 2. Four cities results: left panels show boxplots of ˆβ b (α) for each city and for each value of α. Red and black horizontal lines are placed at ˆβ(α max ) and at 0, respectively. Right panels show MSE, AIC and PACF estimates as function of α. To ease the comparison, the values have been normalized to vary in [0,]. 3. NMMAPS sensitivity analysis: top left panels show national average estimates (posterior means) as function of α. Black dots denote estimates under GAM with approximated standard errors, the red dots denote estimates under GAM with asymptotically exact standard errors, and the blue dots denote estimates under GLM. The grey polygon represents the 95% posterior intervals of the national average estimates under the GAM model with exact standard errors. The red segment is placed at α =, e.g the degree of adjustment used in the NMMAPS model. The black curves in the top right panel denote the city-specific Bayesian estimates of the relative rates under GAM with asymptotically exact standard errors. Bottom panels show the posterior means of the average s.e. of ˆβ c ( 90 c vc (α)) (left) and posterior means of the heterogeneity parameters τ(α) (right). 22

American Journal of EPIDEMIOLOGY

American Journal of EPIDEMIOLOGY Volume 156 Number 3 August 1, 2002 American Journal of EPIDEMIOLOGY Copyright 2002 by The Johns Hopkins Bloomberg School of Public Health Sponsored by the Society for Epidemiologic Research Published by

More information

GENERALIZED ADDITIVE MODELS FOR DATA WITH CONCURVITY: STATISTICAL ISSUES AND A NOVEL MODEL FITTING APPROACH. by Shui He B.S., Fudan University, 1993

GENERALIZED ADDITIVE MODELS FOR DATA WITH CONCURVITY: STATISTICAL ISSUES AND A NOVEL MODEL FITTING APPROACH. by Shui He B.S., Fudan University, 1993 GENERALIZED ADDITIVE MODELS FOR DATA WITH CONCURVITY: STATISTICAL ISSUES AND A NOVEL MODEL FITTING APPROACH by Shui He B.S., Fudan University, 1993 Submitted to the Graduate Faculty of the Graduate School

More information

the university of british columbia department of statistics technical report # 217

the university of british columbia department of statistics technical report # 217 the university of british columbia department of statistics technical report # 217 seasonal confounding and residual correlation in analyses of health effects of air pollution by isabella r ghement nancy

More information

An application of the GAM-PCA-VAR model to respiratory disease and air pollution data

An application of the GAM-PCA-VAR model to respiratory disease and air pollution data An application of the GAM-PCA-VAR model to respiratory disease and air pollution data Márton Ispány 1 Faculty of Informatics, University of Debrecen Hungary Joint work with Juliana Bottoni de Souza, Valdério

More information

A measurement error model for time-series studies of air pollution and mortality

A measurement error model for time-series studies of air pollution and mortality Biostatistics (2000), 1, 2,pp 157 175 Printed in Great Britain A measurement error model for time-series studies of air pollution and mortality FRANCESCA DOMINICI AND SCOTT L ZEGER Department of Biostatistics,

More information

Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni Parmigiani

Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni Parmigiani Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni Parmigiani Use R! Albert: Bayesian Computation with R Bivand/Pebesma/Gomez-Rubio: Applied Spatial Data Analysis with R Claude:Morphometrics

More information

Modelling Survival Data using Generalized Additive Models with Flexible Link

Modelling Survival Data using Generalized Additive Models with Flexible Link Modelling Survival Data using Generalized Additive Models with Flexible Link Ana L. Papoila 1 and Cristina S. Rocha 2 1 Faculdade de Ciências Médicas, Dep. de Bioestatística e Informática, Universidade

More information

Estimating the long-term health impact of air pollution using spatial ecological studies. Duncan Lee

Estimating the long-term health impact of air pollution using spatial ecological studies. Duncan Lee Estimating the long-term health impact of air pollution using spatial ecological studies Duncan Lee EPSRC and RSS workshop 12th September 2014 Acknowledgements This is joint work with Alastair Rushworth

More information

Partial Generalized Additive Models

Partial Generalized Additive Models Partial Generalized Additive Models An Information-theoretic Approach for Selecting Variables and Avoiding Concurvity Hong Gu 1 Mu Zhu 2 1 Department of Mathematics and Statistics Dalhousie University

More information

Modern regression and Mortality

Modern regression and Mortality Modern regression and Mortality Doug Nychka National Center for Atmospheric Research www.cgd.ucar.edu/~nychka Statistical models GLM models Flexible regression Combining GLM and splines. Statistical tools

More information

Flexible Spatio-temporal smoothing with array methods

Flexible Spatio-temporal smoothing with array methods Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session IPS046) p.849 Flexible Spatio-temporal smoothing with array methods Dae-Jin Lee CSIRO, Mathematics, Informatics and

More information

Generalized Additive Models

Generalized Additive Models Generalized Additive Models The Model The GLM is: g( µ) = ß 0 + ß 1 x 1 + ß 2 x 2 +... + ß k x k The generalization to the GAM is: g(µ) = ß 0 + f 1 (x 1 ) + f 2 (x 2 ) +... + f k (x k ) where the functions

More information

Statistical methods for evaluating air pollution and temperature effects on human health

Statistical methods for evaluating air pollution and temperature effects on human health Statistical methods for evaluating air pollution and temperature effects on human health Ho Kim School of Public Health, Seoul National University ISES-ISEE 2010 Workshop August 28, 2010, COEX, Seoul,

More information

Deposited on: 07 September 2010

Deposited on: 07 September 2010 Lee, D. and Shaddick, G. (2008) Modelling the effects of air pollution on health using Bayesian dynamic generalised linear models. Environmetrics, 19 (8). pp. 785-804. ISSN 1180-4009 http://eprints.gla.ac.uk/36768

More information

Generalized Additive Models (GAMs)

Generalized Additive Models (GAMs) Generalized Additive Models (GAMs) Israel Borokini Advanced Analysis Methods in Natural Resources and Environmental Science (NRES 746) October 3, 2016 Outline Quick refresher on linear regression Generalized

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Generalized Linear Models (GLZ)

Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the

More information

The effect of air pollution & climate on hospital admissions for chronic obstructive airways disease: a non-parametric alternative

The effect of air pollution & climate on hospital admissions for chronic obstructive airways disease: a non-parametric alternative The effect of air pollution & climate on hospital admissions for chronic obstructive airways disease: a non-parametric alternative Short title Authors Bircan Erbas Department of General Practice & Public

More information

Extended Follow-Up and Spatial Analysis of the American Cancer Society Study Linking Particulate Air Pollution and Mortality

Extended Follow-Up and Spatial Analysis of the American Cancer Society Study Linking Particulate Air Pollution and Mortality Extended Follow-Up and Spatial Analysis of the American Cancer Society Study Linking Particulate Air Pollution and Mortality Daniel Krewski, Michael Jerrett, Richard T Burnett, Renjun Ma, Edward Hughes,

More information

Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018

Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018 Introduction Statistics 211 - Statistical Methods II Presented January 8, 2018 linear models Dan Gillen Department of Statistics University of California, Irvine 1.1 Logistics and Contact Information Lectures:

More information

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Statistics 203: Introduction to Regression and Analysis of Variance Course review Statistics 203: Introduction to Regression and Analysis of Variance Course review Jonathan Taylor - p. 1/?? Today Review / overview of what we learned. - p. 2/?? General themes in regression models Specifying

More information

Selection of Variables and Functional Forms in Multivariable Analysis: Current Issues and Future Directions

Selection of Variables and Functional Forms in Multivariable Analysis: Current Issues and Future Directions in Multivariable Analysis: Current Issues and Future Directions Frank E Harrell Jr Department of Biostatistics Vanderbilt University School of Medicine STRATOS Banff Alberta 2016-07-04 Fractional polynomials,

More information

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Inversion Base Height. Daggot Pressure Gradient Visibility (miles) Stanford University June 2, 1998 Bayesian Backtting: 1 Bayesian Backtting Trevor Hastie Stanford University Rob Tibshirani University of Toronto Email: trevor@stat.stanford.edu Ftp: stat.stanford.edu:

More information

MULTI-CITY TIME SERIES ANALYSES OF AIR POLLUTION AND MORTALITY DATA USING GENERALIZED GEOADDITIVE MIXED MODELS. by Lung-Chang Chien

MULTI-CITY TIME SERIES ANALYSES OF AIR POLLUTION AND MORTALITY DATA USING GENERALIZED GEOADDITIVE MIXED MODELS. by Lung-Chang Chien MULTI-CITY TIME SERIES ANALYSES OF AIR POLLUTION AND MORTALITY DATA USING GENERALIZED GEOADDITIVE MIXED MODELS by Lung-Chang Chien A dissertation submitted to the faculty of the University of North Carolina

More information

Flexible modelling of the cumulative effects of time-varying exposures

Flexible modelling of the cumulative effects of time-varying exposures Flexible modelling of the cumulative effects of time-varying exposures Applications in environmental, cancer and pharmaco-epidemiology Antonio Gasparrini Department of Medical Statistics London School

More information

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA Kasun Rathnayake ; A/Prof Jun Ma Department of Statistics Faculty of Science and Engineering Macquarie University

More information

Identifying effect modifiers in air pollution time-series studies using a two-stage analysis

Identifying effect modifiers in air pollution time-series studies using a two-stage analysis Identifying effect modifiers in air pollution time-series studies using a two-stage analysis Sandrah P. Eckel Thomas A. Louis Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health,

More information

Power and Sample Size Calculations with the Additive Hazards Model

Power and Sample Size Calculations with the Additive Hazards Model Journal of Data Science 10(2012), 143-155 Power and Sample Size Calculations with the Additive Hazards Model Ling Chen, Chengjie Xiong, J. Philip Miller and Feng Gao Washington University School of Medicine

More information

PENALIZING YOUR MODELS

PENALIZING YOUR MODELS PENALIZING YOUR MODELS AN OVERVIEW OF THE GENERALIZED REGRESSION PLATFORM Michael Crotty & Clay Barker Research Statisticians JMP Division, SAS Institute Copyr i g ht 2012, SAS Ins titut e Inc. All rights

More information

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown. Weighting We have seen that if E(Y) = Xβ and V (Y) = σ 2 G, where G is known, the model can be rewritten as a linear model. This is known as generalized least squares or, if G is diagonal, with trace(g)

More information

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key Statistical Methods III Statistics 212 Problem Set 2 - Answer Key 1. (Analysis to be turned in and discussed on Tuesday, April 24th) The data for this problem are taken from long-term followup of 1423

More information

Professors Lin and Ying are to be congratulated for an interesting paper on a challenging topic and for introducing survival analysis techniques to th

Professors Lin and Ying are to be congratulated for an interesting paper on a challenging topic and for introducing survival analysis techniques to th DISCUSSION OF THE PAPER BY LIN AND YING Xihong Lin and Raymond J. Carroll Λ July 21, 2000 Λ Xihong Lin (xlin@sph.umich.edu) is Associate Professor, Department ofbiostatistics, University of Michigan, Ann

More information

Bootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions

Bootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions JKAU: Sci., Vol. 21 No. 2, pp: 197-212 (2009 A.D. / 1430 A.H.); DOI: 10.4197 / Sci. 21-2.2 Bootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions Ali Hussein Al-Marshadi

More information

CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA

CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA STATISTICS IN MEDICINE, VOL. 17, 59 68 (1998) CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA J. K. LINDSEY AND B. JONES* Department of Medical Statistics, School of Computing Sciences,

More information

5 Autoregressive-Moving-Average Modeling

5 Autoregressive-Moving-Average Modeling 5 Autoregressive-Moving-Average Modeling 5. Purpose. Autoregressive-moving-average (ARMA models are mathematical models of the persistence, or autocorrelation, in a time series. ARMA models are widely

More information

Luke B Smith and Brian J Reich North Carolina State University May 21, 2013

Luke B Smith and Brian J Reich North Carolina State University May 21, 2013 BSquare: An R package for Bayesian simultaneous quantile regression Luke B Smith and Brian J Reich North Carolina State University May 21, 2013 BSquare in an R package to conduct Bayesian quantile regression

More information

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS020) p.3863 Model Selection for Semiparametric Bayesian Models with Application to Overdispersion Jinfang Wang and

More information

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL Discussiones Mathematicae Probability and Statistics 36 206 43 5 doi:0.75/dmps.80 A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL Tadeusz Bednarski Wroclaw University e-mail: t.bednarski@prawo.uni.wroc.pl

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

Spatial Variation in Infant Mortality with Geographically Weighted Poisson Regression (GWPR) Approach

Spatial Variation in Infant Mortality with Geographically Weighted Poisson Regression (GWPR) Approach Spatial Variation in Infant Mortality with Geographically Weighted Poisson Regression (GWPR) Approach Kristina Pestaria Sinaga, Manuntun Hutahaean 2, Petrus Gea 3 1, 2, 3 University of Sumatera Utara,

More information

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking Model checking overview Checking & Selecting GAMs Simon Wood Mathematical Sciences, University of Bath, U.K. Since a GAM is just a penalized GLM, residual plots should be checked exactly as for a GLM.

More information

Challenges in modelling air pollution and understanding its impact on human health

Challenges in modelling air pollution and understanding its impact on human health Challenges in modelling air pollution and understanding its impact on human health Alastair Rushworth Joint Statistical Meeting, Seattle Wednesday August 12 th, 2015 Acknowledgements Work in this talk

More information

LOGISTIC REGRESSION Joseph M. Hilbe

LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University Integrated Likelihood Estimation in Semiparametric Regression Models Thomas A. Severini Department of Statistics Northwestern University Joint work with Heping He, University of York Introduction Let Y

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

Propensity Score Weighting with Multilevel Data

Propensity Score Weighting with Multilevel Data Propensity Score Weighting with Multilevel Data Fan Li Department of Statistical Science Duke University October 25, 2012 Joint work with Alan Zaslavsky and Mary Beth Landrum Introduction In comparative

More information

Spatial bias modeling with application to assessing remotely-sensed aerosol as a proxy for particulate matter

Spatial bias modeling with application to assessing remotely-sensed aerosol as a proxy for particulate matter Spatial bias modeling with application to assessing remotely-sensed aerosol as a proxy for particulate matter Chris Paciorek Department of Biostatistics Harvard School of Public Health application joint

More information

Measurement Error in Spatial Modeling of Environmental Exposures

Measurement Error in Spatial Modeling of Environmental Exposures Measurement Error in Spatial Modeling of Environmental Exposures Chris Paciorek, Alexandros Gryparis, and Brent Coull August 9, 2005 Department of Biostatistics Harvard School of Public Health www.biostat.harvard.edu/~paciorek

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

The effect of weather on respiratory and cardiovascular deaths in 12 U.S. cities.

The effect of weather on respiratory and cardiovascular deaths in 12 U.S. cities. The effect of weather on respiratory and cardiovascular deaths in 12 U.S. cities. The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters.

More information

CRISP: Capture-Recapture Interactive Simulation Package

CRISP: Capture-Recapture Interactive Simulation Package CRISP: Capture-Recapture Interactive Simulation Package George Volichenko Carnegie Mellon University Pittsburgh, PA gvoliche@andrew.cmu.edu December 17, 2012 Contents 1 Executive Summary 1 2 Introduction

More information

Step 2: Select Analyze, Mixed Models, and Linear.

Step 2: Select Analyze, Mixed Models, and Linear. Example 1a. 20 employees were given a mood questionnaire on Monday, Wednesday and again on Friday. The data will be first be analyzed using a Covariance Pattern model. Step 1: Copy Example1.sav data file

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Fractional Polynomials and Model Averaging

Fractional Polynomials and Model Averaging Fractional Polynomials and Model Averaging Paul C Lambert Center for Biostatistics and Genetic Epidemiology University of Leicester UK paul.lambert@le.ac.uk Nordic and Baltic Stata Users Group Meeting,

More information

An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K.

An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K. An Introduction to GAMs based on penalied regression splines Simon Wood Mathematical Sciences, University of Bath, U.K. Generalied Additive Models (GAM) A GAM has a form something like: g{e(y i )} = η

More information

GLM models and OLS regression

GLM models and OLS regression GLM models and OLS regression Graeme Hutcheson, University of Manchester These lecture notes are based on material published in... Hutcheson, G. D. and Sofroniou, N. (1999). The Multivariate Social Scientist:

More information

Statistical and epidemiological considerations in using remote sensing data for exposure estimation

Statistical and epidemiological considerations in using remote sensing data for exposure estimation Statistical and epidemiological considerations in using remote sensing data for exposure estimation Chris Paciorek Department of Biostatistics Harvard School of Public Health Collaborators: Yang Liu, Doug

More information

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project MLR Model Selection Author: Nicholas G Reich, Jeff Goldsmith This material is part of the statsteachr project Made available under the Creative Commons Attribution-ShareAlike 3.0 Unported License: http://creativecommons.org/licenses/by-sa/3.0/deed.en

More information

Design of Screening Experiments with Partial Replication

Design of Screening Experiments with Partial Replication Design of Screening Experiments with Partial Replication David J. Edwards Department of Statistical Sciences & Operations Research Virginia Commonwealth University Robert D. Leonard Department of Information

More information

Generalized Additive Models

Generalized Additive Models By Trevor Hastie and R. Tibshirani Regression models play an important role in many applied settings, by enabling predictive analysis, revealing classification rules, and providing data-analytic tools

More information

Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"

Kneib, Fahrmeir: Supplement to Structured additive regression for categorical space-time data: A mixed model approach Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach" Sonderforschungsbereich 386, Paper 43 (25) Online unter: http://epub.ub.uni-muenchen.de/

More information

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative

More information

Simultaneous Confidence Bands for the Coefficient Function in Functional Regression

Simultaneous Confidence Bands for the Coefficient Function in Functional Regression University of Haifa From the SelectedWorks of Philip T. Reiss August 7, 2008 Simultaneous Confidence Bands for the Coefficient Function in Functional Regression Philip T. Reiss, New York University Available

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Using Estimating Equations for Spatially Correlated A

Using Estimating Equations for Spatially Correlated A Using Estimating Equations for Spatially Correlated Areal Data December 8, 2009 Introduction GEEs Spatial Estimating Equations Implementation Simulation Conclusion Typical Problem Assess the relationship

More information

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17 Model selection I February 17 Remedial measures Suppose one of your diagnostic plots indicates a problem with the model s fit or assumptions; what options are available to you? Generally speaking, you

More information

Comparing the Relationships Between Heat Stress Indices and Mortality in North Carolina

Comparing the Relationships Between Heat Stress Indices and Mortality in North Carolina Comparing the Relationships Between Heat Stress Indices and Mortality in North Carolina Jordan Clark PhD Student CISA Research Assistant Department of Geography UNC-Chapel Hill 10/30/2018 Overview Background

More information

Partial Generalized Additive Models: An Information-Theoretic Approach for Dealing With Concurvity and Selecting Variables

Partial Generalized Additive Models: An Information-Theoretic Approach for Dealing With Concurvity and Selecting Variables Supplementary materials for this article are available online. PleaseclicktheJCGSlinkathttp://pubs.amstat.org. Partial Generalized Additive Models: An Information-Theoretic Approach for Dealing With Concurvity

More information

Measurement error effects on bias and variance in two-stage regression, with application to air pollution epidemiology

Measurement error effects on bias and variance in two-stage regression, with application to air pollution epidemiology Measurement error effects on bias and variance in two-stage regression, with application to air pollution epidemiology Chris Paciorek Department of Statistics, University of California, Berkeley and Adam

More information

Gradient types. Gradient Analysis. Gradient Gradient. Community Community. Gradients and landscape. Species responses

Gradient types. Gradient Analysis. Gradient Gradient. Community Community. Gradients and landscape. Species responses Vegetation Analysis Gradient Analysis Slide 18 Vegetation Analysis Gradient Analysis Slide 19 Gradient Analysis Relation of species and environmental variables or gradients. Gradient Gradient Individualistic

More information

Regularization in Cox Frailty Models

Regularization in Cox Frailty Models Regularization in Cox Frailty Models Andreas Groll 1, Trevor Hastie 2, Gerhard Tutz 3 1 Ludwig-Maximilians-Universität Munich, Department of Mathematics, Theresienstraße 39, 80333 Munich, Germany 2 University

More information

Part I State space models

Part I State space models Part I State space models 1 Introduction to state space time series analysis James Durbin Department of Statistics, London School of Economics and Political Science Abstract The paper presents a broad

More information

Partial Generalized Additive Models

Partial Generalized Additive Models An information theoretical approach to avoid the concurvity Hong Gu 1 Mu Zhu 2 1 Department of Mathematics and Statistics Dalhousie University 2 Department of Statistics and Actuarial Science University

More information

Short-term effects of ultrafine, fine and coarse particles on cause-specific mortality in 7 European cities Massimo Stafoggia

Short-term effects of ultrafine, fine and coarse particles on cause-specific mortality in 7 European cities Massimo Stafoggia Short-term effects of ultrafine, fine and coarse particles on cause-specific mortality in 7 European cities Massimo Stafoggia Dresden, Nov 28 th 2014 BACKGROUND Conflicting evidence on UF effects, often

More information

One-stage dose-response meta-analysis

One-stage dose-response meta-analysis One-stage dose-response meta-analysis Nicola Orsini, Alessio Crippa Biostatistics Team Department of Public Health Sciences Karolinska Institutet http://ki.se/en/phs/biostatistics-team 2017 Nordic and

More information

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall 1 Structural Nested Mean Models for Assessing Time-Varying Effect Moderation Daniel Almirall Center for Health Services Research, Durham VAMC & Dept. of Biostatistics, Duke University Medical Joint work

More information

R-squared for Bayesian regression models

R-squared for Bayesian regression models R-squared for Bayesian regression models Andrew Gelman Ben Goodrich Jonah Gabry Imad Ali 8 Nov 2017 Abstract The usual definition of R 2 (variance of the predicted values divided by the variance of the

More information

Currie, Iain Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh EH14 4AS, UK

Currie, Iain Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh EH14 4AS, UK An Introduction to Generalized Linear Array Models Currie, Iain Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh EH14 4AS, UK E-mail: I.D.Currie@hw.ac.uk 1 Motivating

More information

Hierarchical modelling of performance indicators, with application to MRSA & teenage conception rates

Hierarchical modelling of performance indicators, with application to MRSA & teenage conception rates Hierarchical modelling of performance indicators, with application to MRSA & teenage conception rates Hayley E Jones School of Social and Community Medicine, University of Bristol, UK Thanks to David Spiegelhalter,

More information

Downloaded from:

Downloaded from: Camacho, A; Kucharski, AJ; Funk, S; Breman, J; Piot, P; Edmunds, WJ (2014) Potential for large outbreaks of Ebola virus disease. Epidemics, 9. pp. 70-8. ISSN 1755-4365 DOI: https://doi.org/10.1016/j.epidem.2014.09.003

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information

Lecture 5: Poisson and logistic regression

Lecture 5: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 3-5 March 2014 introduction to Poisson regression application to the BELCAP study introduction

More information

Outline of GLMs. Definitions

Outline of GLMs. Definitions Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density

More information

Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting

Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting Andreas Groll 1 and Gerhard Tutz 2 1 Department of Statistics, University of Munich, Akademiestrasse 1, D-80799, Munich,

More information

Linkage Methods for Environment and Health Analysis General Guidelines

Linkage Methods for Environment and Health Analysis General Guidelines Health and Environment Analysis for Decision-making Linkage Analysis and Monitoring Project WORLD HEALTH ORGANIZATION PUBLICATIONS Linkage Methods for Environment and Health Analysis General Guidelines

More information

Contents. Part I: Fundamentals of Bayesian Inference 1

Contents. Part I: Fundamentals of Bayesian Inference 1 Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian

More information

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science. Texts in Statistical Science Generalized Linear Mixed Models Modern Concepts, Methods and Applications Walter W. Stroup CRC Press Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint

More information

Sample size determination for logistic regression: A simulation study

Sample size determination for logistic regression: A simulation study Sample size determination for logistic regression: A simulation study Stephen Bush School of Mathematical Sciences, University of Technology Sydney, PO Box 123 Broadway NSW 2007, Australia Abstract This

More information

Pumps, Maps and Pea Soup: Spatio-temporal methods in environmental epidemiology

Pumps, Maps and Pea Soup: Spatio-temporal methods in environmental epidemiology Pumps, Maps and Pea Soup: Spatio-temporal methods in environmental epidemiology Gavin Shaddick Department of Mathematical Sciences University of Bath 2012-13 van Eeden lecture Thanks Constance van Eeden

More information

Geographically Weighted Regression as a Statistical Model

Geographically Weighted Regression as a Statistical Model Geographically Weighted Regression as a Statistical Model Chris Brunsdon Stewart Fotheringham Martin Charlton October 6, 2000 Spatial Analysis Research Group Department of Geography University of Newcastle-upon-Tyne

More information

P -spline ANOVA-type interaction models for spatio-temporal smoothing

P -spline ANOVA-type interaction models for spatio-temporal smoothing P -spline ANOVA-type interaction models for spatio-temporal smoothing Dae-Jin Lee 1 and María Durbán 1 1 Department of Statistics, Universidad Carlos III de Madrid, SPAIN. e-mail: dae-jin.lee@uc3m.es and

More information

Lognormal Measurement Error in Air Pollution Health Effect Studies

Lognormal Measurement Error in Air Pollution Health Effect Studies Lognormal Measurement Error in Air Pollution Health Effect Studies Richard L. Smith Department of Statistics and Operations Research University of North Carolina, Chapel Hill rls@email.unc.edu Presentation

More information

REVISED PAGE PROOFS. Logistic Regression. Basic Ideas. Fundamental Data Analysis. bsa350

REVISED PAGE PROOFS. Logistic Regression. Basic Ideas. Fundamental Data Analysis. bsa350 bsa347 Logistic Regression Logistic regression is a method for predicting the outcomes of either-or trials. Either-or trials occur frequently in research. A person responds appropriately to a drug or does

More information

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota Submitted to the Annals of Statistics arxiv: math.pr/0000000 SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION By Wei Liu and Yuhong Yang University of Minnesota In

More information

Bayesian spatial quantile regression

Bayesian spatial quantile regression Brian J. Reich and Montserrat Fuentes North Carolina State University and David B. Dunson Duke University E-mail:reich@stat.ncsu.edu Tropospheric ozone Tropospheric ozone has been linked with several adverse

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j Standard Errors & Confidence Intervals β β asy N(0, I( β) 1 ), where I( β) = [ 2 l(β, φ; y) ] β i β β= β j We can obtain asymptotic 100(1 α)% confidence intervals for β j using: β j ± Z 1 α/2 se( β j )

More information

particulate matter and mortality: timescale effects in four US cities. Am

particulate matter and mortality: timescale effects in four US cities. Am References [1] H. Austin, W. D. Flanders, and K. J. Rothman. Bias arising in case control studies from selection of controls from overlapping groups. Int J Epidemiol, 18:713 716, 1989. [2] A. G. Barnett,

More information

Lecture 2: Poisson and logistic regression

Lecture 2: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 11-12 December 2014 introduction to Poisson regression application to the BELCAP study introduction

More information