Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni Parmigiani

Use R! Albert: Bayesian Computation with R Bivand/Pebesma/Gomez-Rubio: Applied Spatial Data Analysis with R Claude:Morphometrics with R Cook/Swayne: Interactive and Dynamic Graphics for Data Analysis: With R and GGobi Hahne/Huber/Gentleman/Falcon: Bioconductor Case Studies Nason: Wavelet Methods in Statistics with R Paradis: Analysis of Phylogenetics and Evolution with R Peng/Dominici: Statistical Methods for Environmental Epidemiology with R: A Case Study in Air Pollution and Health Pfaff: Analysis of Integrated and Cointegrated Time Series with R, 2 nd edition Sarkar: Lattice: Multivariate Data Visualization with R Spector: Data Manipulation with R

Roger D. Peng Francesca Dominici Statistical Methods for Environmental Epidemiology with R A Case Study in Air Pollution and Health ABC

Roger D. Peng Francesca Dominici Johns Hopkins Bloomberg School of Public Health 615 N. Wolfe St. Johns Hopkins University Baltimore MD 21205-2179 USA rpeng@jhsph.edu fdominic@jhsph.edu Series Editors: Robert Gentleman Kurt Hornik Program in Computational Biology Department of Statistik and Mathematik Division of Public Health Sciences Wirtschaftsuniversität Wien Augasse 2-6 Fred Hutchinson Cancer Research Center A-1090 Wien 1100 Fairview Avenue, N. M2-B876 Austria Seattle, Washington 98109 USA Giovanni Parmigiani The Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins University 550 North Broadway Baltimore, MD 21205-2011 USA Library of Congress Control Number: 2008928295 ISBN 978-0-387-78166-2 e-isbn 978-0-387-78167-9 DOI: 10.1007/978-0-387-78167-9 2008 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper. 987654321 springer.com

Preface As an area of statistical application, environmental epidemiology and more specifically, the estimation of health risk associated with the exposure to environmental agents, has led to the development of several statistical methods and software that can then be applied to other scientific areas. The statistical analyses aimed at addressing questions in environmental epidemiology have the following characteristics. Often the signal-to-noise ratio in the data is low and the targets of inference are inherently small risks. These constraints typically lead to the development and use of more sophisticated (and potentially less transparent) statistical models and the integration of large highdimensional databases. New technologies and the widespread availability of powerful computing are also adding to the complexities of scientific investigation by allowing researchers to fit large numbers of models and search over many sets of variables. As the number of variables measured increases, so do the degrees of freedom for influencing the association between a risk factor and an outcome of interest. We have written this book, in part, to describe our experiences developing and applying statistical methods for the estimation for air pollution health effects. Our experience has convinced us that the application of modern statistical methodology in a reproducible manner can bring to bear substantial benefits to policy-makers and scientists in this area. We believe that the methods described in this book are applicable to other areas of environmental epidemiology, particularly those areas involving spatial temporal exposures. In this book, we use the National Morbidity, Mortality, and Air Pollution Study (NMMAPS) and Medicare Air Pollution Study (MCAPS) datasets and describe the R packages for accessing the data. Chapters 4, 5, 6, and 7 describe the features of the data, the statistical concepts involved, and many of the methods used to analyze the data. Chapter 8 then shows how to bring all of the methods together to conduct a multi-site analysis of seasonally varying effects of PM 10 on mortality. A principal goal of this book is to disseminate R software and promote reproducible research in epidemiological studies and statistical research. As

VI Preface a case study we use data and methods relevant to investigating the health effects of ambient air pollution. Researching the health effects of air pollution presents an excellent example of the critical need for reproducible research because it involves all of the features already mentioned above: inherently small risks, significant policy implications, sophisticated statistical methodology, and very large databases linked from multiple sources. The complexity of the analyses involved and the policy relevance of the targets of inference demand transparency and reproducibility. Throughout the book, we show how R can be used to make analyses reproducible and to structure the analytic process in a modular fashion. We find R to be a very natural tool for achieving this goal. In particular, for the production of this book, we have made use of the tools described in Chapter 3. All of the data described in the book are provided in the NMMAPSlite and MCAPS R packages that can be downloaded from CRAN. 1 We have developed R packages for implementing the statistical methodology as well as for handling the databases. Packages that are not available from CRAN can be downloaded from the book s website. 2 We would like to express our deepest appreciation to the many collaborators and students who have worked with us on various projects, short courses, and workshops that we have developed over the years. In particular, Aidan McDermott, Scott Zeger, Luu Pham, Jon Samet, Tom Louis, Leah Welty, Michelle Bell, and Sandy Eckel were all central to the development of the software, databases, exercises, and analyses presented in this book. Several anonymous reviewers provided helpful comments that improved the presentation of the material in the book. In addition, we would like to thank Duncan Thomas for many useful suggestions regarding an early draft of the manuscript. Finally, this work was supported in part by grant ES012054-03 from the National Institute of Environmental Health Sciences. Baltimore, Maryland, April 2008 Roger Peng Francesca Dominici 1 http://cran.r-project.org/ 2 http://www.biostat.jhsph.edu/ rpeng/userbook/

Contents Preface........................................................ V 1 Studies of Air Pollution and Health........................ 1 1.1 Introduction............................................ 1 1.2 Time Series Studies...................................... 2 1.3 Case-Crossover Studies................................... 2 1.4 Panel Studies........................................... 3 1.5 Cohort Studies.......................................... 4 1.6 Design Comparisons..................................... 5 2 Introduction to R and Air Pollution and Health Data..... 7 2.1 Starting Up R.......................................... 7 2.2 The National Morbidity, Mortality, and Air Pollution Study.. 9 2.3 Organization of the NMMAPSlite Package................ 9 2.3.1 Reading city-specific data.......................... 10 2.3.2 Pollutant data detrending.......................... 11 2.3.3 Mortality age categories............................ 12 2.3.4 Metadata......................................... 13 2.3.5 Configuration options.............................. 14 2.4 MCAPS Data........................................... 14 3 Reproducible Research Tools.............................. 19 3.1 Introduction............................................ 19 3.2 Distributing Reproducible Research........................ 20 3.3 Getting Started......................................... 21 3.4 Exploring a Cached Analysis.............................. 22 3.5 Verifying a Cached Analysis.............................. 25 3.6 Caching a Statistical Analysis............................. 28 3.7 Distributing a Cached Analysis............................ 29 3.8 Summary............................................... 30

VIII Contents 4 Statistical Issues in Estimating the Health Effects of Spatial Temporal Environmental Exposures............. 31 4.1 Introduction............................................ 31 4.2 Time-Varying Environmental Exposures.................... 32 4.3 Estimation Versus Prediction............................. 33 4.4 Semiparametric Models.................................. 35 4.4.1 Overdispersion.................................... 36 4.4.2 Representations for f.............................. 36 4.4.3 Estimation of β................................... 37 4.4.4 Choosing the degrees of freedom for f................ 38 4.5 Combining Information and Hierarchical Models............. 39 5 Exploratory Data Analyses................................ 41 5.1 Introduction............................................ 41 5.2 Exploring the Data: Basic Features and Properties........... 41 5.2.1 Pollutant data.................................... 41 5.2.2 Mortality data.................................... 46 5.3 Exploratory Statistical Analysis........................... 50 5.3.1 Timescale decompositions.......................... 50 5.3.2 Example: Timescale decompositions of PM 10 and mortality..................................... 51 5.3.3 Correlation at different timescales: A look at the Chicago data............................... 53 5.3.4 Looking at more detailed timescales................. 57 5.4 Exploring the Potential for Confounding Bias............... 60 5.5 Summary............................................... 65 5.6 Reproducibility Package.................................. 65 5.7 Problems............................................... 65 6 Statistical Models......................................... 69 6.1 Introduction............................................ 69 6.2 Models for Air Pollution and Health....................... 69 6.3 Semiparametric Models.................................. 71 6.3.1 GAMs in R....................................... 73 6.4 Pollutants: The Exposure of Interest....................... 73 6.4.1 Single versus distributed lag........................ 74 6.4.2 Mortality displacement............................. 77 6.5 Modeling Measured Confounders.......................... 77 6.6 Accounting for Unmeasured Confounders................... 82 6.6.1 Using GAMs for air pollution and health............. 84 6.6.2 Computing standard errors for parametric terms in GAMs................................... 88 6.6.3 Choosing degrees of freedom from the data........... 88 6.6.4 Example: Semiparametric model for Detroit.......... 90 6.6.5 Smoothers........................................ 92

Contents IX 6.7 Multisite Studies: Putting It All Together.................. 93 6.8 Summary............................................... 93 6.9 Reproducibility Package.................................. 95 6.10 Problems............................................... 95 7 Pooling Risks Across Locations and Quantifying Spatial Heterogeneity............................................. 99 7.1 Hierarchical Models for Multisite Time Series Studies of Air Pollution and Health............................... 99 7.1.1 Two-stage hierarchical model....................... 102 7.1.2 Three-stage hierarchical model...................... 104 7.1.3 Spatial correlation model........................... 107 7.1.4 Sensitivity analyses to the adjustment for confounders. 110 7.2 Example: Examining Sensitivity to Prior Distributions....... 112 7.3 Reproducibility Package.................................. 114 7.4 Problems............................................... 114 8 A Reproducible Seasonal Analysis of Particulate Matter and Mortality in the United States........................ 117 8.1 Introduction............................................ 117 8.2 Methods............................................... 121 8.2.1 Combining information across cities.................. 123 8.3 Results................................................. 123 8.3.1 Sensitivity analyses................................ 127 8.4 Comments.............................................. 130 8.5 Reproducibility Package.................................. 131 References..................................................... 133 Index.......................................................... 143