Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni Parmigiani

Similar documents
Use R! Series Editors:

Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni Parmigiani

SpringerBriefs in Mathematics

Machine Tool Vibrations and Cutting Dynamics

Numerical Approximation Methods for Elliptic Boundary Value Problems

Tile-Based Geospatial Information Systems

Statistics for Social and Behavioral Sciences

Multiscale Modeling and Simulation of Composite Materials and Structures

The Theory of the Top Volume II

For other titles in this series, go to Universitext

Kazumi Tanuma. Stroh Formalism and Rayleigh Waves

UNDERSTANDING PHYSICS

Controlled Markov Processes and Viscosity Solutions

PHASE PORTRAITS OF PLANAR QUADRATIC SYSTEMS

American Journal of EPIDEMIOLOGY

Modern regression and Mortality

Coordination of Large-Scale Multiagent Systems

SpringerBriefs in Statistics

Linear Partial Differential Equations for Scientists and Engineers

Dissipative Ordered Fluids

Modern Power Systems Analysis

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

Doubt-Free Uncertainty In Measurement

Multivariable Calculus with MATLAB

METHODS FOR PROTEIN ANALYSIS

Semiconductor Physical Electronics

Multiplicative Complexity, Convolution, and the DFT

Elements of Applied Bifurcation Theory

Igor Emri Arkady Voloshin. Statics. Learning from Engineering Examples

Springer Series in Statistics

Compendium of Chemical Warfare Agents

Progress in Mathematical Physics

Dynamics and Control of Lorentz-Augmented Spacecraft Relative Motion

Advanced Calculus of a Single Variable

Springer Texts in Electrical Engineering. Consulting Editor: John B. Thomas

ATOMIC SPECTROSCOPY: Introduction to the Theory of Hyperfine Structure

Electronic Materials: Science & Technology

Undergraduate Texts in Mathematics

ThiS is a FM Blank Page

Progress in Mathematics

PROBLEMS AND SOLUTIONS FOR COMPLEX ANALYSIS

Felipe Linares Gustavo Ponce. Introduction to Nonlinear Dispersive Equations ABC

APPLIED STRUCTURAL EQUATION MODELLING FOR RESEARCHERS AND PRACTITIONERS. Using R and Stata for Behavioural Research

Fundamentals of Mass Determination

SpringerBriefs in Mathematics

Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni Parmigiani

MATLAB Differential Equations. César Pérez López

Maximum Principles in Differential Equations

Practical Statistics for Geographers and Earth Scientists

Data Analysis Using the Method of Least Squares

RADIATION PROTECTION AND DOSIMETRY

Physics of Classical Electromagnetism

Topics in Algebra and Analysis

A Linear Systems Primer

UNITEXT La Matematica per il 3+2. Volume 87

AUTOMATIC QUANTUM COMPUTER PROGRAMMING A Genetic Programming Approach

Statics and Mechanics of Structures

Controlled Markov Processes and Viscosity Solutions

Statistics and Measurement Concepts with OpenStat

Ahsan Habib Khandoker Chandan Karmakar Michael Brennan Andreas Voss Marimuthu Palaniswami. Poincaré Plot Methods for Heart Rate Variability Analysis

Regulated CheInicals Directory

Springer Atmospheric Sciences

Predicting Long-term Exposures for Health Effect Studies

Fractional Dynamics and Control

Graduate Texts in Mathematics 216. Editorial Board S. Axler F.W. Gehring K.A. Ribet

Nadir Jeevanjee. An Introduction to Tensors and Group Theory for Physicists

Publication of the Museum of Nature South Tyrol Nr. 11

Quantum Biological Information Theory

Chemistry by Computer. An Overview of the Applications of Computers in Chemistry

Time-Resolved Spectroscopy in Complex Liquids An Experimental Perspective

Semantics of the Probabilistic Typed Lambda Calculus

Applied Structural Equation Modelling for Researchers and Practitioners Using R and Stata for Behavioural Research

the university of british columbia department of statistics technical report # 217

Linkage Methods for Environment and Health Analysis General Guidelines

Undergraduate Texts in Mathematics

Qing-Hua Qin. Advanced Mechanics of Piezoelectricity

Roger S. Bivand Edzer J. Pebesma Virgilio Gömez-Rubio. Applied Spatial Data Analysis with R. 4:1 Springer

ION EXCHANGE TRAINING MANUAL

Natural Laminar Flow and Laminar Flow Control

Probability Theory, Random Processes and Mathematical Statistics

Solid Phase Microextraction

Dynamics Formulas and Problems

TWILIGHT. Georgii Vladimirovich Rozenberg Deputy Director Institute of Physics of the Atmosphere Academy of Sciences of the USSR

A FIRST COURSE IN INTEGRAL EQUATIONS

Undergraduate Texts in Mathematics

Evolutionary Biology VOLUME 31

Interactive Quantum Mechanics

INTRODUCTION TO THE SCIENTIFIC STUDY OF ATMOSPHERIC POLLUTION

Rheology of Complex Fluids

Non-Western Theories of International Relations

A measurement error model for time-series studies of air pollution and mortality

Graduate Texts in Mathematics 22

Patrick Moore s Practical Astronomy Series

To my father, who taught me to write

Graceway Publishing Company, Inc.

SpringerBriefs in Agriculture

Vibration Mechanics. Linear Discrete Systems SPRINGER SCIENCE+BUSINESS MEDIA, B.V. M. Del Pedro and P. Pahud

Springer Series on. atomic, optical, and plasma physics 65

Multivariate Analysis of Ecological Data using CANOCO

Springer Series in Statistics

Transcription:

Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni Parmigiani

Use R! Albert: Bayesian Computation with R Bivand/Pebesma/Gomez-Rubio: Applied Spatial Data Analysis with R Claude:Morphometrics with R Cook/Swayne: Interactive and Dynamic Graphics for Data Analysis: With R and GGobi Hahne/Huber/Gentleman/Falcon: Bioconductor Case Studies Nason: Wavelet Methods in Statistics with R Paradis: Analysis of Phylogenetics and Evolution with R Peng/Dominici: Statistical Methods for Environmental Epidemiology with R: A Case Study in Air Pollution and Health Pfaff: Analysis of Integrated and Cointegrated Time Series with R, 2 nd edition Sarkar: Lattice: Multivariate Data Visualization with R Spector: Data Manipulation with R

Roger D. Peng Francesca Dominici Statistical Methods for Environmental Epidemiology with R A Case Study in Air Pollution and Health ABC

Roger D. Peng Francesca Dominici Johns Hopkins Bloomberg School of Public Health 615 N. Wolfe St. Johns Hopkins University Baltimore MD 21205-2179 USA rpeng@jhsph.edu fdominic@jhsph.edu Series Editors: Robert Gentleman Kurt Hornik Program in Computational Biology Department of Statistik and Mathematik Division of Public Health Sciences Wirtschaftsuniversität Wien Augasse 2-6 Fred Hutchinson Cancer Research Center A-1090 Wien 1100 Fairview Avenue, N. M2-B876 Austria Seattle, Washington 98109 USA Giovanni Parmigiani The Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins University 550 North Broadway Baltimore, MD 21205-2011 USA Library of Congress Control Number: 2008928295 ISBN 978-0-387-78166-2 e-isbn 978-0-387-78167-9 DOI: 10.1007/978-0-387-78167-9 2008 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper. 987654321 springer.com

Preface As an area of statistical application, environmental epidemiology and more specifically, the estimation of health risk associated with the exposure to environmental agents, has led to the development of several statistical methods and software that can then be applied to other scientific areas. The statistical analyses aimed at addressing questions in environmental epidemiology have the following characteristics. Often the signal-to-noise ratio in the data is low and the targets of inference are inherently small risks. These constraints typically lead to the development and use of more sophisticated (and potentially less transparent) statistical models and the integration of large highdimensional databases. New technologies and the widespread availability of powerful computing are also adding to the complexities of scientific investigation by allowing researchers to fit large numbers of models and search over many sets of variables. As the number of variables measured increases, so do the degrees of freedom for influencing the association between a risk factor and an outcome of interest. We have written this book, in part, to describe our experiences developing and applying statistical methods for the estimation for air pollution health effects. Our experience has convinced us that the application of modern statistical methodology in a reproducible manner can bring to bear substantial benefits to policy-makers and scientists in this area. We believe that the methods described in this book are applicable to other areas of environmental epidemiology, particularly those areas involving spatial temporal exposures. In this book, we use the National Morbidity, Mortality, and Air Pollution Study (NMMAPS) and Medicare Air Pollution Study (MCAPS) datasets and describe the R packages for accessing the data. Chapters 4, 5, 6, and 7 describe the features of the data, the statistical concepts involved, and many of the methods used to analyze the data. Chapter 8 then shows how to bring all of the methods together to conduct a multi-site analysis of seasonally varying effects of PM 10 on mortality. A principal goal of this book is to disseminate R software and promote reproducible research in epidemiological studies and statistical research. As

VI Preface a case study we use data and methods relevant to investigating the health effects of ambient air pollution. Researching the health effects of air pollution presents an excellent example of the critical need for reproducible research because it involves all of the features already mentioned above: inherently small risks, significant policy implications, sophisticated statistical methodology, and very large databases linked from multiple sources. The complexity of the analyses involved and the policy relevance of the targets of inference demand transparency and reproducibility. Throughout the book, we show how R can be used to make analyses reproducible and to structure the analytic process in a modular fashion. We find R to be a very natural tool for achieving this goal. In particular, for the production of this book, we have made use of the tools described in Chapter 3. All of the data described in the book are provided in the NMMAPSlite and MCAPS R packages that can be downloaded from CRAN. 1 We have developed R packages for implementing the statistical methodology as well as for handling the databases. Packages that are not available from CRAN can be downloaded from the book s website. 2 We would like to express our deepest appreciation to the many collaborators and students who have worked with us on various projects, short courses, and workshops that we have developed over the years. In particular, Aidan McDermott, Scott Zeger, Luu Pham, Jon Samet, Tom Louis, Leah Welty, Michelle Bell, and Sandy Eckel were all central to the development of the software, databases, exercises, and analyses presented in this book. Several anonymous reviewers provided helpful comments that improved the presentation of the material in the book. In addition, we would like to thank Duncan Thomas for many useful suggestions regarding an early draft of the manuscript. Finally, this work was supported in part by grant ES012054-03 from the National Institute of Environmental Health Sciences. Baltimore, Maryland, April 2008 Roger Peng Francesca Dominici 1 http://cran.r-project.org/ 2 http://www.biostat.jhsph.edu/ rpeng/userbook/

Contents Preface........................................................ V 1 Studies of Air Pollution and Health........................ 1 1.1 Introduction............................................ 1 1.2 Time Series Studies...................................... 2 1.3 Case-Crossover Studies................................... 2 1.4 Panel Studies........................................... 3 1.5 Cohort Studies.......................................... 4 1.6 Design Comparisons..................................... 5 2 Introduction to R and Air Pollution and Health Data..... 7 2.1 Starting Up R.......................................... 7 2.2 The National Morbidity, Mortality, and Air Pollution Study.. 9 2.3 Organization of the NMMAPSlite Package................ 9 2.3.1 Reading city-specific data.......................... 10 2.3.2 Pollutant data detrending.......................... 11 2.3.3 Mortality age categories............................ 12 2.3.4 Metadata......................................... 13 2.3.5 Configuration options.............................. 14 2.4 MCAPS Data........................................... 14 3 Reproducible Research Tools.............................. 19 3.1 Introduction............................................ 19 3.2 Distributing Reproducible Research........................ 20 3.3 Getting Started......................................... 21 3.4 Exploring a Cached Analysis.............................. 22 3.5 Verifying a Cached Analysis.............................. 25 3.6 Caching a Statistical Analysis............................. 28 3.7 Distributing a Cached Analysis............................ 29 3.8 Summary............................................... 30

VIII Contents 4 Statistical Issues in Estimating the Health Effects of Spatial Temporal Environmental Exposures............. 31 4.1 Introduction............................................ 31 4.2 Time-Varying Environmental Exposures.................... 32 4.3 Estimation Versus Prediction............................. 33 4.4 Semiparametric Models.................................. 35 4.4.1 Overdispersion.................................... 36 4.4.2 Representations for f.............................. 36 4.4.3 Estimation of β................................... 37 4.4.4 Choosing the degrees of freedom for f................ 38 4.5 Combining Information and Hierarchical Models............. 39 5 Exploratory Data Analyses................................ 41 5.1 Introduction............................................ 41 5.2 Exploring the Data: Basic Features and Properties........... 41 5.2.1 Pollutant data.................................... 41 5.2.2 Mortality data.................................... 46 5.3 Exploratory Statistical Analysis........................... 50 5.3.1 Timescale decompositions.......................... 50 5.3.2 Example: Timescale decompositions of PM 10 and mortality..................................... 51 5.3.3 Correlation at different timescales: A look at the Chicago data............................... 53 5.3.4 Looking at more detailed timescales................. 57 5.4 Exploring the Potential for Confounding Bias............... 60 5.5 Summary............................................... 65 5.6 Reproducibility Package.................................. 65 5.7 Problems............................................... 65 6 Statistical Models......................................... 69 6.1 Introduction............................................ 69 6.2 Models for Air Pollution and Health....................... 69 6.3 Semiparametric Models.................................. 71 6.3.1 GAMs in R....................................... 73 6.4 Pollutants: The Exposure of Interest....................... 73 6.4.1 Single versus distributed lag........................ 74 6.4.2 Mortality displacement............................. 77 6.5 Modeling Measured Confounders.......................... 77 6.6 Accounting for Unmeasured Confounders................... 82 6.6.1 Using GAMs for air pollution and health............. 84 6.6.2 Computing standard errors for parametric terms in GAMs................................... 88 6.6.3 Choosing degrees of freedom from the data........... 88 6.6.4 Example: Semiparametric model for Detroit.......... 90 6.6.5 Smoothers........................................ 92

Contents IX 6.7 Multisite Studies: Putting It All Together.................. 93 6.8 Summary............................................... 93 6.9 Reproducibility Package.................................. 95 6.10 Problems............................................... 95 7 Pooling Risks Across Locations and Quantifying Spatial Heterogeneity............................................. 99 7.1 Hierarchical Models for Multisite Time Series Studies of Air Pollution and Health............................... 99 7.1.1 Two-stage hierarchical model....................... 102 7.1.2 Three-stage hierarchical model...................... 104 7.1.3 Spatial correlation model........................... 107 7.1.4 Sensitivity analyses to the adjustment for confounders. 110 7.2 Example: Examining Sensitivity to Prior Distributions....... 112 7.3 Reproducibility Package.................................. 114 7.4 Problems............................................... 114 8 A Reproducible Seasonal Analysis of Particulate Matter and Mortality in the United States........................ 117 8.1 Introduction............................................ 117 8.2 Methods............................................... 121 8.2.1 Combining information across cities.................. 123 8.3 Results................................................. 123 8.3.1 Sensitivity analyses................................ 127 8.4 Comments.............................................. 130 8.5 Reproducibility Package.................................. 131 References..................................................... 133 Index.......................................................... 143