ICPSR Training Program McMaster University Summer, The R Statistical Computing Environment: The Basics and Beyond

Similar documents
STATISTICAL COMPUTING USING R/S. John Fox McMaster University

2005 ICPSR SUMMER PROGRAM REGRESSION ANALYSIS III: ADVANCED METHODS

Regression Analysis III: Advanced Methods

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

Introduction to the R Statistical Computing Environment

Getting started with spatstat

SOCI 20253/GEOG 20500, SOCI 30253, MACS Introduction to Spatial Data Science SYLLABUS

Interaction effects for continuous predictors in regression modeling

Longitudinal Data Analysis. Michael L. Berbaum Institute for Health Research and Policy University of Illinois at Chicago

GIST 4302/5302: Spatial Analysis and Modeling

GIST 4302/5302: Spatial Analysis and Modeling

GIST 4302/5302: Spatial Analysis and Modeling

Cornell University CSS/NTRES 6200 Spatial Modelling and Analysis for agronomic, resources and environmental applications. Spring semester 2015

BOOK REVIEW Sampling: Design and Analysis. Sharon L. Lohr. 2nd Edition, International Publication,

A Wildlife Simulation Package (WiSP)

Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni Parmigiani

ME DYNAMICAL SYSTEMS SPRING SEMESTER 2009

Longitudinal Analysis. Michael L. Berbaum Institute for Health Research and Policy University of Illinois at Chicago

Introduction to Mixed Models in R

GTECH 380/722 Analytical and Computer Cartography Hunter College, CUNY Department of Geography

Graphical Presentation of a Nonparametric Regression with Bootstrapped Confidence Intervals

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Engineering Physics 3W4 "Acquisition and Analysis of Experimental Information" Part II: Fourier Transforms

Calculus from Graphical, Numerical, and Symbolic Points of View Overview of 2nd Edition

THE STANDARD MODEL IN A NUTSHELL BY DAVE GOLDBERG DOWNLOAD EBOOK : THE STANDARD MODEL IN A NUTSHELL BY DAVE GOLDBERG PDF

GTECH 380/722 Analytical and Computer Cartography Hunter College, CUNY Department of Geography

CSISS Tools and Spatial Analysis Software

Modern Statistics For The Life Sciences Gbv

SPATIAL AND SPATIO-TEMPORAL BAYESIAN MODELS WITH R - INLA BY MARTA BLANGIARDO, MICHELA CAMELETTI

Estimating complex causal effects from incomplete observational data

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Follow links Class Use and other Permissions. For more information, send to:

Correspondence Analysis of Longitudinal Data

Statistics Toolbox 6. Apply statistical algorithms and probability models

Don t be Fancy. Impute Your Dependent Variables!

STRUCTURAL BIOINFORMATICS I. Fall 2015

Geometric Algorithms in GIS

Introduction. How to use this book. Linear algebra. Mathematica. Mathematica cells

A Comparison Between Polynomial and Locally Weighted Regression for Fault Detection and Diagnosis of HVAC Equipment

Roger S. Bivand Edzer J. Pebesma Virgilio Gömez-Rubio. Applied Spatial Data Analysis with R. 4:1 Springer

Applied Regression Modeling

Applied Spatial Analysis in Epidemiology

What is Survey Weighting? Chris Skinner University of Southampton

Key Methods in Geography / Nicholas Clifford, Shaun French, Gill Valentine

NR402 GIS Applications in Natural Resources

Fundamentals of Probability Theory and Mathematical Statistics

ADSP ADSP ADSP ADSP. Advanced Digital Signal Processing (18-792) Spring Fall Semester, Department of Electrical and Computer Engineering

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

LEHMAN COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF ENVIRONMENTAL, GEOGRAPHIC, AND GEOLOGICAL SCIENCES CURRICULAR CHANGE

Designing Multilevel Models Using SPSS 11.5 Mixed Model. John Painter, Ph.D.

Analysing Spatial Data in R: Why spatial data in R?

is your scource from Sartorius at discount prices. Manual of Weighing Applications Part 2 Counting

Principal Component Analysis, A Powerful Scoring Technique

A Note on Bayesian Inference After Multiple Imputation

STATISTICS-STAT (STAT)

Introduction GeoXp : an R package for interactive exploratory spatial data analysis. Illustration with a data set of schools in Midi-Pyrénées.

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Multivariate Statistical Analysis

Solving Systems of Linear Equations with the Help. of Free Technology

Book Vector Calculus Student Solutions Manual 9th Edition

Statistical modelling: Theory and practice

Econometric Analysis of Network Data. Georgetown Center for Economic Practice

The CrimeStat Program: Characteristics, Use, and Audience

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

Applied Spatial Analysis in Epidemiology

STAT 3008 Applied Regression Analysis Tutorial 1: Introduction LAI Chun Hei

Live Formulations of International Association for the properties of Water and Steam (IAPWS)

NONPARAMETRIC STATISTICAL METHODS BY MYLES HOLLANDER, DOUGLAS A. WOLFE, ERIC CHICKEN

Crc Handbook Of Mathematical Curves And Surfaces

Didacticiel - Études de cas

Instructional Calendar Accelerated Integrated Precalculus. Chapter 1 Sections and 1.6. Section 1.4. Section 1.5

For Bonnie and Jesse (again)

Econometrics I G (Part I) Fall 2004

Adaptive sparse grids

COLLABORATION OF STATISTICAL METHODS IN SELECTING THE CORRECT MULTIPLE LINEAR REGRESSIONS

Historical Perspective on Numerical Problem Solving

INTRODUCTION TO STRUCTURAL EQUATION MODELS

CHANGING SOURCES FOR RESEARCH LITERATURE

Course title SD206. Introduction to Structural Equation Modelling

Computational Biology Course Descriptions 12-14

Calculus and Vectors: Content and Reporting Targets

BIOL 580 Analysis of Ecological Communities

WEB application for the analysis of spatio-temporal data

Lab Assistant: Kathy Tang Office: SSC 2208 Phone: ext

Statistical Practice

Working with Ordinal Predictors

Calculus at Rutgers. Course descriptions

Master of Science in Statistics A Proposal

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Part I State space models

DEPARTMENT OF PHYSICS AND ASTRONOMY. BSc/MSci Programme Structures (Full-time Undergraduate Degrees)

A Introduction to Matrix Algebra and the Multivariate Normal Distribution

A Brief Introduction To. GRTensor. On MAPLE Platform. A write-up for the presentation delivered on the same topic as a part of the course PHYS 601

If you are looking for a ebook by Sergiy Butenko;Panos M. Pardalos Numerical Methods and Optimization: An Introduction (Chapman & Hall/CRC Numerical

Longitudinal and Panel Data: Analysis and Applications for the Social Sciences. Table of Contents

Lecture 2: Linear regression

A Glimpse at Scipy FOSSEE. June Abstract This document shows a glimpse of the features of Scipy that will be explored during this course.

Relations and Functions

Linear Statistical Models

Elementary Algebra Basic Operations With Polynomials Worksheet

Transcription:

John Fox ICPSR Training Program McMaster University Summer, 2014 The R Statistical Computing Environment: The Basics and Beyond The R statistical programming language and computing environment has become the defacto standard for writing statistical software among statisticians and has made substantial inroads in the social sciences. R is a free, open-source implementation of the S language, and is available for Windows, Mac OS X, and Unix/Linux systems. There is also a commercial implementation of S called S-PLUS, but it has been eclipsed by R. The basic R system is developed and maintained by the R Core group, comprising 20 members, many of them eminent in the field of statistical computing. The R Project for Statistical Computing is a project of the R Foundation, whose membership includes the R Core group and several other individuals, and is also associated with the Free Software Foundation. A statistical package, such as SPSS, is primarily oriented toward combining instructions with rectangular case-by-variable datasets to produce (often voluminous) printouts. Statistical packages make routine data analysis relatively easy, but they make it relatively difficult to do things that are innovative or non-standard, or to add to the built-in capabilities of the package. In contrast, a good statistical computing environment also makes routine data analysis easy, but it additionally supports convenient programming; this means that users can extend the already impressive facilities of R. Statisticians and others have taken advantage of the extensibility of R to contribute (at the time of writing this syllabus) more than 5000 freely available packages of documented R programs and data to CRAN (the Comprehensive R Archive Network) <http://cran.rproject.org/web/packages/index.html>, and many others to the Bioconductor package archive <http://www.bioconductor.org/>. As well, R is especially capable in the area of statistical graphics, reflecting the origin of S at Bell Labs, a centre of graphical innovation. The first two days of this workshop are meant to provide a basic overview of and introduction to R, including to statistical modeling in R in effect, using R as a statistical package. I will also show you how to use RStudio, a sophisticated front-end to R, which includes support for literate programming to create documents that mix R code with explanatory text, encouraging reproducible research. The final three days pick up where the basic material leaves off, and are intended to provide the background required to use R seriously for data analysis and presentation, including an introduction to R programming and to the design of custom statistical graphs, unlocking the power in the R statistical programming environment. Participants should bring their laptops to the workshop and should install R and RStudio in advance of the workshop (see the instructions on the workshop website). An outline of the workshop follows (with chapter and on-line appendix references to Fox and Weisberg, An R Companion to Applied Regression, Second Edition):

Day 1. Getting started with R (Ch. 1); linear and generalized models in R (Ch. 4, 5); literate programming and reproducible research using R and RStudio. Day 2. Mixed-effects models and repeated-measures ANOVA and MANOVA with the car, nlme, and lme4 packages (on-line appendices on Multivariate Linear Models and Mixed-Effects Models); Survival ( event-history ) analysis and structural-equation models with the survival and sem packages (on-line appendices on Cox Regression for Survival Data and Structural-Equation Models). Day 3. Data in R (Ch. 2); the basics of R programming (Ch. 8, Sec. 8.1-8.4) Day 4. R programming, beyond the basics (Ch. 8, Sec. 8.5-8.10) Day 5. R graphics (Ch. 7); building R packages Workshop Web Site Materials for the workshop will be deposited at <http://socserv.mcmaster.ca/jfox/courses/r-course-berkeley/index.html>, abbreviation <http://tinyurl.com/r-course-berkeley>, which also has active links to many of the resources described in this syllabus. Acquiring R and RStudio I recommend that you use R with the RStudio interactive development environment (IDE), which provides a sophisticated programming editor, and tools for literate programming for reproducible research, R package management, R package creation, and many other useful features. Both R and RStudio are free software, available on the Internet. Instructions for installing R and RStudio for Windows, Mac OS X, and Linux systems are on the workshop web site at <http://socserv.mcmaster.ca/jfox/courses/r-course-berkeley/r-installinstructions.html>, with a link on the workshop home page. Selected Bibliography Publishers of statistical texts have been producing a steady stream of books on R. Of particular note is Springer's Use R! series of brief paperbacks on various R-related topics <http://www.springer.com/series/6991>, several titles of which I've listed below. Recently, Chapman and Hall introduced The R Series <http://www.crcpress.com/browse/series/crctherser>. Basic Texts The principal source for this workshop is J. Fox and S. Weisberg, An R Companion to Applied Regression, Second Edition, Sage (2011). Additional materials are available on the web site for the book

<http://socserv.mcmaster.ca/jfox/books/companion/index.html>, including several appendices (on multivariate linear models, structural-equation models, mixed models, survival analysis, and more). The book is associated with the car package for R. I am a member of the R Foundation. Alternatively (or additionally), more advanced students may wish to use W. N. Venables and B. D. Ripley, Modern Applied Statistics with S as a principal source. Bill Venables is a member of the R Foundation, and Brian Ripley is a member of the R Core group. Manuals R is distributed with a set of manuals, which are also available at the CRAN web site <http://cran.r-project.org/manuals.html>. A manual for S-PLUS Trellis Graphics (also useful for the lattice package in R) is also available on the web at <http://cm.bell-labs.com/cm/ms/departments/sia/doc/trellis.user.pdf>. Programming in R R. A. Becker, J. M. Chambers, and A.R. Wilks, The New S Language: A Programming Environment for Data Analysis and Statistics. Pacific Grove, CA: Wadsworth, 1988. Defines S Version 2, which forms the basis of S Versions 3 and 4, as well as R. (Sometimes called the Blue Book. ) J. M. Chambers, Programming with Data: A Guide to the S Language. New York: Springer, 1998. Describes the then-new features in S Version 4, including the newer formal object-oriented programming system (also incorporated in R), by the principal designer of the S language and a member of the R Core group of developers. Not an easy read. (The Green Book. ) J. M. Chambers, Software for Data Analysis: Programming with R. New York: Springer, 2008. Chambers s newest book ranges quite widely, and emphasizes a deep understanding of the R language, along with object-oriented programming, and links between R and other software. Some topics are unusual, such as processing text data in R. J. M. Chambers and T.J. Hastie, eds., Statistical Models in S. Pacific Grove, CA: Wadsworth, 1992. An edited volume describing the statistical modeling capabilities in S, Versions 3 and 4, and R, and the object-oriented programming system used in S Version 3 and R (and available, for backwards compatibility, in S Version 4). In addition, the text covers S software for particular kinds of statistical models, including linear models, nonlinear models, generalized linear models, local-polynomial regression models, and generalized additive models. (The White Book. ) D. Eddelbbuettel, Seamless R and C++ Integration with Rcpp. New York: Springer, 2013. Judicious use of compiled code written in C, C++, or Fortran can substantially

improve the efficiency of some R programs. The Rcpp package and its cousins simplify the process of integrating C++ code in R. I recommend this book to those who already know C++. R. Gentleman, R Programming for Bioinformatics. Boca Raton: Chapman and Hall, 2009. A thorough, though at points relatively difficult, treatment of programming in R, by one of the original co-developers of R and a founder of the related Bioconductor Project (which develops computing tools for the analysis of genomic data). Don t let the title fool you: Most of the book is of general interest to R programmers. R. Ihaka and R. Gentleman, R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5:299-314, 1996. The original published description of the R project, now quite out of date but still worth looking at. The founders of the R Project, both authors are members of the R Core group. W. N. Venables and B. D. Ripley, S Programming. New York: Springer, 2000. A companion volume to Modern Applied Statistics with S, and at the time of its publication the definitive treatment of writing software in the various versions of S-PLUS and R; now somewhat dated, particularly with respect to R. Brian Ripley is a member of the R Core group of developers, and Bill Venables is a member of the R Foundation. Statistical Computing in R The following three books treat traditional topics in statistical computing, such as optimization, simulation, probability calculations, and computational linear algebra, using R (although the coverage of particular topics in the books differs). All offer introductions to R programming. Of these books, Braun and Murdoch is the briefest and most accessible. W. J. Braun and D. J. Murdoch, A First Course in Statistical Programming with R. Cambridge: Cambridge University Press, 2007. Duncan Murdoch is a member of the R Core group of developers. O. Jones, R. Maillardet, and A. Robinson, Introduction to Scientific Programming and Simulation Using R. Boca Raton: Chapman and Hall, 2009. M. L. Rizzo. Statistical Computing with R, Boca Raton: Chapman and Hall, 2008. Graphics in R P. Murrell. R Graphics, Second Edition. New York: Chapman and Hall, 2011. A tour-deforce the definitive reference on traditional R graphics and on the grid graphics system on which lattice graphics (the R implementation of William Cleveland s Trellis graphics) is built. R code to produce the figures in the book are on Murrell s web site <http://www.stat.auckland.ac.nz/~paul/rg2e/>. Paul Murrell is a member of the R Core group of developers.

P. Murrell and R. Ihaka, An approach to providing mathematical annotation in plots. Journal of Computational and Graphical Statistics, 9:582-599, 2000. One of the unusual and very useful features of R graphics is the ability to include mathematical notation. This article explains how. Paul Murrell and Ross Ihaka are both members of the R core group. D. Sarkar, Lattice: Multivariate Data Visualization with R. New York: Springer, 2008. Deepayan Sarkar is the developer of the powerful lattice package in R, which implements Trellis graphics. This book provides a fine introduction to and overview of lattice graphics. Figures from the book and the R code to produce them are available on the web <http://lmdvr.r-forge.r-project.org/figures/figures.html>. Deepayan Sarkar is a member of the R Core group of developers. H. Wickham, ggplot2: Elegant Graphics for Data Analysis. New York: Springer, 2009: A guide to Hadley Wickham's ggplot2 package, which provides an alternative graphics system for R based on an extension of Wilkinson's The Grammar of Graphics (Second Edition, Springer, 2005), which, in turn, provides a systematic basis for constructing statistical graphs. Data Management P. Spector, Data Manipulation with R. New York: Springer, 2008. Data management is a dry subject, but the ability to carry it out is vital to the effective day-to-day use of R (or of any statistical software). Spector provides a reasonably broad and clear introduction to the subject. (Highly) Selected Statistical Methods Programmed in R Also see the package listing on CRAN <http://cran.rproject.org/web/packages/index.html> and the various CRAN task views <http://cran.r-project.org/web/views/index.html>. J. Adler, R in a Nutshell: A Desktop Quick Reference, Sebastopol CA: O Reilly, 2010. Basic information about using R, including brief illustrations of many R commands. New users of R may find the information in this book useful. R. S. Bivand, E. J. Pebesma, and V. Gómez-Rubio, Applied Spatial Data Analysis with R, New York: Springer, 2008. There is a strong community of researchers in spatial statistics developing R software, much of which is described in this book, including the basic sp package, which provides R classes for spatial data. Roger Bivand is a member of the R Foundation. W. Bowman and A. Azzalini, Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. Oxford: Oxford University Press, 1997. A good introduction to nonparametric density estimation and nonparametric regression, associated with the sm package (for both S-PLUS and R).

C. Davison and D. V. Hinkley, Bootstrap Methods and their Application. Cambridge: Cambridge University Press, 1997. A comprehensive introduction to bootstrap resampling, associated with the boot package (written by A. J. Canty). Somewhat more difficult than Efron and Tibshirani (immediately below). B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. London: Chapman and Hall, 1993. Another extensive treatment of bootstrapping by its originator (Efron), also accompanied by an R package, bootstrap (but somewhat less usable than boot). B. S. Everitt and T. Hothorn, A Handbook of Statistical Analyses Using R, Second Edition. Boca Raton: Chapman and Hall, 2010. Many worked-out, brief examples, illustrating a variety of statistical methods. New users of R may find this book useful. A. Gelman and J. Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press, 2007. A wide-ranging yet deep treatment of hierarchical models and various related topics, predominantly but not exclusively from a Bayesian perspective, using both R and BUGS software. F. E. Harrell, Jr., Regression Modeling Strategies, With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer, 2001. Describes an interesting approach to statistical modeling, with frequent references to Harrell's Hmisc and Design packages. T. J. Hastie and R. J. Tibshirani, Generalized Additive Models. London: Chapman and Hall, 1990. An accessible treatment of generalized additive models, as implemented in the gam package, and of nonparametric regression analysis in general. [The gam function in the mgcv package in R takes a somewhat different approach; see Wood (2000), below.] R. Koenker, Quantile Regression. Cambridge: Cambridge University Press, 2005. Describes a variety of methods for quantile regression by the leading figure in the area. The methods are implemented in Koenker's quantreg package for R. C. Loader, Local Likelihood and Regression. New York: Springer, 1999. Another text on nonparametric regression and density estimation, using the locfit package. Although the text is less readable than Bowman and Azzalini, the locfit software in very capable. T. Lumley, Complex Surveys: A Guide to Analysis Using R. Hoboken NJ, Wiley, 2010. A lucid introduction to the analysis of data from complex survey samples and to Lumley's highly capable survey package. Thomas Lumley is a member of the R Core group of developers. G. P. Nason, Wavelet Methods in Statistics with R. New York: Springer, 2008. Describes the wavethresh package for wavelet smoothing, by one of the key figures in the development of wavelet methods in statistics.

J. C. Pinheiro and D. M. Bates, Mixed-Effects Models in S and S-PLUS. New York: Springer, 2000. An extensive treatment of linear and nonlinear mixed-effects models in S, focused on the authors' nlme package. Mixed models are appropriate for various kinds of non-independent (clustered) data, including hierarchical and longitudinal data. Does not cover Bates's newer lme4 package. Doug Bates is a member of the R Core group of developers. T. M. Therneau and P. M. Grambsch, Modeling Survival Data: Extending the Cox Model. New York, Springer: 2000. An overview of both basic and advanced methods of survival analysis (event-history analysis), with reference to S and SAS software, the former implemented in Therneau's state-of-the-art survival package. S. van Buuren, Flexible Imputation of Missing Data, Boca Raton FL: CRC Press, 2012. There are several packages in R for multiple imputation of missing data; this book largely describes the mice (multiple imputation by chained equations) package. W. N. Venables and B. D. Ripley. Modern Applied Statistics with S, Fourth Edition. New York: Springer, 2002. An influential and wide-ranging treatment of data analysis using S. Many of the facilities described in the book are programmed in the associated (and indispensable) MASS, nnet, and spatial packages, which are included in the standard R distribution. This text is more advanced and has a broader focus than the R Companion. Brian Ripley is a member of the R Core group of developers. S. N. Wood, Generalized Additive Models: An Introduction with R. New York: Chapman and Hall, 2006. Describes the mgcv package in R, which contains a gam function for fitting generalized additive models based on smoothing splines. The initials mgcv stand for multiple generalized cross validation, the method by which Wood selects GAM smoothing parameters. Other Sources (Some Free) See the publications list on the R web site <http://www.r-project.org/doc/bib/rpublications.html>. The R Journal <http://journal.r-project.org/>, the journal of the R Project for Statistical Computing, and its predecessor R News <http://www.rproject.org/doc/rnews/index.html>, are also good sources of information, as is the Journal of Statistical Software <http://www.jstatsoft.org/>, an on-line American Statistical Association journal dominated by coverage of R packages. Information about R packages in a number of application areas is available in various CRAN task views <http://cran.r-project.org/web/views/>.