STATISTICAL COMPUTING USING R/S. John Fox McMaster University

Similar documents
ICPSR Training Program McMaster University Summer, The R Statistical Computing Environment: The Basics and Beyond

2005 ICPSR SUMMER PROGRAM REGRESSION ANALYSIS III: ADVANCED METHODS

Regression Analysis III: Advanced Methods

Introduction to the R Statistical Computing Environment

Interaction effects for continuous predictors in regression modeling

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

BOOK REVIEW Sampling: Design and Analysis. Sharon L. Lohr. 2nd Edition, International Publication,

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Longitudinal Data Analysis. Michael L. Berbaum Institute for Health Research and Policy University of Illinois at Chicago

Engineering Physics 3W4 "Acquisition and Analysis of Experimental Information" Part II: Fourier Transforms

GIST 4302/5302: Spatial Analysis and Modeling

Graphical Presentation of a Nonparametric Regression with Bootstrapped Confidence Intervals

ME DYNAMICAL SYSTEMS SPRING SEMESTER 2009

GIST 4302/5302: Spatial Analysis and Modeling

A Wildlife Simulation Package (WiSP)

GIST 4302/5302: Spatial Analysis and Modeling

STATISTICS-STAT (STAT)

Geometric Algorithms in GIS

Modern Statistics For The Life Sciences Gbv

Cornell University CSS/NTRES 6200 Spatial Modelling and Analysis for agronomic, resources and environmental applications. Spring semester 2015

Applied Regression Modeling

THE STANDARD MODEL IN A NUTSHELL BY DAVE GOLDBERG DOWNLOAD EBOOK : THE STANDARD MODEL IN A NUTSHELL BY DAVE GOLDBERG PDF

LEHMAN COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF ENVIRONMENTAL, GEOGRAPHIC, AND GEOLOGICAL SCIENCES CURRICULAR CHANGE

Semi-parametric estimation of non-stationary Pickands functions

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Statistical modelling: Theory and practice

Elementary Algebra Basic Operations With Polynomials Worksheet

WEB application for the analysis of spatio-temporal data

SCIENCE PROGRAM CALCULUS III

SCIENCE PROGRAM CALCULUS III

Master of Science in Statistics A Proposal

Statistics Toolbox 6. Apply statistical algorithms and probability models

GEOG 508 GEOGRAPHIC INFORMATION SYSTEMS I KANSAS STATE UNIVERSITY DEPARTMENT OF GEOGRAPHY FALL SEMESTER, 2002

Directed acyclic graphs and the use of linear mixed models

Follow links Class Use and other Permissions. For more information, send to:

Tuesday 6:30 9:30 (First/Last classes) Home Phone: SYLLABUS. I. Focus of Course

cha1873x_p02.qxd 3/21/05 1:01 PM Page 104 PART TWO

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

QuantumMCA QuantumNaI QuantumGe QuantumGold

SPATIAL AND SPATIO-TEMPORAL BAYESIAN MODELS WITH R - INLA BY MARTA BLANGIARDO, MICHELA CAMELETTI

Longitudinal Analysis. Michael L. Berbaum Institute for Health Research and Policy University of Illinois at Chicago

GTECH 380/722 Analytical and Computer Cartography Hunter College, CUNY Department of Geography

Calculus from Graphical, Numerical, and Symbolic Points of View Overview of 2nd Edition

Applied Spatial Analysis in Epidemiology

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Regularization in Cox Frailty Models

BOOtstrapping the Generalized Linear Model. Link to the last RSS article here:factor Analysis with Binary items: A quick review with examples. -- Ed.

Instructional Calendar Accelerated Integrated Precalculus. Chapter 1 Sections and 1.6. Section 1.4. Section 1.5

Computational Biology Course Descriptions 12-14

Mathematics with Maple

GTECH 380/722 Analytical and Computer Cartography Hunter College, CUNY Department of Geography

The Challenges of Heteroscedastic Measurement Error in Astronomical Data

SYLLABUS SEFS 540 / ESRM 490 B Optimization Techniques for Natural Resources Spring 2017

STRUCTURAL BIOINFORMATICS I. Fall 2015

NONPARAMETRIC STATISTICAL METHODS BY MYLES HOLLANDER, DOUGLAS A. WOLFE, ERIC CHICKEN

A Introduction to Matrix Algebra and the Multivariate Normal Distribution

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

Do Markov-Switching Models Capture Nonlinearities in the Data? Tests using Nonparametric Methods

Part I State space models

UNIVERSITY OF THE PHILIPPINES LOS BAÑOS INSTITUTE OF STATISTICS BS Statistics - Course Description

ADSP ADSP ADSP ADSP. Advanced Digital Signal Processing (18-792) Spring Fall Semester, Department of Electrical and Computer Engineering

Spatial Analysis and Modeling (GIST 4302/5302) Guofeng Cao Department of Geosciences Texas Tech University

The CrimeStat Program: Characteristics, Use, and Audience

Econometrics I G (Part I) Fall 2004

Applied Spatial Analysis in Epidemiology

Longitudinal and Panel Data: Analysis and Applications for the Social Sciences. Table of Contents

A Comparison Between Polynomial and Locally Weighted Regression for Fault Detection and Diagnosis of HVAC Equipment

HISTORY 1XX/ DH 1XX. Introduction to Geospatial Humanities. Instructor: Zephyr Frank, Associate Professor, History Department Office: Building

Extended Mosaic and Association Plots for Visualizing (Conditional) Independence. Achim Zeileis David Meyer Kurt Hornik

Preliminaries The bootstrap Bias reduction Hypothesis tests Regression Confidence intervals Time series Final remark. Bootstrap inference

Estimating complex causal effects from incomplete observational data

Preliminaries The bootstrap Bias reduction Hypothesis tests Regression Confidence intervals Time series Final remark. Bootstrap inference

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Calculus at Rutgers. Course descriptions

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Introduction. How to use this book. Linear algebra. Mathematica. Mathematica cells

ON THE CONSEQUENCES OF MISSPECIFING ASSUMPTIONS CONCERNING RESIDUALS DISTRIBUTION IN A REPEATED MEASURES AND NONLINEAR MIXED MODELLING CONTEXT

ON USING BOOTSTRAP APPROACH FOR UNCERTAINTY ESTIMATION

The Arctic Ocean. Grade Level: This lesson is appropriate for students in Grades K-5. Time Required: Two class periods for this lesson

Summary Of General Chemistry Laboratory Manual 3rd Edition Answers

Fracture Mechanics: Fundamentals And Applications, Third Edition Free Pdf Books

SuperMix2 features not available in HLM 7 Contents

Using Spreadsheets to Teach Engineering Problem Solving: Differential and Integral Equations

Inference via Kernel Smoothing of Bootstrap P Values

Regression III Lecture 1: Preliminary

A Non-parametric bootstrap for multilevel models

SOCI 20253/GEOG 20500, SOCI 30253, MACS Introduction to Spatial Data Science SYLLABUS

Core Courses for Students Who Enrolled Prior to Fall 2018

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

Principal Moderator s Report

Residual-based Shadings for Visualizing (Conditional) Independence

Relations and Functions

CHANGING SOURCES FOR RESEARCH LITERATURE

Experimental Design and Data Analysis for Biologists

AC : A DIMENSIONAL ANALYSIS EXPERIMENT FOR THE FLUID MECHANICS CLASSROOM

Introduction to Matrix Algebra and the Multivariate Normal Distribution

11. Bootstrap Methods

Bootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions

Didacticiel - Études de cas

Transcription:

STATISTICAL COMPUTING USING R/S John Fox McMaster University The S statistical programming language and computing environment has become the defacto standard among statisticians and has made substantial inroads in the social sciences. The S language has two implementations: the commercial product S-PLUS, and the free, open-source R. Both are available for Windows and Unix/Linux systems; R, in addition, runs on Macintoshes. This lecture series and associated workshops will use R. A statistical package, such as SPSS or SAS, is primarily oriented toward combining instructions with rectangular case-by-variable datasets to produce (often voluminous) printouts. Such packages make routine data analysis relatively easy, but they make it relatively difficult to do things that are innovative or non-standard, or to add to the builtin capabilities of the package. In contrast, a good statistical computing environment also makes routine data analysis easy, but it additionally supports convenient programming; this means that users can extend the already impressive facilities of S. Statisticians and others have taken advantage of the extensibility of S to contribute more than 1400 freely available packages of R programs. As well, S is especially capable in the area of statistical graphics, reflecting its origin at Bell Labs, a centre of graphical innovation. The first two (lecture) sessions are meant to provide a basic overview of and introduction to R, including to statistical modeling in R in effect, using R as a statistical package. The following four to five workshop sessions pick up where the basic lectures leave off, and combine lecture material with hands-on experience. The workshop sessions are intended to provide the background required to use R seriously for data analysis and presentation, including an introduction to S programming and to the design of custom statistical graphs, unlocking the power in the R statistical programming environment. The topics for the workshops in session 3-7 are somewhat flexible, depending upon participants interests: the topics given here are suggestions. If the size of the group is sufficiently small, the workshops will be conducted in a computer lab. An outline of the classes follows (with chapter references to Fox, 2002): 1. Getting started with S (Ch. 1) 2. Statistical models in S (Ch. 4, 5, & appendices) 3. Data in S; using R packages (Ch. 2 & 3) 4-5. Programming in S (Ch. 8) 6. S graphics (Ch. 7) 7. (Interest permitting). Building R packages

Course Web Site Materials for the course will be deposited at <http://socserv.socsci.mcmaster.ca/jfox/courses/r-course/index.html>. BIBLIOGRAPHY This has been an explosion of sources on S (particularly on R, with many titles of the form X with R ). The following list is by no means complete. Moreover, many statistics books not specifically focused on R or S contain S code and examples. Principal Text The principal text for this short course is J. Fox, An R and S-PLUS Companion to Applied Regression, Sage, 2002. The book is now slightly dated, but still suitable as an introduction. Additional materials are available on the web site for the book, <http://socserv.mcmaster.ca/jfox/books/companion/index.html>, including several appendices (on structural-equation models, mixed models, survival analysis, etc.); scripts for the examples in all of the chapters and appendices; information on acquiring and installing R; and more. The book is associated with the car package for R. Alternatively (or additionally), more advanced students may wish to use W. N. Venables and B. D. Ripley, Modern Applied Statistics with S as their principal source. Manuals S-PLUS is distributed with a set of manuals, as is R. S-PLUS manuals are also available on the Insightful Corporation web site, at <http://www.insightful.com/support/documentation.asp>. Likewise, R manuals are also available at the CRAN (Comprehensive R Archive Network) web site, <http://cran.r-project.org/manuals.html>. A manual for S-PLUS Trellis Graphics (also useful for the lattice package in R) is at <http://cm.bell-labs.com/cm/ms/departments/sia/doc/trellis.user.pdf>.

Programming in R/S R. A. Becker, J. M. Chambers, and A.R. Wilks, The New S Language: A Programming Environment for Data Analysis and Statistics. Pacific Grove, CA: Wadsworth, 1988. Defines S Version 2, which forms the basis of the currently used S Versions 3 and 4, as well as R. (Sometimes called the Blue Book. ) W. J. Braun and D. J. Murdoch, A First Course in Statistical Programming with R. Cambridge: Cambridge University Press, 2007. This book introduces both R and traditional topics in statistical computing, such as optimization and computational linear algebra, using R as a vehicle. The pace is relatively leisurely and the material relatively easy to digest. J. M. Chambers, Programming with Data: A Guide to the S Language. New York: Springer, 1998. Describes the new features in S Version 4, including the newer formal object-oriented programming system (also incorporated in R), by the principal designer of the S language. Not an easy read. (The Green Book. ) J. M. Chambers, Software for Data Analysis: Programming with R. New York: Springer, 2008. Chambers s newest book has not yet been published (as I write this syllabus in May) but will be out before the start of the 2008 ICPSR Summer Program. Although (obviously) I haven t yet read it, it will almost surely be worth reading. J. M. Chambers and T.J. Hastie, eds., Statistical Models in S. Pacific Grove, CA: Wadsworth, 1992. An edited volume describing the statistical modeling language in S, Versions 3 and 4, and R, and the object-oriented programming system used in S Version 3 and R (and available, for backwards compatibility, in S Version 4). In addition, the text covers S software for particular kinds of statistical models, including linear models, nonlinear models, generalized linear models, local-polynomial regression models, and generalized additive models. (The White Book. ) R. Ihaka and R. Gentleman, R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5:299-314, 1996. The original published description of the R project, now quite out of date but still worth looking at. W. N. Venables and B. D. Ripley, S Programming. New York: Springer, 2000. The definitive treatment of writing software in the various versions of S-PLUS and R, now slightly dated, particularly with respect to R. Selected Statistical Methods Programmed in S W. Bowman and A. Azzalini, Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. Oxford: Oxford University Press, 1997. A good introduction to nonparametric density estimation and nonparametric regression, associated with the sm package (for both S-PLUS and R).

C. Davison and D. V. Hinkley, Bootstrap Methods and their Application. Cambridge: Cambridge University Press, 1997. A comprehensive introduction to bootstrap resampling, associated with the boot package (for S-PLUS and R, written by A. J. Canty). Somewhat more difficult than Efron and Tibshirani. B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. London: Chapman and Hall, 1993. Another extensive treatment of bootstrapping by its originator (Efron), also accompanied by an S package, bootstrap (for both S-PLUS and R, but somewhat less usable than boot). F. E. Harrell, Jr., Regression Modeling Strategies, With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer, 2001. Describes an interesting approach to statistical modeling, with frequent references to Harrell s Hmisc and Design packages for S-PLUS and R. T. J. Hastie and R. J. Tibshirani, Generalized Additive Models. London: Chapman and Hall, 1990. An accessible treatment of generalized additive models, as implemented in the gam function in S-PLUS (located in the gam package for R), and of nonparametric regression analysis in general. [The gam function in the mgcv package in R takes a somewhat different approach; see Wood (2006), below.] R. Koenker, Quantile Regression. Cambridge: Cambridge University Press, 2005. Describes a variety of methods for quantile regression by the leading figure in the area. The methods are implemented in Koenker's quantreg package for R. C. Loader, Local Likelihood and Regression. New York: Springer, 1999. Another text on nonparametric regression and density estimation, using the locfit package (in S-PLUS and R). Although the text is less readable than Bowman and Azzalini, the locfit software in very capable. P. Murrell. R Graphics. New York: Chapman and Hall, 2006. A tour-de-force the definitive reference on traditional R graphics and on the grid graphics system on which lattice graphics (the R implementation of Cleveland s Trellis graphics) is built. The figures from the book and R code to produce them are on Murrell s web site at <http://www.stat.auckland.ac.nz/~paul/rgraphics/rgraphics.html>. J. C. Pinheiro and D. M. Bates, Mixed-Effects Models in S and S-PLUS. New York: Springer, 2000. An extensive treatment of linear and nonlinear mixed-effects models in S, focused on the authors nlme package, which is available for both S-PLUS and R. Mixed models are appropriate for various kinds of non-independent (clustered) data, including hierarchical and longitudinal data. Unfortunately, Bates s newer and generally more powerful lme4 package isn t discussed. Sarkar, D., Lattice: Multivariate Data Visualization with R. New York: Springer, 2008. Deepayan Sarkar is the developer of the powerful lattice package in R, which

implements Trellis graphics. This book provides a fine introduction to and overview of lattice graphics. J. L. Schafer, Analysis of Incomplete Multivariate Data. London: Chapman and Hall, 1997. This text presents a broadly applicable Bayesian treatment of missing-data problems, including methods for multiple imputation. The most extensive implementation of the methods in the book is in the missing library in S-PLUS version 6 (and in SAS). Schafer s norm, cat, and mix packages are available for earlier versions of S-PLUS and for R. P. Spector, Data Manipulation with R. New York: Springer, 2000. Data management is a dry subject, but the ability to carry it out is vital to the effective day-to-day use of R (or of any statistical software). Spector provides a reasonably broad and clear introduction to the subject. T. M. Therneau and P. M. Grambsch, Modeling Survival Data: Extending the Cox Model. New York: Springer, 2000. An overview of both basic and advanced methods of survival analysis (event-history analysis), with reference to S and SAS software. There are both S- PLUS and R versions of Therneau s state-of-the-art survival package. W. N. Venables and B. D. Ripley. Modern Applied Statistics with S, Fourth Edition. New York: Springer, 2002. An influential and wide-ranging treatment of data analysis using S. Many of the facilities described in the book are programmed in the associated (and indispensable) MASS, nnet, and spatial packages, available both for S-PLUS and R. This text is more advanced and has a broader focus than my R and S-PLUS Companion. S. N. Wood, Generalized Additive Models: An Introduction with R. New York: Chapman and Hall, 2006. This book describes methods implemented in Wood s mgcv package in R, which contains a gam function for fitting generalized additive models. The initials mgcv stand for multiple generalized cross validation, the method by which Wood selects GAM smoothing parameters. Other Sources (Some Free) See the R and S-PLUS web sites, at <http://www.r-project.org/doc/bib/rpublications.html> and <http://www.insightful.com/support/splusbooks.asp>, respectively. R News, the newsletter of the R Project for Statistical Computing, is available at <http://cran.r-project.org/doc/rnews/>.