Graphical Presentation of a Nonparametric Regression with Bootstrapped Confidence Intervals

Similar documents
STAT 704 Sections IRLS and Bootstrap

Finite Population Correction Methods

Introduction to Linear regression analysis. Part 2. Model comparisons

Experimental Design and Data Analysis for Biologists

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Remedial Measures for Multiple Linear Regression Models

Modelling Survival Data using Generalized Additive Models with Flexible Link

Time Series. Anthony Davison. c

Introduction to Nonparametric Regression

2005 ICPSR SUMMER PROGRAM REGRESSION ANALYSIS III: ADVANCED METHODS

STATISTICAL COMPUTING USING R/S. John Fox McMaster University

Preliminaries The bootstrap Bias reduction Hypothesis tests Regression Confidence intervals Time series Final remark. Bootstrap inference

Preliminaries The bootstrap Bias reduction Hypothesis tests Regression Confidence intervals Time series Final remark. Bootstrap inference

Midwest Big Data Summer School: Introduction to Statistics. Kris De Brabanter

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA

Appendix A Summary of Tasks. Appendix Table of Contents

Interaction effects for continuous predictors in regression modeling

ON USING BOOTSTRAP APPROACH FOR UNCERTAINTY ESTIMATION

Nonparametric Methods

The Prediction of Monthly Inflation Rate in Romania 1

Principal Component Analysis, A Powerful Scoring Technique

Bootstrap (Part 3) Christof Seiler. Stanford University, Spring 2016, Stats 205

A Characterization of Principal Components. for Projection Pursuit. By Richard J. Bolton and Wojtek J. Krzanowski

AN EXACT CONFIDENCE INTERVAL FOR THE RATIO OF MEANS USING NONLINEAR REGRESSION ThponRoy Boehringer Ingelheim PharmaceuticaIs, Inc.

Defect Detection using Nonparametric Regression

Contents. Acknowledgments. xix

Sociology 740 John Fox. Lecture Notes. 1. Introduction. Copyright 2014 by John Fox. Introduction 1

The assumptions are needed to give us... valid standard errors valid confidence intervals valid hypothesis tests and p-values

Bootstrap. Director of Center for Astrostatistics. G. Jogesh Babu. Penn State University babu.

Econometrics I. Professor William Greene Stern School of Business Department of Economics 1-1/40. Part 1: Introduction

STAT Checking Model Assumptions

Cost Analysis and Estimating for Engineering and Management

Generalized Additive Models

Why Is It There? Attribute Data Describe with statistics Analyze with hypothesis testing Spatial Data Describe with maps Analyze with spatial analysis

Validation of Visual Statistical Inference, with Application to Linear Models

A MACRO-DRIVEN FORECASTING SYSTEM FOR EVALUATING FORECAST MODEL PERFORMANCE

Statistics Toolbox 6. Apply statistical algorithms and probability models

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Poincaré Plots for Residual Analysis

Double Bootstrap Confidence Interval Estimates with Censored and Truncated Data

Simultaneous Confidence Bands for the Coefficient Function in Functional Regression

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

Regression Graphics. 1 Introduction. 2 The Central Subspace. R. D. Cook Department of Applied Statistics University of Minnesota St.

Confidence Intervals for Process Capability Indices Using Bootstrap Calibration and Satterthwaite s Approximation Method

Generating Half-normal Plot for Zero-inflated Binomial Regression

Local Polynomial Modelling and Its Applications

Supporting Information for Estimating restricted mean. treatment effects with stacked survival models

Neural Networks. Nethra Sambamoorthi, Ph.D. Jan CRMportals Inc., Nethra Sambamoorthi, Ph.D. Phone:

Model Selection, Estimation, and Bootstrap Smoothing. Bradley Efron Stanford University

Bootstrap. ADA1 November 27, / 38

Exam 2. Average: 85.6 Median: 87.0 Maximum: Minimum: 55.0 Standard Deviation: Numerical Methods Fall 2011 Lecture 20

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota

Practice of SAS Logistic Regression on Binary Pharmacodynamic Data Problems and Solutions. Alan J Xiao, Cognigen Corporation, Buffalo NY

SC705: Advanced Statistics Instructor: Natasha Sarkisian Class notes: Introduction to Structural Equation Modeling (SEM)

11. Bootstrap Methods

SAS macro to obtain reference values based on estimation of the lower and upper percentiles via quantile regression.

8STATISTICAL ANALYSIS OF HIGH EXPLOSIVE DETONATION DATA. Beckman, Fernandez, Ramsay, and Wendelberger DRAFT 5/10/98 1.

Chapter 9. Correlation and Regression

Estimation of AUC from 0 to Infinity in Serial Sacrifice Designs

STAT Section 2.1: Basic Inference. Basic Definitions

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

ENGRG Introduction to GIS

UNIVERSITÄT POTSDAM Institut für Mathematik

Probability Distributions

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Nonparametric Methods II

The Model Building Process Part I: Checking Model Assumptions Best Practice

by Jerald Murdock, Ellen Kamischke, and Eric Kamischke An Investigative Approach

DISPLAYING THE POISSON REGRESSION ANALYSIS

SAS/STAT 14.1 User s Guide. Introduction to Nonparametric Analysis

Stat 587: Key points and formulae Week 15

Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview

Ex-Ante Forecast Model Performance with Rolling Simulations

Fisheries, Population Dynamics, And Modelling p. 1 The Formulation Of Fish Population Dynamics p. 1 Equilibrium vs. Non-Equilibrium p.

Bootstrapping Australian inbound tourism

MS-C1620 Statistical inference

One-stage dose-response meta-analysis

1 Chapter 2, Problem Set 1

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Accelerated Advanced Algebra. Chapter 1 Patterns and Recursion Homework List and Objectives

Scenarios Where Utilizing a Spline Model in Developing a Regression Model Is Appropriate

Chapter 11. Correlation and Regression

arxiv: v1 [stat.co] 26 May 2009

Effective Dimension and Generalization of Kernel Learning

Resampling Methods. Lukas Meier

Big Data Analysis with Apache Spark UC#BERKELEY

LDA Midterm Due: 02/21/2005

Physics 509: Bootstrap and Robust Parameter Estimation

Estimation of cumulative distribution function with spline functions

Smooth functions and local extreme values

Chapter 3: Regression Methods for Trends

Review of Statistics 101

Checking model assumptions with regression diagnostics

GENG2140, S2, 2012 Week 7: Curve fitting

Analysis of Fast Input Selection: Application in Time Series Prediction

A Comparison of Alternative Bias-Corrections in the Bias-Corrected Bootstrap Test of Mediation

SAS/STAT 13.1 User s Guide. Introduction to Regression Procedures

A Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators

Introduction to Nonparametric Analysis (Chapter)

Transcription:

Graphical Presentation of a Nonparametric Regression with Bootstrapped Confidence Intervals Mark Nicolich & Gail Jorgensen Exxon Biomedical Science, Inc., East Millstone, NJ INTRODUCTION Parametric regression (least-squares) techniques are used to estimate a statistical model that attempts to predict a variable based on one, or more, other variables. The model is required to have a specified algebraic form such as a straight line, a parabola, or an exponential curve. An example would be predicting a persons annual income based on their age, years of schooling and gender using a linear model of the form: income = A + B*age + C*years_of_school + D*gender These models are restrictive because it is not always possible to find a simple mathematical model form to describe the relationship that exists between the variables. An alternative approach is to use nonparametric models which do not require exact specification of the model form. The approach is known under several names such as nonparametric regression and smoothing. Both the parametric and nonparametric methods have their advantages and disadvantages. The parametric approach has the problem of inexact results if the model is not correctly specified, while the nonparametric approach has the problem of not making full use of the data when the model is known exactly. The appropriate method to use depends on the data at hand and the questions to be answered by the data. NONPARAMETRIC REGRESSION There are several nonparametric methods (Härdle, 1989) which can be used to form the regression line. The methods include kernel smoothing, spline fitting or smoothing, L-Smoothing, R-smoothing, M- smoothing, and LOWESS techniques. The techniques are mathematically related to each other, but have different properties which are advantageous in different situations. The technique discussed in this paper is a useful method proposed by Cleveland (1979) which has the advantage of not being sensitive to outliers (it is robust) and allowing the user to easily adjust the degree of smoothing without the curve having an excessively wiggly shape. The technique is usually called LOcally WEighted Scatterplot Smoothing (LOWESS or LOESS). It fits a simple straight line (or higher order polynomial function) in successive regions of the data, then iteratively reforms the results to create a smooth, continuous curve. The result of a LOWESS regression is a line which best fits the data locally, but is not constrained to be of a particular mathematical form. Visual inspection and judgment can then be used to determine the nature of the relationship between the variables. The resulting curve is not sensitive to missing observations and can be used as a useful tool for finding spurious or outlying observations. Michael Friendly of York University has developed a SAS MACRO that will generate a LOWESS plot based on the LOWESS technique. It is easy to use and can produce hard copy plots via SAS/GRAPH. There are several user controlling parameters such as the width of the region of interest, the degree of the fitted polynomial, and some characteristics of the resulting plot. The program is available on the SAS Institute Web Site as LOWESS2 from www.sas.com/samples/a56143. The degree of smoothing in the program is controlled by the width of the initial region or window used for the initial simple regression, and is expressed as a fraction of the data range. The choice of the window width is subjective and is based on the goals of the analysis. A wide window results in a smooth curve that shows long term trends, while a narrow window will show more of the local variation. Experimentation with the width will often indicate the best choice for the problem being studied. The plot below is for some artificial data and shows three LOWESS curves based on a linear regression and with increasing degrees of smoothness. When the window width is set at 3% of the range (F=0.03) the curve is wiggly, as the width is increased to 0.15 the curve is smoother and misses the high peaks and drops. When the width is set at 50% the curve shows only general trends. The extreme case of 100% would be 1

a straight line; if the initial regressions were second order polynomials, the 100% curve would have a simple second order curve. BOOTSTRAPPING and CONFIDENCE LIMITS A weakness of the nonparametric regression techniques is that, because the regression curve is not based on a specific mathematical model and distribution, it is not possible to directly calculate confidence intervals (CI). The CIs are useful in determining the precision of the predicted model and give the user an idea of how useful the model is in a particular region. For the kernel and spline smoothers there are techniques for constructing pointwise confidence intervals and variability bands (Härdle, 1989). These methods are useful, but can be computationally intensive. However, confidence intervals can be generated for nonparametric regression lines through the bootstrap technique. The bootstrap technique (Efron and Tibshirani, 1993) is a data-based method of simulating an empirical distribution of results, in this case nonparametric regression curves, through a resampling process. The bootstrap has been applied to almost all the nonparametric regression methods, each with several variants in technique (Efron and Tibshirani, 1991,1993; Härdle, 1989; Shao and Tu, 1995). The fundamental idea of bootstrapping is to start with a sample (the original sample), then create a new sample (the bootstrap sample) by sampling with replacement from the original sample. The process is repeated a large number of times to generate a series of bootstrapped samples, and the statistic of interest (such as the mean) is computed for each sample. Because the sampling is with replacement, the bootstrap sample will likely contain several duplicate observations from the original sample, and nearly each bootstrap sample will be unique. Each bootstrap sample represents a plausible alternate sample from the original population, and the set of calculated statistics yields information about variation in the original sample. Since a distribution of different outcomes was generated from one sample, it seems as if you have lifted yourself by your own bootstraps. In this paper we make a simple application of the bootstrap technique to the LOWESS regression. We create a set of (n=100, for example) bootstrap samples and their associated LOWESS regression lines. The set of nonparametric bootstrapped regression lines constitutes an empirical distribution of regression lines. At each independent value (X value), points on the y-axis which are at the 2.5th percentile and 97.5th percentile of the set of bootstrapped regression lines are selected to form the 95% CI of the nonparametric regression line (Efron and Tibshirani, 1993). 2

The nonparametric bootstrap CI has different properties from the usual parametric CI. Parametric CIs are smooth and have a parabola shape because of the underlying mathematical assumptions, and are narrowest at the mean of the predicted and predictor variables. Bootstrap CIs are based on actual data and are therefore jagged and without predetermined shape. The width of the CI at any given point on the x-axis will vary, depending on the placement of datapoints at or near that point. Conceptually, nonparametric CIs have the same interpretation as the parametric CIs. For any exposure value on the x-axis the 95% CI will enclose the true response in approximately 95% of the instances in which it is calculated. The nonparametric 95% confidence intervals are based on the actual data rather than on an assumption about the distribution of the residuals as in parametric least-squares analyses. Therefore, in the regions where there are only a few datapoints one should be cautious about interpreting the intervals. If there are only a few datapoints relatively close together the effective variation of the bootstrap estimate at that interval will be narrow or small. Conversely, a few widely spaced points in an interval will yield a large estimate of the variation. While there are other methods to estimate the CI for nonparametric analyses, none perform well in the region of sparse data. SAS MACRO and EXAMPLE We have developed a SAS MACRO to generate a series of bootstrap samples from a data sample, then determine the bootstrapped CI, and plot the data, LOWESS curve and CIs. The MACRO can be altered to set the number of bootstrap samples and the width of the CI. As an example of how to use the MACRO we have a set of simulated data. The data show the relation-ship between the change over the work day of the forced expiratory volume during the first second of a pulmonary function test (FEV 1) and the amount of sunlight the worker was exposed to that day. The FEV 1 is a measure of lung function and is sensitive to upper airway restriction. To generate 95% CI for the LOWESS plots, one hundred bootstrap samples were used. The highest and lowest 2.5% points were used to form a 95% confidence interval (Efron and Tibshirani, 1993). Since there were 100 points the 2.5% point was the median of the 2 and 3% points and the 97.5% point was the median of the 97 and 98% point. The resulting plot is at the end of the paper. All the LOWESS regressions were based on a window width of 0.4. As can be seen in the plot, the LOWESS regression procedure is highly influenced by few datapoints at high exposures. Although there are no large reductions in lung function at higher exposures, there are so few datapoints that definitive conclusions about exposure-response are difficult where data are sparse. OTHER REGRESSIONS The fundamentals of the LOWESS technique are easy to understand, and relatively straight forward to implement. Many analyses need more than one independent variable to explain variation in the dependent variable, such as the example in the introduction of this paper. There are techniques which can be used when there are multiple independent variables. Two methods that have received attention are the projection pursuit method and the grand tour method. A paper by Cook, et al, (Cook, et al, 1995) presents an overview and comparison of the two techniques. The paper by Schluter, et al, (Schluter and Nychka, 1994) discusses the projection pursuit method, presents an extensive example, and provides a FORTRAN program to estimate models. CONCLUSIONS This MACRO and the LOWESS technique with CI are useful tools for data analysis. They are an improvement over parametric regressions when the form of the regression model is not known or is too complicated to specify a model. They can be combined with parametric regression as another test of the model adequacy. On the negative side it may be thought that the choice of window width can lead to different conclusions, but this is not different from choosing different parametric models in ordinary regression. On the positive side, the choice of window width allows consideration of immediate vs long term changes in the data. There are applications in exposure response studies where the effects of short term vs long term exposures could be usefully modeled with these techniques. A few possible weaknesses of the LOWESS technique are that it can be used for interpolation but not extrapolation, it is unfamiliar to many researchers, and it can be computationally time consuming. None of these possible problems should keep researchers from using this technique in their data exploration and analysis, and promoting its acceptance in the published literature. 3

REFERENCES Cleveland, WS, Robust Locally Weighted Regression and Smoothing Plots, Journal of the American Statistical Association, 74, 1979, pp 829-836. Cook, D., Buja, A., Cabrera, J., and Hurley, C., Grand Tour and Projection Pursuit, Journal of Computational and Graphical Statistics, 4(3), 1995, pp 155-172. Efron, B and Tibshirani, R, "Statistical Data Analysis in the Computer Age", Science, 253, July 26, 1991, pp 390-395. Efron, Bradley and Tibshirani, R, An Introduction to the Bootstrap, Chapman and Hall, New York, 1993. Härdle, Wolfgang, Applied Nonparametric Regression, Cambridge University Press, NY, 1989. Schluter, D. and Nychka, D., Exploring Fitness Surfaces, The American Naturalist, 143(4), 1994, pp 597-616. Shau, J. and Tu, D., The Jackknife and Bootstrap, Springer, New York, 1995. SAS and SAS/GRAPH software are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Mark Nicolich & Gail Jorgensen 732.873.6241 & 732.873.6124 Exxon Biomedical Science, Inc. Mettlers Road, CN 2350 East Millstone, NJ 08875,2350 4

5