BE640 Intermediate Biostatistics 2. Regression and Correlation. Simple Linear Regression Software: SAS. Emergency Calls to the New York Auto Club

Similar documents
SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

EXST7015: Estimating tree weights from other morphometric variables Raw data print

Handout 1: Predicting GPA from SAT

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3-1 through 3-3

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3.1 through 3.3

Answer Keys to Homework#10

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

Comparison of a Population Means

Assignment 9 Answer Keys

Lecture notes on Regression & SAS example demonstration

5.3 Three-Stage Nested Design Example

Chapter 8 (More on Assumptions for the Simple Linear Regression)

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

STATISTICS 479 Exam II (100 points)

Single Factor Experiments

Lecture 11: Simple Linear Regression

STAT 3A03 Applied Regression Analysis With SAS Fall 2017

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Topic 2. Chapter 3: Diagnostics and Remedial Measures

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

PubH 7405: REGRESSION ANALYSIS SLR: DIAGNOSTICS & REMEDIES

A Little Stats Won t Hurt You

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Introduction to Regression

STOR 455 STATISTICAL METHODS I

Biological Applications of ANOVA - Examples and Readings

This is a Randomized Block Design (RBD) with a single factor treatment arrangement (2 levels) which are fixed.

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

STATISTICS 110/201 PRACTICE FINAL EXAM

Topic 18: Model Selection and Diagnostics

Lecture 11 Multiple Linear Regression

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Getting Correct Results from PROC REG

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Introduction to Design and Analysis of Experiments with the SAS System (Stat 7010 Lecture Notes)

Analysis of variance and regression. April 17, Contents Comparison of several groups One-way ANOVA. Two-way ANOVA Interaction Model checking

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

STAT 350. Assignment 4

Stat 500 Midterm 2 12 November 2009 page 0 of 11

INFERENCE FOR REGRESSION

Chapter 1 Linear Regression with One Predictor

SAS Commands. General Plan. Output. Construct scatterplot / interaction plot. Run full model

Lecture 3: Inference in SLR

R 2 and F -Tests and ANOVA

General Linear Model (Chapter 4)

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Lecture 10: 2 k Factorial Design Montgomery: Chapter 6

Lecture 7 Remedial Measures

Box-Cox Transformations

Multicollinearity Exercise

PLS205 Lab 6 February 13, Laboratory Topic 9

Lab # 11: Correlation and Model Fitting

Chapter 16. Simple Linear Regression and Correlation

Chapter 11 : State SAT scores for 1982 Data Listing

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

Booklet of Code and Output for STAC32 Final Exam

a. YOU MAY USE ONE 8.5 X11 TWO-SIDED CHEAT SHEET AND YOUR TEXTBOOK (OR COPY THEREOF).

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

a. The least squares estimators of intercept and slope are (from JMP output): b 0 = 6.25 b 1 =

Topic 14: Inference in Multiple Regression

EXST Regression Techniques Page 1. We can also test the hypothesis H :" œ 0 versus H :"

Chapter 2 Inferences in Simple Linear Regression

Week 7.1--IES 612-STA STA doc

Analysis of variance. April 16, Contents Comparison of several groups

Notes 6. Basic Stats Procedures part II

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

3 Variables: Cyberloafing Conscientiousness Age

Bivariate data analysis

Analysis of variance. April 16, 2009

Advanced Regression Topics: Violation of Assumptions

Overview Scatter Plot Example

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

y response variable x 1, x 2,, x k -- a set of explanatory variables

Chapter 16. Simple Linear Regression and dcorrelation

Odor attraction CRD Page 1

STAT 350 Final (new Material) Review Problems Key Spring 2016

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) STATA Users

Lecture 4. Checking Model Adequacy

Chapter 6 Multiple Regression

1 A Review of Correlation and Regression

STAT 3A03 Applied Regression With SAS Fall 2017

Topic 20: Single Factor Analysis of Variance

One-way ANOVA Model Assumptions

Correlation and Simple Linear Regression

BIOSTATS Intermediate Biostatistics Spring 2015 Class Activity Unit 1: Review of BIOSTATS 540

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

Outline. Analysis of Variance. Acknowledgements. Comparison of 2 or more groups. Comparison of serveral groups

1 Introduction to Minitab

appstats27.notebook April 06, 2017

Chapter 12: Multiple Regression

ANALYSES OF NCGS DATA FOR ALCOHOL STATUS CATEGORIES 1 22:46 Sunday, March 2, 2003

Business Statistics. Lecture 10: Course Review

Chapter 27 Summary Inferences for Regression

Lecture 10: F -Tests, ANOVA and R 2

Analysis of Variance. Source DF Squares Square F Value Pr > F. Model <.0001 Error Corrected Total

Inferences for Regression

Transcription:

BE640 Intermediate Biostatistics 2. Regression and Correlation Simple Linear Regression Software: SAS Emergency Calls to the New York Auto Club Source: Chatterjee, S; Handcock MS and Simonoff JS A Casebook for a First Course in Statistics and Data Analysis. New York, John Wiley, 1995, pp 145-152. Setting: Calls to the New York Auto Club are possibly related to the weather, with more calls occurring during bad weather. This example illustrates descriptive analyses and simple linear regression to explore this hypothesis in a data set containing information on calendar day, weather, and numbers of calls. Data File: ERS.DAT - This is a data set in ASCII format. Variable Name Label Coding/ DAY Date using informat MMDDYY6. Example: 016193 is January 16, 1993 CALLS Calls answered FHIGH Forecasted high temperature FLOW Forecasted low temperature HIGH High temperature LOW Low temperature RAIN Rain Forecast 0 = NO 1 = RAIN SNOW Snow Forecast 0 = NO 1 = SNOW WEEKDAY Type of Day 0 = NO 1 = Weekday YEAR 0 = 1993 1 = 1994 SUNDAY 0 = NO 1 = SUNDAY SUBZERO 0 = NO 1 = SUBZERO 1. Read in the data, create a dictionary of discrete variable values. \sas_howto\simple linear regression ny auto club.doc Page 1 of 11

data temp; input day MMDDYY6. calls fhigh flow high low rain snow weekday year sunday subzero; cards; 011693 2298 38 31 39 31 0 0 0 0 0 0 011793 1709 41 27 41 30 0 0 0 0 1 0 011893 2395 33 26 38 24 0 0 0 0 0 0 011993 2486 29 19 36 21 0 0 1 0 0 0 some rows omitted 012694 3638 26 23 32 5 1 1 1 1 0 0 012794 8947 29 14 31 0 0 0 1 1 0 1 012894 6564 48 34 55 31 1 1 1 1 0 0 012994 5613 31 40 36 36 1 1 0 1 0 0 ; Note - MMDDYY6. tells SAS that the variable DAY Is a date variable of the form mmddyy * * * Create dictionary of variable values for readability * ; proc format; value rainf 0='0=no' 1='1=rain'; value snowf 0='0=no' 1='1=snow'; value weekdayf 0='0=no' 1='1=weekday'; value yearf 0='0=1993' 1='1=1994'; value sundayf 0='0=no' 1='1=Sunday'; value subzerof 0='0=no' 1='1=subzero'; 2. For small data sets, produce a listing for review proc print data=temp; format day MMDDYY6. rain RAINF. snow SNOWF. weekday WEEKDAYF. year YEARF. sunday SUNDAYF. subzero SUBZEROF.; title "Temporary Data set"; title2 " Emergency Calls to New York Auto Club"; Note - Now we make use of the dictionary we created above by using a FORMAT statement. Partial listing of output. \sas_howto\simple linear regression ny auto club.doc Page 2 of 11

Emergency Calls to New York Auto Club Obs day calls fhigh flow high low rain snow weekday year sunday subzero 1 011693 2298 38 31 39 31 0=no 0=no 0=no 0=1993 0=no 0=no 2 011793 1709 41 27 41 30 0=no 0=no 0=no 0=1993 1=Sunday 0=no 3 011893 2395 33 26 38 24 0=no 0=no 0=no 0=1993 0=no 0=no 4 011993 2486 29 19 36 21 0=no 0=no 1=weekday 0=1993 0=no 0=no 5 012093 1849 40 19 43 27 0=no 0=no 1=weekday 0=1993 0=no 0=no 6 012193 1842 44 30 43 29 0=no 0=no 1=weekday 0=1993 0=no 0=no 7 012293 2100 46 40 53 41 1=rain 0=no 1=weekday 0=1993 0=no 0=no 8 012393 1752 47 35 46 40 0=no 0=no 0=no 0=1993 0=no 0=no 9 012493 1776 53 34 55 38 1=rain 0=no 0=no 0=1993 1=Sunday 0=no 10 012593 1812 38 32 43 31 0=no 0=no 1=weekday 0=1993 0=no 0=no 11 012693 1842 35 21 35 25 0=no 0=no 1=weekday 0=1993 0=no 0=no 12 012793 1674 39 27 44 31 1=rain 1=snow 1=weekday 0=1993 0=no 0=no 13 012893 1692 34 28 40 27 0=no 0=no 1=weekday 0=1993 0=no 0=no 3. First look at the data. Plot of calls over time symbol1 value=dot; proc gplot data=temp; title "Calls to NY Auto Club 1993-1994"; plot calls*day; Note - How to save a graph as ".jpeg" 1. Activate the GRAPH output window at bottom right 2. FILE --> EXPORT AS IMAGE 3. At the "SAVE AS TYPE" dialog box, choose jpeg Nyauto1.jpg 1993 values are lower 1994 values are higher and more variable \sas_howto\simple linear regression ny auto club.doc Page 3 of 11

4. Descriptives of Y=Calls. Tests of Assumption of Normality proc capability data=temp NORMAL normaltest; var calls; title "Distribution of Y=CALLS"; The CAPABILITY Procedure Variable: calls Moments N 28 Sum Weights 28 Mean 4318.75 Sum Observations 120925 Std Deviation 2692.56394 Variance 7249900.56 Skewness 0.48107831 Kurtosis -1.4180211 Uncorrected SS 717992159 Corrected SS 195747315 Coeff Variation 62.3459089 Std Error Mean 508.846755 Tests for Normality Test --Statistic--- -----p Value----- Null Hypothesis of normality is rejected. Shapiro-Wilk W 0.829019 Pr < W 0.000 Kolmogorov-Smirnov D 0.251960 Pr > D <0.010 Cramer-von Mises W-Sq 0.311246 Pr > W-Sq <0.005 Anderson-Darling A-Sq 1.867261 Pr > A-Sq <0.005 The null hypothesis of normality of Y=CALLS is rejected by every test. Take care, sometimes the cure is worse than the problem. For now, we ll continue along anyway; this will give us a chance to see some interesting diagnostics! \sas_howto\simple linear regression ny auto club.doc Page 4 of 11

5. Graphical Assessments of Normality of Y=Calls. Histogram with overlay normal (note use of NOPRINT) proc capability data=temp NORMAL noprint; var calls; title "Histogram of Y=CALLS with overlay NORMAL"; histogram calls/normal(color=maroon w=4) cfill=blue cframe=ligr; inset mean std/cfill=blank format=5.2; Quantile Quantile Plot w reference = Normal goptions ftext=none htext=1 cell; symbol1 c=blue v=dot h=1.5; symbol2 c=red; proc capability data=temp noprint; var calls; title "Simple Normal QQPlot for Y=CALLS"; qqplot calls/ normal (mu=est sigma=est) rotate pctlaxis(label='percentiles' GRID LGRID=35) square; Nyauto2.jpg nyauto3.jpg The graphs show what we suspected nonnormality of Y=CALLS. \sas_howto\simple linear regression ny auto club.doc Page 5 of 11

5. Scatterplot of Y=CALLS v X=LOW symbol1 value=square color=blue; symbol2 value=dot color=red; proc gplot data=temp; title "Calls to NY Auto Club 1993-1994"; plot calls*low=year; format year YEARF.; Note - I used value of YEAR as my plotting symbol. I then indicated 1993 and 1994 using different symbols and colors using SYMBOL1 and SYMBOL2 statements. Nyauto4.jpg The scatterplot suggests, as we might expect, that lower temperatures are associated with more calls to the NY Auto club. The distinction between 1993 and 1994 points suggests that, perhaps, low temperature is not the only predictor of increased calls. 6. Least Squares Estimation and Analysis of Variance Table \sas_howto\simple linear regression ny auto club.doc Page 6 of 11

proc reg data=temp SIMPLE; title "Straight line fit of Y=CALLS to X=LOW"; model calls=low; Note - The option SIMPLE produces descriptive statistics. Partial listing -----. The REG Procedure Number of Observations Read 28 Number of Observations Used 28 Descriptive Statistics Uncorrected Standard Variable Sum Mean SS Variance Deviation Intercept 28.00000 1.00000 28.00000 0 0 low 609.00000 21.75000 18003 176.19444 13.27383 calls 120925 4318.75000 717992159 7249901 2692.56394 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 100233719 100233719 27.28 <.0001 Error 26 95513596 3673600 Corrected Total 27 195747315 Root MSE 1916.66373 R-Square 0.5121 Dependent Mean 4318.75000 Adj R-Sq 0.4933 Coeff Var 44.38006 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 7475.84897 704.63038 10.61 <.0001 low 1-145.15398 27.78868-5.22 <.0001 The fitted line is calls ˆ = 7,475.85-145.15*[low] R 2 =.51 indicates that 51% of the variability in calls is explained. The overall F test significance level <.0001 suggests that the straight line fit performs better in explaining variability in calls than does Y = average # calls 7. Overlay of Straight Line Fit onto Scatterplot of Y=CALLS v X=LOW \sas_howto\simple linear regression ny auto club.doc Page 7 of 11

symbol1 value=dot color=blue i=r; proc gplot data=temp; title "Overlay Straight line fit of Y=CALLS to X=LOW"; plot calls*low; Note - The overlay of the line is accomplished in the SYMBOL1 statement as i=r. Nyauto5.jpg The overlay of the straight line fit is reasonable but substantial variability is seen, too. There is a lot we still don t know, including but not limited to the following --- Case influence, omitted variables, variance heterogeneity, incorrect functional form, etc. \sas_howto\simple linear regression ny auto club.doc Page 8 of 11

8. Residuals Analysis Assessment of Normality of Residuals symbol1 value=dot; proc reg data=temp noprint; model calls=low; plot npp.*residual.; title "Normality of Residuals Y=CALLS v X=LOW"; Note - The NOPRINT suppresses numerical output. npp. (note the period!) refers to normal percentiles Residual. Refers to residuals Note all nice information that appears on the graph! Nyauto6.jpg Not bad, actually. 9. Residuals Analysis Detection of Outliers Using Cook s Distance \sas_howto\simple linear regression ny auto club.doc Page 9 of 11

symbol1 value=dot; proc reg data=temp noprint; model calls=low; plot cookd.*obs.; title "Cook's Distance Values for Straight Line Y=CALLS v X=LOW"; Note - The NOPRINT suppresses numerical output. Nyauto7.jpg For straight line regression, the suggestion is to regard Cook s Distance values > 1 as significant.. Here, there are no unusually large Cook Distance values. Not shown but useful, too, are examinations of leverage and jackknife residuals. 10. Assessing Assumptions of Linearity, Heteroscedascity, Independence Using Jacknife Residuals Note - \sas_howto\simple linear regression ny auto club.doc Page 10 of 11

symbol1 value=dot; proc reg data=temp noprint; model calls=low; plot rstudent.*predicted.; title "Jacknife Residuals versus Predicted"; The NOPRINT suppresses numerical output. Jacknife residuals are stored in rstudent. Predicteds are stored in predicted. Be sure to remember the periods! Nyauto8.jpg Recall A jackknife residual for an individual is a modification of the solution for a studentized residual in which the mean square error is replaced by the mean square error obtained after deleting that individual from the analysis. This plot in SAS is nice for its inclusion of some useful summaries the fitted line, the R 2 Departures of this plot from a parallel band about the horizontal line at zero are significant. The plot here is a bit noisy but not too bad considering the small sample size. \sas_howto\simple linear regression ny auto club.doc Page 11 of 11