STAT 3A03 Applied Regression Analysis With SAS Fall 2017

Similar documents
McMaster University. Weighted Least Squares

STAT 3A03 Applied Regression With SAS Fall 2017

STATISTICS 479 Exam II (100 points)

Chapter 1 Linear Regression with One Predictor

Lecture 11: Simple Linear Regression

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

BE640 Intermediate Biostatistics 2. Regression and Correlation. Simple Linear Regression Software: SAS. Emergency Calls to the New York Auto Club

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Overview Scatter Plot Example

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

STAT 350: Summer Semester Midterm 1: Solutions

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

Lecture 3: Inference in SLR

Stat 500 Midterm 2 12 November 2009 page 0 of 11

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Chapter 6 Multiple Regression

22S39: Class Notes / November 14, 2000 back to start 1

Math 3330: Solution to midterm Exam

ANALYSES OF NCGS DATA FOR ALCOHOL STATUS CATEGORIES 1 22:46 Sunday, March 2, 2003

Residuals from regression on original data 1

STOR 455 STATISTICAL METHODS I

df=degrees of freedom = n - 1

STAT 350. Assignment 4

Lecture 1: Linear Models and Applications

Simple Linear Regression

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Topic 18: Model Selection and Diagnostics

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Lecture 11 Multiple Linear Regression

Chapter 8 (More on Assumptions for the Simple Linear Regression)

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Lecture 1 Linear Regression with One Predictor Variable.p2

Weighted Least Squares

Topic 14: Inference in Multiple Regression

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Lecture notes on Regression & SAS example demonstration

Homework 2: Simple Linear Regression

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

Analysis of Variance. Source DF Squares Square F Value Pr > F. Model <.0001 Error Corrected Total

Answer Keys to Homework#10

Assignment 9 Answer Keys

Inferences for Regression

Topic 20: Single Factor Analysis of Variance

EXST Regression Techniques Page 1. We can also test the hypothesis H :" œ 0 versus H :"

Chapter 12 - Lecture 2 Inferences about regression coefficient

Lecture 13 Extra Sums of Squares

Getting Correct Results from PROC REG

2. TRUE or FALSE: Converting the units of one measured variable alters the correlation of between it and a second variable.

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

R 2 and F -Tests and ANOVA

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

General Linear Model (Chapter 4)

Statistics for exp. medical researchers Regression and Correlation

ST Correlation and Regression

Handout 1: Predicting GPA from SAT

Ch 2: Simple Linear Regression

Outline. Topic 19 - Inference. The Cell Means Model. Estimates. Inference for Means Differences in cell means Contrasts. STAT Fall 2013

T-test: means of Spock's judge versus all other judges 1 12:10 Wednesday, January 5, judge1 N Mean Std Dev Std Err Minimum Maximum

Simple Linear Regression

a. The least squares estimators of intercept and slope are (from JMP output): b 0 = 6.25 b 1 =

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Chapter 16. Simple Linear Regression and dcorrelation

Simple Linear Regression

Stat 302 Statistical Software and Its Applications SAS: Simple Linear Regression

a. YOU MAY USE ONE 8.5 X11 TWO-SIDED CHEAT SHEET AND YOUR TEXTBOOK (OR COPY THEREOF).

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

Lab # 11: Correlation and Model Fitting

Chapter 12: Multiple Regression

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

A Little Stats Won t Hurt You

PLS205!! Lab 9!! March 6, Topic 13: Covariance Analysis

Analysis of Covariance

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Simple Linear Regression

Possibly useful formulas for this exam: b1 = Corr(X,Y) SDY / SDX. confidence interval: Estimate ± (Critical Value) (Standard Error of Estimate)

Models for Clustered Data

Chapter 16. Simple Linear Regression and Correlation

Notes 6. Basic Stats Procedures part II

Models for Clustered Data

Lecture 18 Miscellaneous Topics in Multiple Regression

Multicollinearity Exercise

3 Variables: Cyberloafing Conscientiousness Age

EXST7015: Estimating tree weights from other morphometric variables Raw data print

Circle a single answer for each multiple choice question. Your choice should be made clearly.

The General Linear Model. April 22, 2008

ssh tap sas913, sas

The General Linear Model. November 20, 2007

Statistics 5100 Spring 2018 Exam 1

9 Correlation and Regression

One-way ANOVA Model Assumptions

171:162 Design and Analysis of Biomedical Studies, Summer 2011 Exam #3, July 16th

Table 1: Fish Biomass data set on 26 streams

SAS Commands. General Plan. Output. Construct scatterplot / interaction plot. Run full model

Linear Combinations of Group Means

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Applied Statistics and Econometrics

Transcription:

STAT 3A03 Applied Regression Analysis With SAS Fall 2017 Assignment 5 Solution Set Q. 1 a The code that I used and the output is as follows PROC GLM DataS3A3.Wool plotsnone; Class Amp Len Load; Model CyclesAmp Len Load; Output Outwoolout predictedfitted studentstudent; The GLM Procedure Dependent Variable: Cycles Source DF Sum of Squares Mean Square F Value Pr > F Model 6 15559598.44 2593266.41 11.11 <.0001 Error 20 4669019.85 233450.99 Corrected Total 26 20228618.30 R-Square Coeff Var Root MSE Cycles Mean 0.769187 56.09291 483.1677 861.3704 Source DF Type I SS Mean Square F Value Pr > F Amp 2 5624248.963 2812124.481 12.05 0.0004 Len 2 8182252.519 4091126.259 17.52 <.0001 Load 2 1753096.963 876548.481 3.75 0.0413 Source DF Type III SS Mean Square F Value Pr > F Amp 2 5624248.963 2812124.481 12.05 0.0004 Len 2 8182252.519 4091126.259 17.52 <.0001 Load 2 1753096.963 876548.481 3.75 0.0413

A plot of the studentized residuals against the fitted values is as follows PROC GPLOT Datawoolout; Plot student*fitted; There is a clear pattern in these residuals suggesting non-linearity in the model. b Here is the code and output adding in all three pairwise interactions. PROC GLM DataS3A3.Wool plotsnone; Class Amp Len Load; Model CyclesAmp Len Load Amp*Len Amp*Load Len*Load; Output Outwoolout1 predictedfitted studentstudent; The GLM Procedure Dependent Variable: Cycles Source DF Sum of Squares Mean Square F Value Pr > F Model 18 20131626.44 1118423.69 92.25 <.0001 Error 8 96991.85 12123.98 Corrected Total 26 20228618.30 R-Square Coeff Var Root MSE Cycles Mean 0.995205 12.78300 110.1090 861.3704 We notice that the R 2 value Source has now DF increased Type I SS from Mean Square 76.9% F to Value 99.5% Pr > in F this model. [3 marks] A test of the null hypothesis that we can omit the interaction terms has test statistic Amp 2 5624248.963 2812124.481 231.95 <.0001 Len 2 8182252.519 4091126.259 337.44 <.0001 Load 2 1753096.963MSE 876548.481 full 72.30 <.0001 F SSE red SSE full /df red df full Amp*Len 4 3555537.481 888884.370 73.32 <.0001 4669019.85 96991.85/20 8 12123.98 Amp*Load 4 283609.037 70902.259 5.85 0.0168 31.43 Len*Load 4 732881.481 183220.370 15.11 0.0008

If the true model is the reduced no interaction model then this will be an observation from an F 12, 8 distribution. From Table A.5 in the text book we see that the 1% critical value for this distribution is 5.67. Since the observed value of F is greater than 5.67 we can conclude that the p-value is less than 0.01 and so we reject H 0 and conclude that the model including interactions is a significantly better fit. [3 marks] Furthermore a plot of the studentized residuals against the fitted value shows that the nonlinearity problem has now been solved also. PROC GPLOT Datawoolout1; Plot student*fitted; c The following plots come from trying the square, square root and log transformations for the response variable and the no interaction model. Cycles Squared

Square Root of Cycles Log of Cycles From these plots, we see that using a log transformation seems to be the best at removing the non-linearity in the model. [4 marks] The ANOVA table and other summary statistics for this transformation are Source DF Sum of Squares Mean Square F Value Pr > F Model 6 22.48517565 3.74752927 104.47 <.0001 Error 20 0.71742240 0.03587112 Corrected Total 26 23.20259805 R-Square Coeff Var Root MSE LogCycles Mean 0.969080 2.989807 0.189397 6.334748 We note that on the log scale the R 2 value for the model without interactions is now 96.9%. [3 marks]

d For the model on the log scale with all pairwise interactions we get the ANOVA table Dependent Variable: LogCycles Source DF Sum of Squares Mean Square F Value Pr > F Model 18 23.03668445 1.27981580 61.71 <.0001 Error 8 0.16591360 0.02073920 Corrected Total 26 23.20259805 R-Square Coeff Var Root MSE LogCycles Mean 0.992849 2.273352 0.144011 6.334748 The R 2 value for this model on the log scale is now over 99%. The F statistic to compare this model with the model without interactions is [3 marks] F SSE red SSE full /df red df full MSE full 2.22 0.7174224 0.1659136/20 8 0.0207392 Once again we compare this to the critical values of the F 12, 8 distribution. From Table A.4 we see that the 5% critical value is 3.28 and so since 2.22 < 3.28 the p-value of the test is greater than 0.05. Hence we do not reject H 0 and can conclude that there is no evidence that we need the interaction terms in the model on the log scale. [3 marks]

Q. 2 a First we plot the millions of barrels produced against the year as follows PROC GPLOT DataS3A3.Oil; Plot Barrels*Year; quit; Clearly the relationship is not linear. There appears to be exponential growth for most of the times although that seems to have changed in the later years. Next we try to log transform the production covariate. Data S3A3.Oil; set S3A3.Oil; LogBarrelslogBarrels; PROC GPLOT DataS3A3.Oil; Plot LogBarrels*Year; quit; For most of the range of years this seems to have improved the linearity although there is still an issue in the later years. b The regression for log of oil production on year results in the following output and diagnostic plots

PROC REG DataS3A3.Oil plotsnone; Model LogBarrelsYear; Plot LogBarrels*Year; Plot Student.*Predicted.; Plot Student.*nqq.; SAS Output quit; Plot CookD.*Year; Page 1 of 1 The REG Procedure Model: MODEL1 Dependent Variable: LogBarrels Number of Observations Read 30 Number of Observations Used 29 Number of Observations with Missing Values 1 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 105.64736 105.64736 1604.18 <.0001 Error 27 1.77815 0.06586 Corrected Total 28 107.42551 Root MSE 0.25663 R-Square 0.9834 Dependent Mean 8.13163 Adj R-Sq 0.9828 Coeff Var 3.15591 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1-111.87532 2.99664-37.33 <.0001 Year 1 0.06159 0.00154 40.05 <.0001 [1 mark]

[1 mark] [1 mark] [1 mark] The fitted model has a very significant F statistic and an R 2 of 98.34% so the model appears to fit quite well. The plots, however, tell another story. although the linear relationship fits most of the years, the production in the last few observations is consistently overestimated resulting in negative residuals. We also see consistent overestimation of production in the early years of the dataset. There is evidence of correlation between successive observations and the normality assumption is quite suspect due to marked curvature in the normal quantile-quantile plot. The very first observation in 1880 seems quite influential based on the Cook s Distance plot and the final few observations also have somewhat elevated Cook s Distance. Overall we would not be at all satisfied that this is an appropriate linear model. [3 marks]

c Next we shall set up an indicator variable for whether the year was prior to or after the 1973 Oil Crisis. We will also need an interaction variable between this indicator and Year to enable us to fit separate slopes to the two time periods. Data S3A3.Oil; Set S3A3.Oil; If Year < 1973 then Ind19730; else Ind19731; InteractionInd1973*Year; Now we fit a model with both the indicator variable and the interaction variable. PROC REG DataS3A3.Oil plotsnone; Model LogBarrelsYear Ind1973 Interaction; Plot LogBarrels*Predicted.; Plot Student.*Predicted.; Plot Student.*nqq.; Plot CookD.*Year; quit; SAS Output The REG Procedure Model: MODEL1 Dependent Variable: LogBarrels [1 mark] Page 1 of 1 Number of Observations Read 30 Number of Observations Used 29 Number of Observations with Missing Values 1 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 3 107.09025 35.69675 2661.87 <.0001 Error 25 0.33526 0.01341 Corrected Total 28 107.42551 Root MSE 0.11580 R-Square 0.9969 Dependent Mean 8.13163 Adj R-Sq 0.9965 Coeff Var 1.42411 This results in the following fitted lines Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1-121.31370 1.76711-68.65 <.0001 Year 1 0.06650 0.00091254 72.87 <.0001 Ind1973 1 132.18496 17.78712 7.43 <.0001 Interaction 1-0.06697 0.00898-7.46 <.0001 logbarrels { 121.314 + 0.06650Year if Year < 1973 10.871 0.00047Year if Year 1973 [4 marks] This analysis clearly shows that the pattern of production subsequent to 1973 was very different from that prior to that date. Prior to 1973 oil production was steadily increasing at an exponential rate whereas after 1973 it became essentially static over time. [1 mark] The diagnostic plots are:

[1 mark] [1 mark] [1 mark]

[1 mark] The observed value against fitted value plot now seems much more linear and the influence of the last few years has been greatly reduced. Some major issues that still remain, however, are that there appears to be correlation between the errors not unexpected for data gathered over time like this and the normality assumption still seems suspect. The first observation still has rather elevated Cook s Distance and very large negative residual which is a sign that the model does not fit well that long ago.

Q. 3 a n n w i w i x i x w n w i x i n w i n w n ix i n w w i i n n w i x i w i x i 0 Hence we can write n n n w i x i w i x i xw w i n n w i x i w i x w By symmetry we must also have 2 w i n w i yi y w 0 and so we can write n n n w i x i yi y w w i x i yi y w xw w i yi y w n n w i x i yi y w w i x w yi y w w i yi y w b From our notes the weighted least squares criterion is S w β 0, β 1 n 2 w i yi β 0 β 1 x i The partial derivatives of this criterion are [1 mark] S w β 0 2 S w β 0 2 n w i yi β 0 β 1 x i n w i x i yi β 0 β 1 x i

and so we need to solve the two equations n w i yi ˆβ 0 ˆβ 1 x i 0 1 n w i x i yi ˆβ 0 ˆβ 1 x i 0 2 From Equation 1 we see that n w i yi ˆβ 0 ˆβ 1 x i 0 n w iy i ˆβ 0 n w i ˆβ 1 n w ix i 0 n ˆβ0 w i n w iy i ˆβ n 1 w ix i wi y i ˆβ0 ˆβ wi x 1 i wi wi ˆβ0 y w ˆβ i x w Inserting this into Equation 2 we get n w ix i yi ˆβ 0 ˆβ 1 x i 0 n w ix i yi y w + ˆβ 1 x w ˆβ 1 x i 0 n w ix i yi y w ˆβ1 n w ix i 0 n w i yi y w n ˆβ1 w 2 i 0 wi yi y w ˆβ1 2 wi c I will show the unbiasedness of ˆβ 1 first. From above we can write wi yi y ˆβ 1 w 2 wi wi yi y w wi wi 2 wi yi wi 2 n w i wj xj x w 2 y i Replacing the observed y i with the corresponding random variable Y i we get the random variable ˆβ 1 to be n w i ˆβ 1 2 Y i wj xj x w [3 marks]

Now taking expectations we get E ˆβ1 x 1,..., x n n n w i wj xj x w 2 E Y i x i w i wj xj x w 2 β0 + β 1 x i β 0 wi wj xj x w 2 + β 1 wi x i wj xj x w 2 β 0 0 wj xj x w 2 + β 1 wi 2 wj xj x w 2 β 1 Now for ˆβ 0 we have [3 marks] ˆβ 0 y w ˆβ 1 x w wi y i n w i x w wi 2 y i wi n w i n x w w i y i wj 2 y i wj xj x w n w i n x w w i wj 2 y i wj xj x w and so the random variable ˆβ 0 is given by n w i n x w w i ˆβ 0 wj 2 Y i wj xj x w Taking expectations again we get n w i n x w w i β0 E ˆβ0 x 1,..., x n wj 2 + β 1 x i wj xj x w wi β 0 x w wi wj wj xj x w 2 β 0 1 + wi x i + β 1 wj [3 marks] x w wi x i wj xj x w 2 0 wj xj x w 2 + β 1 x w x 2 w wi wj xj x w 2 β 0 [3 marks]

Q. 4 Here is the code used to construct the transformed variables and to fit the transformed model and the output. a Data S3A3.Chroma; Set S3A3.Chroma; LogAmtlogAmount; LogOutlogOutput; PROC REG DataS3A3.Chroma plotsnone; Model LogOutLogAmt; Plot Student.*LogAmt; Plot Student.*nqq.; Output OutChromaOut PredictedFitted StudentRes_stud; Dependent Variable: LogOut Number of Observations Read 20 Number of Observations Used 20 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 68.82878 68.82878 10816.7 <.0001 Error 18 0.11454 0.00636 Corrected Total 19 68.94332 Root MSE 0.07977 R-Square 0.9983 Dependent Mean 4.37741 Adj R-Sq 0.9982 Coeff Var 1.82231 Parameter Estimates Variable DF Parameter Estimate Standard Error tvalue Pr > t Intercept 1 3.47291 0.01984 175.01 <.0001 LogAmt 1 1.12399 0.01081 104.00 <.0001

The usual diagnostic plots on the residuals for the fitted model are We can t see much from the first of these in this case. The residual against fitted value plot shows that the variability for the smallest amount is much greater than for the other amounts. There is also evidence of non-linearity since the residuals for when the amount is 1 are all negative whereas when the amount is 5 they are all positive. The normality assumption also seems somewhat suspect here. [5 marks]

b We can adapt the code given in SAS Lab 11 for this problem to get the weights PROC MEANS DataChromaOut NWay Noprint; CLASS Amount; VAR Res_stud; OUTPUT Outtemp MeanMean VarVar; PROC PRINT Datatemp; Obs amount _TYPE FREQ_ Mean Var 1 0.25 1 5 0.50423 2.34418 2 1 1 5-1.06795 0.03096 3 5 1 5 0.78009 0.06755 4 20 1 5-0.20390 0.02024 Data temp; set temp; If _N_1 then call symput var1, Var; If _N_2 then call symput var2, Var; If _N_3 then call symput var3, Var; If _N_4 then call symput var4, Var; PROC MEANS DataChromaOut noprint; Var Res_stud; Output Outtemp1 MeanMean VarVar; Data temp; set temp; Call symput vare, Var; Data ChromaOut; Set ChromaOut; If Amount0.25 then var&var1; If Amount1 then var&var2; If Amount5 then var&var3; If Amount20 then var&var4; w&vare/var; Looking at the resulting dataset we see that the weights are 0.4507 if x i 0.25 34.1236 if x w i i 1 15.6430 if x i 5 52.2046 if x i 20 [6 marks]

c We will now use these weights in a weighted regression PROC REG DataChromaOut; Model LogOutLogAmt; Weight w; Plot Student.*LogAmt; Plot Student.*nqq.; Dependent Variable: LogOut Number of Observations Read 20 Number of Observations Used 20 Weight: w Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 1244.78315 1244.78315 25359.4 <.0001 Error 18 0.88354 0.04909 Corrected Total 19 1245.66669 Root MSE 0.22155 R-Square 0.9993 Dependent Mean 5.43278 Adj R-Sq 0.9993 Coeff Var 4.07807 Parameter Estimates Variable DF Parameter Estimate Standard Error tvalue Pr > t Intercept 1 3.41177 0.01603 212.86 <.0001 LogAmt 1 1.14399 0.00718 159.25 <.0001 There hasn t been much change in the values of the estimates in this model but the root mean square error is much larger than before although the R 2 is also larger. The useful diagnostic plots are

The spread of the studentized residuals at each predicted value is now much closer to constant. The normality of the residuals based on the weighted regression also seems more plausible than before. There is still a major issue with non-linearity however. [5 marks] Report on the Analysis In this analysis I fitted a linear model to predict the output of the gas chromatograph from the log known amounts of substance used. Both variables were transformed to the log scale. A standard analysis revealed that the variability of the output was much larger for the lowest level of amount and so the model assumptions were violated. To try to fix this problem, I estimated the variance of the residuals for each of the amounts and used a weighted least squares regression with weights equal to the total variance of the residuals divided by the variance for the amount used in each observation. The fitted weighted regression line was logoutput 3.412 + 1.144 logamount The intercept and slope in this model are both significantly different from 0 with both p-values less than 0.0001. The model explained 99.93% of the variability in the output of the gas chromatograph. From the plot of the residuals of this model against the fitted values we see that the weighted regression did indeed alleviate the problem of non-constant variance. There is, however, a strong indication that a linear model is not appropriate here since the residuals for any given amount are generally all positive under-estimation or all negative over-estimation. This suggests that the relationship is not linear and so either a different transformation is required or there is a problem with the calibration of the gas chromatograph. The normal quantilequantile plot does not suggest any major departures from normality or outliers in the data. [5 marks]