Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Similar documents
Topic 18: Model Selection and Diagnostics

Chapter 1 Linear Regression with One Predictor

Topic 14: Inference in Multiple Regression

Topic 23: Diagnostics and Remedies

Topic 16: Multicollinearity and Polynomial Regression

Topic 20: Single Factor Analysis of Variance

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Remedial Measures for Multiple Linear Regression Models

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

General Linear Model (Chapter 4)

Statistics for exp. medical researchers Regression and Correlation

SAS Commands. General Plan. Output. Construct scatterplot / interaction plot. Run full model

STATISTICS 479 Exam II (100 points)

T-test: means of Spock's judge versus all other judges 1 12:10 Wednesday, January 5, judge1 N Mean Std Dev Std Err Minimum Maximum

Lecture 11 Multiple Linear Regression

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

5.3 Three-Stage Nested Design Example

Overview Scatter Plot Example

Lecture 1 Linear Regression with One Predictor Variable.p2

Topic 28: Unequal Replication in Two-Way ANOVA

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

EXST7015: Estimating tree weights from other morphometric variables Raw data print

Chapter 8 Quantitative and Qualitative Predictors

Booklet of Code and Output for STAC32 Final Exam

Lecture 11: Simple Linear Regression

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.

Failure Time of System due to the Hot Electron Effect

Lecture 3: Inference in SLR

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

BE640 Intermediate Biostatistics 2. Regression and Correlation. Simple Linear Regression Software: SAS. Emergency Calls to the New York Auto Club

Multicollinearity Exercise

Detecting and Assessing Data Outliers and Leverage Points

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Introduction to Linear regression analysis. Part 2. Model comparisons

Topic 2. Chapter 3: Diagnostics and Remedial Measures

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

a. YOU MAY USE ONE 8.5 X11 TWO-SIDED CHEAT SHEET AND YOUR TEXTBOOK (OR COPY THEREOF).

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

a. The least squares estimators of intercept and slope are (from JMP output): b 0 = 6.25 b 1 =

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

ST505/S697R: Fall Homework 2 Solution.

Chapter 2 Inferences in Simple Linear Regression

Lecture 18 Miscellaneous Topics in Multiple Regression

Topic 25 - One-Way Random Effects Models. Outline. Random Effects vs Fixed Effects. Data for One-way Random Effects Model. One-way Random effects

PubH 7405: REGRESSION ANALYSIS SLR: DIAGNOSTICS & REMEDIES

Chapter 8 (More on Assumptions for the Simple Linear Regression)

Topic 32: Two-Way Mixed Effects Model

Statistics 5100 Spring 2018 Exam 1

3 Variables: Cyberloafing Conscientiousness Age

Topic 29: Three-Way ANOVA

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

One-Way Analysis of Variance (ANOVA) There are two key differences regarding the explanatory variable X.

Residuals from regression on original data 1

Answer to exercise 'height vs. age' (Juul)

Chapter 6 Multiple Regression

Department of Mathematics The University of Toledo. Master of Science Degree Comprehensive Examination Applied Statistics.

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Regression Model Building

Statistics 512: Applied Linear Models. Topic 1

6. Multiple regression - PROC GLM

Stat 500 Midterm 2 12 November 2009 page 0 of 11

holding all other predictors constant

Outline Topic 21 - Two Factor ANOVA

Outline. Topic 22 - Interaction in Two Factor ANOVA. Interaction Not Significant. General Plan

Scenarios Where Utilizing a Spline Model in Developing a Regression Model Is Appropriate

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Lecture 4. Checking Model Adequacy

Math 423/533: The Main Theoretical Topics

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Comparison of a Population Means

Lecture 7 Remedial Measures

ANALYSES OF NCGS DATA FOR ALCOHOL STATUS CATEGORIES 1 22:46 Sunday, March 2, 2003

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Appendix D INTRODUCTION TO BOOTSTRAP ESTIMATION D.1 INTRODUCTION

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Possibly useful formulas for this exam: b1 = Corr(X,Y) SDY / SDX. confidence interval: Estimate ± (Critical Value) (Standard Error of Estimate)

Stat 302 Statistical Software and Its Applications SAS: Simple Linear Regression

The General Linear Model. April 22, 2008

Statistical Modelling in Stata 5: Linear Models

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Lecture notes on Regression & SAS example demonstration

The General Linear Model. November 20, 2007

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

STK4900/ Lecture 5. Program

Chapter 11 Building the Regression Model II:

STATISTICS 110/201 PRACTICE FINAL EXAM

In Class Review Exercises Vartanian: SW 540

Confidence Interval for the mean response

Answer Keys to Homework#10

Handout 1: Predicting GPA from SAT

Assignment 9 Answer Keys

STAT 3A03 Applied Regression Analysis With SAS Fall 2017

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

Chap 10: Diagnostics, p384

STOR 455 STATISTICAL METHODS I

Multiple Linear Regression

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Transcription:

Topic 19: Remedies

Outline Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Regression Diagnostics Summary Check normality of the residuals with a normal quantile plot or histogram Plot the residuals versus predicted values, versus each of the X s and (when appropriate) versus time/space Examine the partial regression plots Use the graphics smoother to see if there appears to be a curvilinear pattern

Regression Diagnostics Summary Examine the studentized deleted residuals (RSTUDENT in the output) The hat matrix diagonals Dffits, Cook s D, and the DFBETAS Check observations that are extreme on these measures relative to the other observations

Regression Diagnostics Summary Examine the tolerance for each X If there are variables with low tolerance, you need to do some model building Recode variables Variable selection

Remedial measures Weighted least squares Ridge regression Robust regression Nonparametric regression Bootstrapping

i Maximum Likelihood ( β + β σ 2 ) Y~ N X, f i 0 1 i = 2 1 πσ e 1 2 L= f f f Y 1 2 n 0 1 β β X i 0 1 i 2 (likelihood function) Find β and β which maximizes σ L

i Maximum Likelihood What is Y have different (but known) variance? i ( β + β σ 2 ) i Y~ N X, f i 0 1 i = 1 2πσ 1 Y 2 1 2 n i e L= f f f 0 1 β β X i 0 1 i 2 (likelihood function) Find β and β which maximizes σ i L

Weighted regression Maximization of L with respect to β s is equivalent to minimization of σ 1 Y X X 2 i ( ) 2 β β β p ip i 0 1 i,1 1, 1 Weight of each case is w i =1/σ i 2

Weighted least squares Least squares problem is to minimize the sum of w i times the squared residual for case i Computations are easy use the weight statement in proc reg b w = (X WX) -1 (X WY) where W is a diagonal matrix of the weights The problem in practice now becomes determining the weights

Determination of weights Find a relationship between the absolute residual and another variable and use this as a model for the standard deviation Similar approach using the squared residual to model the variance Or use grouped data or approximately grouped data to estimate the variance for all cases in the group

Determination of weights With a model for the standard deviation or the variance, we can approximate the optimal weights Optimal weights are proportional to the inverse of the variance

KNNL Example KNNL p 427 Y is diastolic blood pressure X is age n = 54 healthy adult women aged 20 to 60 years old

Get the data and check it data a1; infile../data/ch11ta01.txt'; input age diast; proc print data=a1; run;

Plot the relationship symbol1 v=circle i=sm70; proc gplot data=a1; plot diast*age / frame; run;

Diastolic bp vs age Strong linear relationship, no skewness but nonconstant variance

Run the regression proc reg data=a1; model diast=age; output out=a2 r=resid; run;

Regression output Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 2374.96833 2374.96833 35.79 <.0001 Error 52 3450.36501 66.35317 Corrected Total 53 5825.33333 Root MSE 8.14575 R-Square 0.4077 Dependent Mean 79.11111 Adj R-Sq 0.3963 Coeff Var 10.29659

Regression output Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 56.15693 3.99367 14.06 <.0001 age 1 0.58003 0.09695 5.98 <.0001 Estimators still unbiased but no longer have minimum variance Prediction interval coverage often lower or higher than 95%

Use the output data set a2 to get the absolute and squared residuals data a2; set a2; absr=abs(resid); sqrr=resid*resid;

Generate plots with a smooth proc gplot data=a2; plot (resid absr sqrr)*age; run;

Absolute value of the absr 20 residuals vs age 18 16 14 12 10 8 6 4 2 0 20 30 40 50 60 age

Squared residuals vs age

Model the std dev vs age (absolute value of the residual) proc reg data=a2; model absr=age; output out=a3 p=shat; Note that a3 has the predicted standard deviations (shat)

Compute the weights data a3; set a3; wt=1/(shat*shat);

Regression with weights proc reg data=a3; model diast=age / clb; weight wt; run;

Output Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 83.34082 83.34082 56.64 <.0001 Error 52 76.51351 1.47141 Corrected Total 53 159.85432 Root MSE 1.21302 R-Square 0.5214 Dependent Mean 73.55134 Adj R-Sq 0.5122 Coeff Var 1.64921

Output Parameter Estimates Parameter Standard 95% Confidence Variable DF Estimate Error t Value Pr > t Limits Intercept 1 55.56577 2.52092 22.04 <.0001 50.5072 60.6244 age 1 0.59634 0.07924 7.53 <.0001 0.43734 0.75534 Reduction in std err of the age coeff

Ridge regression If (X X) is difficult to invert (near singular) then approximate by inverting (X X+kI). Estimators of coefficients are now biased but more stable. For some value of k, ridge regression estimator has a smaller mean square error than ordinary least square estimator. Can be used to reduce number of predictors Ridge = k is an option for model statement. Cross-validation / ridge plots used to determine k

Ridge Regression Can express ridge constraint in terms of finding b to minimize 2 ( Y Zβ) ( Y Zβ) + λ β j where Z in standardized X Note: LASSO is a variation of this approach in which you minimize ( Y Zβ) ( Y Zβ) s.t. β j t

Ridge Regression in SAS proc reg data=a1 outest=b; run; model fat=skinfold thigh midarm / ridge=0 to.1 by.001;

Robust regression Basic idea is to have a procedure that is not sensitive to outliers Alternatives to least squares, minimize sum of absolute values of residuals median of the squares of residuals Do weighted regression with weights based on residuals, and iterate

Nonparametric Several versions regression We have used i=sm70 Interesting theory All versions have some smoothing or penalty parameter similar to the 70 in i=sm70

Regression trees Standard approach in area of data mining replacing multiple regression Basically partition the X space into rectangles Repeatedly split data two nodes based on a single predictor Predicted value is mean of responses in rectangle

Bootstrap Very important theoretical development that has had a major impact on applied statistics Uses resampling to approximate the sampling distribution Sample with replacement from the data or residuals and repeatedly refit model to get the distribution of the quantity of interest

Background Reading We used programs topic19.sas This completes Chapter 11