Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

Similar documents
Topic 23: Diagnostics and Remedies

Lecture 4. Checking Model Adequacy

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

Topic 20: Single Factor Analysis of Variance

SAS Commands. General Plan. Output. Construct scatterplot / interaction plot. Run full model

Topic 25 - One-Way Random Effects Models. Outline. Random Effects vs Fixed Effects. Data for One-way Random Effects Model. One-way Random effects

One-Way Analysis of Variance (ANOVA) There are two key differences regarding the explanatory variable X.

Outline. Topic 22 - Interaction in Two Factor ANOVA. Interaction Not Significant. General Plan

One-way ANOVA Model Assumptions

Single Factor Experiments

Topic 28: Unequal Replication in Two-Way ANOVA

Topic 32: Two-Way Mixed Effects Model

Comparison of a Population Means

Topic 2. Chapter 3: Diagnostics and Remedial Measures

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Introduction to Design and Analysis of Experiments with the SAS System (Stat 7010 Lecture Notes)

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3-1 through 3-3

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3.1 through 3.3

dm'log;clear;output;clear'; options ps=512 ls=99 nocenter nodate nonumber nolabel FORMCHAR=" = -/\<>*"; ODS LISTING;

Odor attraction CRD Page 1

Outline. Topic 19 - Inference. The Cell Means Model. Estimates. Inference for Means Differences in cell means Contrasts. STAT Fall 2013

Chapter 11. Analysis of Variance (One-Way)

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

Outline Topic 21 - Two Factor ANOVA

Analysis of variance and regression. April 17, Contents Comparison of several groups One-way ANOVA. Two-way ANOVA Interaction Model checking

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Outline. Analysis of Variance. Acknowledgements. Comparison of 2 or more groups. Comparison of serveral groups

Week 7.1--IES 612-STA STA doc

Analysis of variance. April 16, Contents Comparison of several groups

Analysis of variance. April 16, 2009

Topic 14: Inference in Multiple Regression

Outline. Analysis of Variance. Comparison of 2 or more groups. Acknowledgements. Comparison of serveral groups

Analysis of Variance

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Two-factor studies. STAT 525 Chapter 19 and 20. Professor Olga Vitek

Answer Keys to Homework#10

Assignment 9 Answer Keys

This is a Randomized Block Design (RBD) with a single factor treatment arrangement (2 levels) which are fixed.

ANOVA Longitudinal Models for the Practice Effects Data: via GLM

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Data are sometimes not compatible with the assumptions of parametric statistical tests (i.e. t-test, regression, ANOVA)

Statistics for exp. medical researchers Comparison of groups, T-tests and ANOVA

Lecture 11: Simple Linear Regression

Subject-specific observed profiles of log(fev1) vs age First 50 subjects in Six Cities Study

WITHIN-PARTICIPANT EXPERIMENTAL DESIGNS

a = 4 levels of treatment A = Poison b = 3 levels of treatment B = Pretreatment n = 4 replicates for each treatment combination

Topic 18: Model Selection and Diagnostics

In many situations, there is a non-parametric test that corresponds to the standard test, as described below:

Exam details. Final Review Session. Things to Review

STAT5044: Regression and Anova

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

Statistics in Stata Introduction to Stata

Topic 29: Three-Way ANOVA

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

SAS/STAT 14.1 User s Guide. Introduction to Nonparametric Analysis

Chapter 16. Nonparametric Tests

Lecture 11 Multiple Linear Regression

Answer to exercise: Blood pressure lowering drugs

Introduction to Crossover Trials

BE640 Intermediate Biostatistics 2. Regression and Correlation. Simple Linear Regression Software: SAS. Emergency Calls to the New York Auto Club

Nonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I

Handout 1: Predicting GPA from SAT

MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010

Lecture 4. Random Effects in Completely Randomized Design

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Chapter 1 Linear Regression with One Predictor

T-test: means of Spock's judge versus all other judges 1 12:10 Wednesday, January 5, judge1 N Mean Std Dev Std Err Minimum Maximum

Introduction to SAS proc mixed

Formal Statement of Simple Linear Regression Model

Analysis of Longitudinal Data: Comparison Between PROC GLM and PROC MIXED. Maribeth Johnson Medical College of Georgia Augusta, GA

Overview Scatter Plot Example

Introduction to SAS proc mixed

Lecture 3: Inference in SLR

Chapter 6 Multiple Regression

Assessing Model Adequacy

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Chapter 14 Logistic and Poisson Regressions

Analysis of Covariance

1 One-way Analysis of Variance

SAS Syntax and Output for Data Manipulation: CLDP 944 Example 3a page 1

The ε ij (i.e. the errors or residuals) are normally distributed. This assumption has the least influence on the F test.

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

Introduction to Nonparametric Analysis (Chapter)

Descriptions of post-hoc tests

Lecture 7 Remedial Measures

PLS205 Lab 2 January 15, Laboratory Topic 3

Chapter 2 Inferences in Simple Linear Regression

Introduction to Statistical Analysis

Covariance Structure Approach to Within-Cases

Lecture 10: 2 k Factorial Design Montgomery: Chapter 6

The Model Building Process Part I: Checking Model Assumptions Best Practice

Repeated Measures Part 2: Cartoon data

Introduction to Linear regression analysis. Part 2. Model comparisons

Inferences About the Difference Between Two Means

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

Module 6: Model Diagnostics

E509A: Principle of Biostatistics. (Week 11(2): Introduction to non-parametric. methods ) GY Zou.

Outline. Topic 13 - Model Selection. Predicting Survival - Page 350. Survival Time as a Response. Variable Selection R 2 C p Adjusted R 2 PRESS

Transcription:

Topic 20 - Diagnostics and Remedies - Fall 2013 Diagnostics Plots Residual checks Formal Tests Remedial Measures Outline Topic 20 2 General assumptions Overview Normally distributed error terms Independent observations Constant variance Will adopt or adapt diagnostics and remedial measures from linear regression Many are the same but others require slight modifications Residuals Predicted values are the cell means ˆµ i = Y i. Residuals are the difference between the observed and predicted Properties: Same least squares properties j e ij = 0 i e ij = Y ij Y i. Topic 20 3 Topic 20 4

Basic Plots Plot the data vs the factor levels Plot the residuals vs the factor levels Plot the residuals vs the fitted values Histogram of the residuals QQplot of the residuals Example Page 777 Experiment designed to study the effectiveness of four rust inhibitors Forty units were used in the experiment Units randomly and equally assigned to rust inhibitors (n i = 10) Each unit exposed to severe weather conditions (accelerated life study) Y coded score (higher means less rust) X brand of rust inhibitor i = 1, 2, 3, 4 j = 1, 2,.., 10 Topic 20 5 Topic 20 6 SAS Commands data a1; infile u:\.www\datasets525\ch17ta02.txt ; input score brand; Scatterplot symbol1 v=circle i=none; proc gplot; plot score*brand; ***Scatterplot; proc glm; class brand; model score=brand; output out=a2 r=res p=pred; proc gplot; plot res*(brand pred); ***Residual plots; proc univariate noprint; ***Histogram and QQplot; histogram res / normal kernel(l=2); qqplot res / normal (L=1 mu=est sigma=est); run; Topic 20 7 Topic 20 8

Residual Plot I Residual Plot II Topic 20 9 Topic 20 10 Histogram QQplot Topic 20 11 Topic 20 12

Look for Outliers Non-constant variance Non-normal errors Summary Nothing too obvious in this example Can plot residuals vs time or other explanatory variable if they are available Formal Tests Normality Wilk-Shapiro Anderson-Darling Kolmogorov-Smirnov Homogeneity of Variance Hartley test Modified Levene test - Brown-Forsythe Bartlett s Topic 20 13 Topic 20 14 Homogeneity Tests Hartley statistic (Table B.10) H = max(s2 i ) min(s 2 i ) Modified Levene - Brown-Forsythe Same as regression approach Groups are the factor levels Bartlett s test Basically a likelihood ratio test Homogeneity Tests Often caught in problem with assumptions ANOVA is robust with respect to moderate deviations from normality ANOVA results can be sensitive to homogeneity of variance, especially when there is unequal cell size Homogeneity tests are often sensitive to normality assumption Modified Levene s often best choice Topic 20 15 Topic 20 16

Modified Levene s Test More robust against normality Considers the absolute deviation of each observation Y ij about its factor level median Ỹi d ij = Yij Ỹi Tests whether the expected value of these absolute deviations is equal across factor levels Simply performs ANOVA on these absolute deviations Example Page 783 Experiment designed to assess the strength of five types of flux used in soldering wire boards Forty units were used in the experiment Units randomly and equally assigned to flux type (n i = 8) Y strength X type of flux Topic 20 17 Topic 20 18 SAS Commands Scatterplot data a1; infile u:\.www\datasets525\ch18ta02.txt ; input strength type; proc gplot; plot strength*type; proc glm; class type; model strength=type; means type / hovtest=bf; means type / hovtest=levene(type=abs); lsmeans type / stderr cl; ***scatterplot; ***Modified Levene test; ***Uses mean rather than median; Topic 20 19 Topic 20 20

Output Source DF Sum of Squares Mean Square F Value Pr > F Model 4 353.6120850 88.4030213 41.93 <.0001 Error 35 73.7988250 2.1085379 Cor Total 39 427.4109100 Brown and Forsythe s Test for Homogeneity of strength Variance ANOVA of Absolute Deviations from Group Medians Sum of Mean Source DF Squares Square F Value Pr > F type 4 9.3477 2.3369 2.94 0.0341 Error 35 27.8606 0.7960 strength Standard type LSMEAN Error Pr > t 1 15.4200000 0.5133880 <.0001 2 18.5275000 0.5133880 <.0001 3 15.0037500 0.5133880 <.0001 4 9.7412500 0.5133880 <.0001 5 12.3400000 0.5133880 <.0001 Remedies Delete potential outliers Is their removal important? Use weighted regression Box-Cox Transformation Non-parametric procedures Topic 20 21 Topic 20 22 proc means data=a1; var strength; by type; output out=a2 var=s2; data a2; set a2; wt=1/s2; data a3; merge a1 a2; by type; SAS Commands ***Obtain sample variances for weights; GLM Output Sum of Source DF Squares Mean Square F Value Pr > F Model 4 324.2130988 81.0532747 81.05 <.0001 Error 35 35.0000000 1.0000000 Corrected Total 39 359.2130988 proc glm data=a3; class type; model strength=type; weight wt; lsmeans type / stderr cl; proc mixed data=a1; class type; model strength=type / ddfm=kr; repeated / group=type; ***Weighted ANOVA; ***Mixed model with diff variances for all types; Least Squares Means strength Standard type LSMEAN Error Pr > t 95% Confidence Limits 1 15.4200000 0.4373949 <.0001 14.532041 16.307959 2 18.5275000 0.4429921 <.0001 17.628178 19.426822 3 15.0037500 0.8791614 <.0001 13.218957 16.788543 4 9.7412500 0.2887129 <.0001 9.155132 10.327368 5 12.3400000 0.2720294 <.0001 11.787751 12.892249 Topic 20 23 Topic 20 24

Cov Parm Group Estimate Residual type 1 1.5305 Residual type 2 1.5699 Residual type 3 6.1834 Residual type 4 0.6668 Residual type 5 0.5920 MIXED Output Fit Statistics -2 Res Log Likelihood 122.1 AIC (smaller is better) 132.1 AICC (smaller is better) 134.2 BIC (smaller is better) 140.6 Null Model Likelihood Ratio Test DF Chi-Square Pr > ChiSq 4 13.73 0.0082 MIXED Output Type 3 Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F type 4 14.8 71.78 <.0001 Least Squares Means Standard Effect type Estimate Error DF t Value Pr > t type 1 15.4200 0.4374 7 35.25 <.0001 type 2 18.5275 0.4430 7 41.82 <.0001 type 3 15.0038 0.8792 7 17.07 <.0001 type 4 9.7413 0.2887 7 33.74 <.0001 type 5 12.3400 0.2720 7 45.36 <.0001 Topic 20 25 Topic 20 26 Summary GLM (weighted ANOVA) and MIXED analysis provide the same factor level estimates and standard errors. Without ddfm=kr, F test and df are also the same. With ddfm=kr, F test similar to Welch F test and factor level df more reasonable. Can consider groups of factor levels with similar variances Group1=1 : Type 1 and 2 Group1=2 : Type 3 Group1=3 :Type 4 and 5 Cov Parm Group Estimate Residual Group 1 1.5502 Residual Group 2 6.1834 Residual Group 3 0.6294 MIXED Output Fit Statistics -2 Res Log Likelihood 122.1 AIC (smaller is better) 128.1 AICC (smaller is better) 128.9 ***Much better fit BIC (smaller is better) 133.2 Null Model Likelihood Ratio Test DF Chi-Square Pr > ChiSq 2 13.70 0.0011 Topic 20 27 Topic 20 28

MIXED Output Type 3 Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F type 4 19.8 77.68 <.0001 Least Squares Means Standard Effect type Estimate Error DF t Value Pr > t type 1 15.4200 0.4402 14 35.03 <.0001 type 2 18.5275 0.4402 14 42.09 <.0001 type 3 15.0038 0.8792 7 17.07 <.0001 type 4 9.7413 0.2887 14 34.73 <.0001 type 5 12.3400 0.2720 14 43.99 <.0001 Delta Method Consider response X with E(X)=µ x and Var(X)=σx 2 Define Y = f(x); What is the mean and var of Y? Consider the following Taylor series expansion Consider f(x) where f (µ x ) 0 f(x) f(µ x ) + (X µ x )f (µ x ) E(Y )=E(f(X)) E(f(µ x )) + E((X µ x )f (µ x ))= f(µ x ) Var(Y ) [f (µ x )] 2 Var(X) = [f (µ x )] 2 σx 2 Topic 20 29 Topic 20 30 Transformations Suppose σ 2 x depends on µ x σ 2 x = g(µ x ) Want to find Y = f(x) such that Var(Y ) c Have shown Var(f(X)) [f (µ x )] 2 σ 2 x Want to choose f such that [f (µ x )] 2 g(µ x ) c Examples g(µ) = µ (Poisson) f(x) = 1 µ dµ f(x) = X 1 g(µ) = µ(1 µ) (Binomial) f(x) = dµ f(x) = asin( X) µ(1 µ) g(µ) = µ 2β (Box-Cox) f(x) = µ β dµ f(x) = X 1 β g(µ) = µ 2 1 (Box-Cox) f(x) = dµ f(x) = log X µ Transformation Guides Regress log(s i ) vs log(y i. ) λ = 1 b 1 Use Proc TRANSREG proc transreg data=a1; model boxcox(strength)=class(type); When σ 2 i µ i use When σ i µ i use log When σ i µ 2 i use 1/Y For proportions, use arcsin( ) arsin(sqrt(y)) is SAS data step Topic 20 31 Topic 20 32

GLM Output - Box-Cox Transformation Sum of Source DF Squares Mean Square F Value Pr > F Model 4 0.03284060 0.00821015 50.05 <.0001 Error 35 0.00574192 0.00016405 Corrected Total 39 0.03858252 bcstrength Standard type LSMEAN Error Pr > t 95% Confidence Limits 1 0.50507700 0.00452845 <.0001 0.495884 0.514270 2 0.48230819 0.00452845 <.0001 0.473115 0.491501 3 0.50998508 0.00452845 <.0001 0.500792 0.519178 4 0.56659809 0.00452845 <.0001 0.557405 0.575791 5 0.53382040 0.00452845 <.0001 0.524627 0.543014 Topic 20 33 Topic 20 34 Nonparametric Approach Based on ranking the observations and using the ranks for inference SAS procedure NPAR1WAY Could also perform permutation test Under H 0 all observations from population with same mean Any observation could be assigned to any group Permute observations numerous times For each permutation compute test statistic Compare observed statistic with this distribution Topic 20 35 Topic 20 36

proc npar1way data=a1; var strength; class type; *Approximate approach proc rank; var strength; ranks strength1; proc glm; class type; model strength1=type; run; SAS Commands Output Wilcoxon Scores (Rank Sums) Sum of Expected Std Dev Mean type N Scores Under H0 Under H0 Score 1 8 201.0 164.0 29.573377 25.1250 2 8 282.0 164.0 29.573377 35.2500 3 8 190.0 164.0 29.573377 23.7500 4 8 36.0 164.0 29.573377 4.5000 5 8 111.0 164.0 29.573377 13.8750 Kruskal-Wallis Test Chi-Square 32.1634 DF 4 Pr > Chi-Square <.0001 Topic 20 37 Topic 20 38 Output Median Scores (Number of Points Above Median) Sum of Expected Std Dev Mean type N Scores Under H0 Under H0 Score 1 8 7.0 4.0 1.281025 0.8750 2 8 8.0 4.0 1.281025 1.0000 3 8 5.0 4.0 1.281025 0.6250 4 8 0.0 4.0 1.281025 0.0000 5 8 0.0 4.0 1.281025 0.0000 Median One-Way Analysis Chi-Square 28.2750 DF 4 Pr > Chi-Square <.0001 Background Reading KNNL Chapter 18 knnl777.sas, knnl783.sas KNNL Chapter 19 Topic 20 39 Topic 20 40