Week 7.1--IES 612-STA STA doc

Similar documents
IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

Chapter 8 (More on Assumptions for the Simple Linear Regression)

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

One-way ANOVA Model Assumptions

Lecture 4. Checking Model Adequacy

In many situations, there is a non-parametric test that corresponds to the standard test, as described below:

Topic 23: Diagnostics and Remedies

Data are sometimes not compatible with the assumptions of parametric statistical tests (i.e. t-test, regression, ANOVA)

This is a Randomized Block Design (RBD) with a single factor treatment arrangement (2 levels) which are fixed.

Assessing Model Adequacy

PLS205 Lab 2 January 15, Laboratory Topic 3

Handout 1: Predicting GPA from SAT

df=degrees of freedom = n - 1

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3.1 through 3.3

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Topic 20: Single Factor Analysis of Variance

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

4.1. Introduction: Comparing Means

Lecture notes on Regression & SAS example demonstration

Lecture 3: Inference in SLR

IX. Complete Block Designs (CBD s)

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Inference for the Regression Coefficient

ANOVA: Analysis of Variation

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3-1 through 3-3

General Linear Model (Chapter 4)

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

Comparison of a Population Means

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Business Statistics. Lecture 10: Course Review

SAS Commands. General Plan. Output. Construct scatterplot / interaction plot. Run full model

Exam details. Final Review Session. Things to Review

Nonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I

5.3 Three-Stage Nested Design Example

Introduction to Crossover Trials

Booklet of Code and Output for STAC32 Final Exam

Analysis of variance and regression. April 17, Contents Comparison of several groups One-way ANOVA. Two-way ANOVA Interaction Model checking

Confidence Interval for the mean response

y response variable x 1, x 2,, x k -- a set of explanatory variables

Chapter 11. Analysis of Variance (One-Way)

The Random Effects Model Introduction

Lecture 11: Simple Linear Regression

Lecture 1 Linear Regression with One Predictor Variable.p2

Final Exam. Name: Solution:

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

E509A: Principle of Biostatistics. (Week 11(2): Introduction to non-parametric. methods ) GY Zou.

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Battery Life. Factory

EXST7015: Estimating tree weights from other morphometric variables Raw data print

N J SS W /df W N - 1

SEVERAL μs AND MEDIANS: MORE ISSUES. Business Statistics

Outline. Analysis of Variance. Comparison of 2 or more groups. Acknowledgements. Comparison of serveral groups

Booklet of Code and Output for STAC32 Final Exam

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Introduction to Design and Analysis of Experiments with the SAS System (Stat 7010 Lecture Notes)

Single Factor Experiments

Analysis of Variance

Analysis of variance. April 16, Contents Comparison of several groups

STATISTICS 141 Final Review

Failure Time of System due to the Hot Electron Effect

Linear Combinations of Group Means

Analysis of variance. April 16, 2009

dm'log;clear;output;clear'; options ps=512 ls=99 nocenter nodate nonumber nolabel FORMCHAR=" = -/\<>*"; ODS LISTING;

Odor attraction CRD Page 1

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Chapter 11 : State SAT scores for 1982 Data Listing

610 - R1A "Make friends" with your data Psychology 610, University of Wisconsin-Madison

4.8 Alternate Analysis as a Oneway ANOVA

STA 101 Final Review

Chapter 6 Multiple Regression

Repeated Measures Part 2: Cartoon data

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Analysis of variance (ANOVA) Comparing the means of more than two groups

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Analysis of Covariance

Answer Keys to Homework#10

Outline. Analysis of Variance. Acknowledgements. Comparison of 2 or more groups. Comparison of serveral groups

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

STATISTICS 479 Exam II (100 points)

Assignment 9 Answer Keys

Degrees of freedom df=1. Limitations OR in SPSS LIM: Knowing σ and µ is unlikely in large

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

Turning a research question into a statistical question.

Analysis of Variance

STAT 401A - Statistical Methods for Research Workers

R in Linguistic Analysis. Wassink 2012 University of Washington Week 6

Correlation and Simple Linear Regression

22s:152 Applied Linear Regression. 1-way ANOVA visual:

Stat 427/527: Advanced Data Analysis I

Statistics for exp. medical researchers Comparison of groups, T-tests and ANOVA

Two-factor studies. STAT 525 Chapter 19 and 20. Professor Olga Vitek

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Analysis of 2x2 Cross-Over Designs using T-Tests

Non-parametric (Distribution-free) approaches p188 CN

Ch Inference for Linear Regression

The Statistical Sleuth in R: Chapter 5

Transcription:

Week 7.1--IES 612-STA 4-573-STA 4-576.doc IES 612/STA 4-576 Winter 2009 ANOVA MODELS model adequacy aka RESIDUAL ANALYSIS Numeric data samples from t populations obtained Assume Y ij ~ independent N(μ i, σ 2 ) or Y ij = μ i + ε ij with ε ij ~ independent N(0, σ 2 ) i = 1,2,, t (populations or treatments) and j = 1, 2,, n i (observations) Residual definition: e ij = yij-y i. same as before residual = - Assumption Checking? Addressing? 1. Constant variance? Plot e ij vs. sample means and look for a pattern 2. Normal responses? - Normal probability plot (normal scores vs. residual quantiles) - Histogram? Boxplot? Stemplot? - 68% of standardized residuals with -1 and +1 (95% within -2 and +2) - Shapiro-Wilk Test of Normality 3. Independent? Plot residuals vs. order of observations? Often implicit part of the design 4. Outliers? Large standardized residuals? ±3 rule - Transformation (e.g. log, sqrt) - Weighted Least Squares - Transformation (sqrt count responses, arcsin-sqrt proportions, log right skewed responses) - GLiMs (generalized linear models binomial / Poisson responses as example) Analysis that reflects dependence? Time series/spatial data/mixed models - Check? - Run analysis with and without points? - Rank-based methods? In Regression, we were concerned about the correct form of the regression model. Why are we NOT concerned about this in ANOVA? What happens if assumptions NOT satisfied? Assumption Impact? 1. Constant variance? MSE (pooled variance estimate) may be incorrect 2. Normal responses? Stat tests may not be correct C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 1

3. Independent? MOST important se(y-bar s) too small 4. Outliers? MSE inflated CI s wider/harder to detect diffs C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 2

Example Bacteria growth in meat under different packaging conditions (again!) Model: Log (Bacterial Growth) ij = μ i + ε ij, where ε ij are independent N (0, σ 2 ). Using R 1. Constant Variance? > boxplot(logcount ~ Condition) 3 4 5 6 7 2. Normality? CO2 mixed plastic vacuum > plot(meatanova) > shapiro.test(meatanova$residuals) Shapiro-Wilk normality test data: MeatANOVA$residuals W = 0.8942, p-value = 0.1336 Residuals Standardized residuals -0.6-0.4-0.2 0.0 0.2 0.4 0.0 0.4 0.8 1.2 11 11 Residuals vs Fitted 4 5 6 7 Fitted values Scale-Location 3 2 2 3 Normal Q-Q 4 5 6 7 Standardized residuals -1.5-0.5 0.5 1.0 2 11 3 Fitted values -1.5-1.0-0.5 0.0 0.5 1.0 1.5 Theoretical Quantiles C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 3

3. Independence? Assume that correct sampling methods were used to insure independence. 4. Outliers? Using the above plots we note that none of the (standardized or studentized) residuals is beyond ± 3. Or using a boxplot of the residuals we note that none of the outliers are flagged as being outlying using the boxplot rule of being beyond the fences. -0.4-0.2 0.0 0.2 Using SAS Plot of resid*condition. Legend: A = 1 obs, B = 2 obs, etc. resid 0.4 ˆ A A A 0.2 ˆ A A A A 0.0 ˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ A -0.2 ˆ A A -0.4 ˆ A A -0.6 ˆ Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒ CO2 mixed plastic vacuum Condition C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 4

Plot of resid*yhat. Legend: A = 1 obs, B = 2 obs, etc. resid 0.4 ˆ A A A 0.2 ˆ A A A A 0.0 ˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ A -0.2 ˆ A A -0.4 ˆ A A -0.6 ˆ Šƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒ 3 4 5 6 7 8 yhat * Variances look constant [know pattern with increasing mean response] * None of the residuals stand out and look outlying C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 5

Plot of resid*nscore. Legend: A = 1 obs, B = 2 obs, etc. resid 0.4 ˆ B A 0.2 ˆ A A A A 0.0 ˆ A -0.2 ˆ A A -0.4 ˆ A A -0.6 ˆ Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒ -2-1 0 1 2 Rank for Variable resid * normal probability plot looks approximately linear C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 6

Example: How about the Meat study with using Bacterial Counts to show When assumptions go BAD! Model: Bacterial Growth ij = μ i + ε ij, where ε ij are independent N (0, σ 2 ). Using R 1. Constant Variance? Serious issues! > MeatData$Count = 10^MeatData$logCount > boxplot(count ~ Condition) 0e+00 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 CO2 mixed plastic vacuum 2. Normality? Serious issues! > plot(meatanovacount) > shapiro.test(meatanovacount$residuals) Standardized residuals Residuals -3e+07-1e+07 1e+07 3e+07 0.0 0.5 1.0 1.5 Residuals vs Fitted 3 9 2 0e+00 1e+07 2e+07 3e+07 4e+07 Fitted values Scale-Location 2 3 9 Shapiro-Wilk normality test 0e+00 1e+07 2e+07 3e+07 4e+07 data: MeatANOVACount$residuals W = 0.8102, p-value = 0.01227 Standardized residuals -2-1 0 1 2 9 Normal Q-Q Fitted values 3 2-1.5-1.0-0.5 0.0 0.5 1.0 1.5 Theoretical Quantiles C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 7

3. Independence? Assume that correct sampling methods were used to insure independence. 4. Outliers? Using the above plots we note that none of the residuals is beyond ± 3. Or using a boxplot of the residuals we note that three of the outliers are flagged as being outlying using the boxplot rule of being beyond the fences. -3e+07-2e+07-1e+07 0e+00 1e+07 2e+07 NOTE: By taking the log transformation of the Bacterial Counts, BOTH the Non Constant Variance AND the Normality problems were eliminated! C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 8

Using SAS Plot of resid*condition. Legend: A = 1 obs, B = 2 obs, etc. resid 1000 ˆ A 500 ˆ A A A A 0 ˆƒƒCƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒAƒƒ A A -500 ˆ A -1000 ˆ Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒ CO2 mixed plastic vacuum Condition C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 9

Plot of resid*yhat. Legend: A = 1 obs, B = 2 obs, etc. resid 1000 ˆ A 500 ˆ A A A A 0 ˆƒƒƒCƒƒƒƒAƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ A A -500 ˆ A -1000 ˆ Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒ 0 500 1000 1500 2000 yhat * Variance obviously increases with increasing mean response C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 10

Plot of resid*nscore. Legend: A = 1 obs, B = 2 obs, etc. resid 1000 ˆ A 500 ˆ A A A A 0 ˆ A A A A A A -500 ˆ A -1000 ˆ Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒ -2-1 0 1 2 Rank for Variable resid * Nonlinear pattern here and nonconstant variance goes hand-in-hand C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 11

LEVINE S TEST FOR HOMOGENEITY Levine s test is a commonly used test to determine if the variances are constant in the popln s. Levine s test is very simple. 1. Fit an ANOVA model and obtain the residuals about sample mean of group (or sample median of group as in Rcmdr). 2. Use the absolute values of these residuals in another ANOVA. 3. The resulting F test is Levine s Test of equality of variances. Rcmdr: can request this directly: Statistics > Variances > Levene s test Example: The Meat study with using logbacterial Counts Using R 1. Constant Variance? > summary(lm(abs(meatanova$res) ~ MeatBacteria$Condition)) Call: lm(formula = abs(meatanova$res) ~ MeatBacteria$Condition) Residuals: Min 1Q Median 3Q Max -0.153333-0.092500 0.001667 0.080000 0.166667 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.30000 0.07608 3.943 0.00428 ** MeatBacteria$Conditionmixed -0.15333 0.10760-1.425 0.19197 MeatBacteria$Conditionplastic 0.03333 0.10760 0.310 0.76464 MeatBacteria$Conditionvacuum -0.10000 0.10760-0.929 0.37989 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.1318 on 8 degrees of freedom Multiple R-squared: 0.3272, Adjusted R-squared: 0.0749 F-statistic: 1.297 on 3 and 8 DF, p-value: 0.3404 H o : variances equal in all groups H A : not all equal Conclusion: P-value = 0.34 implies fail to reject H 0 insufficient evidence to conclude variances differ between the groups. C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 12

Interpretation: Can not conclude that the variances are different (at α = 0.05 or α=0.10 or ) C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 13

Example: How about the Meat study with using Bacterial Counts to show When assumptions go BAD! Using R 1. Constant Variance? Serious issues! > summary(lm(abs(meatcountanova$res) ~ MeatBacteria$Condition)) Call: lm(formula = abs(meatcountanova$res) ~ MeatBacteria$Condition) Residuals: Min 1Q Median 3Q Max -13677052-23614 1359 1272265 9967189 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1374 3658626 0.000375 0.99971 MeatBacteria$Conditionmixed 5588408 5174079 1.080 0.31159 MeatBacteria$Conditionplastic 19933005 5174079 3.852 0.00486 ** MeatBacteria$Conditionvacuum 177409 5174079 0.034 0.97349 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 6337000 on 8 degrees of freedom Multiple R-squared: 0.711, Adjusted R-squared: 0.6027 F-statistic: 6.561 on 3 and 8 DF, p-value: 0.01504 H o : H A : Conclusion: Interpretation: C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 14

NONPARAMETRIC KRUSKAL-WALLIS TEST Suppose assumptions are not met: data not Normal, data have outliers that are REAL, and/or the variances are not constant, WHAT DO YOU DO? CONSIDER A NONPARAMETRIC ALTERNATIVE! H 0 : the distributions all have the same location/center H a : at least two distributions differ in terms of location/center TS. Function of the RANKS of the observations RR: Special Tables or a normal approximation for large sample sizes Assumptions: distributions have same SHAPE and same SPREAD [so it isn t a free lunch] Opinion: Often a useful alternative if outliers present Example: The Meat study with using Bacterial Counts Using R With MeatBacteria the active data set, using RCommander: Statistics > Nonparametric tests > Kruskal-Wallis test... > tapply(meatbacteria$count, MeatBacteria$Condition, median, na.rm=true) CO2 mixed plastic vacuum 3235.937 21379620.895 45708818.961 275422.870 > kruskal.test(count ~ Condition, data=meatbacteria) Kruskal-Wallis rank sum test data: Count by Condition Kruskal-Wallis chi-squared = 9.4615, df = 3, p-value = 0.02374 C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 15

Using SAS title Nonparametric ANOVA/ Kruskal-Wallis ; title2 Bacteria in meat data ; data meat; input condition $ logcount @@; count = exp(logcount); datalines; plastic 7.66 plastic 6.98 plastic 7.80 vacuum 5.26 vacuum 5.44 vacuum 5.80 mixed 7.41 mixed 7.33 mixed 7.04 CO2 3.51 CO2 2.91 CO2 3.66 ; options ls=70 nocenter nodate formdlim= - ; proc npar1way; title3 response = log(count) ; class condition; var logcount; run; proc npar1way; title3 response = count ; class condition; var count; run; (output edited- NPAR1WAY gives lots of different nonparametric methods) ---------------------------------------------------------------------- Nonparametric ANOVA/ Kruskal-Wallis 1 Bacteria in meat data response = log(count) The NPAR1WAY Procedure Analysis of Variance for Variable logcount Classified by Variable condition condition N Mean ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ plastic 3 7.480 vacuum 3 5.500 mixed 3 7.260 CO2 3 3.360 Source DF Sum of Squares Mean Square F Value Pr > F ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Among 3 32.87280 10.957600 94.5844 <.0001 Within 8 0.92680 0.115850 ---------------------------------------------------------------------- The NPAR1WAY Procedure Wilcoxon Scores (Rank Sums) for Variable logcount Classified by Variable condition Sum of Expected Std Dev Mean condition N Scores Under H0 Under H0 Score ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ plastic 3 30.0 19.50 5.408327 10.0 vacuum 3 15.0 19.50 5.408327 5.0 mixed 3 27.0 19.50 5.408327 9.0 C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 16

CO2 3 6.0 19.50 5.408327 2.0 Kruskal-Wallis Test Chi-Square 9.4615 DF 3 Pr > Chi-Square 0.0237 ---------------------------------------------------------------------- Nonparametric ANOVA/ Kruskal-Wallis 7 Bacteria in meat data response = count The NPAR1WAY Procedure Analysis of Variance for Variable count Classified by Variable condition condition N Mean ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ plastic 3 1879.09259 vacuum 3 251.07441 mixed 3 1439.73191 CO2 3 30.22214 Source DF Sum of Squares Mean Square F Value Pr > F ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Among 3 7282652.347795 2427550.783 16.5587 0.0009 Within 8 1172820.616230 146602.577 ---------------------------------------------------------------------- Nonparametric ANOVA/ Kruskal-Wallis 8 Bacteria in meat data response = count The NPAR1WAY Procedure Wilcoxon Scores (Rank Sums) for Variable count Classified by Variable condition Sum of Expected Std Dev Mean condition N Scores Under H0 Under H0 Score ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ plastic 3 30.0 19.50 5.408327 10.0 vacuum 3 15.0 19.50 5.408327 5.0 mixed 3 27.0 19.50 5.408327 9.0 CO2 3 6.0 19.50 5.408327 2.0 Kruskal-Wallis Test Chi-Square 9.4615 DF 3 Pr > Chi-Square 0.0237 ---------------------------------------------------------------------- C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 17

A SMALL DIGRESSION INTO DESIGNS AND A CRD MODEL SOME QUICK TERMINOLOGY RESPONSE: Continuous measurement of interest FACTOR: Categorical variable that defines the populations LEVELS: The different values that the FACTOR can take on Also known as TREATMENTS. TREATMENTS: Different levels of the factor OR Combinations of several factors. COMPLETELY RANDOMIZED DESIGN (CRD): Treatments (or factor levels) are randomly assigned to experimental units or randomly sampling from existing populations (factor levels) that you want to compare. Only constraint on randomization is how many experimental units receive a particular treatment. BALANCED DESIGN: Any design in which there are an equal number of observations in each treatment (or factor level). For the moment in CRD s, balancing has no effect, but with more complex designs which we will encounter later, IT CAN AND WILL MAKE A HUGE DIFFERENCE! C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 18

DIFFERENT CRD MODELS Numeric data independent Random Samples from t populations obtained Assume Y ij ~ independent N(μ i, σ 2 ) i = 1,2,, t and j = 1, 2,, n i CELL MEANS or MEANS MODEL Defn: Y ij = μ i + ε ij with ε ij ~ independent N(0, σ 2 ) is the Cell Means or Means Model b/c: the observations are expressed as deviations from the Means of the populations with the population means, the μ i s, the parameters of interest. EFFECTS MODEL Defn: Y ij = μ + α i + ε ij with ε ij ~ independent N(0, σ 2 ) is the Effects Model b/c: the observations are still expressed as deviations from the Means of the populations, but now the means of the populations are represented in terms of an overall mean and an effect (or deviation) from this overall mean. NOTE 1. While models appear to be different, the pertinent questions of interest (namely ) can be investigated in EITHER MODEL! 2. While the Means Model is easier to present and understand, the Effects Model is the one that is used in software packages. PICTORIALLY μ 1 μ t μ 2 0 μ + α 1 μ + α t μ + α 2 C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 19

HYPOTHESIS OF INTEREST MEANS MODEL: H 0 : μ 1 = μ 2 = μ 3 = = μ t ie all t popln means equal EFFECTS MODEL: H 0 :? Note that Y ij = μ i + ε ij = Y ij = μ + α i + ε ij implies μ 1 = μ + α 1 μ 2 = μ + α 2 μ t = μ + α t and μ 1 = μ 2 = μ 3 = = μ t μ + α 1 = μ + α 2 = = μ + α t α 1 = α 2 = = α t so for EFFECTS MODEL: H 0 : α 1 = α 2 = = α t ESTIMATES OF MODEL PARAMETERS MEANS MODEL: Y ij = μ i + ε ij with ε ij ~ independent N(0, σ 2 ) Parameters are: the and. Recall we estimated with =. How would we estimate the? Note this model has t unknown μ s, but each is UNIQUELY estimated by the. EFFECTS MODEL: Y ij = μ + α i + ε ij with ε ij ~ independent N(0, σ 2 ) Parameters are:. We estimate also with =. How would we estimate and the? Note that this model has t + 1 unknowns in μ and the α s. To estimate the parameters requires a constraint. Typical constraints are: 1. set α 1 = 0 or 2. set α t = 0 or 3. set t i=1 n α =0 i i C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 20

Most programs (including R and SAS) use a constraint like the first or second. NOTES: 1. To test the hypothesis of interest in ANOVA it doesn t matter which constraint is used. 2. However, estimates of the parameters of the EFFECTS MODEL WILL be different depending on which constraint is used. Here s R and SAS output for the MeatBacteria data. > summary(meatanova) Call: lm(formula = logcount ~ Condition, data = MeatBacteria) Residuals: Min 1Q Median 3Q Max -0.500-0.225 0.110 0.210 0.320 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 3.3600 0.1965 17.10 1.39e-07 *** Condition[T.mixed] 3.9000 0.2779 14.03 6.45e-07 *** Condition[T.plastic] 4.1200 0.2779 14.82 4.22e-07 *** Condition[T.vacuum] 2.1400 0.2779 7.70 5.74e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.3404 on 8 degrees of freedom Multiple R-squared: 0.9726, Adjusted R-squared: 0.9623 F-statistic: 94.58 on 3 and 8 DF, p-value: 1.376e-06 proc glm; class condition; model logcount= condition/solution; run; Dependent Variable: logcount Sum of Source DF Squares Mean Square F Value Pr > F Model 3 32.87280000 10.95760000 94.58 <.0001 Error 8 0.92680000 0.11585000 Corrected Total 11 33.79960000 Standard Parameter Estimate Error t Value Pr > t Intercept 5.500000000 B 0.19651124 27.99 <.0001 condition CO2-2.140000000 B 0.27790886-7.70 <.0001 condition mixed 1.760000000 B 0.27790886 6.33 0.0002 condition plastic 1.980000000 B 0.27790886 7.12 <.0001 condition vacuum 0.000000000 B... C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 21

FITTING THE MEANS AND EFFECTS MODELS IN R AND SAS The default models for most ANOVA program/packages is the Effects Model (Y ij = μ + α i + ε ij ) as we just saw in R and SAS! However, the Means Model can easily be obtained in such programs/packages by eliminating the μ term from the model. Using R With MeatBacteria the active data set, using RCommander: Statistics > Fit Models > Linear model... With the following output: > MeatMeansModel <- lm(logcount ~ Condition - 1, data=meatbacteria) > summary(meatmeansmodel) Call: lm(formula = logcount ~ Condition - 1, data = MeatBacteria) Residuals: Min 1Q Median 3Q Max -0.500-0.225 0.110 0.210 0.320 Coefficients: Estimate Std. Error t value Pr(> t ) ConditionCO2 3.3600 0.1965 17.10 1.39e-07 *** Conditionmixed 7.2600 0.1965 36.94 3.16e-10 *** Conditionplastic 7.4800 0.1965 38.06 2.49e-10 *** Conditionvacuum 5.5000 0.1965 27.99 2.87e-09 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.3404 on 8 degrees of freedom Multiple R-squared: 0.9979, Adjusted R-squared: 0.9969 F-statistic: 972.4 on 4 and 8 DF, p-value: 8.861e-11 C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 22

Using SAS options ls=110 nocenter nodate formdlim="-"; title "Fitting a Means Model in SAS"; title2 "Bacteria in meat data"; data meat; input condition $ logcount @@; datalines; plastic 7.66 plastic 6.98 plastic 7.80 vacuum 5.26 vacuum 5.44 vacuum 5.80 mixed 7.41 mixed 7.33 mixed 7.04 CO2 3.51 CO2 2.91 CO2 3.66 ; proc glm; class condition; model logcount = condition / noint solution; run; Fitting a Means Model in SAS 4 Bacteria in meat data The GLM Procedure Dependent Variable: logcount Sum of Source DF Squares Mean Square F Value Pr > F Model 4 450.5928000 112.6482000 972.36 <.0001 Error 8 0.9268000 0.1158500 Uncorrected Total 12 451.5196000 R-Square Coeff Var Root MSE logcount Mean 0.972580 5.768940 0.340367 5.900000 Source DF Type I SS Mean Square F Value Pr > F condition 4 450.5928000 112.6482000 972.36 <.0001 Source DF Type III SS Mean Square F Value Pr > F condition 4 450.5928000 112.6482000 972.36 <.0001 Standard Parameter Estimate Error t Value Pr > t condition CO2 3.360000000 0.19651124 17.10 <.0001 condition mixed 7.260000000 0.19651124 36.94 <.0001 condition plastic 7.480000000 0.19651124 38.06 <.0001 condition vacuum 5.500000000 0.19651124 27.99 <.0001 Comparing both the R and SAS outputs of this Means Models, our estimates of the parameters, the μ i s, are nothing more than the sample averages for each sample. Recall from R these sample averages were: > numsummary(logcount, groups=condition, statistics=c("mean", "sd")) mean sd n CO2 3.36 0.3968627 3 mixed 7.26 0.1946792 3 plastic 7.48 0.4386342 3 vacuum 5.50 0.2749545 3 C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 23

Below is the Means Model ANOVA output. > summary(meatmeansanova) Call: lm(formula = logcount ~ Condition - 1, data = MeatBacteria) Residuals: Min 1Q Median 3Q Max -0.500-0.225 0.110 0.210 0.320 Coefficients: Estimate Std. Error t value Pr(> t ) ConditionCO2 3.3600 0.1965 17.10 1.39e-07 *** Conditionmixed 7.2600 0.1965 36.94 3.16e-10 *** Conditionplastic 7.4800 0.1965 38.06 2.49e-10 *** Conditionvacuum 5.5000 0.1965 27.99 2.87e-09 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.3404 on 8 degrees of freedom Multiple R-squared: 0.9979, Adjusted R-squared: 0.9969 F-statistic: 972.4 on 4 and 8 DF, p-value: 8.861e-11 Below is the default Effects Model ANOVA output for comparison purposes. > summary(meatanova) Call: lm(formula = logcount ~ Condition, data = MeatBacteria) --- Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 3.3600 0.1965 17.10 1.39e-07 *** Condition[T.mixed] 3.9000 0.2779 14.03 6.45e-07 *** Condition[T.plastic] 4.1200 0.2779 14.82 4.22e-07 *** Condition[T.vacuum] 2.1400 0.2779 7.70 5.74e-05 *** --- Residual standard error: 0.3404 on 8 degrees of freedom Multiple R-squared: 0.9726, Adjusted R-squared: 0.9623 F-statistic: 94.58 on 3 and 8 DF, p-value: 1.376e-06 Note that 1. the estimate of σ, square root mse, is the same, 2. estimates of the β s are different (not surprising since the models are different!) 3. the F-test results are different 4. the R 2 value is different. C:\Users\baileraj\BAILERAJ\Classes\Web-CLASSES\ies-612\lectures\Week 7.1--IES 612-STA 4-576-22feb09.doc 2/22/2009 24