Biological Applications of ANOVA - Examples and Readings

Similar documents
unadjusted model for baseline cholesterol 22:31 Monday, April 19,

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Chapter 8 (More on Assumptions for the Simple Linear Regression)

Laboratory Topics 4 & 5

This is a Randomized Block Design (RBD) with a single factor treatment arrangement (2 levels) which are fixed.

T-test: means of Spock's judge versus all other judges 1 12:10 Wednesday, January 5, judge1 N Mean Std Dev Std Err Minimum Maximum

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

Handout 1: Predicting GPA from SAT

4.8 Alternate Analysis as a Oneway ANOVA

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Linear Combinations of Group Means

Lecture 5: Comparing Treatment Means Montgomery: Section 3-5

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

5.3 Three-Stage Nested Design Example

General Linear Model (Chapter 4)

Answer Keys to Homework#10

Assignment 9 Answer Keys

Lecture notes on Regression & SAS example demonstration

Linear Combinations. Comparison of treatment means. Bruce A Craig. Department of Statistics Purdue University. STAT 514 Topic 6 1

EXST7015: Estimating tree weights from other morphometric variables Raw data print

Booklet of Code and Output for STAC32 Final Exam

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

Outline. Topic 19 - Inference. The Cell Means Model. Estimates. Inference for Means Differences in cell means Contrasts. STAT Fall 2013

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3-1 through 3-3

Lec 1: An Introduction to ANOVA

Simple, Marginal, and Interaction Effects in General Linear Models

Topic 28: Unequal Replication in Two-Way ANOVA

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3.1 through 3.3

Pairwise multiple comparisons are easy to compute using SAS Proc GLM. The basic statement is:

Comparison of a Population Means

PLS205 Lab 6 February 13, Laboratory Topic 9

Single Factor Experiments

BE640 Intermediate Biostatistics 2. Regression and Correlation. Simple Linear Regression Software: SAS. Emergency Calls to the New York Auto Club

9 One-Way Analysis of Variance

Least Squares Analyses of Variance and Covariance

PLS205!! Lab 9!! March 6, Topic 13: Covariance Analysis

Week 7.1--IES 612-STA STA doc

COMPLETELY RANDOM DESIGN (CRD) -Design can be used when experimental units are essentially homogeneous.

STAT 115:Experimental Designs

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Topic 20: Single Factor Analysis of Variance

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials.

Regression Analysis. Table Relationship between muscle contractile force (mj) and stimulus intensity (mv).

Analysis of variance and regression. April 17, Contents Comparison of several groups One-way ANOVA. Two-way ANOVA Interaction Model checking

Inferences for Regression

Introduction to Design and Analysis of Experiments with the SAS System (Stat 7010 Lecture Notes)

Descriptions of post-hoc tests

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

One-Way Analysis of Variance (ANOVA) There are two key differences regarding the explanatory variable X.

Odor attraction CRD Page 1

A Re-Introduction to General Linear Models (GLM)

SAS Commands. General Plan. Output. Construct scatterplot / interaction plot. Run full model

11 Factors, ANOVA, and Regression: SAS versus Splus

Analysis of Variance

Simple, Marginal, and Interaction Effects in General Linear Models: Part 1

using the beginning of all regression models

Chap The McGraw-Hill Companies, Inc. All rights reserved.

dm'log;clear;output;clear'; options ps=512 ls=99 nocenter nodate nonumber nolabel FORMCHAR=" = -/\<>*"; ODS LISTING;

N J SS W /df W N - 1

Orthogonal contrasts and multiple comparisons

Descriptive Statistics

Analysis of variance. April 16, Contents Comparison of several groups

UNIVERSITY EXAMINATIONS NJORO CAMPUS SECOND SEMESTER 2011/2012

Statistical Techniques II EXST7015 Simple Linear Regression

Analysis of variance. April 16, 2009

df=degrees of freedom = n - 1

Multivariate analysis of variance and covariance

Lecture 3: Inference in SLR

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

ANALYSES OF NCGS DATA FOR ALCOHOL STATUS CATEGORIES 1 22:46 Sunday, March 2, 2003

1 Tomato yield example.

Outline. Analysis of Variance. Acknowledgements. Comparison of 2 or more groups. Comparison of serveral groups

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Categorical Predictor Variables

Outline. Analysis of Variance. Comparison of 2 or more groups. Acknowledgements. Comparison of serveral groups

PLS205 Winter Homework Topic 8

Disadvantages of using many pooled t procedures. The sampling distribution of the sample means. The variability between the sample means

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

171:162 Design and Analysis of Biomedical Studies, Summer 2011 Exam #3, July 16th

Topic 13. Analysis of Covariance (ANCOVA) [ST&D chapter 17] 13.1 Introduction Review of regression concepts

Analysis of Variance

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

22s:152 Applied Linear Regression. Take random samples from each of m populations.

PLS205 Lab 2 January 15, Laboratory Topic 3

MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Lecture 11: Simple Linear Regression

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

One-way ANOVA Model Assumptions

Statistics for exp. medical researchers Comparison of groups, T-tests and ANOVA

1 Introduction to Minitab

STAT 350: Summer Semester Midterm 1: Solutions

Booklet of Code and Output for STAC32 Final Exam

Multiple Comparison Procedures Cohen Chapter 13. For EDUC/PSY 6600

SPECIAL TOPICS IN REGRESSION ANALYSIS

Business Statistics. Lecture 10: Course Review

8 Analysis of Covariance

Introduction. Chapter 8

data proc sort proc corr run proc reg run proc glm run proc glm run proc glm run proc reg CONMAIN CONINT run proc reg DUMMAIN DUMINT run proc reg

Transcription:

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 1 ANOVA Pac Biological Applications of ANOVA - Examples and Readings One-factor Model I (Fixed Effects) This is the same example for One-factor ANOVA used by Dr. M. in Biometrics class (it's in the BIO 211 Test Pac). The data are birth weights (g) for 36 babies. Each baby is categorized on the basis of the smoking habits of the mother during pregnancy. The value for the SMOKING variable below indicates group membership (i.e. the level). The three levels are: 1 = Nonsmoking, 2 = up to 1 Pack/day, 3 = 1+ pack/day. The WEIGHT variable contains you guessed it! the birth weights. This does a "complete" analysis: it tests the assumptions of normality and homoscedasticity; does the ANOVA, does a planned contrast of nonsmoking babies vs. the combination of the two smoking groups; and also does 11 different multiple comparison tests. SAS does four different tests for normality. The Shapiro-Wilk test is the most widely used. The null hypothesis is Ho: Distribution is Normal. So, when we accept (p>0.05) we have a normal distribution. Note that for each of the three smoking groups, all four normality tests conclude that the distribution is normal. The test for homoscedasticity is the Brown and Forsythe's Test for Homogeneity of WEIGHT Variance. This is a form of Levene s Test, and is an ANOVA done on the absolute deviation of each weight from the group median. The contrast tests the nonsmoking babies (n = 12, mean = 3612.7) against the smoking babies (n = 24, mean = 3131.8). Note that the smoking babies is all 24 babies in the 1 pack/day and 1+ pack/day groups combined. We can calculate the Contrast SS as a simple Groups SS: Groups SS = n j 12(3612.7 3292.1) ( X j X ) 2 = 1233070.37 2 24(3131.8 3292.1) = 616535.185 1233070.37 + 616535.185 = 1849605.556 = Contrast SS 2 Note that all 11 multiple comparison tests give the same result, i.e. that the 1+ pack/day group is different from the other two. Since all the tests agree, this is a robust conclusion. DATA BABYWT; INPUT SMOKING WEIGHT @@; *The @@ allows multiple observations per line; SELECT (SMOKING); WHEN (1) SMOKE='Nonsmoke'; WHEN (2) SMOKE='1 Pack'; WHEN (3) SMOKE='1+ Pack'; END; CARDS; 1 3515 1 3420 1 3175 1 3586 1 3232 1 3884 1 3856 1 3941 1 3232 1 4054 1 3459 1 3998 2 3444 2 3827 2 3884 2 3515 2 3416 2 3742 2 3062 2 3076 2 2835 2 2750 2 3460 2 3340 3 2608 3 2509 3 3600 3 1730 3 3175 3 3459 3 3288 3 2920 3 3020 3 2778 3 2466 3 3260 ; PROC UNIVARIATE NORMAL; CLASS SMOKE; VAR WEIGHT; PROC GLM; CLASS SMOKE; MODEL WEIGHT = SMOKE / SS3; CONTRAST 'Nonsmoking vs Smoking' SMOKE 1 1-2 / E; MEANS SMOKE / HOVTEST=BF Tukey SNK Bon LSD REGWQ Scheffe Duncan Sidak Gabriel SMM Waller lines; RUN;

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 2 Example 1 - One-factor Model I (Fixed Effects) 1 The UNIVARIATE Procedure Variable: WEIGHT SMOKE = 1 Pack Moments N 12 Sum Weights 12 Mean 3362.58333 Sum Observations 40351 Std Deviation 369.312742 Variance 136391.902 Skewness -0.2722961 Kurtosis -0.8565015 Uncorrected SS 137183911 Corrected SS 1500310.92 Coeff Variation 10.9830064 Std Error Mean 106.611406 Basic Statistical Measures Location Variability Mean 3362.583 Std Deviation 369.31274 Median 3430.000 Variance 136392 Mode. Range 1134 Interquartile Range 559.50000 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 31.54056 Pr > t <.0001 Sign M 6 Pr >= M 0.0005 Signed Rank S 39 Pr >= S 0.0005 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.946847 Pr < W 0.5914 Kolmogorov-Smirnov D 0.142287 Pr > D >0.1500 Cramer-von Mises W-Sq 0.044422 Pr > W-Sq >0.2500 Anderson-Darling A-Sq 0.275845 Pr > A-Sq >0.2500 Quantiles (Definition 5) Quantile Estimate 100% Max 3884.0 99% 3884.0 95% 3884.0 90% 3827.0 75% Q3 3628.5 50% Median 3430.0 Example 1 - One-factor Model I (Fixed Effects) 2 The UNIVARIATE Procedure

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 3 Variable: WEIGHT SMOKE = 1 Pack Quantiles (Definition 5) Quantile Estimate 25% Q1 3069.0 10% 2835.0 5% 2750.0 1% 2750.0 0% Min 2750.0 Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 2750 22 3460 23 2835 21 3515 16 3062 19 3742 18 3076 20 3827 14 3340 24 3884 15 Example 1 - One-factor Model I (Fixed Effects) 3 The UNIVARIATE Procedure Variable: WEIGHT SMOKE = 1+ Pack Moments N 12 Sum Weights 12 Mean 2901.08333 Sum Observations 34813 Std Deviation 520.779217 Variance 271210.992 Skewness -0.876079 Kurtosis 0.95442277 Uncorrected SS 103978735 Corrected SS 2983320.92 Coeff Variation 17.9511981 Std Error Mean 150.33601 Basic Statistical Measures Location Variability Mean 2901.083 Std Deviation 520.77922 Median 2970.000 Variance 271211 Mode. Range 1870 Interquartile Range 715.50000 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 19.29733 Pr > t <.0001 Sign M 6 Pr >= M 0.0005 Signed Rank S 39 Pr >= S 0.0005

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 4 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.946014 Pr < W 0.5796 Kolmogorov-Smirnov D 0.1184 Pr > D >0.1500 Cramer-von Mises W-Sq 0.031416 Pr > W-Sq >0.2500 Anderson-Darling A-Sq 0.256533 Pr > A-Sq >0.2500 Quantiles (Definition 5) Quantile Estimate 100% Max 3600.0 99% 3600.0 95% 3600.0 90% 3459.0 75% Q3 3274.0 50% Median 2970.0 Example 1 - One-factor Model I (Fixed Effects) 4 The UNIVARIATE Procedure Variable: WEIGHT SMOKE = 1+ Pack Quantiles (Definition 5) Quantile Estimate 25% Q1 2558.5 10% 2466.0 5% 1730.0 1% 1730.0 0% Min 1730.0 Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 1730 28 3175 29 2466 35 3260 36 2509 26 3288 31 2608 25 3459 30 2778 34 3600 27 Example 1 - One-factor Model I (Fixed Effects) 5 The UNIVARIATE Procedure Variable: WEIGHT SMOKE = Nonsmoke

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 5 Moments N 12 Sum Weights 12 Mean 3612.66667 Sum Observations 43352 Std Deviation 321.395065 Variance 103294.788 Skewness 0.02321712 Kurtosis -1.6519762 Uncorrected SS 157752568 Corrected SS 1136242.67 Coeff Variation 8.89633876 Std Error Mean 92.7787637 Basic Statistical Measures Location Variability Mean 3612.667 Std Deviation 321.39507 Median 3550.500 Variance 103295 Mode 3232.000 Range 879.00000 Interquartile Range 586.50000 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 38.93851 Pr > t <.0001 Sign M 6 Pr >= M 0.0005 Signed Rank S 39 Pr >= S 0.0005 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.906326 Pr < W 0.1914 Kolmogorov-Smirnov D 0.192176 Pr > D >0.1500 Cramer-von Mises W-Sq 0.06868 Pr > W-Sq >0.2500 Anderson-Darling A-Sq 0.442759 Pr > A-Sq 0.2419 Quantiles (Definition 5) Quantile Estimate 100% Max 4054.0 99% 4054.0 95% 4054.0 90% 3998.0 75% Q3 3912.5 50% Median 3550.5 Example 1 - One-factor Model I (Fixed Effects) 6 The UNIVARIATE Procedure Variable: WEIGHT SMOKE = Nonsmoke Quantiles (Definition 5) Quantile Estimate

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 6 25% Q1 3326.0 10% 3232.0 5% 3175.0 1% 3175.0 0% Min 3175.0 Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 3175 3 3856 7 3232 9 3884 6 3232 5 3941 8 3420 2 3998 12 3459 11 4054 10 Example 1 - One-factor Model I (Fixed Effects) 7 The GLM Procedure Class Level Information Class Levels Values SMOKE 3 1 Pack 1+ Pack Nonsmoke Number of Observations Read 36 Number of Observations Used 36 Example 1 - One-factor Model I (Fixed Effects) 8 The GLM Procedure Coefficients for Contrast Nonsmoking vs Smoking Row 1 Intercept 0 SMOKE 1 Pack 1 SMOKE 1+ Pack 1 SMOKE Nonsmoke -2

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 7 Example 1 - One-factor Model I (Fixed Effects) 9 The GLM Procedure Dependent Variable: WEIGHT Sum of Source DF Squares Mean Square F Value Pr > F Model 2 3127499.056 1563749.528 9.18 0.0007 Error 33 5619874.500 170299.227 Corrected Total 35 8747373.556 R-Square Coeff Var Root MSE WEIGHT Mean 0.357536 12.53522 412.6733 3292.111 Source DF Type III SS Mean Square F Value Pr > F SMOKE 2 3127499.056 1563749.528 9.18 0.0007 Contrast DF Contrast SS Mean Square F Value Pr > F Nonsmoking vs Smoking 1 1849605.556 1849605.556 10.86 0.0024 Example 1 - One-factor Model I (Fixed Effects) 10 The GLM Procedure Brown and Forsythe's Test for Homogeneity of WEIGHT Variance ANOVA of Absolute Deviations from Group Medians Sum of Mean Source DF Squares Square F Value Pr > F SMOKE 2 117524 58762.2 0.97 0.3908 Error 33 2005791 60781.6 Example 1 - One-factor Model I (Fixed Effects) 11 The GLM Procedure Waller-Duncan K-ratio t Test for WEIGHT NOTE: This test minimizes the Bayes risk under additive loss and certain other assumptions. Kratio 100 Error Degrees of Freedom 33 Error Mean Square 170299.2 F Value 9.18 Critical Value of t 1.93269

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 8 Minimum Significant Difference 325.61 Means with the same letter are not significantly different. Waller Grouping Mean N SMOKE A 3612.7 12 Nonsmoke A A 3362.6 12 1 Pack B 2901.1 12 1+ Pack Example 1 - One-factor Model I (Fixed Effects) 12 The GLM Procedure t Tests (LSD) for WEIGHT NOTE: This test controls the Type I comparisonwise error rate, not the experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square 170299.2 Critical Value of t 2.03452 Least Significant Difference 342.76 Means with the same letter are not significantly different. t Grouping Mean N SMOKE A 3612.7 12 Nonsmoke A A 3362.6 12 1 Pack B 2901.1 12 1+ Pack Example 1 - One-factor Model I (Fixed Effects) 13 The GLM Procedure Duncan's Multiple Range Test for WEIGHT NOTE: This test controls the Type I comparisonwise error rate, not the experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square 170299.2 Number of Means 2 3

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 9 Critical Range 342.8 360.3 Means with the same letter are not significantly different. Duncan Grouping Mean N SMOKE A 3612.7 12 Nonsmoke A A 3362.6 12 1 Pack B 2901.1 12 1+ Pack Example 1 - One-factor Model I (Fixed Effects) 14 The GLM Procedure Student-Newman-Keuls Test for WEIGHT NOTE: This test controls the Type I experimentwise error rate under the complete null hypothesis but not under partial null hypotheses. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square 170299.2 Number of Means 2 3 Critical Range 342.76355 413.39853 Means with the same letter are not significantly different. SNK Grouping Mean N SMOKE A 3612.7 12 Nonsmoke A A 3362.6 12 1 Pack B 2901.1 12 1+ Pack Example 1 - One-factor Model I (Fixed Effects) 15 The GLM Procedure Ryan-Einot-Gabriel-Welsch Multiple Range Test for WEIGHT NOTE: This test controls the Type I experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square 170299.2

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 10 Number of Means 2 3 Critical Range 342.76355 413.39853 Means with the same letter are not significantly different. REGWQ Grouping Mean N SMOKE A 3612.7 12 Nonsmoke A A 3362.6 12 1 Pack B 2901.1 12 1+ Pack Example 1 - One-factor Model I (Fixed Effects) 16 The GLM Procedure Tukey's Studentized Range (HSD) Test for WEIGHT NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than REGWQ. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square 170299.2 Critical Value of Studentized Range 3.47019 Minimum Significant Difference 413.4 Means with the same letter are not significantly different. Tukey Grouping Mean N SMOKE A 3612.7 12 Nonsmoke A A 3362.6 12 1 Pack B 2901.1 12 1+ Pack Example 1 - One-factor Model I (Fixed Effects) 17 The GLM Procedure Studentized Maximum Modulus (GT2) Test for WEIGHT NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than REGWQ. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square 170299.2 Critical Value of Studentized Maximum Modulus 2.50963 Minimum Significant Difference 422.81

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 11 Means with the same letter are not significantly different. SMM Grouping Mean N SMOKE A 3612.7 12 Nonsmoke A A 3362.6 12 1 Pack B 2901.1 12 1+ Pack Example 1 - One-factor Model I (Fixed Effects) 18 The GLM Procedure Sidak t Tests for WEIGHT NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than REGWQ. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square 170299.2 Critical Value of t 2.51504 Minimum Significant Difference 423.72 Means with the same letter are not significantly different. Sidak Grouping Mean N SMOKE A 3612.7 12 Nonsmoke A A 3362.6 12 1 Pack B 2901.1 12 1+ Pack Example 1 - One-factor Model I (Fixed Effects) 19 The GLM Procedure Bonferroni (Dunn) t Tests for WEIGHT NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than REGWQ. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square 170299.2 Critical Value of t 2.52221 Minimum Significant Difference 424.93

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 12 Means with the same letter are not significantly different. Bon Grouping Mean N SMOKE A 3612.7 12 Nonsmoke A A 3362.6 12 1 Pack B 2901.1 12 1+ Pack Example 1 - One-factor Model I (Fixed Effects) 20 The GLM Procedure Scheffe's Test for WEIGHT NOTE: This test controls the Type I experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square 170299.2 Critical Value of F 3.28492 Minimum Significant Difference 431.83 Means with the same letter are not significantly different. Scheffe Grouping Mean N SMOKE A 3612.7 12 Nonsmoke A A 3362.6 12 1 Pack B 2901.1 12 1+ Pack

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 13 ANOVA by Regression In this example, we'll do the same ANOVA we did in Example 1 (i.e. babies categorized by smoking status of mom during pregnancy). However, this time we will have Excel do the ANOVA by using regression procedures. This approach is not just for fun - or to see if we can fool Excel into doing "stupid ANOVA tricks". When we deal with unbalanced designs, it will be very important to understand that ANOVA problems can be solved by using regression. Also, SAS and other major statistical packages use this approach. Below are the data and "summary output" from the Excel Regression Data Analysis tool. You can see that the birth weights are in the third variable (column). The first two variables are "dummy variables" - they are codes that indicate smoking status. If the values for the dummy variables are 0 0, then that is a nonsmoking baby. Values of 1 0 indicate 1 pack/day. Values of 0 1 indicate 1+ pack/day. First, check-out the ANOVA table, and compare it to the ANOVA table prepared by SAS (or from the TestPac). Notice that Total SS is the same in both. Regression SS below is the same as Groups SS in the TestPac (called SMOKE SS in the SAS output). Residual SS below is the same as Error SS. The DF, MS, and F values also are the same. 0 0 3515 SUMMARY OUTPUT 0 0 3420 0 0 3175 Regression Statistics 0 0 3586 Multiple R 0.59794296 0 0 3232 R Square 0.357535783 0 0 3884 Adjusted R Square 0.318598558 0 0 3856 Standard Error 412.6732694 0 0 3941 Observations 36 0 0 3232 0 0 4054 ANOVA 0 0 3459 df SS MS F Significance F 0 0 3998 Regression 2 3127499.056 1563749.53 9.182364 0.000675317 1 0 3444 Residual 33 5619874.5 170299.227 1 0 3827 Total 35 8747373.556 1 0 3884 1 0 3515 Coefficients Standard Error t Stat P-value Lower 95% 1 0 3416 Intercept 3612.666667 119.1285116 30.3257937 1.12E-25 3370.297695 1 0 3742 X Variable 1-250.083333 168.4731568-1.4844106 0.14719-592.8448197 1 0 3062 X Variable 2-711.583333 168.4731568-4.2237194 0.000178-1054.34482 1 0 3076 1 0 2835 1 0 2750 1 0 3460 1 0 3340 0 1 2608 0 1 2509 0 1 3600 0 1 1730 0 1 3175 0 1 3459 0 1 3288 0 1 2920 0 1 3020 0 1 2778 0 1 2466 0 1 3260

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 14 ANOVA by Regression: Calculations and Sources of Variation The regression done above is a "multiple linear regression". In BIO 211, we did "simple linear regression", which means there was one dependent and one independent variable. Our regression "model" in BIO 211 was: Y = a + bx. In multiple linear regression, there is one dependent variable and two or more independent variables. Our multiple regression model is Y = a + b 1 X 1 + b 2 X 2. In our data, Y is the baby weights, X 1 is the first dummy variable, and X 2 the second dummy variable. The b 1 and b 2 values are called "partial regression coefficients". They are like the slope of the line in simple regression, and they are parameter estimates whose values are determined from the data. The interpretations are: b 1 shows the effect of X 1 on Y while holding X 2 constant b 2 shows the effect of X 2 on Y while holding X 1 constant From the Excel output above, you should see that our equation is: Y = 3612.666667-250.083333 X 1-711.583333 X 2 Let's see the predicted values of Y (calculated by putting in the values of the dummy variables): X1 X2 Baby Predicted 0 0 3515 3612.666667 0 0 3420 3612.666667 0 0 3175 3612.666667 0 0 3586 3612.666667 0 0 3232 3612.666667 0 0 3884 3612.666667 0 0 3856 3612.666667 0 0 3941 3612.666667 0 0 3232 3612.666667 0 0 4054 3612.666667 0 0 3459 3612.666667 0 0 3998 3612.666667 1 0 3444 3362.583334 1 0 3827 3362.583334 1 0 3884 3362.583334 1 0 3515 3362.583334 1 0 3416 3362.583334 1 0 3742 3362.583334 1 0 3062 3362.583334 1 0 3076 3362.583334 1 0 2835 3362.583334 1 0 2750 3362.583334 1 0 3460 3362.583334 1 0 3340 3362.583334 0 1 2608 2901.083334 0 1 2509 2901.083334 0 1 3600 2901.083334 0 1 1730 2901.083334 0 1 3175 2901.083334 0 1 3459 2901.083334 0 1 3288 2901.083334 0 1 2920 2901.083334 0 1 3020 2901.083334 0 1 2778 2901.083334 0 1 2466 2901.083334 0 1 3260 2901.083334 Does anything here look familiar? Do you see how this works? Note that the predicted value for each baby is the mean for that group. When X 1 = 0 and X 2 = 0, then Y = 3612.666667 When X 1 = 1 and X 2 = 0, then Y = 3612.666667-250.083333(1) In other words, the mean of the 1 Pack/day group is 250.083333 g less than the mean of the nonsmoking group. When X 1 = 0 and X 2 = 1, then Y = 3612.666667-711.583333(1) In other words, the mean of the 1+ Pack/day group is 711.583333 g less than the mean of the nonsmoking group. This should make sense to you. All the dummy variables tell is what smoking group a baby belongs to. If you're trying to estimate (predict) the birth weight of a baby, and all you know is what smoking group it is in, your "best guess" is the mean of the group. For example, let's say a baby has just been born, and it is classified into the 1 Pack/day group. Now, pretend you have to guess the birth weight, and for every gram you are off, you have to pay Dr. M. $1.00! What do you do?? Your "best guess" is 3362.583334 grams, because that is right in the middle of the 1 Pack/day group. That should minimize how much you have to pay to the evil Dr. M.

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 15 Now, for the all-important sources of variation and sums of squares. First, in multiple regression, the sources of variation are exactly as they were in simple linear regression. Namely: Total is the variation of the observed value of the dependent variable. Regression is the variation of the predicted value of the dependent variable. Residual (Error) is the variation of the difference between observed and predicted. Total SS Now, compare and contrast SS in one-factor ANOVA with SS in regression. = Regression SS = ( Y Y ) Residual (Error) SS i Total SS Notice that Total SS is exactly the same in one-factor ANOVA and regression. Look at the formula above, and then the ANOVA formula from the TestPac. In both cases, Total SS is the sum of the squared deviation of each baby weight from the grand mean of the baby weights. Groups SS = Regression SS 2 Groups SS in ANOVA is Groups SS = n j ( X j X ). At first, you're thinking this is totally different from Regression SS. But, it's the same! First of all, don't be thrown off by the use of X and Y. The X and Y both refer to the same variable in this case, i.e. baby birth weight. Groups SS says you take the (group mean - grand mean) 2 and multiply by the number of data points in the group. Look at the calculation of Groups SS in the TestPac. Now, think about how you would evaluate Regression SS? Remember, the predicted value of Y is the mean of that group. And what about the grand mean? It's the same - the mean of all 36 babies. So, in each group, you're taking the (group mean - grand mean) 2, and you do this once for each baby in the group. This is just like multiplying by the number of babies in the group. Error SS = Residual (Error) SS Check the TestPac pages for the calculation of Error SS in ANOVA. You calculate a SS for each group (each baby from their group mean), and then add them together. In Residual (Error SS) in regression, you're taking each baby minus the predicted baby and squaring, and then adding them all up. But remember, the predicted value for each baby is its group mean! So, you're doing the same thing as in ANOVA. = 2 ( Yˆ Y ) i ( Y i 2 Yˆ ) i 2 It is important that you understand the relationship between the sources of variation in ANOVA and regression. See Dr. M. if this is causing you problems. It's really not hard - you will get it if you think about it for a bit. If you don't remember the definition of "important" from BIO 211, ask Dr. M. in class!

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 16 ANOVA as Done by Statistics Programs Now that we ve seen how ANOVA is done by regression, we can expand a bit and take a peek at how a statistics program (e.g. SAS) actually does these procedures. First, A couple of important points: 1. This is not comprehensive. We re not going to look at every detail of the calculations used by computer programs, but we will look at some of them; 2. Don t worry about these details for the exam. You re not expected to recreate all the matrices and other details. The general concepts of ANOVA by Regression in the previous section are important, but the details here are value added (which means not on the test ). Our example will be the baby birth weight example (again). You tell the program what the response variable is (birth weight), and what the factor is (Smoking). The program then looks at your data and figures out that: N (the total sample size) is 36. The factor has three levels. The program then knows it s working with the following model: Y = b 0 + b 1 X1 + b 2 X2 + b 3 X3 + ε where Y is the dependent (response) variable (birth weight) b 0 is the intercept. The intercept is included by default, but you can request it not be included in the model. Don t do this unless you really know what you are doing. b 0 b 1 b 2 and b 3 are parameter estimates. These are the unknowns. The program has to estimate these parameters to do the analysis. X1, X2, and X3 are dummy variables that indicate to what smoking level the baby belongs. The values 1 0 0 indicate a nonsmoking baby; 0 1 0 is a 1 pack/day baby; and 0 0 1 is a 1+ pack/day baby ε is the error (residual) The program then writes the model in matrix terms: Y = Xβ + ε where β is a vector containing the b i symbols: Y is a vector containing the birth weights. β b0 b 1 = b2 b3 X is a matrix called the design matrix that has the dummy variables. There are 4 columns in the design matrix. All the values in the the first column will be 1. This first column refers to the intercept. The next 3 columns are the dummy variables X1 X2 and X3, in that order. The OLS (ordinary least squares) solution is to solve for β: X Y = X Xβ (X X) - X Xβ = (X X) - X Y Iβ= (X X) - X Y β= (X X) - X Y X is the transpose of X (X X) - is the inverse of X X I is the identity matrix Although we certainly will calculate the error term (ε ) along the way, we don t include it here in our matrix approach. Let s begin by looking at the elements of Y and X.

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 17 The Y vector looks like this: The design matrix X looks like this: 3515 1 1 0 0 3420 1 1 0 0 3175 1 1 0 0 3586 1 1 0 0 3232 1 1 0 0 3884 1 1 0 0 3856 1 1 0 0 3941 1 1 0 0 3232 1 1 0 0 4054 1 1 0 0 3459 1 1 0 0 3998 1 1 0 0 3444 1 0 1 0 3827 1 0 1 0 3884 1 0 1 0 3515 1 0 1 0 3416 1 0 1 0 3742 1 0 1 0 3062 1 0 1 0 3076 1 0 1 0 2835 1 0 1 0 2750 1 0 1 0 3460 1 0 1 0 3340 1 0 1 0 2608 1 0 0 1 2509 1 0 0 1 3600 1 0 0 1 1730 1 0 0 1 3175 1 0 0 1 3459 1 0 0 1 3288 1 0 0 1 2920 1 0 0 1 3020 1 0 0 1 2778 1 0 0 1 2466 1 0 0 1 3260 1 0 0 1 The program needs the transpose of Y = Y, and the transpose of X = X Y has all the weights listed as a single row (a row vector). There s not enough room here to get all 36 birth weights on a single line, so you have to use your imagination. X looks like this: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 Next, the program calculates X X. Since X is a 4x36 matrix and X is a 36x4 matrix, the X X must be a 4x4 matrix. That is: (4x36) x (36x4) = 4x4 Notice that the principal diagonal has the sample sizes: for all data in the 1,1 position, then for each level as you go down the diagonal. X X is: 36 12 12 12 Reality check: Programs may not actually construct the design matrix. They may read 12 12 0 0 a line of data, construct the appropriate line for the design matrix; transpose that line and 12 0 12 0 then multiply the transpose by the design matrix line. Thus, the X X matrix is being 12 0 0 12 accumulated, one line at a time.

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 18 X Y (4x36) x (36x1) = 4 x1 The resulting vector contains ΣY i (sum of the birth weights) values. The first element is for all the data, subsequent elements are for the levels of the smoking factor. Since sample sizes are in the principal diagonal of the X X matrix, the grand mean and level means may now be calculated. X Y is: 118516 118516 / 36 = 3292.11111 the grand mean 43352 43352 / 12 = 3612.66667 the nonsmoking mean 40351 40351 / 12 = 3362.58333 the 1 pack/day mean 34813 34813 / 12 = 2901.08333 the 1+ pack/day mean Y X (1x36) x (36x4) = 1 x 4 This vector is the same elements as X Y, but as a row vector. Useful for later calculations. Y X is: 118516 43352 40351 34813 2 Y Y (1x36) x (36x1) = 1x1 This is ΣY i i.e. sum of the squared birth weights. This quantity is sometimes called the Uncorrected SS. Y Y is: 398915214 Since we have ΣY i for all the data as the first element of X Y, and the total sample size (36) from the X X matrix, the program may now calculate Total SS by the machine formula : SS Total 2 ( Y ) N i 2 2 i= 1 118516 = Yi = 398915214 = 398915214 390167840.444 = 8747373.556 N 36 i= 1 N (X X) - In order to do several more calculations, the program now needs to calculate the inverse of the X X matrix, which is symbolized by (X X) -. But a problem is that the X X matrix is singular (determinant = 0), and therefore has no inverse. Mathematicians have developed a method called generalized inverse to deal with this situation. A frequently used generalized inverse is the g 2 -inverse, also called a reflexive generalized inverse. (X X) - is: Let A represent a square matrix of order p, and G is also a square matrix of order p. G is a g 2 -inverse of A, and A is a g 2 - inverse of G, (that s why it s called reflexive) if both of the following conditions are met: 1. AGA = A 2. GAG = G The generalized inverse of X X is found by a matrix operation called sweeping, which involves working on the matrix one row at a time. The g 2 -inverse found by the sweeping algorithm is not unique, different solutions can be obtained depending on the how the matrix is swept. Fortunately, we just need an inverse, and don t need to worry about the details of how the sweeping operator functions. We just let the computer tell us the g 2 -inverse it found: 0.083333333333333-0.08333333333333-0.08333333333333 0-0.08333333333333 0.166666666666667 0.083333333333333 0-0.08333333333333 0.083333333333333 0.166666666666667 0 0 0 0 0

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 19 Y'X(X'X) - X'Y We re multiplying 3 matrices here. Y X is 1x4; (X X) - is 4x4; X Y is 4x1 so: (1x4) x (4x4) = 1x4 then (1x4) x (4x1) is 1x1 Y'X(X'X) - X'Y is: 393295339.5 This is just an intermediate value. What we want to do is subtract this from the Y Y value: Y Y - Y'X(X'X) - X'Y is: 398915214-393295339.5 = 5619874.5 This is the Error SS. Notice we now have Total SS, Error SS, and all the sample sizes. We would now be able to complete the ANOVA table and do the F test. (X'X) - X'Y (4x4) x (4x1) = 4 x 1 This is the calculation of the b i values. (X'X) - X'Y is: 2901.083333 So, our model is Y = 2901.083333 + 711.5833333X 1 + 461.5X 2 + 0X 3 711.5833333 We can now plug in the values for dummy variables X 1, X 2, and X 3 and calculate the predicted birth weights: 461.5 0 2901.083333 711.5833333 461.5 0 Intercept X1 X2 X3 Predicted Y 1 1 0 0 3612.666667 1 1 0 0 3612.666667 1 1 0 0 3612.666667 1 1 0 0 3612.666667 1 1 0 0 3612.666667 1 1 0 0 3612.666667 1 1 0 0 3612.666667 1 1 0 0 3612.666667 1 1 0 0 3612.666667 1 1 0 0 3612.666667 1 1 0 0 3612.666667 1 1 0 0 3612.666667 1 0 1 0 3362.583333 1 0 1 0 3362.583333 1 0 1 0 3362.583333 1 0 1 0 3362.583333 1 0 1 0 3362.583333 1 0 1 0 3362.583333 1 0 1 0 3362.583333 1 0 1 0 3362.583333 1 0 1 0 3362.583333 1 0 1 0 3362.583333 1 0 1 0 3362.583333 1 0 1 0 3362.583333 1 0 0 1 2901.083333 1 0 0 1 2901.083333 1 0 0 1 2901.083333 1 0 0 1 2901.083333 1 0 0 1 2901.083333 1 0 0 1 2901.083333 1 0 0 1 2901.083333 1 0 0 1 2901.083333 1 0 0 1 2901.083333 1 0 0 1 2901.083333 1 0 0 1 2901.083333 1 0 0 1 2901.083333 Again, just as we saw in ANOVA by Regression, the key thing to note here is that the predicted weight for each baby is the mean of its smoking group. The method we ve looked at here is more general than using Excel, and this method is how most real statistics program approach ANOVA models. Of course, as the ANOVA model gets more complicated, so do all of these matrices. But the general principles remain the same.

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 20 Randomized Block (no replications) TITLE 'Randomized Block (no replications)'; * This is the Randomized Block example found in Zar (4 th ed). The response variable is weight gain (g) in guinea pigs. There is a Diet factor with 4 levels, and a Block factor with 5 levels. The blocks represent different rooms that have slightly different conditions (noise level, light/dark cycle). Each of the five rooms houses four guinea pigs, one on each of the four diets. If you had BIO 211 with Dr. M., this should sound familiar, as this example was also used in class. HOWEVER, if you took BIO 211 before Fall 1999, the data were different from what you see below. The problem setup (response variable, diets, blocks (rooms)) was exactly the same, only the numbers have changed! Before Fall 1999, Dr. M. used the data from the 1st and 2nd editions of the Zar text. When the 3rd edition of Zar came out (late 1996), the data were changed - but Dr. M. didn't change the class example until Fall 1999. You may wonder why Zar kept the same problem setup, but changed the numbers - well, join the club! Dr. M. would like to hear the answer to that question! Also if you took BIO 211 before Fall 1999, why haven t you graduated yet? We will do three ANOVAs here: (1) a One-factor grouping by diets, (2) a One-factor grouping by Blocks (rooms), and a Two-factor grouping by both diets and blocks. What you should do is examine the SS, DF and MS due to Diets and Blocks in the One-factor and the Two-factor ANOVAs. See if you can detect the pattern, and explain it! Also look at what happens to the Error (unexplained) source in the ANOVAs. ; DATA G_PIGS; INPUT WT_GAIN DIET BLOCK; CARDS; 7.0 1 1 9.9 1 2 8.5 1 3 5.1 1 4 10.3 1 5 5.3 2 1 5.7 2 2 4.7 2 3 3.5 2 4 7.7 2 5 4.9 3 1 7.6 3 2 5.5 3 3 2.8 3 4 8.4 3 5 8.8 4 1 8.9 4 2 8.1 4 3 3.3 4 4 9.1 4 5 ; PROC GLM; CLASS DIET; MODEL WT_GAIN = DIET; PROC GLM; CLASS BLOCK; MODEL WT_GAIN = BLOCK; PROC GLM; CLASS DIET BLOCK; MODEL WT_GAIN = DIET BLOCK; MEANS DIET BLOCK; RUN;

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 21 Randomized Block (no replications) 14:27 Thursday, June 17, 1999 General Linear Models Procedure Class Level Information Class Levels Values DIET 4 1 2 3 4 Number of observations in data set = 20 Randomized Block (no replications) 14:27 Thursday, June 17, 1999 General Linear Models Procedure Dependent Variable: WT_GAIN Source DF Sum of Squares Mean Square F Value Pr > F Model 3 27.42550000 9.14183333 2.03 0.1497 Error 16 71.92400000 4.49525000 Corrected Total 19 99.34950000 R-Square C.V. Root MSE WT_GAIN Mean 0.276051 31.38713 2.120200 6.75500000 Source DF Type I SS Mean Square F Value Pr > F DIET 3 27.42550000 9.14183333 2.03 0.1497 Source DF Type III SS Mean Square F Value Pr > F DIET 3 27.42550000 9.14183333 2.03 0.1497 Randomized Block (no replications) 14:27 Thursday, June 17, 1999 General Linear Models Procedure Class Level Information Class Levels Values BLOCK 5 1 2 3 4 5 Number of observations in data set = 20

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 22 Randomized Block (no replications) 14:27 Thursday, June 17, 1999 Dependent Variable: WT_GAIN General Linear Models Procedure Source DF Sum of Squares Mean Square F Value Pr > F Model 4 62.64700000 15.66175000 6.40 0.0033 Error 15 36.70250000 2.44683333 Corrected Total 19 99.34950000 R-Square C.V. Root MSE WT_GAIN Mean 0.630572 23.15671 1.564235 6.75500000 Source DF Type I SS Mean Square F Value Pr > F BLOCK 4 62.64700000 15.66175000 6.40 0.0033 Source DF Type III SS Mean Square F Value Pr > F BLOCK 4 62.64700000 15.66175000 6.40 0.0033 Randomized Block (no replications) 14:27 Thursday, June 17, 1999 General Linear Models Procedure Class Level Information Class Levels Values DIET 4 1 2 3 4 BLOCK 5 1 2 3 4 5 Number of observations in data set = 20

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 23 Randomized Block (no replications) 14:27 Thursday, June 17, 1999 Dependent Variable: WT_GAIN General Linear Models Procedure Source DF Sum of Squares Mean Square F Value Pr > F Model 7 90.07250000 12.86750000 16.64 0.0001 Error 12 9.27700000 0.77308333 Corrected Total 19 99.34950000 R-Square C.V. Root MSE WT_GAIN Mean 0.906623 13.01631 0.879251 6.75500000 Source DF Type I SS Mean Square F Value Pr > F DIET 3 27.42550000 9.14183333 11.83 0.0007 BLOCK 4 62.64700000 15.66175000 20.26 0.0001 Source DF Type III SS Mean Square F Value Pr > F DIET 3 27.42550000 9.14183333 11.83 0.0007 BLOCK 4 62.64700000 15.66175000 20.26 0.0001 Randomized Block (no replications) 14:27 Thursday, June 17, 1999 General Linear Models Procedure Level of -----------WT_GAIN----------- DIET N Mean SD 1 5 8.16000000 2.14662526 2 5 5.38000000 1.54012986 3 5 5.84000000 2.23002242 4 5 7.64000000 2.45519857 Level of -----------WT_GAIN----------- BLOCK N Mean SD 1 4 6.50000000 1.78325545 2 4 8.02500000 1.81360598 3 4 6.70000000 1.88325959 4 4 3.67500000 0.99456858 5 4 8.87500000 1.10867789

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 24 Two-factor ANOVA with replications (balanced design) TITLE 'Two-factor ANOVA with replications (balanced design)'; * This is the same example used in Dr. M's Biometrics class. A clinic that does health evaluations is studying the effect of smoking. The clinic evaluates people using one of two devices: a stationary bicycle and a treadmill. While the subject is on the bike or treadmill, their oxygen consumption is measured, and the time (in minutes) required for the subject to reach their maximum oxygen consumption is noted. The data below are for 18 people: 6 nonsmokers, 6 moderate, and 6 heavy smokers. From each smoking group, 3 individuals were randomly chosen to ride the bike, and the other 3 walked the treadmill. It is important to note here that every individual was measured on only one device, either the bike or the treadmill. If every individual had been measured on each device, that would be a repeated measures design - we'll deal with that later in the quarter. ; DATA CLINIC; INPUT SMOKING $ DEVICE $ TIME; CARDS; NON BIKE 12.8 NON BIKE 13.5 NON BIKE 11.2 NON TREAD 17.8 NON TREAD 18.1 NON TREAD 16.2 MOD BIKE 10.9 MOD BIKE 11.1 MOD BIKE 9.8 MOD TREAD 15.5 MOD TREAD 13.8 MOD TREAD 16.2 HEAVY BIKE 8.7 HEAVY BIKE 9.2 HEAVY BIKE 9.5 HEAVY TREAD 14.7 HEAVY TREAD 13.2 HEAVY TREAD 10.1 ; PROC GLM; CLASS SMOKING DEVICE; MODEL TIME = SMOKING DEVICE SMOKING*DEVICE; MEANS SMOKING / TUKEY; MEANS DEVICE SMOKING*DEVICE; RUN; Two-factor ANOVA with replications (balanced design) 16:00 Thursday, June 17, 1999 General Linear Models Procedure Class Level Information Class Levels Values SMOKING 3 HEAVY MOD NON DEVICE 2 BIKE TREAD Number of observations in data set = 18

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 25 Two-factor ANOVA with replications (balanced design) 16:00 Thursday, June 17, 1999 Dependent Variable: TIME General Linear Models Procedure Source DF Sum of Squares Mean Square F Value Pr > F Model 5 134.34277778 26.86855556 15.94 0.0001 Error 12 20.22666667 1.68555556 Corrected Total 17 154.56944444 R-Square C.V. Root MSE TIME Mean 0.869142 10.05993 1.298289 12.90555556 Source DF Type I SS Mean Square F Value Pr > F SMOKING 2 48.80777778 24.40388889 14.48 0.0006 DEVICE 1 84.06722222 84.06722222 49.88 0.0001 SMOKING*DEVICE 2 1.46777778 0.73388889 0.44 0.6568 Source DF Type III SS Mean Square F Value Pr > F SMOKING 2 48.80777778 24.40388889 14.48 0.0006 DEVICE 1 84.06722222 84.06722222 49.88 0.0001 SMOKING*DEVICE 2 1.46777778 0.73388889 0.44 0.6568 Two-factor ANOVA with replications (balanced design) 16:00 Thursday, June 17, 1999 General Linear Models Procedure Tukey's Studentized Range (HSD) Test for variable: TIME NOTE: This test controls the type I experimentwise error rate, but generally has a higher type II error rate than REGWQ. Alpha= 0.05 df= 12 MSE= 1.685556 Critical Value of Studentized Range= 3.773 Minimum Significant Difference= 1.9997 Means with the same letter are not significantly different. Tukey Grouping Mean N SMOKING A 14.9333 6 NON B 12.8833 6 MOD B B 10.9000 6 HEAVY

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 26 Two-factor ANOVA with replications (balanced design) 16:00 Thursday, June 17, 1999 General Linear Models Procedure Level of -------------TIME------------ DEVICE N Mean SD BIKE 9 10.7444444 1.62719937 TREAD 9 15.0666667 2.48294180 Level of Level of -------------TIME------------ SMOKING DEVICE N Mean SD HEAVY BIKE 3 9.1333333 0.40414519 HEAVY TREAD 3 12.6666667 2.34591844 MOD BIKE 3 10.6000000 0.70000000 MOD TREAD 3 15.1666667 1.23423391 NON BIKE 3 12.5000000 1.17898261 NON TREAD 3 17.3666667 1.02143690

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 27 Analysis of Covariance (ANCOVA) TITLE 'Analysis of Covariance (ANCOVA)'; * This is the example that was done in Dr. M's Biometrics class. It uses the same data as in the One-factor ANOVA we did at the beginning of the class: i.e. birth weights of babies grouped by smoking status of the mother during pregnancy. The new variable here is the prepregnancy body weight of the mom (in kg). The first variable indicates smoking group: 1 = none, 2 = 1 pack/day, 3 = 1+ pack/day. The second variable is the birthweight (g), the third variable is the mom weight (kg) ; DATA MOM_BABY; INPUT SMOKING BABY_WT MOM_WT; CARDS; 1 3515 65.2 1 3420 58.2 1 3175 48.7 1 3586 65.8 1 3232 73.5 1 3884 68.2 1 3856 69.3 1 3941 69.3 1 3232 59.3 1 4054 73.9 1 3459 56.3 1 3998 70.3 2 3444 62.1 2 3827 72.1 2 3884 72.8 2 3515 49.4 2 3416 54.4 2 3742 63.5 2 3062 61.2 2 3076 51.0 2 2835 44.2 2 2750 63.1 2 3460 63.8 2 3340 65.8 3 2608 59.3 3 2509 51.2 3 3600 80.0 3 1730 60.0 3 3175 74.6 3 3459 68.7 3 3288 69.7 3 2920 62.3 3 3020 65.1 3 2778 49.9 3 2466 46.7 3 3260 61.2 ; PROC Reg; *This PROC Reg does a regression on all 36 data points. If you had Dr. M for BIO 211, this is the example that was used in class for regression; Model BABY_WT = MOM_WT; PROC Reg; *Next, SAS does a regression on each of the smoking groups separately. This is accomplished by the BY SMOKING command; Model BABY_WT = MOM_WT; BY SMOKING; PROC GLM; *The first PROC GLM is used to test if the slopes of the regression lines are the same for each of the smoking groups. This is the interaction term (SMOKING*MOM_WT). The slopes are equal (p = 0.7737).; CLASS SMOKING; MODEL BABY_WT = SMOKING MOM_WT SMOKING*MOM_WT;

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 28 PROC GLM; CLASS SMOKING; MODEL BABY_WT = SMOKING MOM_WT / SOLUTION; MEANS SMOKING; LSMEANS SMOKING; * The second PROC GLM does the ANCOVA. The SOLUTION option prints out the pooled regression coefficient (slope). The value is about 29.5317. The LSMEANS prints the adjusted means, the MEANS prints the means prior to adjustment. ; RUN; Analysis of Covariance (ANCOVA) 1 16:22 Tuesday, December 11, 2001 The REG Procedure Model: MODEL1 Dependent Variable: BABY_WT Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 2623276 2623276 14.56 0.0005 Error 34 6124098 180121 Corrected Total 35 8747374 Root MSE 424.40609 R-Square 0.2999 Dependent Mean 3292.11111 Adj R-Sq 0.2793 Coeff Var 12.89161 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 1356.38005 512.13797 2.65 0.0122 MOM_WT 1 30.97032 8.11531 3.82 0.0005

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 29 Analysis of Covariance (ANCOVA) 2 16:22 Tuesday, December 11, 2001 ----------------------------------------- SMOKING=1 ----------------------------------------- The REG Procedure Model: MODEL1 Dependent Variable: BABY_WT Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 498976 498976 7.83 0.0189 Error 10 637266 63727 Corrected Total 11 1136243 Root MSE 252.44133 R-Square 0.4391 Dependent Mean 3612.66667 Adj R-Sq 0.3831 Coeff Var 6.98767 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 1812.51057 647.43953 2.80 0.0188 MOM_WT 1 27.76590 9.92275 2.80 0.0189 Example 5 - Analysis of Covariance (ANCOVA) 3 16:22 Tuesday, December 11, 2001 ----------------------------------------- SMOKING=2 ----------------------------------------- The REG Procedure Model: MODEL1 Dependent Variable: BABY_WT Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 502060 502060 5.03 0.0488 Error 10 998250 99825 Corrected Total 11 1500311 Root MSE 315.95102 R-Square 0.3346 Dependent Mean 3362.58333 Adj R-Sq 0.2681 Coeff Var 9.39608 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 1905.55393 656.06664 2.90 0.0157 MOM_WT 1 24.16969 10.77737 2.24 0.0488

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 30 Analysis of Covariance (ANCOVA) 4 16:22 Tuesday, December 11, 2001 ----------------------------------------- SMOKING=3 ----------------------------------------- The REG Procedure Model: MODEL1 Dependent Variable: BABY_WT Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 1332362 1332362 8.07 0.0175 Error 10 1650958 165096 Corrected Total 11 2983321 Root MSE 406.31988 R-Square 0.4466 Dependent Mean 2901.08333 Adj R-Sq 0.3913 Coeff Var 14.00580 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 733.48416 771.98274 0.95 0.3644 MOM_WT 1 34.74181 12.22952 2.84 0.0175

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 31 Analysis of Covariance (ANCOVA) 5 16:22 Tuesday, December 11, 2001 The GLM Procedure Class Level Information Class Levels Values SMOKING 3 1 2 3 Number of observations 36 Dependent Variable: BABY_WT Analysis of Covariance (ANCOVA) 6 16:22 Tuesday, December 11, 2001 The GLM Procedure Sum of Source DF Squares Mean Square F Value Pr > F Model 5 5460898.354 1092179.671 9.97 <.0001 Error 30 3286475.202 109549.173 Corrected Total 35 8747373.556 R-Square Coeff Var Root MSE BABY_WT Mean 0.624290 10.05380 330.9821 3292.111 Source DF Type I SS Mean Square F Value Pr > F SMOKING 2 3127499.056 1563749.528 14.27 <.0001 MOM_WT 1 2276706.663 2276706.663 20.78 <.0001 MOM_WT*SMOKING 2 56692.635 28346.318 0.26 0.7737 Source DF Type III SS Mean Square F Value Pr > F SMOKING 2 208179.691 104089.845 0.95 0.3980 MOM_WT 1 2078558.118 2078558.118 18.97 0.0001 MOM_WT*SMOKING 2 56692.635 28346.318 0.26 0.7737

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 32 Analysis of Covariance (ANCOVA) 7 16:22 Tuesday, December 11, 2001 The GLM Procedure Class Level Information Class Levels Values SMOKING 3 1 2 3 Number of observations 36 Dependent Variable: BABY_WT Analysis of Covariance (ANCOVA) 8 16:22 Tuesday, December 11, 2001 The GLM Procedure Sum of Source DF Squares Mean Square F Value Pr > F Model 3 5404205.718 1801401.906 17.24 <.0001 Error 32 3343167.837 104473.995 Corrected Total 35 8747373.556 R-Square Coeff Var Root MSE BABY_WT Mean 0.617809 9.818149 323.2244 3292.111 Source DF Type I SS Mean Square F Value Pr > F SMOKING 2 3127499.056 1563749.528 14.97 <.0001 MOM_WT 1 2276706.663 2276706.663 21.79 <.0001 Source DF Type III SS Mean Square F Value Pr > F SMOKING 2 2780930.101 1390465.051 13.31 <.0001 MOM_WT 1 2276706.663 2276706.663 21.79 <.0001 Standard Parameter Estimate Error t Value Pr > t Intercept 1058.549064 B 405.5780388 2.61 0.0137 SMOKING 1 639.476676 B 132.8567735 4.81 <.0001 SMOKING 2 523.762745 B 132.6281455 3.95 0.0004 SMOKING 3 0.000000 B... MOM_WT 29.531737 6.3261509 4.67 <.0001 NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 33 Analysis of Covariance (ANCOVA) 9 16:22 Tuesday, December 11, 2001 The GLM Procedure Level of -----------BABY_WT----------- ------------MOM_WT----------- SMOKING N Mean Std Dev Mean Std Dev 1 12 3612.66667 321.395065 64.8333333 7.6706446 2 12 3362.58333 369.312742 60.2833333 8.8391519 3 12 2901.08333 520.779217 62.3916667 10.0175717 Example 5 - Analysis of Covariance (ANCOVA) 10 16:22 Tuesday, December 11, 2001 The GLM Procedure Least Squares Means SMOKING BABY_WT LSMEAN 1 3543.84131 2 3428.12738 3 2904.36464

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 34 Calculation of ANCOVA quantities Let s define the following symbols: Y = mean birth weight for smoking group i adjy X b i = adjusted mean birthweight for smoking group i = mean prepregnancy weight for moms in smoking group i X = grand mean of prepregnancy weight for moms p i = pooled regression coefficient = 29.53 = = ( X i X )( Yi Y ) 2 2 2 ( X X ) y = ( Y Y ) xy = sum of crossproducts x i 2 i i = 62. 5 kg Calculation of the Pooled regression coefficient (b p ): Calculate Σ xy i and Σ x i 2 for each of the smoking groups, and pool (add) them: Σ xy p = Σ xy 1 + Σ xy 2 + Σ xy 3 = 17970.622 + 20772.553 + 38350.619 = 77093.794 This is the pooled sum of crossproducts. Σx p 2 = Σx 1 2 + Σx 2 2 + Σx 3 2 = 647.219 + 859.446 + 1103.875 = 2610.535 This is the pooled sum of squares for the independent variable (the moms weights). = Calculation of the adjusted means: b p xy x 77093.794 2610.535 p = = 2 p 29.5317 The adjustment to the birth weight means depends on: (1) how far the mean of the moms in that group is from the grand mean of all moms; and (2) the relationship between mom s weight and baby s weight (b p ). adj Y = Y b i i p ( X X ) i Use this formula to calculate adjusted birth weight means for each of the smoking groups: Nonsmoking: 3612.667-29.5317(64.833-62.5028) = 3612.667-29.5317(2.3302) = 3612.667-68.815 = 3543.852 1 Pack/day: 3362.583-29.5317(60.283-62.5028) = 3362.583-29.5317(-2.2198) = 3362.583 + 65.554 = 3428.137 1+Pack/day: 2901.084-29.5317(62.392-62.5028) = 2901.084-29.5317(-0.111) = 2901.084 + 3.278 = 2904.362

BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 35 ANCOVA - Calculation of Sums of Squares (SS) This is a bit complicated, and it's not necessary to memorize all of this detail, but this may help you understand how the ANCOVA works. In comparing the SS values calculated below to the SAS output, you will note differences. These are due to rounding error. The SAS methods are substantially more accurate than what you see below, but even if we did figure out exactly how SAS did the calculations, it would not help us understand the method. Another way (and perhaps more accurate way) to look at the ANCOVA is that the analysis actually tests whether each group can be described by a common (pooled) regression line. If the pooled regression line for each group is the same (same slope and same intercept), then the adjusted means are not significantly different. And we can test to determine if the slope of that pooled line is significantly different from zero (0). So, our first task is to do a regression for each smoking group - except we use the pooled regression coefficient (b p ) in each regression. This requires us to calculate an intercept for each group (using the means for moms and babies in that group). Then, we use the regression equation to calculate predicted values for each group. We can then calculate Regression SS and Error SS. We'll do this step by step for each group so you can see what's happening. Nonsmoking Group From the SAS output, we see the mean baby weight is 3612.66667, and the mean mom weight is 64.8333333. We use these means and b p = 29.531737 to calculate an intercept term (a 1 ). This is done just as we did it in BIO 211: a 1 = 3612.66667-29.531737*64.8333333 = 1698.025722 So, the equation for this group is Y = 1698.025722 + 29.531737 X Next, we calculate a predicted Y by putting each nonsmoking mom weight in for X. Then, calculate Regression SS (use mean of predicted, not observed - they are different in this case) and Error SS. Mom Baby Predicted Baby 65.2 3515 3623.494975 58.2 3420 3416.772816 48.7 3175 3136.221314 65.8 3586 3641.214017 73.5 3232 3868.608392 68.2 3884 3712.090186 69.3 3856 3744.575096 69.3 3941 3744.575096 59.3 3232 3449.257726 73.9 4054 3880.421086 56.3 3459 3360.662515 70.3 3998 3774.106833 2 Regression SS = ( Ŷ -Y ) = 5644615795. Error SS = (Y - Ŷ) 2 = 639284.3988