Biological Applications of ANOVA - Examples and Readings

Size: px

Start display at page:

Download "Biological Applications of ANOVA - Examples and Readings"

Deirdre Shaw
6 years ago
Views:

1 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 1 ANOVA Pac Biological Applications of ANOVA - Examples and Readings One-factor Model I (Fixed Effects) This is the same example for One-factor ANOVA used by Dr. M. in Biometrics class (it's in the BIO 211 Test Pac). The data are birth weights (g) for 36 babies. Each baby is categorized on the basis of the smoking habits of the mother during pregnancy. The value for the SMOKING variable below indicates group membership (i.e. the level). The three levels are: 1 = Nonsmoking, 2 = up to 1 Pack/day, 3 = 1+ pack/day. The WEIGHT variable contains you guessed it! the birth weights. This does a "complete" analysis: it tests the assumptions of normality and homoscedasticity; does the ANOVA, does a planned contrast of nonsmoking babies vs. the combination of the two smoking groups; and also does 11 different multiple comparison tests. SAS does four different tests for normality. The Shapiro-Wilk test is the most widely used. The null hypothesis is Ho: Distribution is Normal. So, when we accept (p>0.05) we have a normal distribution. Note that for each of the three smoking groups, all four normality tests conclude that the distribution is normal. The test for homoscedasticity is the Brown and Forsythe's Test for Homogeneity of WEIGHT Variance. This is a form of Levene s Test, and is an ANOVA done on the absolute deviation of each weight from the group median. The contrast tests the nonsmoking babies (n = 12, mean = ) against the smoking babies (n = 24, mean = ). Note that the smoking babies is all 24 babies in the 1 pack/day and 1+ pack/day groups combined. We can calculate the Contrast SS as a simple Groups SS: Groups SS = n j 12( ) ( X j X ) 2 = ( ) = = = Contrast SS 2 Note that all 11 multiple comparison tests give the same result, i.e. that the 1+ pack/day group is different from the other two. Since all the tests agree, this is a robust conclusion. DATA BABYWT; INPUT SMOKING allows multiple observations per line; SELECT (SMOKING); WHEN (1) SMOKE='Nonsmoke'; WHEN (2) SMOKE='1 Pack'; WHEN (3) SMOKE='1+ Pack'; END; CARDS; ; PROC UNIVARIATE NORMAL; CLASS SMOKE; VAR WEIGHT; PROC GLM; CLASS SMOKE; MODEL WEIGHT = SMOKE / SS3; CONTRAST 'Nonsmoking vs Smoking' SMOKE / E; MEANS SMOKE / HOVTEST=BF Tukey SNK Bon LSD REGWQ Scheffe Duncan Sidak Gabriel SMM Waller lines; RUN;

2 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 2 Example 1 - One-factor Model I (Fixed Effects) 1 The UNIVARIATE Procedure Variable: WEIGHT SMOKE = 1 Pack Moments N 12 Sum Weights 12 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean Basic Statistical Measures Location Variability Mean Std Deviation Median Variance Mode. Range 1134 Interquartile Range Tests for Location: Mu0=0 Test -Statistic p Value Student's t t Pr > t <.0001 Sign M 6 Pr >= M Signed Rank S 39 Pr >= S Tests for Normality Test --Statistic p Value Shapiro-Wilk W Pr < W Kolmogorov-Smirnov D Pr > D > Cramer-von Mises W-Sq Pr > W-Sq > Anderson-Darling A-Sq Pr > A-Sq > Quantiles (Definition 5) Quantile Estimate 100% Max % % % % Q % Median Example 1 - One-factor Model I (Fixed Effects) 2 The UNIVARIATE Procedure

3 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 3 Variable: WEIGHT SMOKE = 1 Pack Quantiles (Definition 5) Quantile Estimate 25% Q % % % % Min Extreme Observations ----Lowest Highest--- Value Obs Value Obs Example 1 - One-factor Model I (Fixed Effects) 3 The UNIVARIATE Procedure Variable: WEIGHT SMOKE = 1+ Pack Moments N 12 Sum Weights 12 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean Basic Statistical Measures Location Variability Mean Std Deviation Median Variance Mode. Range 1870 Interquartile Range Tests for Location: Mu0=0 Test -Statistic p Value Student's t t Pr > t <.0001 Sign M 6 Pr >= M Signed Rank S 39 Pr >= S

4 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 4 Tests for Normality Test --Statistic p Value Shapiro-Wilk W Pr < W Kolmogorov-Smirnov D Pr > D > Cramer-von Mises W-Sq Pr > W-Sq > Anderson-Darling A-Sq Pr > A-Sq > Quantiles (Definition 5) Quantile Estimate 100% Max % % % % Q % Median Example 1 - One-factor Model I (Fixed Effects) 4 The UNIVARIATE Procedure Variable: WEIGHT SMOKE = 1+ Pack Quantiles (Definition 5) Quantile Estimate 25% Q % % % % Min Extreme Observations ----Lowest Highest--- Value Obs Value Obs Example 1 - One-factor Model I (Fixed Effects) 5 The UNIVARIATE Procedure Variable: WEIGHT SMOKE = Nonsmoke

5 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 5 Moments N 12 Sum Weights 12 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean Basic Statistical Measures Location Variability Mean Std Deviation Median Variance Mode Range Interquartile Range Tests for Location: Mu0=0 Test -Statistic p Value Student's t t Pr > t <.0001 Sign M 6 Pr >= M Signed Rank S 39 Pr >= S Tests for Normality Test --Statistic p Value Shapiro-Wilk W Pr < W Kolmogorov-Smirnov D Pr > D > Cramer-von Mises W-Sq Pr > W-Sq > Anderson-Darling A-Sq Pr > A-Sq Quantiles (Definition 5) Quantile Estimate 100% Max % % % % Q % Median Example 1 - One-factor Model I (Fixed Effects) 6 The UNIVARIATE Procedure Variable: WEIGHT SMOKE = Nonsmoke Quantiles (Definition 5) Quantile Estimate

6 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 6 25% Q % % % % Min Extreme Observations ----Lowest Highest--- Value Obs Value Obs Example 1 - One-factor Model I (Fixed Effects) 7 The GLM Procedure Class Level Information Class Levels Values SMOKE 3 1 Pack 1+ Pack Nonsmoke Number of Observations Read 36 Number of Observations Used 36 Example 1 - One-factor Model I (Fixed Effects) 8 The GLM Procedure Coefficients for Contrast Nonsmoking vs Smoking Row 1 Intercept 0 SMOKE 1 Pack 1 SMOKE 1+ Pack 1 SMOKE Nonsmoke -2

7 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 7 Example 1 - One-factor Model I (Fixed Effects) 9 The GLM Procedure Dependent Variable: WEIGHT Sum of Source DF Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square Coeff Var Root MSE WEIGHT Mean Source DF Type III SS Mean Square F Value Pr > F SMOKE Contrast DF Contrast SS Mean Square F Value Pr > F Nonsmoking vs Smoking Example 1 - One-factor Model I (Fixed Effects) 10 The GLM Procedure Brown and Forsythe's Test for Homogeneity of WEIGHT Variance ANOVA of Absolute Deviations from Group Medians Sum of Mean Source DF Squares Square F Value Pr > F SMOKE Error Example 1 - One-factor Model I (Fixed Effects) 11 The GLM Procedure Waller-Duncan K-ratio t Test for WEIGHT NOTE: This test minimizes the Bayes risk under additive loss and certain other assumptions. Kratio 100 Error Degrees of Freedom 33 Error Mean Square F Value 9.18 Critical Value of t

8 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 8 Minimum Significant Difference Means with the same letter are not significantly different. Waller Grouping Mean N SMOKE A Nonsmoke A A Pack B Pack Example 1 - One-factor Model I (Fixed Effects) 12 The GLM Procedure t Tests (LSD) for WEIGHT NOTE: This test controls the Type I comparisonwise error rate, not the experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square Critical Value of t Least Significant Difference Means with the same letter are not significantly different. t Grouping Mean N SMOKE A Nonsmoke A A Pack B Pack Example 1 - One-factor Model I (Fixed Effects) 13 The GLM Procedure Duncan's Multiple Range Test for WEIGHT NOTE: This test controls the Type I comparisonwise error rate, not the experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square Number of Means 2 3

9 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 9 Critical Range Means with the same letter are not significantly different. Duncan Grouping Mean N SMOKE A Nonsmoke A A Pack B Pack Example 1 - One-factor Model I (Fixed Effects) 14 The GLM Procedure Student-Newman-Keuls Test for WEIGHT NOTE: This test controls the Type I experimentwise error rate under the complete null hypothesis but not under partial null hypotheses. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square Number of Means 2 3 Critical Range Means with the same letter are not significantly different. SNK Grouping Mean N SMOKE A Nonsmoke A A Pack B Pack Example 1 - One-factor Model I (Fixed Effects) 15 The GLM Procedure Ryan-Einot-Gabriel-Welsch Multiple Range Test for WEIGHT NOTE: This test controls the Type I experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square

10 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 10 Number of Means 2 3 Critical Range Means with the same letter are not significantly different. REGWQ Grouping Mean N SMOKE A Nonsmoke A A Pack B Pack Example 1 - One-factor Model I (Fixed Effects) 16 The GLM Procedure Tukey's Studentized Range (HSD) Test for WEIGHT NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than REGWQ. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square Critical Value of Studentized Range Minimum Significant Difference Means with the same letter are not significantly different. Tukey Grouping Mean N SMOKE A Nonsmoke A A Pack B Pack Example 1 - One-factor Model I (Fixed Effects) 17 The GLM Procedure Studentized Maximum Modulus (GT2) Test for WEIGHT NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than REGWQ. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square Critical Value of Studentized Maximum Modulus Minimum Significant Difference

11 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 11 Means with the same letter are not significantly different. SMM Grouping Mean N SMOKE A Nonsmoke A A Pack B Pack Example 1 - One-factor Model I (Fixed Effects) 18 The GLM Procedure Sidak t Tests for WEIGHT NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than REGWQ. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square Critical Value of t Minimum Significant Difference Means with the same letter are not significantly different. Sidak Grouping Mean N SMOKE A Nonsmoke A A Pack B Pack Example 1 - One-factor Model I (Fixed Effects) 19 The GLM Procedure Bonferroni (Dunn) t Tests for WEIGHT NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than REGWQ. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square Critical Value of t Minimum Significant Difference

12 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 12 Means with the same letter are not significantly different. Bon Grouping Mean N SMOKE A Nonsmoke A A Pack B Pack Example 1 - One-factor Model I (Fixed Effects) 20 The GLM Procedure Scheffe's Test for WEIGHT NOTE: This test controls the Type I experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 33 Error Mean Square Critical Value of F Minimum Significant Difference Means with the same letter are not significantly different. Scheffe Grouping Mean N SMOKE A Nonsmoke A A Pack B Pack

13 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 13 ANOVA by Regression In this example, we'll do the same ANOVA we did in Example 1 (i.e. babies categorized by smoking status of mom during pregnancy). However, this time we will have Excel do the ANOVA by using regression procedures. This approach is not just for fun - or to see if we can fool Excel into doing "stupid ANOVA tricks". When we deal with unbalanced designs, it will be very important to understand that ANOVA problems can be solved by using regression. Also, SAS and other major statistical packages use this approach. Below are the data and "summary output" from the Excel Regression Data Analysis tool. You can see that the birth weights are in the third variable (column). The first two variables are "dummy variables" - they are codes that indicate smoking status. If the values for the dummy variables are 0 0, then that is a nonsmoking baby. Values of 1 0 indicate 1 pack/day. Values of 0 1 indicate 1+ pack/day. First, check-out the ANOVA table, and compare it to the ANOVA table prepared by SAS (or from the TestPac). Notice that Total SS is the same in both. Regression SS below is the same as Groups SS in the TestPac (called SMOKE SS in the SAS output). Residual SS below is the same as Error SS. The DF, MS, and F values also are the same SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations ANOVA df SS MS F Significance F Regression Residual Total Coefficients Standard Error t Stat P-value Lower 95% Intercept E X Variable X Variable

14 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 14 ANOVA by Regression: Calculations and Sources of Variation The regression done above is a "multiple linear regression". In BIO 211, we did "simple linear regression", which means there was one dependent and one independent variable. Our regression "model" in BIO 211 was: Y = a + bx. In multiple linear regression, there is one dependent variable and two or more independent variables. Our multiple regression model is Y = a + b 1 X 1 + b 2 X 2. In our data, Y is the baby weights, X 1 is the first dummy variable, and X 2 the second dummy variable. The b 1 and b 2 values are called "partial regression coefficients". They are like the slope of the line in simple regression, and they are parameter estimates whose values are determined from the data. The interpretations are: b 1 shows the effect of X 1 on Y while holding X 2 constant b 2 shows the effect of X 2 on Y while holding X 1 constant From the Excel output above, you should see that our equation is: Y = X X 2 Let's see the predicted values of Y (calculated by putting in the values of the dummy variables): X1 X2 Baby Predicted Does anything here look familiar? Do you see how this works? Note that the predicted value for each baby is the mean for that group. When X 1 = 0 and X 2 = 0, then Y = When X 1 = 1 and X 2 = 0, then Y = (1) In other words, the mean of the 1 Pack/day group is g less than the mean of the nonsmoking group. When X 1 = 0 and X 2 = 1, then Y = (1) In other words, the mean of the 1+ Pack/day group is g less than the mean of the nonsmoking group. This should make sense to you. All the dummy variables tell is what smoking group a baby belongs to. If you're trying to estimate (predict) the birth weight of a baby, and all you know is what smoking group it is in, your "best guess" is the mean of the group. For example, let's say a baby has just been born, and it is classified into the 1 Pack/day group. Now, pretend you have to guess the birth weight, and for every gram you are off, you have to pay Dr. M. $1.00! What do you do?? Your "best guess" is grams, because that is right in the middle of the 1 Pack/day group. That should minimize how much you have to pay to the evil Dr. M.

15 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 15 Now, for the all-important sources of variation and sums of squares. First, in multiple regression, the sources of variation are exactly as they were in simple linear regression. Namely: Total is the variation of the observed value of the dependent variable. Regression is the variation of the predicted value of the dependent variable. Residual (Error) is the variation of the difference between observed and predicted. Total SS Now, compare and contrast SS in one-factor ANOVA with SS in regression. = Regression SS = ( Y Y ) Residual (Error) SS i Total SS Notice that Total SS is exactly the same in one-factor ANOVA and regression. Look at the formula above, and then the ANOVA formula from the TestPac. In both cases, Total SS is the sum of the squared deviation of each baby weight from the grand mean of the baby weights. Groups SS = Regression SS 2 Groups SS in ANOVA is Groups SS = n j ( X j X ). At first, you're thinking this is totally different from Regression SS. But, it's the same! First of all, don't be thrown off by the use of X and Y. The X and Y both refer to the same variable in this case, i.e. baby birth weight. Groups SS says you take the (group mean - grand mean) 2 and multiply by the number of data points in the group. Look at the calculation of Groups SS in the TestPac. Now, think about how you would evaluate Regression SS? Remember, the predicted value of Y is the mean of that group. And what about the grand mean? It's the same - the mean of all 36 babies. So, in each group, you're taking the (group mean - grand mean) 2, and you do this once for each baby in the group. This is just like multiplying by the number of babies in the group. Error SS = Residual (Error) SS Check the TestPac pages for the calculation of Error SS in ANOVA. You calculate a SS for each group (each baby from their group mean), and then add them together. In Residual (Error SS) in regression, you're taking each baby minus the predicted baby and squaring, and then adding them all up. But remember, the predicted value for each baby is its group mean! So, you're doing the same thing as in ANOVA. = 2 ( Yˆ Y ) i ( Y i 2 Yˆ ) i 2 It is important that you understand the relationship between the sources of variation in ANOVA and regression. See Dr. M. if this is causing you problems. It's really not hard - you will get it if you think about it for a bit. If you don't remember the definition of "important" from BIO 211, ask Dr. M. in class!

16 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 16 ANOVA as Done by Statistics Programs Now that we ve seen how ANOVA is done by regression, we can expand a bit and take a peek at how a statistics program (e.g. SAS) actually does these procedures. First, A couple of important points: 1. This is not comprehensive. We re not going to look at every detail of the calculations used by computer programs, but we will look at some of them; 2. Don t worry about these details for the exam. You re not expected to recreate all the matrices and other details. The general concepts of ANOVA by Regression in the previous section are important, but the details here are value added (which means not on the test ). Our example will be the baby birth weight example (again). You tell the program what the response variable is (birth weight), and what the factor is (Smoking). The program then looks at your data and figures out that: N (the total sample size) is 36. The factor has three levels. The program then knows it s working with the following model: Y = b 0 + b 1 X1 + b 2 X2 + b 3 X3 + ε where Y is the dependent (response) variable (birth weight) b 0 is the intercept. The intercept is included by default, but you can request it not be included in the model. Don t do this unless you really know what you are doing. b 0 b 1 b 2 and b 3 are parameter estimates. These are the unknowns. The program has to estimate these parameters to do the analysis. X1, X2, and X3 are dummy variables that indicate to what smoking level the baby belongs. The values indicate a nonsmoking baby; is a 1 pack/day baby; and is a 1+ pack/day baby ε is the error (residual) The program then writes the model in matrix terms: Y = Xβ + ε where β is a vector containing the b i symbols: Y is a vector containing the birth weights. β b0 b 1 = b2 b3 X is a matrix called the design matrix that has the dummy variables. There are 4 columns in the design matrix. All the values in the the first column will be 1. This first column refers to the intercept. The next 3 columns are the dummy variables X1 X2 and X3, in that order. The OLS (ordinary least squares) solution is to solve for β: X Y = X Xβ (X X) - X Xβ = (X X) - X Y Iβ= (X X) - X Y β= (X X) - X Y X is the transpose of X (X X) - is the inverse of X X I is the identity matrix Although we certainly will calculate the error term (ε ) along the way, we don t include it here in our matrix approach. Let s begin by looking at the elements of Y and X.

17 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 17 The Y vector looks like this: The design matrix X looks like this: The program needs the transpose of Y = Y, and the transpose of X = X Y has all the weights listed as a single row (a row vector). There s not enough room here to get all 36 birth weights on a single line, so you have to use your imagination. X looks like this: Next, the program calculates X X. Since X is a 4x36 matrix and X is a 36x4 matrix, the X X must be a 4x4 matrix. That is: (4x36) x (36x4) = 4x4 Notice that the principal diagonal has the sample sizes: for all data in the 1,1 position, then for each level as you go down the diagonal. X X is: Reality check: Programs may not actually construct the design matrix. They may read a line of data, construct the appropriate line for the design matrix; transpose that line and then multiply the transpose by the design matrix line. Thus, the X X matrix is being accumulated, one line at a time.

18 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 18 X Y (4x36) x (36x1) = 4 x1 The resulting vector contains ΣY i (sum of the birth weights) values. The first element is for all the data, subsequent elements are for the levels of the smoking factor. Since sample sizes are in the principal diagonal of the X X matrix, the grand mean and level means may now be calculated. X Y is: / 36 = the grand mean / 12 = the nonsmoking mean / 12 = the 1 pack/day mean / 12 = the 1+ pack/day mean Y X (1x36) x (36x4) = 1 x 4 This vector is the same elements as X Y, but as a row vector. Useful for later calculations. Y X is: Y Y (1x36) x (36x1) = 1x1 This is ΣY i i.e. sum of the squared birth weights. This quantity is sometimes called the Uncorrected SS. Y Y is: Since we have ΣY i for all the data as the first element of X Y, and the total sample size (36) from the X X matrix, the program may now calculate Total SS by the machine formula : SS Total 2 ( Y ) N i 2 2 i= = Yi = = = N 36 i= 1 N (X X) - In order to do several more calculations, the program now needs to calculate the inverse of the X X matrix, which is symbolized by (X X) -. But a problem is that the X X matrix is singular (determinant = 0), and therefore has no inverse. Mathematicians have developed a method called generalized inverse to deal with this situation. A frequently used generalized inverse is the g 2 -inverse, also called a reflexive generalized inverse. (X X) - is: Let A represent a square matrix of order p, and G is also a square matrix of order p. G is a g 2 -inverse of A, and A is a g 2 - inverse of G, (that s why it s called reflexive) if both of the following conditions are met: 1. AGA = A 2. GAG = G The generalized inverse of X X is found by a matrix operation called sweeping, which involves working on the matrix one row at a time. The g 2 -inverse found by the sweeping algorithm is not unique, different solutions can be obtained depending on the how the matrix is swept. Fortunately, we just need an inverse, and don t need to worry about the details of how the sweeping operator functions. We just let the computer tell us the g 2 -inverse it found:

19 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 19 Y'X(X'X) - X'Y We re multiplying 3 matrices here. Y X is 1x4; (X X) - is 4x4; X Y is 4x1 so: (1x4) x (4x4) = 1x4 then (1x4) x (4x1) is 1x1 Y'X(X'X) - X'Y is: This is just an intermediate value. What we want to do is subtract this from the Y Y value: Y Y - Y'X(X'X) - X'Y is: = This is the Error SS. Notice we now have Total SS, Error SS, and all the sample sizes. We would now be able to complete the ANOVA table and do the F test. (X'X) - X'Y (4x4) x (4x1) = 4 x 1 This is the calculation of the b i values. (X'X) - X'Y is: So, our model is Y = X X 2 + 0X We can now plug in the values for dummy variables X 1, X 2, and X 3 and calculate the predicted birth weights: Intercept X1 X2 X3 Predicted Y Again, just as we saw in ANOVA by Regression, the key thing to note here is that the predicted weight for each baby is the mean of its smoking group. The method we ve looked at here is more general than using Excel, and this method is how most real statistics program approach ANOVA models. Of course, as the ANOVA model gets more complicated, so do all of these matrices. But the general principles remain the same.

20 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 20 Randomized Block (no replications) TITLE 'Randomized Block (no replications)'; * This is the Randomized Block example found in Zar (4 th ed). The response variable is weight gain (g) in guinea pigs. There is a Diet factor with 4 levels, and a Block factor with 5 levels. The blocks represent different rooms that have slightly different conditions (noise level, light/dark cycle). Each of the five rooms houses four guinea pigs, one on each of the four diets. If you had BIO 211 with Dr. M., this should sound familiar, as this example was also used in class. HOWEVER, if you took BIO 211 before Fall 1999, the data were different from what you see below. The problem setup (response variable, diets, blocks (rooms)) was exactly the same, only the numbers have changed! Before Fall 1999, Dr. M. used the data from the 1st and 2nd editions of the Zar text. When the 3rd edition of Zar came out (late 1996), the data were changed - but Dr. M. didn't change the class example until Fall You may wonder why Zar kept the same problem setup, but changed the numbers - well, join the club! Dr. M. would like to hear the answer to that question! Also if you took BIO 211 before Fall 1999, why haven t you graduated yet? We will do three ANOVAs here: (1) a One-factor grouping by diets, (2) a One-factor grouping by Blocks (rooms), and a Two-factor grouping by both diets and blocks. What you should do is examine the SS, DF and MS due to Diets and Blocks in the One-factor and the Two-factor ANOVAs. See if you can detect the pattern, and explain it! Also look at what happens to the Error (unexplained) source in the ANOVAs. ; DATA G_PIGS; INPUT WT_GAIN DIET BLOCK; CARDS; ; PROC GLM; CLASS DIET; MODEL WT_GAIN = DIET; PROC GLM; CLASS BLOCK; MODEL WT_GAIN = BLOCK; PROC GLM; CLASS DIET BLOCK; MODEL WT_GAIN = DIET BLOCK; MEANS DIET BLOCK; RUN;

21 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 21 Randomized Block (no replications) 14:27 Thursday, June 17, 1999 General Linear Models Procedure Class Level Information Class Levels Values DIET Number of observations in data set = 20 Randomized Block (no replications) 14:27 Thursday, June 17, 1999 General Linear Models Procedure Dependent Variable: WT_GAIN Source DF Sum of Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square C.V. Root MSE WT_GAIN Mean Source DF Type I SS Mean Square F Value Pr > F DIET Source DF Type III SS Mean Square F Value Pr > F DIET Randomized Block (no replications) 14:27 Thursday, June 17, 1999 General Linear Models Procedure Class Level Information Class Levels Values BLOCK Number of observations in data set = 20

22 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 22 Randomized Block (no replications) 14:27 Thursday, June 17, 1999 Dependent Variable: WT_GAIN General Linear Models Procedure Source DF Sum of Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square C.V. Root MSE WT_GAIN Mean Source DF Type I SS Mean Square F Value Pr > F BLOCK Source DF Type III SS Mean Square F Value Pr > F BLOCK Randomized Block (no replications) 14:27 Thursday, June 17, 1999 General Linear Models Procedure Class Level Information Class Levels Values DIET BLOCK Number of observations in data set = 20

23 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 23 Randomized Block (no replications) 14:27 Thursday, June 17, 1999 Dependent Variable: WT_GAIN General Linear Models Procedure Source DF Sum of Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square C.V. Root MSE WT_GAIN Mean Source DF Type I SS Mean Square F Value Pr > F DIET BLOCK Source DF Type III SS Mean Square F Value Pr > F DIET BLOCK Randomized Block (no replications) 14:27 Thursday, June 17, 1999 General Linear Models Procedure Level of WT_GAIN DIET N Mean SD Level of WT_GAIN BLOCK N Mean SD

24 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 24 Two-factor ANOVA with replications (balanced design) TITLE 'Two-factor ANOVA with replications (balanced design)'; * This is the same example used in Dr. M's Biometrics class. A clinic that does health evaluations is studying the effect of smoking. The clinic evaluates people using one of two devices: a stationary bicycle and a treadmill. While the subject is on the bike or treadmill, their oxygen consumption is measured, and the time (in minutes) required for the subject to reach their maximum oxygen consumption is noted. The data below are for 18 people: 6 nonsmokers, 6 moderate, and 6 heavy smokers. From each smoking group, 3 individuals were randomly chosen to ride the bike, and the other 3 walked the treadmill. It is important to note here that every individual was measured on only one device, either the bike or the treadmill. If every individual had been measured on each device, that would be a repeated measures design - we'll deal with that later in the quarter. ; DATA CLINIC; INPUT SMOKING $ DEVICE $ TIME; CARDS; NON BIKE 12.8 NON BIKE 13.5 NON BIKE 11.2 NON TREAD 17.8 NON TREAD 18.1 NON TREAD 16.2 MOD BIKE 10.9 MOD BIKE 11.1 MOD BIKE 9.8 MOD TREAD 15.5 MOD TREAD 13.8 MOD TREAD 16.2 HEAVY BIKE 8.7 HEAVY BIKE 9.2 HEAVY BIKE 9.5 HEAVY TREAD 14.7 HEAVY TREAD 13.2 HEAVY TREAD 10.1 ; PROC GLM; CLASS SMOKING DEVICE; MODEL TIME = SMOKING DEVICE SMOKING*DEVICE; MEANS SMOKING / TUKEY; MEANS DEVICE SMOKING*DEVICE; RUN; Two-factor ANOVA with replications (balanced design) 16:00 Thursday, June 17, 1999 General Linear Models Procedure Class Level Information Class Levels Values SMOKING 3 HEAVY MOD NON DEVICE 2 BIKE TREAD Number of observations in data set = 18

25 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 25 Two-factor ANOVA with replications (balanced design) 16:00 Thursday, June 17, 1999 Dependent Variable: TIME General Linear Models Procedure Source DF Sum of Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square C.V. Root MSE TIME Mean Source DF Type I SS Mean Square F Value Pr > F SMOKING DEVICE SMOKING*DEVICE Source DF Type III SS Mean Square F Value Pr > F SMOKING DEVICE SMOKING*DEVICE Two-factor ANOVA with replications (balanced design) 16:00 Thursday, June 17, 1999 General Linear Models Procedure Tukey's Studentized Range (HSD) Test for variable: TIME NOTE: This test controls the type I experimentwise error rate, but generally has a higher type II error rate than REGWQ. Alpha= 0.05 df= 12 MSE= Critical Value of Studentized Range= Minimum Significant Difference= Means with the same letter are not significantly different. Tukey Grouping Mean N SMOKING A NON B MOD B B HEAVY

26 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 26 Two-factor ANOVA with replications (balanced design) 16:00 Thursday, June 17, 1999 General Linear Models Procedure Level of TIME DEVICE N Mean SD BIKE TREAD Level of Level of TIME SMOKING DEVICE N Mean SD HEAVY BIKE HEAVY TREAD MOD BIKE MOD TREAD NON BIKE NON TREAD

27 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 27 Analysis of Covariance (ANCOVA) TITLE 'Analysis of Covariance (ANCOVA)'; * This is the example that was done in Dr. M's Biometrics class. It uses the same data as in the One-factor ANOVA we did at the beginning of the class: i.e. birth weights of babies grouped by smoking status of the mother during pregnancy. The new variable here is the prepregnancy body weight of the mom (in kg). The first variable indicates smoking group: 1 = none, 2 = 1 pack/day, 3 = 1+ pack/day. The second variable is the birthweight (g), the third variable is the mom weight (kg) ; DATA MOM_BABY; INPUT SMOKING BABY_WT MOM_WT; CARDS; ; PROC Reg; *This PROC Reg does a regression on all 36 data points. If you had Dr. M for BIO 211, this is the example that was used in class for regression; Model BABY_WT = MOM_WT; PROC Reg; *Next, SAS does a regression on each of the smoking groups separately. This is accomplished by the BY SMOKING command; Model BABY_WT = MOM_WT; BY SMOKING; PROC GLM; *The first PROC GLM is used to test if the slopes of the regression lines are the same for each of the smoking groups. This is the interaction term (SMOKING*MOM_WT). The slopes are equal (p = ).; CLASS SMOKING; MODEL BABY_WT = SMOKING MOM_WT SMOKING*MOM_WT;

28 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 28 PROC GLM; CLASS SMOKING; MODEL BABY_WT = SMOKING MOM_WT / SOLUTION; MEANS SMOKING; LSMEANS SMOKING; * The second PROC GLM does the ANCOVA. The SOLUTION option prints out the pooled regression coefficient (slope). The value is about The LSMEANS prints the adjusted means, the MEANS prints the means prior to adjustment. ; RUN; Analysis of Covariance (ANCOVA) 1 16:22 Tuesday, December 11, 2001 The REG Procedure Model: MODEL1 Dependent Variable: BABY_WT Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept MOM_WT

29 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 29 Analysis of Covariance (ANCOVA) 2 16:22 Tuesday, December 11, SMOKING= The REG Procedure Model: MODEL1 Dependent Variable: BABY_WT Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept MOM_WT Example 5 - Analysis of Covariance (ANCOVA) 3 16:22 Tuesday, December 11, SMOKING= The REG Procedure Model: MODEL1 Dependent Variable: BABY_WT Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept MOM_WT

30 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 30 Analysis of Covariance (ANCOVA) 4 16:22 Tuesday, December 11, SMOKING= The REG Procedure Model: MODEL1 Dependent Variable: BABY_WT Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept MOM_WT

31 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 31 Analysis of Covariance (ANCOVA) 5 16:22 Tuesday, December 11, 2001 The GLM Procedure Class Level Information Class Levels Values SMOKING Number of observations 36 Dependent Variable: BABY_WT Analysis of Covariance (ANCOVA) 6 16:22 Tuesday, December 11, 2001 The GLM Procedure Sum of Source DF Squares Mean Square F Value Pr > F Model <.0001 Error Corrected Total R-Square Coeff Var Root MSE BABY_WT Mean Source DF Type I SS Mean Square F Value Pr > F SMOKING <.0001 MOM_WT <.0001 MOM_WT*SMOKING Source DF Type III SS Mean Square F Value Pr > F SMOKING MOM_WT MOM_WT*SMOKING

32 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 32 Analysis of Covariance (ANCOVA) 7 16:22 Tuesday, December 11, 2001 The GLM Procedure Class Level Information Class Levels Values SMOKING Number of observations 36 Dependent Variable: BABY_WT Analysis of Covariance (ANCOVA) 8 16:22 Tuesday, December 11, 2001 The GLM Procedure Sum of Source DF Squares Mean Square F Value Pr > F Model <.0001 Error Corrected Total R-Square Coeff Var Root MSE BABY_WT Mean Source DF Type I SS Mean Square F Value Pr > F SMOKING <.0001 MOM_WT <.0001 Source DF Type III SS Mean Square F Value Pr > F SMOKING <.0001 MOM_WT <.0001 Standard Parameter Estimate Error t Value Pr > t Intercept B SMOKING B <.0001 SMOKING B SMOKING B... MOM_WT <.0001 NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.

33 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 33 Analysis of Covariance (ANCOVA) 9 16:22 Tuesday, December 11, 2001 The GLM Procedure Level of BABY_WT MOM_WT SMOKING N Mean Std Dev Mean Std Dev Example 5 - Analysis of Covariance (ANCOVA) 10 16:22 Tuesday, December 11, 2001 The GLM Procedure Least Squares Means SMOKING BABY_WT LSMEAN

34 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 34 Calculation of ANCOVA quantities Let s define the following symbols: Y = mean birth weight for smoking group i adjy X b i = adjusted mean birthweight for smoking group i = mean prepregnancy weight for moms in smoking group i X = grand mean of prepregnancy weight for moms p i = pooled regression coefficient = = = ( X i X )( Yi Y ) ( X X ) y = ( Y Y ) xy = sum of crossproducts x i 2 i i = kg Calculation of the Pooled regression coefficient (b p ): Calculate Σ xy i and Σ x i 2 for each of the smoking groups, and pool (add) them: Σ xy p = Σ xy 1 + Σ xy 2 + Σ xy 3 = = This is the pooled sum of crossproducts. Σx p 2 = Σx Σx Σx 3 2 = = This is the pooled sum of squares for the independent variable (the moms weights). = Calculation of the adjusted means: b p xy x p = = 2 p The adjustment to the birth weight means depends on: (1) how far the mean of the moms in that group is from the grand mean of all moms; and (2) the relationship between mom s weight and baby s weight (b p ). adj Y = Y b i i p ( X X ) i Use this formula to calculate adjusted birth weight means for each of the smoking groups: Nonsmoking: ( ) = (2.3302) = = Pack/day: ( ) = ( ) = = Pack/day: ( ) = (-0.111) = =

35 BIO 575 Biological Applications of ANOVA - Winter Quarter 2010 Page 35 ANCOVA - Calculation of Sums of Squares (SS) This is a bit complicated, and it's not necessary to memorize all of this detail, but this may help you understand how the ANCOVA works. In comparing the SS values calculated below to the SAS output, you will note differences. These are due to rounding error. The SAS methods are substantially more accurate than what you see below, but even if we did figure out exactly how SAS did the calculations, it would not help us understand the method. Another way (and perhaps more accurate way) to look at the ANCOVA is that the analysis actually tests whether each group can be described by a common (pooled) regression line. If the pooled regression line for each group is the same (same slope and same intercept), then the adjusted means are not significantly different. And we can test to determine if the slope of that pooled line is significantly different from zero (0). So, our first task is to do a regression for each smoking group - except we use the pooled regression coefficient (b p ) in each regression. This requires us to calculate an intercept for each group (using the means for moms and babies in that group). Then, we use the regression equation to calculate predicted values for each group. We can then calculate Regression SS and Error SS. We'll do this step by step for each group so you can see what's happening. Nonsmoking Group From the SAS output, we see the mean baby weight is , and the mean mom weight is We use these means and b p = to calculate an intercept term (a 1 ). This is done just as we did it in BIO 211: a 1 = * = So, the equation for this group is Y = X Next, we calculate a predicted Y by putting each nonsmoking mom weight in for X. Then, calculate Regression SS (use mean of predicted, not observed - they are different in this case) and Error SS. Mom Baby Predicted Baby Regression SS = ( Ŷ -Y ) = Error SS = (Y - Ŷ) 2 =

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

unadjusted model for baseline cholesterol 22:31 Monday, April 19, 2004 1 Class Level Information Class Levels Values TRETGRP 3 3 4 5 SEX 2 0 1 Number of observations 916 unadjusted model for baseline cholesterol