BIOL 933!! Lab 10!! Fall Topic 13: Covariance Analysis

Similar documents
PLS205!! Lab 9!! March 6, Topic 13: Covariance Analysis

Topic 13. Analysis of Covariance (ANCOVA) - Part II [ST&D Ch. 17]

Topic 13. Analysis of Covariance (ANCOVA) [ST&D chapter 17] 13.1 Introduction Review of regression concepts

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

Lecture 15 Topic 11: Unbalanced Designs (missing data)

PLS205 Lab 6 February 13, Laboratory Topic 9

22s:152 Applied Linear Regression. Take random samples from each of m populations.

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Variance Decomposition and Goodness of Fit

ST430 Exam 2 Solutions

SIMPLE REGRESSION ANALYSIS. Business Statistics

The ε ij (i.e. the errors or residuals) are normally distributed. This assumption has the least influence on the F test.

MANOVA is an extension of the univariate ANOVA as it involves more than one Dependent Variable (DV). The following are assumptions for using MANOVA:

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Block 3. Introduction to Regression Analysis

MATH 644: Regression Analysis Methods

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Topic 6. Two-way designs: Randomized Complete Block Design [ST&D Chapter 9 sections 9.1 to 9.7 (except 9.6) and section 15.8]

VIII. ANCOVA. A. Introduction

Regression and the 2-Sample t

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

Week 7 Multiple factors. Ch , Some miscellaneous parts

Topic 12. The Split-plot Design and its Relatives (Part II) Repeated Measures [ST&D Ch. 16] 12.9 Repeated measures analysis

2. Treatments are randomly assigned to EUs such that each treatment occurs equally often in the experiment. (1 randomization per experiment)

Chapter 4: Regression Models

R 2 and F -Tests and ANOVA

Analysis of Covariance

610 - R1A "Make friends" with your data Psychology 610, University of Wisconsin-Madison

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

PLS205 Winter Homework Topic 8

Inference for Regression

Multiple Predictor Variables: ANOVA

Correlation and the Analysis of Variance Approach to Simple Linear Regression

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

df=degrees of freedom = n - 1

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Multiple Predictor Variables: ANOVA

1 Multiple Regression

MS&E 226: Small Data

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

ANOVA (Analysis of Variance) output RLS 11/20/2016

Introduction. Chapter 8

Ch 2: Simple Linear Regression

STA441: Spring Multiple Regression. More than one explanatory variable at the same time

MORE ON SIMPLE REGRESSION: OVERVIEW

Correlation and Simple Linear Regression

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

General Linear Models. with General Linear Hypothesis Tests and Likelihood Ratio Tests

The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization.

ANCOVA. ANCOVA allows the inclusion of a 3rd source of variation into the F-formula (called the covariate) and changes the F-formula

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

Least Squares Analyses of Variance and Covariance

22s:152 Applied Linear Regression. Chapter 5: Ordinary Least Squares Regression. Part 2: Multiple Linear Regression Introduction

Simple Linear Regression

STAT 350 Final (new Material) Review Problems Key Spring 2016

Extensions of One-Way ANOVA.

MODELS WITHOUT AN INTERCEPT

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

STAT 350: Summer Semester Midterm 1: Solutions

2 Prediction and Analysis of Variance

ANOVA: Analysis of Variation

Simple, Marginal, and Interaction Effects in General Linear Models

Extensions of One-Way ANOVA.

Biological Applications of ANOVA - Examples and Readings

In ANOVA the response variable is numerical and the explanatory variables are categorical.

Unbalanced Data in Factorials Types I, II, III SS Part 1

Analysis of Covariance

Multiple Regression: Example

Lecture 3: Inference in SLR

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Inferences for Regression

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Correlation. We don't consider one variable independent and the other dependent. Does x go up as y goes up? Does x go down as y goes up?

Topic 4: Orthogonal Contrasts

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lecture 10: F -Tests, ANOVA and R 2

Regression and Models with Multiple Factors. Ch. 17, 18

Tests of Linear Restrictions

Handling Categorical Predictors: ANOVA

Lectures 5 & 6: Hypothesis Testing

The simple linear regression model discussed in Chapter 13 was written as

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Multivariate analysis of variance and covariance

Analysis of Covariance

Six Sigma Black Belt Study Guides

Topic 12. The Split-plot Design and its Relatives (continued) Repeated Measures

Data Set 8: Laysan Finch Beak Widths

Business Statistics. Lecture 10: Correlation and Linear Regression

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

STA121: Applied Regression Analysis

STA 101 Final Review

Essential of Simple regression

The Multiple Regression Model

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Transcription:

BIOL 933!! Lab 10!! Fall 2017 Topic 13: Covariance Analysis Covariable as a tool for increasing precision Carrying out a full ANCOVA Testing ANOVA assumptions Happiness Covariables as Tools for Increasing Precision The eternal struggle of the noble researcher is to reduce the amount of variation that is unexplained by the experimental model (i.e. error). Blocking is one way of doing this, of reducing the variation-withintreatments (MSE) so that one can better detect differences among treatments (MST). Blocking is a powerful general technique in that it allows the researcher to reduce the MSE by allocating variation to block effects without necessarily knowing the exact reason for block-to-block differences. The price one pays for this general reduction in error is twofold: 1) Reduction in df error, and 2) Constraint of the experimental design, as the blocks, if they are complete, must be crossed orthogonally with the treatments under investigation. Another way of reducing unaccounted-for variation is through the use of covariables (or concomitant variables). Covariables are specific, observable characteristics of individual experimental units whose continuous values, while independent of the treatments under investigation, are strongly correlated to the values of the response variable in the experiment. Though it may sound a little confusing at first, the concept of covariables is quite intuitive. Consider the following example: Baking potatoes takes forever, so a potato breeder decides to develop a fastbaking potato. He makes a bunch of crosses and carries out a test to compare baking times. While he hopes there are differences in baking times among the new varieties he has made, he knows that the size of the potato, regardless of its variety, will affect cooking time immensely. In analyzing baking times, then, the breeder must take into account this size factor. Ideally, he would like to compare the baking times he would have measured had all the potatoes been the same size by using size as a covariable, he can do just that. Unlike blocks, covariables are specific continuous variables that must be measured on each and every experimental unit, and their correlation to the response variable must be determined, in order to be used. One advantage over blocks is that covariables do not need to satisfy any orthogonality criteria relative to the treatments under consideration. But a valid ANCOVA (Analysis of Covariance) does require the covariable to meet several assumptions. BIOL933 1 Lab 10 (Topic 13)

Impressing your friends by carrying out a full ANCOVA Example 1 ST&D p435 [Lab10ex1.R] This example illustrates the correct way to code for an ANCOVA, to test all ANCOVA assumptions, and to interpret the results of such an analysis. In this experiment, eleven varieties of lima beans (Varieties 1 11) were evaluated for differences in ascorbic acid content (response Y, in mg ascorbic acid / 100 g dry weight). Recognizing that differences in maturity at harvest could affect ascorbic acid content independent of variety, the researchers measured the percent dry matter of each variety at harvest (X) and used it as a covariable (i.e. a proxy indicator of maturity). So, in summary: Treatment Variety (1 11) Response variable Y (ascorbic acid content) Covariable X (% dry matter at harvest) The experiment was organized as an RCBD with five blocks and one replication per block-treatment combination. Block Variety X Y 1 1 34 93 1 2 39.6 47.3 1 3 31.7 81.4............ 5 9 36.8 68.2 5 10 24.6 146.1 5 11 43.8 40.9 BIOL933 2 Lab 10 (Topic 13)

Coding, Results, and Explanations 1. The General Regression plot(y ~ X, lima_dat) reg_mod<-lm(y ~ X, lima_dat) anova(reg_mod) summary(reg_mod) The above plot of Y vs. X indicates a clear negative correlation between the two variables, a correlation that is confirmed by the following regression analysis results: Response: Y X 1 51258 51258 254.44 < 2.2e-16 *** Residuals 53 10677 201 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 231.4084 9.1356 25.33 <2e-16 *** X -4.1925 0.2628-15.95 <2e-16 *** Multiple R-squared: 0.8276 There is a strong negative correlation (p < 2e-16, slope = -4.19) between ascorbic acid content and % dry matter at harvest. In fact, over 82% of the variation in ascorbic acid content can be attributed to differences in % dry matter. It is a result like this that suggests that an ANCOVA may be appropriate for this study. BIOL933 3 Lab 10 (Topic 13)

2. The ANOVA with X (the covariable) as the response variable anovax_mod<-lm(x ~ Block + Variety, lima_dat) anova(anovax_mod) Response: X Block 4 367.85 91.963 9.6384 1.486e-05 *** Variety 10 2166.71 216.671 22.7087 1.851e-13 *** Residuals 40 381.65 9.541 Depending on the situation, this ANOVA on X (the covariable) may be needed to test the assumption of independence of the covariable from the treatments. The two possible scenarios: 1. The covariable (X) is measured before the treatment is applied. In this case, the assumption of independence is satisfied a priori. If the treatment has yet to be applied, there is no way it can influence the observed values of X. 2. The covariable (X) is measured after the treatment is applied. In this case, it is possible that the treatment itself influenced the covariable (X); so one must test directly for independence. In this case, the null hypothesis is that there are no significant differences among treatments for the covariable. (i.e. Variety in this model should be NS). Here we are in Situation #2. The "treatment" is the lima bean variety itself, so it was certainly "applied" before the % dry matter was measured. Here H 0 is soundly rejected (p < 0.0001), so we should be very cautious when drawing conclusions. And how might one exercise caution exactly? The danger here is that the detected effect of Variety on ascorbic acid content is really just the effect of differences in maturity. If the ANOVA is significant and the ANCOVA is not, you can conclude that difference in maturity is the driving force behind ascorbic acid differences. If, on the other hand, your ANOVA is not significant and your ANCOVA is, you can conclude that the covariable successfully eliminated an unwanted source of variability that was obscuring your significant differences among Varieties. BIOL933 4 Lab 10 (Topic 13)

3. The ANOVA with Y as the response variable anovay_mod<-lm(y ~ Block + Variety, lima_dat) anova(anovay_mod) Response: Y Block 4 4969 1242.2 8.3549 5.363e-05 *** Variety 10 51018 5101.8 34.3135 < 2.2e-16 *** Residuals 40 5947 148.7 Multiple R-squared: 0.904 Looking at this ANOVA on Y (the response variable), there appears to be a significant effect of Variety on ascorbic acid content; but we will wait to draw our conclusions until we see the results of the ANCOVA (see comments in the yellow box on the previous page). Incidentally, to calculate the relative precision of the ANCOVA to the ANOVA exactly, the bold and red numbers from the individual ANOVAs above will be used. 4. Testing for Homogeneity of Slopes slopes_mod<-lm(y ~ Block + Variety + X + Variety:X, lima_dat) anova(slopes_mod) Response: Y Block 4 4969 1242.2 27.3019 1.830e-09 *** Variety 10 51018 5101.8 112.1279 < 2.2e-16 *** X 1 3744 3743.8 82.2824 5.746e-10 *** Variety:X 10 884 88.4 1.9428 0.07957. Residuals 29 1319 45.5 With this model, we are checking the homogeneity of slopes across treatments, assuming that the block effects are additive and do not affect the regression slopes. The interaction test is NS at the 5% level (p = 0.0796), so the assumption of slope homogeneity across treatments is not violated. Good. This interaction must be NS in order for us to carry out the ANCOVA. BIOL933 5 Lab 10 (Topic 13)

5. The ANCOVA #library(car) ancova_mod<-lm(y ~ Block + Variety + X, lima_dat) Anova(ancova_mod, type = 2) Anova Table (Type II tests) Response: Y Sum Sq Df F value Pr(>F) Block 756.4 4 3.3469 0.01895 * Variety 7457.6 10 13.1996 1.110e-09 *** X 3743.8 1 66.2642 6.157e-10 *** Residuals 2203.5 39 Multiple R-squared: 0.9644 MSE = 2203.5 / 39 = 56.5 Here we find significant differences in ascorbic acid content among varieties besides those which correspond to differences in percent dry matter at harvest. A couple of things to notice: 1. The model df has increased from 14 to 15 because X is now included as a regression variable. Correspondingly, the error df has dropped by 1 (from 40 to 39). 2. Adding the covariable to the model increased R 2 from 90.4% to 96.4%, illustrating the fact that inclusion of the covariable helps to explain more of the observed variation, increasing our precision. 3. Because we are now adjusting the Block and Variety effects by the covariable regression, no longer do the SS of the individual model effects add up to the Model SS: 756.4 (Block SS) + 7457.6 (Variety SS) + 3743.8 (X SS) = 11957.8 59731.0 (Model SS) When we allow the regression with harvest time to account for some of the observed variation in ascorbic acid content, we find that less variation is uniquely attributable to Blocks and Varieties than before. The covariable SS (3743.85) comes straight out of the ANOVA Error SS (5947.30-2203.45). BIOL933 6 Lab 10 (Topic 13)

As with the ANOVAs before, the bold underlined term above (MSE ANCOVA ) is used in the calculation of relative precision, which can now be carried out: SSTX Effective MSE = MSEY Adjusted [1 + ] ( t 1) SSE Relative Precision = = 56.5 [1+ 2166.71 10 381.65 ] = 88.576278 MSE Y Unadjusted 148.7 = EffectiveMSE Y Adjusted 88.576278 =1.68 In this particular case, we see that each replication with the covariable is as effective as 1.68 without; so the ANCOVA is 1.68 times more precise than the ANOVA ['Precise' in this sense refers to our ability to detect the real effect of Variety on ascorbic acid content (i.e. our ability to isolate the effects of Variety exclusive of the effects on dry matter content at harvest)]. X BIOL933 7 Lab 10 (Topic 13)

Testing ANOVA Assumptions To test the ANOVA assumptions for this model, we rely on an important concept: An ANCOVA of the original data is equivalent to an ANOVA of regression-adjusted data. In other words, if we define a regression-adjusted response variable Z: Z i = Y i b ( X X ) i then we will find that an ANCOVA on Y is equivalent to an ANOVA on Z. Let's try it: 6. Find beta and Xmean; then create Z summary(ancova_mod) mean(lima_dat$x) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 193.0462 13.3420 14.469 < 2e-16 ***... X -3.1320 0.3848-8.140 6.16e-10 *** > mean(lima_dat$x) [1] 33.98727 lima_dat$z<-lima_dat$y + 3.1320*(lima_dat$X - 33.98727) A Comparison of Means The table on the next page shows the adjusted Variety means generated by the lsmeans() function in the ANCOVA alongside the normal Variety means: lima_lsm <- lsmeans(ancova_mod, "Variety") adj_means <- aggregate(lima_dat$z, list(lima_dat$variety), mean) BIOL933 8 Lab 10 (Topic 13)

Variety ANCOVA (Y) LSMeans ANOVA (Z) Means 1 92.587327 92.587327 2 79.116425 79.116425 3 78.103108 78.103108 4 84.530117 84.530117 5 95.983054 95.983054 6 97.506844 97.506844 7 99.978678 99.978678 8 72.044748 72.044748 9 81.146722 81.146722 10 122.783843 122.783843 11 74.319134 74.319134 Pretty impressive. This is striking confirmation that our imposed adjustment (Z) accounts for the covariable just as the ANCOVA does. A Comparison of Analyses Compare the results of this ANCOVA (Y) table: Anova Table (Type II tests) Response: Y Sum Sq Df F value Pr(>F) Block 756.4 4 3.3469 0.01895 * Variety 7457.6 10 13.1996 1.110e-09 *** X 3743.8 1 66.2642 6.157e-10 *** Residuals 2203.5 39 and with those of this ANOVA (Z) table: Response: Z Block 4 768.3 192.08 3.4869 0.01556 * Variety 10 10984.6 1098.46 19.9406 1.528e-12 *** Residuals 40 2203.5 55.09 Dependent Variable: Z A couple of things to notice: a. Error SS is exactly the same in both (2203.5), indicating that the explanatory powers of the two analyses are the same. b. The F values have shifted, but this is understandable due in part to the shifts in df. c. The p-values are very similar in both analyses, giving us the same levels of significance as before. The reason it is important to show the general equivalence of these two approaches is because, now that we have reduced the design to a simple RCBD, testing ANOVA assumptions is straightforward. BIOL933 9 Lab 10 (Topic 13)

7. Normality of Residuals anovaz_mod<-lm(z ~ Block + Variety, lima_dat) lima_dat$anovaz_resids <- residuals(anovaz_mod) shapiro.test(lima_dat$anovaz_resids) Shapiro-Wilk normality test data: lima_dat$anovaz_resids W = 0.9709, p-value = 0.2016 NS Our old friend Shapiro-Wilk is looking good. 8. Homogeneity of Variety Variances #library(car) levenetest(z ~ Variety, data = lima_dat) Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 10 0.677 0.7395 NS Lovely, that. 9. Tukey Test for Nonadditivity lima_dat$anovaz_preds <- predict(anovaz_mod) lima_dat$sq_anovaz_preds <- lima_dat$anovaz_preds^2 tukeyz_mod<-lm(z ~ Block + Variety + sq_anovaz_preds, lima_dat) anova(tukeyz_mod) Response: Z Block 4 768.3 192.08 3.4299 0.01701 * Variety 10 10984.6 1098.46 19.6147 2.977e-12 *** sq_anovaz_preds 1 19.4 19.39 0.3462 0.55967 NS Residuals 39 2184.1 56.00 And there we go. Easy. BIOL933 10 Lab 10 (Topic 13)

The Take Home Message To test the assumptions of a covariable model, adjust the original data by the regression and proceed with a regular ANOVA. In fact, adjusting the original data by the regression and proceeding with an ANOVA is a valid approach to the overall analysis. [Note, however, that one first needs to demonstrate homogeneity of slopes to be able to use a common slope to adjust all observations.] 10. Tukey Separation of Adjusted Means As a final flourish, now that we see that the data meet all the assumptions of ANCOVA, we can dive into a means separation of the Varieties. To do this, remember it is necessary to use the lsmeans library contrast() function! #library(lsmeans) lima_lsm <- lsmeans(ancova_mod, "Variety") contrast(lima_lsm, method = "pairwise", adjust = "tukey") This little bit of code produces the following beast of a table that must then be combed through and organized into a standard ranked means separation table. contrast estimate SE df t.ratio p.value 1-2 13.470902 6.718773 39 2.005 0.6467 1-3 14.484219 4.775527 39 3.033 0.1212 1-4 8.057210 4.992092 39 1.614 0.8668................ 9-10 -41.637122 5.872321 39-7.090 <.0001 9-11 6.827588 4.761519 39 1.434 0.9318 10-11 48.464709 6.034373 39 8.031 <.0001 Results are averaged over the levels of: Block P value adjustment: tukey method for a family of 11 means Last lab -- congratulations! BIOL933 11 Lab 10 (Topic 13)