VARIANCE COMPONENT ANALYSIS

Size: px

Start display at page:

Download "VARIANCE COMPONENT ANALYSIS"

Edgar Ross
5 years ago
Views:

1 VARIANCE COMPONENT ANALYSIS T. KRISHNAN Cranes Software International Limited Mahatma Gandhi Road, Bangalore krishnan.t@systat.com 1. Introduction In an experiment to compare the yields of two varieties of wheat, 10 farms participated, and in each farm both varieties were grown. All the 0 plots in the experiment were of equal area. The data on the yield in quintals is given below: Farm No. Variety A Variety B Note that the yields of Variety A and Variety B are correlated, because the conditions for both the varieties in a given farm would be the same. A standard method to analyze this kind of data is the paired t-test. Let x i be the yield for Variety A for the i th farm, and let y i be for Variety B. Then the paired t-test computes the differences and checks if z i = x i - y i, is far from 0 using the t distribution with 9 degrees of freedom, where Z is the mean of the Z i s. Let us perform this test. The results are: Hypothesis Testing: Paired t-test Paired Samples t-test on Variety A vs Variety B with 10 Cases Alternative = 'not equal' Sample No Variety A Variety B Mean Difference 95% CI SD of difference t df p- value to

2 Paired Samples t-test on Variety A vs Variety B with 10 Cases Alternative = 'not equal' Sample No Variety A Variety B Mean Difference 95% CI SD of difference t df p- value This test assumes that z i 's are independently and identically distributed normal random variables, which is the case if, for example, each (x i,y i ) pair is independently distributed as bivariate normal N(µ 1,µ, Σ ), where Σ is the covariance matrix. However, if we consider the data set for a moment, we can see that Σ cannot be just any covariance matrix. It is highly likely that the yields of the two varieties in the same farm will be positively correlated. Popular as it is, the paired t-test nonetheless fails to take this extra information about the data into account. It collapses the pairs (x i,y i ), into the differences z i, and thus fails to utilize the correlation structure of the original data. One way to remedy this loss of information is to assume that each measurement is made up of three components: The effect of the farm: It is customary to express the effect of the i th farm as µ + α i, where µ is called the mean effect, denoting the average level of yield over all farms, while α i denotes the departure of the i th farm from this average. The effect of the variety (this is where our interest lies). We shall denote the effect of the j th variety by β j for j =1,. Random error, which we call ε ij. So we have the model Yield = Overall effect + Farm effect + Variety effect + Random error, or in notations, y ij = µ + α i + β j + ε i j, where i =1,...,10 and j =1,. Here y ij is the yield of the j th variety in the i th farm. Thus, we have renamed x i as y i1 and y i as y i. In this sort of situation, the focus of the analysis is to determine which variety is better (greater yield) and by how much, over the collection of farms for which this exercise is being done. We are not particularly interested in these 10 farms and since experiments require experimental units (farms), farms inevitably come into the picture. The interest in these farms is only in so far as they represent the population of farms. In that spirit, we consider these farms as a random sample from a population of farms. Thus the farm effect α i 's are considered random and their variation which affects the comparison of varieties is of interest. The effects µ, β j s are fixed as before. A linear model where some (or all) of the parameters are random is called a linear mixed model. Here α i 's are called random effects, while µ, β j 's are called fixed effects. We assume that α i 's and ε ij 's are independent Gaussian (normal) random variables with zero mean. In this example we shall assume that

3 α i 's are distributed independently as N(0, σ a ), while ε ij 's have independent N(0, σ a ) distributions. It is easy to check that the correlation between the yields of the two varieties for the same farm is indeed positive under this model, since Cov (y i1,y i ) = Var (α i ) = σ a > 0. The model that we have formulated here is called a Variance Component (VC) Model, because the variance of each observation is the sum of two variances. One way to carry out a VC analysis of this data set is to consider the data as a two-way classification of Farm Variety and obtain the following ANOVA table. Notice that the F-test for Variety coincides with the paired t-test earlier (F = t ) and the p-values are the same. In the VC model, Var (y ij ) = e a σ + a e σ for all i, j; Cov (y i1,y i ) = e σ a as noted earlier, and so Var (y i1 - y i ) = σ a + σ - σ = σ. The paired t-test uses this as the unknown variance of the Z s; since it is unknown the coefficient does not matter. So what have we gained by the VC model in relation to the paired t-test? In the paired t- test output we have an estimate of σ e as (0.95) = and here we have directly an estimate of variance component σ e as However, VC analysis gives an estimate of σa which was lost in the paired analysis because we analyzed only the difference. This estimate is also a useful quantity, the variation from farm to farm. We discuss this further in the sequel. Analysis of Variance table Source SS Numerator df Denominator df Mean Square F-ratio p-value Variety Farm Error Source Variance Components SE Z p-value Lower 95% Upper 95% Farm Error Fixed Effects versus Random Effects When there are random effects in the data, the randomness in the data is thus split up into two parts: random effects and random error. We always assume that the random errors and 3

4 random effects are independent and are Gaussian with zero mean. The random effects need not be independent among themselves. The random errors may also be interdependent. Owing to the presence of the random effects the original observations are also correlated. Different covariance structures are used for the random effects as well as the random errors. But first let us see why one would want to consider an effect in a linear model as random. Consider the following data set on yield of wheat pertaining to two varieties and three farms. Each farm uses each variety on four plots. The data set is given below. Comparing varieties of wheat (Yield) FARM VARIETY 1 VARIETY 1 67, 73, 59, 84 75, 61, 67, 58 9, 84, 94, 83 54, 78, 61, , 7, 76, 64 4, 44, 80, 83 Let y ijk denote the yield of the k th plot in the i th farm using the j th variety. Then y ijk is the resultant of the i th farm effect as well as the j th variety effect. We shall assume that the plots are all more or less identical. So we have the linear model y ij = µ + α i + β j + ε i j. Here µ is the mean effect, α i is the i th farm effect, and β j is the effect of the j th variety. The ε's, as usual, denote the random errors. Now let us pause for a moment and wonder why one would really collect and analyze a data set of this kind. In other words, what type of inference do we want to make? There are two possible answers to this. First, we may be interested in knowing how these three farms perform using the two varieties. This question is of interest to, for instance, the owner of the farms, when he/she wants to decide which variety to grow. Here he/she has a specific set of farms in mind. Second, an agronomist may want to compare the two varieties irrespective of the farms. He does not have any specific set of farms in mind. He is comparing the performance of Variety 1 as applied by some randomly selected farm, with the performance of Variety applied by another (possibly different) randomly selected farm. In the first case all the effects are fixed. In the second case, the farm effects α i 's are random. Let us analyze the data set under both the models to see how the inference differs. First, the fixed effects model. The results are below. They give 1. an analysis of variance table where it is seen that farm differences are not significant and treatment difference is significant. 4

5 . estimates of the difference between farms 1 and 3, and farms and 3, with standard errors and tests of significance; also confidence intervals of the differences; 3. the difference between varieties 1 and, with standard error and test of significance; also confidence intervals of the differences. Analysis of Variance Source SS DF MS F p-value Farm Variety Error Estimates of fixed effects Effect Level Estimate SE df t p-value Intercept Farm Farm Farm Variety Variety CI's of fixed effects estimates Effect Level Estimate Lower 95% Upper 95% Intercept Farm Farm

6 CI's of fixed effects estimates Effect Level Estimate Lower 95% Upper 95% Farm Variety Variety If we use a model where the farm effect is random, the analysis is the same although the interpretations of the mean squares are different in the sense that the variance σe of the farm effect will be involved. Otherwise the conclusions are the same. Now let us introduce an interaction term in the model as follows: Yield = Overall effect + Farm effect + Variety effect + Interaction effect + Random error, y ijk = µ + α i + β j + γ ij + ε i jk. and consider both the farm and interaction effects to be random. Then the situation becomes quite different. The ANOVA table is Analysis of Variance Source SS DF MS F p-value Farm Variety Farm*Variety Error This ANOVA table is rather different from the earlier one without interaction, because of the additional interaction term with DF which was a part of the error term earlier. If you have more than one observation per cell (combination of farm and variety) it is possible to include the interaction term in the model and analyze it. Now when the interaction term is present and is a random effect, the variance due to Variety estimated by the Variety MS, has this interaction variance also as a part. The Variety and Interaction (Farm* Variety) variances differ only in the extra term in Variety due to Variety effect in terms of differences in β s. Hence under the hypothesis of no Variety effect, this term becomes 6

7 zero. Hence the correct denominator to test Variety is the Farm*Variety MS and not the Error MS. Hence the Variety F(1,)-ratio and the p-value are different. The p-value has gone down. This means that the varieties appear more significantly different when used over a population of farms, than when used for just a specific set of farms. It could have been the other way around also. Then the interpretation would be as follows: The significant difference in the fixed effects model implies that if the same farm uses both the varieties then the results are different. The lack of significance in the mixed effects model means that a random farm using one variety has more or less the same performance as a (possibly different) random farm using the other method. This is the case if, for instance, there is a lot of variability among the farms, and the difference between the varieties is swamped out by it. It is not the case here, though. A bad farm with a good variety may not perform much differently from a good farm with a bad variety. 3. Why use Random Effects? A linear model, just like any other statistical model, tries to capture the essence of the process generating the data, rather than that of the data themselves. We want our inference to hold not only for the given data set but also for future replications of the same experiment. So the choice of the model is dictated by what type of replications we have in mind. Depending upon this, there are different reasons behind treating an effect as random in a model. Here we outline three common situations. If we plan to use the same levels of an effect in all fresh replications, then we may treat the effect as fixed. However, if we plan to use fresh levels of the effecting different replications, then we should make the effect random. Inference based on random effects models is valid for a population of all possible levels of the random effects. The example above furnished an illustration. In such situations, the random coefficients are all independently and identically distributed, as they represent randomly selected levels of the effect. So the resulting model is a variance components model. In some cases, an effect may be considered random even if we plan to use the same levels for all future replications. Consider, for instance, a designed experiment where three operators in a farm are operating two tractors, the response being a score that combines the quality and quantity of the yield produced in a given season. A suitable model for this situation may be y ijk = µ + α i + β j + γ ij + ε i jk, where y ijk is the score for the k th run of the i th tractor operated by the j th operator. If the farm has only these three operators to operate the tractors, then the farm authorities would have to always choose the same three operators in all future replications of the experiment. However, the same operator may behave slightly differently from one replication of the experiment to the next depending on unpredictable factors like his mood. In this case, we would be justified in considering the operator effect as random. However, since the mood variability of the different operators may be different, the random coefficients β j 's need not be identically distributed. In fact, they may also be 7

8 correlated, because the moods of the all the operators may be governed by some common random condition prevailing during a replication of the experiment (e.g., weather during the experiment) that is difficult to control. Indeed, McLean, Sanders, and Stroup (1991) also suggest a model where the operator effect is fixed but the interaction effect (γ ij ) is random. Such a model would be appropriate if we consider the main effect as a measure of the proficiency of the operator, which is not likely to change between replications. However, the mood fluctuations may affect how an operator performs on a given tractor. Such models where the random effect coefficients may not be independently and identically distributed are more general than simple variance components models. A third situation that leads to random effects is where the model is developed in a multilevel fashion. Consider a situation where we want to linearly regress a response variable y (say, yield) on a predictor variable x (water). However, we believe that the regression slope is a random effect that depends on the values of a categorical variable z (variety). Then we have a two-level model. In the first level we model y in terms of x: y ijk = α + β j x ijk + ε i jk. Here j denotes the levels of the categorical variable z. In the second level we model the (random) regression slope in terms of β j = a + b j. Here b j 's are random effect coefficients. Putting the second level equation in the first we get the composite model y ijk = α + (a + b j ) x ijk + ε ijk = α + a x ijk + b j x ijk + ε ijk. This means that here x is present in the fixed part a x ijk as well as in the random part b j x ijk effect. If the deeper levels in a multi-level model have their own random errors, then they lead to random effects in the composite model. References and Suggested Reading Cox, D.R. and Solomon, P.J. (00). Components of Variance. New York: Chapman & Hall/CRC. McLean, R.A., Sanders, W.L., and Stroup, W.W. (1991). A unified approach to mixed linear models. The American Statistician, 45, Milliken, G.A. and Johnson, D.E. (199). Analysis of messy data, Volume I: Designed experiments. London: Chapman and Hall. Searle, S.R., Casella, G., and McCulloch, C.E. (199). Variance Components. New York: John Wiley & Sons. 8

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters Objectives 10.1 Simple linear regression Statistical model for linear regression Estimating the regression parameters Confidence interval for regression parameters Significance test for the slope Confidence