VARIANCE COMPONENT ANALYSIS

Similar documents
Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Confidence Intervals, Testing and ANOVA Summary

Inference with Simple Regression

MULTIVARIATE ANALYSIS OF VARIANCE

Stat 705: Completely randomized and complete block designs

Chap The McGraw-Hill Companies, Inc. All rights reserved.

Residual Analysis for two-way ANOVA The twoway model with K replicates, including interaction,

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Simple Linear Regression: One Qualitative IV

Review of Statistics 101

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Correlation & Simple Regression

Chapter 1 Statistical Inference

Psychology 282 Lecture #4 Outline Inferences in SLR

Analysis of variance

Analysis of Variance and Co-variance. By Manza Ramesh

Lecture 9. ANOVA: Random-effects model, sample size

Multiple Linear Regression

Ch 2: Simple Linear Regression

Battery Life. Factory

Statistical Techniques II EXST7015 Simple Linear Regression

Regression Models - Introduction

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Comparison of prediction quality of the best linear unbiased predictors in time series linear regression models

ACOVA and Interactions

Inference for the Regression Coefficient

III. Inferential Tools

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Inferences for Regression

Introduction to Regression

Lecture 10 Multiple Linear Regression

Inference for Regression Simple Linear Regression

Categorical Predictor Variables

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

MATH Chapter 21 Notes Two Sample Problems

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

Measuring relationships among multiple responses

Intro to Linear Regression

Analysis of Variance (ANOVA)

Reference: Chapter 13 of Montgomery (8e)

Statistical Distribution Assumptions of General Linear Models

2.1: Inferences about β 1

Intro to Linear Regression

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Factorial designs. Experiments

Basic Probability Reference Sheet

This gives us an upper and lower bound that capture our population mean.

Estadística II Chapter 4: Simple linear regression

Correlation and Regression

STAT 525 Fall Final exam. Tuesday December 14, 2010

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

General Linear Model (Chapter 4)

PROBLEM TWO (ALKALOID CONCENTRATIONS IN TEA) 1. Statistical Design

Hypothesis Testing hypothesis testing approach

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

DESAIN EKSPERIMEN Analysis of Variances (ANOVA) Semester Genap 2017/2018 Jurusan Teknik Industri Universitas Brawijaya

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Difference in two or more average scores in different groups

L6: Regression II. JJ Chen. July 2, 2015

STAT Chapter 11: Regression

Stat 217 Final Exam. Name: May 1, 2002

9. Linear Regression and Correlation

Basic Business Statistics, 10/e

Lecture 14. Analysis of Variance * Correlation and Regression. The McGraw-Hill Companies, Inc., 2000

Lecture 14. Outline. Outline. Analysis of Variance * Correlation and Regression Analysis of Variance (ANOVA)

Multiple comparisons - subsequent inferences for two-way ANOVA

Design of Experiments. Factorial experiments require a lot of resources

Chapter 13 Experiments with Random Factors Solutions

One-way between-subjects ANOVA. Comparing three or more independent means

Lecture 3: Analysis of Variance II

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

CHAPTER 4 Analysis of Variance. One-way ANOVA Two-way ANOVA i) Two way ANOVA without replication ii) Two way ANOVA with replication

Chapter 12 - Lecture 2 Inferences about regression coefficient

ANALYZING BINOMIAL DATA IN A SPLIT- PLOT DESIGN: CLASSICAL APPROACHES OR MODERN TECHNIQUES?

9 Correlation and Regression

MORE ON SIMPLE REGRESSION: OVERVIEW

Sleep data, two drugs Ch13.xls

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

Formal Statement of Simple Linear Regression Model

20.0 Experimental Design

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

First Year Examination Department of Statistics, University of Florida

Tied survival times; estimation of survival probabilities

,i = 1,2,L, p. For a sample of size n, let the columns of data be

Longitudinal Data Analysis of Health Outcomes

A Non-parametric bootstrap for multilevel models

with the usual assumptions about the error term. The two values of X 1 X 2 0 1

Analysis of Variance

Sampling Distributions in Regression. Mini-Review: Inference for a Mean. For data (x 1, y 1 ),, (x n, y n ) generated with the SRM,

STK4900/ Lecture 3. Program

Do not copy, post, or distribute

Sociology 6Z03 Review II

Analysis of Variance (ANOVA)

Business Statistics. Lecture 10: Course Review

Correlation and the Analysis of Variance Approach to Simple Linear Regression

df=degrees of freedom = n - 1

Transcription:

VARIANCE COMPONENT ANALYSIS T. KRISHNAN Cranes Software International Limited Mahatma Gandhi Road, Bangalore - 560 001 krishnan.t@systat.com 1. Introduction In an experiment to compare the yields of two varieties of wheat, 10 farms participated, and in each farm both varieties were grown. All the 0 plots in the experiment were of equal area. The data on the yield in quintals is given below: Farm No. Variety A Variety B 1 10.4 10.1 10.6 10.8 3 10. 10. 4 10.1 09.9 5 10.3 11.0 6 10.7 10.5 7 10.3 10. 8 10.9 10.9 9 10.1 10.4 10 09.8 09.9 Note that the yields of Variety A and Variety B are correlated, because the conditions for both the varieties in a given farm would be the same. A standard method to analyze this kind of data is the paired t-test. Let x i be the yield for Variety A for the i th farm, and let y i be for Variety B. Then the paired t-test computes the differences and checks if z i = x i - y i, is far from 0 using the t distribution with 9 degrees of freedom, where Z is the mean of the Z i s. Let us perform this test. The results are: Hypothesis Testing: Paired t-test Paired Samples t-test on Variety A vs Variety B with 10 Cases Alternative = 'not equal' Sample No Variety A Variety B Mean Difference 95% CI SD of difference t df p- value 1 10.340 10.390-0.050-0.61 to 0.95-0.535 9 0.605

Paired Samples t-test on Variety A vs Variety B with 10 Cases Alternative = 'not equal' Sample No Variety A Variety B Mean Difference 95% CI SD of difference t df p- value 0.161 This test assumes that z i 's are independently and identically distributed normal random variables, which is the case if, for example, each (x i,y i ) pair is independently distributed as bivariate normal N(µ 1,µ, Σ ), where Σ is the covariance matrix. However, if we consider the data set for a moment, we can see that Σ cannot be just any covariance matrix. It is highly likely that the yields of the two varieties in the same farm will be positively correlated. Popular as it is, the paired t-test nonetheless fails to take this extra information about the data into account. It collapses the pairs (x i,y i ), into the differences z i, and thus fails to utilize the correlation structure of the original data. One way to remedy this loss of information is to assume that each measurement is made up of three components: The effect of the farm: It is customary to express the effect of the i th farm as µ + α i, where µ is called the mean effect, denoting the average level of yield over all farms, while α i denotes the departure of the i th farm from this average. The effect of the variety (this is where our interest lies). We shall denote the effect of the j th variety by β j for j =1,. Random error, which we call ε ij. So we have the model Yield = Overall effect + Farm effect + Variety effect + Random error, or in notations, y ij = µ + α i + β j + ε i j, where i =1,...,10 and j =1,. Here y ij is the yield of the j th variety in the i th farm. Thus, we have renamed x i as y i1 and y i as y i. In this sort of situation, the focus of the analysis is to determine which variety is better (greater yield) and by how much, over the collection of farms for which this exercise is being done. We are not particularly interested in these 10 farms and since experiments require experimental units (farms), farms inevitably come into the picture. The interest in these farms is only in so far as they represent the population of farms. In that spirit, we consider these farms as a random sample from a population of farms. Thus the farm effect α i 's are considered random and their variation which affects the comparison of varieties is of interest. The effects µ, β j s are fixed as before. A linear model where some (or all) of the parameters are random is called a linear mixed model. Here α i 's are called random effects, while µ, β j 's are called fixed effects. We assume that α i 's and ε ij 's are independent Gaussian (normal) random variables with zero mean. In this example we shall assume that

α i 's are distributed independently as N(0, σ a ), while ε ij 's have independent N(0, σ a ) distributions. It is easy to check that the correlation between the yields of the two varieties for the same farm is indeed positive under this model, since Cov (y i1,y i ) = Var (α i ) = σ a > 0. The model that we have formulated here is called a Variance Component (VC) Model, because the variance of each observation is the sum of two variances. One way to carry out a VC analysis of this data set is to consider the data as a two-way classification of Farm Variety and obtain the following ANOVA table. Notice that the F-test for Variety coincides with the paired t-test earlier (F = t ) and the p-values are the same. In the VC model, Var (y ij ) = e a σ + a e σ for all i, j; Cov (y i1,y i ) = e σ a as noted earlier, and so Var (y i1 - y i ) = σ a + σ - σ = σ. The paired t-test uses this as the unknown variance of the Z s; since it is unknown the coefficient does not matter. So what have we gained by the VC model in relation to the paired t-test? In the paired t- test output we have an estimate of σ e as (0.95) = 0.088 and here we have directly an estimate of variance component σ e as 0.044. However, VC analysis gives an estimate of σa which was lost in the paired analysis because we analyzed only the difference. This estimate is also a useful quantity, the variation from farm to farm. We discuss this further in the sequel. Analysis of Variance table Source SS Numerator df Denominator df Mean Square F-ratio p-value Variety 0.013 1 9 0.013 0.87 0.605 Farm.000 9 9 0. 5.097 0.00 Error 0.393 9 0.044 Source Variance Components SE Z p-value Lower 95% Upper 95% Farm 0.089 0.053 1.673 0.094-0.015 0.158 Error 0.044 0.01.11 0.034 0.003 0.061. Fixed Effects versus Random Effects When there are random effects in the data, the randomness in the data is thus split up into two parts: random effects and random error. We always assume that the random errors and 3

random effects are independent and are Gaussian with zero mean. The random effects need not be independent among themselves. The random errors may also be interdependent. Owing to the presence of the random effects the original observations are also correlated. Different covariance structures are used for the random effects as well as the random errors. But first let us see why one would want to consider an effect in a linear model as random. Consider the following data set on yield of wheat pertaining to two varieties and three farms. Each farm uses each variety on four plots. The data set is given below. Comparing varieties of wheat (Yield) FARM VARIETY 1 VARIETY 1 67, 73, 59, 84 75, 61, 67, 58 9, 84, 94, 83 54, 78, 61, 70 3 74, 7, 76, 64 4, 44, 80, 83 Let y ijk denote the yield of the k th plot in the i th farm using the j th variety. Then y ijk is the resultant of the i th farm effect as well as the j th variety effect. We shall assume that the plots are all more or less identical. So we have the linear model y ij = µ + α i + β j + ε i j. Here µ is the mean effect, α i is the i th farm effect, and β j is the effect of the j th variety. The ε's, as usual, denote the random errors. Now let us pause for a moment and wonder why one would really collect and analyze a data set of this kind. In other words, what type of inference do we want to make? There are two possible answers to this. First, we may be interested in knowing how these three farms perform using the two varieties. This question is of interest to, for instance, the owner of the farms, when he/she wants to decide which variety to grow. Here he/she has a specific set of farms in mind. Second, an agronomist may want to compare the two varieties irrespective of the farms. He does not have any specific set of farms in mind. He is comparing the performance of Variety 1 as applied by some randomly selected farm, with the performance of Variety applied by another (possibly different) randomly selected farm. In the first case all the effects are fixed. In the second case, the farm effects α i 's are random. Let us analyze the data set under both the models to see how the inference differs. First, the fixed effects model. The results are below. They give 1. an analysis of variance table where it is seen that farm differences are not significant and treatment difference is significant. 4

. estimates of the difference between farms 1 and 3, and farms and 3, with standard errors and tests of significance; also confidence intervals of the differences; 3. the difference between varieties 1 and, with standard error and test of significance; also confidence intervals of the differences. Analysis of Variance Source SS DF MS F p-value Farm 49.750 46.375 1.751 0.199 Variety 95.04 1 95.04 6.575 0.0185 Error 813.33 0 140.69 Estimates of fixed effects Effect Level Estimate SE df t p-value Intercept 60.667 4.84 0 1.58 0.000 Farm 1 1.15 5.931 0 0.190 0.851 Farm 10.15 5.931 0 1.707 0.103 Farm 3 0.000 0.000... Variety 1 1.417 4.84 0.564 0.0185 Variety 0.000 0.000... CI's of fixed effects estimates Effect Level Estimate Lower 95% Upper 95% Intercept 60.667 50.566 70.768 Farm 1 01.15-11.46 13.496 Farm 10.15-0.46.496 5

CI's of fixed effects estimates Effect Level Estimate Lower 95% Upper 95% Farm 3 0.000.. Variety 1 1.417 0.316.518 Variety 0.000.. If we use a model where the farm effect is random, the analysis is the same although the interpretations of the mean squares are different in the sense that the variance σe of the farm effect will be involved. Otherwise the conclusions are the same. Now let us introduce an interaction term in the model as follows: Yield = Overall effect + Farm effect + Variety effect + Interaction effect + Random error, y ijk = µ + α i + β j + γ ij + ε i jk. and consider both the farm and interaction effects to be random. Then the situation becomes quite different. The ANOVA table is Analysis of Variance Source SS DF MS F p-value Farm 49.750 46.375 1.778 0.1970 Variety 95.04 1 95.04 5.798 0.0138 Farm*Variety 319.083 159.54 1.151 0.3380 Error 494.750 18 138.597 This ANOVA table is rather different from the earlier one without interaction, because of the additional interaction term with DF which was a part of the error term earlier. If you have more than one observation per cell (combination of farm and variety) it is possible to include the interaction term in the model and analyze it. Now when the interaction term is present and is a random effect, the variance due to Variety estimated by the Variety MS, has this interaction variance also as a part. The Variety and Interaction (Farm* Variety) variances differ only in the extra term in Variety due to Variety effect in terms of differences in β s. Hence under the hypothesis of no Variety effect, this term becomes 6

zero. Hence the correct denominator to test Variety is the Farm*Variety MS and not the Error MS. Hence the Variety F(1,)-ratio and the p-value are different. The p-value has gone down. This means that the varieties appear more significantly different when used over a population of farms, than when used for just a specific set of farms. It could have been the other way around also. Then the interpretation would be as follows: The significant difference in the fixed effects model implies that if the same farm uses both the varieties then the results are different. The lack of significance in the mixed effects model means that a random farm using one variety has more or less the same performance as a (possibly different) random farm using the other method. This is the case if, for instance, there is a lot of variability among the farms, and the difference between the varieties is swamped out by it. It is not the case here, though. A bad farm with a good variety may not perform much differently from a good farm with a bad variety. 3. Why use Random Effects? A linear model, just like any other statistical model, tries to capture the essence of the process generating the data, rather than that of the data themselves. We want our inference to hold not only for the given data set but also for future replications of the same experiment. So the choice of the model is dictated by what type of replications we have in mind. Depending upon this, there are different reasons behind treating an effect as random in a model. Here we outline three common situations. If we plan to use the same levels of an effect in all fresh replications, then we may treat the effect as fixed. However, if we plan to use fresh levels of the effecting different replications, then we should make the effect random. Inference based on random effects models is valid for a population of all possible levels of the random effects. The example above furnished an illustration. In such situations, the random coefficients are all independently and identically distributed, as they represent randomly selected levels of the effect. So the resulting model is a variance components model. In some cases, an effect may be considered random even if we plan to use the same levels for all future replications. Consider, for instance, a designed experiment where three operators in a farm are operating two tractors, the response being a score that combines the quality and quantity of the yield produced in a given season. A suitable model for this situation may be y ijk = µ + α i + β j + γ ij + ε i jk, where y ijk is the score for the k th run of the i th tractor operated by the j th operator. If the farm has only these three operators to operate the tractors, then the farm authorities would have to always choose the same three operators in all future replications of the experiment. However, the same operator may behave slightly differently from one replication of the experiment to the next depending on unpredictable factors like his mood. In this case, we would be justified in considering the operator effect as random. However, since the mood variability of the different operators may be different, the random coefficients β j 's need not be identically distributed. In fact, they may also be 7

correlated, because the moods of the all the operators may be governed by some common random condition prevailing during a replication of the experiment (e.g., weather during the experiment) that is difficult to control. Indeed, McLean, Sanders, and Stroup (1991) also suggest a model where the operator effect is fixed but the interaction effect (γ ij ) is random. Such a model would be appropriate if we consider the main effect as a measure of the proficiency of the operator, which is not likely to change between replications. However, the mood fluctuations may affect how an operator performs on a given tractor. Such models where the random effect coefficients may not be independently and identically distributed are more general than simple variance components models. A third situation that leads to random effects is where the model is developed in a multilevel fashion. Consider a situation where we want to linearly regress a response variable y (say, yield) on a predictor variable x (water). However, we believe that the regression slope is a random effect that depends on the values of a categorical variable z (variety). Then we have a two-level model. In the first level we model y in terms of x: y ijk = α + β j x ijk + ε i jk. Here j denotes the levels of the categorical variable z. In the second level we model the (random) regression slope in terms of β j = a + b j. Here b j 's are random effect coefficients. Putting the second level equation in the first we get the composite model y ijk = α + (a + b j ) x ijk + ε ijk = α + a x ijk + b j x ijk + ε ijk. This means that here x is present in the fixed part a x ijk as well as in the random part b j x ijk effect. If the deeper levels in a multi-level model have their own random errors, then they lead to random effects in the composite model. References and Suggested Reading Cox, D.R. and Solomon, P.J. (00). Components of Variance. New York: Chapman & Hall/CRC. McLean, R.A., Sanders, W.L., and Stroup, W.W. (1991). A unified approach to mixed linear models. The American Statistician, 45, 54 64. Milliken, G.A. and Johnson, D.E. (199). Analysis of messy data, Volume I: Designed experiments. London: Chapman and Hall. Searle, S.R., Casella, G., and McCulloch, C.E. (199). Variance Components. New York: John Wiley & Sons. 8