Module 6: Model Diagnostics

Size: px

Start display at page:

Download "Module 6: Model Diagnostics"

David Banks
5 years ago
Views:

1 /MIXED LINEAR MODELS PREPARED BY THE STATISTICS GROUPS AT IMM, DTU AND KU-LIFE Module 6: Model Diagnostics 6.1 Introduction Linear model diagnostics Which model to use? The model assumptions Residuals Normality investigation Checking for variance homogeneity Checking for variance homogeneity in new model for transformed data Outliers Check for influential observations Justification of the linear model approach Check for random effects normality Data transformation and back transformation Specific alternative distributions Box-Cox transformations Back transformation Introduction As should be clear by now, the mixed linear models are indeed based on a number of assumptions about distributions etc. As for linear models without random effects it is important to check for those assumptions as much as possible. Checking for model assumptions in linear mixed models in general is still a matter of research and common practical procedures are not well established. We suggest the following twofold approach: 02429/Mixed Linear Models Last modified August 23, 2011

2 Module 6: Model Diagnostics 2 1. Embed the model control into linear models (ANOVA/regression) without random effects by considering the model in which all random effects are considered fixed. 2. Do a simple and rough control of the random effects normality assumptions. The justification of 1. is discussed in more detail in the section on residuals below. The beech wood example from Module 3 will be used as the running example and in the example section of the present module the analysis of the beech wood data is completed. 6.2 Linear model diagnostics This section contains only information that could be part of any course on basic statistical analysis Which model to use? As was pointed out in the analysis of the beech wood data in Module 3, one step of the overall approach is to try to simplify a (possibly complex) starting model to a (simpler) final model. This leaves us to make the decision which of these models, we want to run through the model control machine. In the beech wood example it corresponds to basing the model diagnostics on either the (fixed effect starting) model given by Y i = µ + α(width i ) + β(depth i ) + γ(width i, depth i ) + δ(plank i ) + ɛ i, or the (fixed effect final) model given by Y i = µ + α(width i ) + β(depth i ) + δ(plank i ) + ɛ i. There are no clear answer to this question! Since the process of going from the starting model to the final model uses the model assumptions one would be inclined to use the starting model, since then no time is wasted on model reductions that would have to be ignored anyway after a model check in the reduced model shows that this model does not really hold. However, if large models (compared to the number of observations) are specified the information in the data about the model assumptions can be rather weak. So as a general approach, we recommend to carry out the model control primarily in a (preliminary) reduced model, and then redo the model reduction analysis if required.

3 Module 6: Model Diagnostics The model assumptions The classical assumptions for linear normal models (without random effects) are the following: 1. The model structure should capture the systematic effects in the data. 2. Normality of residuals 3. Variance homogeneity of residuals 4. Independence of residuals It is recommended always to check whether these assumptions appear to be fulfilled for the situation in question. The independence assumption may not always be easily checked, although for some data situations methods are available, eg. for repeated measures data. We will return to this in later modules.the assumption in 1. is particularly an issue when regression terms (quantitative factors) enters the model. For the classical (x, y) linear regression model situation this corresponds to the assumption of linearity between x and y. Apart from the formal assumptions it is important to focus on the possibility of: A Outliers B Influential observations Residuals The assumptions may be investigated by constructing the predicted (expected) and residual values from the model. For the final main effects model for the beech wood data, it would amount to constructing: and ŷ i = µ + α(width i ) + β(depth i ) + δ(plank i ) ɛ i = y i ŷ i In fact, it turns out the (theoretical) variance of these residuals are generally not homogeneous (even under the model assumption of homogeneous variance)! This is because the residuals are not the real error terms ɛ ijk but only estimated versions of those. The variance becomes: Var( ɛ i ) = σ 2 (1 h i ) 2 where σ 2 is the model error variance and h i is the so-called leverage for observation i. We will not give the exact definition of the leverage here, but just point out that

4 Module 6: Model Diagnostics 4 the leverage is a measure (between 0 and 1) of distance from the ith observation to the typical (mean) observation only using the X-information of a model. In a simple regression setting the leverage is equivalent to (x i x) 2. For pure ANOVA models the leverage has a less clear interpretation, and in fact for some cases, like the example here with balanced data, the leverage is actually the same for all observations. So the effect of constructing a nice experimental design combined with the luck of avoiding missing values induces a situation in which no observations are more atypical/ strange than others. High leverage (atypical) observations are potentially highly influential on the results of the analysis, and we do not want that the conclusions we make are based only on one or very few observations. To account for the difference in variances in the residuals, we use instead the standardized residuals defined by: ɛ i = y i ŷ i σ(1 h i ) and these are given directly by SAS for us to study in various ways: 1. Normality investigation(histogram, probability/quantile plots, significance tests) 2. Plot of residuals versus predicted values 3. Plot of residuals versus the values/levels of quantitative/qualitative factors in the model. From now on, when we say residuals we consider the standardized residuals Normality investigation In figure 6.1(left) we see that the residuals for the example seem to be symmetrically distributed and that the normal distribution seems to fit quite well apart maybe from a few extremely small and large values. SAS provides four different significance test for normality: Test Statistic P value Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling We do not give the exact definitions of theses tests here nor discuss the features of each test. We note that they seem to reject the normality assumption, although not that clear for some of the test. These tests and P-values should not be given too much weight in practical data analysis. For small data sets they will tend to be very weak, that is, it is

5 Module 6: Model Diagnostics 5 Figure 6.1: Histogram with the best normal curve and normal probability plot of residuals generally difficult to reject the normality assumption in small data sets. But this is not the same as having proved that the normality is true! For large data sets, they become sensitive to even very small deviations from normality, also deviations that due to the central limit theorem have no effect on the tests and confidence intervals used in the analysis. Anyway, the significance tests may still enter as a part of a complete model diagnostics, and IF they become significant for rather small data sets, you definitely know, that there is a problem and IF they are non significant for a large data set, you can feel certain that everything is OK. The presence of some extreme observations is confirmed by the probability plot in figure 6.1(right), where it is clear that 5 residuals are too small while 4 residuals are too large compared to what could expected from the normality assumption. But other than that the distribution fits nicely to a normal distribution. So we seem to have a number of outliers, see below for a discussion of outliers.

6 Module 6: Model Diagnostics 6 Figure 6.2: Residuals versus predicted and factor level values Checking for variance homogeneity To check for variance homogeneity we plot the residuals versus the predicted values and versus the values/levels of quantitative/qualitative factors in the model. In the first plot of the residuals versus the predicted values, figure 6.2 (top left) it is investigated whether the variance depends on the mean ( on the size of the observations ). We actually see, that there is a typical trumpet shape indicating that the variance increases with increasing mean. Then one would consider a transformation like the log or something similar. There also seems to be some systematic deviation from zero, where the left and right ones are more typically positive and the middle ones more typically negative. This is highly disturbing as such patterns indicate that important structures in the data was not accounted for, cf. assumption 1. Note that since the residuals are standardized, we can view the size of the residuals in the light of the standard normal distribution: Approximately 5% of the residuals should be outside the interval between 2 and 2 and only 0.1% should be outside the interval between 3 and 3. So for a data set with 300 observations you wouldn t expect any residuals

7 Module 6: Model Diagnostics 7 beyond this interval. The three plots of residuals versus the factor levels investigate whether there are any group dependent variance heterogeneity. For the five depth groups, figure 6.2 (bottom right) the variances look similar, for the three width groups, figure 6.2 (bottom left), there may be a tendency that the middle width has lower variablity and for the 20 planks, figure 6.2 (top right), there does seem to be clear differences in variability. However, since the residuals versus predicted plot indicated some severe problems, we shouldn t worry too much about these other potential variance heterogeneities, since these may very well change in the process of fixing the problem The same goes for the outlier and influence investigation. So apart from exclusion of errornous and explainable extreme observations, we should postpone the outlier/influence investigation until the investigation of normality and variance homogeneity is completed. Outlying observations in an inadequate model may turn out to be reasonable observations in a more proper model! Checking for variance homogeneity in new model for transformed data In the example section the beech wood data is re-considered by including some possible effects that was forgotten in the first place together with a log-transformation of the data. In the remainder of this module we consider the (standardized) residuals from the final model now obtained. Figure 6.3 is a reproduction of figure 6.2 on the new residuals. The significance tests becomes: Test Statistic P value Shapiro-Wilk Kolmogorov-Smirnov > Cramer-von Mises > Anderson-Darling > Now histogram, probability plot and significance tests all support the assumption of normality. The residuals versus predicted plot, figure 6.4, now has a much nicer pattern. The trumpet shape and the systematic structure has disappeared and none of the residuals are beyond 3 and 3. The plots of residuals versus the factor levels of depth and width indicate no problems with heterogeneity but there may still be slight differences in residual variability for the different planks Outliers An outlier is intuitively defined as: An observation that deviates unusually much from it s expected value. What is unusual is determined by the (estimated) probability dis-

8 Module 6: Model Diagnostics 8 Figure 6.3: Histogram with the best normal curve and normal probability plot of residuals of the new model for the log-transformed data tribution of the model. The general approach for handling outliers is the following: 1. Identify the outlying observations 2. Check whether some of these may be due to errors or may be explained and excluded for some external/atypical reasons: maybe it turns out that a single plank was treated in some extreme way not representative for what is investigated. 3. Investigate the influence of non-explainable outliers. In practice, we are often left with some extreme observations that we cannot exclude for any of the reasons given in 2. The only thing left, is to investigate whether such extreme observations have important influence on the results of the analysis. This is done by redoing the analysis leaving out the extreme observations and comparing with the original results. In fact, SAS can do this automatically for us. There are no indications of outliers in the new model for the log-transformed data.

9 Module 6: Model Diagnostics 9 Figure 6.4: Residuals versus predicted and factor level values for the new model for the log-transformed data Check for influential observations As a final step we investigate the influence of observations on the results of the analysis. A measure of influence is given by the change of the expected(predicted) value of the model by leaving out an observation: f i = ŷi ŷ (i) σ 2 (i) hi where ŷ (i) is the model predicted value without using the ith observation, and σ (i) 2 is the residual error variance estimated without using the ith observation. These values are given directly by SAS, and are usually plotted versus observation number, see figure 6.5. As for the residual this is a standardized statistic, such that normal distribution based cut off values can be employed. We see that only few observations have an absolute influence value between 2 and 3 and none above 3, so things look fine in this case. If one or more observations were extreme in this sense, we would have to

10 Module 6: Model Diagnostics 10 Figure 6.5: Influence statistics versus observation number investigate in more detail what exact parts of the conclusion are influenced and in what way (by comparing model/test results with/without the observation). It could also be investigated whether such influential observations group in any particular way - it may be that the observations from an entire plank are influential, and it could be relevant to study the effect of leaving out all the observations from this plank to see the effect of this. 6.3 Justification of the linear model approach The residual errors ɛ i in the mixed models seen so far in this course are imposed the same assumptions as the residual errors in a systematic linear model. In later modules on repeated measures data the focus will be on more general residual error covariance structure modeling and specific tools for the investigation of this will be given. The estimated residuals could be defined within the mixed model using the predicted values (BLUP) for the random effects, such that they are given by (using the vectormatrix notation of the theory module): ( ) r = y X β + Zû In general the BLUPs of the mixed model and the parameter estimates of the corresponding fixed effects model will be different: The BLUPs are shrinkage versions of the fixed effects parameters (The most extreme values among the levels of a random factor become less extreme). However, the difference is often not pronounced and because of the more complicated model structure, the standardization of the residuals

11 Module 6: Model Diagnostics 11 becomes more complicated. The residuals can be given by PROC MIXED but not standardized residuals of any kind. It is the hope that a choice of transformation based on the structures in the residuals will also stabilize the random effects distributions, but this is not in any way guaranteed. 6.4 Check for random effects normality Until now we have based the model diagnostics on the framework of the fixed effects model. This means that we only investigated the normality assumptions of the residual error: ɛ i N(0, σ 2 ). However, in the mixed model used in the example section for this, we assume that the effects due to planks and plank interactions are also normally distributed: and d(plank i ) N(0, σ 2 P lank), f(width i, plank i ) N(0, σ 2 P lank width) g(depth i, plank i ) N(0, σ 2 P lank depth), ɛ i N(0, σ 2 ) It is not possible to investigate these normality assumptions as direct as the residual normality. However, severe lack of normality of random effects would be seen in probability plots of averages of the data corresponding to the random factors. In figure 6.6 there are no indication of any lack of normality. First of all, this investigation is only a rough approximate approach as the averages consist of contributions from several random effects, so IF some problems occur, we wouldn t necessarily know exactly which effect was non-normal. Secondly, in non-balanced designs or in models with quantitative factors (baselines etc) raw averages like this will furthermore be infected with variations due to the fixed part of the model. Moreover, this approach will only make sense if the number of levels for a factor is not too small. And the really big flaw: IF we have problems with normality of the random effects, we wouldn t really know how to cope with this! 6.5 Data transformation and back transformation If the assumption of normality and/or constant variance are not fulfilled, based on an inspection of the standardized residuals, then the problem can often be solved by transforming the response variable and then consider a mixed linear model for the transformed variable. How should one then go about choosing a transformation? Of course one could just try with a given transformation and then see if the assumptions behind the linear model look like being better fulfilled after the transformation. It would however be more satisfying with a more constructive approach.

12 Module 6: Model Diagnostics 12 Figure 6.6: Random effects normal probability plot With some experience one can often see from the plot of the standardized residuals against the predicted values which transformation is needed. If the picture is a fan opening to the right ( trumpet-shaped ) then typically a log-transformation or a power transformation (with a positive exponent) is what is called for. If the fan opens to the left an exponential transformation or a negative power transformation often helps. There are some more systematic approaches to the choice of transformation and in this section we will consider two such approaches Specific alternative distributions If the observations are more naturally described by a different distribution than the normal then the variance will vary with the mean. For example in the Poisson distribution the variance equals the mean. In such cases one can often successfully transform the data and then describe the transformed data by the normal distribution which has a constant variance. This is often preferable as the normal distribution is well understood

13 Module 6: Model Diagnostics 13 and many results concerning distributions of estimates and test statistics are exact. In the following table such transformations are given for some common distributions: Distribution Variance Scale Transformation Binomial µ(1 µ) interval (0,1) arcsin Poisson µ Positive Gamma µ 2 Positive log Inverse Gauss µ 3 Positive 1/ Box-Cox transformations When another distribution for the observations is not obvious, which is usually the case, one could try and look for a power transformation. This only works for positive data but can be applied to all data if a constant is added to all observations. The idea is, instead of using a linear model for the observations Y 1,..., Y N, to analyze Z 1,..., Z N where { Y λ Z i = i, λ 0 log Y i, λ = 0, (6.1) using a linear model. Here λ = 1 corresponds to no transformation, λ = 1/2 a square root transformation, and so on. So how should λ be determined? The most well known way was proposed by Box and Cox (1964) and is therefore known as the Box-Cox transformation. In order for the transformation to be continuous in λ, which is convenient for technical reasons, the Box-Cox transformation is written in the form { (Y λ Z i = i 1)/λ, λ 0 log Y i, λ = 0. In this context, however, it suffices to think of the transformation as given by (6.1). The appealing feature about this approach is that λ is considered as a parameter along with the rest of the parameters in the linear model, and is therefore determined from the data. This is done using the method of maximum likelihood. The maximum likelihood estimate for λ is defined as the value of λ that maximizes the likelihood function, or the log likelihood function which is this case is given by l(λ) = N 2 log SS e(λ) + (λ 1) N log Y i, where SSe(λ) is the residual sum of squares corresponding to the linear model for the observations transformed by λ. i=1

14 Module 6: Model Diagnostics 14 Let ˆλ denote the maximum likelihood estimate of λ. The hypothesis H 0 : λ = λ 0 can be tested using the test statistic 2(l(ˆλ) l(λ 0 )) which is approximately χ 2 (1)- distributed. Large values of the test statistic are critical for the hypothesis. An approximate (1 α)%-confidence interval is given by the set of λ-values satisfying 2(l(ˆλ) l(λ)) χ 2 α(1), where χ 2 α(1) denotes the (1 α)%-quantile of the χ 2 (1)-distribution, in particular χ (1) = Back transformation Working with transformed data has the disadvantage that often one would prefer to present the results on the original scale rather than using the transformed scale, which means that some kind of back transformation is required. However, not all quantities are easily back transformed with meaningful interpretations for any kind of transformation. We suggest to use simple back transformations of estimates/lsmeans. If µ is an estimate computed for the log transformed data, then use the inverse log, the exponential: exp( µ) as an estimate on the original scale. It should be noted that this is a biased estimate of the expected value on the original scale. In fact it is an estimate of the median, but this also seems like a more natural quantity to estimate when taking into consideration that the distribution on the original is not symmetric. A 95%-confidence interval is easily obtained by calculating it on the transformed scale and then transforming the endpoints of the interval back. Note that such an interval is not symmetric reflecting the asymmetric distribution on the original scale.

enote 6 1 enote 6 Model Diagnostics

enote 6 1 enote 6 Model Diagnostics enote 6 INDHOLD 2 Indhold 6 Model Diagnostics 1 6.1 Introduction.................................... 3 6.2 Linear model diagnostics............................. 4 6.2.1