enote 6 1 enote 6 Model Diagnostics

Size: px
Start display at page:

Download "enote 6 1 enote 6 Model Diagnostics"

Transcription

1 enote 6 1 enote 6 Model Diagnostics

2 enote 6 INDHOLD 2 Indhold 6 Model Diagnostics Introduction Linear model diagnostics Which model to use? The model assumptions Residuals Normality investigation Checking for variance homogeneity Checking for variance homogeneity in new model for transformed data Outliers Check for influential observations Data transformation and back transformation Specific alternative distributions Box-Cox transformations Back transformation Model diagnostics in the mixed the linear model

3 enote INTRODUCTION Check for random effects normality Drying of beech wood data - a case study, part II Factor structure and basic model revisited Explorative analysis on transformed data Test of overall effects/model reduction for transformed data Post hoc analysis and summarizing the results for the transformed data Estimates of the variance parameters Estimates of the fixed parameters Error bars in plots Comparisons of the fixed parameters Exercises Introduction As should be clear by now, the mixed linear models are indeed based on a number of assumptions about distributions etc. As for linear models without random effects it is important to check for those assumptions as much as possible. For various reasons it is both convenient and useful to discuss model diagnostics in purely fixed effect models first. In mixed linear models with no further model structures on the residual part, the assumptions for the residual part are exactly the same as the assumptions for residuals in a purely fixed model. And both theory and R-software options are more easily accessible for usual linear model than for mixed linear models. So in many cases we can chose to either partly embed the model control into linear models (ANOVA/regression) without random effects by considering the model in which all random effects are considered fixed. Or we can subsequenly extract the mixed model residuals and infleunce measures and do similar copy the diagnostic plotting for these. So after a Section on simple linear model diagnostics we turn to the specifics of the mixed linear models again.

4 enote LINEAR MODEL DIAGNOSTICS Linear model diagnostics This section contains only information that could be part of any course on basic statistical analysis Which model to use? As was pointed out in the analysis of the beech wood data in Module 3, one step of the overall approach is to try to simplify a (possibly complex) starting model to a (simpler) final model. This leaves us to make the decision which of these models, we want to run through the model control machine. In the beech wood example it corresponds to basing the model diagnostics on either the (fixed effect starting) model given by Y i = µ + α(width i ) + β(depth i ) + γ(width i, depth i ) + δ(plank i ) + ɛ i, or the (fixed effect final) model given by Y i = µ + α(width i ) + β(depth i ) + δ(plank i ) + ɛ i. There are no clear answer to this question! Note that both of these models are purely fixed effect models where the plank effect for now is modelled as a fixed effect. The purpose here is solely to do some model diagnostics. Since the process of going from the starting model to the final model uses the model assumptions one would be inclined to use the starting model, since then no time is wasted on model reductions that would have to be ignored anyway after a model check in the reduced model shows that this model does not really hold. However, if large models (compared to the number of observations) are specified the information in the data about the model assumptions can be rather weak. So as a general approach, we recommend to carry out the model control primarily in a (preliminary) reduced model, and then redo the model reduction analysis if required The model assumptions The classical assumptions for linear normal models (without random effects) are the following:

5 enote LINEAR MODEL DIAGNOSTICS 5 1. The model structure should capture the systematic effects in the data. 2. Normality of residuals 3. Variance homogeneity of residuals 4. Independence of residuals It is recommended always to check whether these assumptions appear to be fulfilled for the situation in question. The independence assumption may not always be easily checked, although for some data situations methods are available, eg. for repeated measures data. We will return to this in later modules.the assumption in 1. is particularly an issue when regression terms (quantitative factors) enters the model. For the classical (x, y) linear regression model situation this corresponds to the assumption of linearity between x and y. Apart from the formal assumptions it is important to focus on the possibility of: A Outliers B Influential observations Residuals The assumptions may be investigated by constructing the predicted (expected) and residual values from the model. For the final main effects model for the beech wood data, it would amount to constructing: and ŷ i = ˆµ + ˆα(width i ) + ˆβ(depth i ) + ˆδ(plank i ) ˆɛ i = y i ŷ i In fact, it turns out the (theoretical) variance of these residuals are generally not homogeneous (even under the model assumption of homogeneous variance)! This is because the residuals are not the real error terms ɛ ijk but only estimated versions of those. The variance becomes: Var( ˆɛ i ) = σ 2 (1 h i ) 2 where σ 2 is the model error variance and h i is the so-called leverage for observation i. We will not give the exact definition of the leverage here, but just point out that the leverage is a measure (between 0 and 1) of distance from the ith observation to the typical (mean)

6 enote LINEAR MODEL DIAGNOSTICS 6 observation only using the X-information of a model. In a simple regression setting the leverage is equivalent to (x i x) 2. For pure ANOVA models the leverage has a less clear interpretation, and in fact for some cases, like the example here with balanced data, the leverage is actually the same for all observations. So the effect of constructing a nice experimental design combined with the luck of avoiding missing values induces a situation in which no observations are more atypical/ strange than others. High leverage (atypical) observations are potentially highly influential on the results of the analysis, and we do not want that the conclusions we make are based only on one or very few observations. To account for the difference in variances in the residuals, we use instead the standardized residuals defined by: ˆɛ i = y i ŷ i ˆσ(1 h i ) and these are given directly by R for us to study in various ways: 1. Normality investigation(histogram, probability/quantile plots, significance tests) 2. Plot of residuals versus predicted values 3. Plot of residuals versus the values/levels of quantitative/qualitative factors in the model. From now on, when we say residuals we consider the standardized residuals. Actually R provides easy access to some of the core diagnostic plots by the plot function, as is illustrated in Figure Normality investigation In figure 6.1(upper right) we see that the residuals for the example seem to be symmetrically distributed and that the normal distribution seems to fit quite well apart maybe from a few extremely small and large values. It is possible, and often easily provided, to compute different significance test for normality, e.g. the following:

7 enote LINEAR MODEL DIAGNOSTICS 7 planks <- read.table("planks.txt", header = TRUE, sep = ",") planks$plank <- factor(planks$plank) planks$depth <- factor(planks$depth) planks$width <- factor(planks$width) lm1 <- lm(humidity~depth+plank+width, data = planks) par(mfrow=c(2,2)) plot(lm1, which=1:4) Residuals vs Fitted Normal Q-Q Residuals Standardized residuals Standardized residuals Fitted values Scale-Location Cook s distance Theoretical Quantiles Cook s distance Fitted values Obs. number par(mfrow=c(1,1)) Figur 6.1: The four basic diagnostic plots by R

8 enote LINEAR MODEL DIAGNOSTICS 8 Test Statistic P value Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling We do not give the exact definitions of theses tests here nor discuss the features of each test. We note that they seem to reject the normality assumption, although not that clear for some of the test. These tests and P-values should not be given too much weight in practical data analysis. For small data sets they will tend to be very weak, that is, it is generally difficult to reject the normality assumption in small data sets. But this is not the same as having proved that the normality is true! For large data sets, they become sensitive to even very small deviations from normality, also deviations that due to the central limit theorem have no effect on the tests and confidence intervals used in the analysis. Anyway, the significance tests may still enter as a part of a complete model diagnostics, and IF they become significant for rather small data sets, you definitely know, that there is a problem and IF they are non significant for a large data set, you can feel certain that everything is OK. The presence of some extreme observations is seen in the Normal Q-Q plot in figure 6.1(upper right), where it is clear that 5 residuals are too small while 4 residuals are too large compared to what could expected from the normality assumption. But other than that the distribution fits nicely to a normal distribution. So we seem to have a number of outliers, see below for a discussion of outliers Checking for variance homogeneity To check for variance homogeneity we plot the residuals versus the predicted values and versus the values/levels of quantitative/qualitative factors in the model. The former is what the default diagnoistic plot is gving us: In the two left hand side plots of the residuals versus the predicted values, figure 6.1 (top left and bottom left) it is investigated whether the variance depends on the mean ( on the size of the observations ). We actually see, that there is a typical trumpet shape indicating that the variance increases with increasing mean. Then one would consider a transformation like the log or something similar. There also seems to be some systematic deviation from zero, where the left and right ones are more typically positive and the middle ones more typically negative. This is highly disturbing as such patterns indicate that important structures in the data was not accounted for, cf. assumption 1.

9 enote LINEAR MODEL DIAGNOSTICS 9 Note that since the root-absolute residuals in the bottom left plot are standardized, we can view the size of these residuals in the light of the standard normal distribution: Approximately 5% of the residuals should be larger than 2 = 1.4 and only 0.1% should larger than 3 = So for a data set with 300 observations you wouldn t expect any residuals beyond this number. Plotting residuals against other factors in the model, we do ourselves by extracting either the raw residuals as: rawresiduals <- resid(lm1) Or the standardized standresid <- rstandard(lm1) Or even the so-called studentized ones: studresid <- rstudent(lm1) where the estimated residual error ˆσ (i) 2 without using the ith observation is used instead of the usual ˆσ 2 in the standardization of the residuals. The three plots of residuals versus the factor levels investigate whether there are any group dependent variance heterogeneity. For the five depth groups, figure 6.2 (bottom right) the variances look similar, for the three width groups, figure 6.2 (bottom left), there may be a tendency that the middle width has lower variablity and for the 20 planks, figure 6.2 (top right), there does seem to be clear differences in variability. However, since the residuals versus predicted plot indicated some severe problems, we shouldn t worry too much about these other potential variance heterogeneities, since these may very well change in the process of fixing the problem The same goes for the outlier and influence investigation. So apart from exclusion of errornous and explainable extreme observations, we should postpone the outlier/influence investigation until the investigation of normality and variance homogeneity is completed. Outlying observations in an inadequate model may turn out to be reasonable observations in a more proper model!

10 enote LINEAR MODEL DIAGNOSTICS 10 par(mfrow=c(2,2)) plot(studresid~predict(lm1)) with(planks, plot(studresid~plank, col = heat.colors(20))) with(planks, plot(studresid~width, col = rainbow(3))) with(planks, plot(studresid~depth, col = rainbow(5))) predict(lm1) studresid plank studresid width studresid depth studresid par(mfrow=c(1,1)) Figur 6.2: Residuals versus predicted and factor levels

11 enote LINEAR MODEL DIAGNOSTICS Checking for variance homogeneity in new model for transformed data In a section below the beech wood data is re-considered by including some possible effects that was forgotten in the first place together with a log-transformation of the data. In the remainder of this module we consider the (studentized) residuals from the model now obtained: planks$loghum=log(planks$humidity) lm3 <- lm(loghum ~ depth * plank + depth * width + plank * width, data = planks) studresid <- rstudent(lm3) Figure 6.3 is a reproduction of figure 6.1 on the new residuals. The significance tests becomes: Test Statistic P value Shapiro-Wilk Kolmogorov-Smirnov > Cramer-von Mises > Anderson-Darling > In R such tests could be produced by e.g. the nortest package:

12 enote LINEAR MODEL DIAGNOSTICS 12 par(mfrow=c(2,2)) plot(lm3, which=1:4) Residuals vs Fitted Normal Q-Q Residuals Standardized residuals Fitted values Scale-Location Theoretical Quantiles Cook s distance Standardized residuals Cook s distance Fitted values Obs. number Figur 6.3: The four basic diagnostic plots for the extended model and log-transformed data by R

13 enote LINEAR MODEL DIAGNOSTICS 13 par(mfrow=c(2,2)) plot(studresid~predict(lm3)) with(planks, plot(studresid~plank, col = heat.colors(20))) with(planks, plot(studresid~width, col = rainbow(3))) with(planks, plot(studresid~depth, col = rainbow(5))) predict(lm3) studresid plank studresid width studresid depth studresid par(mfrow=c(1,1)) Figur 6.4: Residuals versus predicted and factor levels for the extended model and logtransformed data

14 enote LINEAR MODEL DIAGNOSTICS 14 shapiro.test(rawresiduals) Shapiro-Wilk normality test data: rawresiduals W = , p-value = library(nortest) lillie.test(rawresiduals) Lilliefors (Kolmogorov-Smirnov) normality test data: rawresiduals D = , p-value = cvm.test(rawresiduals) Cramer-von Mises normality test data: rawresiduals W = , p-value = ad.test(rawresiduals) Anderson-Darling normality test data: rawresiduals A = , p-value = Now histogram, probability plot and significance tests all support the assumption of normality. The residuals versus predicted plot, figure 6.4, now has a much nicer pattern.

15 enote LINEAR MODEL DIAGNOSTICS 15 The trumpet shape and the systematic structure has disappeared and none of the rootabsolute-residuals are larger than 3 = The plots of residuals versus the factor levels of depth and width indicate no problems with heterogeneity but there may still be slight differences in residual variability for the different planks Outliers An outlier is intuitively defined as: An observation that deviates unusually much from it s expected value. What is unusual is determined by the (estimated) probability distribution of the model. The general approach for handling outliers is the following: 1. Identify the outlying observations 2. Check whether some of these may be due to errors or may be explained and excluded for some external/atypical reasons: maybe it turns out that a single plank was treated in some extreme way not representative for what is investigated. 3. Investigate the influence of non-explainable outliers. In practice, we are often left with some extreme observations that we cannot exclude for any of the reasons given in 2. The only thing left, is to investigate whether such extreme observations have important influence on the results of the analysis. This is done by redoing the analysis leaving out the extreme observations and comparing with the original results. In fact, there are stastistics easily extracted that can do this automatically for us. There are no indications of outliers in the new model for the log-transformed data Check for influential observations As a final step we investigate the influence of observations on the results of the analysis. A measure of influence is given by the change of the expected(predicted) value of the model by leaving out an observation: f i = ŷi ŷ (i) ˆσ (i) hi

16 enote LINEAR MODEL DIAGNOSTICS 16 where ŷ (i) is the model predicted value without using the ith observation, and ˆσ (i) 2 is the residual error variance estimated without using the ith observation. This is called the dffit value. Another measure is the Cook s distance : D i = n j=1 (ŷ j ŷ j(i) ) 2 pˆσ 2 where p is the number of parameters in the model. Similarly one could measure how much each individual parameter estimate in the model would change by leaving out an observation. All these values are directly extracted by the influence.measures function, illustrated below by a simple model to be able to more easily get the idea: lm0 <- lm(loghum ~ width, data = planks) infl.lm0 <- influence.measures(lm0) dim(infl.lm0$infmat) [1] head(infl.lm0$infmat) dfb.1_ dfb.wdt2 dfb.wdt3 dffit cov.r cook.d hat e e e e e e e e e e e e The first columns give the so-called DFBETAS values - the measure of how much an observation has effected the parameter estimate, next the DFFITS are given. Then the so-called COVRATIO - a measure of the impact of each observation on the variances (and standard errors) of the parameter estimates (=regression coefficients) and their covariances. Then the Cook s Distance is given and at the end the leverage values (h i ). Such measures would usually be plotted versus the observation number, see the lower right corner of figure 6.3, where R as one of default plots plots the Cook s Distance. There are no clear rules for the size of the Cook s distance - although you can find a

17 enote DATA TRANSFORMATION AND BACK TRANSFORMATION 17 number of very different rules of thumb out there - but look out for one of few extreme ones. If one or more observations were extreme in this sense, we would have to investigate in more detail what exact parts of the conclusion are influenced and in what way (by comparing model/test results with/without the observation). It could also be investigated whether such influential observations group in any particular way - it may be that the observations from an entire plank are influential, and it could be relevant to study the effect of leaving out all the observations from this plank to see the effect of this. 6.3 Data transformation and back transformation If the assumption of normality and/or constant variance are not fulfilled, based on an inspection of the standardized residuals, then the problem can often be solved by transforming the response variable and then consider a mixed linear model for the transformed variable. How should one then go about choosing a transformation? Of course one could just try with a given transformation and then see if the assumptions behind the linear model look like being better fulfilled after the transformation. It would however be more satisfying with a more constructive approach. With some experience one can often see from the plot of the standardized residuals against the predicted values which transformation is needed. If the picture is a fan opening to the right ( trumpet-shaped ) then typically a log-transformation or a power transformation (with a positive exponent) is what is called for. If the fan opens to the left an exponential transformation or a negative power transformation often helps. There are some more systematic approaches to the choice of transformation and in this section we will consider two such approaches Specific alternative distributions If the observations are more naturally described by a different distribution than the normal then the variance may vary with the mean. For example in the Poisson distribution the variance equals the mean. In such cases one can often successfully transform the data and then describe the transformed data by the normal distribution which has a constant variance. This is often preferable as the normal distribution is well understood and many results concerning distributions of estimates and test statistics are exact. In the following table such transformations are given for some common distributions:

18 enote DATA TRANSFORMATION AND BACK TRANSFORMATION 18 Distribution Variance Scale Transformation Binomial µ(1 µ) interval (0,1) arcsin Poisson µ Positive Gamma µ 2 Positive log Inverse Gauss µ 3 Positive 1/ Box-Cox transformations When another distribution for the observations is not obvious, which is usually the case, one could try and look for a power transformation. This only works for positive data but can be applied to all data if a constant is added to all observations. The idea is, instead of using a linear model for the observations Y 1,..., Y N, to analyze Z 1,..., Z N where { Y λ Z i = i, λ = 0 log Y i, λ = 0, (6-1) using a linear model. Here λ = 1 corresponds to no transformation, λ = 1/2 a square root transformation, and so on. So how should λ be determined? The most well known way was proposed by Box and Cox (1964) and is therefore known as the Box-Cox transformation. In order for the transformation to be continuous in λ, which is convenient for technical reasons, the Box-Cox transformation is written in the form { (Y λ Z i = i 1)/λ, λ = 0 log Y i, λ = 0. In this context, however, it suffices to think of the transformation as given by (6-1). The appealing feature about this approach is that λ is considered as a parameter along with the rest of the parameters in the linear model, and is therefore determined from the data. This is done using the method of maximum likelihood. The maximum likelihood estimate for λ is defined as the value of λ that maximizes the likelihood function, or the log likelihood function which is this case is given by l(λ) = N 2 log SS e(λ) + (λ 1) N i=1 log Y i, where SSe(λ) is the residual sum of squares corresponding to the linear model for the observations transformed by λ. Let ˆλ denote the maximum likelihood estimate of λ. The hypothesis H 0 : λ = λ 0 can be tested using the test statistic 2(l( ˆλ) l(λ 0 )) which is approximately χ 2 (1)-distributed.

19 enote DATA TRANSFORMATION AND BACK TRANSFORMATION 19 Large values of the test statistic are critical for the hypothesis. An approximate (1 α)%- confidence interval is given by the set of λ-values satisfying 2(l( ˆλ) l(λ)) χ 2 α(1), where χ 2 α(1) denotes the (1 α)%-quantile of the χ 2 (1)-distribution, in particular χ (1) = In the package MASS there is a function boxcox which takes an lm object and by default computes the values of the log likelihood function over the range -2 to 2 of the parameter λ in the transformation:. model1.5 <- lm(humidity ~ depth * plank + depth * width + plank*width, data = planks) library(mass) par(mfrow=c(1,2)) plot(boxcox(model1.5)) 95% log Likelihood boxcox(model1.5)$y λ boxcox(model1.5)$x

20 enote MODEL DIAGNOSTICS IN THE MIXED THE LINEAR MODEL Back transformation Working with transformed data has the disadvantage that often one would prefer to present the results on the original scale rather than using the transformed scale, which means that some kind of back transformation is required. However, not all quantities are easily back transformed with meaningful interpretations for any kind of transformation. We suggest to use simple back transformations of estimates/lsmeans. If ˆµ is an estimate computed for the log transformed data, then use the inverse log, the exponential: exp( ˆµ) as an estimate on the original scale. It should be noted that this is a biased estimate of the expected value on the original scale. In fact it is an estimate of the median, but this also seems like a more natural quantity to estimate when taking into consideration that the distribution on the original is not symmetric. A 95%-confidence interval is easily obtained by calculating it on the transformed scale and then transforming the endpoints of the interval back. Note that such an interval is not symmetric reflecting the asymmetric distribution on the original scale. 6.4 Model diagnostics in the mixed the linear model The residual errors ɛ i in the mixed models seen so far in this course are imposed the same assumptions as the residual errors in a systematic linear model. In later modules on repeated measures data the focus will be on more general residual error covariance structure modeling and specific tools for the investigation of this will be given. The estimated residuals could be defined within the mixed model using the predicted values (BLUP) for the random effects, such that they are given by (using the vectormatrix notation of the theory module): r = y ( X ˆβ + Zû ) In general the BLUPs of the mixed model and the parameter estimates of the corresponding fixed effects model will be different: The BLUPs are shrinkage versions of the fixed effects parameters (The most extreme values among the levels of a random factor become less extreme). However, the difference is often not pronounced and because of the more complicated model structure, the standardization of the residuals becomes more complicated. The raw residuals can be easily extracted also from lmer-results:

21 enote MODEL DIAGNOSTICS IN THE MIXED THE LINEAR MODEL 21 library(lmertest) lmer3 <- lmer(loghum ~ depth * width + (1 plank) + (1 depth:plank) + (1 plank:width), data = planks) lmerresid <- resid(lmer3) And now all the plots from above could be constructed manually, see Figure 6.5 and Figure 6.6. Only the basic residual versus fitted plot would be automatically produced by the plot function. And influence measures for lmer results can be extracted by the influence function of the influence.me-package, see Figure 6.7 for how to extract and plot the Cook s Distances Check for random effects normality Until now we have based the model diagnostics on the residuals. This means that we only investigated the normality assumptions of the residual error: ɛ i N(0, σ 2 ). It is the hope that a choice of transformation based on the structures in the residuals will also stabilize the random effects distributions, but this is not in any way guaranteed. However, in the mixed model used below for this, we assume that the effects due to planks and plank interactions are also normally distributed: and d(plank i ) N(0, σ 2 Plank ), f (width i, plank i ) N(0, σ 2 Plank width ) g(depth i, plank i ) N(0, σ 2 Plank depth ), ɛ i N(0, σ 2 ) We investigate this by looking the BLUPs for the random effects, cf. figure 6.8 where there are no indication of any severe lack of normality. This approach will only make sense if the number of levels for a factor is not too small. And the really big flaw: IF we have problems with normality of the random effects, we wouldn t really know how to cope with this!

22 enote MODEL DIAGNOSTICS IN THE MIXED THE LINEAR MODEL 22 par(mfrow=c(1,2)) plot(sqrt(abs(lmerresid))~predict(lmer3)) qqnorm(lmerresid) predict(lmer3) sqrt(abs(lmerresid)) Normal Q Q Plot Theoretical Quantiles Sample Quantiles Figur 6.5: Residuals versus predicted and Normal QQ plotfor the mixed model

23 enote MODEL DIAGNOSTICS IN THE MIXED THE LINEAR MODEL 23 par(mfrow=c(2,2)) plot(lmerresid~predict(lmer3)) with(planks, plot(lmerresid~plank, col = heat.colors(20))) with(planks, plot(lmerresid~width, col = rainbow(3))) with(planks, plot(lmerresid~depth, col = rainbow(5))) predict(lmer3) lmerresid plank lmerresid width lmerresid depth lmerresid Figur 6.6: Residuals versus predicted and factor levels for the mixed model

24 enote MODEL DIAGNOSTICS IN THE MIXED THE LINEAR MODEL 24 library(influence.me) lmer3.infl <- influence(lmer3, obs=true) par(mfrow=c(1,1)) plot(cooks.distance(lmer3.infl)) Index cooks.distance(lmer3.infl) Figur 6.7: Cook s distance for the mixed model

25 enote MODEL DIAGNOSTICS IN THE MIXED THE LINEAR MODEL 25 par(mfrow=c(1,3)) qqnorm(ranef(lmer3)$ depth:plank [,1]) qqnorm(ranef(lmer3)$ plank:width [,1]) qqnorm(ranef(lmer3)$plank[,1]) Normal Q Q Plot Theoretical Quantiles Sample Quantiles Normal Q Q Plot Theoretical Quantiles Sample Quantiles Normal Q Q Plot Theoretical Quantiles Sample Quantiles Figur 6.8: Random effects normal probability plot

26 enote DRYING OF BEECH WOOD DATA - A CASE STUDY, PART II Drying of beech wood data - a case study, part II In this example section we complete the analysis of this data set. In module 3 we completed an analysis, but without checking the model assumptions. In the main part of this module 6, we found, based on the plot of residuals versus predicted values that something was clearly wrong Factor structure and basic model revisited Having some possible problems with the model as in this case, it is also important to consider whether we actually included all possible effects in the model. In fact, in the classical randomized block analysis carried out in Module 3, we ignored the possibility of an interaction effect between planks and widths or between planks and depths. The average profile plots of the log-transformed data is given here as figure 6.9. The patterns in the two top plots provide information about these two interactions. The depth patterns seem to be rather parallel whereas some clear deviations from parallel patterns are seen for the width humidity structures. Only a statistical analysis can reveal whether these effects are significant. Since the plank effect is considered random the interactions with plank should also be considered random. So including these two would correspond to the model given by the factor structure in figure 6.10 or expressed formally: log Y i = µ + α(width i ) + β(depth i ) + γ(width i, depth i ) + d(plank i ) + f (width i, plank i ) + g(depth i, plank i ) + ɛ i, (6-2) where and d(plank i ) N(0, σ 2 Plank ), f (width i, plank i ) N(0, σ 2 Plank width ) g(depth i, plank i ) N(0, σ 2 Plank depth ), ɛ i N(0, σ 2 ) 6.6 Explorative analysis on transformed data Since a log-transformation also affects the structure of the data, it will generally be a good idea to redo some of the explorative plots of the raw data. In this case it does not change much, so we do not give any further plots.

27 enote TEST OF OVERALL EFFECTS/MODEL REDUCTION FOR TRANSFORMED DATA 27 mean of loghum mean of loghum width depth mean of loghum mean of loghum width depth Figur 6.9: Four average log-humidity profiles 6.7 Test of overall effects/model reduction for transformed data Although we have moved on to use lmertest or Anova for handling the mixed model, we have also worked with lm for model diagnostics. ANOVA on the lm-results can also

28 enote TEST OF OVERALL EFFECTS/MODEL REDUCTION FOR TRANSFORMED DATA [width plank] [plank] [I] [depth plank] 76 width depth width 8 15 depth 4 5 Figur 6.10: The factor structure diagram be used to provide certain F-tests for random effects (NOT given by lmertest). For instance, for those random effects belonging to the error stratum, i.e. those random effects with an arrow directly from [I] in the factor structure diagram. And furthermore, a comparison of the results of fixed ANOVA with the results of lmertest will hopefully support the subject matter understanding. The ANOVA table from the full fixed effects model (as given by both anova and Anova, as the data is balanced) is:

29 enote POST HOC ANALYSIS AND SUMMARIZING THE RESULTS FOR THE TRANSFORMED DATA 29 Source of DF Sums of Mean F P-value variation squares squares depth (217.14) (<.0001) width (80.86) (<.0001) depth*width plank (97.34) (<.0001) depth*plank width*plank <.0001 Error The F-statistics (and P-values) NOT to be used are put in parentheses. To compare, the table of fixed effects from the mixed model analysis corresponding to the model given by (6-2) is: Source of Numerator degrees Denominator degrees Mean P-value variation of freedom of freedom squares depth <.0001 width depth*width From the full ANOVA table we see that the interaction between width and plank is clearly significant, whereas the depth*plank interaction is on the limit. Note that the test for the depth*width interaction is the same in both tables. As opposed to the preliminary analysis above, this interaction seems to be significant. Also note that the denominator degrees of freedom of the tests of main effects coincide with the degrees of freedom (DF) for the test of the plank interaction term in the fixed model, and in fact in this case: F depth = MS depth MS depth plank and F width = MS width MS width plank So the test of fixed effects in the mixed model could be easily derived from the fixed effects ANOVA. This will not always be the case! In summary, we cannot reduce the model, since all effects appear significant. 6.8 Post hoc analysis and summarizing the results for the transformed data The final model is given by model (6-2).

30 enote POST HOC ANALYSIS AND SUMMARIZING THE RESULTS FOR THE TRANSFORMED DATA Estimates of the variance parameters Estimates of the four variance parameters (on log-scale) are: ˆσ 2 Planks = , ˆσ2 Plank width = ˆσ 2 Plank depth = , ˆσ2 = It is not clear at all how one could (or should!) back transform these values to the original scale. The confidence bands for the standard deviations these are obtained the usual way: confint(lmer3, 1:4) Computing profile confidence intervals % 97.5 %.sig sig sig sigma Estimates of the fixed parameters Estimates of the important parts of the systematic part of the model are the 15 values for each combination of width and depth (the LSMEANS for the interaction). Since we have a balanced model these are the simple plank averages of the log-humidity within each combination. Direct back transformation (using the exponential or antilog function) of LSMEANS is used. In this case it has the effect that the average values (median) presented for each combination are the so-called geometrical averages: (using the ijknotation) exp ( 1 20 These are depicted in figure ) ( ) 1/ log(y ijk ) = y ijk k=1 k=1

31 enote POST HOC ANALYSIS AND SUMMARIZING THE RESULTS FOR THE TRANSFORMED DATA 31 mylsmeans <- lsmeans(lmer3, "depth:width")$lsmeans.table with(mylsmeans, interaction.plot(depth, width, exp(estimate), col = 2:4)) width mean of exp(estimate) depth Figur 6.11: Back transformed expected values

32 enote POST HOC ANALYSIS AND SUMMARIZING THE RESULTS FOR THE TRANSFORMED DATA Error bars in plots For such a plot, one would usually like to add some kind of uncertainty information. But this is not always easily done in a meaningful way. A common approach in scientific literature is to add some error bars to each point showing plus/minus one or two standard errors (SE) of estimation. The use of plus/minus one SE should be avoided, since this has no real interpretation (and there is a danger that people actually think of these intervals as confidence intervals). The use of 2 standard errors corresponds to the 95% confidence band for each value, which may be a reasonable information to convey. In general, then the proper confidence interval using the critical value from the t-distribution could just as well be used. But still, IF the aim of the plot is to spot the significant/important differences, then these confidence intervals are NOT of much use: The eye would tend to claim two points significantly different, if the bars do not overlap and non-significant if they do overlap. This corresponds to claiming difference if (and only if) two points are more than 4 standard errors apart. However, this is NOT correct. In this case for two reasons: First of all, if the standard error of two estimated values are equal (and independent), then SE( ˆβ 1 ˆβ 2 ) = Var( ˆβ 1 ˆβ 2 ) = Var( ˆβ 1 ) + Var( ˆβ 2 ) = 2Var( ˆβ 1 ) = 2SE( ˆβ 1 ) and the 95% confidence interval of the difference would be approximately plus/minus twice this value. So two estimates are different if (and only if) they are 2 2 = 2.83 SE s apart, NOT 4! For a randomized block situation like this, the assumption of independence between estimates do NOT hold, since they are all based on the same 20 planks. And for this reason the uncertainty (SE) of a difference is not easily derived from the SE s of the expected values themselves. This make the use of the direct SE-bars approach even more questionable. In fact, in module 1 it was pointed out how the uncertainties of treatment differences are usually much smaller: The uncertainty of the humidity of a specific width*depth combination will include the extensive plank-to-plank variability, whereas the uncertainties of differences will not! Working with transformed data adds further to the complexity in the construction of error information in the plot. In this situation we can read off from the R output the standard errors (SE) and/or the confidence limits for the 15 averages on log-scale. In general, one should not back transform standard errors. In stead it will be meaningful

33 enote POST HOC ANALYSIS AND SUMMARIZING THE RESULTS FOR THE TRANSFORMED DATA 33 to back transform the limits of the confidence bands. For other transformations than the log this may not always be possible. For the numbers in figure 6.11 error bars based on these values would be extremely overlapping all over, and the plot would be messy and it would give the wrong impression with respect to the significant differences between width and depth levels. And since it is not really possible to visualize all possible nor relevant comparisons, we leave the plot without any further information and summarize the significances in tables in line with those presented in the preliminary results section given in module 3. We could investigate the relevant errors for differences between levels of the interaction factor (or any other effect) by extracting these from the difflsmeans function of the lmertest-package: mydifflsmeans <- difflsmeans(lmer3, "depth:width")$diffs.lsmeans.table head(mydifflsmeans,10) hist(mydifflsmeans$ Standard Error ) Histogram of mydifflsmeans$"standard Error" Frequency mydifflsmeans$"standard Error" It shows now that there are two levels of errors: one for comparing widths within depths and another for comparing depths within widths. One could extract these numbers and use them for some kind of error bar on the plot. Since we back transform it becomes a bit more complicated, but one could still investigate the back-transformed confidence intervals and convey some kind of average confidence intervals on the plot. The meaningfulness of this will depend on how much the widths of these confidence bands vary across all comparisons - the investigation will

34 enote POST HOC ANALYSIS AND SUMMARIZING THE RESULTS FOR THE TRANSFORMED DATA 34 show that also. For now we do not pursue this in more detail here Comparisons of the fixed parameters Since the width*depth interaction is significant we should provide depth values of humidity for each level of width (and vice versa). We give the back transformed values, but use significance information from the analysis on the transformed data. If all comparisons across all levels of the interaction factor are explored and reported, one should use an overall correction method, like the Tukey-Kramer method. However, if we only compare, say, depth levels within each width level, we do not perform all possible (105) tests but only 30 test in 3 groups of 10. (We need 10 tests to compare all of 5 combinations of depth). And probably a less restrictive correction should be employed, e.g. a bonferroni correction based on 10 tests within each set of tests. So carrying out standard t-tests, but only claiming significance if the P-value is less than gives the following summary table: Width 1 Width 2 Width 3 Depth a Depth a Depth a Depth a Depth a Depth a Depth b Depth b Depth b Depth bc Depth b Depth b Depth bc Depth b Depth b The conclusion is clearly the same as previously given, although there seems to be no clear statistical evidence of a difference between the middle depth (5) and the neighbor depths 3 and 7 - slightly so for width 1. The width effects within each depth is given by:(using level 5%/3 = 1.67% in each test) Depth 1 Depth 3 Depth 5 Depth 7 Depth 9 Width a Width a Width a Width a Width a Width ab Width b Width b Width b Width ab Width b Width b Width b Width b Width b

35 enote EXERCISES 35 The width*depth interaction effect is indicated by the difference in the results for the top and bottom compared to the rest. There seems to be a larger difference for the middle (high humidity) depths. Confidence intervals for specific combinations or differences can be obtained by direct back transformation of the lower and upper limits from the analysis on the transformed data. For instance, consider the difference between width 2 and width 3 for depth 1. On log-scale, this difference is (directly read off from the difflsmeans output from the lmertest-package) and its 95% confidence interval (without correction) (also directly read off from the R output) is [ , ] The back transformed value is then: and the confidence interval is: exp(0.1004) = [exp( ), exp(0.1725)] = [1.029, 1.189] Note that such a back transformed difference has a relative interpretation on the original scale: The humidity level for width 3 is estimated at 10% smaller than for width 2 with a 95% confidence interval from 3% to 19%. 6.9 Exercises Exercise 1 Cookies data Carry out model diagnostics for the analysis of the cookies data in exercise 1 in Module 2. Exercise 2 Milk data Carry out model diagnostics for the analysis of the milk data in exercise 2 (part c)) in Module 2.

36 enote EXERCISES 36 Exercise 3 Spinage data Carry out model diagnostics for the analysis of the spinage color data in exercise 1 in Module 3. Exercise 4 Spinage2 data Carry out model diagnostics for the analysis of the spinage sensory data in exercise 2 in Module 3. Exercise 5 blueberry data Carry out model diagnostics for the analysis of the blueberry data in exercise 1 in Module 5.

Module 6: Model Diagnostics

Module 6: Model Diagnostics St@tmaster 02429/MIXED LINEAR MODELS PREPARED BY THE STATISTICS GROUPS AT IMM, DTU AND KU-LIFE Module 6: Model Diagnostics 6.1 Introduction............................... 1 6.2 Linear model diagnostics........................

More information

Answer Keys to Homework#10

Answer Keys to Homework#10 Answer Keys to Homework#10 Problem 1 Use either restricted or unrestricted mixed models. Problem 2 (a) First, the respective means for the 8 level combinations are listed in the following table A B C Mean

More information

Assignment 9 Answer Keys

Assignment 9 Answer Keys Assignment 9 Answer Keys Problem 1 (a) First, the respective means for the 8 level combinations are listed in the following table A B C Mean 26.00 + 34.67 + 39.67 + + 49.33 + 42.33 + + 37.67 + + 54.67

More information

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat). Statistics 512: Solution to Homework#11 Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat). 1. Perform the two-way ANOVA without interaction for this model. Use the results

More information

Mixed Model Theory, Part I

Mixed Model Theory, Part I enote 4 1 enote 4 Mixed Model Theory, Part I enote 4 INDHOLD 2 Indhold 4 Mixed Model Theory, Part I 1 4.1 Design matrix for a systematic linear model.................. 2 4.2 The mixed model.................................

More information

Robustness and Distribution Assumptions

Robustness and Distribution Assumptions Chapter 1 Robustness and Distribution Assumptions 1.1 Introduction In statistics, one often works with model assumptions, i.e., one assumes that data follow a certain model. Then one makes use of methodology

More information

STAT5044: Regression and Anova

STAT5044: Regression and Anova STAT5044: Regression and Anova Inyoung Kim 1 / 49 Outline 1 How to check assumptions 2 / 49 Assumption Linearity: scatter plot, residual plot Randomness: Run test, Durbin-Watson test when the data can

More information

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lectures on Simple Linear Regression Stat 431, Summer 2012 Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij = K. Model Diagnostics We ve already seen how to check model assumptions prior to fitting a one-way ANOVA. Diagnostics carried out after model fitting by using residuals are more informative for assessing

More information

Lecture 1: Linear Models and Applications

Lecture 1: Linear Models and Applications Lecture 1: Linear Models and Applications Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Introduction to linear models Exploratory data analysis (EDA) Estimation

More information

STATISTICS 174: APPLIED STATISTICS TAKE-HOME FINAL EXAM POSTED ON WEBPAGE: 6:00 pm, DECEMBER 6, 2004 HAND IN BY: 6:00 pm, DECEMBER 7, 2004 This is a

STATISTICS 174: APPLIED STATISTICS TAKE-HOME FINAL EXAM POSTED ON WEBPAGE: 6:00 pm, DECEMBER 6, 2004 HAND IN BY: 6:00 pm, DECEMBER 7, 2004 This is a STATISTICS 174: APPLIED STATISTICS TAKE-HOME FINAL EXAM POSTED ON WEBPAGE: 6:00 pm, DECEMBER 6, 2004 HAND IN BY: 6:00 pm, DECEMBER 7, 2004 This is a take-home exam. You are expected to work on it by yourself

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c Inference About the Slope ffl As with all estimates, ^fi1 subject to sampling var ffl Because Y jx _ Normal, the estimate ^fi1 _ Normal A linear combination of indep Normals is Normal Simple Linear Regression

More information

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI Introduction of Data Analytics Prof. Nandan Sudarsanam and Prof. B Ravindran Department of Management Studies and Department of Computer Science and Engineering Indian Institute of Technology, Madras Module

More information

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION Answer all parts. Closed book, calculators allowed. It is important to show all working,

More information

Stat 427/527: Advanced Data Analysis I

Stat 427/527: Advanced Data Analysis I Stat 427/527: Advanced Data Analysis I Review of Chapters 1-4 Sep, 2017 1 / 18 Concepts you need to know/interpret Numerical summaries: measures of center (mean, median, mode) measures of spread (sample

More information

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

unadjusted model for baseline cholesterol 22:31 Monday, April 19, unadjusted model for baseline cholesterol 22:31 Monday, April 19, 2004 1 Class Level Information Class Levels Values TRETGRP 3 3 4 5 SEX 2 0 1 Number of observations 916 unadjusted model for baseline cholesterol

More information

Formal Statement of Simple Linear Regression Model

Formal Statement of Simple Linear Regression Model Formal Statement of Simple Linear Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor

More information

10 Model Checking and Regression Diagnostics

10 Model Checking and Regression Diagnostics 10 Model Checking and Regression Diagnostics The simple linear regression model is usually written as i = β 0 + β 1 i + ɛ i where the ɛ i s are independent normal random variables with mean 0 and variance

More information

ACOVA and Interactions

ACOVA and Interactions Chapter 15 ACOVA and Interactions Analysis of covariance (ACOVA) incorporates one or more regression variables into an analysis of variance. As such, we can think of it as analogous to the two-way ANOVA

More information

CHAPTER 2 SIMPLE LINEAR REGRESSION

CHAPTER 2 SIMPLE LINEAR REGRESSION CHAPTER 2 SIMPLE LINEAR REGRESSION 1 Examples: 1. Amherst, MA, annual mean temperatures, 1836 1997 2. Summer mean temperatures in Mount Airy (NC) and Charleston (SC), 1948 1996 Scatterplots outliers? influential

More information

Linear models and their mathematical foundations: Simple linear regression

Linear models and their mathematical foundations: Simple linear regression Linear models and their mathematical foundations: Simple linear regression Steffen Unkel Department of Medical Statistics University Medical Center Göttingen, Germany Winter term 2018/19 1/21 Introduction

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

MLR Model Checking. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

MLR Model Checking. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project MLR Model Checking Author: Nicholas G Reich, Jeff Goldsmith This material is part of the statsteachr project Made available under the Creative Commons Attribution-ShareAlike 3.0 Unported License: http://creativecommons.org/licenses/by-sa/3.0/deed.en

More information

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects Contents 1 Review of Residuals 2 Detecting Outliers 3 Influential Observations 4 Multicollinearity and its Effects W. Zhou (Colorado State University) STAT 540 July 6th, 2015 1 / 32 Model Diagnostics:

More information

STAT 4385 Topic 06: Model Diagnostics

STAT 4385 Topic 06: Model Diagnostics STAT 4385 Topic 06: Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2016 1/ 40 Outline Several Types of Residuals Raw, Standardized, Studentized

More information

Regression diagnostics

Regression diagnostics Regression diagnostics Kerby Shedden Department of Statistics, University of Michigan November 5, 018 1 / 6 Motivation When working with a linear model with design matrix X, the conventional linear model

More information

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Multicollinearity Read Section 7.5 in textbook. Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Example of multicollinear

More information

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin Regression Review Statistics 149 Spring 2006 Copyright c 2006 by Mark E. Irwin Matrix Approach to Regression Linear Model: Y i = β 0 + β 1 X i1 +... + β p X ip + ɛ i ; ɛ i iid N(0, σ 2 ), i = 1,..., n

More information

Topic 8. Data Transformations [ST&D section 9.16]

Topic 8. Data Transformations [ST&D section 9.16] Topic 8. Data Transformations [ST&D section 9.16] 8.1 The assumptions of ANOVA For ANOVA, the linear model for the RCBD is: Y ij = µ + τ i + β j + ε ij There are four key assumptions implicit in this model.

More information

1 One-way Analysis of Variance

1 One-way Analysis of Variance 1 One-way Analysis of Variance Suppose that a random sample of q individuals receives treatment T i, i = 1,,... p. Let Y ij be the response from the jth individual to be treated with the ith treatment

More information

Regression Diagnostics

Regression Diagnostics Diag 1 / 78 Regression Diagnostics Paul E. Johnson 1 2 1 Department of Political Science 2 Center for Research Methods and Data Analysis, University of Kansas 2015 Diag 2 / 78 Outline 1 Introduction 2

More information

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter. 1) Answer the following questions as true (T) or false (F) by circling the appropriate letter. T F T F T F a) Variance estimates should always be positive, but covariance estimates can be either positive

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from

More information

Statistical Modelling in Stata 5: Linear Models

Statistical Modelling in Stata 5: Linear Models Statistical Modelling in Stata 5: Linear Models Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 07/11/2017 Structure This Week What is a linear model? How good is my model? Does

More information

Using SPSS for One Way Analysis of Variance

Using SPSS for One Way Analysis of Variance Using SPSS for One Way Analysis of Variance This tutorial will show you how to use SPSS version 12 to perform a one-way, between- subjects analysis of variance and related post-hoc tests. This tutorial

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 430/514 Recall: A regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates)

More information

Simple Linear Regression for the Advertising Data

Simple Linear Regression for the Advertising Data Revenue 0 10 20 30 40 50 5 10 15 20 25 Pages of Advertising Simple Linear Regression for the Advertising Data What do we do with the data? y i = Revenue of i th Issue x i = Pages of Advertisement in i

More information

Regression diagnostics

Regression diagnostics Regression diagnostics Leiden University Leiden, 30 April 2018 Outline 1 Error assumptions Introduction Variance Normality 2 Residual vs error Outliers Influential observations Introduction Errors and

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression Reading: Hoff Chapter 9 November 4, 2009 Problem Data: Observe pairs (Y i,x i ),i = 1,... n Response or dependent variable Y Predictor or independent variable X GOALS: Exploring

More information

Simple linear regression: estimation, diagnostics, prediction

Simple linear regression: estimation, diagnostics, prediction UPPSALA UNIVERSITY Department of Mathematics Mathematical statistics Regression and Analysis of Variance Autumn 2015 COMPUTER SESSION 1: Regression In the first computer exercise we will study the following

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression September 24, 2008 Reading HH 8, GIll 4 Simple Linear Regression p.1/20 Problem Data: Observe pairs (Y i,x i ),i = 1,...n Response or dependent variable Y Predictor or independent

More information

M A N O V A. Multivariate ANOVA. Data

M A N O V A. Multivariate ANOVA. Data M A N O V A Multivariate ANOVA V. Čekanavičius, G. Murauskas 1 Data k groups; Each respondent has m measurements; Observations are from the multivariate normal distribution. No outliers. Covariance matrices

More information

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION In this lab you will first learn how to display the relationship between two quantitative variables with a scatterplot and also how to measure the strength of

More information

22s:152 Applied Linear Regression. Take random samples from each of m populations.

22s:152 Applied Linear Regression. Take random samples from each of m populations. 22s:152 Applied Linear Regression Chapter 8: ANOVA NOTE: We will meet in the lab on Monday October 10. One-way ANOVA Focuses on testing for differences among group means. Take random samples from each

More information

Section 4.6 Simple Linear Regression

Section 4.6 Simple Linear Regression Section 4.6 Simple Linear Regression Objectives ˆ Basic philosophy of SLR and the regression assumptions ˆ Point & interval estimation of the model parameters, and how to make predictions ˆ Point and interval

More information

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS Ravinder Malhotra and Vipul Sharma National Dairy Research Institute, Karnal-132001 The most common use of statistics in dairy science is testing

More information

Course in Data Science

Course in Data Science Course in Data Science About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst. The course gives an

More information

Regression Model Building

Regression Model Building Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation in Y with a small set of predictors Automated

More information

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y Predictor or Independent variable x Model with error: for i = 1,..., n, y i = α + βx i + ε i ε i : independent errors (sampling, measurement,

More information

Diagnostics and Remedial Measures

Diagnostics and Remedial Measures Diagnostics and Remedial Measures Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Diagnostics and Remedial Measures 1 / 72 Remedial Measures How do we know that the regression

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr Regression Model Specification in R/Splus and Model Diagnostics By Daniel B. Carr Note 1: See 10 for a summary of diagnostics 2: Books have been written on model diagnostics. These discuss diagnostics

More information

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal Hypothesis testing, part 2 With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal 1 CATEGORICAL IV, NUMERIC DV 2 Independent samples, one IV # Conditions Normal/Parametric Non-parametric

More information

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Chapte The McGraw-Hill Companies, Inc. All rights reserved. er15 Chapte Chi-Square Tests d Chi-Square Tests for -Fit Uniform Goodness- Poisson Goodness- Goodness- ECDF Tests (Optional) Contingency Tables A contingency table is a cross-tabulation of n paired observations

More information

Transformations. Merlise Clyde. Readings: Gelman & Hill Ch 2-4, ALR 8-9

Transformations. Merlise Clyde. Readings: Gelman & Hill Ch 2-4, ALR 8-9 Transformations Merlise Clyde Readings: Gelman & Hill Ch 2-4, ALR 8-9 Assumptions of Linear Regression Y i = β 0 + β 1 X i1 + β 2 X i2 +... β p X ip + ɛ i Model Linear in X j but X j could be a transformation

More information

Beam Example: Identifying Influential Observations using the Hat Matrix

Beam Example: Identifying Influential Observations using the Hat Matrix Math 3080. Treibergs Beam Example: Identifying Influential Observations using the Hat Matrix Name: Example March 22, 204 This R c program explores influential observations and their detection using the

More information

Homework 2: Simple Linear Regression

Homework 2: Simple Linear Regression STAT 4385 Applied Regression Analysis Homework : Simple Linear Regression (Simple Linear Regression) Thirty (n = 30) College graduates who have recently entered the job market. For each student, the CGPA

More information

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015 Introduction to Linear Regression Rebecca C. Steorts September 15, 2015 Today (Re-)Introduction to linear models and the model space What is linear regression Basic properties of linear regression Using

More information

One-way ANOVA Model Assumptions

One-way ANOVA Model Assumptions One-way ANOVA Model Assumptions STAT:5201 Week 4: Lecture 1 1 / 31 One-way ANOVA: Model Assumptions Consider the single factor model: Y ij = µ + α }{{} i ij iid with ɛ ij N(0, σ 2 ) mean structure random

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

STAT 420: Methods of Applied Statistics

STAT 420: Methods of Applied Statistics STAT 420: Methods of Applied Statistics Model Diagnostics Transformation Shiwei Lan, Ph.D. Course website: http://shiwei.stat.illinois.edu/lectures/stat420.html August 15, 2018 Department

More information

Analysis of variance and regression. April 17, Contents Comparison of several groups One-way ANOVA. Two-way ANOVA Interaction Model checking

Analysis of variance and regression. April 17, Contents Comparison of several groups One-way ANOVA. Two-way ANOVA Interaction Model checking Analysis of variance and regression Contents Comparison of several groups One-way ANOVA April 7, 008 Two-way ANOVA Interaction Model checking ANOVA, April 008 Comparison of or more groups Julie Lyng Forman,

More information

STATISTICS 479 Exam II (100 points)

STATISTICS 479 Exam II (100 points) Name STATISTICS 79 Exam II (1 points) 1. A SAS data set was created using the following input statement: Answer parts(a) to (e) below. input State $ City $ Pop199 Income Housing Electric; (a) () Give the

More information

Statistics for exp. medical researchers Comparison of groups, T-tests and ANOVA

Statistics for exp. medical researchers Comparison of groups, T-tests and ANOVA Faculty of Health Sciences Outline Statistics for exp. medical researchers Comparison of groups, T-tests and ANOVA Lene Theil Skovgaard Sept. 14, 2015 Paired comparisons: tests and confidence intervals

More information

ECO220Y Simple Regression: Testing the Slope

ECO220Y Simple Regression: Testing the Slope ECO220Y Simple Regression: Testing the Slope Readings: Chapter 18 (Sections 18.3-18.5) Winter 2012 Lecture 19 (Winter 2012) Simple Regression Lecture 19 1 / 32 Simple Regression Model y i = β 0 + β 1 x

More information

Factor Structure Diagrams

Factor Structure Diagrams enote 2 1 enote 2 Factor Structure Diagrams enote 2 INDHOLD 2 Indhold 2 Factor Structure Diagrams 1 2.1 Introduction.................................... 2 2.2 Factors.......................................

More information

The Big Picture. Model Modifications. Example (cont.) Bacteria Count Example

The Big Picture. Model Modifications. Example (cont.) Bacteria Count Example The Big Picture Remedies after Model Diagnostics The Big Picture Model Modifications Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison February 6, 2007 Residual plots

More information

1 Least Squares Estimation - multiple regression.

1 Least Squares Estimation - multiple regression. Introduction to multiple regression. Fall 2010 1 Least Squares Estimation - multiple regression. Let y = {y 1,, y n } be a n 1 vector of dependent variable observations. Let β = {β 0, β 1 } be the 2 1

More information

Introduction and Single Predictor Regression. Correlation

Introduction and Single Predictor Regression. Correlation Introduction and Single Predictor Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Correlation A correlation

More information

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 1 / 13 Chapter 8: Polynomials & Interactions

More information

Stat 705: Completely randomized and complete block designs

Stat 705: Completely randomized and complete block designs Stat 705: Completely randomized and complete block designs Timothy Hanson Department of Statistics, University of South Carolina Stat 705: Data Analysis II 1 / 16 Experimental design Our department offers

More information

STAT 705 Chapter 19: Two-way ANOVA

STAT 705 Chapter 19: Two-way ANOVA STAT 705 Chapter 19: Two-way ANOVA Timothy Hanson Department of Statistics, University of South Carolina Stat 705: Data Analysis II 1 / 38 Two-way ANOVA Material covered in Sections 19.2 19.4, but a bit

More information

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012 Lecture 3: Linear Models Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012 1 Quick Review of the Major Points The general linear model can be written as y = X! + e y = vector of observed

More information

Regression Diagnostics for Survey Data

Regression Diagnostics for Survey Data Regression Diagnostics for Survey Data Richard Valliant Joint Program in Survey Methodology, University of Maryland and University of Michigan USA Jianzhu Li (Westat), Dan Liao (JPSM) 1 Introduction Topics

More information

Model Modifications. Bret Larget. Departments of Botany and of Statistics University of Wisconsin Madison. February 6, 2007

Model Modifications. Bret Larget. Departments of Botany and of Statistics University of Wisconsin Madison. February 6, 2007 Model Modifications Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison February 6, 2007 Statistics 572 (Spring 2007) Model Modifications February 6, 2007 1 / 20 The Big

More information

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002 Time allowed: 3 HOURS. STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002 This is an open book exam: all course notes and the text are allowed, and you are expected to use your own calculator.

More information

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA 22s:152 Applied Linear Regression Chapter 8: ANOVA NOTE: We will meet in the lab on Monday October 10. One-way ANOVA Focuses on testing for differences among group means. Take random samples from each

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

One-way ANOVA (Single-Factor CRD)

One-way ANOVA (Single-Factor CRD) One-way ANOVA (Single-Factor CRD) STAT:5201 Week 3: Lecture 3 1 / 23 One-way ANOVA We have already described a completed randomized design (CRD) where treatments are randomly assigned to EUs. There is

More information

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics DETAILED CONTENTS About the Author Preface to the Instructor To the Student How to Use SPSS With This Book PART I INTRODUCTION AND DESCRIPTIVE STATISTICS 1. Introduction to Statistics 1.1 Descriptive and

More information

dm'log;clear;output;clear'; options ps=512 ls=99 nocenter nodate nonumber nolabel FORMCHAR=" = -/\<>*"; ODS LISTING;

dm'log;clear;output;clear'; options ps=512 ls=99 nocenter nodate nonumber nolabel FORMCHAR= = -/\<>*; ODS LISTING; dm'log;clear;output;clear'; options ps=512 ls=99 nocenter nodate nonumber nolabel FORMCHAR=" ---- + ---+= -/\*"; ODS LISTING; *** Table 23.2 ********************************************; *** Moore, David

More information

9 One-Way Analysis of Variance

9 One-Way Analysis of Variance 9 One-Way Analysis of Variance SW Chapter 11 - all sections except 6. The one-way analysis of variance (ANOVA) is a generalization of the two sample t test to k 2 groups. Assume that the populations of

More information

Regression models. Categorical covariate, Quantitative outcome. Examples of categorical covariates. Group characteristics. Faculty of Health Sciences

Regression models. Categorical covariate, Quantitative outcome. Examples of categorical covariates. Group characteristics. Faculty of Health Sciences Faculty of Health Sciences Categorical covariate, Quantitative outcome Regression models Categorical covariate, Quantitative outcome Lene Theil Skovgaard April 29, 2013 PKA & LTS, Sect. 3.2, 3.2.1 ANOVA

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES Lalmohan Bhar I.A.S.R.I., Library Avenue, Pusa, New Delhi 110 01 lmbhar@iasri.res.in 1. Introduction Regression analysis is a statistical methodology that utilizes

More information

PLS205!! Lab 9!! March 6, Topic 13: Covariance Analysis

PLS205!! Lab 9!! March 6, Topic 13: Covariance Analysis PLS205!! Lab 9!! March 6, 2014 Topic 13: Covariance Analysis Covariable as a tool for increasing precision Carrying out a full ANCOVA Testing ANOVA assumptions Happiness! Covariable as a Tool for Increasing

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013 Topic 20 - Diagnostics and Remedies - Fall 2013 Diagnostics Plots Residual checks Formal Tests Remedial Measures Outline Topic 20 2 General assumptions Overview Normally distributed error terms Independent

More information

An overview of applied econometrics

An overview of applied econometrics An overview of applied econometrics Jo Thori Lind September 4, 2011 1 Introduction This note is intended as a brief overview of what is necessary to read and understand journal articles with empirical

More information

Analysis of 2x2 Cross-Over Designs using T-Tests

Analysis of 2x2 Cross-Over Designs using T-Tests Chapter 234 Analysis of 2x2 Cross-Over Designs using T-Tests Introduction This procedure analyzes data from a two-treatment, two-period (2x2) cross-over design. The response is assumed to be a continuous

More information

Checking model assumptions with regression diagnostics

Checking model assumptions with regression diagnostics @graemeleehickey www.glhickey.com graeme.hickey@liverpool.ac.uk Checking model assumptions with regression diagnostics Graeme L. Hickey University of Liverpool Conflicts of interest None Assistant Editor

More information

Analysis of variance. April 16, Contents Comparison of several groups

Analysis of variance. April 16, Contents Comparison of several groups Contents Comparison of several groups Analysis of variance April 16, 2009 One-way ANOVA Two-way ANOVA Interaction Model checking Acknowledgement for use of presentation Julie Lyng Forman, Dept. of Biostatistics

More information

The Distribution of F

The Distribution of F The Distribution of F It can be shown that F = SS Treat/(t 1) SS E /(N t) F t 1,N t,λ a noncentral F-distribution with t 1 and N t degrees of freedom and noncentrality parameter λ = t i=1 n i(µ i µ) 2

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information