Standardized Residuals vs Normal Scores

Size: px

Start display at page:

Download "Standardized Residuals vs Normal Scores"

Shawn Greer
5 years ago
Views:

1 Stat 50 (Oehlert): Random effects Cmd> print(sire,format:"f4.0") These are the data from exercise 5- of Kuehl (994, Duxbury). There are 5 bulls selected at random, and we observe the birth weights of male calves. Sire is considered random. sire: () (6) () Cmd> print(wts,format:"f4.0") wts: () (6) () Cmd> sire<-factor(sire) Cmd> anova("wts=sire") This is the ordinary ANOVA. It doesn t know anything about fixed or random effects. The DF, SS, and MS are correct. Model used is wts=sire DF SS MS CONSTANT.758e e+05 sire ERROR Cmd> resvsrankits() Normality is not too bad. Standardized Residuals vs Normal Scores Standardized Resids Normal Scores

2 Stat 50 (Oehlert): Random effects Cmd> resvsyhat() Constant variance is a little bit doubtful, but no power family transformation will help much since the ratio of largest to smallest response is only about. Standardized Residuals vs Fitted Values (Yhat) Standardized Resids Fitted Values (Yhat) Cmd> # In order to do random effects analysis, we need some new commands. The first is ems(). ems() computes expected mean squares for models with random and/or fixed effects. Data may be unbalanced. The basic usage is to give the model and then a keyword phrase random:names, where names is a character vector with the names of the random effects. Several more specialized alternatives are also available. The command mixed() does mixed (random and fixed) effects anova, computing the correct denominator for tests. The basic arguments are the same as for ems(). You can also give it the output of ems() as an argument. The third command is varcomp(). varcomp() computes estimates of variance components, their standard errors, and approximate degrees of freedom. Same arguments as ems() or mixed(). The fourth command is reml(), which does restricted maximum likelihood estimation of fixed and random effects. Same basic arguments as the others. All of these commands are available from the Statistics :: ANOVA modeling submenu. Cmd> ems("wts=sire",random:"sire") OK, so let s get the expected mean squares. The arguments are the model, and then random:names, where names is a vector of character strings giving the names of the random terms. The last error term is automatically random. These data are balanced, so the EMS could be calculated with the Hasse diagram. EMS(CONSTANT) = V(ERROR) + 8V(sire) + 40Q(CONSTANT) EMS(sire) = V(ERROR) + 8V(sire) EMS(ERROR) = V(ERROR)

3 Stat 50 (Oehlert): Random effects Cmd> mixed("wts=sire",random:"sire") The mixed macro produces the correct anova for problems with random and/or fixed effects. There is a row for every term in the anova model. There are columns for the DF and MS of each term, the DF of MS for the error or denominator for each term, the F and the p-value. Here we see that sire is reasonably significant. DF MS Error DF Error MS F P value CONSTANT.76e sire ERROR MISSING MISSING Cmd> varcomp("wts=sire",random:"sire") Here are the estimated variance components. The estimate for ERROR is just itself. For sire, we have ( )/8 = Standard errors are computed from the variances of the anova mean squares and the coefficients used in the variance component estimates (0 and for ERROR and /8 and -/8 for sire). Contrast the fact that sire is fairly signficant with the fact that estimate of the sire variance component was less than one SE from zero. This is not necessarily a contridiction because the estimate plus or minus SE form of confidence interval is not appropriate for variance components based on few df. Estimate SE DF sire ERROR Cmd> reml("wts=sire",random:"sire") There are several other ways to estimate the fixed and random components of an ANOVA. The reml() command does restricted maximum likelihood. Its estimates of variance components will always be nonnegative. The output gives the estimated fixed effects (called theta), the estimated variance components (called phi), the variances for theta and phi, the degrees of freedom for the variance components, the estimated random effects (called gamma), the variances of the estimated random effects, the log likelihood for the model, and the residuals (data minus fixed effects; that is, the residuals contain all random effects). In some cases (such as this one), the REML estimates and the usual ANOVA-based estimates will agree. This will not always be true. component: theta CONSTANT 8.55 component: phi sire 6.75 ERROR component: thetavar (,) component: phivar sire ERROR sire ERROR component: phidf sire.767 ERROR 5 component: gamma (,) 0.78 (,) (,) -.98

4 Stat 50 (Oehlert): Random effects 4 (4,) (5,) component: gammavar () component: loglike (,) -79. component: residuals (,) -.55 (,) 7.45 (,) (4,) 0.45 (5,) 6.45 (6,) 0.45 (7,) (8,) (9,) (0,) 9.45 (,).45 (,) 0.45 (,) 5.45 (4,).45 (5,) 5.45 (6,).45 (7,) (8,) -.55 (9,) -.55 (0,) (,) (,) -.55 (,) (4,) 7.45 (5,) (6,) (7,) (8,) -.55 (9,) (0,) 8.45 (,) 8.45 (,) 8.45 (,) -.55 (4,) (5,) 7.45 (6,).45 (7,).45 (8,) 0.45 (9,).45 (40,) -7.55

5 Stat 50 (Oehlert): Random effects 5 Cmd> reml("wts=sire",random:"sire",usemle:t) You can also get maximum likelihood estimates by using the keyword phrase usemle:t. The ordinary REML estimates are more popular, because the REML estimates of variance components are less biased. component: theta CONSTANT 8.55 component: phi sire ERROR component: thetavar CONSTANT CONSTANT component: phivar sire ERROR sire ERROR component: phidf sire.675 ERROR 5 component: gamma (,) (,) (,) -.68 (4,) (5,) component: gammavar () component: loglike (,) Cmd> print(lot,format:"f4.0",labels:f);lot<-factor(lot) These are the data from problem 5- of Kuehl (994, Duxbury). There are 8 randomly chosen lots of cotton seed, and 4 samples are taken from each lot. The response is the amount of aflatoxin on the seeds. lot: Cmd> print(at,format:"f4.0",labels:f) at:

6 Stat 50 (Oehlert): Random effects 6 Cmd> anova("at=lot") Here is the base anova. Again, it does not know about random or fixed terms. In this case, ERROR is the correct denominator for lot. Model used is at=lot DF SS MS CONSTANT lot ERROR Cmd> resvsyhat() Constant variance looks pretty good. Standardized Residuals vs Fitted Values (Yhat) Standardized Resids Fitted Values (Yhat) Cmd> resvsrankits() A little hitch in the rankit plot, but not too bad. Standardized Residuals vs Normal Scores Standardized Resids Normal Scores

7 Stat 50 (Oehlert): Random effects 7 Cmd> ems("at=lot",random:"lot") Here are the expected mean squares. lot is random. Again, these data are balanced (4 samples per lot), so we could get these by hand. EMS(CONSTANT) = V(ERROR) + 4V(lot) + Q(CONSTANT) EMS(lot) = V(ERROR) + 4V(lot) EMS(ERROR) = V(ERROR) Cmd> mixed("at=lot",random:"lot") Here is the ANOVA with correct denominators. lot is highly significant. DF MS Error DF Error MS F P value CONSTANT 5.44e lot e-05 ERROR MISSING MISSING Cmd> varcomp("at=lot",random:"lot") Here are the estimated variance components. Again, even though lot is highly significant, its estimated variance component is less than two SE from zero. Estimate SE DF lot ERROR Cmd> EMSout<-ems("at=lot",random:"lot",keep:T) The keep:t keyword phase makes ems return it s information as a structure instead of printing it out. Cmd> mixed(emsout) We can use this ems output structure as input to mixed or varcomp. Most of the computation in mixed and varcomp is just doing the EMS. So, if the ems is slow and/or complicated, it might make sense to do it once and save the output. DF MS Error DF Error MS F P value CONSTANT 5.44e lot e-05 ERROR MISSING MISSING Cmd> varcomp(emsout) Estimate SE DF lot ERROR

8 Stat 50 (Oehlert): Random effects 8 Cmd> print(mohms,format:"f5.0",labels:f) These are data from problem 6.8 of Hicks and Turner (999 Oxford). Ten resistors are chosen at random, and three operators are chosen at random. Each operator measures the resistance of each resistor twice, with the 0 measurements made in random order. Response is in milliohms. mohms: Cmd> print(oper,format:"f5.0",labels:f) oper: Cmd> print(part,format:"f5.0",labels:f) part: Cmd> anova("mohms=oper*part") Basic ANOVA of the resistor data. Model used is mohms=oper*part DF SS MS CONSTANT 9.809e e+06 oper part oper.part ERROR

9 Stat 50 (Oehlert): Random effects 9 Cmd> chplot(mohms-residuals,residuals,oper) Here is a problem. Operator tends to measure a bit low, and he has more variance as well. This may cause some problems later, and no reasonable power transformation will fix this. RESIDUALS Cmd> ems("mohms=oper*part",random:vector("oper","part")) Here are the EMS. We can see that the two-factor interaction is the appropriate denominator for main effects. EMS(CONSTANT) = V(ERROR) + V(oper.part) + 6V(part) + 0V(oper) + 60Q(CONSTANT) EMS(oper) = V(ERROR) + V(oper.part) + 0V(oper) EMS(part) = V(ERROR) + V(oper.part) + 6V(part) EMS(oper.part) = V(ERROR) + V(oper.part) EMS(ERROR) = V(ERROR) Cmd> mixed("mohms=oper*part",random:vector("oper","part")) There is strong evidence for variation among operators, and there is no evidence of variation between parts (that s good) or an interaction. DF MS Error DF Error MS F P value CONSTANT 9.80e oper e-09 part oper.part ERROR MISSING MISSING Cmd> varcomp("mohms=oper*part",random:vector("oper","part")) Note the negative estimated variance component. This occurs when the F is less than. Also note that operator is highly significant, but less than SE from zero. Estimate SE DF oper part oper.part ERROR

10 Stat 50 (Oehlert): Random effects 0 Cmd> reml("mohms=oper*part",random:vector("oper","part")) Here is the REML fit to the same data. Note that two of the variance components are estimated as zero; only the operator and error variance components are nonzero. component: theta CONSTANT component: phi oper part 0 oper.part 0 ERROR 40. component: thetavar CONSTANT CONSTANT 6.0 component: phivar oper part oper.part ERROR oper part oper.part ERROR component: phidf oper.898 part 0 oper.part 0 ERROR 57 component: gamma (,) (,) (,).7 (4,) 0 (5,) 0 (6,) 0 (7,) 0 (8,) 0 (9,) 0 (0,) 0 (,) 0 (,) 0 (,) 0 (4,) 0 (5,) 0 (6,) 0 (7,) 0 (8,) 0 (9,) 0 (0,) 0 (,) 0 (,) 0 (,) 0 (4,) 0 (5,) 0 (6,) 0 (7,) 0 (8,) 0

11 Stat 50 (Oehlert): Random effects (9,) 0 (0,) 0 (,) 0 (,) 0 (,) 0 (4,) 0 (5,) 0 (6,) 0 (7,) 0 (8,) 0 (9,) 0 (40,) 0 (4,) 0 (4,) 0 (4,) 0 component: gammavar () (6) () (6) () (6) () (6) (4) component: loglike (,)

12 Stat 50 (Oehlert): Random effects Cmd> invchi(vector(-.05/,.05/),0) To compute a confidence interval for an EMS, we need the upper and lower E/ percent points of a chisquare. The EMS of MSE is σ, so we can use these percent points to form a confidence interval for σ. () Cmd> 0*5.68/invchi(vector(-.05/,.05/),0) Multiply the mean square by its degrees of freedom and divide by the upper and lower percent points from chisquare to get the confidence interval. Here, even with 0 df, the interval spans a factor of. () Cmd> *56/invchi(vector(-.05/,.05/),) We can do an analogous computation for the EMS of any MS, it s just that most of these aren t of much interest. Here we form a 95% interval for σ + σαβ + 0σ α, the EMS for the MS of operator. This EMS is not of too much interest, and with only df, the interval is a mile wide. () Cmd> invf(vector(-.05/,.05/),,8) We can compute confidence intervals for the ratio of two EMS s using upper and lower F percent points. Let s get an interval for the ratio of the EMS for operator to the EMS for operator by part; this has and 8 degrees of freedom. () Cmd> 6.7/invF(vector(-.05/,.05/),,8) Divide the F-ratio MS oper over MS oper.part by the F percent points. The produces an interval for EMS-oper/EMS-oper.part, or (σ +σ αβ +0σ α )/(σ +σ αβ ) = +0σ α /(σ + σ αβ ) () Cmd> (6.7/invF(vector(-.05/,.05/),,8)-)/0 Subtract and divide by 0 to get a confidence interval for σα/(σ + σαβ ). Note that the largest plausible ratio is almost 00 times the smallest! ()

13 Stat 50 (Oehlert): Random effects Cmd> invf(vector(-.05/4,.05/4),,8) For variance components with exact F-tests (such as σ α here), we can combine two E/ confidence intervals to construct a E interval for the component of interest. We need F percent points with numerator and denominator df from the two MS, and chisquare percent points for the numerator df. () Cmd> invchi(vector(-.05/4,.05/4),) () Cmd> # We use the numerator df (), the numerator MS (56), the observed F (6.7), the multiplier for the variance component of interest in its EMS (0 for σα ), and the upper and lower F and chisquare percent points. Cmd> *56*(-5.645/6.7)/0/8.764 Here s the lower endpoint. () 6.97 Cmd> *56*(-.0588/6.7)/0/.0558 Here s the upper endpoint. Our estimate of is in the interval, but the maximum is almost 400 times the minimum! () 60.5 Cmd>.9*76.78/invchi(vector(-.05/,.05/),.9) As a simple, but crude, approximation, we can use the estimated variance component and its approximate degrees of freedom as if it were a simple mean square. () Cmd> -cumf(50/(50+0*0)*invf(.95,,8),,8) Power is fairly simple for random effects. You need the probability that an F (MS/MS) is bigger than (EMS/EMS) times the rejection cutoff. Here, suppose that σ = 50, σαβ = 0, and σ α = 0. The test has and 8 df, and we get power.5. ()

14 Stat 50 (Oehlert): Random effects 4 Cmd> # Let s look at how often a confidence interval for error variance covers the true variance. We ll work with 5 degrees of freedom and consider 90% and 95% intervals for σ. The intervals are formed by dividing the error SS by chisquare percent points. When everything works right, the 95% intervals should miss.5% each high and low, and the 90% intervals should miss 5% each high and low. Cmd> lo90<-/invchi(.95,5);hi90<-/invchi(.05,5) Cmd> lo95<-/invchi(.975,5);hi95<-/invchi(.05,5) Cmd> lo90;hi90 Here are the factors. () () Cmd> lo95;hi95 () () Cmd> # We will take 0000 samples of size 6. For each sample of size 6 we ll compute the SS around the mean (with 5 df), and compute confidence intervals for the variance. Cmd> sum(lo90*ss>)/0000 This is for normally distributed data. We get about the fraction of misses high or low that we expect. () Cmd> sum(lo95*ss>)/0000 () 0.07 Cmd> sum(hi90*ss<)/0000 () Cmd> sum(hi95*ss<)/0000 () 0.045

15 Stat 50 (Oehlert): Random effects 5 Cmd> plot(rankits(z),z) Now some nonnormal data. Here is a NPP of 500 points from a distribution with longer tails than normally distributed data. 6 4 z Cmd> sum(lo90*ss>)/0000 () 0.5 Cmd> sum(lo95*ss>)/0000 () Cmd> sum(hi90*ss<)/0000 () Cmd> sum(hi95*ss<)/0000 () These error rates are much too high. The 90% ci only has coverage about.7, and the 95% ci has coverage about.80

16 Stat 50 (Oehlert): Random effects 6 Cmd> plot(rankits(z),z) Here is an NPP of data with longer tails. 0 - z Cmd> sum(lo90*ss>)/0000 () Cmd> sum(lo95*ss>)/0000 () 0.57 Cmd> sum(hi90*ss<)/0000 () 0.45 Cmd> sum(hi95*ss<)/0000 () Check the error rates. The 90% ci has coverage about.4, and the 95% ci has coverage about.5.

17 Stat 50 (Oehlert): Random effects 7 Cmd> plot(rankits(z),z) Here is an NPP of data that are mildly asymmetric, but not terribly outlier prone. 5 4 z Cmd> sum(lo90*ss>)/0000 () Cmd> sum(lo95*ss>)/0000 () Cmd> sum(hi90*ss<)/0000 () Cmd> sum(hi95*ss<)/0000 () The errors are about.5 to times what they should be.

18 Stat 50 (Oehlert): Random effects 8 Cmd> plot(rankits(z),z) Now we finish up with some short-tailed data from a uniform distribution..5.5 z Cmd> sum(lo90*ss>)/0000 () Cmd> sum(lo95*ss>)/0000 () Cmd> sum(hi90*ss<)/0000 () 0.00 Cmd> sum(hi95*ss<)/0000 () These error rates are a factor of 5 to 0 too small. Our coverage is actually greater than the nominal 90 or 95% when the errors are short tailed.

Statistics Univariate Linear Models Gary W. Oehlert School of Statistics 313B Ford Hall

Statistics Univariate Linear Models Gary W. Oehlert School of Statistics 313B Ford Hall Statistics 5401 14. Univariate Linear Models Gary W. Oehlert School of Statistics 313B ord Hall 612-625-1557 gary@stat.umn.edu Linear models relate a target or response or dependent variable to known predictor