Swabs, revisited. The families were subdivided into 3 groups according to the factor crowding, which describes the space available for the household.

Size: px

Start display at page:

Download "Swabs, revisited. The families were subdivided into 3 groups according to the factor crowding, which describes the space available for the household."

Angelina Miles
5 years ago
Views:

1 Swabs, revisited 18 families with 3 children each (in well defined age intervals) were followed over a certain period of time, during which repeated swabs were taken. The variable swabs indicates how many times the swab was positive for pneumococcus. The families were subdivided into 3 groups according to the factor crowding, which describes the space available for the household. 1. What can we say about the distribution of the number of positive swabs for a given individual? (a) Which values can it take? We are dealing with counts, that may take on values from 0 and upwards. Any positive integer, or zero. (b) Did you look at a residual plots from the exercise last time? If not, make it now. Does it look nice? Since we found no interaction between crowding and name in the previous exercise, we make a residual plot from the additive model (and show the output also): proc mixed data=swab; class crowding family name; model swab=crowding name / outp=udp outpm=udpm s cl; random family(crowding); Output (some of it) Covariance Parameter Estimates Cov Parm Estimate family(crowding) Residual

2 Solution for Fixed Effects Effect crowding name Estimate Error DF t Value Pr > t Intercept crowding crow crowding overcrow crowding uncrow name child name child <.0001 name child <.0001 name father name mother Solution for Fixed Effects Effect crowding name Alpha Lower Upper Intercept crowding crow crowding overcrow crowding uncrow... name child name child name child name father name mother... Type 3 Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F crowding name <.0001 Now the predicted values are in the variable Pred in the data set udp, and the corresponding residuals are in the variable Resid, so we can make a residual plot: proc gplot gout=plotud data=udp; plot Resid*Pred=crowding / haxis=axis1 vaxis=axis2 frame; axis1 value=(h=5) minor=none label=(h=5); axis2 value=(h=5) minor=none label=(a=90 R=0 h=5); symbol1 v= O i=none c=blue l=1 h=5 w=2 r=1; symbol2 v= C i=none c=red l=2 h=5 w=2 r=1; symbol3 v= U i=none c=green l=33 h=5 w=2 r=1; 2

3 The residual plot is clearly trumpet shaped, with large residuals (both positive and negative) corresponding to the larger predicted values. This usually calls for a logarithmic transformation. (c) Can we improve our assumptions by making a logarithmic transformation of the outcome? No, we cannot, because of the zeroes. These observations would not enter a logarithmic analysis, and this would correspond to serious missingness (informative missing, since we exclude all the smallest observations) 2. Analyse the data using an appropriate generalized mixed model. Since we are dealing with a count variable with no well defined upper limit, we suspect a Poisson distribution to be appropriate. Since such an analysis could potentially give results that differ from the previous - normality based - analyses, we start by including again the interaction term between crowding and name proc glimmix data=swab; 3

4 class crowding family name; model swab=crowding name crowding*name / dist=poisson link=log; random family(crowding); and thereby create the output The GLIMMIX Procedure Model Information Data Set Response Variable Response Distribution Link Function Variance Function Variance Matrix Estimation Technique Degrees of Freedom Method WORK.SWAB swab Poisson Log Default Not blocked Residual PL Containment Class Level Information Class Levels Values crowding 3 crow overcrow uncrow family name 5 child1 child2 child3 father mother Number of Observations Read 90 Number of Observations Used 90 Fit Statistics -2 Res Log Pseudo-Likelihood Generalized Chi-Square Gener. Chi-Square / DF 1.97 Covariance Parameter Estimates Cov Parm Estimate Error family(crowding) Type III Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F crowding name <.0001 crowding*name but since the interaction term is still insignificant, we exclude it again, and at the same time, we make a data set with predicted values and residuals (default residuals, on log-scale): 4

5 proc glimmix data=swab; class crowding family name; model swab=crowding name / dist=poisson link=log s cl; random family(crowding); output out=poisud pred=p resid=r; so now we get the output Fit Statistics -2 Res Log Pseudo-Likelihood Generalized Chi-Square Gener. Chi-Square / DF 1.96 Covariance Parameter Estimates Cov Parm Estimate Error family(crowding) Solutions for Fixed Effects Effect crowding name Estimate Error DF t Value Pr > t Intercept <.0001 crowding crow crowding overcrow crowding uncrow name child name child <.0001 name child <.0001 name father name mother Solutions for Fixed Effects Effect crowding name Alpha Lower Upper Intercept crowding crow crowding overcrow crowding uncrow... name child name child name child name father name mother... Type III Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F crowding name <

6 Which effects do you find? Compare to the results from the previous analyses. In particular, quantify the difference between overcrowded and uncrowded on an appropriate scale. We find significant effects of both factors, just as in we did in the normal distribution analyses from the previous exercise. But the interpretation differ, since we are now performing the analyses with a logarithmic link. The effect of overcrow vs. uncrow is estimated to , with confidence interval (0.1957, ). When we back-transform, this means that overcrowded families have exp(0.5693) = 1.77 times as many positive swabs as compared to uncrowded families. The confidence interval is (exp(0.1957), exp(0.9429)) = (1.22, 2.57) This is in contrast to the normality based results, where overcrowded families were found to have 5.6 more positive swab counts compared to uncrowded families, with confidence interval (1.91, 9.29). The P -values in the Poisson analysis are both lower than in the corresponding normality-based analysis. This may be due to an unreasonable judgement of the size of the variation, since in the Poisson distribution, this is not a free parameter, but rather identical to the mean (standard error identical to the square root of the mean). We return to this below... The residual plot in the Poisson analysis looks like this: 6

7 which looks nicer than before, although now a bit overtransformed. Possibly, a square root transformation would provide better variance homogeneity (this is actually true in general for Poisson variables, and therefore, variance homogeneity is not really an assumption for these analyses). 3. Make an illustration of the model, using predicted values from the model, and compare to the previous exercise. 7

8 This is to be compared to the previous (normality based) predictions: Now, which of these do the best job of capturing the original observations? 4. How can we interpret the estimate of variation between families? We mimick the interpretation from a binary analysis, and picture a situation where we pick two families at random from the same crowding, and calculate the median ratio of highest vs. lowest theoretical mean: 8

9 exp( ) = 1.29 On top of this random variation between theoretical means for each family, comes the Poisson variation, which depends on the mean value. But: There is evidence of overdispersion (1.96 > 1), meaning that the unsystematic within-family variation is larger than accounted for by the Poisson variation. Such an overdispersion may be included in a glimmix-analysis by adding random _residual_;: proc glimmix data=swab; class crowding family name; model swab=crowding name / dist=poisson link=log s cl; random intercept / subject=family(crowding); random _residual_; which provides the output The GLIMMIX Procedure Fit Statistics -2 Res Log Pseudo-Likelihood Generalized Chi-Square Gener. Chi-Square / DF 2.12 Covariance Parameter Estimates Cov Parm Subject Estimate Error Intercept family(crowding) Residual (VC) Solutions for Fixed Effects Effect crowding name Estimate Error DF t Value Pr > t 9

10 Intercept <.0001 crowding crow crowding overcrow crowding uncrow name child name child <.0001 name child <.0001 name father name mother Solutions for Fixed Effects Effect crowding name Alpha Lower Upper Intercept crowding crow crowding overcrow crowding uncrow... name child name child name child name father name mother... Type III Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F crowding name <.0001 Including such an overdispersion does not change much in the interpretation of fixed effects, but we note, that the P -values are now more or less identical to the normality-based results. The interpretation of the overdispersion estimate is to say, that the residual variation is more than twofold larger than it should be, based on the Poisson assumption. Therefore, the analyses above is not based on a real model, but rather, the standard errors of the estimates are just corrected to incorporate this extra variance, which of course has the consequence of increasing the P-values. If we should build a real model with overdispersion, we would write: y cfn Poisson(λ cfn ) 10

11 log(λ cfn ) = µ + β c + γ n + A cf + ε cfn A cf N(0, ω 2 ), ε cfn N(0, σ 2 ) This will also require two random-statements, only now like this: proc glimmix data=swab; class crowding family name; model swab=crowding name / dist=poisson link=log s cl; random intercept / subject=family(crowding); random intercept / subject=name*family(crowding); but this model unfortunately cannot be estimated by glimmix: The GLIMMIX Procedure Iteration History Objective Max Iteration Restarts Subiterations Function Change Gradient Did not converge. Covariance Parameter Estimates Cov Parm Subject Estimate Error Intercept family(crowding) Intercept family*name(crowdin)

Correlated data. Further topics. Sources of random variation. Specification of mixed models. Faculty of Health Sciences.

Correlated data. Further topics. Sources of random variation. Specification of mixed models. Faculty of Health Sciences. Faculty of Health Sciences Further topics Correlated data Further topics Lene Theil Skovgaard December 11, 2012 Specification of mixed models Model check and diagnostics Explained variation, R 2 Missing