Correlated data. Non-normal outcomes. Reminder on binary data. Non-normal data. Faculty of Health Sciences. Non-normal outcomes

Size: px

Start display at page:

Download "Correlated data. Non-normal outcomes. Reminder on binary data. Non-normal data. Faculty of Health Sciences. Non-normal outcomes"

Rhoda Black
5 years ago
Views:

1 Faculty of Health Sciences Non-normal outcomes Correlated data Non-normal outcomes Lene Theil Skovgaard December 5, 2014 Generalized linear models Generalized linear mixed models Population average models (PA) Subject specific models (SS) Examples with counts Leprosy Seizures (briefly) Two examples with binary outcome Amenorrhea (longitudinal) Smoking among school children (cluster) 1 / 99 2 / 99 Non-normal data Reminder on binary data Typical data from e.g. epidemiology are often not normally distributed (binary, ordinal, counts, survival...) Generalized linear models (in exponential families): Multiple regression models, on a scale that corresponds to the data: Normal (link=identity), traditional linear models Binomial (link=logit), logistic regression Poisson (link=log), log-linear models, Poisson regression Examples of binary outcomes: infection after surgery smoking among school children amenorrhea among contracepting women A binary variable X has a Bernoulli ditribution, meaning that P(U = 1) = p P(U = 0) = 1 p For such an outcome, the mean value is p, and the variance is p(1 p) 3 / 99 4 / 99

2 Binomial data Examples of Binomial distributions If we sum up binary observations, n=4, 20 and 50; p=0.02, 0.2 og 0.5 Y = e.g. n U i = U U n i=1 number of infections for each hospital number of smokers in each school class number of women with amenorrhea for each general practice we get a Binomial distribution, Y Bin(n, p), 5 / 99 6 / 99 Binomial distribution, and approximations Poisson distribution The Binomial variable Y has point probabilities P(Y = m) = ( ) n p m (1 p) n m m Its mean is np and its variance np(1 p) When n is large, this distribution is very intractable, so we use approximations 7 / 99 p moderate (not too close to 0 or 1): Normal distribution p close to either 0 or 1: Poisson distribution Counts with no well-defined upper limit: the number of cancer cases in a specific community during a specific year the number of positive swabs over a certain period of time Law of rare events: As the count parameter n in a Binomial distribution gets larger and the parameter p gets close to either 0 or 1, the Binomial probabilities are approximately equal to the Poisson dsitribution P(Y = m) = λm m! exp( λ) where λ = np is the mean value, as well as the variance. 8 / 99

3 Generalized linear models Generalized linear MIXED models Outcome variable Y i, with a distribution from an exponential family (includes Normal, Binomial, Poisson, Gamma,...), with Mean value: µ i Link funktion: g assumed linear in covariates, i.e. Outcome variable Y ij, e.g. j th measurement time for individual i: Mean value: µ ij Link funktion: g, assumed linear in covariate vector X ij. Two kinds of models: g(µ i ) = β 0 + β 1 x i1 + + β k x ik = Xi T where X i denote the covariate vector for individual i. Normal (link=identity) Binomial (link=logit) Poisson (link=log) β Population average models (PA): g(µ ij ) = β 0 + β 1 x ij1 + + β k x ijk = X T ij β and (Y ij1, Y ij2 ) are associated (correlated) Subject-specific models (SS): g(µ ij ) = β 0 + β 1 x ij1 + + β k x ijk +b i b i N (0, σ 2 b ) 9 / / 99 The two model types Marginal models = Population Average (PA) Marginal models: or Population average (PA): Describe covariate effects on the population mean, e.g. expected difference between the effects of two treatments (corresponds to the repeated statement) Mixed effects model: or Subject specific (SS): Describe covariate effects on specific individuals (or clusters), e.g. expected change over time, or differences between boys and girls in the same school class (corresponds to the random statement) We specify only Marginal mean, E(Y ij X ij ) = µ ij, where g(µ ij ) = X T ij β, i.e. covariate effects as usual Distribution (Normal, Binomial, Posson,...) Marginal variance, φv (µ ij ), depending on distribution Some measure of association for Y s belonging to the same individual/unit This creates problems: Multivariate Binomial and Poisson distributions do not exist It is more of an estimation procedure rather than a model 11 / / 99

4 Marginal models, technicalities Marginal models, technicalities II Since we do not actually have a model, we cannot use a maximum likelihood approach. Instead, we use a GEE: Generalized estimating equation, (written in vector notation) D T V 1 i (y i µ i ) = 0 where V i is the (working) covariance matrix Cov(Y i ) and D i is the matrix of derivatives of the mean value µ i with respect to β The GEE-method requires an iterative procedure, gives consistent estimates of β (they have the correct mean when the sample size is large), even if Cov(Y i ) is incorrect the estimates are asymptotically Normal (i.e for large samle size, we can construct confidence intervals with plus/minus 2 standard errors) standard error of ˆβ should be based on the empirical sandwich estimator, to allow for possible overdispersion and general misspecification of Cov(Y i ) 13 / / 99 Residual variance for non-normal data Overdispersion In general, there is no free variance parameter, since the variance is determined from the mean value: Normal (link=identity), free variance parameter σ 2 Binomial (link=logit), variance np(1 p) Poisson (link=log), variance λ = E(Y ) Overdispersion: The variance may be seen to be larger than determined by the distribution. can be caused by omitted covariates (isn t that always the case?) unrecognized clusters heterogeneity, e.g. a zero -group (non-susceptibles) Traditional solution: An over-dispersion parameter φ is estimated and multiplied onto the variance or more generally: Use the empirical sandwich estimator of Cov(Y i ) 15 / / 99

5 Mixed effects models = Subject Specific models (SS) Interpretation of SS Observations: Y ij, covariate vector X ij Additional covariate vector Z ij, specifying the random effects. We specify Mean, E(Y ij X ij, b i ) = µ ij, where g(µ ij ) = X T ij β+z T ij b i Distribution (Normal, Binomial, Poisson,...) Conditional variance, φv (µ ij ) Variance of random effects, b i N p (0, G), where G is the matrix (and software) notation for σ 2 b Conditional indepence, given the covariates and the random effects This is a real model, but Inference is conditional on random effects and therefore specific to the subject It is very difficult to interpret the effect of covariates that are constant within an individual (i.e. gender, treatment etc) It may be useful to think about it as The individual is a (class) covariate The effect of another covariate is interpreted as for fixed value of all other covariates, including for fixed value of the individual 17 / / 99 For traditional linear models (Normality) For non-normal outcomes with identity link: Subject-specific model with random intercept/level is equal to Marginal model with compound symmetry covariance structure (type=cs) The above is no longer true due to non-linearity of the link-function This means: The interpretation of the parameters β does depend on the way that we model the correlation. And the interpretation of the parameters are different! More generally: The interpretation of the parameters β does not depend on the way that we model the correlation (although the estimate may change somewhat depending on the assumed structure) This implies that effects may either be interpreted cross-sectionally (marginally, for comparison of different populations, say, of different age) or subject-specific (effect of ageing for a single individual) 19 / / 99

6 A very simple example Two individuals Individual Baseline Follow up Difference log(or) OR Average Hypothetical example for illustration Subject specific model with a covariate effect (x-axis) and 21 clusters (b i, individual curves). Red curve denote population average curve but log odds for the average is 0.811, and OR=2.25 The average of individual OR s is larger than the OR calculated from average probabilities 21 / / 99 Population average on logit scale Interpretations SS specifies parallel lines on logit scale Example: The need for glasses increase over age Marginally: Odds ratio for being in need of glasses for a population with mean age 50 compared to a population with mean age 30 is smaller than but the PA deviates somewhat from a straight line and has a smaller slope (smaller effect of covariate x) 23 / 99 Subject specific: the Odds ratio for needing glasses when you (a specific individual) are of age 50 compared to when you were at age / 99

7 Counts of leprosy bacilli Averages for the leprosy example Reference: Snedecor, G.W. and Cochran, W.G. (1967). Statistical Methods, (6th edn). Iowa State University Press Controlled clinical trial: 10 patients treated with placebo P 10 patients treated with antibiotic A 10 patients treated with antibiotic B Recording of the number of bacilli at six sites of the body, i.e. a count variable before treatment (baseline, time=0) several months after treatment, (time=1) 25 / 99 Analysis Variable : bacilli N drug time Obs N Mean Variance A B P Note: The variance is obviously bigger than the average...overdispersion 26 / 99 Spaghettiplot - the leprosy example Average plot - the leprosy example Legends: A B... P Legends: A B... P 27 / / 99

8 Purpose of investigation Why is this not simple? 1. Evaluate the efficiency of antibiotics: red vs green lines 2. Compare the two drugs, A and B: solid vs dotted red lines 3. Quantify the effects of the two antibiotic drugs (SS) Randomization: At baseline, all patients have the same expected mean count (mean value), but by chance, the placebo individuals have larger values than the remaining groups. This is just a before-after study... But we are dealing with non-negative counts, so we do not have a normal distribution, although it may be a reasonable approximation... Can t we just take logarithms? No, because we have zeroes Some other transformation then? Yes, square roots, or arcsine, but the interpretation would suffer a lot Could we just condition on the baseline value? Yes, we could do that... but it becomes more tricky when we have multiple time points 29 / / 99 Model reflections Model reflections, II We are dealing with counts, so it is natural to consider a Poisson distribution, with log-link (natural log) Because it is a randomized study, the mean values at baseline should be identical for the three groups We are prepared to see 3 different changes over time - but some of these may be identical (this is actually the main scientific question) Baseline and follow measurement are correlated within individuals Parametrization of mean values (on the log-scale): Treatment Period Mean (on log scale) P Baseline β 1 P Follow-up β 1 + β 2 A Baseline β 1 A Follow-up β 1 + β 2 + β 3 B Baseline β 1 B Follow-up β 1 + β 2 + β 4 β 3 and β 4 denote additional effects of A and B, when compared to placebo 31 / / 99

9 Marginal model (PA) for leprosy Comments to code A_effect=(drug= A )*time; B_effect=(drug= B )*time; proc genmod data=leprosy; class id; model bacilli= time A_effect B_effect / d=poisson link=log; repeated subject=id / type=un corrw; contrast Antibiotic effect A_effect 1, B_effect 1 / wald; contrast Effect of A equals B? A_effect 1 B_effect -1 / wald; estimate Effect B minus A A_effect 1 B_effect -1; estimate "changes for A" time 1 A_effect 1; estimate "changes for B" time 1 B_effect 1; output out=pa pred=pred_pa xbeta=xbeta_pa; run; time indicates the change over time for the placebo group (the parameter β 2 ) A_effect indicates the additional change over time for drug A (the parameter β 3 ) B_effect indicates the additional change over time for drug B (the parameter β 4 ) d=poisson: specifies the link-function as log, and the working correlation matrix as (proportional to) the mean link=log: may overrule the link-function from dist=poisson, if so needed repeated: specifies an association between measurements on the same id (corrw requests printing) 33 / / 99 Comments to code, II Output estimate statements: Estimate combination of the β s, here β4 β 3 β 2 + β 3 β 2 + β 4 contrast statements: Useful for testing several parameters simultaneously, here the tests β 3 = β 4 = 0: No (extra) effect of either A nor B β3 = β 4 : Effects of A and B are equal (identical to the estimate-statement above) The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable WORK.LEPROSY Poisson Log bacilli Number of Observations Read 60 Number of Observations Used 60 Class Level Information Class Levels Values id Parameter Information Parameter Prm1 Prm2 Prm3 Prm4 Effect Intercept time A_effect B_effect 35 / / 99

10 Output, II Output, III: Estimation GEE Model Information Correlation Structure Unstructured Subject Effect id (30 levels) Number of Clusters 30 Correlation Matrix Dimension 2 Maximum Cluster Size 2 Minimum Cluster Size 2 Algorithm converged. Working Correlation Matrix Col1 Col2 Row Row The GENMOD Procedure Analysis Of GEE Parameter Estimates Empirical Error Estimates 95% Confidence Parameter Estimate Error Limits Z Pr > Z Intercept <.0001 time A_effect B_effect / / 99 Output, IV (additional statements) Interpretations Contrast Estimate Results Mean Mean L Beta Label Estimate Confidence Limits Estimate Error Alpha Effect B minus A changes for A changes for B L Beta Chi- Label Confidence Limits Square Pr > ChiSq Effect B minus A changes for A changes for B Contrast Results for GEE Analysis Chi- Contrast DF Square Pr > ChiSq Type Antibiotic effect Wald Effect of A equals B? Wald But note: It may not be reasonable to estimate the effect of each single drug in a PA-model! 39 / 99 There is a significant effect of antibiotics: 6.99 χ 2 (2) P = 0.03 The effect of placebo is estimated to exp( ˆβ 2 ) = exp( ) = 0.986, i.e a decrease of 1.4% The additional effect of drug A is estimated to exp( ˆβ 3 ) = 0.58, and the total effect to exp( ˆβ 2 + ˆβ 3 ) = exp( ) = 0.574, i.e a decrease of 42.6% The two antibiotics are not significantly different: 0.08 χ 2 (1) P = 0.78 (although the estimated effect is a tiny bit larger for drug A) 40 / 99

11 Predicted means from Population Average model (PA) Wrong analysis not taking the correlation into account proc genmod data=leprosy; class id; model bacilli= time A_effect B_effect / d=poisson link=log modelse type3; ****** no repeated statement; contrast Antibiotic effect A_effect 1, B_effect 1 / wald; contrast Effect of A equals B? A_effect 1 B_effect -1 / wald; estimate Effect B minus A A_effect 1 B_effect -1; estimate "changes for A" time 1 A_effect 1; estimate "changes for B" time 1 B_effect 1; run; Legends: A B... P 41 / / 99 Output from wrong analysis Mixed effects model (SS) Analysis Of Maximum Likelihood Parameter Estimates Wald 95% Confidence Wald Pr>ChiSq Parameter DF Estimate Error Limits Chi-Square Intercept <.0001 time A_effect <.0001 B_effect <.0001 Scale NOTE: The scale parameter was held fixed. Note: Larger effects Too small standard errors Much too small P-values We now assume random intercepts, b i N (0, σb 2 ), in order to answer the orange question from page 29. proc GLIMMIX data=leprosy method=quad(qpoints=50); class id; model bacilli = time A_effect B_effect / dist=poisson link=log solution; random intercept / subject=id type=vc g; contrast Drug x Time Interaction A_effect 1, B_effect 1; contrast Effect of A equals B? A_effect 1 B_effect -1; estimate "changes for A" time 1 A_effect 1; estimate "changes for B" time 1 B_effect 1; output out=ss pred=xbetamean pred(noblup)=xbeta_ss pred(ilink)=predmean pred(ilink noblup)=pres_ss; run; 43 / / 99

12 Comments to glimmix code Output from glimmix analysis method=quad: maximizes the likelihood function qpoints=50: the more quadrature points, the better accuracy random: here we have only one random intercept, so type=... is unimportant g: print the estimate of σb 2 (In glimmix, the parameter σb 2 is generally denoted G) The test of equality of A and B is hard to interpret and is only shown for making this comment on it Estimated G Matrix Effect Row Col1 Intercept Covariance Parameter Estimates Cov Parm Subject Estimate Error Intercept id Solutions for Fixed Effects Effect Estimate Error DF t Value Pr > t Intercept <.0001 time A_effect B_effect / / 99 Output from glimmix analysis, II Predicted means from Subject Specific model (SS) Note: Different scaling from p. 41 Estimates Label Estimate Error DF t Value Pr > t Effect B minus A changes for A changes for B Contrasts Num Den Label DF DF F Value Pr > F Antibiotic effect Effect of A equals B? Note again: Only the drug-specific changes are readily interpreted 47 / 99 Legends: A B... P 48 / 99

13 Predicted individual means from Subject Specific model (SS) Predicted means from PA and SS Legends: A B... P Legends: A B... P 49 / / 99 Comments on difference between PA and SS The analysis uses a log-link, and since the logarithmic function is concave, we have the following: Study on epilepsy Reference: Thall, P.F. and Vail, S.C. (1990). Some covariance models for longitudinal count data with overdispersion. Biometrics. Controlled clinical trial: 30 treated with pragabide 28 treated with placebo Recording of the number of epileptic seizures during The average of two logarithmic values (SS) is smaller than the logarithm of the average (PA) The difference between the two is largest for small values Therefor the effects on log-scale (SS) appears larger 51 / 99 8-week interval before treatment visits every second week after treatment, i.e. in 2-weeks interval We consider rates, per week 52 / 99

Spaghettiplot - the epilepsy example Mean value plot Number af seizures per week: Legends: Progabide Placebo Legends: Progabide Placebo 53 / 99 54 / 99 Purpose of investigation Model building 1.

14 Spaghettiplot - the epilepsy example Mean value plot Number af seizures per week: Legends: Progabide Placebo Legends: Progabide Placebo 53 / / 99 Purpose of investigation Model building 1. Investigate what happens over time, does the number of seizures decrease? 2. Compare the decrease for a patient treated with pragabide to the decrease for a similar patient in the placebo group 3. Compare the decrease for a population treated with pragabide to the decrease for a population treated with placebo Notation: T ij denotes the time span corresponding to the number of seizures, Y ij, so T ij is either 2 or 8 weeks Reasonable model (in principle) for the number of seizures: Poisson outcome Random regression, i.e. linear effect of week, with individual intercepts and slopes Mean value proportional to length of period (8 or 2 weeks) log(8) and log(2) used as offsets This ensures that we model the ratio (on log-scale) Y ij T ij 55 / / 99

15 Random regression, SS model in glimmix Ecological fallacy proc glimmix data=seizures method=quad(qpoints=50); class id trt visit; model seizures = weeks trt trt*weeks / dist=poisson offset=lweeks link=log solution; random intercept weeks / subject=id type=un g; estimate weekly decline trt=0 weeks 1 weeks*trt 1 0; estimate weekly decline trt=1 weeks 1 weeks*trt 0 1; run; Think about the research question: Do we want to say something about populations? between subject covariates or are we interested in specific individuals? within subject covariates Output not shown / / 99 Example: suicide and religion Analysis on population level: the regions In a number of regions, we count: Number of suicides Outcome: % suicides (among all citizens) Number of protestants and catholics, Covariate: % protestants Percent of suicides increases with percent of protestants. Purpose of study: Do people kill themselves when they live among protestants? Is this a precise question?? Are protestants more likely to commit suicidide? 59 / / 99

16 Analysis on subject level Subdivide each region into individual religion: protestants and catholics: More suicides among catholics in regions with many protestants but they do not count as much, since they are a minor group 61 / 99 Amenorrhea example 1151 contracepting women were randomized in two groups, receiving 100 mg of some drug (trt=0) 150 mg of the same drug (trt=1) All women received injections at time points (time=1,2,3,4) with intervals of 90 days (no measurement at baseline (time=0) Each time, it was recorded whether the woman had experienced amenorrhea (a suspected side effect of the drug) in the 90 days following the last injection. Many drop-outs 62 / 99 Amenorrhea example Mean value plot - amenorrhea The MEANS Procedure Analysis Variable : y N N trt time Obs N Miss Mean Variance Note: Baseline is unmeasured (time=0) 63 / / 99

17 Mean value plot - on logit scale Purpose of the amenorrhea investigation Estimate time trend in the probability of side effects for each dose of the drug Compare the two doses Do we have linearity? Not quite... Model could include A time effect (linear or quadratic) A group difference but they should be equal at baseline (time=0) An interaction between group and time (different patterns in the two groups) A random level for each individual 65 / / 99 Mixed effects model (SS) Output from mixed effects model with quadratic time effect Estimated G Matrix proc glimmix method=quad(qpoints=50) noclprint data=amen; class id; model amenorrhea = time time2 trt*time trt*time2 / dist=binomial link=logit solution; random intercept / subject=id g; contrast Interaction with time trt*time 1, trt*time2 1 / chisq; output out=pred_ss pred(noblup ilink)=predicted_ss_mean; run; Beware: Test for interaction is difficult to interpret 67 / 99 Effect Row Col1 Intercept Solutions for Fixed Effects Effect Estimate Error DF t Value Pr > t Intercept <.0001 time <.0001 time time*trt time2*trt Contrasts Num Den Label DF DF Chi-Square F Value Pr > ChiSq Interaction with time Label Pr > F Interaction with time / 99

18 Interpretations Predicted profiles from SS-model Random effects variance G (ˆσ 2 b = ): can be cautiously interpreted as a correlation ˆσ 2 b ˆσ 2 b + π2 3 = 0.61 The interaction is hard to interpret as a within-subject covariate, since no individual has received both treatments. 69 / / 99 Marginal model using GEE (PA) Output from marginal model proc genmod descending data=amen; class id; model amenorrhea = time time2 trt*time trt*time2 / dist=binomial link=logit; repeated subject=id / logor=fullclust; contrast Interaction with time trt*time 1, trt*time2 1; output out=pred_pa pred=predicted_pa; run; Analysis Of GEE Parameter Estimates Empirical Error Estimates 95% Confidence Parameter Estimate Error Limits Z Pr > Z Intercept <.0001 time <.0001 time time*trt time2*trt Alpha <.0001 Alpha <.0001 Alpha <.0001 Alpha <.0001 Alpha <.0001 Alpha <.0001 Note: We have have a missing value issue here, because we cannot use maximum likelihood 71 / 99 Contrast Results for GEE Analysis Chi- Contrast DF Square Pr > ChiSq Type Interaction with time Score 72 / 99

19 Predicted profiles from PA-model Comparison of predicted profiles Note: New scaling...and more so, if they are further away from 0.5 PA estimates are closer to 0.5 then SS estimates... so effects are smaller for PA 73 / / 99 An alternative SS program PROC NLMIXED very flexible, allows any (non-linear) mean value structure can only handle two levels (i.e. not pupils in classes in schools...) PROC NLMIXED data=amen QPOINTS=50; PARMS beta0=-2.5 beta1=0.8 beta2=-0.03 beta3=0.36 beta4=-0.07 g11=0 to 5 by 0.5; eta = beta0 + beta1*time + beta2*time2 + beta3*trt*time + beta4*trt*time2 + b1; mu = exp(eta)/(1+exp(eta)); MODEL y ~ BINARY(mu); RANDOM b1 ~ NORMAL(0, g11) SUBJECT=id; PREDICT mu OUT=predmean; run; 75 / 99 Output from NLMIXED Parameter Estimates Parameter Estimate Error DF t Value Pr > t Alpha Lower beta < beta < beta beta beta g < Parameter Estimates Parameter Upper Gradient beta beta beta beta beta g / 99

20 Smoking among school children Model for smoking Hierarchical (multilevel) design: 1498 children (i) 90 classes (c) 46 schools (s) Outcome: Individual smoking behaviour, smoker (0/1) Purpose of investigation Find out how to make an intervention to prevent smoking Y sci Bernoulli(p sci ) p sci : the probability that child i in class c on school s is a smoker. Model: logit(p sci ) = school covariate effects +A s +school class covariate effects +individual covariate effects +B sc Evaluate various covariate effects A s N (0, ω 2 ) between school variation B sc N (0, τ 2 ) between classes (within school) variation 77 / / 99 Mette Rasmussen Possible covariates, at various levels Initial model Too simple, but a starting point to gain understanding Individual (i): sex/gender, age, parental smoking behaviour, parental smoking attitude, parental labour market attachment, best friend smoking Class (c): sex ratio, number of pupils, grade, teachers School (s): Type of school (rural, urban) Two-level model: no covariates only random school nothing here / / proc glimmix data=smoke; / class school sclass; / model smoker(descending) = / / dist=binary link=logit ddfm=satterth s; random school; run; 79 / / 99

21 Important note Interesting part of output A full maximum likelihood estimation (method=quad) with a sufficient number of qpoints is not feasible for this problem, because of insufficient space and time. The default approximaive solution is method=rspl The simplest model may be fitted with ML and this yields results quite close to the ones presented below Perhaps, some day... The GLIMMIX Procedure Covariance Parameter Estimates Cov Parm Estimate Error SCHOOL Solutions for Fixed Effects Effect Estimate Error DF t Value Pr > t Intercept < / / 99 Interpretation of estimates Interpretation of random effect Fixed effects: Only intercept, i.e. overall level: Inverse logit-transformation: > exp( )/(1+exp( )) [1] exp( ) (1 + exp( )) = Overall, approx. 18.6% of the pupils smoke Estimated between-school variance: ˆσ 2 b = A cautios interpretation as a correlation ˆσ 2 b ˆσ 2 b + π2 3 = 0.13 Median Odds Ratio (MOR) For two randomly chisen individuals from different schools, (with identical covariates) we calculate median OR for the high risk individual compared to the low risk individual: 83 / / 99

22 MOR in practice Interpretation of correlation structure Choose two random individuals from different schools: The distribution of OR between their risk of smoking (always chosen as the ratio above 1) will have a median of MOR = exp(0.954 ω) and since ω = = , we get MOR = exp( ) = 1.46 Pupils from the same school are correlated in their inclination to smoke Pupils from the same class are no more correlated than pupils from different classes on the same scholl. This does not seem appropriate We must introduce an extra correlation for pupils in the same class / / 99 Inclusion of variation between school classes Interpretation of results proc glimmix data=smoke; class school sclass; model smoker(descending) = / dist=binary link=logit ddfm=satterth s; random school sclass; run; Covariance Parameter Estimates Cov Parm Estimate Error SCHOOL 0. sclass Solutions for Fixed Effects Effect Estimate Error DF t Value Pr > t Intercept <.0001 The variation between schools can be totally explained by the variation between school classes The intercept (level) changes slightly because of a different weighting of the observations Median Odds Ratio (MOR) for two children from different classes in the same school: exp( ) = 1.77 Median Odds Ratio (MOR) for two children from different classes in different schools: exp( ) = / / 99

23 An illustrative figure A possible third level... Three schools: blue, red, green Variation between classes in each school, but schools look alike Imagine an extra level/grouping: Gender group within class, i.e. a subgrouping in boys and girls, corresponding to an extra correlation between pupils of the same gender in the same class. Note: This is not the same as a gender effect it need not be a systematic difference the group definition is a substitute for cliques of which we know nothing Modify the Random-statement to: random school sclass ggroup; and remember ggroup in the Class-statement 89 / / 99 One school, gender group effect Output from 3-level model The GLIMMIX Procedure Covariance Parameter Estimates Cov Parm Estimate Error SCHOOL 0. sclass GGROUP Solutions for Fixed Effects Effect Estimate Error DF t Value Pr > t Intercept <.0001 Gender group/clique seems to be an important concept 91 / / 99

24 Interpretation of results Gender correlation - systematic effect? Median Odds Ratio (MOR) for two children of opposite sex (different gender groups) in the same class: exp( ) = 1.91 Median Odds Ratio (MOR) for two children (of either gender) in different classes (at same or different schools): exp( ) = 2.04 How much does systematic gender effect explain of the random components? A large part of the variation seems to be due to gender cliques, or is it simply a systematic difference between boys and girls? proc glimmix data=smoke; class school sclass ggroup sex; model smoker(descending) = sex / dist=binary link=logit ddfm=satterth s; random school sclass ggroup; run; 93 / / 99 One school, systematic gender effect Output from 3-level model, with systematic gender effect The GLIMMIX Procedure Covariance Parameter Estimates Cov Parm Estimate Error SCHOOL 0. sclass GGROUP Solutions for Fixed Effects Effect sex Estimate Error DF t Value Pr > t Intercept <.0001 sex boy sex girl / / 99

25 Interpretation of results Variance component estimates Systematic effect of sex: OR=exp(0.4188) = 1.52 for girls vs. boys Median Odds Ratio (MOR) for two children in different cliques of the same class: exp( ) = 1.83 Median Odds Ratio (MOR) for two children in different classes (at same or different schools): exp( ) = 2.00 How much did systematic gender effect explain of the random components? model school school class gender group school alone school and school class school, class and gender group as above, with sex Note the increase in the class variation 97 / / 99 MOR, and Odds ratios (OR) for gender In case of different: model school school class gender group gender school alone school and school class school, class and gender group as above, with sex Systematic gender effect and gender cliques seem to be the most important determinants for smoking. 99 / 99

Faculty of Health Sciences. Correlated data. Count variables. Lene Theil Skovgaard & Julie Lyng Forman. December 6, 2016

Faculty of Health Sciences. Correlated data. Count variables. Lene Theil Skovgaard & Julie Lyng Forman. December 6, 2016 Faculty of Health Sciences Correlated data Count variables Lene Theil Skovgaard & Julie Lyng Forman December 6, 2016 1 / 76 Modeling count outcomes Outline The Poisson distribution for counts Poisson models,