Size: px


1 Package HGLMMM for Hierarchical Generalized Linear Models Marek Molas Emmanuel Lesaffre Erasmus MC Erasmus Universiteit - Rotterdam The Netherlands ERASMUSMC - Biostatistics / 52

2 Outline General syntax guide A bit of underlying theoretical concepts Example of analyses Comparison with existing methods Further developments ERASMUSMC - Biostatistics / 52

3 Examples Salamander data - crossed random effects Dialyzer data - longitudinal data Dialyzer data - correlated random effects Rats data - overdispersion modeling Cake data - AIC and model comparison ERASMUSMC - Biostatistics / 52

4 Hierarchical Generalized Linear Models Distribution of a response: exponential family density The mean of the distribution [ ] yθ b(θ) f(y;θ,φ) = exp + c(y,φ) φ E[y] = b (θ) = µ µ - the location of the distribution φ - the scale of a distribution or overdispersion ERASMUSMC - Biostatistics / 52

5 Hierarchical Generalized Linear Models The link function The linear predictor g(µ) = η η = Xβ + Zv Fixed effects in the mean structure - β Random effects in the mean structure - v assumed to originate form a distribution indexed by a dispersion parameter λ v. ERASMUSMC - Biostatistics / 52

6 Functions currently in the package HGLMMM HGLMfit - fitting function HGLMLikeDeriv - display derivatives of the fit HGLMLRTest - likelihood ratio test between two nested models BootstrapEnvelopeHGLM - creates bootstrap envelops for deviance residuals summary.hglm - prints out summary of the fit ERASMUSMC - Biostatistics / 52

7 HGLMfit syntax HGLMfit(DistResp = "Normal", DistRand = NULL, Link = NULL, LapFix = FALSE, ODEst = NULL, ODEstVal = 0, formulamain, formulaod, formularand, DataMain, DataRand, Offset = NULL, BinomialDen = NULL, StartBeta = NULL, StartVs = NULL, StartRGamma = NULL, INFO = TRUE, DEBUG = FALSE, na.action, contrasts = NULL, CONV = 1e-04) ERASMUSMC - Biostatistics / 52

8 HGLMfit syntax description DistResp - specify the distribution of the response as: "Normal", "Binomial", "Poisson", "Gamma" DistRand - specify the distribution of random effects: vector of distributions length equal to number of random components c("beta","gamma","igamma","normal") ERASMUSMC - Biostatistics / 52

9 HGLMfit syntax description Link - specify the link function for the response Canonical links available for Normal, Poisson and Binomial Gamma distribution has Log or Inverse link available LapFix - specify whether p v (h) is used for the estimation of the fixed effects If TRUE additional piece of code is used to estimate fixed effects as in Noh and Lee (2007) If FALSE hierarchical likelihood is used for estimation of fixed and random parameters ERASMUSMC - Biostatistics / 52

10 HGLMfit syntax description ODEst - specify whether the overdispersion parameter should be fixed or estimated if NULL it will be fixed for Poisson and Binomial, while estimated for Normal and Gamma if TRUE overdispersion structure will be estimated if FALSE overdispersion structure will be held fixed formulamain - specify the formula for the fixed structure of the model Formula with fixed and random components in the mean structure as in lme4 ERASMUSMC - Biostatistics / 52

11 HGLMfit syntax description formulaod - specify the dispersion structure (residual/overdispersion) One sided formula formularand - specify the dispersion structure of the random effects a list of one sided formulas, number of list entries must be equal to the number of the dispersion components DataMain - specify the main dataset, which will be used for formulamain and formulaod DataRand - a list containing the names of the data frames used for formularand ERASMUSMC - Biostatistics / 52

12 HGLMfit syntax description Offset - Offset variable in Poisson regression as log( µ t ) BinomialDen - specify the denominator of the Binomial distribution should be a vector of length equal to the number of observations StartBeta, StartVs, StartRGamma - specify starting values for fixed parameters, random effects and dispersion parameters of random effects respectively ODEstVal - supply values for overdispersion/ residual dispersion structure ERASMUSMC - Biostatistics / 52

13 Class HGLM objects - result of HGLMfit estimation Results - contains estimates Details - contains designs NAMES - contains labels for print out of the results CALL - contains the original call of the estimating function HGLMfit ERASMUSMC - Biostatistics / 52

14 Class HGLM objects - component Results Estimates of fixed and random effects in the mean structure Estimates of dispersion and (over)/residual dispersion parameters Gradient / Hessian / StdErrors of fixed, random and dispersion estimates Values of h-likelihood, marginal likelihood (REML) and conditional likelihood ERASMUSMC - Biostatistics / 52

15 Class HGLM objects - component Details Deviance residuals and standardized deviance residuals Involving proper hat matrix For outcome (assumed distribution) For random effects (assumed distribution) For (over)/residual dispersion (gamma distribution) For dispersion components (gamma distribution) ERASMUSMC - Biostatistics / 52

16 Other functions description HGLMLRTest Likelihood ratio test comparing two models - two arguments two objects of class HGLM HGLMLikeDeriv Gives gradients of fixed effects in the mean structure and variance components BootstrapEnvelopeHGLM Creates a 95% confidence intervals for correct residual diagnostics ERASMUSMC - Biostatistics / 52

17 Examples - Dialyzer data Dialyzer dataset Response is UFR Covariate of interest is TMP 3 centers involved - coded in a center variable Random effect - Dialyzer number Aim: Determine the relationship between UFR and TMP and determine if this relationship differs across the three centers, which use different systems to manipulate the TMP ERASMUSMC - Biostatistics / 52

18 Examples - Dialyzer data 2000 Center Center 2 Center 3 UFR TMP ERASMUSMC - Biostatistics / 52

19 Examples - Dialyzer data Standard analysis via SAS PROC MIXED Random intercept model Random intercept and slope model - no correlation Random intercept and slope model - fixed correlation Search over the grid for the correlation value ERASMUSMC - Biostatistics / 52

20 Dialyzer data - random intercept model dialyzer1<-dialyzer[complete.cases(dialyzer),] dialyzer1$ufrstd<-(dialyzer1$ufr-mean(dialyzer1$ufr))/sd(dialyzer1$ufr) DatasetRAEF<-data.frame(intercept=rep(1,41)) mod_dial1<-hglmfit(distresp = "Normal", DistRand = c("normal"), Link = "Identity", LapFix = FALSE, ODEst = TRUE, ODEstVal = 0, UFRSTD ~ TMP+as.factor(CENTER)+as.factor(CENTER):TMP+(1 DIALYZER), formulaod = ~ 1, list(one=~1), DataMain=dialyzer1, DataRand=list(DatasetRAEF), Offset = NULL, BinomialDen = NULL, StartBeta = NULL, StartVs = NULL, StartRGamma = NULL, INFO = TRUE, DEBUG = FALSE,contrasts = NULL, CONV = 1e-04) summary(mod_dial1) ERASMUSMC - Biostatistics / 52

21 Dialyzer data - random intercept/slope model mod_dial2<-hglmfit(distresp = "Normal", DistRand = c("normal","normal"), Link = "Identity", LapFix = FALSE, ODEst = TRUE, ODEstVal = 0, UFRSTD ~ TMP+as.factor(CENTER)+as.factor(CENTER):TMP+ (1 DIALYZER)+(TMP DIALYZER), formulaod = ~ 1, list(one=~1,two=~1), DataMain=dialyzer1, DataRand=list(DatasetRAEF,DatasetRAEF), Offset = NULL, BinomialDen = NULL, StartBeta = NULL, StartVs = NULL, StartRGamma = NULL, INFO = TRUE, DEBUG = FALSE,contrasts = NULL, CONV summary(mod_dial2) ERASMUSMC - Biostatistics / 52

22 Dialyzer data - known correlation parameter Assume correlation between random intercept and slope is known ρ = Fit model under independence - obtain estimates of variances of intercept and slope, construct variance covariance matrix using known correlation and computed variances Compute the cholesky decomposition of this matrix Change the design matrix of random effects Fit model update the estimates of variances and use it to construct new covariance matrix using known correlation Compute cholesky decomposition of a new matrix and refit the model after changing design matrix again When variance components of your fit are close to 1 stop the procedure ERASMUSMC - Biostatistics / 52

23 Dialyzer data - known correlation parameter If variances of random intercept and slope are assumed the same - only one step is required If correlation is unknown a grid search could be done This implies many iterations in nested loops - inefficient Possibly modification of the current code to do it at every iteration ERASMUSMC - Biostatistics / 52

24 Dialyzer data - known correlation parameter temp1<-as.vector(exp(mod_dial3$results$dispersion)) rho< Rcov<-matrix(c(temp1[1],rho*sqrt(temp1[1]*temp1[2]), rho*sqrt(temp1[1]*temp1[2]),temp1[2]),2,2) tempchol<-chol(rcov) originalz<-cbind(rep(1,nrow(dialyzer1)),dialyzer1$tmp) modifiedz<-originalz%*%t(tempchol) dialyzer1$newint<-modifiedz[,1] dialyzer1$newtmp<-modifiedz[,2] mod_dial3<-hglmfit(distresp = "Normal", DistRand = c("normal","normal"), Link = "Identity", LapFix = FALSE, ODEst = TRUE, ODEstVal = 0, UFRSTD ~ TMP+as.factor(CENTER)+as.factor(CENTER):TMP+ (NEWINT DIALYZER)+(NEWTMP DIALYZER), formulaod = ~ 1, list(one=~1,two=~1),datamain=dialyzer1, DataRand=list(DatasetRAEF,DatasetRAEF), Offset = NULL, BinomialDen = NULL, StartBeta = NULL, StartVs = NULL, StartRGamma = NULL, INFO = TRUE, DEBUG = FALSE,contras ERASMUSMC - Biostatistics / 52

25 Dialyzer data - known correlation parameter temp2<-as.vector(exp(mod_dial3$results$dispersion)) rho< temp3<-t(tempchol)%*%matrix(c(temp2[1],0,0,temp2[2]),2,2)%*%tempchol Rcov1<-matrix(c(temp3[1,1],rho*sqrt(temp3[1,1]*temp3[2,2]), rho*sqrt(temp3[1,1]*temp3[2,2]),temp3[2,2]),2,2) tempchol<-chol(rcov1) originalz<-cbind(rep(1,nrow(dialyzer1)),dialyzer1$tmp) modifiedz<-originalz%*%t(tempchol) dialyzer1$newint<-modifiedz[,1] dialyzer1$newtmp<-modifiedz[,2] ERASMUSMC - Biostatistics / 52

26 Dialyzer data - known correlation parameter Results ===== Fixed Coefficients - Mean Structure ===== Estimate Std. Error Z value Pr(> Z ) (Intercept) < 2e-16 *** TMP < 2e-16 *** as.factor(center) as.factor(center) TMP:as.factor(CENTER) *** TMP:as.factor(CENTER) *** --- ===== Overdispersion Parameters Estimated ===== Estimate Std. Error Z value Pr(> Z ) (Intercept) <2e-16 *** --- ERASMUSMC - Biostatistics / 52

27 Dialyzer data - known correlation parameter ===== Dispersion Parameters Estimated ===== Dispersion Component: DIALYZER Estimate Std. Error Z value Pr(> Z ) (Intercept) Dispersion Component: DIALYZER Estimate Std. Error Z value Pr(> Z ) (Intercept) ===== Likelihood Functions Value ===== H-likelihood : Marginal likelihood: REML likelihood : C-likelihood : ERASMUSMC - Biostatistics / 52

28 Examples - Dialyzer data BootstrapEnvelopeHGLM(mod_dial_final,19,67523) Sample Quantiles Theoretical Quantiles ERASMUSMC - Biostatistics / 52

29 Package HGLMMM Salamander data Dependent variable: success of salamanders mating Mate ij 60 male salamanders (i=1...60) + 60 female salamanders (j=1...60) Two populations of salamanders: whiteside (W) and roughbutt (R) 360 observations Question: does the type of salamander influence probability of a successful mating The model ( ) µij log = Intercept + TypeF j + TypeM i + TypeF j TypeM i + v i + v j 1 µ ij ERASMUSMC - Biostatistics / 52

30 Package HGLMMM Crossed random effects: Male Female ERASMUSMC - Biostatistics / 52

31 Package HGLMMM Gaussian quadrature infeasible We will perform the following analyses: GLM ignoring correlation in R glm() PQL analysis in SAS PROC GLIMMIX Mixed model using Laplace approximation in R lme4:::lmer() HL(0,1) in R HGLMMM package HL(1,1) in R HGLMMM package HL(1,1) + estimation of overdispersion φ in R HGLMMM package ERASMUSMC - Biostatistics / 52

32 Package HGLMMM Generalized linear model in SAS proc genmod data=sal descending; model mate=typefw typemw typefw*typemw/dist=binomial link=logit; run; Generalized linear model in R glm(cbind(mate,1-mate)~typef+typem+typef*typem, family=binomial(link=logit),data=salamander) ERASMUSMC - Biostatistics / 52

33 Package HGLMMM PQL model in SAS proc glimmix data=sal method=rspl; class female male; model mate=typefw typemw typefw*typemw/dist=binomial link=logit s random female male; random _residual_; run; GLMM using Laplace approximation in lme4:::lmer library(lme4) lmer(mate~typef+typem+typef*typem+(1 Male)+(1 Female), family=binomial(link=logit),data=salamander) ERASMUSMC - Biostatistics / 52

34 Package HGLMMM Hierarchical Generalized Linear Model - HL(0,1) library(hglmmm) RSal<-data.frame(int=rep(1,60)) HGLMfit(DistResp="Binomial",DistRand=c("Normal","Normal"), Link="Logit",LapFix=FALSE,ODEst=FALSE,ODEstVal=c(0), formulamain=mate~typef+typem+typef*typem+(1 Female)+(1 Male), formulaod=~1,formularand=list(one=~1,two=~1), DataMain=salamander,DataRand=list(RSal,RSal), Offset=NULL,BinomialDen=rep(1,360),INFO=TRUE,DEBUG=FALSE) Hierarchical Generalized Linear Model - HL(1,1) + overdispersion LapFix=TRUE ODEst=TRUE ERASMUSMC - Biostatistics / 52

35 Package HGLMMM Salamander data - point estimates Intercept TypeF TypeM TypeF*TypeM Female Male Phi glm PQL PQL OD lme HL(0,1) HL(1,1) HL(1,1)OD Whiteside female and Roughbutt male have lowest probability of success Both of the same population have similar probability of successful mating ERASMUSMC - Biostatistics / 52

36 Package HGLMMM Salamander data - test statistics Intercept TypeF TypeM TypeF*TypeM glm pql pql OD lme HL(0,1) HL(1,1) HL(1,1)OD ERASMUSMC - Biostatistics / 52

37 Package HGLMMM Rat data 30 rats 3 drugs 4 timepoints 120 observations White blood cell count and red blood cell count Response: number of cancer cell colonies Question: Is there a difference between the drugs ERASMUSMC - Biostatistics / 52

38 Package HGLMMM Poisson Model Quasi-Poisson model Dispersion component depends on WBC Diagnostic plots ERASMUSMC - Biostatistics / 52

39 Package HGLMMM Poisson model Rrat<-data.frame(WBC=tapply(rat$WhiteBloodCells,rat$Subject,mean), RBC=tapply(rat$RedBloodCells,rat$Subject,mean)) modrat1<-hglmfit(distresp="poisson",distrand=c("normal"),link="log", LapFix=FALSE,ODEst=FALSE,ODEstVal=c(0), formulamain= Y~WhiteBloodCells+RedBloodCells+as.factor(Drug)+(1 Subject), formulaod=~1,formularand=list(one=~1), DataMain=rat, DataRand=list(Rrat),INFO=TRUE,DEBUG=FALSE) ERASMUSMC - Biostatistics / 52

40 Package HGLMMM Sample Quantiles Theoretical Quantiles ERASMUSMC - Biostatistics / 52

41 Package HGLMMM Quasi-Poisson Model HGLMfit(DistResp="Poisson",DistRand=c("Normal"),Link="Log", LapFix=FALSE,ODEst=TRUE,ODEstVal=c(0), formulamain= Y~WhiteBloodCells+RedBloodCells+as.factor(Drug)+(1 Subject),,formulaOD=~1,formulaRand=list(one=~WBC+I(WBC^2)), DataMain=rat,DataRand=list(Rrat),INFO=TRUE,DEBUG=FALSE) ERASMUSMC - Biostatistics / 52

42 Package HGLMMM Diagnostics for Rat Model Quasi Poisson (y v) Deviance Residuals Absolute Deviance Residuals Scaled Fitted Values Scaled Fitted Values Normal Q Q Plot Histogram Sample Quantiles Frequency Theoretical Quantiles Deviance Residuals ERASMUSMC - Biostatistics / 52

43 Package HGLMMM Poisson Quasi-Poisson Quasi-Poisson PQL Intercept < < < <0.001 WBC < < <0.001 RBC < < <0.001 Drug= Drug= Phi Intercept WBC WBC ERASMUSMC - Biostatistics / 52

44 Package HGLMMM Cake data Dependent variable: breaking angle of cakes 270 cakes 3 recipes and 6 temperatures cakes baked in batches of 18 (3 recipes * 6 temperatures) Random effects for batch and random effect for recipe within batch Question: what is the effect of the baking temperature and recipe on the breaking angle The model η ijk = intercept + recipe j + temp k + recipe j temp k + v i + v ij ERASMUSMC - Biostatistics / 52

45 Package HGLMMM Models considered Breaking angle as normal or gamma random variable What distribution for random effects One or two random effects Which mean structure - do we need an interaction ERASMUSMC - Biostatistics / 52

46 Package HGLMMM Modeling strategy Start with a complex model Use AIC (marginal likelihood) for selection of the distribution of the response Use h-likelihood for selection of distribution of the random effects Use LR test (REML) to test variance component of random effect equal to zero Use LR test (marginal likelihood) to test for the interaction ERASMUSMC - Biostatistics / 52

47 Package HGLMMM Normal Model ===== Likelihood Functions Value ===== H-likelihood : Marginal likelihood: REML likelihood : C-likelihood : Gamma Model ===== Likelihood Functions Value ===== H-likelihood : Marginal likelihood: REML likelihood : C-likelihood : We proceed with the gamma model ERASMUSMC - Biostatistics / 52

48 Package HGLMMM Selection of random effects Effect 1 Effect2 H-likelihood Normal Normal Normal IGamma IGamma IGamma Gamma Gamma Beta Beta Lets keep Gaussian random effects ERASMUSMC - Biostatistics / 52

49 Package HGLMMM Do we need both random effects? > HGLMLRTest(modCake2,modCake7) H-likelihood of model 1 is higher Marginal likelihood comparison: LR test p-value: NA LR test statistics: LR difference df: 0 REML likelihood comparison: LR test p-value: LR test statistics: LR difference df: 1 We prefer the model with two random effects ERASMUSMC - Biostatistics / 52

50 Package HGLMMM Do we need interaction in the mean structure > HGLMLRTest(modCake2,modCake8) H-likelihood of model 1 is higher Marginal likelihood comparison: LR test p-value: LR test statistics: LR difference df: 10 REML likelihood comparison: LR test p-value: NA LR test statistics: LR difference df: 0 We prefer the simpler model ERASMUSMC - Biostatistics / 52

51 Package HGLMMM Further developments Make the package more compatible with R style Add estimation of random effects with known correlation Implement non-canonical links - probit, cloglog Use package MATRIX for large datasets Efficient computation of the matrix T(T T Σ 1 a T) 1 T T Σ 1 a Second order approximations ERASMUSMC - Biostatistics / 52

52 Package HGLMMM Currently known bugs ODEst=FALSE with Gaussian response does not work properly Full description of random effects in summary function intercept/subject/distribution Proper handling of missing values OTHER BUGS ARE WELCOME Thank you for your attention ERASMUSMC - Biostatistics / 52

Hierarchical Hurdle Models for Zero-In(De)flated Count Data of Complex Designs

Hierarchical Hurdle Models for Zero-In(De)flated Count Data of Complex Designs for Zero-In(De)flated Count Data of Complex Designs Marek Molas 1, Emmanuel Lesaffre 1,2 1 Erasmus MC 2 L-Biostat Erasmus Universiteit - Rotterdam Katholieke Universiteit Leuven The Netherlands Belgium

