Package HGLMMM for Hierarchical Generalized Linear Models

Package HGLMMM for Hierarchical Generalized Linear Models Marek Molas Emmanuel Lesaffre Erasmus MC Erasmus Universiteit - Rotterdam The Netherlands ERASMUSMC - Biostatistics 20-04-2010 1 / 52

Outline General syntax guide A bit of underlying theoretical concepts Example of analyses Comparison with existing methods Further developments ERASMUSMC - Biostatistics 20-04-2010 2 / 52

Examples Salamander data - crossed random effects Dialyzer data - longitudinal data Dialyzer data - correlated random effects Rats data - overdispersion modeling Cake data - AIC and model comparison ERASMUSMC - Biostatistics 20-04-2010 3 / 52

Hierarchical Generalized Linear Models Distribution of a response: exponential family density The mean of the distribution [ ] yθ b(θ) f(y;θ,φ) = exp + c(y,φ) φ E[y] = b (θ) = µ µ - the location of the distribution φ - the scale of a distribution or overdispersion ERASMUSMC - Biostatistics 20-04-2010 4 / 52

Hierarchical Generalized Linear Models The link function The linear predictor g(µ) = η η = Xβ + Zv Fixed effects in the mean structure - β Random effects in the mean structure - v assumed to originate form a distribution indexed by a dispersion parameter λ v. ERASMUSMC - Biostatistics 20-04-2010 5 / 52

Functions currently in the package HGLMMM HGLMfit - fitting function HGLMLikeDeriv - display derivatives of the fit HGLMLRTest - likelihood ratio test between two nested models BootstrapEnvelopeHGLM - creates bootstrap envelops for deviance residuals summary.hglm - prints out summary of the fit ERASMUSMC - Biostatistics 20-04-2010 6 / 52

HGLMfit syntax HGLMfit(DistResp = "Normal", DistRand = NULL, Link = NULL, LapFix = FALSE, ODEst = NULL, ODEstVal = 0, formulamain, formulaod, formularand, DataMain, DataRand, Offset = NULL, BinomialDen = NULL, StartBeta = NULL, StartVs = NULL, StartRGamma = NULL, INFO = TRUE, DEBUG = FALSE, na.action, contrasts = NULL, CONV = 1e-04) ERASMUSMC - Biostatistics 20-04-2010 7 / 52

HGLMfit syntax description DistResp - specify the distribution of the response as: "Normal", "Binomial", "Poisson", "Gamma" DistRand - specify the distribution of random effects: vector of distributions length equal to number of random components c("beta","gamma","igamma","normal") ERASMUSMC - Biostatistics 20-04-2010 8 / 52

HGLMfit syntax description Link - specify the link function for the response Canonical links available for Normal, Poisson and Binomial Gamma distribution has Log or Inverse link available LapFix - specify whether p v (h) is used for the estimation of the fixed effects If TRUE additional piece of code is used to estimate fixed effects as in Noh and Lee (2007) If FALSE hierarchical likelihood is used for estimation of fixed and random parameters ERASMUSMC - Biostatistics 20-04-2010 9 / 52

HGLMfit syntax description ODEst - specify whether the overdispersion parameter should be fixed or estimated if NULL it will be fixed for Poisson and Binomial, while estimated for Normal and Gamma if TRUE overdispersion structure will be estimated if FALSE overdispersion structure will be held fixed formulamain - specify the formula for the fixed structure of the model Formula with fixed and random components in the mean structure as in lme4 ERASMUSMC - Biostatistics 20-04-2010 10 / 52

HGLMfit syntax description formulaod - specify the dispersion structure (residual/overdispersion) One sided formula formularand - specify the dispersion structure of the random effects a list of one sided formulas, number of list entries must be equal to the number of the dispersion components DataMain - specify the main dataset, which will be used for formulamain and formulaod DataRand - a list containing the names of the data frames used for formularand ERASMUSMC - Biostatistics 20-04-2010 11 / 52

HGLMfit syntax description Offset - Offset variable in Poisson regression as log( µ t ) BinomialDen - specify the denominator of the Binomial distribution should be a vector of length equal to the number of observations StartBeta, StartVs, StartRGamma - specify starting values for fixed parameters, random effects and dispersion parameters of random effects respectively ODEstVal - supply values for overdispersion/ residual dispersion structure ERASMUSMC - Biostatistics 20-04-2010 12 / 52

Class HGLM objects - result of HGLMfit estimation Results - contains estimates Details - contains designs NAMES - contains labels for print out of the results CALL - contains the original call of the estimating function HGLMfit ERASMUSMC - Biostatistics 20-04-2010 13 / 52

Class HGLM objects - component Results Estimates of fixed and random effects in the mean structure Estimates of dispersion and (over)/residual dispersion parameters Gradient / Hessian / StdErrors of fixed, random and dispersion estimates Values of h-likelihood, marginal likelihood (REML) and conditional likelihood ERASMUSMC - Biostatistics 20-04-2010 14 / 52

Class HGLM objects - component Details Deviance residuals and standardized deviance residuals Involving proper hat matrix For outcome (assumed distribution) For random effects (assumed distribution) For (over)/residual dispersion (gamma distribution) For dispersion components (gamma distribution) ERASMUSMC - Biostatistics 20-04-2010 15 / 52

Other functions description HGLMLRTest Likelihood ratio test comparing two models - two arguments two objects of class HGLM HGLMLikeDeriv Gives gradients of fixed effects in the mean structure and variance components BootstrapEnvelopeHGLM Creates a 95% confidence intervals for correct residual diagnostics ERASMUSMC - Biostatistics 20-04-2010 16 / 52

Examples - Dialyzer data Dialyzer dataset Response is UFR Covariate of interest is TMP 3 centers involved - coded in a center variable Random effect - Dialyzer number Aim: Determine the relationship between UFR and TMP and determine if this relationship differs across the three centers, which use different systems to manipulate the TMP ERASMUSMC - Biostatistics 20-04-2010 17 / 52

Examples - Dialyzer data 2000 Center 1 1500 Center 2 Center 3 UFR 1000 500 200 300 400 TMP ERASMUSMC - Biostatistics 20-04-2010 18 / 52

Examples - Dialyzer data Standard analysis via SAS PROC MIXED Random intercept model Random intercept and slope model - no correlation Random intercept and slope model - fixed correlation Search over the grid for the correlation value ERASMUSMC - Biostatistics 20-04-2010 19 / 52

Dialyzer data - random intercept model dialyzer1<-dialyzer[complete.cases(dialyzer),] dialyzer1$ufrstd<-(dialyzer1$ufr-mean(dialyzer1$ufr))/sd(dialyzer1$ufr) DatasetRAEF<-data.frame(intercept=rep(1,41)) mod_dial1<-hglmfit(distresp = "Normal", DistRand = c("normal"), Link = "Identity", LapFix = FALSE, ODEst = TRUE, ODEstVal = 0, UFRSTD ~ TMP+as.factor(CENTER)+as.factor(CENTER):TMP+(1 DIALYZER), formulaod = ~ 1, list(one=~1), DataMain=dialyzer1, DataRand=list(DatasetRAEF), Offset = NULL, BinomialDen = NULL, StartBeta = NULL, StartVs = NULL, StartRGamma = NULL, INFO = TRUE, DEBUG = FALSE,contrasts = NULL, CONV = 1e-04) summary(mod_dial1) ERASMUSMC - Biostatistics 20-04-2010 20 / 52

Dialyzer data - random intercept/slope model mod_dial2<-hglmfit(distresp = "Normal", DistRand = c("normal","normal"), Link = "Identity", LapFix = FALSE, ODEst = TRUE, ODEstVal = 0, UFRSTD ~ TMP+as.factor(CENTER)+as.factor(CENTER):TMP+ (1 DIALYZER)+(TMP DIALYZER), formulaod = ~ 1, list(one=~1,two=~1), DataMain=dialyzer1, DataRand=list(DatasetRAEF,DatasetRAEF), Offset = NULL, BinomialDen = NULL, StartBeta = NULL, StartVs = NULL, StartRGamma = NULL, INFO = TRUE, DEBUG = FALSE,contrasts = NULL, CONV summary(mod_dial2) ERASMUSMC - Biostatistics 20-04-2010 21 / 52

Dialyzer data - known correlation parameter Assume correlation between random intercept and slope is known ρ = 0.648 Fit model under independence - obtain estimates of variances of intercept and slope, construct variance covariance matrix using known correlation and computed variances Compute the cholesky decomposition of this matrix Change the design matrix of random effects Fit model update the estimates of variances and use it to construct new covariance matrix using known correlation Compute cholesky decomposition of a new matrix and refit the model after changing design matrix again When variance components of your fit are close to 1 stop the procedure ERASMUSMC - Biostatistics 20-04-2010 22 / 52

Dialyzer data - known correlation parameter If variances of random intercept and slope are assumed the same - only one step is required If correlation is unknown a grid search could be done This implies many iterations in nested loops - inefficient Possibly modification of the current code to do it at every iteration ERASMUSMC - Biostatistics 20-04-2010 23 / 52

Dialyzer data - known correlation parameter temp1<-as.vector(exp(mod_dial3$results$dispersion)) rho<--0.648 Rcov<-matrix(c(temp1[1],rho*sqrt(temp1[1]*temp1[2]), rho*sqrt(temp1[1]*temp1[2]),temp1[2]),2,2) tempchol<-chol(rcov) originalz<-cbind(rep(1,nrow(dialyzer1)),dialyzer1$tmp) modifiedz<-originalz%*%t(tempchol) dialyzer1$newint<-modifiedz[,1] dialyzer1$newtmp<-modifiedz[,2] mod_dial3<-hglmfit(distresp = "Normal", DistRand = c("normal","normal"), Link = "Identity", LapFix = FALSE, ODEst = TRUE, ODEstVal = 0, UFRSTD ~ TMP+as.factor(CENTER)+as.factor(CENTER):TMP+ (NEWINT DIALYZER)+(NEWTMP DIALYZER), formulaod = ~ 1, list(one=~1,two=~1),datamain=dialyzer1, DataRand=list(DatasetRAEF,DatasetRAEF), Offset = NULL, BinomialDen = NULL, StartBeta = NULL, StartVs = NULL, StartRGamma = NULL, INFO = TRUE, DEBUG = FALSE,contras ERASMUSMC - Biostatistics 20-04-2010 24 / 52

Dialyzer data - known correlation parameter temp2<-as.vector(exp(mod_dial3$results$dispersion)) rho<--0.648 temp3<-t(tempchol)%*%matrix(c(temp2[1],0,0,temp2[2]),2,2)%*%tempchol Rcov1<-matrix(c(temp3[1,1],rho*sqrt(temp3[1,1]*temp3[2,2]), rho*sqrt(temp3[1,1]*temp3[2,2]),temp3[2,2]),2,2) tempchol<-chol(rcov1) originalz<-cbind(rep(1,nrow(dialyzer1)),dialyzer1$tmp) modifiedz<-originalz%*%t(tempchol) dialyzer1$newint<-modifiedz[,1] dialyzer1$newtmp<-modifiedz[,2] ERASMUSMC - Biostatistics 20-04-2010 25 / 52

Dialyzer data - known correlation parameter Results ===== Fixed Coefficients - Mean Structure ===== Estimate Std. Error Z value Pr(> Z ) (Intercept) -2.7702081 0.0314168-88.176 < 2e-16 *** TMP 0.0093937 0.0001110 84.623 < 2e-16 *** as.factor(center)2 0.0113560 0.0474683 0.239 0.810925 as.factor(center)3 0.0350136 0.0513306 0.682 0.495164 TMP:as.factor(CENTER)2-0.0006180 0.0001692-3.653 0.000259 *** TMP:as.factor(CENTER)3-0.0006929 0.0001813-3.821 0.000133 *** --- ===== Overdispersion Parameters Estimated ===== Estimate Std. Error Z value Pr(> Z ) (Intercept) -5.700 0.149-38.27 <2e-16 *** --- ERASMUSMC - Biostatistics 20-04-2010 26 / 52

Dialyzer data - known correlation parameter ===== Dispersion Parameters Estimated ===== Dispersion Component: DIALYZER Estimate Std. Error Z value Pr(> Z ) (Intercept) -0.0001058 0.4002638-0.000264 1 Dispersion Component: DIALYZER Estimate Std. Error Z value Pr(> Z ) (Intercept) -0.00104 0.25326-0.004 0.997 ===== Likelihood Functions Value ===== H-likelihood : 162.3979 Marginal likelihood: 170.8414 REML likelihood : 138.1487 C-likelihood : 265.6102 ERASMUSMC - Biostatistics 20-04-2010 27 / 52

Examples - Dialyzer data BootstrapEnvelopeHGLM(mod_dial_final,19,67523) Sample Quantiles 4 2 0 2 4 4 2 0 2 4 Theoretical Quantiles ERASMUSMC - Biostatistics 20-04-2010 28 / 52

Package HGLMMM Salamander data Dependent variable: success of salamanders mating Mate ij 60 male salamanders (i=1...60) + 60 female salamanders (j=1...60) Two populations of salamanders: whiteside (W) and roughbutt (R) 360 observations Question: does the type of salamander influence probability of a successful mating The model ( ) µij log = Intercept + TypeF j + TypeM i + TypeF j TypeM i + v i + v j 1 µ ij ERASMUSMC - Biostatistics 20-04-2010 29 / 52

Package HGLMMM Crossed random effects: Male Female 1 4 1 3 1 2 1 1 2 1 3 1 4 1 ERASMUSMC - Biostatistics 20-04-2010 30 / 52

Package HGLMMM Gaussian quadrature infeasible We will perform the following analyses: GLM ignoring correlation in R glm() PQL analysis in SAS PROC GLIMMIX Mixed model using Laplace approximation in R lme4:::lmer() HL(0,1) in R HGLMMM package HL(1,1) in R HGLMMM package HL(1,1) + estimation of overdispersion φ in R HGLMMM package ERASMUSMC - Biostatistics 20-04-2010 31 / 52

Package HGLMMM Generalized linear model in SAS proc genmod data=sal descending; model mate=typefw typemw typefw*typemw/dist=binomial link=logit; run; Generalized linear model in R glm(cbind(mate,1-mate)~typef+typem+typef*typem, family=binomial(link=logit),data=salamander) ERASMUSMC - Biostatistics 20-04-2010 32 / 52

Package HGLMMM PQL model in SAS proc glimmix data=sal method=rspl; class female male; model mate=typefw typemw typefw*typemw/dist=binomial link=logit s random female male; random _residual_; run; GLMM using Laplace approximation in lme4:::lmer library(lme4) lmer(mate~typef+typem+typef*typem+(1 Male)+(1 Female), family=binomial(link=logit),data=salamander) ERASMUSMC - Biostatistics 20-04-2010 33 / 52

Package HGLMMM Hierarchical Generalized Linear Model - HL(0,1) library(hglmmm) RSal<-data.frame(int=rep(1,60)) HGLMfit(DistResp="Binomial",DistRand=c("Normal","Normal"), Link="Logit",LapFix=FALSE,ODEst=FALSE,ODEstVal=c(0), formulamain=mate~typef+typem+typef*typem+(1 Female)+(1 Male), formulaod=~1,formularand=list(one=~1,two=~1), DataMain=salamander,DataRand=list(RSal,RSal), Offset=NULL,BinomialDen=rep(1,360),INFO=TRUE,DEBUG=FALSE) Hierarchical Generalized Linear Model - HL(1,1) + overdispersion LapFix=TRUE ODEst=TRUE ERASMUSMC - Biostatistics 20-04-2010 34 / 52

Package HGLMMM Salamander data - point estimates Intercept TypeF TypeM TypeF*TypeM Female Male Phi glm 0.69-2.01-0.47 2.48 1 PQL 0.79-2.29-0.54 2.82 0.72 0.63 1 PQL OD 0.93-2.73-0.65 3.35 1.44 1.31 0.66 lme 1.01-2.9-0.7 3.59 1.17 1.04 1 HL(0,1) 0.83-2.43-0.58 2.99 1.12 0.97 1 HL(1,1) 1.04-3.01-0.73 3.71 1.38 1.21 1 HL(1,1)OD 1.11-3.2-0.78 3.94 1.73 1.52 0.89 Whiteside female and Roughbutt male have lowest probability of success Both of the same population have similar probability of successful mating ERASMUSMC - Biostatistics 20-04-2010 35 / 52

Package HGLMMM Salamander data - test statistics Intercept TypeF TypeM TypeF*TypeM glm 3.1-5.9-1.5 5.4 pql 2.5-5.3-1.4 5.7 pql OD 2.5-5.8-1.5 7.2 lme 2.7-5.8-1.6 6.6 HL(0,1) 2.3-5.1-1.4 5.8 HL(1,1) 2.6-5.7-1.5 6.5 HL(1,1)OD 2.6-5.9-1.6 6.9 ERASMUSMC - Biostatistics 20-04-2010 36 / 52

Package HGLMMM Rat data 30 rats 3 drugs 4 timepoints 120 observations White blood cell count and red blood cell count Response: number of cancer cell colonies Question: Is there a difference between the drugs ERASMUSMC - Biostatistics 20-04-2010 37 / 52

Package HGLMMM Poisson Model Quasi-Poisson model Dispersion component depends on WBC Diagnostic plots ERASMUSMC - Biostatistics 20-04-2010 38 / 52

Package HGLMMM Poisson model Rrat<-data.frame(WBC=tapply(rat$WhiteBloodCells,rat$Subject,mean), RBC=tapply(rat$RedBloodCells,rat$Subject,mean)) modrat1<-hglmfit(distresp="poisson",distrand=c("normal"),link="log", LapFix=FALSE,ODEst=FALSE,ODEstVal=c(0), formulamain= Y~WhiteBloodCells+RedBloodCells+as.factor(Drug)+(1 Subject), formulaod=~1,formularand=list(one=~1), DataMain=rat, DataRand=list(Rrat),INFO=TRUE,DEBUG=FALSE) ERASMUSMC - Biostatistics 20-04-2010 39 / 52

Package HGLMMM Sample Quantiles 4 2 0 2 4 4 2 0 2 4 Theoretical Quantiles ERASMUSMC - Biostatistics 20-04-2010 40 / 52

Package HGLMMM Quasi-Poisson Model HGLMfit(DistResp="Poisson",DistRand=c("Normal"),Link="Log", LapFix=FALSE,ODEst=TRUE,ODEstVal=c(0), formulamain= Y~WhiteBloodCells+RedBloodCells+as.factor(Drug)+(1 Subject),,formulaOD=~1,formulaRand=list(one=~WBC+I(WBC^2)), DataMain=rat,DataRand=list(Rrat),INFO=TRUE,DEBUG=FALSE) ERASMUSMC - Biostatistics 20-04-2010 41 / 52

Package HGLMMM Diagnostics for Rat Model Quasi Poisson (y v) Deviance Residuals 3 2 1 0 1 2 Absolute Deviance Residuals 0.0 1.0 2.0 3.0 3.6 3.8 4.0 4.2 4.4 4.6 4.8 Scaled Fitted Values 3.6 3.8 4.0 4.2 4.4 4.6 4.8 Scaled Fitted Values Normal Q Q Plot Histogram Sample Quantiles 3 2 1 0 1 2 Frequency 0 10 20 30 2 1 0 1 2 Theoretical Quantiles 4 2 0 2 4 Deviance Residuals ERASMUSMC - Biostatistics 20-04-2010 42 / 52

Package HGLMMM Poisson Quasi-Poisson Quasi-Poisson PQL Intercept 3.301 <0.001 2.855 <0.001 2.709 <0.001 2.855 <0.001 WBC -0.052 <0.001-0.019 <0.001-0.014 0.006-0.019 <0.001 RBC 0.013 0.41 0.029 <0.001 0.028 <0.001 0.029 <0.001 Drug=2 0.197 0.034 0.146 0.345 0.166 0.098 0.146 0.347 Drug=3 0.186 0.055-0.045 0.773 0.109 0.247-0.046 0.773 Phi 1 0.111 0.104 0.117 Intercept -3.667-2.139 1.091-2.21 WBC -0.599 WBC 2 0.018 ERASMUSMC - Biostatistics 20-04-2010 43 / 52

Package HGLMMM Cake data Dependent variable: breaking angle of cakes 270 cakes 3 recipes and 6 temperatures cakes baked in batches of 18 (3 recipes * 6 temperatures) Random effects for batch and random effect for recipe within batch Question: what is the effect of the baking temperature and recipe on the breaking angle The model η ijk = intercept + recipe j + temp k + recipe j temp k + v i + v ij ERASMUSMC - Biostatistics 20-04-2010 44 / 52

Package HGLMMM Models considered Breaking angle as normal or gamma random variable What distribution for random effects One or two random effects Which mean structure - do we need an interaction ERASMUSMC - Biostatistics 20-04-2010 45 / 52

Package HGLMMM Modeling strategy Start with a complex model Use AIC (marginal likelihood) for selection of the distribution of the response Use h-likelihood for selection of distribution of the random effects Use LR test (REML) to test variance component of random effect equal to zero Use LR test (marginal likelihood) to test for the interaction ERASMUSMC - Biostatistics 20-04-2010 46 / 52

Package HGLMMM Normal Model ===== Likelihood Functions Value ===== H-likelihood : -893.6902 Marginal likelihood: -819.5366 REML likelihood : -797.6732 C-likelihood : -767.5713 Gamma Model ===== Likelihood Functions Value ===== H-likelihood : -676.3907 Marginal likelihood: -808.0586 REML likelihood : -848.9244 C-likelihood : -754.2644 We proceed with the gamma model ERASMUSMC - Biostatistics 20-04-2010 47 / 52

Package HGLMMM Selection of random effects Effect 1 Effect2 H-likelihood Normal Normal -676.4 Normal IGamma -874.8 IGamma IGamma -911.4 Gamma Gamma -676.9 Beta Beta -676.4 Lets keep Gaussian random effects ERASMUSMC - Biostatistics 20-04-2010 48 / 52

Package HGLMMM Do we need both random effects? > HGLMLRTest(modCake2,modCake7) H-likelihood of model 1 is higher Marginal likelihood comparison: LR test p-value: NA LR test statistics: 12.70101 LR difference df: 0 REML likelihood comparison: LR test p-value: 0.0005694908 LR test statistics: 11.87315 LR difference df: 1 We prefer the model with two random effects ERASMUSMC - Biostatistics 20-04-2010 49 / 52

Package HGLMMM Do we need interaction in the mean structure > HGLMLRTest(modCake2,modCake8) H-likelihood of model 1 is higher Marginal likelihood comparison: LR test p-value: 0.5034955 LR test statistics: 9.304224 LR difference df: 10 REML likelihood comparison: LR test p-value: NA LR test statistics: 29.88638 LR difference df: 0 We prefer the simpler model ERASMUSMC - Biostatistics 20-04-2010 50 / 52

Package HGLMMM Further developments Make the package more compatible with R style Add estimation of random effects with known correlation Implement non-canonical links - probit, cloglog Use package MATRIX for large datasets Efficient computation of the matrix T(T T Σ 1 a T) 1 T T Σ 1 a Second order approximations ERASMUSMC - Biostatistics 20-04-2010 51 / 52

Package HGLMMM Currently known bugs ODEst=FALSE with Gaussian response does not work properly Full description of random effects in summary function intercept/subject/distribution Proper handling of missing values OTHER BUGS ARE WELCOME Thank you for your attention m.molas@erasmusmc.nl ERASMUSMC - Biostatistics 20-04-2010 52 / 52