Generalized linear mixed models for biologists

Similar documents
Licensed under the Creative Commons attribution-noncommercial license. Please share & remix noncommercially, mentioning its origin.

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Generalized linear mixed models: a practical guide for ecology and evolution

Lecture 9 STK3100/4100

ML estimation: Random-intercepts logistic model. and z

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

Package HGLMMM for Hierarchical Generalized Linear Models

Introduction (Alex Dmitrienko, Lilly) Web-based training program

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Chapter 4 Multi-factor Treatment Designs with Multiple Error Terms 93

The performance of estimation methods for generalized linear mixed models

SAS/STAT 15.1 User s Guide Introduction to Mixed Modeling Procedures

Outline of GLMs. Definitions

Generalized Linear Models for Non-Normal Data

WU Weiterbildung. Linear Mixed Models

An R # Statistic for Fixed Effects in the Linear Mixed Model and Extension to the GLMM

Generalized, Linear, and Mixed Models

Generalized Models: Part 1

Outline for today. Computation of the likelihood function for GLMMs. Likelihood for generalized linear mixed model

T E C H N I C A L R E P O R T A SANDWICH-ESTIMATOR TEST FOR MISSPECIFICATION IN MIXED-EFFECTS MODELS. LITIERE S., ALONSO A., and G.

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

PQL Estimation Biases in Generalized Linear Mixed Models

Poisson regression: Further topics

Approximate Likelihoods

A strategy for modelling count data which may have extra zeros

SAS/STAT 13.1 User s Guide. Introduction to Mixed Modeling Procedures

Statistics 572 Semester Review

Open Problems in Mixed Models

Introduction to Generalized Models

1 Mixed effect models and longitudinal data analysis

STAT 740: Testing & Model Selection

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

A Joint Model with Marginal Interpretation for Longitudinal Continuous and Time-to-event Outcomes

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Model comparison and selection

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Generalized Linear Models

Notes for week 4 (part 2)

Multilevel Methodology

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Generalized Linear Models (GLZ)

The combined model: A tool for simulating correlated counts with overdispersion. George KALEMA Samuel IDDI Geert MOLENBERGHS

36-463/663: Multilevel & Hierarchical Models

Chapter 1. Modeling Basics

Sociology 6Z03 Review II

Applied Asymptotics Case studies in higher order inference

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

R Package glmm: Likelihood-Based Inference for Generalized Linear Mixed Models

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Estimation in Generalized Linear Models with Heterogeneous Random Effects. Woncheol Jang Johan Lim. May 19, 2004

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Logistic regression: Miscellaneous topics

GENERALIZED LINEAR MIXED MODELS AND MEASUREMENT ERROR. Raymond J. Carroll: Texas A&M University

Outline. Mixed models in R using the lme4 package Part 3: Longitudinal data. Sleep deprivation data. Simple longitudinal data

Linear Regression Models P8111

Type I and type II error under random-effects misspecification in generalized linear mixed models Link Peer-reviewed author version

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Chapter 1 Statistical Inference

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Repeated ordinal measurements: a generalised estimating equation approach

Discussion of Maximization by Parts in Likelihood Inference

Conditional Inference Functions for Mixed-Effects Models with Unspecified Random-Effects Distribution

Longitudinal Modeling with Logistic Regression

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

A brief introduction to mixed models

Generalized Linear Mixed-Effects Models. Copyright c 2015 Dan Nettleton (Iowa State University) Statistics / 58

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Stat 5101 Lecture Notes

Statistical Data Analysis Stat 3: p-values, parameter estimation

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Peng Li * and David T Redden

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

STAT 705 Generalized linear mixed models

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Lecture 3.1 Basic Logistic LDA

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Generalized Multilevel Models for Non-Normal Outcomes

Spring RMC Professional Development Series January 14, Generalized Linear Mixed Models (GLMMs): Concepts and some Demonstrations

Exam Applied Statistical Regression. Good Luck!

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Discussion of Missing Data Methods in Longitudinal Studies: A Review by Ibrahim and Molenberghs

Multivariate Survival Analysis

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Modeling Overdispersion

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

STAT 526 Advanced Statistical Methodology

Generalized linear models

Generalized Linear Models 1

Transcription:

Generalized linear mixed models for biologists McMaster University 7 May 2009

Outline 1 2

Outline 1 2

Coral protection by symbionts 10 Number of predation events Number of blocks 8 6 4 2 2 2 1 0 2 0 2 1 0 1 0 none shrimp crabs both Symbionts

Arabidopsis response to fertilization & clipping panel: nutrient, color: genotype Log(1+fruit set) 0 1 2 3 4 5 unclipped clipped : nutrient 1 unclipped clipped : nutrient 8

Environmental stress: Glycera cell survival 0 0.03 0.1 0.32 0 0.03 0.1 0.32 Anoxia Osm=12.8 Anoxia Osm=22.4 Anoxia Osm=32 Anoxia Osm=41.6 Anoxia Osm=51.2 1.0 133.3 66.6 0.8 33.3 0 0.6 Copper Normoxia Osm=12.8 Normoxia Osm=22.4 Normoxia Osm=32 Normoxia Osm=41.6 Normoxia Osm=51.2 0.4 133.3 66.6 0.2 33.3 0 0.0 0 0.03 0.1 0.32 0 0.03 0.1 0.32 0 0.03 0.1 0.32 H 2 S

Outline 1 2

(GLMs) non-normal data, (some) nonlinear relationships; modeling via linear predictor presence/absence, alive/dead (binomial) count data (Poisson, negative binomial) typical applications: logistic regression (binomial/logistic), Poisson regression (Poisson/exponential)

(GLMs) non-normal data, (some) nonlinear relationships; modeling via linear predictor presence/absence, alive/dead (binomial) count data (Poisson, negative binomial) typical applications: logistic regression (binomial/logistic), Poisson regression (Poisson/exponential)

(GLMs) non-normal data, (some) nonlinear relationships; modeling via linear predictor presence/absence, alive/dead (binomial) count data (Poisson, negative binomial) typical applications: logistic regression (binomial/logistic), Poisson regression (Poisson/exponential)

Outline 1 2

Random effects (RE) examples: experimental or observational blocks (temporal, spatial); species or genera; individuals; genotypes inference on population of units rather than individual units (units randomly selected from all possible units) (reasonably large number of units)

Random effects (RE) examples: experimental or observational blocks (temporal, spatial); species or genera; individuals; genotypes inference on population of units rather than individual units (units randomly selected from all possible units) (reasonably large number of units)

Random effects (RE) examples: experimental or observational blocks (temporal, spatial); species or genera; individuals; genotypes inference on population of units rather than individual units (units randomly selected from all possible units) (reasonably large number of units)

Random effects (RE) examples: experimental or observational blocks (temporal, spatial); species or genera; individuals; genotypes inference on population of units rather than individual units (units randomly selected from all possible units) (reasonably large number of units)

Mixed models: classical approach Partition sums of squares, calculate null expectations if fixed effect is 0 (all coefficients β i = 0) or RE variance=0 Figure out numerator (model) & denominator (residual) sums of squares and degrees of freedom Model SSQ, df: variability explained by the effect (difference between model with and without the effect) and number of parameters used Residual SSQ, df: variability caused by finite sample size (number of observations minus number used up by the model)

Mixed models: classical approach Partition sums of squares, calculate null expectations if fixed effect is 0 (all coefficients β i = 0) or RE variance=0 Figure out numerator (model) & denominator (residual) sums of squares and degrees of freedom Model SSQ, df: variability explained by the effect (difference between model with and without the effect) and number of parameters used Residual SSQ, df: variability caused by finite sample size (number of observations minus number used up by the model)

Classical LMM cont. Robust, practical OK if data are Normal design is (nearly) balanced design not too complicated (single RE, or nested REs) (crossed REs: e.g. year effects that apply across all spatial blocks)

Mixed models: modern approach Construct a likelihood for the data (Prob(observing data parameters)) in mixed models, requires integrating over possible values of REs (marginal likelihood) e.g.: likelihood of i th obs. in block j is L Normal (x ij θ i, σ 2 w ) likelihood of a particular block mean θ j is L Normal (θ j 0, σ 2 b ) overall likelihood is L(x ij θ j, σ 2 w )L(θ j 0, σ 2 b ) dθ j Figure out how to do the integral

Mixed models: modern approach Construct a likelihood for the data (Prob(observing data parameters)) in mixed models, requires integrating over possible values of REs (marginal likelihood) e.g.: likelihood of i th obs. in block j is L Normal (x ij θ i, σ 2 w ) likelihood of a particular block mean θ j is L Normal (θ j 0, σ 2 b ) overall likelihood is L(x ij θ j, σ 2 w )L(θ j 0, σ 2 b ) dθ j Figure out how to do the integral

Mixed models: modern approach Construct a likelihood for the data (Prob(observing data parameters)) in mixed models, requires integrating over possible values of REs (marginal likelihood) e.g.: likelihood of i th obs. in block j is L Normal (x ij θ i, σ 2 w ) likelihood of a particular block mean θ j is L Normal (θ j 0, σ 2 b ) overall likelihood is L(x ij θ j, σ 2 w )L(θ j 0, σ 2 b ) dθ j Figure out how to do the integral

Shrinkage Arabidopsis block estimates Mean(log) fruit set 3 0 3 4 10 8 2 3 10 9 9 4 6 3 7 9 4 9 11 2 5 5 2 6 10 5 4 15 0 5 10 15 20 25 Genotype

RE examples Coral symbionts: simple experimental blocks, RE affects intercept (overall probability of predation in block) Glycera: applied to cells from 10 individuals, RE again affects intercept (cell survival prob.) Arabidopsis: region (3 levels, treated as fixed) / population / genotype: affects intercept (overall fruit set) as well as treatment effects (nutrients, herbivory, interaction)

Outline 1 2

Penalized quasi-likelihood (PQL) alternate steps of estimating GLM given known block variances; estimate LMMs given GLM fit flexible (allows spatial/temporal correlations, crossed REs) biased for small unit samples (e.g. counts < 5, binary or low-survival data) (Breslow, 2004) nevertheless, widely used: SAS PROC GLIMMIX, R glmmpql: in 90% of small-unit-sample cases

Penalized quasi-likelihood (PQL) alternate steps of estimating GLM given known block variances; estimate LMMs given GLM fit flexible (allows spatial/temporal correlations, crossed REs) biased for small unit samples (e.g. counts < 5, binary or low-survival data) (Breslow, 2004) nevertheless, widely used: SAS PROC GLIMMIX, R glmmpql: in 90% of small-unit-sample cases

Penalized quasi-likelihood (PQL) alternate steps of estimating GLM given known block variances; estimate LMMs given GLM fit flexible (allows spatial/temporal correlations, crossed REs) biased for small unit samples (e.g. counts < 5, binary or low-survival data) (Breslow, 2004) nevertheless, widely used: SAS PROC GLIMMIX, R glmmpql: in 90% of small-unit-sample cases

Penalized quasi-likelihood (PQL) alternate steps of estimating GLM given known block variances; estimate LMMs given GLM fit flexible (allows spatial/temporal correlations, crossed REs) biased for small unit samples (e.g. counts < 5, binary or low-survival data) (Breslow, 2004) nevertheless, widely used: SAS PROC GLIMMIX, R glmmpql: in 90% of small-unit-sample cases

Better methods Laplace approximation approximate marginal likelihood considerably more accurate than PQL reasonably fast and flexible adaptive Gauss-Hermite quadrature (AGQ) compute additional terms in the integral most accurate slowest, hence not flexible (2 3 RE at most, maybe only 1) Becoming available: R lme4, SAS PROC NLMIXED, PROC GLIMMIX (v. 9.2), Genstat GLMM

Better methods Laplace approximation approximate marginal likelihood considerably more accurate than PQL reasonably fast and flexible adaptive Gauss-Hermite quadrature (AGQ) compute additional terms in the integral most accurate slowest, hence not flexible (2 3 RE at most, maybe only 1) Becoming available: R lme4, SAS PROC NLMIXED, PROC GLIMMIX (v. 9.2), Genstat GLMM

Better methods Laplace approximation approximate marginal likelihood considerably more accurate than PQL reasonably fast and flexible adaptive Gauss-Hermite quadrature (AGQ) compute additional terms in the integral most accurate slowest, hence not flexible (2 3 RE at most, maybe only 1) Becoming available: R lme4, SAS PROC NLMIXED, PROC GLIMMIX (v. 9.2), Genstat GLMM

Comparison of coral symbiont results 4 2 0 2 4 AGQ (2) AGQ (1) Laplace (3) Laplace (2) Laplace (1) PQL glm AGQ (2) AGQ (1) Laplace (3) Laplace (2) Laplace (1) PQL glm twosymb intercept 4 2 0 2 4 symbiont sdran crab.vs.shrimp 4 2 0 2 4

Outline 1 2

General issues: testing RE significance Counting model df for REs how many parameters does a RE require? Somewhere between 1 and n... Hard to compute, and depends on the level of focus (Vaida and Blanchard, 2005) Boundary effects for RE testing most tests depend on null hypothesis being within the parameter s feasible range (Molenberghs and Verbeke, 2007): violated by H 0 : σ 2 = 0 REs may count for < 1 df (typically 0.5) if ignored, tests are conservative

General issues: testing RE significance Counting model df for REs how many parameters does a RE require? Somewhere between 1 and n... Hard to compute, and depends on the level of focus (Vaida and Blanchard, 2005) Boundary effects for RE testing most tests depend on null hypothesis being within the parameter s feasible range (Molenberghs and Verbeke, 2007): violated by H 0 : σ 2 = 0 REs may count for < 1 df (typically 0.5) if ignored, tests are conservative

General issues: finite-sample issues (!) How far are we from asymptopia? Many standard procedures are asymptotic Sample size may refer the number of RE units often far more restricted than total number of data points Hard to count degrees of freedom for complex designs: Kenward-Roger correction

General issues: finite-sample issues (!) How far are we from asymptopia? Many standard procedures are asymptotic Sample size may refer the number of RE units often far more restricted than total number of data points Hard to count degrees of freedom for complex designs: Kenward-Roger correction

General issues: finite-sample issues (!) How far are we from asymptopia? Many standard procedures are asymptotic Sample size may refer the number of RE units often far more restricted than total number of data points Hard to count degrees of freedom for complex designs: Kenward-Roger correction

Specific procedures Likelihood Ratio Test: need large sample size (= large # of RE units!) Wald (Z, χ 2, t or F ) tests crude approximation asymptotic (for non-overdispersed data?) or...... how do we count residual df? don t know if null distributions are correct AIC asymptotic (properties unknown) could use AIC c, but? need residual df

Specific procedures Likelihood Ratio Test: need large sample size (= large # of RE units!) Wald (Z, χ 2, t or F ) tests crude approximation asymptotic (for non-overdispersed data?) or...... how do we count residual df? don t know if null distributions are correct AIC asymptotic (properties unknown) could use AIC c, but? need residual df

Specific procedures Likelihood Ratio Test: need large sample size (= large # of RE units!) Wald (Z, χ 2, t or F ) tests crude approximation asymptotic (for non-overdispersed data?) or...... how do we count residual df? don t know if null distributions are correct AIC asymptotic (properties unknown) could use AIC c, but? need residual df

Glycera results Cu:H2S Cu:H2S:Anoxia Osm:Cu:H2S Osm:Cu:H2S:Anoxia H2S Cu H2S:Anoxia Osm:H2S Osm:Cu:Anoxia Osm:Anoxia Osm:H2S:Anoxia Osm:Cu Cu:Anoxia Anoxia Osm trial 60 40 20 0 20 40 effect

Testing assumptions H2S 0.02 0.04 0.06 0.08 Anoxia 0.08 0.06 Inferred p value 0.08 0.06 Osm Cu 0.04 0.02 0.04 0.02 0.02 0.04 0.06 0.08 True p value

Arabidopsis genotype effects 1 1 0 1 0 1Clipping 1 2 3 2 1 3 1 0 1 0 Control 1 2 1 2 2 0 1 2 1 0 Nutrients 0 1 2 1 0 2 4 3 2 3 4 2 Clip+Nutr. 1 0 1 0 1 1

Where are we?

Now what? MCMC (finicky, slow, dangerous, we have to go Bayesian : specialized procedures for, or WinBUGS translators? (glmmbugs, MCMCglmm) quasi-bayes mcmcsamp in lme4 (unfinished!) parametric bootstrapping: fit null model to data simulate data from null model fit null and working model, compute likelihood diff. repeat to estimate null distribution? analogue for confidence intervals? challenges depend on data: total size, # REs, # RE units, overdispersion, design complexity... More info: glmm.wikidot.com

Now what? MCMC (finicky, slow, dangerous, we have to go Bayesian : specialized procedures for, or WinBUGS translators? (glmmbugs, MCMCglmm) quasi-bayes mcmcsamp in lme4 (unfinished!) parametric bootstrapping: fit null model to data simulate data from null model fit null and working model, compute likelihood diff. repeat to estimate null distribution? analogue for confidence intervals? challenges depend on data: total size, # REs, # RE units, overdispersion, design complexity... More info: glmm.wikidot.com

Acknowledgements Data: Josh Banta and Massimo Pigliucci (Arabidopsis); Adrian Stier and Sea McKeon (coral symbionts); Courtney Kagan, Jocelynn Ortega, David Julian (Glycera); Co-authors: Mollie Brooks, Connie Clark, Shane Geange, John Poulsen, Hank Stevens, Jada White

Breslow, N.E., 2004. In D.Y. Lin and P.J. Heagerty, editors, Proceedings of the second Seattle symposium in biostatistics: Analysis of correlated data, pages 1 22. Springer. ISBN 0387208623. Molenberghs, G. and Verbeke, G., 2007. The American Statistician, 61(1):22 27. doi:10.1198/000313007x171322. Vaida, F. and Blanchard, S., 2005. Biometrika, 92(2):351 370. doi:10.1093/biomet/92.2.351.