Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

Similar documents
Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Analysis of Count Data A Business Perspective. George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013

Generalized linear models

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Q30b Moyale Observed counts. The FREQ Procedure. Table 1 of type by response. Controlling for site=moyale. Improved (1+2) Same (3) Group only

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

ESTIMATE PROP. IMPAIRED PRE- AND POST-INTERVENTION FOR THIN LIQUID SWALLOW TASKS. The SURVEYFREQ Procedure

The GENMOD Procedure. Overview. Getting Started. Syntax. Details. Examples. References. SAS/STAT User's Guide. Book Contents Previous Next

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

The GENMOD Procedure (Book Excerpt)

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

SAS/STAT 14.2 User s Guide. The GENMOD Procedure

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Wrap-up. The General Linear Model is a special case of the Generalized Linear Model. Consequently, we can carry out any GLM as a GzLM.

Model and Working Correlation Structure Selection in GEE Analyses of Longitudinal Data

Generalized Linear Models (GLZ)

Non-Gaussian Response Variables

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Nonlinear Models. What do you do when you don t have a line? What do you do when you don t have a line? A Quadratic Adventure

Modeling Overdispersion

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Subject-specific observed profiles of log(fev1) vs age First 50 subjects in Six Cities Study

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Mohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago

Poisson Data. Handout #4

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

A CASE STUDY IN HANDLING OVER-DISPERSION IN NEMATODE COUNT DATA SCOTT EDWIN DOUGLAS KREIDER. B.S., The College of William & Mary, 2008 A THESIS

Faculty of Health Sciences. Correlated data. Count variables. Lene Theil Skovgaard & Julie Lyng Forman. December 6, 2016

Topic 20: Single Factor Analysis of Variance

Logistic Regressions. Stat 430

Chapter 4: Generalized Linear Models-II

MODELING COUNT DATA Joseph M. Hilbe

Generalized linear models

Generalized Linear Models

STAT 526 Advanced Statistical Methodology

Generalized Linear Models for Count, Skewed, and If and How Much Outcomes

Lecture 3.1 Basic Logistic LDA

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

SAS Software to Fit the Generalized Linear Model

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

Longitudinal Modeling with Logistic Regression

Generalized Linear Models

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Outline. Linear OLS Models vs: Linear Marginal Models Linear Conditional Models. Random Intercepts Random Intercepts & Slopes

GLMs for Count Data. Outline. Generalized linear models. Generalized linear models. Michael Friendly. Example: phdpubs.

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Section Poisson Regression

Generalized Linear Models

Exercise 5.4 Solution

Outline of GLMs. Definitions

Testing and Model Selection

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

A strategy for modelling count data which may have extra zeros

Lecture 9 STK3100/4100

Cohen s s Kappa and Log-linear Models

ZERO INFLATED POISSON REGRESSION

Improving the Precision of Estimation by fitting a Generalized Linear Model, and Quasi-likelihood.

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

9 Generalized Linear Models

MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010

Statistics: A review. Why statistics?

Longitudinal Data Analysis Using SAS Paul D. Allison, Ph.D. Upcoming Seminar: October 13-14, 2017, Boston, Massachusetts

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Generalized Models: Part 1

DISPLAYING THE POISSON REGRESSION ANALYSIS

Linear Regression Models P8111

Chapter 3: Generalized Linear Models

Generalized Linear Mixed-Effects Models. Copyright c 2015 Dan Nettleton (Iowa State University) Statistics / 58

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

lme4 Luke Chang Last Revised July 16, Fitting Linear Mixed Models with a Varying Intercept

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

A course in statistical modelling. session 09: Modelling count variables

Sample solutions. Stat 8051 Homework 8

Introduction to Generalized Models

R code and output of examples in text. Contents. De Jong and Heller GLMs for Insurance Data R code and output. 1 Poisson regression 2

Generalized Multilevel Models for Non-Normal Outcomes

Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 2017, Chicago, Illinois

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

Generalized Linear Models for Non-Normal Data

Generalized linear models

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Multinomial Logistic Regression Models

Deal with Excess Zeros in the Discrete Dependent Variable, the Number of Homicide in Chicago Census Tract

Figure 36: Respiratory infection versus time for the first 49 children.

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Where have all the puffins gone 1. An Analysis of Atlantic Puffin (Fratercula arctica) Banding Returns in Newfoundland

How to deal with non-linear count data? Macro-invertebrates in wetlands

Application of Poisson and Negative Binomial Regression Models in Modelling Oil Spill Data in the Niger Delta

Section 9c. Propensity scores. Controlling for bias & confounding in observational studies

Introduction to logistic regression

Transcription:

Biostokastikum Overdispersion is not uncommon in practice. In fact, some would maintain that overdispersion is the norm in practice and nominal dispersion the exception McCullagh and Nelder (1989) Overdispersion Workshop in generalized linear models Uppsala, June 11-12, 2014 Johannes Forkman, Field Research Unit, SLU Outline What is overdispersion and how do we detect it? An overview of methods for overdispersed data Generalized linear mixed models Generalized estimating equations Adjustment using an overdispersion factor Negative binomial distribution Mixture distributions for zero-inflated data Overdispersion In Poisson and binomially distributed data, the variance is a known function of the mean: Proportions: = 1 / Counts: = In practice, the variance is often much larger. This is called overdispersion.

Measures of goodness of fit Deviance (): Twice the difference between the log likelihood of a model with a perfect fit and the log likelihood of the fitted model Pearson s chi-sqare: ( ) that is, the sum of all squared Pearson residuals How to detect overdispersion No overdispersion Deviance df Pearson df Overdispersion Deviance df Pearson df df = residual degrees of freedom Plant Treatment Infested N Plant Treatment Infested N Example Leaves 20 plants 2 treatments: Active (10 plants) Control (10 plants) Approx. 20 leaves per plant 1 Active 6 20 11 Control 5 20 2 Active 3 20 12 Control 9 19 3 Active 7 20 13 Control 14 20 4 Active 1 20 14 Control 3 20 5 Active 0 18 15 Control 20 20 6 Active 0 20 16 Control 8 20 7 Active 4 18 17 Control 7 15 8 Active 9 20 18 Control 7 20 9 Active 10 20 19 Control 8 20 10 Active 2 20 20 Control 5 20

The data was analyzed using a generalized linear model with a binomial distribution and a logit link. The probability that a leaf was infested was estimated to 0.21 and 0.44, for Active and Control, respectively. This difference was significant (P < 0.0001). This analysis The data was analyzed using a generalized linear model with a binomial distribution was and a logit link. incorrect! The probability that a leaf was infested was estimated to 0.21 and 0.44, for Active and Control, respectively. This difference was significant (P < 0.0001). Why incorrect? Well, because Residual degrees of freedom: 18 Deviance: 94.53 (P < 0.0001) Pearson chi-square: 79.57 (P < 0.0001) How could the data be overdispersed? Each binomial observation is a cluster of approx. 20 Bernoulli (Yes/No) observations. Bernoulli observations from the same plant might be correlated. The observations are clearly overdispersed! This correlation is the source of overdispersion

Note that we get exactly the same results if we analyze the following summary table: Treatment Infested N Active 42 196 Control 86 194 GeneralizedLinearMixed Models(GLMM) These analyses do not consider variation between plants (i.e. correlation within plants). GLMM We have fitted the model logit = + But since the data is overdispersed and clustered, a generalized linear mixed model (GLMM): logit = + + ~ " 0, % & ' = 1,2 ' = 1,2 ) = 1, 2,, 10 In SAS, the glimmix procedure gives Type III Tests of Fixed Effects Effect Num DF Den DF F Value Pr > F Treatment 1 14.35 5.60 0.0325 In R, the glmer function can be used, and the likelihood ratio test is Df AIC deviance Chisq Chi DF Pr(>Chisq) Model.Null 2 65.864 61.864 Model.Mixed 3 62.157 56.157 5.7071 1 0.0169* is more appropriate

Are there no problems? Complicated GLMM models are hard to fit For GLMM, statistical inference is an issue We might be interested in the effects on the population, rather than in the effects on the individual subjects GeneralizedEstimating Equations(GEE) GEE What s not specified? GEE What s specified? Link function : e.g. log or logit The variance function: +, where + is a dispersion factor Correlation pattern: e.g. exchangeable (i.e. compound symmetry), autoregressive, unstructured The exact distribution (i.e. the exact likelihood) Random effects This enables Simple estimating equations (makes it easier to fit the model) A robust estimator of the standard errors: the so called empirical sandwich estimator Population-averaged statistical inference

Linear mixed model (subject-specific) In SAS, using the genmod procedure: Wald: P = 0.017 Score: P = 0.032 In R, using the ggeglm function: Wald: P = 0.017 Yield 6000 6200 6400 6600 6800 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 Soil GEE model (population averaged) Simple regression and GEE model Yield 6000 6200 6400 6600 6800 Yield 6000 6200 6400 6600 6800 Simple regression GEE model 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 Soil Soil

GEE is a model for the population average Black: GEE (populationaveraged) Gray: GLMM (subjectspecific) Adjustment using an overdispersion factor Source: http://stats.stackexchange.com/questions/32419/difference-between-generalized-linear-models-generalized-linear-mixed-models-i Block Mix Count 1 1 24 1 2 12 1 3 8 1 4 13 2 1 9 2 2 9 2 3 9 2 4 18 3 1 12 3 2 8 3 3 44 3 4 0 4 1 8 4 2 12 4 3 25 4 4 0 Example Seeding mixes Residual degrees of freedom: 9 Four blocks Four seeding mixes Observed number of plants of a specific species Deviance: 90.23 (P < 0.0001) Pearson s chi-square: 79.55 (P < 0.0001) The data is clearly overdispersed. In this example, we have no clusters!

Adjustment using overdispersion factor Simply assume that the variance is +, where + is a dispersion factor Response Proportions Counts φµ Variance φµ(1 µ)/n Estimate + as Pearson s chi-square / df: +, = 79.55/9 = 8.84 Within the glm function, specify or family = quasipoisson family = quasibinomial Within the model statement of the genmod procedure, give the option or dist = poisson pscale dist = binomial pscale Results without dispersion factor Analysis of deviance Without overdispersion factor: LR Statistics For Type 3 Analysis Source DF Chi-Square Pr > ChiSq Block 3 4.98 0.1735 Mix 3 30.95 <.0001 = 3 4567856 3 9:;<=5>5 Means on the log scale Back-transformed means With overdispersion factor:? = 3 4567856 3(9:;<=5>5) 6@ 9:;<=5>5 6@(4567856) +, Mix Mean SEM 95% conf. Mean 95% conf. 1 2.57 0.138 2.3 2.8 13.1 10.0 17.2 2 2.32 0.157 2.0 2.6 10.1 7.5 13.8 3 3.06 0.108 2.8 3.3 21.2 17.2 26.3 4 2.04 0.180 1.7 2.4 7.7 5.4 10.9

Results with dispersion factor LR Statistics For Type 3 Analysis Source Num DF Den DF F Value Pr > F Block 3 9 0.19 0.9021 Mix 3 9 1.17 0.3749 Means on the log scale Back-transformed means Mix Mean SEM 95% conf. Mean 95% conf. 1 2.57 0.410 1.77 3.38 13.1 5.9 29.2 2 2.32 0.465 1.40 3.23 10.1 4.1 25.2 3 3.06 0.322 2.43 3.69 21.2 11.3 40.0 4 2.04 0.535 0.99 3.08 7.7 2.7 21.9 The negative binomial distribution SEM are +, = 2.97 times larger than before Example Sheep milk Pearson residuals Somatic cell count in sheep milk using mechanical or manual milking Method Count Method Count Mechanical 2966 Manual 186 Mechanical 569 Manual 107 Mechanical 59 Manual 65 Mechanical 1887 Manual 126 Mechanical 3452 Manual 123 Mechanical 189 Manual 164 Mechanical 93 Manual 408 Mechanical 618 Manual 324 Mechanical 130 Manual 548 Mechanical 2493 Manual 139 Pearson residuals -60-40 -20 0 20 40 60 5 10 15 20 Observation number

Poisson distribution The negative binomial distribution Residual degrees of freedom: 18 mean = 20, scale = 1 mean = 20, scale = 0.1 mean = 20, scale = 0.01 Deviance: 14203 (P < 0.0001) Pearson chi-square: 13643 (P < 0.0001) The observations are clearly overdispersed. Using the previous method, standard errors would be multiplied by +, = 13643/18 = 27.5 0.00 0.02 0.04 0.06 0.08 0.00 0.02 0.04 0.06 0.08 0.00 0.02 0.04 0.06 0.08 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Negative binomial distribution Pearson residuals glm.nb in R, and genmod in SAS Residual degrees of freedom: 18 Deviance: 22.77 (P = 0.200) Pearson chi-square: 16.27 (P = 0.574) The data is not overdispersed Problem solved! Pearson residuals -2-1 0 1 2 5 10 15 20 Observation number

SAS Negative binomial distribution Differences of Method Least Squares Means Method _Method Estimate Standard Error z Value Pr > z Manual Mechanical -1.7384 0.4256-4.08 <.0001 R Zero-inflated data Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 5.389 0.301 17.89 < 2e-16 *** MethodMechanical 1.738 0.426 4.08 4.4e-05 *** 0 0 0 0 0 0 4 0 0 16 0 0 5 0 0 0 10 0 0 0 0 12 0 0 0 0 Zero-inflated data There are more zeros than expected according to the Poisson or negative binomial distribution A special case of overdispersion Common in ecology when the numbers of various species are counted A negative binomial distribution 0.0 0.1 0.2 0.3 0.4 0.5 0.6 mean = 20, scale = 10 0 10 20 30 40 50

This is not a Poisson or a negative binomial distribution Zero-inflated data Zero-inflated Poisson distribution A mixture of a Poisson distribution and a binomial distribution But it might be a mixture of a Poisson and a binomial distribution Pr Extra zero = J Or a mixture of a negative binomial and a binomial distribution 0 5 10 15 Pr Observation zero = J + 1 J Pr(Poisson distribution gives a zero) How to fit zero-inflated models Zero-inflated Poisson and zero-inflated negative binomial models can be fitted See Exercise 5 using the zeroinfl function of the pscl package using the genmod procedure, through dist = zip and dist = zinb, respectively Summary Overdispersion is the rule rather than the exception. When not accounted for, the statistical inference is not valid. The following methods were presented: Generalized linear mixed models Generalized estimating equations Adjustment using an overdispersion factor Negative binomial distribution Mixture distributions for zero-inflated data