Notes for week 4 (part 2)

Similar documents
Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

Exam Applied Statistical Regression. Good Luck!

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

11. Generalized Linear Models: An Introduction

Generalized Linear Models

Non-Gaussian Response Variables

Generalized linear models

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Checking model assumptions with regression diagnostics

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Generalized linear models

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

STAC51: Categorical data Analysis

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev

Stat 5101 Lecture Notes

Generalized Linear Models

Generalized Linear Models

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Modeling Overdispersion

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Generalized Linear Models: An Introduction

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Sample solutions. Stat 8051 Homework 8

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Statistical Modelling with Stata: Binary Outcomes

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Poisson regression: Further topics

Stat 579: Generalized Linear Models and Extensions

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Logistic regression. Ben Bolker October 3, modeling. data analysis road map. (McCullagh and Nelder, 1989)

Nonlinear Models. What do you do when you don t have a line? What do you do when you don t have a line? A Quadratic Adventure

9 Generalized Linear Models

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Prediction of Bike Rental using Model Reuse Strategy

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Statistical Modelling in Stata 5: Linear Models

A strategy for modelling count data which may have extra zeros

Regression Model Building

Subject CS1 Actuarial Statistics 1 Core Principles

Chapter 1 Statistical Inference

Package HGLMMM for Hierarchical Generalized Linear Models

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Generalized Linear Models (GLZ)

Generalized Additive Models

Review of Statistics 101

How to deal with non-linear count data? Macro-invertebrates in wetlands

Reaction Days

Generalized linear mixed models for biologists

The Flight of the Space Shuttle Challenger

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Simple Linear Regression

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Unit 11: Multiple Linear Regression

To appear in The American Statistician vol. 61 (2007) pp

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

11. Generalized Linear Models: An Introduction

1 Least Squares Estimation - multiple regression.

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Introduction to General and Generalized Linear Models

Linear Regression Models P8111

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Simple Linear Regression

Generalised linear models. Response variable can take a number of different formats

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

Introduction to logistic regression

Multiple Linear Regression

Generalized Linear Models I

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Logistic Regressions. Stat 430

Generalized Linear Models. Kurt Hornik

General Regression Model

MODELING COUNT DATA Joseph M. Hilbe

REVISED PAGE PROOFS. Logistic Regression. Basic Ideas. Fundamental Data Analysis. bsa350

Handbook of Regression Analysis

Discrete distribution. Fitting probability models to frequency data. Hypotheses for! 2 test. ! 2 Goodness-of-fit test

A Re-Introduction to General Linear Models

Multivariate Statistics in Ecology and Quantitative Genetics Summary

Experimental Design and Data Analysis for Biologists

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

mgcv: GAMs in R Simon Wood Mathematical Sciences, University of Bath, U.K.

Package generalhoslem

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

R code and output of examples in text. Contents. De Jong and Heller GLMs for Insurance Data R code and output. 1 Poisson regression 2

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking

STAT 526 Advanced Statistical Methodology

Outline of GLMs. Definitions

STAT 525 Fall Final exam. Tuesday December 14, 2010

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Introduction and Single Predictor Regression. Correlation

Transcription:

Notes for week 4 (part 2) Ben Bolker October 3, 2013 Licensed under the Creative Commons attribution-noncommercial license (http: //creativecommons.org/licenses/by-nc/3.0/). Please share & remix noncommercially, mentioning its origin. Version: 2013-10-03 17:55:12 Philosophy avoiding data snooping/ researcher degrees of freedom [10, 5, 6] subjective vs objective, graphical vs formal testing we don t care whether the model is true (it s not) or whether it can be rejected, but whether the deviations matter objective/p-value approach cover your *ss? Model diagnostics Graphical plot computed diagnostic summaries and/or transformations of residuals to highlight particular classes of model deviations Formal compute an overall goodness-of-fit statistic with a known null distribution embed the model in a larger parametric family; compare via likelihood ratio test (consider exact or round alternative). May use score test or single-step update for computational efficiency. log-likelihood l0 l1 LRT Wald score H 0 H 1 x [2, 8]

notes for week 4 (part 2) 2 Residuals Different types of residuals (?residuals.glm,?rstandard,?rstudent) Deviance sign(y µ) 2w i dev i Pearson (y µ)/(w V(µ)) Standardized (y µ)/( V(µ)(1 H) Note whether residuals are scaled by (1) variance function, (2) weights, (3) full variance (i.e. including overdispersion factor φ), (4) diagonal of hat matrix (hatvalues()). (Hat matrix: weighted version of H = X(X T X) 1 X T : maps y to ŷ, so h ii is the influence of y i on ŷ i. All identical for linear models with categorical variables, but not for GLMs... ) Linearity (Deviance) residual vs. fitted plot (Deviance) residuals vs. individual predictors, or combinations of predictors link test 1 ; try adding a quadratic term in the linear predictor, see if it fits better changing link function: power()) 1 Pregibon, D. (1980, January). Goodness of link tests for generalized linear models. Journal of the Royal Statistical Society. Series C (Applied Statistics) 29(1), 15 14 adding polynomial or spline terms to individual predictors (poly(), splines::ns()) transforming individual predictors Variance function Scale-location plot: abs(residuals) vs. fitted value, or individual parameters, or combinations of parameters. If residuals are scaled and there is no overdispersion (see below) then the center is at 1 (Banta example?) fixing some other part of the model tweaking the variance function

notes for week 4 (part 2) 3 Distributional assumptions The variance function and link function might both be right, but the model distribution can still be wrong (e.g. log-normal vs Gamma, zero-inflation). assessing distributional assumption is hard because it s the conditional distribution Q-Q plot (examples): good, but only really valid asymptotically (i.e. conditional distribution of individual samples Normal: e.g. λ > 5 for Poisson, nmin(p, 1 p) > 5 for Binomial) alternatives to Q-Q plot, e.g. 2 (not really practical) Improved Q-Q plot: 3, mgcv::qq.gam() ordinal models robust models (robustbase::glmrob) 2 Hoaglin, D. C. (1980). A Poissonness plot. The American Statistician 34(3), 146 149 3 Augustin, N. H., E.-A. Sauleau, and S. N. Wood (2012, August). On quantile quantile plots for generalized linear models. Computational Statistics & Data Analysis 56(8), 2404 2409 Load data on contagious bovine pleuropneumonia [7], taken from lme4 package: load("cbpp.rdata") ggplot(cbpp,aes(period,incidence/size))+ geom_point(aes(size=size),alpha=0.5)+ geom_line(aes(group=herd)) 0.6 incidence/size 0.4 0.2 size 10 20 30 0.0 1 2 3 4 period Fit with a Poisson-offset model:

notes for week 4 (part 2) 4 m1 <- glm(incidence~herd+offset(log(size)),data=cbpp, family=poisson) library(mgcv) par(mfrow=c(1,2),las=1,bty="l") plot(m1,which=2) qq.gam(m1,pch=1) 3 Normal Q-Q 16 2 Std. deviance resid. 2 1 0-1 deviance residuals 1 0-1 -2 10 17-2 -2-1 0 1 2 Theoretical Quantiles -2-1 0 1 2 theoretical quantiles Influential points?influence.measures Cook s distance (overall influence) leverage leaving out influential points to see if it makes a difference robust modeling (robustbase::glmrob) Posterior predictive summaries Simulate 1000 times; count the number of zeros in each simulation; compute (1-sided) p-value. ss <- simulate(m1,1000,seed=101) zerovec <- colsums(ss==0) zero.obs <- sum(cbpp$incidence==0) (cbpp.zpval <- mean(zerovec>=zero.obs)) ## [1] 0.089

notes for week 4 (part 2) 5 0.12 0.10 Prob(z 22) = 0.089 Density 0.08 0.06 0.04 0.02 0.00 6 8 10 12 14 16 18 20 22 24 26 28 (this is a 1-sided test) Number of zeros Overall goodness-of-fit/overdispersion (aods3 package) Detection Variance > expected (e.g. assume variance = mean but variance > mean) Test: (Pearson residuals) 2 residual df More specifically, r 2 χ 2 n p pchisq(sum(residuals(.,type="pearson")^2),rdf,lower.tail=false), or aods3::gof(.) Meaning May be caused by general lack of fit... or may be intrinsic Solutions quasi-likelihood φ r 2 /(n p): scales all likelihoods by φ, all CI by phi compound model negative binomial (Gamma-Poisson) (via MASS::glm.nb) Beta-Binomial (via bbmle?)

notes for week 4 (part 2) 6 link-normal model: GLMM with observation-level random effect Binary data: if covariate patterns are non-unique, can bin observations otherwise, use binning or smoothing [4] (dominates Hosmer- Lemeshow test): use library(rms) model <- Glm(...) resid(model,"gof") graphical: compare geom_smooth with predicted (e.g. logistic) curves References [1] Augustin, N. H., E.-A. Sauleau, and S. N. Wood (2012, August). On quantile quantile plots for generalized linear models. Computational Statistics & Data Analysis 56(8), 2404 2409. [2] Fears, T. R., J. Benichou, and M. H. Gail (1996, August). A reminder of the fallibility of the Wald statistic. The American Statistician 50(3), 226 227. [3] Hoaglin, D. C. (1980). A Poissonness plot. The American Statistician 34(3), 146 149. [4] le Cessie, S. and J. C. van Houwelingen (1991, December). A goodness-of-fit test for binary regression models, based on smoothing methods. Biometrics 47(4), 1267 1282. [5] Leek, J., R. Peng, and R. Irizarry (2012, August). A deterministic statistical machine. [6] Leek, J., R. Peng, and R. Irizarry (2013, July). The researcher degrees of freedom recipe tradeoff in data analysis. [7] Lesnoff, M., G. Laval, P. Bonnet, S. Abdicho, A. Workalemahu, D. Kifle, A. Peyraud, R. Lancelot, and F. Thiaucourt (2004, June). Within-herd spread of contagious bovine pleuropneumonia in ethiopian highlands. Preventive Veterinary Medicine 64(1), 27 40. [8] Pawitan, Y. (2000, February). A reminder of the fallibility of the Wald statistic: Likelihood explanation. The American Statistician 54(1), 54 56.

notes for week 4 (part 2) 7 [9] Pregibon, D. (1980, January). Goodness of link tests for generalized linear models. Journal of the Royal Statistical Society. Series C (Applied Statistics) 29(1), 15 14. [10] Simmons, J. P., L. D. Nelson, and U. Simonsohn (2011, November). False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science 22(11), 1359 1366.