Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Similar documents
Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status

Ph.D. course: Regression models. Introduction. 19 April 2012

Ph.D. course: Regression models

Regression models. Categorical covariate, Quantitative outcome. Examples of categorical covariates. Group characteristics. Faculty of Health Sciences

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Chapter 22: Log-linear regression for Poisson counts

Lecture 01: Introduction

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

8 Nominal and Ordinal Logistic Regression

Stat 642, Lecture notes for 04/12/05 96

Lecture 12: Effect modification, and confounding in logistic regression

General Regression Model

Lecture 5: Poisson and logistic regression

Lecture 2: Poisson and logistic regression

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Lecture 8. Poisson models for counts

Consider Table 1 (Note connection to start-stop process).

Statistics in medicine

( t) Cox regression part 2. Outline: Recapitulation. Estimation of cumulative hazards and survival probabilites. Ørnulf Borgan

Generalized logit models for nominal multinomial responses. Local odds ratios

Chapter 20: Logistic regression for binary response variables

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Poisson regression: Further topics

Semiparametric Regression

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Residuals and model diagnostics

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

,..., θ(2),..., θ(n)

Survival Analysis Math 434 Fall 2011

Tied survival times; estimation of survival probabilities

Appendix A. Numeric example of Dimick Staiger Estimator and comparison between Dimick-Staiger Estimator and Hierarchical Poisson Estimator

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Introduction to Statistical Analysis

Lecture 7 Time-dependent Covariates in Cox Regression

Survival Analysis I (CHL5209H)

Statistics 262: Intermediate Biostatistics Regression & Survival Analysis

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

More Statistics tutorial at Logistic Regression and the new:

TMA 4275 Lifetime Analysis June 2004 Solution

Lecture 10: Introduction to Logistic Regression

Machine Learning. Module 3-4: Regression and Survival Analysis Day 2, Asst. Prof. Dr. Santitham Prom-on

One-stage dose-response meta-analysis

An introduction to biostatistics: part 1

Semiparametric Generalized Linear Models

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

A note on R 2 measures for Poisson and logistic regression models when both models are applicable

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Confidence Intervals, Testing and ANOVA Summary

Does low participation in cohort studies induce bias? Additional material

Definitions and examples Simple estimation and testing Regression models Goodness of fit for the Cox model. Recap of Part 1. Per Kragh Andersen

Analysing geoadditive regression data: a mixed model approach

Log-linearity for Cox s regression model. Thesis for the Degree Master of Science

Multivariate Survival Analysis

UNIVERSITY OF TORONTO Faculty of Arts and Science

Correlated data. Non-normal outcomes. Reminder on binary data. Non-normal data. Faculty of Health Sciences. Non-normal outcomes

Modelling geoadditive survival data

Varieties of Count Data

Multinomial Logistic Regression Models

STAT331. Cox s Proportional Hazards Model

Disease mapping with Gaussian processes

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Meta-analysis of epidemiological dose-response studies

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

Introduction to logistic regression

Cohen s s Kappa and Log-linear Models

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

PASS Sample Size Software. Poisson Regression

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Turning a research question into a statistical question.

Introduction. Dipankar Bandyopadhyay, Ph.D. Department of Biostatistics, Virginia Commonwealth University

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Frailty Modeling for clustered survival data: a simulation study

Statistics 572 Semester Review

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Lecture Discussion. Confounding, Non-Collapsibility, Precision, and Power Statistics Statistical Methods II. Presented February 27, 2018

Math 1040 Final Exam Form A Introduction to Statistics Fall Semester 2010

Correlation and regression

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data

Chapter 1 Statistical Inference

Probability: Why do we care? Lecture 2: Probability and Distributions. Classical Definition. What is Probability?

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

A new strategy for meta-analysis of continuous covariates in observational studies with IPD. Willi Sauerbrei & Patrick Royston

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

11 November 2011 Department of Biostatistics, University of Copengen. 9:15 10:00 Recap of case-control studies. Frequency-matched studies.

Institute of Actuaries of India

multilevel modeling: concepts, applications and interpretations

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Chapter 2: Describing Contingency Tables - I

Introducing Generalized Linear Models: Logistic Regression

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018

Survival Analysis. Lu Tian and Richard Olshen Stanford University

Logistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S

8 Analysis of Covariance

SURVIVAL ANALYSIS WITH MULTIPLE DISCRETE INDICATORS OF LATENT CLASSES KLAUS LARSEN, UCLA DRAFT - DO NOT DISTRIBUTE. 1.

Lecture 2: Probability and Distributions

Transcription:

Faculty of Health Sciences Regression models Counts, Poisson regression, 27-5-2013 Lene Theil Skovgaard Dept. of Biostatistics 1 / 36

Count outcome PKA & LTS, Sect. 7.2 Poisson regression The Binomial and Poisson distributions Example: Fever episodes Confounding Poisson regression for survival data Home pages: http://biostat.ku.dk/~pka/regrmodels13 E-mail: ltsk@sund.ku.dk 2 / 36

Count variables Definition: A variable, that may take on any non-negative integer, i.e. 0, 1,... Examples: Number of fever episodes during pregnancy Number of metastases following an experimentally induced cancer in laboratory rats Number of deaths due to lung cancer in a year, in a specific region 3 / 36

Well...of course These variables cannot be infinitely large, 429623 fever episodes 895482143 metastases 50 million deaths but in practice they may be very large and perhaps with no well defined upper limit 4 / 36

The Binomial distribution If we have a well defined upper limit, c, we can represent the count as a sum of zeroes and ones, and if we can assume these to be independent, we know that y Bin(c, p) p being the probability of a one for each week of pregnancy organ in a rat inhabitant in a region P(u) = pr(y i = u) = ( ) c p u (1 p) c u u 5 / 36

Binomial distributions, for p=0.005, 0.05 and 0.3 6 / 36

Approximations to the Binomial distribution, I When c is large, and p is moderate ( 0.5), the Binomial distribution looks like a Normal distribution N (m, s 2 ) where the parameter m is the mean value (the expected count) and the standard deviation is s = m = cp cp(1 p) 7 / 36

Approximations to the Binomial distribution, II The law of rare events When c is large, and p is small, the Binomial distribution looks like a Poisson distribution pr(y i = u) = mu u! exp( m), where again the parameter m is the mean value (the expected count) m = cp and the standard deviation is SD = m = cp 8 / 36

Number of fever episodes What is a fever episode? A day with fever? A week where fever occurs? A period with fever, until it ends? We will take it to mean a pregnancy week with occurrence of fever 9 / 36

Notation c: the number of pregnancy weeks (observed), here c = 14 p i : the probability of a fever episode for the ith woman in any of the c pregnancy weeks (assumed to be identical for all weeks, i.e., independent of gestational age) v ij : an indicator of fever in week j for the ith woman y i : the number of fever episodes for the ith woman Note: y i = v i1 + + v ic, a sum of zeros and/or ones 10 / 36

Distribution of fever episodes If fever episodes occur independently of each other in separate weeks, we know that for a specific individual (the index i is omitted) y Bin(c, p) Since p is probably small, we may approximate with a Poisson distribution ( ) c pr(y = u) = p u (1 p) c u mu u u! exp( m) where m may depend on some covariates 11 / 36

Fever episodes, according to parity parity 0: no previous children, expecting first child Number of Fever Episodes Parity 0 1 2 3 4 5 6 7 8 9 10 12 0 4474 731 69 10 2 1 0 0 0 0 0 0 1 5219 1141 114 10 1 2 1 1 0 0 2 0 Total 9693 1872 183 20 3 3 1 1 0 0 2 0 many 0 s (no fever episodes) largest count is 10 out of 14 weeks 12 / 36

Distribution characteristics Fever Episodes Average Parity 0 1 Average SD 2 Age ˆm ŝ 2 0 4474 813 0.172 0.189 27.88 1 5219 1272 0.223 0.264 31.06 Total 9693 2085 0.200 0.231 29.63 Do we see reasonably identical averages and variances (squared standard deviations)? Do we see an effect of parity? The estimated ratio (of average number of fever episodes) is 0.172/0.223 = 0.7713 and highly significant 13 / 36

Model for fever episodes y i : the number of fever episodes for the ith woman, assumed to be Poisson distributed with mean m i = cp i We relate m i = E(y i ) to a linear predictor, using a logarithmic link (in order to respect positive probabilities): log(e(y i )) = log(m i ) = LP i and the linear predictor can then be modeled as a function of covariates. 14 / 36

Covariate effect: Parity Do children attract infection to the pregnant mother? x i,1 : the parity of the ith woman LP i = a + b 1 I (x i,1 = 0) We get the estimate ˆb 1 = 0.2558(0.0423), (P <0.0001) and therefore a clear marginal effect of parity, with back-transformed ratio 0.77(0.71, 0.84) But: This apparent difference might be due to other reasons: age at conception (as a quantitative variable with a linear effect) alcohol habits... Very few women drink more than one or two units a week, so we disregard this covariate 15 / 36

Covariate effect: Age x i,2 : the age at conception for the ith woman LP i = a + b 2 (x i,2 30) We find ˆb 2 = 0.00069(0.00491), so the effect of a 10 years increase is a factor 0.9931, P = 0.89, i.e. virtually no effect 16 / 36

Confounding between parity and age? Possibly... 17 / 36

Multiple regression model Linear predictor: log(e(y i )) = log(m i ) = LP i = a + b 1 I (x i,1 = 0) + b 2 (x i,2 30) choosing a woman of age 30 with previous children as the reference Estimate (CI) Ratio Estimate (CI) P Intercept 1.488 ( 1.541, 1.436) Parity 0 0.300 ( 0.390, 0.211) 0.741 (0.677, 0.810) <0.0001 1 0 1 Age, 10 years 0.140 ( 0.244, 0.035) 0.870 (0.783, 0.965) 0.0088 18 / 36

Interpretation,I Intercept A reference woman aged 30, with previous children is expected to have exp( 1.4882) = 0.226 fever episodes Parity Women with no previous children have a mean number of fever episodes of exp( 0.300) = 0.741 compared to women with previous children, i.e. approximately 26% less, provided that they have the same age The confidence interval ranges from 19% to 32% lower. 19 / 36

Interpretation,II Age Older women have a somewhat lower level of fever episodes: A ten-year increase in age yields an estimated decrease in the mean number of fever episodes of approximately 13% (CI 4 22%), for women with identical parities 20 / 36

Comparison of unadjusted and adjusted effects Ratio Estimate (CI) Covariate(s) Parity, 1 vs. 0 Age, 10 years Only parity 1.29 (1.19,1.40) Only age 0.99 (0.90, 1.09) Both age and parity 1.35 (1.23, 1.48) 0.87 (0.78, 0.97) 21 / 36

Comparison of unadjusted and adjusted effects, II Unadjusted (marginal) effects: More episodes for parity 1+ (Ratio 1.29 (1.19,1.40), P < 0.0001) Slight negative effect of age (Ratio for 10 years: 0.99 (0.90, 1.09), P=0.89) Adjusted effects: More episodes for parity 1+ (Ratio 1.35 (1.23,1.48), P < 0.0001) Significant negative effect of age (Ratio for 10 years: 0.87 (0.78, 0.97), P=0.0088) 22 / 36

Illustration of Confounding The association between parity and age (see the Boxplot on p. 17) results in a significant age effect when adjusting for parity We have an example of two closely related explanatory variables that have opposite effects on the outcome: Women with children have a higher risk but older women have a lower risk 23 / 36

Interaction? Interaction between parity and age (as a linear effect): No: Estimated difference in the age effect of 0.0047 (0.0109) The age effect is somewhat more pronounced for women of parity 0, but not at all significantly, P = 0.66 24 / 36

Model check for linearity in age Residual plots for the model, and smoothed version (parity 1: dots, solid curve, parity 0:circles, dashed curve) 25 / 36

Model with splines in age Predicted values for age effects in the two parity groups, linear spline, with breaks at age 20 and 30 (parity 1: solid curve, parity 0: dashed curve) The deviation from linearity is not significant, P = 0.57 26 / 36

Goodness-of-fit test for model Observed and expected number of fever episodes in ten subgroups according to predicted values: Predicted Mean Number of Number of Number of Fever Episodes Fever Episodes Women Observed (O) Expected (E) O E E 0.138 0.166 1176 188 187.57 0.031 0.166 0.172 1179 197 199.18 0.154 0.172 0.177 1179 212 205.56 0.449 0.177 0.183 1177 196 211.40 1.059 0.183 0.207 1178 239 229.77 0.609 0.207 0.215 1179 273 249.63 1.479 0.215 0.221 1177 229 257.01 1.747 0.221 0.227 1179 265 264.13 0.054 0.227 0.234 1176 272 270.54 0.088 0.234 0.267 1178 287 283.20 0.226 Overall chi-squared statistic of 7.02, P = 0.53 27 / 36

Goodness-of-fit, continued Comparison of observed and expected number of women, according to number of fever episodes: Number of Number of Women O E Fever Episodes Observed (O) Expected (E) E 0 9693 9644.63 0.492 1 1872 1923.71 1.179 2 183 194.44 0.890 3 30 14.21 4.189 Test statistic: 19.97 χ 2 (2), P <0.0001 Too many 0 s and 3-categories Overdispersion? 28 / 36

Comparison to other approaches The Poisson distribution is used here as an approximation to the Binomial distribution Compare to assuming the distribution to be Bin(c = 14, p) and choosing the link function to be log (close to logit since p is small), with the same linear predictor a model assuming Normality, with log-link (even though of course the number of fever episodes is restricted to nonnegative integers) 29 / 36

Alternative approaches Comparison of estimates in models assuming Poisson, Normal, and Binomial distributions: Parity 0 vs. 1 Age, 10 Years Prediction for Model Estimate (SD) P-Value Estimate (SD) P-Value Age 30, Parity 1 Poisson 0.300 (0.046) < 0.0001 0.140 (0.053) 0.0088 0.226 (0.214, 0.238) Binomial 0.300 (0.045) < 0.0001 0.139 (0.053) 0.0083 0.222 (0.211, 0.234) Normal, log-link 0.300 (0.050) < 0.0001 0.141 (0.058) 0.015 0.226 (0.214, 0.238) Somewhat larger SD for normality analysis Overdispersion? 30 / 36

Poisson regression for survival data In the Cox regression model (n c covariates) the log(hazard) is: log(h 0 (t)) + b 1 x i,1 +... + b nc x i,nc. Here, the baseline hazard, h 0 (t) is completely unspecified - no assumptions about the shape of the function. An alternative is to approximate h 0 (t) by a function which is piecewise constant the Poisson regression model for survival data. 31 / 36

Poisson regression for melanoma data For illustration, we use a model based on 3 intervals with cuts at 2.5 and 5 years: Table: Results from fitting a Cox and a Poisson regression model to the malignant melanoma survival data. Cox Poisson Covariate b SD b SD Gender 0.413 0.240 0.396 0.240 Tumor thickness 0.0994 0.0345 0.0964 0.0346 Ulceration 0.952 0.268 0.960 0.269 Age 0.218 0.0775 0.222 0.0763 Intercept (log(ĥ 01 )) 5.093 0.523 Intercept (log(ĥ 02 )) 4.936 0.506 Intercept (log(ĥ 03 )) 4.963 0.476 32 / 36

Poisson regression with categorical covariates The piecewise constant hazard model is particularly attractive when all covariates are categorical because, in this case, data may be reduced to tables of counts and person-years at risk. These tables are sufficient to fit the model. 33 / 36

Table: Failure counts/person-years at risk for the malignant melanoma survival data according to tumor thickness, ulceration, and three time intervals. Time < 2.5 years Tumor thickness Ulceration 0 2 mm 2 5 mm 5+ mm Absent 1/53.47 11/96.12 12/47.12 Present 3/212.30 3/50.00 0/17.50 Time 2.5 5 years Tumor thickness Ulceration 0 2 mm 2 5 mm 5+ mm Absent 4/47.13 9/64.54 4/26.91 Present 4/193.60 2/42.88 1/15.35 Time 5 years Tumor thickness Ulceration 0 2 mm 2 5 mm 5+ mm Absent 1/44.88 6/38.87 0/28.88 Present 7/151.97 2/59.44 1/17.32 34 / 36

Poisson regression for survival data: Comments Nice features: The model works with the standard epidemiological rates A substantial data reduction is obtained in large (e.g., register-based) studies As exemplified, results tend to be very similar to those based on a Cox regression model The model may be fitted using standard software Time is treated as a factor in the model in the same way as other categorical covariates and, therefore, examination of proportional hazards is a simple time covariate interaction A less nice feature is that the analysis depends on the choice of intervals. 35 / 36

Why: Poisson regression Even though there is no assumption in the model of anything having a Poisson distribution, the model may be fitted by, formally, treating the failure counts as Poisson with log(person-years at risk) being a so-called offset in the model. This is because the likelihood function for such a model is proportional to the likelihood function based on the piecewise constant hazard model. 36 / 36