Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Similar documents
Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

Single-level Models for Binary Responses

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Logistic Regressions. Stat 430

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Generalized linear models

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

9 Generalized Linear Models

Lecture 13: More on Binary Data

Linear Regression Models P8111

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

MSH3 Generalized linear model

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Exam Applied Statistical Regression. Good Luck!

Introduction to General and Generalized Linear Models

Sections 4.1, 4.2, 4.3

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Chapter 20: Logistic regression for binary response variables

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

STAT 7030: Categorical Data Analysis

Non-Gaussian Response Variables

12 Modelling Binomial Response Data

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Log-linear Models for Contingency Tables

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Generalised linear models. Response variable can take a number of different formats

BMI 541/699 Lecture 22

Statistical Modelling with Stata: Binary Outcomes

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Stat 579: Generalized Linear Models and Extensions

Generalized Linear Models

UNIVERSITY OF TORONTO Faculty of Arts and Science

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Simple logistic regression

STA 450/4000 S: January

Interactions in Logistic Regression

11. Generalized Linear Models: An Introduction

STAT 425: Introduction to Bayesian Analysis

Generalized linear models

Generalized Linear Models: An Introduction

Stat 5102 Final Exam May 14, 2015

Exercise 5.4 Solution

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Classification. Chapter Introduction. 6.2 The Bayes classifier

Chapter 5: Logistic Regression-I

Outline of GLMs. Definitions

Matched Pair Data. Stat 557 Heike Hofmann

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

STAC51: Categorical data Analysis

Chapter 22: Log-linear regression for Poisson counts

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Lecture 12: Effect modification, and confounding in logistic regression

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Generalized Linear Models 1

R Hints for Chapter 10

Chapter 1 Statistical Inference

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Linear Modelling: Simple Regression

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Categorical data analysis Chapter 5

Binary Response: Logistic Regression. STAT 526 Professor Olga Vitek

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Poisson Regression. The Training Data

The Multilevel Logit Model for Binary Dependent Variables Marco R. Steenbergen

Generalized Linear Models

MODULE 6 LOGISTIC REGRESSION. Module Objectives:

Multinomial Logistic Regression Models

General Regression Model

The joint posterior distribution of the unknown parameters and hidden variables, given the

Regression Methods for Survey Data

Introduction to General and Generalized Linear Models

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Homework 5 - Solution

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

ST3241 Categorical Data Analysis I Logistic Regression. An Introduction and Some Examples

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Notes for week 4 (part 2)

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

Logistic Regression. 1 Analysis of the budworm moth data 1. 2 Estimates and confidence intervals for the parameters 2

Generalized Linear Models. Kurt Hornik

Introduction to logistic regression

Homework 1 Solutions

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Nonlinear Models. What do you do when you don t have a line? What do you do when you don t have a line? A Quadratic Adventure

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Stat 5101 Lecture Notes

Transcription:

Binary Regression GH Chapter 5, ISL Chapter 4 January 31, 2017

Seedling Survival Tropical rain forests have up to 300 species of trees per hectare, which leads to difficulties when studying processes which occur at the community level. To gain insight into species responses, a sample of seeds were selected from a suite of eight species selected to represent the range of regeneration types which occur in this community. Name Size Cotyledon type Ardisia 3 H C. biflora 7 H Gouania 1 E Size = 1 smallest to 8 largest Hirtella 8 H E = Epigeal - cotyledons Inga 4 H H = Hypogeal - seed food reserves Maclura 2 E C. racemosa 6 H Strychnos 5 E

Experimental Design This representative community was then placed in experimental plots manipulated to mimic natural conditions 8 PLOTs: 4 in forest gaps, 4 in understory conditions Each plot split in half: mammals were excluded from one half with a CAGE 4 subplots within each CAGE/NO CAGE 6 seeds of each SPECIES plotted in each SUBPLT 4 LITTER levels applied to each SUBPLT LIGHT levels at forest floor recorded SURV an indicator of whether they germinated and survived was recorded Which variables are important in determining whether a seedling will survive? Are there interactions that influence survival probabilities?

Modeling Survival Distribution for Survival of a single Seedling is a Bernoulli random variable E[SURV i covariates] = π i How should we relate covariates to probability of survival? For example, probability of survival may depend on whether there was a CAGE to prevent animals from eating the seedling or LIGHT levels. Naive approach: Regress SURV on CAGE and LIGHT ˆπ i = ˆβ 0 + ˆβ 1 CAGE i + ˆβ 2 LIGHT i Problems: Fitted values of probabilities are not constrained to (0, 1) Variances are not constant π i (1 π i ) under Bernoulli model Unbiased?

plot of SURV & CAGE seeds = read.table("seeds.txt", header=true) mosaicplot(surv ~ CAGE, data=seeds) seeds 0 1 1 CAGE 0 SURV

plot of SURV versus LIGHT and CAGE 1.00 SURV 0.75 0.50 0.25 CAGE 1.00 0.75 0.50 0.25 0.00 0.00 0e+00 1e+06 2e+06 3e+06 LIGHT

plot of SURV versus LIGHT and CAGE jittered SURV 1.0 0.5 0.0 CAGE 1.00 0.75 0.50 0.25 0.00 0e+00 1e+06 2e+06 3e+06 LIGHT

Logistic Regression To build in the necessary constraints that the probabilities are between 0 and 1 convert to log-odds or logits Odds of survival: π i /(1 π 1 ) ( ) logit(π i ) def πi = log = β 0 + β 1 CAGE i + β 2 LIGHT i = η i 1 π i η i is the linear predictor logit is the link function that relates the mean π i to the linear predictor η i Generalized Linear Models (GLMs) Find Maximum Likelihood Estimates (optimization problem)

Logits To convert from the linear predictor η to the mean π, use the inverse transformation: log odds (SURV= 1) = η odds (SURV = 1) = exp(η) = ω π = odds/(1 + odds) = ω/(1 + ω) ω = π/(1 π) Can go in either direction

Interpretation of Coefficients ω i = exp(β 0 + β 1 CAGE i + β 2 LIGHT i ) When all explanatory variables are 0 (CAGE= 0, LIGHT= 0), the odds of survival are exp(β 0 ) The ratio of odds (or odds ratio) at X j = A to odds at X j = B, for fixed values of the other explanatory variables is Odds ratio = ω A ω b = exp(β j (A B)) Odds ratio = ω A ω b = exp(β j ) if A B = 1 Odds(X j = A) = exp(β j ) Odds(X j = B) Coefficients are log odds ratios

R Code use glm() rather than lm() model formula as before need to specify family (and link if not default) seeds = read.table("seeds.txt", header=true) seeds.glm0 = glm(surv ~ 1, data=seeds, family=binomial) seeds.glm1 = glm(surv ~ CAGE + LIGHT, family = binomial, data=seeds)

Estimates library(xtable) xtable(summary(seeds.glm1)$coef, digits=c(0, 4, 4, 1,-2)) Estimate Std. Error z value Pr(> z ) (Intercept) -1.4373 0.0709-20.3 3.02E-91 CAGE 0.7858 0.0875 9.0 2.79E-19 LIGHT -0.0000 0.0000-5.3 9.09E-08 Coefficient for the dummy variable CAGE= 0.79 If CAGE increases by 1 unit (No CAGE to CAGE) the odds of survival change by exp(.79) = 2.2 The odds of survival in a CAGE are 2.2 times higher than odds of survival in the open.

Confidence Intervals MLEs are approximately normally distributed (large samples) mean βj estimated variance SE(β j ) 2 Asymptotic posterior distribution for β j is N( ˆβ j, SE(β j ) 2 ) (1 α)100% CI based on normal theory: ˆβ j ± Z α/2 SE(β j ) 95% CI for coefficient for CAGE: 0.7858 ± 1.96 0.7858 = (0.62, 0.96) Exponentiate to obtain interval for odds ratio: exp(0.62), exp(0.96) = (1.85, 2.607) The odds of survival in a CAGE are 1.85 to 2.607 times higher than odds of survival in the open (with confidence 0.95).

Deviance The concept of Deviance replaces Sum-of-Squares in GLMs residual deviance = -2 log likelihood at MLEs 2 i y i log(ˆπ i ) + (1 y i ) log(1 ˆπ i ) log(ˆπ i /(1 ˆπ i )) = ˆβ 0 + CAGE ˆβ 1 + LIGHT ˆβ 2 null deviance = residual deviance under model with constant mean (Total Sum of Squares in Gaussian) analysis of deviance change in (residual) deviance has an asymptotic χ 2 distribution with degrees of freedom based on the change in number of parameters

Analysis of Deviance Table anova(seeds.glm0, seeds.glm1, test="chi") ## Analysis of Deviance Table ## ## Model 1: SURV ~ 1 ## Model 2: SURV ~ CAGE + LIGHT ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 3071 3426.1 ## 2 3069 3299.0 2 127.09 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 OverDispersion/Lack of Fit?

Lack of Fit Lack of fit if residual deviance larger than expected no variance needed to compare Residual Deviance has a χ 2 with n p df p-value = P(χ 2 n p > observed deviance) pchisq(seeds.glm1$deviance, seeds.glm1$df.residual, lower=false) ## [1] 0.002024634 Surprising result if model were true.

Diagnostic Plots: plot(seeds.glm1) Residuals Residuals vs Fitted 1 2 2407 2743 2744 2.5 1.0 Predicted values Std. deviance resid. 1.0 Normal Q Q 2743 2744 2407 3 0 2 Theoretical Quantiles Std. deviance resid. 0.0 1.5 Scale Location 2407 2743 2744 2.5 1.0 Std. Pearson resid. Residuals vs Leverage 1 4 2743 2744 2593 0.000 0.004 Cook's distance

termplot termplot(seeds.glm1, term='light', rug=t) Partial for LIGHT 1.0 0.5 0.0 0 500000 1500000 2500000

log(light) seeds.glm2 = glm(surv ~ CAGE + log(light), family = binomial, data=seeds) termplot(seeds.glm2, term="log(light)", rug=t) Partial for log(light) 0.3 0.0 0.2 0 1500000

Other Variables seeds.glm3 = glm(surv ~ SPECIES + CAGE + log(light) + factor(litter), data=seeds, family=binomial) xtable(summary(seeds.glm3)$coef) Estimate Std. Error z value Pr(> z ) (Intercept) -0.83 0.25-3.29 0.00 SPECIESC. biflora 0.19 0.17 1.10 0.27 SPECIESC. racemosa 0.85 0.16 5.19 0.00 SPECIESGouania -2.64 0.36-7.32 0.00 SPECIESHirtella 1.14 0.16 7.03 0.00 SPECIESInga 0.90 0.16 5.55 0.00 SPECIESMaclura -2.64 0.36-7.32 0.00 SPECIESStrychnos -1.19 0.22-5.46 0.00 CAGE 0.93 0.10 9.59 0.00 log(light) -0.09 0.02-4.50 0.00 factor(litter)1 0.09 0.13 0.67 0.50 factor(litter)2 0.24 0.13 1.81 0.07 factor(litter)4-0.14 0.14-1.01 0.31

Analysis of Deviance anova(seeds.glm3, test="chi") ## Analysis of Deviance Table ## ## Model: binomial, link: logit ## ## Response: SURV ## ## Terms added sequentially (first to last) ## ## ## Df Deviance Resid. Df Resid. Dev Pr(>Chi ## NULL 3071 3426.1 ## SPECIES 7 572.20 3064 2853.9 < 2.2e-1 ## CAGE 1 109.29 3063 2744.6 < 2.2e-1 ## log(light) 1 15.30 3062 2729.3 9.191e-0 ## factor(litter) 3 7.92 3059 2721.4 0.0476

Interactions? The presence of a CAGE may be more important for survival for some species than others - implies an interation The odds of survival Cage compared to odds of survival no Cage depend on SPECIES Fit model with upto 4 way interactions: seeds.glm4 = glm(surv~species*cage*log(light)*litter, data=seeds, family=binomial) The analysis of deviance test suggests that there are three way interactions

Hierarchical Model So far we have not taken into account all the sources of variation or information about the experimental design. SPECIES (size & cotyledon type) and LITTER are randomized to sub-plots. Expect that survival of seedlings in the same sub-plot may be related, which suggests a sub-plot random effect. sub-plots are nested within CAGE within plots (so expect that sub-plots in the same CAGE are correlated, as well as sub-plots within the same plot may have a similar survival. Plot characteristics may affect survival (light levels) How to model?