Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Similar documents
Logistic Regressions. Stat 430

MODULE 6 LOGISTIC REGRESSION. Module Objectives:

Introduction to General and Generalized Linear Models

Linear Regression Models P8111

12 Modelling Binomial Response Data

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Generalised linear models. Response variable can take a number of different formats

COMPLEMENTARY LOG-LOG MODEL

Logistic Regression 21/05

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

On the Inference of the Logistic Regression Model

Logistic Regression - problem 6.14

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

UNIVERSITY OF TORONTO Faculty of Arts and Science

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Interactions in Logistic Regression

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Classification. Chapter Introduction. 6.2 The Bayes classifier

Exercise 5.4 Solution

Duration of Unemployment - Analysis of Deviance Table for Nested Models

STAT 510 Final Exam Spring 2015

R Hints for Chapter 10

Generalized Linear Models. stat 557 Heike Hofmann

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Matched Pair Data. Stat 557 Heike Hofmann

Sample solutions. Stat 8051 Homework 8

9 Generalized Linear Models

Exam Applied Statistical Regression. Good Luck!

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression

Simple logistic regression

Leftovers. Morris. University Farm. University Farm. Morris. yield

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Age 55 (x = 1) Age < 55 (x = 0)

Generalized linear models

Introduction to the Analysis of Tabular Data

BMI 541/699 Lecture 22

Analysing categorical data using logit models

ssh tap sas913, sas

STA 450/4000 S: January

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

MSH3 Generalized linear model

Week 7 Multiple factors. Ch , Some miscellaneous parts

Unit 5 Logistic Regression Practice Problems

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Multinomial Logistic Regression Models

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Generalized linear models

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Various Issues in Fitting Contingency Tables

Regression Methods for Survey Data

Neural networks (not in book)

Explanatory variables are: weight, width of shell, color (medium light, medium, medium dark, dark), and condition of spine.

STA102 Class Notes Chapter Logistic Regression

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Figure 36: Respiratory infection versus time for the first 49 children.

Poisson Regression. The Training Data

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Regression with Nonlinear Transformations

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Logistic & Tobit Regression

Checking the Poisson assumption in the Poisson generalized linear model

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

R Output for Linear Models using functions lm(), gls() & glm()

STAC51: Categorical data Analysis

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Likelihoods for Generalized Linear Models

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

MIT Spring 2016

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Stat 401B Final Exam Fall 2016

Logistic regression model for survival time analysis using time-varying coefficients

A strategy for modelling count data which may have extra zeros

Lecture #11: Classification & Logistic Regression

Linear Probability Model

Classification: Logistic Regression and Naive Bayes Book Chapter 4. Carlos M. Carvalho The University of Texas McCombs School of Business

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

robmixglm: An R package for robust analysis using mixtures

Modeling Overdispersion

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Generalized Linear Models 1

Single-level Models for Binary Responses

Survival Analysis I (CHL5209H)

Confidence Intervals for the Odds Ratio in Logistic Regression with One Binary X

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Stat 401B Exam 3 Fall 2016 (Corrected Version)

Consider fitting a model using ordinary least squares (OLS) regression:

Transcription:

Stat 3302 (Spring 2017) Peter F. Craigmile Simple linear logistic regression (part 1) [Dobson and Barnett, 2008, Sections 7.1 7.3] Generalized linear models for binary data Beetles dose-response example Exploratory analysis A better graphical exploratory data analysis The simple linear logistic regression model Finding the maximum likelihood estimates of β 0 and β 1 Fitting the model in R Summarizing the R output Intepreting the slope parameter β 1 Where we go next 1

Generalized linear models for binary data Previously we considered statistical inference on a binomial proportion, p, for Y B(m, p). We begin by considering binary data: Each random variable (RV) Y i only has two possible values: 0 or 1. We assume that Pr(Y i = 0) = 1 p i and Pr(Y i = 1) = p i. For each observation, i, we have a vector of covariates or explanatory variables x T i = (x i,1,..., x i,p ) T. The RVs, {Y 1,..., Y n } are independent. Our aim is to model the relationship between the success probabilities p i and the explanatory variables x i. 2

Beetles dose-response example (See Table 7.2 of Dobson and Barnett [2008]; Original reference is Bliss [1935]) Historically, this type of experiment is an important step in developing statistical models for binomial outcomes. In an experiment adult flour beetles were assessed whether or not they had died after five hours of exposure to a different number of concentrations (doses) of gaseous carbon disulfide. The variables are: The log 10 dose of gaseous carbon disulfide (in units of log 10 CS 2 mg/l). Whether or not the beetle died (1) or remained alive (0). We wish to understand the relationship between (log) dose and the probability of dying. R code for this example can be downloaded from the class website. 3

Exploratory analysis A scatterplot of dead/alive versus the log10 dose is meaningless: Dead (1) or Alive(0) 0.0 0.2 0.4 0.6 0.8 1.0 1.70 1.75 1.80 1.85 log10 dose Instead we borrow strength over the doses. With repeated values of the doses, we have: log10 number number total prop. dose alive dead number killed 1.6907 53 6 59 0.102 1.7242 47 13 60 0.217 1.7552 44 18 62 0.290 1.7842 28 28 56 0.500 1.8113 11 52 63 0.825 1.8369 6 53 59 0.898 1.8610 1 61 62 0.984 1.8839 0 60 60 1.000 4

A better graphical exploratory data analysis proportion killed 0.2 0.4 0.6 0.8 1.0 logit of the proportion killed 2 1 0 1 2 3 4 1.70 1.75 1.80 1.85 log10 dose 1.70 1.75 1.80 1.85 log10 dose Summary? 5

The simple linear logistic regression model Suppose that Y i (i = 1,..., n) are n independent Bern(p i ) RVs. Let ( pi ) η i = log 1 p i = β 0 + β 1 x i, where x i (i = 1,..., n) is some explanatory variable. For our example, Y i = and x i = The above model equivalent to assuming that Y i (i = 1,..., n) are n independent Bern(p i ) RVs where p i = eβ 0+β 1 x i 1 + e β 0+β 1 x i. 6

Finding the maximum likelihood estimates of the parameters in the logistic regression model The log-likelihood function for our data y = (y 1,..., y n ) T is n [ ( ) ] 1 l n = log + y i log p i + (1 y i ) log(1 p i ) where We have and = i=1 n i=1 [ log y i ( ) 1 y i ( ) ] pi + y i log + log(1 p i ), 1 p i p i = eβ 0+β 1 x i 1 + e β 0+β 1 x i, i = 1,..., n. ( ) pi log 1 p i = 1 p i = 1 eβ 0+β 1 x i = 1 + eβ0+β1xi e β0+β1xi 1 + e β 0+β 1 x i 1 + e β 0+β 1 x i 1 =. 1 + e β 0+β 1 x i Thus the log-likelihood function simplifies to n [ ( ) ] 1 l n = log + y i (β 0 + β 1 x i ) log(1 + e β 0+β 1 x i ). i=1 y i How do we find the maximum likelihood estimates? 7

Finding the maximum likelihood estimates, cont. So what is the problem? There is no closed form solution for the maximum likelihood estimates of β 0 and β 1, β 0 and β 1. The computer (in our case R) has to solve the score equations numerically. In particular R uses a algorithm similar to the Newton-Raphson method called Fisher scoring or iteratively weighted least squares (IWLS). 8

Fitting the model in R The R code to fit the simple linear logistic model is: beetle.logit.log10dose <- glm(dead ~ log10.dose, data=beetles, family=binomial) summary(beetle.logit.log10dose) Try and understand the output from R below. Call: glm(formula = dead ~ log10.dose, family = binomial, data = beetles) Deviance Residuals: Min 1Q Median 3Q Max -2.4922-0.5986 0.2058 0.4512 2.3820 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -60.717 5.181-11.72 <2e-16 *** log10.dose 34.270 2.912 11.77 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 645.44 on 480 degrees of freedom Residual deviance: 372.47 on 479 degrees of freedom AIC: 376.47 Number of Fisher Scoring iterations: 5 9

Summarizing the R output 10

Intepreting the slope parameter β 1 Let p(x i ) denote the probability of success for covariate value x i, and let η(x i ) denote the logit of this probability. The odds at value x i is odds(x i ) = p(x i ) 1 p(x i ). The odds ratio comparing the odds at values x i + 1 and x i is odds(x i + 1) odds(x i ) = exp(η(x i + 1))/ exp(η(x i )) = exp(η(x i + 1) η(x i )) = exp([β 0 + β 1 (x i + 1)] [β 0 + β 1 (x i )]) = exp(β 1 ). Thus e β 1 is 11

Intepreting the slope parameter β 1, continued In practice we estimate e β 1 by e β 1. For the beetles dataset, e β 1 = e 34.270 (a very large number!) The key problem is that for this application it makes no sense to consider an increase of one unit in the log10 dose. Ex: What is the multiplicative change in the odds for an increase in 0.01 units of the log10 dose? Produce a 95% confidence interval (CI) for this change. 12

Intepreting the slope parameter β 1, continued 13

Where we go next So far we are in a position to interpret the coefficients of the simple logistic regression model. Here is a wish list of things we want to do: 1. Produce an estimate of the probability at a given covariate value, along with a measure of uncertainty. 2. We will be able to compare different statistical models that we fit using deviances (still to be defined!) 3. We will check the fit of models using residuals appropriate for this statistical model. 4. We will extend to more complicated logistic regression models. References C. I. Bliss. The calculation of the dosage-mortality curve. Annals of Applied Biology, 22: 134 167, 1935. A. J. Dobson and A. Barnett. An Introduction to Generalized Linear Models, Third Edition. Chapman & Hall, New York, NY, 2008. 14