Introduction Fitting logistic regression models Results. Logistic Regression. Patrick Breheny. March 29

Similar documents
Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Logistic regression: Miscellaneous topics

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Generalised linear models. Response variable can take a number of different formats

Chapter 20: Logistic regression for binary response variables

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

One-sample categorical data: approximate inference

STAT 7030: Categorical Data Analysis

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Lecture 13: More on Binary Data

Chapter 8: Multiple and logistic regression

Introducing Generalized Linear Models: Logistic Regression

Generalized Linear Models

Poisson regression: Further topics

Binary Logistic Regression

Chapter 1. Modeling Basics

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Logit Regression and Quantities of Interest

Generalized Linear Models for Non-Normal Data

Investigating Models with Two or Three Categories

Linear Regression With Special Variables

Binomial and Poisson Probability Distributions

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Lecture 10: Introduction to Logistic Regression

Research Methods in Political Science I

Probability: Why do we care? Lecture 2: Probability and Distributions. Classical Definition. What is Probability?

A Re-Introduction to General Linear Models (GLM)

12 Modelling Binomial Response Data

Chapter 12.8 Logistic Regression

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Generalized Linear Models for Count, Skewed, and If and How Much Outcomes

Introduction To Logistic Regression

Week 7 Multiple factors. Ch , Some miscellaneous parts

Logit Regression and Quantities of Interest

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

Model Assumptions; Predicting Heterogeneity of Variance

Sampling distribution of GLM regression coefficients

Generalized Models: Part 1

A Re-Introduction to General Linear Models

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Regression Methods for Survey Data

Lecture 20 Random Samples 0/ 13

Lecture 2: Probability and Distributions

Sections 4.1, 4.2, 4.3

Introduction to Generalized Models

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Profile Analysis Multivariate Regression

Proportional hazards regression

Sections 7.1, 7.2, 7.4, & 7.6

Chapter 5: Logistic Regression-I

Calculating Odds Ratios from Probabillities

9 Generalized Linear Models

Lecture 10: Alternatives to OLS with limited dependent variables. PEA vs APE Logit/Probit Poisson

Chapter 9 Regression with a Binary Dependent Variable. Multiple Choice. 1) The binary dependent variable model is an example of a

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Basic IRT Concepts, Models, and Assumptions

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

Analysing categorical data using logit models

An Introduction to Multilevel Models. PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 25: December 7, 2012

Interactions among Continuous Predictors

Single-level Models for Binary Responses

MS&E 226: Small Data

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Cohen s s Kappa and Log-linear Models

CSE 103 Homework 8: Solutions November 30, var(x) = np(1 p) = P r( X ) 0.95 P r( X ) 0.

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Ch 7: Dummy (binary, indicator) variables

Review of Multiple Regression

Simple logistic regression

Generalized logit models for nominal multinomial responses. Local odds ratios

Generalized linear models

COMPLEMENTARY LOG-LOG MODEL

Content Preview. Multivariate Methods. There are several theoretical stumbling blocks to overcome to develop rating relativities

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

BMI 541/699 Lecture 22

R Hints for Chapter 10

Logistic Regression - problem 6.14

Semiparametric Generalized Linear Models

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Generalized linear models

Multinomial Logistic Regression Models

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Simple, Marginal, and Interaction Effects in General Linear Models: Part 1

Great Theoretical Ideas in Computer Science

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom

Logistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Generalized Linear Models Introduction

Discrete Probability Distributions

Stat 587: Key points and formulae Week 15

Transcription:

Logistic Regression March 29

Introduction Binary outcomes are quite common in medicine and public health: alive/dead, diseased/healthy, infected/not infected, case/control Assuming that these outcomes follow a normal distribution is clearly wrong; the binomial (Bernoulli) distribution is much more natural The binomial distribution is what we use to model things like coin flips, and that s essentially how we re modeling disease/survival/infection here

Setting up the GLM The difference, however, is that we are not assuming that each person s coin has the same probability of landing heads The probability that each person will die/succumb to disease/get infected varies depending on the explanatory variable in a way that is explicitly modeled by the GLM: Random component: y i Bin(1, π i ) Systematic component: η i = x T i β What about the link?

The link Introduction The most natural choice is the canonical link: ( ) πi η i = log 1 π i eη i π i = 1 + e η i The first function is called the logit of π i, and the generalized linear model based on this link is called logistic regression

The Donner party Our example data set from today involves the survival of the members of the Donner party In the spring of 1846, a group of American pioneers set out for California However, they suffered a series of setbacks and did not arrive at the Sierra Nevada mountains until October While crossing the mountains, they became trapped by an early snowfall, and had to spend the winter there

The Donner party (cont d) Conditions were harsh, food supplies ran low, and 40 of the 87 members of the Donner party died before they were finally rescued The data set donner.txt contains the following information regarding the adult (over 15 years old) members of the Donner party: Age Sex Status: either Died or Survived

in SAS In SAS, generalized linear models can be fit using PROC GENMOD Logistic regression models can also be fit using PROC LOGISTIC with similar syntax, although some of the default settings and output options are different Don t use PROC GLM, however, because it doesn t actually fit GLMs (go figure)

Syntax: SAS Specifying models in PROC GENMOD is very similar to using PROC REG, although now we have to specify the distribution of the outcome: PROC GENMOD DATA=donner; CLASS Status Sex; MODEL Status = Sex Age / DIST=Binomial; RUN; In principle, we could specify the link as well, although by default SAS (and R) use the canonical link

Syntax: R Introduction The output, in estimate/standard error/test statistic/p-value format, is quite similar to what we re used to seeing from linear regression, although recall that these numbers are only approximate In R, the glm function accomplishes the same thing: fit <- glm(status~age*sex,donner,family="binomial") summary(fit)

Fitting the model To ensure that we all understand where these numbers are coming from, let s calculate them directly First, we note that for the binomial distribution, V (µ i ) = µ i (1 µ i ) Thus, W is a diagonal matrix with ith entry π i (1 π i )

Fitting the model (cont d) We can then implement the IRLS algorithm as discussed in last week s lecture: repeat { old <- b eta <- X%*%b pi <- exp(eta)/(1+exp(eta)) W <- diag(as.numeric(pi*(1-pi))) z <- eta + solve(w)%*%(y-pi) b <- solve(t(x)%*%w%*%x)%*%t(x)%*%w%*%z if (converged(b,old)) break }

Inferential results We can obtain the inferential results via VarB <- solve(t(x)%*%w%*%x) SE <- sqrt(diag(varb)) z <- b/se p <- 2*pnorm(-abs(z)) As one would hope, we get all of the same results whether we use SAS, R, or calculate the results ourselves

Standard results Estimate Std. Error z value Pr(> z ) (Intercept) 0.3183 1.1310 0.2815 0.7784 Age -0.0325 0.0353-0.9209 0.3571 Female 6.9280 3.3989 2.0383 0.0415 Age:Female -0.1616 0.0943-1.7143 0.0865

Interpreting the coefficients Now let s talk a little more about what these results mean The model coefficients are somewhat difficult to interpret directly, as (a) we ve fit a model with an interaction, and (b) they must go through the link function before they are on the correct scale To help gain some familiarity with what the model coefficients mean, let s calculate survival probabilities for various kinds of people

Males Introduction Let s write out the model as: ( ) π log = β 0 + β 1 Age + β 2 Female + β 3 Age Female 1 π What is the survival probability for a 20 year old male? η = β 0 + 20β 1 ˆη = 0.3312 π = eη 1 + e η ˆπ =.418

Males (cont d) So, a 20-year old male in the Donner party had a 41.8% chance of surviving By a similar calculation, a 40-year male had a 27.3% chance of surviving and a 60-year old male a 16.4% chance

Females Introduction What about for a 20-year-old female? η = β 0 + 20β 1 + β 2 + 20β 3 ˆη = 3.3649 ˆπ =.967 So, 96.7% for a 20-year-old female; further calculation shows 37.4% for a 40-year old female, and just 1.2% for a 60-year old female To summarize the trend: young adult females had a much higher survival probability than young adult males, but their survival probability decreases much more rapidly with age, to the point where older females were much less likely to survive than older males

Automating these calculations We don t have to do all those calculations by hand; we can use an ESTIMATE statement in SAS or the predict function in R In particular, R also allows you to return those predictions on the scale of the response This makes it easy to do things like plot the estimated probabilities

Logistic regression: Fit Female Male Female Male 4 1.0 Linear predictor 2 0 2 4 Pr(Surviving) 0.8 0.6 0.4 0.2 0.0 20 30 40 50 60 Age 20 30 40 50 60 Age

Sigmoidal curves As you can see (most clearly for the females), logistic regression produces fitted probabilities that take on an S shape (or sigmoidal curve) This comes from the logit link function, which as we have demonstrated earlier, constrains the fitted probabilities to lie within [0, 1] Once again, we see the utility of the link function: it allows us to obtain a nonlinear fit from a linear model

Linear vs. logistic regression Linear regression is still BLUE, and in reasonable agreement, but the linearity requirement can lead to unrealistic estimates: Female Logistic Linear 20 30 40 50 60 Male 1.0 Pr(Surviving) 0.5 0.0 20 30 40 50 60 Age

Thinking about the linearity assumption Finally, all of our earlier questions about the systematic part of the model are still relevant to GLMs For example, we have assumed a linear effect is this reasonable? Perhaps the probability of survival is low for young adults and the elderly, and at its maximum in middle age

Quadratic effect: results Including a quadratic effect for age in the systematic component of the model, we see that there may be some justification for this idea: 1.0 Logistic (linear) Female Logistic (quadratic) 20 30 40 50 60 Male 0.8 Pr(Surviving) 0.6 0.4 0.2 0.0 20 30 40 50 60 Age