Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Similar documents
Single-level Models for Binary Responses

Let s see if we can predict whether a student returns or does not return to St. Ambrose for their second year.

STAT 7030: Categorical Data Analysis

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Binary Logistic Regression

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

9 Generalized Linear Models

Lecture 12: Effect modification, and confounding in logistic regression

Cohen s s Kappa and Log-linear Models

Chapter 20: Logistic regression for binary response variables

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Log-linear Models for Contingency Tables

Logistic Regression - problem 6.14

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Logistic regression: Miscellaneous topics

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Linear Regression Models P8111

STA6938-Logistic Regression Model

Generalized logit models for nominal multinomial responses. Local odds ratios

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Classification. Chapter Introduction. 6.2 The Bayes classifier

STA102 Class Notes Chapter Logistic Regression

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Simple logistic regression

Chapter 5: Logistic Regression-I

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Lecture 10: Introduction to Logistic Regression

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Correlation and regression

Sections 4.1, 4.2, 4.3

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Chapter 12.8 Logistic Regression

8 Nominal and Ordinal Logistic Regression

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Logistic Regression. via GLM

BMI 541/699 Lecture 22

A Practitioner s Guide to Generalized Linear Models

Introduction Fitting logistic regression models Results. Logistic Regression. Patrick Breheny. March 29

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Multinomial Logistic Regression Models

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Lecture 3.1 Basic Logistic LDA

Unit 11: Multiple Linear Regression

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Introducing Generalized Linear Models: Logistic Regression

BIOS 625 Fall 2015 Homework Set 3 Solutions

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

ISQS 5349 Final Exam, Spring 2017.

Statistical Distribution Assumptions of General Linear Models

Multiple Logistic Regression for Dichotomous Response Variables

Generalized linear models

Lecture 2: Categorical Variable. A nice book about categorical variable is An Introduction to Categorical Data Analysis authored by Alan Agresti

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Introduction to logistic regression

Beyond GLM and likelihood

Binary Dependent Variables

Logistic Regressions. Stat 430

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Introduction to General and Generalized Linear Models

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Linear Regression With Special Variables

Chapter 4: Generalized Linear Models-I

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

11 November 2011 Department of Biostatistics, University of Copengen. 9:15 10:00 Recap of case-control studies. Frequency-matched studies.

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

multilevel modeling: concepts, applications and interpretations

MS&E 226: Small Data

Testing Independence

Generalized Linear Models Introduction

UNIVERSITY OF TORONTO Faculty of Arts and Science

MS&E 226: Small Data

Biostatistics Advanced Methods in Biostatistics IV

A Re-Introduction to General Linear Models

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Econometrics Problem Set 10

ST3241 Categorical Data Analysis I Logistic Regression. An Introduction and Some Examples

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

More Statistics tutorial at Logistic Regression and the new:

Relative Risk Calculations in R

PubHlth Intermediate Biostatistics Spring 2015 Exam 2 (Units 3, 4 & 5) Study Guide

Week 7 Multiple factors. Ch , Some miscellaneous parts

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Logistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Transcription:

Logistic Regression Some slides from Craig Burkett STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Titanic Survival Case Study The RMS Titanic A British passenger liner Collided with an iceberg during her maiden voyage 2224 people aboard, 710 survived People on board: 1 st class, 2 nd class, 3 rd class passengers (the price of the ticket and also social class played a role) Different ages Different genders

Exploratory data analysis (in R)

The Statistical Sleuth, 3 rd ed

What s Wrong with Linear Regression? (i) (i) E Y i = β 0 + β 1 X 1 + + βk 1 X k 1 Y i ~Bernoulli(π i ) (π i = π(x 1, X 2 )) We can match the expectation. But we ll have Var Y i = π i (1 π i ) Y i is very far from normal (just two values) Var Y i is not constant Predictions for Y i can have the right expectations, but will sometimes be outside of (0, 1)

Logistic Curve s y = 1 1 + exp( y) Inputs can be in (, ), outputs will always be in (0, 1)

Logistic Regression We will be computing 1 + exp( β 0 β 1 X 1 β k X k ) to always get a value between 0 and 1 Model: Y i ~Bernoulli π i, π i = Var Y i =? 1 1 1+exp( β 0 β 1 X 1 i βk X k (i) )

Log-Odds 1 π = 1 + e β 0+β 1 x 1 +.. 1 π = 1 + e β 0+β 1 x 1 + log 1 π 1 = β 0 + β 1 x 1 + log π 1 π = β 0 + β 1 x 1 + 000

Odds If the probability of an event is π, the odds of the π event are 1 π

Odds You pay $5 if France don t win, and get $9 if they do. What s the probability of France winning assuming true odds are offered? (What about if the odds r=p/(1-p) are given as 0.6?)

The Log-Odds are Linear in X log π 1 π = β 0 + β 1 x 1 + Generalized linear model: g μ = Xβ, with a distribution imposed on Y μ g μ = log with Y~Bernoulli(μ) is Logistic 1 μ regression g is the link function logit μ = log μ 1 μ

Maximum Likelihood Y i ~ Bernoulli π i y P Y i = y i β 0, β 1,, β k 1 = π i i 1 1 y πi i P Y 1 = y 1,, Y n = y n n = i=1 log P Y 1 = y 1,, Y n = y n = n i=1 π i = y i log π i + 1 y i 1 log 1 π i π i y i 1 πi 1 y i 1 + exp(β 0 + β 1 X 1 (i) + + βk 1 X k 1 (i) ) Now, take the derivative wrt the betas, and find the betas that maximize the log-likelihood.

Titanic Analysis (in R)

Interpretation of β 0 - Linear Reg. In Linear Regression, if there are no other predictors, the least-squares (and Maximum Likelihood) estimate of Y is the mean E Y = β 0 തY = b 0 y <- rnorm(100, 10, 25) > summary(lm(y ~ 1)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 8.727 2.524 3.458 0.000804 > mean(y) [1] 8.726752

Interpretation of β 0 - Logistic Reg. MLE for π = P(Y = 1) is also the sample mean തY (proportion of the time Y is 1/the marginal probability that Y is 1) In Logistic Regression, without predictors we have 1 π = 1 + e β 0 β 0 = logit തY (in R)

Interpretation of β 0 - Logistic Reg. Now, back to predicting with age. What s the interpretation of β 0 now? (in R)

Interpretation of β 1 (Categorical Predictor) fit <- glm(survived ~ sex, family= binomial, data= titan) > summary(fit) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -1.44363 0.08762-16.48 <2e-16 sexfemale 2.42544 0.13602 17.83 <2e-16 > exp(coef(fit)) (Intercept) sexfemale 0.2360704 11.3071844 Interpretation: the odds of survival for women are 11.3 times higher than for men

Interpretation of β (Mixed Predictors) logit π = β 0 + β 1 age + β 2 I female In Linear Regression, β 1 would the increase in survival for a unit change in age, keeping other variables constant β 2 is the difference between group means, keeping other variables constant In logistic regression, β 1 is the increase in the log-odds of survival, for a unit change in age, keeping other variables constant β 2 is the increase in the log-odds of survival for women compared to men, keeping age constant

Quantifying Uncertainty (in R)

Interpreting Coefficients fit <- glm(survived ~ age + sex, family= binomial, data= titan) > exp(coef(fit)) (Intercept) age sexfemale 0.2936769 0.9957548 11.7128810 > exp(confint(fit)) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 0.2037794 0.4194266 age 0.9855965 1.0059398 sexfemale 8.7239838 15.8549675 Controlling for age, we are 95% confident that females have a 772% to 1485% higher odds of survival, compared to males

Likelihood Ratio Test Can be used to compare any two nested models models Test statistic: LRT = 2 log(lmax full ) 2 log(lmax reduced ) Has a χ 2 distribution when the extra parameters in LMAX full are 0 LMAX: the maximum likelihood value A lot of the time what s computed is deviance = const 2 log LMAX Then: LRT = deviance reduced deviance full

Wald Test The test based on approximating the sampling distribution of the coefficients as normal The SE s that are show shown when you call summary Unreliable for small sample sizes

Model Assumptions Independent observations Correct form of model Linearity between logits & predictor variables All relevant predictors included For CIs and hypothesis tests to be valid, need large sample sizes

Titanic Dataset: Model Checking Independent observations? Not really, for one thing, they were all on one ship! Large sample? Yes

Titanic Dataset: Other Worries Do we have all relevant predictors? I.e., might there be confounding variables we haven t considered/don t have available?