Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Similar documents
Lecture 12: Effect modification, and confounding in logistic regression

Correlation and regression

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Statistical Modelling with Stata: Binary Outcomes

Logistic Regression. Continued Psy 524 Ainsworth

Introduction to Logistic Regression

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Linear Regression Models P8111

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Lecture 10: Introduction to Logistic Regression

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

8 Nominal and Ordinal Logistic Regression

Statistics in medicine

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

BIOSTATS Intermediate Biostatistics Spring 2017 Exam 2 (Units 3, 4 & 5) Practice Problems SOLUTIONS

ssh tap sas913, sas

LOGISTIC REGRESSION Joseph M. Hilbe

Multinomial Logistic Regression Models

Unit 5 Logistic Regression Practice Problems

Single-level Models for Binary Responses

Stat 642, Lecture notes for 04/12/05 96

CHAPTER 1: BINARY LOGIT MODEL

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure).

SAS Analysis Examples Replication C8. * SAS Analysis Examples Replication for ASDA 2nd Edition * Berglund April 2017 * Chapter 8 ;

Count data page 1. Count data. 1. Estimating, testing proportions

Section 9c. Propensity scores. Controlling for bias & confounding in observational studies

Categorical data analysis Chapter 5

Binary Logistic Regression

LOGISTIC REGRESSION. Lalmohan Bhar Indian Agricultural Statistics Research Institute, New Delhi

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

STAT 7030: Categorical Data Analysis

Logistic Regression - problem 6.14

BMI 541/699 Lecture 22

R Hints for Chapter 10

Simple logistic regression

Unit 11: Multiple Linear Regression

TAMS38 Experimental Design and Biostatistics, 4 p / 6 hp Examination on 19 April 2017, 8 12

PubHlth Intermediate Biostatistics Spring 2015 Exam 2 (Units 3, 4 & 5) Study Guide

Basic Medical Statistics Course

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Beyond GLM and likelihood

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Lecture 2: Poisson and logistic regression

Introduction to logistic regression

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Machine Learning Linear Classification. Prof. Matteo Matteucci

Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

STA6938-Logistic Regression Model

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Confidence Intervals for the Odds Ratio in Logistic Regression with One Binary X

STA102 Class Notes Chapter Logistic Regression

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)

Chapter 14 Logistic and Poisson Regressions

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

More Statistics tutorial at Logistic Regression and the new:

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Lecture 5: Poisson and logistic regression

Cohen s s Kappa and Log-linear Models

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Tests for the Odds Ratio in Logistic Regression with One Binary X (Wald Test)

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Turning a research question into a statistical question.

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Chapter 1. Modeling Basics

Logistic Regression: Regression with a Binary Dependent Variable

Chapter 20: Logistic regression for binary response variables

Longitudinal Modeling with Logistic Regression

Unit 5 Logistic Regression

Review of Statistics 101

Case-control studies

Section 4.6 Simple Linear Regression

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

11 November 2011 Department of Biostatistics, University of Copengen. 9:15 10:00 Recap of case-control studies. Frequency-matched studies.

6.873/HST.951 Medical Decision Support Spring 2004 Evaluation

Section VIII. Correlation &Linear regression for continuous outcomes. VIIIa simple bivariate regression VIIIb-multiple regression

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Unit 5 Logistic Regression

Investigating Models with Two or Three Categories

Binary Response: Logistic Regression. STAT 526 Professor Olga Vitek

MULTINOMIAL LOGISTIC REGRESSION

Lecture 8: Summary Measures

Transcription:

Section IX Introduction to Logistic Regression for binary outcomes Poisson regression 0

Sec 9 - Logistic regression In linear regression, we studied models where Y is a continuous variable. What about the case when Y is binary? Y = 1 (presence) or Y = 0 (absence) (Ex: Y is alive or dead, sick or well, positive or negative) Linear regression does not work well for two reasons: 1. Y can t be linearly related to X s 2. Y does not have a Gaussian distribution given P, the probability that Y=1, and therefore the errors are not Gaussian. For (1), we need a linearizing transformation, for (2), we need a different error model. 1

Let P be the mean of the (binary) Ys for a fixed combinations of Xs. P is a proportion. 0 <= P <= 1 so P can t be a linear fcn of X What about P/(1-P) = odds 0 <= odds <= infinity has floor of 0 What about log[odds] = log[ P/(1-P)] -infinity <= log[ P/(1-P)] <= infinity The log[ P/(1-P)] transformation of P is called the logit (log odds) transformation. In Logistic regression, we assume logit(p) = 0 + 1 X 1 +... + k X k (and that Y has a bernoulli distribution with mean = P and var(y) = P(1-P) ) 2

This implies that [P/(1-P)] = odds = e logit ( 0+ 1 X1+...+ k Xk) = e mean Y = P = 1/(1 + e -logit ) = 1/[ 1 + e -( 0 + 1 X1 +... + k Xk) ] P is the predicted probability that Y=1 or the predicted proportion of Y s equal to 1 for a given combination of the Xs. P is also the risk, for example, the risk of disease. Note: If odds=p/(1-p) then P =odds/(odds+1) 1.0 P vs logit 1 P vs logit 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 P=risk 0.5 0.4 P=risk 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0-4 -3-2 -1 0 1 2 3 4 logit =log odds 0-4 -3-2 -1 0 1 2 3 4 logit=log odds 3

If logit = log(odds) = ln(p/(1-p)) = 0 + 1 X 1 +... + k X k then odds = (e 0 ) (e 1X1 ) (e 2X2 ) (e kxk ) or odds = (base odds) OR 1 OR 2.. OR k (The base odds is the odds when all Xs equal zero) (OR=odds ratio) Above gives odds for a set of X 1, X 2, X k P = risk = odds/(odds + 1) = 1/(1 + odds -1 ) Or P = 1/(1+ e -log(odds) ) = 1/(1 + e -logit ) Or P = 1/[ 1 + e -( 0 + 1 X1 +... + k Xk) ] 4

What do the coefficients mean? They are the linear rate of change in the logit per unit change in the X. (Huh?) Example: What if X is coded 0 for males, logit(p) = 0 + 1 X 1 for females and If X = 0 (males) then logit(p m ) = 0 If X = 1 (females) then logit(p f ) = 0 + 1 therefore 1 = logit(p f ) - logit(p m ) = log( P f /(1-P f )) - log( P m /(1-P m )) = log (odds females) - log(odds males) = log(odds females/odds males) = log(odds ratio) = log(or) so, e ( 1) = OR 5

Computing Odds Let P = proportion with disease assume logit(p)= 0 + 1 age + 2 sex (Sex = 0 for male, 1 for female) Odds ratio of disease in females compared to males who are the same age = e 2 Increase in odds of disease for a one year increase in age = e 1 if gender is the same Increase in odds of disease for n year change in age = (e 1 ) n = e n 1. 6

Computing odds ratios (ORs) for continuous variables Outcome: MI= myocardial infarction (coded 0 or 1) Predictors: age in years htn - Hypertension (coded 0 or 1) smoke smoking (coded 0 or 1) Logit(MI) = 0 + 1 age + 2 htn + 3 smoke What is OR for 40 year old with hypertension vs an otherwise identical 30 year old without hypertension? 0+ 1 40 + 2 (1) + 3 smoke-[ 0+ 1 30 + 2(0)+ 3 e smoke] 1 10 + 2 = e 7

Interactions Let Y = CHD (0=no, 1=yes) S = smoking (0=no, 1=yes) D = drinking alcohol (0=no, 1=yes) Model: logit(p)= 0 + 1 S + 2 D + 3 S D Let referent category be: no smoking (S=0), no drinking (D=0) If S=0, D=0, odds=e β0 If S=1, D=0, odds=e β0+β1 If S=0, D=1, odds=e β0+β2 If S=1, D=1, odds=e β0+β1+β2+β3 OR 00 =1= e β0 /e β0 OR 10 =e 1 OR 01 = e 2 OR 11 =e ( 1+ 2+ 3) When will OR 11 = OR 10 x OR 01? If and only if 3 = 0 8

Interpretation example Potential predictors (13) of in hospital infection mortality (yes or no) Crabtree, et al JAMA 8 Dec 1999 No 22, 2143-2148 Gender (female or male) Age - years APACHE score (0-129) Diabetes (y/n) Renal insufficiency / Hemodyalysis (y/n) Intubation / mechanical ventilation (y/n) Malignancy (y/n) Steroid therapy (y/n) Transfusions (y/n) Organ transplant (y/n) WBC - count Max temp - degrees F Days from admission to treatment (or > 7 days yes or no) 9

Interpretation example Table 3. Stepwise Logistic Regression Factors Associated With Mortality for All Infections Characteristic Odds Ratio (95% CI) p value Incr APACHE score 1.15 (1.11-1.18) <.001 Transfusion (y/n) 4.15 (2.46-6.99) <.001 Increasing age 1.03 (1.02-1.05) <.001 Malignancy 2.60 (1.62-4.17) <.001 Max Temperature 0.70 (0.58-0.85) <.001 Adm to treat>7 d 1.66 (1.05-2.61) 0.03 Female (y/n) 1.32 (0.90-1.94) 0.16 *APACHE = Acute Physiology & Chronic Health Evaluation Score 10

Diabetes complications -Descriptive stats Obese =actual weight above expected weight Table of obese by diabetes complication obese diabetes complication Freq no- 0 yes- 1 Total % yes -----+------+------+ no 0 56 28 84 28/84=33% -----+------+------+ yes 1 20 41 61 41/61=67% -----+------+------+ Total 76 69 145 %obese 26% 59% (RR=2.0, OR=4.1) P value < 0.001 Fasting glucose ( fastglu ) mg/dl n min median mean max No complication 76 70.0 90.0 91.2 112.0 Complication 69 75.0 114.0 155.9 353.0 P value = Steady state glucose ( steadyglu ) mg/dl n min median mean max No complication 76 29.0 105.0 114.0 273.0 Complication 69 60.0 257.0 261.5 480.0 P value = 11

12

Diabetes complications-logistic model (SAS output) Parameter DF beta SE(b) Chi-Square p Intercept 1-14.702 3.231 20.706 <.0001 obese 1 0.328 0.615 0.285 0.5938 fastglu 1 0.108 0.031 12.456 0.0004 steadyglu 1 0.0225 0.0053 18.322 <.0001 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits obese 1.388 0.416 4.631 fastglu 1.114 1.049 1.182 steadyglu 1.023 1.012 1.033 Log odds diabetes complication = -14.7+0.328 obese+0.108 fast glu +0.023 steady glu Q-What is the model predicted risk if you are obese, have fasting glucose=100 mg/dl and a steady state glucose=120 mg/dl? A-logit=-14.7+0.328+0.108(100)+0.023(120)=-0.872 Odds=exp(-0.872)=0.418, P=0.418/1.418=0.295 13

Statistical Significance of the s Just as there are t statistics for the coefficients in linear regression, the Wald chi-square statistic is computed for each in logistic regression. The Wald chi-square is defined as 2 wald = [ /SE( )] 2 A confidence interval for the true is formed in the usual way CI: +/- Z SE( ) where Z is the appropriate Gaussian percentile. For 95% intervals, Z=1.96 as usual. First form (95%) CI for β on log scale b 1.96 SE, b + 1.96 SE then take antilogs of each end to get confidence interval for the OR=e b e [b 1.96 SE] [b + 1.96 SE], e 14

Model fit-linear vs Logistic regression k variables, n observations Variation df sum square or deviance Model k G Error n-k D Total n-1 T <-fixed Y i = i th observation, Ŷ i =prediction for i th obs statistic Linear regr Logistic regr D/(n-k) Residual SD e Mean deviance Σ[(Y i -Ŷ i )/Ŷ] 2 -- Hosmer-L χ 2 Corr(Y,Ŷ) 2 R 2 Cox-Snell R 2 G/T R 2 Pseudo R 2 Good regression models have large G and small D. For logistic regression, D/(n-k), the mean deviance, should be near 1.0. 15

Model fit-analysis of deviance Diabetes complication model df -2log L p value Model (G) 3 117.21 < 0.001 Error (deviance=d) 141 83.46 0.99 Total (G+D) 144 200.67 Mean deviance = 83.46/141 = 0.59 Want mean deviance 1.0 if model fits the data. R 2 pseudo= 117.21/200.67= 0.584 R 2 cox-snell = 0.554 16

Model fit Diabetes complication data Hosmer-Lemeshow χ 2 Goodness of fit test χ 2 = (obs-expected) 2 /expected The observed and model based expected frequency values (ie. number of persons) by deciles complication = 1 complication = 0 Group Total Observed Expected Observed Expected 1 16 0 0.23 16 15.77 2 15 0 0.61 15 14.39 3 15 0 1.31 15 13.69 4 15 4 2.48 11 12.52 5 15 8 4.53 7 10.47 6 15 6 8.49 9 6.51 7 15 13 12.74 2 2.26 8 16 15 15.60 1 0.40 9 23 23 23.00 0 0.00 total 145 69 69.00 76 76.00 Chi-Square DF p value 9.8949 7 0.1946 If the model fits the data (the null hypothesis here), the H-L chi-square statistic should be small and the p value should be large. 17

Goodness of fit versus variation accounted for It is possible that that D/(n-k), the mean deviance, and the Hosmer-Lemeshow goodness of fit χ 2 (H-L χ 2 ) can both indicate that the model fits the data (mean deviance near 1.0, H-L χ 2 small and its p value large) and yet the corresponding R 2 values for the same model may be low, implying little of the variation is accounted for. Is this a contradiction? No. The mean deviance and H-L χ 2 are based on comparing observed versus expected frequencies across the J covariate patterns ( cells ) where these J covariate patterns are made using the variables in the model. This is like comparing observed to expected cell MEANS in linear regression. In contrast, the R 2 statistics are based on how much of the total variation is accounted for by the model, not just how well the model fits the means. In effect, the R 2, as in linear regression, is not just assessing fit across the current J covariate patterns, but is assessing how much the total variation of Y (including variation within the J covariate cells) is accounted for by the current model. This variation may be greater than the variation of the (average) Y across the J covariate patterns. So, if the mean deviance and H-L χ 2 are satisfactory but the R 2 is low, adding interactions or other terms to the model using the current variables will not improve the R 2 much. This result implies that there are other factors not currently in the model, that are needed to account for the variation of Y. 18

Goodness of fit Logistic & linear regression Logistic (hosmer-lemeshow) ˆ P vs P ˆ (1-P) vs (1-P) Linear Y vs Ŷ For a given covariate pattern, the model predicted values and the mean values can agree (good fit) even though the predicted value and the individual values do not agree as well. That is, the predicted values can agree with the observed values on average but not for each individual. 19

Sensitivity & Specificity for Logistic If one has a binary outcome like positive or negative (say, for a disease), one can imagine classifying persons into these two groups on the basis of their X values and then comparing the classification to their actual status. The sensitivity and specificity are measures of how good the classification is: True pos True neg Classify pos a b Classify neg c d total a+c b+d Sensitivity = a/(a+c), specificity = d/(b+d) false neg = c/(a+c), false neg= 1 - sensitivity false pos = 1 specificity false pos = b/(b+d) accuracy = W sensitivity+ (1-W) specificity 20

A good classification rule (& thus a good logistic model) has high sensitivity and high specificity. If we choose a cutpoint, P c, we can use the predicted values (P i ) from our logistic rule to classify persons. Classify as positive if P i =predicted P > P c Otherwise classify as negative. For a given P c we can compute sensitivity and specificity and also an overall penalty function, of the form penalty = W(false neg) + (1-W)(false pos) where W gives the relative weights we assign to a false positive or a false negative. We can choose P 0 to minimize the value of the penalty function. This is the same as maximizing accuracy=w(sensitivity) + (1-W)specificity W=0.5 implies equal weight. 21

Accuracy: diabetes complication model logit(p i )=-14.7+0.328 obese+0.108 fast glu +0.023 steady glu P i = e logit / (1 + e logit ) Compute logit(p i ) (or P i ) for all observations and do an ROC as below. Find value of logit or P i (called P 0 ) that maximizes accuracy=(sensitivity+specificity)/2 22

Best accuracy Logit = 0.447, P 0 = (e 0.447 )/(1 + e 0.447 ) =0.61 True comp True no comp Predicted yes 55 4 Predicted no 14 72 Total 69 76 Sensitivity= 55/69 = 79.7% Specificity=72/76=94.7% overall accuracy (equal weight) = 0.5 (.797) + 0.5(.947) = 87.2% However, it is possible that a model can fit the data by the H-L goodness of fit χ 2 but still have poor sensitivity and specificity. This implies that most subjects are at intermediate risk, not high and/or low risk. 23

C statistic summary accuracy The C (concordance) statistic is the area under the ROC curve and varies from 0.5 (worst) to 1.0 (best). Imagine forming all n 0 x n 1 possible pairs of the n 0 obs with Y=0 and the n 1 obs with Y=1. If the logit for the pair member with the 1 is higher than the logit for the pair member with the 0, call this pair a concordant pair. Otherwise, call it discordant unless their logits are identical, in which case, call it a tie. C = num concordant + 0.5 num ties n 0 n 1 If Y=1 is disease and Y=0 is non disease, C is the probability that, in any pair, the diseased subject has a higher predicted probability of disease than the non diseased subject. For Diabetes comp, C = 0.949 24

Logistic model is a discrimination model The logit score from logistic regression can also be considered a discrimination score. One can plot histograms of the logit score in those who have Y=1 versus those who have Y=0. The best cutpoint or threshold that separates these two histograms corresponds to logit(p 0 ), the logit of P 0. Examining the histograms of the logits is a generalization of comparing the distribution of any continuous variable X between the Y=1 versus Y=0 groups. 0.60 0.50 0.40 freq 0.30 0.20 0.10 0.00-4.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 4.0 logit(p) 25

Poisson regression When Y is a positive integer (ie Y=0,1,2,3 ), we cannot model Y or its mean as a linear function of Xs since Y has a floor of zero. 0 <= Y <= infinity We model log(ŷ) = log( ) as a linear function of the X variables so that Mean Y = Ŷ = exp( 0 + 1 X 1 +... + k X k ) In this way, Mean Y= Ŷ and Y can never be less than zero. This model is the Poisson Regression model. Poisson Regression interpretation of s For poisson regression, ln(ŷ) = 0 + 1 X 1 +... + k X k so, i is the rate of change of ln(ŷ) per unit change in X i holding the other Xs constant. Since Ŷ= exp( 0 + 1 X 1 +... + k X k ) dŷ/dx i = i Ŷ, or i = [dŷ/dx i ] / Ŷ so, 100 i is the percent change in Ŷ per unit change in X i, holding all else constant. For X i, exp(β i ) is the mean ratio of Ŷ for X i to Ŷ for X i -1. This is similar to the odds ratio in logistic regression. 26

Linear vs Logistic regression Method Linear regression Logistic regression Outcome (Y) Continuous / means Binary / Proportions Predictors (Xs) Continuous and discrete* Continuous and discrete Overall significance F test= MS model / MS error Chi-square test= Reduced-Full test Model Fit stats R 2, residual SD Mean deviance=1, fit test, Hosmer-Lemeshow stat Model variation Model Sum of Squares Difference log likelihoood Error variation Residual Sum of Squares Full log likelihood=deviance Total variation= Model + Error Total SS = Model SS + Residual SS Reduced = Difference + Full * The analysis of variance (ANOVA) model is a special case of linear regression where all the predictors are discrete. Simplified Regression guide let XB = b 0 + b 1 X 1 + b 2 X 2 + + b k X k outcome range method link error continuous -infinity to + infinity linear regr /ANOVA y=xb gaussian (if all of the predictors are discrete, do ANOVA if there is a random person effects, do repeated measure ANOVA) binary 0 to 1 logistic regr. y=1/(1+exp(-xb)) binomial positive integer 0 to infinity poisson regr, y=exp(xb) poisson If the outcome is continuous but non linear in the beta parameters, must do non linear regression with Gaussian errors. 27

Analysis of deviance (log likelihood) Depression model df -2log L p value Model (G) 3 16.14 < 0.001 Error (deviance=d) 290 252.00 Total (G+D) 293 268.14 Mean deviance = 252/290 = 0.869 Want mean deviance < 1.0 Pseudo R 2 = 16.14/268.14 = 0.060 R 2 cox-snell = 0.0893 Hosmer Lemeshow chi-square = 28