More Statistics tutorial at Logistic Regression and the new:

Similar documents
STA6938-Logistic Regression Model

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

Correlation and regression

Lecture 12: Effect modification, and confounding in logistic regression

Statistics in medicine

Stat 642, Lecture notes for 04/12/05 96

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure).

Basic Medical Statistics Course

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

ABSTRACT INTRODUCTION. SESUG Paper

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

Estimating the Marginal Odds Ratio in Observational Studies

Estimating direct effects in cohort and case-control studies

STA6938-Logistic Regression Model

STAT 7030: Categorical Data Analysis

Joint Modeling of Longitudinal Item Response Data and Survival

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Correlation and Regression Bangkok, 14-18, Sept. 2015

LOGISTIC REGRESSION Joseph M. Hilbe

Lecture 2: Poisson and logistic regression

LOGISTIC REGRESSION. Lalmohan Bhar Indian Agricultural Statistics Research Institute, New Delhi

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Unit 11: Multiple Linear Regression

Jun Tu. Department of Geography and Anthropology Kennesaw State University

Ph.D. course: Regression models. Introduction. 19 April 2012

Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Lecture 5: Poisson and logistic regression

Testing Independence

Introducing Generalized Linear Models: Logistic Regression

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

Checking model assumptions with regression diagnostics

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

STA102 Class Notes Chapter Logistic Regression

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Categorical data analysis Chapter 5

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Generalized logit models for nominal multinomial responses. Local odds ratios

Multiple Regression: Chapter 13. July 24, 2015

Mediation analyses. Advanced Psychometrics Methods in Cognitive Aging Research Workshop. June 6, 2016

Procedia - Social and Behavioral Sciences 109 ( 2014 )

Nemours Biomedical Research Statistics Course. Li Xie Nemours Biostatistics Core October 14, 2014

Exam Applied Statistical Regression. Good Luck!

Longitudinal Modeling with Logistic Regression

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Logistic regression: Miscellaneous topics

CDA Chapter 3 part II

11 November 2011 Department of Biostatistics, University of Copengen. 9:15 10:00 Recap of case-control studies. Frequency-matched studies.

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Multiple linear regression S6

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

APPENDIX B Sample-Size Calculation Methods: Classical Design

Mediation Analysis for Health Disparities Research

Introduction to logistic regression

Logistic Regression. Continued Psy 524 Ainsworth

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21

2 Describing Contingency Tables

Marginal, crude and conditional odds ratios

Chapter 5: Logistic Regression-I

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Introduction to mtm: An R Package for Marginalized Transition Models

REVIEW 8/2/2017 陈芳华东师大英语系

Lecture 15 (Part 2): Logistic Regression & Common Odds Ratio, (With Simulations)

Logistic Regression: Regression with a Binary Dependent Variable

Introduction To Logistic Regression

NELS 88. Latent Response Variable Formulation Versus Probability Curve Formulation

Part IV Statistics in Epidemiology

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Logistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S

Asymptotic equivalence of paired Hotelling test and conditional logistic regression

Investigating Models with Two or Three Categories

8 Nominal and Ordinal Logistic Regression

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Statistical Modelling with Stata: Binary Outcomes

Logistic regression modeling the probability of success

Chapter 2: Describing Contingency Tables - II

Goodness of fit in binary regression models

Ignoring the matching variables in cohort studies - when is it valid, and why?

Tests for Two Correlated Proportions in a Matched Case- Control Design

DIAGNOSTICS FOR STRATIFIED CLINICAL TRIALS IN PROPORTIONAL ODDS MODELS

Chapter 19: Logistic regression

(Where does Ch. 7 on comparing 2 means or 2 proportions fit into this?)

Confounding, mediation and colliding

Lab 8. Matched Case Control Studies

Model Estimation Example

General Linear Model (Chapter 4)

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Multinomial Logistic Regression Models

Extensions of Cox Model for Non-Proportional Hazards Purpose

1 The problem of survival analysis

Applied Epidemiologic Analysis

26:010:557 / 26:620:557 Social Science Research Methods

Transcription:

Logistic Regression and the new: Residual Logistic Regression 1

Outline 1. Logistic Regression 2. Confounding Variables 3. Controlling for Confounding Variables 4. Residual Linear Regression 5. Residual Logistic Regression 6. Examples 7. Discussion 8. Future Work 2

1. Logistic Regression Model In 1938, Ronald Fisher and Frank Yates suggested the logit link for regression with a binary response variable. ibl ln(odds of Y = 1 x) ( 1 ) ( 1 ) ln PY = x ln PY = x = = P( Y= 0 x) 1 PY ( = 1 x) ( x ) ( x) π = ln = β0+ β1x 1 π

A popular model for categorical response variable Logistic regression model is the most popular model for binary data. Logistic regression model is generally used to study the relationship between a binary response variable and a group of predictors (can be either continuous or categorical). Y = 1 (true, success, YES, etc.) or Y = 0 ( false, failure, NO, etc.) Logistic regression model can be extended to model a categorical response variable with more than two categories. The resulting model is sometimes referred to as the multinomial logistic regression model (in contrast to the binomial logistic regression for a binary response variable.)

More on the rationale of the logistic regression model Consider a binary response variable Y=0 or 1and a single predictor variable x. We want to model E(Y x) =P(Y=1 x) as a function of x. The logistic regression model expresses the logistic transform of P(Y=1 x) as a linear function of the predictor. PY ( = 1 x) ln 1 PY ( = 1 x ) This model can be rewritten as P ( Y = 1 x ) = exp( β + β 0 1x) 0 1x 1+ exp( β + β = β + β x ) 0 1 E(Y x)= P(Y=1 x) *1 + P(Y=0 x) * 0 = P(Y=1 x) is bounded between 0 and 1 for all values of x. The following linear model may violate this condition sometimes: P(Y=1 x) = β 0 + β 1 x

More on the properties of the logistic regression model In the simple logistic regression, the regression coefficient β 1 has the interpretation that it is the log of the odds ratio of a success event (Y=1) for a unit change in x. P( Y = 1 x + 1) ln P( Y = 0 x + 1) P( Y = 1 ln P( Y = 0 x) x) = [ β + β ( x + 1)] [ β + β x] = 0 1 0 1 ] β 1 For multiple predictor variables, the logistic regression model is P( Y = 1 x, x,..., x ) = β + β x +... + β k x = ) 1 2 k ln = ) 0 1 1 P( Y 0 x1, x2,..., x k k

Logistic Regression, SAS Procedure http://www.ats.ucla.edu/stat/sas/output/sas_logit_output.htm Proc Logistic This page shows an example of logistic regression with footnotes explaining the output. The data were collected on 200 high school students, with measurements on various tests, including science, math, reading and social studies. The response variable is high writing test score (honcomp), where a writing score greater than or equal to 60 is considered high, and less than 60 considered low; from which we explore its relationship with gender (female), reading test score (read), and science test score (science). The dataset used in this page can be downloaded d d from http://www.ats.ucla.edu/stat/sas/webbooks/reg/default.htm. data logit; set "c:\temp\hsb2"; honcomp = (write >= 60); run; proc logistic data= logit descending; di model honcomp = female read science; run; 7

Logistic Regression, SAS Output 8

2. Confounding Variables Correlated with both the dependent and independent variables Represent major threat to the validity of inferences on cause and effect Add to multicollinearity Can lead to over or underestimation of an effect, it can even change the direction of the conclusion They add error in the interpretation of what may be an accurate measurement 9

For a variable to be a confounder it needs to have Relationship with the exposure Relationship with the outcome even in the absence of the exposure (not an intermediary) Not on the causal pathway Uneven distribution in comparison groups Exposure Outcome Third variable 10

Birth order Down Syndrome Maternal Age Confounding Maternal age is correlated with birth order and a risk factor for Down Syndrome, even if Birth order is low Alcohol Lung Cancer Smoking No Confounding Smoking is correlated with alcohol consumption and is a risk factor for Lung Cancer even for persons who don t drink alcohol 11

3. Controlling for Confounding In study designs Restriction Variables Random allocation of subjects to study groups to attempt to even out unknown confounders Matching subjects using potential confounders 12

In data analysis Stratified analysis using Mantel Haenszel method to adjust for confounders Case-control studies Cohort studies Restriction (is still possible but it means to throw data away) Model fitting using regression techniques 13

Pros and Cons of Controlling Methods Matching methods call for subjects with exactly the same characteristics Risk of over or under matching Cohort studies can lead to too much loss of information i when excluding subjects Some strata might become too thin and thus insignificant i ifi creating also loss of information Regression methods, if well handled, can control lf for confounding factors 14

4. Residual Linear Regression Consider a dependant variable Y and a set of n independent covariates, from which the first k (k<n) of them are potential confounding factors Initial model treating only the confounding variables Y = as β 0 + follows β 1 X 1 + β 2 X 2 +...+ β k X k + ε Residuals are 垐 calculated 垐 from this model, let ˆ = β... 0 + β 1 1+ β 2 2 + + β k k Y X X X 15

The residuals are Zero mean Homoscedasticity Normally distributed ( ) 0 Corr,, e i e j ej = Yj Y ) j i j with the following properties: This residual will be considered d the new dependant d variable. That is, the new model to be fitted is ) Y Y = α + β X + β X + + β X + ε 0 k+ 1 k+ 1 k+ 2 k+ 2... t t which is equivalent to: ) Y = α + Y + β X + β X + + β X + ε 0 k+ 1 k+ 1 k+ 2 k+ 2... t t 16

The Usual Logistic Regression Approach to Control for Confounders Consider a binary outcome Y and n covariates where the first k(k<n)of them being potential confounding factors The usual way to control for these confounding variables ibl is tosimply put all the n variables ibl in the same model as: π log = β 1 π 0 + β 1 X 1 + β 2 X 2 +...+ β k X k +...+ β n X n 17

5. Residual Logistic Regression Each subject has a binary outcome Y Consider n covariates, where the first k(k<n)are potential confounding factors Initial model with π as the probability of success where only confounding effect is analyzed π log 1 π = β + β X + β X +...+ β X 0 1 1 2 2 k k 18

Method 1 The confounding variables effect is retained and plugged in to the second level regression model along with the variables of interest following the residual linear regression approach. That is, let T = ) β 1 X 1 + ) β 2 X 2 +...+ ) β k X k The new model to be fitted is π log = β0 + T + β 2 1 π k + 1X k + 1 + βk + 2 X k + +... + β X n n 19

Method 2 Pearson residuals are calculated from the initial model using the Pearson residual ) (Hosmer and Y π Lemeshow, 1989) ) π Z = ) ) π [ 1 π ] where is the estimated probability of success ) k ) based on the confounding variables alone: ) e β 0 + β i X i i=1 π = ) k [ 0,1 ] 1+ e β 0 + i=1 ) β i X i The second level regression will use this residual 20 as the new dependant variable.

Therefore the new dependant variable is Z, and because it is not dichotomous anymore we can apply a multiple linear regression model to analyze the effect of the rest of the covariates. The new model to be fitted is a linear regression model Z = β 0 + β k+1 X k+1 + β k+2 X k+2 +...+ β n X n + ε 21

6. Example 1 Data: Low Birth Weight Dow. Indicator of birth weight less than 2.5 Kg Age: Mother s age in years Lwt: Mother s weight in pounds Smk: Smoking status during pregnancy Ht: History of hypertension Correlation matrix with alpha=0.05 g Age Lwt Smk Ht Age 1.0000 0.1738-0.0444-0.0158 Lwt 1.0000-0.0408 0.2369 Smk 1.0000 0.0134 Ht 1.0000 22

Potential ti confounding factor: Age Model for π (probability of low birth weight) Logisticregression i i π log = β 1 π 0 + β 1 age + β 2 lwt + β 3 smk + β 4 ht Residual logistic regression initial model Method 1 T = ) β 1 age Mthd2 Method π log = β 1 π 0 + β 1 age π log = β 1 π 0 + T + β 2 lwt + β 3 smk + β 4 ht Z = β 0 + β 2 lwt + β 3 smk + β 4 ht + ε 23

Results Variables Logistic Regression RLR Method1 Odds ratio P-value SE Odds ratio P-value SE lwt 0.988 0.060 0.0064 0.989 0.078 0.0065 smk 3.480 0.001001 0.3576 3.455 0.001001 0.3687 ht 3.395 0.053 0.6322 3.317 0.059 0.6342 RLR Method 2 Variables P-value SE lwt 0.077 0.0024 Smk 0.000 0.1534 ht 0.042042 0.3094 Conf. factors P-value Variables Log reg Ini model Age 0.055 0.027 24

Data: Alzheimer patients Example 2 Decline: Whether the subjects cognitive capabilities deteriorates or not Age: Subjects age Gender: Subjects gender MMS: Mini Mental Score PDS: Psychometric deterioration scale HDT: Depression scale Age Gender MMS PDS HDT Age 1.0000 0.0413-0.2120 0.3327 0.9679 Correlation matrix with Gender 1.0000-0.1074 0.2020-0.1839 alpha=0.05 MMS 1.0000 0.3784-0.1839 PDS 1.0000 0.0110 HDT 1.0000 25

Potential ti confounding factors: Age, Gender Model for π (probability of declining) Logisticregression i i π log = β 1 π 0 + β 1 age + β 2 gender + β 3 mms + β 4 pds + β 5 hdt Residual logistic regression initial model Method 1 T = ) β 1 age + ) β 2 gender Mthd2 Method π log = β 1 π 0 + β 1 age + β 2 gender Z = β 0 + β 3 mms + β 4 pds + β 5 hdt + ε π log = β 1 π 0 + T + β 3 mms + β 4 pds + β 5 hdt 26

Results Variables Logistic Regression RLR Method1 Odds ratio P-value SE Odds ratio P-value SE mms 0.717 0.023 0.1451 0.720 0.023 0.1443 pds 1.691 0.001001 0.1629 1.674 0.001001 0.1565 hdt 1.018 0.643 0.0380 1.018 0.644 0.0377 RLR Method 2 Conf. factors Variables P-value SE P-value Variables mms <0.001 0.0915 Log reg Ini model pds <0.001 0.0935 Age 0.004 0.000 hdt 0.061061 0.02730273 Gender 0.935 0.551 27

7. Discussion The usual logistic regression is not designed to control for confounding factors and there is a risk for multicollinearity. Method 1 is designed to control for confounding factors; however, from the given examples we can see Method 1 yields similar results to the usual logistic regression approach Method 2 appears to be more accurate with some SE significantly reduced and thus the p-values for some regressors are significantly smaller. However it will not yield the odds ratios as Method 1 can. 28

8. Future Work We will further examine the assumptions behind Method 2 to understand why it sometimes yields more significant results. We will also study residual longitudinal data analysis, including the survival analysis, where one or more time dependant variable(s) will be taken into account. 29

Selected References Menard, S. Applied Logistic Regression Analysis. Series: Quantitative Applications in the Social Sciences. Sage University Series Lemeshow, S; Teres, D.; Avrunin, J.S. and Pastides, H. Predicting the Outcome of Intensive Care Unit Patients. Journal of the American Statistical Association 83, 348-356 356 Hosmer, D.W.; Jovanovic, B. and Lemeshow, S. Best Subsets Logistic Regression. Biometrics 45, 1265-1270. 1989. Pergibon, D. Logistic Regression Diagnostics. The Annals of Statistics 19(4), 705-724. 1981. 30