STA6938-Logistic Regression Model

Similar documents
Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

STA6938-Logistic Regression Model

Simple logistic regression

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Correlation and regression

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model

Lecture 12: Effect modification, and confounding in logistic regression

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Cohen s s Kappa and Log-linear Models

Exam Applied Statistical Regression. Good Luck!

Multinomial Logistic Regression Models

STAT 7030: Categorical Data Analysis

COMPLEMENTARY LOG-LOG MODEL

Q30b Moyale Observed counts. The FREQ Procedure. Table 1 of type by response. Controlling for site=moyale. Improved (1+2) Same (3) Group only

More Statistics tutorial at Logistic Regression and the new:

Chapter 5: Logistic Regression-I

CHAPTER 1: BINARY LOGIT MODEL

Description Syntax for predict Menu for predict Options for predict Remarks and examples Methods and formulas References Also see

Logistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S

Logistic Regression. via GLM

Sections 4.1, 4.2, 4.3

Interpretation of the Fitted Logistic Regression Model

Models for Binary Outcomes

ssh tap sas913, sas

ST3241 Categorical Data Analysis I Logistic Regression. An Introduction and Some Examples

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Appendix: Computer Programs for Logistic Regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Binary Logistic Regression

Multiple linear regression

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Chapter 11: Analysis of matched pairs

Discrete Multivariate Statistics

Generalized Linear Modeling - Logistic Regression

Categorical data analysis Chapter 5

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Epidemiology Principle of Biostatistics Chapter 14 - Dependent Samples and effect measures. John Koval

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Lecture 01: Introduction

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Chapter 14 Logistic and Poisson Regressions

LOGISTIC REGRESSION. Lalmohan Bhar Indian Agricultural Statistics Research Institute, New Delhi

Generalized linear models

SAS Analysis Examples Replication C8. * SAS Analysis Examples Replication for ASDA 2nd Edition * Berglund April 2017 * Chapter 8 ;

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Unit 11: Multiple Linear Regression

STAT 705: Analysis of Contingency Tables

Lecture 4 Multiple linear regression

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Lecture 3.1 Basic Logistic LDA

Chapter 11: Models for Matched Pairs

STA6938-Logistic Regression Model

Introduction to the Logistic Regression Model

Basic Medical Statistics Course

11 November 2011 Department of Biostatistics, University of Copengen. 9:15 10:00 Recap of case-control studies. Frequency-matched studies.

Stat 642, Lecture notes for 04/12/05 96

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Logistic Regression Models for Multinomial and Ordinal Outcomes

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Statistical Distribution Assumptions of General Linear Models

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Lecture 15 (Part 2): Logistic Regression & Common Odds Ratio, (With Simulations)

Homework 1 Solutions

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

STA102 Class Notes Chapter Logistic Regression

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Homework Solutions Applied Logistic Regression

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Longitudinal Modeling with Logistic Regression

Model Estimation Example

Introduction to logistic regression

Count data page 1. Count data. 1. Estimating, testing proportions

EDF 7405 Advanced Quantitative Methods in Educational Research. Data are available on IQ of the child and seven potential predictors.

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Epidemiology Wonders of Biostatistics Chapter 13 - Effect Measures. John Koval

SAS PROC NLMIXED Mike Patefield The University of Reading 12 May

ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS

STAT 4385 Topic 03: Simple Linear Regression

Section 9c. Propensity scores. Controlling for bias & confounding in observational studies

Analysis of Count Data A Business Perspective. George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013

Changes Report 2: Examples from the Australian Longitudinal Study on Women s Health for Analysing Longitudinal Data

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Chapter 4: Generalized Linear Models-II

BIOS 625 Fall 2015 Homework Set 3 Solutions

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

ECONOMETRICS II TERM PAPER. Multinomial Logit Models

Models for binary data

Transcription:

Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of Fitted Model a. Interpretation of Regression Parameters b. Issues about Confounding and Interaction c. Interpretation of Fitted Value

. Model Fitting Observed data: i.i.d. copies of ( Y, X, ( Y, X i =,2,, n, Y : a dichotomous variable i i X = ( X, X,, X T p : a covariate vector 2 Goal: Study π ( x = P( Y = x Logistic Regression Model: What is g( x? e π ( x = + e g( x g( x a. If all x i =, 2,, p are continuous variables i g x x x ( = β0 + β + + βp p b. If, for example, x j is a categorical variable such as race, gender, different treatment methods, etc. we need to create some dummy variables. 2

Suppose X j level 2 level 2 = k level k We create a vector of dummy variables with k- Components. X j, level 2 = 0 otherwise X j,2 level 3 = 0 otherwise.. level k X jk, = 0 otherwise ( j Then the vector X = ( X j,, X j,2,, X j, k determines the value of the categorical variable X j k g( x = β + β x + β x + + β x + + β x 0 2 2 j, l j, l p p l= 3

Maximum Likelihood Estimation Log likelihood function: n n l( β = log f( yi xi = log ( xi ( xi i= i= n g( xi e = yilog + ( y log i i= + e + e { ( } y y i i π π g( x i g( x i MLE of β, ˆβ : l( ˆ β = max β l( β Numerical algorithm: (Newton-Raphson Iterative Method ( ( ( ˆ ( k+ ˆ ( k 2 ˆ ( k ˆ ( k l l, k 0,, β = β β β = Asymptotic Result: ( ˆ ( 0, β β ( d β n N I Consistent estimate of the asymptotic covariance matrix: Î ( ˆ β ( ˆ T Iˆ β = XVX 4

x x p x2 x 2 p X= xn xn p ˆ π( ˆ π ˆ π ˆ 2( π2 V= ˆ π ( ˆ n πn ˆ β x j ˆ ˆ e π j = P( Yj = xj = ˆ j =,2,, n β x j + e Asymptotic Normality for the jth parameter: Let e (0,0,,,0,,0 T j = j+ ( ˆ ( ˆ ( 0, ( T T β j β j = j β β d j β e j n ne N e I The (j+th diagonal element of I ( β 5

2. Statistical Inference for Multiple Logistic Regression Model Example 2. The goal of this study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams. Data were collected on 89 women, 59 of which had low birth weight babies and 30 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy. Table: Code Sheet for the Variables in the Low Birth Weight Data Set. Columns Variable Abbreviation ----------------------------------------------------------------------------- 2-4 Identification Code ID 0 Low Birth Weight (0 = Birth Weight ge 2500g, LOW l = Birth Weight < 2500g 7-8 Age of the Mother in Years AGE 23-25 Weight in Pounds at the Last Menstrual Period LWT 32 Race ( = White, 2 = Black, 3 = Other RACE 40 Smoking Status During Pregnancy ( = Yes, 0 = No SMOKE 48 History of Premature Labor (0 = None, = One, etc. PTL 55 History of Hypertension ( = Yes, 0 = No HT 6 Presence of Uterine Irritability ( = Yes, 0 = No UI 67 Number of Physician Visits During the First Trimester FTV (0 = None, = One, 2 = Two, etc. 73-76 Birth Weight in Grams BWT 6

SAS code for Fitting the Multiple Regression Model: proc logistic data=logistic.lowbwt Descending; class RACE /PARAM=REFERENCE Descending; model LOW=AGE LWT RACE FTV; run; Some Remarks: The option Descending in the proc statement tells SAS PY= x to model ( If there is any categorical variable, one should use class statement to point it out The option PARAM associated the class statement tells SAS how to code the categorical variable By setting PARAME=REFERENCE, one actually set one level as a baseline level (reference level and then the regression parameters associated with this variable represent the comparison of each level with this reference level. By default, the reference level is set as the largest level in the numerical order. Wanting to reverse the order, one can simply add the option Descending 7

The LOGISTIC Procedure Model Information Data Set LOGISTIC.LOWBWT Response Variable LOW Low Birth Weight Number of Response Levels 2 Number of Observations 89 Model binary logit Optimization Technique Fisher's scoring Response Profile Ordered Total Value LOW Frequency 59 2 0 30 Probability modeled is LOW=. Class Level Information Design Variables Class Value 2 RACE 3 0 2 0 0 0 Model Convergence Status Convergence criterion (GCONV=E-8 satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 236.672 234.573 SC 239.94 254.023-2 Log L 234.672 222.573 8

The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 2.099 5 0.0335 Score.3876 5 0.0442 Wald 0.6964 5 0.0577 Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq AGE 0.4988 0.4800 LWT 4.7428 0.0294 RACE 2 4.408 0.02 FTV 0.0869 0.768 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept.2953.074.466 0.2267 AGE -0.0238 0.0337 0.4988 0.4800 LWT -0.042 0.00654 4.7428 0.0294 RACE 3 0.433 0.3622.4296 0.238 RACE 2.0039 0.4979 4.0660 0.0438 FTV -0.0493 0.672 0.0869 0.768 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits AGE 0.976 0.94.043 LWT 0.986 0.973 0.999 RACE 3 vs.542 0.758 3.36 RACE 2 vs 2.729.029 7.240 FTV 0.952 0.686.32 9

The use of PROC GENMOD: The logistic regression model can be also fitted using PROC GENMOD. (Generalized Linear Model is SAS. SAS code: proc genmod DATA=logistic.lowbwt DESCENDING; class RACE; model LOW=AGE LWT RACE FTV; run; SAS report: The GENMOD Procedure Model Information Data Set LOGISTIC.LOWBWT Distribution Binomial Link Function Logit Dependent Variable LOW Low Birth Weight Observations Used 89 Class Level Information Class Levels Values RACE 3 2 3 Response Profile Ordered Total Value LOW Frequency 59 2 0 30 PROC GENMOD is modeling the probability that LOW=''. 0

Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 83 222.5729.262 Scaled Deviance 83 222.5729.262 Pearson Chi-Square 83 87.306.0226 Scaled Pearson X2 83 87.306.0226 Log Likelihood -.2865 Algorithm converged. Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept.7285.0057-0.2426 3.6995 2.95 0.0857 AGE -0.0238 0.0337-0.0899 0.0423 0.50 0.4800 LWT -0.042 0.0065-0.027-0.004 4.74 0.0294 RACE -0.433 0.3622 -.43 0.2769.43 0.238 RACE 2 0.5708 0.546-0.4379.5794.23 0.2674 RACE 3 0 0.0000 0.0000 0.0000 0.0000.. FTV -0.0493 0.672-0.377 0.2785 0.09 0.768 Scale 0.0000 0.0000.0000.0000 NOTE: The scale parameter was held fixed. Note that for the GENMOD, it automatically treat the RACE=3 as the reference level. To make it comparable with the results reported by PROC LOGISTIC, we need to recode the data. SAS code: Data a; set logistic.lowbwt; if RACE= THEN RACE=5; if RACE=2 THEN RACE=4; run; proc genmod DATA=a descending; class RACE; model LOW=AGE LWT RACE FTV /D=B; run;

SAS report: The GENMOD Procedure Model Information Data Set WORK.A Distribution Binomial Link Function Logit Dependent Variable LOW Low Birth Weight Observations Used 89 Class Level Information Class Levels Values RACE 3 3 4 5 Response Profile Ordered Total Value LOW Frequency 59 2 0 30 PROC GENMOD is modeling the probability that LOW=''. Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 83 222.5729.262 Scaled Deviance 83 222.5729.262 Pearson Chi-Square 83 87.306.0226 Scaled Pearson X2 83 87.306.0226 Log Likelihood -.2865 Algorithm converged. 2

Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept.2954.074-0.8046 3.3954.46 0.2267 AGE -0.0238 0.0337-0.0899 0.0423 0.50 0.4800 LWT -0.042 0.0065-0.027-0.004 4.74 0.0294 RACE 3 0.433 0.3622-0.2769.43.43 0.238 RACE 4.0039 0.4979 0.028.9797 4.07 0.0438 RACE 5 0 0.0000 0.0000 0.0000 0.0000.. FTV -0.0493 0.672-0.377 0.2785 0.09 0.768 Scale 0.0000 0.0000.0000.0000 NOTE: The scale parameter was held fixed. Remarks: The results from the PROC GENMOD are the same from the PROC LOGISTIC. PROC LOGISTIC reports the log likelihood ratio test statistics to test if the fitted model is better than nothing. Rejection of the null hypothesis yields that the fitted model is indeed better than nothing. PROC GENMOD reports the deviance that can test if the fitted model can be improved by inserting the interactions, i.e. test if the fitted model is the best model (Goodnessof-fit using the selected variables. Rejection of the null hypothesis yields that the fitted model can be somehow improved. Hypothesized Model: gx ( = β + β AGE+ β LWT+ β RACE_2+ β RACE_3+ β FTV 0 2 3 4 5 Estimating Equation: gx= ˆ(.295 0.024 AGE-0.04 LWT+.004 RACE_2 +0.433 RACE_3-0.049 FTV 3

Hypotheses Testing: a. Overall significance test β = ( β, β2, β3, β4, β5 H : =0 0 β H a : β 0 Likelihood Ratio Test L(with the variables G = 2 log = 222.573 ( 234.672 L(without the variables = 2.099 2 2 ( χ5 ( χ5 P value = Pr > G = Pr > 2.099 0.034 < 0.05 Reject the null hypothesis at significance level 0.05. Wald Test Obtaining the MLE of β, ˆβ and the consistent var ˆ ˆ β, estimate of covariance matrix ( ˆ ( var β, ( ˆT W = β var ˆ ˆ β ˆ β χ For this example, W = 0.6964 and 2 2 ( χ5 ( χ5 P value = Pr > W = Pr > 0.6964 0.0577 > 0.05 No evidence to reject the at level 0.05. H 0 2 5 4

What can we do?. In most situations, the results from both tests agree 2. The likelihood ratio test is usually the most powerful test and commonly suggested to use in practice. 2. Univariate Tests After we conclude that the fitted model is significantly better than nothing, we want to know which variable significantly affect the response. H : 0 0 βi = H ( : β 0 or β > 0 or β < 0 a i i i i =, 2,, p Wald test: Z i W i = se ˆ β i ( ˆ βi ˆ β ( ˆ βi ( 0, 2 2 i 2 i χ = Z = var ˆ N At significance level 0.05, it appears that only LWT and RACE are significant factors for the response. Note: if our goal is to obtain the best fitting model while minimizing the number of parameters, we should fit a reduced model containing only those variables thought to 5

be significant, and compare it to the full model containing all the variables. 3. Reduced model vs. Full model Compare the reduced model with the full model. The reduced model should be nested in the full model that means all the variables appeared in the reduced model should be present in the full model. For Example 2., since only LWT and RACE are significant, we consider the reduced model to be the one only containing LWT and RACE. proc logistic data=logistic.lowbwt Descending; class RACE /PARAM=REFERENCE Descending; model LOW=LWT RACE; run; Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 236.672 23.259 SC 239.94 244.226-2 Log L 234.672 223.259 The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio.429 3 0.0097 Score 0.7572 3 0.03 Wald 0.36 3 0.075 6

Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq LWT 5.5886 0.08 RACE 2 5.4024 0.067 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 0.8057 0.8452 0.9088 0.3404 LWT -0.052 0.00644 5.5886 0.08 RACE 3 0.4806 0.3567.856 0.778 RACE 2.08 0.488 4.9065 0.0268 H or 0 : Reduced model is good enough compared to the full model H : β = β = 0 0 5 Test statistic: G = 222.573 ( 223.259 = 0.686 χ 2 2 2 2 ( χ2 ( χ2 P value = Pr > G = Pr > 0.686 0.7 No enough evidence to reject the null hypothesis. 7

3. Interpretation of Fitted Model a. Interpretation of Regression Parameters Study goal: What do the estimated coefficients in the model tell us about the research questions that motivated the study? Model Example where π ( x g( x = log = β0 + βx+ β2x2+ β22x π ( x 22 x = ( x, x, x = ( x, x x x 2 2 2 2 22 : continuous variable : categorical variable with three levels and thus two dummy variables introduced Assume that x2 and x22 indicate level 2 and 3 of x2, respectively I. Continuous independent variable π ( x = x0 + ; x2 π ( x = + ; x2 gx ( = x0 + ; x2 gx ( = x0; x2 = log π ( x = x0; x2 π ( x = x0; x2 [ ( ] [ = β + β x + + β x + β x β + β x + β x + β x = β 0 0 2 2 22 22 0 0 2 2 22 22 ] β : the log of odds ratio for the variable unit, given the other variable fixed. x increased by one 8

OR x = x +, x = x other = e β Notation: ( 0 0 The odds of developing the symptom when x = x0 + is times of that when x = x0 adjusting the other variable (keep the other variable fixed, for the similar subjects, etc e β OR x = a, x = b other = e β In general, ( ( a b Studying the odds ratio for the one unit increase in a continuous variable may not be clinically meaningful The odds of developing the symptom when x = a e β (a b x is times of that when other variable = b adjusting the II. Categorical (Polychotomous independent variable π ( x2 = 2nd level; x π ( x2 = 2nd level; x gx ( 2 = 2nd level; x gx ( 2 = st level; x = log π ( x2 = st level; x π ( x2 = st level; x [ x ] [ = β + β + β + β 0 β + β x + β 0+ β 0 = β 2 0 2 22 0 0 2 22 2 ( = e β OR 2nd level,st level other The odds of developing the symptom will increase by 2 e β st times when the status of x2 changes from the level the 2 nd level, adjusting the other variable (keep the other variable fixed, for the similar subjects, etc ] 9

For the reduced model in Example 2.: Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 0.8057 0.8452 0.9088 0.3404 LWT -0.052 0.00644 5.5886 0.08 RACE 3 0.4806 0.3567.856 0.778 RACE 2.08 0.488 4.9065 0.0268 For the pregnant women in the same race, the odds ratio of 0.052 0 having a small baby is approximately e = 0.859, when a pregnant woman compares with the counterpart who is 0 pounds lighter at the last menstrual period. For the pregnant women who are similar in the weight at the last menstrual period, the odds of having a small baby for Black women is approximately almost three time.08 ( e = 2.948 of that for White women. For the pregnant women who are similar in the weight at the last menstrual period, the odds of having a small baby for women with race other than White or Black is 0.4806 approximately almost.6 times ( e =.67 of that for white women. 20

Confidence Interval Estimation The MLE simply gives the point estimate of the regression parameters. When we interpret the point estimate we actually have 0% confidence that means the chance of the true regression parameter equals to the estimated value is almost zero The confidence interval estimate provides a confidence level while stating that the true regression parameter falls in a specific interval 00 ( α % CI for β i, i =,2,, p ˆ β ± z i α /2 se ( ˆ βi For the reduced model in Example 2., Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 0.8057 0.8452 0.9088 0.3404 LWT -0.052 0.00644 5.5886 0.08 RACE 3 0.4806 0.3567.856 0.778 RACE 2.08 0.488 4.9065 0.0268 we are 95% confident that β ˆ β z0.025se( ˆ β, ˆ β + z0.025se( ˆ β [ 0.052.96 0.00644, 0.052.96 0.00644] [ 0.0278, 0.0026] = + = 2

(, ( β ˆ β z se ˆ β ˆ β + z se ˆ β 2 2 0.025 2 2 0.025 2 [.08.96 0.488,.08.96 0.488] [ 0.244, 2.0378] = + = (, ( β ˆ β z se ˆ β ˆ β + z se ˆ β 22 22 0.025 22 22 0.025 22 [ 0.4806.96 0.3567,0.4806.96 0.3567] [ 0.285,.797] = + = 00 ( α % CI for the odds ratio: We are 95% confident that OR( x x 0, x x ; x e, e 0.0278 0 0.0026 0 = 0 + = 0 2 = [ 0.7573, 0.9743] [ ] [ ] 0.244 2.0378 OR(black, white; x e, e =.325,7.6737 0.285.797 OR(other, white; x e, e = 0.8037,3.2534 Interpretation We are 95% confident that the odds ratio of having a small baby for a pregnant woman vs. her counterpart who is0 pounds lighter at the last menstrual period could be as little as 0.7573 or as large as 0.9743 adjusting their race We are 95% confident that the odds ratio of having a small baby for a black woman vs. a white woman, who are similar in weight at the last menstrual period could be as little as.325 or as large as 7.6737 22

We are 95% confident that the odds ratio of having a small baby for a woman other than black or white vs. a white woman, who are similar in weight at the last menstrual period could be as little as 0.8037 or as large as 3.2534 It seems that black women tend to have a great risk of having a small baby compared with white women due to the confidence interval excluding. b. Issues about Confounding and Interaction Epidemiologist use the term confounder to describe covariate that is associated with both the outcome variable of interest and a primary independent variable or risk factor In most of epidemiological research, we are primarily interested in the association between outcome and a potential risk factor. For example, the correlation between the incidence of coronary heart disease (CHD and smoking is one of the research interests. Assume that we are able to follow two study cohorts: one is smoking group and another non-smoking and we can also control other possible risk factors (distributions of these factors are the same in the two cohorts, then we can simply establish a univariate logistic regression model to describe the correlation between the CHD and SMOKE, i.e. 23

(CHD log P = β + 0 β SMOKE P(CHD β represents the odds ratio of developing CHD of smokers vs. non-smokers. However, what if we are not able to control another risk factor says AGE, in practice. Then the true model is P(CHD log = β0 + β SMOKE+ β2 AGE P(CHD Also we consider the situation that the age distribution is different between smokers and non-smokers: smokers tend to be older than non-smokers. If we mistakenly fitted the previous model ignoring the AGE factor, then 2 ( E( E( OR(smoker, non-smoker= β + β age smoking age non-smoking So this will incorrectly estimate the effect of smoking, actually, it will exaggerate the risk of smoker on developing the CHD. 24

For a variable to be considered as a Confounding variable, it should have two characteristics: a. This variable is related to outcome b. This variable is also related to the primary risk factor Outcome Primary Risk Factor Confounder Example 2.2 we consider a simulated data set Data CHDSIMUL; do ID= to 200 by ; smoke=ranbin(int(time(,,0.5; if smoke= then age=50+5*normal(int(time(; else age=42+5*normal(int(time(; p=exp(-8.0-9.0*smoke+0.5*age+0.20*smoke*age/(+exp (-8.0-9.0*smoke+0.5*age+0.20*smoke*age; CHD=RANBIN(int(time(,,p; output; end; run; 25

Model Fitting: a. Univariate model-only contains SMOKE proc logistic data=logistic.chdsimul Descending; model CHD=smoke; run; The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept -.40 0.2494 32.0098 <.000 smoke.9988 0.3266 37.463 <.000.9988 OR (smoker, non-smoker= e = 7.38 b. Obviously, in this study, age is definitely a confounding variable. Multivariate model-contains SMOKE and AGE proc logistic data=logistic.chdsimul Descending; model CHD=smoke age; run; The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept -0.9238.854 34.737 <.000 smoke 0.6625 0.4096 2.668 0.057 age 0.286 0.040 28.4450 <.000 The coefficient associated with variable SMOKE is greatly reduced. 26

This type of study is called Analysis of Covariance in epidemiology. We should explain the odds ratio adjusting the confounder AGE!!! 0.6625 O R(smoker, non-smoker age= e =.94 Interaction-effect modifier Suppose we fit a logistic regression model with two risk factors, if the effect of one risk factor on the outcome depends on the level of the other factor, we will say that the two factors interact in affecting the outcome. In any model, interaction is incorporated by the inclusion of the cross product terms of two risk factors. For Example 2.2, if we fit model P(CHD log = β0 + β SMOKE+ β2 AGE+ β3 SMOKE AGE P(CHD Now let s consider the age effect of developing the CHD: For smokers P(CHD smpker log = ( β0 + β+( β2 + β3 AGE P(CHD smoker 27

For non-smoker (CHD non-smpker log P = β0+ β 2 AGE P(CHD non-smoker Apparently, the β 3 modifies the effect of age on the CHD. We cannot interpret the age effect on the CHD without knowing the smoking status. c. Fit the interaction model proc logistic data=logistic.chdsimul Descending; model CHD=smoke age smoke*age; run; The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept -7.606 2.55.0769 0.0009 smoke -9.362 4.2930 4.7548 0.0292 age 0.338 0.0486 7.5808 0.0059 smoke*age 0.22 0.0897 5.5864 0.08 0.338 0 OR (age=a+0,age=a non-smoker= e = 3.8 (0.338+ 0.22 0 OR (age=a+0,age=a smoker= e = 3.79 Do you get the idea of effect modifier? 28

Model Fitting Principle in Biomedical Studies: The inclusion of confounder(s is very important, otherwise the relationship between the risk factor and the outcome may be misinterpreted. When the regression coefficients associated with the risk factors changes a lot when drop a covariate, this covariate should be potentially viewed as a confounder, even though it may not be statistically significant from the hypothesis test. In biomedical studies, SEX and AGE are normally treated as confounders. When interaction effect in included, the interpretation the risk factor becomes complicated. Unless the interaction is statistically significant, it is normally advised to not put the interaction in the model. When a covariate is an effect modifier, its status as a confounder is of secondary importance since the estimate of the effect of the risk factor depends on the specific value of the covariate. 29

c. Interpretation of Fitted Value 00 ( α % PI (prediction interval for the odds Given a set of covariates, we can predict the log odds by plug in these covariates into the prediction equation, i.e. obtaining the point prediction g x ˆ ˆ x ˆ x ˆ( = β0 + β + + βp p To obtain the confidence interval, we need to get the distribution of ˆ( g x. ( T T T ( β ( ˆ β ( ˆ T Note that gx ˆ( x ˆ β and ˆ β, var ( ˆ d N β = β, we have ( gˆ( x N x,var g( x = N x, x var β x d T ( ˆ = ˆ ( ˆ β var ˆ g( x x var x ( ˆ x x ˆ ( p p p 2 x ˆ ˆ j j j k j j= 0 j= k= j+ = var β + 2 cov β, β ˆk SAS provides the estimate for the covariance matrix of the regression parameter estimates by inserting the option COVB in model step. 30

For the reduced model in Example 2., proc logistic data=logistic.lowbwt Descending; class RACE /PARAM=REFERENCE Descending; model LOW=LWT RACE /covb; run; The estimate of covariance matrix: The LOGISTIC Procedure Estimated Covariance Matrix Variable Intercept LWT RACE3 RACE2 Intercept 0.743-0.0052-0.035 0.022602 LWT -0.0052 0.00004 0.000356-0.00065 RACE3-0.035 0.000356 0.2726 0.0532 RACE2 0.022602-0.00065 0.0532 0.23894 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 0.8057 0.8452 0.9088 0.3404 LWT -0.052 0.00644 5.5886 0.08 RACE 3 0.4806 0.3567.856 0.778 RACE 2.08 0.488 4.9065 0.0268 Let s consider two particular cases. i. LWT=50, RACE=White g ˆ(50 pounds and White=0.806-0.05 50+.08 0+0.48 0 = -.444 3

2 ( gˆ ( = ( ˆ β0 + ( ˆ β var ˆ 50 pounds and White var 50 var ( ˆ β ( ˆ 2 β ( ˆ 22 β ˆ 0 β ( β β ( β β ( ( ˆ β ˆ β22 ( ˆ β ˆ 2 β22 2 2 0 var 0 var 2 50 cov, + + + + 2 0 cov ˆ, ˆ + 2 0 cov ˆ, ˆ + 2 50 0 cov ˆ β, ˆ β 0 2 0 22 2 + 2 50 0 cov, + 2 0 0 cov, 2 = 0.743+ 50 0.00004+ 2 ( 50 0.0052 = 0.0768 We are 95% confident that the odds of this woman will fall in.444±.96 0.0768.988 0.90 e = e, e = [ 0.37, 0.406] We are 95% confident that π 0.37 0.406 + 0.37 + 0.406 ( 50 pounds and White, = [ 0.20,0.289] i.e. we are 95% confident that the chance of having a small baby for this woman could be as little as 2.0% or as large as 28.9%. ii. LWT=20, RACE=Asian (other g ˆ(20 pounds and Asian=0.806-0.05 20+.08 0+0.48 = -0.53 32

2 ( gˆ ( = ( ˆ β0 + ( ˆ β var ˆ 20 pounds and Asian var 20 var ( ˆ β ( ˆ 2 β ( ˆ 22 β ˆ 0 β ( β β ( β β ( ( ˆ β ˆ β22 ( ˆ β ˆ 2 β22 + + + 2 2 0 var var 2 20 cov, + 2 0 cov ˆ, ˆ + 2 cov ˆ, ˆ + 2 20 0 cov ˆ β, ˆ β 0 2 0 22 2 + 2 20 cov, + 2 0 cov, 2 = 0.743+ 20 0.00004+ 0.272 + 2 20 ( 0.0052 +2 (-0.035+2 20 0.000036 = 0.0446 What is wrong with that? var ˆ β a positive definite matrix? Isn't ( Note that the eigenvalues of ˆ ( ˆ var β is λ = [ 0.732, 0.259, 0.088, 0.000006] It is purely the numerical problem! How to fix it? Change the scale of variable LWT by dividing LWT by 00. data a; set logistic.lowbwt; lwt=lwt/00; run; proc logistic data=a Descending; class RACE /PARAM=REFERENCE Descending; model LOW=LWT RACE /covb; run; 33

The LOGISTIC Procedure Estimated Covariance Matrix Variable Intercept LWT RACE3 RACE2 Intercept 0.743-0.5236-0.035 0.022602 LWT -0.5236 0.44648 0.035585-0.0647 RACE3-0.035 0.035585 0.2726 0.0532 RACE2 0.022602-0.0647 0.0532 0.23894 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 0.8057 0.8452 0.9088 0.3404 LWT -.5223 0.6439 5.5886 0.08 RACE 3 0.4806 0.3567.856 0.778 RACE 2.08 0.488 4.9065 0.0268 Note under current scale, the eigenvalues of ˆ ( ˆ λ = [.2, 0.269, 0.095, 0.009] var β is g ˆ(20 pounds and Asian=0.8057-.5223.2+.08 0+0.4806 = -0.5405 2 ( gˆ ( = ( ˆ β0 + ( ˆ β var ˆ 20 pounds and Asian var.2 var ( ˆ β ( ˆ 2 β ( ˆ 22 β ˆ 0 β ( β β ( β β ( ( ˆ β ˆ β22 ( ˆ β ˆ 2 β22 + + + 2 2 0 var var 2.2 cov, + 2 0 cov ˆ, ˆ + 2 cov ˆ, ˆ + 2.2 0 cov ˆ β, ˆ β 0 2 0 22 2 + 2.2 cov, + 2 0 cov, 2 = 0.743+.2 0.44648 + 0.2726 + 2.2 ( 0.5236 +2 (-0.035+2.2 0.035585 = 0.0657 34

We are 95% confident that the odds of this woman will fall in 0.5405±.96 0.0657.0429 0.038 e = e, e = [ 0.352, 0.963] We are 95% confident that π 0.352 0.963 + 0.352 + 0.963 ( 20 pounds and Asian, = [ 0.260,0.49] i.e. we are 95% confident that the chance of having a small baby for this woman could be as little as 26.0% or as large as 49.%. 35