STA6938-Logistic Regression Model

Size: px

Start display at page:

Download "STA6938-Logistic Regression Model"

Mariah Anthony
6 years ago
Views:

1 Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of Fitted Model a. Interpretation of Regression Parameters b. Issues about Confounding and Interaction c. Interpretation of Fitted Value

2 . Model Fitting Observed data: i.i.d. copies of ( Y, X, ( Y, X i =,2,, n, Y : a dichotomous variable i i X = ( X, X,, X T p : a covariate vector 2 Goal: Study π ( x = P( Y = x Logistic Regression Model: What is g( x? e π ( x = + e g( x g( x a. If all x i =, 2,, p are continuous variables i g x x x ( = β0 + β + + βp p b. If, for example, x j is a categorical variable such as race, gender, different treatment methods, etc. we need to create some dummy variables. 2

3 Suppose X j level 2 level 2 = k level k We create a vector of dummy variables with k- Components. X j, level 2 = 0 otherwise X j,2 level 3 = 0 otherwise.. level k X jk, = 0 otherwise ( j Then the vector X = ( X j,, X j,2,, X j, k determines the value of the categorical variable X j k g( x = β + β x + β x + + β x + + β x j, l j, l p p l= 3

4 Maximum Likelihood Estimation Log likelihood function: n n l( β = log f( yi xi = log ( xi ( xi i= i= n g( xi e = yilog + ( y log i i= + e + e { ( } y y i i π π g( x i g( x i MLE of β, ˆβ : l( ˆ β = max β l( β Numerical algorithm: (Newton-Raphson Iterative Method ( ( ( ˆ ( k+ ˆ ( k 2 ˆ ( k ˆ ( k l l, k 0,, β = β β β = Asymptotic Result: ( ˆ ( 0, β β ( d β n N I Consistent estimate of the asymptotic covariance matrix: Î ( ˆ β ( ˆ T Iˆ β = XVX 4

5 x x p x2 x 2 p X= xn xn p ˆ π( ˆ π ˆ π ˆ 2( π2 V= ˆ π ( ˆ n πn ˆ β x j ˆ ˆ e π j = P( Yj = xj = ˆ j =,2,, n β x j + e Asymptotic Normality for the jth parameter: Let e (0,0,,,0,,0 T j = j+ ( ˆ ( ˆ ( 0, ( T T β j β j = j β β d j β e j n ne N e I The (j+th diagonal element of I ( β 5

6 2. Statistical Inference for Multiple Logistic Regression Model Example 2. The goal of this study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams. Data were collected on 89 women, 59 of which had low birth weight babies and 30 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy. Table: Code Sheet for the Variables in the Low Birth Weight Data Set. Columns Variable Abbreviation Identification Code ID 0 Low Birth Weight (0 = Birth Weight ge 2500g, LOW l = Birth Weight < 2500g 7-8 Age of the Mother in Years AGE Weight in Pounds at the Last Menstrual Period LWT 32 Race ( = White, 2 = Black, 3 = Other RACE 40 Smoking Status During Pregnancy ( = Yes, 0 = No SMOKE 48 History of Premature Labor (0 = None, = One, etc. PTL 55 History of Hypertension ( = Yes, 0 = No HT 6 Presence of Uterine Irritability ( = Yes, 0 = No UI 67 Number of Physician Visits During the First Trimester FTV (0 = None, = One, 2 = Two, etc Birth Weight in Grams BWT 6

7 SAS code for Fitting the Multiple Regression Model: proc logistic data=logistic.lowbwt Descending; class RACE /PARAM=REFERENCE Descending; model LOW=AGE LWT RACE FTV; run; Some Remarks: The option Descending in the proc statement tells SAS PY= x to model ( If there is any categorical variable, one should use class statement to point it out The option PARAM associated the class statement tells SAS how to code the categorical variable By setting PARAME=REFERENCE, one actually set one level as a baseline level (reference level and then the regression parameters associated with this variable represent the comparison of each level with this reference level. By default, the reference level is set as the largest level in the numerical order. Wanting to reverse the order, one can simply add the option Descending 7

8 The LOGISTIC Procedure Model Information Data Set LOGISTIC.LOWBWT Response Variable LOW Low Birth Weight Number of Response Levels 2 Number of Observations 89 Model binary logit Optimization Technique Fisher's scoring Response Profile Ordered Total Value LOW Frequency Probability modeled is LOW=. Class Level Information Design Variables Class Value 2 RACE Model Convergence Status Convergence criterion (GCONV=E-8 satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC SC Log L

9 The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio Score Wald Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq AGE LWT RACE FTV Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept AGE LWT RACE RACE FTV Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits AGE LWT RACE 3 vs RACE 2 vs FTV

10 The use of PROC GENMOD: The logistic regression model can be also fitted using PROC GENMOD. (Generalized Linear Model is SAS. SAS code: proc genmod DATA=logistic.lowbwt DESCENDING; class RACE; model LOW=AGE LWT RACE FTV; run; SAS report: The GENMOD Procedure Model Information Data Set LOGISTIC.LOWBWT Distribution Binomial Link Function Logit Dependent Variable LOW Low Birth Weight Observations Used 89 Class Level Information Class Levels Values RACE Response Profile Ordered Total Value LOW Frequency PROC GENMOD is modeling the probability that LOW=''. 0

11 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X Log Likelihood Algorithm converged. Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept AGE LWT RACE RACE RACE FTV Scale NOTE: The scale parameter was held fixed. Note that for the GENMOD, it automatically treat the RACE=3 as the reference level. To make it comparable with the results reported by PROC LOGISTIC, we need to recode the data. SAS code: Data a; set logistic.lowbwt; if RACE= THEN RACE=5; if RACE=2 THEN RACE=4; run; proc genmod DATA=a descending; class RACE; model LOW=AGE LWT RACE FTV /D=B; run;

12 SAS report: The GENMOD Procedure Model Information Data Set WORK.A Distribution Binomial Link Function Logit Dependent Variable LOW Low Birth Weight Observations Used 89 Class Level Information Class Levels Values RACE Response Profile Ordered Total Value LOW Frequency PROC GENMOD is modeling the probability that LOW=''. Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X Log Likelihood Algorithm converged. 2

13 Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept AGE LWT RACE RACE RACE FTV Scale NOTE: The scale parameter was held fixed. Remarks: The results from the PROC GENMOD are the same from the PROC LOGISTIC. PROC LOGISTIC reports the log likelihood ratio test statistics to test if the fitted model is better than nothing. Rejection of the null hypothesis yields that the fitted model is indeed better than nothing. PROC GENMOD reports the deviance that can test if the fitted model can be improved by inserting the interactions, i.e. test if the fitted model is the best model (Goodnessof-fit using the selected variables. Rejection of the null hypothesis yields that the fitted model can be somehow improved. Hypothesized Model: gx ( = β + β AGE+ β LWT+ β RACE_2+ β RACE_3+ β FTV Estimating Equation: gx= ˆ( AGE-0.04 LWT+.004 RACE_ RACE_ FTV 3

14 Hypotheses Testing: a. Overall significance test β = ( β, β2, β3, β4, β5 H : =0 0 β H a : β 0 Likelihood Ratio Test L(with the variables G = 2 log = ( L(without the variables = ( χ5 ( χ5 P value = Pr > G = Pr > < 0.05 Reject the null hypothesis at significance level Wald Test Obtaining the MLE of β, ˆβ and the consistent var ˆ ˆ β, estimate of covariance matrix ( ˆ ( var β, ( ˆT W = β var ˆ ˆ β ˆ β χ For this example, W = and 2 2 ( χ5 ( χ5 P value = Pr > W = Pr > > 0.05 No evidence to reject the at level H

15 What can we do?. In most situations, the results from both tests agree 2. The likelihood ratio test is usually the most powerful test and commonly suggested to use in practice. 2. Univariate Tests After we conclude that the fitted model is significantly better than nothing, we want to know which variable significantly affect the response. H : 0 0 βi = H ( : β 0 or β > 0 or β < 0 a i i i i =, 2,, p Wald test: Z i W i = se ˆ β i ( ˆ βi ˆ β ( ˆ βi ( 0, 2 2 i 2 i χ = Z = var ˆ N At significance level 0.05, it appears that only LWT and RACE are significant factors for the response. Note: if our goal is to obtain the best fitting model while minimizing the number of parameters, we should fit a reduced model containing only those variables thought to 5

16 be significant, and compare it to the full model containing all the variables. 3. Reduced model vs. Full model Compare the reduced model with the full model. The reduced model should be nested in the full model that means all the variables appeared in the reduced model should be present in the full model. For Example 2., since only LWT and RACE are significant, we consider the reduced model to be the one only containing LWT and RACE. proc logistic data=logistic.lowbwt Descending; class RACE /PARAM=REFERENCE Descending; model LOW=LWT RACE; run; Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC SC Log L The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio Score Wald

17 Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq LWT RACE Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept LWT RACE RACE H or 0 : Reduced model is good enough compared to the full model H : β = β = Test statistic: G = ( = χ ( χ2 ( χ2 P value = Pr > G = Pr > No enough evidence to reject the null hypothesis. 7

18 3. Interpretation of Fitted Model a. Interpretation of Regression Parameters Study goal: What do the estimated coefficients in the model tell us about the research questions that motivated the study? Model Example where π ( x g( x = log = β0 + βx+ β2x2+ β22x π ( x 22 x = ( x, x, x = ( x, x x x : continuous variable : categorical variable with three levels and thus two dummy variables introduced Assume that x2 and x22 indicate level 2 and 3 of x2, respectively I. Continuous independent variable π ( x = x0 + ; x2 π ( x = + ; x2 gx ( = x0 + ; x2 gx ( = x0; x2 = log π ( x = x0; x2 π ( x = x0; x2 [ ( ] [ = β + β x + + β x + β x β + β x + β x + β x = β ] β : the log of odds ratio for the variable unit, given the other variable fixed. x increased by one 8

19 OR x = x +, x = x other = e β Notation: ( 0 0 The odds of developing the symptom when x = x0 + is times of that when x = x0 adjusting the other variable (keep the other variable fixed, for the similar subjects, etc e β OR x = a, x = b other = e β In general, ( ( a b Studying the odds ratio for the one unit increase in a continuous variable may not be clinically meaningful The odds of developing the symptom when x = a e β (a b x is times of that when other variable = b adjusting the II. Categorical (Polychotomous independent variable π ( x2 = 2nd level; x π ( x2 = 2nd level; x gx ( 2 = 2nd level; x gx ( 2 = st level; x = log π ( x2 = st level; x π ( x2 = st level; x [ x ] [ = β + β + β + β 0 β + β x + β 0+ β 0 = β ( = e β OR 2nd level,st level other The odds of developing the symptom will increase by 2 e β st times when the status of x2 changes from the level the 2 nd level, adjusting the other variable (keep the other variable fixed, for the similar subjects, etc ] 9

20 For the reduced model in Example 2.: Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept LWT RACE RACE For the pregnant women in the same race, the odds ratio of having a small baby is approximately e = 0.859, when a pregnant woman compares with the counterpart who is 0 pounds lighter at the last menstrual period. For the pregnant women who are similar in the weight at the last menstrual period, the odds of having a small baby for Black women is approximately almost three time.08 ( e = of that for White women. For the pregnant women who are similar in the weight at the last menstrual period, the odds of having a small baby for women with race other than White or Black is approximately almost.6 times ( e =.67 of that for white women. 20

21 Confidence Interval Estimation The MLE simply gives the point estimate of the regression parameters. When we interpret the point estimate we actually have 0% confidence that means the chance of the true regression parameter equals to the estimated value is almost zero The confidence interval estimate provides a confidence level while stating that the true regression parameter falls in a specific interval 00 ( α % CI for β i, i =,2,, p ˆ β ± z i α /2 se ( ˆ βi For the reduced model in Example 2., Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept LWT RACE RACE we are 95% confident that β ˆ β z0.025se( ˆ β, ˆ β + z0.025se( ˆ β [ , ] [ , ] = + = 2

22 (, ( β ˆ β z se ˆ β ˆ β + z se ˆ β [ , ] [ 0.244, ] = + = (, ( β ˆ β z se ˆ β ˆ β + z se ˆ β [ , ] [ 0.285,.797] = + = 00 ( α % CI for the odds ratio: We are 95% confident that OR( x x 0, x x ; x e, e = 0 + = 0 2 = [ , ] [ ] [ ] OR(black, white; x e, e =.325, OR(other, white; x e, e = , Interpretation We are 95% confident that the odds ratio of having a small baby for a pregnant woman vs. her counterpart who is0 pounds lighter at the last menstrual period could be as little as or as large as adjusting their race We are 95% confident that the odds ratio of having a small baby for a black woman vs. a white woman, who are similar in weight at the last menstrual period could be as little as.325 or as large as

23 We are 95% confident that the odds ratio of having a small baby for a woman other than black or white vs. a white woman, who are similar in weight at the last menstrual period could be as little as or as large as It seems that black women tend to have a great risk of having a small baby compared with white women due to the confidence interval excluding. b. Issues about Confounding and Interaction Epidemiologist use the term confounder to describe covariate that is associated with both the outcome variable of interest and a primary independent variable or risk factor In most of epidemiological research, we are primarily interested in the association between outcome and a potential risk factor. For example, the correlation between the incidence of coronary heart disease (CHD and smoking is one of the research interests. Assume that we are able to follow two study cohorts: one is smoking group and another non-smoking and we can also control other possible risk factors (distributions of these factors are the same in the two cohorts, then we can simply establish a univariate logistic regression model to describe the correlation between the CHD and SMOKE, i.e. 23

24 (CHD log P = β + 0 β SMOKE P(CHD β represents the odds ratio of developing CHD of smokers vs. non-smokers. However, what if we are not able to control another risk factor says AGE, in practice. Then the true model is P(CHD log = β0 + β SMOKE+ β2 AGE P(CHD Also we consider the situation that the age distribution is different between smokers and non-smokers: smokers tend to be older than non-smokers. If we mistakenly fitted the previous model ignoring the AGE factor, then 2 ( E( E( OR(smoker, non-smoker= β + β age smoking age non-smoking So this will incorrectly estimate the effect of smoking, actually, it will exaggerate the risk of smoker on developing the CHD. 24

25 For a variable to be considered as a Confounding variable, it should have two characteristics: a. This variable is related to outcome b. This variable is also related to the primary risk factor Outcome Primary Risk Factor Confounder Example 2.2 we consider a simulated data set Data CHDSIMUL; do ID= to 200 by ; smoke=ranbin(int(time(,,0.5; if smoke= then age=50+5*normal(int(time(; else age=42+5*normal(int(time(; p=exp( *smoke+0.5*age+0.20*smoke*age/(+exp ( *smoke+0.5*age+0.20*smoke*age; CHD=RANBIN(int(time(,,p; output; end; run; 25

26 Model Fitting: a. Univariate model-only contains SMOKE proc logistic data=logistic.chdsimul Descending; model CHD=smoke; run; The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept <.000 smoke < OR (smoker, non-smoker= e = 7.38 b. Obviously, in this study, age is definitely a confounding variable. Multivariate model-contains SMOKE and AGE proc logistic data=logistic.chdsimul Descending; model CHD=smoke age; run; The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept <.000 smoke age <.000 The coefficient associated with variable SMOKE is greatly reduced. 26

27 This type of study is called Analysis of Covariance in epidemiology. We should explain the odds ratio adjusting the confounder AGE!!! O R(smoker, non-smoker age= e =.94 Interaction-effect modifier Suppose we fit a logistic regression model with two risk factors, if the effect of one risk factor on the outcome depends on the level of the other factor, we will say that the two factors interact in affecting the outcome. In any model, interaction is incorporated by the inclusion of the cross product terms of two risk factors. For Example 2.2, if we fit model P(CHD log = β0 + β SMOKE+ β2 AGE+ β3 SMOKE AGE P(CHD Now let s consider the age effect of developing the CHD: For smokers P(CHD smpker log = ( β0 + β+( β2 + β3 AGE P(CHD smoker 27

28 For non-smoker (CHD non-smpker log P = β0+ β 2 AGE P(CHD non-smoker Apparently, the β 3 modifies the effect of age on the CHD. We cannot interpret the age effect on the CHD without knowing the smoking status. c. Fit the interaction model proc logistic data=logistic.chdsimul Descending; model CHD=smoke age smoke*age; run; The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept smoke age smoke*age OR (age=a+0,age=a non-smoker= e = 3.8 ( OR (age=a+0,age=a smoker= e = 3.79 Do you get the idea of effect modifier? 28

29 Model Fitting Principle in Biomedical Studies: The inclusion of confounder(s is very important, otherwise the relationship between the risk factor and the outcome may be misinterpreted. When the regression coefficients associated with the risk factors changes a lot when drop a covariate, this covariate should be potentially viewed as a confounder, even though it may not be statistically significant from the hypothesis test. In biomedical studies, SEX and AGE are normally treated as confounders. When interaction effect in included, the interpretation the risk factor becomes complicated. Unless the interaction is statistically significant, it is normally advised to not put the interaction in the model. When a covariate is an effect modifier, its status as a confounder is of secondary importance since the estimate of the effect of the risk factor depends on the specific value of the covariate. 29

30 c. Interpretation of Fitted Value 00 ( α % PI (prediction interval for the odds Given a set of covariates, we can predict the log odds by plug in these covariates into the prediction equation, i.e. obtaining the point prediction g x ˆ ˆ x ˆ x ˆ( = β0 + β + + βp p To obtain the confidence interval, we need to get the distribution of ˆ( g x. ( T T T ( β ( ˆ β ( ˆ T Note that gx ˆ( x ˆ β and ˆ β, var ( ˆ d N β = β, we have ( gˆ( x N x,var g( x = N x, x var β x d T ( ˆ = ˆ ( ˆ β var ˆ g( x x var x ( ˆ x x ˆ ( p p p 2 x ˆ ˆ j j j k j j= 0 j= k= j+ = var β + 2 cov β, β ˆk SAS provides the estimate for the covariance matrix of the regression parameter estimates by inserting the option COVB in model step. 30

31 For the reduced model in Example 2., proc logistic data=logistic.lowbwt Descending; class RACE /PARAM=REFERENCE Descending; model LOW=LWT RACE /covb; run; The estimate of covariance matrix: The LOGISTIC Procedure Estimated Covariance Matrix Variable Intercept LWT RACE3 RACE2 Intercept LWT RACE RACE Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept LWT RACE RACE Let s consider two particular cases. i. LWT=50, RACE=White g ˆ(50 pounds and White= =

32 2 ( gˆ ( = ( ˆ β0 + ( ˆ β var ˆ 50 pounds and White var 50 var ( ˆ β ( ˆ 2 β ( ˆ 22 β ˆ 0 β ( β β ( β β ( ( ˆ β ˆ β22 ( ˆ β ˆ 2 β var 0 var 2 50 cov, cov ˆ, ˆ cov ˆ, ˆ cov ˆ β, ˆ β cov, cov, 2 = ( = We are 95% confident that the odds of this woman will fall in.444± e = e, e = [ 0.37, 0.406] We are 95% confident that π ( 50 pounds and White, = [ 0.20,0.289] i.e. we are 95% confident that the chance of having a small baby for this woman could be as little as 2.0% or as large as 28.9%. ii. LWT=20, RACE=Asian (other g ˆ(20 pounds and Asian= =

33 2 ( gˆ ( = ( ˆ β0 + ( ˆ β var ˆ 20 pounds and Asian var 20 var ( ˆ β ( ˆ 2 β ( ˆ 22 β ˆ 0 β ( β β ( β β ( ( ˆ β ˆ β22 ( ˆ β ˆ 2 β var var 2 20 cov, cov ˆ, ˆ + 2 cov ˆ, ˆ cov ˆ β, ˆ β cov, cov, 2 = ( ( = What is wrong with that? var ˆ β a positive definite matrix? Isn't ( Note that the eigenvalues of ˆ ( ˆ var β is λ = [ 0.732, 0.259, 0.088, ] It is purely the numerical problem! How to fix it? Change the scale of variable LWT by dividing LWT by 00. data a; set logistic.lowbwt; lwt=lwt/00; run; proc logistic data=a Descending; class RACE /PARAM=REFERENCE Descending; model LOW=LWT RACE /covb; run; 33

34 The LOGISTIC Procedure Estimated Covariance Matrix Variable Intercept LWT RACE3 RACE2 Intercept LWT RACE RACE Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept LWT RACE RACE Note under current scale, the eigenvalues of ˆ ( ˆ λ = [.2, 0.269, 0.095, 0.009] var β is g ˆ(20 pounds and Asian= = ( gˆ ( = ( ˆ β0 + ( ˆ β var ˆ 20 pounds and Asian var.2 var ( ˆ β ( ˆ 2 β ( ˆ 22 β ˆ 0 β ( β β ( β β ( ( ˆ β ˆ β22 ( ˆ β ˆ 2 β var var 2.2 cov, cov ˆ, ˆ + 2 cov ˆ, ˆ cov ˆ β, ˆ β cov, cov, 2 = ( ( =

35 We are 95% confident that the odds of this woman will fall in ± e = e, e = [ 0.352, 0.963] We are 95% confident that π ( 20 pounds and Asian, = [ 0.260,0.49] i.e. we are 95% confident that the chance of having a small baby for this woman could be as little as 26.0% or as large as 49.%. 35

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ Logistic Regression The goal of a logistic regression analysis is to find the best fitting and most parsimonious, yet biologically reasonable, model to describe the relationship between an outcome (dependent