Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Size: px

Start display at page:

Download "Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression"

Ariel Newton
6 years ago
Views:

1 Logistic Regression Usual linear regression (repetition) y i = b 0 + b 1 x 1i + b 2 x 2i + e i, e i N(0,σ 2 ) or: y i N(b 0 + b 1 x 1i + b 2 x 2i,σ 2 ) Example (DGA, p. 336): E(PEmax) = weight height Interpretation of linear regression For a given height, PEmax grows by 1.024cmH 2 O per kg body weight. For a given weight, PEmax grows by 0.147cmH 2 O per cm body height. The effect of a single explaining variable is conditional on the other variables present in the model. The effect of each explaining variable is linear variable Number / Frequency Other types of outcomes these are integers; the error term could not be normal... instead look at the mean value: Still we have the problem: E(y) = b 0 + b 1 x 1i + b 2 x 2i mean of a 0-1 variable X: E(X) = P(X = 1) = p p [0,1] a count has a its mean value [0,+ ] 0-1 response variable: Wound infection (dependence on age and on operation time?) p inf optime age p inf optime age p inf optime age p inf optime age

2 p inf optime age p inf optime age p inf optime age p inf optime age Analysis of a 0-1 response variable Response variable binary ( 0 / 1 ) how is the dependence on operation time (optime) and age (age) described? Model for Not good to use p = P {Wound infection} ( [0, 1])? p = a + b 1 x 1 + b 2 x 2! since this would usually not stick to [0,1] Logistic regression Binary outcome (e.g. 1 for success ): Y {0, 1} Probability for success : p = P {Y = 1} [0,1] Odds for success : ω = p 1 p [0,+ ] p = ω 1 + ω Odds-ratio (2 groups): OR = p 1 1 p 1 / p2 1 p 2 [0,+ ] Log-odds: logit is the link function. Linear predictor: Predicted odds: Logistic regression (ctd.) logit(p) = ln ( p 1 p ) [,+ ] logit(p) = b o + b 1 x 1 + b 2 x 2 = η ω = exp(η) Predicted probability: p = ω 1 + ω = exp(η) 1 + exp(η) 7 8

3 Logistic regression interpretation Two groups, with probabilities p 1 and p 2 : ( ) p1 logit(p 1 ) logit(p 2 ) = ln 1 p 1 ( p2 ln ( / p1 p2 = ln 1 p 1 1 p 2 = ln(or) 1 p 2 ) A linear model for logit(p) yields comparisons via odds-ratios. ) 9 Logistic regression in wound infection data Y = { 1 post-operative wound infection 0 no post-operative wound infection p = P {postoperative wound infection} x 1 = operation time in minutes x 2 = age in years Estimated model: logit(p) = x x 2 exp( x x 2 ) p = 1 + exp( x x 2 ) 10 Interpretation of logistic regression: Same operation time (T) Age difference of 10 years (A + 10 vs. A) logit(p 1 ) = T (A + 10) logit(p 2 ) = T A ln(or A+10,A ) = OR A+10,A = exp(0.353) = What does that mean? OR A+10,A = exp(0.353) = If age increases by 10 years, the odds to get a wound infection increases by a factor 1.423, i.e. by 42.3% Odds-ratio refers to the difference in odds for disease between two levels of an explaining variable

4 Calculation of probabilities: ( ) p logit(p) = ln = b 0 + b 1 x 1i + b 2 x 2i 1 p exp(b 0 + b 1 x 1i + b 2 x 2i ) p = 1 + exp(b 0 + b 1 x 1i + b 2 x 2i ) 1 p = exp(b 0 + b 1 x 1i + b 2 x 2i ) The Example yields: logit(p {optime=200 min, age=60 years}) = = = p = e = 1 + e = Dependence of p on age for different operation times Dependence of p on operation time for different ages Predicted probability: 30, 120 and 240 min Predicted probability, 50, 60, 75 years Age Operation time 15 16

5 What does the intercept mean here? ( ) p logit(p) = ln = b 0 + b 1 x 1i + b 2 x 2i 1 p ( ) p x 1i = x 2i = 0 ln = b 0 1 p The intercept is the log-odds for disease in a person with 0 in all covariates. In the wound infection case this would be a person of 0 years which is operated 0 minutes not very meaningful! Wound infection data analyzed in SAS Direct input of data: data brem ; input inf optime age ; cards ; : : ; run ; Analyst: Open Direct programming proc logistic data = brem descend; model inf = optime age; run; (or: proc genmod data = brem descending ; model inf = optime age / dist = binomial link = logit ; estimate "Operation" optime 1 / exp ; estimate "age" age 1 / exp ; run ;) Analyst Statistics/Regression/Logistic click at Single trial in Dependent type choose inf as Dependent; specify Model Pr{..} as 1 choose optime and age as Quantitative The LOGISTIC Procedure Model Information Data Set WORK.BREM Response Variable inf Number of Response Levels 2 Number of Observations 194 Model binary logit Optimization Technique Fisher s scoring Response Profile Ordered Total Value inf Frequency Probability modeled is inf= 1. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC SC Log L SAS Output 19 20

6 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept <.0001 optime age (or: The LOGISTIC Procedure Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits optime age The GENMOD Procedure Model Information Data Set WORK.BREM Distribution Binomial Link Function Logit Dependent Variable inf Observations Used 194 PROC GENMOD is modeling the probability that inf= 1. Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Intercept <.0001 optime age Scale Analysis Of Parameter Estimates Standard Wald 95% Chi- Parameter Estimate Error Confidence Limits Square Pr > ChiSq Contrast Estimate Results Standard Chi- Label Estimate Error Confidence Limits Square Pr>ChiSq Operation Exp(Operation) age Exp(age) Confidence intervals (1 α) c.i. = estimate ± z 1 α/2 std.error 95% confidence interval for OR associated with a difference of 1 year in the age at operation: For ln(or): ± = ( ; ) For OR: exp[( ; )] = ( ; ) or: e e = ( ; ) 95% confidence interval for OR associated with a difference of 10 years in age at operation: For ln(or): ± = ( , ) For OR: exp[( ; )] = ( ; ) or: e e = ( ; ) = ( ; ) 23 24

7 Program: Confidence intervals in SAS using Proc Genmod proc genmod data = brem descending ; model inf = optime age / dist = binomial ; estimate "Op60" optime 60 / exp ; estimate "A10" age 10 / exp ; run ; Output: Effect of scaling and centering of covariates Program: data brem ; set brem ; a50 = ( age - 50 ) / 10 ; op1 = ( optime - 60 ) / 60 ; run ; proc logistic data = brem descend; model inf = op1 a50; run; Standard Wald 95% Chi- Parameter Estimate Error Conf. Limits Square Pr>ChiSq Op Exp(Op60) A Exp(A10) Scaling and centering Output: Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept <.0001 op a Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits op a The Intercept refers to the log(odds) for a person with value 0 in all covariates, but this is now a person of 50 years, operated for 1 hour. If the covariates are divided by a factor: the estimates are multiplied with this factor. the standard deviations are multiplied by this factor. Wald s test and the p values remain the same. If the covariates are centered around a value: the estimates are not changed. the standard deviations are not changed. Wald s test and the p values remain the same

8 The intercept refers to the log odds for a person with covariates values equal to those which are used for centering. ˆ odds 50,60 = exp( ) = ˆp 50,60 = 1/( ) = , c.i.(odds 50,60 ) = exp( ± ) = ( , ) c.i.(p 50,60 ) = (1/( ),1/( )) = ( , ) The infection probability for a 0-person (50 years old, operated for 1 hour) is 0.052, with a 95% c.i. of (0.024, 0.112). Wald s test: Model reduction For testing the importance of a single covariate, e.g. H 0 : β k = 0. Under H 0, we have approximately: or: estimate std.err. N(0, 1) ( ) 2 estimate χ 2 1 std.err. This is calculated in SAS per default, for each parameter separately Likelihood-ratio-test: Model reduction 2ln(likelihood-ratio) χ 2 df The likelihood-ratio is the ratio between the maximized likelihood functions under two different models, for which the smaller one lacks df (one or more) parameters. The deviance is the likelihood-ratio test statistic for comparing the current model vs. a model with one parameter per observation. Thus, the corresponding df is the number of observations minus the number of parameters in the current model. E.g., in our example we have 194 observations and 3 parameters in our model (intercept, optime, age), so the deviance has df = 191. The deviance on its own is not meaningful! However, the difference in deviances between two (nested) models corresponds to the likelihood-ratio test between the two models. It is assessed with help of the χ 2 distribution with df equal to the difference in the numbers of parameters in the two models. E.g., test of model with both optime and age vs. model with only optime: (191) vs (192): χ 2 = = 7.869, df = 1, p = (a bit different from the Wald test...) 31 32

9 Data from DGA: 2 k table with ordered categories Shoe size CS < Total Yes No Total Recall (lecture on categorical data): χ 2 test for independence: 9.29, with 5 df; p = Partition of χ 2 test in tests for linearity and for trend: χ 2 total (5) = χ2 lin (4) + χ2 trend (1) 9.29 = Logistic regression: Model deviance df p logit(p i ) = β i Test for linearity logit(p i ) = α + β s i Test for trend logit(p i ) = µ Analysis of shoe size data: data shoe ; input cs $ shoeno number ; cards ; Y Y Y Y Y Y N N N N N N ; run; Direct programming: Shoeno as class variable: proc logistic data=shoe descend; weight number; class shoeno / param=ref ref=last; model cs = shoeno; run; Shoeno as quantitative variable: proc logistic data=shoe descend; weight number; model cs = shoeno; run; 35 36

10 Analyst: Shoeno as class variable: Statistics/Regression/Logistic choose cs as Dependent, and shoeno as Class under Variables, choose number as Weight. double-click at the Code node, copy the program to the editor add two options to the class statement: class shoeno / param=ref ref=last For a direct interpretation of the parameter estimates, the last 2 steps are essential! (if the 1st level should be the reference, use ref=first instead...) Shoeno as quantitative variable: Select shoeno as Quantitative instead of Class. Full model (shoe size as a class variable) Response Profile Ordered Total Total Value cs Frequency Weight 1 Y N Probability modeled is cs= Y. Class Level Information Design Variables Class Value shoeno Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC SC Log L Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq shoeno Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept <.0001 shoeno shoeno shoeno shoeno shoeno Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits shoeno 4 vs shoeno 5 vs shoeno 6 vs shoeno 3.5 vs shoeno 4.5 vs p := P(cs = y shoeno = 3.5) =?: OR ˆ = = ˆp = exp( )/(1 + exp( )) = Model with linear effect of shoe size Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC SC Log L Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio Score Wald Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept shoeno Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits shoeno

11 Model comparisons Difference model deviance df deviance df p full linear intercept only Exercise: Use the output to calculate the predicted values for the probability of a caesarean section for women with shoe sizes 4, 5 and 6, respectively, from the model with a linear effect of shoe size. The test in the second last line is the trend test Case-control studies In a case-control-studie, one chooses: cases (diseased) as verified from a register or so controls, which are persons representing the population to which the cases belong. Thus, persons in case-control-studies are chosen according to the outcome. Typically, the proportion of cases and controls will be specified beforehand. If a variable is important for the development of the disease: Different distributions of the variable between cases and controls. The probability to be a case (in the population), P{disease}, can not be estimated from a case-control study. But, the effects of covariates on the disease probability can be estimated! 43 44

12 Case-control studies Prevalence in the population: p = P {case} p 1 p = odds(case) Selection fractions, i.e. inclusion probabilities π 0 and π 1 : P {inclusion in study case} = π 1 P {inclusion in study control} = π 0 In a case-control study one observes the number of cases and the number of controls, conditional on that they are actually in the study. These depend on diverse covariates (which one is interested in) and on the inclusion probabilities (which one is not interested in) p 1 p case control π 1 1 π 1 π 0 1 π 0 P {case & included} = p π 1 included not included included not included P {control & included} = (1 p) π 0 p π 1 odds(case included) = = p (1 p) π 0 1 p π 1 π 0 Logistic regression Model for the population: [ ] p ln = b 0 + b 1 x 1 + b 2 x 2 1 p Model for the observed: [ ] [ ] p π1 ln[odds(case incl.)] = ln + ln 1 p π 0 ( [ ] ) π1 = ln + b 0 + b 1 x 1 + b 2 x 2 π

13 Analysis of P(case inclusion) i.e. binary observations: Y = { 1 case 0 control Effects of covariates are estimated correctly! Intercept has no meaning depends on π 0 and π 1, which are usually unknown. 49

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

The GENMOD Procedure MODEL Statement MODEL response = < effects > < /options > ; MODEL events/trials = < effects > < /options > ; You can specify the response in the form of a single variable or in the