Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Logistic Regression Usual linear regression (repetition) y i = b 0 + b 1 x 1i + b 2 x 2i + e i, e i N(0,σ 2 ) or: y i N(b 0 + b 1 x 1i + b 2 x 2i,σ 2 ) Example (DGA, p. 336): E(PEmax) = 47.355 + 1.024 weight + 0.147 height Interpretation of linear regression For a given height, PEmax grows by 1.024cmH 2 O per kg body weight. For a given weight, PEmax grows by 0.147cmH 2 O per cm body height. The effect of a single explaining variable is conditional on the other variables present in the model. The effect of each explaining variable is linear. 1 2 0-1 variable Number / Frequency Other types of outcomes these are integers; the error term could not be normal... instead look at the mean value: Still we have the problem: E(y) = b 0 + b 1 x 1i + b 2 x 2i mean of a 0-1 variable X: E(X) = P(X = 1) = p p [0,1] a count has a its mean value [0,+ ] 0-1 response variable: Wound infection (dependence on age and on operation time?) p inf optime age p inf optime age p inf optime age p inf optime age 1 1 140 76 51 0 25 30 101 0 120 63 151 0 120 66 2 0 190 71 52 1 240 73 102 0 90 63 152 0 120 64 3 0 150 80 53 0 180 79 103 0 40 25 153 0 90 77 4 0 65 48 54 0 35 17 104 0 45 23 154 0 105 68 5 0 390 34 55 0 180 75 105 0 90 89 155 0 90 83 6 0 210 73 56 0 60 30 106 0 65 79 156 0 75 60 7 1 140 73 57 0 195 64 107 0 50 32 157 0 150 74 8 0 210 78 58 1 120 81 108 0 205 78 158 0 75 25 9 0 135 78 59 1 240 66 109 0 105 89 159 0 120 78 10 0 10 10 60 0 100 52 110 0 390 77 160 0 75 78 11 0 5 64 61 0 80 14 111 0 90 87 161 0 75 80 12 0 40 13 62 0 90 31 112 0 130 63 162 0 70 40 13 0 55 13 63 0 215 63 113 0 75 77 163 0 210 78 14 0 55 70 64 0 65 27 114 1 60 72 164 0 30 76 15 0 40 78 65 0 15 78 115 0 35 72 165 1 130 73 16 0 30 37 66 0 40 23 116 0 45 22 166 1 180 61 17 0 60 23 67 0 175 42 117 1 120 79 167 0 240 53 18 0 120 77 68 0 45 25 118 0 165 62 168 0 210 53 19 0 70 26 69 0 150 63 119 0 80 75 169 0 200 65 20 0 60 24 70 0 50 81 120 0 105 75 170 0 105 31 21 0 150 81 71 0 150 83 121 0 105 82 171 0 45 82 22 0 45 13 72 0 50 22 122 0 85 82 172 0 100 78 3 4

p inf optime age p inf optime age p inf optime age p inf optime age 23 0 50 20 73 0 150 61 123 0 80 79 173 1 205 61 24 0 110 83 74 0 120 69 124 0 240 77 174 0 45 29 25 0 120 82 75 0 60 14 125 0 50 50 175 0 120 24 26 0 175 31 76 0 85 13 126 0 10 24 176 0 120 72 27 0 60 31 77 0 120 60 127 1 105 76 177 0 105 75 28 0 55 67 78 1 70 71 128 0 30 76 178 0 40 74 29 0 35 69 79 0 70 81 129 0 70 90 179 0 60 21 30 0 160 52 80 0 80 22 130 0 40 16 180 1 60 84 31 0 60 22 81 0 150 90 131 0 120 80 181 0 60 30 32 1 60 80 82 0 65 54 132 0 35 51 182 0 60 59 33 0 60 52 83 0 45 44 133 0 90 88 183 0 40 53 34 0 95 84 84 0 25 7 134 1 130 75 184 0 60 69 35 0 90 60 85 1 140 81 135 0 40 20 185 0 60 24 36 0 30 33 86 0 20 81 136 0 75 67 186 0 45 64 37 0 70 23 87 1 180 57 137 0 95 15 187 1 220 83 38 0 60 75 88 0 55 23 138 0 70 80 188 0 50 16 39 0 180 70 89 0 75 17 139 1 60 78 189 0 60 78 40 0 120 78 90 1 115 75 140 0 45 77 190 0 90 78 41 0 120 62 91 0 90 86 141 0 60 73 191 0 120 81 42 0 60 20 92 0 45 65 142 0 30 31 192 0 40 25 43 1 300 82 93 1 55 12 143 0 75 24 193 0 50 13 44 0 45 66 94 0 65 74 144 0 45 7 194 0 45 86 45 0 45 75 95 0 35 17 145 0 40 76 46 0 45 70 96 0 35 13 146 0 50 24 47 0 120 55 97 0 20 28 147 0 75 72 48 0 45 27 98 1 60 71 148 0 25 50 49 0 165 84 99 0 120 64 149 0 55 10 50 0 60 25 100 0 45 38 150 0 60 39 Analysis of a 0-1 response variable Response variable binary ( 0 / 1 ) how is the dependence on operation time (optime) and age (age) described? Model for Not good to use p = P {Wound infection} ( [0, 1])? p = a + b 1 x 1 + b 2 x 2! since this would usually not stick to [0,1]... 5 6 Logistic regression Binary outcome (e.g. 1 for success ): Y {0, 1} Probability for success : p = P {Y = 1} [0,1] Odds for success : ω = p 1 p [0,+ ] p = ω 1 + ω Odds-ratio (2 groups): OR = p 1 1 p 1 / p2 1 p 2 [0,+ ] Log-odds: logit is the link function. Linear predictor: Predicted odds: Logistic regression (ctd.) logit(p) = ln ( p 1 p ) [,+ ] logit(p) = b o + b 1 x 1 + b 2 x 2 = η ω = exp(η) Predicted probability: p = ω 1 + ω = exp(η) 1 + exp(η) 7 8

Logistic regression interpretation Two groups, with probabilities p 1 and p 2 : ( ) p1 logit(p 1 ) logit(p 2 ) = ln 1 p 1 ( p2 ln ( / p1 p2 = ln 1 p 1 1 p 2 = ln(or) 1 p 2 ) A linear model for logit(p) yields comparisons via odds-ratios. ) 9 Logistic regression in wound infection data Y = { 1 post-operative wound infection 0 no post-operative wound infection p = P {postoperative wound infection} x 1 = operation time in minutes x 2 = age in years Estimated model: logit(p) = 5.1144 + 0.00753 x 1 + 0.0353 x 2 exp( 5.1144 + 0.00753 x 1 + 0.0353 x 2 ) p = 1 + exp( 5.1144 + 0.00753 x 1 + 0.0353 x 2 ) 10 Interpretation of logistic regression: Same operation time (T) Age difference of 10 years (A + 10 vs. A) logit(p 1 ) = 5.1144 + 0.00753 T + 0.0353 (A + 10) logit(p 2 ) = 5.1144 + 0.00753 T + 0.0353 A ln(or A+10,A ) = 0.0353 10 OR A+10,A = exp(0.353) = 1.423 What does that mean? OR A+10,A = exp(0.353) = 1.423 If age increases by 10 years, the odds to get a wound infection increases by a factor 1.423, i.e. by 42.3% Odds-ratio refers to the difference in odds for disease between two levels of an explaining variable. 11 12

Calculation of probabilities: ( ) p logit(p) = ln = b 0 + b 1 x 1i + b 2 x 2i 1 p exp(b 0 + b 1 x 1i + b 2 x 2i ) p = 1 + exp(b 0 + b 1 x 1i + b 2 x 2i ) 1 p = 1 1 + exp(b 0 + b 1 x 1i + b 2 x 2i ) The Example yields: logit(p {optime=200 min, age=60 years}) = 5.1144 + 0.00753 200 + 0.0353 60 = 5.1144 + 1.560 + 2.118 = 1.490 p = e 1.490 0.2254 1.490 = 1 + e 1.2254 = 0.1839 13 14 Dependence of p on age for different operation times Dependence of p on operation time for different ages Predicted probability: 30, 120 and 240 min. 0.0 0.2 0.4 0.6 0.8 1.0 Predicted probability, 50, 60, 75 years 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 Age 0 100 200 300 400 500 600 Operation time 15 16

What does the intercept mean here? ( ) p logit(p) = ln = b 0 + b 1 x 1i + b 2 x 2i 1 p ( ) p x 1i = x 2i = 0 ln = b 0 1 p The intercept is the log-odds for disease in a person with 0 in all covariates. In the wound infection case this would be a person of 0 years which is operated 0 minutes not very meaningful! Wound infection data analyzed in SAS Direct input of data: data brem ; input inf optime age ; cards ; 1 140 76 0 190 71 0 150 80 : : 0 50 13 0 45 86 ; run ; Analyst: Open... 17 18 Direct programming proc logistic data = brem descend; model inf = optime age; run; (or: proc genmod data = brem descending ; model inf = optime age / dist = binomial link = logit ; estimate "Operation" optime 1 / exp ; estimate "age" age 1 / exp ; run ;) Analyst Statistics/Regression/Logistic click at Single trial in Dependent type choose inf as Dependent; specify Model Pr{..} as 1 choose optime and age as Quantitative The LOGISTIC Procedure Model Information Data Set WORK.BREM Response Variable inf Number of Response Levels 2 Number of Observations 194 Model binary logit Optimization Technique Fisher s scoring Response Profile Ordered Total Value inf Frequency 1 1 23 2 0 171 Probability modeled is inf= 1. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 143.247 130.007 SC 146.515 139.811-2 Log L 141.247 124.007 SAS Output 19 20

Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1-5.1142 1.1041 21.4568 <.0001 optime 1 0.00753 0.00316 5.6815 0.0171 age 1 0.0353 0.0145 5.9023 0.0151 (or: The LOGISTIC Procedure Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits optime 1.008 1.001 1.014 age 1.036 1.007 1.066 The GENMOD Procedure Model Information Data Set WORK.BREM Distribution Binomial Link Function Logit Dependent Variable inf Observations Used 194 PROC GENMOD is modeling the probability that inf= 1. Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 191 124.0070 0.6493 Scaled Deviance 191 124.0070 0.6493 Intercept -5.1144 1.1041-7.2785-2.9504 21.46 <.0001 optime 0.0075 0.0032 0.0013 0.0137 5.68 0.0171 age 0.0353 0.0145 0.0068 0.0638 5.90 0.0151 Scale 1.0000 0.0000 1.0000 1.0000 Analysis Of Parameter Estimates Standard Wald 95% Chi- Parameter Estimate Error Confidence Limits Square Pr > ChiSq Contrast Estimate Results Standard Chi- Label Estimate Error Confidence Limits Square Pr>ChiSq Operation 0.0075 0.0032 0.0013 0.0137 5.68 0.0171 Exp(Operation) 1.0076 0.0032 1.0013 1.0138 age 0.0353 0.0145 0.0068 0.0638 5.90 0.0151 Exp(age) 1.0360 0.0151 1.0069 1.0659 21 22 Confidence intervals (1 α) c.i. = estimate ± z 1 α/2 std.error 95% confidence interval for OR associated with a difference of 1 year in the age at operation: For ln(or): 0.035325 ± 1.96 0.014516 = (0.006874; 0.063776) For OR: exp[(0.0066874; 0.063776)] = (1.006897; 1.065854) or: e 0.035325 e 1.96 0.014516 = (1.006897;1.065854) 95% confidence interval for OR associated with a difference of 10 years in age at operation: For ln(or): 10 0.035325 ± 1.96 10 0.014516 = (0.068736, 0.637764) For OR: exp[(0.068736; 0.637764)] = (1.071154; 1.892244) or: e 10 0.035325 e 1.96 10 0.014516 = (1.071154;1.892244) = (1.006897 10 ;1.065854 10 ) 23 24

Program: Confidence intervals in SAS using Proc Genmod proc genmod data = brem descending ; model inf = optime age / dist = binomial ; estimate "Op60" optime 60 / exp ; estimate "A10" age 10 / exp ; run ; Output: Effect of scaling and centering of covariates Program: data brem ; set brem ; a50 = ( age - 50 ) / 10 ; op1 = ( optime - 60 ) / 60 ; run ; proc logistic data = brem descend; model inf = op1 a50; run; Standard Wald 95% Chi- Parameter Estimate Error Conf. Limits Square Pr>ChiSq Op60 0.4518 0.1896 0.0803 0.8234 5.68 0.0171 Exp(Op60) 1.5712 0.2978 1.0836 2.2781 A10 0.3533 0.1454 0.0683 0.6382 5.90 0.0151 Exp(A10) 1.4237 0.2070 1.0707 1.8931 25 26 Scaling and centering Output: Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1-2.8962 0.4228 46.9216 <.0001 op1 1 0.4518 0.1896 5.6815 0.0171 a50 1 0.3532 0.1454 5.9023 0.0151 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits op1 1.571 1.084 2.278 a50 1.424 1.071 1.893 The Intercept refers to the log(odds) for a person with value 0 in all covariates, but this is now a person of 50 years, operated for 1 hour. If the covariates are divided by a factor: the estimates are multiplied with this factor. the standard deviations are multiplied by this factor. Wald s test and the p values remain the same. If the covariates are centered around a value: the estimates are not changed. the standard deviations are not changed. Wald s test and the p values remain the same. 27 28

The intercept refers to the log odds for a person with covariates values equal to those which are used for centering. ˆ odds 50,60 = exp( 2.8962) = 0.05523 ˆp 50,60 = 1/(1 + 0.05523) = 0.05234, c.i.(odds 50,60 ) = exp( 2.8962 ± 1.96 0.4228) = (0.02412,0.12650) c.i.(p 50,60 ) = (1/(1 + 0.02412),1/(1 + 0.1265)) = (0.02355, 0.11229) The infection probability for a 0-person (50 years old, operated for 1 hour) is 0.052, with a 95% c.i. of (0.024, 0.112). Wald s test: Model reduction For testing the importance of a single covariate, e.g. H 0 : β k = 0. Under H 0, we have approximately: or: estimate std.err. N(0, 1) ( ) 2 estimate χ 2 1 std.err. This is calculated in SAS per default, for each parameter separately. 29 30 Likelihood-ratio-test: Model reduction 2ln(likelihood-ratio) χ 2 df The likelihood-ratio is the ratio between the maximized likelihood functions under two different models, for which the smaller one lacks df (one or more) parameters. The deviance is the likelihood-ratio test statistic for comparing the current model vs. a model with one parameter per observation. Thus, the corresponding df is the number of observations minus the number of parameters in the current model. E.g., in our example we have 194 observations and 3 parameters in our model (intercept, optime, age), so the deviance has df = 191. The deviance on its own is not meaningful! However, the difference in deviances between two (nested) models corresponds to the likelihood-ratio test between the two models. It is assessed with help of the χ 2 distribution with df equal to the difference in the numbers of parameters in the two models. E.g., test of model with both optime and age vs. model with only optime: 124.007(191) vs. 131.876(192): χ 2 = 131.876 124.007 = 7.869, df = 1, p = 0.005 (a bit different from the Wald test...) 31 32

Data from DGA: 2 k table with ordered categories Shoe size CS < 4 4 4.5 5 5.5 6 Total Yes 5 7 6 7 8 10 43 No 17 28 36 41 46 140 308 Total 22 35 42 48 54 150 351 Recall (lecture on categorical data): χ 2 test for independence: 9.29, with 5 df; p = 0.098. Partition of χ 2 test in tests for linearity and for trend: χ 2 total (5) = χ2 lin (4) + χ2 trend (1) 9.29 = 1.27 + 8.02 Logistic regression: Model deviance df p logit(p i ) = β i 0.00 0 Test for linearity 1.78 4 0.775 logit(p i ) = α + β s i 1.78 4 0.775 Test for trend 7.56 1 0.005 logit(p i ) = µ 9.34 5 0.096 33 34 Analysis of shoe size data: data shoe ; input cs $ shoeno number ; cards ; Y 3.5 5 Y 4.0 7 Y 4.5 6 Y 5.0 7 Y 5.5 8 Y 6.0 10 N 3.5 17 N 4.0 28 N 4.5 36 N 5.0 41 N 5.5 46 N 6.0 140 ; run; Direct programming: Shoeno as class variable: proc logistic data=shoe descend; weight number; class shoeno / param=ref ref=last; model cs = shoeno; run; Shoeno as quantitative variable: proc logistic data=shoe descend; weight number; model cs = shoeno; run; 35 36

Analyst: Shoeno as class variable: Statistics/Regression/Logistic choose cs as Dependent, and shoeno as Class under Variables, choose number as Weight. double-click at the Code node, copy the program to the editor add two options to the class statement: class shoeno / param=ref ref=last For a direct interpretation of the parameter estimates, the last 2 steps are essential! (if the 1st level should be the reference, use ref=first instead...) Shoeno as quantitative variable: Select shoeno as Quantitative instead of Class. Full model (shoe size as a class variable) Response Profile Ordered Total Total Value cs Frequency Weight 1 Y 6 43.00000 2 N 6 308.00000 Probability modeled is cs= Y. Class Level Information Design Variables Class Value 1 2 3 4 5 shoeno 4 1 0 0 0 0 5 0 1 0 0 0 6 0 0 1 0 0 3.5 0 0 0 1 0 4.5 0 0 0 0 1 5.5 0 0 0 0 0 Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 263.067 263.723 SC 263.552 266.632-2 Log L 261.067 251.723 37 38 Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq shoeno 5 8.6369 0.1245 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1-1.7492 0.3831 20.8513 <.0001 shoeno 4 1 0.3629 0.5704 0.4049 0.5246 shoeno 5 1-0.0185 0.5603 0.0011 0.9737 shoeno 6 1-0.8898 0.5039 3.1187 0.0774 shoeno 3.5 1 0.5255 0.6368 0.6808 0.4093 shoeno 4.5 1-0.0426 0.5841 0.0053 0.9419 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits shoeno 4 vs 5.5 1.438 0.470 4.396 shoeno 5 vs 5.5 0.982 0.327 2.944 shoeno 6 vs 5.5 0.411 0.153 1.103 shoeno 3.5 vs 5.5 1.691 0.485 5.892 shoeno 4.5 vs 5.5 0.958 0.305 3.011 p := P(cs = y shoeno = 3.5) =?: OR ˆ = 1.7492 + 0.5255 = 1.2237 ˆp = exp( 1.2237)/(1 + exp( 1.2237)) = 0.2273 Model with linear effect of shoe size Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 263.067 257.508 SC 263.552 258.477-2 Log L 261.067 253.508 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 7.5597 1 0.0060 Score 8.0237 1 0.0046 Wald 7.6971 1 0.0055 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 0.6877 0.9462 0.5283 0.4673 shoeno 1-0.5194 0.1872 7.6971 0.0055 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits shoeno 0.595 0.412 0.859 39 40

Model comparisons Difference model deviance df deviance df p full 251.7230 6 1.7845 4 0.7753 linear 253.5075 10 7.5598 1 0.0060 intercept only 261.0673 11 Exercise: Use the output to calculate the predicted values for the probability of a caesarean section for women with shoe sizes 4, 5 and 6, respectively, from the model with a linear effect of shoe size. The test in the second last line is the trend test. 41 42 Case-control studies In a case-control-studie, one chooses: cases (diseased) as verified from a register or so controls, which are persons representing the population to which the cases belong. Thus, persons in case-control-studies are chosen according to the outcome. Typically, the proportion of cases and controls will be specified beforehand. If a variable is important for the development of the disease: Different distributions of the variable between cases and controls. The probability to be a case (in the population), P{disease}, can not be estimated from a case-control study. But, the effects of covariates on the disease probability can be estimated! 43 44

Case-control studies Prevalence in the population: p = P {case} p 1 p = odds(case) Selection fractions, i.e. inclusion probabilities π 0 and π 1 : P {inclusion in study case} = π 1 P {inclusion in study control} = π 0 In a case-control study one observes the number of cases and the number of controls, conditional on that they are actually in the study. These depend on diverse covariates (which one is interested in) and on the inclusion probabilities (which one is not interested in). 45 46 p 1 p case control π 1 1 π 1 π 0 1 π 0 P {case & included} = p π 1 included not included included not included P {control & included} = (1 p) π 0 p π 1 odds(case included) = = p (1 p) π 0 1 p π 1 π 0 Logistic regression Model for the population: [ ] p ln = b 0 + b 1 x 1 + b 2 x 2 1 p Model for the observed: [ ] [ ] p π1 ln[odds(case incl.)] = ln + ln 1 p π 0 ( [ ] ) π1 = ln + b 0 + b 1 x 1 + b 2 x 2 π 0 47 48

Analysis of P(case inclusion) i.e. binary observations: Y = { 1 case 0 control Effects of covariates are estimated correctly! Intercept has no meaning depends on π 0 and π 1, which are usually unknown. 49