Logistic regression with one predictor. STK4900/ Lecture 7. Program

Logstc regresson wth one redctor STK49/99 - Lecture 7 Program. Logstc regresson wth one redctor 2. Maxmum lkelhood estmaton 3. Logstc regresson wth several redctors 4. Devance and lkelhood rato tests 5. A comment on model ft Sectons 5., 5.2 (excet 5.2.6, and 5.6 Sulementary materal on lkelhood and devance We have data (x,y,, (x n,y n Here y s a bnary outcome ( or for subect and x s a redctor for the subect We let ( x E( y x P( y x The logstc regresson models take the form: ex( b+ b x ( x + ex( b + b x Ths gves a "S-shaed" relaton between (x and x and ensures that (x stays between and 2 The logstc model may alternatvely be gven n terms of the odds: ( x ex( b+ b x (* - ( x If we consder two subects wth covarate values x+d and x, resectvely, ther odds rato becomes [ x ] [ - x ] ( x+d - ( +D ( x ( ( b + b x+d ex ( ex( b + b x ex( bd In artcular e b s the odds rato corresondng to one unt's ncrease n the value of the covarate By (* the logstc regresson model may also be gven as: é ( x ù log ê b + b x (** - ( x ú ë û Thus the logstc regresson model s lnear n the log-odds 3 Consder the WCGS study wth CHD as outcome and age as redctor (ndvdual age, not groued age as we consdered n Lecture 6 wcgsread.table("htt://www.uo.no/studer/emner/matnat/math/stk49/v3/wcgs.txt", se"\t",headert,na.strngs"." ftglm(chd69~age, datawcgs,famlybnomal summary(ft R outut (edted: Estmate Std. Error z value Pr(> z (Intercet -5.9395.5493 -.83 < 2e-6 age.744.3 6.585 4.56e-.744 The odds rato for one year ncrease n age s e.77 whle the.744 odds rato for a ten-year ncrease s e 2. (The numbers devate slghtly from those on slde 25 from Lecture 6, snce there we used mean age for each age grou whle here we use the ndvdual ages How s the estmaton erformed for the logstc regresson model? 4

Maxmum lkelhood estmaton Estmaton n the logstc model s erformed usng maxmum lkelhood estmaton We frst descrbe maxmum lkelhood estmaton for the lnear regresson model: For ease of resentaton, we assume that 2 s known The densty of y takes the form (cf slde 2 from Lecture : The lkelhood s the smultaneous densty b consdered as a functon of the arameters and b for the observed values of the y We estmate the arameters by maxmzng the lkelhood. Ths corresonds to fndng the arameters that make the observed y as lkely as ossble Maxmzng the lkelhood L s the same as maxmzng whch s the same as mnmzng 5 For the lnear regresson model, maxmum lkelhood estmaton concdes wth least squares estmaton 6 We then consder the stuaton for logstc regresson We have data (x,y,, (x n,y n, where s a bnary outcome ( or for subect and s a redctor Here we have where P( y x P( y x - ex( b+ b x + ex( b + b x Thus the dstrbuton of y may be wrtten as y P( y x (- - y x y 7 The lkelhood becomes Snce n Õ y L P( y x (- - n Õ ex( b+ b x + ex( b + b x the lkelhood s, for gven observatons, a functon of the unknown arameters b and b b We estmate and b by the values of these arameters that maxmze the lkelhood These estmates are called the maxmum lkelhood estmates (MLE and are denoted ˆ and bˆ b y 8

Confdence nterval for b and odds rato 95% confdence nterval for (based on the normal aroxmaton: OR ex( b ˆ b ±.96 se( ˆ b b s the odds rato for one unt's ncrease n x We obtan a 95% confdence nterval for OR by transformng the lower and uer lmts of the confdence nterval for b R functon for comutng odds rato wth 95% confdence lmts excoeffuncton(glmob { regtabsummary(glmob$coef excoefex(regtab[,] lowerexcoef*ex(-.96*regtab[,2] uerexcoef*ex(.96*regtab[,2] cbnd(excoef,lower,uer } In the CHD examle we have ˆ b.744 and se( ˆ b.3 95% confdence nterval for b :.744±.96.3.e. from.52 to.96 excoef(ft R outut (edted: Estmate of odds rato OR ex(.744.77 95% confdence nterval for OR : excoef lower uer (Intercet.26.9.77 age.77.54. from ex(.52.53 to ex(.96. 9 Wald test for H : b Multle logstc regresson To test the null hyothess H : b versus the two-sded alternatve H : b ¹ we use the Wald test statstc: A ˆ b z se( ˆ b We reect H for large values of z Assume now that we for each subect have a bnary outcome y We let redctors x, x2,..., x ( x, x,..., x E( y x, x,..., x P( y x, x,..., x 2 2 2 Under H the test statstc s aroxmately standard normal Logstc regresson model: P-value (two-sded: P 2 P(Z > z where Z s standard normal In the CHD examle we have ˆ b.744 and se( ˆ b.3 Wald test statstc ( x, x,..., x 2 ex( b + b x + b x +... + b x 2 2 + b + b x + b2 x2 + + b x Alternatvely the model may be wrtten: ex(... z.744/.3 6.58 whch s hghly sgnfcant (cf. slde 4 æ ( x, x,..., x ö b + b x + b x + + b x è ø 2 log 2 2... ç - ( x, x2,..., x 2

The logstc model may also be gven n terms of the odds: ( x, x2,..., x ex( b+ b x+ b2 x2+... + b x - ( x, x,..., x 2 If we consder two subects wth values x +D and x, for the frst covarate and the same values for all the others, ther odds rato becomes ( x+d, x2,..., x é ë - ( x+d, x2,..., x ù û ( x, x2,..., x é ë - ( x, x2,..., x ù û ex( b+ b ( x+d + b2 x2+... + b x ex( bd ex( b + b x + b x +... + b x 2 2 In artcular e b s the odds rato corresondng to one unt's ncrease n the value of the frst covarate holdng all other covarates constant Wald tests and confdence ntervals ˆ b MLE for b se( ˆ b standard error for To test the null hyothess H : we use the Wald test statstc: b ˆ b z se( ˆ b 95% confdence nterval for : ˆ b ±.96 se( ˆ b b OR ex( b s the odds rato for one unt's ncrease n the value of the -th covarate holdng all other covarates constant ˆ b whch s aroxmately N(,-dstrbuted under H A smlar nterretaton holds for the other regresson coeffcents 3 We obtan a 95% confdence nterval for OR by transformng the lower and uer lmts of the confdence nterval for b 4 Consder the WCGS study wth CHD as outcome and age, cholesterol (mg/dl, systolc blood ressure (mmhg, body mass ndex (kg/m 2, and smokng (yes, no as redctors (as on age 68 n the text book we omt an ndvdual wth an unusually hgh cholesterol value wcgs.multglm(chd69~age+chol+sb+bm+smoke, datawcgs, famlybnomal, subset(chol<6 summary(wcgs.mult R outut (edted: Estmate Std. Error z value Pr(> z (Intercet -2.3.9773-2.598 < 2e-6 age.644.9 5.42 6.22e-8 chol.7.5 7.79.45e-2 sb.93.4 4.76 2.4e-6 bm.574.264 2.79.293 smoke.6345.4 4.526 6.e-6 Odds ratos wth confdence ntervals R command (usng the functon from slde : excoef(wcgs.mult R outut (edted: excoef lower uer (Intercet 4.5e-6 6.63e-7 3.6e-5 age.67.42.92 chol..8.4 sb.9..28 bm.59.6.5 smoke.886.433 2.482 5 6

For a numercal covarate t may be more meanngful to resent an odds rato corresondng to a larger ncrease than one unt (cf. slde 3 Ths s easly acheved by refttng the model wth a rescaled covarate If you (e.g want to study the effect of a ten-years ncrease n age, you ft the model wth the covarate age_age/ wcgs.rescglm(chd69~age_+chol_5+sb_5+bm_+smoke, datawcgs, famlybnomal, subset(chol<6 summary(wcgs.resc R outut (edted: Estmate Std. Error z value Pr(> z (Intercet -3.6.6-2.598 < 2e-6 age_.644.9 5.42 6.22e-8 chol_5.537.76 7.79.45e-2 sb_5.965.25 4.76 2.4e-6 bm_.574.264 2.79.293 smoke.634.4 4.526 6.e-6 Odds ratos wth confdence ntervals: R command (usng the functon from slde : excoef(wcgs.resc R outut (edted: excoef lower uer (Intercet.494.394.62 age_.95.585 2.457 chol_5.7.4746.9853 sb_5 2.624.7573 3.98 bm_.776.595 2.977 smoke.886.4329 2.4824 Note that values of the Wald test statstc are not changed (cf. slde 5 7 8 An am of the WCGS study was to study the effect on CHD of certan behavoral atterns, denoted A, A2, B3 and B4 Behavoral attern s a categorcal covarate wth four levels, and must be ftted as a factor n R wcgs$behcatfactor(wcgs$behat wcgs.behglm(chd69~age_+chol_5+sb_5+bm_+smoke+behcat, datawcgs, famlybnomal, subset(chol<6 summary(wcgs.beh R outut (edted: Estmate Std. Error z value Pr(> z (Intercet -2.7527.2259-2.9 < 2e-6 age_.664.99 5.57 4.25e-7 chol_5.533.764 6.98 2.96e-2 sb_5.96.265 4.367.26e-5 bm_.5536.2656 2.84.372 smoke.647.4 4.285.82e-5 behcat2.66.222.298.7654 behcat3 -.6652.2423-2.746.6 behcat4 -.5585.392 -.75.82 9 Here we may be nterested n : Testng f behavoral atterns have an effect on CHD rsk Testng f t s suffcent to use two categores for behavoral attern (A and B In general we consder a logstc regresson model: ( x, x,..., x 2 ex( b + b x + b x +... + b x 2 2 + b + b x + b2 x2 + + b x ex(... Here we want to test the null hyothess that q of the b 's are equal to zero, or equvalently that there are q lnear restrctons among the b 's Examles: H : b b b b ( q 4 2 3 4 H : b b and b b ( q 2 2 3 4 2

Devance and sum of squares For the lnear regresson model the sum of squares was a key quantty n connecton wth testng and for assessng the ft of a model We want to defne a quantty for logstc regresson that corresonds to the sum of squares To ths end we start out by consderng the relaton between the log-lkelhood and the sum of squares for the lnear regresson model For the lnear regresson model llog L takes the form (cf. slde 6: For the saturated model the are estmated by m y, and the log-lkelhood becomes For a gven secfcaton of the lnear regresson model the are estmated by the ftted values,.e. ˆ m ˆ y, wth corresondng log-lkelhood n ˆ n 2 l - log(2 s - å - 2 2 2 s ( y ˆ m 2 The devance for the model s defned as D 2( l - l ˆ and t becomes The log-lkelhood obtans ts largest value for the saturated model,.e. the model where there are no restrctons on the 2 n D y -m å 2 s ( ˆ 2 For the lnear regresson model the devance s ust the sum of squares for the ftted model dvded by 2 22 Devance for bnary data We then consder logstc regresson wth data ( y, x, x,..., x,2,..., n y 2 where s bnary resonse and the are redctors We ntroduce P( y x, x,..., x and note that the 2 log-lkelhood l l( s a functon of,..., n,..., n (cf. slde 8 x For a ftted logstc regresson model we obtan the estmated robabltes ex( ˆ b+ ˆ b ˆ x +... + b x ˆ ˆ( x, x2,..., x + ex( ˆ b + ˆ b x +... + ˆ b x and the corresondng value lˆ l( ˆ ˆ,..., n of the log-lkelhood The devance for the model s defned as D 2( l -ˆ lˆ For the saturated model,.e. the model where there are no restrctons on the, the are estmated by y and the log-lkelhood takes the value l l,....., n l l( (,..., The devance tself s not of much use for bnary data But by comarng the devances of two models, we may check f one gves a better ft than the other. 23 24

Consder the WCGS study wth age, cholesterol, systolc blood ressure, body mass ndex, smokng and behavoral attern as redctors (cf slde 9 R outut (edted: Estmate Std. Error z value Pr(> z (Intercet -2.7527.2259-2.9 < 2e-6 age_.664.99 5.57 4.25e-7 chol_5.533.764 6.98 2.96e-2 sb_5.96.265 4.367.26e-5 bm_.5536.2656 2.84.372 smoke.647.4 4.285.82e-5 behcat2.66.222.298.7654 behcat3 -.6652.2423-2.746.6 behcat4 -.5585.392 -.75.82 Null devance: 774.2 on 34 degrees of freedom Resdual devance: 589.6 on 332 degrees of freedom Devance and lkelhood rato tests We want to test the null hyothess H that q of the b 's are equal to zero, or equvalently that there are q lnear restrctons among the b 's To test the null hyothess, we use the test statstc G D - D where D s the devance under the null hyothess and D s the devance for the ftted model (not assumng H We reect H for large values of G The devance of the ftted model s denoted "resdual devance" n the outut The "null devance" s the devance for the model wth no covarates,.e. for the model where all the are assumed to be equal 25 To comute P-values, we use that the test statstc G s ch-square dstrbuted wth q degrees of freedom under H 26 We wll show how we may rewrte G n terms of the lkelhood rato We have D 2( l - l and D 2( l - l Here where Thus G D - D lˆ log Lˆ and lˆ log Lˆ Lˆ max L and Lˆ max L model H 2( l - l ˆ - 2( l - l ˆ 2( l ˆ ˆ l - - -2log( Lˆ ˆ L Thus large values of G corresonds to small values of the lkelhood rato Lˆ and the test based on G s equvalent to Lˆ the lkelhood rato test 27 For the model wth age, cholesterol, systolc blood ressure, body mass ndex, smokng, and behavoral attern as redctors (cf slde 25 the devance becomes D589.6 For the model wthout behavoral attern (cf slde 7 the devance takes the value D 64.4 The test statstc takes the value: anova(wcgs.resc,wcgs.beh,test"chsq" R outut (edted: G D - D 64.4-589.6 24.8 Analyss of Devance Table Model : chd69 ~ age_ + chol_5 + sb_5 + bm_ + smoke Model 2: chd69 ~ age_ + chol_5 + sb_5 + bm_ + smoke + behcat Resd.Df Resd.Dev Df Devance P(> Ch 335 64.4 2 332 589.6 3 24.765.729e-5 28

Model ft for lnear regresson (revew. Lnearty 2. Constant varance 3. Indeendent resonses 4. Normally dstrbuted error terms and no outlers Model ft for logstc regresson. Lnearty: Stll relevant, see followng sldes 2. Heteroscedastc model, Var( y x (-,.e. deends on E( y x. However ths non-constant varance s taken care of by the maxmum lkelhood estmaton. 3. Indeendent resonses: See Lecture on Frday. 4. Not relevant, data are bnary, no outlers n resonses (but there could well be extreme covarates, nfluental observatons. 29 Checkng lnearty for logstc regresson We want to check f the robabltes can be adequately descrbed by the lnear exresson æ ( x, x,..., x ö b + b x + b x + + b x è ø 2 log 2 2... ç - ( x, x2,..., x We wll dscuss 3 aroaches:. Groung the covarates 2. Addng square terms or logarthmc terms to the model 3. Extendng the model to generalzed addtve models (GAM æ ( x, x,..., x ö b + + + + è ø 2 log f( x f2( x2... f ( x ç - ( x, x2,..., x 3. Groung the varables For a smle llustraton we consder the stuaton where age s the only covarate n the model for CHD, and we want to check f the effect of age s lnear (on the log-odds scale ft.catageglm(chd69~factor(agec, datawcgs,famlybnomal wcgs$agem39.5*(wcgs$agec+42.9*(wcgs$agec+47.9*(wcgs$agec2+ 52.8*(wcgs$agec3+57.3*(wcgs$agec4 ft.lnageglm(chd69~agem, datawcgs,famlybnomal summary(ft.catage summary(ft.lnage anova(ft.lnage, ft.catage,test"chsq" The rocedure wll be smlar f there are other covarates n addton to age We may here ft a model consderng the age grou as a factor (age grous: 35-4, 4-45, 46-5, 5-55, 56-6 Or we may ft a model where the mean age n each age grou s used as numercal covarate (means: 39.5, 42.9, 47.9, 52.8, 57.3 We may then use a devance test to check f flexble categorcal model gves a better ft than the lnear numercal. Here we fnd no mrovement,.269 3 R outut (edted: Estmate Std. Error z value Pr(> z (Intercet -2.843.85-5.62 < 2e-6 factor(agec -.35.23 -.569.569 factor(agec2.537.2235 2.374.8 factor(agec3.84.2275 3.697.2 factor(agec4.6.2585 4. 4.3e-5 Estmate Std. Error z value Pr(> z (Intercet -5.9466.566 -.588 < 2e-6 agem.747.6 6.445.5e- Model : chd69 ~ agem Model 2: chd69 ~ factor(agec Resd. Df Resd. Dev Df Devance P(> Ch 352 74.2 2 349 736.3 3 3.928.269 32

f( x 2. Addng square terms or log-terms The smle model log é ( x ù ê b ( + b x - x ú ë û can be extended to more flexble models such as ( x 2 logê é ù b + b x + b2x ( x ú ë - û or ( x logê é ù b + b x + b2 log( x - ( x ú ë û We may then use a devance test to check f the flexble models gves a better ft than the orgnal. Here we nether fnd any mrovement,.79 and.88 ftglm(chd69~age, datawcgs,famlybnomal fta2glm(chd69~age+i(age^2, datawcgs,famlybnomal ftlogglm(chd69~age+log(age, datawcgs,famlybnomal anova(ft,fta2,test"chsq" anova(ft,ftlog,test"chsq" R outut (edted: > anova(ft,fta2,test"chsq" Model : chd69 ~ age Model 2: chd69 ~ age + I(age^2 Resd. Df Resd. Dev Df Devance Pr(>Ch 352 738.4 2 35 738.3.69473.792 > anova(ft,ftlog,test"chsq" Model : chd69 ~ age Model 2: chd69 ~ age + log(age Resd. Df Resd. Dev Df Devance Pr(>Ch 352 738.4 2 35 738.3.958.892 However, n these data, age, age^2 and log(age are strongly correlated 33 34 2b. Addng a less correlated term R outut (edted: > ftlogbglm(chd69~age+log(age-38.9, datawcgs,famlybnomal > anova(ft,ftlogb,test"chsq" Model : chd69 ~ age Model 2: chd69 ~ age + log(age - 38.9 Resd. Df Resd. Dev Df Devance Pr(>Ch 352 738.4 2 35 735.8 2.5793.83 > > cor(wcgs$age,log(wcgs$age-38.9 [].852647 > cor(wcgs$age,log(wcgs$age [].998432 Stll no sgnfcant mrovement (., but the new covarate s stll qute correlated wth orgnal covarate. 3. Generalzed addtve model In ths examle ust wth one covarate æ ( x ö b + è ø where f ( x s a smooth functon estmated by the rogram. log ç f( x - ( x The aroach can easly be extended to several covarates. We can then (a Plot the estmated functon wth confdence ntervals. Wll a straght lne ft wthn the confdence lmts? (a Comare the smle and flexble model by a devance test. 35 36

R outut (edted: > lbrary(gam ftgamgam(chd69~s(age, datawcgs, famlybnomal > lot(ftgam,set > anova(ft,ftgam,test"chsq" Analyss of Devance Table Model : chd69 ~ age Model 2: chd69 ~ s(age Resd. Df Resd. Dev Df Devance Pr(>Ch 352 738.4 2 349 729.6 3 8.7622.3263 * For these data (a The nformal grahcal check ust allows a straght lne wthn confdence lmts. (a However, the devance test gves a weakly sgnfcant devaton from lnearty (.32 There may thus be some unmortant devaton from lnearty. 37 Devance and groued data On sldes 26-28 n Lecture 6 we saw that we got the same estmates and standard errors when we ftted the model wth mean age n each age grou as numercal covarate usng bnary data and groued data summary(ft.lnage chd.grouedread.table("htt://www.uo.no/studer/emner/matnat/math/stk49/v3/chd_groued.txt ", headert ft.grouedglm(cbnd(chd,no-chd~agem, datachd.groued, famlybnomal summary(ft.groued R outut (edted: Estmate Std. Error z value Pr(> z (Intercet -5.9466.566 -.588 < 2e-6 agem.747.6 6.445.5e- Null devance: 78.2 on 353 degrees of freedom Resdual devance: 74.2 on 352 degrees of freedom Estmate Std. Error z value Pr(> z (Intercet -5.9466.566 -.588 < 2e-6 agem.747.6 6.445.5e- Null devance: 44.95 on 4 degrees of freedom Resdual devance: 3.928 on 3 degrees of freedom 38 We see that the "resdual devance" and the "null devance" are not the same when we use bnary data and when we use groued data However, the dfference between the two s the same n both cases As long as we look at dfferences between devances, t does not matter whether we used bnary or groued data Further note that the resdual devance for the model wth groued data s the same as we got on slde 3 when comarng the models wth age as a numercal and categorcal covarate (based on bnary data When we use groued data, the resdual devance can be used as a goodness-of-ft test 39