Logistic Regression Maximum Likelihood Estimation

Harvard-MIT Dvson of Health Scences and Technology HST.951J: Medcal Decson Support, Fall 2005 Instructors: Professor Lucla Ohno-Machado and Professor Staal Vnterbo 6.873/HST.951 Medcal Decson Support Fall 2005 Logstc Regresson Mamum Lkelhood Estmaton Lucla Ohno-Machado

Rsk Score of Death from Angoplasty Unadjusted Overall Mortalty Rate = 2.1% 3000 60% Number of Cases 2500 2000 1500 1000 62% Number of Cases 26% Mortalty Rsk 21.5% 53.6% 50% 40% 30% 20% 12.4% 500 0 7.6% 0.4% 1.4% 2.2% 2.9% 1.6% 1.3% 0 to 2 3 to 4 5 to 6 7 to 8 9 to 10 >10 Rsk Score Category 10% 0%

Lnear Regresson Ordnary Least Squares (OLS) Mnmze Sum of Squared Errors (SSE) y 3 n data ponts s the subscrpt for each pont 2 1 4 ŷ = β 0 + β 1 n n 2 2 0 1 =1 =1 SSE = (y ŷ ) = [y (β + β )]

Logt 1 p = 1+ e (β +β ) 0 1 y p = β 0 +β 1 e e β 0 +β 1 +1 p log 1 p = β 01+ β 1 logt

Increasng β 1.2 1.2 1.2 1 1 1 0.8 0.8 0.8 0.6 0.6 Seres1 0.6 Seres1 Seres1 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 10 20 30 0 10 20 30 0 0 10 20 30

Fndng β 0 Baselne case 1 p = 1+ e (β ) 0 Death Blue(1) Green(0) 28 22 50 297 Lfe 45 52 97 Total 73 74 147 1 0. = 1+ e (β ) 0 β 0 = 0. 8616

Odds rato Odds: p/(1-p) Odds-rato Death Blue Green OR = p death blue 1 p death p blue death green 1 p 28 22 50 death green 28/ 45 73 74 147 OR = = 1. 47 22 / 52 Lfe 45 52 97 Total

What do coeffcents mean? e β color = ORcolor Death OR = 28/ 45 = 1. 47 22 / 52 e β color = 1. 47 Blue Green β color = 0. 385 28 22 50 Lfe 45 52 97 p blue = 1+ e ( Total 73 74 147 1 0. 8616+. 385 = 0. 383 0 ) 1 p green = = 0. 297 1+ e 0. 8616

What do coeffcents mean? e β age = ORage Death Age49 Age50 OR = p 1 p death age =50 death age =50 28 22 50 p death age =49 Lfe 45 52 97 Total 73 74 147 1 p death age =49

Why not search usng OLS? ŷ = β 0 + β 1 y 3 n 4 2 2 SSE = (y ŷ ) 1 =1 logt p log 1 p = β 01+ β 1

P(model data)? p = 1+ e ( If only ntercept s allowed, whch 1 value would t have? β β 1 ) 0 + y y

P (data model)? P(data model) = [P(model data) P(data)] / P(model) When comparng models: P(model): assume all the same (e, chances of beng a model wth hgh coeffcents the same as low, etc) P(data): assume t s the same Then, P(data model) α P(model data)

Mamum Lkelhood Estmaton Mamze P(data model) Mamze the probablty that we would observe what we observed (gven assumpton of a partcular model) Choose the best parameters from the partcular model logt

Mamum Lkelhood Estmaton Steps: Defne epresson for the probablty of data as a functon of the parameters Fnd the values of the parameters that mamze ths epresson

Lkelhood Functon L = Pr(Y ) L = Pr( y 1, y 2,..., y n ) L = Pr( y ) Pr(y n 1 2 )... Pr( y n )= = 1 Pr( y )

L = Pr(Y ) L = Pr( Lkelhood Functon y, 1 y 2,..., y n ) L = Pr( y ) Pr( y Bnomal )... Pr( y ) = Pr( y ) 1 2 n =1 n Pr( y = 1) = p Pr( y = 0) = (1 p ) Pr( y ) = p y (1 p ) 1 y

Lkelhood Functon log L log L n L = Pr( y ) = p (1 p ) = 1 = 1 n p L = = 1 (1 p ) = = y log n y p (1 p ) + y ( β ) log(1 + e y (1 p ) 1 y log(1 β ) p ) Snce model s the logt

Log Lkelhood Functon n L = Pr( y ) = p (1 p ) = 1 = 1 n p y L = = 1 (1 p ) n y (1 p ) p 1 y log L = y log + log(1 p ) (1 p )

Log Lkelhood Functon p log L = y log + log(1 p ) (1 p ) log L = y (β ) log(1+ e β ) Snce model s the logt

Mamze log L = y (β ) log(1+ e β ) log L = β yˆ = 1 1+ e β y yˆ = 0 Not easy to solve because y-hat s non-lnear, need to use teratve methods: most popular s Newton-Raphson

Mamze log L = y (β ) log(1+ e β ) log L = y y β ˆ = 0 ŷ = 1 Not easy to solve because y-hat s non-lnear, need to use teratve methods: most popular s Newton-Raphson 1+ e β

Newton-Raphson Start wth random or zero βs walk n the drecton that mamzes MLE how bg a step (Gradent or Score) drecton

Mamzng the LogLkelhood Log Lkelhood β +1 Frst teraton LL β Intal LL

Mamzng the LogLkelhood Log Lkelhood β +1 Second teraton LL β New Intal LL

Smlar teratve method to Mnmzng the Error n Gradent Descent (neural nets) Error surface ntal error negatve dervatve fnal error local mnmum w ntal w traned postve change

Newton-Raphson Algorthm log L = y (β ) log(1+ e β ) U (β ) = log L = y β ˆ y Gradent I (β ) = 2 log L = β β ' ' y ˆ (1 y ˆ ) Hessan 1 β j + 1 = β j I (β j )U (β j ) a step

Convergence Crteron β β j+1 j <. 0001 β j Convergence problems: complete and quascomplete separaton

Complete separaton MLE does not est (e, t s nfnte) β β +1 y y

Quas-complete separaton Same values for predctors, dfferent outcomes y

No (quas)complete separaton s fne to fnd MLE y

How good s the model? Is t better than predctng the same pror probablty for everyone? (e, model wth just β 0 ) How well do the tranng data ft? How well does s generalze?

Generalzed lkelhood-rato test Are β 1, β 2,, β dfferent from 0? n n y n L = Pr( y ) = p (1 p ) = 1 = 1 1 y log L = [y log p + (1 y ) log( 1 p )] G = 2 log L + 2 log L G has χ 2 dstrbuton o 1 cross entropy _ error = [y log p + (1 y ) log( 1 p )]

AIC, SC, BIC To compare models Akake s Informaton Crteron, k parameters AIC= 2 log L+2 k Schwartz Crteron, Bayesan Informaton Crteron, n cases BIC= 2 log L +k logn

Summary Mamum Lkelhood Estmaton s used n fndng parameters for models MLE mamzes the probablty that the data obtaned would have been generated by the model Comng up: goodness-of-ft (how good are the predctons?) How well do the tranng data ft? How well does s generalze?