Logistic Regression Maximum Likelihood Estimation

Size: px

Start display at page:

Download "Logistic Regression Maximum Likelihood Estimation"

Vernon Hodge
6 years ago
Views:

1 Harvard-MIT Dvson of Health Scences and Technology HST.951J: Medcal Decson Support, Fall 2005 Instructors: Professor Lucla Ohno-Machado and Professor Staal Vnterbo 6.873/HST.951 Medcal Decson Support Fall 2005 Logstc Regresson Mamum Lkelhood Estmaton Lucla Ohno-Machado

2 Rsk Score of Death from Angoplasty Unadjusted Overall Mortalty Rate = 2.1% % Number of Cases % Number of Cases 26% Mortalty Rsk 21.5% 53.6% 50% 40% 30% 20% 12.4% % 0.4% 1.4% 2.2% 2.9% 1.6% 1.3% 0 to 2 3 to 4 5 to 6 7 to 8 9 to 10 >10 Rsk Score Category 10% 0%

3 Lnear Regresson Ordnary Least Squares (OLS) Mnmze Sum of Squared Errors (SSE) y 3 n data ponts s the subscrpt for each pont ŷ = β 0 + β 1 n n =1 =1 SSE = (y ŷ ) = [y (β + β )]

4 Logt 1 p = 1+ e (β +β ) 0 1 y p = β 0 +β 1 e e β 0 +β 1 +1 p log 1 p = β 01+ β 1 logt

5 Increasng β Seres1 0.6 Seres1 Seres

6 Fndng β 0 Baselne case 1 p = 1+ e (β ) 0 Death Blue(1) Green(0) Lfe Total = 1+ e (β ) 0 β 0 =

7 Odds rato Odds: p/(1-p) Odds-rato Death Blue Green OR = p death blue 1 p death p blue death green 1 p death green 28/ OR = = / 52 Lfe Total

8 What do coeffcents mean? e β color = ORcolor Death OR = 28/ 45 = / 52 e β color = Blue Green β color = Lfe p blue = 1+ e ( Total = ) 1 p green = = e

9 What do coeffcents mean? e β age = ORage Death Age49 Age50 OR = p 1 p death age =50 death age = p death age =49 Lfe Total p death age =49

10 Why not search usng OLS? ŷ = β 0 + β 1 y 3 n SSE = (y ŷ ) 1 =1 logt p log 1 p = β 01+ β 1

11 P(model data)? p = 1+ e ( If only ntercept s allowed, whch 1 value would t have? β β 1 ) 0 + y y

12 P (data model)? P(data model) = [P(model data) P(data)] / P(model) When comparng models: P(model): assume all the same (e, chances of beng a model wth hgh coeffcents the same as low, etc) P(data): assume t s the same Then, P(data model) α P(model data)

13 Mamum Lkelhood Estmaton Mamze P(data model) Mamze the probablty that we would observe what we observed (gven assumpton of a partcular model) Choose the best parameters from the partcular model logt

14 Mamum Lkelhood Estmaton Steps: Defne epresson for the probablty of data as a functon of the parameters Fnd the values of the parameters that mamze ths epresson

15 Lkelhood Functon L = Pr(Y ) L = Pr( y 1, y 2,..., y n ) L = Pr( y ) Pr(y n 1 2 )... Pr( y n )= = 1 Pr( y )

16 L = Pr(Y ) L = Pr( Lkelhood Functon y, 1 y 2,..., y n ) L = Pr( y ) Pr( y Bnomal )... Pr( y ) = Pr( y ) 1 2 n =1 n Pr( y = 1) = p Pr( y = 0) = (1 p ) Pr( y ) = p y (1 p ) 1 y

17 Lkelhood Functon log L log L n L = Pr( y ) = p (1 p ) = 1 = 1 n p L = = 1 (1 p ) = = y log n y p (1 p ) + y ( β ) log(1 + e y (1 p ) 1 y log(1 β ) p ) Snce model s the logt

18 Log Lkelhood Functon n L = Pr( y ) = p (1 p ) = 1 = 1 n p y L = = 1 (1 p ) n y (1 p ) p 1 y log L = y log + log(1 p ) (1 p )

19 Log Lkelhood Functon p log L = y log + log(1 p ) (1 p ) log L = y (β ) log(1+ e β ) Snce model s the logt

20 Mamze log L = y (β ) log(1+ e β ) log L = β yˆ = 1 1+ e β y yˆ = 0 Not easy to solve because y-hat s non-lnear, need to use teratve methods: most popular s Newton-Raphson

21 Mamze log L = y (β ) log(1+ e β ) log L = y y β ˆ = 0 ŷ = 1 Not easy to solve because y-hat s non-lnear, need to use teratve methods: most popular s Newton-Raphson 1+ e β

22 Newton-Raphson Start wth random or zero βs walk n the drecton that mamzes MLE how bg a step (Gradent or Score) drecton

23 Mamzng the LogLkelhood Log Lkelhood β +1 Frst teraton LL β Intal LL

24 Mamzng the LogLkelhood Log Lkelhood β +1 Second teraton LL β New Intal LL

25 Smlar teratve method to Mnmzng the Error n Gradent Descent (neural nets) Error surface ntal error negatve dervatve fnal error local mnmum w ntal w traned postve change

26 Newton-Raphson Algorthm log L = y (β ) log(1+ e β ) U (β ) = log L = y β ˆ y Gradent I (β ) = 2 log L = β β ' ' y ˆ (1 y ˆ ) Hessan 1 β j + 1 = β j I (β j )U (β j ) a step

27 Convergence Crteron β β j+1 j < β j Convergence problems: complete and quascomplete separaton

28 Complete separaton MLE does not est (e, t s nfnte) β β +1 y y

29 Quas-complete separaton Same values for predctors, dfferent outcomes y

30 No (quas)complete separaton s fne to fnd MLE y

31 How good s the model? Is t better than predctng the same pror probablty for everyone? (e, model wth just β 0 ) How well do the tranng data ft? How well does s generalze?

32 Generalzed lkelhood-rato test Are β 1, β 2,, β dfferent from 0? n n y n L = Pr( y ) = p (1 p ) = 1 = 1 1 y log L = [y log p + (1 y ) log( 1 p )] G = 2 log L + 2 log L G has χ 2 dstrbuton o 1 cross entropy _ error = [y log p + (1 y ) log( 1 p )]

33 AIC, SC, BIC To compare models Akake s Informaton Crteron, k parameters AIC= 2 log L+2 k Schwartz Crteron, Bayesan Informaton Crteron, n cases BIC= 2 log L +k logn

34 Summary Mamum Lkelhood Estmaton s used n fndng parameters for models MLE mamzes the probablty that the data obtaned would have been generated by the model Comng up: goodness-of-ft (how good are the predctons?) How well do the tranng data ft? How well does s generalze?

Decision Analysis (part 2 of 2) Review Linear Regression

Decision Analysis (part 2 of 2) Review Linear Regression Harvard-MIT Dvson of Health Scences and Technology HST.951J: Medcal Decson Support, Fall 2005 Instructors: Professor Lucla Ohno-Machado and Professor Staal Vnterbo 6.873/HST.951 Medcal Decson Support Fall