STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

Size: px

Start display at page:

Download "STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression"

Elwin Gibbs
6 years ago
Views:

1 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Ths handout covers materal found n Secton 3.7 of your text. You may also want to revew regresson technques n Chapter. In ths handout, we wll dscuss the regresson problem when the response varable s bnary, ndcatng ether the presence or absence of some characterstc. EXAMPLE: The data n the fle CHD.csv were taken from the text Appled Logstc Regresson by Hosmer and Lemeshow. Researchers are nterested n the relatonshp between age and the presence (or absence) of evdence of coronary heart dsease (CHD). A porton of ths data set s shown below. Suppose we wanted to predct the presence of CHD based on Age. After readng the data nto R, you should attach the data set so that you can reference varables by name. > attach(chd.data) Then, we can examne the relatonshp between the presence of CHD and Age. > scatter.smooth(age,chd)

2 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Note that we could ft a lnear regresson model wth Age as the predctor and CHD as the response. Before we do ths n R, consder our lnear regresson model: CHD Age η η Age e, where CHD Age n the absenc eof CHD, n the presenc eof CHD. Note that the mean functon s gven by E(CHD Age ) η η Age. Bernoull Random Varables Now, let θ(age) denote the probablty of havng CHD for a gven Age. Note that s a Bernoull random varable wth the followng probablty dstrbuton: CHD Age CHD Age P(CHD Age) - θ(age) θ(age) We can fnd the expected value of the Bernoull random varable as follows: θ(age ) θ(age ) θ(age ) E(CHD Age ). Why s ths mportant? Ths shows that E(CHD Age ) η η Age θ(age ) P(CHD Age ). That s, the regresson lne gves an estmate of the probablty of havng CHD for a gven Age. > chd.lm<-lm(chd~age) > plot(age,chd) > ablne(chd.lm) 2

3 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Some Problems that Arse Usng ths Model. Non-normalty of the error terms. Only two dfferent error terms are possble for each Age : - θ(age) f the response s, and - θ(age) f the response s. 2. Non-constant varance of the error terms. Snce CHD Age s a Bernoull random varable, we know that Var( CHD Age ) = θ(age) [- θ(age)]. Ths then mples that Var( CHD Age ) = [ ηage wth Age and s NOT constant. η ] [- η η Age ]. That s, the varance functon vares 3. Constrants on the response functon. A lnear representaton permts estmates or predctons outsde the range to, whch s not correct when modelng probabltes. For example, what s our estmate of θ(age=2) f we use a lnear regresson model? Comment: The constrant that the mean functon fall between and frequently rules out a lnear response functon. For our CHD example, the use of a lnear response functon mght requre us to assume a probablty of for the mean response for all ndvduals beneath a certan age and a probablty of for all ndvduals over a certan age (see below). Such a model s often consdered unreasonable, however. Ideally, we d lke to fnd a model where the probabltes and are reached asymptotcally. One such model s the logstc regresson model. The Smple Logstc Mean Functon We parameterze ths model as follows: exp(η ηx ) E(y x ) θ(x ). exp(η η x ) 3

4 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Some examples of smple logstc mean functons are shown below: Wth η = Wth η = - Comments:. The logstc mean functon s always between and. 2. As η ncreases, the functon becomes more S-shaped; therefore, the functon changes more rapdly n the center. 3. When η s postve, the functon s monotone ncreasng; when η s negatve, the functon s monotone decreasng. 4. Changng η shfts the functon horzontally. 5. The logstc functon possesses the property of symmetry. If the response varable s recoded by changng s to s and s to s, the sgns of all coeffcents wll be reversed. To ft the logstc regresson model n R, you can use the followng programmng statements: > chd.glm <- glm(chd~age,famly="bnomal") > summary(chd.glm) Call: glm(formula = CHD ~ Age, famly = "bnomal") Devance Resduals: Mn Q Medan 3Q Max Coeffcents: Estmate Std. Error z value Pr(> z ) (Intercept) e-6 *** Age e-6 *** --- Sgnf. codes: ***. **. *.5.. (Dsperson parameter for bnomal famly taken to be ) Null devance: on 99 degrees of freedom Resdual devance: 7.35 on 98 degrees of freedom AIC:.35 Number of Fsher Scorng teratons: 4 4

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson To see a plot of the predcted probabltes based on age, use the followng commands. > prob.chd <- ftted(chd.

5 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson To see a plot of the predcted probabltes based on age, use the followng commands. > prob.chd <- ftted(chd.glm) > plot(age,prob.chd,ylab="p(chd Age)") Ths curve s a plot of: θ(age ˆ ) P(CHD Age ˆ ) exp(η ηage ) exp(η η Age ) To ft the logstc regresson model n SAS, you can use the followng programmng statements: ods html; ods graphcs on; proc logstc descendng plots=effect; model CHD = age / lnk=logt; run; ods graphcs off; ods html close; 5

6 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Questons:. Based on the plot of predcted probabltes, fnd θ ˆ (4) P(CHD ˆ Age 4). 2. Based on the plot of predcted probabltes, fnd θ ˆ (6) P(CHD ˆ Age 6). Interpretng the Model Parameters exp(η ηage ) Mean functon: E(CHD Age ) θ(age ). exp(η ηage ) exp(η η Age ) Ftted Model Equaton (or Ftted Probabltes): E(CHD Age ˆ ) θ(age ˆ ˆ ˆ ) exp(η ˆ ηˆ Age ) Note that n the mean functon, the probabltes θ(age) are nonlnear functons of η and η. However, a smple transformaton results n a lnear model. That s, we can show the followng: θ(age ) ln η ηage θ(age ) Proof of the prevous clam: 6

7 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Much of the nterpretaton of the logstc regresson model centers on ths rato: θ(age ). θ(age ) Ths rato compares the probablty of havng CHD for a gven age to the probablty of NOT havng CHD for a gven age; n other words, ths s the odds of CHD for age. The natural logarthm of the odds s referred to as the log odds, or the logt. Note that the smple logstc regresson model assumes a lnear model for the logt θ(age ) ln η θ(age ) η Age. Ths representaton of the model shows that the regresson coeffcents do n fact represent changes n the log odds (or the logt). To see how ths works, let s go back to our example. > chd.glm <- glm(chd~age,famly="bnomal") > summary(chd.glm) Call: glm(formula = CHD ~ Age, famly = "bnomal") Devance Resduals: Mn Q Medan 3Q Max Coeffcents: Estmate Std. Error z value Pr(> z ) (Intercept) e-6 *** Age e-6 *** --- Sgnf. codes: ***. **. *.5.. (Dsperson parameter for bnomal famly taken to be ) Null devance: on 99 degrees of freedom Resdual devance: 7.35 on 98 degrees of freedom AIC:.35 Number of Fsher Scorng teratons: 4 From ths output, fnd the followng: ηˆ : ηˆ : exp(η η Age ) θ(age ˆ ˆ ˆ ), where Age = 4: exp(η ˆ ηˆ Age ) The estmated odds of CHD when Age = 4: 7

8 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Note that θˆ (Age ) = 4 s the estmated probablty that a 4-year old wll have CHD. Recall that the estmated probabltes n R can be obtaned as follows: > prob.chd <- ftted(chd.glm) > cbnd(age,prob.chd) Age prob.chd These estmated probabltes can also be obtaned from SAS: proc logstc descendng; model CHD = age / lnk=logt; output out=get_values predcted=predcted_probabltes; run; proc prnt data=get_values; run; 8

9 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Fnd the estmated odds of CHD when Age = 45: θ(age ˆ ) = θ(age ˆ ) Fnd the estmated odds of CHD when Age = 5: θ(age ˆ ) = θ(age ˆ ) Based on the prevous two answers, fnd an odds rato for a 5 year ncrease n Age: Fndng the Odds Rato for a t-year Increase n Age Recall that the rato θ(age ˆ ) s the estmated odds of CHD when Age= Age. θ(age ˆ ) θ(age ˆ j) Frst, let Age = Agej. Then we know that ln ηˆ ηˆ Agej. θ(age ˆ j) θ(age ˆ j t) Next, let Age = Agej + t. Agan, we have ln ηˆ ηˆ (Agej t). θ(age ˆ j t) Now, consder ther dfference: θ(age ˆ j t) θ(age ˆ j) ln - ln ηˆ θ(age ˆ j t) θ(age ˆ j) ηˆ (Age t) j ηˆ ηˆ Agej t ηˆ 9

10 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson θ(age ˆ j t) θ(age ˆ j t) Note that ln(or assocated wth a t-year ncrease n age) = ln t ηˆ. θ(age ˆ j) θ(age ˆ j) t η e ˆ. Therefore, the odds rato assocated wth a t-year ncrease n age s gven by Back to our example:. Dscuss the odds rato assocated wth a one-year ncrease n age. 2. Dscuss the odds rato assocated wth a fve-year ncrease n age. 3. Is t reasonable to assume that a t-year ncrease n a contnuous predctor s constant, regardless of the startng pont? For example, does the rsk assocated wth a 5-year ncrease n age reman constant throughout one s lfe?

11 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Estmaton of the Model Parameters In a lnear regresson analyss, the regresson coeffcents are estmated based on the least squares method. That s, the estmates are obtaned by mnmzng the sum of the squared resduals. In a logstc regresson analyss, the model parameters are estmated through a process called the maxmum lkelhood method. The basc prncple of maxmum lkelhood s to choose as estmates those parameter values whch, f true, would maxmze the probablty of observng what we have actually observed. Ths nvolves:. Fndng an expresson (.e., the lkelhood functon) for the probablty of the data as a functon of the unknown parameters. For the logstc model, the bnary response varable s assumed to follow a bnomal dstrbuton wth a sngle tral (n=) and probablty of success equal to θ(x). Therefore, for the th observed par (x,y), the contrbuton to the lkelhood s θ(x ) y ( θ(x )) where y βo βx e θ(x ) and βo βx e y Then, snce we assume ndependence across observatons, the lkelhood functon s gven by L β L β ~ n o,β y θ(x ) (θ(x )) 2. Fndng the values of the unknown parameters whch make the value of ths expresson as large as possble. For computatonal purposes t s usually easer to maxmze the logarthm of the lkelhood functon rather than the lkelhood functon tself. Ths works because the logarthm s a monotonc ncreasng functon; therefore, the maxmzng parameters are the same for the lkelhood and log-lkelhood functons. The log-lkelhood functon s gven by y lnl(β,β ) o n ( y )ln θ(x ) y ln θ(x ) To fnd the parameter estmates, we solve smultaneously the equatons gven by settng the partal dervatves wth respect to each parameter equal to : β o lnl(βo,β) β lnl(β,β ) o

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Several dfferent nonlnear optmzaton routnes are used to fnd solutons to such systems.

12 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Several dfferent nonlnear optmzaton routnes are used to fnd solutons to such systems. Ths process gets ncreasngly computatonally ntensve as the number of terms n the model ncreases. Statstcal Inference for the Logstc Regresson Model Frst, consder the followng output from PROC LOGISTIC: proc logstc descendng; model CHD = age / lnk=logt; output out=get_values predcted=predcted_probabltes; run; All of these statstcs are testng the same null hypothess: Ho: all explanatory varables n the model have coeffcents of zero. Ha: at least one explanatory varable n the model has a coeffcent dfferent from zero. The Lkelhood Rato test compares the log-lkelhood for the ftted model wth the lkelhood for a model wth NO explanatory varables. PROC LOGISTIC reports -2 log-lkelhood for each of these models, and the ch-square test statstc s the dfference of these two numbers. Note that the df = corresponds to the one ndependent varable n the model. The Score statstc s a functon of the frst and second dervatves of the log-lkelhood functon under the null hypothess. There s some evdence that ths test does not perform as well as the lkelhood rato test for small samples. The Wald statstc s an approxmaton that s more accurate wth larger sample szes. 2

13 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Carryng out the Lkelhood Rato Test n R: > summary(chd.glm) Call: glm(formula = CHD ~ Age, famly = "bnomal") Devance Resduals: Mn Q Medan 3Q Max Coeffcents: Estmate Std. Error z value Pr(> z ) (Intercept) e-6 *** Age e-6 *** --- Sgnf. codes: ***. **. *.5.. (Dsperson parameter for bnomal famly taken to be ) Null devance: on 99 degrees of freedom Resdual devance: 7.35 on 98 degrees of freedom AIC:.35 Number of Fsher Scorng teratons: 4 > -pchsq( ,) [] e-8 Hypothess Testng For Indvdual Coeffcents H H o a : η : η When the sample sze s large, the test for sgnfcance of the slope parameter (η ) can be calculated as follows: ηˆ z = SE(η ˆ ) χ 2 = z 2 = 3

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Confdence Intervals for Coeffcents and Correspondng Odds Ratos A ( α)% confdence nterval for η can be calculated as follows:

14 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Confdence Intervals for Coeffcents and Correspondng Odds Ratos A ( α)% confdence nterval for η can be calculated as follows: ηˆ z α/2 SE(η ˆ ) A ( α)% confdence nterval for the odds rato assocated wth η s calculated as follows: exp( ηˆ z α/2 SE(η ˆ )) These ntervals can be calculated n SAS PROC LOGISTIC as follows: proc logstc descendng; model CHD = age / lnk=logt clparm=wald; run; To compute the confdence ntervals for the parameter estmates n R, enter the followng: > confnt.default(chd.glm) 2.5 % 97.5 % (Intercept) Age To compute the odds rato and the assocated confdence ntervals n R, you can use the followng commands. > exp(coef(chd.glm)) (Intercept) Age > exp(confnt.default(chd.glm)) 2.5 % 97.5 % (Intercept) Age

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson If η corresponds to a contnuous predctor and we wsh to examne the odds rato assocated wth a t unt ncrease n that predctor,

15 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson If η corresponds to a contnuous predctor and we wsh to examne the odds rato assocated wth a t unt ncrease n that predctor, the confdence nterval for the odds rato becomes exp(t ηˆ z α/2 t SE(η ˆ )) Example: Fnd the odds rato for CHD assocated wth a year ncrease n age, and gve a 95% confdence nterval based on ths estmate. proc logstc descendng; model CHD = age / lnk=logt clparm=pl clodds=pl; unts age=; run; In R: > exp(*coef(chd.glm)) (Intercept) Age e e+ > exp(*confnt.default(chd.glm)) 2.5 % 97.5 % (Intercept).9573e e-4 Age.89225e e+ The ntervals shown above from R are all known as Wald ntervals (based on normal-theory methods). These may not be approprate for small samples; therefore, you may want to consder another method called the Profle Lkelhood method. Ths nvolves an teratve evaluaton of the lkelhood functon and produces ntervals that may not be symmetrc around the estmate. proc logstc descendng; model CHD = age / lnk=logt clparm=pl; run; 5

16 STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Usng the profle lkelhood method n R: > confnt(chd.glm) Watng for proflng to be done % 97.5 % (Intercept) Age Questons:. How do these compare to the Wald confdence ntervals? 2. How would you fnd the profle lkelhood confdence nterval for the odds rato for a one-year ncrease n age? 6

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore Sesson Outlne Introducton to classfcaton problems and dscrete choce models. Introducton to Logstcs Regresson. Logstc functon and Logt functon. Maxmum Lkelhood Estmator (MLE) for estmaton of LR parameters.