STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

Similar documents
Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Statistics for Economics & Business

Chapter 11: Simple Linear Regression and Correlation

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

STAT 3008 Applied Regression Analysis

Negative Binomial Regression

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

Basic Business Statistics, 10/e

Statistics for Business and Economics

[The following data appear in Wooldridge Q2.3.] The table below contains the ACT score and college GPA for eight college students.

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Chapter 13: Multiple Regression

Lecture 6: Introduction to Linear Regression

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

STATISTICS QUESTIONS. Step by Step Solutions.

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

Lecture 4 Hypothesis Testing

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Linear Regression Analysis: Terminology and Notation

4.3 Poisson Regression

/ n ) are compared. The logic is: if the two

Introduction to Regression

Regression Analysis. Regression Analysis

Chapter 9: Statistical Inference and the Relationship between Two Variables

x i1 =1 for all i (the constant ).

Diagnostics in Poisson Regression. Models - Residual Analysis

Y = β 0 + β 1 X 1 + β 2 X β k X k + ε

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

DO NOT OPEN THE QUESTION PAPER UNTIL INSTRUCTED TO DO SO BY THE CHIEF INVIGILATOR. Introductory Econometrics 1 hour 30 minutes

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

Lecture Notes on Linear Regression

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

Economics 130. Lecture 4 Simple Linear Regression Continued

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Limited Dependent Variables

Comparison of Regression Lines

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

e i is a random error

LOGIT ANALYSIS. A.K. VASISHT Indian Agricultural Statistics Research Institute, Library Avenue, New Delhi

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

First Year Examination Department of Statistics, University of Florida

STAT 3340 Assignment 1 solutions. 1. Find the equation of the line which passes through the points (1,1) and (4,5).

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Lecture 2: Prelude to the big shrink

28. SIMPLE LINEAR REGRESSION III

January Examinations 2015

Chapter 14 Simple Linear Regression

18. SIMPLE LINEAR REGRESSION III

Rockefeller College University at Albany

Regression with limited dependent variables. Professor Bernard Fingleton

x = , so that calculated

Biostatistics 360 F&t Tests and Intervals in Regression 1

a. (All your answers should be in the letter!

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

Hydrological statistics. Hydrological statistics and extremes

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Interval Estimation in the Classical Normal Linear Regression Model. 1. Introduction

β0 + β1xi. You are interested in estimating the unknown parameters β

x yi In chapter 14, we want to perform inference (i.e. calculate confidence intervals and perform tests of significance) in this setting.

PubH 7405: REGRESSION ANALYSIS INTRODUCTION TO POISSON REGRESSION

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

β0 + β1xi. You are interested in estimating the unknown parameters β

Lecture Notes for STATISTICAL METHODS FOR BUSINESS II BMGT 212. Chapters 14, 15 & 16. Professor Ahmadi, Ph.D. Department of Management

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

where I = (n x n) diagonal identity matrix with diagonal elements = 1 and off-diagonal elements = 0; and σ 2 e = variance of (Y X).

Statistics MINITAB - Lab 2

Learning Objectives for Chapter 11

THE ROYAL STATISTICAL SOCIETY 2006 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE

Topic 7: Analysis of Variance

Chapter 15 - Multiple Regression

Polynomial Regression Models

Statistics for Managers Using Microsoft Excel/SPSS Chapter 14 Multiple Regression Models

Statistics II Final Exam 26/6/18

Chapter 14: Logit and Probit Models for Categorical Response Variables

Laboratory 1c: Method of Least Squares

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

BIO Lab 2: TWO-LEVEL NORMAL MODELS with school children popularity data

Introduction to Generalized Linear Models

CHAPTER 8. Exercise Solutions

Laboratory 3: Method of Least Squares

Linear Approximation with Regularization and Moving Least Squares

The Geometry of Logit and Probit

Tests of Single Linear Coefficient Restrictions: t-tests and F-tests. 1. Basic Rules. 2. Testing Single Linear Coefficient Restrictions

Reminder: Nested models. Lecture 9: Interactions, Quadratic terms and Splines. Effect Modification. Model 1

1 Binary Response Models

Statistics Chapter 4

Chapter 15 Student Lecture Notes 15-1

Chapter 5 Multilevel Models

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Marginal Effects in Probit Models: Interpretation and Testing. 1. Interpreting Probit Coefficients

Midterm Examination. Regression and Forecasting Models

Properties of Least Squares

The Ordinary Least Squares (OLS) Estimator

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Maximum Likelihood Estimation

Chapter 12 Analysis of Covariance

ANSWERS CHAPTER 9. TIO 9.2: If the values are the same, the difference is 0, therefore the null hypothesis cannot be rejected.

Transcription:

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Ths handout covers materal found n Secton 3.7 of your text. You may also want to revew regresson technques n Chapter. In ths handout, we wll dscuss the regresson problem when the response varable s bnary, ndcatng ether the presence or absence of some characterstc. EXAMPLE: The data n the fle CHD.csv were taken from the text Appled Logstc Regresson by Hosmer and Lemeshow. Researchers are nterested n the relatonshp between age and the presence (or absence) of evdence of coronary heart dsease (CHD). A porton of ths data set s shown below. Suppose we wanted to predct the presence of CHD based on Age. After readng the data nto R, you should attach the data set so that you can reference varables by name. > attach(chd.data) Then, we can examne the relatonshp between the presence of CHD and Age. > scatter.smooth(age,chd)

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Note that we could ft a lnear regresson model wth Age as the predctor and CHD as the response. Before we do ths n R, consder our lnear regresson model: CHD Age η η Age e, where CHD Age n the absenc eof CHD, n the presenc eof CHD. Note that the mean functon s gven by E(CHD Age ) η η Age. Bernoull Random Varables Now, let θ(age) denote the probablty of havng CHD for a gven Age. Note that s a Bernoull random varable wth the followng probablty dstrbuton: CHD Age CHD Age P(CHD Age) - θ(age) θ(age) We can fnd the expected value of the Bernoull random varable as follows: θ(age ) θ(age ) θ(age ) E(CHD Age ). Why s ths mportant? Ths shows that E(CHD Age ) η η Age θ(age ) P(CHD Age ). That s, the regresson lne gves an estmate of the probablty of havng CHD for a gven Age. > chd.lm<-lm(chd~age) > plot(age,chd) > ablne(chd.lm) 2

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Some Problems that Arse Usng ths Model. Non-normalty of the error terms. Only two dfferent error terms are possble for each Age : - θ(age) f the response s, and - θ(age) f the response s. 2. Non-constant varance of the error terms. Snce CHD Age s a Bernoull random varable, we know that Var( CHD Age ) = θ(age) [- θ(age)]. Ths then mples that Var( CHD Age ) = [ ηage wth Age and s NOT constant. η ] [- η η Age ]. That s, the varance functon vares 3. Constrants on the response functon. A lnear representaton permts estmates or predctons outsde the range to, whch s not correct when modelng probabltes. For example, what s our estmate of θ(age=2) f we use a lnear regresson model? Comment: The constrant that the mean functon fall between and frequently rules out a lnear response functon. For our CHD example, the use of a lnear response functon mght requre us to assume a probablty of for the mean response for all ndvduals beneath a certan age and a probablty of for all ndvduals over a certan age (see below). Such a model s often consdered unreasonable, however. Ideally, we d lke to fnd a model where the probabltes and are reached asymptotcally. One such model s the logstc regresson model. The Smple Logstc Mean Functon We parameterze ths model as follows: exp(η ηx ) E(y x ) θ(x ). exp(η η x ) 3

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Some examples of smple logstc mean functons are shown below: Wth η = Wth η = - Comments:. The logstc mean functon s always between and. 2. As η ncreases, the functon becomes more S-shaped; therefore, the functon changes more rapdly n the center. 3. When η s postve, the functon s monotone ncreasng; when η s negatve, the functon s monotone decreasng. 4. Changng η shfts the functon horzontally. 5. The logstc functon possesses the property of symmetry. If the response varable s recoded by changng s to s and s to s, the sgns of all coeffcents wll be reversed. To ft the logstc regresson model n R, you can use the followng programmng statements: > chd.glm <- glm(chd~age,famly="bnomal") > summary(chd.glm) Call: glm(formula = CHD ~ Age, famly = "bnomal") Devance Resduals: Mn Q Medan 3Q Max -.978 -.8456 -.4576.8253 2.2859 Coeffcents: Estmate Std. Error z value Pr(> z ) (Intercept) -5.3945.3365-4.683 2.82e-6 *** Age.92.246 4.6 4.2e-6 *** --- Sgnf. codes: ***. **. *.5.. (Dsperson parameter for bnomal famly taken to be ) Null devance: 36.66 on 99 degrees of freedom Resdual devance: 7.35 on 98 degrees of freedom AIC:.35 Number of Fsher Scorng teratons: 4 4

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson To see a plot of the predcted probabltes based on age, use the followng commands. > prob.chd <- ftted(chd.glm) > plot(age,prob.chd,ylab="p(chd Age)") Ths curve s a plot of: θ(age ˆ ) P(CHD Age ˆ ) exp(η ηage ) exp(η η Age ) To ft the logstc regresson model n SAS, you can use the followng programmng statements: ods html; ods graphcs on; proc logstc descendng plots=effect; model CHD = age / lnk=logt; run; ods graphcs off; ods html close; 5

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Questons:. Based on the plot of predcted probabltes, fnd θ ˆ (4) P(CHD ˆ Age 4). 2. Based on the plot of predcted probabltes, fnd θ ˆ (6) P(CHD ˆ Age 6). Interpretng the Model Parameters exp(η ηage ) Mean functon: E(CHD Age ) θ(age ). exp(η ηage ) exp(η η Age ) Ftted Model Equaton (or Ftted Probabltes): E(CHD Age ˆ ) θ(age ˆ ˆ ˆ ) exp(η ˆ ηˆ Age ) Note that n the mean functon, the probabltes θ(age) are nonlnear functons of η and η. However, a smple transformaton results n a lnear model. That s, we can show the followng: θ(age ) ln η ηage θ(age ) Proof of the prevous clam: 6

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Much of the nterpretaton of the logstc regresson model centers on ths rato: θ(age ). θ(age ) Ths rato compares the probablty of havng CHD for a gven age to the probablty of NOT havng CHD for a gven age; n other words, ths s the odds of CHD for age. The natural logarthm of the odds s referred to as the log odds, or the logt. Note that the smple logstc regresson model assumes a lnear model for the logt θ(age ) ln η θ(age ) η Age. Ths representaton of the model shows that the regresson coeffcents do n fact represent changes n the log odds (or the logt). To see how ths works, let s go back to our example. > chd.glm <- glm(chd~age,famly="bnomal") > summary(chd.glm) Call: glm(formula = CHD ~ Age, famly = "bnomal") Devance Resduals: Mn Q Medan 3Q Max -.978 -.8456 -.4576.8253 2.2859 Coeffcents: Estmate Std. Error z value Pr(> z ) (Intercept) -5.3945.3365-4.683 2.82e-6 *** Age.92.246 4.6 4.2e-6 *** --- Sgnf. codes: ***. **. *.5.. (Dsperson parameter for bnomal famly taken to be ) Null devance: 36.66 on 99 degrees of freedom Resdual devance: 7.35 on 98 degrees of freedom AIC:.35 Number of Fsher Scorng teratons: 4 From ths output, fnd the followng: ηˆ : ηˆ : exp(η η Age ) θ(age ˆ ˆ ˆ ), where Age = 4: exp(η ˆ ηˆ Age ) The estmated odds of CHD when Age = 4: 7

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Note that θˆ (Age ) = 4 s the estmated probablty that a 4-year old wll have CHD. Recall that the estmated probabltes n R can be obtaned as follows: > prob.chd <- ftted(chd.glm) > cbnd(age,prob.chd) Age prob.chd 2.4347876 2 23.596245 3 24.665278 4 25.7334379 5 25.7334379 6 26.824847 7 26.824847 8 28.994228 9 28.994228 29.98444 3.2255 2 3.2255 3 3.2255 4 3.2255 5 3.2255 6 3.2255 7 32.4679324 8 32.4679324 9 33.623662 2 33.623662 2 34.768662 22 34.768662 23 34.768662 24 34.768662 25 34.768662 26 35.9353324 27 35.9353324 28 36.243583 29 36.243583 3 36.243583 3 37.2352 32 37.2352 33 37.2352 34 38.257825 35 38.257825 36 39.272925 37 39.272925 38 4.294799 These estmated probabltes can also be obtaned from SAS: proc logstc descendng; model CHD = age / lnk=logt; output out=get_values predcted=predcted_probabltes; run; proc prnt data=get_values; run; 8

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Fnd the estmated odds of CHD when Age = 45: θ(age ˆ ) = θ(age ˆ ) Fnd the estmated odds of CHD when Age = 5: θ(age ˆ ) = θ(age ˆ ) Based on the prevous two answers, fnd an odds rato for a 5 year ncrease n Age: Fndng the Odds Rato for a t-year Increase n Age Recall that the rato θ(age ˆ ) s the estmated odds of CHD when Age= Age. θ(age ˆ ) θ(age ˆ j) Frst, let Age = Agej. Then we know that ln ηˆ ηˆ Agej. θ(age ˆ j) θ(age ˆ j t) Next, let Age = Agej + t. Agan, we have ln ηˆ ηˆ (Agej t). θ(age ˆ j t) Now, consder ther dfference: θ(age ˆ j t) θ(age ˆ j) ln - ln ηˆ θ(age ˆ j t) θ(age ˆ j) ηˆ (Age t) j ηˆ ηˆ Agej t ηˆ 9

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson θ(age ˆ j t) θ(age ˆ j t) Note that ln(or assocated wth a t-year ncrease n age) = ln t ηˆ. θ(age ˆ j) θ(age ˆ j) t η e ˆ. Therefore, the odds rato assocated wth a t-year ncrease n age s gven by Back to our example:. Dscuss the odds rato assocated wth a one-year ncrease n age. 2. Dscuss the odds rato assocated wth a fve-year ncrease n age. 3. Is t reasonable to assume that a t-year ncrease n a contnuous predctor s constant, regardless of the startng pont? For example, does the rsk assocated wth a 5-year ncrease n age reman constant throughout one s lfe?

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Estmaton of the Model Parameters In a lnear regresson analyss, the regresson coeffcents are estmated based on the least squares method. That s, the estmates are obtaned by mnmzng the sum of the squared resduals. In a logstc regresson analyss, the model parameters are estmated through a process called the maxmum lkelhood method. The basc prncple of maxmum lkelhood s to choose as estmates those parameter values whch, f true, would maxmze the probablty of observng what we have actually observed. Ths nvolves:. Fndng an expresson (.e., the lkelhood functon) for the probablty of the data as a functon of the unknown parameters. For the logstc model, the bnary response varable s assumed to follow a bnomal dstrbuton wth a sngle tral (n=) and probablty of success equal to θ(x). Therefore, for the th observed par (x,y), the contrbuton to the lkelhood s θ(x ) y ( θ(x )) where y βo βx e θ(x ) and βo βx e y Then, snce we assume ndependence across observatons, the lkelhood functon s gven by L β L β ~ n o,β y θ(x ) (θ(x )) 2. Fndng the values of the unknown parameters whch make the value of ths expresson as large as possble. For computatonal purposes t s usually easer to maxmze the logarthm of the lkelhood functon rather than the lkelhood functon tself. Ths works because the logarthm s a monotonc ncreasng functon; therefore, the maxmzng parameters are the same for the lkelhood and log-lkelhood functons. The log-lkelhood functon s gven by y lnl(β,β ) o n ( y )ln θ(x ) y ln θ(x ) To fnd the parameter estmates, we solve smultaneously the equatons gven by settng the partal dervatves wth respect to each parameter equal to : β o lnl(βo,β) β lnl(β,β ) o

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Several dfferent nonlnear optmzaton routnes are used to fnd solutons to such systems. Ths process gets ncreasngly computatonally ntensve as the number of terms n the model ncreases. Statstcal Inference for the Logstc Regresson Model Frst, consder the followng output from PROC LOGISTIC: proc logstc descendng; model CHD = age / lnk=logt; output out=get_values predcted=predcted_probabltes; run; All of these statstcs are testng the same null hypothess: Ho: all explanatory varables n the model have coeffcents of zero. Ha: at least one explanatory varable n the model has a coeffcent dfferent from zero. The Lkelhood Rato test compares the log-lkelhood for the ftted model wth the lkelhood for a model wth NO explanatory varables. PROC LOGISTIC reports -2 log-lkelhood for each of these models, and the ch-square test statstc s the dfference of these two numbers. Note that the df = corresponds to the one ndependent varable n the model. The Score statstc s a functon of the frst and second dervatves of the log-lkelhood functon under the null hypothess. There s some evdence that ths test does not perform as well as the lkelhood rato test for small samples. The Wald statstc s an approxmaton that s more accurate wth larger sample szes. 2

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Carryng out the Lkelhood Rato Test n R: > summary(chd.glm) Call: glm(formula = CHD ~ Age, famly = "bnomal") Devance Resduals: Mn Q Medan 3Q Max -.978 -.8456 -.4576.8253 2.2859 Coeffcents: Estmate Std. Error z value Pr(> z ) (Intercept) -5.3945.3365-4.683 2.82e-6 *** Age.92.246 4.6 4.2e-6 *** --- Sgnf. codes: ***. **. *.5.. (Dsperson parameter for bnomal famly taken to be ) Null devance: 36.66 on 99 degrees of freedom Resdual devance: 7.35 on 98 degrees of freedom AIC:.35 Number of Fsher Scorng teratons: 4 > -pchsq(36.66-7.35,) [] 6.67658e-8 Hypothess Testng For Indvdual Coeffcents H H o a : η : η When the sample sze s large, the test for sgnfcance of the slope parameter (η ) can be calculated as follows: ηˆ z = SE(η ˆ ) χ 2 = z 2 = 3

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Confdence Intervals for Coeffcents and Correspondng Odds Ratos A ( α)% confdence nterval for η can be calculated as follows: ηˆ z α/2 SE(η ˆ ) A ( α)% confdence nterval for the odds rato assocated wth η s calculated as follows: exp( ηˆ z α/2 SE(η ˆ )) These ntervals can be calculated n SAS PROC LOGISTIC as follows: proc logstc descendng; model CHD = age / lnk=logt clparm=wald; run; To compute the confdence ntervals for the parameter estmates n R, enter the followng: > confnt.default(chd.glm) 2.5 % 97.5 % (Intercept) -7.533737-3.87533 Age.6376477.58775 To compute the odds rato and the assocated confdence ntervals n R, you can use the followng commands. > exp(coef(chd.glm)) (Intercept) Age.4944629.736795 > exp(confnt.default(chd.glm)) 2.5 % 97.5 % (Intercept).5364.456434 Age.65846486.725698 4

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson If η corresponds to a contnuous predctor and we wsh to examne the odds rato assocated wth a t unt ncrease n that predctor, the confdence nterval for the odds rato becomes exp(t ηˆ z α/2 t SE(η ˆ )) Example: Fnd the odds rato for CHD assocated wth a year ncrease n age, and gve a 95% confdence nterval based on ths estmate. proc logstc descendng; model CHD = age / lnk=logt clparm=pl clodds=pl; unts age=; run; In R: > exp(*coef(chd.glm)) (Intercept) Age 8.73649e-24 3.3967e+ > exp(*confnt.default(chd.glm)) 2.5 % 97.5 % (Intercept).9573e-33 3.899548e-4 Age.89225e+ 4.85872e+ The ntervals shown above from R are all known as Wald ntervals (based on normal-theory methods). These may not be approprate for small samples; therefore, you may want to consder another method called the Profle Lkelhood method. Ths nvolves an teratve evaluaton of the lkelhood functon and produces ntervals that may not be symmetrc around the estmate. proc logstc descendng; model CHD = age / lnk=logt clparm=pl; run; 5

STAT 45 BIOSTATISTICS (Fall 26) Handout 5 Introducton to Logstc Regresson Usng the profle lkelhood method n R: > confnt(chd.glm) Watng for proflng to be done... 2.5 % 97.5 % (Intercept) -7.7258762-3.246547 Age.669358.6267 Questons:. How do these compare to the Wald confdence ntervals? 2. How would you fnd the profle lkelhood confdence nterval for the odds rato for a one-year ncrease n age? 6