CHAPTER 1: BINARY LOGIT MODEL - PDF Free Download

CHAPTER 1: BINARY LOGIT MODEL Prof. Alan Wan 1 / 44

Table of contents 1. Introduction 1.1 Dichotomous dependent variables 1.2 Problems with OLS 3.3.1 SAS codes and basic outputs 3.3.2 Wald test for individual significance 3.3.3 Likelihood-ratio, LM and Wald tests for overall significance 3.3.4 Odds ratio estimates 3.3.5 AIC, SC and Generalised R 2 3.3.6 Association of predicted probabilities and observed responses 3.3.7 Hosmer-Lemeshow test statistic 2 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Introduction Motivation for Logit model: 3 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Introduction Motivation for Logit model: Dichotomous dependent variables; 3 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Introduction Motivation for Logit model: Dichotomous dependent variables; Problems with Ordinary Least Squares (OLS) in the face of dichotomous dependent variables; 3 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Dichotomous dependent variables Often variables in social sciences are dichotomous: employed vs. unemployed married vs. unmarried guilty vs. innocent voted vs. didn t vote 4 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Dichotomous dependent variables Social scientists frequently wish to estimate regression models with a dichotomous dependent variable; Most researchers are aware that something is wrong with OLS in the face of a dichotomous dependent variable but they do not know what makes dichotomous variables problematic in regression, and what other methods are superior 5 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Dichotomous dependent variables Focus of this chapter is on binary Logit models (or logistic regression models) for dichotomous dependent variables; Logits have many similarities to OLS but there are also fundamental differences 6 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS Examine why OLS regression runs into problems when the dependent variable is 0/1. Example Dataset: penalty.txt Comprises 147 penalty cases in the state of New Jersey; In all cases the defendant was convicted of first-degree murder with a recommendation by the prosecutor that a death sentence be imposed; Penalty trial is conducted to determine if the defendant should receive a death penalty or life imprisonment; 7 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS The dataset comprises the following variables: DEATH 1 for a death sentence 0 for a life sentence BLACKD 1 if the defendant was black 0 otherwise WHITVIC 1 if the victim was white 0 otherwise SERIOUS - an average rating of seriousness of the crime evaluated by a panel of judges, ranging from (least serious) to 15 (most serious) The goal is to regress DEATH on BLACKD, WHITVIC and SERIOUS; 8 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS Note that DEATH, which has only two outcomes, follows a Bernoulli(p) distribution with p being the probability of a death sentence. Let Y =DEATH, then Pr(Y = y) = p y (1 p) 1 y, y = 0, 1 Recall that Bernoulli trials led to the Binomial distribution - if we repeat the Bernoulli(p) trials n times and count the number of W successes, the distribution of W follows a Binomial B(n, p) distribution, i.e., Pr(W = w) = n C w p w (1 p) (n w), 0 w n So the Bernoullli distribution is special case of the Binomial distribution when n = 1. 9 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS data penalty; infile 'd:\teaching\ms4225\penalty.txt'; input DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2; PROC REG; MODEL DEATH=BLACKD WHITVIC SERIOUS; RUN; 10 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS The REG Procedure Model: MODEL1 Dependent Variable: DEATH Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 2.61611 0.87204 4.11 0.0079 Error 143 30.37709 0.21243 Corrected Total 146 32.99320 Root MSE 0.46090 R-Square 0.0793 Dependent Mean 0.34014 Adj R-Sq 0.0600 Coeff Var 135.50409 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1-0.05492 0.12499-0.44 0.6610 BLACKD 1 0.12197 0.08224 1.48 0.1403 WHITVIC 1 0.05331 0.08411 0.63 0.5272 SERIOUS 1 0.03840 0.01200 3.20 0.0017 11 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS The coefficient of SERIOUS is positive and very significant; Neither of the two racial variables are significantly different from zero; R 2 is low; F -test indicates overall significance of the model; But...can we trust these results? 12 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS Note that if y is a 0/1 variable, then E(y i ) = 1 Pr(y i = 1) + 0 Pr(y i = 0) = 1 p i + 0 (1 p i ) = p i. But based on linear regression, y i = β 1 + β 2 X i + ɛ i. Hence E(y i ) = E(β 1 + β 2 X i + ɛ i ) = β 1 + β 2 X i + E(ɛ i ) = β 1 + β 2 X i. Therefore, p i = β 1 + β 2 X i. This is commonly referred to as the linear probability model (LPM). 13 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS Accordingly, from the SAS results, a one-point increase in the SERIOUS scale is associated with a 0.038 increase in the probability of a death sentence; the probability of a death sentence for blacks is 0.12 higher than for non-blacks, ceteris paribus. But do these results make sense? The LPM p i = β 1 + β 2 X i is actually implausible because p i is postulated to be a linear function of X i and thus has no upper and lower bounds. Accordingly, p i (which is a probability) can be greater than 1 or smaller than 0!! 14 / 44

Odds versus probability Odds of an event: the ratio of the expected number of times that an event will occur to the expected number of times it will not occur; 15 / 44

Odds versus probability Odds of an event: the ratio of the expected number of times that an event will occur to the expected number of times it will not occur; For example, an odds of 4 means we expect 4 times as many occurrences as non-occurrences; an odds of 5/2 (or 5 to 2) means we expect 5 occurrences to 2 non-occurrences; 15 / 44

Odds versus probability Relationship between probability and odds: Probability Odds 0.1 0.11 0.2 0.25 0.3 0.43 0.4 0.67 0.5 1.00 0.6 1.50 0.7 2.33 0.8 4.00 0.9 9.00 o < 1 p < 0.5 and o > 1 p > 0.5; 0 o < although 0 p 1 16 / 44

Odds versus probability Death sentence by race of defendant for 147 penalty trials: blacks non-blacks death 28 22 50 life 45 52 97 73 74 147 17 / 44

Odds versus probability Death sentence by race of defendant for 147 penalty trials: blacks non-blacks death 28 22 50 life 45 52 97 73 74 147 o D = 50/97 = 0.52; o D B = 28/45 = 0.62; and o D NB = 22/52 = 0.42; Hence the ratio of blacks odds of death to non-blacks odds of death are 0.62/0.42 = 1.476; This means the odds of death sentence for blacks are 47.6% higher than non-blacks, or the odds of death sentence for non-blacks are 0.63 times the corresponding odds for blacks 17 / 44

Logit model: basic elements The Logit model is based on the following cumulative distribution function of the logistic distribution: p i = 1 1+e β 1 +β 2 X i ; Let Z i = β 1 + β 2 X i, then p i = 1 1+e Z i = F (β 1 + β 2 X i ) = F (Z i ); As Z i ranges from to, P i ranges between 0 and 1; P i is non-linearly related to Z i. 18 / 44

Logit model: basic elements Graph of the Logit with β 1 = 0 and β 2 = 1: P i 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-4 -3-2 -1 0 1 2 3 4 19 / 44

Logit model: basic elements Note that e Z i = p i /(1 p i ), the odds of an event; So, ln(p i /(1 p i )) = Z i = β 1 + β 2 X i ; in other words, the log of the odds is linear in X i, although p i and X i have a non-linear relationship. This is different from the LPM. 20 / 44

Logit model: basic elements For a linear model y i = β 1 + β 2 X i + ɛ i, y i X i = β 2, a constant; 21 / 44

Logit model: basic elements For a linear model y i = β 1 + β 2 X i + ɛ i, y i X i But for a Logit model, p i = F (β 1 + β 2 X i ) p i = F (β 1 + β 2 X i X i X i = F (β 1 + β 2 X i )β 2 = f (β 1 + β 2 X i )β 2, = β 2, a constant; where f (.) is the probability density function for the logistic distribution. 21 / 44

Logit model: basic elements Note that for the Logit model f (β 1 + β 2 X i ) = e Z i (1 + e Z i ) 2 = F (β 1 + β 2 X i )(1 F (β 1 + β 2 X i )) = p i (1 p i ) 22 / 44

Logit model: basic elements Note that for the Logit model f (β 1 + β 2 X i ) = e Z i (1 + e Z i ) 2 = F (β 1 + β 2 X i )(1 F (β 1 + β 2 X i )) = p i (1 p i ) Therefore, p i X i = β 2 p i (1 p i ). In other words, a 1-unit change in X i does not produce a constant effect on p i. 22 / 44

Maximum Likelihood estimation Note that y i only takes on values of 0 and 1, so p i /(1 p i ) is undefined and OLS is not an appropriate method of estimation. Maximum likelihood (ML) estimation is usually the technique to adopt; ML principle: choose as estimates the parameter values which would maximise the probability of what we have already observed; Steps of ML estimation: First, construct the likelihood function by expressing the probability of observing the data as a function of the unknown parameters. Second, find the values of the unknown parameters that make the value of this expression as large as possible. 23 / 44

Maximum Likelihood estimation The likelihood function is given by L = Pr(y 1, y 2,...y n ) = Pr(y 1 )Pr(y 2 )...Pr(y n ), assuming independent sampling; n = Pr(y i ) i=1 24 / 44

Maximum Likelihood estimation The likelihood function is given by L = Pr(y 1, y 2,...y n ) = Pr(y 1 )Pr(y 2 )...Pr(y n ), assuming independent sampling; n = Pr(y i ) i=1 But by definition, Pr(y i = 1) = p i and Pr(y i = 0) = 1 p i. Therefore, Pr(y i ) = p y i i (1 p i ) 1 y i 24 / 44

Maximum Likelihood estimation So, L = = n Pr(y i ) = i=1 n i=1 n p i ( ) y i (1 p i ) 1 p i i=1 p y i i (1 p i ) 1 y i It is usually easier to maximise the log of L than L itself. Taking log of both sides yields n p i lnl = log( ) y i + log(1 p i ) 1 p i = i=1 n p i y i log( ) + 1 p i i=1 n log(1 p i ) i=1 25 / 44

Maximum Likelihood estimation Substituting p i = 1 1+e β 1 +β 2 X i in lnl leads to n lnl = β 1 y i + β 2 i=1 n X i y i i=1 n log(1 + e β 1+β 2 X i ) i=1 26 / 44

Maximum Likelihood estimation Substituting p i = 1 1+e β 1 +β 2 X i in lnl leads to n lnl = β 1 y i + β 2 i=1 n X i y i i=1 n log(1 + e β 1+β 2 X i ) i=1 There are no closed form solutions to β 1 and β 2 when maximizing lnl; Numerical optimisation is required - SAS uses Fisher s Scoring, which is similar in principle to the Newton-Raphson algorithm. 26 / 44

Maximum Likelihood estimation Suppose θ is a univariate unknown parameter to be estimated. The Newton-Raphson algorithm derives estimates based on the formula ˆθ new = ˆθ old H 1 (ˆθ old )U(ˆθ old ), where H(.) and U(.) are the second and first derivatives of the objective function with respect to θ. The algorithm stops when the estimates from successive iterations converge; Consider a simple example, where g(θ) = θ 3 + 3θ 2 5. So, U(θ) = 3θ(θ 2) and H(θ) = 6(θ 1); Actual maximum and minimum of g(θ) are located at θ = 2 and θ = 0 respectively; 27 / 44

Maximum Likelihood estimation Step 1: Choose an arbitrary initial starting value, say, ˆθ initial = 1.5. So, U(1.5) = 2.25 and H(1.5) = 3. The new estimate of θ is therefore equal to ˆθ new = 1.5 2.25/( 3) = 2.25; Step 2: ˆθ old = 2.25. So, U(2.25) = 1.6875 and H(2.25) = 7.5. The new estimate of θ is ˆθ new = 2.25 1.6875/( 7.5) = 2.025; Continue with Steps 3, 4 and so on until convergence; Caution: Suppose we start with ˆθ initial = 0.5. If the process is left unchecked, the algorithm will converge to the minimum located at θ = 0!!! 28 / 44

Maximum Likelihood estimation The only difference between Fisher s Scoring and Newton-Raphson algorithm is that Fisher s Scoring uses E(H(.)) instead of H(.); Our current situation is more complicated in that the unknowns are multivariate. However, the optimisation principle remains the same; In practice, we need a set of initial values. PROC LOGISTIC in SAS starts with all coefficients equal to zero. 29 / 44

PROC LOGISTIC: basic elements data PENALTY; infile 'd:\teaching\ms4225\penalty.txt'; input DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2; PROC LOGISTIC DATA=PENALTY DESCENDING; MODEL DEATH=BLACKD WHITVIC SERIOUS; RUN; 30 / 44

PROC LOGISTIC: basic elements The LOGISTIC Procedure Model Information Data Set WORK.PENALTY Response Variable DEATH Number of Response Levels 2 Number of Observations 147 Model binary logit Optimization Technique Fisher's scoring Response Profile Ordered Total Value DEATH Frequency 1 1 50 2 0 97 Probability modeled is DEATH=1. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. 31 / 44

PROC LOGISTIC: basic elements Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 190.491 184.285 SC 193.481 196.247-2 Log L 188.491 176.285 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 12.2060 3 0.0067 Score 11.6560 3 0.0087 Wald 10.8211 3 0.0127 32 / 44

PROC LOGISTIC: basic elements The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1-2.6516 0.6748 15.4424 <.0001 BLACKD 1 0.5952 0.3939 2.2827 0.1308 WHITVIC 1 0.2565 0.4002 0.4107 0.5216 SERIOUS 1 0.1871 0.0612 9.3342 0.0022 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits BLACKD 1.813 0.838 3.925 WHITVIC 1.292 0.590 2.832 SERIOUS 1.206 1.069 1.359 Association of Predicted Probabilities and Observed Responses Percent Concordant 67.2 Somers' D 0.349 Percent Discordant 32.3 Gamma 0.351 Percent Tied 0.5 Tau-a 0.158 Pairs 4850 c 0.675 33 / 44

Wald test for individual significance Test of significance of individual coefficients: H 0 : β j = 0 vs. H 1 : otherwise Instead of reporting the t-stats, PROC LOGISTIC reports the Wald χ 2 -stats for the significance of individual coefficients. Reason being that the t-stat is not t distributed in a Logit model; instead, it has an asymptotic N(0, 1) distribution under the null of H 0 : β j = 0. The square of a N(0, 1) variable is a χ 2 variable with 1 df. The Wald χ 2 -stat is just the square of the usual t-stat. 34 / 44

Likelihood-ratio, LM and Wald tests for overall significance Test of overall model significance: H 0 : β 1 = β 2 =... = β k = 0 vs. H 1 : otherwise 1. Likelihood-ratio test: LR = 2[lnL( ˆβ (UR) ) lnl( ˆβ (R) )] χ 2 k 2. Score (Lagrange-multplier)(LM) test: LM = [U( ˆβ (R) )] [ H 1 ( ˆβ (R) )][U( ˆβ (R) )] χ 2 k 3. Wald test: W = ˆβ (UR) [ H( ˆβ (UR) )] ˆβ (UR) χ 2 k 35 / 44

Odds ratio estimates The odds ratio estimates are obtained by exponentiating the corresponding β estimates, i.e., e ˆβ j ; The (predicted) odds ratio of 1.813 indicates that the odds of a death sentence for black defendants are 81% higher than the odds for other defendants; Similarly, the (predicted) odds of death are about 29% higher when the victim is white, notwithstanding the coefficient being insignificant; A 1-unit increase in the SERIOUS scale is associated with a 21% increase in the predicted odds of a death sentence 36 / 44

AIC, SC and Generalised R 2 Model selection criteria 1. Akaike s Information Criterion (AIC): AIC = 2[lnL (k + 1)] 2. Schwartz Bayesian Criterion (SBC or SC): SC = 2lnL + (k + 1) ln(n) 3. Generalized R 2 = 1 e LR/n, analogous to the conventional R 2 used in linear regression 37 / 44

Association of predicted probabilities and observed responses For the 147 observations in the sample, there are 147 C 2 = 10731 ways to pair them up (without pairing an observation with itself). Of these, 5881 pairs have either both 1 s or both 0 s on y. These we ignore, leaving 4850 pairs for which one case has a 1 and other case has a 0; For each of these pairs, we ask the following question: Based on estimated model, does the case with a 1 have a higher predicted probability of attaining 1 than the case with a 0? 38 / 44

Association of predicted probabilities and observed responses Let C= number of concordant pairs, D= number of discordant pairs, T =number of ties, and N=total number of pairs before eliminating any; Tau a = C D N,Somer sd(sd) = C D C D C+D+T, Gamma = C+D and C stat = 0.5 (1 + SD) All 4 measures vary between 0 and 1 with large values corresponding to stronger associations between the predicted and observed values 39 / 44

Hosmer-Lemeshow goodness of fit test The Hosmer-Lemeshow (HL) test is goodness of fit test which may be invoked by augmenting the LACKFIT option in the model statement under PROC LOGISTIC; The HL statistic is calculated as follows. Based on the estimated model, predicted probabilities are generated for all observations. These are sorted by size, then grouped into approximately 10 intervals. Within each interval, the expected frequency is obtained by adding up the predicted probabilities. Expected frequencies are compared with the observed frequencies by the conventional Pearson χ 2 statistic. The df is the number of intervals minus 2; 40 / 44

Hosmer-Lemeshow goodness of fit test HL = 2G (O j E j ) 2 j=1 E j χ 2 G 2, where G is the number of intervals, and O and E are the observed and predicted frequencies respectively. LACKFIT output is as follows: Partition for the Hosmer and Lemeshow Test DEATH = 1 DEATH = 0 Observed Expected Observed Expected Group Total 1 15 3 2.04 12 12.96 2 15 2 2.78 13 12.22 3 15 3 3.49 12 11.51 4 15 4 4.10 11 10.90 5 15 6 4.89 9 10.11 6 15 6 5.42 9 9.58 7 15 4 5.97 11 9.03 8 15 6 6.77 9 8.23 9 15 7 7.50 8 7.50 10 12 9 7.05 3 4.95 Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square DF Pr > ChiSq 3.9713 8 0.8597 41 / 44

Class exercises 1. Tutorial 1 2. Table 12.4 of Ramanathan (1995): Introductory Econometrics, presents information on the acceptance or rejection to medical school for a sample of 60 applicants, along with a number of their characteristics. The variables are as follows: ACCEPT=1 if granted acceptance, 0 otherwise; GPA=cumulative undergraduate grade point average; BIO=score in the biology portion of the Medical College Admission Test (MCAT); CHEM=score in the chemistry portion of the MCAT; 42 / 44

Class exercises PHY=score in the physics portion of the MCAT; RED=score in the reading portion of the MCAT; PRB=score in the problem portion of the MCAT; QNT=score in the quantitative portion of the MCAT; AGE=age of the applicant; GENDER=1 for male, 0 for female; Answer the following questions with the aid of the program and output medicalsas.txt and medicalout.txt uploaded on the course website: 43 / 44

Class exercises 1. Write down the estimated Logit model that regresses ACCEPT on all of the above explanatory variables. 2. Test for the overall significance of the model using the LR, LM and Wald tests. Do the three tests provide consistent results? 3. Test for the significance of the individual coefficients using the Wald test. 4. Predict the probability of success of an individual with the following characteristics: GPA=2.96, BIO=7, CHEM=7, PHY=8, RED=5, PRB=7, QNT=5, AGE=25, GENDER=0. 5. Calculate the Generalised R 2 for the above regression. How well does the model appear to fit the data? 6. AGE and GENDER represent personal characteristics. Test the hypothesis that they jointly have no impact on the probability of success. 44 / 44