CHAPTER 1: BINARY LOGIT MODEL

Similar documents
UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

ssh tap sas913, sas

Section 9c. Propensity scores. Controlling for bias & confounding in observational studies

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Simple logistic regression

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

Count data page 1. Count data. 1. Estimating, testing proportions

COMPLEMENTARY LOG-LOG MODEL

Testing and Model Selection

STA6938-Logistic Regression Model

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Stat 642, Lecture notes for 04/12/05 96

1.5 Testing and Model Selection

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

SAS Analysis Examples Replication C8. * SAS Analysis Examples Replication for ASDA 2nd Edition * Berglund April 2017 * Chapter 8 ;

Multinomial Logistic Regression Models

Models for Binary Outcomes

Model Estimation Example

LOGISTIC REGRESSION. Lalmohan Bhar Indian Agricultural Statistics Research Institute, New Delhi

Binary Logistic Regression

Linear Regression Models P8111

Chapter 14 Logistic and Poisson Regressions

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Lecture 8: Summary Measures

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

BIOS 625 Fall 2015 Homework Set 3 Solutions

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

In Class Review Exercises Vartanian: SW 540

Correlation and regression

9 Generalized Linear Models

Sections 4.1, 4.2, 4.3

Modeling Machiavellianism Predicting Scores with Fewer Factors

Chapter 5: Logistic Regression-I

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Testing Independence

LOGISTIC REGRESSION Joseph M. Hilbe

Exam Applied Statistical Regression. Good Luck!

STAT 525 Fall Final exam. Tuesday December 14, 2010

UNIVERSITY OF TORONTO Faculty of Arts and Science

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Review of Multiple Regression

Single-level Models for Binary Responses

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Beyond GLM and likelihood

Generalized Linear Models for Non-Normal Data

Introducing Generalized Linear Models: Logistic Regression

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

Logistic Regression: Regression with a Binary Dependent Variable

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21

Lecture 2: Categorical Variable. A nice book about categorical variable is An Introduction to Categorical Data Analysis authored by Alan Agresti

Basic Medical Statistics Course

Age 55 (x = 1) Age < 55 (x = 0)

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Paper: ST-161. Techniques for Evidence-Based Decision Making Using SAS Ian Stockwell, The Hilltop UMBC, Baltimore, MD

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Chapter 4: Constrained estimators and tests in the multiple linear regression model (Part III)

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Advanced Quantitative Methods: maximum likelihood

Introduction to the Logistic Regression Model

Applied Economics. Regression with a Binary Dependent Variable. Department of Economics Universidad Carlos III de Madrid

Investigating Models with Two or Three Categories

Sections 3.4, 3.5. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Answers to Problem Set #4

Longitudinal Modeling with Logistic Regression

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

The Logit Model: Estimation, Testing and Interpretation

STAT 7030: Categorical Data Analysis

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Generalized linear models

Statistics 135 Fall 2008 Final Exam

Regression with Qualitative Information. Part VI. Regression with Qualitative Information

ZERO INFLATED POISSON REGRESSION

Final Exam. Question 1 (20 points) 2 (25 points) 3 (30 points) 4 (25 points) 5 (10 points) 6 (40 points) Total (150 points) Bonus question (10)

(θ θ ), θ θ = 2 L(θ ) θ θ θ θ θ (θ )= H θθ (θ ) 1 d θ (θ )

ECON Introductory Econometrics. Lecture 11: Binary dependent variables

Solutions for Examination Categorical Data Analysis, March 21, 2013

Q30b Moyale Observed counts. The FREQ Procedure. Table 1 of type by response. Controlling for site=moyale. Improved (1+2) Same (3) Group only

Lecture 11 Multiple Linear Regression

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

General Linear Model (Chapter 4)

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Chapter 2: Describing Contingency Tables - II

The Flight of the Space Shuttle Challenger

Binary Dependent Variables

Biostat Methods STAT 5820/6910 Handout #5a: Misc. Issues in Logistic Regression

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Spring 2013 Instructor: Victor Aguirregabiria

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Generalized Models: Part 1

Chapter 9 Regression with a Binary Dependent Variable. Multiple Choice. 1) The binary dependent variable model is an example of a

Transcription:

CHAPTER 1: BINARY LOGIT MODEL Prof. Alan Wan 1 / 44

Table of contents 1. Introduction 1.1 Dichotomous dependent variables 1.2 Problems with OLS 3.3.1 SAS codes and basic outputs 3.3.2 Wald test for individual significance 3.3.3 Likelihood-ratio, LM and Wald tests for overall significance 3.3.4 Odds ratio estimates 3.3.5 AIC, SC and Generalised R 2 3.3.6 Association of predicted probabilities and observed responses 3.3.7 Hosmer-Lemeshow test statistic 2 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Introduction Motivation for Logit model: 3 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Introduction Motivation for Logit model: Dichotomous dependent variables; 3 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Introduction Motivation for Logit model: Dichotomous dependent variables; Problems with Ordinary Least Squares (OLS) in the face of dichotomous dependent variables; 3 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Introduction Motivation for Logit model: Dichotomous dependent variables; Problems with Ordinary Least Squares (OLS) in the face of dichotomous dependent variables; Alternative estimation techniques 3 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Dichotomous dependent variables Often variables in social sciences are dichotomous: employed vs. unemployed married vs. unmarried guilty vs. innocent voted vs. didn t vote 4 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Dichotomous dependent variables Social scientists frequently wish to estimate regression models with a dichotomous dependent variable; Most researchers are aware that something is wrong with OLS in the face of a dichotomous dependent variable but they do not know what makes dichotomous variables problematic in regression, and what other methods are superior 5 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Dichotomous dependent variables Focus of this chapter is on binary Logit models (or logistic regression models) for dichotomous dependent variables; Logits have many similarities to OLS but there are also fundamental differences 6 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS Examine why OLS regression runs into problems when the dependent variable is 0/1. Example Dataset: penalty.txt Comprises 147 penalty cases in the state of New Jersey; In all cases the defendant was convicted of first-degree murder with a recommendation by the prosecutor that a death sentence be imposed; Penalty trial is conducted to determine if the defendant should receive a death penalty or life imprisonment; 7 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS The dataset comprises the following variables: DEATH 1 for a death sentence 0 for a life sentence BLACKD 1 if the defendant was black 0 otherwise WHITVIC 1 if the victim was white 0 otherwise SERIOUS - an average rating of seriousness of the crime evaluated by a panel of judges, ranging from (least serious) to 15 (most serious) The goal is to regress DEATH on BLACKD, WHITVIC and SERIOUS; 8 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS Note that DEATH, which has only two outcomes, follows a Bernoulli(p) distribution with p being the probability of a death sentence. Let Y =DEATH, then Pr(Y = y) = p y (1 p) 1 y, y = 0, 1 Recall that Bernoulli trials led to the Binomial distribution - if we repeat the Bernoulli(p) trials n times and count the number of W successes, the distribution of W follows a Binomial B(n, p) distribution, i.e., Pr(W = w) = n C w p w (1 p) (n w), 0 w n So the Bernoullli distribution is special case of the Binomial distribution when n = 1. 9 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS data penalty; infile 'd:\teaching\ms4225\penalty.txt'; input DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2; PROC REG; MODEL DEATH=BLACKD WHITVIC SERIOUS; RUN; 10 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS The REG Procedure Model: MODEL1 Dependent Variable: DEATH Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 2.61611 0.87204 4.11 0.0079 Error 143 30.37709 0.21243 Corrected Total 146 32.99320 Root MSE 0.46090 R-Square 0.0793 Dependent Mean 0.34014 Adj R-Sq 0.0600 Coeff Var 135.50409 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1-0.05492 0.12499-0.44 0.6610 BLACKD 1 0.12197 0.08224 1.48 0.1403 WHITVIC 1 0.05331 0.08411 0.63 0.5272 SERIOUS 1 0.03840 0.01200 3.20 0.0017 11 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS The coefficient of SERIOUS is positive and very significant; Neither of the two racial variables are significantly different from zero; R 2 is low; F -test indicates overall significance of the model; But...can we trust these results? 12 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS Note that if y is a 0/1 variable, then E(y i ) = 1 Pr(y i = 1) + 0 Pr(y i = 0) = 1 p i + 0 (1 p i ) = p i. But based on linear regression, y i = β 1 + β 2 X i + ɛ i. Hence E(y i ) = E(β 1 + β 2 X i + ɛ i ) = β 1 + β 2 X i + E(ɛ i ) = β 1 + β 2 X i. Therefore, p i = β 1 + β 2 X i. This is commonly referred to as the linear probability model (LPM). 13 / 44

1.1 Dichotomous dependent variables 1.2 Problems with OLS Problems with OLS Accordingly, from the SAS results, a one-point increase in the SERIOUS scale is associated with a 0.038 increase in the probability of a death sentence; the probability of a death sentence for blacks is 0.12 higher than for non-blacks, ceteris paribus. But do these results make sense? The LPM p i = β 1 + β 2 X i is actually implausible because p i is postulated to be a linear function of X i and thus has no upper and lower bounds. Accordingly, p i (which is a probability) can be greater than 1 or smaller than 0!! 14 / 44

Odds versus probability Odds of an event: the ratio of the expected number of times that an event will occur to the expected number of times it will not occur; 15 / 44

Odds versus probability Odds of an event: the ratio of the expected number of times that an event will occur to the expected number of times it will not occur; For example, an odds of 4 means we expect 4 times as many occurrences as non-occurrences; an odds of 5/2 (or 5 to 2) means we expect 5 occurrences to 2 non-occurrences; 15 / 44

Odds versus probability Odds of an event: the ratio of the expected number of times that an event will occur to the expected number of times it will not occur; For example, an odds of 4 means we expect 4 times as many occurrences as non-occurrences; an odds of 5/2 (or 5 to 2) means we expect 5 occurrences to 2 non-occurrences; Let p be the probability of an event occurring and o the corresponding odds, then o = p/(1 p) or p = o/(1 + o); 15 / 44

Odds versus probability Relationship between probability and odds: Probability Odds 0.1 0.11 0.2 0.25 0.3 0.43 0.4 0.67 0.5 1.00 0.6 1.50 0.7 2.33 0.8 4.00 0.9 9.00 o < 1 p < 0.5 and o > 1 p > 0.5; 0 o < although 0 p 1 16 / 44

Odds versus probability Death sentence by race of defendant for 147 penalty trials: blacks non-blacks death 28 22 50 life 45 52 97 73 74 147 17 / 44

Odds versus probability Death sentence by race of defendant for 147 penalty trials: blacks non-blacks death 28 22 50 life 45 52 97 73 74 147 o D = 50/97 = 0.52; o D B = 28/45 = 0.62; and o D NB = 22/52 = 0.42; 17 / 44

Odds versus probability Death sentence by race of defendant for 147 penalty trials: blacks non-blacks death 28 22 50 life 45 52 97 73 74 147 o D = 50/97 = 0.52; o D B = 28/45 = 0.62; and o D NB = 22/52 = 0.42; Hence the ratio of blacks odds of death to non-blacks odds of death are 0.62/0.42 = 1.476; 17 / 44

Odds versus probability Death sentence by race of defendant for 147 penalty trials: blacks non-blacks death 28 22 50 life 45 52 97 73 74 147 o D = 50/97 = 0.52; o D B = 28/45 = 0.62; and o D NB = 22/52 = 0.42; Hence the ratio of blacks odds of death to non-blacks odds of death are 0.62/0.42 = 1.476; This means the odds of death sentence for blacks are 47.6% higher than non-blacks, or the odds of death sentence for non-blacks are 0.63 times the corresponding odds for blacks 17 / 44

Logit model: basic elements The Logit model is based on the following cumulative distribution function of the logistic distribution: p i = 1 1+e β 1 +β 2 X i ; Let Z i = β 1 + β 2 X i, then p i = 1 1+e Z i = F (β 1 + β 2 X i ) = F (Z i ); As Z i ranges from to, P i ranges between 0 and 1; P i is non-linearly related to Z i. 18 / 44

Logit model: basic elements Graph of the Logit with β 1 = 0 and β 2 = 1: P i 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-4 -3-2 -1 0 1 2 3 4 19 / 44

Logit model: basic elements Note that e Z i = p i /(1 p i ), the odds of an event; So, ln(p i /(1 p i )) = Z i = β 1 + β 2 X i ; in other words, the log of the odds is linear in X i, although p i and X i have a non-linear relationship. This is different from the LPM. 20 / 44

Logit model: basic elements For a linear model y i = β 1 + β 2 X i + ɛ i, y i X i = β 2, a constant; 21 / 44

Logit model: basic elements For a linear model y i = β 1 + β 2 X i + ɛ i, y i X i But for a Logit model, p i = F (β 1 + β 2 X i ) p i = F (β 1 + β 2 X i X i X i = F (β 1 + β 2 X i )β 2 = f (β 1 + β 2 X i )β 2, = β 2, a constant; where f (.) is the probability density function for the logistic distribution. 21 / 44

Logit model: basic elements For a linear model y i = β 1 + β 2 X i + ɛ i, y i X i But for a Logit model, p i = F (β 1 + β 2 X i ) p i = F (β 1 + β 2 X i X i X i = F (β 1 + β 2 X i )β 2 = f (β 1 + β 2 X i )β 2, = β 2, a constant; where f (.) is the probability density function for the logistic distribution. As f (β 1 + β 2 X i ) is always positive, the sign of β 2 indicates the direction of the relationship between p i and X i. 21 / 44

Logit model: basic elements For a linear model y i = β 1 + β 2 X i + ɛ i, y i X i But for a Logit model, p i = F (β 1 + β 2 X i ) p i = F (β 1 + β 2 X i X i X i = F (β 1 + β 2 X i )β 2 = f (β 1 + β 2 X i )β 2, = β 2, a constant; where f (.) is the probability density function for the logistic distribution. As f (β 1 + β 2 X i ) is always positive, the sign of β 2 indicates the direction of the relationship between p i and X i. 21 / 44

Logit model: basic elements Note that for the Logit model f (β 1 + β 2 X i ) = e Z i (1 + e Z i ) 2 = F (β 1 + β 2 X i )(1 F (β 1 + β 2 X i )) = p i (1 p i ) 22 / 44

Logit model: basic elements Note that for the Logit model f (β 1 + β 2 X i ) = e Z i (1 + e Z i ) 2 = F (β 1 + β 2 X i )(1 F (β 1 + β 2 X i )) = p i (1 p i ) Therefore, p i X i = β 2 p i (1 p i ). In other words, a 1-unit change in X i does not produce a constant effect on p i. 22 / 44

Maximum Likelihood estimation Note that y i only takes on values of 0 and 1, so p i /(1 p i ) is undefined and OLS is not an appropriate method of estimation. Maximum likelihood (ML) estimation is usually the technique to adopt; 23 / 44

Maximum Likelihood estimation Note that y i only takes on values of 0 and 1, so p i /(1 p i ) is undefined and OLS is not an appropriate method of estimation. Maximum likelihood (ML) estimation is usually the technique to adopt; ML principle: choose as estimates the parameter values which would maximise the probability of what we have already observed; 23 / 44

Maximum Likelihood estimation Note that y i only takes on values of 0 and 1, so p i /(1 p i ) is undefined and OLS is not an appropriate method of estimation. Maximum likelihood (ML) estimation is usually the technique to adopt; ML principle: choose as estimates the parameter values which would maximise the probability of what we have already observed; Steps of ML estimation: First, construct the likelihood function by expressing the probability of observing the data as a function of the unknown parameters. Second, find the values of the unknown parameters that make the value of this expression as large as possible. 23 / 44

Maximum Likelihood estimation The likelihood function is given by L = Pr(y 1, y 2,...y n ) = Pr(y 1 )Pr(y 2 )...Pr(y n ), assuming independent sampling; n = Pr(y i ) i=1 24 / 44

Maximum Likelihood estimation The likelihood function is given by L = Pr(y 1, y 2,...y n ) = Pr(y 1 )Pr(y 2 )...Pr(y n ), assuming independent sampling; n = Pr(y i ) i=1 But by definition, Pr(y i = 1) = p i and Pr(y i = 0) = 1 p i. Therefore, Pr(y i ) = p y i i (1 p i ) 1 y i 24 / 44

Maximum Likelihood estimation So, L = = n Pr(y i ) = i=1 n i=1 n p i ( ) y i (1 p i ) 1 p i i=1 p y i i (1 p i ) 1 y i It is usually easier to maximise the log of L than L itself. Taking log of both sides yields n p i lnl = log( ) y i + log(1 p i ) 1 p i = i=1 n p i y i log( ) + 1 p i i=1 n log(1 p i ) i=1 25 / 44

Maximum Likelihood estimation Substituting p i = 1 1+e β 1 +β 2 X i in lnl leads to n lnl = β 1 y i + β 2 i=1 n X i y i i=1 n log(1 + e β 1+β 2 X i ) i=1 26 / 44

Maximum Likelihood estimation Substituting p i = 1 1+e β 1 +β 2 X i in lnl leads to n lnl = β 1 y i + β 2 i=1 n X i y i i=1 n log(1 + e β 1+β 2 X i ) i=1 There are no closed form solutions to β 1 and β 2 when maximizing lnl; 26 / 44

Maximum Likelihood estimation Substituting p i = 1 1+e β 1 +β 2 X i in lnl leads to n lnl = β 1 y i + β 2 i=1 n X i y i i=1 n log(1 + e β 1+β 2 X i ) i=1 There are no closed form solutions to β 1 and β 2 when maximizing lnl; Numerical optimisation is required - SAS uses Fisher s Scoring, which is similar in principle to the Newton-Raphson algorithm. 26 / 44

Maximum Likelihood estimation Suppose θ is a univariate unknown parameter to be estimated. The Newton-Raphson algorithm derives estimates based on the formula ˆθ new = ˆθ old H 1 (ˆθ old )U(ˆθ old ), where H(.) and U(.) are the second and first derivatives of the objective function with respect to θ. The algorithm stops when the estimates from successive iterations converge; 27 / 44

Maximum Likelihood estimation Suppose θ is a univariate unknown parameter to be estimated. The Newton-Raphson algorithm derives estimates based on the formula ˆθ new = ˆθ old H 1 (ˆθ old )U(ˆθ old ), where H(.) and U(.) are the second and first derivatives of the objective function with respect to θ. The algorithm stops when the estimates from successive iterations converge; Consider a simple example, where g(θ) = θ 3 + 3θ 2 5. So, U(θ) = 3θ(θ 2) and H(θ) = 6(θ 1); Actual maximum and minimum of g(θ) are located at θ = 2 and θ = 0 respectively; 27 / 44

Maximum Likelihood estimation Step 1: Choose an arbitrary initial starting value, say, ˆθ initial = 1.5. So, U(1.5) = 2.25 and H(1.5) = 3. The new estimate of θ is therefore equal to ˆθ new = 1.5 2.25/( 3) = 2.25; Step 2: ˆθ old = 2.25. So, U(2.25) = 1.6875 and H(2.25) = 7.5. The new estimate of θ is ˆθ new = 2.25 1.6875/( 7.5) = 2.025; Continue with Steps 3, 4 and so on until convergence; Caution: Suppose we start with ˆθ initial = 0.5. If the process is left unchecked, the algorithm will converge to the minimum located at θ = 0!!! 28 / 44

Maximum Likelihood estimation The only difference between Fisher s Scoring and Newton-Raphson algorithm is that Fisher s Scoring uses E(H(.)) instead of H(.); Our current situation is more complicated in that the unknowns are multivariate. However, the optimisation principle remains the same; In practice, we need a set of initial values. PROC LOGISTIC in SAS starts with all coefficients equal to zero. 29 / 44

PROC LOGISTIC: basic elements data PENALTY; infile 'd:\teaching\ms4225\penalty.txt'; input DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2; PROC LOGISTIC DATA=PENALTY DESCENDING; MODEL DEATH=BLACKD WHITVIC SERIOUS; RUN; 30 / 44

PROC LOGISTIC: basic elements The LOGISTIC Procedure Model Information Data Set WORK.PENALTY Response Variable DEATH Number of Response Levels 2 Number of Observations 147 Model binary logit Optimization Technique Fisher's scoring Response Profile Ordered Total Value DEATH Frequency 1 1 50 2 0 97 Probability modeled is DEATH=1. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. 31 / 44

PROC LOGISTIC: basic elements Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 190.491 184.285 SC 193.481 196.247-2 Log L 188.491 176.285 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 12.2060 3 0.0067 Score 11.6560 3 0.0087 Wald 10.8211 3 0.0127 32 / 44

PROC LOGISTIC: basic elements The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1-2.6516 0.6748 15.4424 <.0001 BLACKD 1 0.5952 0.3939 2.2827 0.1308 WHITVIC 1 0.2565 0.4002 0.4107 0.5216 SERIOUS 1 0.1871 0.0612 9.3342 0.0022 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits BLACKD 1.813 0.838 3.925 WHITVIC 1.292 0.590 2.832 SERIOUS 1.206 1.069 1.359 Association of Predicted Probabilities and Observed Responses Percent Concordant 67.2 Somers' D 0.349 Percent Discordant 32.3 Gamma 0.351 Percent Tied 0.5 Tau-a 0.158 Pairs 4850 c 0.675 33 / 44

Wald test for individual significance Test of significance of individual coefficients: H 0 : β j = 0 vs. H 1 : otherwise Instead of reporting the t-stats, PROC LOGISTIC reports the Wald χ 2 -stats for the significance of individual coefficients. Reason being that the t-stat is not t distributed in a Logit model; instead, it has an asymptotic N(0, 1) distribution under the null of H 0 : β j = 0. The square of a N(0, 1) variable is a χ 2 variable with 1 df. The Wald χ 2 -stat is just the square of the usual t-stat. 34 / 44

Likelihood-ratio, LM and Wald tests for overall significance Test of overall model significance: H 0 : β 1 = β 2 =... = β k = 0 vs. H 1 : otherwise 1. Likelihood-ratio test: LR = 2[lnL( ˆβ (UR) ) lnl( ˆβ (R) )] χ 2 k 2. Score (Lagrange-multplier)(LM) test: LM = [U( ˆβ (R) )] [ H 1 ( ˆβ (R) )][U( ˆβ (R) )] χ 2 k 3. Wald test: W = ˆβ (UR) [ H( ˆβ (UR) )] ˆβ (UR) χ 2 k 35 / 44

Odds ratio estimates The odds ratio estimates are obtained by exponentiating the corresponding β estimates, i.e., e ˆβ j ; The (predicted) odds ratio of 1.813 indicates that the odds of a death sentence for black defendants are 81% higher than the odds for other defendants; Similarly, the (predicted) odds of death are about 29% higher when the victim is white, notwithstanding the coefficient being insignificant; A 1-unit increase in the SERIOUS scale is associated with a 21% increase in the predicted odds of a death sentence 36 / 44

AIC, SC and Generalised R 2 Model selection criteria 1. Akaike s Information Criterion (AIC): AIC = 2[lnL (k + 1)] 2. Schwartz Bayesian Criterion (SBC or SC): SC = 2lnL + (k + 1) ln(n) 3. Generalized R 2 = 1 e LR/n, analogous to the conventional R 2 used in linear regression 37 / 44

Association of predicted probabilities and observed responses For the 147 observations in the sample, there are 147 C 2 = 10731 ways to pair them up (without pairing an observation with itself). Of these, 5881 pairs have either both 1 s or both 0 s on y. These we ignore, leaving 4850 pairs for which one case has a 1 and other case has a 0; 38 / 44

Association of predicted probabilities and observed responses For the 147 observations in the sample, there are 147 C 2 = 10731 ways to pair them up (without pairing an observation with itself). Of these, 5881 pairs have either both 1 s or both 0 s on y. These we ignore, leaving 4850 pairs for which one case has a 1 and other case has a 0; For each of these pairs, we ask the following question: Based on estimated model, does the case with a 1 have a higher predicted probability of attaining 1 than the case with a 0? 38 / 44

Association of predicted probabilities and observed responses For the 147 observations in the sample, there are 147 C 2 = 10731 ways to pair them up (without pairing an observation with itself). Of these, 5881 pairs have either both 1 s or both 0 s on y. These we ignore, leaving 4850 pairs for which one case has a 1 and other case has a 0; For each of these pairs, we ask the following question: Based on estimated model, does the case with a 1 have a higher predicted probability of attaining 1 than the case with a 0? If yes, we call the pair a concordant ; if no, we call the pair a discordant ; if the two cases have the same predicted values, we call it a tie ; 38 / 44

Association of predicted probabilities and observed responses For the 147 observations in the sample, there are 147 C 2 = 10731 ways to pair them up (without pairing an observation with itself). Of these, 5881 pairs have either both 1 s or both 0 s on y. These we ignore, leaving 4850 pairs for which one case has a 1 and other case has a 0; For each of these pairs, we ask the following question: Based on estimated model, does the case with a 1 have a higher predicted probability of attaining 1 than the case with a 0? If yes, we call the pair a concordant ; if no, we call the pair a discordant ; if the two cases have the same predicted values, we call it a tie ; Obviously, the more concordant pairs, the better the fit of the model. 38 / 44

Association of predicted probabilities and observed responses Let C= number of concordant pairs, D= number of discordant pairs, T =number of ties, and N=total number of pairs before eliminating any; Tau a = C D N,Somer sd(sd) = C D C D C+D+T, Gamma = C+D and C stat = 0.5 (1 + SD) 39 / 44

Association of predicted probabilities and observed responses Let C= number of concordant pairs, D= number of discordant pairs, T =number of ties, and N=total number of pairs before eliminating any; Tau a = C D N,Somer sd(sd) = C D C D C+D+T, Gamma = C+D and C stat = 0.5 (1 + SD) All 4 measures vary between 0 and 1 with large values corresponding to stronger associations between the predicted and observed values 39 / 44

Association of predicted probabilities and observed responses Let C= number of concordant pairs, D= number of discordant pairs, T =number of ties, and N=total number of pairs before eliminating any; Tau a = C D N,Somer sd(sd) = C D C D C+D+T, Gamma = C+D and C stat = 0.5 (1 + SD) All 4 measures vary between 0 and 1 with large values corresponding to stronger associations between the predicted and observed values Rules of thumb for minimally acceptable levels of Tau a, SD, Gamma and C stat are 0.1, 0.3, 0.3 and 0.65 respectively. 39 / 44

Hosmer-Lemeshow goodness of fit test The Hosmer-Lemeshow (HL) test is goodness of fit test which may be invoked by augmenting the LACKFIT option in the model statement under PROC LOGISTIC; The HL statistic is calculated as follows. Based on the estimated model, predicted probabilities are generated for all observations. These are sorted by size, then grouped into approximately 10 intervals. Within each interval, the expected frequency is obtained by adding up the predicted probabilities. Expected frequencies are compared with the observed frequencies by the conventional Pearson χ 2 statistic. The df is the number of intervals minus 2; 40 / 44

Hosmer-Lemeshow goodness of fit test HL = 2G (O j E j ) 2 j=1 E j χ 2 G 2, where G is the number of intervals, and O and E are the observed and predicted frequencies respectively. LACKFIT output is as follows: 41 / 44

Hosmer-Lemeshow goodness of fit test HL = 2G (O j E j ) 2 j=1 E j χ 2 G 2, where G is the number of intervals, and O and E are the observed and predicted frequencies respectively. LACKFIT output is as follows: Partition for the Hosmer and Lemeshow Test DEATH = 1 DEATH = 0 Observed Expected Observed Expected Group Total 1 15 3 2.04 12 12.96 2 15 2 2.78 13 12.22 3 15 3 3.49 12 11.51 4 15 4 4.10 11 10.90 5 15 6 4.89 9 10.11 6 15 6 5.42 9 9.58 7 15 4 5.97 11 9.03 8 15 6 6.77 9 8.23 9 15 7 7.50 8 7.50 10 12 9 7.05 3 4.95 Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square DF Pr > ChiSq 3.9713 8 0.8597 41 / 44

Class exercises 1. Tutorial 1 2. Table 12.4 of Ramanathan (1995): Introductory Econometrics, presents information on the acceptance or rejection to medical school for a sample of 60 applicants, along with a number of their characteristics. The variables are as follows: ACCEPT=1 if granted acceptance, 0 otherwise; GPA=cumulative undergraduate grade point average; BIO=score in the biology portion of the Medical College Admission Test (MCAT); CHEM=score in the chemistry portion of the MCAT; 42 / 44

Class exercises PHY=score in the physics portion of the MCAT; RED=score in the reading portion of the MCAT; PRB=score in the problem portion of the MCAT; QNT=score in the quantitative portion of the MCAT; AGE=age of the applicant; GENDER=1 for male, 0 for female; Answer the following questions with the aid of the program and output medicalsas.txt and medicalout.txt uploaded on the course website: 43 / 44

Class exercises 1. Write down the estimated Logit model that regresses ACCEPT on all of the above explanatory variables. 2. Test for the overall significance of the model using the LR, LM and Wald tests. Do the three tests provide consistent results? 3. Test for the significance of the individual coefficients using the Wald test. 4. Predict the probability of success of an individual with the following characteristics: GPA=2.96, BIO=7, CHEM=7, PHY=8, RED=5, PRB=7, QNT=5, AGE=25, GENDER=0. 5. Calculate the Generalised R 2 for the above regression. How well does the model appear to fit the data? 6. AGE and GENDER represent personal characteristics. Test the hypothesis that they jointly have no impact on the probability of success. 44 / 44