Extensions to the Basic Framework II

Similar documents
Multivariate Regression: Part I

ECON Introductory Econometrics. Lecture 11: Binary dependent variables

Chapter 11. Regression with a Binary Dependent Variable

Applied Economics. Regression with a Binary Dependent Variable. Department of Economics Universidad Carlos III de Madrid

2. We care about proportion for categorical variable, but average for numerical one.

Binary Dependent Variable. Regression with a

ECON 594: Lecture #6

Binary Dependent Variables

Extensions to the Basic Framework I

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Spring 2013 Instructor: Victor Aguirregabiria

Regression with a Binary Dependent Variable (SW Ch. 9)

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit

Homework Solutions Applied Logistic Regression

POLI 8501 Introduction to Maximum Likelihood Estimation

Nonlinear Regression Functions

Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals

Discrete Dependent Variable Models

Lecture 3.1 Basic Logistic LDA

Graduate Econometrics Lecture 4: Heteroskedasticity

Linear Regression With Special Variables

Maximum Likelihood (ML) Estimation

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised January 20, 2018

Control Function and Related Methods: Nonlinear Models

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

Applied Statistics and Econometrics

Econometrics I KS. Module 1: Bivariate Linear Regression. Alexander Ahammer. This version: March 12, 2018

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Regression #8: Loose Ends

Using the same data as before, here is part of the output we get in Stata when we do a logistic regression of Grade on Gpa, Tuce and Psi.

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

CRE METHODS FOR UNBALANCED PANELS Correlated Random Effects Panel Data Models IZA Summer School in Labor Economics May 13-19, 2013 Jeffrey M.

Final Exam. Question 1 (20 points) 2 (25 points) 3 (30 points) 4 (25 points) 5 (10 points) 6 (40 points) Total (150 points) Bonus question (10)

Applied Statistics and Econometrics

Linear Regression with Multiple Regressors

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Nonlinear Econometric Analysis (ECO 722) : Homework 2 Answers. (1 θ) if y i = 0. which can be written in an analytically more convenient way as

Introduction to Estimation Methods for Time Series models Lecture 2

Econometrics II Tutorial Problems No. 1

Practice exam questions

Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis"

Lecture 12: Effect modification, and confounding in logistic regression

Gibbs Sampling in Latent Variable Models #1

A simple alternative to the linear probability model for binary choice models with endogenous regressors

Binary Logistic Regression

Binary Choice Models Probit & Logit. = 0 with Pr = 0 = 1. decision-making purchase of durable consumer products unemployment

i (x i x) 2 1 N i x i(y i y) Var(x) = P (x 1 x) Var(x)

At this point, if you ve done everything correctly, you should have data that looks something like:

Recent Developments in Multilevel Modeling

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Consider Table 1 (Note connection to start-stop process).

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

ECON Introductory Econometrics. Lecture 17: Experiments

Lecture 5. In the last lecture, we covered. This lecture introduces you to

Week 7: Binary Outcomes (Scott Long Chapter 3 Part 2)

Chapter 9 Regression with a Binary Dependent Variable. Multiple Choice. 1) The binary dependent variable model is an example of a

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Limited Dependent Variable Models II

4 Instrumental Variables Single endogenous variable One continuous instrument. 2

Logistic & Tobit Regression

Problem Set 10: Panel Data

Econ 836 Final Exam. 2 w N 2 u N 2. 2 v N

4 Instrumental Variables Single endogenous variable One continuous instrument. 2

Modeling Binary Outcomes: Logit and Probit Models

Interpreting coefficients for transformed variables

Section Least Squares Regression

Sociology 362 Data Exercise 6 Logistic Regression 2

Single-level Models for Binary Responses

Machine Learning Lecture 7

1 Outline. 1. Motivation. 2. SUR model. 3. Simultaneous equations. 4. Estimation

Section I. Define or explain the following terms (3 points each) 1. centered vs. uncentered 2 R - 2. Frisch theorem -

ECON3150/4150 Spring 2015

Multiple Regression Analysis

Lab 07 Introduction to Econometrics

Instrumental Variables, Simultaneous and Systems of Equations

GMM Estimation in Stata

raise Coef. Std. Err. z P> z [95% Conf. Interval]

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt

Lab 11 - Heteroskedasticity

Econometrics Honor s Exam Review Session. Spring 2012 Eunice Han

Lecture 2: Poisson and logistic regression

Lecture 10: Alternatives to OLS with limited dependent variables. PEA vs APE Logit/Probit Poisson

Introduction to Econometrics

Econometrics Midterm Examination Answers

Comparing groups using predicted probabilities

Lecture 5: Poisson and logistic regression

Generalized Linear Models for Non-Normal Data

Instrumental Variable Regression

Essential of Simple regression

Logistic regression: Why we often can do what we think we can do. Maarten Buis 19 th UK Stata Users Group meeting, 10 Sept. 2015

Non-linear panel data modeling

ECONOMETRICS HONOR S EXAM REVIEW SESSION

Practical Econometrics. for. Finance and Economics. (Econometrics 2)

ECONOMICS AND ECONOMIC METHODS PRELIM EXAM Statistics and Econometrics August 2013

LOGISTIC REGRESSION Joseph M. Hilbe

fhetprob: A fast QMLE Stata routine for fractional probit models with multiplicative heteroskedasticity

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Measurement Error. Often a data set will contain imperfect measures of the data we would ideally like.

Greene, Econometric Analysis (6th ed, 2008)

Generalized Linear Models Introduction

Transcription:

Topic 7 Extensions to the Basic Framework II ARE/ECN 240 A Graduate Econometrics Professor: Òscar Jordà

Outline of this topic Nonlinear regression Limited Dependent Variable regression Applications of MLE 2

Nonlinear Regression General set-up: y i = m(x i ; )+² i Remarks: Nonlinearity refers specifically to nonlinearity in the parameters. For example, the following regression can be handled with standard regression methods: y 2 i = 0 + 1x i + 2x i + ² i whereas the following regression cannot: y i = 0 + 1 1 x i + ² i 3

Remarks (cont.) It will be important to ascertain identification. Remember, in linear regression we rule out multicolinearity, which is an example of lack of identification. I will come back to this issue. I will use the method of moments to think about what the estimator looks like. In general, we will have moment conditions such as: E(R(Z) 0 (y m(x; )) = 0 where R(Z) denotes not just variables Z but also nonlinear transformations of Z. 4

Example: R(Z) = 1 Let g i (y i ;X i ; ) =(y i m(x i ; )) = ² i or g(y; X; ) =(y m(x; )) = ² GMM: F.O.C. max ^Q n ( ) = g(y;x; ) 0 ^Wg(y; X; ) Let M n k = @m @ then ^M 0 ^W ^g =0! ^ Assuming a CLT holds such that t 1 nx d p ²i! N(0; Ð); with Ð = E(²² 0 ) =? ¾ 2 I n n i=1 5

Example (cont.) Apply M.V.T. to F.O.C. 0= ^M 0 ^W ^g = 0 0 ^M ^Wg+ ^M ^W M( ¹ ^ ) Usual assumptions: ^ p! =) ^M! p M; ¹ 2 [ ^ ; ] =) M ¹! p M; ^W! p W Identification assumption: p n( ^ ) = Ã ^M 0 ^W ¹ M n rank(h) =k and invertible! 1 Ã! ^M 0 ^Wg p n d! N(0;V) V =(M 0 WM) 1 M 0 W ÐWM(M 0 WM) 1 If Ð = ¾ 2 I and W ^W = I then V = ¾ 2 (M 0 M) 1 6

AMickeyMouse Mickey-Mouse Example Suppose I want to estimate the model y i = 2x i + ² i ; ² i i:i:d:» D(0;¾ 2 ) Clearly, I could estimate the model y i = x i + ² i p n(^ ) d! N(0;¾ 2 (X 0 X) 1 ) from which and using the delta method: p n(^ 1=2 )! d N(0;¾ 2 ( @ @ )2 (X 0 X) 1 )=N(0; 4 2¾ 2 (X 0 X) 1 ) Using NLS (GMM) and asymptotic formula, verify you get the same answer 7

Statistical Properties General Comments Previously we focused on the matrix X and required that t it be full-rank k(i.e., no one variable can be expressed as a linear combination of others). With non-linear functions things are a little trickier now we require the Jacobian M to be full rank. More technically, one could have asymptotic identification even is identification in small samples fails. With non-linear functions, one has to be careful of local optima this will affect asymptotic results which usually assume global optimum. Often not a big deal as long as properties around the global optimum are OK 8

Statistical Properties (cont.) Nonlinearities come in many different forms: it is difficult to give general results. I did not talk about consistency because this gets technical really quickly. But generally speaking, the same arguments about normality, the efficient weighting matrix and feasible weighting g schemes go largely through with a little work. One are to be careful about is instrumental variables. Notice the moment condition is E(R(Z) 0 (y m(x; )) = 0 so instruments for X itself are not appropriate. 9

Nonlinear IV and NLTSLS GMM is always the safest way to go. But if you do TSLS, make sure the first stage regression refers to the nonlinear transformation of the x, not the x themselves. Example: y = x 2 + u Two possible first stage regressions: x 2 = z + v or x = ¼z + u Basic idea: d(x 2 ) 6= (^x) 2 10

Estimating Nonlinear Models Suppose your objective function is twice, continuously differentiable so that a second order Taylor series approximation around the point is: Q ( ) ¼ Q( (0) )+h 0 (0) 1 k (0) ( (0) )+ 1 2 ( (0)) 0 H (0) ( (0) ) k k where h (0) is the gradient of Q evaluated at (0) and where H (0) is the hessian. The first-order conditions for Q* clearly are: h (0) + H (0) ( (0) )=0 =) (1) = (0) H 1 (0) h (0) 11

Newton s Method If Q = Q* then clearly we are done but Q* is only an approximation. Newton s method is an algorithm, which applied sequentially, allows us to obtain the optimum of Q (there are mathematical foundations that tell you when the algorithm converges to the optimum and when it does so in finite time). Hence, given a set of initial conditions (arbitrarily chosen), iterate on (j) = (j 1) H 1 (j 1) h (j 1) 12

Apply to OLS The OLS objective function is already quadratic and easy to solve. But for illustration purposes: Q( ) =(y X ) 0 (y X ); h( ) = 2X 0 (y X ); H( ) = 2(X 0 X) 0 Suppose I guess (0) = b. Then Newton s method says (1) = b +(2X 0 X) 1 2(X 0 y Xb)=(X 0 X) 1 X 0 y And clearly ^ = (1) regardless of initial conditions. 13

GMM Example Suppose I am interested in the usual GMM problem: max ^Q n ( ) = g(y;x;z; ) 0 ^Wg(y;X;Z; ) With possibly nonlinear g. Denote: ^G j = @g(y;x;z; ^ j) @ = @^g j @ Then Newton s method specialized to GMM is: h i 1 h i ^ j+1 = ^ j ^G0 j ^W ^Gj ^G0 j ^W ^gj 14

Practicalities Computers have algorithms that calculate derivatives numerically. However, if you can derive the expressions for the gradient and the Hessian analytic, then the optimization algorithm will go much faster. Poorly chosen initial values can result in: (a) slow convergence; and/or (b) getting stuck in local optima. Non-linear optimization is part science, part art. If the function you are optimizing has several local optima, be sure to try different starting values to make sure you get the global optimum 15

Stopping Rules When do we stop the algorithm? 3 methods: 1. When: ^ i;j+1 ^ i;j <±for i =1;:::;k for ± a userdefined tolerance level (e.g. 0.00001) 2. When: Q ( ^ ^ n ( j+1 ) Q n ( j) < ± 3. When: h (j+1) <± When the model is poorly identified (the area around the optimum is very flat), criterion 1 may fail even when 2 and 3 are met. Best to check them all. 16

Quasi-Newton Methods Replace: with (j) = (j 1) H 11 h (j 1) h (j 1) (j) = (j 1) 1 (j 1) D (j 1) h (j 1) (j 1) h (j 1) Interpretation: The gradient h gives the direction of most direct improvement in the objective function. The Hessian H is sometimes difficult to obtain so it is useful to use stand-ins, D, that are easier to compute (j) is a step-size that can improve the speed of the algorithm 17

Common Algorithms Most have to do with how the Hessian is approximated since the step-size can be adapted to any nx 0 @Qi Q @Q Gauss-Newton/BHHH: i D (j) = i=1 @ ^ (j) @ ^ (j) Marquardt: if D (j) is not negative definite in BHHH then use: nx 0 @Qi D (j) = X i @Q i i=1 @ ^ (j) @ ^ (j) (j) I 18

Remarks These algorithms are popular in econometrics because they exploit Fisher s information equality in MLE. Also think about the optimal weighting matrix in GMM. For more complex models, there are nonderivative methods: more robust but much slower. Usually, it is a good idea to mix the methods: use robust methods (such as BHHH/Marquardt or even non-derivative) away from the optimum and then switch to Newton method. 19

Binary Variables Binary variable: a variable that takes on values 0 or 1. e.g., 0f for male, 1f for female; 0f for unemployed, 1 for employed. etc. Discrete variable: a variable that takes on a small set of values, e.g. 1 for no high-school, 2 for high-school, 3 for college, 4 for post-graduate Binary and discrete variables are usually handled as any other variable when entering as regressors. The most important issue is to keep in mind that they generate colinearity with the constant term and this needs to be addressed. 20

Binary regressor Consider a dummy variable d i 2 0; 1, say 0 for male, 1 for female. Suppose you want to investigate the effect of class size on test scores broken down by boys and girls. Suppose first that sex only affects the average score. Two ways to run the regression: T estscore i = F d i + M(1 d i )+ STR STR i + ² i T estscore i = 0 + F d i + STR ST R i + ² i 21

Interpretation In this regression T estscore i = F d i + M(1 d i )+ STR STR i + ² i F is the average score for girls and M the average score for boys, when STR = 0. 0 In this regression T estscore i = 0 + F d i + STR ST R i + ² i 0 is the average score for boys and F is how much better/worse girls are on average with respect to the boys, when STR = 0. 22

Interactive binary variables We could also consider that the effect of the treatment itself varies by sex. For example, you may conjecture that boys may need more control in class but girls are more self-disciplined disciplined. Hence you may consider specifying: T estscore i = F d i + M(1 d i )+ STR F STR i d i + STR M STR i (1 d i )+² i 23 But statistically speaking, nothing really changes with respect to what we have discussed in previous topics

Limited Dependent Variables Things change when the dependent variable is binary, e.g what determines the decision to rent or own; to be in or out of the labor force; to take public or private transportation; etc. Let y i 2f0; 1g then in the model y i = X i + ² i notice that: E(y i jx i i) ) = X i = P (y i =1jX j i i) ),, but there is nothing that guarantees that Xi X i i ^ 2 f0; 1g 24

OLS with limited it dependent d variable Usual linear OLS works (with the caveats just mentioned). Be sure to use heteroskedasticity robust standard errors this is because the variance of y jx is X i (1 X i ) whichvarieswithx i y i jx i OLS is a good exploratory tool but it is conventional to estimate models based on MLE designed to deal with the idiosyncrasies of having a binary dependent variable. 25

Example: Boston HMDA Data Source: Alicia H. Munnell, Geoffrey M.B. Tootell, Lynne E. Browne, and James ME McEneaney, Mortgage Lending in Boston: Interpreting HMDA Data, American Economic Review, 1996, pp. 25 53. Here we look at mortgage applications for single- family residences for white and black applicants only. Variables: dependent variable is deny : 1 for application denied, 0 otherwise; pi_rat i is the payment to income ratio; black is 1 if applicant is african-american, 0 otherwise. 26

OLS Output. sum deny pi_rat black Variable Obs Mean Std. Dev. Min Max deny 2380.1197479.3247347 0 1 pi_rat 2380.3308136.1072573 0 3 black 2380.142437.3495712 0 1. reg deny pi_rat black, robust Linear regression Number of obs = 2380 F( 2, 2377) = 49.39 Prob > F = 0.0000 R-squared = 0.0760 Root MSE =.31228 Robust deny Coef. Std. Err. t P> t [95% Conf. Interval] pi_rat.5591946.0886663 6.31 0.000000.3853233.7330658 black.1774282.0249463 7.11 0.000.1285096.2263469 _cons -.0905136.0285996-3.16 0.002 -.1465963 -.0344309 27

The picture 28

Restricting the Response It seems natural then that we would specify the relation between the dependent variable and the regressors as y i = F (X i ) + ² i with F ( 1) =0;F (1) =1; andf(x) but this looks like a CDF df (u) du > 0 29

A Latent Variable Model Suppose the model for the latent variable y* is: y i = X i + ² i ; ² i» D(0; 1) Remarks: For reasons that will become clear momentarily, we assume the residuals are standardized. we suppose that y* itself is unobservable but it is related to the binary variable y as follows: y i =1ify i > 0; y i =0ify i 0 Notice then that: P (y i =1)=P (y i > 0) = P (X i + ² i > 0) = P (² i > X X i ) = P (² i X i ) = F (X i ) 30

Likelihood of the Latent Variable Model So, if we made an assumption about the distribution of the ² i, specifically, F(.), we could use MLE to estimate the model. Notice that the likelihood function is L(y; )= n i=1p (y i =1) y i P (y i =0) 1 y i n 1 y i = i=1 F (X i ) y i (1 F (X i )) The log-likelihood is: n nx L(y; )= y i log F (X i )+(1 y i )log(1 F (X i )) i=1 y i 31

MLE first-order conditions Recall: L(y; )= Then, or nx y i log F (X i )+(1 y i )log(1 F (X i )) i=1 @L(y; ) @ = nx yi F ix 0 i 1 y i F ix 0 F i =0; i=1 i 1 F i F i = F (X i ); Fi 0 = F 0 (X i ); F 0 (z) = @F(u) @u nx yi F i i F i 0 F X i =0 i=1 i(1 F i ) 32

Probit and Logit Models 33 The two most common assumptions for F(.) are the normal and the logistic distributions, i.e.: Probit model: F (X i ) = (X i ) Logit model: F (X i ) = exp(x i ) 1+exp(X = 1 i ) 1+exp( X i ) Remarks: for the logit the log-odds ratio is μ Pi log = X i ; P i = P (y i = 1) 1 P i for the logit, the first derivative is such that: 0 F 0 (u) = F (u)f ( u) = F (u)(1 F (u))

Logit MLE Recall: L(y; ) = F.O.C. XnX n y i log F (X i ) + (1 y i ) log(1 F (X i )) i=1 @L(y; ) @ = n nx i=1 y i F F ix 0 i 1 y i F i 1 F ix 0 i =0; i F i = F (X i ); Fi 0 = F 0 (X i ); F 0 (u) = @F(u) @u Recall that t for the logit F 0 (u) =F (u)f ( u) =F (u)(1 F (u)) Combining i things 34

Logit MLE (cont.) F.O.C. @L(y; ) @ = nx i=1 F (X i )(1 F (X i )) X F (X iy 0 i + i ) F (X i )(1 F (X i )) X 1 F (X i(1 0 y i )=0 i ) @L(y; ) nx = X 0 @ iy i XiF 0 (X i ) =0 i=1 What does this condition remind you of? 35

Logit MLE (cont.) Remember, originally we thought of a model such as: y i = F (X i ) + ² i would be good when y is binary. Minimizing i i i the sum of squared residuals results in the GMM moment condition n g n (y i ;X i ; ) = 1 X X n i(y 0 i F (X i )) i=1 And the first order conditions of the GMM problem when ^W = I are just the MLE FOC! 36

Logit MLE Moreover @L 2 (y; ) = @ @ 0 nx F (X i )(1 F (X i ))X ix 0 i i=1 It turns out the Newton method is particularly well suited for this model since ^ ^ (j+1) = (j) + (X 0 P ^P ^Q 1 (j) Q (j) X) 1 (X 0 y X 0 P ^P (j) ) where ^P (j) is an n n matrix with diagonal elements given by F (X i (j) ^ (j) ) and similarly for but with diagonal elements [1 F (X i ^ (j) )] Q ^Q(j) (j) 37

Remarks Because logit/probit models are inherently nonlinear, the can no longer be interpreted as measuring marginal effects directly. The reason we assumed that the model for the latent variable has standardized residuals is that the coefficients are identified up to scale only. Marginal effects (and the interpretation of the ) is different than in linear regression: @P[y i =1jX i ] @x ij = F 0 (X i ) ij ; where F 0 (u) = @F(u) @u 38

Marginal Effects 1.8 1 Marginal Effects Depend on Evaluation Point 0.2 Probability.4.6-4 -2 0 2 4 Regressor 39

HMDA Example Continued. reg deny pi_rat, robust Linear regression Number of obs = 2380 F( 1, 2378) = 37.56 Prob > F = 0.0000 R-squared = 0.0397 Root MSE =.31828 Robust deny Coef. Std. Err. t P> t [95% Conf. Interval] pi_rat.6035349.0984826 6.13 0.000.4104144.7966555 _cons -.0799096.0319666-2.50 0.012 -.1425949 -.0172243. probit deny pi_rat Iteration 0: log likelihood = -872.0853 Iteration 1: log likelihood = -832.02975 Iteration 2: log likelihood = -831.7924 Iteration 3: log likelihood = -831.79234 Probit regression Number of obs = 2380 LR chi2(1) = 80.59 Prob > chi2 = 0.0000 Log likelihood = -831.79234 Pseudo R2 = 0.0462 40 deny Coef. Std. Err. z P> z [95% Conf. Interval] pi_rat 2.967907.3591054 8.26 0.000 2.264073 3.67174 _cons -2.194159.12899-17.01 0.000000-2.446974-1.941343 Note: 0 failures and 1 success completely determined.

HDMA Example (cont.). probit deny pi_rat Iteration 0: log likelihood = -872.0853 Iteration 1: log likelihood = -832.02975 Iteration 2: log likelihood = -831.7924 Iteration 3: log likelihood = -831.79234 Probit regression Number of obs = 2380 LR chi2(1) = 80.59 Prob > chi2 = 0.0000 Log likelihood = -831.79234 Pseudo R2 = 0.0462 deny Coef. Std. Err. z P> z [95% Conf. Interval] pi_rat 2.967907.3591054 8.26 0.000 2.264073 3.67174 _cons -2.194159.12899-17.01 0.000-2.446974-1.941343 Note: 0 failures and 1 success completely determined.. logit deny pi_rat Iteration 0: log likelihood = -872.0853 Iteration 1: log likelihood = -830.96071 Iteration 2: log likelihood = -830.09497 Iteration 3: log likelihood = -830.09403 Iteration 4: log likelihood = -830.09403 Logistic regression Number of obs = 2380 LR chi2(1) = 83.98 Prob > chi2 = 0.0000 Log likelihood = -830.09403 Pseudo R2 = 0.0482 41 deny Coef. Std. Err. z P> z [95% Conf. Interval] pi_rat 5.884498.7336006 8.02 0.000 4.446667 7.322328 _cons -4.028432.2685763-15.00 0.000-4.554832-3.502032

Comparing Logit and Probit 42

How good is the fit? McFadden s (1974) Pseudo-R 2 Pseudo-R 2 =1 P n i=1 y i ln ^p i +(1 y i )ln(1 ^p i ) n[¹y ln ¹y +(1 ¹y)ln(1 ¹y)] Predicted Outcomes: this is tricky to evaluate. Here is the STATA output:. estat classificationl i Classified + if predicted d Pr(D) >=.5 True D defined as deny!= 0 Logistic model for deny True Classified D ~D Total + 11 6 17-274 2089 2363 Total 285 2095 2380 Sensitivity Pr( + D) 3.86% Specificity Pr( - ~D) 99.71% Positive predictive value Pr( D +) 64.71% Negative predictive value Pr(~D -) 88.40% False + rate for true ~D Pr( + ~D) 0.29% False - rate for true D Pr( - D) 96.14% False + rate for classified + Pr(~D +) 35.29% False - rate for classified - Pr( D -) 11.60% 43 Looks good!? Correctly classified 88.24%

But Remember that there are very few applications denied, so it is easy to just always predict deny = 0 and seemingly do well Classified + if predicted Pr(D) >=.5 True D defined as deny!= 0 Sensitivity Pr( + D) 3.86% Specificity Pr( - ~D) 99.71% Positive predictive value Pr( D +) 64.71% Negative predictive value Pr(~D -) 88.40% False + rate for true ~D Pr( + ~D) 0.29% False - rate for true D Pr( - D) 96.14% False + rate for classified + Pr(~D +) 35.29% False - rate for classified - Pr( D -) 11.60% Correctly classified 88.24% 44

HDMA Example (cont.). probit deny pi_rat Iteration 0: log likelihood = -872.0853 Iteration 1: log likelihood = -832.02975 Iteration 2: log likelihood = -831.7924 Iteration 3: log likelihood = -831.79234 Probit regression Number of obs = 2380 LR chi2(1) = 80.59 Prob > chi2 = 0.0000 Log likelihood = -831.79234 Pseudo R2 = 0.0462 deny Coef. Std. Err. z P> z [95% Conf. Interval] pi_rat 2.967907.3591054 8.26 0.000 2.264073 3.67174 _cons -2.194159.12899-17.01 0.000-2.446974-1.941343 Note: 0 failures and 1 success completely determined.. logit deny pi_rat Iteration 0: log likelihood = -872.0853 Iteration 1: log likelihood = -830.96071 Iteration 2: log likelihood = -830.09497 Iteration 3: log likelihood = -830.09403 Iteration 4: log likelihood = -830.09403 Logistic regression Number of obs = 2380 LR chi2(1) = 83.98 Prob > chi2 = 0.0000 Log likelihood = -830.09403 Pseudo R2 = 0.0482 45 deny Coef. Std. Err. z P> z [95% Conf. Interval] pi_rat 5.884498.7336006 8.02 0.000 4.446667 7.322328 _cons -4.028432.2685763-15.00 0.000-4.554832-3.502032

Prediction For example, what is the probability of being denied a loan when the payment-to-income ratio is 60%? No race distinction in the regression. prvalue, x(pi_rat = 0.6) logit: Predictions for deny Confidence intervals by delta method 95% Conf. Interval Pr(y=1 x): 0.3781 [ 0.2901, 0.4660] Pr(y=0 x): 0.6219 [ 0.5340, 0.7099] pi_rat x=.6 46

Now including black. logit deny pi_rat black Iteration 0: log likelihood = -872.0853 Iteration 1: log likelihood = -806.3571 Iteration 2: log likelihood = -795.72934 Iteration 3: log likelihood = -795.69521 Iteration 4: log likelihood = -795.69521 Logistic regression Number of obs = 2380 LR chi2(2) = 152.78 Prob > chi2 = 0.00000000 Log likelihood = -795.69521 Pseudo R2 = 0.0876 deny Coef. Std. Err. z P> z [95% Conf. Interval] pi_rat 5.370362.7283192 7.37 0.000 3.942883 6.797842 black 1.272782.1461983 8.71 0.000.9862385 1.559325 _cons -4.125558.2684161-15.37 0.000-4.651644-3.599472. prvalue, x(pi_rat = 0.6 black = 0) logit: Predictions for deny Confidence intervals by delta method 95% Conf. Interval Pr(y=1 x): 0.2884 [ 0.2094, 0.3674] Pr(y=0 x): 0.7116 [ 0.6326, 0.7906]. prvalue, x(pi_rat = 0.6 black = 1) logit: Predictions for deny Confidence intervals by delta method 95% Conf. Interval Pr(y=1 x): 0.5913 [ 0.4909, 0.6917] Pr(y=0 x): 0.4087 [ 0.3083, 0.5091] pi_rat black pi_rat black x=.6 0 x=.6 1 47

Butitdepends depends Same exercise, but now the payment to income ratio is 10%. prvalue, x(pi_rat = 0.1 black = 0) logit: Predictions for deny Confidence intervals by delta method 95% Conf. Interval Pr(y=1 x): 0.0269 [ 0.0166, 0.0371] Pr(y=0 x): 0.9731 [ 0.9629, 0.9834] pi_rat black x=.1 0. prvalue, x(pi_rat = 0.1 black = 1) logit: Predictions for deny Confidence intervals by delta method 95% Conf. Interval Pr(y=1 x): 0.0898 [ 0.0533, 0.1264] Pr(y=0 x): 0.9102 [ 0.8736, 0.9467] pi_rat black x=.1 1 Probably blacks have less income, hence the results when black is excluded. d The rejection rate is twice as high at high payment/income ratios but almost 4 times at low payment/income ratios 48

MLE, GMM and the Information Matrix Equality Latent variable models are often estimated with MLE Let me use GMM results to explain the statistical properties of MLE in general settings. Specifically, understanding the information matrix equality is helpful in understanding Quasi-MLE and other related estimation principles 49

GMM and MLE In a random sample, the density for the i th observation is: f(y i jx i ; ) Since this is a proper density, it must integrate t to 1 Z A f(y i jx i ; )dy =1 Assuming integrals and derivatives are interchangeable and taking the derivative on both sides: Z A @f(y i jx i ; ) dy =0 @ 50

GMM and MLE (cont.) Multiplying and dividing by f: Z A That is, @f(y i jx i ; ) @ 1 f(y i jx i ; ) f(y ijx i ; )dy =0 Z Z @ log f(yi jx i ; ) f(y i jx i ; )dy = @ A @ A h(y i ; )f(y i jx i ; )dy =0 h is called the score and the previous result implies that E[h(y i ; )jx i ]=0! 1 n nx h(y i ; ) =0 i=11 51

GMM and MLE Recall that the log-likelihood function for the sample is: L( ) = nx log f(yi jx i ; FOC! 1 X n i j i; ) h(yi i ; ) =0 n i=1 I=1 So if we set up MLE as a GMM problem based on the condition that the expected value of the score is zero (and assuming all the right assumptions are satisfied), and using the optimal weighting matrix, then p d n( ^ )! N(0; (G 0 Ð 1 G) 1 ) 52

GMM and MLE Notice that: And G = 1 n nx i=1 @ 2 log f(y i jx i ; ) @ @ 0 Ð=E[h(y i ; )h(y i ; ) 0 ]! ^Ð = 1 n However, recall that Z Z @ log f(y i jx i ; ) f(y i jx i ; )dy = @ nx h(y i ; )h(y i ; ) 0 i=11 A A h(y i ; )f(y i jx i ; )dy =0 And assuming that integration and differentiation are exchangeable 53

GMM and MLE Z A then Z A @h(y i ; ) @ @h(y i ; ) f(y i jx i ; )dy + @ f(y i jx i ; )dy + Z A h(y i ; ) Z A h(y i ; ) @f(y ijx i ) @ 0 dy =0 1 f(y i jx i ; ) @f(y i jx i ) f(y i jx i ; )dy =0 @ 0 Z A @h(y i ; ) f(y i jx i ; )dy = @ Z A h(y i ; ) h 0 (y i ; ) f(y i jx i ; )dy So that 54

The Information Matrix Equality Recall Z A @ log f(y i jx i ; ) @ f(y i jx i ; )dy = Z A h(y i ; )f(y i jx i ; )dy =0 and from the previous slide Z Z @h(yi ; ) f(y i jx i ; )dy = @ A @ A Implying: h(y i ; ) h 0 (y i ; ) f(y i jx i ; )dy @ ) @2 ) E @ log f(y ijx i ; ) @ log f(y i jx i ; = E log f(y i jx i ; @ @ 0 @ @ 0 This is called the information matrix equality 55

Remarks When the likelihood function is correctly specified, the information matrix equality holds and an estimate t of the asymptotic covariance matrix of the parameters estimated can be obtained with either the outerproduct of the scores, or the second-derivative estimate. When the likelihood is incorrectly specified, the information matrix does not hold. However, as long as the score has mean zero, extremum estimation results can be used to show that the estimator is consistent and with asymptotic covariance matrix given by the sandwich c estimator 56

Remarks (cont.) Moreover, the information matrix equality is the basis for the BHHH algorithm at the true parameter values, the outer product of the gradients is equivalent to the Hessian. 57

Recap 58 FOC of MLE as a moment condition: L( ) = GMM: with G = 1 n nx i=1 nx i=1 log f(y i jx i ; ) FOC! g n ( ) = 1 n nx h(y i ; ) =0 i=1 p n( ^ ) d! N(0; (G 0 Ð 1 G) 1 ) @ 2 log f(y i jx i ; ) @ @ 0 n Ð=E[h(y i ; )h(y i ; ) 0 ]! ^Ð = 1 nx h(y i ; )h(y i ; ) 0 n i=1 @ f(yi ) @ 2 i ) E @ log ijx i ; ) @ log f(y i jx i ; = E @ log f(y ijx i ; @ @ 0 @ @ 0

The Newton Method for MLE FOC for MLE And 1 n nx i=1 @ log f(y i jx i ; ) @ = 1 n nx @ 2 log f(y i jx i ; ) nx h(y i ; ) =0 i=1 G = 1 n n i=1 @ @ 0 The Newton step is (using the information matrix equality): ^ (j+1) = ^ (j) ^G 1 (j)^h (j) ¼ ^ (j) +(^h (j)^h0 (j) ) 1^h(j) 59