Extensions to the Basic Framework II

Size: px

Start display at page:

Download "Extensions to the Basic Framework II"

Charlotte Walsh
5 years ago
Views:

1 Topic 7 Extensions to the Basic Framework II ARE/ECN 240 A Graduate Econometrics Professor: Òscar Jordà

2 Outline of this topic Nonlinear regression Limited Dependent Variable regression Applications of MLE 2

3 Nonlinear Regression General set-up: y i = m(x i ; )+² i Remarks: Nonlinearity refers specifically to nonlinearity in the parameters. For example, the following regression can be handled with standard regression methods: y 2 i = 0 + 1x i + 2x i + ² i whereas the following regression cannot: y i = x i + ² i 3

4 Remarks (cont.) It will be important to ascertain identification. Remember, in linear regression we rule out multicolinearity, which is an example of lack of identification. I will come back to this issue. I will use the method of moments to think about what the estimator looks like. In general, we will have moment conditions such as: E(R(Z) 0 (y m(x; )) = 0 where R(Z) denotes not just variables Z but also nonlinear transformations of Z. 4

5 Example: R(Z) = 1 Let g i (y i ;X i ; ) =(y i m(x i ; )) = ² i or g(y; X; ) =(y m(x; )) = ² GMM: F.O.C. max ^Q n ( ) = g(y;x; ) 0 ^Wg(y; X; ) Let M n then ^M 0 ^W ^g =0! ^ Assuming a CLT holds such that t 1 nx d p ²i! N(0; Ð); with Ð = E(²² 0 ) =? ¾ 2 I n n i=1 5

6 Example (cont.) Apply M.V.T. to F.O.C. 0= ^M 0 ^W ^g = 0 0 ^M ^Wg+ ^M ^W M( ¹ ^ ) Usual assumptions: ^ p! =) ^M! p M; ¹ 2 [ ^ ; ] =) M ¹! p M; ^W! p W Identification assumption: p n( ^ ) = Ã ^M 0 ^W ¹ M n rank(h) =k and invertible! 1 Ã! ^M 0 ^Wg p n d! N(0;V) V =(M 0 WM) 1 M 0 W ÐWM(M 0 WM) 1 If Ð = ¾ 2 I and W ^W = I then V = ¾ 2 (M 0 M) 1 6

7 AMickeyMouse Mickey-Mouse Example Suppose I want to estimate the model y i = 2x i + ² i ; ² i i:i:d:» D(0;¾ 2 ) Clearly, I could estimate the model y i = x i + ² i p n(^ ) d! N(0;¾ 2 (X 0 X) 1 ) from which and using the delta method: p n(^ 1=2 )! d N(0;¾ )2 (X 0 X) 1 )=N(0; 4 2¾ 2 (X 0 X) 1 ) Using NLS (GMM) and asymptotic formula, verify you get the same answer 7

8 Statistical Properties General Comments Previously we focused on the matrix X and required that t it be full-rank k(i.e., no one variable can be expressed as a linear combination of others). With non-linear functions things are a little trickier now we require the Jacobian M to be full rank. More technically, one could have asymptotic identification even is identification in small samples fails. With non-linear functions, one has to be careful of local optima this will affect asymptotic results which usually assume global optimum. Often not a big deal as long as properties around the global optimum are OK 8

9 Statistical Properties (cont.) Nonlinearities come in many different forms: it is difficult to give general results. I did not talk about consistency because this gets technical really quickly. But generally speaking, the same arguments about normality, the efficient weighting matrix and feasible weighting g schemes go largely through with a little work. One are to be careful about is instrumental variables. Notice the moment condition is E(R(Z) 0 (y m(x; )) = 0 so instruments for X itself are not appropriate. 9

10 Nonlinear IV and NLTSLS GMM is always the safest way to go. But if you do TSLS, make sure the first stage regression refers to the nonlinear transformation of the x, not the x themselves. Example: y = x 2 + u Two possible first stage regressions: x 2 = z + v or x = ¼z + u Basic idea: d(x 2 ) 6= (^x) 2 10

11 Estimating Nonlinear Models Suppose your objective function is twice, continuously differentiable so that a second order Taylor series approximation around the point is: Q ( ) ¼ Q( (0) )+h 0 (0) 1 k (0) ( (0) )+ 1 2 ( (0)) 0 H (0) ( (0) ) k k where h (0) is the gradient of Q evaluated at (0) and where H (0) is the hessian. The first-order conditions for Q* clearly are: h (0) + H (0) ( (0) )=0 =) (1) = (0) H 1 (0) h (0) 11

12 Newton s Method If Q = Q* then clearly we are done but Q* is only an approximation. Newton s method is an algorithm, which applied sequentially, allows us to obtain the optimum of Q (there are mathematical foundations that tell you when the algorithm converges to the optimum and when it does so in finite time). Hence, given a set of initial conditions (arbitrarily chosen), iterate on (j) = (j 1) H 1 (j 1) h (j 1) 12

13 Apply to OLS The OLS objective function is already quadratic and easy to solve. But for illustration purposes: Q( ) =(y X ) 0 (y X ); h( ) = 2X 0 (y X ); H( ) = 2(X 0 X) 0 Suppose I guess (0) = b. Then Newton s method says (1) = b +(2X 0 X) 1 2(X 0 y Xb)=(X 0 X) 1 X 0 y And clearly ^ = (1) regardless of initial conditions. 13

14 GMM Example Suppose I am interested in the usual GMM problem: max ^Q n ( ) = g(y;x;z; ) 0 ^Wg(y;X;Z; ) With possibly nonlinear g. Denote: ^G j ^ Then Newton s method specialized to GMM is: h i 1 h i ^ j+1 = ^ j ^G0 j ^W ^Gj ^G0 j ^W ^gj 14

15 Practicalities Computers have algorithms that calculate derivatives numerically. However, if you can derive the expressions for the gradient and the Hessian analytic, then the optimization algorithm will go much faster. Poorly chosen initial values can result in: (a) slow convergence; and/or (b) getting stuck in local optima. Non-linear optimization is part science, part art. If the function you are optimizing has several local optima, be sure to try different starting values to make sure you get the global optimum 15

16 Stopping Rules When do we stop the algorithm? 3 methods: 1. When: ^ i;j+1 ^ i;j <±for i =1;:::;k for ± a userdefined tolerance level (e.g ) 2. When: Q ( ^ ^ n ( j+1 ) Q n ( j) < ± 3. When: h (j+1) <± When the model is poorly identified (the area around the optimum is very flat), criterion 1 may fail even when 2 and 3 are met. Best to check them all. 16

17 Quasi-Newton Methods Replace: with (j) = (j 1) H 11 h (j 1) h (j 1) (j) = (j 1) 1 (j 1) D (j 1) h (j 1) (j 1) h (j 1) Interpretation: The gradient h gives the direction of most direct improvement in the objective function. The Hessian H is sometimes difficult to obtain so it is useful to use stand-ins, D, that are easier to compute (j) is a step-size that can improve the speed of the algorithm 17

18 Common Algorithms Most have to do with how the Hessian is approximated since the step-size can be adapted to any nx Gauss-Newton/BHHH: i D (j) = ^ ^ (j) Marquardt: if D (j) is not negative definite in BHHH then use: nx D (j) = X i ^ ^ (j) (j) I 18

19 Remarks These algorithms are popular in econometrics because they exploit Fisher s information equality in MLE. Also think about the optimal weighting matrix in GMM. For more complex models, there are nonderivative methods: more robust but much slower. Usually, it is a good idea to mix the methods: use robust methods (such as BHHH/Marquardt or even non-derivative) away from the optimum and then switch to Newton method. 19

20 Binary Variables Binary variable: a variable that takes on values 0 or 1. e.g., 0f for male, 1f for female; 0f for unemployed, 1 for employed. etc. Discrete variable: a variable that takes on a small set of values, e.g. 1 for no high-school, 2 for high-school, 3 for college, 4 for post-graduate Binary and discrete variables are usually handled as any other variable when entering as regressors. The most important issue is to keep in mind that they generate colinearity with the constant term and this needs to be addressed. 20

21 Binary regressor Consider a dummy variable d i 2 0; 1, say 0 for male, 1 for female. Suppose you want to investigate the effect of class size on test scores broken down by boys and girls. Suppose first that sex only affects the average score. Two ways to run the regression: T estscore i = F d i + M(1 d i )+ STR STR i + ² i T estscore i = 0 + F d i + STR ST R i + ² i 21

22 Interpretation In this regression T estscore i = F d i + M(1 d i )+ STR STR i + ² i F is the average score for girls and M the average score for boys, when STR = 0. 0 In this regression T estscore i = 0 + F d i + STR ST R i + ² i 0 is the average score for boys and F is how much better/worse girls are on average with respect to the boys, when STR = 0. 22

23 Interactive binary variables We could also consider that the effect of the treatment itself varies by sex. For example, you may conjecture that boys may need more control in class but girls are more self-disciplined disciplined. Hence you may consider specifying: T estscore i = F d i + M(1 d i )+ STR F STR i d i + STR M STR i (1 d i )+² i 23 But statistically speaking, nothing really changes with respect to what we have discussed in previous topics

24 Limited Dependent Variables Things change when the dependent variable is binary, e.g what determines the decision to rent or own; to be in or out of the labor force; to take public or private transportation; etc. Let y i 2f0; 1g then in the model y i = X i + ² i notice that: E(y i jx i i) ) = X i = P (y i =1jX j i i) ),, but there is nothing that guarantees that Xi X i i ^ 2 f0; 1g 24

25 OLS with limited it dependent d variable Usual linear OLS works (with the caveats just mentioned). Be sure to use heteroskedasticity robust standard errors this is because the variance of y jx is X i (1 X i ) whichvarieswithx i y i jx i OLS is a good exploratory tool but it is conventional to estimate models based on MLE designed to deal with the idiosyncrasies of having a binary dependent variable. 25

26 Example: Boston HMDA Data Source: Alicia H. Munnell, Geoffrey M.B. Tootell, Lynne E. Browne, and James ME McEneaney, Mortgage Lending in Boston: Interpreting HMDA Data, American Economic Review, 1996, pp Here we look at mortgage applications for single- family residences for white and black applicants only. Variables: dependent variable is deny : 1 for application denied, 0 otherwise; pi_rat i is the payment to income ratio; black is 1 if applicant is african-american, 0 otherwise. 26

27 OLS Output. sum deny pi_rat black Variable Obs Mean Std. Dev. Min Max deny pi_rat black reg deny pi_rat black, robust Linear regression Number of obs = 2380 F( 2, 2377) = Prob > F = R-squared = Root MSE = Robust deny Coef. Std. Err. t P> t [95% Conf. Interval] pi_rat black _cons

28 The picture 28

29 Restricting the Response It seems natural then that we would specify the relation between the dependent variable and the regressors as y i = F (X i ) + ² i with F ( 1) =0;F (1) =1; andf(x) but this looks like a CDF df (u) du > 0 29

30 A Latent Variable Model Suppose the model for the latent variable y* is: y i = X i + ² i ; ² i» D(0; 1) Remarks: For reasons that will become clear momentarily, we assume the residuals are standardized. we suppose that y* itself is unobservable but it is related to the binary variable y as follows: y i =1ify i > 0; y i =0ify i 0 Notice then that: P (y i =1)=P (y i > 0) = P (X i + ² i > 0) = P (² i > X X i ) = P (² i X i ) = F (X i ) 30

31 Likelihood of the Latent Variable Model So, if we made an assumption about the distribution of the ² i, specifically, F(.), we could use MLE to estimate the model. Notice that the likelihood function is L(y; )= n i=1p (y i =1) y i P (y i =0) 1 y i n 1 y i = i=1 F (X i ) y i (1 F (X i )) The log-likelihood is: n nx L(y; )= y i log F (X i )+(1 y i )log(1 F (X i )) i=1 y i 31

32 MLE first-order conditions Recall: L(y; )= Then, or nx y i log F (X i )+(1 y i )log(1 F (X i )) = nx yi F ix 0 i 1 y i F ix 0 F i =0; i=1 i 1 F i F i = F (X i ); Fi 0 = F 0 (X i ); F 0 nx yi F i i F i 0 F X i =0 i=1 i(1 F i ) 32

33 Probit and Logit Models 33 The two most common assumptions for F(.) are the normal and the logistic distributions, i.e.: Probit model: F (X i ) = (X i ) Logit model: F (X i ) = exp(x i ) 1+exp(X = 1 i ) 1+exp( X i ) Remarks: for the logit the log-odds ratio is μ Pi log = X i ; P i = P (y i = 1) 1 P i for the logit, the first derivative is such that: 0 F 0 (u) = F (u)f ( u) = F (u)(1 F (u))

34 Logit MLE Recall: L(y; ) = F.O.C. XnX n y i log F (X i ) + (1 y i ) log(1 F (X i )) = n nx i=1 y i F F ix 0 i 1 y i F i 1 F ix 0 i =0; i F i = F (X i ); Fi 0 = F 0 (X i ); F 0 Recall that t for the logit F 0 (u) =F (u)f ( u) =F (u)(1 F (u)) Combining i things 34

35 Logit MLE (cont.) = nx i=1 F (X i )(1 F (X i )) X F (X iy 0 i + i ) F (X i )(1 F (X i )) X 1 F (X i(1 0 y i )=0 i ) nx = X iy i XiF 0 (X i ) =0 i=1 What does this condition remind you of? 35

36 Logit MLE (cont.) Remember, originally we thought of a model such as: y i = F (X i ) + ² i would be good when y is binary. Minimizing i i i the sum of squared residuals results in the GMM moment condition n g n (y i ;X i ; ) = 1 X X n i(y 0 i F (X i )) i=1 And the first order conditions of the GMM problem when ^W = I are just the MLE FOC! 36

37 Logit MLE 2 (y; 0 nx F (X i )(1 F (X i ))X ix 0 i i=1 It turns out the Newton method is particularly well suited for this model since ^ ^ (j+1) = (j) + (X 0 P ^P ^Q 1 (j) Q (j) X) 1 (X 0 y X 0 P ^P (j) ) where ^P (j) is an n n matrix with diagonal elements given by F (X i (j) ^ (j) ) and similarly for but with diagonal elements [1 F (X i ^ (j) )] Q ^Q(j) (j) 37

38 Remarks Because logit/probit models are inherently nonlinear, the can no longer be interpreted as measuring marginal effects directly. The reason we assumed that the model for the latent variable has standardized residuals is that the coefficients are identified up to scale only. Marginal effects (and the interpretation of the ) is different than in linear i =1jX i ij = F 0 (X i ) ij ; where F 0 38

39 Marginal Effects Marginal Effects Depend on Evaluation Point 0.2 Probability Regressor 39

40 HMDA Example Continued. reg deny pi_rat, robust Linear regression Number of obs = 2380 F( 1, 2378) = Prob > F = R-squared = Root MSE = Robust deny Coef. Std. Err. t P> t [95% Conf. Interval] pi_rat _cons probit deny pi_rat Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Probit regression Number of obs = 2380 LR chi2(1) = Prob > chi2 = Log likelihood = Pseudo R2 = deny Coef. Std. Err. z P> z [95% Conf. Interval] pi_rat _cons Note: 0 failures and 1 success completely determined.

41 HDMA Example (cont.). probit deny pi_rat Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Probit regression Number of obs = 2380 LR chi2(1) = Prob > chi2 = Log likelihood = Pseudo R2 = deny Coef. Std. Err. z P> z [95% Conf. Interval] pi_rat _cons Note: 0 failures and 1 success completely determined.. logit deny pi_rat Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Iteration 4: log likelihood = Logistic regression Number of obs = 2380 LR chi2(1) = Prob > chi2 = Log likelihood = Pseudo R2 = deny Coef. Std. Err. z P> z [95% Conf. Interval] pi_rat _cons

42 Comparing Logit and Probit 42

43 How good is the fit? McFadden s (1974) Pseudo-R 2 Pseudo-R 2 =1 P n i=1 y i ln ^p i +(1 y i )ln(1 ^p i ) n[¹y ln ¹y +(1 ¹y)ln(1 ¹y)] Predicted Outcomes: this is tricky to evaluate. Here is the STATA output:. estat classificationl i Classified + if predicted d Pr(D) >=.5 True D defined as deny!= 0 Logistic model for deny True Classified D ~D Total Total Sensitivity Pr( + D) 3.86% Specificity Pr( - ~D) 99.71% Positive predictive value Pr( D +) 64.71% Negative predictive value Pr(~D -) 88.40% False + rate for true ~D Pr( + ~D) 0.29% False - rate for true D Pr( - D) 96.14% False + rate for classified + Pr(~D +) 35.29% False - rate for classified - Pr( D -) 11.60% 43 Looks good!? Correctly classified 88.24%

44 But Remember that there are very few applications denied, so it is easy to just always predict deny = 0 and seemingly do well Classified + if predicted Pr(D) >=.5 True D defined as deny!= 0 Sensitivity Pr( + D) 3.86% Specificity Pr( - ~D) 99.71% Positive predictive value Pr( D +) 64.71% Negative predictive value Pr(~D -) 88.40% False + rate for true ~D Pr( + ~D) 0.29% False - rate for true D Pr( - D) 96.14% False + rate for classified + Pr(~D +) 35.29% False - rate for classified - Pr( D -) 11.60% Correctly classified 88.24% 44

45 HDMA Example (cont.). probit deny pi_rat Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Probit regression Number of obs = 2380 LR chi2(1) = Prob > chi2 = Log likelihood = Pseudo R2 = deny Coef. Std. Err. z P> z [95% Conf. Interval] pi_rat _cons Note: 0 failures and 1 success completely determined.. logit deny pi_rat Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Iteration 4: log likelihood = Logistic regression Number of obs = 2380 LR chi2(1) = Prob > chi2 = Log likelihood = Pseudo R2 = deny Coef. Std. Err. z P> z [95% Conf. Interval] pi_rat _cons

46 Prediction For example, what is the probability of being denied a loan when the payment-to-income ratio is 60%? No race distinction in the regression. prvalue, x(pi_rat = 0.6) logit: Predictions for deny Confidence intervals by delta method 95% Conf. Interval Pr(y=1 x): [ , ] Pr(y=0 x): [ , ] pi_rat x=.6 46

47 Now including black. logit deny pi_rat black Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Iteration 4: log likelihood = Logistic regression Number of obs = 2380 LR chi2(2) = Prob > chi2 = Log likelihood = Pseudo R2 = deny Coef. Std. Err. z P> z [95% Conf. Interval] pi_rat black _cons prvalue, x(pi_rat = 0.6 black = 0) logit: Predictions for deny Confidence intervals by delta method 95% Conf. Interval Pr(y=1 x): [ , ] Pr(y=0 x): [ , ]. prvalue, x(pi_rat = 0.6 black = 1) logit: Predictions for deny Confidence intervals by delta method 95% Conf. Interval Pr(y=1 x): [ , ] Pr(y=0 x): [ , ] pi_rat black pi_rat black x=.6 0 x=

48 Butitdepends depends Same exercise, but now the payment to income ratio is 10%. prvalue, x(pi_rat = 0.1 black = 0) logit: Predictions for deny Confidence intervals by delta method 95% Conf. Interval Pr(y=1 x): [ , ] Pr(y=0 x): [ , ] pi_rat black x=.1 0. prvalue, x(pi_rat = 0.1 black = 1) logit: Predictions for deny Confidence intervals by delta method 95% Conf. Interval Pr(y=1 x): [ , ] Pr(y=0 x): [ , ] pi_rat black x=.1 1 Probably blacks have less income, hence the results when black is excluded. d The rejection rate is twice as high at high payment/income ratios but almost 4 times at low payment/income ratios 48

49 MLE, GMM and the Information Matrix Equality Latent variable models are often estimated with MLE Let me use GMM results to explain the statistical properties of MLE in general settings. Specifically, understanding the information matrix equality is helpful in understanding Quasi-MLE and other related estimation principles 49

50 GMM and MLE In a random sample, the density for the i th observation is: f(y i jx i ; ) Since this is a proper density, it must integrate t to 1 Z A f(y i jx i ; )dy =1 Assuming integrals and derivatives are interchangeable and taking the derivative on both sides: Z i jx i ; ) dy 50

51 GMM and MLE (cont.) Multiplying and dividing by f: Z A That i jx i ; 1 f(y i jx i ; ) f(y ijx i ; )dy =0 Z log f(yi jx i ; ) f(y i jx i ; )dy A h(y i ; )f(y i jx i ; )dy =0 h is called the score and the previous result implies that E[h(y i ; )jx i ]=0! 1 n nx h(y i ; ) =0 i=11 51

52 GMM and MLE Recall that the log-likelihood function for the sample is: L( ) = nx log f(yi jx i ; FOC! 1 X n i j i; ) h(yi i ; ) =0 n i=1 I=1 So if we set up MLE as a GMM problem based on the condition that the expected value of the score is zero (and assuming all the right assumptions are satisfied), and using the optimal weighting matrix, then p d n( ^ )! N(0; (G 0 Ð 1 G) 1 ) 52

53 GMM and MLE Notice that: And G = 1 n nx 2 log f(y i jx i 0 Ð=E[h(y i ; )h(y i ; ) 0 ]! ^Ð = 1 n However, recall that Z log f(y i jx i ; ) f(y i jx i ; )dy nx h(y i ; )h(y i ; ) 0 i=11 A A h(y i ; )f(y i jx i ; )dy =0 And assuming that integration and differentiation are exchangeable 53

54 GMM and MLE Z A then Z i i ; ) f(y i jx i ; )dy f(y i jx i ; )dy + Z A h(y i ; ) Z A h(y i ; ijx i 0 dy =0 1 f(y i jx i ; i jx i ) f(y i jx i ; )dy 0 Z i ; ) f(y i jx i ; )dy Z A h(y i ; ) h 0 (y i ; ) f(y i jx i ; )dy So that 54

55 The Information Matrix Equality Recall Z log f(y i jx i ; f(y i jx i ; )dy = Z A h(y i ; )f(y i jx i ; )dy =0 and from the previous slide Z ; ) f(y i jx i ; )dy A Implying: h(y i ; ) h 0 (y i ; ) f(y i jx i ; ) log f(y ijx i ; log f(y i jx i ; = E log f(y i jx @ 0 This is called the information matrix equality 55

56 Remarks When the likelihood function is correctly specified, the information matrix equality holds and an estimate t of the asymptotic covariance matrix of the parameters estimated can be obtained with either the outerproduct of the scores, or the second-derivative estimate. When the likelihood is incorrectly specified, the information matrix does not hold. However, as long as the score has mean zero, extremum estimation results can be used to show that the estimator is consistent and with asymptotic covariance matrix given by the sandwich c estimator 56

57 Remarks (cont.) Moreover, the information matrix equality is the basis for the BHHH algorithm at the true parameter values, the outer product of the gradients is equivalent to the Hessian. 57

58 Recap 58 FOC of MLE as a moment condition: L( ) = GMM: with G = 1 n nx i=1 nx i=1 log f(y i jx i ; ) FOC! g n ( ) = 1 n nx h(y i ; ) =0 i=1 p n( ^ ) d! N(0; (G 0 Ð 1 G) 1 2 log f(y i jx i 0 n Ð=E[h(y i ; )h(y i ; ) 0 ]! ^Ð = 1 nx h(y i ; )h(y i ; ) 0 n f(yi 2 i ) log ijx i ; log f(y i jx i ; = log f(y ijx @ 0

59 The Newton Method for MLE FOC for MLE And 1 n nx log f(y i jx i ; = 1 n 2 log f(y i jx i ; ) nx h(y i ; ) =0 i=1 G = 1 n 0 The Newton step is (using the information matrix equality): ^ (j+1) = ^ (j) ^G 1 (j)^h (j) ¼ ^ (j) +(^h (j)^h0 (j) ) 1^h(j) 59

Multivariate Regression: Part I

Topic 1 Multivariate Regression: Part I ARE/ECN 240 A Graduate Econometrics Professor: Òscar Jordà Outline of this topic Statement of the objective: we want to explain the behavior of one variable as a