Week 7: Binary Outcomes (Scott Long Chapter 3 Part 2)

Week 7: (Scott Long Chapter 3 Part 2) Tsun-Feng Chiang* *School of Economics, Henan University, Kaifeng, China April 29, 2014 1 / 38

ML Estimation for Probit and Logit ML Estimation for Probit and Logit Suppose there are N samples drew independently drew from the population and the choice y is binary. We want to estimate β given the data, then the likelihood function is the product of every sample s probability p i of their choices, where P i = L(β y, X) = N p i (eq.7) i=1 { Pr(yi = 1 x i ) if y i = 1 is observed 1 Pr(y i = 1 x i ) if y i = 0 is observed Let N 1 is the number of samples who choose y i = 1 and N 2 is the number of samples who choose y i = 0. (eq. 7) can be divided into two groups according to the choices, 2 / 38

ML Estimation for Probit and Logit L(β y, X) = N 1 y=1 N 2 Pr(y i = 1 x i ) [1 Pr(y i = 1 x i )] y=0 From (eq. 4), the equation above can be rewritten as L(β y, X) = N 1 y=1 N 2 F(x i β) [1 F(x i β)] y=0 Take logs to obtain the log likelihood function, N 1 N 2 lnl(β y, X) = lnf(x i β) + ln[1 F(x i β)] y=1 y=0 Unlike the the linear regression model in Chapter 2.6 where ML estimators have closed-form solutions are available, algebraic maximization of lnl(β y, X) is rarely possible in most cases. 3 / 38

ML Estimation for Probit and Logit Numerical Methods for ML Estimation The log likelihood functions are complicated, so it is often the numerical methods are used to derived the maximum likelihood estimators. They start with a guess of the values of the parameters and iterate to improve on that guess. Assume we are trying to estimate the vector of parameters β. We begin with an initial guess β 0, called start values, and attempt to improve on this guess by adding a vector ζ 0 of adjustments, β 1 = β 0 + ζ 0 Continue to update the previous iteration according to the equation: β t+1 = β t + ζ t 4 / 38

ML Estimation for Probit and Logit Figure (Train, p186): Maximum Likelihood Estimate, one-parameter example Literations stop until when the gradient of the log likelihood function is close 0 or the estimates do not change from one step to the next, β T +1 = β T Then β T is the maximum likelihood estimates. 5 / 38

ML Estimation for Probit and Logit The problem is how to find ζ t that reduces the steps of iterations or obtain β T quickly. It is useful to think of ζ t as consisting of two parts: where ζ t = D t γ t γ t is a gradient vector defined as lnl/ β t which indicates the direction of the change in the log likelihood for a change in the parameters. See the one-parameter example, when the direction or slope is positive (negative), then increases (decreases) β t in the next step. D t is a direction matrix that reflects the curvature of the log likelihood function; that is, it indicates how rapidly the gradient is changing. See the one-parameter example, when the slope is changing rapidly (slowly), then take the smaller (larger) next step for β t. 6 / 38

ML Estimation for Probit and Logit Figure (Train, p188): Gradient Vector γ t, one-parameter example 7 / 38

ML Estimation for Probit and Logit Figure (Train, p188): Direction Matrix D t, one-parameter example 8 / 38

ML Estimation for Probit and Logit Different numerical methods provide different direction matrix, the following are the most commonly used, The Newton-Raphson Method The Method of Scoring The BHHH Method D t = ( 2 lnl β t β t ) 1 D t = (E[ 2 lnl β t β t ]) 1 D t = ( N i=1 lnl i β t ( lnl i β t ) ) 1 9 / 38

ML Estimation for Probit and Logit Since β is obtained numerically, the covariance matrix must be estimated using numerical methods. Let ˆβ be the maximum likelihood estimates, then for different numerical methods, there are different ways of estimating the covariance matrix. The Newton-Raphson Method The Method of Scoring The BHHH Method Var( ˆβ) = ( N i=1 2 lnl i ) 1 ˆβ ˆβ Var( ˆβ) = (E[ 2 lnl ]) 1 ˆβ ˆβ Var( ˆβ) = ( N i=1 lnl i ˆβ lnl i ˆβ ) 1 10 / 38

ML Estimation for Probit and Logit Problems with Numerical Methods There can be some problems for ML estimation by numerical methods, can t find ˆβ after many iterations flat log likelihood function. wrong estimates are obtained local maximum. ML estimates do not exist no variation in the independent variable for one of the outcomes. The problems above could be from the following reasons: Number of observation. Small observations might explain why the model does not converge. Scaling of variables. When the standard deviation of a variables is very large or small relatively to other variables, it is possible to fail to find the ML estimates. Distribution of outcomes. If there are few observations in one outcome, convergence may be difficult. 11 / 38

ML Estimation for Probit and Logit Figure: Logit Analysis of Labor Force Participation 12 / 38

ML Estimation for Probit and Logit Figure: Probit Analysis of Labor Force Participation 13 / 38

ML Estimation for Probit and Logit My R Code: To run the probit and logit, use the command glm. > labor = read.csv(file.choose(), header = TRUE) Logit > labor_logit <- glm(lfp k5 + k618 + age + wc + hc + lwg + inc, family = binomial(link = "logit"), data = labor) > summary(labor_logit) Probit > labor_probit <- glm(lfp k5 + k618 + age + wc + hc + lwg + inc, family = binomial(link = "probit"), data = labor) > summary(labor_probit) 14 / 38

The Probability Curve and Parameters The Probability Curve and Parameters Consider a model with a signle x, Pr(y = 1 x) = F(α + βx) The change in α shifts the probability curve (or cdf) in parallel. When β is positive, smaller intercept shifts the curve right, and larger intercept shifts the left (Visual intuition: to achieve a given probability, say 0.5, it needs smaller (larger) x when α is larger (smaller)). The change of β changes the slope of probability curve (or cdf). The larger the β, the steeper the slope (Visual intuition: to achieve a given change in probability, it needs less (more) change in x when the slope is larger (smaller) ). 15 / 38

The Probability Curve and Parameters Figure 3.8 A: Effects of Changing α 16 / 38

The Probability Curve and Parameters Figure 3.8 B: Effects of Changing β 17 / 38

The Probability Curve and Parameters Now consider a model with one more independent variable z, Pr(y = 1 x, z) = F(α + β 1 x + β 2 z) If we assign a value, z for z, then the model above is written as Pr(y = 1 x, z = z) = F(α + β 1 x + β 2 z) = F((α + β 2 z) + β 1 x) The term β 2 z becomes a part of intercept. Therefore, when the value of z changes, the probability curve will shift in parallel with respect to x. This means the effect of a variable on the probability is dependent on the values of other variables. 18 / 38

The Probability Curve and Parameters Figure 3.9: How z Affects the Effect of x 19 / 38

The Probability Curve and Parameters Figure 3.9: Values of z Create Parallel Curves with Respect to x 20 / 38

Interpretation Interpretation - Predicted Probabilities To interpret the estimated results from the logit and probit models, probabilities are the fundamental statistic, Probit : Pr(y = 1 x) = Φ(x ˆβ) = x ˆβ 1 2π exp( t2 2 )dt Logit : Pr(y = 1 x) = Λ(x exp(x ˆβ) ˆβ) = 1 + exp(x ˆβ) Since the model is nonlinear, there is no single method of interpretation can fully describe the relationship between a variable and the outcome. What and how to interpret the predicted probabilities depend on your research purpose or questions. 21 / 38

Interpretation The range of probabilities The minimum and maximum probabilites in the sample are defined as min Pr(y = 1 x) = min i F(x i ˆβ) max Pr(y = 1 x) = max i F(x i ˆβ) Listing the largest and smallest predicted probabilities can suggest what variables are important. However, the range is easily affected by the extreme values of x. The effect of each variable on the predicted probabilities We can also see how probability changes when a variable changes, but it requires to control for the values of other variables. Usually the values of other variables are fixed at their means. For example, we can see the predicted change in the probability as x k change from its minimum to it maximum, i.e. Pr(y = 1 x, max x k ) Pr(y = 1 x, min x k ) 22 / 38

Interpretation Figure: Probability for Maximum k5 (Table 3.4) Take an example using R, let k5 be x k. Create a new data frame where k5 is the maximum from the original data using the command with. > labor_newdata_k5max <- with(labor, data.frame(k5=max(k5), k618=mean(k618), age=mean(age), wc=mean(wc), hc=mean(hc), lwg=mean(lwg), inc=mean(inc)) ) > labor_newdata_k5max Predict the probability using the command predict. > labor_newdata_k5max$k5maxprob <- predict(labor_probit, newdata=labor_newdata_k5max, type="response") > labor_newdata_k5max 23 / 38

Interpretation Figure: Probability for Minimum k5 (Table 3.4) Similarly, we can create a data frame where k5 is the minimum and others are at their means. and calculate the predicted probability Pr(y = 1 x, max k5) Pr(y = 1 x, min k5) = 0.0132 0.6573 = 0.6441 24 / 38

Interpretation Probabilities Over the Range of a Variable When there are more than one variable, we can compare the effects of two variables while the remaining variables are held constant. However, you can only change the values of one variable at one time. For example, if we want to examine the effects of x j and x l while others are at their means, first fix x l at a value x l, then allow x j to move over a range, for the predicted probability of the probit, Pr(y = 1 x, x l = x l, x j ) = Φ(α + β 1 x 1 + β 2 x 2 + + β l x l + + β j x j + + β k x k ) We obtain probabilities over some range of x j, when x l = x l. Second, let x l be at ẋ l and still allow x j to move over the same range. The predicted probability is, Pr(y = 1 x, x l = ẋ l, x j ) = Φ(α + β 1 x 1 + β 2 x 2 + + β l ẋ l + + β j x j + + β k x k ) We can compare how the probability changes when x j changes under two different values of x l. 25 / 38

Interpretation Figure: Probabilities Over the Range of Age (Figure 3.10) Create a data frame where the range of age is from 30 to 60 with 5 for each interval, and the wife s college is 0. > labor_newdata2 <- with(labor, data.frame(k5=mean(k5), k618=mean(k618), wc=0, hc=mean(hc), lwg=mean(lwg), inc=mean(inc), age=rep(seq(from =30, to = 60, length.out = 7)) )) > labor_newdata2 26 / 38

Interpretation Figure: Probabilities Over the Range of Age (Figure 3.10 Continued) predict the probabilities over the range of age. > labor_newdata2$ageprob <- predict(labor_probit, newdata=labor_newdata2, type="response") > labor_newdata2 27 / 38

Interpretation Figure 3.10: Probabilities Over the Range of Age for Two Wife s Education Levels 28 / 38

Interpretation Changes in Probabilities: Marginal Effects How to summarize the effect of independent variables? Because the scale of y is arbitrary, β can t be interpreted directly. The marginal effects of x on the probabilities are the better summary. The marginal effect is the change in Pr(y = 1 x) for a change of δ in x l holding all other x at specific values. There are two kinds of marginal effects: Marginal Change An infinitely small change in x l, or δ 0 Discrete Change A finite change in x l. The measures of these two effects agrees when the probability curve is linear. 29 / 38

Interpretation Figure 3.13: Marginal Change (δ infinitely small) V.S. Discrete Change (δ = 1) 30 / 38

Interpretation Marginal Change Let F be the cdf and f the pdf of a distribution. The derivation for the marginal change of a variable x l on the probability is, Pr(y = 1 x) x l = F(x β) = df(x β) x l dx β x β x l = f (x β)β l (eq.8) For the probit model, Pr(y = 1 x) x l = φ(x β)β l and for the logit model, Pr(y = 1 x) x l = λ(x β)β l = exp(x β) [1 + exp(x β)] 2 β l = Pr(y = 1 x)[1 Pr(y = 1 x)]β l (eq.9) 31 / 38

Interpretation The marginal effect is the slope of the probability curve relating x l to Pr(y = 1 x), holding all other variables constant. The sign of the marginal effect is determined by β l, since f (x β) is always positive. The magnitude of the change depends on the magnitude of β l and value of x β. Assume β l is positive, from (eq. 9), the effect of x l on the probability is positive. But when we consider another variable x j, the situation is more complicated (but tractable). Take derivative of (eq. 8) with respect to x l and x j and use the logit cdf, 2 Pr(y = 1 x) x l x j = β l β j Pr(y = 1 x)[1 Pr(y = 1 x)][1 2Pr(y = 1 x)] (How to derive it?) Assume β j is also positive, when Pr(y = 1 x) < 0.5, the increase of x j will make the slope of probability with respect to x l increase; when Pr(y = 1 x) > 0.5, the increase of x j will make the slope of probability with respect to x l decrease (see Figure 3.9). 32 / 38

Interpretation Figure 3.12: Marginal Effect in the Binary Response Model (β is positive) 33 / 38

Interpretation Overall, there are several things that affect the size of the marginal effect (applied to both marginal change and discrete change), The associated parameter of the variable of interest, i.e. β l in our previous example of (eq. 8) The start value of the variable of interest, i.e. x l The amount of change in x l Values and parameters of other variables Since the value of the marginal effect depends on the levels of all variables, we must decide on which values of the variables to use when computing the effect. One method is to compute the average over all observations: Pr(y = 1 x) mean = 1 x l N N f (x i β)β l i=1 34 / 38

Interpretation Another method is to compute the marginal effect at mean of the independent variables, Pr(y = 1 x) x l = f ( x β)β l However, these two methods are limited. The primary reason is the existence of the dummy variable. It is inappropriate to take derivative or average for dummy variables, and we can t see how the predicted probability changes when there is a change in a dummy variable. So discrete change is introduced. Discrete Change (i) A Unit Change in x l. If a variable x l increases from a start value xl 0 to xl 0 + 1, the change in probability is defined as Pr(y = 1 x) x l = Pr(y = 1 x, x 0 l + 1) Pr(y = 1 x, x 0 l ) The start value affects the change in the probability. Usually, x 0 l = x l. 35 / 38

Interpretation Or we can use another unit change that is centered around x l, Pr(y = 1 x) x l = Pr(y = 1 x, x l + 1 2 ) Pr(y = 1 x, x l 1 2 ) (ii) A Standard Deviation Change in x l. Similar to the centered around x l, but replace 1 2 with s l 2, where s l is standard deviation of x l. Pr(y = 1 x) x l = Pr(y = 1 x, x l + s l 2 ) Pr(y = 1 x, x l s l 2 ) (iii) A Change from 0 to 1 (or 1 to 0) for Dummy Variables. When x l is a dummy variable, its mean x l is meaningless, and both of x l + 1 2 and x l 1 2 could exceed the range. Consequently, a preferred measure of discrete change for dummy variable is set the start value as 0 (1) and the end value as 1 (0), Pr(y = 1 x) x l = Pr(y = 1 x, x l = 1(0)) Pr(y = 1 x, x l = 0(1)) 36 / 38

Interpretation The change in probability depends on the values of other variables. Previously we fix them at their means. But for dummy variables, their means are unreasonable. Another way to fix this problem is to find a baseline whose dummy variables are the same as the modals, and continuous variables are means. For example, x 1 and x 2 are continuous, and x 3 and x 4 are dummy variables. For most observations, x 3 = 1 and x 4 = 0, then the discrete change in the probability with respect to x 1 for the baseline observation is Pr(y = 1 x 1, x 2, x 3 = 1, x 4 = 0) x 1 37 / 38

Interpretation 2nd Midterm Date: Tuesday, May 27th, 2014 Time: 9:00 am 11:30 am Location: The Conference Room for Lecture Coverage: Scott Long Chapter 2.6, 3 and 4 Others: Closed Book, Closed Notes. A Simple Calculator for Taking Exponent and Logarithm. 38 / 38