Week 7: Binary Outcomes (Scott Long Chapter 3 Part 2)

Similar documents
Modeling Binary Outcomes: Logit and Probit Models

POLI 8501 Introduction to Maximum Likelihood Estimation

Linear Regression With Special Variables

Lecture #11: Classification & Logistic Regression

Lecture 12: Application of Maximum Likelihood Estimation:Truncation, Censoring, and Corner Solutions

Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis"

Linear Regression Models P8111

Week 2: Pooling Cross Section across Time (Wooldridge Chapter 13)

Econometrics II. Seppo Pynnönen. Spring Department of Mathematics and Statistics, University of Vaasa, Finland

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes 1

Applied Economics. Regression with a Binary Dependent Variable. Department of Economics Universidad Carlos III de Madrid

Generalized Linear Models for Non-Normal Data

Goals. PSCI6000 Maximum Likelihood Estimation Multiple Response Model 1. Multinomial Dependent Variable. Random Utility Model

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Spring 2013 Instructor: Victor Aguirregabiria

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Advanced Quantitative Methods: maximum likelihood

Statistical Distribution Assumptions of General Linear Models

STAT5044: Regression and Anova

Linear Models in Machine Learning

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Linear Regression (9/11/13)

Applied Health Economics (for B.Sc.)

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation

Partial effects in fixed effects models

Discrete Dependent Variable Models

Probit Estimation in gretl

Gibbs Sampling in Latent Variable Models #1

Ordered Response and Multinomial Logit Estimation

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Lecture 6: Discrete Choice: Qualitative Response

2. We care about proportion for categorical variable, but average for numerical one.

4. Nonlinear regression functions

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Binary Choice Models Probit & Logit. = 0 with Pr = 0 = 1. decision-making purchase of durable consumer products unemployment

COMPLEMENTARY LOG-LOG MODEL

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Lecture 5: LDA and Logistic Regression

Models of Qualitative Binary Response

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Kernel Logistic Regression and the Import Vector Machine

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Econometrics I Lecture 7: Dummy Variables

MLE and GMM. Li Zhao, SJTU. Spring, Li Zhao MLE and GMM 1 / 22

Topic 5: Non-Linear Relationships and Non-Linear Least Squares

ECON 594: Lecture #6

Binary choice 3.3 Maximum likelihood estimation

Binary choice. Michel Bierlaire

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Non-linear panel data modeling

Overfitting, Bias / Variance Analysis

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

9 Generalized Linear Models

The Multilevel Logit Model for Binary Dependent Variables Marco R. Steenbergen

Lecture notes to Chapter 11, Regression with binary dependent variables - probit and logit regression

Machine Learning Linear Classification. Prof. Matteo Matteucci

Proportional hazards regression

Lecture 16 Solving GLMs via IRWLS

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Chapter 9 Regression with a Binary Dependent Variable. Multiple Choice. 1) The binary dependent variable model is an example of a

Group comparisons in logit and probit using predicted probabilities 1

Confidence Intervals for the Odds Ratio in Logistic Regression with One Binary X

Generalized Linear Models

Chapter 11. Regression with a Binary Dependent Variable

Model Estimation Example

Gradient-Based Learning. Sargur N. Srihari

Support Vector Machines and Bayes Regression

Single-level Models for Binary Responses

Regression with Nonlinear Transformations

Likelihood-Based Methods

FSAN815/ELEG815: Foundations of Statistical Learning

POLI 7050 Spring 2008 February 27, 2008 Unordered Response Models I

TMA 4275 Lifetime Analysis June 2004 Solution

ECON Introductory Econometrics. Lecture 11: Binary dependent variables

Introduction To Logistic Regression

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Economics Applied Econometrics II

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Logistic Regression and Generalized Linear Models

Generalized Linear Models

Iterative Reweighted Least Squares

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Week 10: Theory of the Firm (Jehle and Reny, Chapter 3)

Machine Learning Lecture 7

Econometrics Problem Set 10

Binary Outcomes. Objectives. Demonstrate the limitations of the Linear Probability Model (LPM) for binary outcomes

Lecture 15: Logistic Regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

Homework Solutions Applied Logistic Regression

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Maximum Likelihood Estimation. only training data is available to design a classifier

stcrmix and Timing of Events with Stata

MS&E 226: Small Data

Transcription:

Week 7: (Scott Long Chapter 3 Part 2) Tsun-Feng Chiang* *School of Economics, Henan University, Kaifeng, China April 29, 2014 1 / 38

ML Estimation for Probit and Logit ML Estimation for Probit and Logit Suppose there are N samples drew independently drew from the population and the choice y is binary. We want to estimate β given the data, then the likelihood function is the product of every sample s probability p i of their choices, where P i = L(β y, X) = N p i (eq.7) i=1 { Pr(yi = 1 x i ) if y i = 1 is observed 1 Pr(y i = 1 x i ) if y i = 0 is observed Let N 1 is the number of samples who choose y i = 1 and N 2 is the number of samples who choose y i = 0. (eq. 7) can be divided into two groups according to the choices, 2 / 38

ML Estimation for Probit and Logit L(β y, X) = N 1 y=1 N 2 Pr(y i = 1 x i ) [1 Pr(y i = 1 x i )] y=0 From (eq. 4), the equation above can be rewritten as L(β y, X) = N 1 y=1 N 2 F(x i β) [1 F(x i β)] y=0 Take logs to obtain the log likelihood function, N 1 N 2 lnl(β y, X) = lnf(x i β) + ln[1 F(x i β)] y=1 y=0 Unlike the the linear regression model in Chapter 2.6 where ML estimators have closed-form solutions are available, algebraic maximization of lnl(β y, X) is rarely possible in most cases. 3 / 38

ML Estimation for Probit and Logit Numerical Methods for ML Estimation The log likelihood functions are complicated, so it is often the numerical methods are used to derived the maximum likelihood estimators. They start with a guess of the values of the parameters and iterate to improve on that guess. Assume we are trying to estimate the vector of parameters β. We begin with an initial guess β 0, called start values, and attempt to improve on this guess by adding a vector ζ 0 of adjustments, β 1 = β 0 + ζ 0 Continue to update the previous iteration according to the equation: β t+1 = β t + ζ t 4 / 38

ML Estimation for Probit and Logit Figure (Train, p186): Maximum Likelihood Estimate, one-parameter example Literations stop until when the gradient of the log likelihood function is close 0 or the estimates do not change from one step to the next, β T +1 = β T Then β T is the maximum likelihood estimates. 5 / 38

ML Estimation for Probit and Logit The problem is how to find ζ t that reduces the steps of iterations or obtain β T quickly. It is useful to think of ζ t as consisting of two parts: where ζ t = D t γ t γ t is a gradient vector defined as lnl/ β t which indicates the direction of the change in the log likelihood for a change in the parameters. See the one-parameter example, when the direction or slope is positive (negative), then increases (decreases) β t in the next step. D t is a direction matrix that reflects the curvature of the log likelihood function; that is, it indicates how rapidly the gradient is changing. See the one-parameter example, when the slope is changing rapidly (slowly), then take the smaller (larger) next step for β t. 6 / 38

ML Estimation for Probit and Logit Figure (Train, p188): Gradient Vector γ t, one-parameter example 7 / 38

ML Estimation for Probit and Logit Figure (Train, p188): Direction Matrix D t, one-parameter example 8 / 38

ML Estimation for Probit and Logit Different numerical methods provide different direction matrix, the following are the most commonly used, The Newton-Raphson Method The Method of Scoring The BHHH Method D t = ( 2 lnl β t β t ) 1 D t = (E[ 2 lnl β t β t ]) 1 D t = ( N i=1 lnl i β t ( lnl i β t ) ) 1 9 / 38

ML Estimation for Probit and Logit Since β is obtained numerically, the covariance matrix must be estimated using numerical methods. Let ˆβ be the maximum likelihood estimates, then for different numerical methods, there are different ways of estimating the covariance matrix. The Newton-Raphson Method The Method of Scoring The BHHH Method Var( ˆβ) = ( N i=1 2 lnl i ) 1 ˆβ ˆβ Var( ˆβ) = (E[ 2 lnl ]) 1 ˆβ ˆβ Var( ˆβ) = ( N i=1 lnl i ˆβ lnl i ˆβ ) 1 10 / 38

ML Estimation for Probit and Logit Problems with Numerical Methods There can be some problems for ML estimation by numerical methods, can t find ˆβ after many iterations flat log likelihood function. wrong estimates are obtained local maximum. ML estimates do not exist no variation in the independent variable for one of the outcomes. The problems above could be from the following reasons: Number of observation. Small observations might explain why the model does not converge. Scaling of variables. When the standard deviation of a variables is very large or small relatively to other variables, it is possible to fail to find the ML estimates. Distribution of outcomes. If there are few observations in one outcome, convergence may be difficult. 11 / 38

ML Estimation for Probit and Logit Figure: Logit Analysis of Labor Force Participation 12 / 38

ML Estimation for Probit and Logit Figure: Probit Analysis of Labor Force Participation 13 / 38

ML Estimation for Probit and Logit My R Code: To run the probit and logit, use the command glm. > labor = read.csv(file.choose(), header = TRUE) Logit > labor_logit <- glm(lfp k5 + k618 + age + wc + hc + lwg + inc, family = binomial(link = "logit"), data = labor) > summary(labor_logit) Probit > labor_probit <- glm(lfp k5 + k618 + age + wc + hc + lwg + inc, family = binomial(link = "probit"), data = labor) > summary(labor_probit) 14 / 38

The Probability Curve and Parameters The Probability Curve and Parameters Consider a model with a signle x, Pr(y = 1 x) = F(α + βx) The change in α shifts the probability curve (or cdf) in parallel. When β is positive, smaller intercept shifts the curve right, and larger intercept shifts the left (Visual intuition: to achieve a given probability, say 0.5, it needs smaller (larger) x when α is larger (smaller)). The change of β changes the slope of probability curve (or cdf). The larger the β, the steeper the slope (Visual intuition: to achieve a given change in probability, it needs less (more) change in x when the slope is larger (smaller) ). 15 / 38

The Probability Curve and Parameters Figure 3.8 A: Effects of Changing α 16 / 38

The Probability Curve and Parameters Figure 3.8 B: Effects of Changing β 17 / 38

The Probability Curve and Parameters Now consider a model with one more independent variable z, Pr(y = 1 x, z) = F(α + β 1 x + β 2 z) If we assign a value, z for z, then the model above is written as Pr(y = 1 x, z = z) = F(α + β 1 x + β 2 z) = F((α + β 2 z) + β 1 x) The term β 2 z becomes a part of intercept. Therefore, when the value of z changes, the probability curve will shift in parallel with respect to x. This means the effect of a variable on the probability is dependent on the values of other variables. 18 / 38

The Probability Curve and Parameters Figure 3.9: How z Affects the Effect of x 19 / 38

The Probability Curve and Parameters Figure 3.9: Values of z Create Parallel Curves with Respect to x 20 / 38

Interpretation Interpretation - Predicted Probabilities To interpret the estimated results from the logit and probit models, probabilities are the fundamental statistic, Probit : Pr(y = 1 x) = Φ(x ˆβ) = x ˆβ 1 2π exp( t2 2 )dt Logit : Pr(y = 1 x) = Λ(x exp(x ˆβ) ˆβ) = 1 + exp(x ˆβ) Since the model is nonlinear, there is no single method of interpretation can fully describe the relationship between a variable and the outcome. What and how to interpret the predicted probabilities depend on your research purpose or questions. 21 / 38

Interpretation The range of probabilities The minimum and maximum probabilites in the sample are defined as min Pr(y = 1 x) = min i F(x i ˆβ) max Pr(y = 1 x) = max i F(x i ˆβ) Listing the largest and smallest predicted probabilities can suggest what variables are important. However, the range is easily affected by the extreme values of x. The effect of each variable on the predicted probabilities We can also see how probability changes when a variable changes, but it requires to control for the values of other variables. Usually the values of other variables are fixed at their means. For example, we can see the predicted change in the probability as x k change from its minimum to it maximum, i.e. Pr(y = 1 x, max x k ) Pr(y = 1 x, min x k ) 22 / 38

Interpretation Figure: Probability for Maximum k5 (Table 3.4) Take an example using R, let k5 be x k. Create a new data frame where k5 is the maximum from the original data using the command with. > labor_newdata_k5max <- with(labor, data.frame(k5=max(k5), k618=mean(k618), age=mean(age), wc=mean(wc), hc=mean(hc), lwg=mean(lwg), inc=mean(inc)) ) > labor_newdata_k5max Predict the probability using the command predict. > labor_newdata_k5max$k5maxprob <- predict(labor_probit, newdata=labor_newdata_k5max, type="response") > labor_newdata_k5max 23 / 38

Interpretation Figure: Probability for Minimum k5 (Table 3.4) Similarly, we can create a data frame where k5 is the minimum and others are at their means. and calculate the predicted probability Pr(y = 1 x, max k5) Pr(y = 1 x, min k5) = 0.0132 0.6573 = 0.6441 24 / 38

Interpretation Probabilities Over the Range of a Variable When there are more than one variable, we can compare the effects of two variables while the remaining variables are held constant. However, you can only change the values of one variable at one time. For example, if we want to examine the effects of x j and x l while others are at their means, first fix x l at a value x l, then allow x j to move over a range, for the predicted probability of the probit, Pr(y = 1 x, x l = x l, x j ) = Φ(α + β 1 x 1 + β 2 x 2 + + β l x l + + β j x j + + β k x k ) We obtain probabilities over some range of x j, when x l = x l. Second, let x l be at ẋ l and still allow x j to move over the same range. The predicted probability is, Pr(y = 1 x, x l = ẋ l, x j ) = Φ(α + β 1 x 1 + β 2 x 2 + + β l ẋ l + + β j x j + + β k x k ) We can compare how the probability changes when x j changes under two different values of x l. 25 / 38

Interpretation Figure: Probabilities Over the Range of Age (Figure 3.10) Create a data frame where the range of age is from 30 to 60 with 5 for each interval, and the wife s college is 0. > labor_newdata2 <- with(labor, data.frame(k5=mean(k5), k618=mean(k618), wc=0, hc=mean(hc), lwg=mean(lwg), inc=mean(inc), age=rep(seq(from =30, to = 60, length.out = 7)) )) > labor_newdata2 26 / 38

Interpretation Figure: Probabilities Over the Range of Age (Figure 3.10 Continued) predict the probabilities over the range of age. > labor_newdata2$ageprob <- predict(labor_probit, newdata=labor_newdata2, type="response") > labor_newdata2 27 / 38

Interpretation Figure 3.10: Probabilities Over the Range of Age for Two Wife s Education Levels 28 / 38

Interpretation Changes in Probabilities: Marginal Effects How to summarize the effect of independent variables? Because the scale of y is arbitrary, β can t be interpreted directly. The marginal effects of x on the probabilities are the better summary. The marginal effect is the change in Pr(y = 1 x) for a change of δ in x l holding all other x at specific values. There are two kinds of marginal effects: Marginal Change An infinitely small change in x l, or δ 0 Discrete Change A finite change in x l. The measures of these two effects agrees when the probability curve is linear. 29 / 38

Interpretation Figure 3.13: Marginal Change (δ infinitely small) V.S. Discrete Change (δ = 1) 30 / 38

Interpretation Marginal Change Let F be the cdf and f the pdf of a distribution. The derivation for the marginal change of a variable x l on the probability is, Pr(y = 1 x) x l = F(x β) = df(x β) x l dx β x β x l = f (x β)β l (eq.8) For the probit model, Pr(y = 1 x) x l = φ(x β)β l and for the logit model, Pr(y = 1 x) x l = λ(x β)β l = exp(x β) [1 + exp(x β)] 2 β l = Pr(y = 1 x)[1 Pr(y = 1 x)]β l (eq.9) 31 / 38

Interpretation The marginal effect is the slope of the probability curve relating x l to Pr(y = 1 x), holding all other variables constant. The sign of the marginal effect is determined by β l, since f (x β) is always positive. The magnitude of the change depends on the magnitude of β l and value of x β. Assume β l is positive, from (eq. 9), the effect of x l on the probability is positive. But when we consider another variable x j, the situation is more complicated (but tractable). Take derivative of (eq. 8) with respect to x l and x j and use the logit cdf, 2 Pr(y = 1 x) x l x j = β l β j Pr(y = 1 x)[1 Pr(y = 1 x)][1 2Pr(y = 1 x)] (How to derive it?) Assume β j is also positive, when Pr(y = 1 x) < 0.5, the increase of x j will make the slope of probability with respect to x l increase; when Pr(y = 1 x) > 0.5, the increase of x j will make the slope of probability with respect to x l decrease (see Figure 3.9). 32 / 38

Interpretation Figure 3.12: Marginal Effect in the Binary Response Model (β is positive) 33 / 38

Interpretation Overall, there are several things that affect the size of the marginal effect (applied to both marginal change and discrete change), The associated parameter of the variable of interest, i.e. β l in our previous example of (eq. 8) The start value of the variable of interest, i.e. x l The amount of change in x l Values and parameters of other variables Since the value of the marginal effect depends on the levels of all variables, we must decide on which values of the variables to use when computing the effect. One method is to compute the average over all observations: Pr(y = 1 x) mean = 1 x l N N f (x i β)β l i=1 34 / 38

Interpretation Another method is to compute the marginal effect at mean of the independent variables, Pr(y = 1 x) x l = f ( x β)β l However, these two methods are limited. The primary reason is the existence of the dummy variable. It is inappropriate to take derivative or average for dummy variables, and we can t see how the predicted probability changes when there is a change in a dummy variable. So discrete change is introduced. Discrete Change (i) A Unit Change in x l. If a variable x l increases from a start value xl 0 to xl 0 + 1, the change in probability is defined as Pr(y = 1 x) x l = Pr(y = 1 x, x 0 l + 1) Pr(y = 1 x, x 0 l ) The start value affects the change in the probability. Usually, x 0 l = x l. 35 / 38

Interpretation Or we can use another unit change that is centered around x l, Pr(y = 1 x) x l = Pr(y = 1 x, x l + 1 2 ) Pr(y = 1 x, x l 1 2 ) (ii) A Standard Deviation Change in x l. Similar to the centered around x l, but replace 1 2 with s l 2, where s l is standard deviation of x l. Pr(y = 1 x) x l = Pr(y = 1 x, x l + s l 2 ) Pr(y = 1 x, x l s l 2 ) (iii) A Change from 0 to 1 (or 1 to 0) for Dummy Variables. When x l is a dummy variable, its mean x l is meaningless, and both of x l + 1 2 and x l 1 2 could exceed the range. Consequently, a preferred measure of discrete change for dummy variable is set the start value as 0 (1) and the end value as 1 (0), Pr(y = 1 x) x l = Pr(y = 1 x, x l = 1(0)) Pr(y = 1 x, x l = 0(1)) 36 / 38

Interpretation The change in probability depends on the values of other variables. Previously we fix them at their means. But for dummy variables, their means are unreasonable. Another way to fix this problem is to find a baseline whose dummy variables are the same as the modals, and continuous variables are means. For example, x 1 and x 2 are continuous, and x 3 and x 4 are dummy variables. For most observations, x 3 = 1 and x 4 = 0, then the discrete change in the probability with respect to x 1 for the baseline observation is Pr(y = 1 x 1, x 2, x 3 = 1, x 4 = 0) x 1 37 / 38

Interpretation 2nd Midterm Date: Tuesday, May 27th, 2014 Time: 9:00 am 11:30 am Location: The Conference Room for Lecture Coverage: Scott Long Chapter 2.6, 3 and 4 Others: Closed Book, Closed Notes. A Simple Calculator for Taking Exponent and Logarithm. 38 / 38