Lecture 6: Discrete Choice: Qualitative Response

Size: px

Start display at page:

Download "Lecture 6: Discrete Choice: Qualitative Response"

Gwendolyn Mosley
5 years ago
Views:

1 Lecture 6: Instructor: Department of Economics Stanford University 2011

2 Types of Discrete Choice Models Univariate Models Binary: Linear; Probit; Logit; Arctan, etc. Multinomial: Logit; Nested Logit; GEV; Probit. Multivariate Models Binary Choice: Only the choice is observed. The choice maximizes the latent utility yi. yi = x i β + ε i, ε i F ), where F ) is known, symmetric around 0. Observe y i = 1 y i 0), then P y i = 1) = P ε i > x i β) = 1 F x i β) = F x i β).

3 Choice of F Logit: F t) = Λ t) = expt) 1+expt). Probit: F t) = Φ t), standard normal distribution. Arctan: F t) = π argtgt, cauchy distribution. Is the choice of F ) important? NO. If the truth is P y i = 1) = G x i β) but you specify F x β), then H ) s.t. P y i = 1) = F H x i, β)). H x i, β) can be approximated by z i β, for some transformation z i of x i, then it is just ok to run P y i = 1) = F z i β). So the choice of z i is the important factor. YES. In practice, usually determine choices of x i by how to interpret the choice and won t transform it into z i simply for getting better fit. So the choice of F ) is in fact important.

4 Maximum Likelihood Estimation Is the estimate ˆβ important? NO. Compare Fx, ˆβ) x, usually evaluated at the average of x. MLE: L = log L = log L β = = n n n n F x i β) y i 1 F x i β)) 1 y i y i log F x i β) + 1 y i ) log 1 F x i β)) yi F x i β) 1 y ) i 1 F x i β) f x i β) x i F x i y i F x i β) β) 1 F x i β))f x i β) x i

5 Then 2 log L β 0 ) β β LD = n fi 2 F i 1 F i ) x ix i since E y i F x i β) x i) = 0. 1 log L β 0 ) A N 0, 1 n β n n ) fi 2 F i 1 F i ) x ix i since Var y i F i x i ) = F i 1 F i ). Therefore, ) [ LD= 1 n ˆβ 2 log L β 0 ) β 0 n β β [ d 1 n N 0, n ] 1 1 log L β n 0 ) β fi 2 F i 1 F i ) x ix i ] 1

6 Probit and Logit Case The two distributions are similar in both location and scale. Only the tail of Logit is flatter. The variance of logit = π2 3. For probit: [ ) d 1 n ˆβ β N 0, n n φ 2 i Φ i 1 Φ i ) x ix i ] 1 For logit: f t) = Λ t) = Λ t) 1 Λ t)). So [ ] ) 1 d= 1 n n ˆβ β N 0, Λ x i β) 1 Λ x i β)) x i x i n.

7 Nonlinear Least Square Alternatively, y i = F x i β) + ɛ i E ɛ i x i ) = 0 V ɛ i x i ) = F x i β) 1 F x i β)) Minimize n y i F x i β))2 to get β. β is not efficient because of heteroscedasticity so do weighted least square: ˆβ = argmin n F 1 x i β ) 1 F x i β )) y i F x i β)) 2 But the usual choice for Probit and Logit is MLE The least square problem is also nonlinear. log L for probit and logit are both concave, so any local max is global max.

8 Discriminant Analysis DA) The DA method specifies the density of x given y = 1 or y = 0, and tries to predict whether x belongs to group y = 1 or y = 0 by looking at whether P y x) > 1 2 or not. Normal discriminant analysis: Assume x y = 1) N µ 1, Σ), x y = 0) N µ 0, Σ), P y = 1) = q 1, P y = 0) = q 0, then use Bayes rule: P y = 1 x) = q 1 N µ 1, Σ) q 1 N µ 1, Σ) + q 0 N µ 0, Σ) = q1 Nµ 1,Σ) q 0 Nµ 0,Σ) 1 = ) 1 + exp x Σ 1 µ 1 µ 0 ) + ln q1 q µ 1 Σ 1 µ µ 0 Σ 1 µ 0 1 = 1 + exp x for β = Σ 1 µ 1 µ 0 ) β + C)

9 You can estimate ˆq 0, ˆq 1, ˆµ 1, ˆµ 0, ˆΣ from sample moments, then ˆβ = ˆΣ 1 ˆµ 1 ˆµ 0 ). If you only care about P y = 1 x) then you can estimate β and C using a logit model. Both are consistent, the sample moments are more efficient. If the true model is a logit model, and NOT a DA model, then you have to estimate it using the logit likelihood. Otherwise you will get inconsistent estimate. If the true model is DA, then it can be shown that if you run the linear regression: y = â + ˆbx, then ˆb = λ ˆβ = λˆσ 1 ˆµ 1 ˆµ 0 ) for some constant proportional λ.

10 Grouped Data, min χ 2 There is one exception when you can t use MLE if only grouped data but not individual data are available. Example: Have the proportionˆp t ) of people voting for yes for some bill, but no individual record of voting. Let P t = F x tβ), then ˆP t = P t + ɛ t for Eɛ t ) = 0 and Var ɛ t ) = P t 1 P t ). Note that F 1 P t ) = x tβ.do a Taylor expansion F 1 ˆP t ) F 1 P t ) + F 1 P t ) P t ˆPt P t ) = x tβ + v t for Ev t = 0 and Var v t ) = Pt1 Pt) n tf 2 F 1 P t)).

11 Then estimate Var v t ) by plug in ˆP t into where P t is and run a WLS of this taylor expansion: )) T n t f 2 F ˆP 1 t ˆβ = argmin ˆP t 1 ˆP ) F 1 ˆPt ) ) 2 x tβ t t=1 Alternatively, run a nonlinear weighted least without inverting F 1 P t ): β = argmin T t=1 n t ˆP t 1 ˆP t ) ˆP t F x tβ)) 2 But ˆβ is preferable because it is a linear problem rather a nonlinear problem.

12 Multinomial Models Utility from taking car, train and bus: Bus y = 1): U 1i = µ 1i + ɛ 1i Train y = 2): U 2i = µ 2i + ɛ 2i Car y = 3): U 3i = µ 3i + ɛ 3i Then P y i = 1) = P U 1i > U 2i, U 1i > U 3i ) P y i = 2) = P U 2i > U 1i, U 2i > U 3i ) P y i = 3) = P U 3i > U 1i, U 3i > U 2i ).

13 Multinomial Logit ɛ 1i, ɛ 2i, ɛ 3i iid, distributed as Type-I extreme Value: F ɛ) = e e ɛ, f ɛ) = e ɛ e e ɛ. The probability of choice can be calculated: P y i = 2) = P ɛ 2 + µ 2 µ 1 > ɛ 1, ɛ 2 + µ 2 µ 3 > ɛ 3 ) = = = = Similarly, P y i = j) = 0 f ɛ 2 ) P ɛ 3 < ɛ 2 + µ 2 µ 3 ) P ɛ 1 < ɛ 2 + µ 2 µ 1 ) dɛ 2 e ɛ2 [ e e ɛ 2 [1+e µ 1 µ 2 +e µ 3 µ 2] ] dɛ 2 e [1+eµ 1 µ 2 +e µ 3 µ 2 ]x dx e µ1 µ2 + e µ3 µ2 = e µ2 e µ1 + e µ2 + e µ3. e µ j e µ 1 +e µ 2 +e µ 3. It is easy to show that P y = 1 y = 1 or y = 2) etc.

14 IIA Property and Nested Logit Suppose y = 2 is not available, then similarly P y = 3 without choice 2) = P U 3 > U 1 ) = e µ3 e µ1 + e µ3. The IIA independence of irrelevant alternatives) property: the relative frequency of choicing between 1bus) and 3car) does not change whether or not choice 2train) is present. This appears unreasonable since bus1) and train2) are both public transportation and appears correlated with each other. To allow for correlation between choice 1 and 2, introduce F ɛ 1, ɛ 2, ɛ 3 ) = G ɛ 1, ɛ 2 ) F ɛ 3 ) = G ɛ 1, ɛ 2 ) e e ɛ 3 and [ G ɛ 1, ɛ 2 ) = e e ɛ 1 ρ +e ɛ ] 2 ρ ρ, for 0 < ρ 1.

15 Since the independence of ɛ 3 and G ɛ 1, ɛ 2 ) P U 1 > U 2 ) =P U 1 > U 2 U 1 > U 3 or U 2 > U 3 ) P y = 1) = P y = 1) + P y = 2) In fact, both sides are verified to be e µ 1 /ρ e µ 1 /ρ +e µ 2 /ρ. Also, P y = 3) = P U 3 > U 1, U 3 > U 2 ) = Now the IIA property does not hold since P y = 3 with no choice 2) = P U 3 > U 1 ) e µ3 e µ3 + e µ1/ρ+µ2/ρ) ρ P U 3 > U 1 U 3 > U 2 or U 1 > U 2 ) = P y = 3 y = 1 or y = 3)

16 Estimation Method A convenient way is to use a 2-step procedure to estimate β and ρ first, and then use the resulting estimate as the initial value for MLE iteration. The likelihood function: n L = P y = 1) y 1i P y = 2) y 2i P y = 3) y 3i = L 1 L 2 for L 1 = P y = 1 y = 1 or 2) y 1i P y = 2 y = 1 or 2) y 2i n L 2 = P y = 1 or 2) y 1i +y 2i y P y = 3) 3i

17 First step: L 1 is the logit likelihood for the subsample that choose between 1 and 2, and is a function of only β/ρ: e x 1i β/ρ e x 1i β/ρ log L 1 = y 1i e x 1i β/ρ + e + y x 2i β/ρ 2i e x 1i β/ρ + e x 2i β/ρ Second step: plug this into L 2 and maximizes wrt β and ρ: log L 2 = y 1i + y 2i ) log 1 +y 3i log e x 3i β + e x 1i e x 3i β ) ˆβ ρ e x 3i β + e x 1i + e x 2i ˆβ ρ e x 3i β ˆβ ρ ) )) ρ + e x 2i Use them as the initial starting value for maximizing the likelihood function L. ˆβ ρ )) ρ

18 In fact for this simple example, there is not much point, it is easier to simply maximize the L directly FIML:full information maximum likelihood), instead of using the sequential procedure first. However, if the choice set is large and there are many branches, the two step procedure is important to find a good starting value. Limitation of Logit and Probit: Nested Logit suffers from the problem how to decide on the independence and correlation structure of the choice set. Probit is a way to allow for flexible correlation among different choices, But multinomial probit suffers from the computational burden of integrating high dimensional normal density.

19 Distribution Free Methods All parametric models rely on specifying a completely known F ). The aim here is to relax this parametric assumption. Two types of methods: One is to allow for heteroscedasticity of ɛ while maintaining minimal identifying moment conditions, this is Manski s maximum score estimator. The other is to relax all restriction on F ) while maintaining conditional homoscedasticity, this is Cosslett1983), Klein and Spady1993), among others.

20 Maximum Score Estimator of Manski The identifying assumption is that P ɛ < 0 x) 1 2 for all x. Maximize the number of correct predictions called score ): Q n β) = 1 n yi 1 x i β 0) + 1 y i ) 1 x i β < 0) 1 n n q y i, x i, β) Alternatively, minimize the number of wrong predictions, since the prediction is wrong iff y i 1 x i β 0) = 1, it can be written as a LAD 1 n n y i 1 x i β 0) In fact, when x i β 0 0, P y i = 1) 1 2, so med y i x i ) = 1; when x i β 0 < 0, P y i = 0) 1 2 so med y i x i ) = 0.

21 The asymptotic distribution is very difficult but there is an intuitive reason for the slow rate n 1 3 of convergence. Consider ˆβ maximizes: Q n β) Q n β 0 ) = Q β) Q β 0 ) + }{{} A Q n β) Q n β 0 ) Q β) + Q β 0 ) }{{} B n A is as before approximated by a quadratic function: A O β β 0 ) 2).

22 The convergence rate for B n with mean 0 is Var B n ). Var B n ) = 1 n 2 n Var q y i, x i, β) q y i, x i, β 0 )) 1 n E q y i, x i, β) q y i, x i, β 0 )) 2 = 1 n E [y i1 x i β 0, x i β < 0) 1 y i ) 1 x i β 0 < 0, x i β 0)] 2 ) 1 O n E [1 β z i β 0 ) + 1 β z i β 0 )] ) 1 = O n β β 0 where x i β = β z i ) 1 = B n = O p n β β0. ) 1 Regular linear case: B n = O p n β β 0.

23 Expect ˆβ to maximize an objective function A + B n = O β β 0 ) 2) ) 1 + O p n β β0 A β 0 ) + B n β 0 ) = 0 = A ˆβ) + B n ˆβ) 0, so that ) ) ) 2 1 O ˆβ β0 + O p n ˆβ β 0 0 O ˆβ ) ) β O p n ) ) O ˆβ β 0 O p n 1 3

24 Panel Data Probit y it = βx it + α i + ɛ it. ɛ i N 0, Σ) independent of everything else α i = φ + λ 1 x i λ T x it + v i, for v i N 0, σ 2 v ) independent of everything else. Observe only the probit part: y it = 1 y it 0) = 1 βx it + α i + ɛ it 0) = 1 βx it + φ + λ 1 x i λ T x it + v i + ɛ it 0). To do MLE, write down the likelihood function, let σtt 2 be the tth diagonal element of Σ and simply take φ = 0. ) βx t + λ 1 x λ T x T P y t = 1 x 1,..., x T ) = Φ. σ 2 tt + σv 2 But to do MLE you need to write the JOINT probability of P y 1 = 1, y 2 = 0,..., y T = 1) etc. Use simulation estimation.

25 In contrast to MLE, Chamberlain suggests doing minimum distance by first running a cross-sectional probit for each period t = 1,..., T : P y t = 1 x 1,..., x T ) = Φ π t1 x π tt x T ) and then estimate ψ t σ 2 tt + σ 2 v ) 1 and λt, t = 1,..., T, and β by minimizing the distance between use any weighting matrix you want) ψ ψ T 1 βi T +. 1 λ 1,, λ T ) = ˆπ 11 ˆπ 1T..... π T 1 π TT

26 In cross section models without the endogenous fix effect α i, the proper measure to compare different models is x P y t = 1 x t = x) = P y t = 1 x t = x) P y t = 1 x t = x ) for x close to x. In panel data with latent fixed effect, it is harder to define this measure. Chamberlain suggests that a proper measure is [P y t = 1 x t = x, c) P y t = 1 x t = x, c)] f c c) dc Under the panel probit assumptions, this can be conveniently estimated.

27 Consider z φ + λ 1 x λ T x T, so that c = z + v, and f z, v) = f z) f v). P y t = 1 x t = x, c) f c) dc = P y t = 1 x t = x, c, v) f c c) dc = P y t = 1 x t = x, c) f c) dc = P y t = 1 x t = x, z, v) f v v) f z c v) dvdc = P y t = 1 x t = x, z, v) f v v) dvf z z) dz = P y t = 1 x t = x, z) f z z) dz which can be estimated by P y t = 1 x t = x, z) d ˆF z z) = 1 n n Φ α t βx + λ 1 x i λ T x it ))

28 Conditional Logit This one allows you not to make ANY assumption about the latent fixed effect α i. Assume P y t = 1 x 1,..., x T, c) = G βx t + c) = eβx t +c 1+e βx t +c and y t are independent across t. Then instead of maximizing the likelihood function, maximize the following conditional likelihood function conditional on t y t = s for s = 0,..., T : ) P y t, t = 1,..., T y t = s, x 1,..., x T, c t = P y t, t = 1,..., T, s.t. t y t = s) P t t y = P y t x 1,..., x T, c) t = s) d B t P d t x 1,..., x T, c) where d = d 1,..., d T ) and B {d : s.t. T t=1 d t = s}.

29 By considering y t = 1, 0 seperately, you could see that P y t x 1,..., x T, c) = exp βx t + c) y t, and so, 1 + exp βx t + c) ) P y t, t = 1,..., T y t = s, x 1,..., x T, c t T expβx t+c)y t exp β ) T t=1 1+expβx = t+c) t=1 x ty t T = expβx t+c)d t d B t=1 1+expβx t+c) exp β ) T t=1 x td t The conditional log likelihood function to be maximized is: [ )] [ )] n T T log L = log exp β x it y it log exp β x it d it t=1 d i B i t=1 for B i {d i : T t=1 d it = T t=1 y it}.

30 For example, when T = 3, T t=1 y it = 1, then x i = x i1,..., x it ): ) P P P y i1 = 1, y i2 = 0, y i3 = 0 x i, c i, t y i1 = 0, y i2 = 1, y i3 = 0 x i, c i, t y i1 = 0, y i2 = 0, y i3 = 1 x i, c i, t e βx i1 y it = 1 = e βx i1 + e βx i2 + e βx i3 ) e βx i2 y it = 1 = e βx i1 + e βx i2 + e βx i3 ) y it = 1 e βx i1 = e βx i1 + e βx i2 + e βx i3 The undesirable feature is that the independence assumption is needed, in contrast the fix effect probit allows for flexible Σ. In addition, if to calculate [P y t = 1 x t = x, c) P y t = 1 x t = x, c)] f c) dc need parametric assumption of f c) anyway.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n