Probabilistic Choice Models

Size: px

Start display at page:

Download "Probabilistic Choice Models"

Mark Watts
6 years ago
Views:

1 Econ 3: James J. Heckman Probabilistic Choice Models This chapter examines different models commonly used to model probabilistic choice, such as eg the choice of one type of transportation from among many choices available to the consumer. Section discusses derivation and limitations of conditional logit models. Section 2 discusses probit models and Section 3 discusses the nested logit (generalized extreme value models), which address some of the limitations of the conditional logit models. The Conditional Logit Model In this section we investigate conditional logit models. We discuss its derivation from a random utility model (Section.2) with Extreme Value Type I distributed shocks. The relevant properties of the Extreme Value Type I distribution are discussed in Section.. We also derive the conditional logit model from the Luce axioms (Section.3). In Section.4, we discuss some of the limitations of the conditional logit models.. TheExtremeValueTypeIDistribution Suppose ε is independent (not necessarily identical) Extreme Value Type I random variable. Then the CDF of ε is: Pr(ε <c)f (c) ( ( (c + α i ))) where α i is a parameter of the Extreme Value Type I CDF. Also, by the assumption of independence, we can write: ny ny F (ε, ε 2,, ε n ) F (ε i ) ( ( (ε i + α i ))) i McFadden (974) mistakenly referred to this as the Weibull distribution, and this misclassification shows up in the literature. i

2 The Extreme Value Type I distribution has two useful features. First, the difference between two Extreme Value Type I random variables is a logit. Second, Extreme Value Type Is are closed under maximization, since (assuming independence): ³ Pr max{ε i } ε i ny Pr(ε i ε) i ny ( ( (ε + α i ))) i Ã Ã Ã nx!! ( (ε + α i )) i ( ( ε))! nx ( α i ) i () P Consider n ( α i ). We can solve for α in the following equation: i which implies: nx ( α i ) ( α) i Ã nx! α log ( α i ). We can then substitute this value of α into equation () to get: ³ Pr max{ε i } ε i i ( ( ( ε)) ( α)) ( ( (ε + α))) which is indeed a Extreme Value Type I random variable. 2

3 .2 Random Utility Model An individual with characteristics s has a choice set x,where x B, B is a feasible set. We write: Pr (x s, B) as the probability that a person of characteristics s chooses x from the feasible set. We also suppose that: U (s, x) v(s, x)+ε(s, x) where ε is independent Extreme Value Type I. From our information on Extreme Value Type Is in section, we know that ε i + v i,(andthusu i ), has an Extreme Value Type I distribution with parameter α i v i,asshown below: F Ui (ε) Pr(ε i + v i < ε) Pr(ε i < ε v i ) ( ( (ε + α i v i ))) Let us now suppose that there are two goods and two corresponding utilities. Consumers govern their choices by the obvious decision rule: choose good one if U >U 2. More generally, if there are n goods, then good j will be selected if U j argmax {U i } n i. Specifically, in our two good case: Pr ( is chosen) Pr(U >U 2 )Pr(ε + v > ε 2 + v 2 ) Imposing that ε is independent Extreme Value Type I, we can be much more precise about this probability: Pr(ε + v > ε 2 + v 2 ) Pr(ε + v v 2 > ε 2 ) Z Z f(ε ) Z ε +v v 2 f(ε 2 )dε 2 dε f(ε )( (ε + v v 2 + α 2 )) dε (2) 3

4 Observe that F (ε )( (ε + α )), which implies: f(ε ) F (ε ) ε ( (ε + α )) ( (ε + α )) (ε + α )(( (ε + α ))) Substituting this into equation(2), gives us: Pr ( ischosen) Z (ε + α )(( (ε + α ))) ( (ε + v v 2 + α 2 )) dε e α Z e ε e [ ( ε )][( α ) (v v 2 +α 2 )] dε ( α ) ( α )+ (v v 2 + α 2 ) he i [ ( ε )][( α ) (v v 2 +α 2 )] ( α ) ( α )+ (v v 2 + α 2 ) (v α ) (v α )+(v 2 α 2 ) This result generalizes, because the max over (n ) choices is still an Extreme Value Type I, so we can make a two stage maximization argument, as follows : 4

5 Pr(ε + v > ε i + v i, where ṽ j v j α j. i, 2,,n) Pr ε + v > max (ε i + v i ) i2,,n (v α ) (v α ) + (v 2 α 2 )+ +(v n α n ) (ṽ ) np (ṽ i ) i This type of model of probabilistic choice is called a conditional or multinomial logit model. The difference between conditional" and multinomial" is simply that in the conditional" logit case, the values of the variables (usually choice characteristics) vary across the choices, while the parameters are common across the choices. In the multinomial" logit case, the values of the variables are common across choices for the same person (usually individual characteristics) but the parameters vary across choices. For eg we have in the linear v i case, the probability of individual j making choice i from among m choices is: Conditional Logit case: P ij (β0 c ij ) mp,wherec ij is the vector of values (β 0 c kj ) k of characteristics of choice i as perceived by individual j. Multinomial Logit case: P ij (α0 i s j) mp,wheres j is a vector of individual (α 0 k s j) k characteristics for individual j. Note that we can easily combine the two cases under one model, as described below: Generalized case: We can combine the conditional and multinomial logit models by generalizing either one of the two types of models. For eg, we could permit the coefficients in the multinomial logit case to depend on choice characteristics, ie have: α i φ i + c 0 ijθ 5

6 Then we get the generalized case, where the probability of choice i by individual j depends on both individual as well as choice characteristics (as well as interaction terms) 2 : P ij (α0 i s j) mp (α 0 k s j) k (φ0 is j + θ 0 c ij s j ) mp (φ 0 ks j + θ 0 c kj s j ) k We could similarly modify the coefficients in the conditional logit case to obtain the generalized version..3 Derivation of Logit from the Luce Axioms We will now show how the conditional logit can be derived from the random utility model discussed in Section.2 and the Luce Axioms presented below..3. Luce Axioms Axiom : Independence of Irrelevant Alternatives(IIA) Suppose that x, y B, s S. Then, Pr (x s, {x, y})pr(y s, B) Pr(y s, {x, y})pr(x s, B) or, we have: Pr (x s, {x, y}) Pr (y s, {x, y}) Pr (x s, B) Pr (y s, B). The term on the left is the odds ratio; the ratio of probabilities of choosing x to y given characteristics s and {x, y}. This axiom has been named Independence of Irrelevant Alternatives for an obvious reason the odds of our choice are not effected by adding additional alternatives. Note that this assumes that the additional choices entering in B affect probability of choosing x in the same manner as they affect the probability of choosing y; implicitly we are assuming that the additional choices have equivalent relationship with choice x and choice y. We will see how this assumption is a limitation, in Section.4 below. 2 Note that by allowing sj (the vector of individual characteristics) to have an intercept column of s, the model would have pure choice characteristic effects, in addition to the interaction terms. 6

7 Axiom 2: Positivity This axiom states that the probability of choosing any one of the choices is strictly greater than zero: Pr (y s, B) > 0 y B.3.2 Derivation of logit With the Luce assumptions set out in the preceeding section, we can now proceed to our derivation of the logit. Define P yx Pr(y s, {x, y}). Then by Axiom above, we know: Pyx P xy Pr (x s, B) Pr(y s, B) (3) Summing over y, we get: Pr (x s, B) X y B Pyx P xy Pr (x s, B) Again using Axiom, for z B: P y B Pyx P xy (4) Pyz P zy Pxz P zx Pr (z s, B) Pr(y s, B) Pr (z s, B) Pr(x s, B) (5a) (5b) Substituting these in equation (3), we get: Pyx P xy Pr (y s, B) Pr (x s, B) Pyz P zy Pxz P zy Pr (z s, B) Pr (z s, B) P yz P zy P xz P zx (6) 7

8 Now, in terms of the random utility model discussed in Section.2, define the mean utility of a person with characteristics s choosing x from set {x, z} as: v(s, x, z) ln P xz P zx P xz P zx (v(s, x, z)) Define a comparable ression for P yz P zy. Replacing this into equation (6) produces: P yx P xy Then from equation (4), we get: Pr (x s, B) P y B ³ (v(s, y, z)) (v(s, x, z)) (v(s, y, z)) (v(s, x, z)) (v(s,x,z)) Py B P y B ( (v(s, y, z))) (v(s, x, z)) ( (v(s, y, z))). Assume additionally, additive separability of v(s, x, z) as follows: v(s, x, z) v(s, x) v(s, z) Note that this is equivalent to assuming irrelevance of the benchmark. From 8

9 this assumption, we get: Pr (x s, B) P y B (v(s, x) v(s, z)) ( (v(s, y) v(s, z))) v(s, x)( v(s, z)) ³ P ( v(s, z)) y B (v(s, y)) v(s, x) Py B (v(s, y)) (7) which gives the multinomial logit. McFadden (974) shows that Luce Axioms and a condition on ε ( Translation Completeness ) produce the Extreme Value Type I (which he mistakenly referred to as the Weibull)..4 Consequences of Independence: Limitations of Logit Models We just showed that: so that: P i P (v i) i (v i) P i P j (v i ) P i (v i) (v j ) P i (v i) (v i) (v j ) (v i v j ) ln Pi P j v i v j A common specification for v i is v i z i β. Thus: Pi ln Pi P j ln (z i z j ) β β P j z j or, changes in characteristics z j have a common effect on the ratio of log probabilities. This allows for estimation of the probabilities of purchasing a new good. (One could obtain an estimate of β from the existing goods. This estimate can then be combined with the characteristics, z new,ofthe new good to estimate the probability of selection, as in equation 7). 9

10 Further, from equation (7): and: Pr (2 {, 2, 3}) Pr (2 {, 2}) ev 2 e v + e v 2 e v 2 e v + e v 2 + e v 3 < Pr (2 {, 2}) This leads us to a restrictive property of the conditional logit model - we have assumed independence of the ε i, when in fact, they may be correlated. This is illustrated by McFadden s famous red bus, blue bus problem: Suppose we are modelling transportation choice and our alternatives consist of {car, bus, train}. If the alternatives are replaced by {car, red bus, blue bus}, then we have violated our assumption of dissimilar alternatives; if U 2 >U, then the event U 3 >U is more likely. One can see by the preceding equation that adding more bus colors continually decreases the probability that car travel is chosen. We can deal with the problem of similar alternatives by using the nested logit model (Nested Logit) or the random coefficient probit model. 2 Probit: Random Coefficients In this section (as in Section.4 above), we make v i a simple linear function of the choice characteristics alone (as discussed in Section.2, we can easily generalize this to include individual characteristics as well as interactions). Then we have, utility from choice i is: U i Z i β + η i where: η i N(0, σ 2 i ), η i Z i, β, η j, i, j. Moreover, β is a random variable, with β ( β, P β), sothat: It follows that: U i Z i β + Zi β β + ηi U U 2 0 (Z Z 2 ) β +(Z Z 2 ) β β +(η η 2 ) 0 U U 3 0 (Z Z 3 ) β +(Z Z 3 ) β β +(η η 3 ) 0. 0

11 Further: Var (U U 2 ) E{[(U U 2 ) E (U U 2 )] 0 [(U U 2 ) E (U U 2 )]} E [(Z Z 2 )(β β)+(η η 2 )] 0 [(Z Z 2 )(β β)+(η η 2 )] ª E (Z Z 2 )(β β)(β β) 0 (Z Z 2 ) 0 +(η η 2 )(η η 2 ) 0ª (Z Z 2 ) X β (Z Z 2 ) 0 + σ 2 + σ 2 2 (since σ 2 0). Similarly: Thus: so: Var (U U 3 )(Z Z 3 ) X β (Z Z 3 ) 0 + σ 2 + σ 2 3 Cov (U U 2,U U 3 )(Z Z 2 ) X β (Z Z 3 ) 0 + σ 2 ρ Corr (U U 2,U U 3 ) (Z Z 2 ) P β (Z Z 3 ) 0 + σ 2 p Var (U U 2 ) Var (U U 3 ) We now seek to derive the probability of choosing good in a three good case: Pr ( {, 2, 3}) Pr(U U 2 0 and U U 3 0). From before, we know that: U U 2 N (Z Z 2 ) β, Var (U U 2 ) U U 3 N (Z Z 3 ) β, Var (U U 3 ). Thus: Pr (U U 2 0 and U U 3 0) Prh pvar (U U 2 )t +(Z Z 2 ) β 0

12 and p Var (U U 3 )t 2 +(Z Z 3 ) β i 0 where t and t 2 are standard normal. Thus, the above equation reduces to: Ã Pr t (Z Z 2 ) p β Var (U U 2 ) and t 2 (Z Z 3 ) p β! Var (U U 3 ) Pr Ã t (Z Z 2 ) β p Var (U U 2 ) and t 2 (Z Z 3 ) p β! Var (U U 3 ) As t and t 2 may be correlated, we integrate over the joint density to get the probability: Z a Z b t2-2ρt t 2 +t 2 2 Pr (choosing ) 2π p -ρ e 2 -ρ 2 2 dt 2 dt where: a (Z Z 2 ) β p Var (U U 2 ), and b (Z Z 3 ) β p Var (U U 3 ) Now consider adding a third good to the two good case, under two alternative scenarios. If the third good has identical characteristics as the first, then Z 2 Z 3. If there is no stochastic component (no utility innovation), then σ 2 σ 2 2 σ Therefore, in this case: Pr ( chosen) Pr(U U 2 0 and U U 3 0) Pr(U U 2 0) Thus, there is no change in the probability of choosing good despite the addition of a third good. Again focusing on the two good case, we 2

13 observe: Pr ( {, 2}) Pr(U U 2 0) Ã Pr t (Z Z 2 ) p β! Var (U U 2 ) 2π Z (Z Z2) β [(Z Z 2 )Σ β (Z Z 2 ) 0 +σ 2 +σ2 2 ]/2 t2 2 dt which can be evaluated to derive the desired probability. Here P we consider a McFadden-Luce type of set up, where one imposes β 0. Defining σ p σ 2 + σ2 2, we observe that the probability of choosing good in the two-good case is: Z (Z Z2) β σ t2 dt 2π 2 Adding a third good to the scene with identical characteristics, (Z 2 Z 3 ), yields the probability for good being purchased as: Z (Z Z 2 ) β σ Ã Z (Z Z 2 ) β σ Ã t 2 2π p ρ 2ρt t 2 + t 2!! 2 dt 2 2 ρ 2 dt 2 One can show that, upon evaluation of these integrals, the probability derived from addition of the third good is less than the probability in the two good case. This leads us to a similar problem as the multinomial/conditional logit - adding alternatives decreases the probability of choice, despite the fact that the alternatives are quite similar. Thus, in the probit case we are able to avoid the limitation of the logit models with regard to addition of an identical good, through the covariance structure of the random coefficients. As illustrated in case 2, probit models without random coefficients suffer from the same limitation. Note that while the richer covariance structure is able to capture the relationship between choices in the probit model, applications involving many choices are practically limited as evaluation of higher-order multivariate normal integrals is difficult (refer discussion in Greene, Section a). 3

14 3 Nested Logit: Generalized Extreme Value (GEV) Model Consider a function G(y,y 2,,y J ), where G satisfies: i. Non-negativity: G (y,y 2,,y J ) 0 (y,y 2,,y J ) 0. ii. Homogeneous of degree : G (αy, αy 2,, αy J )αg(y,y 2,,y J ). iii. Derivative property: k G y y 2 y J 0 if k even 0 if k odd. If G satisfies these conditions, then we get the following probability: P (y i {y,y 2,..., y J }) P i y ig i (y,y 2,,y J ) G (y,y 2,,y J ), where P i is a probability that can be derived from utility maximization. We can use the theorem above to derive a special case of the nested logit model. Define: G ((v ), (v 2 ),, (v J )) σ v3 vj (v ) (v )+ ((v 2 )) σ + +((v J )) σ σ Observe that σ 0is the ordinary logit model. (With G definedinthis way, we are assuming that ε is uncorrelated with all of the other ε j, while the remaining ε i may be correlated. The parameter σ is a kind of measure of correlation between the remaining ε i. It is this correlation structure that would allow the GEV model to tackle the limitation of the ordinary conditional/multinomial logit models. ). This function obviously meets the conditions for the GEV model. For, i. Non-negativity: obvious as 0 < σ < ii. Homogeneity: G (α (v ), α (v 2 ),, α (v J )) 4

15 α (v )+ (α (v 2 )) σ + +(α (v J )) σ σ α (v )+ α σ ((v 2 )) ³ σ + + α σ ( (v J )) σ σ α (v )+α Ã α (v )+ + + αg ((v ), (v 2 ),, (v J )) vj vj + + σ! σ iii. By inspection, one can see that this derivative property will hold. (It is obvious when differentiating with respect to (v. For other derivatives, the fact that 0 < σ < gives the needed alternation in sign. Note that y i in the definition of the property is analogous to (v i here.) Thus, we can now proceed to derive our probabilities. First, consider: Pr ( {, 2}) e v ³ σ e v + e v 2 σ e v e v + e which is simply our binomial logit model. Also note that in the three good case: Ã! σ v3 G 2 ( σ) + σ Ã! σ σ v3 + 5

16 Now suppose that we eliminate choice (by letting v ). Then: σ v3 (v 2 ) + Pr (2 {2, 3}) σ v3 + Observe that: + v3 σ Pr ( {, 2, 3}) e v e v + ³ e v 2 σ + e v3 σ e v σ v 2 e v + e ³ +e v 3 v 2 σ e v e v + e v 2 e v 3 Ã+ e v 2 σ! σ (8) σ Letting σ, and supposing e v 2 >ev 3, we get: e v 3 e v 3 < e v 2 and thus from equation (8), we have: e v 2 σ 0 as σ Pr ( {, 2, 3}) ev e v + e v 2 (9) 6

17 Conversely, if e v 3 >ev 2, just reverse the roles of v 2 and v 3 so: Pr ( {, 2, 3}) e v e v + e v 2 e v 3 e v 2 ev e v + e v 3 (0) Combining equations (9) and (0), we get, as σ : Pr ( {, 2, 3}) e v e v +max{e v 2,e v 3} () Equations (9), (0) & () imply that in this GEV model, the probability of choice on addition of a choice 3 identical to choice 2, does not necessarily fall, as was the case in the ordinary conditional/multinomial logit case. Equations (9) shows that if the added choice 3 is highly correlated to choice 2(σ ) but yields less utility, then the probability in the three choice case reduces to the binomial logit (the probability in the two choice case), with choice 3 dropping out, as one would intuitively ect. What about the probability of choice 2 how does this change when we add an identical choice 3 in this GEV model? To answer this, consider: ( Ã! v Ã!) 2 v σ 3 Ã! σv 2 e v 2 ( σ) + Pr (2 {, 2, 3}) (! Ã v 2 e v + Ã v 2!( Ã v 2 ( Ã v 2 e v + ( Ã v 2 e v + Ã v 3 +! Ã v 3 +! Ã v 3 + Ã σ v 2! Ã v 3 +!) σ!) σ!!) σ!) σ Ã Ã v 2! Ã v 3 + When σ 0, ie when there is no correlation between choice 2 and choice 3, we have ordinary conditional/multinomial logit. Suppose v 2 >v 3 and σ. By appealing to the result derived in equation (), we get:!! σ 7

18 P (2 {, 2, 3}) v + v3. + v3 + + σ v3 σ (2) We know for v 2 >v 3 : e v 3 e v 3 < e v 2 e 0, as σ and thus, from equation (2), we get: (v 2 ) Pr (2 {, 2, 3}) (v )+(v 2 ), as σ (One could derive a similar result be assuming that v 3 >v 2 ). This equation tells us that in the GEV model, if choices 2 and 3 are very similar, if utility from 2 is greater than that from 3, then choice 3 gets disregarded (same as in Equation 9 earlier), which agrees with our intuition. Finally, supposing that v 2 v 3,weget: G e v + + σ e v + 2 (v )+2 σ (v 2 ). Thus: σ 8

19 (v 2 )2 σ Pr (2 {, 2, 3}) (v )+2 σ (v 2 ) v 2 2 σ v +2v 2 σ lim Pr (2 {, 2, 3}) (v 2 ) 2 (v ) + (v 2 ) This final equation tells us if the characteristics are identical in the nested logit model, then the probability, in the three choice case, of choosing one of the two identical choices is equal half the probability of the two choice case, which is again what is intuitively ected. Thus the nested logit (GEV) model is able to avoid the key limitation of the conditional/multinomial logit imposed by the IIA assumption. References [] Maddala, Limited-Dependent and Qualitative Variables in Econometrics,983, chapters 2 & 3. [2] Greene, Econometric Analysis, 2000, chapter 9. 9

Probabilistic Choice Models

Probabilistic Choice Models James J. Heckman University of Chicago Econ 312 This draft, March 29, 2006 This chapter examines dierent models commonly used to model probabilistic choice, such as eg the choice