Probabilistic Choice Models James J. Heckman University of Chicago Econ 312 This draft, March 29, 2006
This chapter examines dierent models commonly used to model probabilistic choice, such as eg the choice of one type of transportation from among many choices available to the consumer. Section 1 discusses derivation and limitations of conditional logit models. Section 2 discusses probit models and Section 3 discusses the nested logit (generalized extreme value models), which address some of the limitations of the conditional logit models. 1 The Conditional Logit Model In this section we investigate conditional logit models. We discuss its derivation from a random utility model (Section 1.2) with Extreme Value Type I distributed shocks. The relevant 1
properties of the Extreme Value Type I distribution are discussed in Section 1.1. We also derive the conditional logit model from the Luce axioms (Section 1.3). In Section 1.4, we discuss some of the limitations of the conditional logit models. 1.1 The Extreme Value Type I Distribution Suppose is independent (not necessarily identical) Extreme ValueTypeIrandomvariable.ThentheCDFof is: 1 Pr( )= () =exp( exp ( ( + ))) 1 McFadden (1974) mistakenly referred to this as the Weibull distribution, and this misclassification shows up in the literature. 2
where is a parameter of the Extreme Value Type I CDF. Also, by the assumption of independence, we can write: ( 1 2 )= Y ( )= Y exp ( exp ( ( + ))) =1 =1 The Extreme Value Type I distribution has two useful features. First, the dierence between two Extreme Value Type I random variables is a logit. Second, Extreme Value Type Is are closed under maximization, since (assuming independence): 3
³ Pr max{ } Y = Pr( ) = =1 Y exp ( exp ( ( + ))) =1 Ã Ã X!! = exp exp ( ( + )) =1 Ã! X = exp exp () exp( ) (1) =1 4
P Consider exp( ) We can solve for in the following equation: =1 which implies: X exp( )=exp() =1 =log à X! exp( ) =1 We can then substitute this value of into equation (1) to get: ³ Pr max{ } = exp( (exp ()) exp()) = exp( exp ( ( + ))) which is indeed a Extreme Value Type I random variable. 5
1.2 Random Utility Model An individual with characteristics has a choice set ;with element, is a feasible set. We write: Pr ( ) as the probability that a person of characteristics chooses from the feasible set. We also suppose that: ( ) =( )+( ) where is independent Extreme Value Type I. From our information on Extreme Value Type Is in section 1, we know that +,(andthus ),hasanextremevaluetypeidistribution 6
with parameter, as shown below: () =Pr( + ) = Pr( ) = exp( exp ( ( + ))) Letusnowsupposethattherearetwogoodsandtwocorresponding utilities. Consumers govern their choices by the obvious decision rule: choose good one if 1 2 More generally, if there are goods, then good will be selected if argmax { } =1. Specifically, in our two good case: Pr (1 is chosen) =Pr( 1 2 )=Pr( 1 + 1 2 + 2 ) 7
Imposing that is independent Extreme Value Type I, we can be much more precise about this probability: Pr ( 1 + 1 2 + 2 ) (2) = Pr( 1 + 1 2 2 ) = = Z Z ( 1 ) μz 1 + 1 2 ( 2 ) 2 1 ( 1 )exp( exp ( 1 + 1 2 + 2 )) 1 Observe that ( 1 )=exp(exp ( 1 + 1 )) which implies: ( 1 ) = ( 1) 1 = exp (exp ( 1 + 1 )) (exp ( 1 + 1 )) = exp ( 1 + 1 )(exp(exp ( 1 + 1 ))) 8
Substituting this into equation(2), gives us: Pr (1 ischosen) = Z exp ( 1 + 1 )(exp( exp ( 1 + 1 ))) = 1 exp ( exp ( 1 + 1 2 + 2 )) 1 Z 1 [ exp( 1 )][exp( 1 )exp ( 1 2 + 2 )] 1 9
= exp( 1 ) = = 1 exp ( 1 )+exp( 1 2 + 2 ) [ exp( 1 )][exp( 1 )exp ( 1 2 + 2 )] exp ( 1 ) exp ( 1 )+exp( 1 2 + 2 ) exp( 1 1 ) exp( 1 1 )+exp( 2 2 ) This result generalizes, because the max over ( 1) choices is still an Extreme Value Type I, so we can make a two stage maximization argument, as follows: 10
Pr ( 1 + 1 + =12 ) μ = Pr 1 + 1 max ( + ) =2 = = exp( 1 1 ) exp( 1 1 )+exp( 2 2 )+ +exp( ) exp( 1 ) P exp ( ) =1 where = This type of model of probabilistic choice is called a conditional or multinomial logit model. The dierence between conditional and multinomial is simply that in the conditional 11
logit case, the values of the variables (usually choice characteristics) vary across the choices, while the parameters are common across the choices. In the multinomial logit case, the values of the variables are common across choices for the same person (usually individual characteristics) but the parameters vary across choices. For e.g. we have in the linear case, the probability of individual making choice from among m choices is: Conditional Logit case: P = exp(0 ) P exp( 0 ) =1,where is the vector of values of characteristics of choice as perceived by individual. 12
Multinomial Logit case: P = exp(0 ) P exp( 0 ) =1,where is a vector of individual characteristics for individual. Note that we can easily combine the two cases under one model, as described below: Generalized case: We can combine the conditional and multinomial logit models by generalizing either one of the two types of models. For eg, we could permit the coecients in the multinomial logit case to depend on choice characteristics, ie have: = + 0 13
Then we get the generalized case, where the probability of choice by individual depends on both individual as well as choice characteristics (as well as interaction terms) 2 : = exp(0 ) P exp( 0 ) =1 = exp(0 + 0 ) P exp( 0 + 0 ) =1 We could similarly modify the coecients in the conditional logit case to obtain the generalized version. 2 Note that by allowing (the vector of individual characteristics) to have an intercept column of 1s, the model would have pure choice characteristic eects, in addition to the interaction terms. 14
1.3 Derivation of Logit from the Luce Axioms We will now show how the conditional logit can be derived from the random utility model discussed in Section 1.2 and theluceaxiomspresentedbelow. 1.3.1 Luce Axioms Axiom 1: Independence of Irrelevant Alternatives(IIA) Suppose that Then, Pr ( { })Pr( ) =Pr( { })Pr( ) 15
or, we have: Pr ( { }) Pr ( { }) = Pr ( ) Pr ( ) The term on the left is the odds ratio; the ratio of probabilities of choosing to given characteristics and { } This axiom has been named Independence of Irrelevant Alternatives for an obvious reason the odds of our choice are not eected by adding additional alternatives. Note that this assumes that the additional choices entering in B aect probability of choosing in the same manner as they aect the probability of choosing ; implicitly we are assuming that the additional choices have equivalent relationship with choice and choice. We will see how this assumption is a limita- 16
tion, in Section 1.4 below. Axiom 2: Positivity This axiom states that the probability of choosing any one of the choices is strictly greater than zero: Pr ( ) 0 1.3.2 Derivation of logit With the Luce assumptions set out in the preceeding section, we can now proceed to our derivation of the logit. Define =Pr( { }). Then by Axiom 1 above, we know: μ Pr ( ) =Pr( ) (3) 17
Summing over, weget: Pr ( ) X μ = 1 = Pr ( ) = P 1 μ (4) Again using Axiom 1, for : μ Pr ( ) =Pr( ) μ Pr ( ) =Pr( ) (5a) (5b) Substituting these in equation (3), we get: 18
μ μ = Pr ( ) Pr ( ) = μ Pr ( ) Pr ( ) = (6) Now, in terms of the random utility model discussed in Section 1.2, define the mean utility of a person with characteristics choosing from set { } as: ( ) ln = =exp(( )) 19
Define a comparable expression for. Replacing this into equation (6) produces: = Then from equation (4), we get: Pr ( ) = = = exp (( )) exp (( )) 1 μ P exp (( )) exp (( )) 1 ³ 1 P exp(()) (exp (( ))) P exp (( )) (exp (( ))) 20
Assume additionally, additive separability of ( ) as follows: ( ) =( ) ( ) Note that this is equivalent to assuming irrelevance of the benchmark. From this assumption, we get: Pr ( ) = = = P exp (( ) ( )) (exp (( ) ( ))) exp ( )exp(( )) ³ P exp (( )) exp (( )) exp ( ) P exp (( )) (7) 21
which gives the multinomial logit. McFadden (1974) shows that Luce Axioms and a condition on ( Translation Completeness ) produce the Extreme Value Type I (which he mistakenly referred to as the Weibull). 1.4 Consequences of Independence: Limitations of Logit Models We just showed that: = exp( ) P exp( ) 22
so that: = exp( ) P exp( ) exp( ) P exp( ) = exp( μ ) exp( ) =exp( ) ln A common specification for is = Thus: μ μ ln ln =( ) = = or, changes in characteristics have a common eect on the ratio of log probabilities. This allows for estimation of the probabilities of purchasing a new good. (One could obtain an 23
estimate of from the existing goods. This estimate can then be combined with the characteristics,, of the new good to estimate the probability of selection, as in equation 7). Further, from equation (7): and: Pr (2 {1 2}) = 2 1 + 2 Pr (2 {1 2 3}) = 2 1 + 2 + 3 Pr (2 {1 2}) This leads us to a restrictive property of the conditional logit model - we have assumed independence of the when in fact, they may be correlated. This is illustrated by McFadden s famous red bus, blue bus problem: Suppose we are modelling 24
transportation choice and our alternatives consist of {car, bus, train}. If the alternatives are replaced by {car, red bus, blue bus}, then we have violated our assumption of dissimilar alternatives; if 2 1 then the event 3 1 is more likely. One can see by the preceding equation that adding more bus colors continually decreases the probability that car travel is chosen. We can deal with the problem of similar alternatives by using the nested logit model (Nested Logit) or the random coecient probit model. 2 Probit: Random Coecients In this section (as in Section 1.4 above), we make asimple linear function of the choice characteristics alone (as discussed 25
in Section 1.2, we can easily generalize this to include individual characteristics as well as interactions). Then we have, utility from choice is: = + where: (0 2 ), Moreover, is a random variable, with ( P ), sothat: It follows that: = + + 1 2 0 ( 1 2 ) +( 1 2 ) +( 1 2 ) 0 1 3 0 ( 1 3 ) +( 1 3 ) +( 1 3 ) 0 26
Further: Var ( 1 2 ) = ( (1 2 ) ( 1 2 ) 0 (1 2 ) ( 1 2 ) ) = [( 1 2 )( )+( 1 2 )] 0 [( 1 2 )( )+( 1 2 )] ª ½ (1 = 2 )( )( ) 0 ( 1 2 ) 0 +( 1 2 )( 1 2 ) 0 ¾ (since 12 =0). Similarly: = ( 1 2 ) X ( 1 2 ) 0 + 2 1 + 2 2 Var ( 1 3 )=( 1 3 ) X ( 1 3 ) 0 + 2 1 + 2 3 27
Thus: so: Cov ( 1 2 1 3 )=( 1 2 ) X ( 1 3 ) 0 + 2 1 = Corr ( 1 2 1 3 )= ( 1 2 ) P ( 1 3 ) 0 + 2 1 p Var (1 2 ) Var ( 1 3 ) We now seek to derive the probability of choosing good 1 in a three good case: Pr (1 {1 2 3}) =Pr( 1 2 0 and 1 3 0) From before, we know that: 1 2 ( 1 2 ) Var ( 1 2 ) 1 3 ( 1 3 ) Var ( 1 3 ) 28
Thus: Pr ( 1 2 0 and 1 3 0) p Var (1 = Pr 2 ) 1 +( 1 2 ) 0 and p Var ( 1 3 ) 2 +( 1 3 ) 0 where 1 and 2 are standard normal. Thus, the above equation reduces to: à Pr 1 ( 1 2 ) p Var (1 2 ) and 2 ( 1 3 ) p! Var (1 3 ) =Pr à 1 ( 1 2 ) p Var (1 2 ) and 2 ( 1 3 ) p! Var (1 3 ) 29
As 1 and 2 may be correlated, we integrate over the joint density to get the probability: = Pr (choosing 1) Z Z 1 2 p 1-2 1 2 1-2 1 2 + 2 2 2 1-2 2 1 where: = ( 1 2 ) p and = ( Var (1 2 ) 1 3 ) p Var (1 3 ) Now consider adding a third good to the two good case, under two alternative scenarios. 30
Case 1: Non-random utility, random coecients. If the third good has identical characteristics as the first, then 2 = 3 If there is no stochastic component (no utility innovation), then 2 1 = 2 2 = 2 3 =0 Therefore, in this case: Pr (1 chosen) = Pr( 1 2 0 and 1 3 0) = Pr( 1 2 0) Thus, there is no change in the probability of choosing good 1 despite the addition of a third good. Again fo- 31
cusing on the two good case, we observe: Pr (1 {12}) = Pr( 1 2 0) Ã = Pr 1 ( 1 2 ) p! Var (1 2 ) = Z 1 2 (12) [( 1 2 ) ( 1 2 ) 0 + 2 1 +2 2 ]12 exp μ 2 1 2 which can be evaluated to derive the desired probability. Case 2: Random utility, non-random coecients. Here we consider a McFadden-Luce type of set up, where one imposes P =0 Defining = p 2 1 + 2 2 we observe 32
that the probability of choosing good 1 in the two-good case is: Z (12) μ μ 1 exp 2 2 2 Adding a third good to the scene with identical characteristics, ( 2 = 3 ), yields the probability for good 1 being purchased as: Z ( 1 2 ) Ã Z ( 1 2 ) Ã μ!! 1 2 p 2 1 exp 1 1 2 1 2 + 2 2 2 2 1 2 1 2 One can show that, upon evaluation of these integrals, the probability derived from addition of the third good is less than the probability in the two good case. This leads us to a similar problem as the multinomial/conditional 33
logit adding alternatives decreases the probability of choice, despite the fact that the alternatives are quite similar. Thus, in the probit case we are able to avoid the limitation of the logit models with regard to addition of an identical good, through the covariance structure of the random coecients. As illustrated in case 2, probit models without random coefficients suer from the same limitation. Note that while the richer covariance structure is able to capture the relationship between choices in the probit model, applications involving many choices are practically limited as evaluation of higherorder multivariate normal integrals is dicult (refer discussion in Greene, Section 19.6.2.a). 34
3 Nested Logit: Generalized Extreme Value (GEV) Model Consider a function ( 1 2 ) where satisfies: i. Non-negativity: ( 1 2 ) 0 ( 1 2 ) 0 ii. Homogeneous of degree 1: ( 1 2 )= ( 1 2 ) iii. Derivative property: 1 2 0 if even 0 if odd. 35
If satisfies these conditions, then we get the following probability: ( { 1 2 }) = ( 1 2 ) ( 1 2 ) where is a probability that can be derived from utility maximization. We can use the theorem above to derive 36
a special case of the nested logit model. Define: (exp( 1 ) exp( 2 ) exp( )) μ μ 2 exp +exp exp( 1 )+ 1 μ + +exp 1 = exp( 1 )+ 3 1 (exp( 2 )) 1 1 + +(exp( )) 1 1 1 1 Observe that =0is the ordinary logit model. (With defined in this way, we are assuming that 1 is uncorrelated with all of the other while the remaining may be correlated. The parameter is a kind of measure of correlation between the remaining. It is this correla- 37
tion structure that would allow the GEV model to tackle the limitation of the ordinary conditional/multinomial logit models. ). This function obviously meets the conditions for the GEV model. For, i. Non-negativity: obvious as 0 1 ii. Homogeneity: ( exp( 1 )exp( 2 ) exp( )) = exp( 1 )+ ( exp( 2 )) 1 1 + +(exp ( )) 1 1 1 μ = exp( 1 )+ 1 1 (exp( 2 )) 1 1 + + ³ 1 1 (exp ( )) 1 1 1 38
μ μ = exp( 1 )+ exp = Ã exp( 1 )+ μ exp 2 1 2 1 = (exp( 1 ) exp( 2 ) exp( )) + + μ + +exp μ μ exp 1 iii. By inspection, one can see that this derivative property will hold. (It is obvious when dierentiating with respect to exp( 1. For other derivatives, the fact that 0 1 gives the needed alternation in sign. Note that in the definition of the property is analogous to exp( here.) 1 1 1! 39
Thus, we can now proceed to derive our probabilities. First, consider: Pr (1 {1 2}) = 1 1 + ³ 2 1 1 = 1 1 + 2 which is simply our binomial logit model. Also note that in the three good case: Ã μ μ! 2 3 1 2 = (1 ) exp +exp 1 1 1 exp μ Ã μ μ! 2 2 3 = exp exp +exp 1 1 1 μ 2 1 40
Now suppose that we eliminate choice 1 (by letting 1 ). Then: μ μ μ 2 2 3 exp( 2 )exp exp + exp 1 1 1 Pr (2 {23}) = μ μ 1 2 3 exp + exp 1 1 μ 2 exp 1 = μ μ 2 3 exp + exp 1 1 41
Observe that: Pr (1 {1 2 3}) = = 1 + ³ 2 1 1 + 1 + 3 1 1 1 2 1 ³ 1+ 3 2 1 1 = 1 + 2 Ã1+ 1 μ 3 2 1! 1 (8) 1 Letting 1 and supposing 2 3 we get: 42
μ 3 2 1= μ 3 2 andthusfromequation(8),wehave: 1 1 0 as 1 Pr (1 {123}) 1 1 + 2 (9) Conversely, if 3 2 just reverse the roles of 2 and 3 so: Pr (1 {1 2 3}) 1 1 + 2 μ 3 2 = 1 1 + 3 (10) Combining equations (9) and (10), we get, as 1: Pr (1 {1 2 3}) 1 1 +max{ 2 3} (11) 43
Equations (9), (10) & (11) imply that in this GEV model, the probability of choice 1 on addition of a choice 3 identical to choice 2, does not necessarily fall, as was the case in the ordinary conditional/multinomial logit case. Equations (9) shows that if the added choice 3 is highly correlated to choice 2( 1) but yields less utility, then the probability in the three choice case reduces to the binomial logit (the probability in the two choice case), with choice 3 dropping out, as one would intuitively expect. What about the probability of choice 2 how does this change when we add an identical choice 3 in this GEV model? To 44
answer this, consider: = = = Pr (2 {123}) " ½ μ μ ¾ # μ 2 3 1 2 (1 ) exp +exp 1 1 1 exp 2 1 μ μ ¾ 1 2 3 1 + ½exp +exp 1 1 μ ½ μ μ ¾ 2 2 3 exp exp +exp 1 1 1 μ μ ¾ 1 2 3 1 + ½exp +exp 1 1 μ 2 exp 1 ½ μ μ ¾! 1 μ μ μ 2 3 2 3 Ã 1 + exp +exp exp +exp 1 45 1 1 1
When =0, ie when there is no correlation between choice 2 and choice 3, we have ordinary conditional/multinomial logit. Suppose 2 3 and 1. By appealing to the result derived in equation (11), we get: (2 {123}) μ 2 exp = 1 μ μ 2 3 (12) exp +exp 1 1 μ μ 1 2 3 exp +exp 1 1 μ μ 1 2 3 exp 1 + exp +exp 1 1 46
We know for 2 3 : μ 3 2 1 μ 3 2 1 1 0 as 1 and thus, from equation (12), we get: Pr (2 {1 2 3}) exp( 2 ) as 1 exp( 1 )+exp( 2 ) (One could derive a similar result be assuming that 3 2 ) This equation tells us that in the GEV model, if choices 2 and 3 are very similar, if utility from 2 is greater than that from 3, then choice 3 gets disregarded (same as in Equation 9 earlier), which agrees with our intuition. 47
Finally, supposing that 2 = 3,weget: μ = 2 1 + exp +exp 1 μ 1 = 2 1 + 2exp 1 = exp( 1 )+2 1 exp ( 2 ) μ 1 2 1 Thus: = exp ( 2 )2 Pr (2 {123}) = exp( 1 )+2 1 exp ( 2 ) exp 2 = 2 exp 1 +2exp 2 lim Pr (2 {123}) 1 exp( 2 ) 1 2 exp( 1 )+exp( 2 ) 48
This final equation tells us if the characteristics are identical in the nested logit model, then the probability, in the three choice case, of choosing one of the two identical choices is equal half the probability of the two choice case, which is again what is intuitively expected. Thus the nested logit (GEV) model is able to avoid the key limitation of the conditional/multinomial logit imposed by the IIA assumption. References [1] Maddala, Limited-Dependent and Qualitative Variables in Econometrics,1983, chapters 2 & 3. [2] Greene, Econometric Analysis, 2000, chapter 19. 49