Goals. PSCI6000 Maximum Likelihood Estimation Multiple Response Model 1. Multinomial Dependent Variable. Random Utility Model

Goals PSCI6000 Maximum Likelihood Estimation Multiple Response Model 1 Tetsuya Matsubayashi University of North Texas November 2, 2010 Random utility model Multinomial logit model Conditional logit model Independence of Irrelevant Alternatives Nested logit model (next week) Mixed logit model (next week) Multinomial probit model (next week) 1 / 47 2 / 47 Multinomial Dependent Variable Random Utility Model Vote: Bush, Clinton, or Perot in the 1992 presidential election Travel: car, bus, or train Occupation: blue-color job, white-color job, professional job etc. A decision maker chooses one alternative from a choice set. The choice set is characterized as follows: Alternatives must be mutually exclusive. The choice set must be exhaustive. The number of alternatives must be definite. 3 / 47 4 / 47

Random Utility Model Random Utility Model Thus, the model is expressed as: The random utility model assumes that a decision maker i attaches a utility to each alternative, U im, m 1... M. The random utility model assumes that the utility consists of two components: Systematic component, which we can observe. Random component, which we cannot observe. U im V im + ɛ im where U im is a decision maker i s utility for alternative m, V im is the systematic component for a decision maker i associated with choice m, and ɛ im denotes the random component of utility for a decision maker i associated with choice m. For example, the utility of voting for Obama increases as ideological proximity increases. 5 / 47 6 / 47 Random Utility Model Random Utility Model We assume that the systematic component for the utility is a linear function of some exogenous variables V im x im β if variables are choice-specific and V im x i β m if variables are individual-specific. In the case of presidential vote choice, candidates traits are choice specific variables and individuals demographic characteristics are individual-specific variables.p The utility of individual i for choice m is rewritten as: U im x im β + ɛ im U im x i β m + ɛ im The model assumes that the decision maker chooses choice m if and only if: U im > U ij j m 7 / 47 8 / 47

Multinomial Logit Model: Set Up Set Up We begin with the random utility model using individual-specific variables: U im x i β m + ɛ im where x i denotes a vector of individual-specific characteristics. β m is a vector of choice specific parameters. Thus, the effect of x i varies across the choices. Suppose a vote choice model in the 1992 presidential election. Income is a key exogenous variable. The utilities are written as Bush: U i1 β 10 + β 11 Income i + ɛ i1 Clinton: U i2 β 20 + β 21 Income i + ɛ i2 Perot: U i3 β 30 + β 31 Income i + ɛ i3 9 / 47 10 / 47 Set Up Set Up Suppose that individual i chooses one of two alternatives. The probability of choosing alternative 1 is the probability that the utility 1 exceeds the utility from alternative 2: Pr(y i 1) Pr(U i1 > U i2 ) This is a binary choice model. Pr(V i1 + ɛ i1 > V i2 + ɛ i2 ) Pr(ɛ i2 ɛ i1 < V i1 V i2 ) Suppose that individual i chooses one of three alternatives. The probability of choosing alternative 1 is the probability that the utility 1 exceeds the utility from alternative 2 and the utility from alternative 3: P(y i 1) Pr[(U i1 > U i2 ) and (U i1 > U i3 )] Pr[(V i1 + ɛ i1 > V i2 + ɛ i2 ) and (V i1 + ɛ i1 > V i3 + ɛ i3 )] Pr[(ɛ i2 ɛ i1 < V i1 V i2 ) and (ɛ i3 ɛ i1 < V i1 V i3 )] 11 / 47 12 / 47

Set Up Distributional Assumption When there are J choices, the probability of choice m is P(y i m) Pr(U im > U ij ) j m For example, the probability of voting for Bush equals the probability that the utility gained from voting for Bush exceeds the utilities from voting for Clinton and Perot. First, random components are independently and identically distributed (IID). In other words, the random components of the utility of all alternatives are uncorrelated with the unobserved components of utility for all other alternatives, and each of these unobserved components has identical distribution. Second, random components are distributed according to type I extreme value. 13 / 47 14 / 47 Distributional Assumption The Probability Density Function The CDF of type I extreme value distribution is F (ɛ im ) e e ɛ im The PDF of type I extreme value distribution is f (ɛ im ) e ɛ im e e ɛ im The choice of the distribution is motivated by the simplicity, tractability, and usefulness of the resulting model. This distribution has mode 0, mean.58, and standard deviation 1.28. pdf 0.0 0.1 0.2 0.3 4 2 0 2 4 x 15 / 47 16 / 47

The Cumulative Distribution Function Distributional Assumption cdf 0.0 0.2 0.4 0.6 0.8 1.0 The difference between two extreme value variables is distributed logistic. That is, if ɛ im and ɛ in are iid extreme value, then, ɛ imn follows the logistic distribution: eɛ imn F (ɛ imn) 1 + e ɛ imn 4 2 0 2 4 x 17 / 47 18 / 47 Distributional Assumption Identification The choice probability is: P im (ɛ in ɛ im < V im V in )f (ɛ im )dɛ im Some algebraic manipulation of this integral results in a succinct, closed form expression: P im Pr(y i m) e V im J J1 ev ij e x i β m J J1 ex i β J It is convenient to code the outcomes as j 0, 1,..., J so there are J + 1 alternatives in this notation. In the current set up, the ˆβ m are unidentified. For any vector of constants q, we find the same probabilities whether we use β m or β where β β m + q. We could add an arbitrary constant to all the coefficients in the model, yet get the same probabilities. which is the logit choice probability. (See Train, 2003, 78-9 for this derivation.) 19 / 47 20 / 47

Identification Identification Consider the following example with 3 choices: P(y i m) e x(β 1+q) e x(β k +q) e xβ 1 e xq e x(β1+q) + e x(β2+q) + e x(β 3+q) e xβ 1 e xq e xβ 1 e xq + e xβ 2 e xq + e xβ 3 e xq e xβ 1 e xq ( e xβ k )e xq e xβ 1 e xβ k Therefore, the model cannot distinguish the true parameters from the parameters plus an arbitrary constant. As with ordered response models, we need an assumption or normalization which will identify the parameters. A convenient normalization that solves the identification problem is to assume that one of the sets of coefficients (the coefficients for one of the choices) are all zero. 21 / 47 22 / 47 Identification Identification Specifically, assume that all β 0 0 for category zero. More generally, for j 0, 1,..., J. P(y i 0) P(y i j) e x 0 e x 0 + J k1 exβ k 1 1 + J k1 exβ k e xβ j 1 + J k1 exβ k The first alternative becomes the reference category to which all of the results are compared. In this form, it is clear that when J 1, we have the binary logit as a special case of the multinomial logit: P(y i 1) e xβ 1 1 + e xβ 1 23 / 47 24 / 47

Estimation Estimation Estimation of this model is relatively easy since the log-likelihood is globally concave. To specify the likelihood, first define d ij 1 if individual i chooses alternative j, and d ij 0 otherwise. This means there are J + 1 d ij s, each indicating a choice. Use these to select the appropriate terms in the likelihood function. As with ordered response models, there is a different probability expression for each selected outcome. The likelihood function for individual i is L i P d i0 0 P d i1 1 P d i2 2... P d ij J Since we assume these are independent, the joint likelihood is the product of the likelihood of each outcome: L N i1 P d i0 0 Pd i1 1 Pd i2 2... P d ij J 25 / 47 26 / 47 Estimation Estimation The log-likelihood is lnl where β 0 0. N i1 m0 N i1 m0 J d im lnp m ( J d im ln e x i β m 1 + J J1 ex i β j ) Estimate the vote choice model in the 1992 Presidential election: vote i β m0 + β m1 Economy i + β m2 Democrat i +β m3 Republican i + β m4 Income i + ɛ im where vote i has three categories (Bush, Clinton, Perot). Use multinom in nnet library. 27 / 47 28 / 47

Interpreting Coefficients Marginal Effects There are two sets of coefficients for each independent variable. The signs of coefficients can be interpreted in a direct manner. For example, a negative coefficient indicates the the independent variable reduces the probability of voting for a candidate compared to the baseline candidate. Statistical inference is done as usual. We can calculate the marginal effect of one continuous independent variable on the probabilities of the outcome categories. P m x k P m (β km P m (β m β) J P j β kj ) j0 This is the weighted sum of β k where the weights are the outcome probabilities. This tells us the effect on the probabilities of choosing m if a variable increases by small amount. 29 / 47 30 / 47 Predicted Probabilities Odds Ratio Predicted probabilities can be computed with the following equation: ˆP m e x ˆβ m 1 + J j1 ex ˆβ j The values of the key independent variable change, while the other variables are held constant. Odds ratios are useful when you want to know the odds of choosing one alternative relative to the other. We first write: Ω mn (x i ) P im P in where Ω mn (x i ) is the odds of outcome m versus outcome n given x i. x i includes all independent variables. 31 / 47 32 / 47

Odds Ratio We continue: Ω mn (x i ) P im P in ex i β m e x i β n e x i [β m β n] e x i βm J J1 ex i β J e x i βn J J1 ex i β J An individual with characteristics specified in x i is e x i [β m β n] more likely to choose m over n. If you want to use the odds ratio as opposed to the baseline category, the equation is simplified to: Ω m1 (x i ) e x i β m An individual with characteristics specified in x i is e x i β m more likely to choose m over the baseline category. 33 / 47 Odds Ratio You can assess how a change in a particular independent variable affects the odds ratio of m to the baseline category. The effect is computed by: Ω m1 (x i, x ik + δ) Ω m1 (x i, x ik ) e β km δ where x ik is the k th independent variable for individual i and β km is the coefficient associated with the k th independent variable for alternative m. For a change of δ in x ik, the odds of outcome m versus the baseline category are expected to change by a factor of e β km δ, holding all other variables constant. The factor change in the odds for a change in x ik does not depend on the level of x ik or on the level of any other variable. 34 / 47 Conditional Logit Model Conditional Logit Model In the MNL model, each explanatory variable denotes individual-specific characteristics and has a different effect on each outcome. The utility for the MNL model is expressed as U im x i β m + ɛ im The conditional logit model is slightly different from MNL since it considers the impact of choice-specific attributes instead of individual-specific attributes. The utility for the CL model is written as U im z im γ + ɛ im where z im denotes a vector of choice-specific attributes. In the case of the vote choice model in 1992, z im would be a perceived candidate trait, for example. Importantly, the parameters are not choice-specific attributes; there is only one for each attribute. In the three-candidate race, the utilities are expressed as Bush: U i1 β 1 honesty i1 + ɛ i1 Clinton: U i2 β 1 honesty i2 + ɛ i2 Perot: U i3 β 1 honesty i3 + ɛ i3 The utility gets larger when perceived honesty increases. 35 / 47 36 / 47

Data for Conditional Logit Model Conditional Logit Model outcome i outcome chosen honesty age 1 1 0 1 50 1 2 1 7 50 1 3 0 3 50 2 1 1 5 30 2 2 0 1 30 2 3 0 2 30 3 1 0 2 70 3 2 0 3 70 3 3 1 4 70 The probability that individual i chooses alternative m in the CL model is e z imγ Pr(y i m) J J1 ez ij γ which should be compared to the MNL model: where β 1 0. Pr(y i m) e x i β m J J1 ex i β J 37 / 47 38 / 47 (Mixed) Conditional Logit Model (Mixed) Conditional Logit Model It is possible to include both individual-specific and choice-specific attributes in the model. The utility is given by U im x i β m + z im γ + ɛ im where x i contains individual-specific attributes for individual i and z im contains choice-specific attributes for outcome m. The probability that individual i chooses alternative m is: where β 1 0. P(y i m) ex i β m+z im γ J J1 ex i β j +z ij γ 39 / 47 40 / 47

Interpretation Independence of Irrelevant Alternatives You can interpret the coefficients in exactly the same way as you do in the MNL model for the individual-specific variables. For choice-specific variables, the signs of the coefficients indicate how an increase in z affects the likelihood that the individual chooses one alternative. You can also use the same techniques (e.g., predicted probabilities) to make an interpretation. In the multinomial logit model, the equation for the odds of m versus n is P(y i m) P(y i n) exi βm e x i β n evim e V in This equation indicates that the odds are determined without reference to the other outcomes that might be available. This property is called as the independence of irrelevant alternatives or IIA. This is a consequence of assuming independence of ɛ ij in the random utility model. 41 / 47 42 / 47 Independence of Irrelevant Alternatives Independence of Irrelevant Alternatives Think about McFadden s famous example. A person has two choices for commuting to work: a private car that is chosen with P(car) 1/2 and a red bus with P(red bus) 1/2. The implied odds of taking the car versus the red bus is 1. Suppose a new bus company is started that is identical the current service except that the buses are blue. IIA requires that the new probabilities are P(car) 1/3, P(red bus).1/3, and P(blue bus) 1/3. This is necessary so that the odds of a car versus a red bus remain 1. However, if the only thing to distinguish the new bus service from the old is the color of the bus, we would not expect car travelers to start taking the bus (i.e., the utility does not change). Instead, the share of red bus riders would be split, resulting in P(car) 1/2 P(red bus) 1/4, and P(blue bus) 1/4. The new, implied odds for car versus red bus are 2 1/2 1/4, which violates the IIA assumption! The IIA assumption requires that if a new alternative becomes available, then all probabilities for the prior choices must adjust in precisely the amount necessary to retain the original odds among all pairs of outcomes. 43 / 47 44 / 47

Independence of Irrelevant Alternatives Testing IIA We assumed the the disturbances were distributed identically and independently according to Type 1 Extreme Value distribution. The violation of IIA indicates that the errors ɛ ij are not independent across alternatives j. The non-independence causes us to overestimate the probability of choosing alternatives that are similar to each other. A Hausman-type test is available to assess the property of IIA. If the IIA property holds, then the parameter estimates obtained on the subset of alternatives will not be significantly different from those obtained on the full set of alternatives. 45 / 47 46 / 47 Testing IIA The Hausman test proceeds as follows: 1 Estimate coefficients ˆβ F and covariance matrix ˆV F with all J alternatives. 2 Estimate coefficients ˆβ R and covariance matrix ˆV R with reduced alternatives. 3 Compare both estimates based on Hausman statistic: ( ˆβ R ˆβ F ) [ ˆV R ˆV F ] 1 ( ˆβ R ˆβ F ) which follows χ 2 distribution with k degrees of freedom where k is the number of elements in the β vector. 4 If the test statistic is larger than a critical value, we reject the null hypothesis that the IIA property holds. See Fry and Harris (1998) for alternative tests. 47 / 47