Approximate Inference for the Multinomial Logit Model

Approximate Inference for the Multinomial Logit Model M.Rekkas Abstract Higher order asymptotic theory is used to derive p-values that achieve superior accuracy compared to the p-values obtained from traditional tests for inference about parameters of the multinomial logit model. Simulations are provided to assess the finite sample behavior of the test statistics considered and to demonstrate the superiority of the higher order method. Stata code that outputs these p-values is available to facilitate the implementation of these methods for the end-user. Department of Economics, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, email: mrekkas@sfu.ca, phone: (778) 782-6793, fax: (778) 782-5944. I would like to thank Nancy Reid and two anonymous referees for helpful comments and suggestions. The support of the Natural Sciences and Engineering Research Council of Canada is gratefully appreciated. 1

1 Introduction The multinomial logit specification is the most popular discrete choice model in applied statistical disciplines such as economics. Recent developments in higher order likelihood asymptotic methods are applied to obtain highly accurate tail probabilities for testing parameters of interest in these models. This involves using an adjusted version of the standard log likelihood ratio statistic. Simulations are provided to demonstrate the significant improvements in accuracy that can be achieved on conventional first-order methods, that is, methods that achieve distributional accuracy of order O(n 1/2 ), where n is the sample size. The resulting p-value expressions for assessing scalar parameters of interest are remarkably simple and can easily be programmed into conventional statistical packages. The results have particular appeal to applied statisticians dealing with discrete choice models where the number of observations may be limited. More generally, these higherorder methods can be conducted regardless of the sample size in order to determine the extent to which first-order methods can be relied upon. The two main contributions are as follows. First, higher order likelihood theory is used to obtain highly accurate p-values for testing parameters of the multinomial model. And second, Stata code is made available to the end-user for this model. 1 While the past two decades have seen significant advances in likelihood asymptotic methods, the empirical work employing these techniques has severely lagged behind. This schism is undoubtedly due to the lack of user friendly computer code. The Stata programs are provided as a means to bridge this gap. 2 Model For a given parametric model and observed data y = (y 1, y 2,..., y n ), denote the log-likelihood function as l(θ), where θ is the full parameter vector of the model expressed as θ T = (ψ, λ T ) T, with scalar interest parameter ψ and nuisance parameter vector λ. Denote the overall maximum likelihood estimator as ˆθ = ( ˆψ, ˆλ T ) T = argmax θ l(θ) and the constrained maximum likelihood estimator as ˆθ ψ T = (ψ, ˆλ T ψ )T = argmax λ l(θ; y) for fixed ψ values. Let j θθ T (ˆθ) = 2 l(θ)/ θ θ T ˆθ denote the observed information matrix and j λλ T (ˆθ ψ ) = 2 l(θ)/ λ λ T denote the observed ˆθψ nuisance information matrix. Inference about ψ is typically based on two departure methods, 1 Brazzale (1999) provides R Code for approximate conditional inference for logistic and loglinear models but does not consider the multinomial logit model. 2

known as the Wald departure (q) and the signed log likelihood ratio departure (r): q = ( ˆψ ψ) { } 1/2 jθθ T (ˆθ) (1) j λλ T (ˆθ ψ ) r = sgn( ˆψ ψ)[2{l(ˆθ) l(ˆθ ψ )}] 1/2. (2) Note that the expression in (1) is not the usual Wald statistic for which the estimated standard error of ˆψ is used for standardization. 2 Approximate p-values are given by Φ(q) and Φ(r), where Φ( ) represents the standard normal cumulative distribution function. These methods are referred to as first-order methods as q and r are distributed asymptotically as standard normal with firstorder accuracy (i.e. the relative error of the approximation is O(n 1/2 )). In small and even moderate samples, these methods can be highly inaccurate. Barndorff-Nielsen (1986) derived the modified signed log likelihood ratio statistic for higher order inference r = r 1 r log ( r Q ), (3) where r is the signed likelihood ratio departure in (2) and Q is a standardized maximum likelihood departure term. The distribution of r is also asymptotically distributed as standard normal but when the distribution of y is continuous it achieves third-order accuracy. Tail area approximations can be obtained by using Φ(r ). For exponential family models, several definitions for Q exist, see for example, Barndorff-Nielsen (1991), Pierce and Peters (1992), Fraser and Reid (1995), and Jensen (1995). The derivation of Q given by Fraser and Reid (1995) will be used in this paper. While the Fraser and Reid version only applies to continuous data, saddlepoint arguments can be invoked to argue that the method is still valid for exponential family models in the discrete setting. Given the maximum likelihood estimates take values on a lattice however, technical issues surrounding the exact order of the error produce p-values with distributional accuracy of order O(n 1 ). For more general models, Davison et al. (2006) provide a framework for handling discrete data that also achieves second-order accuracy. Given the present context involves the exponential family model, the Fraser and Reid (1995) methodology is directly applicable. Fraser and Reid (1995) used tangent exponential models to derive a highly accurate approximation to the p-value for testing a scalar interest parameter. The theory for obtaining Q involves two main components. The first component requires a reduction of dimension by approximate 2 The standard Wald statistic will be considered in the examples and simulations as this is the statistic that is typically reported in conventional statistical packages. 3

ancillarity. 3 This step reduces the dimension of the variable to the dimension of the full parameter. The second component requires a further reduction of dimension from the dimension of the parameter to the dimension of the scalar interest parameter. These two components are achieved through two key reparameterizations: from the parameter θ to a new parameter ϕ, and from the parameter ϕ to a new parameter χ. The variable ϕ represents the local canonical parameter of an approximating exponential model, and the parameter χ is a scaled version of ϕ. The canonical parameterization of θ is given by: ϕ T (θ) = l(θ; y) yt V, (4) y o where V = (v 1,..., v p ) is an ancillary direction array that can be obtained as V = y θ T { } k(y, θ) 1 { } k(y, θ) = ˆθ y T θ T, (5) ˆθ where k = k(y, θ) = (k 1,..., k n ) T is a full dimensional pivotal quantity. Fraser and Reid (1995) obtain this conditionality reduction without the computation of an explicit ancillary statistic. The second reparameterization is to χ(θ), where χ(θ) is constructed to act as a scalar canonical parameter in the new parameterization: χ(θ) = ψ ϕ T (ˆθ ψ ) ϕ(θ), (6) ψ ϕ T (ˆθ ψ ) where ψ ϕ T (θ) = ψ(θ)/ ϕ T = ( ψ(θ)/ θ T )( ϕ(θ)/ θ T ) 1. The matrices j ϕϕ T (ˆθ) and j (λλ T )(ˆθ ψ ) are defined as the observed information matrix and observed nuisance information matrix, respectively, and are defined as follows: j ϕϕ T (ˆθ) = j θθ T (ˆθ) ϕ θ T (ˆθ) 2 (7) j (λλ T )(ˆθ ψ ) = j λλ T (ˆθ ψ ) ϕ T λ (ˆθ ψ )ϕ λ T (ˆθ ψ ) 1. (8) The standardized maximum likelihood departure is then given by Q = sgn( ˆψ ψ) χ(ˆθ) χ(ˆθ ψ ) { } 1/2 jϕϕ T (ˆθ). (9) j (λλ T )(ˆθ ψ ) Notice for canonical parameter, ϕ T (θ) = (ψ, λ T ) T, the expression in (9) simplifies to the Wald departure given in (1). For this case, more accurate inference about ψ is simply based on (3) with the conventional first-order quantities given in (1) and (2) as inputs. 3 Fraser and Reid show that an exact ancillary statistic is not required for this reduction. 4

Now, consider the multinomial logit model. Suppose there are J+1 response categories, y i = (y i0,..., y ij ) with corresponding probabilities, (π i0,..., π ij ) and K explanatory variables with associated β T j parameters where β j is a K 1 vector. The probabilities are derived as: π ij = Λ(β T j x i ) = exp(βj T x i) 1 +, j = 0, 1,..., J, J m=1 exp(βmx T i ) with the normalization β 0 = 0. For data y = (y 1,..., y n ) the likelihood function is given by L(β) = n = exp π y i1 i1 πy i2 i2 πy ij ij (1 π i1... π ij ) (1 y i1... y ij ) { β T 1 y i1 x i +... + βj T [ ]} 1 y ij x i + log 1 +. J m=1 exp(βmx T i ) The corresponding log likelihood is given by [ ] l(β) = β1 T y i1 x i +... + βj T 1 y ij x i + log 1 +. (10) J m=1 exp(βmx T i ) The exponential family form in (10) gives the canonical parameter ϕ(θ) = (β1 T,..., βt k ). Thus if interest is on a scalar component of βj T, the maximum likelihood departure Q is given by expression (1). To calculate this expression the first and second derivatives of the log likelihood function and related quantities are required. The first and second derivatives for this model are easily calculated: l βjk = (y ij π ij )x ik, j = 1,..., J and k = 1,..., K l βjk β jl = π ij (1 π ij )x ik x il, j = 1,..., J and k, l = 1,..., K l βjk β j T l = π ij π ij x ik x il, j = 1,..., J and k, l = 1,..., K j j T. To examine the higher-order adjustment, two simple examples are considered. 4 For the first example, data from a real economic field experiment are used to estimate the parameters of the model. In this example, there are five independent variables and a dependent variable that can take on one of two values, 0 or 1, i.e. the model is the standard logit model. 5 The dataset for this example is provided in Table 1. The estimation results (with the constant suppressed) with 0 as the comparison group are provided in Table 2. Log odds can easily be calculated by exponentiating the coefficients. The conventional p-values associated with the maximum likelihood 4 All computations were done in Stata 8. Code for the two examples is accessible from www.sfu.ca/ mrekkas. Code for the first example is also provided using R Code. 5 The special case where the dependent variable can only take one of two values has previously been considered. For more on the logit model see Brazzale (1999). 5

estimates are reported along with those produced from the signed log likelihood ratio departure given in (2) and from the modified log likelihood ratio statistic given in (3). These resulting p-values are denoted as MLE, LR, and RSTAR, respectively. The p-values associated with the maximum likelihood estimates are provided for comparison as these p-values are outputted by most conventional statistical packages. It should be noted that using the r formula in (3) along with Φ(r ) produces a p-value with interpretation as probability left of the data point. However, for consistency with output reported by statistical packages, the p-values associated with r in the tables are always reported to reflect tail probabilities. As can be discerned from Table 2, even with 40 observations, the p-values produced from the three different methods are quite different and, depending on the method chosen, would lead to different inferences about the parameters. Next, the second example considers a dependent variable that can take on one of three values, 1, 2 or 3. The dataset is provided in Table 3. 6 Results from this estimation (with the constants suppressed) with group 1 as the comparison group are provided in Table 4. Relative risk ratios can be obtained by exponentiating the coefficients. Once again the table reveals a wide range of p-values. For instance, the coefficient for variable X2 in the Y=3 equation, would be deemed insignificant at the 5% level using the conventional MLE test while it would be deemed significant at this level using the LR or RSTAR methods. To investigate the properties of the higher-order method in small and large samples, two simulations are conducted and accuracy is assessed by computing the observed p-values for each method (MLE, LR, RSTAR) and recording several criteria. The recorded criteria for each method is as follows: coverage probability, coverage error, upper and lower error probabilities, and coverage bias. The coverage probability records the percentage of a true parameter value falling within the intervals. The coverage error records the absolute difference between the nominal level and the coverage probability. The upper (lower) error probability records the percentage of a true parameter value falling above (below) the intervals. And the coverage bias represents the sum of the absolute differences between the upper and lower error probabilities and their nominal levels. The first simulation generates 10,000 random samples each of size 50 from a dataset of brand choice with two independent variables representing gender and age. 7 The simulated dependent variable can take one of three different values representing one of three different brands. The data are provided in Table 5, where X0 represents the constant, X1 represents the gender of the 6 This dataset consists of a sample of size 30 from the data available at www.stata-press.com/data/r8/ for car choice. 7 The dataset consists of a sample of size 50 from the data available at www.ats.ucla.edu/stat/stata/dae/. 6

consumer (coded 1 if the consumer is female) and X2 represents the age of the consumer. The dependent variable is simulated under the following conditions: the first brand was chosen as the base category and the true values for the parameters were set as -11.7747 and -22.7214 for the constants of brands 2 and 3, respectively, 0.5238 and 0.4659 for the parameter associated with the gender variable for brands 2 and 3, respectively, and 0.3682 and 0.6859 for the parameter associated with age for brands 2 and 3, respectively. The results from this simulation are recorded in Table 6 for nominal 90%, 95%, and 99% confidence intervals for covering the true gender parameter associated with brand 2 of 0.5238. The superiority of the higher-order method in terms of coverage error and coverage bias is evident. Notice the skewed tail probabilities produced by both first-order methods. The second simulation generates 10,000 random samples using the full dataset of 735 observations with similar conditions set out in the first simulation. The dataset is not listed but is available at the website provided earlier. The results from this simulation are provided in Table 7 again for nominal 90%, 95%, and 99% confidence intervals for covering the true gender parameter associated with brand 2 of 0.5238. With this larger sample size the first-order methods perform predictably better, however, the asymmetry in the tails while diminished, still persists. 3 Conclusion In this paper higher order likelihood asymptotic theory was applied for testing parameters of the multinomial logit; improvements to first-order methods were shown using two simulations. Stata code has been made available to facilitate the implementation of these higher order adjustments. 7

References [1] Barndorff-Nielsen, O., 1991, Modified Signed Log-Likelihood Ratio, Biometrika 78,557-563. [2] Brazzale, A., 1999, Approximate Conditional Inference in Logistic and Loglinear Models, Journal of Computational and Graphical Statistics 8(3), 653-661. [3] Davison, A., Fraser, D., Reid, N., 2006, Improved Likelihood Inference for Discrete Data, Journal of the Royal Statistical Society Series B 68, 495-508. [4] Fraser, D., Reid, N., 1995, Ancillaries and Third-Order Significance, Utilitas Mathematica 7, 33-53. [5] Jensen, J., 1995, Saddlepoint Approximation, Oxford University Press, New York. [6] Lugannani, R., Rice, S., 1980, Saddlepoint Approximation for the Distribution of the Sums of Independent Random Variables, Advances in Applied Probability 12, 475-490. [7] Pierce, D., Peters, D., 1992. Practical Use of Higher Order Asymptotics for Multiparameter Exponential Families (with discussion), Journal of the Royal Statistical Society Series B 54, 701-738. 8

Table 1: Data from Field Experiment Y X0 X1 X2 X3 X4 X5 Y X0 X1 X2 X3 X4 X5 0 1 0 0 10.9 3 1 1 1 1 0 12.25 3 1 1 1 0 0 8.2 2 2 0 1 1 0 11 2 0 0 1 0 0 14.6 2 1 1 1 1 0 10.2 3 2 1 1 0 0 9.33 2 1 0 1 0 1 11.5 2 0 0 1 0 0 11.5 3 6 0 1 0 1 11.67 2 6 0 1 0 0 13.33 1 4 0 1 0 1 12.75 4 2 1 1 0 0 12.33 5 3 0 1 0 1 12.2 3 0 1 1 0 0 13 4 1 1 1 0 1 9.33 4 2 0 1 0 0 11.5 3 0 0 1 0 1 10.2 2 1 1 1 0 0 10.5 2 3 1 1 0 1 11 2 0 1 1 0 0 9.8 4 2 1 1 0 1 10.8 2 4 0 1 0 0 9.33 4 1 0 1 1 1 13.33 1 1 0 1 1 0 13.5 3 1 0 1 1 1 9.33 4 0 0 1 1 0 12 2 1 0 1 1 1 8.67 1 0 1 1 1 0 8.2 1 0 0 1 1 1 9.67 3 4 1 1 1 0 11.8 2 1 0 1 1 1 10 3 14 1 1 1 0 12.5 1 0 1 1 1 1 10.25 6 0 1 1 1 0 9.67 2 2 0 1 1 1 9.33 2 5 1 1 1 0 9.5 3 0 0 1 1 1 10.5 2 0 0 1 1 0 10.75 3 0 0 1 1 1 9.33 1 1 Table 2: Estimation Results p-values Coefficient SE MLE LR RSTAR X1-0.3665 0.7689 0.3168 0.3160 0.3658 X2-1.7582 0.8004 0.0140 0.0094 0.0153 X3-0.4703 0.2636 0.0372 0.0274 0.0372 X4 0.3468 0.3190 0.1385 0.1361 0.1490 X5-0.1097 0.1815 0.2728 0.2553 0.3047 9

Table 3: Data Y X0 X1 X2 Y X0 X1 X2 3 1 1 46.7 1 1 1 21.6 1 1 1 26.1 2 1 0 44.4 1 1 1 32.7 3 1 1 44.7 2 1 0 49.2 2 1 1 49 1 1 1 24.3 3 1 1 43.8 1 1 0 39 3 1 1 46.6 1 1 1 33 2 1 0 45.6 1 1 1 20.3 3 1 0 40.7 2 1 1 38 3 1 1 46.7 1 1 0 60.4 2 1 0 49.2 1 1 1 69 2 1 1 38 2 1 1 27.7 3 1 1 44.7 2 1 1 41 3 1 1 43.8 2 1 1 65.6 3 1 1 46.6 1 1 1 24.8 3 1 1 46.7 Table 4: Estimation Results p-values Y Coefficient SE MLE LR RSTAR 2 X1-0.2525 1.1361 0.8241 0.8238 0.8337 X2 0.0858 0.0519 0.0980 0.0753 0.0950 3 X1 1.5998 1.4213 0.2603 0.2419 0.2920 X2 0.1000 0.0511 0.0502 0.0273 0.0397 10

Table 5: Simulation Data X0 X1 X2 X0 X1 X2 X0 X1 X2 X0 X1 X2 X0 X1 X2 1 0 34 1 0 33 1 1 32 1 0 32 1 1 38 1 1 36 1 1 32 1 0 31 1 1 32 1 0 32 1 1 32 1 1 35 1 1 36 1 1 36 1 1 31 1 0 32 1 1 30 1 0 34 1 0 31 1 1 32 1 1 36 1 0 32 1 1 36 1 1 33 1 1 32 1 0 32 1 1 33 1 1 32 1 0 31 1 1 32 1 1 34 1 1 32 1 1 32 1 1 31 1 0 32 1 1 28 1 1 36 1 1 29 1 0 36 1 1 32 1 1 37 1 1 32 1 1 33 1 1 32 1 1 28 1 1 32 1 1 31 1 1 32 1 1 34 1 0 35 Table 6: Simulation Results for n = 50 Coverage Coverage Lower Upper Coverage CI Method Probability Error Probability Probability Bias MLE 0.9085 0.0085 0.0411 0.0504 0.0093 90% LR 0.8896 0.0104 0.0498 0.0606 0.0108 RSTAR 0.9065 0.0065 0.0460 0.0475 0.0065 MLE 0.9602 0.0102 0.0181 0.0217 0.0102 95% LR 0.9423 0.0077 0.0261 0.0316 0.0077 RSTAR 0.9533 0.0033 0.0233 0.0234 0.0033 MLE 0.9966 0.0066 0.0012 0.0022 0.0066 99% LR 0.9867 0.0033 0.0052 0.0081 0.0033 RSTAR 0.9910 0.0010 0.0041 0.0049 0.0010 Table 7: Simulation Results for n = 735 Coverage Coverage Lower Upper Coverage CI Method Probability Error Probability Probability Bias MLE 0.8943 0.0057 0.0498 0.0559 0.0061 90% LR 0.8938 0.0062 0.0498 0.0564 0.0066 RSTAR 0.8945 0.0055 0.0506 0.0549 0.0055 MLE 0.9460 0.0040 0.0250 0.0290 0.0040 95% LR 0.9453 0.0047 0.0250 0.0297 0.0047 RSTAR 0.9460 0.0040 0.0256 0.0284 0.0040 MLE 0.9880 0.0020 0.0048 0.0072 0.0024 99% LR 0.9878 0.0022 0.0048 0.0074 0.0026 RSTAR 0.9879 0.0021 0.0049 0.0072 0.0023 11