ECONOMETRICS II (ECO 2401) Victor Aguirregabiria. Winter 2017 TOPIC 3: MULTINOMIAL CHOICE MODELS

Size: px

Start display at page:

Download "ECONOMETRICS II (ECO 2401) Victor Aguirregabiria. Winter 2017 TOPIC 3: MULTINOMIAL CHOICE MODELS"

Laureen Lane
6 years ago
Views:

1 ECONOMETRICS II (ECO 2401) Victor Aguirregabiria Winter 2017 TOPIC 3: MULTINOMIAL CHOICE MODELS 1. Introduction 2. Nonparametric model 3. Random Utility Models - De nition; - Common Speci cation and Normalizations; - Choice Probabilities; - Some Theorems

2 4. Logit Model 5. Nested Logit Model 6. Random Coe cients Logit Model 7. Monte Carlo Simulation 8. Simulation-Based Estimation

3 1. INTRODUCTION Economics deals with agents choices. Many important economic decisions can be described as discrete choices within a nite number of choice alternatives. - Consumer choice of store, or brand, or product variety; - Occupational choice; Migration decisions; School / university choice; - Firms decisions of where to locate plants / stores; which products to sell; - Commuters s choice of transportation mode: car, bus, subway, bicycle, walk, mixed....

4 INTRODUCTION [2] Let J = f0; 1; :::; J 1g be the set of choice alternatives that the agent faces. We index choice alternatives by j. Let Y 2 J be the variable that represents the actual choice of an individual. Let X be a vector of exogenous variables such as individual characteristics, and attributes of each choice alternative. Using a sample of fy; Xg we are interested in learning how X a ects Y. - How prices or other product attributes a ect consumer demand; - How neighborhood amenities and housing prices a ect people decisions of where to live;...

5 Stylized description of model and estimation The model can be described as: where: Y = h (X; "; ) - " is a vector of unobservables; - is a vector of parameters; - h(:) is a function that maps (X; "; ) into the choice set J. De ne the Conditional Choice Probability (CCP) function as the probability distribution of Y conditional on X. For any pair (j; x): P (j j x) Pr (Y = j j X = x)

6 Stylized description of model and estimation [2] Suppose that: " independent of X with CDF F ("; ) where represents the unknown parameters in this distribution function. Model Y = h (X; "; ) and distribution F ("; ) imply a CCP function: P (j j x; ; ) = where 1f:g is the indicator function. Z 1 fh (x; "; ) = jg df ("; )

7 Stylized description of model and estimation [3] The researcher observes a random sample of N agents, indexed by n, with information on fy n ; x n : n = 1; 2; :::; Ng. She is interested in the estimation of the parameters (; ). The (conditional) log-likelihood function for this model and data is, `N() = NX n=1 ln Pr(Y = y n j X = x n ; ) = NX n=1 ln P (y n j x n ; ) The MLE is: b = arg max `N()

8 Predictions / Counterfactual analysis Given the estimated model, we can make predictions and counterfactual analysis. Let (x ; ) be a value of (X; ) that is di erent in some of its components to the observed/estimated value (x n, b ); e.g., a change in an attribute of a choice alternative; a change in agents characteristics; shutting down the e ect of a variable ( k = 0); removing some choice alternatives; etc. We can compare estimated and counterfactual CCPs, P (jjx n ; b ) and P (jjx ; ) This is a helpful exercise for policy analysis or managerial decisions.

9 2. NONPARAMETRIC MODEL Suppose that X has also a discrete & nite support, X 2 X f1; 2; :::; Mg. Consider a fully nonparametric speci cation of the CCPs. Nonparametric model: The vector of parameters is the vector of M J CCPs, P = fp (jjx) : (j; x) 2 J X g, with the only restriction that P j2j P (j j x) = 1 for any value of x. Let N jx N P n=1 1fy n = j ; x n = xg be the number of observations in the sample where we observe (x n ; y n ) = (x; j).

10 NONPARAMETRIC MODEL [2] The log-likelihood function is: `N(P) = = = NX n=1 2 NX 6 X 4 n=1 (j;x)2j X X ln P (y n j x n ) (j;x)2j X 1fy n = j ; x n = xg ln P (j j x) N jx ln P (j j x) Taking into account the restrictions P J j=0 P (j j x) = 1 for any value of x, the likelihood equations are: N jx 1 P (j j x) N 0x 1 P (0 j x) = 0

11 NONPARAMETRIC MODEL [2] Solving this system of equations we obtain the following expression for the MLE of the CCPs: bp (j j x) = X N jx i2j N ix The MLE of this Nonparametric model is just the frequency estimator of the CCPs. As usual, this MLE is consistent, asymptotically normal, and e cient given the minimal restrictions in this nonparametric model.

12 Some Limitations of this Nonparametric model [1] May be very imprecise for values of x with few observations [use kernel or nearest-neighborgs under the additional assumption of CCP function smooth in x.] [2] When x is continuous: curse of dimensionality in the speed of asymptotic convergence. [3] Function h(:) and the distribution of the unobservables F (:) are not separately identi ed. This is relevant for some economic questions.

13 3. RANDOM UTILITY MODELS An agent should choose one alternative from a choice set with J mutually exclusive alternatives J f0; 1; :::; J 1g. We use n to index agents, i or j to index alternatives, and k to index explanatory variables. Let Y n 2 J be the random variable that represents the choice of agent n. Assumption of Utility Maximization: The agent makes this choice to maximize her payo or utility. Y n = arg max j2j U n(j) where U n (j) is the utility or payo for agent n of choosing alternative j.

14 Principle of Revealed Preference Suppose that we observe an agent n making choices under di erent choice sets J : i.e., y n (J 1 ), y n (J 2 ), :::, y n (J T ). Under the Assumption of Utility Maximization, the agent s choices reveal information on her preferences. This a powerful principle in Econometrics and it is behind the estimation of demand or supply functions.

15 RANDOM UTILITY MODELS (2) In a Random Utility Model (RUM) the speci cation of U n (j) is: where: U n (j) = u n (j; X jn ) + " jn - X jn is a K 1 vector of characteristics of agent n and/or choice alternative j that are observable to the researcher; - u n (:) is a real-valued function; - " n = f" 0n ; " 1n ; :::; " J 1;n g represents unobservable variables to the researcher, but observable to the agent and therefore a ecting her choice.

16 RANDOM UTILITY MODELS (3) A common speci cation of the a RUM is: U n (j) = X j n + W jn n + Z n j + " jn - X j is a 1 K x vector of characteristics of alternative j (e.g., price); - Z n is a 1 K z vector of observable attributes of the agent (e.g., income); - W jn is a 1K w vector of characteristics that vary across individuals (e.g., commuting time to work using transportation mode j); - j is a K z 1 vector of parameters. - n and n are K x 1 and K w 1 vectors, respectively, that represent the marginal utility of each product attribute.

17 RANDOM COEFFICIENTS RUMs We can distinguish two types of models according to the speci cation of the coe cients n and n. Models without random coe cients. Either ( n ; n ) are constant parameters (i.e., n = and n = for any n) or they are deterministic functions of observable agent s observable Z n. In the later case, the terms X j n + W jn n are equivalent to f W jn e where fw jn includes products of characteristics X j and attributes Z n. Models with random coe cients. n and/or n depend on unobservable random variables for the researcher.

18 EXAMPLE 1: Choice of Transportation Mode to Work Y 2 f Walking, Bike, Bus, Metro, Car g X j = ( Price per mile ) W jn = ( Commuting time using mode j ) Z n = ( Income, Age, Gender, etc )

19 EXAMPLE 2: Choice (Demand) of Di erentiated Product (Laptops) Y 2 f every laptop product available in the market g X j = ( Price, Brand, CPU speed, Screen size, Weight, Color, RAM, HD size, etc ) n contains the marginal utilities of each product attribute (for individual n); W jn = Interactions of X j with Z n = ( Income, Age, Gender, etc ).

20 SOME NORMALIZATIONS For constants a n and b n > 0, function a n + b n U n (j) is a positive a ne transformation of utility U n (j). Any positive a ne transformation of the utility function generates the same (utility maximizing) behavior for agent n. Therefore, we need to make some normalization assumptions on the parameters in the utility function U n (j) = X j n + W jn n + Z n j + " jn, such that we can identify the parameters in the utility function.

21 SOME NORMALIZATIONS [2] A necessary condition to identify a parameter is that a marginal change in the parameter implies a change in the optimal choice of some agents in the population (such that some CCPs change). Other necessary condition is that it is not possible to completely o set the e ect on all CCPs of a marginal change in the parameter by making a marginal change in other parameter. Consider a model with 3 choice alternatives. Y n = 2 i : " 0n " 2n (X 2 X 0 ) n + (W 2n W 0n ) n + Z n ( 2 0 ) " 1n " 2n (X 2 X 1 ) n + (W 2n W 1n ) n + Z n ( 2 1 )

22 SOME STANDARD NORMALIZATIONS [3] Some standard normalization assumptions are: (1) No constant terms (i.e., no a n that does not depend on j): no constant term in X j n or W jn n ; (2) If model includes X j, then Z n does not include constant term. (3) 0 = 0. [If all the 0 js are additively transformed by the same constant, the optimal choice does not change]. (4) V ar(" 1n " 0n ) = 1 [If we multiply all the di erences " jn " 0n by the same constant, the optimal choice does not change].

23 Choice Probabilities in RUM The CCP for alternative j is (omitting agent subindex n) P (j jx) = Pr u(j; x) + " j u(i; x) + " i for any i 6= j = Pr " i " j + u(j; x) u(i; x) for any i 6= j = +1 Z 1 F " j j" j "j + u(j; x) u(i; x) for any i 6= j f "j (" j ) d" j where f "j is the marginal density of " j, and F " j j" j is the CDF of " j f" i : i 6= jg conditional on " j. Integral of dimension J. Only for some speci cations of the CDF F (") has a closed form expression.

24 4. MULTINOMIAL LOGIT MODEL [without random coe cients] We have U n (j) = X jn + " jn where " jn are i.i.d. over (n; j) Type 1 Extreme Value Type 1 Extreme Value is also called Gumbel distribution. For any j, we have that the CDF F (" j ) = exp n exp n " j oo and the PDF is f(" j ) = exp n " j exp n " j oo. The PDF is asymmetric. The di erence of two independent Type 1 Extreme Value variables has a Logistic distribution with CDF, F (" j " i ) = exp n " j " i 1 + exp n o " j " i o

25 Under this assumption on the distribution of ", we have the following form for the CCPs: P (j) = = +1 Z 1 +1 Z 1 F " j j" j "j + u j u i for any i 6= j f "j (" j ) d" j f(" j ) 2 4 Y F (" j + u j u i ) 5 d" j i6=j 3 = +1 Z Y 1 i2j expf " j g expf expf " j u j + u i gg d" j = +1 Z 1 expf " j g exp n expf " j u j g Pi2J expfu i g o d" j

26 De ne S P i2j expfu i g, and make the change in variable, v = " j + u j ln S P (j) = +1 Z 1 expf v + u j ln Sg exp f expf vgg dv = expfu j ln Sg = expfu jg expfln Sg Z expf vg exp f expf vgg dv = expfu j g P Jk=0 expfu k g

27 ML ESTIMATION OF LOGIT MODEL The closed form expression for CCPs is very convenient for the estimation of the model. Consider the logit model U n (j) = X jn + " jn, and the random sample fy n ; x n : i = 1; 2; :::; Ng. The log-likelihood function is: `N() = = NX n=1 NX n=1 j2j ln Pr(Y = y n j X = x n ; ) X 1fy n = jg ln " expfx jn g P i2j expfx in g # This log-likelihood function if globally concave in. Furthermore, the gradient and Hessian of this function have simple closed form expressions. Therefore, the numerical computation of the MLE can be implemented in a simple way using Newton s method.

28 ML ESTIMATION OF LOGIT MODEL (2) You can verify that in a Logit j = P (j) [1 P (j)] Taking this into account, we can show ln P (j j x n ; (j) j P = x jn m(x n ; ) where m(x n ; ) P i2j x in P (ijx n ; ). And the likelihood equation equations are: 1 N 0 NX JX n=1 j=0 h xjn m(x n ; ) i [1fy n = jg P (jjx n ; )] A = 0 1

29 But it is clear that P j2j m(x n ; ) [1fy n = jg P j2j 1fy n = jg = P j2j P (jjx n ; ) = 1. P (jjx n ; )] = 0 because Therefore, 1 N 0 NX JX n=1 j=0 x jn [1fy n = jg 1 P (jjx n ; )] A = 0

30 MLE as Method of Moments Estimator of Regression-like model These likelihood equations provide an interpretation of the MLE as a MME of a regression-like interpretation of the MNL. By de nition of CCPs, we have that P (j j x n ) = Pr (y n = j j x n ) = E ( 1fy n = jg j x n ) Therefore, 1fy n = jg = P (j j x n ) + v jn where, by construction, the error term v jn is such that E v jn j x n = 0. This system of equations can be seen as a regression-like representation of a multinomial choice model.

31 MLE as MME of Regression-like model [2] At the true parameters, the following moment conditions should hold: for any j: E x jn [1fy n = jg P (jjx n ; )] = 0 MME is based on sample counterpart of population moment conditions. 1 N NX n=1 x jn [1fy n = jg P (jjx n ; )] = 0 The MLE combines these J K moment conditions in a particular way that is the optimal combination. For the MNL, these optimal moment conditions have a very simple form: 1 N NX n=1 JX j=0 x jn [1fy n = jg 1 P (jjx n ; )] A = 0

32 Independence of Irrelevant Alternatives The logit model imposes the restriction that the ratio between the probabilities of two alternatives, say j and i, depends ONLY on the utilities of these alternatives, and not on utilities of other alternatives: n uj o P (j) P (i) = exp exp fu i g Therefore, if we change the choice set J, by adding or/and removing alternatives, the ratios between probabilities should not change. For any two choice sets J and J 0 (that include j and i as alternatives), we have that: P (jjj ) P (ijj ) = P (jjj 0 ) P (ijj 0 ) This property can generate some unrealistic predictions.

33 Independence of Irrelevant Alternatives (2) Consider consumers deciding which car model to purchase. The set of available models in year 2014 (i.e., choice set J 2014 ) includes model "Lux" that is a luxury car; and model "Econ", that is a very modest and unexpensive car. In year 2014, their market shares are: P (Lux j J 2014 ) = 0:10 ; P (Econ j J 2014 ) = 0:40; P (Lux j A 2014 ) P (Econ j A 2014 ) = 1 4 In 2015, the new luxury model "NewLux" (very similar to Lux) appears in the market. The logit model predicts that: P (Lux j J 2015 ) = 0:10 (1 P NewLux ) P (Econ j J 2015 ) = 0:40 (1 P NewLux )

34 Independence of Irrelevant Alternatives (3) For instance, if P NewLux = 10%, then P (Lux j J 2015 ) = 9% and P (Econ j J 2015 ) = 36%, what seems unrealistic. The IIA is an implication of the Logit property that the di erences " j are i.i.d. across any pair of choices. " i

35 IIA and Average Partial E ects In most applications we are interested in the estimation of Average Partial E ects (APE). In a discrete choice model, de ne AP E k;j (x) as the APE of variable k in choice alternative j when the explanatory variables are x. AP E k;j (x) k Does the MNL impose a restrictive / unrealisitc structure on these APEs? The answer to this question depends on whether variable X k is an attribute of a choice alternative (i.e., X kj ) or is an attribute of the agent (i.e., Z kn ).

36 IIA and Average Partial E ects [2] Remember that in the MNL: P jn = expfx j + Z n j g P J 1 i=0 expfx i + Z n i g Consider the APE on P j of a change in the X i for i 6= j (i.e., e ect on demand of product j of a change in the price of product i = P jn P in The e ect is proportional to P jn and P in. Two products j with the same P j are a ected exactly in the same way by an increase in the price of product i. This seems very restrictive.

37 IIA and Average Partial E ects [3] Consider now the partial e ect of a change in a characteristic of individual n = P jn 0 JX j i=0 i P in 1 A This is not a particularly restrictive APE. As in a binary choice model, this APE goes to zero when P jn! 0 and when P jn! 1. It depends on the value of j relative to the other s, and these parameters are unrestricted.

38 Solutions to Independence of Irrelevant Alternatives Di erent models have been proposed to deal with this limitation of the Logit model: (1) Multinomial probit; (2) Nested Logits; (3) Random coe cients logit

39 5. NESTED LOGIT MODEL Suppose that the set J chocie alternative can be partitioned into G (mutually exclusive) groups of alternatives, that we index by g. Let J g be the set of alternatives in group g such that: J = [ G g=1 J g The idea is that alternatives within a group share some unobserved features that make them closer substitutes that alternatives in di erent groups. Formally, the assumtion is that the vector of unobservables " = (" 0 ; " 1 ; :::; " J 1 ) has a Generalized Extreme Vaue (GEV) distribution: F (") = exp 8 >< >: GX g=1 " X j2j g exp " j g!# g where, 1, 2,..., R are positive parameters, with 1. 9 >= >;

40 NESTED LOGIT MODEL [2] Consider the RUM Y = arg max j2j fu j + " j g where " = (" 0 ; " 1 ; :::; " J 1 ) has a GEV distribution. The CCPs of this model have the following form: P j = P g (1) P (2) jjg with P (2) jjg = exp ( ) uj g ( P ui exp i2j g g ) ; P (1) g = GP g 0 =1 exp exp g I g g 0 I g 0 and I g are the group inclusive values: I g = ln 0 ( ) 1 B X uj exp A j2j g g

41 NESTED LOGIT MODEL [3] The NL has an interpretation as a sequential decision model. Let Y n (1) 2 f1; 2; :::; Gg represent agent n s choice of group. And let Y n (2) represent the choice of speci c alternative. The model implies that: Pr(Y (1) n = g j X n ) = P (1) g (X n ) and Pr(Y (2) n = j j X n ; Y (1) n = g) = P (2) jjg (X n) Therefore, the likelihood function of the model, l() = X N n=1 ln Pr(Y njx n ; )

42 can be written as the sum of two likehoods: l (1) () + l (2) () l() = NX GX n=1 g=1 1fy (1) n = gg ln P (1) g (X n ; ) + NX n=1 X j2j y (1) n 1fy (2) n = jg ln P (2) jjy n (1) (X n ; )

43 NESTED LOGIT MODEL [4] The Nested Logit maintains the property of IIA for alternatives within the same group but not for alternatives in di erent groups. In the example of the demand of cars: the new car will have a stronger substitution e ect within its own group, e.g., luxury cars.

44 TWO-STEP ESTIMATION OF NL Note that l() = l (1) () + l (2) () where: l (1) () is the within-group likelihood function for the choice variable Y (1) n conditional on X n l (2) () is the between-group likelihood function for the choice variable Y (2) n conditional on X n and Y (1) n. We can estimate a combination of the parameters in by maximizing l (1) (), and other combination of parameters by maximizing l (2) (). This two-step procedure is not statistically e cient but it is computationally very convenient because each step consists of a standard MNL estimation (i.e., globally concave likelihood function).

45 TWO-STEP ESTIMATION OF NL [2] Step 1: Maximization of within-group likelihood function l (2) () with probabilities: expfx j g + Z n j;g g P jjg;n = P i2j g expfx i g + Z n i;g g where the estimated parameters are: g g and j;g j g (where when is notmalized to zero within each group).

46 TWO-STEP ESTIMATION OF NL [3] Step 2: Construct the estimated inclusive values: bi g = ln 0 X j2j g expfx j b g + Z n b j;g g And maximization of betwithin-group likelihood function l (1) () with probabilities: expfc g + g bi g g P jjg;n = P Gg 0 =1 expfc g 0 + g 0 bi g 0g 1 C A where: - C g represents observable group characteristics (if any);

47 - The estimated parameters are g, with one of these parameters normalized to zero within each group.

48 EFFICIENT ESTIMATION OF NL Given this consistent two-step estimator, we can construct an e cient estimator, and a valid variance-covariance matric by doing one Newton or BHHH iteration in the estimation of the full likelihood function: b eff = b 2step 2 l( b 2step b 2step

49 6. RANDOM COEFFICIENTS LOGIT (MIXED LOGIT) In the standard RCLogit we have that: where: U jn = X jn n + " jn - " jn are i.i.d. over (n; j) Type 1 Extreme Value; - n is i.i.d. over n N(b; ); - " n and n are independent. We can also represent n as n = b + W v n where W is a K K lower triangular matrix that is the Cholesky s decomposition of (i.e., W 0 W = ) and v n = (v 1n ; v 2n ; :::; v Kn ) 0 is a vector of independent standard normals.

50 RC LOGIT We can write U jn = X jn b + [X jn W] v n + " jn = X jn b + " K P k=1 KP k 0 =k X k 0 jn w k 0 k! v kn # + " jn The parameters of the model are b and W. The RCLogit can be generalized to a allow for a nonparametric speci cation of the distribution of v n. Fox, Kim, Ryan, and Bajari (JoE, 2012) show that the nonparametric RCLogit is identi ed.

51 RC LOGIT - CCPs To obtain CCPs, we should integrate over " n and v n the optimal decision fy n = jg, f X jn b + [X jn W] v n + " jn X in b + [X in W] v n + " in for any i 6= jg: Z exp n X jn b + [X jn W] v o P (j jx n ) = J 1 P i=0 exp fx in b + [X in W] vg KQ k=1 (v k ) dv k It requires numerical integration over the distribution of the K random variables fv k g.

52 RC LOGIT and IIA Consider the e ect on P j of a marginal change in the attributes of product i 6= i. In the Logit model, this e ect is the same for every choice alternaitve i = b P j P i In RC Logit, this e ect i = Z [b + W v] j (v) i (v) f(v) dv and j (v) = exp n X j b + [X j W] v o = P J 1 i=0 exp fx i b + [X i W] vg.

53 RC LOGIT and IIA [2] Z The e j = [b + W v] j (v) i (v) f(v) dv depends i E v j (v) i (v) that is equal to Cov j (v) i (v) + P j P i. i i = Logit b Cov j (v) i (v) This covariance depends on the distance between the vectors X j and X i. - When X j and Cov j (v) i (v) > 0 X i is small, low values of j (v) are associated with low i (v), - When X j X i is large, Cov j (v) i (v) can be zero o even negative.

54 RC LOGIT - MLE Given a random sample, the log-likelihood function `N(b,W) is: NX JX n=1 j=0 1fy n = jg ln 2 Z 6 4 exp n X jn b + [X jn W] v o J 1 P i=0 exp fx in b + [X in W] vg KQ k=1 (v k )dv k The MLE is the value of (b,w) that maximizes `N(b,W). This MLE has the standard good properties: - MLE is CAN and AE. - `N(b,W) is twice continuously di erentiable in (b,w): we can use gradient methods (e.g., Newton, BHHH) to search for the MLE. - `N(b,W) is not globally concave but is concave in b given W.

55 RC LOGIT - MLE [2] The main issue in the implementation of the MLE of the RCLogit is the computation of the CCPs by solving the multiple integration problem. We can use Monte Carlo simulation methods to approximate CCPs. However, we need to take into account how the approximation error a ect the properties of our estimators.

56 7. MONTE CARLO SIMULATION Monte Carlo simulation is a general method to approximate multiple-dimensional integrals. It is used not only in econometrics but in any scienti c application, empirical or theoretical, that requires the computation of multiple-dimensional integrals. Let v = (v 1 ; v 2 ; :::; v K ) be a vector of continuous random variables with joint CDF (v) that is twice continuously di erentiable. Let P be a parameter that is de ned as: P = Z h(v) d(v) There is NOT a closed-form expression for this integral.

57 Fundamental Theorem of Sampling Let v be a random variable with CDF F (v) that is twice continuously di erentiable. Then: (1) The random variable u = F (v) has a distribution U [0; 1]. (2) There exists an inverse function F 1 (:) such that v = 1 (u). (1)+(2) If fu 1 ; u 2 ; :::; u R g are R i.i.d. random draws from a U [0; 1], then n F 1 (u 1 ); F 1 (u 2 );...; F 1 (u R ) o are R i.i.d. random draws from the distribution F.

58 Proof: (2) F is strictly increasing. Therefore, by the inverse function theorem, there exists and inverse function. (1) Given the random variable u = F (v) and a constant u 0 2 [0; 1], we have that: CDF u (u 0 ) = Pr (u u 0 ) = Pr F 1 (u) F 1 (u 0 ) So, u has a U [0; 1] distribution. = Pr v F 1 (u 0 ) = F (F 1 (u 0 )) = u 0

59 Example 1 [Logistic]: v Logistic. The CDF is F (v) = exp(v) 1 + exp(v). The inverse CDF (i.e., the quantile function) of the Logistic is: F 1 (u) = ln u 1 u Then, we can get a random draw from the Logistic by getting a draw u from U [0; 1] and then apply transformation: v = ln u 1 u

60 Example 2 [Multivariate Normal]: Let v = (v 1 ; v 2 ; :::; v K ) be a vector of Normal random variables N(m,). Then, we can write: where: v = m + W v - v = (v 1 ; v 2 ; :::; v K ) is a vector of i.i.d. standard normals; - W is a lower triangular matrix obtained as the Cholesky decomposition of, i.e., W W 0 =.

61 We can get a random draw of v N(m,) by taking K independent random draws from U [0; 1], (u 1 ; u 2 ; :::; u K ), and the applying the transformation: v = m + W 0 1 (u 1 ). 1 (u K ) where 1 is the inverse of the CDF of the standard normal. 1 C A

62 Frequency Simulator Remember that P is a parameter de ned as: P = E (h(v)) = Z h(v) (v) dv Let fv 1 ; v 2 ; :::; v R g be R independent random draws from the CDF. Then, the Frequency Simulator of P is de ned as: ep R = 1 R RX r=1 h(v r ) The simulation error is: e R = e P R P.

63 Properties of the Frequency Simulator (1) Unbiased: E e P R = P. (2) Variance: V ar P e V ar(h(v)) R =. R (3) Consistent: As R goes to in nity, e P R! P. (4) Simulation error is asymptotically normal: p R er qv ar(h(v))! N(0; 1).

64 Properties of the Frequency Simulator (2) In general, it is possible to obtain simulators more precise (with lower variance) than the frequency simulator. Other limitation of the FS is that if the h(:) is discontinuous or non-di erentiable with respect to some parameters, then the simulator is also discontinuous and non-di erentiable. This may have important implications in the estimation of dicrete choice models. In some Simulated-Based estimators that use the frequency simulator of CCPs are such that: - Criterion function is a step function of the parameters: numerical optimization problems; - Estimator may not be asymptotical normal.

65 Simulation-Based Estimation using Frequency Simulator Consider the RUM with utilities U jn = X jn [b + W v n ] + " jn such that: P jn () = Z 1 n " in " jn + (X jn X in )[b + Wv n ] o f(" n ; v n )d" n dv n The Simulated-MLE of = (b; W) using the Frequency Simulator is the value of that maximizes the Simulated log-likelihood: `(R) () = NX n=1 J 1 X j=0 1fy n = jg ln e P (R) jn () where e P (R) jn () is the frequency simulator of P jn().

66 Simulation-Based Estimation using Frequency Simulator [2] The frequency simulator P e(r) jn () is: ep (R) jn () = 1 R where and f" (r) n ; v n (r) f(" n ; v n ). RX r=1 1 " (r) in "(r) jn + (X jn X in )[b + W v n (r) ] : r = 1; 2; :::; Rg are R i.i.d. draws from the distribution This was the SMLE proposed in a seminal paper by Lerman and Manski (1981) for the Multinomial Probit, i.e., " n N(0; ) and W = 0.

67 Sim-Based Estimation using Frequency Simulator [3] This estimator very poor statistical and computational properties of this estimator. There are di erent issues. [1] For choice alternatives with low P jn () we have that P e(r) jn () = 0, unless R is very large. The log-likelihood becomes minus in nite, even at the true. [2] R should be very large to have an estimator with decent properties. [3] `(R) () is a step function. Standard gradient methods do not work. [4] For xed R, the estimator is not consistent, it is not asymptotical normal. Poor small sample properties.

68 Solutions to the Problems of SMLE with Frequency Simulator We will study several methods and results that overcome the limitations of the SMLE with the Frequency simulator. [1] Smoothing probs using RC Logit model (SMLE is root-n consistent as R goes to in nite; small bias even with R no too large); [2] Importance-sampling simulators: GHK for the Probit model. (SMLE is root-n consistent as R goes to in nite; small bias even with R no too large); [3] Simulated Method of Moments + smooth simulator: (root-n consistent and asymptotically normal estimator even when R is xed and small).

69 Solution using RC Logit In the RC Logit we take into account that " n is independent of v n with Extreme Value distirbution, such that we have closed form expressions of probs conditional on v n, and these probs a re smooth functions. Z exp n o X jn b + [X jn W] v n P jn () = J 1 P i=0 f(v n )dv n exp fx in b + [X in W] v n g Therefore, we can use the simulator: ep (R) jn () = 1 R RX r=1 exp J 1 P i=0 exp X jn b + [X jn W] v (r) n X in b + [X in W] v (r) n where fv (r) n : r = 1; 2; :::; Rg are R i.i.d. draws from the distribution f(v n ).

70 Solution using RC Logit [2] The simulator P e(r) jn () has several important advantages over the frequency simulator P e(r) jn (). - e P (R) jn () is continusously di erentiable in. - It is always > 0 and < 1 for any value of R, even for R = 1. - The variance of its simulation error is subtantially smaller than for the simulation error of the frequency simulator.

71 Importance sampling simulation (IS) Let be a density function di erent to. Density is denoted the Importance Sampling density. By de nition of P, we have that: P = E (h(v)) = = Z Z h(v) (v) dv h(v) (v) (v) (v) dv = E h(v) (v) (v)! Let n v 1 ; v 2 ; :::; v Ro be R independent random draws from. Then, the IIS (based on ) of P is de ned as: ep R = 1 R RX r=1 h(v r ) (v r) (v r )

72 Properties of IS (1) Unbiased: E e P R = P. (2) Variance: V ar e P R = V ar h(v) (v) (v). R (3) Consistent: As R goes to in nity, e P R! P. (4) Asymptotically normal: s p R er V ar h(v) (v)! N(0; 1). (v)

73 Relative variances of FS and IS V ar e P F S = V ar(h(v)) R and V ar e P IS = V ar h(v) (v) (v). R Therefore, if the ratios (v) are smaller than 1 for values of v with large (v) (h(v) P ) 2, then the IS will have lower variance than the FS. For the ISS to have a lower variance than FS, the IS density should over sample (relative to ) those regions in the support of v where (h(v) P ) 2 is large.

74 Simulation of Multinomial Probit probabilities: GHK Simulator Let " = (" 1 ; " 2 ; :::; " J ) be a vector of Normal random variables with vector of means 0 and variance-covariance. Let c = (c 1 ; c 2 ; :::; c J ) be a vector of constants. Consider the following probability: P = Pr (" 1 c 1, " 2 c 2,..., " J c J ) = Z 1 f" 1 c 1, " 2 c 2,..., " J c J g (v; ) dv These probabilities appear in a Multinomial Probit model. The Geweke-Hajivassiliou-Keane (GHK) simulator is a very e cient simulator of these probabilities. It is also continuously di erentiable in the argument (vector of parameters) c.

75 GHK Simulator (2) Let W = fw ij g be a lower triangular matrix that comes from the Cholesky decomposition of. Then, " = W z, where z is a vector of J independent standard normals, such that: and " 1 = w 11 z 1 " 2. = w 21 z 1 + w 22 z 2. " J = w J1 z 1 + w J2 z 2 + ::: + w JJ z J n o "j c j = n o wj1 z 1 + w j2 z 2 + ::: + w jj z j c j with ec j = c j w jj and e b ji = w ji w jj for i < j = n z j ec j ew j1 z 1 ::: ew j;j 1 z j 1 o

76 GHK Simulator (3) Therefore, P = Z 1 n z j ec j ew j1 z 1 ::: ew j;j 1 z j 1 for any j o (z) dz

77 GHK Simulator (4) Consider the following IS density, f (z ) [1] z 1 fz 1jz 1 ec 1 g is a random draw from the standard normal right truncated at ec 1. [2] Given z1, then z 2 fz 2jz 2 ec 2 ew 21 z1 g is a random draw from the standard normal right truncated at ec 2 ew 21 z1.... [j] Given (z1 ; ::; z j 1 ), then z j fz jjz j ec j ew j1 z1 ::: ew j;j 1 zj 1g is a random draw from the standard normal right truncated at ec j ew j1 z1 ::: ew j;j 1 zj 1.

78 GHK Simulator (5) What is the form of the IS density f (z )? Note that the density of a random variable z (z ) that is a right-truncated normal at c is, where here and 1 (c) represent the pdf and cdf of the standard normal, respectively. Then, by de nition: = = f (z ) (z1 ) (z 2 ) ::: (z J ) [1 (ec 1 )] h 1 (ec 2 ew 21 z1 )i ::: h 1 (ec J ew J1 z1 ::: ew J;J 1 zj 1 )i Q Jj=1 (z1 ) (z 2 ) ::: (z J ) Q h Jj=1 1 ( ec j ew j1 z1 ::: ew j;j 1 zj 1 )i

79 GHK Simulator (5) The GHK simulator of P is the ISS that uses IS density f (z ). Let fz r : r = 1; 2; :::; Rg be R independent random draws from the IS density f (z ). Then, ep (R) GHK = 1 R Note that: RX r=1 1 n z jr ec j ew j1 z 1 ::: ew j;j 1 z j 1 for any jo (z r) f (z r) - (z r) f (z r) = Q h J j=1 1 ( ec j ew j1 z1 ::: ew j;j 1 zj 1 )i - By construction of the z r simulations, the indicator of n zjr ec j ew j1 z1 ::: ew j;j is always 1.

80 Therefore, ep GHK R = 1 R RX JY r=1 j=1 h 1 ( ec j ew j1 z 1 ::: ew j;j 1 z J 1 )i

81 Properties of GHK Simulator [1] It is unbiased, consistent, asymptotically normal. [2] It has substantially lower variance than the FS. In some standard settings the ratios of variances can the of the order of 100 or even [3] P e R GHK (c; ) is continuously di erentiable in the parameters (c; ). [4] e P GHK R is always strictly greater than 0 and lower than 1. [5] It is simple to get random draws from a truncated standard normal. If z that is a right-truncated normal at c then its CDF is F (z ) = (z ) (c), 1 (c) such that given u U[0; 1], z = F 1 (u) = 1 ((c) + [1 (c)] u)

82 8. SIMULATION-BASED ESTIMATION (SBE) 8.1. Refreshing Estimation and Asymptotic Theory 8.2. SBE: Conditions on the Simulators 8.3. Simulated Based Estimators: SMM and SMLE 8.4. Asymptotic Properties

83 8.1 REFRESHING ESTIMATION & ASYMPTOTIC THEORY Consider a discrete choice model with CCPs P (jjx n ; ), where is a q 1 the vector of parameters. Let 0 be the true value of in the population under study. Let fy n ; x n : n = 1; 2; :::; Ng be a random sample from the population. The model implies the following moment conditions: E JX j=1 z jn [1 fy n = jg P (jjx n ; 0 )] A = 0 where z jn is a q 1 vector of functions of x n, e.g., z jn = (x 0 jn ; P i6=j x 0 in, [x jn x jn ] 0 ). 1 The population likelihood equations, is particular example of these moment

84 conditions. In this case, z jn ln P (jjx n; 0 ) E 0 0 ln P (jjx n ; 0 0 [1 fy n = jg P (jjx n ; 0 )] A = 0 1

85 We can represent these moment conditions in a compact form as: E (z n [1 n P n ( 0 )]) = 0 where z n is the q J matrix (z 1n ; z 2n ;...; z Jn ); 1 n is the J 1 vector (1 fy n = 1g ; 1 fy n = 2g ;...; 1 fy n = Jg) 0 ; and P n () is the J 1 vector (P (1jx n ; ); P (2jx n ; );...; P (Jjx n ; )) 0. Or even in a more compact form: where g n () = z n [1 n P n ( 0 )]. E ( g n ( 0 ) ) = 0 Identi cation Assumption. 0 is the unique value in the parameter space that solves the system of equations E ( g n () ) = 0.

86 ESTIMATION. The estimator b N is the value that solves the system of sample moment conditions: 1 N NX n=1 g n () = 1 N NX n=1 z n [1 n P n ()] = 0 Example: MLE. When z jn ln P (jjx 0, and z n ln P 0, we have that the sample moment conditions above de ne the MLE: 1 N NX ln P n 0 [1 n P n ()] = 0 Example: GMM. Consider a GMM estimation of based on the moment conditions m N () = 1 N P Nn=1 z n [1 n P n ()], where the dimension of the number of instruments z n is q and it is greater than the dimension of, i.e. over-identi ed model, q > q.

87 Example: GMM [Cont]. The GMM criterion function is m N () 0 A N m N (), where A N is a q q positive-de nite matrix. The GMM should satisfy the rst order conditions: N () A N m N () = 0. These moment conditions can be written in the form N 1 P Nn=1 z n [1 n P n ()] = 0. In particular, 1 N NX n=1 z n () [1 n P n ()] = 0 where z n () = W() z n, and W() is the qq weighting matrix n () z 0 n A Example: MM. Of course, the representation of the estimator as the solution to the system 1 N P Nn=1 z n [1 n P n ()] = 0 includes as a particular case a Method of Moments estimator where z n is the matrix of instruments chosen by the researcher (i.e., functions of x n ) and the dimension of z n is q J.

88 CONSISTENCY. Suppose that: (a) P n () is continuously di erentiable in ; (b) for any 2, we have that V ar(g n ()) = V ar(z n [1 n P n ()]) is nite; (c) is a compact set; and (d) 0 is the unique value in the parameter space that solves the system of equations E ( g n () ) = 0. Then, as N! 1, (i) 1 N P Nn=1 g n () converges in probability uniformly in 2 to E(g n ()); (ii) b N! p 0. ASYMPTOTIC NORMALITY. By de nition of b N, we have that 1 N P Nn=1 g n ( b N ) = 0. Using a Taylor expansion around = 0, we have that: " # p 1 " # N b 1 P N 0 = n ( 0 ) 1 P 0 p Nn=1 g n ( 0 ) N + o(1)

89 ASYMPTOTIC NORMALITY [Cont.] Then, under standard regularity 1 P conditions, we have that n ( 0 ) 0! p n( 0 0 G 0, and 1 P p Nn=1 g n ( 0 )! d N(0; 0 ) with 0 = E g n ( 0 ) g n ( 0 ) 0. By Slut- N sky s Theorem: p N b N 0!d N(0; G0 1 0 G 1 0) 0 In our Discrete Choice Model we have that: G 0 = E z n ( = E z n P n z 0 n where P n is J J matrix where the element (j; j) in the main diagonal is P (jjx n ; 0 ) [1 P (jjx n ; 0 )], and the element (j; i) out of the main diagonal is P (jjx n ; 0 ) P (ijx n ; 0 ).!

90 8.2. SBE: CONDITIONS ON SIMULATORS Given a model de ned by the moment conditions E ( g n ( 0 ) ) = 0, a Simulated-Based-Estimator is the value b N;R that solves the moment conditions: 1 N NX n=1 eg R n () = 1 N NX n=1 ez R n () h 1 n e P R n () i = 0 where e P R n () is the J 1 vector of simulators ( e P R (1jx n ; ); e P R (2jx n ; );...; ep R (Jjx n ; )) 0. Conditions on Simulator [1] The simulator is based on independent random draws fu nr : n = 1; 2; :::; N; r = 1; 2; :::; Rg from a U[0; 1]. Each observation n has its own R independent random draws.

91 Conditions on Simulator [Cont.] [2] These random draws are made at the beginning of the estimation procedure and they are kept xed during the implementation of the algorithm that searches for b N;R. That is, the same set of random draws is used to construct simulators for di erent values of. If new drawings were made at each iteration of the algorithm, they would introduce new randomness at each step and it would not be possible to obtain numerical convergence of the algorithm, and the asymptotic properties of the estimator would not hold: McFadden (Econometrica, 1989). Note that some components in are parameters in the distribution of the unobservables. The values of these parameters change during our search for the estimator b N;R. Therefore, we cannot keep constant the random draws from the distribution of the unobservables. However, we can always keep constant the

92 random draws from the distribution U[0; 1]. For instance, if " n N(0; 2 ), we have that " nr = 1 (u nr ) with u nr U(0; 1) such that the random draws fu nr g remain constant over the estimation procedure, but the values " nr are modi ed as changes.

93 Conditions on Simulator [Cont.] [3] The simulator e P R (jjx n ; ) is continuously di erentiable in, and it is always within (0; 1). [4] For any value (j; x n ; ), the simulator e P R (jjx n ; ) is unbiased, and as R goes to in nity, it is consistent, and asymptotically normal: E h e P R (jjx n ; ) i = P (jjx n ; ) As R! 1, e P R (jjx n ; )! p P (jjx n ; ) As R! 1, p R h e P R n () P n() i! p N(0; e V (x n ; )) where e V (x n ; ) is the variance matrix of the J simulation errors.

94 8.3. SIMULATED BASED ESTIMATORS: SMM and SMLE Simulated method of Moments (SMM). It is the value b N;R that solves the moment conditions: 1 N NX n=1 z n h 1n e P R n () i = 0 Simulated maximum likelihood (SML). It is the value b N;R that solves the moment conditions: 1 ln e P R n () 0 h 1n e P R n () i = 0 N

95 8.4. ASYMPTOTIC PROPERTIES There are two types of asymptotics we can consider for SB Estimators. - As N! 1 and R is xed. - As N! 1 and R! 1. Asymptotics as N! 1 and R is xed are particularly interesting because they fully take into account how simulation error a ects the asymptotic bias and variance of SB Estimators. We start presenting asymptotic results of estimators as N! 1 and R is xed.

96 A useful decomposition For the derivation of the asymptotic results of SBEs, it is helpful to consider the following decomposition of the conditions that de ne the estimator: 1 N X N n=1 e g R n () = = 1 N X N n=1 g n() [A] Standard MCs + 1 N X N n=1 h Eu eg R n () g n () i [B] Simulation Bias + 1 N X N n=1 heg R n () E u eg R n () i [C] Simulation Noise where E u eg R n () represents the expectation over the simulated random draws fu nr g but conditional on the observed data (y n ; x n ).

97 Term [A]: Standard Moment Conditions 1 N X N n=1 g n() [A] Standard MCs Under standard regularity conditions, we have that: N 1 P N n=1 g n () converges in probability and uniformly in to the function E(g n ()) g 0 (). G 0. N 1 P n ( 0 ) 0 converges in probability to the matrix n( 0 0 N 1=2 P N n=1 g n ( 0 ) converges in distribution to N (0; 0 ).

98 In our discrete choice model: G 0 = E z n ( = E z n P n z 0 n!

99 Term [B]: Simulation Bias 1 N X N n=1 h Eu eg R n () g n () i [B] Simulation Bias In our model, for the Simulated Method of Moments: E u eg n R () h = E u zn 1n P e R n () i = z n h 1n E u e P R n () i = z n [1 n P n ()] = g n () Therefore, for the SMM with unbiased simulator of CCPs, the Simulation Bias term is exactly zero for any value of, N, and R.

100 Term [B]: Simulation Bias For Simulated Maximum Likelihood: E u eg R n () = E ln e P R n () = E u ln Pn () = g n () + E u e R n () 6= g n () h 1n e P R n () i! # + er h1n n () P n () e R n () i e [1 n P n ()] E u R n () e R n ()! where: e R n () is the K J matrix of simulation errors ln e P R n ; e R n () is the the J 1 vector of simulation errors in e P R n ().

101 Term [B]: Simulation Bias [SML] e In general, E u R n () 6= 0 because an unbiased simulator of the CCP typically implies a bias simulator of the derivative of the ln P (jjx n ; (jjx n; 1 P (jjx n ; ) Note that simulation that enters additively in the simulator of P (jjx n ; ), however enters in the denominator 1 P (jjx n ;) in the simulator ln P (jjx e In general, E u R n () e R n () 6= 0 because simulation error in CCPs is correlated with simulation error in the derivatives of the log-ccps.

102 Term [B]: Simulation Bias [SML] Importantly, this Simulation Bias does not go to zero as the sample size goes to in nity p lim N!1 1 N NX n=1 h Eu eg n R() g n () i = E E e u R n () [1 n P n ()] E E u e R n () e R n () 6= 0 The rst term is zero at = 0, but the second term is not zero. Therefore, SML is inconsistent as N! 1 and R is xed. Consistency of the SML requires that as N! 1 the number of simulations R also goes to in nity.

103 Term [C]: Simulation Noise 1 N X N n=1 heg R n () E u eg R n () i [C] Simulation Noise In our model, for or the Simulated Method of Moments we have showed that E u eg R n () = g n (). Therefore, eg R n () E u eg R n () = z n h 1n e P R n () i z n [1 n P n ()] = z n e R n () Given the properties of our simulator, we have that the vector of K 1 random variables z n e R n () is such that, for any n and : E z n e R n () = 0 and conditional on x n, E(e R n ()jx n ) = 0.

104 Asymptotic Distribution of the SMM Using a Taylor approximation around = 0 of the moment conditions of the SMM, and taking into account that b N;R! p 0, we have that: p N b N;R 0 = N NX P z e R 3 n ( 0 ) p N N X n=1 h z n 1n P n ( 0 ) e R n ( 0 ) i o(1) As N! 1 with R xed, we have that: N NX n=1 z e P R n ( ! p e G R E z e P R n ( 0 0!

105 Also: X N p z n [1 n P n ( 0 )] 5! d N (0 ; 0 ) N n=1 3 where: 0 E z n P n z 0 n And: X N p z n e R n ( 0 ) 5! d N 0 ; e R N n=1 where: e R E z n e V (x n ; 0 ) R z 0 n! where remember that in e P R n ( 0 ). ev (x n ; 0 ) R is the variance matrix of the simulation errors

106 And the terms X N p z n [1 n P n ( 0 )] 5 and N n= X N p z n e R n ( 0 ) 5 are N n=1 independent due to the conditional mean independence of the simulation error, i.e., E(e R n ()jx n ) = 0. Therefore, applying Slutsky s Theorem, we have that the asymptotic distribution of the SSM estimator as N! 1 with R xed, is: p N b N 0!d N 0 ; G e R 1 h 0 + e i e R GR 1 0 As R goes to in nity, G e R! G 0, and e R! 0, such that the SMM estimator becomes equivalent to the MM estimator without simulation. But in any empirical application, with nite number of simulation R, the simulation error introduces additional noise that increases the variance of the estimator. This is fully taken into account by the expression of the asymptotic variance above.

107 WILLIAMS-DALY-ZACHARY (WDZ) THEOREM Consider the RUM U(j) = u j + " j where " = f" j : j = 0; 1; :::; Jg has a CDF F (") that is continuously di erentiable over the whole Euclidean space R J+1. De ne the Social Surplus function (McFadden, 1981) S(u) = Z max j2a n uj + " j o df j = P (j) Proof: Exercise. Hint: Note max j2a n uj + " j o =@uj is equal to 1f u j + " j u i + " i for any i 6= j g.

108 Note that this result is like the discrete choice version of Roy s Theorem in Consumer Demand: The derivative of the indirect utility function with respect to price is equal to the demand.

109 HOTZ-MILLER PROPOSITION Consider the RUM U(j) = u j + " j where " = f" j : j = 0; 1; :::; Jg has a CDF F (") that is continuously di erentiable over the whole Euclidean space R J+1. De ne the vector of utility di erences u = fu j of CCPs P = fp (j) : j > 0g. u 0 : j > 0g and the vector Given the CDF F ("), the de nition of CCPs provides a mapping from the vector of utility di erences u into the vector of CCPs, P. P = G(u) Hotz and Miller show that this mapping is invertible: u = G 1 (P) There is a unique vector of utility di erences u that can rationalize (generate as optimal choices) a vector of CCPs P. Revealed Preference.

ECONOMETRICS II (ECO 2401) Victor Aguirregabiria. Winter 2018 TOPIC 3: MULTINOMIAL CHOICE MODELS

ECONOMETRICS II (ECO 2401) Victor Aguirregabiria Winter 2018 TOPIC 3: MULTINOMIAL CHOICE MODELS 1. Introduction 2. Nonparametric model 3. Random Utility Models - De nition; - Common Speci cation and Normalizations;