Zhengyuan Gao 1. INTRODUCTION

Size: px

Start display at page:

Download "Zhengyuan Gao 1. INTRODUCTION"

Kimberly Freeman
5 years ago
Views:

1 A ROBUST SEMI-PARAMETRIC INFERENCE OF DYNAMIC DISCRETE GAMES 1 Zhengyuan Gao This paper considers the dynamic discrete game model with fixed points in the worst case and with incomplete beliefs. In the worst case, an ɛ-approximation to a fixed point does not lead to the consistent second step estimator of the parametric structural parameters. Given certain structures of the value function, we propose an alternative two-step inference algorithm. The first step is to control the complexity of the nonparametric approximation and to estimated value and policy functions under the worst case. The second step is to select a pseudo parameter that can give a ɛ-fitting and can generate an empirical log-likelihood test statistic. The hypothesis testing should accept the estimated values of structural parameters. The objective functions in both steps can reduce to linear-quadratic equations and are robust to outliers. 1. INTRODUCTION Dynamic discrete choice model builds upon the agents rational behavior, namely their future expectation and utility maximization. It exploits the intrinsic evolution structure of the model and capture the endogenous effect of agents action. In practice, the single agent based model is frequently used. However, in many economic phenomena, agent decisions not only depends on the expectations on their own actions but also the expectations of their opponents behavior. The concerns on the strategic interactions along with the evolution process motivate a line of research on the dynamic discrete game. Many techniques and estimate methods are introduced to handle the issues in dynamic discrete games, e.g. multiple equilibria, curse of dimensionality, and heterogeneity. This paper considers the robustness issue in dynamic discrete games and provides a tractable inference approach under a more flexible assumption on consumers optimal behavior and a more flexible model specification. In this section, we discuss relevant literature and describe the contribution of this paper. The estimation methods in single agent based dynamic discrete choice models are initiated by Wolpin (1984), Pakes (1986) and Rust (1987). Estimating the parameters of these structural models requires to solve an optimization problem with a nesting dynamic programming Tinbergen Institute and Quantitative Economics Department, University of Amsterdam, z.gao@uva.nl 1 This is a very preliminary and incomplete version. 1

2 procedure. The computation is non-trivial for both the likelihood function evaluation in outer loop and the fixed point iteration in inner loop. The computation complexity even grows exponentially when the model allows for strategic interaction. Such complexity mainly comes from two causes, the exponentially growth of state spaces and the existence of multiple equilibria (Aguirregabiria and Mira, 2009). Although facing these computational difficulties in large scale problems, Ericson and Pakes (1995) set up the oligopolistic competition model under framework of the dynamic discrete game with complete information participants. Recent works on dynamic discrete games relax the complete information assumption and suggest using two step methods to estimate an incomplete information model. The incomplete information model is able to explain the heterogeneity in players actions by introducing random variables. These random variables, like private information state variables, are independently distributed across players so that the integrations over the random variables do not require analytical evaluations in the deterministic or complete information case. From the computational point of view, random algorithm can break down the curse of dimensionality caused by the multivariate integration in the dynamic programming. Keane and Wolpin (1994) suggest solving the dynamic programming problem by Monte Carlo integration and interpolation. Later, Rust (1997) proves that a certain structure random Bellman operator can break the curse of dimensionality. In the dynamic discrete games, the incomplete belief setting plays important roles in reducing the computation burden and in capturing the heterogenous behavior across the agents. The prevailing approaches of solving the incomplete models are slightly different from those in deterministic models. The algorithms in complete information model need to compute the full set of equilibria for the candidate parameter values and then select the optimal value. In the incomplete model, the estimation procedure often includes two steps. In first step, people try to establish objective functions with estimated or approximated components. The second step is to solve constrained or unconstrained optimization problems with respect to the structural parameters and with objective functions obtained from the first step. Pesendorfer and Schimidt-Dengler (2003) and Aguirregabiria and Mira (2007) exploit the mapping between conditional choice probabilities and choice specific value functions and use nonparametrically estimated probability distribution function to recover the value function of specific agent in their first step. Bajari, Benkard, and Levin (2007), Pakes, Ostrovsky, and Berry (2008) and 2

3 Bajari, Chernozhukov, and Hong (2005) try to obtain consistent estimated policy functions nonparametrically in the first step. The objective function in second step can be a pseudo likelihood or a minimum distance criterion function with or without constraints from equilibria conditions. All these two step approaches do not require equilibria evaluation in the first step so that mitigates the multiple equilibria problem. Apart from applying first step nonparametric estimation, nonparametric techniques are used to identify the structural parameters (Magnac and Thesmar, 2002) or to obtain an approximated solution for the value function (Rust, Traub, and Wozniakowski, 2002). Although nonparametric methods provide flexible approximations and estimations, their results are not always reliable. A key problem in nonparametric statistics is how to make an optimal decision to minimize the mean square error (MSE), namely the trade-off between bias and variance. There are two main reasons why people did not take MSE into account in dynamic discrete games. The first reason is the mis-specification in the second step objective function. In dynamic discrete games, the second step objective functions are either pseudo likelihood or approximated moment restrictions. They are not exact parametric functions or conditions. A research who conducts inference based on structural parameter estimators in the second step worries about whether there is a bias from the first step estimators. Fernndez-Villaverde, Rubio-Ramrez, and S. Santos (2006) show that the higher order bias from first step estimator gives a first order effect on the likelihood function in the second step. A sequence of approximated likelihoods does not converge to the exact likelihood though the sequence of approximated policy function converge to the exact one. The second reason is the finite sample problem. In micro-econometrics, data resources, especially the industrial data, are not fruitful. Given the complicated nesting dynamic programming structure, people prefer to fit a given model than to make an inference. The statistical mentality that all structural models are rejected, therefore none of them are any good does a disservice and contributes to a radicalization of some members of the profession -Rust (2008). An exactly fitted first step approximation will be easier obtain a satisfy result. Due to the importance of the consistency in the first step approximation or estimation, it seems that there is not point to focus on the other side of the coin. We do not want to leave the structural setting, but we believe a flexible inference procedure has bigger explanatory power. We consider the worst case of Bellman s optimal principle 3

4 such that the numerical approximated solution is ε far from the fixed point value function. In economics, people often consider the ε as the approximated error thus they try to find a better fitting series to eliminate it. It is true that numerical solution will give a bias term but it is also true that the stochastic movements of agents cause the error. A higher order stochastic approximation can perform quite well once the parametric form of underlying model is known. But outliers and inconsistent behaviors cannot be explained by the parametric model. An over-fitting nonparametric curve may try to explain these abnormal situations. Thus a consistent first stage estimator does not mean it is reliable. To stay in a tractable problem and control the complexity growth rate, people should allow flexibility for the nonparametric approximation. However except for a recent theoretical study by Rust, Traub, and Wozniakowski (2002) there is relatively little literature analyzing the inference for the dynamic discrete choice model in the worst case. If a bias exists in the first step whatsoever, it is unnecessary to evaluate approximated exact likelihood. Thus we relax the parametric assumptions on functional forms of heterogenous beliefs in the second step. We propose a two-step approach to make inference for the structural dynamic models with discretized continuous state spaces. The approach does not rely heavily on the parametric form assumptions and provides flexibilities to the potential issue of model mis-specifications. The approach also maintains robustness in the appearance of outliers of the Bellman s optimal principle. In the first step, we use kernel methods with soft-margins to estimate agents best response functions. Given the estimated policy functions, we recover the generated structural parameters in the second step by a semi-parametric method and analyze the robustness by comparing its outcomes with parametric estimation. We apply the Local Empirical Likelihood (LEL) approach which is similar to the constrained likelihood (MPEC Su and Judd, 2008) and the Nest Pseudo ML (NPML Aguirregabiria and Mira, 2007). It considers the Markov Perfect Equilibrium (MPE) as a model constraint. Unlike other likelihood based methods, LEL does not require a fully parametric assumption on the heterogenous beliefs, it can, however, formulate a pseudo likelihood based on the available constraints. The flexible Lagrangian weights and the linear-quadratic functional form make LEL robust and tractable for the multiple equilibria case. 4

5 2. A STOCHASTIC DYNAMIC GAME We start with a well-known model framework of dynamic competition between oligopolistic competitors. The framework dates back to Ericson and Pakes (1995) and has been improved and generalized by a series of authors.doraszelski and Pakes (2007) give a fruitful literature review over this framework. The major feature of this framework is that actions taken in a given period may affect both current profits and the future strategic interaction. The evolution of an industry with heterogeneous firms is modeled via the Markov Decision Processes (MDP) with discrete time and infinite time horizon t = 1, 2,...,. Dynamic competitions represent in terms of entry, exit, and investment decisions in each period. Unlike the reduced form model, this dynamic structural model is able to exploit the intrinsic evolution process and construct a controllable scheme. The model includes N firms, denoted i = 1,..., N. In period t the state variable of firm i is denoted by s it. The state vector among all firms is commonly observed as s t S, where S R L is the entire state space. Depending on the specific application, relevant state variables might include the firms capacities, market shares or investments. Given the current common information s t, firms i will choose it strategy a it A i. We assume the firms choose actions a t = (a 1t,..., a Nt ) simultaneously in each period. The actions can be discrete choice, e.g. entry and exit decision or continuous choice, e.g. investment quantities, product prices, etc. Firms have to face private shocks or opportunities in practice. To econometricians, these shocks ε t = (ε 1t,..., ε Nt ) are private unobservable information, therefore ε it is treated as a random variable. The distribution associated with ε i is G i ( ) on R L. Given an action a i, the state s i will transfer to the next state s i under a certain transition probability p i (s s; a). An action trajectory (a i1,... a it ) will reduce the evolution process from state s i1 to s it to a Markov chain p i = (p i (s 2 s 1 ; a i1 ),..., p i (s T +1 s T ; a it )). In the paper, the Markov transition probability is assumed to be homogenous, namely given action a and states s and s, the transition probability p i (s s; a) is homogenous in time. With this Markov structure, the firms play stationary Markov strategies in this game. For simplicity, we denote a, s as the current action and state and a, s as the next period variables. The dependence of P( s; a) on actions a are not always necessary. It is obvious that entry/exit decision will affect the firm s next status but it may not so obvious that a short term adjustment of 5

6 capacity will affect a long term strategic target. Let π i (a i, a i, s, ε i ) be current profit of firm i with opponents action a i. A firm makes its decision to maximize the expected future profit, [ ] E β τ t π i (a τ, s τ, ε iτ ) s t, τ=t where β (0, 1) is the discount factor. The primitives of the model include the discount factor β, the transition probability p( ) and the profit functions {π i ( )} N. In the model, we consider Markov Perfect Equilibria (MPE). MPE guarantee the optimal actions taken by the agents based on the state of the system at that time only. By MPE condition, each firm s action depends on the current state and its current private shock. Let σ = (σ 1 (s, ε 1 ),..., σ(s, ε N )) be a profile of Markov strategy function or decision rules for N firms such that σ i : S R L A i. The Bellman equation for the dynamic model is (2.1) V i (s, ε i ) = max a i A i { ˆ π i (a i, a i, s, ε i ) + β V i (s, ε i)dp (s s; a i, a i ) where V i (s, ε i ) is the value function. By Bellman s principle of optimality, we substitute the decision profile into (2.1) and integrate out the private shocks, then we have: ˆ (2.2) V i (s; σ) = E ε [π i (σ(s, ε), s, ε i ) + β } ] V i (s ; σ)dp (s s; σ(s, ε)) s. This is the integrated Bellman equation. V i (s; σ(s, ε)) is called the ex ant value function Bajari, Benkard, and Levin (2007) which reflects expected profits at the beginning of a period before private shocks are realized. Our inference procedure is to obtain a consistent estimated best response function ˆσ, then use ˆσ and the integrated Bellman equation to recover the distribution of the private shock of each firm., 3. KERNEL-BASED VALUE FUNCTIONS AND POLICY FUNCTIONS In the 1st-step, we will implement a kernel-based approximation for the value function of Bellman equation. We dually present the problem of finding a robust policy function as a constrained optimization problem on minimizing the estimated value functions fitting error. We prove the policy function obtained in the dual optimization is uniformly consistent and solving the problem is nonlinear programming difficult. 6

7 3.1. The Kernel-based Approximation The kernel-based approximation is introduced by Ormoneit and Sen (2002) in statistical learning. The algorithm assigns value function estimates to the states in a sample trajectory s 1,..., s T and updates these estimates by kernel-based averaging. In economics, Rust (1997) gives a local averaging approach to approximate the true Bellman operator and shows it circumvents the curse of dimensionality in MDP. The track is to impose a normalized structure for the random Bellman operator and then average the value function over the random sample points. Kernel-based averaging is a kind of local averaging with certain functional structures on the dot product space k : S S R, Ormoneit and Sen (2002) Lemma 1 prove that given fixed bandwidth, the kernel-based approximation also has a polynomial growth function. The recursive relation in (2.2) can be expressed briefly as an Bellman operator Γ mapping from V to V. In the infinite horizon problem, there is no terminal period from which the backward induction starts to carry out the dynamic programming algorithm in (2.2). Bellman equation can be compactly written as a fixed point condition V = ΓV. In practice, a long finite horizon MDP problem can approximately solve the infinite horizon problem. The fixed point condition does hold in (2.1) due to the private information. The random shock will make the profit function π i discontinuous and hence the contraction mapping property of Γ will be violated. Suppose S a is a collection of m a historical state transitions from s to s given action a such that Si a = {(s ij, s ij) j = 1,..., m a }. The kernel function k S a,b(s i, s ) is centered at s i, ( ) (3.1) k S a,b(s i, s si s ) := φ / b (s u,s u ) Sa φ ( ) su s where φ is a mother kernel function and b is the bandwidth parameter. φ is in the radial basis function (RBF) class including the normalized class used in Rust (1997). The kernel approximation function is: (3.2) ˆΓa (V i )(s) = k S a,b(s, s ) [π i (a i, a i, s) + βv i (s)]. (s u,s u) S a The operator ˆΓ a : B(S) B(S) is a random operator based on historic realizations of outcome given action a. Given the optimal policy rule σ, the kernel-based function approximates 7 b

8 the ex ant value function in (2.2) and ˆΓσ Γ 0 in L. The RBF averages both the random sample points and private shocks in the integrated Bellman equation. Without any prior knowledge, neither of transition probability function nor the distribution of private information G i can be identified from each other. Because Γ is a direct sum operator of G and the Bellman operator Γ(s) in (2.1), G Γ(s) = ΓVi (s, ε i )dg i. The smoothness of the approximation is controlled by the bandwidth choice. Ormoneit and Sen (2002) show that the optimal bandwidth has a shrinkage rate of O(m 2/N a ). Theorem 3.1 Given assumption A.1, the sequence ˆΓ a V Γ a V uniform converges to zero. Equation (3.2) provides a model-free approximation, although it the structural parameters are not identifiable. Since the kernel-based approximation satisfies the fixed point condition asymptotically, one can control the policy function to evaluate the degree of fitting, [ (3.3) V i (s) π i (a i, a i, s) + β ] ˆΓ a V i (s)p i (s s, a). s Equation (3.3) is the fitting error of kernel-based approximation. The optimal policy function will minimize the fitting error. To obtain the optimal policy function, one need to apply the value iteration update algorithm for a system of linear regression: (3.4) V a = T [K(π i + βv a )]. If A i = M. K is a m a M m a tensor with entry k(s, s ) at location (s, a, s ). T is an operator on m a M M tensor and maximizes over its second dimension. V a is matrix of m a M. The sparsity and complexity of (3.4) is O(l m a M) if we only consider l-nearest neighborhood. It is very essential in models with high dimensional action states A to choose a fix neighborhood for local averaging. The computation for (3.4) is feasible in many applications, but the structure parameter is not identified by (3.4) furthermore the inference is hard to implement due to the unknown ε in π The Kernel-based Constrained Optimization We propose to carry out the fitting error minimization under the constrained optimization procedure. First, given action a, we re-write (3.4) in a simple expression: (3.5) V i (s) = l θ l Φ il (s), 8

9 where Φ is a basis function such that Φ(s), Φ(s ) = k(s, s ) and θ includes π and (Φ 1 βφ) 1. Since kernel (Gram) matrix K ij := k(s i, s j ) is positive definite, Φ is invertible. Once basis function and local sample points choosen, Φ(s) is determined. Bajari, Benkard, and Levin (2007) assume profit function and value function are linear in unknown parameters to reduce computation. Equation (3.5) has different meaning from the assumption of Bajari, Benkard, and Levin (2007). The variable θ is not necessary corresponding to underlying parameters, θ stands for a coefficient of the basis function. Since V is in Banach space, Φ always exists. θ is used to adjusted the fitting of Φ s polynomial function. The norm of θ, θ is a regularizer. We use θ 2 /2 to penalize model complexity. The fixed point condition implies equation (3.3) equals to zero in theory. But in finite sample case, (3.3) may lead to a small number ɛ rather than exactly zero. Vapnik (1998) devised a so-called ɛ-insensitive loss function such that: (3.6) y f(x) ε = max {0, y f(x) ɛ}. Hence the primal problem of minimizing fitting error of (3.3) becomes 1 min θ,ξ 2 θ 2 + C S (ξ j + ξj ), j S (3.7) s.t.θ T Φ [ π(a, s) + β (θ T Φ(s ))p(s s, a) ] ɛ + ξ j, θ T Φ + [ π(a, s) + β (θ T Φ(s ))p(s s, a) ] ɛ + ξ j, ξ j 0, ξ j 0 j S. ξ j and ξj are slack variables. Fixed point condition and ε-insensitive loss function set up a 2ɛ width tube for fitting curves. ξ j and ξj play a role as a soft margin for that tube. Minimization in (3.7) captures the main feature of constructing a robust inference procedure. It states that in order to obtain a small risk, we need to control both empirical risk, via ε- insensitive loss function, and model complexity, via penaltizer θ. The parameter C trades off model complexity and curves fitting. The key idea to solve the optimization problem in (3.7) is to construct a Lagrangian from the objective function and the corresponding constraints, by introducing a dual set of variables. The Lagrangian function has a saddle point with respect to the primal and dual 9

10 variables at the solution. (3.8) L := 1 2 θ 2 + C j + ξj S j S(ξ ) (η j ξ j + ηj ξj ) j S ( α j ɛ + ξj + θ T Φ [ π(a, s) + β (θ T Φ(s ))p(s s, a) ]) j S + j S α j ( ɛ + ξ j θ T Φ + [ π(a, s) + β (θ T Φ(s ))p(s s, a) ]), where the dual variables have to satisfy positivity constraints α, η, α, η 0. We can simplify the fitting error as (3.9) θ T Φ [ π j + β (θ T Φ(s ))p(s s, a) ] = θ T Ψ(s j ) π j, where Ψ(s j ) = Φ β (θ T Φ(s ))p(s s j, s j, a). The partial derivative of L with respect to the primal variables (θ, ξ j, ξ j ) equal to zero for optimality: L/ θ = θ j S(α j α j )Ψ(s j, s j ) = 0 (3.10) L/ ξ j = C/ S α j η j = 0 L/ ξ j = C/ S α j η j = 0 Substituting (3.10) into (3.8) yields the dual optimization problem, (3.11) min α,α 1 2 (αk α k )(αj α j ) Ψ(s k, s k ), Ψ(s j, s j ) j,k S ɛ (αj + α j ) + π j (αj α j ), j S j S s.t.0 α j, α j C/ S. The dimensionality of the dual problem has been reduced to only 2 S, since only α and α are the decision variables. The optimization problem (3.11) is a convex linear quadratic programming, uses fewer parameters and takes the intrinsic mis-specification into account. Many available packages can solve the problem(3.11) that is nonlinear programming difficult. Given the numerical values α and α, the parameter θ and value function can be expressed as: (3.12) θ = j S(α j α j )Ψ(s j, s j ), V (s i ) = θ T Φ = j S(α j α j ) [ K ij β p(s s i, a)k ij ]. 10

11 Note that the solution of V does not compute the O(m 2 M) complex inversion. θ is parameterized by dual slack variables α, α and basis function Φ. Later on, we informally call θ parameter, although it is not a parameter for the underlying model. θ, however, captures the substantial influence from unobservable random effects and the controlled basis function. Any adjustment of the model will be reflected through the variation of θ. Therefore, θ plays a similar role as the parameter in parametric model. Beside the dual representation, it is necessary to extend the result of uniform convergence in Theorem 3.1 to the constrained case. The constraint determines the possible set of functions so that it can be considered as regularized operator for the available functions. We denote it as Υ such that it maps from a inner product space of K := {k k : X X R} into another inner product space Ω(k) := Υk, Υk where Ω(k) is the regularization term. The transformation Υ extracts those parts that should be affected by the regularization. Because those interesting potential candidate functions should satisfy the constraint, they are expected to be stable under Υ transforming. Therefore, if ˆΓV = k, ΓV = Υk, ΥΓV holds, the uniform convergence is valid for Ω(k). Theorem 3.2 A linear operator Υ mapping from an inner product space to another inner product space satisfies Ψ = ΥΦ for all Φ, Φ K where Ψ(s j ) is defined in (3.12). There is Υk(s, ), ΥΓ a V ( ) = Γ a V (s) and Υk(s, ), Υk(s, ) = Γ a k(s, s ). Hence ˆΓ a V Υk, ΥΓ a V uniformly converges to zero. This result is a very useful condition for capacity control. It means that equation (3.12) can be written as V = i,j α i, α j Γ a k(s i, s j ). Note that coefficient α i, α j is significant simpler than the inverse operator in (3.4). In statistical point of view, the simplification dues to the capacity control of entropy numbers 1. The set of possible solutions scale linearly when the regularized class is re-scaled by a constant. Therefore, the analysis of curse of dimensionality bases on the growth function in the usual kernel approximation case. In other word, we do not need new capacity bounds for this constrained optimization problem. 1 Uniform convergence often has exponential a VC-type bound which has a factor in terms of entropy number. By the theorem, Υ is a scale operator so that the entropy number changes by scaling. 11

12 3.3. Iterative Policy Algorithm The preceding results show that a kernel-based constrained optimization reduced the computation complexity and maintain the uniform consistent approximation to the ex ant value function. The construction bases on a fixed policy rule a. With explicit kernel-based expressions (3.12), we can set up an iteration algorithm to obtain the estimated policy function ˆσ(θ). Due to the ɛ-loss function, the estimated policy function is unnecessary equivalent to that in recursive forward iteration method. The ɛ-tube avoids over-fitting the model thus ˆΓV will fluctuate within the tube by ±ε. Thus the action a = arg max a [ˆΓV ± ɛ] may be different from a below: = arg max a ΓV with an exact approximation Γ. The algorithm is given 1. Set the initial policy a 0, select a basis function and its corresponding kernel k. 2. Choose a subset S a of states space and ensure the transition probability between any two of them are strictly positive. 3. Given the action a it (a i0 for the first evaluation) for firm i, calculate the kernel matrix K := {K ikj = Φ(s k ), Φ(s j ) } for any s k, s j S a. 4. Given the profit function π i evaluated at policy a t and kernel matrix K, solve the optimization problem (3.11) for α and α. 5. Apply (3.12) to obtain ex ant value function V i (s), and then calculate the one-step policy improvement: a i,t+1 = arg max a j S(α j α j ) [K ij β p(s s i, a i,t )K ij ]. Set next period action to a i,t+1, udpate the policy rule and then go back to Step-3. The algorithm procedure is similar to the inner iteration of the nested fixed point algorithm (NFXP Rust, 1987) except that we implement kernel-based optimization rather than an approximation. Thus one may expect to compare the results of NFXP with (3.11). However, we have to emphasize that it is unfair to use the policy function a = arg max a ΓV to judge the correctness of a = arg max a [ˆΓV ±ɛ]. They focus on different aspects. An exact approximation user has a strong confidence on the fixed point condition and prefers to belief the equilibria happen in the steady condition V = ΓV. A robust kernel-based optimization user may prefer to be less optimistic and consider the potential actions under imperfect situation. The results are comparable when ɛ is set to zero in (3.7). Once a user is optimistic towards Bellman s principle of optimality, he can set ɛ to zero and assume the fixed point condition 12

13 is exactly hold. The problem reduces to the kernel-based MDP approximation. We give the following theorem for the equivalence of iterative policy algorithm in parametric and semi-parametric settings. Theorem 3.3 When ɛ goes to zero and basis function of kernel matrix equals to the value function almost surely, the approximated value function obtained by (3.12) is equivalent to the solution V of the following system of linear equations: V = a i A F i (a i )[π (a i ) + e i (a i )], + β s V (s )p (s s), where e is the expectation of ε conditional on a i and F i (a i ) is the conditional choice probability. P, π, e are vectors that stack the corresponding state-specific elements. Star represents the elements associate with an equilibrium conditional choice probability. In additional, the iterative policy algorithms are also equivalent. 4. THE SECOND STEP ESTIMATION In the preceding section, θ is calculated and treated as a structural parameter for the integrated Bellman equation. One may consider to use θ directly to exploit the rest structural parameter of unobservable random variables. There are two obstacles of a direct estimation based on θ. Firstly, one has no prior information about the distribution form of the heterogeneous random variable as well as their underlying parameter spaces. Secondly, the pseudo parameter θ is not a singleton. It generates a set such that with suitable kernel functions and bandwidth choices, any value in this set is a valid candidate for solving the kernel-based approximation. How many values one should choose? If only choosing one value, which is the optimum? In this section we will develop an approach to overcome these difficulties. The nonparametric smoothing densities are feasible for the first obstacle. Empirical Likelihood (EL Owen, 1988, 1990, 2001) generates a so-called implied density function based on model constraints. The usual kernel density distributes weights according to the inner product between two input variables in the feature space, in contrast, EL distributes weights according to the imposed constrains. The pseudo parameters of EL s implied density are the Lagrangian multipliers of the model constrains. The input values satisfying the model 13

14 constrains will be assigned to higher weights while those violating the model constrains will be assigned to lower weights. In this paper, we develop a local EL that preserves all major properties of EL but simplify the computation in a local set. The second obstacle has a close relation with the multiple MPE issue. In parametric case, ˆσ may depend on structural parameters through a set of equilibrium conditions such as first-order conditions, market balance conditions, etc. The corresponding relation between ˆσ and structural parameters are not necessary one to one mapping if there are multiple MPE. Suppose firms utility function is highly nonlinear, e.g. u(c) = c ac 2 bc 3, where a, b are structural parameters. Economic theory says that firms will set the price equal to the marginal utility u c (c) = 1 2ac 3bc 2 = p. This first-order condition implies two solutions. In nonparametric case, the correspond relation is unspecific and in general multiple MPE exist. The changes of the bandwidth value b or the functional form for the basis function φ will give different performances for Lagrangian multipliers α and α. A successive result is that the value of policy function ˆσ(θ) alters. Since ˆσ depends on θ and θ is obtained in terms of α and α by (3.12). Within a class of kernel functions K and an interval of b, the solutions of constrain optimization problem (3.7) are all feasible. Therefore, we can construct a set of θ such that Θ i (b, K) := {θ i : b, k satisfy (3.7) for firm i}. The optimal θ i Θ i should maximize the empirical likelihood. The construction of Θ is motivated by Bajari, Benkard, and Levin (2007). Bajari, Benkard, and Levin (2007) select a set of θ based on the MPE condition. They choose alternative Markov policies σ, any parameters θ those make the value function under MPE strategy profile σ larger than all alternative σ will be selected into the set The Semi-parametric Constraints In parametric models, the constraints for the second step estimation usually comes from a fixed point condition of the conditional choice probability (Pesendorfer and Schimidt- Dengler, 2003; Aguirregabiria and Mira, 2007) or an equilibrium condition of the policy function and structural parameters (Su and Judd, 2008). A nonparametric model does not 14

15 specify the functional form of mixture distribution, namely G, and has multiple MPE, thus we have to seek for an alternative constraint. To illustrate the importance role of these constraints, we use a classical example. (Hotz and Miller, 1993, Lemma 3.1) estimate the conditional choice probabilities F (a s) nonparametrically and invert its map from (4.1) F (a s) = [ˆ max V (s, a, ε)dg(ε a) a ] / V to recover the value function parameters. The equation (4.1) comes from Roy s identity. This is the Fredholm Type 1 integral equation. To simplify (4.1) we denote an operator A such that AG = P and G = A 1 P. The inverse operator is computational tedious and may end up with unstable solutions. The second issue is extremely essential in nonparametric case, because it causes the so-called ill-posed problem. Su and Judd (2008) declaim that the algorithm in constrained optimization case will have quadratic convergent rate which is faster than linear convergent rate of usual iterative algorithm. Their Mathematical Programming with Equilibria Constraint (MPEC) can find out the optimal solution without a redundant specification of the relation between ˆσ and structural parameters. Su and Judd (2008) show that a constrained optimization problem: maxl(θ, σ; s) (θ,σ) s.t.t (θ, σ) = 0 can be solved by good solvers without additional efforts of specifying an algorithm for computing {Σ:σ = Σ(θ)}. A good solver in professional optimization program will implicitly define Σ via T (θ, Σ(θ)) = 0 and implement the augmented likelihood through L(θ, Σ(θ); s). When P has unknown functional form, the constraint play a role as a regularizer operator for the inverse problem G = A 1 P. Hence, the constraints in the second step estimation can improve the algorithm computation and regulate the solution set. Assumption 1 Conditional Independence: Conditional on S, the distribution G is independently distributed across firms. G i for firm i is independently and identically distributed over time. 15

16 Assumption 2 Additive Separability: Private information appears additively in the profit function. π i (a t, s, ε i ) = π i (a t, s) + ε i (a i ). CI and AS assumptions concern on the primitives structure and are the key assumptions to identify private shocks. ˆ ] (4.2) V i (s; σ) = E ε [π i (σ(s, ε), s, ε i ) + β V i (s ; σ)dp (s.= s; σ(s, ε)) s θ T Φ(s) by equation 3.5. If the policy function ˆσ i is given and a i = a is selected, ˆ ] V i (s; ˆσ i, a i ) =E ε i E ε i [π i (ˆσ i (s, ε i ), s, ε i, a) + β V i (s ; σ)dp (s s; σ(s, ε)) s, (4.3) =E ε i E ε i [π i (ˆσ i (s, ε i ), s, ε i, a)] ˆ ˆ + β V i (s ; σ)dg 1... dg N dp (s s; ˆσ i, a i ), ε 1...ε N. =E ε i [π i (ε i, a ˆσ i (s), s)] + β θ T Φ(s )p(s s; ˆσ i, a i ). (4.3) comes from the CI assumption. Let E ε i [π i (ε i, a, ˆσ i (s) s)] = π i (a, ˆσ i (s) s). The approximation in 3.5 is implemented under the MPE condition, namely the policy rule is the optimal choice for the integrated Bellman equation. The kernel mitigates the unobservable shocks under optimal choice. ˆ π i (a, ˆσ i (s) s) = [π i (a, ˆσ i (s) s) + ε(a)] dg(ε ˆσ) (4.4) = π i (a, ˆσ i (s) s) + ε i (a) The first equality of (4.4) apply the AS assumption and the last equality comes from ε(a)dg(ε(a)) = µi (a). µ is the averaging value of the difference between estimated profit function π i (ˆσ(s)) and π i (ˆσ(s)) with the best response function ˆσ i = a. Equation (4.4) gives the fact that the firm s estimated profit function comes from complete belief MPE condition, a deviation from the steady policy rule will shift the firm back to the situation of incomplete belief. CI states ε is independently distributed across t for action a. If G( ) is ergodic, the nonparametric constraint is: (4.5) [ π i (a, ˆσ i (s t ) s t ) π i (a, ˆσ i (s t ) s t )] /T t ˆ [ ] ˆ = ε t (a)/t dg i ε(a)dg i = µ(a i ), t 16

17 The profit function π i (a, ˆσ i (s), s) is the primitive, in practice one can solve the Bertrand Nash equilibrium to obtain this π i. In practice, it is difficult to approximate π i (a, ˆσ i (s) s). However, we can use the approximation of ex ante value function to extract µ. Corollary 1 If a set of profit function {π i } A a=1 : X R with the property that the n A matrix (π a (x n )) na has rank A. Then the basis functions φ a span{π i } A a=1 and Φ construct a representation of ex ante value function V (s; σ) such that V (s; σ) =. φ(s) + µ + β θ T Φ(s )p(s s; ˆσ i, a i ) = Ṽ (s, θ) + µ. The difference between this semi-parametric representation and the nonparametric representation of ex ante value function is the mean of the heterogenous belief. Thus given action a, (4.5) can be written as [ θ T Φ(s t ) Ṽ (s t, θ, η) ] t T ˆ [θ T Φ(s t ) Ṽ (s t, θ) ] dg i = µ i, for agent i. We use a short hand notation m it (θ) for [θ T Φ(s t ) Ṽ (s t, θ)]. The integration w.r.t. G generates a moment condition for the model. The distribution G can be estimated empirically and used for a testing on µ EL and Local EL The constraint 4.5 is used to identify unknown the distribution function G and speed up the computation convergent rate. We apportion the probabilities g = (g 1,... g T ) for the distribution of G. The sum of log-likelihood function of G is expressed in terms of these weights as T t=1 log ng t. In addition, g should satisfy the common requirement of probabilities such that g t 0 and t g t = 1. Given action a for firm i, we obtain the following EL criterion: max t log ng it, (4.6) s.t.g it 0, g it = 1, t g it m it (θ) = µ(a), θ Θ i, t where Θ i is the set generated by a class of kernel functions and bandwidth choices. The constraints of the objective function are made up of a convex hull of a family of multinomial 17

18 distributions {g i } T Lagrange multipliers: and the nonparametric constraint. We may proceed by the method of (4.7) L := t log ng it nλ t [m it (θ) µ(a)] + γ( t g it 1). By KKT, an explicit expression can be derived by a Lagrange multiplier argument: (4.8) g it (θ) = 1 T λ [m it (θ) µ(a)], where λ is found by numerical search. A feasible λ satisfies: (4.9) 1 T T t=1 m it (θ) µ(a) 1 + λ [m it (θ) µ(a)] = 0. Substitute (4.8) into (4.6), we have a minimax criterion function that is the dual representation of problem (4.6): (4.10) min log T (1 + λ [m it (θ) µ(a)]), θ t s.t. 1 T m it (θ) µ(a) T 1 + λ [m it (θ) µ(a)] = 0. t=1 Problem (4.10) is a standard nonlinear optimization problem. The outer-loop of the optimization is to minimize the empirical log-likelihood with respect to θ and the inner-loop is to obtain numerical value of λ. If the log-likelihood ratio in (4.6) is replaced by entropy t ng it log ng it, Kitamura and Stutzer (1997) show that the implied density will become (4.11) g it,et (θ) = exp λ [m it(θ) µ(a)] E t exp λ [m it (θ) µ(a)], which is similar to the multinomial choice probability. In parametric case, people usually assume the unknown private shocks with multinomial distributions. Unlike parametric case, the implied density is flexible to the observations and reassign the weights based on the constraint. EL, like GMM and its relative approaches, constructs a divergence criterion to preserve such identification property and improves the robustness and efficiency of the optimization procedure. However, all GMM and EL estimations are global approaches and people assume the underlying parameters can be globally found. Global optimization procedure is used for problems 18

19 with a small number of variables, where computing time is not critical, and the possibility of finding the true global solution is very high. The complexity of global optimization methods grows exponentially with the problem sizes. Since the generated parameter set Θ is not a global character, it is unnecessary to implement a global optimization method. Moveover, the outer-loop optimization could be discontinuous due to the presence of multiple local optima. In contrary to the optimization problem (3.5), the computational difficulty of EL is to evaluate the outer-loop. The searching direction of θ in the outer-loop is unstable. Let H and s denote the Hessian and gradient function of t log T (1 + λ [m it (θ) µ(a)]) with respect to θ. Newton iteration gives (4.12) θ (k+1) = θ (k) H(θ (k) ) 1 s(θ (k) ). The evaluation of Hessian matrix H(θ (k) ) requires the second derivative of log-likelihood function. There is no closed form derivative of m it (θ) µ(a). The numerical derivative is difficult to implement as well. Because we applied the ε-loss function in the first step, the θ may perform quite similarly in a small neighborhood [θ ɛ, θ + ɛ]. The Hessian matrix in this region is so flat that the singularity problem may occur. In addition, to make the Newton method valid, the Taylor expansion for the log-likelihood function should be guaranteed. One crucial step is to make sure the log-likelihood function is smooth enough. A global smoothness appears to be frail the in this incomplete belief MDP model. Therefore, we suggest a localized EL algorithm that inherits the properties of EL but is more robust, namely, it provides a linear-quadratic expansion that excludes Hessian matrix. The alternative algorithm is called Local EL. We construct a local estimator which has invariance limited distribution within the local neighborhood and is asymptotic optimal. It does not require global smooth functions and the optimization only depends on local EL s values. It is more computational efficient to evaluate the local EL than to calculate the second derivative of the objective function or the Lagrangian. The word local is to indicate that, given θ, the values θ is very close to the θ so that it is not possible to separate g(θ) and g(θ ) easily. Assumption A.3 implies the log-likelihood ratios of implied probabilities approximate to a linear-quadratic formula. The problem reduces to a standard linear-quadratic programming 19

20 problem: (4.13) min θ s.t. 1 T t [u Tt S t 1 2 utt M t u t ], T t=1 m it (θ) µ(a) 1 + λ [m it (θ) µ(a)] = 0, where u is the small step-wise. S and M are elements calculated by LEL algorithm LEL algorithm Partition Θ i into several grids and select one value from each grid. Let the selected value θ it. Running the algorithm grid by grid. 1. Find an auxiliary estimate θ it with its value in Θ i. 2. Construct a matrix M t = {M t,q,p }, q, p = 1, 2,..., l, where M t,q,p = { Λ t [θ t + T (u q + u p ), θ t ]. Λ t [θ t + T u q, θ t ] Λ n [θ t + T u p, θ t ] } u 1,..., u l is a basis of R l. M t is invertible. Λ t (θ 1, θ 2 ) = log( g it (θ 1 )/ g it (θ 2 )). 3. Construct a linear term by the linear-quadratic approximation function: u T p S t = Λ t [θ t + T u p, θ t ] M t,p,p. Since all the values from RHS are known, S t can be computed as a statistics. 4. Construct a central estimator: θ t = θ t + T M 1 t S t, namely S t = M t ( θ t θ t )/ T. θ t appears as θ n with an added correction. 5. Return the value of t log n g( θ t ). If t log n g( θ t ) > t log n g(θ t ), we choose θ t and return to step 2, if not, choose θ it. Proposition 1 M t in the algorithm is invertible. The advantage of LEL algorithm is that the gradient vectors and Hessian matrices are excluded from the log-likelihood approximation. The construction of M bases on the property of 20

21 logarithm function. Matrix M is very sensitive to the small change. Let g(θ t + T u)/ g(θ t ) = ɛ and ɛ (0, 1). The logarithm will magnify the small ɛ exponentially such that log ɛ (, 0). In contrary, the constructed estimator is robust to the peculiar outliers because large ɛ values will be mitigated by logarithm operation. The calculation of M is independent of the numerical second-order derivatives. The evaluation of log-likelihood on θ is easily to compute, thus LEL has an improvement on computation. Gao (2009) proves that the linearquadratic expansion in (4.13) with M t and S t obtained by the LEL algorithm converges to the EL log-likelihood function. Theorem 4.1 If the underlying distribution G have finite mean and variance, 2 t log T g( θ) converges in distribution to χ 2 (1) as T. Theorem 4.1 provides an asymptotic justification for tests that accept the value µ at the α level, when 2 t log T g( θ) < χ 2,1 α (1). If µ is unrejected, the function g at given θ is the implied distribution for this structural parameter. EL confidence region gives an instruction on how to use the implied distribution. If the confidence region unrejects the µ, it means that µ is a reliable estimated parameter, and its associated implied distribution is suitable for inferential processes. Otherwise, one should be caution to µ and hesitate to apply the empirical distribution of ε(a) in 4.5 to inference. EL methods for the mean require some modifications to work for other parameters like the variance. The variance of [m(θ) µ] 2. If µ is known or it locates in the confidence region, we can construct EL ratio function for the mean of [m t (θ) µ] 2. The computation is similar to preceding approach. The implied probability g( θ) can be considered as parametric family with T 1 parameters ( g 1,..., 1 T 1 t g t ). Implied probability of EL is very flexible to the increasing data size, since its parameter grows with the sample. Thus for continuous underlying distribution, EL may approach the true parameter values that parametric MLE might not. The growing parameters make EL appear to be very different from parametric likelihood. However, if the underlying distribution is discrete, the EL ratio function is that of a multinomial with only observed values. As t increases, eventually EL reduces to a random data-determined multinomial with an ever-increasing number of parameters. 21

22 Figure 1. Loglikelihood The horizontal scale is 10 times as large as the vertical scale. Logarithm operator enlarges the difference and make the optimal point much more significant. 5. APPLICATION AND NUMERICAL RESULTS We start the section with a simple application to Rust s model of optimal replacement of bus engines (Rust, 1987). This is a single agent dynamic discrete choice model, but it is a useful starting example to illustrates how the semi-parametric algorithm works. In this model, the maintenance manager i of the bus company has to decide how long to operate a bus before replacing its engine with a new one. The state s it variable is the accumulated miles of the engine at time t. The manager choose whether to replace the engine a = 1 or maintain it a = 0. When a bus engine is replaced, it is as good as new, so the state of the system regenerates to s it = 0 when a it = 1. The private shock ε it (a it ) is assumed to be additively separable. The profit function is given by θ 1i c(0, θ 2i ) ε it (1) if a it = 1 π(a it, s it, θ 1i, θ 2i ) = c(s it, θ 2i ) ε it (0) if a it = 0, where c( ) is the cost function of engines operating and maintenance. The transition prob- 22

23 ability for s it is: g(s it+1 0) if a it = 1 p i (s it+1 s it ; a it ) = g(s it+1 s it ) if a it = 0. g( ) is a known probability density function. β in this model is set to g( ) = θ 2i exp θ 2i ( ) includes the parameter θ 2i. We use the estimation results in Rust (1987) as the initial values. It is possible to use more realistic distribution such as log-normal distribution with separate parameters for mean and variance, because the explicit solution likelihood is unnecessary. The specification of cost function is given as below: (5.1) Quadratic: c(s, θ) = θ 1 s + θ 2 s 2, Power: c(s, θ) = θ 1 s θ 2, Mixed: c(s, θ) = θ 1 /(1.1 s) + θ 2 s 1/2. The last equation is slightly different from original one where the constant is set to 91. The reason is we scale the state variables to [0, 1] interval rather than discretize them into 90 states. The advantage of scaling is to avoid states in greater numeric ranges dominate those in smaller numeric ranges. Another advantage is to avoid numerical difficulties in kernel calculation. In kernel evaluations, the inner products of basis functions may generate large values which cause problems in the numerical operation. The kernel matrix evaluated via Ψ(s k, s k ), Ψ(s j, s j ) is shown in Figure 2 for a (monthly data for 1975 GMC model 5308 buses). We use cross validation to select suitable parameters of RBF kernel. Cross validation separates the data to several folds and then carries out parallel grid search in these folds. The kernel matrix has a main diagonal representation with significant sparse pattern on minor diagonal. This matrix captures the phenomena that all the odometers accumulate miles at the beginning stage but the transitions among states become divergent for large odometer values. Let s satisfies EV (s, a = 1) = EV (0, a = 0), a standard mileage for the engine replacement. When the engine is close to its limit, the manager can either postpone the engine replacement and then ask for a longer running or install a new engine with certain costs. In other words, when s is close to s, the consideration of effects of Markov process is more important than that at the beginning stage. The feature elements (s i, s i±j ) for small js and large is in the kernel matrix are significant such that these 23

24 K(s i, s i±j ) are useful in mapping the sample to a reproducing kernel Hilbert space. Another advantage is the sparse pattern. With large sparse pattern, the computation of inversion and multiplicity are tractable even for every high dimensional matrices. Kernel Matrix Figure 2. Kernel Matrix The data is from a bus data with 4329 observations. The sensitivity loss ɛ controls the goodness of fit of the kernel approximation and furthermore affects its inferential performance. To analyze the effects of ɛ, we use different ɛ in the power cost function case. We compare approximation and inference ability of this kernel function based on testing and estimating samples. With fixed C and θ, Table I and Figure 6 give the results of goodness of fit and inferential power of kernel functions. The first 50% data is used for estimation while the other is used for testing. The prediction ability is measured in two terms, Mean Squared Error (MSE) and squared correlation coefficient (r 2 ) which are N i (θ T Φ(s i ) π i )/N and ( N Ni θ T Φ(s i )π i N i θ T Φ(s i ) ) N 2 i π i (5.2) r 2 = ( N Ni (θ T Φ(s i )) 2 ( N i θ T Φ(s i )) 2 ) ( N Ni π 2 i ( N i π i ) 2 ) respectively. MSE for testing sample can be considered as the information loss due to the inaccurate prediction. r 2 measures the linear relationship between the approximated value function and the parametric profit function. It is obvious that a too small ɛ make the kernel over-fit the model and thus ask for more information to construct the fitting. The MSE and 24

Estimating Single-Agent Dynamic Models

Estimating Single-Agent Dynamic Models Paul T. Scott Empirical IO Fall, 2013 1 / 49 Why are dynamics important? The motivation for using dynamics is usually external validity: we want to simulate counterfactuals