Constrained Optimization Approaches to Estimation of Structural Models: Comment

Size: px

Start display at page:

Download "Constrained Optimization Approaches to Estimation of Structural Models: Comment"

Clifton Howard
5 years ago
Views:

1 Constrained Optimization Approaches to Estimation of Structural Models: Comment Fedor Iskhakov, University of New South Wales Jinhyuk Lee, Ulsan National Institute of Science and Technology John Rust, Georgetown University Kyoungwon Seo, Korea Advanced Institute of Science and Techn. Bertel Schjerning, University of Copenhagen January 2015

2 Su and Judd (Econometrica, 2012)

3 Death to NFXP? Su and Judd (Econometrica, 2012) 2228 C.-L. SU AND K. L. JUDD TABLE II NUMERICAL PERFORMANCE OF NFXP AND MPEC IN THE MONTE CARLO EXPERIMENTS a Runs Converged CPU Time # of Major # of Func. # of Contraction β Implementation (out of 1250 runs) (in sec.) Iter. Eval. Mapping Iter MPEC/AMPL MPEC/MATLAB NFXP , MPEC/AMPL MPEC/MATLAB NFXP , MPEC/AMPL MPEC/MATLAB NFXP , MPEC/AMPL MPEC/MATLAB NFXP , MPEC/AMPL MPEC/MATLAB NFXP ,487 a For each β, weusefivestartingpointsforeachofthe250replications.cputime,numberofmajoriterations, number of function evaluations and number of contraction mapping iterations are the averages for each run.

4 The Nested Fixed Point Algorithm NFXP solves the unconstrained optimization problem Outer loop (Hill-climbing algorithm): max L(θ, EV θ ) θ Likelihood function L(θ, EV θ ) is maximized w.r.t. θ Quasi-Newton algorithm: Usually BHHH, BFGS or a combination. Each evaluation of L(θ, EV θ ) requires solution of EV θ Inner loop (fixed point algorithm): The implicit function EV θ defined by EV θ = Γ(EV θ ) is solved by: Successive Approximations (SA) Newton-Kantorovich (NK) Iterations

5 Mathematical Programming with Equilibrium Constraints MPEC solves the constrained optimization problem max L(θ, EV ) subject to EV = Γ θ(ev ) θ,ev using general-purpose constrained optimization solvers such as KNITRO and CONOPT Su and Judd (Ecta 2012) considers two such implementations: MPEC/AMPL: AMPL formulates problems and pass it to KNITRO. Automatic differentiation (Jacobian and Hessian) Sparsity patterns for Jacobian and Hessian MPEC/MATLAB: User need to supply Jacobians, Hessian, and Sparsity Patterns Su and Judd do not supply analytical derivatives. ktrlink provides link between MATLAB and KNITRO solvers.

6 Zurcher s Bus Engine Replacement Problem Choice set: Each bus comes in for repair once a month and Zurcher chooses between ordinary maintenance (d t = 0) and overhaul/engine replacement (d t = 1) State variables: Harold Zurcher observes: x t : mileage at time t since last engine overhaul ε t = [ε t (d t = 0), ε t (d t = 1)]: other state variable Utility function: u(x t, d, θ u ) + ε t (d t ) = { RC c(0, θu ) + ε t (1) if d t = 1 c(x t, θ u ) + ε t (0) if d t = 0 (1) State variables process x t (mileage since last replacement) p(x t+1 x t, d t, θ x ) = { g(xt+1 0, θ x ) if d t = 1 g(x t+1 x t, θ x ) if d t = 0 (2) If engine is replaced, state of bus regenerates to x t = 0.

7 Zurcher s Bus Engine Replacement Problem Zurcher s problem Maximize expected sum of current and future discounted utilities, summarized by the Bellman equation: V θ (x t, ε t ) = max [u(x t, d, θ u ) + ε t (d) + βev θ (x t+1, ε t+1 x t, ε t, d)] d D(x t ) Under (CI) and (XV) we can integrate out the unobserved state variables = y EV θ (x, d) = Γ θ (EV θ )(x, d) ln exp[u(y, d ; θ u ) + βev θ (y, d )] p(dy x, d, θ x ) d D(y)

8 Structural Estimation Econometric problem: Given observations of mileage and replacement decisions for i = 1,.., n busses observed over T i time periods: (d i,t, x i,t ), t = 1,..., T i and i = 1,..., n, we wish to estimate parameters θ = {θ u, θ x } using MLE Likelihood Under assumption (CI) the log-likelihood function contribution l f has the particular simple form l f T i T i i (θ) = log(p(d i,t x i,t, θ)) + log (p(x i,t x i,t 1, d i,t 1, θ x )) t=2 } {{ } t=2 } {{ } l 1 i (θ) l 2 i (θ x ) where P(d i,t x i,t, θ) is the choice probability given the observable state variable, x i,t.

9 Choice Probabilities and Expected Value functions Under the extreme value (XV) assumption choice probabilities are multinomial logistic P(d x, θ) = exp{u(x, d, θ u) + βev θ (x, d)} j D(y) {u(x, j, θ 1 ) + βev θ (x, j)} (3) The expected value function is given by the unique fixed point to the contraction mapping Γ θ, defined by EV θ (x, d) = Γ θ (EV θ )(x, d) = ln exp[u(y, d ; θ u ) + βev θ (y, d )] y d D(y) p(dy x, d, θ x ) Γ θ is a contraction mapping with unique fixed point EV θ, i.e. Γ(EV ) Γ(W ) β EV W

10 Rust (Econometrica, 1987)

11 Death to NFXP? Su and Judd (Econometrica, 2012) 2228 C.-L. SU AND K. L. JUDD TABLE II NUMERICAL PERFORMANCE OF NFXP AND MPEC IN THE MONTE CARLO EXPERIMENTS a Runs Converged CPU Time # of Major # of Func. # of Contraction β Implementation (out of 1250 runs) (in sec.) Iter. Eval. Mapping Iter MPEC/AMPL MPEC/MATLAB NFXP , MPEC/AMPL MPEC/MATLAB NFXP , MPEC/AMPL MPEC/MATLAB NFXP , MPEC/AMPL MPEC/MATLAB NFXP , MPEC/AMPL MPEC/MATLAB NFXP ,487 a For each β, weusefivestartingpointsforeachofthe250replications.cputime,numberofmajoriterations, number of function evaluations and number of contraction mapping iterations are the averages for each run.

12 How to do CPR

13 NFXP survival kit Step 1: Read NFXP manual and print out NFXP pocket guide Step 2: Solve for fixed point using Newton Iterations Step 3: Recenter Bellman equation Step 4: Provide analytical gradients of Bellman operator Step 5: Provide analytical gradients of likelihood Step 6: Use BHHH (outer product of gradients as hessian approx.) If NFXP heartbeat is still weak: Read NFXP pocket guide until help arrives!

14 STEP 1: NFXP documentation Main references Rust (1987): Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher Econometrica Rust (2000): Nested Fixed Point Algorithm Documentation Manual: Version 6 https: //editorialexpress.com/jrust/nfxp.html

15 Nested Fixed Point Algorithm NFXP Documentation Manual version 6, (Rust 2000, page 18): Formally, one can view the nested fixed point algorithm as solving the following constrained optimization problem: max L(θ, EV ) subject to EV = Γ θ(ev ) (4) θ,ev Since the contraction mapping Γ always has a unique fixed point, the constraint EV = Γ θ (EV ) implies that the fixed point EV θ is an implicit function of θ. Thus, the constrained optimization problem (4) reduces to the unconstrained optimization problem max L(θ, EV θ ) (5) θ where EV θ is the implicit function defined by EV θ = Γ(EV θ ).

16 NFXP pocket guide

17 STEP 2: Newton-Kantorovich Iterations Problem: Find fixed point of the contraction mapping EV = Γ(EV ) Error bound on successive contraction iterations: EV k+1 EV β EV k EV linear convergence slow when β close to 1 Newton-Kantorovich: Solve [I Γ](EV θ ) = 0 using Newtons method EV k+1 EV A EV k EV 2 quadratic convergence around fixed point, EV

18 STEP 2: Newton-Kantorovich Iterations Newton-Kantorovich iteration: EV k+1 = EV k (I Γ ) 1 (I Γ)(EV k ) where I is the identity operator on B, and 0 is the zero element of B (i.e. the zero function). The nonlinear operator I Γ has a Fréchet derivative I Γ which is a bounded linear operator on B with a bounded inverse. The Fixed Point (poly) Algorithm 1 Successive contraction iterations (until EV is in domain of attraction) 2 Newton-Kantorovich (until convergence)

19 STEP 2: Newton-Kantorovich Iterations Successive Approximations, VERY Slow 0.4 Convergence of fixed point algorithm, beta=[0.9, 0.95, 0.99, 0.999, ] Tolerance Iteration count

20 STEP 2: Newton-Kantorovich Iterations, β = Successive Approximations, VERY Slow Begin contraction iterations j tol tol (j)/ tol (j -1) : : : Elapsed time : ( seconds ) Begin Newton - Kantorovich iterations nwt tol e -13 Elapsed time : ( seconds ) Convergence achieved!

21 STEP 2: Newton-Kantorovich Iterations, β = Quadratic convergence! Begin contraction iterations j tol tol (j)/ tol (j -1) Elapsed time : ( seconds ) Begin Newton - Kantorovich iterations nwt tol e e e e -12 Elapsed time : ( seconds ) Convergence achieved!

22 STEP 2: When to switch to Newton-Kantorovich Observation: tol k = EV k+1 EV k < β EV k EV tol k quickly slow down and declines very slowly for β close to 1 Relative tolerance tol k+1 /tol k approach β When to switch to Newton-Kantorovich? Suppose that EV 0 = EV + k. (Initial EV 0 equals fixed point EV plus an arbitrary constant) Another successive approximation does not solve this: tol 0 = EV 0 Γ(EV 0 ) = EV + k Γ(EV + k) = EV + k (EV + βk) = (1 β)k tol 1 = EV 1 Γ(EV 1 ) = EV + βk Γ(EV + βk) tol 1 /tol 0 = β = EV + βk (EV + β 2 k) = β(1 β)k Newton will immediately strip away the irrelevant constant k Switch to Newton whenever tol 1 /tol 0 is sufficiently close to β

23 STEP 3: Recenter to ensure numerical stability Logit formulas must be reentered. P i = = exp(v i ) j D(y) exp(v j ) exp(v i V 0 ) j D(y) exp(v j V 0 ) and log-sum must be recenteret too EV θ = ln exp(v j )p(dy x, d, θ x ) y j D(y) = V 0 + ln exp(v j V 0 ) p(dy x, d, θ x ) y j D(y) If V 0 is chosen to be V 0 = max j V j we can avoid numerical instability due to overflow/underflow

24 STEP 4: Fréchet derivative of Bellman operator Fréchet derivative For NK iteration we need Γ EV k+1 = EV k (I Γ ) 1 (I Γ)(EV k ) In terms of its finite-dimensional approximation, Γ θ takes the form of an N N matrix equal to the partial derivatives of the N 1 vector Γ θ (EV θ ) with respect to the N 1 vector EV θ Γ θ is simply β times the transition probability matrix for the controlled process {d t, x t } Two lines of code in MATLAB

25 STEP 5: Provide analytical gradients of likelihood Gradient similar to the gradient for the conventional logit l 1 i (θ)/ θ = [d it P(d it x it, θ)] (v repl. v keep )/ θ Only thing that differs is the inner derivative of the choice specific value function that besides derivatives of current utility also includes EV θ / θ wrt. θ By the implicit function theorem we obtain EV θ / θ = [I Γ θ ] 1 Γ/ θ By-product of the N-K algorithm: [I Γ θ ] 1

26 STEP 6: BHHH Recall Newton-Raphson θ g+1 = θ g λ (Σ i H i (θ g )) 1 Σ i s i (θ g ) Berndt, Hall, Hall, and Hausman, (1974): Use outer product of scores as approx. to Hessian θ g+1 = θ g + λ ( Σ i s i s i ) 1 Σi s i Why is this valid? Information identity: E [H i (θ)] = E [s i (θ) s i (θ) ] (only valid for MLE and CMLE)

27 STEP 6: BHHH Some times linesearch may not help Newtons Method 9.4 Non concave likelihood Convex region: Newton Raphson moves in the oposite direction of the gradient NR moves DOWNHILL (Wrong, wrong, way) BHHH: Still good Concave region: Newton Raphson moves in the same direction of the gradient NR moves UPHILL θ

28 STEP 6: BHHH Advantages Σ i s i s i is always positive definite I.e. it always moves uphill for λ small enough Does not rely on second derivatives Disadvantages Only a good approximation At the true parameters for large N for well specified models (in principle only valid for MLE) Only superlinear convergent - not quadratic We can always use BHHH for first iterations and the switch to BFGS to update to get an even more accurate approximation to the hessian matrix as the iterations start to converge.

29 STEP 6: BHHH The road ahead will be long. Our climb will be steep. We may not get there in one year or even in one term. But, America, I have never been more hopeful than I am tonight that we will get there. I promise you, we as a people will get there. (Barack Obama, Nov. 2008)

30 Convergence! β= *** Convergence Achieved *** _ \ \ = /- ; ( ) ( ) _._. ( ) ( ) Number of iterations : 9 grad * direc Log - likelihood Param. Estimates s.e. t- stat RC c Time to convergence is 0 min and 0.07 seconds

31 MPEC versus NFXP-NK: sample size 6,000 Converged CPU Time # of Major # of Func. # of Bellm. # of N-K β (out of 1250) (in sec.) Iter. Eval. Iter. Iter. MPEC-Matlab MPEC-AMPL NFXP-NK

32 MPEC versus NFXP-NK: sample size 60,000 Converged CPU Time # of Major # of Func. # of Bellm. # of N-K β (out of 1250) (in sec.) Iter. Eval. Iter. Iter. MPEC-AMPL NFXP-NK

33 CPU is linear sample size MPEC AMPL NFXP NK cpu time per major iteration (seconds) Sample size x 10 5 T NFXP = x (R 2 = 0.991), T MPEC = x (R 2 = 0.988).

34 CPU is linear sample size MPEC AMPL NFXP NK total cpu time (seconds) Sample size x 10 5 T NFXP = x (R 2 = 0.926), T MPEC = x (R 2 = 0.554).

35 Summary of findings Su and Judd (Econometrica, 2012) used an inefficient version of NFXP that solely relies on the method of successive approximations to solve the fixed point problem. Using the efficient version of NFXP proposed by Rust (1987) we find: MPEC and NFXP-NK are similar in performance when the sample size is relatively small. In problems with large sample sizes, NFXP-NK outperforms MPEC by a significant margin. NFXP does not slow down as β 1 It is non-trivial to compute standard error using MPEC, whereas they are a natural by-product of NFXP.

36 Patient still alive

Numerical Methods-Lecture XIII: Dynamic Discrete Choice Estimation

Numerical Methods-Lecture XIII: Dynamic Discrete Choice Estimation (See Rust 1989) Trevor Gallen Fall 2015 1 / 1 Introduction Rust (1989) wrote a paper about Harold Zurcher s decisions as the superintendent