Statistical Properties of Numerical Derivatives

Size: px

Start display at page:

Download "Statistical Properties of Numerical Derivatives"

Linette Jennings
5 years ago
Views:

1 Statistical Properties of Numerical Derivatives Han Hong, Aprajit Mahajan, and Denis Nekipelov Stanford University and UC Berkeley November / 63

2 Motivation Introduction Many models have objective functions that are intensive to compute. Sample objective functions might not have well-behaved analytic derivatives. Common sense solution: use numerical derivatives (Judd (1998. Statistical problem: Extra nuissance parameter (step size of numerical differentiation, Not clear whether results from estimators with nuissance parameters (e.g. bandwidth translate directly to the case of numerical derivative. Can we obtain good estimators by solving numerical first-order conditions? Can we provide relatively weak conditions on step size if we are only interested in value of derivative? 2 / 63

3 Motivation Introduction Example (see (Judd, 1998: Need to compute gradient of g approximated by ĝ Can use approximation with step size h: or ĝ h ĝ(x + h ĝ(x (x = L1,1ĝ(x = = g (x + O (h h ĝ h ĝ(x + h ĝ(x h (x = L1,2ĝ(x = = g (x + O ( h 2 2h More generally, higher order differentiation. Bias and variance tradeoffs. 3 / 63

4 Motivation Introduction Conclusion: careful analysis of function properties and application of higher order numerical derivative technique reduces error by orders of magnitude. The case where g is estimated is more complicated: ĝ(x + h and ĝ(x h are correlated Consistency of derivatives depends on choices of approximating formula and step size. Results depend on tradeoffs between (a Quality of numerical approximation (b choice of step size (c smoothness of population objective (d Empirical process properties for ĝ. 4 / 63

5 Literature background Introduction Anderssen and Bloomfield (1974: differentiation of functions measured with noise Newey and McFadden (1994: sufficient conditions for convergence of numerical Hessian Newey (1994, Newey and McFadden (1994: numerical Hessian in semiparametric model. L Ecuyer and Perron (1994: convergence rate and asymptotics for basic finite-difference formulas for smooth functions Powell (1984: Censored LAD. Buchinsky and Hahn (1998: Alternative censored QR. Simulation estimation of nonsmooth models (Pakes and Pollard (1989. Maximum score and smoothing (Manski (1975, Horowitz ( / 63

6 What we find Introduction Find convergence rates and statistical properties of numerical derivatives of possibly non-smooth objective functions. While in some cases choice of step of numerical differentiation is based on bias-variance tradeoff, in smooth classes of models there will be no such tradeoff. First-order conditions using numerical derivatives can be still used even if the sample objective function is not smooth (e.g. maximum score provided that step size is selected correctly. Smoothness of objective function influences the precise convergence rate of derivative estimator and its limit distribution. Requirements on rate of convergence of the numerical derivative restrict the order of numerical differentiation operator. 6 / 63

7 Outline of the talk Introduction Consistency of numerical partial derivatives in parametric and semiparametric models. Sufficient conditions for consistency Rates of convergence and distribution theory M-estimators based on numerical first-order conditions. Consistency of estimators Convergence rate and optimal choice of step of numerical differentiation Monte-Carlo evidence. 7 / 63

8 Introduction Numerical derivative: definition Consider function f ( with p continuous derivatives. We consider a class of linear operators L defined by step size ε > 0, order m N, weights a k [ 1, 1] and t k [ 1, 1] for k = 1,..., K such that L f (x 0 = 1 K ε m a k f (x 0 + ε t k. k=1 We call L ε m,q the m-th numerical derivative operator of order q p m + 1 if L ε m,q g (x 0 = g (m (x 0 + o (ε q. 8 / 63

9 Introduction Numerical derivative: structure To obtain weights use Taylor s expansion for small ε f (x 0 + ε t k = p i=0 f (i (x 0 ε i tk i i! + o (εp Solve the system of equations 1 i! K a k tk i = δ i,m k=1 with boundary conditions t 1 = 1, and t K = 1 where δ i,m is the Kronecker symbol. Example: D 3 ( θ = L εn 1,3 = M n( θ 2ε n+8 M n( θ ε n 8 M n( θ+ε n+ M n( θ+2ε n 12ε n. 9 / 63

10 Consistency of numerical derivatives Statement of the problem Parameter vectors (θ, h(. θ: Euclidean components. h ( : infinite-dimensional components. Moment condition m (z; θ 0, h 0 ( = E [ρ (x; θ 0, h 0 ( z] = 0, where ρ( can possibly be non-smooth, e.g. Ai and Chen (2003, Chen and Pouzo (2009. Analytic expression might not exist for either derivative w.r.t. θ or directional derivative w.r.t. h( or both. Remark: even if ρ( is linear, there might be no analytic expression for the derivative, e.g. ρ (x; θ 0, h 0 ( = h 0 (z α h 0 (x f (x, z; θ. 10 / 63

11 Consistency of numerical derivatives Statement of the problem Conditional moment might not be pointwise differentiable. Conditional moment is sufficiently smooth in mean squares. E[ m (z; θ, h( 1θ (θ θ 0 1h [δ]... p 1 +p 2 =p p,θ p 1,h p 2 [δ] p 1 (θ θ 0 p 2] 2 = o ( δ 2ν 2 + θ θ 0 2ν, Can still use numerical derivative operator, for example, ( ( ˆm z; ˆθ + e j ε n, ĥ( ˆm z; ˆθ e j ε n, ĥ( D wj (z = 2ε n ( ( ˆm z; ˆθ, ĥ( + τ n w j ( ˆm z; ˆθ, ĥ( τ n w j (, 2τ n 11 / 63

12 Consistency of numerical derivatives Statement of the problem Need to compute numerically, for each w j, D wj (z = m (z; θ, h( θ j m (z; θ, h( [w j ], h The asymptotic variance depends on wj, which solves min D wj (z Σ 0 (z 1 D wj (z, w j W and Σ 0 (z = E [ ρ (x; θ 0, h 0 ( ρ (x; θ 0, h 0 ( z ]. Use ĥ(, and θ, and numerically evaluated ŵ to estimate the asymptotic variance. 12 / 63

13 Consistency of numerical derivatives Numerical derivative: problem This structure extends to the case where f is a functional and X is functional space, for taking directional derivatives. In particular, consider first derivative of order q: ( Dŵ j (z = L εn,θ 1,q m z; θ, ( ĥ( L τn,θ 1,q m z; θ, ĥ(. Question 1: How to choose τ n and ε n for n for a derivative of given order q? Sufficient condition for step sizes of partial and directional derivatives such that p Dŵ j (z Dŵ j (z. 13 / 63

14 Consistency of numerical derivatives Numerical derivative: solution Conditional moment might not be pointwise differentiable but has continuous L 2 derivatives. High-level assumption for convergence of conditional expectation: sup n1/k m(z; θ, h( m(z; θ, h( m(z; θ 0, h 0 ( = o 1+n (θ, h( U 1/k m(z; θ, h( +n 1/k m(z; θ, h( p (1, γ for any sequence of shrinking neighborhoods U γ as γ 0. More precise conditions can be given for particular estimator of ˆm (. Nonparametric rate for ĥ (, for some k 1 : n 1/k 1 ĥ( h 0( = O p (1, and n 1/2 ˆθ θ 0 = Op (1. 14 / 63

15 Consistency of numerical derivatives Numerical derivative: solution Theorem (1 Under provided assumptions, if ε n n 1/ max{k, k 1} ε n 0 and then. τ n n 1/ max{k, k 1} τ n 0 p sup D w (z D w (z 0 z Z 15 / 63

16 Consistency of numerical derivatives Numerical derivative: discussion This result follows closely existing results in Newey and McFadden (1994 and Powell (1984 Provided conditions imply the rate slower than 1/ n for ε n in parametric case In semiparametric case need stronger condition to control for slow non-parametric rate Such sufficient conditions are too strong Sharper results can be derived. Even for nonsmooth models, only need nɛ. in the parametric case. 16 / 63

17 Consistency of numerical derivatives Numerical derivative: parametric case Can derive sharper results if know more detail about moment function Moment vector ĝ (θ = 1 n n g (Z i, θ, i=1 estimator ˆθ equates it to zero Need to estimate G = g(θ 0 θ using L ɛn 1,pĝ (ˆθ Decompose L ɛn 1,pĝ (ˆθ G = Ĝ 1 + Ĝ 2 + Ĝ 3 [ ] Ĝ 1 = L ɛn 1,pĝ (ˆθ L ɛn 1,p (ˆθ g Ĝ 2 = L ɛn 1,p (ˆθ g G (ˆθ Ĝ 3 = G (ˆθ G. G 3 = O p (ˆθ θ 0, but it has no relation to ɛ n. 17 / 63

18 Consistency of numerical derivatives Numerical derivative: parametric case Assumption (1 A 2p + 1th order mean value expansion applies to the limiting function g (θ uniformly in a neighborhood of θ 0. For all ɛ sufficiently small and r = 2p + 1, sup θ N (θ 0 g (θ r ε l l! g (l (θ = O ( ɛ r+1. l=0 The bias term G 2 (ˆθ can be controlled if the bias reduction is uniformly small in a neighborhood of θ 0. An immediate consequence of this assumption is that Ĝ 2 (ˆθ = O ( ɛ 2p. 18 / 63

19 Consistency of numerical derivatives Numerical derivative: parametric case The weakest possible condition to control Ĝ 1 (ˆθ that covers all the models that we are aware of seems to come from a convergence rate result in Pollard (1984. Assumption (2 Define F = {g(, θ, θ Θ}. The graphs of functions in F has polynomial degrees of discrimination. Most of the functions in econometric applications fall in this category. By lemmas 25 and 36 of Pollard (1984, assumption 2 implies that there exist universal constants A > 0 and V > 0 such that for any F n F with envelope function F n, sup Q sup N 1 (ε QF n, Q, F n Aε V Q N 2 (ε ( QFn 2 1/2, Q, Fn Aε V. 19 / 63

20 Consistency of numerical derivatives Numerical derivative: parametric case Lemma (2 For a neighborhood N (θ 0 around θ 0, F = sup θ N(θ0 g (Z i, θ <, and for all ɛ small enough, [ sup E (g(z i, θ + ɛ g(z i, θ ɛ 2] = O(ɛ, (1 θ N(θ 0 Then under assumption 2, if nε n / log n sup L ɛn 1,pĝ (θ Lɛn 1,p g (θ = o p(1. d(θ,θ 0 =o(1 Consequently, assumptions 2 implies that (ˆθ Ĝ1 = o p (1 if d (ˆθ, θ 0 = o p (1. 20 / 63

21 Consistency of numerical derivatives Theorem (2 Under assumptions, 1, 2 and the conditions of lemma 2, p L ɛn 1,pĝ (ˆθ G (θ 0 if ɛ n 0 and nɛ n / log n, and if d (ˆθ, θ0 = o p (1. This is the weakest possible sufficient condition that we are able to come up with without making stronger assumptions. The case of an indicator function involved in g(z i, θ typically corresponds to γ = 1/2. However, this condition can be improved if we are willing to impose the following stronger assumption, which holds for smoother functions such as those that are Hölder-continuous. 21 / 63

22 Consistency of numerical derivatives Numerical derivative: parametric case Assumption (3 The moment condition is mean square differentiable. Define G (θ = 1 n n i=1 (g (Z i, θ g (θ. There exists some ɛ > 0, such that for all δ sufficiently small, E sup d(θ 1,θ 2 <δ,d(θ 1,θ 0 <ɛ,d(θ 2,θ 0 <ɛ G (θ 1 G (θ 2 φ n (δ, for functions φ n ( such that δ φ n (δ /δ γ is decreasing for some γ > 0 and γ / 63

23 Consistency of numerical derivatives Numerical derivative: parametric case This assumption allows us to put an envelope directly on Ĝ1. This assumption is more stringent than Theorem in Vaart and Wellner, and may fail in cases when Theorem holds, for example with indicator functions. Define a class of functions M ɛ δ = {g (Z i, θ 1 g (Z i, θ 2, d (θ 1, θ 2 δ, d (θ 1, θ 0 < ɛ, d (θ 2, θ 0 < ɛ}. Assumption 3, which requires bounding E P G n M ɛ δ, can be obtained by invoking the maximum inequalities in Theorems and in Vaart and Wellner. 23 / 63

24 Consistency of numerical derivatives Numerical derivative: parametric case These tail bounds provide that for Mδ ɛ an envelope function of the class of functions M ɛ δ, EP G n M ɛ δ J (1, M ɛ δ (P (Mδ ɛ2 1/2, EP G n M ɛ δ ( J [] (1, M ɛ δ, L 2 (P P (Mδ ɛ2 1/2, where J (1, M ɛ δ and J [] (1, M ɛ δ, L 2 (P are the uniform and bracketing entropy integrals. The following result shows that for smooth functions g(z i, θ that are Lipschitz in θ, the only condition needed for consistency is ɛ n 0. Theorem p Under assumptions 1 and 3, L ɛn 1,pĝ (ˆθ G (θ 0 if ɛ n 0 and nɛ 2 2γ, and if d (ˆθ, θ0 = o p (1. 24 / 63

25 Consistency of numerical derivatives Numerical derivative: rate of convergence Theorem Under the assumption that E sup d(θ 1,θ 2 <δ,d(θ 1,θ 0 <ɛ,d(θ 2,θ 0 <ɛ G (θ 1 G (θ 2 φ n (δ ɛ, where δ φ(δ/δ γ is decreasing for some 0 < γ < 1, if ˆθ θ 0 = O p (n η 1 for η > 0, provided that η 1 γ+2ν, the best rate of convergence between L ɛ 1,pĝ (ˆθ and G is achieved when ɛ n = O ( n 1/(1 γ+2ν in which case L ɛ 1,pĝ (ˆθ G 2 = O p (n 2ν 1 γ+2ν. 25 / 63

26 Consistency of numerical derivatives Numerical derivative: parametric case If γ < 1, and numerical differentiation operator guarantees order of residual ν 1 we can have parametric rate of convergence η = 1/2 In smooth models γ = 1, there is no bias-variance tradeoff smaller step size ɛleads to smaller bias Order of root MSE in smooth case bounded below by variance term of O ( 1/ n for sufficiently smaller ɛ n 26 / 63

27 Example Consistency of numerical derivatives Simple example: m (z i ; θ = 1 (z i θ τ. Numerical derivative ( 1 n 1(z i ˆθ+ɛ 1(z i ˆθ ɛ n i=1 2ɛ = 1 n n i=1 1 ɛ U z i θ 0 (ˆθ θ 0 ɛ. Consistency condition in Powell (1984 and Newey, McFadden (1994 both of which require nɛ, are too strong. This is not necessary: ˆf (x uniformly consistent for f (x. Only need nɛ/ log n. Second part of noise due to ˆθ θ 0, ˆθ θ 0 ε, will vanish. ˆθ θ 0 p 0 ˆf (ˆθ θ 0 p f (0 = f z (θ / 63

28 Consistency of numerical derivatives Numerical derivative: the sieve infinite dimensional case Infinite dimensional θ. Computes a directional derivative G h = d m(θ 0,η 0 +τh,x d τ numerically: L ɛn,h 1,p m (ˆθ, ˆη, x = 1 ε n k τ=0 a k m (ˆθ, ˆη + tk h ε n, x Two methods to estimate the conditional expectation: sieve (series and kernel ( 1 m (θ, η, z = p N 1 n 1 n (z p N (z i p N (z i p N (z i ρ (θ, η; y i. n n and m (θ, η, z = ( 1 nb dz n i=1 n ( 1 zi z K i=1 b dz n 1 nb dz n i=1 n ( zi z K i=1 b n ρ (θ, η; y i. 28 / 63

29 Consistency of numerical derivatives Assumption For the basis functions p N (z the following holds: (i The smallest eigenvalue of E [ p N (Z i p N (Z i ] is bounded away from zero uniformly in N. (ii For some C > 0, sup p N (z C <. z Z (iii The population conditional moment belongs to the completion of the sieve space and ( sup sup m (θ, η, z proj m (θ, η, z p (z N ( = O N α. (θ,η Θ H z Z 29 / 63

30 Consistency of numerical derivatives Assumption (i Uniformly bounded moment functions: sup ρ(θ, η, C. The θ,η density of covariates Z is uniformly bounded from zero on its support. (ii Suppose that 0 H n and for ɛ n 0 and some C > 0, sup Var (ρ (θ, η + ɛ n w; Y i ρ (θ, η ɛ n w; Y i z = O (ɛ n, z Z,η,w H n η, w <C,θ N (θ 0 (iii For each n, the class of functions F n = {ρ (θ, η + ɛ n w; ρ (θ, η ɛ n w;, θ Θ, η, w H n } is Euclidean whose coefficients depend on the number of sieve terms. In other words, there exist constants A, and 0 r 0 < 1 2 such that the covering number satisfies ( log N (δ, F n, L 1 A n 2r 0 1 log, δ 30 / 63

31 Consistency of numerical derivatives Lemma Suppose that ρ (π n η, η = O p ( n φ. Under assumptions 4 and 5 sup d(θ,θ 0 =o(1,d(η,η 0 =o(1,η H n L ɛn,w 1,p m (θ, η, z Lɛn,w 1,p m (θ, η, z = op (1 uniformly in z and w, provided that ɛ n 0 and min{n α, n φ }ɛ n, and nɛ n N 2 n 2r 0 log n. 31 / 63

32 Consistency of numerical derivatives Assumption K( is the m-th order kernel function which is an element of the class of functions F defined by Assumption 2. It integrates to 1, it is bounded and its square has a finite integral. Lemma Under assumptions 5 and 6 sup d(θ,θ 0 =o(1,d(η,η 0 =o(1,η H n L ɛn,w 1,p m (θ, η, z Lɛn,w 1,p m (θ, η, z = op (1 uniformly in w and z where f (z is strictly positive for the kernel estimator provided that ɛ n 0, b n 0, ɛ n min{bn N, n φ } and nɛnbdz n n 2r 0 log n. 32 / 63

33 Consistency of numerical derivatives Theorem Under assumptions 1, 5, and either 4 or 6, L ɛn,w 1,p (ˆθ, m p ˆη, z m(θ,η,z η [w], uniformly in z and w, if N, ɛ n min{n α, n φ }, and for series estimator, and nɛ n N 2 n 2r 0 log n b n 0, ɛ n min{bn N, n φ }, and nɛnbdz n for kernel-based n 2r 0 log n estimator, provided that d (ˆθ, θ 0 = o p (1 and d (ˆη, η 0 = o p (1. 33 / 63

34 Consistency of numerical derivatives Hölder-continuous moment functions ASSUMPTION 5. (i For any sufficiently small ɛ > 0 sup ρ (θ, η + wɛ, z ρ (θ, η + wɛ, z C(zɛ γ, (θ,η Θ H,w H, w <C where 0 < γ 1 and E [ C (Z 2] <. 34 / 63

35 Consistency of numerical derivatives Lemma Suppose that ρ (π n η, η = O ( n φ. Under either pair of assumptions 4 and 5(i,(ii,(iii, (iv or 6 and 5 (i, (ii,(iii,(iv sup L ɛn,w 1,p (ˆθ, m ˆη, z L ɛn,w 1,p m (θ 0, η 0, z = o p (1 d(ˆθ,θ 0 =o p(1,d(ˆη,η 0 =o p(1,η H n uniformly in z and w, provided that ɛ n 0, ɛ n min{n α, n φ } and n ɛ 1 γ n for series estimator, and b n 0, ɛ n min{bn q, n φ }, N n r 0 n 1 2r 0 ɛ 2 2γ n log n b dz n for kernel estimator. 35 / 63

36 Gradient-based estimation with numerical derivatives Definitions Consider a form of Z-estimator. Estimate parameter θ 0 in metric space (Θ, d. First-order condition hard to compute analytically. Sample objective function is Q n (θ = 1 n n g (z i, θ. i=1 Estimator θ n Θ solves empirical first-order condition: D n ( θn = L εn Q ( 1 1,p n ( θn = o p n. 36 / 63

37 Gradient-based estimation with numerical derivatives Assumptions An identification condition Assumption The map Θ R k defined by D (θ = θ E [g (z i, θ] is identified at θ 0 Θ. In other words from lim n D (θ n = 0 it follows that lim n θ n θ 0 = 0 for any sequence θ n Θ. A continuity in probability condition Assumption The parameter space Θ has a compact cover. For each n, there exists a countable subset T n Θ such that P (sup inf g (Z i, θ g ( Z i, θ 2 > 0 = 0. θ Θ θ T n 37 / 63

38 Gradient-based estimation with numerical derivatives Consistency Theorem Functions with absolutely bounded finite differences Under assumptions 7, 8, 1, and 2, as long as ε n 0 and nεn log n, sup L ɛn ˆQ 1,p (θ G (θ = o p (1. θ Θ Consequently, ˆθ p θ 0 if L ɛn ˆQ 1,p (ˆθ = o p (1. The Hölder continuous case is also similar to variance estimation. 38 / 63

39 Gradient-based estimation with numerical derivatives Rate of convergence Lemma Lemma 1 also provides the rates of convergence: For ˆθ nεn sup θ N(θ 0 log n ˆQ Lɛn 1,p (θ L ɛn ˆQ 1,p (θ = O p (1, Initial parameter rate of convergence: Suppose ˆθ p θ 0 p θ 0 and L ε ˆQ ( 1,p (ˆθ = o 1 p nεn. Under Assumptions of Theorem 10, if nε n / log n, and nε 1+4p / log n = O (1, then nεn log n d (ˆθ, θ 0 = O P (1. 39 / 63

40 Gradient-based estimation with numerical derivatives Rate of convergence Lemma Under conditions of theorem 10 with nε 1+4p n have = o (1, and nε3 n log n we ( sup L ɛn ˆQ d( ˆθ,θ ( 1,p (ˆθ L ɛn ˆQ ( 1 1,p (θ 0 L ɛn 1,p (ˆθ Q + L ɛn 1,p Q (θ 0 = o p. log n nεn 0=O nεn Theorem Suppose ˆθ p θ 0 and L ε ˆQ ( 1,p (ˆθ = o 1 p nεn. Under Assumptions of Theorem 10, if nε 3 n/ log n, and nε 1+p = O (1, then nεn d (ˆθ, θ 0 = OP (1. 40 / 63

41 Gradient-based estimation with numerical derivatives Distribution Assumption A CLT holds: As n and ɛ n 0, G n (θ 0 + ε n G n (θ 0 ε n εn d N (0, Ω. Theorem Assume that the conditions of theorem 10 hold but with nε 1+4p n = o (1. In addition, suppose that the Hessian matrix H (θ of g (θ is continuous, nonsingular and finite at θ 0. Then if Assumption 9 holds with nε 3 n ( d nεn (ˆθ θ 0 N 0, H (θ 0 1 ΩH (θ / 63

42 Gradient-based estimation with numerical derivatives Functions with polynomial envelopes for finite differences Theorem Suppose ˆθ p θ 0 and L ε 1,p ˆQ (ˆθ = o p ( 1 nε 1 γ n. Under Assumptions 3 and 8, if nεn 2 2γ = O (1, and suppose that the Hessian matrix H (θ of g (θ is continuous, nonsingular and finite at θ 0, then nε 1 γ n d and nε 1 γ+2p n (ˆθ, θ 0 = OP (1. Theorem Assume that the conditions of theorem 15 hold but with ( nε 1+2p γ n = o (1. If lim ε 0 ɛ 2 2γ Var L ε 1,p g (Z i, θ 0 = Ω, and if, nε 2 γ n. Then ( nε 1 γ d n (ˆθ θ 0 N 0, H (θ 0 1 ΩH (θ / 63

43 Gradient-based estimation with numerical derivatives Stronger assumptions for smooth models Proposition Suppose the conditions of theorem 16 hold except nε 2 γ n. Suppose further that g (z i, θ is mean square differentiable in a neighborhood of θ 0 : for measurable functions D (, : Z Θ R p such that E [ g (Z, θ 1 g (Z, θ 2 (θ 2 θ 1 D(Z, θ 1 ] 2 = o ( θ1 θ 2 2, E D (Z, θ 1 2 < for all θ 1, and θ 2 N θ0. Define q ε (z i, θ = L ε 1,p g (z i; θ D (z, θ, Assume that sup [Gq ε (z i, θ 1 Gq ε (z i, θ 0 ] = o p (1, d(θ,θ 0 =o(1,ε=o(1 and D (z i, θ is Donsker in d (θ, θ 0 δ, then the conclusion of theorem 16 holds. Example: quantile regression. 43 / 63

44 Gradient-based estimation with numerical derivatives Example Our theory does not require smoothness of moment function, and applies to non-smooth objective functions. Example: Maximum score Simple case - one regressor with normalized coefficient. Objective function Q n (θ = 1 n ( y i 1 1 { θ + x i > 0 }. n 2 i=1 Numerical first-order condition - set numerical gradient equal to zero L εn 1,2 Q n (θ = 1 n n i=1 where U( is a uniform kernel. ( y i 1 1 U 2 ε n ( xi + θ = 0, ε n 44 / 63

45 Gradient-based estimation with numerical derivatives Example comparison with smoothed maximum score Use uniform kernel K(z = 1 (1{z > 1} + 1{z [ 1, 1]}z + 1{z > 1}. 2 Smoothed objective function use bandwidth h n First-order condition Q s n (θ = 1 n θ Q s n (θ = 1 n n i=1 n i=1 ( y i 1 K 2 ( y i 1 1 U 2 h n ( xi + θ h n. ( xi + θ = 0. This is identical to equating numerical gradient to zero! h n 45 / 63

46 Gradient-based estimation with numerical derivatives Example Both numerical gradient-based procedure and smoothed matimum score have same convergence rate n ε n Step size parameter is equivalent to bandwidth If one uses any gradient-search method with maximum score-type objective functions, the asymptotic distribution of Kim and Pollard (1990 is not applicable. 46 / 63

47 Setup Monte-Carlo Simulations Analyze minimizers of Q n (θ = 1 n n x i θ α. i=1 In population Q (θ = E [ x i θ α ]. x i N (0, 1 Estimator solves M n ( θ = 1 n n i=1 ( { 1 x i θ } 1 x i ˆθ α 1 = / 63

48 Problem Monte-Carlo Simulations Solution well-defined if α 1. α = 1 gives LAD. Look at numerical derivative of M n ( θ (Hessian of Sample objective function. Set θ 0 = 0. Population Hessian can be computed analytically D (0 = 1{α = 1}φ(0 α 1 ( α 1 4 π 2α/2 Γ. 2 Compare L ε M 1,p n ( θ with D (0. 48 / 63

49 Methodology Monte-Carlo Simulations First-order formula Second-order formula Mn D 1 ( θ = L εn 1,1 = Mn D 2 ( θ = L εn 1,2 = ( θ + εn M n ( θ. ε n ( θ + εn M n ( θ εn 2ε n, Third-order formula is D 3 ( θ = L εn 1,3 = M n( θ 2ε n+8 M n( θ ε n 8 M n( θ+ε n+ M n( θ+2ε n 12ε n. 49 / 63

50 Methodology Monte-Carlo Simulations Choose α = 2.5, 1.5 and 1. Numerical and analytic derivatives coincide when α = 2. Use different rates for step size of numerical differentiation. Use different order formulas. 50 / 63

51 Methodology Monte-Carlo Simulations We generate 1000 Monte-Carlo samples with the number of observations from 500 to For each sample s we find the estimate θ s by solving M n s ( θs = 0. We choose sample-adaptive step of numerical differentiation as ε = C (n s q. We choose C = 2 and q from to / 63

52 Methodology Monte-Carlo Simulations Then we evaluate the numerical Hessians D 1 ( θs D 3 ( θs., D 2 ( θs, and The values of numerical Hessians are stored and then we compute the mean-squared error of the evaluated Hessians across Monte-Carlo samples: MSE ( Di = 1 S S s=1 ( θs 2 ( Di D(0 The results are reported by showing the dependence of MSE ( Di from the sample size n s. 52 / 63

53 Results Monte-Carlo Simulations α = 2.5: smooth objective function. Decreasing the rate of the step size and increasing the order of numerical derivative lead to a decrease in the mean-squared error in evaluation of the derivative. α = 1: least smooth. The optimal step size rate is close to.2. Increase in the order of numerical differentiation results in small increase in the precision of the derivative evaluation. α = 1.5: intermediate case. While an increase in the order of the numerical derivative leads to a decrease in the mean-squared error, the mean squared error tends to decrease when q increases, then it starts to increase again. 53 / 63

54 Results Conclusion Sufficient conditions for consistency of numerical derivatives. Problem is related to but different from bias-variance tradeoff. In cases with smooth objective functions, step size can decrese at arbitrary fast rate. Gradient-based numerical optimization techniques can be applied even with nonsmooth sample objective provided that step size is selected properly. Rate of convergence of functions containing numerical differentiation restricts the order of numerical differentiation operator. Monte-Carlo analysis illustrates theoretical step size conditions. 54 / 63

55 Simulations st order derivative, alpha= e RMSE sample size 55 / 63

56 Simulations st order derivative, alpha= e RMSE sample size 56 / 63

57 Simulations st order derivative, alpha= RMSE sample size 57 / 63

58 Simulations nd order derivative, alpha= e RMSE sample size 58 / 63

59 Simulations nd order derivative, alpha= e RMSE sample size 59 / 63

60 Simulations nd order derivative, alpha= RMSE sample size 60 / 63

61 Simulations 14 x rd order derivative, alpha= e RMSE sample size 61 / 63

62 Simulations rd order derivative, alpha= e RMSE sample size 62 / 63

63 Simulations rd order derivative, alpha= RMSE sample size 63 / 63

Closest Moment Estimation under General Conditions

Closest Moment Estimation under General Conditions Chirok Han Victoria University of Wellington New Zealand Robert de Jong Ohio State University U.S.A October, 2003 Abstract This paper considers Closest