Information Theory for Maximum Likelihood Estimation of Diffusion Models

Size: px

Start display at page:

Download "Information Theory for Maximum Likelihood Estimation of Diffusion Models"

Willis Shelton
5 years ago
Views:

1 Information heory for Maximum Likelihood Estimation of Diffusion Models Hwan-sik Choi Binghamton University Mar. 4 Abstract We develop an information theoretic framework for maximum likelihood estimation of diffusion models. wo information criteria that measure the divergence of a diffusion process from the true diffusion are defined. he maximum likelihood estimator MLE converges asymptotically to the limit that minimizes the criteria when sampling interval decreases as sampling span increases. When both drift and diffusion specifications are correct, the criteria become zero and the inverse curvatures of the criteria equal the asymptotic variance of the MLE. For misspecified models, the minimizer of the criteria defines pseudo-true parameters. Pseudo-true drift parameters depend on approximate transitions if used. JEL: C3, C, C4. Keywords: Kullback-Leibler information criterion, fiber bundle, high frequency, infinitesimal generator, infill asymptotics, misspecification. I thank the seminar participants at Purdue University Statistics, Symposium of Probability and Stochastic Processes CIMA, Mexico for helpful discussions. hwansik@binghamton.edu, Department of Economics, Binghamton University, 44 Vestal Parkway E. P.O. Box 6, Binghamton, NY 39, USA.

2 Introduction We develop an information theoretic framework for maximum likelihood estimation of diffusion models in an asymptotic environment in which sampling interval shrinks to zero as sampling span increases to infinity. he new approach provides a unified framework within which we can analyze both correctly specified and misspecified diffusion models. For discrete time models, White 98 gives an information theoretic foundation for analyzing the maximum likelihood estimator MLE. His theory is suitable for the models analyzed with the conventional large sample theory with fixed sampling interval. For diffusion models, estimation and inference under decreasing sampling interval were studied previously by, for example, Bandi and Phillips 3, Jeong and Park 3, and Choi, Jeong, and Park 3. Bandi and Phillips 3 study fully nonparametric estimation. Jeong and Park 3 study the maximum likelihood estimation for correctly specified parametric models for both stationary and non-stationary processes. Choi, Jeong, and Park 3 provide some asymptotic results for misspecified models, although their main focus is on model selection testing. he theory developed in Jeong and Park 3 and Choi, Jeong, and Park 3 can explain the differential properties of the drift and diffusion parameter estimators observed in high frequency samples. hey show that the rates of information accumulation for drift and diffusion parameters are different, and both sample size and sampling span are important for diffusion parameter estimation whereas sampling span only matters for drift parameter estimation. he present paper provides an information theoretic foundation for their new asymptotic theory. In White s theory, the Kullback-Leibler information criterion KLIC, Kullback and Leibler 95 plays a key role in understanding the asymptotic properties of the MLE. He shows that the MLE converges to the limit that minimizes the KLIC from the true distribution and derives asymptotic properties of the MLE under model misspecification. He also shows that the usual inference procedure with the Likelihood Ratio, Lagrange Multiplier, and Wald test statistics is not valid for a misspecified model. Naturally, we are interested in the questions similar to those that White has addressed in his paper. Under the new asymptotic framework with decreasing sampling interval, do the MLEs of misspecified diffusion models converge to some limits asymptotically, do the limits have any meaning, are the usual inference procedures based on likelihood ratios valid, what is the consequence of partial misspecification, what is the impact of having to use an approximate

3 transition when the exact transition is unknown in closed-form? We provide an answer to each of these questions in this paper. Based on the KLIC, we develop two-tier information criteria, which we refer to as the primary and secondary information criteria. he criteria measure the divergence of a diffusion process M from the true diffusion X. he primary criterion depends on the diffusion function of M, and equals zero if and only if the diffusion function is the same as the diffusion function of X. he secondary criterion becomes zero if both drift and diffusion functions of M are correct. Our new information criteria play a similar role as the KLIC in White 98. When sampling interval decreases as sampling span increases, the MLE asymptotically minimizes the two criteria, and the minimizer of the criteria is said to be the pseudo-true parameter. Because the primary criterion does not depend on the drift specification of M, the MLE for a diffusion function parameter is consistent to the same limit regardless of drift specifications. he secondary criterion depends on both diffusion and drift specification. When the exact transition is not known in closed-form and an approximate transition is used, the secondary criterion depends on the choice of an approximate transition as well. herefore, the pseudo-true drift parameter depends on diffusion and drift function specifications and the choice of approximate transitions if used. Under the partial misspecification where the diffusion function is correctly specified, the pseudo-true values of drift function parameters become invariant for a large class of approximate transitions. hus, the MLE of drift parameters would be less sensitive to the choice of an approximate transition when the diffusion function is correctly specified. When the drift function is correctly specified but the diffusion function is misspecified, the drift estimator does not converge to the true value in general. However, we show that the Euler approximate transition has a robustness property in the sense that the MLE of the drift parameters converges asymptotically to the true value. For the Milstein approximate transition and the closed-form approximate transition of Aït-Sahalia, the MLE of the drift parameter may not converge to the true parameter value. We also study the asymptotic distribution of the MLE for both correctly specified and misspecified models. When a diffusion model is correctly specified, we show that the asymptotic variances of the MLE of diffusion and drift parameters are given by the inverse curvatures of the primary and secondary criteria respectively. For misspecified models, the asymptotic variance of the MLE depends on both expected Hessian matrix and covariance matrix of the score functions. 3

4 We develop the new information criteria in the next section, and give information theoretic interpretation of the asymptotic properties of the MLE in Section 3. Divergence of diffusion processes Let W be the standard Brownian motion defined on a probability space Ω, F, P, and X be a stationary diffusion on D R that solves the stochastic differential equation SDE dx t = µ X t dt + σ X t dw t. with a drift function µ and a diffusion function σ. We assume that SDEs in this paper satisfy the conditions in Karatzas and Shreve 99 to admit a weak solution. Suppose we have a parametric diffusion model Mθ for X given by Mθ : dx t = µx t ; α dt + σx t ; β dw t, where µ ; α and σ ; β > are known functions with an unknown parameter vector θ = α, β in a compact set Θ R k. A diffusion model Mθ is misspecified if there exists no θ Θ such that µ = µ ; α and σ = σ ; β on D. Let p t, x, y be the transition density from X = x to X t = y of the true process in. and pt, x, y; θ be the exact or approximate transition density of Mθ. We consider the KLIC from the transition density p t, x, y to pt, x, y; θ. he KLICP, Q from a probability measure P to an equivalent probability measure Q, is defined by ˆ KLICP, Q dp log dp, dq and it is infinite if P and Q are not equivalent. See Csiszár 967a,b, 975 for more general divergence measures. Based on the KLIC, define the divergence measure D t θ E log p t, X, X t,. pt, X, X t ; θ as a function of t given θ. he function D t θ characterizes the divergence from the true process X to Mθ at various transition intervals. For instance, D t θ for small t would measure how close the model transition pt, x, y; θ is from the true transition p t, x, y for observations X and X t obtained over a small time interval t. In the limit of t, 4

5 Dtθ α =., β =. α =.8, β =.7 rue: α =.6, β = ransition interval t Figure : For Mθ : dx t =.α X t dt + β X t dw t, the two curves show D t θ from the true diffusion θ = α, β =.6,. to two diffusion processes θ =.,. and θ =.8,.7. For small t, the process with θ =.,. is closer to the true process, but θ =.8,.7 is closer for large t. D θ would measure the divergence of the marginal densities of X and Mθ. Since D t θ depends on the transition interval t given θ, the proximity of two processes X and Mθ in D t θ should be understood separately for different t. In this sense, we may consider the combination of Mθ and t as a model if sampling occurs at a specific time interval only. Consequently, for the discrete time analysis with fixed sampling interval, it is sufficient to consider D t θ at a specific t as the only relevant criterion. In this paper, however, we are particularly interested in the asymptotic framework in which t since the exact transition of a diffusion model is rarely known in closed-form and all approximate transitions would be valid only if t. his is also one of the reasons that diffusion models are mostly used for high frequency samples. In Figure, we give two examples of the divergence measure D t θ calculated with the CIR process Cox, Ingersoll, and Ross 985 Mθ : dx t =.α X t dt + β X t dw t with a two dimensional parameter vector θ = α, β. he true process X is defined by θ = α, β =.6,., and D t θ from X to the diffusion processes with θ =.,. 5

6 and θ =.8,.7 are shown in Figure. he process with θ =.,. has the diffusion function parameter close to the true value β =. whereas θ =.8,.7 has the drift function parameter close to the true parameter α =.6 compared to the other. he figure shows that the process with θ =.,. is closer to X in D t θ at small t. As t increases, the process θ =.8,.7 becomes closer to the true process. o see the relative effect of α and β on D t θ more clearly as t changes, we compare the contour plots of D t θ for various θ in Figure under quarterly t = /4, monthly t = /, weekly t = /5, and daily t = /5 transition intervals. he true process X is shown as the dot at the center of the plot. Each contour level curve represents the diffusion processes with equal D t θ from X. From this figure, we can notice that the drift function parameter α horizontal moves and the diffusion function parameter β vertical moves influence the divergence differently depending on the transition interval. Given β, a fixed amount of deviation of α from its true value.6 increases the divergence very much with quarterly transition interval, but the increase is quite small under daily transition interval. On the other hand, deviation of β from its true value. has similar impact on D t θ at all four transition intervals. his implies that the divergence measure D t θ is more sensitive to diffusion function specification than drift function specification as transition interval decreases. his example motivates that it is crucial to study the behavior of D t θ with small t to understand the asymmetric impact of drift and diffusion specifications on D t θ for high frequency data. From this motivation, we define the primary criterion as follows. Definition. Primary Criterion. he primary criterion is the limit of the divergence function D t θ as t. C P θ lim t D t θ We show that C P θ plays a crucial role in analyzing the MLE under the asymptotic framework in which the sampling interval decreases to zero. o ensure that C P θ is well defined, we develop our theory under the following assumptions. Let lt, x, y = log pt, x, y be a generic log transition density of Mθ for any θ Θ or of the true process X with a drift function µ and a diffusion function σ. We consider the scaled log transition density lt, x, y tlt, x, y + log t, for t >..3 6

7 β β Quaterly α Weekly e α β β Monthly e α Daily e α Figure : Contour plots of the divergence measure D t θ defined in. from the true process X to Mθ : dx t =.α X t dt + β X t dw t. he true process is defined by θ = α, β =.6,. and shown as the dot at the center. Each level represents the processes in Mθ with equal divergence from X under quarterly t = /4, monthly t = /, weekly t = /5, and daily t = /5 transition intervals. 7

8 Although the transition pt, x, y is not well defined at t =, we extend the domain of the scaled function lt, x, y to t by defining its value at t = as lim t lt, x, y which is assumed to exist by the following assumption. he derivatives of lt, x, y at t = are defined similarly. Assumption. lt, x, y, µx and σx are infinitely differentiable in t, x, y D and θ Θ. Let ht, x, y be one of these functions or their derivative functions. hen ht, x, y is dominated by g in the sense that ht, x, y gxgy, for all small t, y, x D, and θ in the interior of Θ. Moreover, g is locally bounded, increasing at a polynomial rate of order p on the boundary of D, and EgX t d <, where d is the dimension of θ. Assumption. here exists q > such that q sup X t p t [, ] as. If D =,, we further assume that q sup X p t. t [, ] With the above technical conditions, we introduce the following notation for simplicity of exposition. Let ft, x, y be a function that satisfies Assumption. We denote the functions f / x f, x, x, f /t x f, x, x, t with the single argument x as the function ft, x, y and its derivative with respect to t at y = x and t =. Other higher order derivatives such as f /y x, f /yy x, or f /yt x are defined similarly. Using this notation, for the scaled log transition density lt, x, y in.3, we can write l / x = l, x, x, l /t x = l, x, x/ t, l /y x = l, x, x/ y, and other derivatives similarly. We also assume the following for lt, x, y and its derivatives at y = x and t =. 8

9 Assumption 3. We have the following limits for lt, x, y. l/ x =, l/t x = log σ x logπ, l/y x =, l/yy x = σ x. Assumption 4. he functions l /yyy x and l /yyyy x depend on σx but not on µx. Assumptions 3 and 4 can be shown to be satisfied for the exact transition density always. When the exact transition must be approximated because it is not available in closed-form, we can show that the Euler and Milstein approximate transitions and the closed-form approximate transition of Aït-Sahalia satisfy them also. See Lemma A.5 of Jeong and Park 3 for details. For expositional simplicity, we write f fx for any function f with a single argument. For example, we write l / = l / X, µ = µx and σ = σx. Also, we suppress θ for the functions such as the log-likelihoods and the drift and diffusion functions of Mθ. With this notional convention, we derive the explicit form of C P in the following theorem. All proofs are in Appendix. heorem.. Let A t + µ y y + σ y y.4 be the infinitesimal generator of the true process X. Given X = x, define ft, x, y l t, x, y lt, x, y,.5 where l t, x, y and lt, x, y are the scaled log-likelihoods defined in.3 of the true process X and the model Mθ respectively. If Assumptions -4 are satisfied, then we have C P θ = EAf / = [ ] E log σ σ σ + σ..6 Note that Af / = Af / X, σ = σ X and σ = σ X ; β from our notational convention. he theorem shows that the drift specification of Mθ becomes irrelevant for D t θ as t because the primary criterion C P θ does not depend on the drift function 9

10 parameter α. his result also supports our observation from Figure. We will denote C P θ as C P β in the rest of this paper. here is a convenient way to interpret the primary criterion. Suppose that Y β is a random variable of which the distribution is the marginal distribution of the diffusion process hen we can write rx t ; β σ X t σ X t ; β. C P β = E{Y β log Y β}. Intuitively, C P β measures the expected error from approximating log Y with its linear approximation Y for given β. For a correctly specified diffusion function with σ = σ ; β, we would have C P β = since r ; β =. For given β = β, let M β α Mθ be the sub-model defined on the subset Θ β = {α, β Θ β = β } of Θ. invariant on Θ β. heorem. implies that the primary criterion C P β is herefore, if there exists β that minimizes C P β, all the diffusion processes in M β α would minimize the primary criterion and be considered to be closest to X. We define the secondary divergence measure that depends on both α and β to further distinguish the members of M β α. Definition. Secondary Criterion. Let D tθ D tθ t be the derivative of D t θ at t. he secondary criterion is the limit as t. C S θ lim t D tθ We derive the explicit form of C S θ to understand the role of diffusion and drift specifications in determining the secondary divergence of Mθ from the true process X. Let µ /x x µx/ x and σ /x x σ x/ x, and denote µ /x µ /x X and σ /x σ /x X. Other functions and their derivatives without arguments are defined in the same manner.

11 heorem.. Suppose Assumptions -4 are satisfied. We have C S θ = EA f / = [{ } E µ µ + σ /x + σ µ /x + σ /xx 4 σ + σ + σ µ + σ /x l /yyy l /yyy + σ4 4 l /yyyy l /yyyy ] + µ l /yt l /yt + l /tt l /tt + σ l /yyt l /yyt,.7 where A is the infinitesimal generator defined in.4, the function f is the scaled loglikelihood ratio defined in.5 with the exact transition densities of X and Mθ, l/yyy = 3σ /x σ 4, l/yyyy = σ /xx σ 4 5 σ/x 4 σ 6, l/yt = µ σ 3σ /x 4σ,.8 l/tt = µ σ µ /x + µσ /x σ 3σ /x 6σ + σ /xx 4,.9 l/yyt = µ σ /x + µσ /x σ 3σ /x 4σ + 3σ /xx 4 and the derivatives of l are similarly defined with µ x and σ x.,. When an approximate transition is used, the exact form of C S θ would depend on the approximation method generally because the derivatives in.8-. depend on the local properties of the transition of Mθ around y = x and t =. However, it is shown in Jeong and Park that these derivatives have identical forms for the exact transition and Aït-Sahalia s closed-form approximate transition. herefore, Aït-Sahalia s closed-form approximation is sufficiently accurate to give the same C S θ as the exact transition would. Moreover, in the following corollary, we show that if diffusion function specification is correct, C S θ is invariant for a larger class of approximate transitions. Corollary.. If the diffusion function σ ; β of Mθ is identical to the true diffusion function, i.e. σ ; β = σ on D, then C S θ conditional on β = β is simplified to C S θ = [ E µ µ ] σ

12 for the exact transition as well as the Euler, Milstein, and Aït-Sahalia s closed-form approximate transitions. Further, we have C S θ and the equality holds if and only if θ = α, β Θ such that µ ; α = µ on D. As an example, we give the explicit forms of C P and C S for the CIR model Mθ : dx t = α α X t dt + β X t dw t. Let the process with θ = ᾱ, ᾱ, β be the true diffusion. In this example, the function Af / used to calculate the primary criterion becomes a constant that depends on β only rather than random, and is simply given by C P β = β log β + β β. he secondary criterion C S θ = EA f / can be calculated from [ A f / x = ᾱᾱ x + ᾱ ᾱ x β + β x ᾱ β x + β x + 3 ᾱ ᾱ x + β β x β + 5 β β 6x x β ᾱ ᾱ x + ᾱ ᾱ x α α x β x β x ᾱ + ᾱ x + α α x β x β + ᾱ α x + ᾱᾱ x α α x 3 β β x 6 x + ᾱ ᾱᾱ x x + β β α α α x x Let us set the true process as ᾱ, ᾱ, β =.6,.,., and calculate C P β and C S θ. Figure 3 shows the values of C P β for different values of β. he closest diffusion function specification does not depend on drift specifications, and C P becomes zero if and only if the diffusion specification is correct β =.. In Figure 4, we provide the contour plots of C S θ from the true process to processes with α, α given β =.8,.,.,.4. he true value of α, α is shown as the ].

13 C P β β Figure 3: he primary criterion C P β from the true process X to Mθ : dx t = α α X t dt + β X t dw t. he true process is defined by α, α, β =.6,.,.. C P β does not depend on α, α, and it is minimized at the true parameter value β =.. dot at the center of each plot. he true value of the diffusion function parameter is β =.. We can see in the figure that C S θ is minimized at the true value α, α =.6,. when β =.. However, if β., C S θ is minimized at a different α, α point. As shown in this example, slight misspecification of diffusion functions leads to a large change of the closest drift specification in C S. his implies that when diffusion specification is incorrect, the best drift specification in the sense of minimizing C S may not be the correct drift specification. Consequently, the diffusion process with correct drift specification may not be the closest diffusion to the true diffusion in this case. For further development of our theory, we assume that the primary and secondary criteria satisfy the following. Assumption 5. he functions C P β and C S θ satisfy the following conditions. a here is a unique β that solves the minimization problem C P β = min C P β. β b Let Θ β = {α, β β = β }, then we have a unique θ = α, β in the interior of Θ 3

14 α α β = α.8. β = α α α β = α.8. β = α Figure 4: Contour plot of the secondary criterion C S θ from the true process X to Mθ : dx t = α α X t dt + β X t dw t. he true process is defined by α, α, β =.6,.,. and shown as the dot at the center. Each level represents the diffusion processes with α, α in Mθ with equal C S θ from X under four different values of β =.8,.,.,.4. Note that β =. is the true parameter value, and C S θ is minimized at the true parameter α, α =.6,. in this case. 4

15 that solves the minimization problem c C P and C S are twice differentiable at θ. C S θ = min C S θ. θ Θ β Let θ = α, β solve the system of equations he sub-model M β α defined on Θ β β = argmin C P β,. β θ = argmin C S θ.. θ Θ represents the closest members in Mθ to the true diffusion process under the primary criterion C P β. he value α gives the closest diffusion Mθ in M β α that minimizes C S θ given β. he first order conditions for. and. are respectively given by [ ] EAf /β = E σ + σ σ /β =,.3 [ EA µ/α f /α = E σ µ µ + σ σ µ /xα µ /ασ/x ] σ =..4 For example, if the diffusion function of Mθ is constant, σ ; β = β, then we have β = {Eσ X } / implying that β is the mean of the true diffusion function. When an approximate transition is used, the first order condition for the drift parameter α depends on the approximate transition. Aït-Sahalia s closed-form approximate transition results in the same condition in.4 as the exact transition, but for the Euler and Milstein approximate transitions, the first order conditions for α in.4 are given by E [ µ/α σ µ µ [ ] µ/α E µ µ =, for the Euler,.5 σ σ σ 3µ/α σ/x ] σ =, for the Milstein. Compared to the first order condition with the Euler approximation, the Milstein approximation adds an additional term σ 3µ/α σ /x that incorporates the dependence of the σ σ diffusion function σ/x on x, and the first order condition with the exact transition adds the 5

16 term σ µ σ /xα + µ /ασ /x that incorporates the dependence on both diffusion and σ drift functions σ /x and µ /xα on x. When diffusion function is misspecified but drift specification is correct, the solution α with the exact or Milstein approximation is not the true drift parameter α generally. However, the solution of.5 with the Euler approximation would be the same as the true parameter. In this sense, the Euler approximation gives a robust drift estimator under incorrect diffusion specification, which is useful when we are interested in finding correct drift specification rather than the closest diffusion process to the true process. When diffusion specification is correct σ = σ, from Corollary., the first order condition for α is simplified to [ ] EA µ/α f /α = E σ µ µ = for the exact, Euler, Milstein approximations. Considering this result, we can expect that the MLE of drift parameters would be less sensitive to the choice of an approximate density in this case. We study the asymptotic behavior of the ML estimators and the relationship of their asymptotic properties with the primary and secondary criteria in the following section. 3 Information theoretic analysis of maximum likelihood estimation Suppose we have N + observations {X i } N i= of the diffusion X from time zero to = N at a non-random time interval. Denote the log-likelihood function N L log p, X i, X i i= of the true diffusion process, and N Lθ log p, X i, X i ; θ i= 6

17 for Mθ conditional on X. We suppress the dependency of the log-likelihoods on N and or equivalently and for notational simplicity. he MLE ˆθ is defined by ˆθ arg max Lθ. θ Θ When Mθ is misspecified, Lθ and ˆθ are also called the quasi-log-likelihood function and the quasi-mle respectively. he exact transition density pt, x, y; θ of Mθ is known in closed-form for only a few diffusion processes such as the Ornstein-Uhlenbeck or Feller s square root processes. When pt, x, y; θ is not available in closed-form, we must use an approximate transition density such as the Euler or Milstein approximate transitions, the simulated likelihoods of Brandt and Santa-Clara, or the closed-form approximation of Aït-Sahalia based on the Hermite expansion. he log-likelihood ratio L Lθ has an interesting relationship with the primary criterion. If we fix and consider the limit of average log-likelihood ratio plim N L Lθ = plim L Lθ = E log p, X, X = D θ p, X, X ; θ as, then the sequential limit becomes the primary criterion. lim plim N L Lθ = C P β he following theorem shows that the MLE ˆθ is consistent to the minimizer θ = α, β in.-., which is defined as the pseudo-true value. heorem 3.. Suppose that Assumptions -5 are satisfied. Under 3 as and, the MLE ˆθ converges in probability to the solution θ = α, β of the system of equations in.-.. he above theorem shows that the probability limit of ˆβ does not depend on the specification of drift functions, and is robust to the choice of the approximation method for the transition density. But the limit of ˆα depends on the transition density approximation methods as well as diffusion specification. his explains the high sensitivity of drift estimators to the choice of an approximate transition in practice. We also can see that when 7

18 the diffusion function is misspecified but the drift function is correctly specified, ˆα may not converge the true parameter value because the true parameter value may not minimize the secondary criterion. Based on this theorem, we illustrate the geometry of the ML estimation of diffusion models. Geometrically, our model Mθ has the structure of a fiber bundle. A fiber bundle E = B F consists of a base space B, a fiber space F, and a projection of E onto B that defines how to attach fibers on B. It is also denoted simply by F E π B. For us, the total space E of the fiber bundle represents all possible combinations of drift and diffusion specifications defined in Mθ. hus, we can denote the total space as Eθ indexed by θ Θ. he base space B is the space of all diffusion functions σ ; β of Mθ. We denote the base space as Bβ indexed by β. Over B, we define the fiber space F of drift functions µ ; α of Mθ attached to each element β in B. We denote the fiber space as F α indexed by α. Finally, the mapping π : E B of the fiber bundle Eθ onto Bβ defines how each fiber space is attached to B. In our case, π is the projection of θ = α, β in E onto the second factor β in B, in which sense the fiber bundle is a trivial bundle. From these definitions, the fiber attached to β on B can be written as π β, which represents the space of all the diffusion processes in Mθ with diffusion function specification σ ; β. On the right side of Figure 5, we show the fiber bundle E with the fiber space F over the base space B. he primary criterion C P shown in Figure 5 measures the information divergence from the true process µ, σ to σ ; β in the base space Bβ. he σ-orbit represents the space of all diffusion functions with equal divergence in C P from the true diffusion. he divergence C P is defined on the horizontal plane that σ-orbit belongs to, which represents that C P does not depend on F and is determined by diffusion functions only. Under the primary criterion C P, the closest diffusion specification in Bβ is shown as β in Figure 5. hen given β, consider the fiber π β. he secondary criterion C S measures the divergence from µ, σ to an element of π β, and the minimizer of C S is shown as α. herefore, we can think the minimization with respect to C P and C S as two distinct projections, and the pseudo-true value θ = α, β is obtained by sequential application of the two projections. We now study the asymptotic distribution of the MLE. For correctly specified models, the primary and secondary criteria are closely related to the asymptotic variance of the MLE. Suppose we have σ = σ ; β and α = µ ; α for the true parameter θ = α, β. When we have as, Jeong and Park 3 show in their heorem 8

19 E = B F F α fiber µ, σ C S α σ σ-orbit C P β Bβ base Figure 5: he geometry of the ML estimation. On the right, the model Mθ is shown as the fiber bundle E = B F with the fiber space F of the drift functions µ ; α over the base space B of the diffusion functions σ ; β and the projection π : E B that defines how the fibers are attached on B. he primary criterion C P measures the information divergence from the true process µ, σ to σ ; β in Bβ. he σ-orbit represents the space of all diffusion functions with equal divergence in C P from the true diffusion function σ. Under C P, the closest member of the diffusion functions in Bβ is shown as β. he secondary criterion C S measures the divergence from µ, σ to an element of the fiber π β given β, and the minimizer of C S is shown as α. he pseudo-true value θ = α, β is determined by the sequential minimization of C P and C S. 9

20 4.3 that [ ˆβ β d σ/β σ ] /β N, E σ, 3. [ ˆα α d µ/α µ ] /α N, E σ, 3. where σ /β σx ; β / β and µ /α µx ; α / α. In 3.-3., the asymptotic variances of the ML estimators are given by the inverse of the matrices σ/β σ /β µ/α µ /α E and E σ σ 3.3 for ˆβ and ˆα respectively. We show that the matrices in 3.3 are derived from the limits as of the Fisher information matrices calculated under discrete sampling with fixed. For a fixed sampling interval >, the Fisher information matrix G is induced by the KLIC D t θ by the second derivative { } D t θ G t θ =, θ i θ j where { } is the i, j element of G t. herefore, the asymptotic variance of the ML estimator ˆθ under fixed would be given by ˆθ θ d N, G θ, as. Based on this result, we consider the limit of the -sequence of the Fisher { information matrix G θ for β as. Let G p β lim D tθ t be the limit, then we have β i β j } { G P C P } { β } Af / β = = E. 3.4 β i β j β i β j Similarly, for the Fisher information for α, we consider the limit G S θ lim t { D t θ α i α j },

21 which is given by G S θ = { C S } θ α i α j = { E A } f /. 3.5 α i α j his shows that the asymptotic Fisher information matrices are given by the curvatures of C P and C S at θ. In the following theorem, we show that the curvatures given in 3.4 and 3.5 coincide with the inverse asymptotic variance matrices in 3.3 under and. heorem 3.. Suppose that Assumptions -5 are satisfied, and Mθ is correctly specified with the true parameter θ = α, β. Let p, q be the constants defined in Assumption. Under 4pq+ as and, we have where G P 3.5 respectively. ˆβ β d N, G P β, ˆα α d N, G S θ, and G S are the asymptotic Fisher information matrices defined in 3.4 and For misspecified models, the second Bartlett identity does not hold because the Fisher information matrix is different from the negative expected Hessian matrix. herefore, we derive the Fisher information and expected Hessian matrices separately. We first give the asymptotic expansion of the score functions. heorem 3.3. Suppose that Assumptions -5 are satisfied. For misspecified models, under 3 as and, we have the explicit form of the score functions { } Lθ s β θ = β i { } Lθ s α θ = α i + ˆ ˆ ˆ σ σ σ /β dt + O p + O p µ /α σ µ µ + σ σ, 3.6 µ /xα + µ /ασ /x σ σ µ /α σ dw t + O p + O p. 3.7 he asymptotic leading term of s β θ is O p /, and it is generally larger than the second term which is O p. But at β = β, the first term becomes O p / because dt

22 [ ] E σ σ σ /β = from.3. herefore, the first term in 3.6 is larger than the second term only if. Similarly, the first term in the asymptotic expansion of s α θ given 3.7 would be O p at θ = θ. his term becomes the largest term under. We will assume for the rest of the paper to derive the asymptotic distribution of the MLE. As shown in Jeong and Park 3, the asymptotic order of s β θ for correctly specified models is O p /, or equivalently O p N. From heorem 3.3, we find that the score function s β θ for misspecified models becomes larger than O p N. Intuitively, for correctly specified models, we identify the true parameter by finding β that vanishes the first term in 3.6, but for misspecified models, we minimize the primary criterion to find the pseudo-true parameter β that vanishes the first term in 3.6 on the long run average. As shown later in this section, this fact reduces the convergence rate of ˆβ to for misspecified models compared to the rate N of ˆβ for correctly specified models. In the following theorem, we calculate the asymptotic Fisher information matrix from the asymptotic variances of the score functions I β β plim, s βθs β θ and the asymptotic expected Hessian matrices where h β θ = function H β β and the speed density, I α θ plim, s αθs α θ, plim βθ, h, H α θ plim αθ, h, { } { Lθ β i β j and h α θ = Lθ α i α j }. We first define the derivative of the scale ˆ x s x exp m x x µ v σ dv v σ xs x, for any x D fixed. With the scale function s x = x x s w dw, the transformed diffusion s X of the true diffusion X becomes a driftless diffusion, and the stationary density of the diffusion X is given by m x/ D m xdx.

23 heorem 3.4. Suppose that Assumptions -5 are satisfied. Let as and. For a misspecified diffusion model with the pseudo-true parameter θ = α, β, we have the asymptotic Fisher information matrices given by I β β = EV β V β, I α θ = EV α V α, where V β = V β X and V α = V α X are defined from the functions where ˆ x V β x = σ xs x A l /β v m vdv, x V α x = σ xs x ˆ x A l /β v = σ v σ vσ /β v, A l/α v = µ /αv σ v µ v µv + x A l/αv σ v σ v m vdv + σ x σ x µ /αx, µ /xα v + µ /αvσ /x v σ v he asymptotic expected Hessian matrices are the same as the negative second derivatives of the primary and secondary criteria, and their explicit forms are given by { H β β C P β } = β i β j { H α θ C S θ } = = α i α j EA l/αα = E [ µ/αα σ = EA l /ββ = [ ] E σ 4 σ /β σ + σ + σ /β σ, /ββ µ µ µ /αµ /α σ + σ σ µ /xαα + µ ] /αα σ /x σ. From the above theorem, we can get the asymptotic distribution of the MLE. Corollary 3.. Suppose that Assumptions -5 are satisfied. Let as and. For a misspecified diffusion model with the pseudo-true parameter θ = α, β,. 3

24 we have ˆβ β d N, H β β I β β H β β, ˆα α d N, H α θ I α θ H α θ. From the corollary, we can get the explicit form of the asymptotic distributions of the MLE in terms of the model s drift and diffusion functions. It also clarifies that the differences of the asymptotic behaviors of the MLE under correct specification and misspecification. his result also shows that the likelihood ratio test statistics would not have the Chi-squared distribution asymptotically. Consequently, the Lagrange Multiplier, and Wald test statistics would not be distributed as the Chi-squared distribution asymptotically either. 4 Conclusion he information criteria developed in this paper provide the information theoretic foundation for studying asymptotic behaviors of the MLE and likelihood based test statistics for high frequency data. he new framework also allows us to analyze the estimation and inference problems geometrically. We surmise that we can derive the higher order properties of the MLE and other general estimators on this foundation using the differential geometric methods developed in Efron 978 and Amari 98. Our new framework would make the development of these theories easier and the interpretation of them more intuitive. Another important issue is asymptotic analysis of inference with decreasing sampling interval. Since the conventional form of the likelihood based tests would not be valid under model misspecification, a robust test statistic would be desirable. Based on the results of this paper, we can develop a robust version of the likelihood based test statistics and derive its limiting distribution. Also, asymptotic analysis of specification testing for diffusion models in our new framework is another important topic. hese topics will be investigated in the future research. We did not consider in this paper more general processes such as jump diffusions, or Lévy processes. Although these processes are important, the development of the information theory for these processes would warrant a separate paper, and we leave it as a future research topic. 4

25 Appendices A Notational convention We summarize the notional convention used in this paper for reference. For a function ft, x, y, we define the functions f / x f, x, x, f /t x f, x, x, t with the single argument x as the functions evaluated at y = x and t =. Other higher order derivatives such as f /y x, f /yy x, or f /yt x are defined similarly. For a function fx with the single argument x, we define f fx omitting the argument X. For example, we write l/ = l / X, µ = µx, σ = σx. Also, we suppress θ for the functions such as the log-likelihoods and the drift and diffusion functions of Mθ. Using the notations defined above, we define µ /x x µx/ x, σ/x x σ x/ x, and σ/xx x σ x/ x, and denote µ /x µ /x X, σ /x σ /x X, σ /xx σ /xx X. Other functions and their derivatives without arguments are defined in the same manner. For the infinitesimal generator A, we define Af / Af / X, Af /y Af /y X = Af, X, X y, Af /yβ Af /yx. β When these functions are used as integrands, the argument omitted is X t rather than X. 5

26 B Proofs B. Proof of heorem. Proof. We omit θ in D t θ for notational simplicity. he following proof applies to all θ Θ. Define the conditional expectation g t x E x {l t, x, X t lt, x, X t }, where E x is the conditional expectation with respect to X t given X = x. hen we can rewrite D t = E{l t, X, X t lt, X, X t } as D t = Eg t X. Under Assumption, we get lim t D t = E lim t g t X by the dominated convergence theorem. herefore, we need show that lim t g t x = Af / x. From.3 and.5, we can rewrite g t x = E x { l t, x, X t lt, x, X t }/t = E x ft, x, X t /t = Ex ft, x, X t f / x. t he last equality above is from f / x = given by Assumption 3. hen we get the limit E x ft, x, X t f / x lim = Af t t / x by the definition of the infinitesimal generator A. For the explicit expression in.6, we first apply A = t + µ y y + σ y to ft, x, y = l t, x, y lt, x, y to evaluate Aft, x, y at t = and y = x. As the result, y 6

27 we would get From Assumption 3, we have and herefore, Af / = f /t + µ f /y + σ f /yy = l /t l /t + µ l /y l /y + σ l /yy l /yy. f /t x = log σ x σ x, f /yx =, f /yy x = σ x + σ x. Af / x = log σ x σ x + σ x σ x + σ, x which proves the second equality in the theorem. B. Proof of heorem. Proof. For both D t θ and D tθ, we omit θ for notational simplicity. We define the conditional expectation g t x Ex ft, x, X t f / x t given X = x as in the proof of heorem.. D = EAf / by heorem., we can write D lim t D t D t B. Since D t = Eg t X by definition and = lim E g tx Af / X. t t Since lim t E gtx Af / X g tx t = E lim Af / X t t theorem, we only need to show from the dominated convergence g t x Af / x lim = t t A f / x. B. 7

28 o show the above equality, we use Dynkin s formula Dynkin 965 p.33, Øksendal 998 heorem 7.4.., ˆ τ E x ϕx τ = ϕx + E x AϕX s ds, for a twice continuously differentiable function ϕ on R and τ is a stopping time such that E x τ <. By applying Dynkin s formula to the numerator of g t x in B. after setting the stopping time τ = t, we get ˆ t E x ft, x, X t f / x = E x Afs, x, X s ds = ˆ t E x Afs, x, X s ds. B.3 We apply Dynkin s formula twice sequentially to E x Afs, x, X s to have ˆ s E x Afs, x, X s = Af / x + E x A fu, x, X u du herefore, B.3 becomes = Af / x + ˆ s ˆ v {A f / x + E x A 3 fv, x, X v dv} du = Af / x + sa f / x + E x ˆ s ˆ v E x ft, x, X t f / x = taf / x + t A f / x + E x ˆ t From this, we can write A 3 fv, x, X v dv du. ˆ s ˆ v A 3 fv, x, X v dv du ds. g t x Af / x t = A f / x + t Ex ˆ t ˆ s ˆ v A 3 fv, x, X v dv du ds. B.4 Since E x t s v A3 fv, x, X v dv du ds = Ot 3 for any x D from Assumption, we get B. from B.4 by taking t on the both sides. For the second equality in.7, we apply 8

29 A = µ /x yµ y + µ /xx y σ y y { + µ y µ y + σ /x y + σ y µ /x y + σ /xx y } 4 y + σy µ y + σ /x y 3 y 3 + σ4 y 4 4 y t + µ y y t + σ y y t, to ft, x, y = l t, x, y lt, x, y. hen we evaluate A ft, x, y at t = and y = x to get A f / = µ µ + σ /x + σ µ + σ /x where + σ µ /x + σ /xx σ + σ B B + σ4 4 C C + D D + µ A A + σ E E, A = l /yt = µ σ 3σ /x 4σ, B = l /yyy = 3σ /x σ 4, C = l /yyyy = σ /xx σ σ /x σ 6, D = l /tt = µ σ µ /x + µσ /x σ 3σ /x 6σ + σ /xx 4, E = l /yyt = σ µ /x µσ /x σ + 3σ /x 4σ 3σ /xx, 4 and the terms A, B,..., E are defined similarly with µ and σ from the true diffusion. he explicit forms of these terms are given from the corresponding terms A j,..., E j in Section 3 of Choi, Jeong, and Park 3 derived for the exact transition density. 9

30 B.3 Proof of Corollary. Proof. For C S θ, we omit θ for notational simplicity. Since l /yyy = l /yyy and l /yyyy = l/yyyy from σ = σ, we have C S = E From σ = σ, we also have [ l /yt l /yt µ + l /tt l /tt + σ l /yyt l /yyt l/yt l /yt = µ µ σ, l/tt l /tt = µ /x µ /x µ µ l/yyt l /yyt = µ /x µ /x σ σ µ σ /x σ 3 µ σ /x + µσ /x σ 3 ] µσ /x σ σ.., herefore, C S = E [ µ µµ σ µ ] µ σ = [ E µ µ ] σ. For Euler approximation, we use the explicit forms of l /yt, l /tt, and l /yyt which correspond to A j, D j, E j respectively derived for the Euler approximate transition densities in Section 3 of Choi, Jeong, and Park 3. From l/yt l /yt = µ µ σ l/yyt l /yyt =., l/tt l /tt = µ µ σ, we get C S = E [ µ µµ σ µ ] µ σ = [ E µ µ ] σ. For Milstein approximation, we use the explicit forms of l /yt, l /tt, and l /yyt which correspond to A j, D j, E j respectively derived for the Milstein approximate transition densities 3

31 in Section 3 of Choi, Jeong, and Park 3. From l/yt l /yt = µ µ σ, l/tt l /tt = µ µ σ l/yyt l /yyt = 3µ µσ/x σ 4, + 3µ µσ/x σ, we get C S = E [ µ µµ σ µ ] µ σ = [ E µ µ ] σ. B.4 Proof of heorem 3. Proof. he limiting distribution is derived in heorem 4.3 in Jeong and Park 3. We only need to prove that the inverse asymptotic variances of diffusion and drift parameter estimators are the same as the second derivatives of the primary and secondary criteria respectively. For the Fisher information matrix for ˆβ, we first get the first derivative with respect to β i C P β β i = [ ] E σ σ σ σ β i by the dominated convergence theorem. Similarly, we get the second derivative with respect to β j C P β β i β j = E [ σ β j σ 4 σ σ σ + β i σ σ σ σ σ 4 + β j β i σ σ σ ] σ. β i β j Since σ = σ at β = β, we get g P ijβ = C P β β i β j = E [ σ σ σ ] σ 6 = [ σ β i β j E σ ] σ 4, β i β j which is the inverse of the asymptotic variance of the diffusion parameter estimator ˆβ in heorem 4.3 in Jeong and Park 3. For the Fisher information matrix for ˆα, we first collect the part of C S θ that depends 3

32 on α with σ = σ. hat part is given by [ ] E σ µ µ µ. From the dominated convergence theorem, the second derivative is given by g S ijθ = C S θ α i α j = E [ ] µ µ σ, α i α j which is the inverse of the asymptotic variance of the drift parameter estimator ˆα in heorem 4.3 in Jeong and Park 3. B.5 Proof of heorem 3. Proof. Lemma B.7 of Choi, Jeong, and Park 3 gives the asymptotic first order condition that the probability limit θ of the MLE satisfies under 3 as and. he asymptotic first order order condition in their Lemma B.7 can be rewritten as EA l /β = EA l/α =, B.5 which is the first order condition of minimizing the primary and secondary criteria. herefore, θ also minimizes the criteria. B.6 Proof of heorem 3.3 Proof. Define the operator B σ y y. Let A l /β A l / / β and σ /β σ / β and define other derivatives such as A l/β, B l/β, A l/α, AB l /α similarly. When these functions are used in integrands, the argument is X t which is omitted for notational simplicity. From the proof of heorem 3. of Choi, Jeong, and Park 3, under 3, the score functions of β and α are given by s β θ = s α θ = ˆ ˆ A l /β dt + A l/α dt + ˆ ˆ A l/β dt + ˆ B l/β dv t + O p, AB l /α dw t + O p + O p, B.6 B.7 3

33 where V is a standard Brownian motion independent of W. he asymptotic orders of the terms in the above theorem are ˆ ˆ A l /β dt = O p A l/α dt = O p,, ˆ ˆ A l/β dt = O p, AB l /α dw t = O p. ˆ B l/β dv t = O p, he explicit forms of the leading terms of s β θ and s α θ are calculated from A l /β = σ σ σ /β, A l/α = µ /α σ µ µ + AB l /α = σ µ /α σ. σ σ µ /xα + µ /ασ /x σ We get the main result of the theorem by plugging in these terms in B.6-B.7., B.7 Proof of heorem 3.4 From B.6-B.7, under as and, we have ˆ s β θ p A l /β dt and ˆ s α θ A l/α dt + ˆ AB l /α dw t p. herefore, we only need to get the asymptotic distributions of ˆ A l /β dt for s β θ and ˆ A l/α dt + ˆ AB l /α dw t 33

34 for s α θ. We can derive the limiting distributions of A l /β dt and A l/α dt from Lemma B. of Choi, Jeong, and Park 3. heir lemma proves that for fx t such that EfX t =, we have where ˆ fx t dt d EVX t W, ˆ x Vx = σ xs x fvm vdv x for x D. Using the fact EA l /β = EA l/α = from B.5, we apply their lemma to A l /β dt and A l/α dt, which gives us s β θ d N, EV β V β, s α θ d N, EV α V α, B.8 B.9 where V β = V β X and V α = V α X are defined from the functions ˆ x V β x = σ xs x A l /β v m vdv, x ˆ x V β x = σ xs x x A l/α v m vdv + σ x µ /αx σ x. For the expected Hessian matrices, we use the relationships h β θ = ˆ A l /ββ dt + O p + O p, h α θ = ˆ A l/αα dt + O p + O p, 34

35 which are shown in the proof of Lemma B.6 of Choi, Jeong, and Park 3. From as and, and h βθ h αθ ˆ ˆ ˆ ˆ A l /ββ dt A l/αα dt A l /ββ dt p EA l /ββ, p, p, A l/αα dt p EA l/αα, as, the asymptotic expected Hessian matrices of β and α are the same as the negative second derivatives of the primary and secondary criteria respectively. he explicit forms are calculated by taking the second derivatives of the criteria. B.8 Proof of Corollary 3. Proof. Under as and, we get ˆβ β h βθ s βθ p, ˆα α h αθ s α θ p, from the proof of heorem 3. of Choi, Jeong, and Park 3. We only need to show the asymptotic distributions of h βθ s βθ and h αθ s α θ. heir asymptotic distributions are from the results of heorem 3.4 and B.8-B.9. 35

36 References Aït-Sahalia, Y. : Maximum likelihood estimation of discretely sampled diffusions: a closed-form approximation approach, Econometrica, 7, 3 6. Amari, S. I. 98: Differential geometry of curved exponential families - curvatures and information loss, Annals of Statistics,, Bandi, F., and P. Phillips 3: Fully nonparametric estimation of scalar diffusion models, Econometrica, 7, Brandt, M., and P. Santa-Clara : Simulated likelihood estimation of diffusions with an application to exchange rate dynamics in incomplete markets, Journal of Financial Economics, 63, 6. Choi, H.-S., M. Jeong, and J. Y. Park 3: An asymptotic analysis of likelihoodbased diffusion model selection using high frequency data, Journal of Econometrics, 783, Cox, J., J. Ingersoll, Jr, and S. Ross 985: A theory of the term structure of interest rates, Econometrica, 53, Csiszár, I. 967a: Information type measures of difference of probability distributions and indirect observations, Studia Scientiarum Mathematicarum Hungarica,, b: On topological properties of f-divergence, Studia Scientiarum Mathematicarum Hungarica,, : I-divergence geometry of probability distributions and minimization problems, Annals of Probability, 3, Dynkin, E. B. 965: Markov processes. Springer-Verlag. Efron, B. 978: he geometry of exponential families, Annals of Statistics, 6, Jeong, M., and J. Y. Park : Asymptotic expansions and second order efficiency of maximum likelihood estimators for diffusion models, Working Paper, Indiana University. 36

37 3: Asymptotic theory of maximum likelihood estimator for diffusion model, Working Paper, Indiana University. Karatzas, I., and S. Shreve 99: Springer, nd edn. Brownian Motion and Stochastic Calculus. Kullback, S., and R. Leibler 95: On information and sufficiency, Annals of Mathematical Statistics,, Øksendal, B. K. 998: Stochastic Differential Equations: An Introduction with Applications. Springer. White, H. 98: Maximum likelihood estimation of misspecified models, Econometrica, 5, 6. 37

Graduate Econometrics I: Maximum Likelihood II

Graduate Econometrics I: Maximum Likelihood II Yves Dominicy Université libre de Bruxelles Solvay Brussels School of Economics and Management ECARES Yves Dominicy Graduate Econometrics I: Maximum Likelihood