Follow links for Class Use and other Permissions. For more information send to:

Size: px

Start display at page:

Download "Follow links for Class Use and other Permissions. For more information send to:"

Hugo Morgan
5 years ago
Views:

1 COPYRIGH NOICE: Kenneth J. Singleton: Empirical Dynamic Asset Pricing is published by Princeton University Press and copyrighted, 00, by Princeton University Press. All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher, except for reading and browsing via the World Wide Web. Users are not permitted to mount this file on any network servers. Follow links for Class Use and other Permissions. For more information send to:

2 Large-Sample Properties of Extremum Estimators Extremum estimators are estimators obtained by either maximizing or minimizing a criterion function over the admissible parameter space. In this chapter we introduce more formally the concept of an extremum estimator and discuss the large-sample properties of these estimators. After briefly setting up notation and describing the probability environment within which we discuss estimation, we describe regularity conditions under which an estimator converges almost surely to its population counterpart. We then turn to the large-sample distributions of extremum estimators. hroughout this discussion we maintain the assumption that θ is a consistent estimator of θ 0 and focus on properties of the distribution of θ as gets large. Whereas discussions of consistency are often criterion-function specific, the large-sample analyses of most of the extremum estimators we will use subsequently can be treated concurrently. We formally define a family of estimators that encompasses the first-order conditions of the ML, standard GMM, and LLS estimators as special cases. hen, after we present a quite general central limit theorem, we establish the asymptotic normality of these estimators. Finally, we examine the relative asymptotic efficiencies of the GMM, LSS, and ML estimators and interpret their asymptotic efficiencies in terms of the restrictions on the joint distribution of the data used in estimation... Basic Probability Model Notationally, we let denote the sample space, F the set of events about which we want to make probability statements (a σ -algebra of events), and he perspective on the large-sample properties of extremum estimators taken in this chapter has been shaped by my discussions and collaborations with Lars Hansen over the past years. In particular, the approach to establishing consistency and asymptotic normality in Sections.. follows that of Hansen (b, 00). [First Page] [], () Lines: 0 to -.pt PgEnds: E [], ()

3 . Large-Sample Properties of Extremum Estimators Pr the probability measure. hus, we denote the probability space by (, F, Pr). Similarly, we let B K denote the Borel algebra of events in R K, which is the smallest σ -algebra containing all open and closed rectangles in R K. A K -dimensional vector random variable X is a function from the sample space to R K with the property that for each B B K, {ω : X (ω) B} F. Each random variable X induces a probability space (R K, B K, µ X )bythe correspondence µ X (B) = Pr{ω : X (ω) B}, for all B B K. wo notions of convergence of sequences of random variables that we use extensively are as follows. Definition.. he sequence of random variables {X } is said to converge almost surely (a.s.) to the random variable X if and only if there exists a null set N such that ω \ N : lim X (ω) = X (ω). (.) Definition.. he sequence of random variables {X } is said to converge in probability to X if and only if, for every ɛ>0, we have lim Pr { X X > ɛ} = 0. (.) When the th element of the sequence is the estimator θ for sample size and the limit is the population parameter vector of interest θ 0, then we call the estimator θ consistent for θ 0. Definition.. A sequence of estimators {θ } is said to be strongly (weakly) consistent for a constant parameter vector θ 0 if and only if θ converges almost surely (in probability) to θ 0 as. here are many different sets of sufficient conditions on the structure of asset pricing models and the probability models generating uncertainty for extremum estimators to be consistent. In this chapter we follow closely the approach in Hansen (b), which assumes that the underlying random vector of interest, z t, is a stationary and ergodic time series. Chapters and 0 discuss how stochastic trends have been accommodated in DAPMs. We let R denote the space consisting of all infinite sequences x = (x, x,...)of real numbers (lower case x indicates x R). A -dimensional rectangle is of the form {x R : x I, x I,...,x I }, where I,...,I are finite or infinite intervals in R. If B denotes the smallest he topics discussed in this section are covered in more depth in most intermediate statistics books. See Chung () and Billingsley (). A null set N for P is a set with the property that Pr{N } = 0. [], () Lines: to.0pt PgV PgEnds: EX [], ()

4 .. Basic Probability Model σ -algebra of subsets of R containing all finite dimensional rectangles, then X = (X, X,...) is a measurable mapping from to (R, B ) (here the X s are random variables). Definition.. A process {X t } is called stationary if, for every k, the process {X t } =k has the same distribution as {X t } = ; that is, t t P {(X, X,...) B }= P {(X k+, X k+...) B }. (.) In practical terms, a stationary process is one such that the functional forms of the joint distributions of collections (X k, X k,..., X k l ) do not change over time. An important property of a stationary process is that the process {Y k } defined by Y k = f (X k, X k+,...,) is also stationary for any f that is measurable relative to B. he assumption that {X t } is stationary is not sufficient to ensure that sample averages of the process converge to EX, a requirement that underlies our large-sample analysis of estimators. (Here we use EX, because all X t have the same mean.) he reason is that the sample we observe is the realization (X (ω 0 ), X (ω 0 ),...) associated with a single ω 0 in the sample space. If we are to learn about the distribution of the time series {X t } from this realization, then, as we move along the series {X t (ω 0 )}, it must be as if we are observing realizations of X t (ω) for fixed t as ω ranges over. o make this idea more precise, suppose there is an event A F with the property that one can find a B B such that for every t >, A ={ω : (X t (ω), X t + (ω),...) B}. Such an event A is called invariant because, for ω 0 A, the information provided by {X t (ω 0 ), X t + (ω 0 ),...} as t increases is essentially unchanged with t. On the other hand, if such a B does not exist, then A = {ω : (X (ω), X (ω),...) B } ={ω : (X t (ω), X t + (ω),...) B}, (.) for some t >, and {X t (ω), X t + (ω),...} conveys information about a different event in F (different part of ). Definition.. A stationary process is ergodic if every invariant event has probability zero or one. If the process is ergodic, then a single realization conveys sufficient information about for a strong law of large numbers (SLLN) to hold. For further discussion of stationary and ergodic stochastic processes see, e.g., Breiman (). [], () Lines: to.0pt * PgEnds: PageBreak [], ()

5 . Large-Sample Properties of Extremum Estimators heorem.. then If X X,...,is a stationary and ergodic process and E X <, X t EX a.s. (.) t= One can relax the assumption of stationarity, thereby allowing the marginal distributions of z t to change over time, and still obtain a SLLN. However, this is typically accomplished by replacing the relatively weak requirements implicit in the assumption of stationarity on the dependence between z t and z t s, for s = 0, with stronger assumptions (see, e.g., Gallant and White, ). wo considerations motivate our focus on the case of stationary and ergodic time series. First, in dynamic asset pricing models, the pricing relations are typically the solutions to a dynamic optimization problem by investors or a replication argument based on no-arbitrage opportunities. As we will see more formally in later chapters, both of these arguments involve optimal forecasts of future variables, and these optimal forecasting problems are typically solved under the assumption of stationary time series. Indeed, these forecasting problems will generally not lend themselves to tractable solutions in the absence of stationarity. Second, the assumption that a time series is stationary does not preclude variation over time in the conditional distributions of z t conditioned on its own history. In particular, the time variation in conditional means and variances that is often the focus of financial econometric modeling is easily accommodated within the framework of stationary and ergodic time series. Of course, neither of these considerations rules out the possibility that the real world is one in which time series are in fact nonstationary. At a conceptual level, the economic argument for nonstationarity often comes down to the need to include additional conditioning variables. For example, the case of a change in operating procedures by a monetary authority, as we experienced in the United States in the early 0s, could be handled by conditioning on variables that determine a monetary authority s operating procedures. However, many of the changes in a pricing environment that would lead us to be concerned about stationarity happen infrequently. herefore, we do not have repeated observations on the changes that concern us the most. he pragmatic solution to this problem has often been to judiciously choose the sample period so that the state vector z t in an asset pricing model can reasonably be assumed to be stationary. With these considerations in mind, we proceed under the formal assumption of stationary time series. An important exception is the case of nonstationarity induced by stochastic trends. [], () Lines: to.pt Long Page PgEnds: EX [], ()

6 .. Consistency: General Considerations Consistency: General Considerations Let Q ( z,θ) denote the function to be minimized by choice of the K - vector θ of unknown parameters within an admissible parameter space R K, and let Q 0 (θ) be its population counterpart. hroughout this chapter, it will be assumed that Q 0 (θ) is uniquely minimized at θ 0, the model parameters that generate the data. We begin by presenting a set of quite general sufficient conditions for θ to be a consistent estimator of θ 0. he discussion of these conditions is intended to illustrate the essential features of a probability model that lead to strong consistency (θ converges almost surely to θ 0 ). Without further assumptions, however, the general conditions proposed are not easily verified in practice. herefore, we proceed to examine a more primitive set of conditions that imply the conditions of our initial consistency theorem. One critical assumption underlying consistency is the uniform convergence of sample criterion functions to their population counterparts as gets large. Following are definitions of two notions of uniform convergence. Definition.. Let g (θ) be a nonnegative sequence of random variables depending on the parameter θ. Consider the two modes of uniform convergence of g (θ) to 0: P lim g (θ) = 0 =, (.) sup θ lim P sup g (θ)<ɛ = for any ɛ>0. (.) θ If (.) holds, then g (θ) is said to converge to 0 almost surely uniformly in θ. If (.) holds, then g (θ) is said to converge to 0 in probability uniformly in θ. he following theorem presents a useful set of sufficient conditions for θ to converge almost surely to θ 0. heorem.. Suppose (i) is compact. (ii) he nonnegative sample criterion function Q ( z,θ) is continuous in θ and is a measurable function of z for all θ. (iii) Q ( z,θ)converges to a non-stochastic function Q 0 (θ) almost surely uniformly in θ as ; and Q 0 (θ) attains a unique minimum at θ 0. Define θ as a value of θ that satisfies Q ( z,θ ) = min Q ( z,θ). (.) θ hen θ converges almost surely to θ 0. In situations where θ is not unique, if we let Ɣ denote the set of minimizers, we can show that δ (ω) = sup{ θ θ 0 : θ Ɣ } converges almost surely to 0 as. [], () Lines: -.pt Long Page PgEnds: E [], ()

7 0. Large-Sample Properties of Extremum Estimators Proof (heorem.). Define the function ρ(ɛ) = inf {Q 0 (θ) Q 0 (θ 0 ), for θ θ 0 ɛ}. (.) As long as ɛ>0, Assumptions (i) (iii) guarantee that ρ(ɛ) > 0. (Continuity of Q 0 follows from our assumptions.) Assumption (iii) implies that there exists a set with P ( ) = and a positive, finite function (ω, ɛ), such that ρ (ω) sup Q (ω, θ) Q 0 (θ) < ρ(ɛ)/, (.0) θ for all ω, ɛ > 0, and (ω, ɛ). his inequality guarantees that for all ω, ɛ > 0, and (ω, ɛ), Q 0 (θ ) Q 0 (θ 0 ) = Q 0 (θ ) Q (ω, θ ) + Q (ω, θ ) Q (ω, θ 0 ) + Q (ω, θ 0 ) Q 0 (θ 0 ) Q 0 (θ ) Q (ω, θ ) + Q (ω, θ 0 ) Q 0 (θ 0 ) Q 0 (θ ) Q (ω, θ ) + Q (ω, θ 0 ) Q 0 (θ 0 ) ρ (ω) < ρ(ɛ), (.) which implies that θ θ 0 <ɛ for all ω, ɛ > 0, and (ω, ɛ). he assumptions of heorem. are quite general. In particular, the z t s need not be identically distributed or independent. However, this generality is of little practical value unless the assumptions of the theorem can be verified in actual applications. In practice, this amounts to verifying Assumption (iii). he regularity conditions imposed in the econometrics literature to assure that (iii) holds typically depend on the specification of Q and Q 0 and, thus, are often criterion function specific. We present a set of sufficient conditions to establish the almost sure uniform convergence of the sample mean G ( z,θ) = g (z t,θ) (.) t = to its population counterpart G 0 (θ) = E [g (z t,θ)]. his result then is used to establish the uniform convergence of Q to Q 0 for the cases of ML and GMM estimators for stationary processes. o motivate the regularity conditions we impose on the time series {z t } and the function g, it is instructive to examine how far the assumption that [0], () Lines: 0 to.pt PgEnds: EX [0], ()

8 .. Consistency: General Considerations {z t } is stationary and ergodic takes us toward fulfilling the assumptions of heorem.. herefore, we begin by assuming: Assumption.. [( ] ( Q 0 (δ) = E y t x t δ ), Q (δ) = y t x t δ ), δ R K. (.) {z t : t } is a stationary and ergodic stochastic process. As discussed in Chapter, the sample and population criterion functions for LLP are t = For the LLP problem, Q 0 (δ) is assured of having a unique minimizer δ 0 if the second-moment matrix E [x t x t ] has full rank. hus, with this additional assumption, the second part of Condition (iii) of heorem. is satisfied. Furthermore, under the assumption of ergodicity, x t x t E [ x t x t ] and x t y t E [x t y t ] a.s. (.) t = t = It follows immediately that δ δ 0 a.s. hough unnecessary in this case, we can also establish the strong consistency of δ for δ 0 from the observation that Q (δ) Q 0 (δ) a.s., for all δ R K. From Figure. it is seen that the criterion functions are quadratic and eventually overlap (for large ), so the minimizers of Q (δ) and Q 0 (δ) must eventually coincide. We conclude that the strong consistency of estimators in LLP problems is essentially implied by the assumption that {z t } is stationary and ergodic (and the rank condition on E [x t x t ]). More generally, the assumptions of ergodicity of {z t } and the continuity of Q ( z,θ) in its second argument do not imply the strong consistency of the minimizer θ of the criterion function Q (θ). he reason is that ergodicity guarantees only pointwise convergence, and the behavior in the tails of some nonlinear criterion functions may be problematic. o illustrate this Q 0 δ 0 Figure.. Sample and population criterion functions for a least-squares projection. δ Q [], () Lines: -0.0pt PgEnds: E [], ()

9 . Large-Sample Properties of Extremum Estimators Q 0 Q θ θ 0 Figure.. Well-behaved Q 0, Q. point, Figure. depicts a relatively well-behaved function Q that implies the convergence of θ to θ 0. In contrast, although the function Q (θ ) in Figure. can be constructed to converge pointwise to Q 0 (θ ), θ 0 and θ may grow increasingly far apart as increases if the dip moves further out to the right as grows. his potential problem is ruled out by the assumptions that {Q : } converges almost surely uniformly in θ to a function Q 0 and that θ 0 is the unique minimizer of Q 0. Even uniform convergence of Q to Q 0 combined with stationarity and ergodicity are not sufficient to ensure that θ converges to θ 0, however. o see why, consider the situation in Figure.. If Q 0 (θ ) asymptotes to the minimum of Q 0 (θ ) over R (but does not achieve this minimum) in the left tail, then Q (θ ) can get arbitrarily close to Q 0 (θ 0 ), even though θ and θ 0 are growing infinitely far apart. o rule this case out, we need to impose a restriction on the behavior of Q 0 in the tails. his can be accomplished either by imposing restrictions on the admissible parameter space or by restricting Q 0 directly. For example, if it is required that inf {Q 0 (θ ) Q 0 (θ 0 ) : θ, θ θ 0 > ρ} > 0, (.) then Q 0 (θ ) cannot asymptote to Q 0 (θ 0 ), for θ far away from θ 0, and convergence of θ to θ 0 is ensured. his condition is satisfied by the least-squares Q 0 Q θ 0 θ Figure.. Poorly behaved Q. [], () Lines: to -.0pt PgV Long Page PgEnds: EX [], ()

10 .. Consistency: General Considerations Q Q 0 θ θ 0 Figure.. Q converging to asymptoting Q 0. criterion function for linear models. For nonlinear models, potentially undesirable behavior in the tails is typically ruled out by assuming that is compact (the tails are chopped off ). With these observations as background, we next provide a primitive set of assumptions that assure the strong consistency of θ for θ 0. As noted in Chapter, most of the criterion functions we will examine can be expressed as sample means of functions g (z t,θ), or are simple functions of such sample means (e.g., a quadratic form). Accordingly, we first present sufficient conditions (beyond Assumption.) for the convergence of G (θ) = g (z t,θ) (.) t = to E [g (z t,θ)] almost surely, uniformly in θ. Our first assumption rules out bad behavior in the tails and the second states that the function g (z t,θ) has a finite mean for all θ: Assumption.. is a compact metric space. Assumption.. he function g (,θ) is Borel measurable for each θ in ; Eg (z t,θ) exists and is finite for all θ in. We will also need a stronger notion of continuity of g (z t,θ). Let ɛ t (θ, δ) = sup{ g (z t,θ) g (z t,α) for all α in with α θ < δ}. (.) Definition.. he random function g (z t,θ) is first-moment continuous at θ if lim δ 0 E [ɛ t (θ, δ)] = 0. Assumption. guarantees that has a countable dense subset. Hence, under Assumptions. and., the function ɛ t (θ, δ) is Borel measurable (it can be represented as the almost sure supremum of a countable collection of Borel measurable functions). [], () Lines: -0.pt Long Page * PgEnds: PageBreak [], ()

11 . Large-Sample Properties of Extremum Estimators First-moment continuity of g (z t,θ) is a joint property of the function g and the random vector z t. Under Assumptions.., if g (z t,θ) is firstmoment continuous at θ, then g (z t,θ) is first-moment continuous for every t. Assumption.. θ. he random function g (z t,θ) is first-moment continuous at all he measure of distance between G and E [g (z t, )] we are concerned with is ρ = sup G (θ) Eg (z t,θ). (.) θ Using the compactness of and the continuity of g t ( ), it can be shown that {ρ : } converges almost surely to zero. he proof proceeds as follows: Let {θ i : i } be a countable dense subset of. he distance between G (θ) and Eg (z t,θ) satisfies the following inequality: G (θ) Eg (z t,θ) G (θ) G (θ i ) + G (θ i ) Eg (z t,θ i ) + Eg (z t,θ i ) Eg (z t,θ). (.) For all θ, the first term on the right-hand side of (.) can be made arbitrarily small by choosing θ i such that θ i θ is small (because the θ i are a dense subset of ) and then using ergodicity and the uniform continuity of g (z t,θ) (uniform continuity follows from Assumptions. and.). he second term can be made arbitrarily small for large enough by ergodicity. Finally, the last term can be made small by exploiting the uniform continuity of g. he following theorem summarizes this result, a formal proof of which is provided in Hansen (00). heorem. (Hansen, b). Suppose Assumptions.. are satisfied. hen {ρ : } in (.) converges almost surely to zero... Consistency of Extremum Estimators Equipped with heorem., the strong consistency of the extremum estimators discussed in Chapter can be established.... Maximum Likelihood Estimators Suppose that the functional form of the density function of y t conditioned J J on y t, f (y t y ; β), is known for all t. Let Q 0 (β) = E [log f (y t y t ; β)] t [], (0) Lines: to.pt PgEnds: EX [], (0)

12 .. Consistency of Extremum Estimators denote the population criterion function and suppose that β 0, the parameter vector of the data-generating process for y t, is a maximizer of Q 0 (β). o show the uniqueness of β 0 as a maximizer of Q 0 (β), required by Condition (iii) of heorem., we use Jensen s inequality to obtain [ ( ) ] [ ( ) ] J J E log f ( y t y t ; β f y t y t ; β ) < log E ( J J ), β = β 0. (.0) f y t yt ; β 0 f y t yt ; β 0 he right-hand side of (.0) is zero (by the law of iterated expectations) because ( ) f J yt yt ; β ( ) J f ( y t ) f y t yt y t ; β ; β 0 dy =. (.) J 0 herefore, [ ( )] [ ( )] J J E log f y t yt ; β < E log f y t yt ; β 0, if β = β 0 (.) and β 0 is the unique solution to (.). he approximate sample log-likelihood function is hus, setting z t (y t, y t J ( J ) l (β) = log f y t yt ; β. (.) t =J + ) and ( ) J g (z t,β) = log f y t yt ; β, (.) G in the preceding section becomes the log-likelihood function. If Assumptions.. are satisfied, then heorem. implies the almost sure, uniform convergence of the sample log-likelihood function to Q 0 (β).... Generalized Method of Moment Estimators he GMM criterion function is based on the model-implied M -vector of moment conditions E [h(z t,θ 0 )] = 0. With use of the sample counterpart to this expectation, the sample and population criterion functions are constructed as quadratic forms with distance matrices W and W 0, respectively: See DeGroot (0) for a discussion of the use of first-moment continuity of log f (y t y t J ; β) in proving the strong consistency of ML estimators. DeGroot refers to first-moment continuity as supercontinuity. [], () Lines:.pt * PgEnds: Eject [], ()

13 . Large-Sample Properties of Extremum Estimators Q (θ) = H ( z,θ) W H ( z,θ), (.) Q 0 (θ) = H 0 (θ) W 0 H 0 (θ), (.) t = where H ( z,θ) = h(z t,θ) and H 0 (θ) = E [h(z t,θ)]. Since H 0 (θ) is zero at θ 0, the function Q 0 ( ) achieves its minimum (zero) at θ 0. o apply heorem. to these criterion functions we impose an additional assumption. Assumption.. {W : } is a sequence of M M positive semidefinite matrices of random variables with elements that converge almost surely to the corresponding elements of the M M constant, positive semidefinite matrix W 0 with rank(w 0 ) K. In addition, we let ρ = sup{ Q (θ) Q 0 (θ) : θ } (.) denote the maximum error in approximating Q 0 by its sample counterpart Q. he following lemma shows that Assumptions.. are sufficient for this approximation error to converge almost surely to zero. Lemma.. Suppose Assumptions.. are satisfied. hen {ρ : } converges almost surely to zero. Proof (Lemma.). Repeated application of the riangle and Cauchy-Schwartz Inequalities gives Q (θ) Q 0 (θ) H (θ) H 0 (θ) W H (θ) + H 0 (θ) W W 0 H (θ) + H 0 (θ) W 0 H (θ) H 0 (θ), (.) where W = r WW. herefore, letting φ 0 = max{ H 0 (θ) : θ } and ρ sup{ H (θ) H 0 (θ) : θ }, 0 ρ ρ W [φ 0 + ρ ] + φ 0 W W 0 [φ 0 + ρ ] + φ 0 W 0 ρ. (.) Since h(z t,θ) is first-moment continuous, H 0 (θ) is a continuous function of θ. herefore, φ 0 is finite because a continuous function on a compact set achieves its maximum. heorem. implies that ρ converges almost surely to zero. Since each of the three terms on the right-hand side of (.) converges almost surely to zero, it follows that {ρ : } converges almost surely to zero. [], () Lines: 0 to.0pt PgEnds: EX [], ()

14 .. Consistency of Extremum Estimators When this result is combined with heorems. and., it follows that the GMM estimator {θ : } converges almost surely to θ QML Estimators Key to consistency of QML estimators is verifying that the population moment equation (.0) based on the normal likelihood function is satisfied at θ 0. As noted in Chapter, this is generally true if the functional forms of the conditional mean and variance of y t are correctly specified (the moments implied by a DAPM are those in the probability model generating y t ). It is informative to verify that (.0) is satisfied at θ 0 for the interest rate Example.. his discussion is, in fact, generic to any one-dimensional state process y t, since it does not depend on the functional forms of the conditional mean µ rt or variance σ rt. Extensions to the multivariate case, with some increase in notational complexity, are immediate (see, e.g., Bollerslev and Wooldridge, ). Recalling the first-order conditions (.) shows the limit of the middle term on the right-hand-side to be ( ) (r t µ ˆ σ rt ) ˆrt (r t µ rt ) σ E rt. (.0) σˆrt θ j σ rt θ t= j Using the law of iterated expectations, we find that this expectation simplifies as [ ( ) ] (r t µ rt ) σ rt (r t µ rt ) σ E = E E rt rt σrt θ j σ rt θ j σ rt = E. (.) σ rt θ j he expectation (.) is seen to be minus the limit of the first term in (.), so the first and second terms cancel. hus, for the population first-order conditions associated with (.) to have a zero at θ 0, it remains to show that the limit of the last term in (.), evaluated at θ 0, is zero. his limit is { } (r t µ ˆ rt ) ˆµ rt (r t µ rt ) µ rt E ˆ, (.) θ t = σ rt j σ rt θ j which is indeed zero, because E [r t µ rt r t ] = 0 by construction and all of the other terms are constant conditional on r t. [], () Lines: -0.0pt PgEnds: E [], ()

15 . Large-Sample Properties of Extremum Estimators Consistency of the QML estimator then follows under the regularity conditions of heorem.... Asymptotic Normality of Extremum Estimators he consistency of θ for θ 0 implies that the limiting distribution of θ is degenerate at θ 0. For the purpose of conducting inference about the population value θ 0 of θ, we would like to know the distribution of θ for finite. his distribution is generally not known, but often it can be reliably approximated using the limiting distribution of (θ θ 0 ) obtained by a central limit theorem. Applicable central limit theorems have been proven under a wide variety of regularity conditions. We continue our focus on stationary and ergodic economic environments. Suppose that θ is strongly consistent for θ 0. o show the asymptotic normality of θ, we focus on the first-order conditions for the maximization or minimization of Q, the sample mean of the function D 0 (z t ; θ) first introduced in Chapter. More precisely, we let log f ( yt y J t θ for the ML estimator, h(z t,θ) = h(z t,θ) for the GMM estimator, (.) ( y t x t θ ) x t for the LLP estimator. In each case, by appropriate choice of z t and θ,e [h(z t,θ 0 )] = 0. hus, the function D 0 (z t ; θ), representing the first-order conditions for Q 0, is where the K M matrix A 0 is D 0 (z t ; θ) = A 0 h(z t ; θ), (.) I K for the ML estimator, A 0 = E [ h(z t,θ 0 ) / θ]w 0 for the GMM estimator, (.) IK for the LLP estimator, where I K denotes the K K identity matrix. he choice of A 0 for the GMM estimator is motivated subsequently as part of the proof of heorem.. Using this notation and letting H (θ) = h(z t,θ), (.) t = [], () Lines: 00 to.pt Short Page * PgEnds: Eject [], ()

16 .. Asymptotic Normality of Extremum Estimators we can view all of these estimators as special cases of the following definition of a GMM estimator (Hansen, b). Definition.. he GMM estimator {θ : } is a sequence of random vectors that converges in probability to θ 0 for which { A H (θ ) : } converges in probability to zero, where {A } is a sequence of K M matrices converging in probability to the full-rank matrix A 0. For a sequence of random variables {X }, convergence in distribution is defined as follows. Definition.. Let F, F,...,be distribution functions of the random variables X, X,... hen the sequence {X } converges in distribution to X (denoted X X ) if and only if F (b) F X (b) for all b at which F X is continuous. he classical central limit theorem examines the partial sums S = (/ ) t X t of an independently and identically distributed process {X t } with mean µ and finite variance. Under these assumptions, the distribution of S converges to that of normal with mean µ and covariance matrix Var[X t ]. However, for the study of asset pricing models, the assumption of independence is typically too strong. It rules out, in particular, persistence in the state variables and time-varying conditional volatilities. he assumption that {X t } is a stationary and ergodic time series, which is much weaker than the i.i.d. assumption in the classical model, is not sufficient to establish a central limit theorem. Essentially, the problem is that an ergodic time series can be highly persistent, so that the X t and X s, for s = t, are too highly correlated for S to converge to a normal random vector. he assumption of independence in the classical central limit theorem avoids this problem by assuming away any temporal dependence. Instead, we will work with the much weaker assumption that {X t } is a Martingale Difference Sequence (MDS), meaning that E [X t X t, X t,...] = 0 (.) with probability one. he assumption that X t is mean-independent of its past imposes sufficient structure on the dependence of {X t } for the following central limit theorem to be true. heorem. (Billingsley, ). Let {X t } t = be a stationary and ergodic MDS such that E X is finite. hen the distribution of (/ ) t = X t approaches the normal distribution with mean zero and variance E X. [], () Lines:.pt Short Page PgEnds: E [], ()

17 0. Large-Sample Properties of Extremum Estimators hough many financial time series are not MDSs, it will turn out that they can be expressed as moving averages of MDS, and this will be shown to be sufficient for our purposes. Equipped with Billingsley s theorem, under the following conditions, we can prove that the GMM estimator is asymptotically normal. heorem. (Hansen, b). Suppose that (i) {z t } is stationary and ergodic. (ii) is an open subset of R K. (iii) h is a measurable function of z t for all θ, h d 0 E (z t,θ 0 ) θ is finite and has full rank, and h/ θ is first moment continuous at all θ. (iv) θ is a GMM estimator of θ 0. (v) H ( z,θ 0 ) N (0, 0 ), where 0 = lim E [H (θ 0 )H (θ 0 ) ]. (vi) A converges in probability to A 0, a constant matrix of full rank, and A 0 d 0 has full rank. hen (θ θ 0 ) N (0, 0 ), where 0 = (A 0 d 0 ) A 0 0 A 0 (d 0 A 0 ). (.) In proving heorem., we will need the following very useful lemma. Lemma.. Suppose that {z t } is stationary and ergodic and the function g (z t,θ) satisfies: (a) E [g (z t,θ 0 )] exists and is finite, (b) g is first-moment continuous at θ 0, and suppose that θ converges to θ 0 in probability. hen (/ ) t = g (z t,θ ) converges to E [g (z t,θ 0 )] in probability. Proof (heorem.). When we apply aylor s theorem on a coordinate by coordinate basis, ( H (θ ) = H (θ 0 ) + G ) θ (θ θ 0 ), (.) where θ isak M matrix with the mth column, θ m, satisfying θ m θ 0 θ θ 0, for m =,...,M, and the ijth element of the M K matrix G (θ ) is the jth [0], () Lines: to.pt Short Page PgEnds: EX [0], ()

18 .. Asymptotic Normality of Extremum Estimators i element of the K vector H (θ i )/ θ. he matrix G (θ ) converges in probability to the matrix d 0 by Lemma.. Furthermore, since A H (θ ) converges in probability to zero, (θ θ 0 ) and [ (A 0 d 0 ) A 0 H (θ 0 )] have the same limiting distribution. Finally, from (v) it follows that (θ θ 0 ) is asymptotically normal with mean zero and covariance matrix (A 0 d 0 ) A 0 0 A 0 (d 0 A 0 ). A key assumption of heorem. is Condition (v), as it takes us a long way toward the desired result. Prior to discussing applications of this theorem, it will be instructive to discuss more primitive conditions for Condition (v) to hold and to characterize 0. Letting I t denote the information set generated by {z t, z t,...}, and h t h(z t ; θ 0 ), we begin with the special case (where ACh is shorthand for autocorrelation in h): Case ACh(0). E [h t I t ] = 0. Since I t includes h s, for s t, {h t } is an MDS. hus, heorem., the central limit theorem (CL), applies directly and implies Condition (v) with 0 = E h t h t. (.0) Case ACh(n ). E [h t +n I t ] = 0, for some n. When n >, this case allows for serial correlation in the process h t up to order n. We cannot apply heorem. directly in this case because it presumes that h t is an MDS. However, it turns out that we can decompose h t into a finite sum of terms that do follow an MDS and then Billingsley s CL can be applied. oward this end, h t is written as n h t = u t,j, (.) j=0 where u t,j I t j and satisfies the property that E [u t,j I t j ] = 0. his representation follows from the observation that h t = E [h t I t ] + u t,0 n = E [h t I t ] + u t,0 + u t, =... = u t,j, (.) where the law of iterated expectations has been used repeatedly. hus, j=0 [], () Lines: 0.0pt Short Page * PgEnds: Eject [], ()

19 . Large-Sample Properties of Extremum Estimators n h t = u t,j. (.) t= t = j=0 Combining terms for which t j is the same (and, hence, that reside in the same information set) and defining gives n u t = u t +j,j, (.) j=0 n+ n h t = u t + V, (.) t = t =0 where V n involves a fixed number of u t,j depending only on n, for all. Since V n converges to zero in probability as, we can focus on the sample mean of u in deriving the limiting distribution of the sample mean t of h t. he series {u } is an MDS. hus, Billingsley s theorem implies that t n+ [ ] u N(0, 0 ), 0 = E u t u. (.) t= t Moreover, substituting the left-hand side of (.) for the scaled average of the u t s in (.), gives [ ( )( )] 0 = lim E h t h t = lim j = n+ t = t= n ( ) j E [ n ] h t h t j = j= n+ t E h t h t j. (.) In words, the asymptotic covariance matrix of the scaled sample mean of h t is the sum of the autocovariances of h t out to order n. Case ACh( ). E [h t h t s ] = 0, for all s. Since, in case ACh(n ), n is the number of nonzero autocovariances of h t, (.) can be rewritten equivalently as [], () Lines: 00 to.pt * PgEnds: Eject [], ()

20 .. Distributions of Specific Estimators = E h(z t,θ 0 )h(z t j,θ 0 ). (.) j= his suggests that, for the case where E [h t h t s ] = 0, for all s (i.e., n = ), (.) holds as well. Hansen (b) shows that this is indeed the case under the additional assumption that the autocovariance matrices of h t are absolutely summable... Distributions of Specific Estimators In applying heorem., it must be verified that the problem of interest satisfies Conditions (iii), (iv), and (v). We next discuss some of the implications of these conditions for the cases of the ML, GMM, and LLP criterion functions. In addition, we examine the form of the asymptotic covariance matrix 0 implied by these criterion functions, and discuss consistent estimators of Maximum Likelihood Estimation In the case of ML estimation, we proved in Chapter that [ ( ) ] J J E D 0 y t, y t,β 0 y = 0, where ( ) log f ( ) J J D 0 y t, y t,β = y t yt ; β. (.) β J Since the density of y t conditioned on y t is the same as the density conditioned on y t by assumption, (.) implies that the score (.) is an MDS. herefore, heorem. and Case ACh(0) apply and ( ) log f ( ) J H (z t,θ 0 ) = y t yt ; β 0 (.0) β In deriving this result, we implicitly assumed that we could reverse the order of integration and differentiation. Formally, this is justified by the assumption that the partial derivative J of log f (y t y t ; β) is first-moment continuous at β 0. More precisely, consider a function h(z,θ). Suppose that for some δ>0, the partial derivative h(z, θ)/ θ exists for all values of z and all θ such that θ θ 0 <δ, and suppose that this derivative is first-moment continuous at θ 0.If E [h(z,θ)] exists for all θ θ 0 <δ and if E [ h(z, θ)/ θ ] <, then h(z,θ) E [h(z,θ)] E =. θ θ=θ0 θ θ=θ0 t = t [], () Lines: -.00pt * PgEnds: PageBreak [], ()

21 . Large-Sample Properties of Extremum Estimators converges in distribution to a normal random vector with asymptotic covariance matrix log f ( J ) J 0 = E y t log f ( yt ; β 0 yt ) y t ; β 0. (.) β β Furthermore, the first-order conditions to the log-likelihood function give K equations in the K unknowns (β), so A is I K in this case and ML 0 = d 0 0 (d 0 ). hus, it remains to determine d 0. Since E [D 0 (z t,β 0 )] = 0, differentiating both sides of this expression with respect to β gives 0 log f ( ) J d ML = E y β β t 0 yt ; β 0 log f ( J ) J = E y t log f ( ) yt ; β 0 yt y t ; β 0. (.) β β When we combine (.), (.), and (.) and use the fact that if X N (0, X ), then AX N (0, A X A ), it follows that ( [ ) log f ( ) ] J (b ML β 0 ) N 0, E y β β t yt ; β 0. (.) In actual implementations of ML estimation, the asymptotic covariance in (.) is replaced by its sample counterpart. From (.) it follows that this matrix can be estimated either as the inverse of the sample mean of the outer product of the likelihood scores or as minus the inverse of the sample mean of the second-derivative matrix evaluated at b ML, ( ) log f ( yt J ) y b ML. (.) β β t ; t = 0 he second equality in (.) is an important property of conditional density functions that follows from (.). By definition, (.) can be rewritten as log f ( ) ( ) 0 = y J t y t ; β 0 f y t J yt ; β 0 dy t. β Differentiating under the integral sign and using the chain rule gives log f ( ) ( ( ) 0 = E y t J y t ; β log f ) 0 + E y t J log f yt ; β J 0 y t y t ; β 0. β β β β [], (0) Lines: 00 to.0pt * PgEnds: PageBreak [], (0)

22 .. Distributions of Specific Estimators Assuming that the regularity conditions for Lemma. are satisfied by the likelihood score, we see that (.) converges to the covariance matrix of b ML as. he asymptotic covariance matrix of b ML is the Cramer-Rao lower bound, the inverse of the so-called Hessian matrix. his suggests that, even though the ML estimator may be biased in small samples, as gets large, the ML estimator is the most efficient estimator in the sense of having the smallest asymptotic covariance matrix among all consistent estimators of β 0. his is indeed the case and we present a partial proof of this result in Section..... GMM Estimation heorem. applies directly to the case of GMM estimators. he GMM estimator minimizes (.) so the regularity conditions for heorem. require that h(z t,θ) be differentiable, h(z t, θ)/ θ be first-moment continuous at θ 0, and that W converge in probability to a constant, positivesemidefinite matrix W 0. he first-order conditions to the minimization problem (.) are H (θ ) W H (θ ) = 0. (.) θ herefore, the A implied by the GMM criterion function (.) is H (θ ) A = W. (.) θ By Lemma. and the assumption that W converges to W 0, it follows that A converges in probability to A 0 = d 0 W 0. Substituting this expression into (.), we conclude that (θ θ 0 ) converges in distribution to a normal with mean zero and covariance matrix ( ) ( GMM 0 = d0 W 0 d 0 d 0 W 0 0 W 0 d 0 d 0 W ) 0 d 0. (.) If the probability limit of the distance matrix defining the GMM criterion function is chosen to be W 0 = 0, then (.) simplifies to 0 GMM = ( d 0 0 ) d 0. (.) We show in Section. that this choice of distance matrix is the optimal choice among GMM estimators constructed from linear combinations of the moment equation E [h(z t,θ 0 )] = 0. A consistent estimator of GMM 0 is constructed by replacing all of the matrices in (.) or (.) by their sample counterparts. he matrix W 0 [], () Lines:.pt PgEnds: E [], ()

23 . Large-Sample Properties of Extremum Estimators is estimated by W, the matrix used to construct the GMM criterion function, and d 0 is replaced by H (θ )/ θ. he construction of a consistent estimator of 0 depends on the degree of autocorrelation in h(z t,θ 0 ).In Case ACh(n ), with finite n, the autocovariances of h comprising 0 are replaced by their sample counterparts using fitted h(z t,θ ) in place of h(z t,θ 0 ): h(z t,θ )h(z t j,θ ). (.) t =j+ An asymptotically equivalent estimator is obtained by subtracting the sample mean from h(z t,θ ) before computing the sample autocovariances. If, on the other hand, n = or n is very large relative to the sample size, then an alternative approach to estimating 0 is required. In Case ACh( ), 0 is given by (.). Letting Ɣ h0 (j) = E [h t h t j ], we proceed by constructing an estimator as a weighted sum of the autocovariances that can feasibly be estimated with a finite sample of length : ( ) j ( ) = k Ɣ h j, (.0) K j= + where the sample autocovariances are given by h(z t,θ )h(z t j,θ ) for j 0, ( ) t=j+ Ɣ h j = (.) h(z t +j,θ )h(z t,θ ) for j < 0, t= j+ and B is a bandwidth parameter discussed later. he scaling factor / ( K ) is a small-sample adjustment for the estimation of θ. he function k( ), called a kernel, determines the weight given to past sample autocovariances in constructing. he basic idea of this estimation strategy is that, for fixed j, sample size must increase to infinity for Ɣ h (j) to be a consistent estimator of Ɣ h0 (j). At the same time, the number of nonzero autocovariances in (.0) must increase without bound for to be a consistent estimator of 0. he potential problem is that if terms are added proportionately as gets large, then the number of products h t h t j in the sample estimate of Ɣ h (j) stays small regardless of the size of. o avoid this problem, the kernel must be chosen so that the number of autocovariances included grows, but at a slower rate than, B [], () Lines: to.0pt PgEnds: EX [], ()

24 .. Distributions of Specific Estimators so that the number of terms in each sample estimate Ɣ h (j) increases to infinity. wo popular kernels for estimating 0 are { for x, runcated k(x) = (.) 0 otherwise, { x for x, Bartlett k(x) = (.) 0 otherwise. For both of these kernels, the bandwidth B determines the number of autocovariances included in the estimation of. In the case of the truncated kernel, all lags out to order B are included with equal weight. his is the kernel studied by White (). In the case of the Bartlett kernel, the autocovariances are given declining weights out to order j B. Newey and West (b) show that, by using declining weights, the Bartlett kernel guarantees that is positive-semidefinite. his need not be the case in finite samples for the truncated kernel. he choice of the bandwidth parameter B is discussed in Andrews ().... Quasi-Maximum Likelihood Estimation he QML estimator is a special case of the GMM estimator. Specifically, continuing our discussion of the scalar process r t with conditional mean µ rt and variance σ rt that depend on the parameter vector θ, let the jth component of h(z t,θ) be the score associated with θ j : σ rt (θ) (r t µ rt (θ)) σ rt (θ) h j (z t,θ) + σ θ rt (θ) θ j σ j rt (θ) (r t µ rt (θ)) µ rt (θ) +,j =,...,K. (.) σ rt (θ) θ j he asymptotic distribution of the QML estimator is thus determined by the properties of h(z t,θ 0 ). From (.) it is seen that E [h j (z t,θ 0 ) I t ] = 0; that is, {h(z t,θ 0 )} is an MDS. his follows from the observations that, after taking conditional expectations, the first and second terms cancel and the third term has a conditional mean of zero. herefore, the QML estimator θ QML falls under Case ACh(0) with M = K (the number of moment equations equals the number of parameters) and ( ) ( ) QML ( QML ) θ θ 0 N 0, ( QML d ) 0 0 d, (.) 0 [], () Lines:.pt * PgEnds: Eject [], ()

25 . Large-Sample Properties of Extremum Estimators where 0 = E [h(z t,θ 0 )h(z t,θ 0 ) ], with h given by (.), and d QML log f N 0 = E (rt I t ; θ 0 ). (.) θ θ hough these components are exactly the same as in the case of fullinformation ML estimation, d QML 0 and 0 are not related by (.), so (.) does not simplify further.... Linear Least-Squares Projection he LLP estimator is the special case of the GMM estimator with z t = (y t, x t ), h(z t,δ) = (y t x t δ)x t, A 0 = I K. Also, d LLP 0 = E x t x t (.) and with u t (y t x t δ 0 ), where δ 0 is the probability limit of the least-squares estimator δ, It follows that LLP 0 0 = E x t u t u t j x t j. (.) j= [ [ = E x t x ] t E x t u t u t j x t j E x t x ] t. (.) j= In order to examine several special cases of LLP for forecasting the future, we assume that the variable being forecasted is dated t + n, n, and let x t denote the vector of forecast variables observed at date t : y t +n = x t δ 0 + u t +n. (.0) We consider several different assumptions about the projection error u t +n. Unless otherwise noted, throughout the following discussion, the information set I t denotes the information generated by current and past x t and u t. Consider first Case ACh(0) with n = and E [u t + I t ] = 0. One circumstance where this case arises is when a researcher is interested in testing whether y t + is unforecastable given information in I t (see Chapter ). For instance, if we assume that x t includes the constant as the first component, and partitioning x t as x t = (, x t ) and δ 0 conformably as δ 0 = (δ c,δ x ), then this case implies that E [y t + I t ] = δ c, δ x = 0, and y t+ is unforecastable given past information about x t and y t. he alternative hypothesis is that [], () Lines: to 0.pt * PgEnds: Eject [], ()

26 .. Distributions of Specific Estimators E [y t + I t ] = δ c + x t δ x, (.) with the (typical) understanding that the projection error under this alternative satisfies E [u t + I t ] = 0. A more general alternative would allow δ x = 0 and the projection error u t + to be correlated with other variables in I t.we examine this case later. Since d 0 = E [x t x t ] and this case fits into Case ACh(0), ( ) LLP (δ δ 0 ) N 0, 0, (.) where 0 [ [ LLP ] [ ] = E x t x ] t E ut + x t x t E x t x t. (.) Without further assumptions, LLP 0 does not simplify. One simplifying assumption that is sometimes made is that the variance of u t + conditioned on I t is constant: E u t + I t = σ u, a constant. (.) Under this assumption, 0 in (.) simplifies to σ E [x t x t ] and LLP 0 [ = σ E x t x ] u t. (.) hese characterizations of LLP 0 are not directly applicable because the asymptotic covariance matrices are unknown (are functions of unknown population moments). herefore, we replace these unknown moments with their sample counterparts. Let û t + (y t + x t δ ). With the homoskedasticity assumption (.), the distribution of δ used for inference is ( ) LLP δ N δ 0,, (.) where = ˆu t ( ) LLP = σˆ u x t x t, (.) with σ ˆ = (/ ) u t. his is, of course, the usual distribution theory used in the classical linear least-squares estimation problem. Letting σˆδi denote the ith diagonal element of (.), we can test the null hypothesis H 0 : δ0 i = δ0 i using the distribution t = u [], () Lines: 0.pt * PgEnds: Eject [], ()

27 0. Large-Sample Properties of Extremum Estimators δ i δi 0 N (0, ). (.) σˆδi Suppose that we relax assumption (.) and let the conditional variance of u t + be time varying. hen LLP 0 is given by (.) and now 0 is estimated by and = û t + x t x t, (.) esting proceeds as before, but with a different calculation of σˆδi. Next, consider Case ACh(n ) which has n > and E [u t +n I t ] = 0. his case would arise, for example, in asking the question whether y t +n is forecastable given information in I t. For this case, d 0 is unchanged, but the calculation of 0 is modified so that [ n [ LLP 0 = E x t x ] t E u t +n u t +n j x t x t j E x t x ] t. (.) n t = ( ) ( ) LLP = x t x t x t x t. (.0) t = t = j = n+ Analogously to the case ACh(0), this expression simplifies further if the conditional variances and autocorrelations of u t are constants. o estimate the asymptotic covariance matrix for this case, we replace E [x t x t ]by(/ ) t = x tx t and 0 by = j = n+ t = esting proceeds in exactly the same way as before. û t +n û t +n j x t x t j. (.).. Relative Efficiency of Estimators he efficiency of an estimator can only be judged relative to an a priori set of restrictions on the joint distribution of the z t that are to be used in estimation. hese restrictions enter the formulation of a GMM estimator in two ways: through the choices of the h function and the A 0. he form of the asymptotic covariance matrix 0 in (.) shows the dependence of [0], () Lines: 0-0.pt PgEnds: EX [0], ()

28 .. Relative Efficiency of Estimators the limiting distribution on both of these choices. In many circumstances, a researcher will have considerable latitude in choosing either A 0 or h(z t,θ) or both. herefore, a natural question is: Which is the most efficient GMM estimator among all admissible estimators? In this section, we characterize the optimal GMM estimator, in the sense of being most efficient or, equivalently, having the smallest asymptotic covariance matrix among all estimators that exploit the same information about the distribution of z.... GMM Estimators o highlight the dependence of the distributions of GMM estimators on the information used in specifying the moment equations, it is instructive to start with the conditional version of the moment equations underlying the GMM estimation, t = A t h(z t,θ ) = 0, (.) where A t is a (possibly random) K M matrix in the information set I t, and h(z t,θ) is an M vector, with K M, satisfying [ ] E h(z t ; θ 0 ) I t = 0. (.) In this section, we will treat z t as a generic random vector that is not presumed to be in I t and, indeed, in all of the examples considered subsequently z t I t. Initially, we treat h(z t ; θ) as given by the asset pricing theory and, as such, not subject to the choice of the researcher. We also let { } h(z t ; θ 0 ) A = A t I t, such that E A t has full rank (.) θ denote the class of admissible GMM estimators, where each estimator is indexed by the (possibly random) weights A t. he efficiency question at hand is: In estimating θ 0, what is the optimal choice of A t? (Which choice of A t gives the smallest asymptotic covariance matrix for θ among all estimators based on matrices in A?) he following lemma, based on the analysis in Hansen (), provides a general characterization of the optimal A A. Lemma.. Suppose that the assumptions of heorem. are satisfied and {A t } A is a stationary and ergodic process ( jointly with z t ). hen the optimal choice A A satisfies [], () Lines:.pt * PgEnds: Eject [], ()

Follow links for Class Use and other Permissions. For more information send to:

Follow links for Class Use and other Permissions. For more information send to: COPYRIGH NOICE: Kenneth. Singleton: Empirical Dynamic Asset Pricing is published by Princeton University Press and copyrighted, 00, by Princeton University Press. All rights reserved. No part of this book