Elements of Asymtotic Theory James L. Powell Deartment of Economics University of California, Berkeley Objectives of Asymtotic Theory While exact results are available for, say, the distribution of the classical least squares estimator for the normal linear regression model, and for other leading secial combinations of distributions and statistics, generally distributional results are unavailable when statistics are nonlinear in the underlying data and/or when the joint distribution of those data is not arametrically secified. Fortunately, most statistics of interest can be shown to be smooth functions of samle averages of some observable random variables, in which case, under fairly general conditions, their distributions can be aroximated by a normal distribution, usually with a covariance matrix that shrinks to zero as the number of random variables in the averages increases. Thus, as the samle size increases, the ga between the true distribution of the statistic and its aroximating distribution asymtotes to zero, motivating the term asymtotic theory. There are two main stes in the aroximation of the distribution of a statistic by a normal distribution. The first ste is the demonstration that a samle average converges (in an aroriate sense) to its exectation, and that a standardized version of the statistic has distribution function that converges to a standard normal. Such results are called limit theorems the former (aroximation of a samle average by its exectations) is called a law of large numbers (abbreviated LLN ), and the second (aroximation of the distribution of a standardized samle average by a standard normal distribution) is known as a central limit theorem (abbreviated CLT ). There are a number of LLNs and CLTs in the statistics literature, which imose different combinations of restrictions on the number of moments of the variables being averaged and the deendence between them; some basic versions are given below. The second ste in deriving a normal aroximation of the distribution of a smooth function of samle averages is demonstration that this smooth function is aroximately a linear function of samle averages (which is itself a samle average). Useful results for this ste, which use the usual calculus aroximation of smooth (differentiable) functions by normal functions, are groued under the heading Slutsky theorems, after the author of one of the articularly useful results. So, as long as the conditions of the limit theorems and Slutsky theorems aly, smooth functions of samle averages are aroximately samle averages 1
themselves, and thus their distribution functions are aroximately normal. The whole aaratus of asymtotic theory is made u of the regularity conditions for alicability of the limit and Slutsky theorems, along with the rules for mathematical maniulation of the aroximating objects. In order to discuss aroximation of a statistic by a constant, or its distribution by a normal distribution, we first have to extend the usual definitions of limits of deterministic sequences to accomodate sequences of random variables (yielding the concet of convergence in robability ) or their distribution functions ( convergence in distribution ). Convergence in Distribution and Probability For most alications of asymtotic theory to statistical roblems, the objects of interest are sequences of random vectors (or matrices), say or Y N, which are generally statistics indexed by the samle size N; the object is to find simle aroximations for the distribution functions of or Y N which can be made arbitrarily close to their true distribution functions for N large enough. Writing the cumulative distribution function of as F N (x) Pr{ x} (where the argument x has the same dimension as and the inequality holds only if it is true for each comonent), we define convergence in distribution (sometimes called convergence in law) in terms of convergence of the function F N (x) to some fixed distribution function F (x) for any oint x where the F is continuous. That is, the random vector converges in distribution to X, denoted d X as N, if at all oints x where F (x) is continuous. lim N F N(x) =F (x) A few comments on this definition and its imlications are warranted: 1. The limiting random vector X in this definition is really just a laceholder for the distribution function F (x); there needs be no real random vector X which has the limiting distribution function. In fact, it is often customary to relace X with the limiting distribution function F (x) in the 2
definition, i.e., to write d F (x) as N ; if, say, F (x) is a multivariate standard normal, then we would write d N(0, I), where the condition N is usually not stated, but imlied. d 2. If F (x), then we can use the c.d.f. F (x) to aroximate robabilies for XN if N is large, i.e., Z Pr{ A} = A df (x), where the symbol = A means that the difference in the left- and right-hand sides goes to zero as N increases. A 3. We would like to construct statistics so that their limiting distributions are of convenient (tabulated) form, e.g., normal or chi-squared distribution. This often involves standardizing the statistics by subtracting off their exectations and dividing by their standard deviations (or the matrix equivalent). 4. If lim N F N (x) is discontinuous, it may not be a distribution function, which is why the qualification at all oints where F (x) is continuous has to be aended to the definition of convergence in distribution. To make this more concrete, suose µ N 0, 1, N so that ³ F N (x) = Φ Nx 0 if x<0 1 2 if x =0 1 if x>0. Obviously here lim N F N (x) is not right-continuous, so we would never use it to aroximate robabilities for (since it isn t a c.d.f.). Instead, a reasonable aroximating function for F N (x) would be F (x) 1{x 0}. 3
where 1{A} is the indicator function of the statement A, i.e., it equals one if A is true and is zero otherwise. The function F (x) is the c.d.f. of a degenerate random variable X which equals zero with robability one, which is a retty good aroximation for F N (x) when N is large for this roblem. The last examle convergence of the c.d.f. F N (x) to the c.d.f. of a degenerate ( nonrandom ) random variable is a secial case of the other key convergence concet for asymtotic theory, convergence in robability. Following Ruud s text, we define convergence in robability of the sequence of random vectors to the (nonrandom) vector x, denoted or if i.e., lim = x N x as N d x, F N (x) 1{0 x} whenever all comonents of x are nonzero. (The symbol lim stands for the term robabilty limit.) We can generalize this definition of convergence in robability of to a constant x to ermit to converge in robability to a nondegenerate random vector X by secifying that X means that is that X d 0. An equivalent definition of convergence in robability, which is the definition usually seen in textbooks, x if, for any number ε>0, Pr{ x >ε} 0 4
as N. Thus, converges in robability to x if all the robability mass for eventually ends u in an ε-neighborhood of x, no matter how small ε is. Showing the equivalence of these two definitions of robability limit is a routine exercise in robability theory and real analysis. A stronger form of convergence of to x, which can often be used to show convergence in robability, is quadratic mean convergence. We say qm X n x as N if E[ x 2 ] 0. If this condition can be shown, then it follows that converges in robability to x because of Markov s Inequality: If Z is a nonnegative (scalar) random variable, i.e., Pr{Z 0} =1, then for any K>0. Pr{Z >K} E(Z) K The roof of this inequality is worth remembering; it is based uon a decomosition of E(Z) (which may be infinite) as E(Z) = Z 0 Z K 0 zdf Z (z) zdf Z (z)+ Z K (z K)dF Z (z)+ Z K KdF Z (z). Since the first two terms in this sum are nonnegative, it follows that E(Z) Z K KdF Z (z) =K Pr{Z >K}, from which Markov s inequality immediately follows. A better-known variant of this inequality, Tchebysheff s Inequality, takesz =(X E(X)) 2 for a scalar random variable X (with finite second moment) and K = ε 2 for any ε>0, from which it follows that Pr{ X E(X) >ε} =Pr{(X E(X)) 2 >ε 2 } Var(X) ε 2, since Var(X) =E(Z) for this examle. A similar consequence of Markov s inequality is that convergence in quadratic mean imlies convergence in robability: if E[ x 2 ] 0, 5
then for any ε>0. Pr{ x >ε} E[ x 2 ] ε 2 0 There is another, stronger version of convergence of a random variable to a limit, known as almost sure convergence (or convergence almost everywhere or convergence with robability one), which is denoted a.s. x. Rather than give the detailed definition of this form of convergence which is best left to a more advanced course it is worth ointing out that it is indeed stronger than convergence in robability (which is sometimes called weak convergence for that reason), but that, for the usual uroses of asymtotic theory, the weaker concet is sufficient. Limit Theorems With the concets of convergence in robability and distribution, we can more recisely state how a samle average is aroximately equal to its exectation, or how its standardized version is aroximately normal. The first result, on convergence in robability, is Weak Law of Large Numbers (WLLN): If 1 N P N i=1 X i is a samle average of scalar, i.i.d. random variables {X i } N i=1 which have E(X i)=μ and Var(X i )=σ 2 finite, then μ = E(Xi ). Proof: By the usual calculations for means and variances of sums of i.i.d. random variables, E( )=μ, Var( )= σ2 N = E( μ) 2, which tends to zero as N. By definition, XN tends to μ in quadratic mean, and thus in robability. Insection of the roof of the WLLN suggests a more general result for averages of random variables that are not i.i.d. namely, that E( ) 0 if 1 N 2 NX i=1 j=1 NX Cov(X i,x j ) 0. 6
This general condition can be secialized to yield other LLNs for heterogeneous (non-constant variances) or deendent (nonzero covariances) data. Other laws of large numbers exist which relax conditions on the existence of second moments, and which show almost sure convergence (and are known as strong laws of large numbers ). The next main result, on aroximate normality of samle averages, again alies to i.i.d. data: Central Limit Theorem (Lindeberg-Levy): If XN 1 N P N i=1 X i is a samle average of scalar, i.i.d. random variables {X i } N i=1 which have E(X i)=μ and Var(X i )=σ 2 finite, then Z N E( ) Var( XN ) = µ XN μ N σ d N(0, 1). The roof of this result is a bit more involved, requiring maniulation of characteristic functions, and will not be resented here. The result of this CLT is often rewritten as N( XN μ) d N(0,σ 2 ), which oints out that converges to its mean μ at exactly the rate N increases, so that the roduct eventually balances out to yield a random variable with a normal distribution. Another way to rewrite the consequence of the CLT is µ A N μ, σ2, N where the symbol A means is aroximately distributed as (or is asymtotically distributed as ). It is very imortant here to make the distinction between the limiting distribution of the standardized mean Z N, which is a standard normal, and the asymtotic distribution of the (non-standardized) samle mean, which is actually a sequence of aroximating normal distributions with variances that shrink to zero at seed 1/N. In short, a limiting distribution cannot deend uon N, which has assed to its (infinite) limit, while an asymtotic distribution can involve the samle size N. Like the Weak Law of Large Numbers, the Lindeberg-Levy Central Limit Theorem admits a number of different generalizations (often named after their authors) which relax the assumtion of i.i.d. data while 7
imosing various stronger restrictions on the existence of moments or the tail behavior of the true data distributions. To this oint, the WLLN and CLT aly only to scalar random variables {X i }; to extend them to random vectors {X i }, we can use the intriguingly-named Cramér-Wald device, which is really a theorem that states that if the scalar random variable Y N λ 0 converges in distribution to Y λ 0 X for any fixed vector λ having the same dimension as thatis,if λ 0 d λ 0 X for any λ then the vector converges in distribution to X, d X. This result immediately yields a multivariate version of the Lindeberg-Levy CLT: a samle mean of i.i.d. random vectors {X i } with E(X i )=μ and V (X i )=Σ has N( μ) d N (0, Σ). And, since convergence in robability is a secial case of convergence in distribution, the Cramér-Wald device also immediately yields a multivariate version of the WLLN. Slutsky Theorems To extend the limit theorems for samle averages to statistics which are functions of samle averages, asymtotic theory uses smoothness roerties of those functions (i.e., continuity and differentiability) to aroximate those functions by olynomials, usually constant or linear functions. The simlest of these aroximation results is the continuity theorem, which states that robability limits share an imortant roerty of ordinary limits namely, the robability limit of a continuous function is the value of that function evaluated at the robability limit: 8
Continuity Theorem: If x0 and the function g(x) is continous at x = x 0, then g( ) g(x 0 ). Proof: For any ε>0, continuity of the function g(x) at x 0 imlies that there is some δ>0 so that x 0 δ imlies that g( ) g(x 0 ) ε. Thus Pr{ g( ) g(x 0 ) ε} Pr{ x 0 δ}. But the robability on the right-hand side converges to one because x0, so the robability on the left-hand side does, too. With the continuity theorem, it should be clear that robability limits are more conveniently maniulated than, say, mathematical exectations, since, for examle, if ˆθ ȲN is an estimator of the ratio of the means of Y i and X i, θ 0 E(Y i) E(X i ) μ Y μ X, then ˆθ N θ0 as long as μ X 6=0, even though E(ˆθ N ) 6= θ 0 in general. Another key aroximation result is Slutsky s Theorem, which states that the sum (or roduct) of a random vector that converges in robability and another that converges in distribution itself converges in distribution: Slutsky s Theorem: If the random vectors and Y N havethesamelength,andif x0 and Y N d Y, then XN + Y N d x0 + Y and X 0 N Y N d x 0 0 Y. The tyical alication of Slutsky s theorem is to estimators that can be reresented in terms of a linear aroximation involving some statistics that converge either in robability or distribution. To be more concrete, suose the (scalar) estimator ˆθ N can be decomosed as N(ˆθN θ 0 )=X 0 NY N + Z N, where x0, d Y N N(0, Σ), and Z N 0. 9
This decomosition is often obtained by a first-order Taylor s series exansion of ˆθ N, and the comonents, Y N, and Z N are samle averages to which a LLN or CLT can be alied; an examle is the so-called delta method discussed below. Under these conditions, Slutsky s theorem imlies that N(ˆθN θ 0 ) d N(0, x 0 0Σx 0 ). Both the continuity theorem and Slutsky s theorem are secial cases of the continuous maing theorem, which says that convergence in distribution, like convergence in robability, interchanges with continuous functions, as long as they are continuous everywhere: Continuous Maing Theorem: If d X and the function g(x) is continous for all x, then g(xn ) d g(x). The continuity requirement for g(x) can be relaxed to hold only on the suort of X if needed. The continuous maing theorem can be used to get aroximations for the distributions of test statistics based uon asymtotically-normal estimators; for examle, if Z N N(ˆθ N θ 0 ) d N (0, Σ), then T Z 0 NZ N = N(ˆθ N θ 0 ) 0 (ˆθ N θ 0 ) d χ 2, where dim{ˆθ N }. The final aroximation result discussed here, which is an alication of Slutsky s theorem, is the so-called delta method, which states that continuously-differentiable functions of asymtotically-normal statistics are themselves asymtotically normal: The Delta Method : If ˆθ N is a random vector which is asymtotically normal, with N(ˆθ N θ 0 ) d N (0, Σ) for some θ 0, and if g(θ) is continuously differentiable at θ = θ 0 with Jacobian matrix G 0 g(θ 0) θ 0 10
that has full row rank, then N(g(ˆθN ) g(θ 0 )) d N 0, G 0 ΣG 0 0. Proof: First suose that g(θ) is scalar-valued. Since ˆθ N converges in robability to θ 0 (by Slutsky s Theorem), and since g(θ)/ θ 0 is continuous at θ 0, the Mean Value Theorem of differential calculus imlies that, g(ˆθ N )=g(θ 0 )+ g(θ N ) θ 0 (ˆθ N θ 0 ), for some value of θ N between ˆθ N and θ 0 (with arbitrarily high robability). Since ˆθ N that θ N θ 0, so Thus, by Slutsky s Theorem, g(θ N ) θ 0 G 0 g(θ 0) θ 0. θ0,itfollows N(g(ˆθ N ) g(θ 0 )) = g(θ N ) θ 0. N(ˆθ N θ 0 ) d G 0 N(0, Σ) = N(0, G 0 ΣG 0 0). To extend this argument to the case where g(ˆθ N ) is vector-valued, use the same argument to show that any (scalar) linear combination λ 0 g(ˆθ) has the asymtotic distribution N(λ 0 g(ˆθ N ) λ 0 g(θ 0 )) d N(0, λ 0 G 0 ΣG 0 0λ), from which the result follows from the Cramér-Wald device. 11