Elements of Asymtotic Theory James L. Powell Deartment of Economics University of California, Berkeley Objectives of Asymtotic Theory While exact results are available for, say, the distribution of the classical least squares estimator for the normal linear regression model, and for other leading secial combinations of distributions and statistics, generally distributional results are unavailable when statistics are nonlinear in the underlying data and/or when the joint distribution of those data is not arametrically seci ed. Fortunately, most statistics of interest can be shown to be smooth functions of samle averages of some observable random variables, in which case, under fairly general conditions, their distributions can be aroximated by a normal distribution, usually with a covariance matrix that shrinks to zero as the number of random variables in the averages increases. Thus, as the samle size increases, the ga between the true distribution of the statistic and its aroximating distribution asymtotes to zero, motivating the term asymtotic theory. There are two main stes in the aroximation of the distribution of a statistic by a normal distribution. The rst ste is the demonstration that a samle average converges (in an aroriate sense) to its exectation, and that a standardized version of the statistic has distribution function that converges to a standard normal. Such results are called limit theorems the former (aroximation of a samle average by its exectations) is called a law of large numbers (abbreviated LLN ), and the second (aroximation of the distribution of a standardized samle average by a standard normal distribution) is known as a central limit theorem (abbreviated CLT ). There are a number of LLNs and CLTs in the statistics literature, which imose di erent combinations of restrictions on the number of moments of the variables being averaged and the deendence between them; some basic versions are given below. The second ste in deriving a normal aroximation of the distribution of a smooth function of samle averages is demonstration that this smooth function is aroximately a linear function of samle averages (which is itself a samle average). Useful results for this ste, which use the usual calculus aroximation of smooth (di erentiable) functions by linear functions, are groued under the heading Slutsky theorems, after the author of one of the articularly useful results. So, as long as the conditions of the limit theorems and Slutsky theorems aly, smooth functions of samle averages are aroximately samle averages 1
themselves, and thus their distribution functions are aroximately normal. The whole aaratus of asymtotic theory is made u of the regularity conditions for alicability of the limit and Slutsky theorems, along with the rules for mathematical maniulation of the aroximating objects. In order to discuss aroximation of a statistic by a constant, or its distribution by a normal distribution, we rst have to extend the usual de nitions of limits of deterministic sequences to accomodate sequences of random variables (yielding the concet of convergence in robability ) or their distribution functions ( convergence in distribution ). Convergence in Distribution and Probability For most alications of asymtotic theory to statistical roblems, the objects of interest are sequences of random vectors (or matrices), say or Y N ; which are tyically statistics indexed by the samle size N; the object is to nd simle aroximations for the distribution functions of or Y N which can be made arbitrarily close to their true distribution functions for N large enough. Writing the cumulative distribution function of as F N (x) Prf xg (where the argument x has the same dimension as and the inequality holds only if it is true for each comonent), we de ne convergence in distribution (sometimes called convergence in law) in terms of convergence of the function F N (x) to some xed distribution function F (x) for any oint x where the F is continuous. That is, the random vector converges in distribution to X, denoted d! X as N! 1; if at all oints x where F (x) is continuous. lim F N(x) = F (x) N!1 A few comments on this de nition and its imlications are warranted: 1. The limiting random vector X in this de nition is really just a laceholder for the distribution function F (x); there needs be no real random vector X which has the limiting distribution function. In fact, it is often customary to relace X with the limiting distribution function F (x) in the 2
de nition, i.e., to write d! F (x) as N! 1; if, say, F (x) is a multivariate standard normal, then we would write d! N(0; I); where the condition N! 1 is usually not stated, but imlied. d 2. If! F (x); then we can use the c.d.f. F (x) to aroximate robabilies for XN if N is large, i.e., Z Prf 2 Ag = A df (x); where the symbol = means A that the di erence in the left- and right-hand sides goes to zero as N increases. A 3. We would like to construct statistics so that their limiting distributions are of convenient (tabulated) form, e.g., normal or chi-squared distribution. This often involves standardizing the statistics by subtracting their exectations and dividing by their standard deviations (or the matrix equivalent). 4. If lim N!1 F N (x) is discontinuous, it may not be a distribution function, which is why the quali cation at all oints where F (x) is continuous has to be aended to the de nition of convergence in distribution. To make this more concrete, suose N 0; 1 ; N so that F N (x) = Nx! 8 < : 0 if x < 0 1 2 if x = 0 1 if x > 0 Obviously here lim N!1 F N (x) is not right-continuous, so we would never use it to aroximate robabilities for (since it isn t a c.d.f.). Instead, a reasonable aroximating function for F N (x) would be F (x) 1fx 0g: : 3
where 1fAg is the indicator function of the statement A; i.e., it equals one if A is true and is zero otherwise. The function F (x) is the c.d.f. of a degenerate random variable X which equals zero with robability one, which is a retty good aroximation for F N (x) when N is large for this roblem. The last examle convergence of the c.d.f. F N (x) to the c.d.f. of a degenerate ( nonrandom ) random variable is a secial case of the other key convergence concet for asymtotic theory, convergence in robability. Following Ruud s text, we de ne convergence in robability of the sequence of random vectors to the (nonrandom) vector x, denoted or if i.e., lim = x N!1! x as N! 1 d! x; F N (c) Prf cg! 1fx cg whenever all comonents of x are di erent from those of c. (The symbol lim stands for the term robabilty limit.) We can generalize this de nition of convergence in robability of to a constant x to ermit to converge in robability to a nondegenerate random vector X by secifying that! X means that is that X d! 0: An equivalent de nition of convergence in robability, which is the de nition usually seen in textbooks,! x if, for any number " > 0; Prfjj xjj >"g! 0 4
as N! 1: Thus, converges in robability to x if all the robability mass for eventually ends u in an "-neighborhood of x; no matter how small " is. Showing the equivalence of these two de nitions of robability limit is a routine exercise in robability theory and real analysis. A stronger form of convergence of to a nonrandom vector x; which can often be used to show convergence in robability, is quadratic mean convergence. We say qm X n! x as N! 1 if E[jj xjj 2 ]! 0: If this condition can be shown, then it follows that converges in robability to x because of Markov s Inequality: If Z is a nonnegative (scalar) random variable, i.e., PrfZ 0g = 1; then for any K > 0: PrfZ > Kg E(Z) K The roof of this inequality is worth remembering; it is based uon a decomosition of E(Z) (which may be in nite) as E(Z) = Z 1 0 Z K 0 zdf Z (z) zdf Z (z) + Z 1 K (z K)dF Z (z) + Since the rst two terms in this sum are nonnegative, it follows that E(Z) Z 1 K Z 1 K KdF Z (z) = K PrfZ > Kg; KdF Z (z): from which Markov s inequality immediately follows. A better-known variant of this inequality, Tchebyshe s Inequality, takes Z = (X E(X)) 2 for a scalar random variable X (with nite second moment) and K = " 2 for any " > 0; from which it follows that PrfjX E(X)j > "g = Prf(X E(X)) 2 > " 2 g V ar(x) " 2 ; since V ar(x) = E(Z) for this examle. A similar consequence of Markov s inequality is that convergence in quadratic mean imlies convergence in robability: if E[jj xjj 2 ]! 0; 5
then Prfjj xjj >"g E[jj xjj 2 ] " 2! 0 for any " > 0: There is another, stronger version of convergence of a random variable to a limit, known as almost sure convergence (or convergence almost everywhere or convergence with robability one), which is denoted a:s:! x: Rather than give the detailed de nition of this form of convergence which is best left to a more advanced course it is worth ointing out that it is indeed stronger than convergence in robability (which is sometimes called weak convergence for that reason), but that, for the usual uroses of asymtotic theory, the weaker concet is su cient. Limit Theorems With the concets of convergence in robability and distribution, we can more recisely state how a samle average is aroximately equal to its exectation, or how its standardized version is aroximately normal. The rst result, on convergence in robability, is Weak Law of Large Numbers (WLLN): If 1 N P N i=1 X i is a samle average of scalar, i.i.d. random variables fx i g N i=1 which have E(X i) = and V ar(x i ) = 2 nite, then! = E(Xi ): Proof: By the usual calculations for means and variances of sums of i.i.d. random variables, E( ) = ; V ar( ) = 2 N = E( mean, and thus in robability. ) 2 ; which tends to zero as N! 1: By de nition, XN tends to in quadratic Insection of the roof of the WLLN suggests a more general result for averages of random variables that are not i.i.d. namely, that E( )! 0 if 1 N 2 NX i=1 j=1 NX Cov(X i ; X j )! 0: 6
This general condition can be secialized to yield other LLNs for heterogeneous (non-constant variances) or deendent (nonzero covariances) data. Other laws of large numbers exist which relax conditions on the existence of second moments, and which show almost sure convergence (and are known as strong laws of large numbers ). The next main result, on aroximate normality of samle averages, again alies to i.i.d. data: Central Limit Theorem (Lindeberg-Levy): If XN 1 N P N i=1 X i is a samle average of scalar, i.i.d. random variables fx i g N i=1 which have E(X i) = and V ar(x i ) = 2 nite, then X Z N N E( X N ) V ar( XN ) = XN N d! N(0; 1): The roof of this result is a bit more involved, requiring maniulation of characteristic functions, and will not be resented here. The result of this CLT is often rewritten as N( XN ) d! N(0; 2 ); which oints out that converges to its mean at exactly the rate N increases, so that the roduct eventually balances out to yield a random variable with a normal distribution. Another way to rewrite the consequence of the CLT is A N ; 2 ; N where the symbol means A is aroximately distributed as (or is asymtotically distributed as ). It is very imortant here to make the distinction between the limiting distribution of the standardized mean Z N ; which is a standard normal, and the asymtotic distribution of the (non-standardized) samle mean ; which is actually a sequence of aroximating normal distributions with variances that shrink to zero at seed 1=N: In short, a limiting distribution cannot deend uon N; which has assed to its (in nite) limit, while an asymtotic distribution can involve the samle size N: Like the Weak Law of Large Numbers, the Lindeberg-Levy Central Limit Theorem admits a number of di erent generalizations (often named after their authors) which relax the assumtion of i.i.d. data while 7
imosing various stronger restrictions on the existence of moments or the tail behavior of the true data distributions. To this oint, the WLLN and CLT aly only to scalar random variables fx i g; to extend them to random vectors fx i g; we can use the intriguingly-named Cramér-Wold device, which is really a theorem that states that if the scalar random variable Y N 0 converges in distribution to Y 0 X for any xed vector having the same dimension as that is, if 0 d! 0 X for any then the vector converges in distribution to X; d! X: This result immediately yields a multivariate version of the Lindeberg-Levy CLT: a samle mean of i.i.d. random vectors fx i g with E(X i ) = and V (X i ) = has N( ) d! N (0; ) : And, since convergence in robability is a secial case of convergence in distribution, the Cramér-Wald device also immediately yields a multivariate version of the WLLN. Slutsky Theorems To extend the limit theorems for samle averages to statistics which are functions of samle averages, asymtotic theory uses smoothness roerties of those functions (i.e., continuity and di erentiability) to aroximate those functions by olynomials, usually constant or linear functions. The simlest of these aroximation results is the continuity theorem, which states that robability limits share an imortant roerty of ordinary limits namely, the robability limit of a continuous function is the value of that function evaluated at the robability limit: 8
Continuity Theorem: If! x0 and the function g(x) is continuous at x = x 0 ; then g( )! g(x 0 ): Proof: For any " > 0; continuity of the function g(x) at x 0 imlies that there is some > 0 so that jj x 0 jj imlies that jjg( ) g(x 0 )jj ": Thus Prfjjg( ) g(x 0 )jj "g Prfjj x 0 jj g: But the robability on the right-hand side converges to one because! x0 ; so the robability on the left-hand side does, too. With the continuity theorem, it should be clear that robability limits are more conveniently maniulated than, say, mathematical exectations, since, for examle, if ^ YN is an estimator of the ratio of the means of Y i and X i ; 0 E(Y i) E(X i ) Y X ; then ^ N! 0 as long as X 6= 0; even though E(^ N ) 6= 0 in general (and may not even be well-de ned.. Another key aroximation result is Slutsky s Theorem, which states that the sum (or roduct) of a random vector that converges in robability and another that converges in distribution itself converges in distribution: Slutsky s Theorem: If the random vectors and Y N have the same length, and if! x0 and Y N d! Y; then XN + Y N d! x0 + Y and X 0 N Y N d! x 0 0 Y. The tyical alication of Slutsky s theorem is to estimators that can be reresented in terms of a linear aroximation involving some statistics that converge either in robability or distribution. To be more concrete, suose the (scalar) estimator ^ N can be decomosed as N(^N 0 ) = X 0 NY N + Z N ; where! x0 ; d Y N! N(0; ); and Z N! 0: 9
This decomosition is often obtained by a rst-order Taylor s series exansion of ^ N ; and the comonents ; Y N ; and Z N are samle averages to which a LLN or CLT can be alied; an examle is the so-called delta method discussed below. Under these conditions, Slutsky s theorem imlies that N(^N 0 ) d! N(0; x 0 0x 0 ): Both the continuity theorem and Slutsky s theorem are secial cases of the continuous maing theorem, which says that convergence in distribution, like convergence in robability, interchanges with continuous functions, as long as they are continuous everywhere: g(x): Continuous Maing Theorem: If d! X and the function g(x) is continous for all x; then g(xn ) d! The continuity requirement for g(x) can be relaxed to hold only on the suort of X if needed. The continuous maing theorem can be used to get aroximations for the distributions of test statistics based uon asymtotically-normal estimators; for examle, if Z N N(^ N 0 ) d! N (0; I) ; then T Z 0 NZ N = N(^ N 0 ) 0 (^ N 0 ) d! 2 ; where dimf^ N g: The nal aroximation result discussed here, which is an alication of Slutsky s theorem, is the so-called delta method, which states that continuously-di erentiable functions of asymtotically-normal statistics are themselves asymtotically normal: The Delta Method : If ^ N is a random vector which is asymtotically normal, with N(^N 0 ) d! N (0; ) for some 0 ; and if g() is continuously di erentiable at = 0 with Jacobian matrix G 0 @g( 0) @ 0 10
that has full row rank, then N(g(^N ) g( 0 ))! d N 0; G 0 G 0 0 : Proof: First suose that g() is scalar-valued. Since ^ N converges in robability to 0 (by Slutsky s Theorem), and since @g()=@ 0 is continuous at 0 ; the Mean Value Theorem of di erential calculus imlies that, g(^ N ) = g( 0 ) + @g( N ) @ 0 (^ N 0 ); for some value of N between ^ N and 0 (with arbitrarily high robability). Since ^ N that N! 0 ; so Thus, by Slutsky s Theorem, @g( N ) @ 0! G 0 @g( 0) @ 0 :! 0, it follows N(g(^ N ) g( 0 )) = @g( N ) @ 0 : N(^ N 0 ) d! G 0 N(0; ) = N(0; G 0 G 0 0): To extend this argument to the case where g(^ N ) is vector-valued, use the same argument to show that any (scalar) linear combination 0 g(^) has the asymtotic distribution N( 0 g(^ N ) 0 g( 0 )) d! N(0; 0 G 0 G 0 0); from which the result follows from the Cramér-Wald device. 11