Lecture Notes on Asymptotic Statistics. Changliang Zou

Size: px

Start display at page:

Download "Lecture Notes on Asymptotic Statistics. Changliang Zou"

Bertina Norris
5 years ago
Views:

1 Lecture Notes on Asymptotic Statistics Changliang Zou

2 Prologue Why asymptotic statistics? The use of asymptotic approximation is two-fold. First, they enable us to find approximate tests and confidence regions. Second, approximations can be used theoretically to study the quality (efficiency) of statistical procedures Van der Vaart Approximate statistical procedures To carry out a statistical test, we need to know the critical value of the test statistic. Roughly speaking, this means we must know the distribution of the test statistic under the null hypothesis. Because such distributions are often analytically intractable, only approximations are available in practice. Consider for instance the classical t-test for location. Given a sample of iid observations X 1,..., X n, we wish to test H 0 : µ = µ 0. If the observations arise from a normal distribution with mean µ 0, then the distribution of t-test statistic, n( X n µ 0 )/S n, is exactly known, say t(n 1). However, we may have doubts regarding the normality. If the number of observations is not too small, this does not matter too much. Then we may act as if n( Xn µ 0 )/S n N(0, 1). The theoretical justification is the limiting result, as n, sup x ( n( P Xn µ) S n ) x Φ(x) 0, provided that the variables X i have a finite second moment. Then, a large-sample or asymptotical level α test is to reject H 0 if n( X n µ 0 )/S n > z α/2. When the underlying distribution is exponential, the approximation is satisfactory if n 100. Thus, one aim of asymptotic statistics is to derive the asymptotical distribution of many types of statistics. There are similar benefits when obtaining confidence intervals. For instance, consider maximum likelihood estimator θ n of dimension p based on a sample of size n from a density f(x; θ). A major result in asymptotic statistic is that in many situations n( θ n θ) is asymptotically normally distributed with zero mean and covariance matrix I 1 θ, where [ ( ) ( ) ] T log f(x; θ) log f(x; θ) I θ = E θ θ θ is the Fisher information matrix. Thus, acting as if n( θ n θ) N p (0, I 1 ), we can find 2 θ

3 the following ellipsoid { } θ : (θ θ n ) T I θ (θ θ n ) χ2 p,α n is an approximate 1 α confidence region. Efficiency of statistical procedures For a relatively small number of statistical problems, there exists an exact, optimal solution. For example, the Neyman-Pearson lemma to find UMP tests, the Rao-Blackwell theory to find MVUE, and Cramer-Rao Theorem. However, there are not always exact optimal theory or procedure, then asymptotic optimality theory may help. For instance, to compare two tests, we might compare approximations to their power functions. Consider the foregoing hypothesis problem for location. A well-known nonparametric test statistic is the sign statistic T n = n 1 n I Xi >θ 0, where the null hypothesis is H 0 : θ = θ 0 and θ denotes the median associated the distribution of X. To compare the efficiency of sign and t-test is rather difficult because the exact power functions of two tests are untractable. However, by the definitions and methods introduced later, we can obtain the asymptotic relative efficiency of the sign test versus the t-test is equal to 4f 2 (0) x 2 f(x)dx. To compare estimators, we might compare asymptotic variances rather than exact variances. A major result in this area is that for smooth parametric models maximum likelihood estimators are asymptotically optimal. This roughly means the following. First, MLE are asymptotically consistent; Second, the rate at which MLE converge to the true value is the fastest possible, typically n; Third, the asymptotic variance, attain the C-R bound. Thus, asymptotic justify the use of MLE in certain situations. (Even though in general it does not lead to best estimators for finite sample in many cases, it is always not a worst one and always leads to a reasonable estimator. Contents Basic convergence concepts and preliminary theorems (8) Transformations of given statistics: The Delta method (4) 3

4 The basic sample statistics: distribution function, moment, quantiles, and order statistics (3) Asymptotic theory in parametric inference: MLE, likelihood ratio test, etc (6) U-statistic, M-estimates and R-estimates (6) Asymptotic relative efficiency (6) Asymptotic theory in nonparametric inference: rank and sign tests (6) Goodness of fit (3) Nonparametric regression and density estimation (4) Advanced topic selected: bootstrap and empirical likelihood (4) Text books Billingsley, P. (1995). Probability and Measure, 3rd edition, John Wiley, New York. DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer. Serfling, R. (1980). Approximation Theorems of Mathematical Statistics, John Wiley, New York. Shao, J. (2003). Mathematical Statistics, 2nd ed. Springer, New York. Van der Vaart, A. W. (2000). Asymptotic Statistics, Cambridge University Press. 4

5 Chapter 1 Basic convergence concepts and preliminary theorems Throughout this course, there will usually be an underlying probability space (Ω, F, P ), where Ω is a set of points, F is a σ-field of subsets of Ω, and P is a probability distribution or measure defined on the element of F. A random variable X(w) is a transformation of Ω into the real line R such that images X 1 (B) of Borel sets B are elements of F. A collection of random variables X 1 (w), X 2 (w),... on a given (Ω, F) will typically be denoted by X 1, X 2, Modes of convergence of a sequence of random variables Definition (convergence in probability) Let {X n, X} be random variables defined on a common probability space. We say X n converges to X in probability if, for any ɛ > 0, P ( X n X > ɛ) 0 as n, or equivalently lim P ( X n X < ɛ) = 1, every ɛ > 0. n 5

6 p This is usually written as X n X. Extensions to the vector case: for random p-vectors p X 1, X 2... and X, we say X n X if Xn X p 0, where z = ( p z2 i ) 1/2 denotes the Euclidean distance (L 2 -norm) for z R p. It is easily to seen that X n p X iff the corresponding component-wise convergence holds. Example For iid Bernoulli trials with a success probability p = 1/2, let X n denote the number of times in the first n trials that a success is followed by a failure. Denoting T i = I{ith trial is success and (i+1)st trial is a failure}, X n = n 1 T i, and therefore E[X n ] = (n 1)/4, and Var[X n ] = n 1 Var[T i]+2 n 2 Cov[T i, T i+1 ] = 3(n 1)/16 2(n 2)/16 = (n+1)/16. It then follows by an application of Chebyshev s inequality that X n /n p 1/4. [P ( x µ ɛ) σ 2 /ɛ 2 ] Definition (bounded in probability) A sequence of random variables X n is said to be bounded in probability if, for any ɛ > 0, there exists a constant k such that P ( X n > k) ɛ for all n. Any random variable (vector) is bounded in probability. It is convenient to have short expressions for terms that converge or bounded in probability. If X n p 0, then we write X n = o p (1), pronounced by small oh-p-one ; The expression O p (1) ( big oh-p-one ) denotes a sequence that is bounded in probability, say, write X n = O p (1). These are so-called stochastic o( ) and O( ). More generally, for a given sequence of random variables R n, X n = o p (R n ) means X n = Y n R n and Y n p 0; X n = O p (R n ) means X n = Y n R n and Y n = O p (1). This expresses that the sequence X n converges in probability to zero or is bounded in probability at the rate R n. For deterministic sequences X n and R n, O p ( ) and o p ( ) reduce to the usual o( ) and O( ) from calculus. Obviously, X n = o p (R n ) implies that X n = O p (R n ). An expression we will often used is: for some sequence a n, if a n X n X n = o p (a 1 n ); if a n X n = O p (1), then we write X n = O p (a 1 n ). p 0, then we write Definition (convergence with probability one) Let {X n, X} be random variables 6

7 defined on a common probability space. We say X n converges to X with probability 1 (or almost surely, strongly, almost everywhere) if ( ) P lim X n = X = 1. n This can be written as P (ω : X n (ω) X(ω)) = 1. We denote this mode of convergence as wp1 a.s X n X or X n X. Extensions to random vector case is straightforward. Almost sure convergence is a stronger mode of convergence than convergence in probability. In fact, a characterization of wp1 is that lim P ( X m X < ɛ, all m n) = 1, every ɛ > 0. (1.1) n It is clear from this equivalent condition that wp1 is stronger than convergence in probability. Its proof can be found on page 7 in Serfling (1980). Example Suppose X 1, X 2,... is an infinite sequence of iid U[0, 1] random variables, and let X (n) = max{x 1,..., X n }. See X (n) wp1 1. Note that P ( X (n) 1 ɛ, n m) = P (X (n) 1 ɛ, n m) = P (X (m) 1 ɛ) = 1 (1 ɛ) m 1, as m. Definition (convergence in rth mean) Let {X n, X} be random variables defined on a common probability space. For r > 0, we say X n converges to X in rth mean if lim E X n X r = 0. n This is written X n rth X. It is easily shown that X n rth X X n sth X, 0 < s < r, by Jensen s inequality (If g( ) is a convex function on R, and X and g(x) are integrable r.v. s, then g(e[x]) E[g(X)]). 7

8 Definition (convergence in distribution) Let {X n, X} be random variables. Consider their distribution functions F Xn ( ) and F X ( ). We say that X n converges in distribution (in law) to X if lim n F Xn (t) = F X (t) at every point that is a continuity point of F X. This is written as X n d X or F Xn F X. Example Consider X n Uniform{ 1, 2,..., n 1, 1}. Then, it can be shown easily n n n that the sequence X n converges in law to U[0, 1]. Actually, consider any t [ i, i+1 ), the n n difference between F Xn (t) = i and F n X(t) = t can be arbitrarily small if n is sufficiently large ( i n t < n 1 ). The result follows from the definition of d. Example Let {X n } n=1 is a sequence of random variables where X n N(0, 1 + n 1 ). Taking the limit of the distribution function of X n as n yields lim n F Xn (x) = Φ(x) for all x R. Thus, X n d N(0, 1). According to the assertion below the definition of, p p we know that X n X is equivalent to convergence of every one of the sequences of components. The analogous statement for convergence in distribution is false: Convergence in distribution of the sequence X n is stronger than convergence of every one of the sequences of components X ni. The point is that the distribution of the components X ni separately does not determine their distribution (they might be independent or dependent in many ways). We speak of joint convergence in law versus marginal convergence. Example If X U[0, 1] and X n = X for all n, and Y n = X for n odd and Y n = 1 X for n even, then X n d X and Y n d U[0, 1], yet (X n, Y n ) does not converge in law. Suppose {X n, X} are integer-valued random variables. It is not hard to show that X d n X P (X n = k) P (X = k) for every integer k. This is a useful characterization of convergence in law for integer-valued random variables. 8

9 1.2 Fundamental results and theorems on convergence Relationship The results describes the relationship among four convergence modes are summarized as follows. Theorem Let {X n, X} be random variables (vectors). (i) If X n wp1 X, then X n p X. (ii) If X n rth X for a r > 0, then X n p X. (iii) If X n p X, then Xn d X. (iv) If, for every ɛ > 0, n=1 P ( X n X > ɛ) <, then X n wp1 X. Proof. ɛ > 0, (i) is an obvious consequence of the equivalent characterization (1.1); (ii) for any E X n X r E[ X n X r I( X n X > ɛ)] ɛ r P ( X n X > ɛ) and thus P ( X n X > ɛ) ɛ r E X n X r 0, as n. (iii) This is a direct application of Slutsky Theorem; (iv) Let ɛ > 0 be given. We have ( ) P ( X m X ɛ, for some m n) = P { X m X ɛ} P ( X m X ɛ). m=n m=n The last term in the equation above is the tail of a convergent series and hence goes to zero as n. Example Consider iid N(0, 1) random variables X 1, X 2,..., and suppose X n is the mean of the first n observations. For an ɛ > 0, consider n=1 P ( X n > ɛ). By Markov s inequality, P ( X n > ɛ) E[ X n 4] = 3. Since ɛ 4 ɛ 4 n 2 n=1 n 2 <, from Theorem (iv) it wp1 follows that X n 0. 9

10 1.2.2 Transformation It turns out that continuous transformations preserve many types of convergence, and this fact is useful in many applications. We record it next. Its proof can be found on page 24 in Serfling (1980). Theorem (Continuous Mapping Theorem) Let X 1, X 2,... and X be random p- vectors defined on a probability space, and let g( ) be a vector-valued (including real-valued) continuous function defined on R p. If X n converges to X in probability, almost surely, or in law, then g(x n ) converges to X in probability, almost surely, or in law, respectively. Example (i) If X n d N(0, 1), then χ 2 1; (ii) If (X n, Y n ) d N 2 (0, I 2 ), then max{x n, Y n } d max{x, Y }, which has the CDF [Φ(x)] 2. The most commonly considered functions of vectors converging in some stochastic sense are linear and quadratic forms, which is summarized in the following result. Corollary Suppose that the p-vector X n converge to the p-vector X in probability, almost surely, or in law. Let A q p and B p p be matrices. Then AX n AX and X T nbx n X T BX in the given mode of convergence. Proof. The vector-valued function ( p ) T p Ax = a 1i x i,..., a qi x i and the real-valued function x T Bx = p p b ij x i x j j=1 are continuous function of x = (x 1,..., x p ) T. 10

11 Example (i) If X n d N p (µ, Σ), then CX n d N(Cµ, CΣC T ) where C q p is a matrix; Also, (X n µ) T Σ 1 (X n µ) d χ 2 p; (ii) (Sums and products of random variables converging wp1 or in probability) If X n wp1 X and Y n wp1 Y, then X n + Y n wp1 X + Y and X n Y n wp1 XY. Replacing the wp1 with in probability, the foregoing arguments also hold. Remark The condition that g( ) is continuous function in Theorem can be further relaxed to that g( ) is continuous a.s., i.e., P (X C(g)) = 1 where C(g) = {x : g is continuous at x} is called the continuity set of g. Example (i) If X d n X N(0, 1), then 1/X d n Z, where Z has the distribution of 1/X, even though the function g(x) = 1/x is not continuous at 0. This is due to P (X = 0) = 0. However, if X n = 1/n (degenerate distribution) and 1, x > 0, g(x) = 0, x 0, then X d n 0 but g(x n ) d 1 g(0); (ii)if (X n, Y n ) d N 2 (0, I 2 ) then X n /Y d n Cauchy. Example Let {X} n=1 be a sequence of independent random variables where X n has a Poi(θ) distribution. Let X n be the sample mean computed on X 1,..., X n. By definition, we can see that X n p θ as n. If we wish to find a consistent estimator of the standard deviation of X n which is θ 1/2 we can consider transformation is continuous at θ if θ > 0 that X 1/2 X 1/2 n n. CMT implies that the square root p θ 1/2 as n. and Y n In Example 1.2.2, the condition that (X n, Y n ) d N 2 (0, I 2 ) cannot be relaxed to X n d Y where X and Y are independent, i.e., we need the convergence of the joint CDF of (X n, Y n ). This is different when d is replaced by p or wp1, such as in Example (ii). The following result, which plays an important role in probability and statistics, establishes the convergence in distribution of X n + Y n or X n Y n when no information regarding the joint CDF of (X n, Y n ) is provided. d X Theorem (Slutsky s Theorem) Let X n d X and Y n p c, where c is a finite constant. Then, 11

12 (i) X n + Y n d X + c; (ii) X n Y n d cx; (iii) X n /Y n d X/c if c 0. Proof. The method of proof of the theorem is demonstrated sufficiently by proving (i). Choose and fix t such that t c is a continuity point of F X. Let ε > 0 be such that t c + ε and t c ε are also continuity points of F X. Then F Xn+Yn (t) = P (X n + Y n t) P (X n + Y n t, Y n c < ε) + P ( Y n c ε) P (X n t c + ε) + P ( Y n c ε) and, similarly F Xn+Y n (t) P (X n t c ε) P ( Y n c ε). It follows from the previous two inequalities and the hypotheses of the theorem that F X (t c ε) lim inf n F Xn+Yn (t) lim sup F Xn+Yn (t) F X (t c + ε). n Since t c is a continuity point of F X, and since ε can be taken arbitrary small, the above equation yields lim n F Xn+Y n (t) = F X (t c). The result follows from F X (t c) = F X+c (t). Extensions to the vector case is straightforward. (iii) is valid provided C 0 is understood as C being invertible. A straightforward but often used result by this theorem is that X d p n X and X n Y n 0, then Y d n X. In asymptotic practice, we often firstly derive the result such as Y n = X n +o p (1) and then investigate the asymptotic distribution of X n. 12

13 Example (i) Theorem (iii); Furthermore, convergence in probability to a constant is equivalent to convergence in law to the given constant. follows from the part (i). can be proved by definition. Because the degenerate distribution function of constant c is continuous everywhere except for point c, for any ɛ > 0, P ( X n c ɛ) = P (X n c + ɛ) + P (X n c ɛ) 1 F X (c + ɛ) + F X (c ɛ) = 0 The results follows from the definition of convergence in probability. Example Let {X n } n=1 is a sequence of independent random variables where X n Gamma(α n, β n ), where α n and β n are sequences of positive real numbers such that α n α and β n β for some positive real numbers α and β. Also, let ˆβ n be a consistent estimator of β. We can conclude that X n / ˆβ d n Gamma(α, 1). Example (t-statistic) Let X 1, X 2,... be iid random variables with EX 1 = 0 and EX1 2 <. Then the t-statistic n X n /S n, where Sn 2 = (n 1) 1 n (X i X n ) 2 is the sample variance, is asymptotically standard normal. To see this, first note that by two applications of WLLN and CMT ( ) Sn 2 = n 1 Xi 2 n 1 n X n 2 p 1(EX1 2 (EX 1 ) 2 ) = Var(X 1 ). Again, by CMT, S n p Var(X1 ). By the CLT, n X n d N(0, Var(X 1 )). Finally, Slutsky s Theorem gives that the sequence of t-statistics converges in law to N(0, Var(X 1 ))/ Var(X 1 ) = N(0, 1) WLLN and SLLN We next state some theorems known as the laws of large numbers. It concerns the limiting behavior of sums of independent random variables. The weak law of large numbers (WLLN) refers to convergence in probability, whereas the strong of large numbers (SLLN) refers to a.s. convergence. Our first result gives the WLLN and SLLN for a sequence of iid random variables. 13

14 Theorem Let X 1, X 2,..., be iid random variables having a CDF F. (i) The WLLN The existence of constants a n for which 1 n p X i a n 0 holds iff lim x x[1 F (x) + F ( x)] = 0, in which case we may choose a n = n xdf (x). n (ii) The SLLN The existence of a constant c for which holds iff E[X 1 ] is finite and equals c. 1 n X i wp1 c Example Suppose {X i } is a sequence of independent random variables where X i t(2). The variance of X i does not exist, but Theorem still applies to this case and we can still therefore conclude that X n p 0 as n. The next result is for sequences of independent but not necessarily identically distributed random variables. Theorem Let X 1, X 2,..., be random variables with finite expectations. (i) The WLLN Let X 1, X 2,..., be uncorrelated with means µ 1, µ 2,... and variances σ 2 1, σ 2 2,.... If lim n 1 n 2 n σ2 i = 0, then 1 n X i 1 n p µ i 0. (ii) The SLLN Let X 1, X 2,..., be independent with means µ 1, µ 2,... and variances σ 2 1, σ 2 2,.... If σ2 i /c 2 i < where c n ultimately monotone and c n, then c 1 n (X i µ i ) wp

15 (iii) The SLLN with common mean Let X 1, X 2,..., be independent with common mean µ and variances σ1, 2 σ2, If σ 2 i =, then X i / σi 2 σ 2 wp1 i µ. A special case of Theorem (ii) is to set c i = i in which we have 1 n X i 1 n µ i wp1 0. The proof of Theorems and can be found in Billingsley (1995). indep Example Suppose X i (µ, σi 2 ). Then, by simple calculus, the BLUE (best linear unbiased estimate) of µ is n σ 2 i X i / n σ 2 i. Suppose now that the σi 2 do not grow at a rate faster than i; i.e., for some constant K, σi 2 ik. Then, n σ 2 i clearly diverges as n, and so by Theorem (iii) the BLUE of µ is strongly consistent. Example Suppose (X i, Y i ), i = 1,..., n are iid bivariate samples from some distribution with E(X 1 ) = µ 1, E(Y 1 ) = µ 2, Var(X 1 ) = σ1, 2 Var(Y 1 ) = σ2, 2 and corr(x 1, Y 1 ) = ρ. Let r n denote the sample correlation coefficient. The almost sure convergence of r n to ρ follow very easily. We write r n = 1 Xi Y n i XȲ ( Xi 2 X n 2 )(, Yi 2 Ȳ n 2 ) then from the SLLN for iid random variables (Theorem 1.2.4) and continuous mapping theorem (Theorem 1.2.2; Example (ii)), r n wp1 E(X 1Y 1 ) µ 1 µ 2 σ 2 1σ 2 2 = ρ Characterization of convergence in law Next we provide a collection of basic facts about convergence in distribution. The following theorems provide methodology for establishing convergence in distribution. 15

16 Theorem Let X, X 1, X 2,... random p-vectors. (i) (The Portmanteau Theorem) X d n X is equivalent to the following condition: E[g(X n )] E[g(X)] for every bounded continuous function g. (ii) (Levy-Cramer continuity theorem) Let Φ X, Φ X1, Φ X2,... be the character functions of X, X 1, X 2,..., respectively. X d n X iff lim n Φ Xn (t) = Φ X (t) for all t R p. (iii) (Cramer-Wold device) X d n X iff c T X d n c T X for every c R p. Proof. (i) See Serfling (1980), page 16; (ii) Shao (2003), page 57; (iii) Assume c T X n d c T X for any c, then by Theorem (ii) lim Φ X n (tc 1,..., tc p ) = Φ X (tc 1,..., tc p ), for all t. n With t = 1, and since c is arbitrary, it follows by Theorem (ii) again that X n d X. The converse can be proved by a similar argument. [Φ c T X n (t) = Φ Xn (tc) and Φ c T X(t) = Φ X (tc) for any t R and any c R p.] A straightforward application of Theorem is that if X n d X and Y n d c for constant vector c, then (X n, Y n ) d (X, c). Example Example revisited. Consider now the function g(x) = x 10, 0 x 1. Note that g is continuous and bounded. Therefore, by the Portmanteau theorem, E(g(X n )) = n i 10 E(g(X)) = 1 n 11 0 x10 dx = Example For n 1, 0 p 1, and a given continuous function g : [0, 1] R, define the sequence B n (p) = g( k n )Ck np k (1 p) n k, k=0 which is so-called Bernstein polynomials. Note that B n (p) = E[g( X ) X Bin(n, p)]. As n X p n, p (WLLN), and it follows that X d δ n n p, the point mass at p. Since g is continuous and hence bounded (compact interval), it follows from the Portmanteau theorem that B n (p) g(p). 16

17 Example (i) Let X 1,..., X n be independent random variables having a common CDF and T n = X X n, n = 1, 2,.... Suppose that E X 1 <. It follows from the property of CHF and Taylor expansion that the CHF of X 1 1EX, [ 2 Φ X (t) ] t 2 t=0 = EX 2 ] Φ X1 (t) = Φ X1 (0) + 1µt + o( t ) as t 0, where µ = EX 1. Then, it follows that the CHF of T n /n is [ ( n [ t 1µt Φ Tn/n(t) = Φ X1 = o( t n n)] 1 ) n satisfies [ Φ X(t) t ] t=0 = for any t R as n. Since (1 + c n /n) n exp{c} for any complex sequence c n satisfying c n c, we obtain that Φ Tn/n(t) exp{ 1µt}, which is the CHF of the distribution degenerated at µ. By Theorem (ii), T n /n d µ. From (i), this also shows that T n /n p µ (an informal proof of WLLN); (ii) Similarly, µ = 0 and σ 2 = Var(X 1 ) < imply [second-order Taylor expansion] Φ Tn/ n(t) = [1 σ2 t 2 2n + o(t2 n 1 ) for any t R as n, which implies that Φ Tn/ n(t) exp{ σ 2 t 2 /2}, the CHF of N(0, σ 2 ). Hence, T n / n d N(0, σ 2 ); (iii) Suppose now that X 1,..., X n are random p-vectors and µ = EX 1 and Σ = Cov(X 1 ) are finite. For any fixed c R p, it follows from the previous discussion that (c T T n nc T µ)/ n d N(0, c T Σc). From Theorem (iii), we conclude that (T n nµ)/ n d N p (0, Σ). ] n ] n The following two simple results are frequently useful in calculations. Theorem (i) (Prohorov s Theorem) If X n d X for some X, then X n = O p (1). (ii) (Polya s Theorem) If F Xn F X and F X is continuous, then as n, sup F Xn F X 0. <x< Proof. (i) For any given ε > 0, fix a constant M such that P (X M) < ε. By the definition of convergence in law, P ( X n M) exceeds P ( X M) arbitrarily small for 17

18 sufficiently large n. Thus, there exists N such that P ( X n M) < 2ε, for all n N. The results follows from the definition of O p (1). (ii) Firstly, fix k N. By the continuity of F there exists points = x 0 < x 1 < < x k = with F (x i ) = i/k. By monotonicity, we have, for x i 1 x x i, F Xn (x) F X (x) F Xn (x i ) F X (x i 1 ) = F Xn (x i ) F X (x i ) + 1/k F Xn (x i 1 ) F X (x i ) = F Xn (x i 1 ) F X (x i 1 ) 1/k. Thus, F Xn (x) F X (x) is bounded above by sup i F Xn (x i ) F X (x i ) + 1/k, for every x. The latter, finite supremum converges to zero because each term converges to zero due to the condition, for each fixed k. Because k is arbitrary, the result follows. The following result can be used to check whether X n X n has a PDF f n. d X when X has a PDF f and Theorem (Scheffe Theorem) Let f n be a sequence of densities of absolutely continuous functions,, with lim n f n (x) = f(x), each x R p. If f is a density function, then lim n fn (x) f(x) dx = 0. Proof. Put g n (x) = [f(x) f n (x)]i f(x) fn(x). By noting that [f n (x) f(x)]dx = 0, f n (x) f(x) dx = 2 g n (x)dx. Since 0 g n (x) f(x) for all x. Hence, by dominated convergence, lim n gn (x)dx = 0. [Dominated convergence theorem. If lim n f n = f and there exists an integrable function g such that f n g, then lim n fn (x)dx = lim n f n (x)dx holds] As an example, consider the PDF f n of the t- distribution t n, n = 1, 2,.... One can show (exercise) that f n f, where f is the standard normal PDF. The following result provides a convergence of moments criterion for convergence in law. Theorem (Frechet and Shohat Theorem) Let the distribution function F n possess finite moments α nk = t k df n (t) for k = 1, 2,... and n = 1, 2,.... Assume that the limits α k = lim n α nk exist (finite) for each k. Then, 18

19 (i) the limits α k are the moments of some a distribution function F ; (ii) if the F given by (i) is unique, then F n F. [A sufficient condition: the moment sequence α k determines the distribution F uniquely if the Carleman condition α 1/(2i) = holds.] Results on o p and O p There are many rules of calculus with o and O symbols, which we will apply without comment. For instance, o p (1) + o p (1) = o p (1), o p (1) + O p (1) = O p (1), O p (1)o p (1) = o p (1) (1 + o p (1)) 1 = O p (1), o p (R n ) = R n o p (1), O p (R n ) = R n O p (1), o p (O p (1)) = o p (1). Two more complicated rules are given by the following lemma. Lemma Let g be a function defined on R p such that g(0) = 0. Let X n be a sequence of random vectors with values on R that converges in probability to zero. Then, for every r > 0, (i) if g(t) = o( t r ) as t 0, then g(x n ) = o p ( X n r ); (ii) if g(t) = O( t r ) as t 0, then g(x n ) = O p ( X n r ). Proof. Define f(t) = g(t)/ t r for t 0 and f(0) = 0. Then g(x n ) = f(x n ) X n r. (i) Because the function f is continuous at zero by assumption, f(x n ) p f(0) = 0 by Theorem (ii) By assumption there exists M and δ > 0 such that f(t) M whenever t δ. Thus P ( f(x n ) > M) P ( X n > δ) 0, and the sequence f(x n ) is bounded. 19

20 1.3 The central limit theorem The most fundamental result on convergence in law is the central limit theorem (CLT) for sums of random variables. We firstly state the case of chief importance, iid summands. Definition A sequence of random variables X n is asymptotically normal with µ n and σ 2 n if (X n µ n )/σ n d N(0, 1), written by X n is AN(µ n, σ 2 n) The CLT for the iid case Theorem (Lindeberg-Levy) Let X i be iid with mean µ and finite variance σ 2. Then ( ) n X µ d N(0, 1). σ By Slutsky s Theorem, we can write n ( X µ ) d N(0, σ 2 ). Also, X is AN(µ, σ 2 /n). See Billingsley (1995) for a proof. Example (Confidence intervals) This theorem can be used to approximate P ( X µ + kσ n ) by Φ(k). This is very useful because the sampling distribution of X is not available except for some special cases. Then, setting k = Φ 1 (1 α) = z α, [ X n σ/ nz α, X n + σ/ nz α ] is a confidence interval for µ of asymptotic level 1 2α. More precisely, we have that the probability that µ is contained in this interval converges to 1 2α (how accurate?). Example (Sample variance) Suppose X 1,..., X n are iid with mean µ, variance σ 2 and E(X1) 4 <. Consider the asymptotic distribution of Sn 2 = 1 n n 1 (X i X n ) 2. Write n(s 2 n σ 2 ) = ( ) 1 n (X i µ) 2 σ 2 n n n 1 n 1 ( X n µ) 2. The second term converges to zero in probability and the first term is asymptotically normal by the CLT. The whole expression is asymptotically normal by the Slutsky Theorem, i.e., n(s 2 n σ 2 ) d N(0, µ 4 σ 4 ), 20

21 where µ 4 denotes the centered fourth moment of X 1 and µ 4 σ 4 comes certainly from computing the variance of (X 1 µ) 2. Example (Level of the Chi-square test) Normal theory prescribes to reject the null hypothesis H 0 : σ 2 1 for values of ns 2 n exceeding the upper α point χ 2 n 1,α of the χ 2 n 1 distribution. If the observations are sample from a normal distribution, the test has exactly level α. However, this is not approximately the case of the underlying distribution is not normal. The CLT and the Example yield the following two statements χ 2 n 1 (n 1) d N(0, 1), ( ) S 2 n n 2(n 1) σ 1 d N(0, κ + 2), 2 where κ = µ 4 /σ 4 3 is the kurtosis of the underlying distribution. The first statement implies that (χ 2 n 1,α (n 1))/ 2(n 1) converges to the upper α point z α of N(0, 1). Thus, the level of the chi-square test satisfies ( ( ) ) ( ) n S P H0 (nsn 2 > χ 2 2 n 1,α) = P n σ 1 > χ2 n 1,α n z α 2 1 Φ 2 n k + 2 So, the asymptotic level reduces to 1 Φ(z α ) = α iff the kurtosis of the underlying distribution is 0. If the kurtosis goes to infinity, then the asymptotic level approaches to 1 Φ(0) = 1/2. We conclude that the level of the chi-square test is nonrobust against departures of normality that affect the value of the kurtosis. If, instead, we would use a normal approximation to the distribution n(s 2 n/σ 2 1) the problem would not arise, provided that the asymptotic variance κ + 2 is estimated accurately. Theorem (Multivariate CLT for iid case) Let X i be iid random p-vectors with mean µ and and covariance matrix Σ. Then n ( X µ ) d N p (0, Σ). Proof. By the Cramer-Wold device, this can be proved by finding the limit distribution of the sequences of real variables ( c T 1 n ) (X i µ) = 1 n 21 (c T X i c T µ).

22 Because the random variables c T X i c T µ are iid with zero mean and variance c T Σc, this sequence is AN(0, c T Σc) by Theorem This is exactly the distribution of c T X if X possesses an N p (0, Σ). Example Suppose that X 1,..., X n is a random sample from the Poisson distribution with mean θ. Let Z n be the proportions of zero observed, i.e., Z n = 1/n n I {X j =0}. Let us find the joint asymptotic distribution of ( X n, Z n ). Note that E(X 1 ) = θ, EI {X1 =0} = e θ, Var(X 1 ) = θ, Var(I {X1 =0}) = e θ (1 e θ ), and EX 1 I {X1 =0} = 0. So, Cov(X 1, I {X1 =0}) = θe θ. Hence, n ( ( X n, Z n ) (θ, e θ ) ) d N 2 (0, Σ), where Σ = θ θe θ θe θ e θ (1 e θ ). It is not as widely known that existence of a variance is not necessary for asymptotic normality of partial sums of iid random variables. A CLT without a finite variance can sometimes be useful. We present the general result below and then give an illustrative example. Feller (1966) contains detailed information on the availability of CLTs without the existence of a variance, along with proofs. First, we need a definition. Definition A function g : R R is called slowly varying at if, for every t > 0, lim x g(tx)/g(x) = 1. Examples of slowly varying functions are log x, x/(1 + x), and indeed any function with a finite limit as x. But, for example, x or e x are not slowly varying. Theorem Let X 1, X 2,... be iid from a CDF F on R. Let v(x) = x x y2 df (y). Then, there exist constants {a n }, {b n } such that if and only if v(x) is slowly varying at. n X i a n b n d N(0, 1), 22

23 If F has a finite second moment, then automatically v(x) is slowly varying at. We present an example below where asymptotic normality of the sample partial sums still holds, although the summands do not have a finite variance. Example Suppose X 1, X 2,... are iid from a t-distribution with 2 degrees of freedom (t(2)) that has a finite mean but not a finite variance. The density is given by f(y) = c/(2 + y 2 ) 3 2 for some positive c. Hence, by a direct integration, for some other constant k, 1 [ v(x) = k x 2 + x 2 + x 2 arcsinh(x/ ] 2). 2 Therefore, on using the fact that arcsinh(x) = log(2x) + O(x 2 ) as x, we get, for v(tx) any t > 0, 1 on some algebra. It follows that for iid observations from a t(2) v(x) distribution, on suitable centering and normalizing, the partial sums n X i converge to a normal distribution, although the X i s do not have a finite variance. The centering can be taken to be zero for the centered t-distribution; it can be shown that the normalizing required is b n = n log n (why?) The CLT for the independent not necessarily iid case Theorem (Lindeberg-Feller) Suppose X n is a sequence of independent variables with means µ n and variances σ 2 n <. Let s 2 n = n σ2 i. If for any ɛ > 0 1 s 2 n j=1 where F i is the CDF of X i, then x µ j >ɛs n (x µ j ) 2 df j (x) 0, (1.2) (X i µ i ) s n d N(0, 1). A proof can be seen on page 67 in Shao (2003). The condition (1.2) is called Lindeberg-Feller condition. 23

24 Example Let X 1, X 2..., be independent variables such that X j has the uniform distribution on [ j, j], j = 1, 2,.... Let us verify the conditions of Theorem are satisfied. Note that EX j = 0 and σj 2 = 1 j 2j j x2 dx = j 2 /3 for all j. Hence, s 2 n = σj 2 = 1 3 j=1 j 2 = j=1 n(n + 1)(2n + 1). 18 For any ɛ > 0, n < ɛs n for sufficiently large n, since lim n n/s n = 0. Because X j j n, when n is sufficiently large, E(Xj 2 I { Xj >ɛs n}) = 0. Consequently, lim n n j=1 E(X2 j I { Xj >ɛs n}) <. Considering s n, Lindeberg s condition holds. The Lindeberg- Feller theorem is a landmark theorem in probability and statistics. Generally, it is hard to verify the Lindeberg-Feller condition. A simpler theorem is the following. Theorem (Liapounov) Suppose X n is a sequence of independent variables with means µ n and variances σ 2 n <. Let s 2 n = n σ2 i. If for some δ > 0 1 s 2+δ n E X j µ j 2+δ 0 (1.3) j=1 as n, then (X i µ i ) s n d N(0, 1). A proof is given in Sen and Singer (1993). For instance, if s n, sup j 1 E X j µ j 2+δ < and n 1 s n is bounded, then the condition of Liapounov s theorem is satisfied. In practice, usually one tries to work with δ = 1 or 2 for algebraic convenience. It can be easily checked that if X i is uniformly bounded and s n, the condition is immediately satisfied with δ = 1. Example Let X 1, X 2,... be independent random variables. Suppose that X i has the binomial distribution BIN(p i, 1), i = 1, 2,.... For each i, EX i = p i and E X i EX i 3 = 24

25 (1 p i ) 3 p i + p 3 i (1 p i ) 2p i (1 p i ). Hence, n E X i EX i 3 2s 2 n = 2 n E X i EX i 2 = 2 n p i(1 p i ). Then Liapounov s condition (1.3) holds with δ = 1 if s n. For example, if p i = 1/i or M 1 p i M 2 with two constants belong to (0, 1), s n n holds. Accordingly, by Liapounov s theorem, (X i p i ) d s n N(0, 1). A consequence especially useful in regression is the following theorem, which is also proved in Sen and Singer (1993). Theorem (Hajek-Sidak) Suppose X 1, X 2,... are iid random variables with mean µ and variance σ 2 <. Let c n = (c n1, c n2,..., c nn ) be a vector of constants such that as n. Then max 1 i n c 2 ni c 2 nj j=1 c ni (X i µ) σ c 2 nj j=1 0 (1.4) d N(0, 1). The condition (1.4) is to ensure that no coefficient dominates the vector c n, and is referred as Hajek-Sidak condition in the literatures. For example, if c n = (1, 0,..., 0), then the condition would fail and so would the theorem. The Hajek-Sidak s theorem has many applications, including in the regression problem. Here is an important example. Example (Simplest linear regression) Consider the simple linear regression model y i = β 0 + β 1 x i + ε i, where ε i s are iid with mean 0 and variance σ 2 but are not necessarily normally distributed. The least squares estimate of β 1 based on n observations is n β 1 = (y i ȳ n )(x i x n ) n n (x = β i x n ) ε i(x i x n ) n (x i x n ). 2 So, β1 = β 1 + n ε ic ni / n j=1 c2 nj, where c ni = x i x n. Hence, by the Hajek-Sidak s Theorem j=1 c 2 nj β 1 β 1 σ = n ε ic ni n σ j=1 c2 nj d N(0, 1), 25

26 provided max 1 i n (x i x n ) 2 n j=1 (x j x n ) 2 0 as n. For most reasonable designs, this condition is satisfied. Thus, the asymptotic normality of the LSE (least squares estimate) is established under some conditions on the design variables, an important result. Theorem (Lindeberg-Feller multivariate) Suppose X i is a sequence of independent vectors with means µ i, covariances Σ i and distribution function F i. Suppose that 1 n n Σ i Σ as n, and that for any ɛ > 0 1 x µ n j 2 df j (x) 0, then j=1 x µ j >ɛ n 1 n (X i µ i ) d N(0, Σ). Example (multiple regression) In the linear regression problem, we observe a vector y = Xβ + ε for a fixed or random matrix X of full rank, and an error vector ε with iid components with mean zero and variance σ 2. The least squares estimator of β is β = (X T X) 1 X T y. This estimator is unbiased and has covariance matrix σ 2 (X T X) 1. If the error vector ε is normally distributed, then β is exactly normally distributed. Under reasonable conditions on the design matrix, β is asymptotically normally distributed for a large range of error distributions. Here we fix p and let n tend to infinity. This follows from the representation (X T X) 1/2 ( β β) = (X T X) 1/2 X T ε = a ni ε i, where a n1,..., a nn are the columns of the (p n) matrix (X T X) 1/2 X T =: A. This sequence is asymptotically normal if the vectors a n1 ε 1,..., a nn ε n satisfy the Lindeberg conditions. The norming matrix (X T X) 1/2 has been chosen to ensure that the vectors in the display have covariance matrix σ 2 I p for every n. The remaining condition is a ni 2 Eε 2 i I { ani ε i >ɛ} 0. 26

27 This can be simplified to other conditions in several ways. Because a ni 2 = tr(aa T ) = p, it suffices that max i Eε 2 i I { ani ε i >ɛ} 0, which is also equivalent to max i a ni 0. Alternatively, the expectation Eε 2 i I { ani ε i >ɛ} can be bounded ɛ k E ε i k+2 a ni k and a second set of sufficient conditions is a ni k 0; E ε 1 k <, k > CLT for a random number of summands The canonical CLT for the iid case says that if X 1, X 2,... are iid with mean zero and a finite variance σ 2, then the sequence of partial sums T n = n X i obeys the central limit theorem in the sense Tn σ d N(0, 1). There are some practical problems that arise in applications, for n example in sequential statistical analysis, where the number of terms present in a partial sum is a random variable. Precisely, {N(t)}, t 0, is a family of (nonnegative) integer-valued random variables, and we want to approximate the distribution of T N(t), where for each fixed n, T n is still the sum of n iid variables as above. The question is whether a CLT still holds under appropriate conditions. Here is the Anscombe-Renyi theorem. Theorem (Anscombe-Renyi) Let X i be iid with mean µ and a finite variance σ 2, and let {N n }, be a sequence of (nonnegative) integer-valued random variables and {a n } a sequence of positive constants tending to such that N n /a n p c, 0 < c <, as n. Then, T Nn N n µ σ N n d N(0, 1) as n. Example (coupon collection problem) Consider a problem in which a person keeps purchasing boxes of cereals until she obtains a full set of some n coupons. The assumptions are that the boxes have an equal probability of containing any of the n coupons mutually independently. Suppose that the costs of buying the cereal boxes are iid with some mean µ and some variance σ 2. If it takes N n boxes to obtain the complete set of all n coupons, then N n /(n ln n) p 1 as n The total cost to the customer to obtain the 27

28 complete set of coupons is T Nn = X X Nn. By the Anscombe-Renyi theorem and Slutsky s theorem, we have that T Nn Nnµ σ n ln n is approximately N(0, 1). [On the distribution of N n. Let t i be the boxes to collect the i-th coupon after i 1 coupons have been collected. Observe that the probability of collecting a new coupon given i 1 coupons is p i = (n i+1)/n. Therefore, t i has a geometric distribution with expectation 1/p i and N n = n t i. By Theorem 1.2.5, we know 1 n ln n N p n 1 p 1 i = 1 n 1 n ln n n ln n i = 1 ln n 1 i =: 1 ln n H n. Note that H n is the harmonic number and hence by using the asymptotics of the harmonic numbers (H n = ln n + γ + o(1); γ is Euler-constant), we obtain Nn 1.] n ln n Central limit theorems for dependent sequences The assumption that observed data X 1, X 2,... form an independent sequence is often one of technical convenience. Real data frequently exhibit some dependence and at the least some correlation at small lags. Exact sampling distributions for fixed n are even more complicated for dependent data than in the independent case, and so asymptotics remain useful. In this subsection, we present CLTs for some important dependence structures. The cases of stationary m-dependence and without replacement sampling are considered. Stationary m-dependence We start with an example to illustrate that a CLT for sample means can hold even if the summands are not independent. Example Suppose X 1, X 2,... is a stationary Gaussian sequence with E(X i ) = µ, Var(X i ) = σ 2 <. Then, for each n, n( X n µ) is normally distributed and so n( X n µ) d N(0, τ 2 ), provided τ 2 = lim n Var( n( X n µ)) <. But Var( n( X n µ)) = σ Cov(X i, X j ) = σ n n i j 28 (n i)γ i,

29 where γ i = Cov(X 1, X i+1 ). Therefore, τ 2 < if and only if 1 n (n i)γ i has a finite limit, say ρ, in which case n( X n µ) d N(0, σ 2 + ρ). What is going on qualitatively is that 1 (n i)γ n i is summable when γ i 0 adequately fast. Instances of this are when only a fixed finite number of the γ i are nonzero or when γ i is damped exponentially; i.e., γ i = O(a i ) for some a < 1. It turns out that there are general CLTs for sample averages under such conditions. The case of m-dependence is provided below. Definition A stationary sequence {X n } is called m-dependent for a given fixed m if (X 1,..., X i ) and (X j, X j+1,...) are independent whenever j i > m. Theorem (m-dependent sequence) Let {X i } be a stationary m-dependent sequence. Let E(X i ) = µ and Var(X i ) = σ 2 <. Then n( X n µ) d N(0, τ 2 ), where τ 2 = σ m+1 i=2 Cov(X 1, X i ). See Lehmann (1999) for a proof; m-dependent data arise either as standard time series models or as models in their own right. For example, if {Z i } are i.i.d. random variables and X i = a 1 Z i 1 + a 2 Z i 2, i 3, then {X i } is 1-dependent. This is a simple moving average process of use in time series analysis. X i = h(z i, Z i+1,..., Z i+m ) for some function h. A more general m-dependent sequence is Example Suppose Z i are i.i.d. with a finite variance σ 2, and let X i = (Z i +Z i+1 )/2. Then, obviously n X i = Z 1+Z n n i=2 Z i. Then, by Slutsky s theorem, n( X n µ) d N(0, σ 2 ). Notice we write n( X n µ) into two parts in which one part is dominant and produces the CLT, and the other part is asymptotically negligible. This is essentially the method of proof of the CLT for more general m-dependent sequences. Sampling without replacement Dependent data also naturally arise in sampling without replacement from a finite population. Central limit theorems are available and we will present them shortly. But let us start 29

30 with an illustrative example. Example Suppose, among N objects in a population, D are of type 1 and N D of type 2. A sample without replacement of size n is taken, and let X be the number of sampled units of type 1. We can regard these D type 1 units as having numerical values X 1,..., X D = 1 and the rest as having values X D+1,..., X N X N1,..., X Nn correspond to the sampled units. Of course, X has the hypergeometric distribution N D P (X = x) = Cx D Cn x C n N, 0 x D. = 0, X = n X N i, where Two configurations can be thought of: (a) n is fixed, and D/N p, 0 < p < 1 with N. In this case, by applying Stirlings approximation to N! and D!, P (X = x) C x np x (1 p) x, and so X d Bin(n, p); (b) n, N, N n, D/N p, 0 < p < 1. This is the case where convergence of X to normality holds. Here is a general result; again, see Lehmann (1999) for a proof. Theorem For N 1, let π N be a finite population with numerical values X 1, X 2,... X N. Let X N1, X N2,..., X Nn be the values of the units of a sample without replacement of size n. Let X n = n X N i /n and X N = N X N/N. Suppose n, N n, and (a) (b) max 1 i N (X i X N ) 2 0, N (X i X N ) 2 and n/n 0 < τ < 1 as N ; N max 1 i N (X i X N ) 2 = O(1), as N. N (X i X N ) 2 Then, X n E( X n ) Var( Xn ) d N(0, 1). 30

31 Example Suppose X N1,..., X Nn is a sample without replacement from the set {1, 2,..., N}, and let X n = n X N i /n. Then, by a direct calculation, E( X n ) = N + 1, Var( 2 X n ) = (N n)(n + 1). 12n Furthermore, Hence, by Theorem , N max 1 i N (X i X N ) 2 = N (X i X n ) 2 X n E( X n) Var X n d N(0, 1). 3(N 1) N + 1 = O(1) Accuracy of CLT Suppose a sequence of CDFs F Xn d F X for some F X. Such a weak convergence result is usually used to approximate the true value of F Xn (x) at some fixed n and x by F X (x). However, the weak convergence result by itself says absolutely nothing about the accuracy of approximating F Xn (x) by F X (x) for that particular value of n. To approximate F Xn (x) by F X (x) for a given finite n is a leap of faith unless we have some idea of the error committed; i.e., F Xn (x) F X (x). More specifically, if for a sequence of random variables X 1,..., X n X n E( X n ) Var( Xn ) d Z N(0, 1), then we need some idea of the error ( Xn P E( X ) n ) Var( Xn ) x Φ(x). in order to use the central limit theorem for a practical approximation with some degree of confidence. The first result for the iid case in this direction is the classic Berry-Esseen theorem. Typically, these accuracy measures give bounds on the error in the appropriate CLT for any fixed n, making assumptions about moments of X i. In the canonical iid case with a finite variance, the CLT says that n( X µ)/σ converges in law to the N(0, 1). By Polya s theorem, the uniform error n = sup <x< P ( n( X µ)/σ x) Φ(x) 0 as n. Bounds on n for any given n are called uniform bounds. 31

32 The following results are the classic Berry-Esseen uniform bound and an extension of the Berry-Esseen inequality to the case of independent but not iid variables.; a proof can be seen in Petrov (1975). Introducing higher-order moment assumptions (third), the Berry-Esseen inequality assert for this convergence the rate O(n 1/2 ). Theorem (i) (Berry-Esseen; iid case) Let X 1,..., X n be iid with E(X 1 ) = µ, Var(X 1 ) = σ 2, and β 3 = E X 1 µ 3 <. Then there exists a universal constant C, not depending on n or the distribution of the X i, such that ( ) sup n( P Xn µ) x Φ(x) x σ Cβ 3 σ 3 n. (ii) (independent but not iid case) Let X 1,..., X n be independent with E(X i ) = µ i, Var(X i ) = σ 2 i, and β 3i = E X i µ i 3 <. Then there exists a universal constant C, not depending on n or the distribution of the X i, such that ( Xn sup x P E( X ) n ) n Var( Xn ) x Φ(x) C β 3i ( n σ2 i )3/2. It is the best possible rate in the sense of not being subject to improvement without narrowing the class of distribution functions considered. For some specific underlying CDFs F X, better rates of convergence in the CLT may be possible. This issue will be clearer when we discuss asymptotic expansions for P ( n( X n µ)/σ x). In Theorem (i), the universal constant C may be taken as C = 0.8. Example The Berry-Esseen bound is uniform in x, and it is valid for any n 1. While these are positive features of the theorem, it may not be possible to establish that n ɛ for some preassigned ɛ > 0 by using the Berry-Esseen theorem unless n is very large. Let us see an illustrative example. Suppose X 1,..., X n iid BIN(p, 1) and n = 100. Suppose we want the CLT approximation to be accurate to within an error of n = In the Bernoulli case, β 3 = pq(1 2pq), where q = 1 p. Using C = 0.8, the uniform Berry-Esseen bound is n 0.8pq(1 2pq) (pq) 3/2 n. 32

33 This is less than the prescribed n = iff pq > , which does not hold for any 0 < p < 1. Even for p = 0.5, the bound is less than or equal to n = only when n > 25, 000, which is a very large sample size. Of course, this is not necessarily a flaw of the Berry-Esseen inequality itself because the desire to have a uniform error of at most n = is a tough demand, and a fairly large value of n is probably needed to have such a small error in the CLT. Example As an example of independent variables that are not iid, consider X i BIN(i 1, 1), i 1, and let S n = n X i. Then, E(S n ) = n i 1, Var(S n ) = n (i 1)/i2 and β 3i = (i 1)(i 2 2i + 2)/i 4. Therefore, from Theorem (ii), n C n (i 1)(i2 2i + 2)/i 4 n [(i 1)/i2 ] 3/2 Observe now n (i 1)/i2 = log n + O(1) and n (i 1)(i2 2i + 2)/i 4 = log n + O(1). Substituting these back into the Berry-Esseen bound, one obtains with some minor algebra that n = O(log n) 1/2. For x sufficiently large, while n remains fixed, the quantities F Xn (x) and F X (x)each become so close to 1 that the bound given in Theorem is too rude. There has been a parallel development on developing bounds on the error in the CLT at a particular x as opposed to bounds on the uniform error. Such bounds are called local Berry-Esseen bounds. Many different types of local bounds are available.we present here just one. Theorem Let X 1,..., X n E X i µ i 2+δ < for some 0 < δ 1. Then ( Xn P E( X ) n ) Var( Xn ) x Φ(x) for some universal constant 0 < D <. be independent with E(X i ) = µ i, Var(X i ) = σ 2 i, and D n E X i µ i 2+δ 1 + x 2+δ ( n σ2 i )1+ δ 2. Such local bounds are useful in proving convergence of global error criteria such as FXn (x) Φ(x) p dx or for establishing approximations to the moments of F Xn. Uniform error bounds would be useless for these purposes. If the third absolute moments are finite, 33

Asymptotic Statistics-III. Changliang Zou

Asymptotic Statistics-III. Changliang Zou Asymptotic Statistics-III Changliang Zou The multivariate central limit theorem Theorem (Multivariate CLT for iid case) Let X i be iid random p-vectors with mean µ and and covariance matrix Σ. Then n (