Convergence in Distribution Undergraduate version of central limit theorem: if X 1,..., X n are iid from a population with mean µ and standard deviation σ then n 1/2 ( X µ)/σ has approximately a normal distribution. Also Binomial(n, p) random variable has approximately a N(np, np(1 p)) distribution. Precise meaning of statements like X and Y have approximately the same distribution? Desired meaning: X and Y have nearly the same cdf. But care needed. 1
Q1) If n is a large number is the N(0,1/n) distribution close to the distribution of X 0? Q2) Is N(0,1/n) close to the N(1/n,1/n) distribution? Q3) Is N(0,1/n) close to N(1/ n,1/n) distribution? Q4) If X n 2 n is the distribution of X n close to that of X 0? 2
Answers depend on how close close needs to be so it s a matter of definition. In practice the usual sort of approximation we want to make is to say that some random variable X, say, has nearly some continuous distribution, like N(0,1). So: want to know probabilities like P(X > x) are nearly P(N(0,1) > x). Real difficulty: case of discrete random variables or infinite dimensions: not done in this course. Mathematicians meaning of close: Either they can provide an upper bound on the distance between the two things or they are talking about taking a limit. In this course we take limits and use metrics. 3
Definition: A sequence of random variables X n taking values in a separable metric space S, d converges in distribution to a random variable X if E(g(X n )) E(g(X)) for every bounded continuous function g mapping S to the real line. Notation: : X n X. Remark: This is abusive language. It is the distributions that converge not the random variables. Example: If U is Uniform and X n = U, X = 1 U then X n converges in distribution to X. Other Jargon: weak convergence, weak convergence, convergence in law. 4
General Properties: If X n X and h is continuous from S 1 to S 2 then Y n = h(x n ) Y = h(x) Theorem 1 (Slutsky) If X n X, Y y o and h is continuous from S 1 S 2 to S 3 at x, y o for each x then Z n = h(x n, Y n ) Z = h(x, y) 5
We will begin by specializing to simplest case: S is the real line and d(x, y) = x y. In the following we suppose that X n, X are real valued random variables. Theorem 2 The following are equivalent: 1. X n converges in distribution to X. 2. P(X n x) P(X x) for each x such that P(X = x) = 0. 3. The limit of the characteristic functions of X n is the characteristic function of X: for every real t. E(e itxn ) E(e itx ) These are all implied by M Xn (t) M X (t) < for all t ǫ for some positive ǫ. 6
Now let s go back to the questions I asked: X n N(0,1/n) and X = 0. Then P(X n x) 1 x > 0 0 x < 0 1/2 x = 0 Limit is cdf of X = 0 except for x = 0; cdf of X is not continuous at x = 0. So: X n X. Does X n N(1/n,1/n) have distribution close that of Y n N(0,1/n). Find a limit X and prove both X n X and Y n X. Take X = 0. Then and E(e txn ) = e t/n+t2 /(2n) 1 = E(e tx ) E(e tyn ) = e t2 /(2n) 1 so that both X n and Y n have the same limit in distribution. 7
N(0,1/n) vs X=0; n=10000 0.0 0.2 0.4 0.6 0.8 1.0 X=0 N(0,1/n) -3-2 -1 0 1 2 3 N(0,1/n) vs X=0; n=10000 0.0 0.2 0.4 0.6 0.8 1.0 X=0 N(0,1/n) -0.03-0.02-0.01 0.0 0.01 0.02 0.03 8
N(1/n,1/n) vs N(0,1/n); n=10000 0.0 0.2 0.4 0.6 0.8 1.0 N(0,1/n) N(1/n,1/n) -3-2 -1 0 1 2 3 N(1/n,1/n) vs N(0,1/n); n=10000 0.0 0.2 0.4 0.6 0.8 1.0 N(0,1/n) N(1/n,1/n) -0.03-0.02-0.01 0.0 0.01 0.02 0.03 9
Multiply both X n and Y n by n 1/2 and let X N(0,1). Then nx n N(n 1/2,1) and ny n N(0,1). Use characteristic functions to prove that both nx n and nyn converge to N(0,1) in distribution. If you now let X n N(n 1/2,1/n) and Y n N(0,1/n) then again both X n and Y n converge to 0 in distribution. If you multiply X n and Y n in the previous point by n 1/2 then n 1/2 X n N(1,1) and n 1/2 Y n N(0,1) so that n 1/2 X n and n 1/2 Y n are not close together in distribution. You can check that 2 n 0 in distribution. 10
N(1/sqrt(n),1/n) vs N(0,1/n); n=10000 0.0 0.2 0.4 0.6 0.8 1.0 N(0,1/n) N(1/sqrt(n),1/n) -3-2 -1 0 1 2 3 N(1/sqrt(n),1/n) vs N(0,1/n); n=10000 0.0 0.2 0.4 0.6 0.8 1.0 N(0,1/n) N(1/sqrt(n),1/n) -0.03-0.02-0.01 0.0 0.01 0.02 0.03 11
Summary: to derive approximate distributions: Show sequence of rvs X n converges weakly to some X. The limit distribution (i.e. dstbn of X) should be non-trivial, like say N(0,1). Don t say: X n is approximately N(1/n,1/n). Do say: n 1/2 (X n 1/n) converges to N(0,1) in distribution. 12
The Central Limit Theorem Theorem 3 If X 1, X 2, are iid with mean 0 and variance 1 then n 1/2 X converges in distribution to N(0,1). That is, P(n 1/2 X x) 1 2π x e y2 /2 dy. Proof: We will show E(e itn1/2 X ) e t 2 /2. This is the characteristic function of N(0,1) so we are done by our theorem. 13
Some basic facts: If Z N(0,1) then E ( e itz) = e t2 /2 Theorem 4 If X is a real random variable with E( X k ) < then the function ψ(t) = E ( e itx) has k continuous derivatives as a function of the real variable t. (Real part and imaginary part each have that many derivatives.) Moreover for 1 j k we find ψ (j) (t) = i k E ( X k e itx) Theorem 5 (Taylor Expansion) For such an X: ψ(t) = 1 + k j=1 i j E(X j )t j /j! + R(t) where the remainder function R(t) satisfies lim t 0 R(t)/tk = 0 14
Finish proof: let ψ(t) = E(exp(itX 1 )): E(e it n X ) = ψ n (t/ n) Since variance is 1 and mean is 0: ψ(t) = 1 t 2 /2 + R(t) where lim t 0 R(t)/t 2 = 0. Fix t, replace t by t/ n: ψ n (t/ n) = 1 t 2 /(2n) + R(t/ n) Define x n = t 2 /2 + 2nR(t/ n). Notice x n t 2 /2 (by property of R) and use x n x implies valid for all complex x. Get to finish proof. (1 + x n /n) n e x E(e itn1/2 X ) e t2 /2. 15
Proof of Theorem 4: do case k = 1. Must show But lim h 0 ψ(t + h) ψ(t) h ψ(t + h) ψ(t) h = E = ie(xe itx ) ei(t+h)x e itx h Fact: e i(t+h)x e itx h X for any t. By Dominated Convergence Theorem can take limit inside integral to get ψ (t) = ie(xe itx ) 16
Multivariate convergence in distribution Definition: X n R p converges in distribution to X R p if E(g(X n )) E(g(X)) for each bounded continuous real valued function g on R p. This is equivalent to either of Cramér Wold Device: a T X n converges in distribution to a T X for each a R p. or Convergence of characteristic functions: E(e iat X n ) E(e iat X ) for each a R p. 17
Extensions of the CLT 1. Y 1, Y 2, iid in R p, mean µ, variance Σ then n 1/2 (Ȳ µ) MVN(0,Σ). 2. Lyapunov CLT: for each n X n1,..., X nn independent rvs with E(X ni ) = 0 (1) Var( X ni ) = 1 (2) i E( X ni 3 ) 0 (3) then i X ni N(0,1). i 3. Lindeberg CLT: If conds (1), (2) and E(X 2 ni 1( X ni > ǫ)) 0 each ǫ > 0 then i X ni N(0,1). (Lyapunov s condition implies Lindeberg s.) 4. Non-independent rvs: m-dependent CLT, martingale CLT, CLT for mixing processes. 5. Not sums: Slutsky s theorem, δ method. 18
Slutsky s Theorem in R p : If X n X and Y n converges in distribution (or in probability) to c, a constant, then X n + Y n X + c. More generally, if f(x, y) is continuous then f(x n, Y n ) f(x, c). Warning: hypothesis that limit of Y n constant is essential. Definition: Y n Y in probability if ǫ > 0: P(d(Y n, Y ) > ǫ) 0. Fact: for Y constant convergence in distribution and in probability are the same. Always convergence in probability implies convergence in distribution. Both are weaker than almost sure convergence: Definition: Y n Y almost surely if P({ω Ω : lim n Y n (ω) = Y (ω)}) = 1. 19
Theorem 6 (The delta method) Suppose: Sequence Y n y, a constant. If X n = a n (Y n y) then X n X for some random variable X. f is ftn defined on a neighbourhood of y R p which is differentiable at y. Then a n (f(y n ) f(y)) converges in distribution to f (y)x. If X n R p and f : R p R q then f is q p matrix of first derivatives of components of f. 20
Proof: The function f : R q R p is differentiable at y R q if there is a matrix Df such that lim h 0 f(y + h) f(y) Dfh h = 0 that is, for each ǫ > 0 there is a δ > 0 such that h δ implies Define and f(y + h) f(y) Dfh ǫ h. R n = a n (f(y n ) f(y)) a n Df (Y n y) S n = a n Df (Y n y) = DfX n According to Slutsky s theorem S n DfX If we now prove R n 0 then by Slutsky s theorem we find a n (f(y n ) f(y)) = S n + R n DfX 21
Now fix ǫ 1, ǫ 2 > 0. I claim there is K so big that for all n P(B n ) P( a n (Y n y) > K) ǫ 1. Let δ > 0 be the value in the definition of derivative corresponding to ǫ 2 /K. Choose N so large that n N implies K/a n δ. For n N we have { a n (Y n y) > K} { Y n y > δ} { R n > ǫ 2 } so that n N implies P( R n > ǫ 2 ) ǫ 1 which means R n 0 in probability. 22
Example: Suppose X 1,..., X n are a sample from a population with mean µ, variance σ 2, and third and fourth central moments µ 3 and µ 4. Then n 1/2 (s 2 σ 2 ) N(0, µ 4 σ 4 ) where is notation for convergence in distribution. For simplicity I define s 2 = X 2 X 2. 23
How to apply δ method: 1) Write statistic as a function of averages: Define See that Define W i = W n = [ X 2 i X i [ X 2 X ] ]. See that s 2 = f( W n ). f(x 1, x 2 ) = x 1 x 2 2 2) Compute mean of your averages: [ ] [ E(X 2 µ W E( W n ) = i ) µ = 2 + σ 2 E(X i ) µ ]. 3) In δ method theorem take Y n = W n and y = E(Y n ). 24
4) Take a n = n 1/2. 5) Use central limit theorem: n 1/2 (Y n y) MV N(0,Σ) where Σ = Var(W i ). 6) To compute Σ take expected value of (W µ W )(W µ W ) T There are 4 entries in this matrix. Top left entry is This has expectation: (X 2 µ 2 σ 2 ) 2 E { (X 2 µ 2 σ 2 ) 2} = E(X 4 ) (µ 2 + σ 2 ) 2. 25
Using binomial expansion: So E(X 4 ) = E{(X µ + µ) 4 } = µ 4 + 4µµ 3 + 6µ 2 σ 2 + 4µ 3 E(X µ) + µ 4. Σ 11 = µ 4 σ 4 + 4µµ 3 + 4µ 2 σ 2 Top right entry is expectation of (X 2 µ 2 σ 2 )(X µ) which is E(X 3 ) µe(x 2 ) Similar to 4th moment get µ 3 + 2µσ 2 Lower right entry is σ 2. So Σ = [ µ4 σ 4 + 4µµ 3 + 4µ 2 σ 2 µ 3 + 2µσ 2 µ 3 + 2µσ 2 σ 2 ] 26
7) Compute derivative (gradient) of f: has components (1, 2x 2 ). Evaluate at y = (µ 2 + σ 2, µ) to get a T = (1, 2µ). This leads to n 1/2 (s 2 σ 2 ) [ n 1/2 X [1, 2µ] 2 (µ 2 + σ 2 ) X µ which converges in distribution to ] (1, 2µ)MV N(0,Σ). This rv is N(0, a T Σa) = N(0, µ 4 σ 4 ). 27
Alternative approach worth pursuing. Suppose c is constant. Define X i = X i c. Then: sample variance of Xi variance of X i. is same as sample Notice all central moments of X i same as for X i. Conclusion: no loss in µ = 0. In this case: a T = (1,0) and Σ = [ µ4 σ 4 µ 3 µ 3 σ 2 ]. Notice that a T Σ = [µ 4 σ 4, µ 3 ] and a T Σa = µ 4 σ 4. 28
Special case: if population is N(µ, σ 2 ) then µ 3 = 0 and µ 4 = 3σ 4. Our calculation has n 1/2 (s 2 σ 2 ) N(0,2σ 4 ) You can divide through by σ 2 and get n 1/2 ( s2 1) N(0,2) σ2 In fact ns 2 /σ 2 has a χ 2 n 1 distribution and so the usual central limit theorem shows that (n 1) 1/2 [ns 2 /σ 2 (n 1)] N(0,2) (using mean of χ 2 1 is 1 and variance is 2). Factor out n to get n n 1 n1/2 (s 2 /σ 2 1)+(n 1) 1/2 N(0,2) which is δ method calculation except for some constants. Difference is unimportant: Slutsky s theorem. 29
Extending the ideas to higher dimensions. W 1, W 2, iid Density f(x) = exp( (x + 1))1(x > 1) Mean 0 shifted exponential. Set S k = W 1 + + W k Plot against k for k = 1..n. Label x axis to run from 0 to 1. Rescale vertical axes to fit in square. 30
n=100 n=400 S_k/sqrt(n) 0.0 0.4 0.8 S_k/sqrt(n) 0.2 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0 k/n 0.0 0.2 0.4 0.6 0.8 1.0 k/n n=1600 n=10000 S_k/sqrt(n) 0.0 0.2 0.4 0.6 0.8 1.0 S_k/sqrt(n) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 k/n 0.0 0.2 0.4 0.6 0.8 1.0 k/n 31
Proof: of Slutsky s Theorem: First: why is it true? If X n X and Y n y then we will show (X n, Y n ) (X, y). Point is that joint law of X, y is determined by marginal laws! Once this is done then by definition. E(h(X n, Y n )) E(h(X, y)) Note: You don t need continuity for all x, y but I will do only easy case. 32
Definition: A family {P α, α A} of probability measures on (S, d) is tight if for each ǫ > 0 there is a K compact in S such that for every α A P(K) 1 ǫ Theorem 7 If S is a complete separable metric space then each probability measure P on the Borel sets in S is tight. Proof: Let x 1, x 2, be dense in S. For each n draw balls B n,, B n,2, 1/n around x 1, x 2,.... of radius Each point in S is in one of these balls because the x j sequence is dense. That is: S = B n,j j=1 33
Thus 1 = P(S) = lim J P Pick J n so large that P J n B n,j j=1 J B n,j j=1 1 ǫ/2 n Let F n be the closure of J n j=1 B n,j. Let K = n=1 F n. I claim K is compact and has probability at least 1 ǫ. First P(K) = 1 P(K c ) = 1 P ( F c n ) 1 P(F c n) 1 ǫ/2 n = 1 ǫ (Incidentally you see that K is not empty!) 34
Second: K closed (intersection of closed sets). Third: K is totally bounded since each F n is a cover of K by (closed) balls of radius 1/n. So K is compact. Theorem 8 If X n converge in distribution to some X in a complete separable metric space S then the sequence X n is tight. Conversely: Theorem 9 If the sequence X n is tight then every subsequence is also tight. There is a subsequence X nk and a random variable X such that as k X nk X. Theorem 10 If there is a rv X such that every subsequence of X n has a further subsubsequence converging in distribution to X then X n X. 35
Proof of Theorem 9: do R p. First assertion obvious. Let x 1, x 2,... be dense in R p. Find sequence n 1,1 < n 1,2 < such that the sequence F n1,k (x 1 ) has a limit which we denote y 1. Exists because probabilities trapped in [0,1]. (Bolzano-Weierstrass). Pick n 2,1 < n 2,2 < a subsequence of the n 1,k such that F n2,k (x 2 ) has a limit which we denote y 2. Continue picking subsequence n m+1,k from the sequence n m,k so that F nm+1,k (x m+1 ) has a limit which we denote y m+1. 36
Trick: Diagonalization. Consider the sequence n 1,1 < n 2,2 < After the kth entry all remaining are a subsequence of the kth subsequence n k,j. So for each j. lim k F n k,k (x j ) = y j Idea: would like to define F X (x j ) = y j but that might not give cdf. Instead set F X (x) = inf{y j : x j > x}. Next: prove F X is cdf. Then prove subsequence converges to F X. 37