The Central Limit Theorem

Size: px

Start display at page:

Download "The Central Limit Theorem"

Aubrey O’Connor’
5 years ago
Views:

1 epartment of Mathematics Ma 3/03 KC Border Introduction to Probability and Statistics Winter 209 Lecture : The Central Limit Theorem Relevant textbook passages: Pitman [6]: Section 3.3, especially p. 96 Larsen Marx [2]: Section 4.3, (Appendix 4.A.2 is optional). Fun with CFs Recall a random variable X on the probability space (S, E, P ) has a cumulative distribution function (cdf) F defined by F (x) = P (X x) = P ( {s S : X(s) p} )... Quantiles Assume X is a random variable with a continuous, strictly increasing cumulative distribution function F. Pitman [6]: pp Then the equation P (X x) = F (x) = p has a unique solution x p = F (p) for every p with 0 < p <, namely x p = F (p). (If we allow x to take on the values ±, then we can be sure that a solution exists for the cases p = 0 and p =.).. efinition When X has a continuous, strictly increasing cumulative distribution function F, the value x p = F (p) is called the p th quantile of the distribution F. The mapping F : p x p is called quantile function of the distribution F. Some of the quantiles have special names. When p has the form of a percent, x p is called a percentile. The p = /2 quantile is called the median. The p = /4 quantile is the first quartile, and the p = 3/4 is called the third quartile. The interval between the first and third quartiles is called the interquartile range. There are also quintiles, and deciles, and, I m sure, others as well...2 Proposition Let F be the cdf of the random variable X. Assume that F is continuous. Then F X is a random variable that is uniformly distributed on [0, ]. In other words, P (F (X) p) = p, (0 p ). Note that we are not assuming that F is strictly increasing. Proof : Since F is continuous, its range has no gaps, that is, the range includes (0, ). Now fix p [0, ]. There are three cases, p =, 0 < p <, and p =. KC Border v ::2.46

2 KC Border The Central Limit Theorem 2 The case p = is trivial, since F is bounded above by. In case 0 < p <, define x p = sup{x R : F (x) p} = sup{x R : F (x) = p} and note that x p is finite. In fact, if F is strictly increasing, then x p is just the p th quantile defined above. By continuity, By the construction of x p, for all x R, So, replacing x by X above, so F (x p ) = p. F (x) p x x p. () F (X) p X x p P (F (X) p) = P (X x p ) = F (x p ) = p. The above argument works for p = 0 if 0 is in the range of F. But if 0 is not in the range of F, then F (X) > 0 a.s., so P (F (X) = 0) = The quantile function in general Even if F is not continuous, so that the range of F has gaps, there is still a fake inverse of F that is very useful. efine Q F : (0, ) R by Q f (p) = inf{x R : F (x) p}. When F is clear form the context, we shall simply write Q F. When F is strictly increasing and continuous, then Q F is just F on (0, ). More generally, flat spots in F correspond to jumps in Q F and vice-versa. The key property is that for any p (0, ) and x R, Q F (p) x p F (x), (2) or equivalently Compare this to equation (). F (x) p x Q F (p)...3 Example Refer to Figure. to get a feel for the relationship between flats and jumps. The cumulative distribution function F has a flat spot from x to y, which means P (x < X < y) = 0. Now {v : F (v) p} = [x, ), so Q F (p) = x. It also has jump from p to q at y, which means P (y) = q p. For p in the gap p < p < q, we have {v : F (v) p } = [y, ), so Q F (p ) = y. This creates a flat spot in Q F. v ::2.46 KC Border

3 KC Border The Central Limit Theorem 3 z q p p F y Q F r x w x y z w r p p q 0 Figure.. Construction of Q F from F...4 Proposition Let F be a cumulative distribution function and let U be a Uniform[0, ] random variable. That is, for every p [0, ], P (U p) = p. Note that P (U (0, )) = so with probability one Q F (U) is defined. Then Q F (U) is a random variable and the cumulative distribution function of Q F (U) is F. Proof : From (2), P (Q F (U) x) = P (U F (x)) = F (x), where the last equality comes from the fact that U is uniformly distributed. The usefulness of this result is this: If you can have a uniform random variable U, then you can create a random variable X with any distribution F via the transformation X = Q F (U)..2 Stochastic dominance and increasing functions Recall that a random variable X stochastically dominates the random variable Y if for every value x, P (X > x) P (Y > x). This is equivalent to F X (x) F Y (x) for all x..2. Proposition Let X stochastically dominate Y, and let g be a nondecreasing real function. Then E g(x) E g(y ). If g is strictly increasing, then the inequality is strict unless F X = F Y. As a special case, when g(x) = x, we have E X E Y. Not in Pitman I will prove this in two ways. The first proof is for the special where X and Y are strictly bounded in absolute value by b, and have densities f X and f Y, and the function g is continuous continuously differentiable. Then the expected value of g(x) is obtained via the integral b b g(x)f X (x) dx, KC Border v ::2.46

4 KC Border The Central Limit Theorem 4 so integrating by parts we see this is equal to Likewise E g(y ) is equal to g(t)f X (t) b b b F X (x)g (x) dx. b g(t)f Y (t) b b b F Y (x)g (x) dx. b Now by assumption F X ( b) = F Y ( b) = 0 and F X (b) = F Y (b) =, so the first term in each expectation is identical. Since g is nondecreasing, g 0 everywhere. Since F X F Y everywhere, we conclude E g(x) E g(y ). q.e.d. The second proof is more general, and relies on the inverse of the CF (aka the quantile function. Last time we showed that even if a cumulative distribution function F is not invertible because it has jumps, we can use the function Q F (p) = inf{x R : F (x) p}. We showed that if U is a Uniform[0,] random variable, then Q F (U) is a random variable that has distribution F. Now observe that if F X F Y, then Q X Q Y. Therefore, since g is nondecreasing, g ( Q X (U) ) g ( Q Y (U) ) Since expectation is a positive linear operator, we have E g ( Q X (U) ) E g ( Q Y (U) ). But g ( Q X (U) ) has the same distribution as g(x) and g ( Q Y (U) ) has the same distribution as g(y ), so their expectations are the same. q.e.d. For yet a different sort of proof, based on convex analysis, see Border [3]..3 When are distributions close? Not in Pitman. I have already argued that the emoivre Laplace Limit Theorem says that the Binomial(n, p) distributions can be approximated by the Normal distributions when n is large. The argument consisted mostly of showing how the Normal density approximates the Binomial pmf. In this lecture I will formalize the notion of closeness for distributions. The appropriate notion is what is called convergence in distribution. Informally, two distributions are close if their cumulative distribution functions are close. This is fine if both cdfs are continuous. Then we can define the distance between F and G to be sup x F (x) G(x). This definition is reasonable in a few other cases as well. For instance, on the last homework assignment, you had to plot the empirical cdf of data from the coin-tossing experiment, and it was pretty close everywhere to a straight line. But there is another kind of closeness we wish to capture. Suppose X is a random variable that takes on the values and 0, each with probability /2; and Y is a random variable that takes on the values + ε and 0, each with probability /2. Should their distributions be considered close? I d like to think so, but F X () F Y () = /2 for every ε > 0. See Figures.2. There is a simple way to incorporate both notions of closeness, which is captured by the following somewhat opaque definition. v ::2.46 KC Border

5 KC Border The Central Limit Theorem ε Figure.2. Are these cumulative distribution functions close?.3. efinition The sequence F n of cumulative distribution functions converges in distribution to the cumulative distribution function F, written F n F if for all t at which F is continuous, F n (t) F (t), We also say the random variables X n converge in distribution to X, written X n their cdfs converge in distribution. X if.3.2 Example Let X n be a random variable that takes on the values + (/n) and 0, each with probability /2, and let X take on the values and 0, each with probability /2. Then the cdf F n of X n has jumps of size /2 at 0 and + (/n), while the cdf F of X has jumps of size /2 at 0 and. You can check that at every t except at t =, which is a point of discontinuity of F, F n (t) F (t). Thus F n F. Add pictures..3.3 Remark Note that convergence in distribution depends only on the cdf, so random variables can be defined on different probability spaces and still converge in distribution. This is not true of convergence in probability or almost-sure convergence. Aside: It is possible to quantify how close two distribution functions are using the Lévy metric, which is defined by d(f, G) = inf ε > 0 : ( x ) F (x ε) ε G(x) F (x + ε) + ε }{{}}{{}. (3) 3a 3b KC Border v ::2.46

6 KC Border The Central Limit Theorem 6 See, e.g., Billingsley [2, Problem 4, p. 2 22]. Note that since F and G are bounded between 0 and, we must have d(f, G). It may not seem that this expression is symmetric in F and G, but it is. For suppose ε > d(f, G), as defined in (3). Pick an arbitrary x R. Then applying (3) to x = x, we get F ( x ε) ε G( x) F ( x + ε) + ε. Applying (3b) to x = x ε gives and applying (3a) to x = x + ε gives G( x ε) F ( x) + ε, Rearranging the last two inequalities gives F ( x) ε G( x + ε). G( x ε) ε F ( x) G( x + ε) + ε. raw a picture That is, ε > d(f, G) = ε > d(g, F ). Symmetry implies the converse, so we see that d(f, G) = d(g, F ). It is now not to hard to show that if d(f, F n) 0, then F n F : For suppose d(f, F n) = ε n 0 and that x is a point of continuity of F. Then F ( x ε n) ε n F n( x) F ( x + ε n) + ε n. Since F is continuous at x, we have F ( x + ε n) + ε n F (x) and F ( x ε n) ε n F (x), so F n( x) F (x). Since x is an arbitrary point of continuity of F, we have F n F. The converse is slightly more difficult, and I ll leave it as an exercise..3.4 Fact It can be shown that if X n has cdf F n and X has cdf F, and if g is a bounded continuous function, then F n F = E g(xn ) E g(x). See, e.g., Breiman[4, 8.3, pp ]. In fact, this can be taken as a definition for convergence in distribution. Aside: The fact above does not imply that if X n X, then E X n E X, since the function f(x) = x is not bounded. itto for the variance. There is a more complicated result for unbounded functions that applies often enough to be useful. Here it is. Let X n X, and let g and h be continuous functions such that h(x) as x ±, and g(x)/h(x) 0 as x ±. If lim sup n E( h(x n) ) <, then E ( g(x ) n) E ( g(x) ). See Breiman [4, 8.3, pp , exercise 4]. So as a consequence, If X n X, and E( X 3 n ) is bounded, then E(Xn) E(X) and Var(X n) Var(X)..4 Central Limit Theorem We saw in the Law of Large Numbers that if S n is the sum of n independent and identically distributed random variable with finite mean µ and standard deviation σ, then the sample mean A n = S n /n converges to µ as n. This is because the mean of A n = S n /n is µ and the standard deviation is equal to σ/ n, so the distribution is collapsing around µ. What if we don t want the distribution to collapse? Let s standardize the average A n : A n = A Sn n µ σ/ n = n µ σ/ n = n n S n n µ σ/ n = S n nµ. v ::2.46 KC Border

7 KC Border The Central Limit Theorem 7 Or equivalently, by Proposition 7.2.2, let s look at S n / n rather than S n /n. S n / n is nµ and the standard deviation is σ. The standardization is The mean of S n n nµ σ = S n nµ, which is the same as the standardization of A n. One version of the Central Limit Theorem tells what happens to the distribution of the standardization of A n. The emoivre Laplace Limit Theorem is a special case of the Central Limit Theorem that applies to the Binomial istribution. Cf. Pitman [6, p. 96] and Larsen Marx [2, Theorem 4.3.2, pp ]..4. Central Limit Theorem, v. Let X, X 2,... be a sequence of independent and identically distributed random variables. Let µ = E X i and > σ 2 = Var X i > 0. efine S n by S n = n i= X i. Then S n nµ N(0, ). The proof of this result is beyond the scope of this course, but I have included a completely optional appendix to these notes sketching it. Note that we have to assume that the variance is positive, that is, the X i s are not degenerate. If all the random variables are degenerate, there is no way to standardize them, and the limiting distribution, if any, will be degenerate, not normal. But armed with this fact, we are now in a position to explain one of the facts mentioned in Lecture 0.3, page 0 6. I will prove it for the case where X has a finite variance σ 2, but it is true without that restriction..4.2 Proposition If X and Y are independent with the same distribution having finite variance σ 2, and if (X + Y )/ 2 has the same distribution as X, then X has a normal N(0, σ 2 ) distribution. Proof : (cf. Breiman [4, Proposition 9.2., p. 86]) Note first that E X = 0 since E X = E(X + Y )/ 2 = 2 E X. Now let X, X 3,... be a sequence of independent random variables having the same distribution as X. Then X (X + X 2 )/ 2 X +X X3+X4 2 = X + X 2 + X 3 + X 4, etc. 2 4 So X σ X + + X n whenever n is a power of 2, but the right-hand side converges in distribution to a N(0, ) distribution. In fact, the CLT (as it is known to its friends) is even more general. KC Border v ::2.46

8 KC Border The Central Limit Theorem 8.5 A CLT for non-identical distributions What if the X i s are not identically distributed? The CLT can still hold. The following result is taken from Breiman [4, Theorem 9.2, pp ]. There are other variations. See also Feller [8, Theorem VIII.4.3, p. 262], Lindeberg [3], Loève [4, 2.B, p. 292]; Hall and Heyde [0, 3.2]. Feller [6, 7] provides a stronger theorem that does not require finite variances..5. A more general CLT Let X, X 2,... be independent random variables (not necessarily identically distributed) with E X i = 0 and 0 < Var X i = σi 2 <, and E Xi 3 <. efine n S n = X i. Let i= s 2 n = σ σ 2 n = Var S n, so that s n is the standard deviation of S n. Assume that lim sup n s 3 n n E Xi 3 = 0, i= then S n s n N(0, ). The importance of this result is that for a standardized sum of a large number of independent random variables, each of which has a negligible part of the total variance, the distribution is approximately normal. This is the explanation of the observed fact that many characteristics of a population have a normal distribution. In applied statistics, it is used to justify the assumption that data are normally distributed..6 The Berry Esseen Theorem One of the nice things about the Weak Law of Large Numbers is that it gave us information on how close we probably were to the mean. The Berry Esseen Theorem gives information on how close the cdf of the standardized sum is to the standard normal cdf. The statement and a proof may be found in Feller [8, Theorem XVI.5.., pp ]. See also Bhattacharya and Ranga Rao [, Theorem 2.4, p. 04]..6. Berry Esseen Theorem Let X,..., X n be independent and identically distributed with expectation 0, variance σ 2 > 0, E( X 3 i ) = ρ <. Let F n be the cumulative distribution function of S n /. Then for all x R, F n (x) Φ(x) 3ρ σ 3 n..7 The CLT and densities Under the hypothesis that each X i has a density and finite third absolute moment, it can be shown that the density of S n / n converges uniformly to the standard normal density at a rate of / n. See Feller [8, Section XVI.2, pp. 533ff]. v ::2.46 KC Border

9 KC Border The Central Limit Theorem 9.8 The Second Limit Theorem Fréchet and Shohat [9] give an elementary proof of a generalization of another useful theorem on convergence in distribution, originally due to Andrey Andreyevich Markov (Андре й Андре евич Ма рков), which he called the Second Limit Theorem. More importantly, their proof is in English. The statement here is taken from van der Waerden [9, 24.F, p. 03]. You can also find this result in Breiman [4, Theorem 8.48, pp. 8 82]..8. Markov Fréchet Shohat Second Limit Theorem Let F n be a sequence of cdfs where each F n has finite moments µ k,n of all orders k =, 2,..., and assume that for each k, lim n µ k,n = µ k, where each µ k is finite. Then there is a cdf F such that the k th moment of F is µ k. Moreover, if F is uniquely determined by its moments, then F n F. An important case is the standard Normal distribution, which is determined by its moments { 0 if k is odd µ k = (2m)! 2 m m! if k = 2m. Mann and Whitney [5] used this to derive the asymptotic distribution of their eponymous test statistic [9, p. 277], which we shall discuss in Section Slutsky s Theorem My colleague Bob Sherman assures me that the next result, which he refers to as Slutsky s Theorem [8], is incredibly useful in his work. This version is taken from Cramér [5, 20.6, pp ]..9. Theorem (Slutsky s Theorem) Let X, X 2,..., be a sequence of random variables with cdfs F, F 2,...,. Assume that F n F. Let Y, Y 2,..., satisfy plim Y n = c, where c is n a real constant. Then, (slightly abusing notation), (X n + Y n ) F (x c), (X n Y n ) F (x/c) (c > 0), (X n /Y n ) F (cx) (c > 0). One of the virtues of this theorem is that you do not need to know anything about the independence or dependence of the random variables X n and Y n. The proof is elementary, and may be found in Cramér [5, 20.6, pp ], Jacod and Protter [, Theorem 8.8, p. 6], or van der Waerden [9, 24.G, pp.03 04]. Appendix Stop here. O NOT ENTER You are not responsible for what follows. Not to be confused with his son, Andrey Andreyevich Markov, Jr. (Андре й Андре евич Ма рков). KC Border v ::2.46

10 KC Border The Central Limit Theorem 0.0 The characteristic function of a distribution Some of you may have taken ACM 95, in which case you have come across Fourier transforms. If you haven t, you may still find this comment weakly illuminating. You may also want to look at Appendix 4.A.2 in Larsen Marx [2]. You may know that the exponential function extends nicely to the complex numbers, and that for a real number u, e iu = cos u + i sin u. where of course i =. The characteristic function φ X : R C of the random variable X is a complex-valued function defined on the real line by φ X (u) = E e iux. This expectation always exists as a complex number (integrate real and imaginary parts separately) since e iux = for all u, x..0. Fact If X and Y have the same characteristic function, they have the same distribution..0.2 Fact (Fourier Inversion Formula) If F has a density f and characteristic function φ, then and we have the inversion formula φ(u) = f(x) = e iux f(x) dx e iux φ(u) du..0. Characteristic Function of an Independent Sum.0.3 Fact Let X, Y be independent random variables on (S, E, P ). Then φ X+Y (u) = E(e iu(x+y ) ) = E(e iux e iuy ) = E e iux E e iuy by independence = φ X (u)φ Y (u)..0.2 Characteristic Function of a Normal Random Variable.0.4 Fact Let X be a standard normal random variable. Then φ X (u) = E e iux = e iux e 2 x2 dx = e u2 2. 2π (This is not obvious.) R. The characteristic function and convergence in distribution.. Theorem F n F φn (u) φ(u) for all u. Proof : See Breiman [4], Corollary 8.3 and Proposition 8.3, pp v ::2.46 KC Border

11 KC Border The Central Limit Theorem.2 The characteristic function and the CLT Sketch of proof of the CLT: We consider the case where E X i = 0. mean everywhere.) Compute the characteristic function φ n of X + + X n. φ n (u) = = E = = = [ E ( + iu X + 2 n k= [ E e iu { X }] n { E e iu Xk } { E e iu X } + +Xn ( { Π n k=e iu Xk }) ( independence) (identical distribution) ( iu X ) [ 2 (iux ) ])] 2 n + o nσ (Otherwise subtract the (Taylor s Theorem) = [ + 0 u2 σ 2 ( )] u 2 n 2n σ 2 + o (E X = 0, E X 2 = σ 2 ) 2n Now use the well-known fact: lim x 0 ( + ax + o(bx)) x which is the characteristic function of N(0, ). e a, so as n, φ n (u) e u2 2, Proof of the well-known fact: ( ) ln ( + ax + o(bx)) x = x ln( + ax + o(bx) ) Therefore e (+ax+γ(bx)) x e a as x 0. = x [ln + ax ln () + o(x)] = x [ax + o(x)] = a + o(x) x a as x 0..3 Other Proofs The Fourier transform approach to the CLT is algebraic and does not give a lot of insight. There are other approaches that have a more probabilistic flavor. Chapter 2 of Ross and Peköz [7] provides a very nice elementary, but long, proof based on constructing a new sequence of random variables (the Chen Stein construction) with the same distribution as the in such a way that we can easily compute their distance from a stan- sequence X + + X n nσ dard normal. Bibliography [] R. N. Bhattacharya and R. Ranga Rao Normal approximation and asymptotic expansions, corrected and enlarged ed. Malabar, Florida: Robert E. Krieger Publishing Company. Reprint of the 976 edition published by John Wiley & Sons. The new edition has an additional chapter and mistakes have been corrected. KC Border v ::2.46

12 KC Border The Central Limit Theorem 2 [2] P. Billingsley Convergence of probability measures. Wiley Series in Probability and Mathematical Statistics. New York: Wiley. [3] K. C. Border. 99. Functional analytic tools for expected utility theory. In C.. Aliprantis, K. C. Border, and W. A. J. Luxemburg, eds., Positive Operators, Riesz Spaces, and Economics, number 2 in Studies in Economic Theory, pages Berlin: Springer Verlag. [4] L. Breiman Probability. Reading, Massachusetts: Addison Wesley. [5] H. Cramér Mathematical methods of statistics. Number 34 in Princeton Mathematical Series. Princeton, New Jersey: Princeton University Press. Reprinted 974. [6] W. Feller Über den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift 40(): OI: 0.007/BF [7] Über den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung. ii. Mathematische Zeitschrift 42(): OI: 0.007/BF [8] W. Feller. 97. An introduction to probability theory and its applications, 2d. ed., volume 2. New York: Wiley. [9] M. Fréchet and J. Shohat. 93. Errata: A proof of the generalized second limit-theorem in the theory of probability. Transactions of the American Mathematical Society 33(4):999. OI: 0.090/S [0] P. Hall and C. C. Heyde Martingale limit theory and its application. Probability and Mathematical Statistics. New York: Academic Press. [] J. Jacod and P. Protter Probability essentials, 2d. ed. Berlin and Heidelberg: Springer. [2] R. J. Larsen and M. L. Marx An introduction to mathematical statistics and its applications, fifth ed. Boston: Prentice Hall. [3] J. W. Lindeberg Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift 5(): OI: 0.007/BF [4] M. Loève Probability theory, 4th. ed. Number in Graduate Texts in Mathematics. Berlin: Springer Verlag. [5] H. B. Mann and. R. Whitney On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics 8(): [6] J. Pitman Probability. Springer Texts in Statistics. New York, Berlin, and Heidelberg: Springer. [7] S. M. Ross and E. A. Peköz A second course in probability. Boston: Probability- Bookstore.com. [8] E. Slutsky Ueber stochastische Asymptotem und Grenzwerte. Metron 5(3):3. [9] B. L. van der Waerden Mathematical statistics. Number 56 in Grundlehren der mathematischen Wissenschaften in Einzeldarstellungen mit besonderer Berücksichtigung der Anwendungsgebiete. New York, Berlin, and Heidelberg: Springer Verlag. Translated by Virginia Thompson and Ellen Sherman from Mathematische Statistik, published by Springer-Verlag in 965, as volume 87 in the series Grundlerhen der mathematischen Wissenschaften. v ::2.46 KC Border

Introducing the Normal Distribution

Department of Mathematics Ma 3/103 KC Border Introduction to Probability and Statistics Winter 2017 Lecture 10: Introducing the Normal Distribution Relevant textbook passages: Pitman [5]: Sections 1.2,