The Central Limit Theorem: More of the Story

Size: px

Start display at page:

Download "The Central Limit Theorem: More of the Story"

Cameron Newman
5 years ago
Views:

1 The Central Limit Theorem: More of the Story Steven Janke November 2015 Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

2 Central Limit Theorem Theorem (Central Limit Theorem) Let X 1, X 2,... be a sequence of independent and identically distributed random variables, each with expectation µ and variance σ 2. Then the distribution of Z n = X 1 + X X n nµ σ n converges to the distribution of a standard normal random variable. lim P(Z n x) = 1 x e y2 2 dy n 2π Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

3 Central Limit Theorem Applications The sampling distribution of the mean is approximately normal. The distribution of experimental errors is approximately normal. >> But why the normal distribution? Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

4 Benford s Law In an arbitrary table of data, such as populations or lake areas, P[Leading digit is d] = log 10 (1 + 1 d ) Data: List of 60 Tallest Buildings Lead Digit Meters Feet Benford Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

5 Benford Justification Simon Newcomb 1881 Frank Benford 1938 Proof arguments: Positional number system Densities Scale invariance Scale and base unbiased (Hill 1995) Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

6 Central Limit Theorem Theorem (Central Limit Theorem) Let X 1, X 2,... be a sequence of independent and identically distributed random variables, each with expectation µ and variance σ 2. Then the distribution of Z n = X 1 + X X n nµ σ n converges to the distribution of a standard normal random variable. lim P(Z n x) = 1 x e y2 2 dy n 2π Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

7 Central Limit Theorem Proof Proof Sketch: Let Y i = X i µ Moment Generating Function of Y i is M Yi (t) = Ee ty i MGF of Z n is M Zn (t) = [M Y1 ( t σ n ]n lim n ln M Zn (t) = t2 2 The MGF of the standard normal is e t2 2 Since the MGF s converge, the distributions converge. (Lévy Continuity Theorem). Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

8 Counter-Examples Moment problem: Lognormal R.V. not determined by moments. No first moment: Cauchy R.V. has no MGF, EX = so CLT does not hold. No second moment: f (x) = 1 x 3 for x 1. Pairwise independence is not sufficient for CLT. Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

9 Demoivre s Theorem 1733 Each X i is Bernoulli ( 0 or 1 ). b(k) = P[S n = k] = ( n k) 1 2 n n! ( 2π)n n ne n (Stirling s formula) b( n 2 ) 2 πn log( b( n 2 +d b( n 2 ) ) 2d2 n b( n 2 + d) 2 πn e 2d2 n lim n P[a Sn n/2 n/2 b] = 1 b x2 2 a e 2 dx Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

10 Laplace 1810 Dealt with independent and identically distributed case. Started with discrete variables: Consider X i where p k = P[X i = k m ] for k = m, m + 1,, m 1, m Generating function: T (t) = m k= m p kt k q j = P[ X i = j m ] is coefficient of tj in T (t) n Substitute e ix 1 π for t and recall 2π π e itx e isx dx = δ ts Then, q j = 1 π 2π π e ijx [ m m p ke ikx ] n dx Now, expand e ikx in a power series around 0 and use the fact that the mean of X i is zero. Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

11 Why Normal? Normal Characteristic Function f (u) = 1 2π e iux e x2 2 dx = e u2 2 f Sn σ n (u) = E[e iu(sn/σ n) u ] = (f ( σ n ))n = (1 σ2 2σ 2 n u2 + o( σ2 σ 2 n u2 )) n = (1 u2 2n + o(u2 n ))n e u 2 2 Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

12 Levy Continuity Theorem If distribution functions F n converge to F, then the corresponding ch.f. f n converge to f. Conversely, if f n converges to g continuous at 0, then F n converges to F. Proof Sketch: First direction is the Helly-Bray theorem. The set {e iux } is a separating set for distribution functions. In both directions, continuity points and mass of F n are critical. Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

13 History Laplace never presented a general CLT statement. (Concerned with limiting probabilities for particular problems). Concern over convergence led Poisson to improvements (not identically distributed case). Dirichlet and Cauchy changed conception of analysis (epsilon/delta). Counter-examples uncovered limitations. Chebychev proved CLT using convergence of moments. (Markov and Liapounov were students). First rigorous proof (Liapounov 1900). CLT holds with independent (but not necessarily i.i.d.) X i if E Xj 3 [ Xj 2 = E Xj 3 ]3/2 sn 3 0 Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

14 Liapounov proof Assume: E Xj 3 [ E Xj = 3 Xj 2]3/2 sn 3 0 g n (u) = Π n 1 f k( u s n ) = Π n 1 [1 + (f k( u s n ) 1)] f k ( u s n ) = 1 u2 (σ 2 2sn 2 k + δ k( u s n )) f k ( u s n ) 1 2u 2 σ2 k s 2 n = k 1 f k( u s n ) 1 2u 2 (E X 2 k ) 3 2 < E X k 3 σ sup k k n s n 0 = sup f k ( u s n ) 1 0 Use log(1 + z) = z(1 + θz) where θ 1 for z 1 2 log g n (u) = n 1 (f k( u s n ) 1) + θ n 1 (f k( u s n ) 1) 2 Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

15 Liapounov proof continued log g n (u) = n 1 (f k( u s n ) 1) + θ n 1 (f k( u s n ) 1) 2 θ n 1 (f k( u s n ) 1) 2 sup f k ( u s n ) 1 n 1 (f k( u s n ) 1) 0 σk 2 2 sn 2 u + θ 3 k E X k 3 f k ( u s n ) 1 = u2 sn 3 n 1 (f k( u s n ) 1) = u2 2 + θ u3 sn 3 n 1 E X k 3 u2 2 Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

16 Lindeberg 1922 Theorem (Central Limit Theorem) Let the variables X i be independent with EX i = 0 and EXi 2 = σi 2. Let s be the standard deviation of the sum S and let F be the distribution of S s. With Φ(x) the normal distribution, then if 1 sn 2 x ɛs n x 2 df k 0, we have sup F (x) Φ(x) 5ɛ x Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

17 Lindeberg Proof Pick auxiliary function f. Arbitrary distribution V, define F (x) = f (x t)dv (x) With φ(x) the normal density, define Ψ(x) = (f (x t)φ(x)dx Taylor expansion of f to third power gives F (x) Ψ(x) < k x 3 dv (x) With U i the distribution of X i, F 1 (x) = f (x t)du 1 (x)... F n (x) = F n 1 (x t)du n (x) Note U(x) = U(x t 1 t 2 t n )du 1 (t 1 ) du n (t n ) By selecting f carefully, U(x) Φ(x) < 3( n i x 3 du i (x)) 1 4 Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

18 Still Why Normal? Let X = X 1 + X 2 where X is N (0, 1) and X 1 independent of X 2 f (u) = f 1 (u)f 2 (u) = e u2 2 e u2 2 is an entire, non-vanishing function with f 1 (z) e c u 2, Hadamard factorization theorem = log f 1 (u) is a polynomial in u of at most degree 2. f is a characteristic function = f (0) = 1, f (u) = f ( u), and it is bounded. Hence, log f (u) = iua + bu 2. This is the general form of the normal characteristic function. Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

19 Feller - Levy 1935 Theorem (Final Central Limit Theorem) Let the variables X i be independent with EX i = 0 and EXi 2 = σi 2. Let S n = n 1 X i and sn 2 = n 1 σ2 k. Φ is the normal distribution and F k is the distribution of X k. Then as n, if and only if for every ɛ > 0 P[S n /s n x] Φ(x) and max k n σ k s n 0 1 sn 2 x 2 df k 0 x ɛs n Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

20 Levy Following appeared in Levy s monograph in Theorem (Levy s version) In order that a sum S = j X j of independent variables have a distribution close to Gaussian, it is necessary and sufficient that, after reducing medians to zero, the following conditions be satisfied. Each summand that is not negligible compared to the dispersion of the entire sum has a distribution close to Gaussian. The maximum of the absolute value of the negligible summands is itself negligible compared to the dispersion of the sum. Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

21 Normed sums and Stable Laws Definition X is said to have a stable law if whenever X 1, X 2,, X k are independent with the same distribution as X, we have X 1 + X X k D = ax + b Both the Normal (σ 2 < 0) and Cauchy (σ 2 = ) Laws are stable. Theorem (Limit of Normed Sums) Suppose that then X has a stable law. S n = X 1 + X X n A n B n D X Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

22 Entropy Definition (Discrete Entropy) Let X be a discrete random variable taking values x i with probability p i. The entropy of X is H(X ) = p i log p i H(X ) > 0 unless X is constant. H(aX + b) = H(X ) If X is the result of flipping a fair coin, H(X ) = 1. If X takes n values, H(X ) is maximized when p i = 1 n. Extend to joint entropy H(X, Y ). Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

23 Entropy Axioms Symmetric in p i Continuous in p i Normalized: H(X ) = 1 for fair coin. X and Y independent gives H(X, Y ) = H(X ) + H(Y ) Decomposition: H(r 1,, r m, q 1,, q n ) = αh(r 1,, r m ) + (1 α)h(q 1,, q n ) Axioms = H(X ) = p i log p i Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

24 Entropy Definition (Differential Entropy) Let X be a continuous random variable with density p. The differential entropy of X is H(X ) = p(t) log p(t)dt X is uniform on [0, c] gives H(X ) = log(c) X is N (0, σ 2 ) gives H(X ) = 1 2 log(2πeσ2 ) H(aX + b) = H(X ) + log(a) H(X ) can be negative. Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

25 Entropy Definition (Relative Entropy) Let p and q be densities. The relative entropy distance from p to q is D(p q) = p(x) log( p(x) q(x) )dx If supp(p) supp(q), then D(p q) =. D is not a metric (not symmetric and no triangle inequality). D(p q) 0 with equality if and only if p = q a.e. Convergence in D is stronger than convergence in L 1. Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

26 Entropy Lemma (Maximum Entropy) Let p be the density of random variable with variance σ 2, and let φ 2 σ be the density of a N (0, σ 2 ) random variable. Then H(p) H(φ 2 σ) = log(2πeσ2 ) 2 with equality if and only if p is a normal density. Proof: 0 D(p φ 2 σ) = p(x)(log(p(x) + log(2πσ2 ) 2 = H(p) + log(2πeσ2 ) 2 + x 2 2σ 2 log(e))dx = H(p) + H(φ 2 σ) Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

27 Second Law of Thermodynamics Let Ω be microstates corresponding to a particular macrostate. Then the entropy of the macrostate is S = k log e Ω. Ω is composed of microstates r of probability p r. If there are v copies of the system, about vp r are microstates of type r. Then, Ω = v v 1!v 2! v k! v v v v 1 1 v v 2 2 v v k k S = k log e Ω k(v log(v) v r log(v r )) = kv p r log(p r ) Entropy is maximized subject to energy constraint at Gibbs states. ( p r = 1 and p r E r = E) Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

28 Fisher Information Definition Let Y be a random variable with density g and variance σ 2. Set ρ = g g and let φ be the normal density with the same mean and variance as Y. Fisher Information is I (Y ) = E[(ρ(Y )) 2 ] Standardized Fisher Information is J(Y ) = σ 2 E[(ρ(Y ) ρ φ (Y )) 2 ] Z with distribution N (0, σ 2 ) minimizes Fisher information (I (Z) = 1 σ 2 ) among all distributions with variance σ 2. (Note: J(Z) = 0) Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

29 Fisher Information Properties Lemma (de Bruijn) If Z is a normal R.V. independent of Y with the same mean and variance, then D(Y Z) = 1 0 J( ty + 1 tz) 1 2t dt Proof relies on these facts: The normal density satisfies the heat equation: φτ τ = 1 2 Hence, Y + Z also satisfies the heat equation. We can then calculate the derivative of D(Y Z). 2 φ τ /x 2. We also have that if J(Y ) 0 then D(Y Z) 0. Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

30 Fisher Information Properties Lemma If U and V are independent, then for β [0, 1] J(U + V ) β 2 J(U) + (1 β 2 )J(V ) J( βu + 1 βv ) βj(u) + (1 β)j(v ) with equality if and only if U and V are normal. In particular, J(X + Y ) J(X ) H(X + Y ) H(X ) Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

31 Fisher Information Properties From Lemma, J( βu + 1 βv ) βj(u) + (1 β)j(v ) In particular, with S n = ( n 1 X i)/ n and β = n n+m, nj(s n ) + mj(s m ) (m + n)j(s n+m ) If J(S n ) < for some n, then J(S n ) converges to 0. (Take n = m and assume i.i.d. variables to see monotone convergence of a subsequence.) Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

32 Sketch of Central Limit Theorem Proof Assume i.i.d. random variables J(S n ) converges to 0 Hence D(S n Z) converges to 0 For densities p Sn and φ 2 σ, (p Sn φ σ 2) 2 2D(S n Z) S n N (0, σ 2 ) Can generalize to non i.i.d. variables. Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

33 Conclusions Assume errors are uniformally asymptotically negligible (no one dominates). Assume independent errors with distributions that have second moments. Normalize the sum with mean and variance. Then the limit is a standard normal distribution. Normal distribution is stable ( infinitely divisible ). (X = X 1 + X 2 ) Normal distribution maximizes entropy. Convolution (adding R.V. s) increases entropy. Steven Janke (Seminar) The Central Limit Theorem:More of the Story November / 33

Continuous Random Variables

1 / 24 Continuous Random Variables Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical Engineering Indian Institute of Technology Bombay February 27, 2013 2 / 24 Continuous Random Variables