1 / 36 A Brief Analysis of Central Limit Theorem Omid Khanmohamadi (okhanmoh@math.fsu.edu) Diego Hernán Díaz Martínez (ddiazmar@math.fsu.edu) Tony Wills (twills@math.fsu.edu) Kouadio David Yao (kyao@math.fsu.edu) SIAM Chapter Florida State University March 17, 2014
2 / 36 Outline Examples Statement of Theorem Modes of Convergence Fourier Transform and Convolution Outline of Proof Generalizations
3 / 36 Outline Examples Statement of Theorem Modes of Convergence Fourier Transform and Convolution Outline of Proof Generalizations
From Concrete to Abstract: Examples then Theorems! You should start with understanding the interesting examples and build up to explain what the general phenomena are. This was your progress from initial understanding to more understanding. Michael Atiyah [image source: Wikipedia] The source of all great mathematics is the special case, the concrete example. It is frequent in mathematics that every instance of a concept of seemingly great generality is in essence the same as a small and concrete special case. Paul Halmos (19162006) [image source: Wikipedia] 4 / 36
Sum of Dice Throws is (Eventually) Normally Distributed Comparison of probability density functions, p(k), for sum of n fair 6-sided dice, showing convergence to a normal distribution with increasing n [image source: Wikipedia] n = 1 p(k) 0.18 1 / 6 0.16 0.14 0.12 0.10 0.08 0.05 0.04 0.02 0.00 123456 p(k) 0.18 0.16 0.14 0.12 0.10 0.08 0.05 0.04 0.02 0.00 p(k) 0.18 0.16 0.14 0.12 0.10 0.08 0.05 0.04 0.02 0.00 1 / 6 n = 2 2 7 12 n = 3 1 / 8 3 10,11 18k 5 / 36 k k p(k) 0.18 0.16 0.14 0.12 0.10 0.08 0.05 0.04 0.02 0.00 p(k) 0.18 0.16 0.14 0.12 0.10 0.08 0.05 0.04 0.02 0.00 n = 4 73 / 648 4 14 24 n = 5 65 / 648 5 17,18 30k k
6 / 36 Dice Throws (Cont'd) Roll a fair dice 10 9 times, with each roll independently of others. fair = faces have equal probability (identically distributed) Let X i be the number that come up on the ith die and let S 10 9 = 10 9 i=1 X i be the total (sum) of the numbers rolled. The probability that S 10 9 is less than x standard deviations 1 x above its mean is (approximately) 2π e t2 /2 dt.
7 / 36 Outline Examples Statement of Theorem Modes of Convergence Fourier Transform and Convolution Outline of Proof Generalizations
8 / 36 Denitions and Assumptions Let X 1, X 2,..., X n be a sequence i.i.d random variables, each with mean µ = 0 and variance σ 2 = 1. Let S n = n i=1 X i. Any other nite µ and σ 2 may be reduced to this case. [ ] Sn E n = 1 E[S n ] = 1 n n n i=1 E[X i] = 0. Var Mean (E) is a linear function. [ Sn n ] = ( 1 n ) 2 Var[S n ] = 1 n n i=1 Var[X i] = 1 n n = 1. Var is not a linear function; it distributes over sums (when the random variables are independent) and it squares scalar multipliers.
9 / 36 Denitions and Assumptions (cont'd) Central Limit Theorem is a statement about the so-called normalized sum dened as Sn nµ Sn which in our case is nσ n. Normalized mean is the dierence between the sum S n and its expected value nµ, measured relative to (in units of) standard deviation nσ; it measures how many standard deviations the sum is from its expected value.
10 / 36 Statement of Central Limit Theorem With the assumptions of the previous slide, we have ( ) Pr a S n b n 1 2π b Convergence ( ) is in distribution. a e t2 /2 dt Convergence is not in probability or almost surely. Convergence is not uniform. as n Tails of the distribution converge more slowly than its center.
11 / 36 Outline Examples Statement of Theorem Modes of Convergence Fourier Transform and Convolution Outline of Proof Generalizations
12 / 36 Convergence in Distribution Central Limit Theorem is expressed in terms of convergence in distribution which is dened as follows: Denition (Convergence in Distribution) A sequence of random variables X 1,..., X n converges in distribution to X if, F Xn (x) F X (x) as n at all points x where F X is continuous, where F X represents the distribution of the random variable X, given by F X (x) := Pr(X x)
13 / 36 Characteristic Function and its relation to Convergence in Distribution Denition (Characteristic function) The characteristic function of any real-valued random variable completely denes its probability distribution. Let F X be the distribution function of the random variable X, the characteristic function of X is the function φ X given by E[e iξx ] = φ X (ξ) = e iξx df X (x) = f X (x)e iξx dx, where f X is the density function of X (if it exists). Notice the relation to Fourier transform if the density f X exists. Convergence in distribution and convergence in characteristic are equivalent.
14 / 36 Outline Examples Statement of Theorem Modes of Convergence Fourier Transform and Convolution Outline of Proof Generalizations
15 / 36 Fourier Transform Pair The convention we will be using is that the (1 dimensional) Fourier transform of a function f (x) is f (ξ) = f (x)e iξx dx and the inverse Fourier transform of a function f (ξ) is f (x) = 1 2π f (ξ)e iξx dξ.
16 / 36 Convolution If f and g are integrable functions, we dene the convolution f g by (f g)(x) = f (x y)g(y) dy. Convolution is sometimes also known by its German name, faltung ("folding"). Later, in the proof section, we see n-fold convolution which means convolution repeated n times.
17 / 36 Basic Properties of Fourier Transform There are a few basic properties of the Fourier transform that we will need to know. In particular, we need to know what the Fourier transform does to scaling, a Gaussian distribution, and convolution. Scaling: For a non-zero real number α, if g(x) = f (αx), then ĝ(ξ) = α f 1 ( ) ξ. α Gaussian: If f (x) = 1 2π e x2 2, then f (ξ) = 2πf (ξ) Convolution: Under Fourier transforms the convolution becomes multiplication. (f g)(ξ) = f (ξ)ĝ(ξ)
18 / 36 Outline Examples Statement of Theorem Modes of Convergence Fourier Transform and Convolution Outline of Proof Generalizations
19 / 36 Overview, View, Review! Tell them what you're going to tell them, tell them, and tell them what you told them. Paul Halmos (19162006) [image source: Wikipedia]
An Overview of the Outline of the Proof Our goal is to outline the steps in showing: ( ) Pr a S n b n 1 2π b a e t2 /2 dt 1. Write density of sum S n in terms of density of its i.i.d terms X i (by using an n-fold convolution) to go from f to f Sn. 2. Find eect of scaling on density (by using a substitution in the integral) to go from f Sn to f Sn/ n. 3. Use the scaling results for Fourier transform and density as well as convolution to go from f Sn/ n to f Sn/ n. 4. Expand f around zero to nd a useful converging expression. 5. Rewrite that converging expression for f Sn/ n to get convergence to a Gaussian density 6. Take inverse Fourier transform to arrive at the standard Gaussian density. 20 / 36
21 / 36 Step 1: From f to f Sn : n-fold Convolution We show the result for two iid variables, X 1 and X 2, with identical distributions F X1 F X2 =: F and densities f X1 f X2 =: f. f X1 +X2 (a) = d da F X1+X2 (a) = d da Pr{X 1 + X 2 a}. F X1 +X2 (a) is given by the integral over {(x 1, x 2 ): x 1 + x 2 a} of f X1 (x 1 )f X2 (x 2 ) = f (x 1 )f (x 2 ): F X1 +X2 (a) = Pr{X 1 + X 2 a} = Dierentiation gives f X1 +X2 (a) = d da = F (a x)f (x) dx = a x2 f (x 1 )f (x 2 ) dx 1 dx 2 F (a x)f (x) dx f (a x)f (x) dx = f f (a)
22 / 36 Step 2: From f Sn to f Sn / : Eect of Scaling on Density n The Central Limit Theorem involves the probability ( ) Pr a S n b. n Notice that if the density of S n is f Sn (t), then ( ) Pr a S n b = Pr ( a n S n ) b n n by making the substitution s = Sn n is nf Sn ( nt). = = b n a n b a f Sn (t) dt nfsn ( ns) ds n t. This shows that the density of
23 / 36 Step 3: From f Sn / to f n Sn / n Now, we have everything we need to get from the density f of a sequence of i.i.d random variables to the characteristic f Sn/ n (ξ) of the corresponding normalized sum S n / n: f Sn (t) = f f (t). fsn (ξ) = (f f )(ξ) = ( f ) n (ξ) f Sn/ n (t) = nf Sn ( nt). f (ξ) = Sn/ n nf Sn ( nt)(ξ) = n 1 ( ) ξ fsn n n ( ) ( ) ξ ξ = f Sn = ( f ) n n n
24 / 36 Step 4: Taylor Expansion of f at 0 The Fourier Transform of the density f (identical for all) of X i is f (ξ) = e iξx f (x)dx Dierentiation under the integral sign can be done, so the Taylor Series is f (ξ) = f (0) + f (0)ξ + f (0)ξ 2 2 + ɛ(ξ)ξ 2 as ξ 0, in which limit ɛ(ξ) 0 also. Observe that f (0) = f (x)dx = 1 f (0) = i xf (x)dx = 0 (mean 0) f (0) = x 2 f (x)dx = 1 (variance 1)
25 / 36 Taylor Expansion of f at 0 (cont'd) So f (ξ) = 1 ξ 2 as ξ 0, which is the same as as ξ 0. 2 + ɛ(ξ)ξ2 ( ) ξ f 2 (ξ) 1 ξ2 0 2
26 / 36 Step 5: Convergence of f Sn / n(ξ) to e ξ2 /2 Hoping that we may get a similar convergence result for f Sn/ n, we write ( f ) n (ξ/ n) (1 ξ2 2n ) n = f (ξ/ ) n) (1 ξ2 n 1 ( f ) k (ξ/ ) n) (1 n k 1 ξ2 2n 2n k=0 f (ξ/ ) n) (1 ξ2 n 1 f (ξ/ n) k 1 ξ2 2n 2n k=0 n k 1
27 / 36 Convergence of f Sn / n(ξ) to e ξ2 /2 Since f (ξ) f L f L 1 = 1, for n large enough we have ( f ) n (ξ/ ) n n) (1 ξ2 n 2n f (ξ/ n) It's clear that as n, ξ/ n 0, so as n, so ( f ) n (ξ/ ) n n) (1 ξ2 0 2n (1 ξ2 2n ) f Sn/ n (ξ) = ( f ) n (ξ/ n) e ξ2 /2
28 / 36 Step 6: Convergence of f Sn / n(x) to e x 2 /2 / 2π: Inverse Fourier Transform Taking the inverse Fourier Transform we obtain f Sn/ n (x) 1 2π e x2 /2 as n, which is the conclusion of the Central Limit Theorem! Observe that this is pointwise convergence in density (or equivalently in distribution).
29 / 36 Outline Examples Statement of Theorem Modes of Convergence Fourier Transform and Convolution Outline of Proof Generalizations
30 / 36 Directions for Generalization Three general versions of CLT will be discussed: Lyapunov's CLT which weakens the hypothesis of identical distribution with a tradeback on the hypothesis of nite variance (Lyapunov's Condition). Lindeberg's CLT which weakens Lyapunov's Condition (nite variance) and maintains the same weak requirements on the distribution of the random variables. Multivariate CLT which uses the covariance matrix of the random variables for the generalization.
31 / 36 Lyapunov's CLT Suppose X 1, X 2,..., X n is a sequence of independent random variables, each with nite expected value µ i and variance σ 2 i (i.e. not identically distributed). Let s 2 n = and for some δ > 0, the following condition (called Lyapunov condition), holds lim n 1 s 2+δ n n E i=1 n i=1 σ 2 i [ X i µ i 2+δ] = 0 then a sum of X i µ i converges in distribution to a standard normal sn random variable, as n.
32 / 36 Lindeberg's CLT Suppose X 1, X 2,..., X n is a sequence of independent random variables, each with nite expected value µ i and variance σ 2 i (i.e. not identically distributed). Let s 2 n = and for every ɛ > 0, the following condition (called Lindeberg condition), holds n i=1 σ 2 i lim n 1 s 2 n n [ E (X i µ i ) 2 1 { Xi µ i >ɛsn}] = 0 i=1 then a sum of X i µ i converges in distribution to a standard normal sn random variable, as n.
33 / 36 Comparison of Finite Variance Conditions Lindeberg: Classical: Lyapunov: Xi µ i >ɛsn (X i µ i ) 2 df i < (X i µ i ) 2 df i < R X i µ i 2+δ df i < R Observe that, in the Classical CLT, µ i = µ and f i (x) = f (x) i
34 / 36 Generalizations in a Nutshell: CLT is Robust If one has a lot of small random terms which are mostly independent and each contributes a small fraction of the total sum, then the total sum must be approximately normally distributed.
35 / 36 Multivariate CLT Suppose {X 1, X 2,..., X n } R d is a sequence of independent random vectors, with nite mean vector E[X i ] = µ and nite covariance matrix Σ, then ( n ) 1 X i nµ N d (0, Σ) n i=1 in distribution as n, where N d (0, Σ) is the multivariate normal distribution with mean vector 0 and covariance matrix Σ. Note: Addition is done componentwise.
Thank you for your attention! Figure: Laplace 36 / 36 4
37 / 36 Outline More Details
38 / 36 Almost Sure convergence and Convergence in Probability Because of their relationship to Convergence in Distribution, it is useful to review Almost Sure Convergence and Convergence in Probability. We let X 1, X 2,..., X n,... be a sequence of random variables dened on the probability space (Ω, F, P) Almost Sure Convergence (Strong convergence): X 1, X 2,..., X n,... converges almost surely to a random variable X if, for every ε > 0 ( ) P lim X n X < ε = 1 n Convergence in Probability (Weak convergence): X 1, X 2,..., X n,... converges in probability to X if, for for every ε > 0 lim P ( X n X < ε) = 1 or n lim P ( X n X ε) = 0 n
39 / 36 Notable Relationship between Convergence Concepts (A.S.) Conv = Conv in Prob = Conv in Distribution