INFINITE COIN-TOSS AND THE LAWS OF LARGE NUMBERS The traditional interpretation of the probability of an event E is its asymptotic frequency: the limit as n of the fraction of n repeated, similar, and independent trials in which E occurs. Similarly the expectation of a random variable X is taken to be its asymptotic average, the limit as n of the average of n repeated, similar, and independent replications of X. As statisticians trying to make inference about the underlying probability distribution f(x θ) governing observed random variables X i, this suggests that we should be interested in the probability distribution for large n of quantities like the average of the RV s, Xn 1 n n i=1 X i. Three of the most celebrated theorems of probability theory concern this sum. For independent random variables X i, all with the same probability distribution satisfying E X i 3 <, set µ = EX i, σ = E X i µ, and S n = n i=1 X i. The three main results are: Laws of Large Numbers: Central Limit Theorem: Law of the Iterated Logarithm: S n nµ σn S n nµ σ n 0 (i.p. and a.s.) = N(0,1) (i.d.) lim sup ± S n nµ n σ n log log n = 1.0 (a.s.) Together these three give a clear picture of how quickly and in what sense 1 n S n tends to µ. We begin with the Law of Large Numbers (LLN), in its weak form (asserting convergence i.p.) and in its strong form (convergence a.s.). There are several versions of both theorems. The simplest requires the X i to be IID and L ; stronger results allow us to weaken (but not eliminate) the independence requirement, permit non-identical distributions, and consider what happens if the RV s are only L 1 (or worse!) instead of L. The text covers these things well; to complement it I am going to: (1) Prove the simplest version, and with it the Borel-Cantelli theorems; and () Show what happens with Cauchy random variables, which don t satisfy the requirements (the LLN fails).
I. Weak version, non-iid, L : µ i = EX i, σ ij = E[X i µ i ][X j µ j ] A. Y n = (S n Σµ i )/n satisfies EY n = 0, EYn = 1 n Σ i n σ ii + n Σ i<j n σ ij ; 1. If σ ii M and σ ij 0 or σ ij < Mr i j, r < 1, Chebychev = Y n 0, i.p.. (pairwise) IID L is OK II. Strong version, non-iid, L : EX i = 0, EXi M, EX ix j 0. A. P[ S n > nǫ] < Mn = M n ǫ nǫ 1. P[ S n > n ǫ] < M, Σ n ǫ n P[ S n > n ǫ] < Mπ 6ǫ. Borel-Cantelli: P[ S n > n ǫ i.o.] = 0, 1 S n n 0 a.s. 3. D n = max n <k<(n+1) S k S n, EDn ne S (n+1) S n 4n M 4. Chebychev: P[D n > n ǫ] < 4n M n 4 ǫ, D n 0 a.s. B. S k /k S n +D n n 0 a.s., QED 1. Bernoulli RV s, normal number theorem, Monte Carlo integration. III. Weak version, pairwise-iid, L 1 A. Equivalent sequences: n P[X n Y n ] < 1. [X n Y n ] < a.s.. n i=1 [X n i], a n i=1 [X i] converge iff n i=1 [Y n i], a n i=1 [Y i] both converge 3. Y n = X n 1 [ Xn n] IV. Counterexamples: Cauchy, A. X i dx = P[ S π[1+x ] n /n ǫ] π tan 1 (ǫ) 1, WLLN fails. B. P[X i = ±n] = c n, n 1; X i / L 1, and S n /n 0 i.p. or a.s. C. P[X i = ±n] = c n log n, n > 1; X i / L 1, but S n /n 0 i.p. and not a.s. D. Medians: for ANY RV s X n X i.p., then m n m if m is unique. Page
Let X i be iid standard Cauchy RV s, with and characteristic function P[X 1 t] = E e iωx 1 = so S n /n has characteristic function t dx π[1 + x ] = 1 + 1 π arctan(t) e iωx E e iωs n/n = Ee i ω n [X 1+ +X n ] = dx π[1 + x ] = e ω, ( ) n Ee i ω n X 1 = (e ω n ) n = e ω and S n /n also has the standard Cauchy distribution with P[S n /n t] = 1 + 1 arctan(t); in π particular, S n /n does not converge almost surely, or even in probability. A LAW OF LARGE NUMBERS FOR CORRELATED SEQUENCES In many applications we would like a Law of Large Numbers for sequences of random variables that are not independent; for example, in Markov Chain Monte Carlo integration, we have a stationary Markov chain {X t } (this means that the distribution of X t is the same for all t and that the conditional distribution of X u for u > t, given {X s s t}, depends only on X t ) and want to estimate the population mean E[φ(X t )] for some function φ( ) by the sample mean E[φ(X t )] 1 T φ(x t ). t=1 Even though they are identically distributed, the random variables Y t φ(x t ) won t be independent if the X t aren t independent, so the LLN we already have doesn t quite apply. A sequence of random variables Y t is called stationary if each Y t has the same probability distribution and, moreover, each finite set (Y t1 +h,y t +h,...,y tk +h) has a joint distribution that doesn t depend on h. The sequence is called L if each Y t has a finite variance σ (and hence also a well-defined mean µ); by stationarity it also follows that the covariance γ st = E[(Y s µ)(y t µ)] is finite and depends only on the absolute difference t s. Theorem. If a stationary L sequence has a summable covariance, i.e., satisfies t= γ st c <, then 1 E[Y t ] = lim Y t. T T Proof. Let S T be the sum of the first T Y t s and set (as usual) ȲT S T /T. The variance of S T is E[(S T Tµ) ] = s=1 t=1 t=1 E[(X s µ)(x t µ)] s=1 t= T c, Page 3 γ st
so ȲT had variance V[ȲT] c/t and by Chebychev s inequality P[ ȲT µ > ǫ] E[(ȲT µ) ] ǫ = E[(S T Tµ) ] T ǫ T c T ǫ = c 0 as T. Tǫ A strong LLN follows with a bit more work, just as for iid random variables. Examples 1. IID: If X t are independent and identically distributed, and if Y t = φ(x t ) has finite variance σ, then Y t { has a well-defined finite mean µ and ȲT µ. σ if s = t Here γ st = 0 if s t, so c = σ <.. AR 1 : If Z t are iid N(0,1) for < t <, µ R, σ > 0, 1 < ρ < 1, and X t µ + σ ρ s Z t s s=0 = ρx t 1 + α + σz t, ( ) σ where α = (1 ρ)µ, then the X t are identically distributed (all with the N(µ, ) distribution) but not independent (since γ st = σ 1 ρ 1 ρ ρ s t 0); this is called an autoregressive process (because of equation (*), expressing X t as a regression of previous X s s) of order one (because only one earlier X s appears in (*)), and is about the simplest non-iid sequence occuring in applications. Since the covariance is summable, t= γ st = σ 1 ρ 1 + ρ 1 ρ = σ (1 ρ ) <, we again have X T µ as T.. Geometric Ergodicity: If for some 0 < ρ < 1 and c > 0 we have γ st cρ s t for a Markov chain Y t the chain is called Geometrically Ergodic (because cρ t is a geometric sequence), and the same argument as for AR 1 shows that Ȳt converges; Meyn & Tweedie (1993), Tierney (1994), and others have given conditions for MCMC chains to be Geometric Ergodic, and hence for the almost-sure convergence of sample averages to population means. 3. General Ergodicity: Consider the three sequences of random variables on (Ω,F,P) with Ω = (0,1] and F = B(Ω), each with X 0 (ω) = ω: 1. X n+1 X n (mod 1);. X n+1 X n + α (mod 1) (Does it matter if α is rational?); 3. X n+1 4X n (1 X n ). For each, find a probability measure P (equivalently find a distribution for X 0 ) such that the X n are all identically distributed; the sequence is called ergodic if each E F left invariant by the transformation T that takes X n to X n+1, E = T 1 (E), is trivial in the sense that P[E] = 0 or P[E] = 1. The Ergodic Theorem asserts that X n converges almost-surely to a T-invariant limit X as n ; since only constants are T-invariant for ergodic sequences, it follows that X n µ = EX n. The conditions here are weaker than those for the usual LLN; in all three cases above, for example, each X n is completely determined by X 0 so there is complete dependence! Page 4
Stable Limit Laws Let S n = X 1 +...+X n be the partial sum of iid random variables. IF the random variables are all square integrable, THEN the Central Limit Theorem applies and necessarily S n µ = No(0,1). nσ But what if each X n is not square integrable? We have already seen CLT fail for Cauchy variables X j. Denote by F(x) = P[X n x] the common CDF of the {X n }. Theorem (Stable Limit Law). There exist constants A n > 0 and B n R and a distribution µ for which the S n A n B n = µ if and only if there are constants 0 < α, M 0, and M + 0, with M + M + > 0, such that the following limits hold for every ξ > 0 as x + : F( x) P[X x] 1. = M 1 F(x) P[X > x] M + ;. M + > 0 1 F(xξ) 1 F(x) ξ α M > 0 F( xξ) F( x) ξ α. In this case the limit is the Stable Distribution with index α, with characteristic function E[e iωy ] = e iδω γ ω α [1 iβ tan πα sgn(ω)], where β = M+ M +M + and γ = (M + M + ). The sequence A n must be essentially A n n 1/α (more precisely, the sequence C n = n 1/α A n is slowly changing in the sense that C cn 1 = lim n C n for every c > 0); thus partial sums converge to stable distributions at rate n 1/α, more slowly (much more slowly, if α is close to one) than in the L (Gaussian) case of the central limit theorem. Note that the Cauchy distribution is the special case of (α,β,γ,δ) = (1,0,1,0) and the Normal distribution is the special case of (α,β,γ,δ) = (,0,σ /,µ). Although each Stable distribution has an absolutely continuous distribution with continuous probability density function f(y), these two cases and the inverse gamma distribution with α = 1/ and β = ±1 are the only ones where the p.d.f. can be given in closed form. Moments are easy enough to compute; for α < the Stable distribution only has finite moments of order p < α and, in particular, none of them has a finite variance. The Cauchy has finite moments of order p < 1 but does not have a well-defined mean. Condition. says that each tail must be fall off like a power (sometimes called Pareto tails), and the powers must be identical; Condition 1. gives the tail ratio. In a common special case, M = 0; for example, random variables X n with the Pareto distribution (often used to model income) given by P[X n > t] = (k/t) α for t k will have a stable limit for their partial sums if α <, and (by CLT) a normal limit if α. You can find out more details reading Chapter 9 of Breiman s book. Page 5