STA 711: Probability & Measure Theory Robert L. Wolpert

STA 711: Probability & Measure Theory Robert L. Wolpert 6 Independence 6.1 Independent Events A collection of events {A i } F in a probability space (Ω,F,P) is called independent if P[ i I A i ] = P[A i ] i I for each finite set I of indices. This is a stronger requirement than pairwise independence, the requirement merely that P[A i A j ] = P[A i ]P[A j ] for each i j. For a simple counter-example, toss two fair coins and let H n be the event Heads on the nth toss for n = 1,2. Then the three events A 1 := H 1, A 2 := H 2, and A 3 := H 1 H 2 (the event that the coins disagree) each have P[A i ] = 1/2 and each pair has P[A i A j ] = (1/2) 2 = 1/4 for i j, but A i = has probability zero and not (1/2) 3 = 1/8. 6.2 Independent Classes of Events Classes {C i } of events are called independent if P[ i I A i ] = P[A i ] i I for each finite I whenever each A i C i. An important tool for simplifying proofs of independence is Theorem 1 (Basic Criterion) If classes {C i } of events are independent and if each C i is a π-system, then {σ(c i )} are independent too. Proof. Let I be a nonempty finite index set and {C i } i I an independent collection of π-systems. Fix i I, set J = I\{i}, and fix A j C j for each j J. Set: { [ L := B F : P B ] } A j = P[B] P[A j ]. Then j J j J Ω L, by the independence of {C j } j J ; B L B c L, by a quick computation; and B n L and {B n } disjoint B n L, quick computation. 1

Thus L is a λ-system containing C i, and so by Dynkin s π-λ theorem it contains σ(c i ). Thus σ(c i ) and {A j } j J are independent for each {A j C j }, so {σ(c i ),{C j } j J } are independent π-systems. Repeating the same argument I 1 times (or, more elegantly, mathematical induction on the cardinality I ) completes the proof. 6.3 Independent Random Variables A collection of random variables {X i } on some probability space (Ω,F,P) are called independent if P ( i I [X i B i ] ) = P[X i B i ] i I for each finite set I of indices and each collection of Borel sets {B i B(R)}. This is just the same as the requirement that the σ-algebras F i := σ(x i ) = X i 1 (B) be independent; by the Basic Criterion it is enough to check that the joint CDFs factor, i.e., that P ( i I [X i x i ] ) = i I F i (x i ) for each finite index set I and each x R I, or just for a dense set of such x (Why?). For finitely-many jointly continuous random variables this is equivalent to requiring that the joint density function factor as the product of marginal density functions, while for finitelymany discrete random variables it s equivalent to the usual factorization criterion for the joint pmf. The present definition goes beyond those two cases, however for example, it includes the case of random variables X Bi(7,.3), Y Ex(2.), Z = ζ for ζ No(,1), and C with the Cantor distribution. It also applies to infinite (even uncountable) collections of random variables, where no joint pdf or pmf can exist. Since σ ( g(x) ) σ(x) for any random variable X and Borel function g( ), if {X i } are independent andifg i ( )arearbitraryborelfunctions, itfollowsthat{g i (X i )}areindependent too and, in particular, that if X Y then X g(y) for all Borel functions g( ). 6.4 Two Zero-One Laws First, a little toy example to motivate. Begin with a leather bag containing one gold coin, and n = 1. (a) At nth turn, first add one additional silver coin to the bag, then draw one coin at random. Let A n be the event A n = {Draw gold coin on nth draw}. Whichever coin you draw, replace it; increment n; and repeat. Page 2

(b) As above but at nth turn, add n silver coins. Let γ be the probability that you ever draw the gold coin. In each case, is γ =? γ = 1? or < γ < 1? In latter case, give exact asymptotic expression for γ and numerical estimate to four decimals. Why doesn t < γ < 1 violate Borel s zero-one law (Prop. 1 below)? Can you find γ exactly, perhaps with the help of Mathematica or Maple? 6.4.1 The Borel-Cantelli Lemmas and Borel s Zero-One Law Our proof below of the Strong Law of Large Numbers for iid bounded random variables relies on the almost-trivial but very useful: Lemma 2 (Borel-Cantelli) Let {A n } be events on some probability space (Ω,F,P) that satisfy P[A n ] <. Then the event that infinitely-many of the {A n } occur (the limsup) has probability zero. Proof. [ P ] [ A m P A m ] P[A m ] as n. This result does not require independence of the {A n }, but its partial converse does: Lemma 3 (Second Borel-Cantelli) Let {A n } be independent events on a probability space (Ω,F,P) that satisfy P[A n ] =. Then the event that infinitely-many of the {A n } occur (the limsup) has probability one. Proof. First recall that 1+x e x for all real x R, positive or not (draw a graph). For each pair of integers 1 n N <, [ N P A c m ] = N ( 1 P[Am ] ) ( N e P[Am] = exp exp ( ) N P[A m ] ) P[A m ] = e = as N ; thus each A c m is a null set, hence so is their union, so Page 3

[ P ] [ A m = 1 P 1 ( P A c m ] A c m ) = 1 = 1. Together these two results comprise the Proposition 1 (Borel s Zero-One Law) For independent events {A n }, the event A := limsupa n has probability P(A) = or P(A) = 1, depending on whether the sum P(A n ) is finite or not. 6.4.2 Kolmogorov s Zero-One Law For any collection {X n } of random variables on a probability space (Ω,F,P), define two sequences of σ-algebras ( past and future ) by: F n := σ{x i : i n} T n := σ{x i : i n+1} and, from them, construct the π-system P and σ-algebra T by P := F n T := T n. In general P will not be a σ-algebra, because it will not be closed under countable unions or intersections, but it is a field and hence a π-system, and generates the σ-algebra F n := σ(p) F. The class T, called the tail σ-field, includes such events as X n converges or limsupx n 1 or, with S n := n 1 X j, 1 n S n Converges or 1 n S n. Theorem 4 (Kolmogorov s Zero-One Law) For independent random variables X n, the tail σ-field T is almost trivial in the sense that every event Λ T has probability P[Λ] = or P[Λ] = 1. Proof. Let A P = F n, and Λ T. Then for some n N, A F n and Λ T n, so A Λ; thus P and T are independent. Since P is a π-system, it follows from the Basic Criterionthatσ(P)andT arealsoindependent. ButF n P soeachx n isσ(p)-measurable, hence T σ(p) and each Λ T must also be in σ(p) T. It follows that: P[Λ] = P[Λ Λ] = P[Λ]P[Λ] = P[Λ] 2, so = P[Λ] ( 1 P[Λ] ) proving that P[Λ] must be zero or one. Page 4

6.5 Product Spaces Do independent random variables exist, with arbitrary (marginal) specified distributions? How can they be constructed? One way is to build product probability spaces; let s see how to do that. Let (Ω j,f j,p j ) be a probability space for j = 1,2 and set: Ω = Ω 1 Ω 2 := {(ω 1,ω 2 ) : ω j Ω j } F = F 1 F 2 := σ{a 1 A 2 : A j F j } P := P 1 P 2, the unique extension to F satisfying: P(A 1 A 2 ) = P 1 (A 1 ) P 2 (A 2 ) for A 1 F 1, A 2 F 2. Any random variables X 1 on (Ω 1,F 1,P 1 ) and X 2 on (Ω 2,F 2,P 2 ) can be extended to the common space (Ω,F,P) by defining X 1 (ω 1,ω 2 ) := X 1 (ω 1 ) and X 2 (ω 1,ω 2 ) := X 2 (ω 2 ); it s easy to show that {X j } are independent and have the same marginal distributions as {X j}. Thus, independent random variables do exist with arbitrary distributions. The same construction extends to countable families. 6.6 Fubini s Theorem We now consider how to evaluate probabilities and integrals on product spaces. For any F-measurable random variable X : Ω 1 Ω 2 S (S would be R, for real-valued RVs, but could also be R n or any complete separable metric space), and for any ω 2 Ω 2, the (second) section of X is the (Ω 1,F 1,P 1 ) random variables X ω2 : Ω 1 S defined by X ω2 (ω 1 ) := X(ω 1,ω 2 ). It is not quite obvious, but true, that X ω2 is F 1 -measurable show this first for indicator random variables X = 1 A1 A 2 of product sets, then extend by a π-λ argument to indicators X = 1 A of sets A F, then to simple RVs by linearity, then to the nonnegative RVs X + and X for an arbitrary F-measurable X by monotone limits. Similarly, the first section X ω1 ( ) := X(ω 1, ) is F 2 -measurable for each ω 1 Ω 1. Finally: Fubini s theorem gives conditions (namely, that either X or E X < ) to guarantee that these three integrals are meaningful and equal: { } { } X ω2 dp 1 dp? 2 = XdP =? X ω1 dp 2 dp 1 (1) Ω 2 Ω 1 Ω Ω 1 Ω 2 Toprovethis, firstnotethatit strueforindicatorsx = 1 A1 A 2 oftheπ-systemofmeasurable rectangles (A 1 A 2 ) with each A j F j ; then verify that the class C of events A F for which it holds for X = 1 A is a λ-system. By Dynkin s π-λ theorem it follows that F C so (1) holds for all indicators X = 1 A of events A F, hence for all nonnegative simple functions in E +, and finally for all F-measurable X by the MCT. For X L 1, apply this result separately to X + and X. Page 5

Fubini s theorem applies more generally. Each probability measure P j may be replaced by an arbitrary σ-finite measure; or one of them (say, P 2 ) may be replaced by a measurable kernel 1 K(ω 1, dω 2 ) that is a σ-finite measure K(ω 1, ) on F 2 in its second variable for each fixed ω 1, and an F 1 -measurable function K(, B 2 ) in its first variable for each fixed B 2 F 2. Now Fubini s Theorem asserts (under positivity or L 1 conditions) the equality of integrals of X wrt the measure P(dω 1 dω 2 ) = P 1 (dω 1 )K(ω 1,dω 2 ) to the iterated integrals { } ν X (dω 2 ) = XdP = X ω1 (ω 2 )K(ω 1,dω 2 ) P 1 (dω 1 ) Ω 2 Ω Ω 2 for the measure on F 2 given by ν X (dω 2 ) := Ω 1 X ω2 (ω 1 )K(ω 1,dω 2 )P 1 (dω 1 ). Ω 1 As an easy consequence (take Ω 1 = N with counting measure P 1 (B 1 ) := #{B 1 }, and let K(n,B 2 ) = P(X n B 2 ) be the distribution of X n on Ω 2 = R), for any sequence of random variables we may exchange summation and expectation and conclude { } E X n? = {EX n } whenever each X n or when E X n <, but otherwise equality may fail. For an example where interchanging integration order fails, integrate by parts to verify: 1 1 { 1 } y 2 x 2 (x 2 +y 2 ) dx dy = { 2 1 } y 2 x 2 (x 2 +y 2 ) dy dx = 2 1 1 { } 1 1+y { 2 } +1 1+x 2 dy = π 4 dx = +π 4. As expected in light of Fubini s Theorem, the integrand isn t nonnegative nor is it in L 1 : 1,1 y 2 x 2 π/2 1, (x 2 +y 2 ) 2 dxdy r 2 sin 2 θ cos 2 θ rdrdθ r 4 ( ) π/2 ( 1 ) = sin 2 (2θ)dθ r 1 dr = (π/4)( ). 1 Measurable kernels come up all the time when studying conditional distributions (as you ll see in week 9 of this course) and, in particular, Markov chains and processes. Page 6

6.6.1 A Simple but Useful Consequence of Fubini For any p > and any random variable X, [ ] X E X p = E px p 1 dx It follows that = [ ] = E 1 { X >x} px p 1 dx [ ] E1{ X >x} px p 1 dx = px p 1 P [ X > x ] dx. X L p E X p < n p 1 P[ X > n] <. The case p = 1 is easiest and most important: if S := n= P[ X > n] <, then X L 1 with E X S E X +1. If X takes on only nonnegative integer values then EX = S. For any ǫ >, ǫ P[ X > nǫ] < E X ǫ P[ X > nǫ] 6.7 Hoeffding s Inequality If {X j } are independent and (individually) bounded, so ( j N) ( {a j,b j }) for which P[a j X j b j ] = 1, then ( c > ), S n := n j=1 X j satisfies n= P [ (S n ES n ) c ] / n exp ( 2c 2 b j a j ). 2 If X j are iid and bounded by X j 1, e.g., then P[( X n µ) ǫ] e nǫ2 /2. Wassily Hoeffding proved this improvement on Chebychev s inequality for L random variables in 1963 at UNC. It follows from Hoeffding s Lemma: E[e λ(x j µ j ) ] exp ( λ 2 (b j a j ) 2 /8 ), proved in turn from Jensen s ineq and Taylor s theorem (with remainder). The importance is that the bound decreases exponentially in n as n, while the Chebychev bound only decreases like a power of n. The price is that {X j } must be bounded in L, not merely in L 2. See also related and earlier Bernstein s inequality (1937), Chernoff bounds (1952), and Azuma s inequality (1967). 1 Page 7

Here s a proof for the important special case of X j = ±1 with probability 1/2 each (and hence µ = ): P[ X n ǫ] = P[S n nǫ] = P [ e λsn e nλǫ] for any λ > E e λsn e nλǫ = { 1 2 eλ + 1e λ} n 2 e nλǫ { } n e λ2 /2 e nλǫ = exp ( nλ 2 /2 nλǫ ) The exponent is minimized at λ = ǫ, so P[ X n ǫ] exp ( nǫ 2 /2 nǫǫ ) = e nǫ2 /2. by Markov s inequality by independence see footnote 2 The general case isn t much harder, but proving that Ee λx e λ2 /2 is a bit more delicate. By Borel/Cantelli it follows from Hoeffding s inequality that ( X n µ) > ǫ only finitely-many times for each ǫ >, if {X n } L are iid, leading to our first Strong Law of Large Numbers: P[ X n µ] = 1 (why does this follow?). Note that Chebychev s inequality only guarantees the algebraic bound P[ X n ǫ] 1/nǫ 2, instead of Hoeffding s exponential bound. Since 1/nǫ 2 isn t summable in n, Chebychev s bound isn t strong enough to prove a strong LLN, but Hoeffding s is. Hoeffding s inequality is now used commonly in computer science and machine learning, applied to indicators of Bernoulli events (or, equivalently, to binomial random variables), where it gives the bound P [ (p ǫ)n S n (p+ǫ)n ] = P [ X n p ǫ ] 1 2e 2ǫ2 n for S n = j n X j Bi(n,p) for iid Bernoulli X j iid Bi(1, p) variables, showing exponential concentration of probability around the mean. This is far stronger than Chebychev s bound of P [ X n p ǫ ] 1 p(1 p)/ǫ 2 n 2 cosh(λ) = { 1 2 eλ + 1 2 e λ} = λ 2k (2k)! (λ 2 ) k 2 k (k!) = eλ2 /2 Page 8 Last edited: September 28, 215