STA 711: Probability & Measure Theory Robert L. Wolpert

Similar documents
4 Expectation & the Lebesgue Theorems

4 Expectation & the Lebesgue Theorems

Independent random variables

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

Probability and Measure

The main results about probability measures are the following two facts:

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

STAT 7032 Probability Spring Wlodek Bryc

Product measure and Fubini s theorem

STA205 Probability: Week 8 R. Wolpert

Notes 1 : Measure-theoretic foundations I

Spring 2014 Advanced Probability Overview. Lecture Notes Set 1: Course Overview, σ-fields, and Measures

Sample Spaces, Random Variables

Probability and Measure

7 Convergence in R d and in Metric Spaces

I. ANALYSIS; PROBABILITY

1 Sequences of events and their limits

Recitation 2: Probability

RS Chapter 1 Random Variables 6/5/2017. Chapter 1. Probability Theory: Introduction

Lecture 11: Random Variables

Product measures, Tonelli s and Fubini s theorems For use in MAT4410, autumn 2017 Nadia S. Larsen. 17 November 2017.

Probability and Measure

Midterm Examination. STA 205: Probability and Measure Theory. Wednesday, 2009 Mar 18, 2:50-4:05 pm

STAT 200C: High-dimensional Statistics

2 Construction & Extension of Measures

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

1.1 Review of Probability Theory

Measure and integration

Quick Tour of Basic Probability Theory and Linear Algebra

1. Probability Measure and Integration Theory in a Nutshell

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 3 9/10/2008 CONDITIONING AND INDEPENDENCE

More on Distribution Function

Probability: Handout

Inference for Stochastic Processes

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

REAL ANALYSIS I Spring 2016 Product Measures

Lecture 4: Conditional expectation and independence

Lecture 6 Basic Probability

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 8 10/1/2008 CONTINUOUS RANDOM VARIABLES

Probability Theory. Richard F. Bass

Introduction and Preliminaries

8 Laws of large numbers

1 Variance of a Random Variable

The Essential Equivalence of Pairwise and Mutual Conditional Independence

MATH/STAT 235A Probability Theory Lecture Notes, Fall 2013

Section 2: Classes of Sets

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Estimates for probabilities of independent events and infinite series

Lecture 22: Variance and Covariance

18.175: Lecture 2 Extension theorems, random variables, distributions

Random variables. DS GA 1002 Probability and Statistics for Data Science.

JUSTIN HARTMANN. F n Σ.

4 Sums of Independent Random Variables

STAT 7032 Probability. Wlodek Bryc

Math 564 Homework 1. Solutions.

We will briefly look at the definition of a probability space, probability measures, conditional probability and independence of probability events.

Probability Theory Review

STAT 712 MATHEMATICAL STATISTICS I

Building Infinite Processes from Finite-Dimensional Distributions

Module 1. Probability

Lecture 2: Review of Basic Probability Theory

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

1 Presessional Probability

Review of Probability Theory

Lecture 1: August 28

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

x log x, which is strictly convex, and use Jensen s Inequality:

Essentials on the Analysis of Randomized Algorithms

Lecture 2: Random Variables and Expectation

2. Variance and Covariance: We will now derive some classic properties of variance and covariance. Assume real-valued random variables X and Y.

STAT 200C: High-dimensional Statistics

Probability Theory for Machine Learning. Chris Cremer September 2015

Lecture 21: Expectation of CRVs, Fatou s Lemma and DCT Integration of Continuous Random Variables

1 Probability space and random variables

STOR 635 Notes (S13)

RANDOM WALKS AND THE PROBABILITY OF RETURNING HOME

Probability Review. Yutian Li. January 18, Stanford University. Yutian Li (Stanford University) Probability Review January 18, / 27

Lebesgue Integration on R n

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Dynkin (λ-) and π-systems; monotone classes of sets, and of functions with some examples of application (mainly of a probabilistic flavor)

2 Probability, random elements, random sets

Notes on Measure, Probability and Stochastic Processes. João Lopes Dias

DS-GA 1002 Lecture notes 2 Fall Random variables

A D VA N C E D P R O B A B I L - I T Y

36-752: Lecture 1. We will use measures to say how large sets are. First, we have to decide which sets we will measure.

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

Probability. Lecture Notes. Adolfo J. Rumbos

Probability Background

Joint Distribution of Two or More Random Variables

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Measures. 1 Introduction. These preliminary lecture notes are partly based on textbooks by Athreya and Lahiri, Capinski and Kopp, and Folland.

Lecture 17 Brownian motion as a Markov process

conditional cdf, conditional pdf, total probability theorem?

Probability Theory and Statistics. Peter Jochumzen

Stochastic Processes. Winter Term Paolo Di Tella Technische Universität Dresden Institut für Stochastik

A PECULIAR COIN-TOSSING MODEL

Transcription:

STA 711: Probability & Measure Theory Robert L. Wolpert 6 Independence 6.1 Independent Events A collection of events {A i } F in a probability space (Ω,F,P) is called independent if P[ i I A i ] = P[A i ] i I for each finite set I of indices. This is a stronger requirement than pairwise independence, the requirement merely that P[A i A j ] = P[A i ]P[A j ] for each i j. For a simple counter-example, toss two fair coins and let H n be the event Heads on the nth toss for n = 1,2. Then the three events A 1 := H 1, A 2 := H 2, and A 3 := H 1 H 2 (the event that the coins disagree) each have P[A i ] = 1/2 and each pair has P[A i A j ] = (1/2) 2 = 1/4 for i j, but A i = has probability zero and not (1/2) 3 = 1/8. 6.2 Independent Classes of Events Classes {C i } of events are called independent if P[ i I A i ] = P[A i ] i I for each finite I whenever each A i C i. An important tool for simplifying proofs of independence is Theorem 1 (Basic Criterion) If classes {C i } of events are independent and if each C i is a π-system, then {σ(c i )} are independent too. Proof. Let I be a nonempty finite index set and {C i } i I an independent collection of π-systems. Fix i I, set J = I\{i}, and fix A j C j for each j J. Set: { [ L := B F : P B ] } A j = P[B] P[A j ]. Then j J j J Ω L, by the independence of {C j } j J ; B L B c L, by a quick computation; and B n L and {B n } disjoint B n L, quick computation. 1

Thus L is a λ-system containing C i, and so by Dynkin s π-λ theorem it contains σ(c i ). Thus σ(c i ) and {A j } j J are independent for each {A j C j }, so {σ(c i ),{C j } j J } are independent π-systems. Repeating the same argument I 1 times (or, more elegantly, mathematical induction on the cardinality I ) completes the proof. 6.3 Independent Random Variables A collection of random variables {X i } on some probability space (Ω,F,P) are called independent if P ( i I [X i B i ] ) = P[X i B i ] i I for each finite set I of indices and each collection of Borel sets {B i B(R)}. This is just the same as the requirement that the σ-algebras F i := σ(x i ) = X i 1 (B) be independent; by the Basic Criterion it is enough to check that the joint CDFs factor, i.e., that P ( i I [X i x i ] ) = i I F i (x i ) for each finite index set I and each x R I, or just for a dense set of such x (Why?). For finitely-many jointly continuous random variables this is equivalent to requiring that the joint density function factor as the product of marginal density functions, while for finitelymany discrete random variables it s equivalent to the usual factorization criterion for the joint pmf. The present definition goes beyond those two cases, however for example, it includes the case of random variables X Bi(7,.3), Y Ex(2.), Z = ζ for ζ No(,1), and C with the Cantor distribution. It also applies to infinite (even uncountable) collections of random variables, where no joint pdf or pmf can exist. Since σ ( g(x) ) σ(x) for any random variable X and Borel function g( ), if {X i } are independent andifg i ( )arearbitraryborelfunctions, itfollowsthat{g i (X i )}areindependent too and, in particular, that if X Y then X g(y) for all Borel functions g( ). 6.4 Two Zero-One Laws First, a little toy example to motivate. Begin with a leather bag containing one gold coin, and n = 1. (a) At nth turn, first add one additional silver coin to the bag, then draw one coin at random. Let A n be the event A n = {Draw gold coin on nth draw}. Whichever coin you draw, replace it; increment n; and repeat. Page 2

(b) As above but at nth turn, add n silver coins. Let γ be the probability that you ever draw the gold coin. In each case, is γ =? γ = 1? or < γ < 1? In latter case, give exact asymptotic expression for γ and numerical estimate to four decimals. Why doesn t < γ < 1 violate Borel s zero-one law (Prop. 1 below)? Can you find γ exactly, perhaps with the help of Mathematica or Maple? 6.4.1 The Borel-Cantelli Lemmas and Borel s Zero-One Law Our proof below of the Strong Law of Large Numbers for iid bounded random variables relies on the almost-trivial but very useful: Lemma 2 (Borel-Cantelli) Let {A n } be events on some probability space (Ω,F,P) that satisfy P[A n ] <. Then the event that infinitely-many of the {A n } occur (the limsup) has probability zero. Proof. [ P ] [ A m P A m ] P[A m ] as n. This result does not require independence of the {A n }, but its partial converse does: Lemma 3 (Second Borel-Cantelli) Let {A n } be independent events on a probability space (Ω,F,P) that satisfy P[A n ] =. Then the event that infinitely-many of the {A n } occur (the limsup) has probability one. Proof. First recall that 1+x e x for all real x R, positive or not (draw a graph). For each pair of integers 1 n N <, [ N P A c m ] = N ( 1 P[Am ] ) ( N e P[Am] = exp exp ( ) N P[A m ] ) P[A m ] = e = as N ; thus each A c m is a null set, hence so is their union, so Page 3

[ P ] [ A m = 1 P 1 ( P A c m ] A c m ) = 1 = 1. Together these two results comprise the Proposition 1 (Borel s Zero-One Law) For independent events {A n }, the event A := limsupa n has probability P(A) = or P(A) = 1, depending on whether the sum P(A n ) is finite or not. 6.4.2 Kolmogorov s Zero-One Law For any collection {X n } of random variables on a probability space (Ω,F,P), define two sequences of σ-algebras ( past and future ) by: F n := σ{x i : i n} T n := σ{x i : i n+1} and, from them, construct the π-system P and σ-algebra T by P := F n T := T n. In general P will not be a σ-algebra, because it will not be closed under countable unions or intersections, but it is a field and hence a π-system, and generates the σ-algebra F n := σ(p) F. The class T, called the tail σ-field, includes such events as X n converges or limsupx n 1 or, with S n := n 1 X j, 1 n S n Converges or 1 n S n. Theorem 4 (Kolmogorov s Zero-One Law) For independent random variables X n, the tail σ-field T is almost trivial in the sense that every event Λ T has probability P[Λ] = or P[Λ] = 1. Proof. Let A P = F n, and Λ T. Then for some n N, A F n and Λ T n, so A Λ; thus P and T are independent. Since P is a π-system, it follows from the Basic Criterionthatσ(P)andT arealsoindependent. ButF n P soeachx n isσ(p)-measurable, hence T σ(p) and each Λ T must also be in σ(p) T. It follows that: P[Λ] = P[Λ Λ] = P[Λ]P[Λ] = P[Λ] 2, so = P[Λ] ( 1 P[Λ] ) proving that P[Λ] must be zero or one. Page 4

6.5 Product Spaces Do independent random variables exist, with arbitrary (marginal) specified distributions? How can they be constructed? One way is to build product probability spaces; let s see how to do that. Let (Ω j,f j,p j ) be a probability space for j = 1,2 and set: Ω = Ω 1 Ω 2 := {(ω 1,ω 2 ) : ω j Ω j } F = F 1 F 2 := σ{a 1 A 2 : A j F j } P := P 1 P 2, the unique extension to F satisfying: P(A 1 A 2 ) = P 1 (A 1 ) P 2 (A 2 ) for A 1 F 1, A 2 F 2. Any random variables X 1 on (Ω 1,F 1,P 1 ) and X 2 on (Ω 2,F 2,P 2 ) can be extended to the common space (Ω,F,P) by defining X 1 (ω 1,ω 2 ) := X 1 (ω 1 ) and X 2 (ω 1,ω 2 ) := X 2 (ω 2 ); it s easy to show that {X j } are independent and have the same marginal distributions as {X j}. Thus, independent random variables do exist with arbitrary distributions. The same construction extends to countable families. 6.6 Fubini s Theorem We now consider how to evaluate probabilities and integrals on product spaces. For any F-measurable random variable X : Ω 1 Ω 2 S (S would be R, for real-valued RVs, but could also be R n or any complete separable metric space), and for any ω 2 Ω 2, the (second) section of X is the (Ω 1,F 1,P 1 ) random variables X ω2 : Ω 1 S defined by X ω2 (ω 1 ) := X(ω 1,ω 2 ). It is not quite obvious, but true, that X ω2 is F 1 -measurable show this first for indicator random variables X = 1 A1 A 2 of product sets, then extend by a π-λ argument to indicators X = 1 A of sets A F, then to simple RVs by linearity, then to the nonnegative RVs X + and X for an arbitrary F-measurable X by monotone limits. Similarly, the first section X ω1 ( ) := X(ω 1, ) is F 2 -measurable for each ω 1 Ω 1. Finally: Fubini s theorem gives conditions (namely, that either X or E X < ) to guarantee that these three integrals are meaningful and equal: { } { } X ω2 dp 1 dp? 2 = XdP =? X ω1 dp 2 dp 1 (1) Ω 2 Ω 1 Ω Ω 1 Ω 2 Toprovethis, firstnotethatit strueforindicatorsx = 1 A1 A 2 oftheπ-systemofmeasurable rectangles (A 1 A 2 ) with each A j F j ; then verify that the class C of events A F for which it holds for X = 1 A is a λ-system. By Dynkin s π-λ theorem it follows that F C so (1) holds for all indicators X = 1 A of events A F, hence for all nonnegative simple functions in E +, and finally for all F-measurable X by the MCT. For X L 1, apply this result separately to X + and X. Page 5

Fubini s theorem applies more generally. Each probability measure P j may be replaced by an arbitrary σ-finite measure; or one of them (say, P 2 ) may be replaced by a measurable kernel 1 K(ω 1, dω 2 ) that is a σ-finite measure K(ω 1, ) on F 2 in its second variable for each fixed ω 1, and an F 1 -measurable function K(, B 2 ) in its first variable for each fixed B 2 F 2. Now Fubini s Theorem asserts (under positivity or L 1 conditions) the equality of integrals of X wrt the measure P(dω 1 dω 2 ) = P 1 (dω 1 )K(ω 1,dω 2 ) to the iterated integrals { } ν X (dω 2 ) = XdP = X ω1 (ω 2 )K(ω 1,dω 2 ) P 1 (dω 1 ) Ω 2 Ω Ω 2 for the measure on F 2 given by ν X (dω 2 ) := Ω 1 X ω2 (ω 1 )K(ω 1,dω 2 )P 1 (dω 1 ). Ω 1 As an easy consequence (take Ω 1 = N with counting measure P 1 (B 1 ) := #{B 1 }, and let K(n,B 2 ) = P(X n B 2 ) be the distribution of X n on Ω 2 = R), for any sequence of random variables we may exchange summation and expectation and conclude { } E X n? = {EX n } whenever each X n or when E X n <, but otherwise equality may fail. For an example where interchanging integration order fails, integrate by parts to verify: 1 1 { 1 } y 2 x 2 (x 2 +y 2 ) dx dy = { 2 1 } y 2 x 2 (x 2 +y 2 ) dy dx = 2 1 1 { } 1 1+y { 2 } +1 1+x 2 dy = π 4 dx = +π 4. As expected in light of Fubini s Theorem, the integrand isn t nonnegative nor is it in L 1 : 1,1 y 2 x 2 π/2 1, (x 2 +y 2 ) 2 dxdy r 2 sin 2 θ cos 2 θ rdrdθ r 4 ( ) π/2 ( 1 ) = sin 2 (2θ)dθ r 1 dr = (π/4)( ). 1 Measurable kernels come up all the time when studying conditional distributions (as you ll see in week 9 of this course) and, in particular, Markov chains and processes. Page 6

6.6.1 A Simple but Useful Consequence of Fubini For any p > and any random variable X, [ ] X E X p = E px p 1 dx It follows that = [ ] = E 1 { X >x} px p 1 dx [ ] E1{ X >x} px p 1 dx = px p 1 P [ X > x ] dx. X L p E X p < n p 1 P[ X > n] <. The case p = 1 is easiest and most important: if S := n= P[ X > n] <, then X L 1 with E X S E X +1. If X takes on only nonnegative integer values then EX = S. For any ǫ >, ǫ P[ X > nǫ] < E X ǫ P[ X > nǫ] 6.7 Hoeffding s Inequality If {X j } are independent and (individually) bounded, so ( j N) ( {a j,b j }) for which P[a j X j b j ] = 1, then ( c > ), S n := n j=1 X j satisfies n= P [ (S n ES n ) c ] / n exp ( 2c 2 b j a j ). 2 If X j are iid and bounded by X j 1, e.g., then P[( X n µ) ǫ] e nǫ2 /2. Wassily Hoeffding proved this improvement on Chebychev s inequality for L random variables in 1963 at UNC. It follows from Hoeffding s Lemma: E[e λ(x j µ j ) ] exp ( λ 2 (b j a j ) 2 /8 ), proved in turn from Jensen s ineq and Taylor s theorem (with remainder). The importance is that the bound decreases exponentially in n as n, while the Chebychev bound only decreases like a power of n. The price is that {X j } must be bounded in L, not merely in L 2. See also related and earlier Bernstein s inequality (1937), Chernoff bounds (1952), and Azuma s inequality (1967). 1 Page 7

Here s a proof for the important special case of X j = ±1 with probability 1/2 each (and hence µ = ): P[ X n ǫ] = P[S n nǫ] = P [ e λsn e nλǫ] for any λ > E e λsn e nλǫ = { 1 2 eλ + 1e λ} n 2 e nλǫ { } n e λ2 /2 e nλǫ = exp ( nλ 2 /2 nλǫ ) The exponent is minimized at λ = ǫ, so P[ X n ǫ] exp ( nǫ 2 /2 nǫǫ ) = e nǫ2 /2. by Markov s inequality by independence see footnote 2 The general case isn t much harder, but proving that Ee λx e λ2 /2 is a bit more delicate. By Borel/Cantelli it follows from Hoeffding s inequality that ( X n µ) > ǫ only finitely-many times for each ǫ >, if {X n } L are iid, leading to our first Strong Law of Large Numbers: P[ X n µ] = 1 (why does this follow?). Note that Chebychev s inequality only guarantees the algebraic bound P[ X n ǫ] 1/nǫ 2, instead of Hoeffding s exponential bound. Since 1/nǫ 2 isn t summable in n, Chebychev s bound isn t strong enough to prove a strong LLN, but Hoeffding s is. Hoeffding s inequality is now used commonly in computer science and machine learning, applied to indicators of Bernoulli events (or, equivalently, to binomial random variables), where it gives the bound P [ (p ǫ)n S n (p+ǫ)n ] = P [ X n p ǫ ] 1 2e 2ǫ2 n for S n = j n X j Bi(n,p) for iid Bernoulli X j iid Bi(1, p) variables, showing exponential concentration of probability around the mean. This is far stronger than Chebychev s bound of P [ X n p ǫ ] 1 p(1 p)/ǫ 2 n 2 cosh(λ) = { 1 2 eλ + 1 2 e λ} = λ 2k (2k)! (λ 2 ) k 2 k (k!) = eλ2 /2 Page 8 Last edited: September 28, 215