Probability and Measure

Size: px

Start display at page:

Download "Probability and Measure"

Monica Juliet Barnett
5 years ago
Views:

1 Probability and Measure Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA Convergence of Random Variables 1. Convergence Concepts 1.1. Convergence of Real Numbers A sequence of real numbers a n converges to a limit a if and only if, for each ɛ > 0, the sequence a n eventually lies within a ball of radius ɛ centered at a. It s okay if the first few (or few million) terms lie outside that ball and the number of terms that do lie outside the ball may depend on how big ɛ is (if ɛ is small enough it will take millions of terms before the remaining sequence lies inside the ball). This can be made mathematically precise by introducing a letter (say, N ɛ ) for how many initial terms we have to throw away, so that a n a if and only if there is an N ɛ < so that, for each n N ɛ, a n a < ɛ: only finitely many a n can be farther than ɛ from a. The same notion of convergence really works in any (complete) metric space, where we require that some measure of the distance d(a n, a) from a n to a tend to zero in the sense that it exceeds each number ɛ > 0 for at most some finite number N ɛ of terms. Points a n in d-dimensional Euclidean space will converge to a limit a R d if and only if each of their coordinates converges; and, since there are only finitely many of them, if they all converge then they do so uniformly (i.e., for each ɛ we can take the same N ɛ for all d of the coordinate sequences). 2. Convergence of Random Variables For random variables X n the idea of convergence to a limiting random variable X is more delicate, since each X n is a function of ω Ω and usually 1

2 there are infinitely many points ω Ω. What should we mean in asking about the convergence of a sequence X n of random variables to a limit X? Should we mean that X n (ω) converges to X(ω) for each fixed ω? Or that these sequences converge uniformly in ω Ω? Or that some notion of the distance d(x n, X) between X n and the limit X decreases to zero? Should the probability measure P be involved in some way? Here are a few different choices of what we might mean by the statement that X n converges to X, for a sequence of random variables X n and a random variable X, all defined on the same probability space (Ω, F, P): pw: The sequence of real numbers X n (ω) X(ω) for every ω Ω (pointwise convergence): ( ɛ > 0) ( ω Ω) ( N ɛ,ω < ) ( n N ɛ,ω ) X n (ω) X(ω) < ɛ. uni: The sequences of real numbers X n (ω) X(ω) uniformly for ω Ω: ( ɛ > 0) ( N ɛ < ) ( ω Ω) ( n N ɛ ) X n (ω) X(ω) < ɛ. a.s..: Outside some null event N F, each sequence of real numbers X n (ω) X(ω) (Almost-Sure convergence, or almost everywhere (a.e.)): for some N F with P[N] = 0, ( ɛ > 0) ( ω / N) ( N ɛ,ω < ) ( n N ɛ,ω ) X n (ω) X(ω) < ɛ, i.e., P { ɛ>0 N< n N X n (ω) X(ω) ɛ } = 0. L : Outside some null event N F, the sequences of real numbers X n (ω) X(ω) converge uniformly ( almost-uniform or L convergence): for some N F with P[N] = 0, ( ɛ > 0) ( N ɛ < ) ( ω / N) ( n N ɛ ) X n (ω) X(ω) < ɛ. i.p.: For each ɛ > 0, the probabilities P[ X n X > ɛ] 0 (convergence in probability, or in measure ): ( ɛ > 0) ( η > 0) ( N ɛ,η < ) ( n N ɛ,η ) P[ X n X > ɛ] < η. L 1 : The expectation E[ X n X ] converges to zero (convergence in L 1 ): ( ɛ > 0) ( N ɛ < ) ( n N ɛ ) E[ X n X ] < ɛ. 2

3 L p : For some fixed number p > 0, the expectation of the p th power E[ X n X p ] converges to zero (convergence in L p, sometimes called in the p th mean ): ( ɛ > 0) ( N ɛ < ) ( n N ɛ ) E[ X n X p ] < ɛ. i.d.: The distributions of X n converge to the distribution of X, i.e., the measures P Xn 1 converge in some way to P X 1 ( vague or weak convergence, or convergence in distribution, sometimes written X n X): ( ɛ > 0) ( φ C b (R)) ( N ɛ,φ < ) ( n N ɛ,φ ) E[ φ(x n ) φ(x) ] < ɛ. Which of these eight notions of convergence is right for random variables? The answer is that all of them are useful in probability theory for one purpose or another. You will want to know which ones imply which other ones, under what conditions. All but the first two (pointwise, uniform) notions depend upon the measure P; it is possible for a sequence X n to converge to X in any of these senses for one probability measure P, but to fail to converge for another P. Most of them can be phrased as metric convergence for some notion of distance between random variables: i.p.: X n X in probability if and only if d 0 (X, X n ) 0 as real numbers, where: ( ) X Y d 0 (X, Y ) E 1 + X Y L 1 : X n X in L 1 if and only if d 1 (X, X n ) = X X n 1 0 as real numbers, where: X Y 1 E X Y L p : X n X in L p if and only if d p (X, X n ) = X X n p 0 as real numbers, where: X Y p (E X Y p ) 1/p L : X n X almost uniformly if and only if d (X, X n ) = X Y 0 as real numbers, where: X Y = l.u.b.{r < : P[ X Y > r] > 0} As the notation suggests, convergence in probability and in L are in some sense limits of convergence in L p as p 0 and p, respectively. Almostsure convergence is an exception: there is no metric notion of distance d(x, Y ) for which X n X almost surely if and only if d(x, X n ) 0. 3

4 2.1. Almost-Sure Convergence Let {X n } and X be a collection of RV s on some (Ω, F, P). The set of points ω for which X n (ω) does converge to X(ω) is just ɛ>0 m=1 n=m [ω : X n (ω) X(ω) ɛ], the points which, for all ɛ > 0, have X n (ω) X(ω) less than ɛ for all but finitely-many n. The sequence X n is said to converge almost everywhere (a.e.) to X, or to converge to X almost surely (a.s..), if this set of ω has probability one, or (conversely) if its complement is a null set: [ P ɛ>0 m=1 n=m ] [ω : X n (ω) X(ω) > ɛ] = 0. The union over ɛ > 0 is only a countable one, since we need include only rational ɛ (or, for that matter, any sequence ɛ k tending to zero, such as ɛ k = 1/k). Thus X n X a.e. if and only if, for each ɛ > 0, [ P m=1 n=m ] [ω : X n (ω) X(ω) > ɛ] = 0. (a.e.) This combination of intersection and union occurs frequently in probability, and has a name; for any sequence E n of events, [ m=1 n=m E n] is called the lim sup of the {E n }, and is sometimes described more colorfully as [E n i.o.], the set of points in E n infinitely often. Its complement is the lim inf of the sets F n = En c, [ m=1 n=m F n]: the set of points in all but finitely many of the F n. Since P is countably additive, and since the intersection in the definition of lim sup is decreasing and the union in the definition of lim inf is increasing, always we have P[ n=m E n] P[ m=1 m. Thus, n=m E n] and P[ n=m F n] P[ m=1 n=m F n] as Theorem 1 X n X P-a.s.. if and only if for every ɛ > 0, lim P[ X n X > ɛ for some n m] = 0. m In particular, X n X P-a.s.. if P[ X n X > ɛ] < for each ɛ > 0 (why?). 4

5 2.2. Convergence In Probability The sequence X n is said to converge to X in probability (i.p.) if, for each ɛ > 0, P[ω : X n (ω) X(ω) > ɛ] 0. (i.p.) If we denote by E n the event [ω : X n (ω) X(ω) > ɛ] we see that convergence almost surely requires that P[ n m E n] 0 as m, while convergence in probability requires only that P[E n ] 0. Thus: Theorem 2 If X n X a.e. then X n X i.p. Here is a partial converse: Theorem 3 If X n X i.p., then there is a subsequence n k such that X nk X a.e. Proof. Set n 0 = 0 and, for each integer k 1, set { [ n k = inf n > n k 1 : P ω : X n (ω) X(ω) > 1 ] } 2 k. k For any ɛ > 0 we have 1 k < ɛ eventually (namely, for k > k 0 = 1 ɛ ) and for each m > k 0, [ P k=m ] [ω : X nk (ω) X(ω) > ɛ] [ P k=m [ω : X nk (ω) X(ω) > 1 k ] ] P[ω : X nk (ω) X(ω) > 1 k ] k=m k=m 2 k = 2 1 m 0 as m. 5

6 A Counter-Example If X n X a.e. implies X n X i.p., and if the converse holds at least along subsequences, are the two notions really identical? Or is it possible for RV s X n to converge to X i.p., but not a.e.? The answer is that the two notions are different, and that a.e. convergence is strictly stronger than convergence i.p. Here s an example: Let (Ω, F, P) be the unit interval with Borel sets and Lebesgue measure. Define a sequence of random variables X n : Ω R by X n (ω) = { i 1 if < ω i+1 2 j 2 j 0 otherwise where n = i + 2 j, 0 i < 2 j. Each X n is one on an interval of length 2 j, where j = log 2 (n) ; since 1 n 1 2 j < 2 n, P[ X n > ɛ] = 2 j < 2 n 0 for each 0 < ɛ < 1 and X n 0 i.p. On the other hand, for every j > 0 we have Ω = 2 j 1 i=0 ( i 2 j, i + 1 ] 2 j = 2 j+1 1 n=2 j [ ω : Xn (ω) = 1 ] so [ω : X n (ω) 0] is empty, not a set of probability one! Obviously X n does not converge a.e. This example is a building-block for several examples to come, so getting to know it well is worth while. Try to verify that X n 0 in probability and in L p but not almost surely. What is X n p? Why doesn t X n 0 a.s.? What would happen if we multiplied X n by n? By n 2? What about the subsequence Y n = X 2 n? Does X n converge in L? 3. Cauchy Convergence Sometimes we wish to consider a sequence X n that converges to some limit X, perhaps without knowing X in advance; the concept of Cauchy Convergence is ideal for this. For any of the distance measures d p above, with 0 p, say X n is a Cauchy sequence in L p if ( ɛ > 0)( N < )( n m N) d p (X m, X n ) < ɛ. 6

7 The spaces L p for 0 p are all complete in the sense that if X n is Cauchy for d p then there exists X L p for which d p (X n, X) 0. To see this, take an increasing subsequence N k along which d p (X m, X n ) < 2 k for n m N k, and set X 0 = 0 and N 0 = 0; set Y k X Nk X Nk 1. Check to confirm that k=1 Y k converges a.s., to some limit X L p with d p (X n, X) Uniform Integrability Let Y 0 be integrable on some probability space (Ω, F, P), E[Y ] = Y dp < ; it follows (from DCT or MCT, for example) that lim E[Y 1 [Y >t]] = Y dp = 0 t Ω [ω:y (ω)>t] and, consequently, that for any sequence of random variables X n dominated by Y in the sense that X n Y a.s.., lim E[X n 1 t [Xn>t]] = X n dp [ω:x n(ω)>t] Y dp [ω:y (ω)>t] = 0, uniformly in n. Call the sequence X n uniformly integrable (or simply UI) if E[X n 1 [Xn>t]] 0 uniformly in n, even if it is not dominated by a single integrable random variable Y. The big result is: Theorem 4 If X n X i.p. and if X n is UI then X n X in L 1. Proof. Without loss of generality take X 0. Fix any ɛ > 0; find (by UI) t ɛ > 0 such that E[ X n 1 [Xn>t ɛ]] ɛ for all n. Now find (by X n X i.p.) N ɛ N such that, for n N ɛ, P[ X n > ɛ] < ɛ/t ɛ ; then: E[ X n ] = X n dp + X n dp + X n dp [ X n ɛ] [ɛ< X n t ɛ] ɛ dp + t ɛ dp + [ X n ɛ] [ɛ< X n t ɛ] ɛ + (t ɛ )P[ X n > ɛ] + ɛ 3ɛ. 7 [t ɛ< X n ] [t ɛ< X n ] X n dp

8 Similarly, for any p > 0, X n X (i.p.) and X n p UI (for example, X n Y L p ) gives X n X (L p ). In the special case of X n Y L p this is just Lebesgue s Dominated Convergence Theorem (DCT). We have seen that {X n } is UI whenever X n Y L 1, but UI is more general than that. Here are two more criteria: Theorem 5 If {X n } is uniformly bounded in L p for some p > 1 then {X n } is UI. Proof. Let c R + be an upper bound for E X n p. First recall that, by Fubini s Theorem, any random variable X satisfies for any q > 0 E X q = 0 q x q 1 P[ X > x] dx. We apply this for q = 1 and q = p to the random variables X n and, for t > 0, to X n 1 Xn >t. Fix any t > 0; then E [ ] X n 1 Xn >t = = 0 t 0 P[ X n 1 Xn >t > x] dx P[ X n > t] dx + t P[ X n p > t p ] + t E X n p t p + 1 p t p 1 t t = t 1 p (1 + p 1 )E X n p c t 1 p (1 + p 1 ) 0 as t, uniformly in n. 0 P[ X n > x] dx p x p 1 p t p 1 P[ X n > x] dx p x p 1 P[ X n > x] dx Theorem 6 If {X n } is UI, then ( ɛ > 0)( δ > 0)( A F w/ P(A) < δ) E[ X n 1 A ] < ɛ. Conversely, if {X n } is uniformly bounded in L 1 and if ( ɛ > 0)( δ > 0) such that E[ X n 1 A ] < ɛ whenever P[A] < δ, then {X n } is UI. Proof. Straightforward. The condition {X n } is uniformly bounded in L 1 is unnecessary if (Ω, F, P) is non-atomic. 8

9 5. Summary: Uniform Integrability and Convergence Concepts I. Uniform Integrability (UI) A. X n < Y L r, r > 0, implies [ X n >t] X n r dp [Y >t] Y r dp 0 as t, uniformly in n. This is the definition of UI. B. If (Ω, F, P) nonatomic, X n UI iff ɛ δ Λ X n dp < ɛ for P[Λ] δ (take δ = ɛ/2t) [If (Ω, F, P) has atoms must also require E X n B] C. E X n p c p < implies X n r UI for each r < p... δ = (ɛ/c) q, 1 p + 1 q = 1, q = p p Remark: not for r = p (counterexample: X n = n1 (0,1/n] ) D. Main result (Thm 4.5.4, p97): If X n X i.p. then X n r UI iff X n X in L r iff E X n r E X r. II. Vague Convergence A. X n X i.p. iff n k n ki X nki X a.e. (by contradiction) B. X n X a.s.. and φ(x) continuous implies φ(x n ) φ(x) a.s.. C. X n X i.p. and φ(x) continuous implies φ(x n ) φ(x) i.p. (use A) D. Definition: X n X if Eφ(X n ) Eφ(X) φ C b (R) 1. Prop: X n X i.p. implies X n X (use II.C) 2. Prop: X n X implies F n (r) F (r) wherever F (r) = F (r ). a. Remark: Even if X n X, F n (r) may not converge where F (r) jumps; b. Remark: Even if X n X, f n (r) = F n (r) may not converge to f(r) = F (r); in fact, either may fail to exist. III. Implications among these notions: a.e., i.p., L r, L p, L, i.d. (0<r<p< ): A. a.e. = i.p. (by Easy Borel-Cantelli) 1. i.p. = a.e. along subsequences 2. i.p. a.e. (counterexample: X n (ω) = 1 (i/2 j,(i+1)/2 j ](ω), n = i + 2 j ) B. L p = i.p. (by Chebychev s inequality) 1. i.p. = L p under Uniform Integrability 2. i.p. L p (counterexample: X n = n 1/p 1 (0,1/n] ) C. L p = L r (by Jensen s inequality) 9

10 1. L r L p (counterexample: X n = n 1/p 1 (0,1/n] ) D. L = L p (simple estimate) 1. L p L (counterexample: X n = n 1/2p 1 (0,1/n] ) E. L = a.e. (uniform cgce implies pointwise cgce) F. i.p. = i.d. (II.D.1 above) 1. i.d. i.p. (counterexample: X n, X on different spaces) 2. i.d. = a.s.. ( (Ω, F, P), X n, X X n X a.e...) 10

11 6. Infinite Coin-Toss and the Laws of Large Numbers The traditional interpretation of the probability of an event E is its asymptotic frequency: the limit as n of the fraction of n repeated, similar, and independent trials in which E occurs. Similarly the expectation of a random variable X is taken to be its asymptotic average, the limit as n of the average of n repeated, similar, and independent replications of X. As statisticians trying to make inference about the underlying probability distribution f(x θ) governing observed random variables X i, this suggests that we should be interested in the probability distribution for large n of quantities like the average of the RV s, 1 n n i=1 X i. Three of the most celebrated theorems of probability theory concern this sum. For independent random variables X i, all with the same probability distribution satisfying E X i 3 <, set µ = EX i, σ 2 = E X i µ 2, and S n = n i=1 X i. The three main results are: Laws of Large Numbers: Central Limit Theorem: S n nµ σn S n nµ σ n 0 (i.p. and a.s..) = N(0, 1) (i.d.) Law of the Iterated Logarithm: lim sup ± S n nµ σ 2n log log n = 1.0 (a.s..) Together these three give a clear picture of how quickly and in what sense 1 n S n tends to µ. We begin with the Law of Large Numbers (LLN), in its weak form (asserting convergence i.p.) and in its strong form (convergence a.s..). There are several versions of both theorems. The simplest requires the X i to be IID and L 2 ; stronger results allow us to weaken (but not eliminate) the independence requirement, permit non-identical distributions, and consider what happens if the RV s are only L 1 (or worse!) instead of L 2. The text covers these things well; to complement it I am going to: (1) Prove the simplest version, and with it the Borel-Cantelli theorems; and (2) Show what happens with Cauchy random variables, which don t satisfy the requirements (the LLN fails). 11

12 I. Weak version, non-iid, L 2 : µ i = EX i, σ ij = E[X i µ i ][X j µ j ] A. Y n = (S n Σµ i )/n satisfies EY n = 0, EY 2 n = 1 n 2 Σ i σ ii + 2 n 2 Σ i<j σ ij ; 1. If σ ii M and σ ij 0, Chebychev = Y n 0, i.p. 2. (pairwise) IID L 2 is OK II. Strong version, non-iid, L 2 : EX i = 0, EX 2 i M, EX i X j 0. A. P[ S n > nɛ] < Mn n 2 ɛ 2 = M nɛ 2 1. P[ S n 2 > n 2 ɛ] < M, Σ n 2 ɛ 2 n P[ S n 2 > n 2 ɛ] < Mπ2 6ɛ 2 2. Borel-Cantelli: P[ S n 2 > n 2 1 ɛ i.o.] = 0, S n 2 n 2 0 a.s.. 3. D n = max n 2 k<(n+1) 2 S k S n 2, EDn 2 2nE S (n+1) 2 S n 2 4n 2 M 4. Chebychev: P[D n > n 2 ɛ] < 4n2 M n 4 ɛ, D 2 n 0 a.s.. B. S k /k S n 2 +D n 0 a.s.., QED n 2 1. Bernoulli RV s, normal number theorem, Monte Carlo. III. Weak version, pairwise-iid, L 1 A. Equivalent sequences: n P[X n Y n ] < 1. n [X n Y n ] < a.s.. 2. n i=1 [X i], a n n i=1 [X i] converge iff n i=1 [Y i], a n n i=1 [Y i] do 3. Y n = X n 1 [ Xn n] IV. Counterexamples: Cauchy, A. X i dx π[1+x 2 ] = P[ S n /n ɛ] 2 π tan 1 (ɛ) 1, WLLN fails. B. P[X i = n] = ±c, n 1; X n 2 i / L 1, and S n /n 0 i.p. or a.s.. C. P[X i = n] = ±c n 2 log n, n 3; X i / L 1, but S n /n 0 i.p. and not a.s.. D. Medians: for ANY RV s X n X i.p., then m n m if m is unique. Let X i be iid standard Cauchy RV s, with P[X 1 t] = and characteristic function E e iλx 1 = t dx π[1 + x 2 ] = π arctan(t) e iλx so S n /n has characteristic function E e iλsn/n = E e i λ n [X 1+ +X n] = dx π[1 + x 2 ] = e λ, ( ) E e i λ n n X 1 = (e λ n ) n = e λ 12

13 and S n /n also has the standard Cauchy distribution with P[S n /n t] = π arctan(t); in particular, S n/n does not converge almost surely, or even in probability. 13

7 Convergence in R d and in Metric Spaces

STA 711: Probability & Measure Theory Robert L. Wolpert 7 Convergence in R d and in Metric Spaces A sequence of elements a n of R d converges to a limit a if and only if, for each ǫ > 0, the sequence a