On the convergence of sequences of random variables: A primer

BCAM May 2012 1 On the convergence of sequences of random variables: A primer Armand M. Makowski ECE & ISR/HyNet University of Maryland at College Park armand@isr.umd.edu

BCAM May 2012 2 A sequence a : N 0 R, often described as {a n, n = 1, 2,...}, converges to some a in R if for every ε > 0, there exists n (ε) such that a n a ε, n n (ε) We write lim a n = a or n a n a This definition contains two basic questions: Existence It converges! Value Find the limiting value What happens if a = ±?

BCAM May 2012 3 Existence Every monotone sequence converges! Bolzano-Weierstrass: Every bounded sequence contains at least one convergent subsequence! Given a sequence a : N 0 R, define lim sup n a n = inf n 1 a n with a n = ( ) sup a m m n and lim inf n = sup a n with a n = n 1 ( ) inf a m m n

BCAM May 2012 4 and lim sup n a n = Largest accumulation point of the sequence lim inf n a n = Smallest accumulation point of the sequence a n liminf a n and a n limsupa n n n lim inf a n limsupa n n n Fact:: The sequence a : N 0 R converges if and only if lim inf a n = limsupa n lim a n n n n

BCAM May 2012 5 A sequence a : N 0 R is said to be Cauchy if for every ε > 0, there exists n (ε) such that a n a m ε, m, n n (ε) Fact: A sequence a : N 0 R converges if and only if it is Cauchy R is complete under its usual topology

BCAM May 2012 6 Cesaro convergence The Cesaro sequence associated with the sequence a : N 0 R is the sequence a : N 0 R given by a n = 1 n (a 1 +... + a n ), n = 1, 2,... A sequence a : N 0 R is Cesaro-convergent if the associated Cesaro sequence a : N 0 R converges.

BCAM May 2012 7 Fact: A convergent sequence a : N 0 R with limit a is also Cesaro-convergent with limit a, namely lim n a n = a However, the converse is not true, e.g., a n = ( 1) n, n = 1, 2,,... Averaging is good! Law of Large Numbers!!!

BCAM May 2012 8 Random variables Given a probability triple (Ω, F, P), a d-dimensional random variable (rv) is a measurable mapping X : Ω R d such that X 1 (B) = {ω Ω : X(ω) B} F, B B(R d ) Two viewpoints Rv as a mapping Rv as a probability distribution function (i.e., measure) F : R d [0, 1] : x F(x) P [X x] Multiple modes of convergence with many subtleties!

BCAM May 2012 9 An obvious definition... Consider a collection {X; X n, n = 1, 2,...} of R d -valued rvs all defined on the same probability triple (Ω, F, P). Then, we say convergence takes place to X if lim X n(ω) = X(ω), n ω Ω Why not? Too strong Modeling information: Often only the corresponding probability distributions {F n, n = 1, 2,...} are available

BCAM May 2012 10 Four basic modes of convergence Convergence in distribution Convergence in the r th -mean (r 1) Convergence in probability Convergence with probability one (w.p. 1) Easy-to use-criteria Relationships Impact of (continuous) transformations Cesaro convergence Key limit theorems of Probability Theory

BCAM May 2012 11 Convergence with probability one Consider a collection {X; X n, n = 1, 2,...} of R d -valued rvs all defined on the same probability triple (Ω, F, P). We say that the sequence {X n, n = 1, 2,...} converges almost surely (a.s.) (or with probability one (w.p. 1)) if [ ] P ω Ω : lim X n(ω) = X(ω) = 1 n We write lim n X n = X a.s.

BCAM May 2012 12 Convergence in distribution Also known as convergence in law and weak convergence. Multiple equivalent definitions available A sequence of probability distribution functions {F n, n = 1, 2,...} on R converges in distribution to the probability distribution function F on R, written F n = n F, if lim F n(x) = F(x), n where C F denotes the continuity set of F. x C F

BCAM May 2012 13 Why this definition? The limit of a distribution is not always a distribution Skorokhod s Theorem: Assume the sequence of probability distribution functions {F n, n = 1, 2,...} on R to converge in distribution to the probability distribution function F on R. Then there exists a single probability triple (Ω, F, P) and a collection of rvs {X, X n, n = 1, 2,...} defined on it such that with the property that X F, X n F n, n = 1, 2,... lim X n(ω) = X(ω), n ω Ω

BCAM May 2012 14 Proof: Take Ω = (0, 1), F = B((0, 1)), P = λ and set X(ω) = F (ω) and X n (ω) = F n (ω), ω Ω n = 1, 2,... For any non-decreasing function F : R [0, 1], define its (left-continuous) generalized inverse F : [0, 1] R {± } by F (t) inf{x R : F(x) t}, 0 t 1

BCAM May 2012 15 Auto-regressive sequences Consider X 0 = ξ X t+1 = αx t + W t+1, t = 0, 1,... Assume α 0 R-valued rv ξ i.i.d. R-valued rvs {W, W t, t = 1, 2,...} with E [ W ] < Mutual independence

BCAM May 2012 16 Fact: if α < 1, then there exists an R-valued rv X such that X t = t X regardless of the initial condition ξ. The rv X has finite first moment and is characterized by X = st α s W s s=1 because t X t+1 = α t+1 ξ + α t s W s+1 s=0 t = st α t+1 ξ + α s W s+1 (1) s=0 (W 1, W 2,...,W t ) = st (W t, W t 1,...,W 1 )

BCAM May 2012 17 Lindley s recursion Consider X 0 = ξ X t+1 = (X t + η t+1 ) +, t = 0, 1,... Assume R + -valued rv ξ i.i.d. R-valued rvs {η, η t, t = 1, 2,...} with E [ η ] < Mutual independence

BCAM May 2012 18 Fact: If E [η] < 0, then there exists an R + -valued rv X such that X t = t X regardless of the initial condition ξ, with ( ) + X = st sup (η 1 +... + η t ) t=1,2,...

BCAM May 2012 19 Analytic view of weak convergence With R d -valued rv X = (X 1,...,X d ), define its characteristic function Φ X : R C given by [ ] Φ X (t) = E e it X, t R d Also Φ F = Φ X where X F Uniqueness: Φ F = Φ G if and only if F = G

BCAM May 2012 20 Fact: With R d -valued rv X = (X 1,...,X d ), its characteristic function Φ X : R C satisfies the following properties: Bounded: Φ X (t) Φ X (0) = 1, t R d Uniformly continuous on R d : lim sup ( Φ X (t + h) Φ X (t) ) = 0 h 0 t R d Positive definiteness: For every n = 1, 2,..., every t 1,...,t n in R d and every z 1,...,z n in C, n k=1 n Φ X (t k t l )z k zl 0 l=1 This charactrizes characteristic functions among functions R C

BCAM May 2012 21 Fact: The sequence of probability distribution functions {F n, n = 1, 2,...} on R d converges in distribution to the probability distribution function F on R d if and only if lim Φ F n n (t) = Φ F (t), t R d Behavior of characteristic functions: lim Φ F n n (t) = lim E [ e ] itx n, n t R d

BCAM May 2012 22 Fact: Consider a sequence of probability distribution functions {F n, n = 1, 2,...} on R d such that the limits lim Φ F n n (t) = Φ(t), t R d exist. If Φ : R d C is continuous at t = 0, then it is the characteristic function of a probability distribution function F on R d and F n = n F. Consequence of the Bochner-Herglotz Theorem which provides a characterization of characteristic functions through positive definiteness.

BCAM May 2012 23 Beware (I) Behavior of probability density functions: F n (x) = x f n (t)dt, x R Example: F n (x) = x n, x [0, 1] n = 1, 2,...

BCAM May 2012 24 Beware (II) Behavior of probability mass functions (pmfs): F n (x) = x j x (F(x j ) F(x j )), x R n = 1, 2,... Example: X n = n + Poi(λ), n = 1, 2,

BCAM May 2012 25 Tightness The R d -valued rvs {X n, n = 1, 2,...} are tight if there for every ε > 0, there exists a compact subset K ε R d such that sup P [X n K ε ] 1 ε n=1,2,... By Prohorov s Theorem, Tightness = Sequential precompactness (with respect to weak convergence) Remember Bolzano-Weierstrass!

BCAM May 2012 26 Easy criterion Tightness holds if for some r 1, we have sup E [ X n r ] < n=1,2,... By Markov s inequality, P [ X n > c] E [ X n r ] c r, c > 0 n = 1, 2,...

BCAM May 2012 27 Fact: if the sequence of probability distribution functions {F n, n = 1, 2,...} on R converges in distribution to the probability distribution function F on R, then the collection {F n, n = 1, 2,...} is tight

BCAM May 2012 28 Fix x in R. For each δ > 0, there a finite integer n = n (x; δ) such that Consequently, F(x) δ F n (x) F(x) + δ, n n P [X n > x] P [X > x] + δ, n n Now take x sufficiently large, say x = x(δ), such that P [X > x] δ Finally, P [X n > x] P [X > x] + δ δ + δ = 2δ, n n(δ) with n(δ) = n (x(δ); δ)

BCAM May 2012 29 Fact: The sequence of probability distribution functions {F n, n = 1, 2,...} on R d converges in distribution to the probability distribution function F on R d if and only if lim E [g(x n)] = E [g(x)] n for every bounded continuous mapping g : R d R. Alternate definition of weak convergence Useful consequences

BCAM May 2012 30 Beware Assume and X n = n X Y n = n Y where for each n = 1, 2,..., the pair of rvs X n and Y n are defined on the same probability triple (Ω n, F n, P n )

BCAM May 2012 31 Convergence of sums: Is it true that X n + Y n = n X + Y? In general no: Take Z N(0, 1), X n = Z and Y n = ( 1) n Z, so that X n + Y n = (1 + ( 1) n )Z Fact: We have X n + Y n = n X + Y if for each n = 1, 2,..., the rvs X n and Y n are independent!

BCAM May 2012 32 Joint convergence: Is it true that (X n, Y n ) = n (X, Y )? In general no: Same counterexample as before Fact: We have (X n, Y n ) = n (X, Y ) if for each n = 1, 2,..., the rvs X n and Y n are independent, in which case X and Y are independent.

BCAM May 2012 33 Convergence under transformation: Is it true that with h : R d R p? h(x n ) = n h(x) Fact: We have h(x n ) = n h(x) if h : R d R is continuous Skorohod to the rescue!

BCAM May 2012 34 Convergence in the r th mean (r > 0) Consider a collection {X; X n, n = 1, 2,...} of R d -valued rvs all defined on the same probability triple (Ω, F, P). We say that the sequence {X n, n = 1, 2,...} converges in the r th -mean to the rv X if and E [( X n r ) r ] <, n = 1, 2,... and E [( X r ) r ] < This is often written lim E [( X n X r ) r ] = 0. n X n r n X

BCAM May 2012 35 Cauchy criterion available: For every ε > 0, there exists a finite integer n (ε) such that E [( X n X m r ) r ] ε, n, m n (ε)

BCAM May 2012 36 Revisiting auto-regressive sequences Consider X 0 = ξ X t+1 = αx t + W t+1, t = 0, 1,... Assume α 0 R-valued rv ξ i.i.d. R-valued rvs {W, W t, t = 1, 2,...} with E [ W 2] < Mutual independence

BCAM May 2012 37 Recall that for each t = 0, 1, 2,..., X t+1 = α t+1 ξ + t α t s W s+1 s=0 So and [ t ] E α t s W s+1 = s=0 [ t ] Var α t s W s+1 s=0 ( t ) α t s = = s=0 E [W] = 1 αt+1 1 α E [W] t α 2(t s) Var[W s+1 ] s=0 ( t ) α 2s s=0 σ 2 W = 1 α2(t+1) 1 α 2 σ 2 W

BCAM May 2012 38 Convergence in probability Consider a collection {X; X n, n = 1, 2,...} of R d -valued rvs all defined on the same probability triple (Ω, F, P). We say that the sequence {X n, n = 1, 2,...} converges in probability to the rv X if for every ε > 0, This is often written lim P [ X n X 2 > ε] = 0. n X n P n X For d = 1, lim P [ X n X > ε] = 0. n

BCAM May 2012 39 Cauchy criterion available Fact: Convergence in the r th -mean implies convergence in probability: By Markov s inequality P [ X n X > ε] = P [ X n X r > ε r ] ε r E [ X n X r ], r > 0, ε > 0 n = 1, 2,... Converse is not true without additional conditions, e.g., with α > 0, 0 with probability 1 n α X n = n with probability n α

BCAM May 2012 40 Fact: Convergence in probability implies convergence in distribution: Indeed, for each n = 1, 2,... and ε > 0, we have P [X n x] P [X x + ε] + P [ X n X ε] and P [X x ε] P [X n x] + P [ X n X ε] Thus, and Finally let ε 0! lim sup n P [X n x] P [X x + ε] P [X x ε] liminf n P [X n x]

BCAM May 2012 41 Converse is not true! With Z N(0, 1), take X n = ( 1) n Z for each n = 1, 2,.... Obviously, X n = n Z but X n Z = 1 ( 1) n Z, n = 1, 2,...

BCAM May 2012 42 However, not all is lost: If the sequence {X n, n = 1, 2,...} converges in distribution to the a.s. constant rv c, then X n = n c Every sequence converging in distribution to a constant converges to it in probability! Indeed, for each n = 1, 2,... and ε > 0, we have P [ X n c ε] = P [X n c + ε] P [X n < c ε]

BCAM May 2012 43 Problems: You know that X n P n X and Y n P n Y

BCAM May 2012 44 Convergence of sums: Is it true that X n + Y n P n X + Y? Yes because for each n = 1, 2,..., the event [ (X n + Y n ) (X + Y ) > ε] is contained in [ X n X > ε 2 ] [ Y n Y > ε 2 ]

BCAM May 2012 45 What if only and X n P n X Y n = n Y Counterexample: With Z N(0, 1), set X n = Z and Y n = ( 1) n Z, n = 1, 2,... so that X n + Y n = (1 + ( 1) n )Z, n = 1, 2,... It is plain that X n P n Z and Y n = n Z, but the convergence X n + Y n = n X + Y does not hold, hence X n + Y n P n X + Y fails as well!

BCAM May 2012 46 Joint convergence: Is it true that (X n, Y n ) P n (X, Y )?

BCAM May 2012 47 Convergence under transformation: Is it true that with continuous h : R d R p? h(x n ) P n h(x) Easy to see if h : R d R p is uniformly continuous!

BCAM May 2012 48 Fact: Convergence in the a.s. sense implies convergence in probability With ε > 0, [X n converges to X] n=1b n (ε) with monotone increasing events Therefore, by monotonicity! B n (ε) m=n[ X m X ε], n = 1, 2,... P [X n converges to X] lim n P [B n(ε)]

BCAM May 2012 49 If P [X n converges to X] = 1, then lim n P [B n (ε)] = 1 becomes 0 = lim n P [B n(ε) c ] lim n P [ m=n[ X m X > ε]] by complementarity, whence lim P [ X n X > ε] = 0 n Converse is not true!

BCAM May 2012 50 However, not all is lost Partial converse If the sequence {X n, n = 1, 2,...} converges in probability to the rv X, then there exists a sequence ν : N 0 N 0 with ν k < ν k+1, k = 1, 2,... (whence lim k ν k = ) such that lim X ν k = X k a.s. Thus, any sequence convergent in probability contains a deterministic subsequence which converges a.s. (to the same limit).

BCAM May 2012 51 Borel-Cantelli Lemma Consider a sequence of events {A n, n = 1, 2,...}, i.e., A n F, n = 1, 2,... Set and lim sup n A n = n=1 m=n A m = [A n i.o.] lim inf n A n = n=1 m=n A m Obviously lim inf n A n limsup n A n

BCAM May 2012 52 Fact: If then P P [A n ] <, n=1 [ ] lim supa n = 0 n Fact: Assume the events {A n, n = 1, 2,...} to be mutually independent. If P [A n ] =, then P n=1 [ ] lim supa n = 1 n

BCAM May 2012 53 Establishing a.s. convergence How do we show that lim P [X n converges to X] = 1? n With ε > 0, A n (ε) [ X n X ε], n = 1, 2,... and so that B n (ε) m=na m (ε), n = 1, 2,... n=1b n (ε) =...

BCAM May 2012 54 Key observation: By the definition of convergence, [X n converges to X] = ε>0 ( n=1b n (ε)) = ( k=1 n=1 B n (k 1 ) ) Fact: Convergence takes place if for every ε > 0, we have or equivalently, if lim n P [ n=1b n (ε)] = 1 lim n P [ n=1b n (ε) c ] = 0

BCAM May 2012 55 But P [ n=1b n (ε) c ] = P [ n=1 ( m=na m (ε)) c ] = P [ n=1 ( m=na m (ε) c )] (2) so lim n P [ n=1b n (ε) c ] = lim n P [ m=na m (ε) c ] By a union bound argument, if lim n P [ n=1b n (ε) c ] = 0 P [A n (ε) c ] < n=1

BCAM May 2012 56 Fact: We have if for every ε > 0, we have lim P [X n converges to X] n P [ X n X > ε] < n=1 Instance of Borel-Cantelli Lemma

BCAM May 2012 57 Interchanging limit and expectation Consider the rvs {X n, n = 1, 2,...} with E [ X n ] <, n = 1, 2,... such that X n P n X for some rv X. When do we have that the limit of the expected values is the expected value of the limit? lim E [X n] = E [X] n

BCAM May 2012 58 What about using Monotone Convergence Theorem Bounded Convergence Theorem

BCAM May 2012 59 Uniform integrability The rvs {X n, n = 1, 2,...} are uniformly integrable if ( ) lim sup E [1[ X n > c] X n ] = 0 c n=1,2,... Easy test The rvs {X n, n = 1, 2,...} are uniformly integrable if for some r > 1, we have sup E [ X n r ] < n=1,2,...

BCAM May 2012 60 Fact: Consider a collection of rvs {X, X n, n = 1, 2,...} such that X n = n X. If the collection is uniformaly integrable, then E [ X ] < and lim E [X n] = E [X] n For each n = 1, 2,... and c > 0, we have the decomposition E [X n ] E [X] = E [1 [ X n c]x n ] E [1 [ X c]x] + E [1 [ X n > c] X n ] E [1 [ X > c]x] Converse available No escape from uniform integrability!

BCAM May 2012 61 Poisson s Theorem for sums of Bernoulli rvs For each n = 1, 2,..., the collection {B n,k (p n ), k = 1,...,k n } i.i.d. Bernoulli(p n ) rvs is defined on some probability triple (Ω n, F n, P n ). Write so that S n = k n k=1 [ kn ] E [S n ] = E B n,k (p n ) = B n,k (p n ), n = 1, 2,... k n k=1 k=1 E [B n,k (p n )] = k n p n

BCAM May 2012 62 Theorem 1 If for some λ > 0, then with lim k np n = λ and n S n = n Poi(λ) lim k n = n Poi(λ)(k) = λk k! e λ, k = 0, 1,... Historically: k n = n and p n = λ n Many variations on this theme! Chen-Stein method for Poisson approximation Point process version leads to ubiquity of Poisson modeling!

BCAM May 2012 63 Strong Law of Large Numbers Consider a collection {X, X n, n = 1, 2,...} of i.i.d. rvs defined on the same probability triple (Ω, F, P), and write S n = X 1 +... + X n, n = 1, 2,... Theorem 2 If E [ X ] <, then lim n S n n = E [X] a.s. Frequentist definition of probability compatible with Kolmogorov s axiomatic model

BCAM May 2012 64 Weak Law of Large Numbers Consider a collection {X, X n, n = 1, 2,...} of i.i.d. rvs defined on the same probability triple (Ω, F, P), and write S n = X 1 +... + X n, n = 1, 2,... Theorem 3 If E [ X ] <, then S n n P n E [X] a.s. Many variations on this theme! Markov s inequality at work (when second moments are available)

BCAM May 2012 65 Var[X] = Var[X 1 +... + X n ] n = Var[X i ] + n i=1 k,l=1, k l Cov[X k, X l ] Here, under i.i.d. assumptions, so that Var [ Sn n ] = Var[X] n [ ] S n P n E [X] > ε ε 2 Var [ Sn n ] This also works under weaker assumptions, e.g., uncorrelated rvs, etc

BCAM May 2012 66 Central Limit Theorem Consider a collection {X, X n, n = 1, 2,...} of i.i.d. rvs defined on the same probability triple (Ω, F, P), and write S n = X 1 +... + X n, n = 1, 2,... Theorem 4 If E [ X 2] <, then ( ) Sn n n E [X] = n σu with U N(0, 1) and σ 2 = Var[X].