STA205 Probability: Week 8 R. Wolpert

Similar documents
Probability and Measure

4 Expectation & the Lebesgue Theorems

Fourier Transforms of Measures

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

8 Laws of large numbers

Multivariate Random Variable

7 Convergence in R d and in Metric Spaces

Probability and Measure

Simulation - Lectures - Part III Markov chain Monte Carlo

Lecture 1: August 28

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

STA 711: Probability & Measure Theory Robert L. Wolpert

Probability and Measure

1 Stat 605. Homework I. Due Feb. 1, 2011

1 Presessional Probability

1: PROBABILITY REVIEW

Exercises with solutions (Set D)

5 Operations on Multiple Random Variables

Asymptotic Statistics-III. Changliang Zou

1 Sequences of events and their limits

The Central Limit Theorem: More of the Story

Large Sample Theory. Consider a sequence of random variables Z 1, Z 2,..., Z n. Convergence in probability: Z n

A D VA N C E D P R O B A B I L - I T Y

Multiple Random Variables

Lecture 4: Sampling, Tail Inequalities

An Introduction to Laws of Large Numbers

2. Variance and Covariance: We will now derive some classic properties of variance and covariance. Assume real-valued random variables X and Y.

Lecture 4: September Reminder: convergence of sequences

Statistics for scientists and engineers

Joint Probability Distributions and Random Samples (Devore Chapter Five)

6.1 Moment Generating and Characteristic Functions

STA 294: Stochastic Processes & Bayesian Nonparametrics

MA Advanced Econometrics: Applying Least Squares to Time Series

4 Expectation & the Lebesgue Theorems

Economics 241B Review of Limit Theorems for Sequences of Random Variables

7 Random samples and sampling distributions

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

Recitation 2: Probability

lim F n(x) = F(x) will not use either of these. In particular, I m keeping reserved for implies. ) Note:

Stable Limit Laws for Marginal Probabilities from MCMC Streams: Acceleration of Convergence

STAT 200C: High-dimensional Statistics

Spring 2012 Math 541B Exam 1

Lecture 2: Review of Probability

Master s Written Examination

Summary of basic probability theory Math 218, Mathematical Statistics D Joyce, Spring 2016

Convergence in Distribution

Chapter 5 continued. Chapter 5 sections

P (A G) dp G P (A G)

Markov Chain Monte Carlo (MCMC)

17. Convergence of Random Variables

Statistical signal processing

ELEG 3143 Probability & Stochastic Process Ch. 6 Stochastic Process

Week 12-13: Discrete Probability

Chapter 5. Chapter 5 sections

1 Probability theory. 2 Random variables and probability theory.

Probability Notes. Compiled by Paul J. Hurtado. Last Compiled: September 6, 2017

Statistical Methods in Particle Physics

1 Review of Probability

. Find E(V ) and var(v ).

IEOR 6711: Stochastic Models I Fall 2013, Professor Whitt Lecture Notes, Thursday, September 5 Modes of Convergence

University of Regina. Lecture Notes. Michael Kozdron

Quick Tour of Basic Probability Theory and Linear Algebra

Tail inequalities for additive functionals and empirical processes of. Markov chains

Lecture 2: Review of Basic Probability Theory

Useful Probability Theorems

Selected Exercises on Expectations and Some Probability Inequalities

Lecture 4: Two-point Sampling, Coupon Collector s problem

Gaussian vectors and central limit theorem

Tom Salisbury

Lecture 2: Repetition of probability theory and statistics

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Introduction to Stochastic processes

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

MATH Notebook 5 Fall 2018/2019

Lecture 6 Basic Probability

1.1 Review of Probability Theory

Continuous Random Variables

2 n k In particular, using Stirling formula, we can calculate the asymptotic of obtaining heads exactly half of the time:

Disjointness and Additivity

Midterm 2 Review. CS70 Summer Lecture 6D. David Dinh 28 July UC Berkeley

Gaussian, Markov and stationary processes

CSC 2541: Bayesian Methods for Machine Learning

Review of Probabilities and Basic Statistics

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Lecture 3: Statistical sampling uncertainty

Introduction to Machine Learning

CHAPTER 3: LARGE SAMPLE THEORY

MAT 135B Midterm 1 Solutions

Limiting Distributions

11. Further Issues in Using OLS with TS Data

Lecture 9. d N(0, 1). Now we fix n and think of a SRW on [0,1]. We take the k th step at time k n. and our increments are ± 1

Statistics of stochastic processes

1 Solution to Problem 2.1

Name of the Student: Problems on Discrete & Continuous R.Vs

Lecture 21: Convergence of transformations and generating a random variable

Chapter 7. Markov chain background. 7.1 Finite state space

Probability Distributions Columns (a) through (d)

Probability Background

X n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2)

Transcription:

INFINITE COIN-TOSS AND THE LAWS OF LARGE NUMBERS The traditional interpretation of the probability of an event E is its asymptotic frequency: the limit as n of the fraction of n repeated, similar, and independent trials in which E occurs. Similarly the expectation of a random variable X is taken to be its asymptotic average, the limit as n of the average of n repeated, similar, and independent replications of X. As statisticians trying to make inference about the underlying probability distribution f(x θ) governing observed random variables X i, this suggests that we should be interested in the probability distribution for large n of quantities like the average of the RV s, Xn 1 n n i=1 X i. Three of the most celebrated theorems of probability theory concern this sum. For independent random variables X i, all with the same probability distribution satisfying E X i 3 <, set µ = EX i, σ = E X i µ, and S n = n i=1 X i. The three main results are: Laws of Large Numbers: Central Limit Theorem: Law of the Iterated Logarithm: S n nµ σn S n nµ σ n 0 (i.p. and a.s.) = N(0,1) (i.d.) lim sup ± S n nµ n σ n log log n = 1.0 (a.s.) Together these three give a clear picture of how quickly and in what sense 1 n S n tends to µ. We begin with the Law of Large Numbers (LLN), in its weak form (asserting convergence i.p.) and in its strong form (convergence a.s.). There are several versions of both theorems. The simplest requires the X i to be IID and L ; stronger results allow us to weaken (but not eliminate) the independence requirement, permit non-identical distributions, and consider what happens if the RV s are only L 1 (or worse!) instead of L. The text covers these things well; to complement it I am going to: (1) Prove the simplest version, and with it the Borel-Cantelli theorems; and () Show what happens with Cauchy random variables, which don t satisfy the requirements (the LLN fails).

I. Weak version, non-iid, L : µ i = EX i, σ ij = E[X i µ i ][X j µ j ] A. Y n = (S n Σµ i )/n satisfies EY n = 0, EYn = 1 n Σ i n σ ii + n Σ i<j n σ ij ; 1. If σ ii M and σ ij 0 or σ ij < Mr i j, r < 1, Chebychev = Y n 0, i.p.. (pairwise) IID L is OK II. Strong version, non-iid, L : EX i = 0, EXi M, EX ix j 0. A. P[ S n > nǫ] < Mn = M n ǫ nǫ 1. P[ S n > n ǫ] < M, Σ n ǫ n P[ S n > n ǫ] < Mπ 6ǫ. Borel-Cantelli: P[ S n > n ǫ i.o.] = 0, 1 S n n 0 a.s. 3. D n = max n <k<(n+1) S k S n, EDn ne S (n+1) S n 4n M 4. Chebychev: P[D n > n ǫ] < 4n M n 4 ǫ, D n 0 a.s. B. S k /k S n +D n n 0 a.s., QED 1. Bernoulli RV s, normal number theorem, Monte Carlo integration. III. Weak version, pairwise-iid, L 1 A. Equivalent sequences: n P[X n Y n ] < 1. [X n Y n ] < a.s.. n i=1 [X n i], a n i=1 [X i] converge iff n i=1 [Y n i], a n i=1 [Y i] both converge 3. Y n = X n 1 [ Xn n] IV. Counterexamples: Cauchy, A. X i dx = P[ S π[1+x ] n /n ǫ] π tan 1 (ǫ) 1, WLLN fails. B. P[X i = ±n] = c n, n 1; X i / L 1, and S n /n 0 i.p. or a.s. C. P[X i = ±n] = c n log n, n > 1; X i / L 1, but S n /n 0 i.p. and not a.s. D. Medians: for ANY RV s X n X i.p., then m n m if m is unique. Page

Let X i be iid standard Cauchy RV s, with and characteristic function P[X 1 t] = E e iωx 1 = so S n /n has characteristic function t dx π[1 + x ] = 1 + 1 π arctan(t) e iωx E e iωs n/n = Ee i ω n [X 1+ +X n ] = dx π[1 + x ] = e ω, ( ) n Ee i ω n X 1 = (e ω n ) n = e ω and S n /n also has the standard Cauchy distribution with P[S n /n t] = 1 + 1 arctan(t); in π particular, S n /n does not converge almost surely, or even in probability. A LAW OF LARGE NUMBERS FOR CORRELATED SEQUENCES In many applications we would like a Law of Large Numbers for sequences of random variables that are not independent; for example, in Markov Chain Monte Carlo integration, we have a stationary Markov chain {X t } (this means that the distribution of X t is the same for all t and that the conditional distribution of X u for u > t, given {X s s t}, depends only on X t ) and want to estimate the population mean E[φ(X t )] for some function φ( ) by the sample mean E[φ(X t )] 1 T φ(x t ). t=1 Even though they are identically distributed, the random variables Y t φ(x t ) won t be independent if the X t aren t independent, so the LLN we already have doesn t quite apply. A sequence of random variables Y t is called stationary if each Y t has the same probability distribution and, moreover, each finite set (Y t1 +h,y t +h,...,y tk +h) has a joint distribution that doesn t depend on h. The sequence is called L if each Y t has a finite variance σ (and hence also a well-defined mean µ); by stationarity it also follows that the covariance γ st = E[(Y s µ)(y t µ)] is finite and depends only on the absolute difference t s. Theorem. If a stationary L sequence has a summable covariance, i.e., satisfies t= γ st c <, then 1 E[Y t ] = lim Y t. T T Proof. Let S T be the sum of the first T Y t s and set (as usual) ȲT S T /T. The variance of S T is E[(S T Tµ) ] = s=1 t=1 t=1 E[(X s µ)(x t µ)] s=1 t= T c, Page 3 γ st

so ȲT had variance V[ȲT] c/t and by Chebychev s inequality P[ ȲT µ > ǫ] E[(ȲT µ) ] ǫ = E[(S T Tµ) ] T ǫ T c T ǫ = c 0 as T. Tǫ A strong LLN follows with a bit more work, just as for iid random variables. Examples 1. IID: If X t are independent and identically distributed, and if Y t = φ(x t ) has finite variance σ, then Y t { has a well-defined finite mean µ and ȲT µ. σ if s = t Here γ st = 0 if s t, so c = σ <.. AR 1 : If Z t are iid N(0,1) for < t <, µ R, σ > 0, 1 < ρ < 1, and X t µ + σ ρ s Z t s s=0 = ρx t 1 + α + σz t, ( ) σ where α = (1 ρ)µ, then the X t are identically distributed (all with the N(µ, ) distribution) but not independent (since γ st = σ 1 ρ 1 ρ ρ s t 0); this is called an autoregressive process (because of equation (*), expressing X t as a regression of previous X s s) of order one (because only one earlier X s appears in (*)), and is about the simplest non-iid sequence occuring in applications. Since the covariance is summable, t= γ st = σ 1 ρ 1 + ρ 1 ρ = σ (1 ρ ) <, we again have X T µ as T.. Geometric Ergodicity: If for some 0 < ρ < 1 and c > 0 we have γ st cρ s t for a Markov chain Y t the chain is called Geometrically Ergodic (because cρ t is a geometric sequence), and the same argument as for AR 1 shows that Ȳt converges; Meyn & Tweedie (1993), Tierney (1994), and others have given conditions for MCMC chains to be Geometric Ergodic, and hence for the almost-sure convergence of sample averages to population means. 3. General Ergodicity: Consider the three sequences of random variables on (Ω,F,P) with Ω = (0,1] and F = B(Ω), each with X 0 (ω) = ω: 1. X n+1 X n (mod 1);. X n+1 X n + α (mod 1) (Does it matter if α is rational?); 3. X n+1 4X n (1 X n ). For each, find a probability measure P (equivalently find a distribution for X 0 ) such that the X n are all identically distributed; the sequence is called ergodic if each E F left invariant by the transformation T that takes X n to X n+1, E = T 1 (E), is trivial in the sense that P[E] = 0 or P[E] = 1. The Ergodic Theorem asserts that X n converges almost-surely to a T-invariant limit X as n ; since only constants are T-invariant for ergodic sequences, it follows that X n µ = EX n. The conditions here are weaker than those for the usual LLN; in all three cases above, for example, each X n is completely determined by X 0 so there is complete dependence! Page 4

Stable Limit Laws Let S n = X 1 +...+X n be the partial sum of iid random variables. IF the random variables are all square integrable, THEN the Central Limit Theorem applies and necessarily S n µ = No(0,1). nσ But what if each X n is not square integrable? We have already seen CLT fail for Cauchy variables X j. Denote by F(x) = P[X n x] the common CDF of the {X n }. Theorem (Stable Limit Law). There exist constants A n > 0 and B n R and a distribution µ for which the S n A n B n = µ if and only if there are constants 0 < α, M 0, and M + 0, with M + M + > 0, such that the following limits hold for every ξ > 0 as x + : F( x) P[X x] 1. = M 1 F(x) P[X > x] M + ;. M + > 0 1 F(xξ) 1 F(x) ξ α M > 0 F( xξ) F( x) ξ α. In this case the limit is the Stable Distribution with index α, with characteristic function E[e iωy ] = e iδω γ ω α [1 iβ tan πα sgn(ω)], where β = M+ M +M + and γ = (M + M + ). The sequence A n must be essentially A n n 1/α (more precisely, the sequence C n = n 1/α A n is slowly changing in the sense that C cn 1 = lim n C n for every c > 0); thus partial sums converge to stable distributions at rate n 1/α, more slowly (much more slowly, if α is close to one) than in the L (Gaussian) case of the central limit theorem. Note that the Cauchy distribution is the special case of (α,β,γ,δ) = (1,0,1,0) and the Normal distribution is the special case of (α,β,γ,δ) = (,0,σ /,µ). Although each Stable distribution has an absolutely continuous distribution with continuous probability density function f(y), these two cases and the inverse gamma distribution with α = 1/ and β = ±1 are the only ones where the p.d.f. can be given in closed form. Moments are easy enough to compute; for α < the Stable distribution only has finite moments of order p < α and, in particular, none of them has a finite variance. The Cauchy has finite moments of order p < 1 but does not have a well-defined mean. Condition. says that each tail must be fall off like a power (sometimes called Pareto tails), and the powers must be identical; Condition 1. gives the tail ratio. In a common special case, M = 0; for example, random variables X n with the Pareto distribution (often used to model income) given by P[X n > t] = (k/t) α for t k will have a stable limit for their partial sums if α <, and (by CLT) a normal limit if α. You can find out more details reading Chapter 9 of Breiman s book. Page 5