Probability Theory. Richard F. Bass

Similar documents
A D VA N C E D P R O B A B I L - I T Y

Lecture 6 Basic Probability

Exercises Measure Theoretic Probability

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

Part II Probability and Measure

Probability and Measure

3 Integration and Expectation

Chapter 5. Weak convergence

Random Process Lecture 1. Fundamentals of Probability

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

Probability and Measure

Useful Probability Theorems

Chapter 1. Poisson processes. 1.1 Definitions

Math 6810 (Probability) Fall Lecture notes

17. Convergence of Random Variables

Elementary Probability. Exam Number 38119

Measure and integration

Some basic elements of Probability Theory

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

ADVANCED PROBABILITY: SOLUTIONS TO SHEET 1

STOR 635 Notes (S13)

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

I. ANALYSIS; PROBABILITY

7 Convergence in R d and in Metric Spaces

P (A G) dp G P (A G)

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

Exercises Measure Theoretic Probability

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Stochastic integration. P.J.C. Spreij

STAT 7032 Probability Spring Wlodek Bryc

4 Sums of Independent Random Variables

Dynkin (λ-) and π-systems; monotone classes of sets, and of functions with some examples of application (mainly of a probabilistic flavor)

1 Sequences of events and their limits

1 Presessional Probability

Jump Processes. Richard F. Bass

18.175: Lecture 3 Integration

Real Analysis Problems

Probability and Measure

1 Probability space and random variables

Notes on Measure, Probability and Stochastic Processes. João Lopes Dias

The main results about probability measures are the following two facts:

X n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2)

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

MATHS 730 FC Lecture Notes March 5, Introduction

Probability: Handout

STA 711: Probability & Measure Theory Robert L. Wolpert

Problem Sheet 1. You may assume that both F and F are σ-fields. (a) Show that F F is not a σ-field. (b) Let X : Ω R be defined by 1 if n = 1

Integration on Measure Spaces

2 n k In particular, using Stirling formula, we can calculate the asymptotic of obtaining heads exactly half of the time:

Introduction and Preliminaries

Measure-theoretic probability

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents

Independent random variables

Chapter 1. Measure Spaces. 1.1 Algebras and σ algebras of sets Notation and preliminaries

Product measures, Tonelli s and Fubini s theorems For use in MAT4410, autumn 2017 Nadia S. Larsen. 17 November 2017.

THEOREMS, ETC., FOR MATH 515

Measurable functions are approximately nice, even if look terrible.

Sample Spaces, Random Variables

Martingale Theory for Finance

1: PROBABILITY REVIEW

Brownian Motion and Conditional Probability

1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989),

MATH 117 LECTURE NOTES

CHAPTER 1. Martingales

1. Probability Measure and Integration Theory in a Nutshell

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define

Lecture 21: Expectation of CRVs, Fatou s Lemma and DCT Integration of Continuous Random Variables

PCMI Introduction to Random Matrix Theory Handout # REVIEW OF PROBABILITY THEORY. Chapter 1 - Events and Their Probabilities

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

µ X (A) = P ( X 1 (A) )

Lecture 4 Lebesgue spaces and inequalities

1. Stochastic Processes and filtrations

Part III Advanced Probability

Chapter 4. Measure Theory. 1. Measure Spaces

Basics of Stochastic Analysis

4th Preparation Sheet - Solutions

MATH/STAT 235A Probability Theory Lecture Notes, Fall 2013

ECON 2530b: International Finance

Probability Theory II. Spring 2016 Peter Orbanz

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

9 Brownian Motion: Construction

On the convergence of sequences of random variables: A primer

Chapter 4. The dominated convergence theorem and applications

Combinatorics in Banach space theory Lecture 12

LEBESGUE INTEGRATION. Introduction

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

II - REAL ANALYSIS. This property gives us a way to extend the notion of content to finite unions of rectangles: we define

Measures and Measure Spaces

Probability Theory I: Syllabus and Exercise

Advanced Probability

Lebesgue measure and integration

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define

3. DISCRETE RANDOM VARIABLES

Recitation 2: Probability

Compendium and Solutions to exercises TMA4225 Foundation of analysis

FE 5204 Stochastic Differential Equations

Probability and Measure

Transcription:

Probability Theory Richard F. Bass

ii c Copyright 2014 Richard F. Bass

Contents 1 Basic notions 1 1.1 A few definitions from measure theory............. 1 1.2 Definitions............................. 2 1.3 Some basic facts.......................... 5 1.4 Independence........................... 8 2 Convergence of random variables 11 2.1 Types of convergence....................... 11 2.2 Weak law of large numbers.................... 13 2.3 The strong law of large numbers................. 13 2.4 Techniques related to a.s. convergence.............. 19 2.5 Uniform integrability....................... 22 3 Conditional expectations 25 4 Martingales 29 4.1 Definitions............................. 29 4.2 Stopping times.......................... 30 4.3 Optional stopping......................... 31 4.4 Doob s inequalities........................ 33 4.5 Martingale convergence theorems................ 34 iii

iv CONTENTS 4.6 Applications of martingales................... 36 5 Weak convergence 41 6 Characteristic functions 47 6.1 Inversion formula......................... 49 6.2 Continuity theorem........................ 53 7 Central limit theorem 59 8 Gaussian sequences 67 9 Kolmogorov extension theorem 69 10 Brownian motion 73 10.1 Definition and construction................... 73 10.2 Nowhere differentiability..................... 78 11 Markov chains 81 11.1 Framework for Markov chains.................. 81 11.2 Examples............................. 84 11.3 Markov properties......................... 87 11.4 Recurrence and transience.................... 91 11.5 Stationary measures....................... 94 11.6 Convergence............................ 101

Chapter 1 Basic notions 1.1 A few definitions from measure theory Given a set X, a σ-algebra on X is a collection A of subsets of X such that (1) A; (2) if A A, then A c A, where A c = X \ A; (3) if A 1, A 2,... A, then i=1a i and i=1a i are both in A. A measure on a pair (X, A) is a function µ : A [0, ] such that (1) µ( ) = 0; (2) µ( i=1a i ) = i=1 µ(a i) whenever the A i are in A and are pairwise disjoint. A function f : X R is measurable if the set {x : f(x) > a} A for all a R. A property holds almost everywhere, written a.e, if the set where it fails has measure 0. For example, f = g a.e. if µ({x : f(x) g(x)}) = 0. The characteristic function χ A of a set in A is the function that is 1 when x A and is zero otherwise. A function of the form n i=1 a iχ Ai is called a simple function. If f is simple and of the above form, we define f dµ = n a i µ(a i ). i=1 1

2 CHAPTER 1. BASIC NOTIONS If f is non-negative and measurable, we define { } f dµ = sup s dµ : 0 s f, s simple. Provided f dµ <, we define f dµ = f + dµ f dµ, where f + = max(f, 0), f = max( f, 0). 1.2 Definitions A probability or probability measure is a measure whose total mass is one. Because the origins of probability are in statistics rather than analysis, some of the terminology is different. For example, instead of denoting a measure space by (X, A, µ), probabilists use (Ω, F, P). So here Ω is a set, F is called a σ-field (which is the same thing as a σ-algebra), and P is a measure with P(Ω) = 1. Elements of F are called events. Elements of Ω are denoted ω. Instead of saying a property occurs almost everywhere, we talk about properties occurring almost surely, written a.s.. Real-valued measurable functions from Ω to R are called random variables and are usually denoted by X or Y or other capital letters. We often abbreviate random variable by r.v. We let A c = (ω Ω : ω / A) (called the complement of A) and B A = B A c. Integration (in the sense of Lebesgue) is called expectation or expected value, and we write E X for XdP. The notation E [X; A] is often used for A XdP. The random variable 1 A is the function that is one if ω A and zero otherwise. It is called the indicator of A (the name characteristic function in probability refers to the Fourier transform). Events such as (ω : X(ω) > a) are almost always abbreviated by (X > a). Given a random variable X, we can define a probability on R by P X (A) = P(X A), A R. (1.1)

1.2. DEFINITIONS 3 The probability P X is called the law of X or the distribution of X. We define F X : R [0, 1] by F X (x) = P X ((, x]) = P(X x). (1.2) The function F X is called the distribution function of X. As an example, let Ω = {H, T }, F all subsets of Ω (there are 4 of them), P(H) = P(T ) = 1. Let X(H) = 1 and X(T ) = 0. Then P 2 X = 1δ 2 0 + 1δ 2 1, where δ x is point mass at x, that is, δ x (A) = 1 if x A and 0 otherwise. F X (a) = 0 if a < 0, 1 if 0 a < 1, and 1 if a 1. 2 Proposition 1.1 The distribution function F X of a random variable X satisfies: (a) F X is nondecreasing; (b) F X is right continuous with left limits; (c) lim x F X (x) = 1 and lim x F X (x) = 0. Proof. We prove the first part of (b) and leave the others to the reader. If x n x, then (X x n ) (X x), and so P(X x n ) P(X x) since P is a measure. Note that if x n x, then (X x n ) (X < x), and so F X (x n ) P(X < x). Any function F : R [0, 1] satisfying (a)-(c) of Proposition 1.1 is called a distribution function, whether or not it comes from a random variable. Proposition 1.2 Suppose F is a distribution function. There exists a random variable X such that F = F X. Proof. Let Ω = [0, 1], F the Borel σ-field, and P Lebesgue measure. Define X(ω) = sup{x : F (x) < ω}. Here the Borel σ-field is the smallest σ-field containing all the open sets. We check that F X = F. Suppose X(ω) y. Then F (z) ω if z > y. By the right continuity of F we have F (y) ω. Hence (X(ω) y) (ω F (y)).

4 CHAPTER 1. BASIC NOTIONS Suppose ω F (y). If X(ω) > y, then by the definition of X there exists z > y such that F (z) < ω. But then ω F (y) F (z), a contradiction. Therefore (ω F (y)) (X(ω) y). We then have P(X(ω) y) = P(ω F (y)) = F (y). In the above proof, essentially X = F 1. However F may have jumps or be constant over some intervals, so some care is needed in defining X. Certain distributions or laws are very common. We list some of them. (a) Bernoulli. A random variable is Bernoulli if P(X = 1) = p, P(X = 0) = 1 p for some p [0, 1]. ( ) n (b) Binomial. This is defined by P(X = k) = p k k (1 p) n k, where n is a positive integer, 0 k n, and p [0, 1]. (c) Geometric. For p (0, 1) we set P(X = k) = (1 p)p k. Here k is a nonnegative integer. (d) Poisson. For λ > 0 we set P(X = k) = e λ λ k /k! Again k is a nonnegative integer. (e) Uniform. For some positive integer n, set P(X = k) = 1/n for 1 k n. Suppose F is absolutely continuous. This is not the definition, but being absolutely continuous is equivalent to the existence of a function f such that F (x) = x f(y) dy for all x. We call f = F the density of F. Some examples of distributions characterized by densities are the following. (f) Uniform on [a, b]. Define f(x) = (b a) 1 1 [a,b] (x). This means that if X has a uniform distribution, then 1 P(X A) = b a 1 [a,b](x) dx. A

1.3. SOME BASIC FACTS 5 (g) Exponential. For x > 0 let f(x) = λe λx. (h) Standard normal. Define f(x) = 1 2π e x2 /2. So P(X A) = 1 e x2 /2 dx. 2π (i) N (µ, σ 2 ). We shall see later that a standard normal has mean zero and variance one. If Z is a standard normal, then a N (µ, σ 2 ) random variable has the same distribution as µ + σz. It is an exercise in calculus to check that such a random variable has density A 1 2πσ e (x µ)2 /2σ 2. (1.3) (j) Cauchy. Here f(x) = 1 π 1 1 + x 2. 1.3 Some basic facts We can use the law of a random variable to calculate expectations. Proposition 1.3 If g is nonnegative or if E g(x) <, then E g(x) = g(x) P X (dx). Proof. If g is the indicator of an event A, this is just the definition of P X. By linearity, the result holds for simple functions. By the monotone convergence theorem, the result holds for nonnegative functions, and by writing g = g + g, it holds for g such that E g(x) <. If F X has a density f, then P X (dx) = f(x) dx. So, for example, E X = xf(x) dx and E X 2 = x 2 f(x) dx. (We need E X finite to justify this if X is not necessarily nonnegative.) We define the mean of a random variable to be its expectation, and the variance of a random variable is defined by Var X = E (X E X) 2.

6 CHAPTER 1. BASIC NOTIONS For example, it is routine to see that the mean of a standard normal is zero and its variance is one. Note Var X = E (X 2 2XE X + (E X) 2 ) = E X 2 (E X) 2. Another equality that is useful is the following. Proposition 1.4 If X 0 a.s. and p > 0, then E X p = 0 pλ p 1 P(X > λ) dλ. The proof will show that this equality is also valid if we replace P(X > λ) by P(X λ). Proof. Use Fubini s theorem and write 0 pλ p 1 P(X > λ) dλ = E = E 0 X 0 pλ p 1 1 (λ, ) (X) dλ pλ p 1 dλ = E X p. We need two elementary inequalities. Proposition 1.5 Chebyshev s inequality If X 0, P(X a) E X a. Proof. We write P(X a) = E since X/a is bigger than 1 when X [a, ). [ ] [ X ] 1 [a, ) (X) E a 1 [a, )(X) E X/a,

1.3. SOME BASIC FACTS 7 If we apply this to X = (Y E Y ) 2, we obtain P( Y E Y a) = P((Y E Y ) 2 a 2 ) Var Y/a 2. (1.4) This special case of Chebyshev s inequality is sometimes itself referred to as Chebyshev s inequality, while Proposition 1.5 is sometimes called the Markov inequality. The second inequality we need is Jensen s inequality, not to be confused with the Jensen s formula of complex analysis. Proposition 1.6 Suppose g is convex and and X and g(x) are both integrable. Then g(e X) E g(x). Proof. One property of convex functions is that they lie above their tangent lines, and more generally their support lines. So if x 0 R, we have g(x) g(x 0 ) + c(x x 0 ) for some constant c. Take x = X(ω) and take expectations to obtain Now set x 0 equal to E X. E g(x) g(x 0 ) + c(e X x 0 ). If A n is a sequence of sets, define (A n i.o.), read A n infinitely often, by (A n i.o.) = n=1 i=n A i. This set consists of those ω that are in infinitely many of the A n. A simple but very important proposition is the Borel-Cantelli lemma. It has two parts, and we prove the first part here, leaving the second part to the next section. Proposition 1.7 (Borel-Cantelli lemma) If n P(A n) <, then P(A n i.o.) = 0.

8 CHAPTER 1. BASIC NOTIONS Proof. We have However, which tends to zero as n. P(A n i.o.) = lim n P( i=na i ). P( i=na i ) P(A i ), i=n 1.4 Independence Let us say two events A and B are independent if P(A B) = P(A)P(B). The events A 1,..., A n are independent if P(A i1 A i2 A ij ) = P(A i1 )P(A i2 ) P(A ij ) for every subset {i 1,..., i j } of {1, 2,..., n}. Proposition 1.8 If A and B are independent, then A c and B are independent. Proof. We write P(A c B) = P(B) P(A B) = P(B) P(A)P(B) = P(B)(1 P(A)) = P(B)P(A c ). We say two σ-fields F and G are independent if A and B are independent whenever A F and B G. The σ-field generated by a random variable X, written σ(x), is given by {(X A) : A a Borel subset of R}.) Two random variables X and Y are independent if the σ-field generated by X and the σ-field generated by Y are independent. We define the independence of n σ-fields or n random variables in the obvious way. Proposition 1.8 tells us that A and B are independent if the random variables 1 A and 1 B are independent, so the definitions above are consistent.

1.4. INDEPENDENCE 9 If f and g are Borel functions and X and Y are independent, then f(x) and g(y ) are independent. This follows because the σ-field generated by f(x) is a sub-σ-field of the one generated by X, and similarly for g(y ). Let F X,Y (x, y) = P(X x, Y y) denote the joint distribution function of X and Y. (The comma inside the set means and. ) Proposition 1.9 F X,Y (x, y) = F X (x)f Y (y) if and only if X and Y are independent. Proof. If X and Y are independent, the 1 (,x] (X) and 1 (,y] (Y ) are independent by the above comments. Using the above comments and the definition of independence, this shows F X,Y (x, y) = F X (x)f Y (y). Conversely, if the inequality holds, fix y and let M y denote the collection of sets A for which P(X A, Y y) = P(X A)P(Y y). M y contains all sets of the form (, x]. It follows by linearity that M y contains all sets of the form (x, z], and then by linearity again, by all sets that are the finite union of such half-open, half-closed intervals. Note that the collection of finite unions of such intervals, A, is an algebra generating the Borel σ-field. It is clear that M y is a monotone class, so by the monotone class lemma, M y contains the Borel σ-field. For a fixed set A, let M A denote the collection of sets B for which P(X A, Y B) = P(X A)P(Y B). Again, M A is a monotone class and by the preceding paragraph contains the σ-field generated by the collection of finite unions of intervals of the form (x, z], hence contains the Borel sets. Therefore X and Y are independent. The following is known as the multiplication theorem. Proposition 1.10 If X, Y, and XY are integrable and X and Y are independent, then E [XY ] = (E X)(E Y ). Proof. Consider the random variables in σ(x) (the σ-field generated by X) and σ(y ) for which the multiplication theorem is true. It holds for indicators by the definition of X and Y being independent. It holds for simple random variables, that is, linear combinations of indicators, by linearity of both sides.

10 CHAPTER 1. BASIC NOTIONS It holds for nonnegative random variables by monotone convergence. And it holds for integrable random variables by linearity again. If X 1,..., X n are independent, then so are X 1 E X 1,..., X n E X n. Assuming everything is integrable, E [(X 1 E X 1 ) + (X n E X n )] 2 = E (X 1 E X 1 ) 2 + + E (X n E X n ) 2, using the multiplication theorem to show that the expectations of the cross product terms are zero. We have thus shown Var (X 1 + + X n ) = Var X 1 + + Var X n. (1.5) We finish up this section by proving the second half of the Borel-Cantelli lemma. Proposition 1.11 Suppose A n is a sequence of independent events. If we have n P(A n) =, then P(A n i.o.) = 1. Note that here the A n are independent, while in the first half of the Borel- Cantelli lemma no such assumption was necessary. Proof. Note P( N i=na i ) = 1 P( N i=na c i) = 1 N P(A c i) = 1 i=n N (1 P(A i )). By the mean value theorem, 1 x e x, so we have that the right hand side is greater than or equal to 1 exp( N i=n P(A i)). As N, this tends to 1, so P( i=na i ) = 1. This holds for all n, which proves the result. i=n

Chapter 2 Convergence of random variables 2.1 Types of convergence In this section we consider three ways a sequence of random variables X n can converge. We say X n converges to X almost surely if P(X n X) = 0. X n converges to X in probability if for each ε, as n. X n converges to X in L p if as n. P( X n X > ε) 0 E X n X p 0 The following proposition shows some relationships among the types of convergence. 11

12 CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES Proposition 2.1 (1) If X n X a.s., then X n X in probability. (2) If X n X in L p, then X n X in probability. (3) If X n X in probability, there exists a subsequence n j such that X nj converges to X almost surely. Proof. To prove (1), note X n X tends to 0 almost surely, so 1 ( ε,ε) c(x n X) also converges to 0 almost surely. Now apply the dominated convergence theorem. (2) comes from Chebyshev s inequality: as n. P( X n X > ε) = P( X n X p > ε p ) E X n X p /ε p 0 To prove (3), choose n j larger than n j 1 such that P( X n X > 2 j ) < 2 j whenever n n j. So if we let A j = ( X nj X > 2 j ), then P(A j ) 2 j. By the Borel-Cantelli lemma P(A i i.o.) = 0. This implies there exists J (depending on ω) such that if j J, then ω / A j. Then for j J we have X nj (ω) X(ω) 2 j. Therefore X nj (ω) converges to X(ω) if ω / (A i i.o.). Let us give some examples to show there need not be any other implications among the three types of convergence. Let Ω = [0, 1], F the Borel σ-field, and P Lebesgue measure. Let X n = e n 1 (0,1/n). Then clearly X n converges to 0 almost surely and in probability, but E X p n = e np /n for any p. Let Ω be the unit circle, and let P be Lebesgue measure on the circle normalized to have total mass 1. Let t n = n i=1 i 1, and let A n = {e iθ : t n 1 θ < t n }. Let X n = 1 An. Any point on the unit circle will be in infinitely many A n, so X n does not converge almost surely to 0. But P(A n ) = 1/2πn 0, so X n 0 in probability and in L p.

2.2. WEAK LAW OF LARGE NUMBERS 13 2.2 Weak law of large numbers Suppose X n is a sequence of independent random variables. Suppose also that they all have the same distribution, that is, F Xn = F X1 for all n. This situation comes up so often it has a name, independent, identically distributed, which is abbreviated i.i.d. Define S n = n i=1 X i. S n is called a partial sum process. S n /n is the average value of the first n of the X i s. Theorem 2.2 (Weak law of large numbers=wlln) Suppose the X i i.i.d. and E X1 2 <. Then S n /n E X 1 in probability. are Proof. Since the X i are i.i.d., they all have the same expectation, and so E S n = ne X 1. Hence E (S n /n E X 1 ) 2 is the variance of S n /n. If ε > 0, by Chebyshev s inequality, P( S n /n E X 1 > ε) Var (S n/n) ε 2 = n i=1 Var X i n 2 ε 2 = nvar X 1 n 2 ε 2. (2.1) Since E X 2 1 <, then Var X 1 <, and the result follows by letting n. The weak law of large numbers can be improved greatly; it is enough that xp( X 1 > x) 0 as x. 2.3 The strong law of large numbers Our aim is the strong law of large numbers (SLLN), which says that S n /n converges to E X 1 almost surely if E X 1 <. The strong law of large numbers is the mathematical formulation of the law of averages. If one tosses a fair coin over and over, the proportion of heads should converge to 1/2. Mathematically, if X i is 1 if the i th toss turns up heads and 0 otherwise, then we want S n /n to converge with probability one to 1/2, where S n = X 1 + + X n.

14 CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES Before stating and proving the strong law of large numbers for i.i.d. random variables with finite first moment, we need three facts from calculus. First, recall that if b n b are real numbers, then b 1 + + b n n b. (2.2) Second, there exists a constant c 1 such that k=n 1 k 2 c 1 n. (2.3) (To prove this, recall the proof of the integral test and compare the sum to n 1 x 2 dx when n 2.) Third, suppose a > 1 and k n is the largest integer less than or equal to a n. Note k n a n /2. Then 1 k {n:k n 2 n j} {n:a n j} 4 a 4 2n j 1 (2.4) 2 1 a 2 by the formula for the sum of a geometric series. We also need two probability estimates. Lemma 2.3 If X 0 a.s., then E X < if and only if P(X n) <. n=1 Proof. Suppose E X is finite. Since P(X x) increases as x decreases, P(X n) n=1 = which is finite. n n=1 n 1 0 P(X x) dx P(X x) dx = E X,

2.3. THE STRONG LAW OF LARGE NUMBERS 15 If we now suppose n=1 P(X n) <, write E X = P(X x) dx 1 + 0 1 + P(X n) <. n=1 n=1 n+1 n P(X x) dx Lemma 2.4 Let {X n } be an i.i.d. sequence with each X n E X 1 <. Define Y n = X n 1 (Xn n). Then k=1 Var Y k k 2 <. 0 a.s. and Proof. Since Var Y k E Y 2 k, k=1 Var Y k k 2 = = = = E Y 2 k k 2 k=1 k=1 0 k k=1 k=1 k=1 0 0 1 k 2 1 k 2 2xP(Y k > x) dx 2xP(Y k > x) dx 0 k=1 c 1 0 0 1 (x k) 2xP(Y k > x) dx 1 (x k) 2xP(X k > x) dx 1 k 2 1 (x k)2xp(x 1 > x) dx 1 x 2xP(X 1 > x) dx

16 CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES = 2c 1 P(X 1 > x) dx = 2c 1 E X 1 <. 0 We used the fact that the X k are i.i.d., the Fubini theorem, and (2.3). Before proving the strong law, let us first show that we cannot weaken the hypotheses. Theorem 2.5 Suppose the X i are i.i.d. and S n /n converges a.s. Then E X 1 <. Proof. Since S n /n converges, then X n n = S n n S n 1 n 1 n 1 n 0 almost surely. Let A n = ( X n > n). If n=1 P(A n) =, then by using that the A n s are independent, the Borel-Cantelli lemma (second part) tells us that P(A n i.o.) = 1, which contradicts X n /n 0 a.s. Therefore n=1 P(A n) <. Since the X i are identically distributed, we then have P( X 1 > n) <, n=1 and E X 1 < follows by Lemma 2.3. We now state and prove the strong law of large numbers. Theorem 2.6 Suppose {X i } is an i.i.d. sequence with E X 1 <. S n = n i=1 X i. Then S n n E X 1, a.s. Let Proof. By writing each X n as X n + Xn and considering the positive and negative parts separately, it suffices to suppose each X n 0. Define Y k =

2.3. THE STRONG LAW OF LARGE NUMBERS 17 X k 1 (Xk k) and let T n = n i=1 Y i. The main part of the argument is to prove that T n /n E X 1 a.s. Step 1. Let a > 1 and let k n be the largest integer less than or equal to a n. Let ε > 0 and let ( Tkn E T kn ) A n = > ε. Then Then P(A n ) Var (T k n /k n ) ε 2 P(A n ) n=1 k n n=1 j=1 = Var T k n k 2 nε 2 = k n = 1 ε 2 Var Y j k 2 nε 2 k j=1 {n:k n 2 n j} 4(1 a 2 ) 1 ε 2 kn j=1 Var Y j k 2 nε 2. 1 Var Y j j=1 Var Y j j 2 by (2.4). By Lemma 2.4, n=1 P(A n) <, and by the Borel-Cantelli lemma, P(A n i.o.) = 0. This means that for each ω except for those in a null set, there exists N(ω) such that if n N(ω), then T kn (ω) ET kn /k n < ε. Applying this with ε = 1/m, m = 1, 2,..., we conclude Step 2. Since T kn E T kn k n 0, a.s. E Y j = E [X j ; X j j] = E [X 1 ; X 1 j] E X 1 by the dominated convergence theorem as j, then by (2.2) E T kn k n = Therefore T kn /k n E X 1 a.s. kn j=1 E Y j k n E X 1.

18 CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES Step 3. If k n k k n+1, then T k k T k n+1 k n+1 kn+1 k n since we are assuming that the X k are non-negative. Therefore lim sup k T k k ae X 1, a.s. Similarly, lim inf k T k /k (1/a)E X 1 a.s. Since a > 1 is arbitrary, T k k E X 1, a.s. Step 4. Finally, P(Y n X n ) = n=1 P(X n > n) = n=1 P(X 1 > n) < by Lemma 2.3. By the Borel-Cantelli lemma, P(Y n X n i.o.) = 0. In particular, Y n X n 0 a.s. By (2.2) we have n=1 T n S n n = n i=1 (Y i X i ) n 0, a.s., hence S n /n E X 1 a.s.

2.4. TECHNIQUES RELATED TO A.S. CONVERGENCE 19 2.4 Techniques related to a.s. convergence If X 1,..., X n are random variables, σ(x 1,..., X n ) is defined to be the smallest σ-field with respect to which X 1,..., X n are measurable. This definition extends to countably many random variables. If X i is a sequence of random variables, the tail σ-field is defined by T = n=1σ(x n, X n+1,...). An example of an event in the tail σ-field is (lim sup n X n > a). Another example is (lim sup n S n /n > a). The reason for this is that if k < n is fixed, S n n = S n k n + i=k+1 X i. n The first term on the right tends to 0 as n. So lim sup S n /n = lim sup( n i=k+1 X i)/n, which is in σ(x k+1, X k+2,...). This holds for each k. The set (lim sup S n > a) is easily seen not to be in the tail σ-field. Theorem 2.7 (Kolmogorov 0-1 law) If the X i events in the tail σ-field have probability 0 or 1. are independent, then the This implies that in the case of i.i.d. random variables, if S n /n has a limit with positive probability, then it has a limit with probability one, and the limit must be a constant. Proof. Let M be the collection of sets in σ(x n+1,...) that is independent of every set in σ(x 1,..., X n ). M is easily seen to be a monotone class and it contains σ(x n+1,..., X N ) for every N > n. Therefore M must be equal to σ(x n+1,...). If A is in the tail σ-field, then A is independent of σ(x 1,..., X n ) for each n. The class M A of sets independent of A is a monotone class, hence is a σ-field containing σ(x 1,..., X n ) for each n. Therefore M A contains σ(x 1,...). We thus have that the event A is independent of itself, or This implies P(A) is zero or one. P(A) = P(A A) = P(A)P(A) = P(A) 2. Next is Kolmogorov s inequality, a special case of Doob s inequality.

20 CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES Proposition 2.8 Suppose the X i are independent and E X i = 0 for each i. Then P( max 1 i n S i λ) E S2 n λ 2. Proof. Let A k = ( S k λ, S 1 < λ,..., S k 1 < λ). Note the A k are disjoint and that A k σ(x 1,..., X k ). Therefore A k is independent of S n S k. Then n E Sn 2 E [Sn; 2 A k ] = k=1 n E [(Sk 2 + 2S k (S n S k ) + (S n S k ) 2 ); A k ] k=1 n E [Sk; 2 A k ] + 2 k=1 n E [S k (S n S k ); A k ]. k=1 Using the independence, E [S k (S n S k )1 Ak ] = E [S k 1 Ak ]E [S n S k ] = 0. Therefore n n E Sn 2 E [Sk; 2 A k ] λ 2 P(A k ) = λ 2 P( max S k λ). 1 k n k=1 Our result is immediate from this. k=1 We look at a special case of what is known as Kronecker s lemma. Proposition 2.9 Suppose x i are real numbers and s n = n i=1 x i. If the sum j=1 (x j/j) converges, then s n /n 0. Proof. Let b n = n j=1 (x j/j), b 0 = 0, and suppose b n b. As we have seen, this implies ( n i=1 b i)/n b. We have n(b n b n 1 ) = x n, so s n n n = i=1 (ib i ib i 1 ) n i=1 = ib i n 1 i=1 (i + 1)b i n n n 1 i=1 = b n b i n 1 n 1 b b = 0. n

2.4. TECHNIQUES RELATED TO A.S. CONVERGENCE 21 Lemma 2.10 Suppose V i is a sequence of independent random variables, each with mean 0. Let W n = n i=1 V i. If i=1 Var V i <, then W n converges almost surely. Proof. Choose n j > n j 1 such that i=n j Var V i < 2 3j. If n > n j, then applying Kolmogorov s inequality shows that P( max n j i n W i W nj > 2 j ) 2 3j /2 2j = 2 j. Letting n, we have P(A j ) 2 j, where A j = (max n j i W i W nj > 2 j ). By the Borel-Cantelli lemma, P(A j i.o.) = 0. Suppose ω / (A j i.o.). Let ε > 0. Choose j large enough so that 2 j+1 < ε and ω / A j. If n, m > n j, then W n W m W n W nj + W m W nj 2 j+1 < ε. Since ε is arbitrary, W n (ω) is a Cauchy sequence, and hence converges. We now consider the three series criterion. We prove the if portion here and defer the only if to Section 20. Theorem 2.11 Let X i be a sequence of independent random variables., A > 0, and Y i = X i 1 ( Xi A). Then X i converges if and only if all of the following three series converge: (a) P( X n > A); (b) E Y i ; (c) Var Y i. Proof of if part. Since (c) holds, then (Y i E Y i ) converges by Lemma 2.10. Since (b) holds, taking the difference shows Y i converges. Since (a) holds, P(X i Y i ) = P( X i > A) <, so by Borel-Cantelli, P(X i Y i i.o.) = 0. It follows that X i converges.

22 CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES 2.5 Uniform integrability Before proceeding to an extension of the SLLN, we discuss uniform integrability. A sequence of random variables is uniformly integrable if sup X i dp 0 i as M. ( X i >M) Proposition 2.12 Suppose there exists ϕ : [0, ) [0, ) such that ϕ is nondecreasing, ϕ(x)/x as x, and sup i E ϕ( X i ) <. Then the X i are uniformly integrable. Proof. Let ε > 0 and choose x 0 such that x/ϕ(x) < ε if x x 0. If M x 0, ( X i >M) X i = X i ϕ( X i ) ϕ( X i )1 ( Xi >M) ε ϕ( X i ) ε sup E ϕ( X i ). i Proposition 2.13 If X n and Y n are two uniformly integrable sequences, then X n + Y n is also a uniformly integrable sequence. Proof. Since there exists M such that sup n E [ X n ; X n > M] < 1 and sup n E [ Y n ; Y n > M] < 1, then sup n E X n M + 1, and similarly for the Y n. Note P( X n + Y n > K) E X n + E Y n K by Chebyshev s inequality. Now let ε > 0 and choose L such that E [ X n ; X n > L] < ε 2(1 + M) K

2.5. UNIFORM INTEGRABILITY 23 and the same when X n is replaced by Y n. E [ X n + Y n ; X n + Y n > K] E [ X n ; X n + Y n > K] + E [ Y n ; X n + Y n > K] E [ X n ; X n > L] + E [ X n ; X n L, X n + Y n > K] + E [ Y n ; Y n > L] + E [ Y n ; Y n L, X n + Y n > K] ε + LP( X n + Y n > K] + ε + LP( X n + Y n > K] 2ε + L 4(1 + M). K Given ε we have already chosen L. Choose K large enough that 4L(1 + M)/K < ε. We then get that E [ X n + Y n ; X n + Y n > K] is bounded by 3ε. The main result we need in this section is Vitali s convergence theorem. Theorem 2.14 T7.3 If X n X almost surely and the X n are uniformly integrable, then E X n X 0. Proof. By the above proposition, X n X is uniformly integrable and tends to 0 a.s., so without loss of generality, we may assume X = 0. Let ε > 0 and choose M such that sup n E [ X n ; X n > M] < ε. Then E X n E [ X n ; X n > M] + E [ X n ; X n M] ε + E [ X n 1 ( Xn M)]. The second term on the right goes to 0 by dominated convergence. Proposition 2.15 Suppose X i is an i.i.d. sequence and E X 1 <. Then E S n n E X 1 0. Proof. Without loss of generality we may assume E X 1 = 0. By the SLLN, S n /n 0 a.s. So we need to show that the sequence S n /n is uniformly integrable.

24 CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES Pick M such that E [ X 1 ; X 1 > M] < ε. Pick N = ME X 1 /ε. So P( S n /n > N) E S n /nn E X 1 /N = ε/m. We used here We then have E S n n E X i = ne X 1. i=1 E [ X i ; S n /n > N] E [ X i : X i > M] + E [ X i ; X i M, S n /n > N] ε + MP( S n /n > N) 2ε. Finally, E [ S n /n ; S n /n > N] 1 n n E [ X i ; S n /n > N] 2ε. i=1

Chapter 3 Conditional expectations If F G are two σ-fields and X is an integrable G measurable random variable, the conditional expectation of X given F, written E [X F] and read as the expectation (or expected value) of X given F, is any F measurable random variable Y such that E [Y ; A] = E [X; A] for every A F. The conditional probability of A G given F is defined by P(A F) = E [1 A F]. If Y 1, Y 2 are two F measurable random variables with E [Y 1 ; A] = E [Y 2 ; A] for all A F, then Y 1 = Y 2, a.s., or conditional expectation is unique up to a.s. equivalence. In the case X is already F measurable, E [X F] = X. If X is independent of F, E [X F] = E X. Both of these facts follow immediately from the definition. For another example, which ties this definition with the one used in elementary probability courses, if {A i } is a finite collection of disjoint sets whose union is Ω, P(A i ) > 0 for all i, and F is the σ-field generated by the A i s, then P(A F) = P(A A i ) 1 Ai. P(A i i ) This follows since the right-hand side is F measurable and its expectation over any set A i is P(A A i ). As an example, suppose we toss a fair coin independently 5 times and let X i be 1 or 0 depending whether the ith toss was a heads or tails. Let A be the event that there were 5 heads and let F i = σ(x 1,..., X i ). Then 25

26 CHAPTER 3. CONDITIONAL EXPECTATIONS P(A) = 1/32 while P(A F 1 ) is equal to 1/16 on the event (X 1 = 1) and 0 on the event (X 1 = 0). P(A F 2 ) is equal to 1/8 on the event (X 1 = 1, X 2 = 1) and 0 otherwise. We have E [E [X F]] = E X (3.1) because E [E [X F]] = E [E [X F]; Ω] = E [X; Ω] = E X. The following is easy to establish. Proposition 3.1 (a) If X Y are both integrable, then E [X F] E [Y F] a.s. (b) If X and Y are integrable and a R, then E [ax + Y F] = ae [X F] + E [Y F]. It is easy to check that limit theorems such as monotone convergence and dominated convergence have conditional expectation versions, as do inequalities like Jensen s and Chebyshev s inequalities. Thus, for example, we have the following. Proposition 3.2 (Jensen s inequality for conditional expectations) If g is convex and X and g(x) are integrable, A key fact is the following. E [g(x) F] g(e [X F]), Proposition 3.3 If X and XY are integrable and Y is measurable with respect to F, then E [XY F] = Y E [X F]. (3.2) Proof. If A F, then for any B F, E [ 1 A E [X F]; B ] = E [ E [X F]; A B ] = E [X; A B] = E [1 A X; B]. Since 1 A E [X F] is F measurable, this shows that (3.1) holds when Y = 1 A and A F. Using linearity and taking limits shows that (3.1) holds whenever Y is F measurable and Xand XY are integrable. Two other equalities follow. a.s.

27 Proposition 3.4 If E F G, then E [ E [X F] E ] = E [X E] = E [ E [X E] F ]. Proof. The right equality holds because E [X E] is E measurable, hence F measurable. To show the left equality, let A E. Then since A is also in F, E [ E [ E [X F] E ] ; A ] = E [ E [X F]; A ] = E [X; A] = E [E [ X E]; A ]. Since both sides are E measurable, the equality follows. To show the existence of E [X F], we proceed as follows. Proposition 3.5 If X is integrable, then E [X F] exists. Proof. Using linearity, we need only consider X 0. Define a measure Q on F by Q(A) = E [X; A] for A F. This is trivially absolutely continuous with respect to P F, the restriction of P to F. Let E [X F] be the Radon- Nikodym derivative of Q with respect to P F. The Radon-Nikodym derivative is F measurable by construction and so provides the desired random variable. When F = σ(y ), one usually writes E [X Y ] for E [X F]. Notation that is commonly used (however, we will use it only very occasionally and only for heuristic purposes) is E [X Y = y]. The definition is as follows. If A σ(y ), then A = (Y B) for some Borel set B by the definition of σ(y ), or 1 A = 1 B (Y ). By linearity and taking limits, if Z is σ(y ) measurable, Z = f(y ) for some Borel measurable function f. Set Z = E [X Y ] and choose f Borel measurable so that Z = f(y ). Then E [X Y = y] is defined to be f(y). If X L 2 and M = {Y L 2 : Y is F-measurable}, one can show that E [X F] is equal to the projection of X onto the subspace M. We will not use this in these notes.

28 CHAPTER 3. CONDITIONAL EXPECTATIONS

Chapter 4 Martingales 4.1 Definitions In this section we consider martingales. Let F n be an increasing sequence of σ-fields. A sequence of random variables M n is adapted to F n if for each n, M n is F n measurable. M n is a martingale if M n is adapted to F n, M n is integrable for all n, and E [M n F n 1 ] = M n 1, a.s., n = 2, 3,.... (4.1) If we have E [M n F n 1 ] M n 1 a.s. for every n, then M n is a submartingale. If we have E [M n F n 1 ] M n 1, we have a supermartingale. Submartingales have a tendency to increase. Let us take a moment to look at some examples. If X i is a sequence of mean zero i.i.d. random variables and S n is the partial sum process, then M n = S n is a martingale, since E [M n F n 1 ] = M n 1 + E [M n M n 1 F n 1 ] = M n 1 + E [M n M n 1 ] = M n 1, using independence. If the X i s have variance one and M n = S 2 n n, then E [S 2 n F n 1 ] = E [(S n S n 1 ) 2 F n 1 ]+2S n 1 E [S n F n 1 ] S 2 n 1 = 1+S 2 n 1, using independence. It follows that M n is a martingale. Another example is the following: if X L 1 and M n = E [X F n ], then M n is a martingale. 29

30 CHAPTER 4. MARTINGALES If M n is a martingale and H n F n 1 for each n, it is easy to check that N n = n i=1 H i(m i M i 1 ) is also a martingale. If M n is a martingale and g(m n ) is integrable for each n, then by Jensen s inequality E [g(m n+1 ) F n ] g(e [M n+1 F n ]) = g(m n ), or g(m n ) is a submartingale. Similarly if g is convex and nondecreasing on [0, ) and M n is a positive submartingale, then g(m n ) is a submartingale because E [g(m n+1 ) F n ] g(e [M n+1 F n ]) g(m n ). 4.2 Stopping times We next want to talk about stopping times. Suppose we have a sequence of σ-fields F i such that F i F i+1 for each i. An example would be if F i = σ(x 1,..., X i ). A random mapping N from Ω to {0, 1, 2,...} is called a stopping time if for each n, (N n) F n. A stopping time is also called an optional time in the Markov theory literature. The intuition is that the sequence knows whether N has happened by time n by looking at F n. Suppose some motorists are told to drive north on Highway 99 in Seattle and stop at the first motorcycle shop past the second realtor after the city limits. So they drive north, pass the city limits, pass two realtors, and come to the next motorcycle shop, and stop. That is a stopping time. If they are instead told to stop at the third stop light before the city limits (and they had not been there before), they would need to drive to the city limits, then turn around and return past three stop lights. That is not a stopping time, because they have to go ahead of where they wanted to stop to know to stop there. We use the notation a b = min(a, b) and a b = max(a, b). The proof of the following is immediate from the definitions. Proposition 4.1 (a) Fixed times n are stopping times. (b) If N 1 and N 2 are stopping times, then so are N 1 N 2 and N 1 N 2. (c) If N n is a nondecreasing sequence of stopping times, then so is N = sup n N n. (d) If N n is a nonincreasing sequence of stopping times, then so is N =

4.3. OPTIONAL STOPPING 31 inf n N n. (e) If N is a stopping time, then so is N + n. We define F N = {A : A (N n) F n for all n}. 4.3 Optional stopping Note that if one takes expectations in (4.1), one has E M n = E M n 1, and by induction E M n = E M 0. The theorem about martingales that lies at the basis of all other results is Doob s optional stopping theorem, which says that the same is true if we replace n by a stopping time N. There are various versions, depending on what conditions one puts on the stopping times. Theorem 4.2 If N is a bounded stopping time with respect to F n and M n a martingale, then E M N = E M 0. Proof. Since N is bounded, let K be the largest value N takes. We write K E M N = E [M N ; N = k] = k=0 Note (N = k) is F j measurable if j k, so K E [M k ; N = k]. k=0 Hence E [M k ; N = k] = E [M k+1 ; N = k] E M N = This completes the proof. = E [M k+2 ; N = k] =... = E [M K ; N = k]. K E [M K ; N = k] = E M K = E M 0. k=0 The assumption that N be bounded cannot be entirely dispensed with. For example, let M n be the partial sums of a sequence of i.i.d. random variable that take the values ±1, each with probability 1 2. If N = min{i : M i = 1}, we will see later on that N < a.s., but E M N = 1 0 = E M 0. The same proof as that in Theorem 4.2 gives the following corollary.

32 CHAPTER 4. MARTINGALES Corollary 4.3 If N is bounded by K and M n E M N E M K. is a submartingale, then Also the same proof gives Corollary 4.4 If N is bounded by K, A F N, and M n is a submartingale, then E [M N ; A] E [M K ; A]. Proposition 4.5 If N 1 N 2 are stopping times bounded by K and M is a martingale, then E [M N2 F N1 ] = M N1, a.s. Proof. Suppose A F N1. We need to show E [M N1 ; A] = E [M N2 ; A]. Define a new stopping time N 3 by { N 1 (ω) if ω A N 3 (ω) = N 2 (ω) if ω / A. It is easy to check that N 3 is a stopping time, so E M N3 = E M K = E M N2 implies E [M N1 ; A] + E [M N2 ; A c ] = E [M N2 ]. Subtracting E [M N2 ; A c ] from each side completes the proof. The following is known as the Doob decomposition. Proposition 4.6 Suppose X k is a submartingale with respect to an increasing sequence of σ-fields F k. Then we can write X k = M k + A k such that M k is a martingale adapted to the F k and A k is a sequence of random variables with A k being F k 1 -measurable and A 0 A 1. Proof. Let a k = E [X k F k 1 ] X k 1 for k = 1, 2,... Since X k is a submartingale, then each a k 0. Then let A k = k i=1 a i. The fact that the A k are increasing and measurable with respect to F k 1 is clear. Set M k = X k A k. Then or M k is a martingale. E [M k+1 M k F k ] = E [X k+1 X k F k ] a k+1 = 0, Combining Propositions 4.5 and 4.6 we have

4.4. DOOB S INEQUALITIES 33 Corollary 4.7 Suppose X k is a submartingale, and N 1 N 2 are bounded stopping times. Then E [X N2 F N1 ] X N1. 4.4 Doob s inequalities The first interesting consequences of the optional stopping theorems are Doob s inequalities. If M n is a martingale, denote M n = max i n M i. Theorem 4.8 If M n is a martingale or a positive submartingale, P(M n a) E [ M n ; M n a]/a E M n /a. Proof. Set M n+1 = M n. Let N = min{j : M j a} (n + 1). Since is convex, M n is a submartingale. If A = (Mn a), then A F N because By Corollary 4.4 P(M n a) E A (N j) = (N n) (N j) = (N j) F j. [ M ] n a ; M n a 1 a E [ M N ; A] 1 a E [ M n ; A] 1 a E M n. For p > 1, we have the following inequality. Theorem 4.9 If p > 1 and E M i p < for i n, then ( p ) pe E (Mn) p Mn p. p 1 Proof. Note Mn n i=1 M n, hence Mn L p. We write, using Theorem 4.8 for the first inequality, E (M n) p = = E 0 M n 0 pa p 1 P(M n > a) da pa p 2 M n da = p p 1 (E (M n) p ) (p 1)/p (E M n p ) 1/p. 0 pa p 1 E [ M n 1 (M n a)/a] da p p 1 E [(M n) p 1 M n ]

34 CHAPTER 4. MARTINGALES The last inequality follows by Hölder s inequality. Now divide both sides by the quantity (E (M n) p ) (p 1)/p. 4.5 Martingale convergence theorems The martingale convergence theorems are another set of important consequences of optional stopping. The main step is the upcrossing lemma. The number of upcrossings of an interval [a, b] is the number of times a process crosses from below a to above b. To be more exact, let S 1 = min{k : X k a}, T 1 = min{k > S 1 : X k b}, and S i+1 = min{k > T i : X k a}, T i+1 = min{k > S i+1 : X k b}. The number of upcrossings U n before time n is U n = max{j : T j n}. Theorem 4.10 (Upcrossing lemma) If X k is a submartingale, E U n (b a) 1 E [(X n a) + ]. Proof. The number of upcrossings of [a, b] by X k is the same as the number of upcrossings of [0, b a] by Y k = (X k a) +. Moreover Y k is still a submartingale. If we obtain the inequality for the the number of upcrossings of the interval [0, b a] by the process Y k, we will have the desired inequality for upcrossings of X. So we may assume a = 0. Fix n and define Y n+1 = Y n. This will still be a submartingale. Define the S i, T i as above, and let S i = S i (n + 1), T i = T i (n + 1). Since T i+1 > S i+1 > T i, then T n+1 = n + 1. We write n+1 n+1 E Y n+1 = E Y S 1 + E [Y T i Y S i ] + E [Y S i+1 Y T i ]. i=0 i=0

4.5. MARTINGALE CONVERGENCE THEOREMS 35 All the summands in the third term on the right are nonnegative since Y k is a submartingale. For the jth upcrossing, Y T j Y S j is always greater than or equal to 0. So b a, while Y T j Y S j (Y T i Y S i ) (b a)u n. i=0 So E U n E Y n+1 /(b a). (4.2) This leads to the martingale convergence theorem. Theorem 4.11 If X n is a submartingale such that sup n E X n + X n converges a.s. as n. <, then Proof. Let U(a, b) = lim n U n. For each a, b rational, by monotone convergence, E U(a, b) c(b a) 1 E (X n a) + <. So U(a, b) <, a.s. Taking the union over all pairs of rationals a, b, we see that a.s. the sequence X n (ω) cannot have lim sup X n > lim inf X n. Therefore X n converges a.s., although we still have to rule out the possibility of the limit being infinite. Since X n is a submartingale, E X n E X 0, and thus E X n = E X + n + E X n = 2E X + n E X n 2E X + n E X 0. By Fatou s lemma, E lim n X n sup n E X n <, or X n converges a.s. to a finite limit. Corollary 4.12 If X n is a positive supermartingale or a martingale bounded above or below, X n converges a.s. Proof. If X n is a positive supermartingale, X n is a submartingale bounded above by 0. Now apply Theorem 4.11.

36 CHAPTER 4. MARTINGALES If X n is a martingale bounded above, by considering X n, we may assume X n is bounded below. Looking at X n + M for fixed M will not affect the convergence, so we may assume X n is bounded below by 0. Now apply the first assertion of the corollary. Proposition 4.13 If X n is a martingale with sup n E X n p < for some p > 1, then the convergence is in L p as well as a.s. This is also true when X n is a submartingale. If X n is a uniformly integrable martingale, then the convergence is in L 1. If X n X in L 1, then X n = E [X F n ]. X n is a uniformly integrable martingale if the collection of random variables X n is uniformly integrable. Proof. The L p convergence assertion follows by using Doob s inequality (Theorem 4.9) and dominated convergence. The L 1 convergence assertion follows since a.s. convergence together with uniform integrability implies L 1 convergence. Finally, if j < n, we have X j = E [X n F j ]. If A F j, E [X j ; A] = E [X n ; A] E [X ; A] by the L 1 convergence of X n to X. Since this is true for all A F j, X j = E [X F j ]. 4.6 Applications of martingales One application of martingale techniques is Wald s identities. Proposition 4.14 Suppose the Y i are i.i.d. with E Y 1 <, N is a stopping time with E N <, and N is independent of the Y i. Then E S N = (E N)(EY 1 ), where the S n are the partial sums of the Y i. Proof. S n n(e Y 1 ) is a martingale, so E S n N = E (n N)E Y 1 by optional stopping. The right hand side tends to (E N)(E Y 1 ) by monotone convergence. S n N converges almost surely to S N, and we need to show the expected values converge.

4.6. APPLICATIONS OF MARTINGALES 37 Note S n N = = n k S n k 1 (N=k) Y j 1 (N=k) k=0 n Y j 1 (N=k) = k=0 j=0 j=0 k>j j=0 n Y j 1 (N j) Y j 1 (N j). j=0 The last expression, using the independence, has expected value (E Y j )P(N j) (E Y 1 )(1 + E N) <. j=0 So by dominated convergence, we have E S n N E S N. Wald s second identity is a similar expression for the variance of S N. Let S n be your fortune at time n. In a fair casino, E [S n+1 F n ] = S n. If N is a stopping time, the optional stopping theorem says that E S N = E S 0 ; in other words, no matter what stopping time you use and what method of betting, you will do not better on average than ending up with what you started with. We can use martingales to find certain hitting probabilities. Proposition 4.15 Suppose the Y i are i.i.d. with P(Y 1 = 1) = 1/2, P(Y 1 = 1) = 1/2, and S n the partial sum process. Suppose a and b are positive integers. Then P(S n hits a before b) = b a + b. If N = min{n : S n { a, b}}, then E N = ab. Proof. Sn 2 n is a martingale, so E Sn N 2 = E n N. Let n. The right hand side converges to E N by monotone convergence. Since S n N is bounded in absolute value by a+b, the left hand side converges by dominated convergence to E SN 2, which is finite. So E N is finite, hence N is finite almost surely.

38 CHAPTER 4. MARTINGALES S n is a martingale, so E S n N = E S 0 = 0. By dominated convergence, and the fact that N < a.s., hence S n N S N, we have E S N = 0, or We also have ap(s N = a) + bp(s N = b) = 0. P(S N = a) + P(S N = b) = 1. Solving these two equations for P(S N = a) and P(S N = b) yields our first result. Since E N = E S 2 N = a2 P(S N = a) + b 2 P(S N = b), substituting gives the second result. Based on this proposition, if we let a, we see that P(N b < ) = 1 and E N b =, where N b = min{n : S n = b}. An elegant application of martingales is a proof of the SLLN. Fix N large. Let Y i be i.i.d. Let Z n = E [Y 1 S n, S n+1,..., S N ]. We claim Z n = S n /n. Certainly S n /n is σ(s n,, S N ) measurable. If A σ(s n,..., S N ) for some n, then A = ((S n,..., S N ) B) for some Borel subset B of R N n+1. Since the Y i are i.i.d., for each k n, E [Y 1 ; (S n,..., S N ) B] = E [Y k ; (S n,..., S N ) B]. Summing over k and dividing by n, E [Y 1 ; (S n,..., S N ) B] = E [S n /n; (S n,..., S N ) B]. Therefore E [Y 1 ; A] = E [S n /n; A] for every A σ(s n,..., S N ). Thus Z n = S n /n. Let X k = Z N k, and let F k = σ(s N k, S N k+1,..., S N ). Note F k gets larger as k gets larger, and by the above X k = E [Y 1 F k ]. This shows that X k is a martingale (cf. the next to last example in Section 11). By Doob s upcrossing inequality, if Un X is the number of upcrossings of [a, b] by X, then E UN 1 X E X+ N 1 /(b a) E Z 1 /(b a) = E Y 1 /(b a). This differs by at most one from the number of upcrossings of [a, b] by Z 1,..., Z N. So the expected number of upcrossings of [a, b] by Z k for k N is bounded by 1 + E Y 1 /(b a). Now let N. By Fatou s lemma, the expected number of upcrossings of [a, b] by Z 1,... is finite. Arguing as in the proof of the martingale convergence theorem, this says that Z n = S n /n does not oscillate.

4.6. APPLICATIONS OF MARTINGALES 39 It is conceivable that S n /n. But by Fatou s lemma, E[lim S n /n ] lim inf E S n /n lim inf ne Y 1 /n = E Y 1 <.

40 CHAPTER 4. MARTINGALES

Chapter 5 Weak convergence We will see later that if the X i are i.i.d. with mean zero and variance one, then S n / n converges in the sense P(S n / n [a, b]) P(Z [a, b]), where Z is a standard normal. If S n / n converged in probability or almost surely, then by the Kolmogorov zero-one law it would converge to a constant, contradicting the above. We want to generalize the above type of convergence. We say F n converges weakly to F if F n (x) F (x) for all x at which F is continuous. Here F n and F are distribution functions. We say X n converges weakly to X if F Xn converges weakly to F X. We sometimes say X n converges in distribution or converges in law to X. Probabilities µ n converge weakly if their corresponding distribution functions converges, that is, if F µn (x) = µ n (, x] converges weakly. An example that illustrates why we restrict the convergence to continuity points of F is the following. Let X n = 1/n with probability one, and X = 0 with probability one. F Xn (x) is 0 if x < 1/n and 1 otherwise. F Xn (x) converges to F X (x) for all x except x = 0. Proposition 5.1 X n converges weakly to X if and only if E g(x n ) E g(x) for all g bounded and continuous. The idea that E g(x n ) converges to E g(x) for all g bounded and continuous makes sense for any metric space and is used as a definition of weak 41

42 CHAPTER 5. WEAK CONVERGENCE convergence for X n taking values in general metric spaces. Proof. First suppose E g(x n ) converges to E g(x). Let x be a continuity point of F, let ε > 0, and choose δ such that F (y) F (x) < ε if y x < δ. Choose g continuous such that g is one on (, x], takes values between 0 and 1, and is 0 on [x + δ, ). Then F Xn (x) E g(x n ) E g(x) F X (x + δ) F (x) + ε. Similarly, if h is a continuous function taking values between 0 and 1 that is 1 on (, x δ] and 0 on [x, ), F Xn (x) E h(x n ) E h(x) F X (x δ) F (x) ε. Since ε is arbitrary, F Xn (x) F X (x). Now suppose X n converges weakly to X. We start by making some observations. First, if we have a distribution function, it is increasing and so the number of points at which it has a discontinuity is at most countable. Second, if g is a continuous function on a closed bounded interval, it can be approximated uniformly on the interval by step functions. Using the uniform continuity of g on the interval, we may even choose the step function so that the places where it jumps are not in some pre-specified countable set. Our third observation is that P(X < x) = lim k P(X x 1 k ) = lim y x F X(y); thus if F X is continuous at x, then P(X = x) = F X (x) P(X < x) = 0. Suppose g is bounded and continuous, and we want to prove that E g(x n ) E g(x). By multiplying by a constant, we may suppose that g is bounded by 1. Let ε > 0 and choose M such that F X (M) > 1 ε and F X ( M) < ε and so that M and M are continuity points of F X. We see that P(X n M) = F Xn ( M) F X ( M) < ε, and so for large enough n, we have P(X n M) 2ε. Similarly, for large enough n, P(X n > M) 2ε. Therefore for large enough n, E g(x n ) differs from E (g1 [ M,M] )(X n ) by at most 4ε. Also, E g(x) differs from E (g1 [ M,M] )(X) by at most 2ε. Let h be a step function such that sup x M h(x) g(x) < ε and h is 0 outside of [ M, M]. We choose h so that the places where h jumps are continuity points of F and of all the F n. Then E (g1 [ M,M] )(X n ) differs from E h(x n ) by at most ε, and the same when X n is replaced by X.