Inference for Stochastic Processes

Similar documents
Math 6810 (Probability) Fall Lecture notes

Fundamental Inequalities, Convergence and the Optional Stopping Theorem for Continuous-Time Martingales

Martingale Theory for Finance

1 Gambler s Ruin Problem

Probability and Measure

Martingale Theory and Applications

Math-Stat-491-Fall2014-Notes-V

Lecture 17 Brownian motion as a Markov process

A D VA N C E D P R O B A B I L - I T Y

Lectures on Markov Chains

4 Expectation & the Lebesgue Theorems

RANDOM WALKS IN ONE DIMENSION

Continuum Probability and Sets of Measure Zero

Lecture 4 An Introduction to Stochastic Processes

Part III Advanced Probability

Admin and Lecture 1: Recap of Measure Theory

1 Normal Distribution.

Exercises: sheet 1. k=1 Y k is called compound Poisson process (X t := 0 if N t = 0).

STA 711: Probability & Measure Theory Robert L. Wolpert

Lecture 4. David Aldous. 2 September David Aldous Lecture 4

The Boundary Problem: Markov Chain Solution

Week 12-13: Discrete Probability

Notes 1 : Measure-theoretic foundations I

7 Convergence in R d and in Metric Spaces

2 Metric Spaces Definitions Exotic Examples... 3

RANDOM WALKS AND THE PROBABILITY OF RETURNING HOME

Problem Sheet 1. You may assume that both F and F are σ-fields. (a) Show that F F is not a σ-field. (b) Let X : Ω R be defined by 1 if n = 1

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Exercises: sheet 1. k=1 Y k is called compound Poisson process (X t := 0 if N t = 0).

Lecture 6 Random walks - advanced methods

Universal examples. Chapter The Bernoulli process

18.175: Lecture 2 Extension theorems, random variables, distributions

MATH/STAT 235A Probability Theory Lecture Notes, Fall 2013

Statistical Inference

Probability COMP 245 STATISTICS. Dr N A Heard. 1 Sample Spaces and Events Sample Spaces Events Combinations of Events...

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

1 Presessional Probability

the time it takes until a radioactive substance undergoes a decay

Sample Spaces, Random Variables

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence

JUSTIN HARTMANN. F n Σ.

Chapter 11 Advanced Topic Stochastic Processes

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Math 456: Mathematical Modeling. Tuesday, March 6th, 2018

1. Stochastic Processes and filtrations

Let (Ω, F) be a measureable space. A filtration in discrete time is a sequence of. F s F t

POL502: Foundations. Kosuke Imai Department of Politics, Princeton University. October 10, 2005

MATH HOMEWORK PROBLEMS D. MCCLENDON

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero

One-Parameter Processes, Usually Functions of Time

Lecture 5. 1 Chung-Fuchs Theorem. Tel Aviv University Spring 2011

Lecture 4 - Random walk, ruin problems and random processes

Markov Chains. Chapter 16. Markov Chains - 1

Probability Models. 4. What is the definition of the expectation of a discrete random variable?

STOCHASTIC CALCULUS JASON MILLER AND VITTORIA SILVESTRI

4 Expectation & the Lebesgue Theorems

Exercise Exercise Homework #6 Solutions Thursday 6 April 2006

P (A G) dp G P (A G)

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

Basic Definitions: Indexed Collections and Random Functions

Lecture 6: The Pigeonhole Principle and Probability Spaces

1 Analysis of Markov Chains

CONVERGENCE OF RANDOM SERIES AND MARTINGALES

4 Sums of Independent Random Variables

Lecture 4: September Reminder: convergence of sequences

Lecture 12. F o s, (1.1) F t := s>t

6. Brownian Motion. Q(A) = P [ ω : x(, ω) A )

Stochastic Models (Lecture #4)

ADVANCED PROBABILITY: SOLUTIONS TO SHEET 1

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture 1: Probability Fundamentals

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 27

Probability Theory II. Spring 2016 Peter Orbanz

Set theory. Math 304 Spring 2007

Some Background Material

MORE ON CONTINUOUS FUNCTIONS AND SETS

Lecture 21 Representations of Martingales

Introduction to Stochastic Processes

A PECULIAR COIN-TOSSING MODEL

Probability Theory. Richard F. Bass

Some Definition and Example of Markov Chain

Notes 18 : Optional Sampling Theorem

Verona Course April Lecture 1. Review of probability

Recitation 2: Probability

Random walk on a polygon

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

Chapter 7. Markov chain background. 7.1 Finite state space

33 The Gambler's Ruin 1

Lecture 2: Repetition of probability theory and statistics

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 9 10/2/2013. Conditional expectations, filtration and martingales

Plan Martingales cont d. 0. Questions for Exam 2. More Examples 3. Overview of Results. Reading: study Next Time: first exam

RS Chapter 1 Random Variables 6/5/2017. Chapter 1. Probability Theory: Introduction

Chapter 1 Review of Equations and Inequalities

Deep Learning for Computer Vision

Lecture 3: Sizes of Infinity

Math 564 Homework 1. Solutions.

Basic Measure and Integration Theory. Michael L. Carroll

An Introduction to Stochastic Processes in Continuous Time

Statistics 511 Additional Materials

Transcription:

Inference for Stochastic Processes Robert L. Wolpert Revised: June 19, 005 Introduction A stochastic process is a family {X t } of real-valued random variables, all defined on the same probability space (Ω, F, P) so that it will make sense to talk about their joint distribution. In many applications the set T of indices t is infinite (common examples include the non-negative integers Z + or reals R + ), so the random variables {X t } won t have a joint probability density function or probability mass function we must find some other way to specify the joint distribution, and to make inference about it. In this class we will do this for four specific classes of stochastic processes: 1. Markov Chains,. Diffusions, 3. Lévy Processes, and 4. Gaussian Processes. Three of these classes are examples of Markov Processes, so it is worth while introducing a few tools useful in the study of all of them. First, an example. Robert L. Wolpert is Professor of Statistics and Decision Sciences and Professor of the Environment at Duke University, Durham, NC, 7708-051. 1

Example: Bernoulli Sequence Let Ω = (0, 1] be the unit interval and, for ω Ω and n N = {1,,...}, let β n (ω) be the n th bit in the binary expansion of ω, β n (ω) = n ω (mod 1) If P(dω) is Lebesgue measure, then {β n } are independent Bernoulli random variables, each equal to one with probability 1 and otherwise zero. We may use these to construct many interesting examples. For any 0 p 1 it is possible to construct a measure P p (dω) on Ω for which the {β n } are independent and identically distributed with P p [β n = 1] = p. A Little Probability Theory Probability and Statistics are complementary sciences for studying the world of uncertain phenomena. In Probability, we (pretend that we) know all about the mechanism governing some unpredictable observations, and we compute the probabilities of events we have not yet seen; in Statistics we (pretend that we) know which events have occured and which have not, and we try to make inference about what must have been the mechanism governing those observations. In the Bernoulli setting, for example, probability theory allows us to compute that the probability of an even number of 0 s before the first 1 is /3, if the bits are zero or one with probability one-half each, while the probability that the total number of ones among the first 1000 bits is 500±0 is P[480 X 50 p = 0.5] = 0.80534, while statistics allows us to infer, upon seeing 600 ones among the first thousand bits, that it is unlikely that the {β n } are independent with P[β n = 1] = 1. More precisely, we can infer from a Bayesian perspective that IF the β n are i.i.d. Bernoulli with some probability parameter p, that P[p (0.574, 0.65) X = 600] 0.90 and P[p 0.5 X = 600] 1.1 10 10, while from a frequentist perspective we compute that P[X 600 p = 1] = 1.36 10 10 and infer that it would be a miracle to see so many ones if p = 1. Probability Theory concerns what might happen in a random experiment. It is chiefly concerned with events, which we think of informally as things that might happen, and then again might not, and with random variables, or numbers that depend on chance. More formally we represent everything that might happen in the experiment by the elements of some

set Ω called the sample space, and represent events by subsets E Ω and random variables by real-valued functions X : Ω R, with the notion that performing the experiment is the same as choosing some element ω Ω, whereupon an event E has occurred if ω E and otherwise has not, and that the value of a random variable is observed to be X(ω). Let F denote the collection of all the Events E whose probability P[E] we can compute. In some cases this might be all subsets E Ω, but perhaps that is too ambitious. Certainly we can compute P[Ω] = 1 (the probability that anything at all happens has to be one), and we can compute P[E c ] if we can compute P[E] (it s just P[E c ] = 1 P[E], since necessarily either E occurs or it doesn t), and we will want to be able to compute P[E 1 E ] and P[E 1 E ] for events E 1 and E. Any collection of sets that satisfies these three rules is called an algebra. If it satisfies the somewhat more stringent rule that it contains E i (the event that at least one of the {E i } occurs) and E i (the event that all of the {E i } occur) for any countable collection {E i } F, then F is called a σ-algebra. We wish to assign a probability P(E) to each event E F in ways that satisfy the obvious rules 1. P[E] 0 for every E F;. P[Ω] = 1; 3. P[E 1 E ] = P[E 1 ] + P[E ], if E 1 E =. It turns out to be useful to strengthen this to countable unions, i.e., P[ E i ] = P[E i ], if E i E j = i j. A probability is a real-valued function P : F R that satisfies these three rules. The probability distribution of a random variable X is the probability assignment µ X (B) = P[X B] for sets B R; for this to even make sense we will need {ω : X(ω) B} = X 1 (B) to be an event. It turns out that the collection of sets B R for which X 1 (B) F is a σ-algebra, so requiring this for intervals B = (a, b] (which we would need just to define the CDF F X (b) = P[X b]) is just the same as asking that X 1 (B) F for the entire collection B of Borel sets, the smallest σ-algebra of sets B R that includes all the open sets (or, equivalently, all the intervals). For any collection {E α } of subsets E α Ω there is a smallest σ-algebra σ ( {E α } ) that contains them all; for example, the Borel Sets B of any 3

topological space are defined to be the smallest σ-algebra containing all the open sets. Similarly for any collection of random variables {X α } there is a smallest σ-algebra G = σ ( {X α } ) F that contains every X α 1 (B). If Y is any L 1 random variable (i.e., one such that E[ Y ] < ) then we define the conditional expectation of Y, given G, denoted E[Y G], to be the best approximation to Y from among all possible functions of {X α }. In particular, if Y is already one of the {X α }, or some function of finitely many {X αj }, or a limit of such things, then E[Y G] is just Y itself. More generally, this conditional expectation is defined to be any random variable satisfying the two conditions E[Y G] G (short-hand for (E[Y G]) 1 (B) G) E[(E[Y G] Y )1 G ] = 0 for all sets G G The first condition says that E[Y G] has to be a limit of functions of finitely many of the {X α }, while the second condition says that the approximation is so good that the average error is zero over any event in G or, equivalently, that the error is orthogonal to every bounded function of the {X α }. This reduces to the familiar notion of conditional expectation y f( x, y) dy E[Y X 1,..., X m ] = f( x, y) dy when X = (X 1,..., X m ) is finite dimensional with a joint density function, so G = σ(x 1,..., X m ) depends on only finitely many random variables, but extends the idea to conditioning on infinitely many random variables (even uncountably many). Similarly it extends the notions of conditional probability P[A B], with G = σ(1 B ) = {, B, B c, Ω} and { P[A B ] = P[A B ]/P[B ] for ω B, P[A G](ω) = P[A B c ] = P[A B c ]/P[B c ] for ω / B. In many applications there will be an obvious candidate for this optimal predictor; an example is the case of martingales. Martingales In cases where T R is one-dimensional the indices are naturally ordered. We often speak of t T as time and think of the past at any time t T 4

as including whatever we can observe at times s t. Let Ft X represent the past at time t T of the process X t ; formally, it consists of all events E F that depend only on X s : s t, and in particular contains all the events X 1 t (B) that depend just on the process at time t. More generally a filtration is any increasing family F t F of σ-algebras, and X is said to be adapted to F t if X 1 t (B) F t for every t T. Then F t will contain the past of X t but perhaps will also contain the past of other processes as well. In the Bernoulli example, F t = { all unions of intervals of the form (i/ t, j/ t ], 0 i < j t} is just the collection of all unions of half-open intervals with dyadic rational endpoints of order t or less, and the conditional expectation E[Y F t ] of any integrable function Y L 1 (0, 1) is just the piecewise-constant simple function whose constant value on each dyadic interval of order t is given by Y s average value over that interval, E[Y F t ](ω) = t (i+1)/ t i/ t Y (ω) dω, i < ω i + 1 t t Let F t be a filtration. A real-valued stochastic process M t is said to be a martingale for F t if E[ M t ] < for every t T E[M t F s ] = M s for every s t T. The second condition asserts that the best predictor of M t available, from among all possible functions of {M r : r s} or of any other random variables Y with Y 1 (B) F s, is M s itself. This says in a very strong way that M t is conditionally constant, that on average it neither increases nor decreases over time. It will follow that E[M t ] = E[M 0 ], of course, but something much stronger is true; E[M τ ] = E[M 0 ] even at random times τ T, so long as the random time depends only on the past. More formally, τ : Ω T is called a Stopping Time (or sometimes a Markov time) if {ω : τ(ω) t} F t for every t T i.e., that the event [τ t] depends only on the {X s } for s t. A stopping time in a gambling game would be a rule for when 5

to quit that depends only on the past and present; this forbids quitting just before we lose (too bad!). Martingales have at least three remarkable properties that make them ideal tools for studying stochastic processes and their inference: Doob s Optional Stopping Theorem: If M t is a martingale and τ a stopping time then M t τ is a martingale, too. This implies that E[M τ ] = E[M 0 ] for any stopping time τ, if either E[τ] < or if {M t } is UI. (Note: This could be taken as an alternate definition of Martingale). Martingale Convergence Theorem: If M t is Uniformly Integrable then there exists an L 1 random variable Z s.t. M t Z a.s. as t, and M t = E[Z F t ] for all t <. One sufficient condition for uniform integrability is that M t Y for some Y L 1 ; another is that E[ M t p ] K for some p > 1 and K <. Martingale Maximal Inequality: For any 0 t set Mt sup[m s : 0 s t], the maximum value of M s on the interval s t. Let p > 1 and set q = p/(p 1) (so 1/p + 1/q = 1). Set X p (E X p ) 1/p (the ordinary L p norm). Then for any c > 0, P[ M t > c] E[ M t ]/c Mt p q sup M s p s t Notice that the Maximal Inequality is reminiscent of the Markov Inequality P[ Y > c] E[ Y ]/c true for any L 1 random variable, but it is much stronger the bound is not only on M t, but on the maximum M t. Examples Recall the Bernoulli random variables β n Bi(1, 1 ) constructed above. Define a symmetric random walk starting at x Z by S t x + t (β n 1), n=1 6

and let F t = σ(s n : n t) = σ(β n : n t). Evidently S t begins at S 0 = x and at each time t takes a unit step up if β t = 1 and takes a unit step down if β t = 0, equally likely events. It follows that S t is a martingale, of course, but so too is M(t) = (S t ) t (check this). For 0 < p < 1 under the probability assignment P p that makes the β n into i.i.d. Bi(1,p) random variables, S t will become a biased random walk that steps to the right (i.e., up) with probability p and to the left with probability q = 1 p. If p 1 then S t is no longer a martingale, but both M 0 (t) = (q/p) St and M 1 = S t (p q)t are martingales. These lead to a simple and elegant solution of the famous Gambler s Ruin problem: for a x b let τ be the first time that S t leaves the open interval (a, b), τ min{t T : S t (a, b)}, i.e., the first hitting time of {a, b}, and let f(x) = P[S τ = b] be the probability that the interval is exited to the right. This function f(x) may be calculated using Doob s Optional Sampling Theorem. For the symmetric (p = 1) case, E[S τ S 0 = x] = [f(x)]b + [1 f(x)]a = a + f(x)(b a) = x (by Doob s O.S.T.), so f(x)(b a) = x a and f(x) = x a b a. For example, a gambler with a $100 fortune has a probability of f(100) = 100/110 of winning $10 before going broke, when playing a fair game betting $1 each play with even odds (here a = 0, x = 100, and b = 110). The expected time to complete play is available too: E[(S τ ) τ S 0 = x] = [f(x)]b + [1 f(x)]a E[τ S 0 = x] = a + f(x)(b a ) E[τ S 0 = x] = x (by Doob s O.S.T.), so E[τ S 0 = x] = (x a)(b a ) (x a ) = (b x)(x a), b a or 1000 plays in the gambling example above. 7

For the asymmetric case, E[(q/p) Sτ S 0 = x] = [f(x)](q/p) b + [1 f(x)](q/p) a = (q/p) a + f(x)[(q/p) b (q/p) a ] = (q/p) x by Doob s O.S.T., so f(x) = (q/p)x (q/p) a (q/p) b (q/p) a = (p/q)b x (p/q) b a 1 (p/q) b a (p/q) b x if (p/q) b a 0 Betting on red or black roulette in the U.S. has probability p = 9/19 of winning and q = 10/19 of losing, so a gambler with a $100 fortune has a probability of f(100) = (9/10)10 (9/10) 110 1 (9/10) 110 (9/10) 10 =.3487 of winning $10 before going broke, when betting $1 on red or black at U.S. roulette. Actually he or she has the same odds (to four decimal places) of winning $10 before going broke beginning with the largest fortune of anyone on earth (presently about $50B)! The expected time to complete play is available once again: E[S τ (p q)τ S 0 = x] = [f(x)]b + [1 f(x)]a (p q)e[τ S 0 = x] = x by Doob s O.S.T., so E[τ S 0 = x] = [f(x)]b + [1 f(x)]a x p q = (b x)[(p/q)b x (p/q) b a ] (x a)[1 (p/q) b x ] (p q)[1 (p/q) b a ] (x a) (b a)(p/q)b x if (p/q) b a 0 q p or about E[τ S 0 = 100] = [100 110(9/10) 10 ]/[1/19] 1171 for our ill-fated gambler. The problem of doubling our money is even more striking: Beginning with a stake of x = $50, the chance of doubling it to reach b = 100 before 8

losing it all (so a = 0) is f(50) = 1 in the fair game, with an expectation of taking 50 = 500 plays to complete the game, while the house edge afforded by the (green) 0 and 00 possibilities at U.S. roulette drop the gambler s chances to f(50) (0.9) 50 =.00515, with an expected playing time of only 940 plays. Note that S t is a random walk whose steps have expectation E[S t S t 1 ] = E[β t 1] = (p q) = 1/19, so on average it would take about 50 19 = 950 steps to fall from 50 to zero; the expected length of the game is a little shorter than that, due to the approximately one in two-hundred chance of winning. Now we are ready to begin studying inference for Markov Processes; we begin by introducing Markov Chains. 9