A primer on basic probability and Markov chains

Size: px

Start display at page:

Download "A primer on basic probability and Markov chains"

Prudence Shaw
5 years ago
Views:

1 A primer on basic probability and Markov chains David Aristo January 26, 2018 Contents 1 Basic probability Informal ideas and random variables Probability spaces Independence and conditional probability Memoryless random variables Normal random variables Expectation Mean and variance Moment generating functions Markov and Chebyshev inequalities Law of large numbers Central limit theorem Modes of convergence Borel Cantelli lemma Markov chains Informal ideas The Markov property Hitting times and the strong Markov property Irreducibility and aperiodicity The Perron-Frobenius theorem Mean rst passage times Stationary distributions and convergence Ergodic theorem Left and right actions Innite state space

2 1 Basic probability This chapter is a crash course in basic probability. It attempts to present the main ideas in elementary probability without resorting to any measure theoretic technicalities or proofs. In particular, we will consider only measurable sets and functions. For this reason the presentation is somewhat incomplete, but it will not matter for the applications we need in later chapters. For a more detailed presentation at about the same level, see [1]. For a higher level measure-theoretical treatment see [2]. 1.1 Informal ideas and random variables A random variable is, informally, a real number X whose value is random. We think of the value of X as depending on some outcome ω. For instance, consider rolling two dice and let X be the sum of the rolls. Then if the outcome is ω = (1, 3) we have X = 4. The letter P stands for the probability of some subset of outcomes, or event. In this case P(X = 4) = 3/36 since there are 3 ways for the sum of rolls to be 4 namely if the outcome is ω = (1, 3), (2, 2) or (3, 1) and 36 total outcomes. Here, the event {X = 4} corresponds to the collection {(1, 3), (2, 2), (3, 1)} of outcomes. Often we use commas to stand for two or more events occuring simultaneously. For instance P(X odd, X 4) = 1/3 stands for the probability X is odd and X 4. Much of the information about a random variable X is contained in its cumulative distribution function: Denition 1.1. The cumulative distribution function (CDF) of X is F X (x) = P(X x). The CDF of X gives the probability the random variable is less than or equal to a given number. When F X is dierentiable, its derivative gives the probability the random variable is near a given number. Denition 1.2. If it exists, p X := F X is the probability density function (PDF) of X. Note that if X has a PDF p X, then P(a < X b) = F X (b) F X (a) = b a p X (x)dx. For a simple example where p X doesn't exist, consider X that always takes the value 0. Then F X (x) = 0 for x < 0 and F X (x) = 1 for x 0, so F X isn't dierentiable. If X only takes integer or discrete values, we also dene the following. Denition 1.3. Suppose X can only take nonnegative integer values. Then its probability mass function (PMF) is f X (k) = P(X = k). We'll usually consider random variables with PDFs or PMFs. Exercise 1.4. Let X be the sum of two die rolls. Find the PMF and CDF of X. 2

3 The simplest random variable with discrete values is the Bernoulli random variable: Denition 1.5. X is a Bernoulli-p random variable if its PMF is p, k = 1 f X (k) = 1 p, k = 0, 0, else where p (0, 1). Note that a Bernoulli-1/2 random variable corresponds to a (fair) coin toss. A Bernoulli-p random variable corresponds to a biased coin toss, where the probability of (say) heads is p. The simplest random variable with continuous values is the uniform random variable: Denition 1.6. X is a uniform random variable in [0, 1] if its CDF is 0, x < 0 F X (x) = x, x (0, 1). 1, x 1 Sometimes X is also called a uniform random number in [0, 1]. X corresponds to sampling a point at random in [0, 1]. Imagine [0, 1] as a dartboard: the probably of throwing a dart that lands on the board to the left of x [0, 1] is just x. Exercise 1.7. Find the PDF of a uniform random variable. We use the word distribution to refer to either the CDF, PDF, or PMF of a random variable. Since the distribution of a random variable contains most of the relevant information, we will often introduce X simply by giving its distribution. However, a distribution does not completely determine a random variable, as we see in the next section. In particular, the distribution of a random variable does not give information about how it is related to other random variables. Exercise 1.8. Show that F X is nondecreasing and lim x F X (x) = 0, lim x F X (x) = 1. Show that if F X has PDF p X, then p X(x)dx = 1. The CDF F X is also right-continuous. A proof of that requires some set-theoretic technicalities, so we skip it. 3

4 1.2 Probability spaces Recall that events are subsets of outcomes to which probabilities are assigned. Thus, events are simply certain subsets of Ω. (In general, not all subsets of Ω are events, since not all such subsets can be assigned probabilities, but we will ignore this complication. In the context of measure theory, this means we assume everything is measurable.) Formally, P is a function that assigns a number in [0, 1] to each event, with the property that P(Ω) = 1 and: Denition 1.9 (Countable additivity). If A 1, A 2,... are disjoint events, then P( k=1a k ) = P(A k ). k=1 It is easy to check that countable additivity implies countable subadditivity: P( k=1a k ) P(A k ), for any events A k. k=1 Perhaps the simplest probability measures are the uniform ones, which assign the same probabilities to all individual outcomes or events of the same size: Example 1.10 (Uniform measure on a nite set). Suppose Ω is a nite set. If P(A) = A / Ω for each A Ω, where S denotes the number of elements in a set S, then P is called the uniform probability measure on Ω. The uniform measure can be dened on certain innite outcome spaces Ω. In this case the condition P(Ω) = 1 forces each single outcome ω Ω to have zero probability: Example 1.11 (Uniform measure on the unit interval). If Ω = [0, 1] and P([a, b]) = b a for 0 a b 1, then P is called the uniform probability measure on Ω. Formally, random variables are just real-valued functions of outcomes, that is, functions X : Ω R. Denition The CDF F X (x) of X is formally dened by F X (x) = P(X 1 (, x)). It is standard to write P(X x) instead of P(X 1 (, x]). Similarly, we write P(X = x) instead of P(X 1 ({x})), and for A Ω we write P(X A) instead of P(X 1 (A)). Exercise Take the two die roll example from above, and let Ω = {1,..., 6} 2, and let P be uniform on Ω. Dene X : Ω R by X(ω) = ω 1 + ω 2. Compute its probability mass function using the formula f X (k) = P(X 1 ({k}), k N. 4

5 Random variables are not completely determined by their CDFs, For instance, let P be uniform on Ω = [0, 1] and dene X : Ω R by X(ω) = ω. Let P be uniform on Ω = [1, 2] and dene X : Ω R by X(ω) = ω 1. Then X and and X have the same CDF (what is it?). Even though X and X are technically not the same, we will usually consider them as equivalent. The following example is also instructive. Suppose X is dened as above and dene Y : Ω R by ˆX(ω) = 1 ω. Then Y has the same distribution as X (check this!). Dene also Z : Ω R by Z(ω) = 0 if ω < 1/2 and Z(ω) = 1 if ω 1/2. Then P(X [0, 1/2), Z = 0) = 1/2 while P(Y [0, 1/2), Z = 0) = 0. (By convention, the comma separating the events stands for and.) Thus, though X, Y have the same distribution, they are related to Z in dierent ways. To describe the relation between X, Z or Y, Z, we need a joint distribution. For instance the joint CDF of X and Z is F X,Z (x, z) = P(X x, Z z). On the other hand if two random variables are independent (dened below), then knowing their individual distributions is enough to know how they behave together. 1.3 Independence and conditional probability Let A, B be events. Denition A, B are independent if P(A B) = P(A)P(B). For example, consider two die rolls. Let A be the event the rst roll is odd and B the event the second roll is 4. It is easy to see that P(A)P(B) = 1 2 = 1. On the other hand, there are 3 4 outcomes in which the rst roll is odd and the second roll is 4. Thus, P(A B) = 12 = 1. Intuitively, A, B are independent if they have nothing to do 36 3 with each other. In this example, this is automatic because A and B depend on distinct die rolls. Note A, B independent is dierent from A, B being disjoint: if A B = then P(A B) = 0, while P(A), P(B) may be nonzero. Denition The conditional probability of A given B is P(A B) = P(A B). (1) P(B) Let Q be the probability dened by Q(A) = P(A B)/P(B). It corresponds to the original probability P with an added assumption that we know B occurred. Thus, P(A B) is the probability A occurs, given that B occurred. Consider two die rolls. Let B be the event that the rst roll is odd, and A the event that the rolls add to 5. Consider rst P(A B). Assuming B occurred, we know that the rst roll is either 1, 3 or 5, and each has the same probability 1/3. If the rst roll is 1 or 3, the sum of the rolls is 7 only when the second roll is 4 or 1, respectively, both which have probability 1/6. If the rst roll is 5 there is no way the sum is 5. Thus, P(A B) = =

6 On the other hand, consider P(A B). There are only two outcomes namely (1, 4) and (3, 2) for which the rst roll is odd and the rolls add to 5. Thus, P(A B) = 2/36 = 1/18. Since P(B) = 1/2 we veried equation (1). Note that if P(B) > 0 and A, B are independent, P(A B) = P(A). It says that, given that B occurred, it tells us nothing about the probability of A. Exercise Consider two die rolls. Let A be the event that the rst roll is 3, B the event that the second roll is 4, and C the event that the rolls add to 7. Show that each pair of A, B, C are independent but P(A B C) P(A)P(B)P(C). Denition Random variables X 1, X 2,... are independent if ( ) P {X j I j } = P(X j I j ) j J j J for all nite subsets J N and intervals I j R. For example, consider an innite die roll in which the outcome space is Ω = {1,..., 6} N, or sequences ω = (ω 1, ω 2,...) of numbers in ω i {1,..., 6}. Let X i be 1 if the ith roll is a 6, and 0 otherwise. That is, X i (ω) = 1 {ωi =6}, where 1 A = 1 if A is true and 0 otherwise. Then X 1, X 2,... are independent. Note that Exercise 1.16 shows something like Denition 1.17 is needed for independence: pairwise independence of the X i 's is not enough. Exercise Use Denition 1.17 to write a denition of when a sequence A 1, A 2,... of events are independent. The following is an immediate consequence of the denition of conditional probability. Theorem Let A k, k = 1, 2,... be disjoint events with P( k=1 A k) = 1. Then for any event A. P(A) = P(A A k )P(A k ) k=1 The theorem above is sometimes called the rule of total probability. Despite its simplicity, this rule (and variants theoreof) are arguably among the most useful results in probability. 1.4 Memoryless random variables Denition An exponential-λ random variable has CDF F X (x) = 1 e λx. λ > 0 is called the rate. X can be constructed by taking P uniform on Ω = (0, 1) and dening X(ω) = λ 1 log(1 ω) (check this!). Actually, this provides a way of sampling X: pick a uniform random number ω in (0, 1), and then compute X(ω). 6

7 Denition A geometric-p random variable has PMF f X (k) = (1 p) k 1 p. A geometric random variable can be constructed by taking X i 's to be independent Bernoulli random variables (Denition 1.5), and setting X = min{i : X i = 1}. Intuitively, this corresponds to taking innitely many biased coin tosses say probability of heads is p and then taking the rst toss that is heads. If the rst heads is on the kth toss then the rst k 1 tosses were tails and the kth was heads. Then by independence, P(X = k) = (1 p) k p. Exercise Show that if X is exponential or geometric, then P(X > t + s X > t) = P(X > s) where t, s > 0 if X is exponential, and t, s N if X is geometric. Because of the property in Exercise 1.22, geometric and exponential random variables are used to model random arrivals. Intuitively, the amount you have waited so far for an arrival has nothing to do with the additional amount of time you'll have to wait. For this reason these random variables are called memoryless. 1.5 Normal random variables Denition A standard normal random variable has PDF p X (x) = 1 2π e x2 /2. This is the so-called bell curve. Denition Let Y 1, Y 2,... be independent Bernoulli-1/2 random variables and X i = 2Y i 1. Then S n = X X n is called the simple random walk. Thinking of n as time, S n starts at 0 and at each time step moves 1 unit right or left with equal probability. Let F n be the CDF of n 1 S n. It turns out that F n converges to F (x) = x 1 2π e t2 /2 dt, the CDF of a standard normal random variable. Thus, a standard normal random variable is simply the (appropriately rescaled) position of a simple random walk after a long time. Denition A sequence X n, n = 1, 2,... of random variables is iid if it is independent and each X i has the same distibution (meaning the same CDF, PMF or PDF). The word iid is an acronym for independent and identically distributed. The sequence X 1, X 2,... dening the random walk is iid. It turns out that the standard normal distribution can be obtained in a similar way from any reasonable iid sequence. This is the central limit theorem, which will be proved below. 1.6 Expectation The expected value or expectation of X, denoted E(X), is dened as: 7

8 Denition For a random variable X with PDF p X, E(X) = For a random variable X with PMF f X, xp X (x)dx. (2) E(X) = kf X (k). (3) k=1 When X does not have a PDF or a PMF, it takes more work to dene E(X). We will skip this because it involves measure and integration theory outside the scope of this section. However, we note that if X is nonnegative, then E(X) = (1 F 0 X(x))dx. The expected value of X corresponds to the average value it takes. That is, equation (3) says the expected value of X is the sum of values it takes, multiplied by the probability of each value. Equation (2) says something analogous for continuous valued random variables. Expectation is linear, that is, E(X + Y ) = E(X) + E(Y ). If g : R R, then g X is a random variable, usually written g(x), and its expectation is given by (2) or (3) under appropriate conditions. The connection between expectation and probability is as follows. Let 1 X A be the function which is 1 if X A and 0 otherwise, called the indicator function of the event X A. Then P(X A) = E (1 X A ). Exercise Consider two die rolls and let X be the sum of the rolls. Find E(X). Exercise Find E(X) when X is geometric-p. Repeat for X exponential-λ. The following is a useful consequence of independence. Theorem If X, Y are independent, then whenever all the expectations exist. E(f(X)g(Y )) = E(f(X))E(g(Y )) (4) We skip the proof, which requires some measure and integration theory that is outside the scope of this section. Exercise Check that equation (4) holds when X, Y are independent and f(x) = 1 if x A and 0 otherwise, and g(x) = 1 if x B and 0 otherwise (here we write f = 1 A and g = 1 B ). 8

9 1.7 Mean and variance Denition The mean of X is its expected value E(X). The mean of X is often written µ Not all random variables have nite mean. For example, the Cauchy distribution, which has PDF p X (x) = 1 π 1 1+x 2, has E(X) =. Denition The variance of X is V ar(x) = E((X µ) 2 ), with µ the mean of X. The variance of X is often written σ 2. The standard deviation is its square root σ. The variance is a measure of how much X tends to dier from its mean. If X has a PDF p X, then its variance is σ 2 = (x µ) 2 p X (x)dx. Exercise Let X 1, X 2 be independent random variables with means µ 1, µ 2 and variances σ 2 1, σ 2 2. Show that the mean of X 1 + X 2 is µ 1 + µ 2 and the variance of X 1 + X 2 is σ σ 2 2. If c > 0 is constant, show the variance of cx 1 is c 2 σ 2 1. The sample variance of x 1,..., x n R is dened by ˆσ 2 = 1 n 1 n (x i x) 2, x = 1 n k=1 n x i. Sample variance can be thought of as an empirical quantity, while variance of a random variable is a theoretical quantity. The next exercise shows the relationship between the two types of variance. Exercise Let X 1,..., X n be iid random variables with variance σ 2. X n = 1 n n 1 i=1 (X i X) 2 has E( X n ) = σ 2, where X = 1 n n i=1 X i. k=1 Show that The normalization 1/(n + 1) is really needed in Exercise 1.34; if 1/n is used instead, there is a slight bias in the estimate of Xn of σ 2, meaning its expected value is incorrect. Exercise Find the mean and variance of an exponential-λ random variable. 1.8 Moment generating functions Let X be a random variable. Denition For k N, the kth moment of X is E(X k ). The rst moment of X is its expected value or mean. The second moment is related to its variance via V ar(x) = E(X 2 ) E(X). Usually people think of the third moment as measuring skewness of X (i.e. asymmetry about its mean) and the fourth moment as measuring the heaviness of its tail, meaning its propensity to take large positive or negative values (called kurtosis). In general, the moments together dene the moment generating function, which in turn can be used to characterize a random variable. 9

10 Denition The moment generating function (MGF) of X is ϕ X (s) = E(exp(sX)). Note that ϕ X (s) may be innite for some s, so the domain of ϕ X is simply those s for which the expectation is nite. Why is it called the MGF? Note that, by Taylor expansion, ) ϕ X (s) = E (1 + sx + s2 X , 2 and so under appropriate conditions (for being able to dierentiate inside the integral), ϕ X(0) = E(X). Similar calculations show for k N, and ϕ (k) X the kth derivative of ϕ X, ϕ (k) X (0) = E(Xk ). Note that the MGF of X can be understood as a Laplace transform of X. Under appropriate conditions the Laplace transform is invertible, in which case the MGF of X can be used to identify the distribution of X. Theorem If two random variables have the same MGF, then they have the same distribution. More generally, if their MGFs agree in a neighborhood of 0, then they have the same distribution. Exercise Show that if X n, n = 1, 2,... are iid with common MGF φ, then the MGF of S n = X X n is φ Sn (s) = φ(s) n. Exercise Find the MGF of an exponential-λ random variable. Exercise Show that the MGF of a standard normal random variable is φ(s) = e t2 / Markov and Chebyshev inequalities Theorem 1.42 (Markov inequality). Let X be a nonnegative random variable. Then for each a > 0. P(X a) E(X) a Assuming X has a PDF p X, Markov's inequality follows from E(X) = 0 xp X (x)dx a ap X (x)dx = ap(x a). Corollary 1.43 (Chebyshev inequality). Let X have mean µ and variance σ 2. Then P( X µ kσ) 1 k 2. The corollary is proved by applying the Markov inequality to (X µ) 2 with a = k 2 σ 2. 10

11 1.10 Law of large numbers The law of large number says that the average of n iid random variables converges, in some sense, to their common mean µ as n. Theorem 1.44 (Weak law of large numbers). Let X n, n = 1, 2,... be iid with nite mean µ and variance σ 2 <. Set X n = n 1 (X X n ). Then for each ɛ > 0, lim P( X n µ ɛ) = 0. (5) n By Exercise 1.34, the variance of Xn is n 1 σ 2. So by Chebyshev's inequality, P( X n µ ɛ) σ2 nɛ 2. The type of convergence in (5) is called convergence in probability. We will examine this and other modes of convergence below. The strong law of large numbers is as follows. Theorem 1.45 (Strong law of large numbers). Let X n, n = 1, 2,... be iid random variables with nite mean µ. Set X n = n 1 (X X n ). Then P(lim n Xn = µ) = 1. The type of convergence in Theorem 1.45 is stronger than that of Theorem We will compare these and other modes of convergence in Section 1.12 below. We outline an alternative proof of a weak law of large numbers in the next exercise. Exercise Let X n, n = 1, 2,... be iid with E[X 1 ] = µ and common MGF φ whose domain of denition contains a neighborhood of 0. Dene X n = n 1 (X X n ). Show that the MGF of Xn is φ Xn (s) = e sµ. (See Exercise 1.39.) Conclude that the distribution of Xn converges to the distribution of the constant random variable X µ Central limit theorem Recall that if X n, n = 1, 2,... are iid with mean µ, then by the law of large numbers n 1 (X X n ) tends to µ. Assume WLOG µ = 0. Then n 1 (X X n ) tends to 0, but if we scale instead by n 1/2 we get a nontrivial distribution, the bell curve. Theorem 1.47 (Central limit theorem). Let X n, n = 1, 2,... be iid with E(X 1 ) = 0 and σ 2 = E(X 2 1) <. Let Z n = X X n σ. n Then the distribution of Z n converges to a standard normal distribution. To see why the theorem is true, assume the MGF ϕ of X 1 is dened in a neighborhood of 0 (this assumption is not needed but makes the proof easy). Since E(X 1 ) = 0 and E(X 2 1) = σ 2, by Taylor expansion, ϕ(s) = 1 + s2 σ 2 + O(s 3 ). 2 11

12 Let ϕ n be the MGF of Z n. By independence of the X i 's (see Exercise 1.46), ( ) n ) n ( ) s ϕ n (s) = ϕ σ = (1 + s2 s 2 n 2n + O(n 3/2 ) exp as n. 2 The last expression is the MGF of a standard normal random variable (see Exercise 1.41) Modes of convergence Here we briey discuss some of the modes of convergence discussed above. Denition X n converges to X in distribution, written X d n X, if the CDF of X n converges pointwise to the CDF of X on the set where the CDF of X is continuous. The following alternative characterization of convergence in distribution is often useful: X n d X if and only if E(f(X n )) E(f(X)) for all bounded continuous f : R R. Denition X n converges to X in probability, written X n p X, if for each ɛ > 0, lim P( X n X ɛ) = 0. n Theorem Convergence in probability implies convergence in distribution. We sketch a proof. Let f be bounded and continuous. Given ɛ > 0, where M = 2 sup f. Thus, f(x n ) f(x) ɛ + M1 fn(x) f(x) ɛ E ( f(x n ) f(x) ) ɛ + MP ( f(x n ) f(x) ɛ) ɛ as n, since X n p X implies f(xn ) p f(x) (Exercise 1.51). Thus, E(f(X n )) E(f(X)). Exercise Show that X n p X implies f(xn ) p f(x) for any continuous f. Denition X n converges to X a.s., written X n The a.s. stands for almost sure. a.s. X, if P(lim n X n = X) = 1. Theorem Almost sure convergence implies convergence in probability. This follows from Fatou's lemma from measure theory; we omit proof. We have seen almost sure convergence implies convergence in probability which implied convergence in distribution. We now give examples to show the reverse implications do not hold. First, consider uniform measure P on Ω = [0, 1], and dene X n (ω) = ω for n even and X n (ω) = 1 ω for n odd. Let X be a uniform random variable in [0, 1]. Then all X n 's have the same distribution, the distribution of X (see Denition 1.6). However, for 12

13 n odd, P( X n X < ɛ) = P({1/2 ɛ/2 < X n < 1/2 + ɛ/2) = ɛ. Thus, X n does not converge to X in probability. Next, let X n be independent Bernoulli-1/n, that is, P(X n = 1) = 1/n and P(X n = 0) = 1 1/n. Then X n converges in distribution and in probability to 0, but P(lim sup X n = 1) = 1 n by the Borel Cantelli lemma, proved below. Thus, X n does not converge almost surely to 0. (See the discussion below Theorem 1.54 below.) 1.13 Borel Cantelli lemma Theorem Let E n, n = 1, 2,... be independent events with n=1 P(E n) =. Then ( ) P E k = 1. n=1 k n Before proving the theorem, consider its application to the example at the end of the last section. Let E n = {X n = 1} be the event that X n = 1. Then the hypotheses of Theorem 1.54 are satised, so P ( n=1 k n {X n = 1}) = 1. Observe that k n {X n = 1} = {sup k n X k = 1} and thus { } {X n = 1} = sup X k = 1 for all n n=1 k n k n { } = lim sup X n = 1 n. Since this event has probability 1, X n does not converge to 0 almost surely. Now we turn to a proof of the theorem. By DeMorgan's laws, it suces to show that ( ) P = 0, n=1 k n where Ek c = Ω \ E k denotes the complement of E. We have ( ) P = P(Ek) c k n k n E c k E c k = k n (1 P(E k )) k n exp ( P(E k )) = exp ( k n P(E k ) where we used independence and 1 x e x. Countable subadditivity gives the result. 13 ) = 0.

14 2 Markov chains This chapter is a crash course on the basics of Markov chains. For simplicity of presentation we mostly assume nite state space. We will describe how results generalize to countably innite or uncountable state space only when needed for later use. For a more detailed presentation see [4]. 2.1 Informal ideas Informally, a Markov chain is a random walk in which the next step of the walk depends only on the current position. For instance, let X i, i = 1, 2,... be iid random variables with P(X i = 1) = P(X i = 1) = 1/2. Then S n = X X n is a random walk where at each step we go right or left with equal probability. This is called the simple random walk. Markov chains generalize this example by allowing the steps to have any distribution which depends only on the current position. 2.2 The Markov property Throughout, J is a nite set and X n is a sequence of random variables with values in J. Here J is called state space and elements of J are called states. Denition 2.1. X n is a Markov chain if for each n 0 and i 0,..., i n 1, i, j I, P(X n+1 = j X 0 = i 0,..., X n 1 = i n 1, X n = i) = P(X n+1 = j X n = i). (6) Equation (11) is called the Markov property. We think of the subscript n as time or a time step and the sequence X n of the time evolution of a random walk. Due to the Markov property, this evolution is completely described by the matrices P n (i, j) = P(X n+1 = j X n = i). We are mostly interested in cases where these matrices do not depend on n, in which case X n is called time homogeneous. We will assume our Markov chains are time homogeneous unless otherwise specied. Denition 2.2. The transition matrix P of a time homogeneous Markov chain is It is easily checked that P ij = P(X n+1 = j X n = i). P(X n = j n,..., X 1 = j 1 X 0 = j 0 ) = P j0 j 1... P jn 1 j n. Moreover, probabilities of X n after n steps can be obtained from the nth power of P : Exercise 2.3. Use the denition of conditional probability and induction to show that where (P n ) ij stands for the ij entry of P n. P(X n = j X 0 = i) = (P n ) ij. 14

15 For example, consider a random walk on three states J = {1, 2, 3}. Suppose 0 1/2 1/2 P = 1/3 0 2/3 (7) Starting at 1, we go to either 2 or 3 with the same probability. Starting at 2, we go to 3 twice as often as 1. And starting at 3, we always go to 1. In this Markov chain we never stay at the same state for two consecutive time steps, but in general this is allowed (in which case the diagonal of P is nonzero). In general, the only assumptions on P are that it is square, its entries are nonnegative, and its rows sum to 1. That is, any such matrix is the transition matrix of a Markov chain, and conversely. Denition 2.4. We write P i (X n = j) P(X n = j X 0 = i) and, for any PMF ρ on J, P ρ (X n = j) = i I P i (X n = j)ρ(i). More generally, P i or P ρ indicate that X n started at X 0 = i or at the distribution ρ. The superscript next to P corresponds to an initial distribution of X n. Thus, P i (X n = j) is the probability X n = j when we start at X 0 = i. In light of the law of total probability (Theorem 1.19), P ρ (X n = j) corresponds the probability X n = j when X 0 is a random variable with PMF ρ. 2.3 Hitting times and the strong Markov property A stopping time for a Markov chain (X n ) is a random variable τ with values in N such that the event {τ = k} is known at time k. This means that, based on knowing the values of X n for 0 n k, we can decide whether or not τ = k. Unfortunately, a more precise description of stopping times is not possible without introducing some measure theory. However, we can introduce an important class of stopping times with no further eort. Denition 2.5. A hitting time of X n has the form τ A = inf{n 0 : X n A} for some nonempty A J. Here, τ A is called the hitting time of A. A hitting time is a stopping time, and the most important type of stopping time for us. It turns out that Markov chains (in discrete time) have the Markov property at stopping times. This is called the strong Markov property: Theorem 2.6 (Strong Markov property). Let X n be a Markov chain and τ a stopping time. Then for every n 1 and i, j 1,..., j n J P(X τ+n = j n,..., X τ+1 = j 1 X τ = i, {X k = i k, 0 k < τ}) = P i (X n = j n,..., X 1 = j 1 ). 15

16 The strong Markov property says that the Markov chain starts afresh after each stopping time. It can be proved from the (ordinary) Markov property. Exercise 2.7. Prove the strong Markov property in the case where τ is a hitting time. (Hint: use Theorem 1.19.) We end this section by giving an example of a stopping time that will be important later on. Consider two Markov chains (X n ) and (Y n ) and let τ = inf{n 0 : X n = Y n } be the rst time the chains intersect each other. Note that τ is the hitting time of the set A = {(i, i) : i J}. This τ is called the coupling time of the Markov chains, and is useful for proving convergence theorems as we will see below. 2.4 Irreducibility and aperiodicity A Markov chain on nite state space J is called irreducible if any state can be reached from any other state. We write i j if and only if (P n ) ij > 0 for some n. Thus, i j if and only if j can be reached from i after some number of steps, with positive probability. Denition 2.8. X n is irreducible if i j and j i for each i, j J. If X n is not irreducible, it is called reducible. Exercise 2.9. Show that the transition matrix from (7) denes a irreducible Markov chain. Give an example of a transition matrix dening a reducible Markov chain. Denition The period of X n is the integer k = min i J gcd{n > 0 : Pi (X n = i) > 0}, where gcd stands for greatest common divisor. X n is aperiodic if its period is k = 1. Intuitively, an irreducible Markov chain is aperiodic if, starting at (any state) i, it can return to i at irregular times. As an example, consider a Markov chain on state space J = {1, 2} with transition probabilities P 12 = P 21 = 1 and P 11 = P 21 = 0. Then P is periodic with period 2. Exercise Show that the matrix P from (7) denes an aperiodic Markov chain. 2.5 The Perron-Frobenius theorem Let X n be a Markov chain with transition matrix P. The Perron-Frobenius theorem provides useful spectral information about P. Theorem 2.12 (Perron-Frobenius 1). Let X n be irreducible and aperiodic. Then P has a simple eigenvalue λ = 1 such that all other eigenvalues are strictly smaller in magnitude. Moreover, the (normalized) left eigenvector corresponding to λ = 1 denes a PMF on J. 16

17 Exercise Verify the assumptions/conclusion of Theorem 2.12 for P dened in (7). It turns out that if X n has period k, then P has eigenvalues e 2πim/k for m = 1,..., k. And if J has exactly k disjoint subsets (necessarily disjoint) on which X n is irreducible, then P has eigenvalue 1 with geometric multiplicity k. In particular, the transition matrix of any Markov chain (on nite state space) has the eigenvalue 1 and a corresponding left eigenvector which denes a PMF on J. Example Give an example of a periodic and a non-irreducible Markov chain. Find the left eigenvectors and eigenvalues to verify the statements above. It is often useful to have the following version of Theorem 2.12 for restricted transition matrices, that is, matrices whose ith row and column has been removed for i in some set B. Theorem 2.15 (Perron-Frobenius 2). Let P be the transition matrix of X n, let B J be a nonempty proper subset of J, and let Q be the restriction of P to B. Let Xn be obtained from X n by conditioning X n to always remain in B. If Xn is irreducible and aperiodic, then Q has a real simple eigenvalue λ < 1 such that all other eigenvalues are strictly smaller in magnitude, and λ has a left eigenvector that denes a PMF on B. The proofs of the Perron Frobenius theorems are somewhat technical. For proofs we refer the reader to [5]. Recall that P n describes the behavior of a X n after n steps. Later on, we will use this together with the spectral information from the Perron-Frobenius theorems to study long-time convergence of X n. 2.6 Mean rst passage times Recall τ A is the hitting time for X n of a set A J. Denition The mean rst passage time of X n to A is the expected value E(τ A ). Mean rst passage times satisfy the following linear system: Theorem Let X n have transition matrix P. Then u i = E i (τ A ) satises { 1 + j / A u i = u jp ij, i / A 0, i A. (8) Clearly u i = 0 if i A. If i / A, by the Markov property and law of total probability, u i = E i (τ A ) = j J E i (τ A X 1 = j)p i (X 1 = j) = j J (1 + E j (τ A ))P i (X 1 = j) = 1 + j J u j P ij. 17

18 Let v, Q, I and 1 denote the restrictions of u, P, the identity matrix, and the all ones column vector, respectively, to J \ A. By Theorem 2.15, if X n is irreducible, then I Q does not have 0 as an eigenvalue, hence is invertible. In this case, the last display becomes v = (I Q) 1 1 which gives an explicit linear equation for the mean rst passage time. Exercise Find E i (τ A ) for i = 1, 2, 3 when P is as in (7) and A = {3}. 2.7 Stationary distributions and convergence Let P be the transition matrix of a Markov chain X n on nite state space J. Denition A vector π is a stationary distribution of P (or X n ) if πp = π. Here πp is matrix multiplication, thinking of π as a row vector. Entrywise, this is π i P ij = π j. i J Another way to write this is P π (X 1 = j) = π j. That is, if we start π at time 0, we are still at π at time 1. More generally, if X 0 has PMF π, then X n has PMF π for every n. So if we start at π we remain at π forever. This is the reason for the word stationary. The remarks below the Perron-Frobenius theorem (Theorem 2.12) show that every Markov chain on nite state space has a stationary distribution. The stationary distribution is unique if the Markov chain is irreducible. Theorem A irreducible Markov chain has a unique stationary distribution. It turns out that the stationary probability π i can be understood as the average residence time in i during a loop from j to j, in the following sense: Theorem Let X n be a irreducible Markov chain with stationary distribution π. Then for each i J, ( E j τj ) 1 n=0 1 X n=i π i =, (9) E j (τ j ) where j J is arbitrary and τ j = inf{n > 0 : X n = j} is the hitting time of j. The theorem can be understood as follows. The denominator of the RHS of (9) is simply the normalization required to obtain a probability vector. So consider the numerator. It counts the (average) number of times we are at i in a loop from j to j, excluding the last step of the loop. Multiplying by P represents evolving X n one step. This corresponds to counting the (average) number of times we are at i in a loop from j to j, excluding the rst step of the loop. But at the rst and last steps of the loop we are at j, so we get the same thing! 18

19 Exercise Prove Theorem Then use it to show that π i = 1/E i (τ i ). If the distribution of X n converges, it must converge to a stationary distribution: Theorem Suppose X n is a Markov chain with initial distribution ρ such that π j = lim n P ρ (X n = j) (10) exists for each j J. Then π is a stationary distribution of X n. Note that P ρ (X n = j) = (ρp n ) j and so (10) shows that lim n ρp n = π. From this it is easy to see πp = π. This shows that if a Markov chain converges to some distribution as in (10), then it must be a stationary distribution. As we have seen above, all Markov chains have a stationary distribution; however not all Markov chains converge to this distribution. Exercise Give an example of an aperiodic Markov chain for which (10) does not hold. Give an example of a non-irreducible Markov chain for which (10) holds for two dierent π's, depending on the choice of ρ. The following theorem gives conditions for a Markov chain to converge to a unique stationary distribution. Theorem Let X n be aperiodic and irreducible, and π its stationary distribution. Then for any initial distribution ρ and j J, π j = lim n P ρ (X n = j). The theorem states that X n converges in distribution to π. One proof comes from Perron-Frobenius (Theorem 2.12). Let r < 1 be an upper bound for the magnitude of the second-largest eigenvalue. From the Perron-Frobenius theorem, it can be shown that P n = 1π + O(r n ) where 1 is the all 1's column vector, and π is a row vector. Then P ρ (X n = j) = (ρp n ) j = (ρ1π + O(r n )) j = (π + O(r n )) j π j as n. Another proof of Theorem 2.25 comes from the following coupling argument. Let X n and Y n be two independent copies of the same Markov chain. Let X 0 have PMF ρ and Y 0 have PMF π. Dene τ = inf{n 0 : X n = Y n }. In can be checked that aperiodicity and irreducibility imply P(τ < ) = 1. But since π is stationary and Y 0 has distribution π, we know Y n has distribution π for all n, including n = τ. But this means X τ has distribution π and hence X n has distribution π for n τ! (Note that this argument also shows uniqueness of π.) Exercise Prove Theorem 2.25 by lling in details from the above proof sketch. 19

20 2.8 Ergodic theorem The ergodic theorem, Theorem 2.28 below, describes a dierent type of convergence of X n. While the theorems above involve the convergence of the distribution of X n to π, the ergodic theorem describes the convergence of time averages of single paths of X n, i.e. trajectories of X n corresponding to a single outcome ω. This is a very dierent type of convergence than the ones discussed above. For example, convergence of X n π in distribution cannot be understood in terms of single paths. Let f : J R be an arbitrary function on state space and write f(i) f i. Proposition Let X n be irreducible and π its stationary distribution. Then ( E j τj ) 1 n=0 f(x n ) f i π i =. E j (τ j ) i J The proposition can be proved in a way similar to Theorem Now we are ready for the ergodic theorem. Theorem Let X n be irreducible with π its stationary distribution. Then n 1 n 1 m=0 f(x m ) a.s. π i f i i J It is common to understand the ergodic theorem as a statement saying the time average over a path equals the spatial average with respect to a stationary distribution π. We only sketch a proof. Let σ k be the kth time that X n visits j, and dene S k = σ k+1 n=σ k +1 f(x n ), T k = σ k+1 σ k. By the strong Markov property, S k and T k are iid. We may approximate n 1 lim n n 1 f(x m ) m=0 by S S k T T k for k large. By the law of large numbers, S S k T T k But by Proposition 2.27, E(S 1 )/E(T 1 ) = i J π if i. a.s. E(S 1) E(T 1 ). Exercise Fill in the details in the above sketch to prove Theorem

21 2.9 Left and right actions Let P be a transition matrix for a Markov chain X n on nite state space J, µ a PMF on J and f : J R a function. It is often convenient to view left and right matrix multiplications µp and P f as transformations on measures and functions, respectively, in the following sense: (µp n ) j = jth entry of the PMF µp n := P µ (X n = j), (P n f)(j) = value of the function P n f at j := E j (f(x n )). Thus, µp n is a PMF on J, corresponding to taking n steps of the Markov chain starting at µ, and P n f is a function on J corresponding to evaluating f after n steps. The left and right actions can be combined as follows: 2.10 Innite state space µp n f = E µ (f(x n )). The case where J is countably innite is easy to handle, so we consider it rst. A Markov chain X n on state space J is still dened exactly as in Denition 2.1. For the convergence theorems above to hold, we need to add an additional assumption that for some state j, the expected return time to j is nite, that is, E j [τ j ] < with τ j = inf{n 0 : X n = j}. This is called positive recurrence. Then X n has stationary distribution π dened by the formula in Theorem 2.27 (and the proof is the same). In this case, irreducibility implies j is reached in nite time, and the coupling argument below Theorem 2.25 gives convergence of X n to π in distribution. Moreover, the arguments in proof of the ergodic theorem go through with little modication. Now consider the case where J is uncountably innite. Denition X n is a Markov chain on an uncountable state space J if for each A J. P(X n+1 A X 0 = i 0,..., X n 1 = i n 1, X n = i) = P(X n+1 A X n = i), (11) We will usually assume J is R n or a subset of R n. In this case, the transition matrix P must be replaced with a transition kernel, which we will assume has a a density P(X n+1 A X n = x) = p(x, y) dy. Think of p as a matrix with entries indexed by the A reals. (What properties should p have?) For later reference we record this notation. Denition X n has transition density p if p : J J R satises P(X n+1 A X n = x) = p(x, y) dy. for all x J and A J. 21 A

22 Using the transition density, all the basics above are easily translated to the current setup. Essentially, sums are replaced with integrals. One crucial dierence is that the probability X n equals any single point is zero. This aects how we can prove convergence to equilibrium, as we discuss below. Exercise Let ξ i, i = 1, 2,... be iid standard normal random variables and X 0 = 0, X n = ξ ξ n for n 1. Find the transition density of X n. Denition Let X n have transition density p. If π : J R satises π(y) = π(x)p(x, y) dx for every y J, we say π is a stationary density for X n. J In uncountable state space, positive recurrence does not make sense, since in general we never hit single points. In fact, if X n has a transition density, then P x (X n = y) = 0 for all x, y whenever n 1! But the arguments above can still be modied under some extra assumption. Typically, one assumes there is some set A J that can be treated like a point. This A is called a small set. A typical assumption (called a Doeblin condition) is that there is a probability measure µ on A and a constant c > 0 such that (i) X n reaches A in nite expected time from every point and (ii) for each i A and B A, P i (X 1 B) cµ(b). Intuitively, after reaching A, in the next step we are distributed according to µ on A with probability at least c. Being distributed according to µ on A can now be treated as a point, j, and all the above arguments based on looking at loops from j to j still hold. References [1] R. Durrett, The essentials of probability, Duxbury press, [2] R. Durrett, Probability: Theory and examples, Duxbury press, [3] C. Geyer, Practical Markov chain Monte Carlo. [4] J.R. Norris, Markov chains, Cambridge series in statistical and probabilistic mathematics, [5] E. Seneta, Non-negative matrices and Markov chains, Springer series in statistics, [6] R.J. Baxter, Exactly solved models in statistical mechanics, Academic press,

23 References [1] R. Durrett, The essentials of probability, Duxbury press, [2] R. Durrett, Probability: Theory and examples, Duxbury press, [3] C. Geyer, Practical Markov chain Monte Carlo. [4] J.R. Norris, Markov chains, Cambridge series in statistical and probabilistic mathematics, [5] E. Seneta, Non-negative matrices and Markov chains, Springer series in statistics, [6] R.J. Baxter, Exactly solved models in statistical mechanics, Academic press,

Recitation 2: Probability

Recitation 2: Probability Colin White, Kenny Marino January 23, 2018 Outline Facts about sets Definitions and facts about probability Random Variables and Joint Distributions Characteristics of distributions