Lectures for APM 541: Stochastic Modeling in Biology. Jay Taylor

Size: px

Start display at page:

Download "Lectures for APM 541: Stochastic Modeling in Biology. Jay Taylor"

Ashlyn Marlene George
6 years ago
Views:

1 Lectures for APM 541: Stochastic Modeling in Biology Jay Taylor November 3, 2011

2 Contents 1 Distributions, Expectations, and Random Variables Probability Spaces Conditional Probabilities Discrete Random Variables Continuous Random Variables Multivariate Distributions Sums of Independent Random Variables Approximation and Limit Theorems in Probability Convergent Sequences and Approximation Modes of Convergence of Random Variables Laws of Large Numbers The Central Limit Theorem The Law of Rare Events Random Number Generation Pseudorandom Number Generators The Inversion Method Rejection Sampling Simulating Discrete Random Variables Discrete-time Markov Chains Definitions and Properties Asymptotic Behavior of Markov Chains Class Structure Hitting Times and Absorption Probabilities Stationary Distributions

3 CONTENTS 3 5 Biological Applications of Markov Chains The Wright-Fisher Model and its Relatives Cannings Models Galton-Watson Processes Chain Epidemic Models Epidemics with Household and Community Transmission Continuous-time Markov Chains Definitions and Properties Kolmogorov Equations Gillespie s Algorithm and the Jump Chain Stationary Distributions Time Reversal Poisson Processes and Measures Diffusions and Stochastic Calculus Brownian Motion The Invariance Principle Diffusion Approximations for CTMC s via the Heat Equation Properties of standard Brownian motion Diffusion Processes Diffusion Approximations Technical Interlude: Generators and Martingales Martingales

4 Chapter 1 Distributions, Expectations, and Random Variables 1.1 Probability Spaces We can think about probability in two ways: Frequentist interpretation: The probability of an event is the limiting frequency with which the event occurs when we conduct an infinite series of identical but independent trials. Subjective interpretation: The probability of an event measures the strength of our subjective belief that the event will occur in one trial. There has been much argument, especially amongst statisticians, as to which of these interpretations is correct. I tend to take a fairly pragmatic view of things and switch between these perspectives as best suits the problem that I am working on. However, we can at least write down a formal (and pretty much universally accepted) mathematical definition of probability. Definition 1.1. A probability space is a triple {Ω, F, P} where: Ω is the sample space, i.e., the set of all possible outcomes. F is a collection of subsets of Ω which we call events. required to satisfy the following conditions: F is called a σ-algebra and is 1. The empty set and the sample space are both events:, Ω F. 2. If E is an event, then its complement E c = Ω\E is also an event. 3. If E 1, E 2, are events, then their union n E n is an event. P is a function from F into [0, 1]: if E is an event, then P(E) is the probability of E. P is said to be a probability distribution or probability measure on F and is also required to satisfy several conditions: 1. P( ) = 0; P(Ω) = Countable additivity: If E 1, E 2, are mutually exclusive events, i.e., E i E j = whenever i j, then ( ) P E n = P(E n ). n=1 4 n=1

5 1.2. CONDITIONAL PROBABILITIES 5 If you want to read or publish articles on mathematical probability and statistics, then you will need to come to grips with this definition. David Williams little book, Probability with Martingales, provides an excellent introduction to this theory. In this course, we will usually be very informal and ignore the role played by the σ-algebra F. However, the properties described in the third part of the definition are both useful and intuitive: P( ) = 0 means that the probability that nothing (whatsoever) happens is zero. P(Ω) = 1 means that the probability that something (whatever it is) happens is one. If E 1 and E 2 are mutually exclusive events, then E 1 E 2 is the event that either E 1 or E 2 happens and the probability of that is just the sum of the probability that E 1 happens and the probability that E 2 happens: P (E 1 E 2 ) = P (E 1 ) + P (E 2 ). Countable additivity says that this property holds when we have a countable collection of disjoint events. The following lemma lists some other useful properties that can be deduced from Definition 1. Lemma 1.1. The following properties hold for any two events A, B in a probability space: 1. P(A c ) = 1 P(A). 2. If A and B are mutually exclusive, then P(A B) = 0 3. For any two events A and B (not necessarily mutually exclusive), we have: P(A B) = P(A) + P(B) P(A B) Exercise 1.1. Prove Lemma Conditional Probabilities It is often the case that we have some partial information about the outcome of an experiment or the state of an unknown system. Our next definition shows how we should modify our beliefs about the unobserved outcome given this additional information: Definition 1.2. Suppose that A and B are events and that P(B) > 0. Then the conditional probability that A occurs given that B occurs is P (A B) = P(A B) P(B) In frequentist terms, we can think of the conditional probability P(A B) as the fraction of trials resulting in both A and B divided by the fraction of trials resulting in B. In general, P(A B) P(A), in which case we say that B contains some information about A, i.e., knowing whether B does or does not occur gives us some information about whether A does or does not occur. On the other hand, if P(A B) = P(A), then B gives us with no information about A. This important scenario motivates the next definition.

6 6 CHAPTER 1. DISTRIBUTIONS, EXPECTATIONS, AND RANDOM VARIABLES Definition 1.3. Independent Events 1. Two events A and B are said to be independent if P(A B) = P(A) P(B). 2. A countable collection of events E 1, E 2, is said to be independent if for every finite subcollection E i1,, E in we have P (E i1 E in ) = P (E i1 ) P (E in ). Example 1.1. Three events A, B, and C are independent if all of the following identities hold: P ( A B C ) = P(A) P(B) P(C) P ( A B ) = P(A) P(B) P ( A C ) = P(A) P(C) P ( B C ) = P(B) P(C) Theorem 1.1. If A and B are independent and P(B) > 0, then P(A B) = P(A B) P(B) = P(A)P(B) P(B) = P(A). In other words, if A and B are independent, then, as we would expect, B gives us no information about A. Notice that the expression for conditional probability stated in Definition 2 can be rearranged to give P(A B) = P(A B) P(B), i.e., the probability that both A and B occur is equal to the conditional probability that A occurs given that B occurs times the probability that B occurs. Notice that, by symmetry, we also have P(A B) = P(B A) P(A). Although elementary, these simple algebraic manipulations lead to two of the most useful formulas in probability. The Law of Total Probability is important because it can often be used to compute the probability of a complicated event by conditioning on additional information. We will see many examples of this procedure throughout the course. Bayes formula is important, of course, because it forms the foundations of Bayesian statistics which we will also discuss at length in this course. Theorem 1.2. (Law of Total Probability) If A is an event and B 1,, B n is a collection of disjoint events such that A B 1 B n, then P(A) = P(A B 1 ) + + P(A B n ) = P(A B 1 ) P(B 1 ) + + P(A B n ) P(B n ). Theorem 1.3. (Bayes formula) If A and B are events with P(A) > 0 and P(B) > 0, then P(A B) = P(B A)P(A). P(B)

7 1.3. DISCRETE RANDOM VARIABLES Discrete Random Variables In practice, we are often unable to directly observe the state of the systems that we study in biology and instead must make do with indirect information provided by experiments. One way to model this situation mathematically is by identifying the probability space (Ω, F, P) with the true but unknown state of the system of interest and then introducing random variables that represent the outcomes of the experiments that we perform on that system. For example, if we perform just one experiment and if the set of possible outcomes is denoted E, then we would define a random variable X which is a function from Ω into E. Thus, if the state of the system is ω, then the result of our experiment will be the value X(ω). To be more concrete, suppose that we choose a saguaro cactus at random from Picacho Peak State Park and we then measure its height. In this case, the probability space could encode all of the processes (e.g., climatic and ecological) influencing the heights of the saguaros in the park as well as those influencing our sampling of an individual, while the random variable X will denote just the height of that individual which will be some value in the set E = [0, ). Remark 1.1. As promised, we are skirting over many formalities that are important if we want to prove theorems about random variables. In particular, to define random variables rigorously, we need to attach some additional structure to the set E and then require that the function X is measurable. For our purposes we can ignore these technical issues, but see Chapter 3 in Williams (1991) for the details. Definition 1.4. Suppose that (Ω, F, P) is a probability space and that X is a random variable that take values in the set E. Then the distribution of X is the probability distribution µ defined on E by the formula µ(a) P(X A) P ( {ω Ω : X(ω) A} ). Here A is a subset of E, i.e., A is a collection of possible outcomes for our experiment, whereas the set {ω Ω : X(ω) A} is a subset of Ω. Remark 1.2. Much of the time we will simply ignore the underlying probability space and restrict our attention to the distributions of the random variables defined on that space. In particular, we will usually just write X rather than X(ω) even when we have a particular value of X in mind. On the other hand, we will often be content to use the notation P(X A) rather than explicitly introduce the probability measure µ as we did in Definition 4. With practice, this shorthand will become very natural. Discrete random variables provide an important special case of these concepts. Definition A random variable X is said to be discrete if it takes values in a set E that is either finite or contains countably infinitely many points (e.g., the integers). 2. If E = {x 1, x 2, }, then the probability mass function of X is the function p : E [0, 1] defined by the formula p(x i ) = P(X = x i ). The probability mass function of a discrete random variable completely determines its distribution: P(X A) = x i A p(x i ).

8 8 CHAPTER 1. DISTRIBUTIONS, EXPECTATIONS, AND RANDOM VARIABLES In other words, to calculate the probability that X takes a value in a set A E, we simply need to sum the probability mass function of X over all of the points that belong to A. Notice that this implies that p(x i ) = P(X E) = 1, x i E since E is defined to be the set of all possible values that X can take. Definition 1.6. If X is a discrete random variable that takes values in a subset of the real numbers, then the expected value of X is defined to be the weighted average of these values EX = i p(x i ) x i. Remark 1.3. The expected value of a random variable is also called its expectation or its mean and is sometimes written as E[X] for clarity. In some respects, the name expected value is misleading since EX could well be a value that X never takes. For example, if E = {0, 1} and P(X = 0) = P(X = 1) = 1/2, then even though X is never equal to 1/2. EX = = 1 2, An important property of expectations is that they are linear: Theorem 1.4. (Linearity) Suppose that X and Y are discrete random variables and that a and b are real numbers. Then E [ a X + b Y ] = a EX + b EY. The next theorem describes another important property of expectations that is sometimes incorrectly stated as a definition, hence the tongue-in-cheek name: Theorem 1.5. (The Law of the Unconscious Statistician) If X is a discrete random variable with values in a set E and f : E R is a real-valued function, then f(x) is a discrete random variable and E [f(x)] = x i p(x i ) f(x i ) If you want a challenge, then try to do Exercise 1.2. Prove Theorems 4 and 5. Definition 1.7. The variance of a discrete real-valued random variable X is defined as Var(X) = E [ (X EX) 2]. Exercise 1.3. Use Theorems 4 and 5 to show that Var(X) = E [ X 2] (EX) 2.

9 1.3. DISCRETE RANDOM VARIABLES 9 The next four examples describe some of the more important discrete distributions that we will encounter this semester. In each case, E will denote the set of possible values of the random variable and p(x) will denote its probability mass function. Example 1.2. X is said to have a Bernoulli distribution with parameter p if E = {0, 1} and P(X = 1) = p; P(X = 0) = 1 p In this case the mean and variance of X are given by EX = p Var(X) = p(1 p). Bernoulli random variables are the simplest non-constant random variables and are often used to represent the success (1) or failure (0) of a random trial. Example 1.3. X is said to have a Binomial distribution with parameters n and p if X takes values in the set E = {0, 1,, n} with probability mass function ( ) n P(X = k) = p k (1 p) n k. k Recall that the binomial coefficient that appears in this definition is equal to ( ) n n! = k k!(n k)!, where n! = n(n 1)(n 2) 1, and counts the number of ways of choosing a subset of k objects from a collection of n objects. The mean and variance of X are given by EX = np Var(X) = np(1 p) Binomial distributions often arise when we carry out n independent but identical trials, each having probability p of success, and we count the total number of successes. Exercise 1.4. Suppose that X 1,, X n are independent, identically-distributed (abbreviated i.i.d.) Bernoulli random variables with parameter p and let X = X X n. Show that X is a Binomial random variable with parameters n and p. Example 1.4. X is said to have a geometric distribution with parameter p if X takes values in the non-negative integers E = {0, 1, } with probability mass function The mean and variance of X are given by P(X = k) = (1 p) k p. EX = 1 p p Var(X) = (1 p) p 2. Geometric distributions also arise when we carry out independent but identical trials. Let X 1, X 2, be an infinite collection of i.i.d. Bernoulli random variables, each with parameter p, and define X to be the number of failures that occur until the first success. Then X is a geometric random variable with parameter p.

10 10 CHAPTER 1. DISTRIBUTIONS, EXPECTATIONS, AND RANDOM VARIABLES Example 1.5. X is said to have a Poisson distribution with parameter λ if X takes values in the non-negative integers E = {0, 1, } with probability mass function The mean and variance of X are given by: λ λk P(X = k) = e k! EX = λ Var(X) = λ. Poisson distributions often arise in situations where a large number of independent trials are carried out and the probability of success of any one trial is small. We will discuss this in the next lecture when we consider the Law of Rare Events. 1.4 Continuous Random Variables Some the variables that we will be interested in take values in sets that are continuous, e.g., the height (in cm) of a randomly sampled individual could be regarded as a random variable that can assume any value between 0 and 300. In this case, the probability mass function is zero at every point and we need to describe the distribution of the random variable in a different way. Definition 1.8. A real-valued random variable X is said to be continuous if there is a nonnegative function p(x), called the probability density function of X, such that P(X A) = x p(x)dx, where A is a subset of R. In particular, by taking A = R, we see that A p(x)dx = P(X R) = 1, i.e., the density must integrate to 1 over the whole real line. Remark 1.4. If X is any real-valued random variable (not necessarily continuous), then the distribution of X is completely determined by its cumulative distribution function (often abbreviated c.d.f.) F (x) = P(X x). Notice that F (x) is an increasing function of x, i.e., if x < y, then F (x) F (y). Also, lim F (x) = P(X ) = 0 lim x F (x) = P(X ) = 1. x If X is also continuous, then the cumulative distribution function F (x) and the density function p(x) are related in the following way F (x) = x p(y)dy and p(x) = F (x), i.e., the density is just the derivative of the cumulative distribution function. Furthermore, we can estimate the density of X at a value x using the approximate formula, p(x) P (x ɛ < X x + ɛ) 2ɛ where ɛ > 0 is any small positive number. = F (x + ɛ) F (x ɛ), 2ɛ

11 1.4. CONTINUOUS RANDOM VARIABLES 11 Remark 1.5. An important distinction between probabilities and probability densities is that whereas the probability of any event is a number between 0 and 1, the probability density p(x) may be greater than one (in fact, it can be infinite). In general, many of the definitions and results that hold for discrete random variables also hold for continuous random variables provided that we replace the probability mass function by the probability density function and we replace sums by integrals. Definition 1.9. If X is a continuous random variable with density p(x), then the expected value of X is the weighted average EX = p(x) xdx. Also, as in the discrete case, the variance of X is defined to be Var(X) = E [ (X EX) 2]. Theorem 1.6. Suppose that X and Y are continuous random variables and that a and b are real numbers. Then E[a X + b Y ] = a EX + b EY. Theorem 1.7. (The Law of the Unconscious Statistician) If X is a continous real-valued random variable and f : R R is a real-valued function, then E [f(x)] = f(x) p(x)dx Some important classes of continuous random variables are described below. Example 1.6. X is said to be uniformly distributed on the interval [a, b] if it has density p(x) = 1 b a if x [a, b] 0 if x < a or x > b. In this case, the mean and variance of X are given by EX = a + b 2 ; Var(X) = 1 12 (b a)2. In addition, if [a, b] = [0, 1], then X is said to be a standard uniform random variable. Example 1.7. X is said to have the exponential distribution with rate parameter λ > 0 if it has density λe λx if x 0 p(x) = 0 if x < 0. In this case, the mean and variance of X are given by EX = 1 λ ; Var(X) = 1 λ 2,

12 12 CHAPTER 1. DISTRIBUTIONS, EXPECTATIONS, AND RANDOM VARIABLES and so the mean is equal to the reciprocal of the rate. We can also explicitly calculate the cumulative distribution function of X: if t 0, then P{X t} = t 0 λe λx dx = 1 e λt. Exponential random variables are often used to model the times between random events when the rates at which these events occur do not change over time. For example, if we assume that the mutation rate at a particular site in the genome of a species of interest is constant, then the time between successive mutations at that site will be exponentially distributed. This follows from the fact that every exponential distribution is memoryless, i.e., P{X > t + s X > t} = P{X > t + s} P{X > t} = e (t+s) e t = e s = P{X > s}. In other words, if we think of X as the lifespan of an individual (measured in years, say), then this equation says that the conditional probability that the individual will survive for another s years given that they have already survived for t years is the same as the unconditional probability that they will survive for at least s years. It is as if upon surviving for t years, the clock governing their lifespan begins anew with the same exponential distribution. Curiously, this property also characterizes exponential distributions: if X is a random variable and the identity P{X > t + s X > t} = P{X > t} holds for all real numbers s, t > 0, then it can be shown that X is an exponential random variable. This observation will play a central role in our study of continuous-time Markov chains later in the semester. Example 1.8. X is said to have the gamma distribution with shape parameter α > 0 and scale parameter θ > 0 if it has density 1 Γ(α)θ x α 1 e x/θ if x 0 α p(x) = 0 if x < 0, where Γ is the so-called gamma function defined by Γ(α) = In this case, the mean and variance of X are given by 0 x α 1 e x dx. EX = αθ; Var(X) = αθ 2. If α = 1, then X is an exponential random variable with rate parameter θ 1. If X 1,, X n are independent exponentially distributed random variables, each with rate parameter λ, then their sum X = X X n is a gamma random variable with shape parameter α = n and scale parameter θ = λ 1. Thus, gamma random variables are often used to model the durations of processes that last until a series of independent events has occurred, e.g., the time to oncogenic transformation of a cell in which n independent mutations must occur to compromise regulation of the cell cycle.

13 1.5. MULTIVARIATE DISTRIBUTIONS 13 Example 1.9. X is said to have the normal distribution (also called the Gaussian distribution) with mean µ and variance σ 2 > 0 if it has density: p(x) = 1 2πσ 2 e (x µ)2 /2σ 2. If, in addition, µ = 0 and σ 2 = 1, then X is said to be standard normal random variable. One useful property of normal random variables is that if Z is a standard normal random variable, then the variable X = µ + σz is normally distributed with mean µ and variance σ 2. Normal distributions are ubiquitous in nature, which is why they are called normal and why so much statistical machinery has been developed under the assumption that the data being analyzed is normally distributed. In the next lecture, we will see that this is at least partly explained by the Central Limit Theorem. 1.5 Multivariate Distributions Suppose that we have carried out several experiments on a system of interest and that we let X i be a random variable that represents the outcome of the i th experiment. In this situation, it may be useful to consider all of the experiments together, which we can do by introducing a single vector-valued random variable that takes values in the product space X = (X 1,, X n ) E = E 1 E n = { (x 1,, x n ) : x i E i } where E i is the set of all possible outcomes of the i th experiment. For example, if we are working at a bird banding station, then we might collect three pieces of information on each bird captured by the nets, so that X 1 denotes the sex of the bird with values in the set E 1 = {m, f}, X 2 denotes the body mass of the bird with values in the set E 2 = [0, ), and X 3 denotes the number of ectoparasites on the bird with values in the set E 3 = {0, 1, 2, }. Although we may be interested in each of these variables in its own right, we are likely to learn much more about the birds in the population by considering the random vector containing all of our data on each individual: X = (X 1, X 2, X 3 ) = (sex, mass, parasite load). Definition Suppose that X 1,, X n are random variables defined on the same probability space that take values in the sets E 1,, E n. Then the joint distribution of X 1,, X n is defined to be the distribution of the random vector X = (X 1,, X n ) with values in E = E 1 E n, i.e., P {X A} = P {(X 1,, X n ) A}, where A is a subset of E. In this case, the distribution of any one of the variables, say X i, considered on its own, is said to be the marginal distribution of that variable. Remark 1.6. Although we can always recover the marginal distributions of a collection of random variables from their joint distribution, it is not possible to deduce the latter from the former unless we are given some additional information. In fact, usually there will be infinitely many

14 14 CHAPTER 1. DISTRIBUTIONS, EXPECTATIONS, AND RANDOM VARIABLES ways of assigning a joint distribution that is compatible with any particular set of marginals. On the other hand, an important case in which the joint distribution of a collection of random variables is uniquely determined by the marginal distributions is when the random variables are independent of one another. Definition Independent Random Variables 1. Random variables X 1,, X n are said to be independent if ) P (X 1 E 1,, X n E n = n P ( ) X i E i for all sets E 1,, E n such that the events {X i E i } are well-defined. 2. An infinite collection of random variables is said to be independent if every finite subcollection is independent according to 1. i=1 The next theorem states two useful facts: (i) functions of independent random variables are themselves independent; and (ii) the expected value of a product of independent random variables is equal to the product of the expected values of the individual variables. Theorem 1.8. Suppose that X 1,, X n are independent real-valued random variables and that f 1,, f n are functions from R to R. Then f 1 (X 1 ),, f n (X n ) are independent real-valued random variables and [ n ] n E f i (X i ) = E [f i (X i )] i=1 i=1 whenever the expectations on both sides of the equation are defined. An important special case is when the random variables X 1,, X n are discrete. In this case, the random vector X = (X 1,, X n ) is itself a discrete random variable (since the product space E = E 1 E n is countable) and the joint probability mass function of the variables X 1,, X n is defined by the formula p(x 1,, x n ) = P {X = (x 1,, x n )} = P {X 1 = x 1,, X n = x n }. If the variables are independent and if we let p i (x) = P{X i = x} be the (marginal) probability mass function of X i, then we can use Definition 1.11 to show that the joint probability mass function is equal to the product of the marginal probability mass functions p(x 1,, x n ) = p 1 (x 1 ) p n (x n ). As hinted at above, we can find the marginal probability mass functions p i (x i ) from the joint probability mass function p(x) even if the variables are not independent. This process is called marginalization and is given by the following formula p i (y) = p(x). x E:x i =y Here y is a point in E i and {x E : x i = y} is the set of all vectors x = (x 1,, x n ) in E for which the i th coordinate x i equals y.

15 1.5. MULTIVARIATE DISTRIBUTIONS 15 One of the most commonly encountered discrete multivariate distributions is the multinomial distribution. Definition Let n 1 be a positive integer and let (p 1,, p k ) be a collection of positive real numbers such that p p k = 1. We say that the random vector X = (X 1,, X k ) has the multinomial distribution with parameters n and (p 1,, p k ) if each of the variables X i takes values in the set {0,, n} and if the joint probability mass function of these variables is given by ( ) n p(n 1,, n k ) = p n 1 1 n 1,, n pn k k. k Recall that the multinomial coefficient ( ) n = n 1,, n k n! n 1! n k! is the number of ways of partitioning a collection of n elements into k disjoint subsets such that the first subset contains n 1 elements, the second subset contains n 2 elements, etc. In particular, this coefficient is zero if the sum n n k is not equal to n. It follows that the sum of the components of X is equal to n, X X k = n, which shows that the variables X 1,, X k are not independent. Multinomial distributions arise in the following way. Suppose that we conduct n independent but identical trials, that each trial can result in one of k possible outcomes, and that the probability that any one trial results in the i th outcome is p i. If we let X i denote the number of trials that result in the i th outcome, then (X 1,, X k ) has the multinomial distribution with parameters n and (p 1,, p k ). We will also work with continuous multivariate distributions. Definition A random vector X = (X 1,, X n ) with values in R n is said to be continuously distributed if there is a non-negative function p : R n [0, ) with the property that P{X A} = p(x)dx, A where A is a subset of R n. The function p is said to be the joint probability density function of the variables X 1,, X n. In this case, each of the variables X i is individually continuous and the marginal density function p i () of X i can be recovered from the joint density function by integration p i (y) = p(x)dx. x:x i =y Here {x : x i = y} is the set of n-dimensional vectors x = (x 1,, x n ) whose i th coordinate x i is equal to y. If we know that the variables are independent, then we can show that the joint density function is equal to the product of the marginal density functions p((x 1,, x n )) = p 1 (x 1 ) p n (x n ), and, in fact, this identity also implies independence.

16 16 CHAPTER 1. DISTRIBUTIONS, EXPECTATIONS, AND RANDOM VARIABLES Before we give an example of a continuous multivariate distribution, we need two more definitions. Definition Suppose that X and Y are real-valued random variables (either discrete or continuous). The covariance of X and Y is defined to be Cov(X, Y ) = E [(X EX)(Y EY )]. If Cov(X, Y ) = 0, then we say that X and Y are uncorrelated. Exercise Show that Cov(X, Y ) = Cov(Y, X). 2. Show that Cov(X, Y ) = E[XY ] (EX)(EY ). 3. Show that any two independent random variables are uncorrelated. 4. Give a counterexample to show that uncorrelated random variables are not necessarily independent. If we have more than two random variables, then there are many covariances to be tracked. The next definition describes a convenient way of organizing this information. Definition Suppose that X 1,, X n are real-valued random variables (either discrete or continuous) and let σ ii = Var(X i ) denote the variance of X i and σ ij = Cov(X i, X j ) denote the covariance of X i and X j. Then the variance-covariance matrix of the random vector X = (X 1,, X n ) is the n n matrix Σ with entry σ ij in the i th column and j th row: Σ = σ 11 σ 12 σ 1n σ 21 σ 22 σ 2n... σ n1 σ n2 σ nn. Finally we come to the promised example. Definition A continuous random vector X = (X 1,, X n ) with values in R n is said to have the multivariate normal distribution with mean vector µ = (µ 1,, µ n ) and n n variance-covariance matrix Σ if it has joint density function { p(x) = (2π) n/2 Σ 1/2 exp 1 } 2 (x µ)t Σ 1 (x µ). In this formula, Σ denotes the determinant of the matrix Σ and Σ 1 denotes its matrix inverse, i.e., Σ 1 is the unique n n matrix such that ΣΣ 1 = Σ 1 Σ = I n, where I n is the n n identity matrix, i.e., all of the diagonal elements of I equal 1 and all of the off-diagonal elements equal 0.

17 1.6. SUMS OF INDEPENDENT RANDOM VARIABLES 17 Theorem 1.9. If X = (X 1,, X n ) is a multivariate normal random vector with mean vector µ = (µ 1,, µ n ) and variance-covariance matrix Σ, and b = (b 1,, b n ) R n is an n- dimensional vector, then the variable Z = b X = b 1 X b n X n is a real-valued normal random variable with mean b µ = b 1 µ b n µ n and variance b T Σb = n n σ ij b i b j. i=1 j=1 1.6 Sums of Independent Random Variables Many problems in applied probability involve sums of independent random variables. For example, in the next chapter, we will review some of the classical limit theorems of probability that arise when large numbers of independent random variables are added together. In preparation, here we will see how we can find the distribution of the sum of two independent random variables. We first consider the discrete case. Lemma 1.2. Let X and Y be independent integer-valued random variables with probability mass functions p X (n) and p Y (n). Then the probability mass function of the variable X + Y is p X+Y (n) P{X + Y = n} ( ) = P {X = m, Y = n m} = = = m= m= m= m= P {X = m, Y = n m} P{X = m} P{Y = n m} p X (m) p Y (n m) p X p Y (n). all of the ways they can sum to n since these are disjoint events since X, Y are independent The quantity p X p Y (n) is sometimes called the discrete convolution of p X with p Y. Notice that it is commutative, i.e., p X p Y (n) = p Y p X (n). Example Suppose that X and Y are independent Poisson random variables with parameters λ and µ, respectively. Then the probability mass function of X + Y is n λ λm µn m p X+Y (n) = e e µ m! (n m)! m=0 = e (λ+µ) 1 n n! n! m!(n m)! λm µ n m m=0 (λ+µ) (λ + µ)n = e, n! where we have used the binomial theorem to pass from the second line to the third. Looking at the result, we see that X + Y is itself a Poisson random variable with parameter λ + µ and so we

18 18 CHAPTER 1. DISTRIBUTIONS, EXPECTATIONS, AND RANDOM VARIABLES have proved the following important fact: The sum of two independent Poisson random variables X and Y is a Poisson random variable with parameter equal to the sum of the parameters of X and Y. Exercise 1.6. Let X and Y be independent binomial random variables with parameters (n, p) and (m, p) respectively. Show that X + Y is a binomial random variable with parameters (n + m, p). There are similar results for sums of independent continuous random variables. Lemma 1.3. Let X and Y be independent continuous random variables with densities p X (x) and p Y (x), respectively. Then X + Y is a continuous random variable with density p X+Y (z) = p X (t)p Y (z t)dt p X p Y (z), and p X p Y (t) is called the convolution of p X and p Y. Although Lemmas 1.2 and 1.3 give explicit expressions for the probability mass function and probability density function of a sum of two independent random variables, the sums and integrals that arise in these expressions can be difficult to evaluate. In some cases, the distribution of the sum can be more easily found by considering either the probability generating function or the moment generating function of the variables. Definition Let X be a non-negative integer-valued random variable with distribution p n = P{X = n}. Then the probability generating function is the function ψ X : [0, 1] [0, 1] defined by ψ X (s) E [ s X] = p n s n. The most important property of the probability generating function (p.g.f.) of a random variable is that it completely determines the distribution of that variable, i.e., if X and Y have the same p.g.f., then they have the same distribution. The second most important property is given in the next lemma. n=0 Lemma 1.4. Suppose that X and Y are independent non-negative integer-valued random variables with probability generating functions ψ X (s) and ψ Y (s). Then the probability generating function of the sum X + Y is the product ψ X (s)ψ Y (s): ψ X+Y (s) E [ s X+Y ] = E [ s X s Y ] = E [ s X] E [ s Y ] = ψ X (s) ψ Y (s). Example Let X and Y be the independent Poisson random variables introduced in Example We first calculate the p.g.f. of X: λ λn ψ X (s) = e n! sn n=0 = e λ (λ s) n n! n=0 = e λ e λ s = e λ(s 1).

19 1.6. SUMS OF INDEPENDENT RANDOM VARIABLES 19 A similar calculation shows that ψ Y (s) = e µ(s 1) and so the p.g.f. of the sum X + Y is ψ X+Y (s) = ψ X (s) ψ Y (s) = e λ(s 1) e µ(s 1) = e (λ+µ)(s 1). Since this is also the p.g.f. of a Poisson random variable with parameter λ + µ, it follows that this is exactly the distribution of X + Y. A different kind of generating function is needed to handle real-valued random variables that aren t necessarily integer-valued. Definition Let X be a real-valued random variable. Then the moment generating function is the function M X : R R defined by M X (t) E [ e tx] x p(x)etx if X is discrete with p.m.f. p(x) = p(x)etx dx if X is continuous with density p(x). In general, the moment generating function (m.g.f.) of an arbitrary random variable X may be infinite for some values of t. However, if it is defined on an interval containing 0, then it uniquely determines the distribution of the random variable X. In other words, if X and Y have the same m.g.f. and if this function is defined on some interval ( a, b), then we can conclude that X and Y have the same distribution. Furthermore, the following counterpart to Lemma 1.4 holds for the moment generating function of a sum of independent variables. Lemma 1.5. If X and Y are independent real-valued random variables with moment generating functions M X (t) and M Y (t), then the moment generating function of their sum X + Y is the product M X (t) M Y (t): t(x+y M X+Y (t) = E [e )] = E [ e tx e ty ] = E [ e tx] E [ e ty ] = M X (t) M Y (t). Example Let X and Y be independent normal random variables with mean and variance equal to µ X and σx 2 and µ Y and σy 2, respectively. A somewhat tedious calculation shows that the m.g.f. of X and Y are M X (t) = exp {µ X t 12 } σ2xt 2 and M Y (t) = exp {µ Y t 12 } σ2y t 2, and then Lemma 1.5 tells us that the m.g.f. of the sum X + Y is { M X+Y (t) = M X (t) M Y (t) = exp (µ X + µ Y )t 1 } ( σ 2 2 X + σy 2 ) t 2. However, since M X+Y (t) is also the m.g.f. of a normal random variable with mean µ X + µ Y and variance σx 2 + σ2 Y, it follows that this is the distribution of X + Y, i.e., we have shown that the sum of two independent normally distributed random variables is also normally distributed.

20 Chapter 2 Approximation and Limit Theorems in Probability 2.1 Convergent Sequences and Approximation A recurrent theme in applied mathematics is the approximation of a complicated object, be this a number, a system of differential equations, or a stochastic process, by a simpler one. Often this is done heuristically: for example, we may formulate a less complicated model by simply omitting certain details that we believe (or hope) are unimportant. However, in some settings, approximation can be done more rigorously by working with convergent sequences of objects. Before examining how this applies to random variables and distributions, we first recall what it means for a sequence of real numbers to converge to a limit. Definition 2.1. A sequence of real numbers x 1, x 2, is said to converge to a limit x if for every positive real number ɛ > 0, we can find an integer N such that the difference x n x is less than ɛ whenever n N. When this is true, we write x n x or, more formally, x = lim n x n. The intuition behind this definition is that as n increases, the terms x n in a convergent sequence should approach the limit arbitrarily closely and then remain close to that limit. Example 2.1. If x n = n 1 n, then the sequence x 1, x 2, converges to the limit x = 1. Indeed, if ɛ > 0 is a positive real number and we take N to be any positive integer greater than 1/ɛ, then for any integer n N, we have x x n = 1 n 1 n = 1 n 1 N < ɛ. Convergent sequences can be used to formulate approximations in two fundamentally different ways. On the one hand, we can sometimes approximate a complicated object x by the terms in a sequence of simpler objects x n that converge to x. In fact, we do this whenever when we use a truncated decimal expansion to approximate an irrational number. Example 2.2. The number π = can be approximated by the terms in the sequence x 1 = 3, x 2 = 3.1, x 3 = 3.14, etc. On the other hand, if the complicated object is itself a term in a sequence, say x n, and that sequence converges to a limit x that is more convenient to work with (e.g., easier to simulate), 20

21 2.1. CONVERGENT SEQUENCES AND APPROXIMATION 21 then we may choose to approximate x n by x, at least when n is sufficiently large. This is illustrated in the next example, which shows how we can approximate a geometrically distributed random variable by one that is exponentially distributed. Example 2.3. For each integer n 1, let X n be a geometrically distributed random variable with success probability λ/n. Since EX n = n/λ, we can expect X n to be large whenever n is large, in which case it will also be expensive to simulate. To find an approximation for X n, we first observe that if t is a non-negative integer, then ( P{X n > t} = 1 n) λ t. (2.1) This result can be derived in two ways. Either we can evaluate the following infinite series ( λ P{X n > t} = P{X n = k} = 1 λ k, n n) k=t+1 k=t+1 or we can recall Example 1.4 and note that the probability of the event X n > t is the same as the probability that there are no successes in a series of t independent Bernoulli trials, each of which has success probability λ/n. Expression (2.1) is useful because of the following important result that you may recall from a calculus class: lim n ( 1 a n) nγ = e γa. It follows that if we replace t by nτ in (2.1), then ( lim P{X n > nτ} = lim 1 λ ) nτ n n n = e λτ. This limit is interesting because if X is an exponentially-distributed random variable with rate parameter λ, then P{X > τ} = e λτ and so we have shown that lim P{X n > nτ} = P{X > τ}. n Also, since P{Y > t} = 1 P{Y t} for any random variable Y, it follows that which can be rewritten as lim P{X n nτ} = P{X τ}, n { } 1 lim P n n X n τ = P{X τ}. (2.2) Since this last result holds for every non-negative real number τ 0, we have shown that whenever n is large, the distribution of the random variable 1 n X n can be approximated by an exponential distribution with rate parameter λ. Later in the course we will see how this result can be used to approximate a discrete-time Markov chain by a continuous-time Markov chain. Remark 2.1. Example 2.3 illustrates another important theme in stochastic modeling, which is that is that it is often necessary to normalize random variables when we want to pass to a limit. In this particular example, the normalization is suggested by the fact that although the expected values of the unnormalized variable X n diverge as n tends to infinity, the expectations of the normalized variables X n /n are all constant: [ ] 1 E n X n = λ.

22 22 CHAPTER 2. APPROXIMATION AND LIMIT THEOREMS IN PROBABILITY 2.2 Modes of Convergence of Random Variables As rich and as complicated as the real numbers are, real-valued random variables have an even richer and more complicated structure. One illustration of this difference is that there are several different senses (or modes) in which a sequence of real-valued random variables X 1, X 2, can converge to a limiting random variable X. In this course, we will mostly ignore the many technical issues that this raises and operate with an intuitive sense of what convergence should mean. Indeed, most consumers of stochastic approximations are largely unaware of these issues. However, it will be useful to at least be familiar some of the jargon. We will begin by assuming that X 1, X 2, and X are all real-valued random variables. Definition 2.2. The sequence (X n ; n 1) is said to converge in probability to X if for every ɛ > 0, lim n P{ X n X > ɛ} = 0. In other words, to say that the sequence converges to X in probability means that the probability that X n and X differ by some fixed amount ɛ can be made arbitrarily small by taking n sufficiently large. Notice that this definition implicitly assumes that X n and X are both defined on the same probability space. Definition 2.3. The sequence (X n ; n 1) is said to converge almost surely to X if { } P lim X n = X = 1. n It follows from Definition 2.1 that the sequence of random variables X n converges to X almost surely if for every positive real number ɛ > 0, there is an integer-valued random variable N such that P { X n X < ɛ for all n N} = 1. Once again, this definition assumes that all of the variables are defined on the same probability space. However, there are important differences between convergence in probability and almost sure convergence. In particular, whereas convergence in probability only requires that we compare each X n with X one at a time, almost-sure convergence requires that we simultaneously compare all of the variables X n with n N with X. For this reason, almost-sure convergence is a much stronger mode of convergence than convergence in probability, i.e., almost-sure convergence implies convergence in probability, but the converse is not true. Remark 2.2. We say that an event E occurs almost surely if P(E) = 1. This is often abbreviated by writing E occurs a.s. Since non-empty sets can have probability 0, it is important to realize that saying that E occurs almost surely is not the same thing as saying that E must happen. For example, if X is standard uniform random variable, then the event E = {X is irrational } occurs a.s., since P(E c ) = P{X is rational } = 0, but of course X could be rational. Although these statements may seem counter-intuitive, they reflect the technical complications that arise when we deal with probabilities on sets containing uncountably many objects (e.g., the real numbers). Definition 2.4. The sequence (X n ; n 1) is said to converge in distribution to X if lim P{X n x} = P{X x} n for every real number x with the property that P{X = x} = 0.

23 2.3. LAWS OF LARGE NUMBERS 23 Convergence in distribution is also called weak convergence and is indeed much weaker than the other two modes of convergence introduced above, i.e., almost-sure convergence and convergence in probability each imply convergence in distribution, but neither is implied by convergence in distribution. Furthermore, we can talk about convergence in distribution even when all of the random variables are defined on distinct probability spaces: only the distributions of these variables appear in the definition. Remark 2.3. In Example 2.3, we showed that the sequence of normalized geometric random variables 1 n X n converges in distribution to an exponential random variable X with parameter λ. See equation (2.2). Once again, Williams (1991) provides an excellent introduction to the technical issues - see Chapter A13 for the key facts. Also, although I have focused on real-valued random variables above, each of these definitions can be extended to make sense of what it means for a sequence of E-valued random variables to converge to a limiting random variable where E is any other space that we are likely to encounter (e.g., a set of vectors or matrices or even functions). These extensions become important when we wish to talk about the convergence not only of sequences of random variables, but also of sequences of stochastic processes. I ll briefly discuss some of these issues when we introduce diffusion approximations, but for the most part we can ignore them in this course. The book Markov Processes: Characterization and Convergence by Stewart Ethier and Tom Kurtz (1986) provides one of the best introductions to this subject for readers with a solid grasp of real and functional analysis. Having introduced the key definitions, we now briefly examine some of the main limit theorems in probability. 2.3 Laws of Large Numbers Recall that the frequentist interpretation of probability is that the probability of an event is the limiting frequency with which that event occurs when we conduct an infinite series of independent but identical trials. The weak and the strong laws of large numbers show that the intuition behind this interpretation is at least mathematically sound. Recall that we use the letters i.i.d. as an abbreviation for the phrase independent and identically distributed. Theorem 2.1. Weak Law of Large Numbers (WLLN) Suppose that X 1, X 2, are i.i.d. random variables and that E X 1 <. If µ = EX 1 denotes the expected value of X 1 and S n = X X n is the sum of the first n variables, then for every ɛ > 0, { } lim P 1 n n S n µ > ɛ = 0. In the terminology of the previous section, the WLLN asserts that the sequence of normalized partial sums S n /n converges in probability to the constant µ. In practical terms, this theorem tells us that if we conduct a large number of i.i.d. trials and calculate the sample mean of the outcomes, then with high probability this will be close to the expected value of the system. Theorem 2.2. Strong Law of Large Numbers (SLLN) Suppose that X 1, X 2, are i.i.d. and that E X 1 <. If µ = EX 1 denotes the expected value of X 1 and S n = X X n is

24 24 CHAPTER 2. APPROXIMATION AND LIMIT THEOREMS IN PROBABILITY the sum of the first n variables, then { } 1 P lim n n S n = µ = 1. Thus, the SLLN asserts that the sequence S n /n in fact converges almost surely to the constant µ. The connection with the frequentist interpretation of probability is given in the following example. Example 2.4. Suppose that we are interested in estimating the probability of an event A and that we are able to conduct an unlimited number of i.i.d. trials to do so. Let X i be the Bernoulli random variable defined by setting X i = 1 if A occurs on the i th trial and X i = 0 otherwise. Then X 1, X 2, is a sequence of i.i.d. random variables and µ = EX 1 = P{X 1 = 1} 1 + P{X 1 = 0} 0 = P(A). Also, since S n /n is the proportion of the first n trials that resulted in A, the SLLN tells us that the limiting frequency of the event A converges almost surely to the probability P(A), as we expect. Remark 2.4. The weak and the strong law of large numbers are important special cases of a more general heuristic which states that deterministic behavior can emerge in certain kinds of stochastic models if these consist of a large number of weakly interacting (or indeed independent) components. For example, we can derive many of the deterministic models studied in epidemiology (e.g., the simple SIR model) by starting with a stochastic model of an epidemic in a finite population with N individuals and then letting N tend to infinity. These kinds of limits are sometimes called hydrodynamic limits by analogy with the deterministic models that describe the dynamics of fluids containing large numbers of weakly interacting molecules. 2.4 The Central Limit Theorem Even if a large system behaves almost deterministically, we may still be interested in the random fluctuations of the system about its deterministic limit. The next theorem addresses this issue in the context of sums of independent random variables. Theorem 2.3. Central Limit Theorem (CLT) Suppose that X 1, X 2, are i.i.d. random variables and that both the mean µ = EX 1 and the variance σ 2 = Var(X 1 ) of X 1 are finite. If we let S n = X X n and Z n = 1 σ n (S n nµ), then the sequence Z 1, Z 2, converges in distribution to a standard normal random variable Z, i.e., for every real number x, lim P {Z n x} = P {Z x} = 1 x e z2 /2 dz. n 2π Here is one way to think about the CLT. By the SLLN, we know that the sequence of empirical means S n /n converges almost surely to µ, i.e., for all sufficiently large n, the differences n S n /n µ should be small. Nonetheless, because n is a random variable, it may be useful to

APM 504: Probability Notes. Jay Taylor Spring Jay Taylor (ASU) APM 504 Fall / 65

APM 504: Probability Notes. Jay Taylor Spring Jay Taylor (ASU) APM 504 Fall / 65 APM 504: Probability Notes Jay Taylor Spring 2015 Jay Taylor (ASU) APM 504 Fall 2013 1 / 65 Outline Outline 1 Probability and Uncertainty 2 Random Variables Discrete Distributions Continuous Distributions