APM 541: Stochastic Modelling in Biology Probability Notes. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 77

Size: px

Start display at page:

Download "APM 541: Stochastic Modelling in Biology Probability Notes. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 77"

Roland Francis
6 years ago
Views:

1 APM 541: Stochastic Modelling in Biology Probability Notes Jay Taylor Fall 2013 Jay Taylor (ASU) APM 541 Fall / 77

2 Outline Outline 1 Motivation 2 Probability and Uncertainty 3 Conditional Probability 4 Discrete Distributions 5 Continuous Distributions 6 Multivariate Distributions 7 Limit Theorems for Sums of Random Variables Jay Taylor (ASU) APM 541 Fall / 77

3 Motivation Chance and Contingency Influence Biological Processes at all Scales quantum biology: stochastic tunneling in photosynthetic pathways stochastic molecular dynamics: molecular motors as Brownian ratchets noise in biochemical and gene regulatory networks developmental biology: stochastic X-inactivation leads to mosaic phenotypes in female mammals (tortoiseshell cats) behavioral ecology: optimal foraging in uncertain environments population ecology: extinction risk of small populations depends on demographic and environmental stochasticity population genetics: allele frequencies fluctuate because of random genetic drift and random mutation phylogenetics: random DNA substitutions are used for phylogenetic inference Jay Taylor (ASU) APM 541 Fall / 77

4 Motivation Stochastic X-inactivation produces the mottled coat pattern on tortoiseshell cats. Mass extinctions have had profound and often unpredictable consequences for life on earth. Jay Taylor (ASU) APM 541 Fall / 77

5 Probability and Uncertainty Multiple Interpretations of Probability Frequentist interpretation (J. Venn): The probability of an event is equal to its limiting frequency in an infinite series of independent, identical trials. Propensity interpretation (C. Pierce, K. Popper): The probability of an event is equal to the physical propensity of the event. (Confession: I have no idea what this means.) Logical interpretation (J. M. Keynes, R. Cox, E. Jaynes): Probabilities quantify the degree of support that some evidence E gives to a hypothesis H. Subjective interpretation (F. Ramsey, J. von Neumann, B. de Finetti): The probability of a proposition is equal to the degree of an individual s belief that the proposition is true. Jay Taylor (ASU) APM 541 Fall / 77

6 Probability and Uncertainty A Behavioristic Derivation of the Laws of Probability We consider a two-player game, with players P 1 and P 2, that make wagers on an event with a then-unknown outcome. Here we will assume that the event can result in one of n mutually-exclusive and exhaustive outcomes, e 1,, e n. The game takes place in three steps. 1 First, player P 1 assigns a unit price Pr(A) to each possible event A S = {e 1,, e n}. This is the amount that P 1 is willing to pay for a $1 wager on the event that A happens, as well as the amount that P 1 is willing to accept from P 2 in exchange for such a wager. The player selling such a wager agrees to pay the purchaser $1 if A occurs and $0 if it does not occur. 2 Player P 2 then decides the amount S(A) that they wish to wager on each event A. If S(A) is positive, then P 2 purchases a wager worth $S for S(A) Pr(A) from P 1, while if S(A) is negative, then P 2 sells such a wager to P 1 for the same amount. 3 The outcome of the event is then determined and the two players settle their wagers. Jay Taylor (ASU) APM 541 Fall / 77

7 Probability and Uncertainty Although player P 1 may assign any set of prices to the various possible wagers, many assignments can be considered irrational insofar as they lead to certain loss for P 1. Indeed, a set of prices {Pr(A) : A S} and a set of wagers {S(A) : A S} is said to be a Dutch book if these will guarantee that P 2 will earn a profit from P 1 no matter which outcome occurs. The following are examples of Dutch books. Case 1: Suppose that Pr(A) < 0 for some event A. In this case, P 2 can agree to purchase a $1 wager from P 1 at a cost of Pr(A), i.e., P 1 will pay P 2 an amount Pr(A) to take such a wager. Then P 2 will have earned 1 + Pr(A) if the event A occurs and will have earned Pr(A) if the event A does not occur. Thus P 2 makes a profit no matter what occurs and thus this is a Dutch book. Case 2: Suppose that Pr(S) < 1. In this case, P 2 can earn a profit of 1 Pr(S) > 0 by purchasing a $1 wager on S. Since one of the events e 1,, e n is certain to happen, it follows that S will occur no matter which event occurs so that P 2 is certain to make a profit of 1 Pr(S) > 0. Similarly, if Pr(S) > 1, then P 2 should sell a $1 stake in S to P 1 and this will guarantee that P 2 will earn a profit of Pr(S) 1 > 0. Jay Taylor (ASU) APM 541 Fall / 77

8 Probability and Uncertainty Case 3: Now suppose that A and B are disjoint subsets of S, i.e., AB A B =, and that Pr(A B) < Pr(A) + Pr(B). Assume that P 2 purchases from P 1 a $1 wager on the event A B and sells to P 1 both a $1 wager on A and a $1 wager on B. Then the profit/losses incurred by P 2 are summarized in the following table: A B A B total outcome A B Pr(A) 1 Pr(B) 1 Pr(A B) Pr(A) + Pr(B) Pr(A B) ĀB Pr(A) Pr(B) 1 1 Pr(A B) Pr(A) + Pr(B) Pr(A B) Ā B Pr(A) Pr(B) Pr(A B) Pr(A) + Pr(B) Pr(A B) Notice that we do not consider the outcome AB since we are assuming that A and B represent mutually exclusive events. It follows that P 2 earns a profit no matter which outcome occurs and so P 2 has made Dutch book against P 1. Similarly, if Pr(A B) > Pr(A) + Pr(B), then P 2 can make a Dutch book by selling a $1 wager on A B and purchasing a $1 wager on each event A and B. Jay Taylor (ASU) APM 541 Fall / 77

9 Probability and Uncertainty The above examples show that a Dutch book can be made against P 1 whenever the prices assigned by P 1 to events in S satisfy certain inequalities. This motivates the following definition. The set of prices (Pr(A) : A S) is said to be coherent if it satisfies the following conditions: 1 Pr(A) 0 for each A S; 2 Pr(S) = 1; 3 Pr(A B) = Pr(A) + Pr(B) whenever A and B are disjoint subsets of S. Thus our previous examples show that any set of prices that is not coherent is vulnerable to a Dutch book. In fact, the converse is also true: under the rules of the above game, P 2 cannot make a Dutch book against P 1 if the prices assigned by P 1 are coherent. Jay Taylor (ASU) APM 541 Fall / 77

10 Probability and Uncertainty Condition (3) implies that a coherent price assignment is finitely-additive. If A 1,, A m are disjoint subsets of S, then Pr! m[ A i = i=1 mx Pr(A i ). i=1 Indeed, this follows by recursion on m since we can use (3) to peel off sets A m, A m 1, one by one from the union inside the price function:!!! m[ m 1 [ m 1 [ Pr A i = Pr A i A m = Pr A i + Pr(A m) i=1 = Pr = i=1 m 2 [ i=1 mx Pr(A i ). i=1 i=1 A i! + Pr(A m 1) + Pr(A m) Jay Taylor (ASU) APM 541 Fall / 77

11 Probability and Uncertainty If we identify subjective probabilities with wagers and we accept that rational actors should assign probabilities in such a way that prevents a Dutch book from being made against them, then it follows that subjective probabilities should obey the usual axioms of frequentist probability. However, this argument has several weaknesses/limitations: Our arguments in favor of coherence presume that rational individuals should act in such a way to avoid being made a sure looser. However, there may be situations in which this is not true, e.g., when individuals act altruistically and make sacrifices to benefit others. Our arguments also assume that it is possible for the players to eventually know what the true outcome is. In many instances where we wish to apply probabilistic reasoning, such knowledge might never be available. We have restricted attention to settings with only finitely-many possible outcomes. However, it is often useful to consider probabilities on infinite sets. Jay Taylor (ASU) APM 541 Fall / 77

12 Probability and Uncertainty Most modern mathematical treatments of probability begin with Kolmogorov s notion of a probability space. Definition A probability space is a triple (Ω, F, P) consisting of the following objects: 1 A set Ω called the sample space. 2 A collection F of subsets of S, called events, that satisfies the following conditions: Ω F; if A F, then Ā F; if A 1, A 2, F, then i=1 A i F. 3 A function P : F R satisfying the following conditions: P(A) 0 for every A F; P(Ω) ( = 1; ) P i 1 A i = i 1 P(A i) for any countable collection of disjoint sets A i F. Jay Taylor (ASU) APM 541 Fall / 77

13 Probability and Uncertainty Some terminology: A collection F of sets satisfying the conditions listed under (2) is said to be a σ-algebra on Ω. A set in F is said to be measurable and the pair (Ω, F) is called a measure space. The function P defined in (3) is said to be a probability measure or probability distribution on F. Since P is defined on F, the probability of a set P(A) is only defined if A is a measurable set. Probability measures are countably additive. This is a stronger condition than the finite additivity required for coherence. Jay Taylor (ASU) APM 541 Fall / 77

14 Probability and Uncertainty The axioms of probability imply several useful properties: P(Ā) = 1 P(A); P( ) = 0; If A B, then P(A) P(B). P(A B) = P(A) + P(B) P(AB); General inclusion-exclusion formula:! n[ nx P A i = P(A i ) X P(A i A j ) + X P(A i A j A k ) + i=1 i=1 i<j ( 1) n+1 P(A 1A 2 A n). i<j<k Exercise: Show that these properties follow from the definition. Jay Taylor (ASU) APM 541 Fall / 77

15 Conditional Probability Conditional Probabilities Suppose that we have specified a probability distribution P on a space Ω and that we learn that an event B Ω has occurred. How should we update our beliefs concerning the probabilities of other events A Ω? We will answer this question with the help of the game-theoretic approach used above. To do so, we allow the players to place wagers on conditional events. For example, suppose that player P 2 pays Pr(A B) to place a wager of $1 on the conditional event that A occurs given that B has occurred, denoted A B. Depending on the outcome, P 2 will incur the following costs and profits: If B occurs and A occurs, then P 1 keeps the fee Pr(A B) but pays $1 to P 2. If B occurs but A does not occur, then P 1 keeps the fee Pr(A B) and pays nothing to P 2. If B does not occur, then P 1 returns the fee Pr(A B) to P 2 and neither player makes a profit. Jay Taylor (ASU) APM 541 Fall / 77

16 Conditional Probability Suppose that the prices assigned by P 1 to the wagers AB, A B, and B satisfy the inequality Pr(AB) < Pr(A B)Pr(B). I claim that P 2 can make a Dutch book against P 1 by purchasing a wager of $1 on AB and selling wagers of $1 on A B and $Pr(A B) on B. The following table summarizes the profits made by P 2: AB A B B 1-1 Pr(A B) total event B Pr(AB) 0 Pr(A B)Pr(B) Pr(A B)Pr(B) Pr(AB) ĀB Pr(AB) Pr(A B) Pr(A B)Pr(B) Pr(A B) Pr(A B)Pr(B) Pr(AB) AB 1 Pr(AB) Pr(A B) 1 Pr(A B)Pr(B) Pr(A B) Pr(A B)Pr(B) Pr(AB) Provided that the prices set by P 1 satisfy the inequality given above, it is clear that P 2 will earn a profit no matter which outcome occurs and so P 2 has created a Dutch book. Similarly, if Pr(AB) > Pr(A B)Pr(B), then P 2 can create a Dutch book by purchasing wagers on A B and B and selling a wager on AB. Jay Taylor (ASU) APM 541 Fall / 77

17 Conditional Probability Thus, to avoid falling victim to a Dutch book, the prices assigned by P 1 to these three events must satisfy the identity: Pr(AB) = Pr(A B)Pr(B). This result motivates the following definition. Definition Let A and B be events in a probability space (Ω, F, P) and suppose that P(B) > 0. Then the conditional probability P(A B) of A given B is defined to be P(A B) = P(AB) P(B). Jay Taylor (ASU) APM 541 Fall / 77

18 Conditional Probability Independence In general, there is no simple relationship between the probabilities P(A), P(B) and P(AB). However, there is an important special case where the three probabilities are related by a simple identity. Definition Events A and B are independent if P(AB) = P(A)P(B). Events A 1, A 2, A 3, are independent if every finite collection {A i1, A i2,, A in } of distinct events satisfies the identity: P! n\ ny A ij = P `A ij. j=1 j=1 Jay Taylor (ASU) APM 541 Fall / 77

19 Conditional Probability Colloquially, we say that two events are independent if there is no causal or logical relationship between them, i.e., knowing that B is true does not change the likelihood that A is true. Indeed, if A and B are independent and P(A) > 0 and P(B) > 0, then P(A B) = P(AB) P(B) P(B A) = P(AB) P(A) = P(A)P(B) P(B) = P(A)P(B) P(A) = P(A) = P(B), which shows that our formal definition of independence is consistent with this heuristic interpretation. However, the formal definition is slightly broader in that it applies even when one or more of the events has probability 0. Jay Taylor (ASU) APM 541 Fall / 77

20 Conditional Probability The Law of Total Probability Suppose that F 1,, F n are mutually exclusive events and that B is an event contained in the union F 1 F n. Then P(B) = = nx P(BF i ) i=1 nx P(B F i )P(F i ). The latter identity is known as the law of total probability and can often be used to calculate the probabilities of complex events. i=1 Jay Taylor (ASU) APM 541 Fall / 77

21 Conditional Probability Bayes Formula Suppose that E and F are two events with P(E) > 0 and P(F ) > 0. Since EF = FE, it follows that the we can express the probability of the intersection EF in two ways: P(E F )P(F ) = P(EF ) = P(F E)P(E). The equality of the first and third expressions in this identity leads to Bayes formula: P(F E) = P(F ) P(E F ) P(E). This is arguably one of the most important identities in all of probability theory and, as the name suggests, it plays a fundamental role in Bayesian statistics. Jay Taylor (ASU) APM 541 Fall / 77

22 Discrete Distributions Random Variables Suppose that (Ω, F, P) is a probability space that represents our beliefs concerning the state of some system. In many cases, it will not be possible to directly observe which state ω Ω the system occupies, but it will be possible to conduct experiments that provide some information about this state. Mathematically, we can model such an experiment by a function defined on the sample space which takes values in a set E. X : Ω E Since the value of the variable X (ω) depends on the unknown state ω of the system, we say that X is a random variable. If we wish to emphasize the range of X, then we say that X is an E-valued random variable. Jay Taylor (ASU) APM 541 Fall / 77

Discrete Distributions Example: Ω might represent the unknown positions and types of all of the molecules in a population of cells at some point in time.

23 Discrete Distributions Example: Ω might represent the unknown positions and types of all of the molecules in a population of cells at some point in time. We have little chance of determining all of this information, but we could conduct a microarray experiment to determine the population-averaged levels of gene expression across the genome. In this case, we could take E = [0, ) G, where G is the number of genes included in our study, and let X (ω) = (X 1(ω),, X G (ω)) denote the estimated mrna levels. Of course, we can also think of each of the quantities X 1, X 2, as a random variable in own right, taking values in the space [0, ). In other words, we can define many random variables on the same space. Jay Taylor (ASU) APM 541 Fall / 77

24 Discrete Distributions Definition Suppose that X : Ω E is a random variable defined on a probability space (Ω, F, P). The distribution of X is the probability distribution P X on E defined by the following identity: P X (A) P(X A) = P(X 1 (A)). Notice that the two expressions on the right-hand side are determined by the probability measure P on Ω, i.e., the distribution of a random variable depends on the probability distribution on the underlying space on which the variable is defined. Technical aside: To make this definition rigorous, we need to introduce a σ-algebra E on E and require that X 1 (A) F whenever A E. X is said to be measurable with respect to the two σ-algebras F and E if this condition is satisfied. The distribution of X is then a probability measure on the measurable space (E, E). Jay Taylor (ASU) APM 541 Fall / 77

25 Discrete Distributions Discrete Random Variables Definition A random variable X is said to be discrete if X takes values in a set E that is either finite or countably infinite. The probability mass function of a discrete random variable X is the function p X : E [0, 1] defined by the formula p X (e) = P(X = e). Example: A random variable X with values in the set E = {0, 1} and probability mass function p X (1) = p, p X (0) = 1 p is said to be a Bernoulli random variable with parameter p. Jay Taylor (ASU) APM 541 Fall / 77

26 Discrete Distributions The probability mass function of a discrete random variable completely determines its distribution via the following identity: P(X A) = X x A p X (x), A E. In other words, to calculate the probability that X belongs in A, we simply sum the probability mass function over all of the values in A. In particular, if the probability mass function is summed over the entire space, the sum must be equal to 1: X p X (x) = P(X E) = 1. x E Jay Taylor (ASU) APM 541 Fall / 77

27 Discrete Distributions Expectations of Discrete Random Variables Definition If X is a discrete random variable with values in a subset of the real numbers, then the expected value or mean of X is the weighted average of these values: E[X ] = X x E p X (x) x. Example: If X is Bernoulli with parameter p, then the expected value of X is E[X ] = p 1 + (1 p) 0 = p. Notice that the expected value of a random variable need not be a value that variable can assume. Jay Taylor (ASU) APM 541 Fall / 77

28 Discrete Distributions Properties of Expectations The following two properties are often used to compute expected values: Linearity: nx E[c 1X c nx n] = c i E[X i ] Transformations: If f : E R is a real-valued function, then E[f (X )] = X p(x) f (x) x E i=1 Caveat: If f is a non-linear function, then in general E[f (X )] f (E[X ]). Jay Taylor (ASU) APM 541 Fall / 77

29 Discrete Distributions Variance Definition If X is a discrete random variable with values in a subset of the real numbers, then the variance of X is the weighted average of the squared difference between X and its mean: Var(X ) E[(X E[X ]) 2 ] = X x E p X (x) (x E[X ]) 2. In practice, it is often more convenient to calculate the variance using the following formula: Var(X ) = E[X 2 ] E[X ] 2. Example: If X Bernoulli(p), then, since X 2 = X, Var(X ) = p p 2 = p(1 p). Jay Taylor (ASU) APM 541 Fall / 77

30 Discrete Distributions Definition X is said to have the binomial distribution with parameters n 1 and p [0, 1], written X Binomial(n, p), if X takes values in the set E = {0, 1,, n} with probability mass function! P(X = k) = n p k (1 p) n k, k = 0, 1,, n. k Furthermore, the mean and the variance of X are: E[X ] = np Var(X ) = np(1 p). Application: Suppose that we perform a series of n independent, identical (IID) trials, each of which results in a success with probability p or a failure with probability 1 p. Then the total number of successes in the n trials is a binomial random variable with parameters n and p. In particular, a Bernoulli random variable with parameter p is also a binomial random variable with parameters n = 1 and p. Jay Taylor (ASU) APM 541 Fall / 77

31 Discrete Distributions Definition X is said to have the geometric distribution with parameter p [0, 1], written X Geometric(p), if X takes values in the non-negative integers E = {0, 1, } with probability mass function P(X = k) = (1 p) k p, k 0. Furthermore, the mean and the variance of X are: E[X ] = 1 p p Var(X ) = 1 p p 2. Application: Suppose that we perform a series of IID trials, each of which results in a success with probability p. Then the number of failures that occur before we observe the first success is geometrically distributed with parameter p. Alternate definitions: Some authors define the geometric distribution to be the number of trials required to obtain the first success, in which case X takes values in the set E = {1, 2, } and P(X = k) = (1 p) k 1 p. Jay Taylor (ASU) APM 541 Fall / 77

32 Discrete Distributions Definition X is said to have the negative binomial distribution with parameters r > 0 and p [0, 1], written X NB(r, p), if X takes values in the non-negative integers E = {0, 1, } with probability mass function! P(X = k) = r + k 1 p r (1 p) k, k 0. k Furthermore, the mean and the variance of X are: E[X ] = r(1 p) p Var(X ) = r(1 p) p 2. Application: Suppose that we perform a series of IID trials, each of which results in a success with probability p. Then the number of failures that occur before the r th success is a negative binomial random variable with parameters r and p. Caveat: As with the geometric distribution, some authors define the negative binomial distribution in terms of the number of trials until the r th success, in which case X takes values in the set E = {r, r + 1, }. Jay Taylor (ASU) APM 541 Fall / 77

33 Discrete Distributions The following plot shows the probability mass function for a series of negative binomial distributions, with fixed success probability p = 0.5 and increasing r = 0.2, 1, 5, 10. When r = 1, this distribution reduces to a geometric distribution. However, as r increases, the distribution becomes more symmetric (less skewed) around its mean. 0.9 NegativeBinomial Distributions r =0.2 r =1 r =5 r= p(k) Geometric(0.5) k Jay Taylor (ASU) APM 541 Fall / 77

34 Discrete Distributions Definition X is said to have the Poisson distribution with parameter λ 0, written X Poisson(λ), if X takes values in the non-negative integers E = {0, 1, } with probability mass function P(X = k) = e λ λ k k!, k 0. Furthermore, the mean and the variance of X are: E[X ] = Var(X ) = λ. Application: Suppose that we perform a large number n of IID trials, each with success probability p = λ/n. Then the total number of successes is approximately Poisson distributed with parameter λ and this approximation becomes exact in the limit as n. This is a special case of the Law of Rare Events. Jay Taylor (ASU) APM 541 Fall / 77

35 Discrete Distributions The Poisson distribution provides a useful approximation for the distribution of many kinds of count data. Some examples include: The number of misspelled words on a page of a book. The number of misdialed phone numbers in a city on a particular day. The number of beta particles emitted by a 14 C source in an hour. The global number of major earthquakes that occur in a year. The number of sites in a non-coding region that differ between two closely-related species. The mean of this distribution is 2LµT, where L is the number of base pairs in the region, µ is the substitution rate per site per year, and T is the divergence time in years. Jay Taylor (ASU) APM 541 Fall / 77

36 Discrete Distributions The Poisson distribution is less suited for modeling count data corresponding to events or individuals that are clumped. In such cases, the data are usually overdispersed, i.e., the variance is greater than the mean, and the negative binomial distribution may provide a better fit. p(k) NegativeBinomial vs. Poisson (mean=4) Poisson(5) r =1 r =2 r =5 r = NB: σ 2 = µ + 1 r µ2 r is an inverse measure of aggregation or dispersion: smaller values of r give a more skewed distribution If we let r with p chosen so that the mean is µ, then NB(r, p) Poisson(µ) k examples: numbers of macro parasites per individual, numbers of infections per outbreak are usually modeled with the negative binomial distribution Jay Taylor (ASU) APM 541 Fall / 77

37 Continuous Distributions Continuous Random Variables Definition A real-valued random variable X is said to be continuous if there is a non-negative function p : R [0, ], called the probability density function of X, such that Z P(X A) = p(x)dx, for any (nice) subset A R. A Technical aside: A set is nice if it is belongs to the Borel σ-algebra on R. This is the smallest σ-algebra on R that contains all open intervals. While we will largely ignore measurability in this course, this is an important concept in fully rigorous treatments of probability theory. See David Williams book Probability with Martingales (1991). Jay Taylor (ASU) APM 541 Fall / 77

38 Continuous Distributions The probability density function plays the same role for continuous random variables that the probability mass function plays for discrete random variables. Be careful, however, not to confuse the two. The existence of a density function has several consequences: Densities integrate to 1: Z R p(x)dx = P(X R) = 1. Points have zero probability mass: P(X = t) = Z t t p(x)dx = 0. These two identities seem to lead to a paradox. On the one hand, the probability that X is equal to any particular point x is zero for every x. On the other hand, the probability that X is equal to some point, i.e., X R, is 1. However, there is no contraction here, since there are uncountably infinitely many real numbers and we have not required that probability distributions be additive over arbitrary collections of disjoint sets:! 1 = P(X R) = P [ x R{X = x} X x R P(X = x) = 0. Jay Taylor (ASU) APM 541 Fall / 77

39 Continuous Distributions Definition If X is a real-valued random variable (not necessarily continuous), the cumulative distribution function of X is the function F : R [0, 1] defined by F (x) = P(X x). If X is continuous, then the density p(x) and the cumulative distribution function F (x) are related in the following way: F (x) = Z x p(t)dt and p(x) = F (x). Furthermore, the density of X at a point x can be estimated by the following formulas: p(x) P(x ɛ < X x + ɛ) 2ɛ = F (x + ɛ) F (x ɛ). 2ɛ Jay Taylor (ASU) APM 541 Fall / 77

40 Continuous Distributions Expectations of Continuous Random Variables Definition If X is a continuous random variable with probability density function p(x), then the expected value or mean of X is the weighted average of these values: Z E[X ] = x p(x)dx. R In general, expectations of continuous random variables behave like expectations of discrete random variables provided that we replace the sum by an integral and the probability mass function by the probability density function. For example, if f : R R is a real-valued function and X is as in the definition, then Z E[f (X )] = f (x) p(x)dx. R Jay Taylor (ASU) APM 541 Fall / 77

41 Continuous Distributions Definition X is said to have the uniform distribution on (l, u), written X U(l, u), if X takes values in the set E = (l, u) with probability density function p(x) = 1 u l, Furthermore, the mean and the variance of X are: E[X ] = l + u 2 x (l, u). Var(X ) = (u l)2. 12 If l = 0 and u = 1, then X is said to be a standard uniform random variable. Application: In Monte Carlo simulations, we are usually given a stream of independent standard uniform (pseudo-)random variables and our problem is to use these to generate a stream of independent random variables with the distribution of interest. Jay Taylor (ASU) APM 541 Fall / 77

42 Continuous Distributions Definition X is said to have the Beta distribution with parameters a, b > 0, written X Beta(a, b), if X takes values in the set E = (0, 1) with probability density function p(x) = 1 β(a, b) x a 1 (1 x) b 1, x (0, 1). Here, β(a, b) is the Beta function, which is defined by the integral β(a, b) = Z 1 Furthermore, the mean and the variance of X are: 0 x a 1 (1 x) b 1 dx. E[X ] = a a + b Var(X ) = ab (a + b) 2 (a + b + 1). Remark: If a = b = 1, then the Beta distribution reduces to the standard uniform distribution. Jay Taylor (ASU) APM 541 Fall / 77

43 Continuous Distributions Application: The Beta distribution provides a flexible family of probability distributions for quantities that take values in the interval [0, 1], e.g., proportions or probabilities. The figure below shows that, depending on whether the parameters are less than or greater than 1, the Beta density will either be bimodal, with maxima at x = 0, 1, or unimodal, with a mode at (a 1)/(a + b 2). 3.5 Beta distribution p(x) a =b =0.1 a =b =2 Other applications include: order statistics for IID uniform variables posterior distribution of success probabilities equilibrium frequencies of neutral alleles in panmictic populations x Jay Taylor (ASU) APM 541 Fall / 77

44 Continuous Distributions Definition X is said to have the exponential distribution with rate parameter λ > 0, written X Exp(λ), if X takes values in the set E = [0, ) with probability density function p(x) = λe λx, x 0. Furthermore, the mean and the variance of X are E[X ] = 1 λ Var(X ) = 1 λ 2, while the cumulative distribution function is F (t) = P(X t) = 1 e λt. Application: Exponential random variables are often used to model random times, e.g., the time until a radioactive nucleus decays. Of course, the shape of such a distribution should not depend on the units of measurement and this is true of the exponential distribution. In particular, if X is exponentially distributed with rate λ and Y = γx, then Y is exponentially distributed with rate λ/γ. Jay Taylor (ASU) APM 541 Fall / 77

45 Continuous Distributions Exponential distributions are special in that they are the only continuous distributions on [0, ) which are memoryless, i.e., if X Exp(λ) and t, s > 0, then P(X > t + s X > t) = P(X > t + s) P(X > t) = e λ(t+s) e λt = e λs = = P(X < s). In other words, the variable has no memory of having survived until time t: the probability that it dies between times t and t + s is the same as the probability of dying between times 0 and s. Put another way, the rate at which the variable dies is constant in time. Later we will see that this property plays a crucial role in the construction and analysis of continuous-time Markov chains, where the waiting times between successive jumps are exponential. Jay Taylor (ASU) APM 541 Fall / 77

46 Continuous Distributions There is an important connection between the geometric distribution and the exponential distribution. Suppose that for each n 1, X n is a geometric random variable with success probability p n = λ/n. As n increases, the expected value E[X n] = n/λ diverges. Suppose then that we consider the rescaled random variables Y n = 1 Xn. Each of these n variables has mean E[Y n] = λ 1. Furthermore, if we let n tend to infinity, then P(Y n > t) = P(X n > nt) = n e λt = P(X > t) 1 λ «nt n where X is exponentially distributed with rate λ. In other words, we can approximate a geometric random variable with small success probability by an exponential random variable provided that we change the units in which time is measured. Jay Taylor (ASU) APM 541 Fall / 77

47 Continuous Distributions Definition X is said to have the gamma distribution with shape parameter α > 0 and rate parameter λ > 0, written X Gamma(α, λ), if X takes values in the set E = [0, ) with probability density function p(x) = λα Γ(α) x α 1 e λx, x 0. Here, Γ(α) is the Gamma function, which is defined for α > 0 by the integral Γ(α) = Z Furthermore, the mean and the variance of X are: 0 x α 1 e x dx. E[X ] = α λ Var(X ) = α λ 2. Remark: Sometimes the Gamma distribution is described in terms of the shape parameter α and a scale parameter θ = 1/λ. Jay Taylor (ASU) APM 541 Fall / 77

48 Continuous Distributions The Gamma function introduced on the preceding slide is an important object in its own right and appears in many settings in mathematics and statistics. Although Γ(α) usually cannot be explicitly evaluated, the following identity is often useful: Γ(α + 1) = Z 0 x α e x dx = x α ( e x ) 0 = αγ(α). In particular, if α = n is an integer, then Z 0 αx α 1 e x dx Γ(n + 1) = nγ(n) = n(n 1)Γ(n 2) = = n!γ(1) = n! since Γ(1) = 1. Thus the Gamma function can be regarded as a smooth extension of the factorial function to the positive real numbers. Jay Taylor (ASU) APM 541 Fall / 77

49 Continuous Distributions The Gamma distribution is related to the exponential distribution in much the same way that the negative binomial distribution is related to the geometric distribution. 3 Gamma Distributions (rate =1) Other applications include: p(x) Exponential(1) a =0.5 a =1 a =2 a = x When α = 1, the Gamma distribution reduces to the exponential. If X 1,, X n are independent exponential RV s with rate λ, then their sum X = X X n is a Gamma RV with shape parameter α = n and rate λ. Thus the Gamma distribution is often used to model random lifespans that elapse after a series of independent events. Jay Taylor (ASU) APM 541 Fall / 77

50 Continuous Distributions Definition X is said to have the normal (or Gaussian) distribution with mean µ and variance σ 2 > 0, written X N (µ, σ 2 ), if X takes values in R with probability density function p(x) = 1 σ 2 /2σ 2 2π e (x µ). In this case, the mean and the variance of X are µ and σ 2, as implied by the name of the distribution. Furthermore, if µ = 0 and σ 2 = 1, then X is said to be a standard normal random variable. Applications: Normal distributions arise as the limiting distribution of a sum of a large number of independent random variables. This is the content of the Central Limit Theorem which we will discuss in the next lecture. Many quantities encountered in biological systems are approximately normally distributed, including some morphological traits (among individuals in a population) and some protein concentrations (per cell). Jay Taylor (ASU) APM 541 Fall / 77

51 Continuous Distributions Some useful properties of the normal distribution include: Sums of independent normal RV s are normal: If X 1,, X n are independent normal RV s and X i N (µ i, σ 2 i ), then their sum X = X X n is a normal random variable with mean µ µ n and variance σ σ 2 n. 0.8 Normal Distributions p(x) σ 2 =0.25 σ 2 =1 σ 2 =4 σ 2 =25 Linear transforms preserve normality. If X N (µ, σ 2 ), then Y = ax + b is normal with mean aµ + b and variance a 2 σ 2. If Z N (0, 1), then X = σz + µ N (µ, σ 2 ) x Jay Taylor (ASU) APM 541 Fall / 77

52 Multivariate Distributions Multivariate Distributions and Random Vectors Definition Suppose that X 1,, X n are random variables defined on a common probability space (Ω, F, P) with values in the sets E 1,, E n, respectively. Then the joint distribution of these variables is the distribution of the random vector X = (X 1,, X n) on the set E = E 1 E n, i.e., P(X A) = P((X 1,, X n) A) for any A E. In this case, the distribution of any one of the variables, say X i, considered on its own is said to be the marginal distribution of that variable. The joint distribution of a collection of random vectors tells us how the variables are related to one another. Jay Taylor (ASU) APM 541 Fall / 77

53 Multivariate Distributions Although we can always recover the marginal distributions of a collection of random variables from their joint distribution, the inverse operation usually is not possible without additional information. One case where this is possible is when the random variables are independent. Definition 1 The random variables X 1,, X n are said to be independent if P(X 1 E 1,, X n E n) = ny P(X i E i ) 1 for all sets E 1,, E n such that the events {X i E i } are well-defined. 2 An infinite collection of random variables is said to be independent if every finite sub-collection is independent. In other words, X and Y are independent if and only if the events {X E} and {Y F } are independent for all sets E and F in the ranges of X and Y. Jay Taylor (ASU) APM 541 Fall / 77

54 Multivariate Distributions As a general rule, calculations involving multivariate distributions are greatly simplified when the component variables are independent. The following theorem provides an important example of this principle. Theorem Suppose that X 1,, X n are independent real-valued random variables and that f 1,, f n are functions from R to R. Then f 1(X 1),, f n(x n) are independent real-valued random variables and " ny # ny E f i (X i ) = E [f i (X i )], i=1 whenever the expectations on both sides of the identity are defined. In particular, by letting each f i (x) = x be the identity function we obtain the following important special case: " ny # ny E X i = E [X i ]. i=1 i=1 i=1 Jay Taylor (ASU) APM 541 Fall / 77

55 Multivariate Distributions Of course, it is often the case that variables of interest are not independent and then we need metrics to quantify the extent to which they depend on each other. One such metric is the covariance. Definition Suppose that X and Y are real-valued random variables defined on the same probability space. The covariance of X and Y is the quantity Cov(X, Y ) = E [(X E[X ])(Y E[Y ])] = E[XY ] E[X ]E[Y ]. The covariance between two random variables is a measure of their linear association. In particular, if X and Y are independent, then since the theorem on the previous slide shows that E[XY ] = E[X ]E[Y ], it follows that Cov(X, Y ) = 0. However, the converse of this result is not true: the mere fact that two variables have zero covariance does not imply that they are independent. Jay Taylor (ASU) APM 541 Fall / 77

56 Multivariate Distributions Properties of Covariance Covariances have a number of useful properties: Cov(X, X ) = Var(X ); Cov(X, Y ) = Cov(Y, X ); Cov(aX, by ) = ab Cov(X, Y ); Bilinearity:! nx mx Cov X i, Y j = Cauchy-Schwartz inequality: i=1 j=1 nx mx Cov(X i, Y j ); i=1 j=1 Cov(X, Y ) p Var(X ) Var(Y ). Exercise: Verify the above properties. Jay Taylor (ASU) APM 541 Fall / 77

57 Multivariate Distributions When we work with random vectors containing more than two variables, it is often useful to organize the covariances into a single matrix. Definition Suppose that X 1,, X n are real-valued random variables and let σ ij = Cov(X i, X j ) denote the covariance of X i and X j. Then the variance-covariance matrix of the random vector X = (X 1,, X n) is the n n matrix Σ with entry σ ij in the i th column and j th row: 0 1 σ 11 σ 12 σ 1n σ 21 σ 22 σ 2n Σ = C.. A. σ n1 σ n2 σ nn Because Cov(X, Y ) = Cov(Y, X ), it is clear that every variance-covariance matrix is symmetric. Furthermore, it can be shown that any such matrix is also non-negative definite, i.e., given any (column) vector β R n, we have β T Σβ = Var! nx β i X i 0. i=1 Jay Taylor (ASU) APM 541 Fall / 77

58 Multivariate Distributions Definition Suppose that X 1,, X n are discrete random variables with values in the sets E 1,, E n. Then the random vector X = (X 1,, X n) is itself a discrete random variable and the joint probability mass function of the variables X 1,, X n is defined by the formula p(x 1,, x n) = P {X = (x 1,, x n)} = P{X 1 = x 1,, X n = x n}. Furthermore, the marginal probability mass function p i (x) of the variable X i can be recovered from the joint p.m.f. by summing over all other variables: p i (y) = X p(x 1,, x i 1, y, x i+1,, x n). {x E:x i =y} If the random variables X 1,, X n are independent, then the joint probability mass function is equal to the product of the marginal probability mass functions: p(x 1,, x n) = ny p i (x i ) where p i (x i ) = P(X i = x i ) is the probability mass function of X i. i=1 Jay Taylor (ASU) APM 541 Fall / 77

59 Multivariate Distributions Definition Let n 1 and let p 1,, p k be a collection of non-negative real numbers such that p p k = 1. We say that the random vector X = (X 1,, X k ) has the multinomial distribution with parameters n and (p 1,, p k ) if each of the variables X i takes values in the set {0,, n} and if the joint probability mass function of these variables is given by! provided n n k = n. p(n 1,, n k ) = n n 1,, n k p n 1 1 pn k k Application: As the name suggests, the multinomial distribution generalizes the binomial distribution. Suppose that we conduct n IID trials and that each trial can result in one of k possible outcomes, which have probabilities p 1,, p k. If X i denotes the number of trials that result in the i th outcome, then (X 1,, X k ) has the multinomial distribution with parameters n and (p 1,, p k ). Jay Taylor (ASU) APM 541 Fall / 77

60 Multivariate Distributions Definition A random vector X = (X 1,, X n) with values in R n is said to be continuously distributed if there is a non-negative function p : R n [0, ] such that Z Z P(X A) = p(x)dx. for any (Borel measurable) subset A R n. In this case, each of the variables X i is individually continuous and the marginal density of X i can be found by integrating the joint density over all of the other components: p i (y) = Z Z A p(x 1,, x i 1, y, x i+1,, x n)dx 1 dx i 1 dx i+1 dx n. In general, a random vector formed from a set of marginally continuous random variables need not be jointly continuous. However, if the variables are also independent, then the random vector will be jointly continuous and the joint density is given by the product of the marginal densities: ny p(x 1,, x n) = p i (x i ). i=1 Jay Taylor (ASU) APM 541 Fall / 77

61 Multivariate Distributions Definition A continuous random vector X = (X 1,, X n) with values in R n is said to have the multivariate normal distribution with mean vector µ = (µ 1,, µ n) and n n variance-covariance matrix Σ if it has joint density function j p(x) = (2π) n/2 Σ 1/2 exp 1 ff 2 (x µ)t Σ 1 (x µ). Here Σ is a positive-definite symmetric matrix with determinant Σ > 0 and inverse Σ 1. We note that if X is multivariate normal, then every component variable X i is a normal random variable. Furthermore, if X and Y are multivariate normal random variables with values in R n, then so is X + Y, i.e., sums of multivariate normal random variables are also multivariate normal. Jay Taylor (ASU) APM 541 Fall / 77

62 Limit Theorems for Sums of Random Variables Sums of Independent Random Variables Many problems in applied probability and statistics involve sums of independent random variables. For example, in discrete-time branching processes, the size of a population at time t + 1 is equal to the sum of the number of offspring born to each adult female alive at time t. In some cases, the distribution of the sum of two independent integer-valued random variables can be calculated with the help of the following result. Theorem Suppose that X and Y are independent integer-valued random variables with probability mass functions p X and p Y, respectively. Then Z = X + Y is an integer-valued random variable with probability mass function p Z (n) = p X p Y (n) X k= p X (k)p Y (n k) Remark: The operation p X p Y is called the discrete convolution of p X and p Y. Jay Taylor (ASU) APM 541 Fall / 77

63 Limit Theorems for Sums of Random Variables The proof of this result uses only elementary properties of probability distributions and independence: p Z (n) = P(Z = n) = P(X + Y = n)! [ = P {X = k, Y = n k} k= k= X = P(X = k, Y = n k) X = P(X = k)p(y = n k) k= X = p X (k)p Y (n k). k= Jay Taylor (ASU) APM 541 Fall / 77

64 Limit Theorems for Sums of Random Variables By way of example, we can show that the sum of two independent Poisson random variables is Poisson. Suppose that X Poisson(λ) and Y Poisson(µ) are independent and let Z = X + Y. Then the probability mass function of Z is p Z (n) = = X k= p X (k)p Y (n k) nx «e λ λ e k µ µ n k k=0 = e (λ+µ) 1 n! k! nx k=0 = e (λ+µ) (λ + µ) n n! (n k)! «n! k!(n k)! λk µ n k which shows that X + Y Poisson(λ + µ). This result will be very useful when we begin to work with Poisson processes. Jay Taylor (ASU) APM 541 Fall / 77

65 Limit Theorems for Sums of Random Variables An analogous result exists for sums of independent continuous random variables. Theorem Let X and Y be independent continuous random variables with densities p X and p Y, respectively. Then Z = X + Y is a continuous random variable with density p Z (z) = p X p Y (z) Z p X (t)p Y (z t)dt and p X p Y (z) is called the convolution integral of p X and p Y. Exercise: Use this theorem to show that the sum of two independent exponential random variables with rate parameter λ is Gamma distributed with shape parameter α = 2 and rate parameter λ. Jay Taylor (ASU) APM 541 Fall / 77

66 Limit Theorems for Sums of Random Variables When summing large numbers of independent random variables, the convolutions described on the preceding slides may not be helpful. In these cases, we instead consider the asymptotic or limiting behavior of such sums as the number of terms tends to infinity. To discuss these results, we need to decide what it means for a sequence of random variables or a sequence of probability distributions to converge to a limit. We first recall the definition of convergence in R n. Definition We say that a sequence {x n : n 1} R n of n-dimensional vectors converges to a limit x R n if for every ɛ > 0 there exists an integer N ɛ such that for all n N ɛ we have x x n < ɛ, where y denotes the n-dimensional Euclidean norm. Technical aside: In this definition, the Euclidean norm on R n can be replaced by any other norm on R n and this will not change whether a sequence is convergent or not. Jay Taylor (ASU) APM 541 Fall / 77

67 Limit Theorems for Sums of Random Variables When we turn to spaces of random variables or probability distributions, there is no longer a single mode of convergence that is useful in all settings. The most important modes of convergence are described in the next definition. Definition Suppose that X 1, X 2, is a sequence of R n -valued random variables defined on a probability space and let X also be a R n -valued random variable on this space. 1 We say that X n converges in distribution to X if for every bounded, continuous function f : R n R, lim E[f (Xn)] = E[f (X )] n 2 We say that X n converges in probability to X if for every ɛ > 0 lim P { Xn X > ɛ} = 0. n 3 We say that X n converges almost surely to X if n o P lim n Xn = X = 1. Jay Taylor (ASU) APM 541 Fall / 77

68 Limit Theorems for Sums of Random Variables The modes of convergence defined on the preceding slide differ in the sense that there are sequences that converge in one mode that don t converge in the other modes. However, there are some useful relationships between these notions. In particular, the following is true: Theorem Let X n; n 1 and X be as in the previous definition. 1 If X n converges to X almost surely, then X n also converges to X in probability and in definition. 2 If X n converges to X in probability, then X n also converges to X in distribution. In other words, almost sure convergence implies convergence in probability which implies convergence in distribution. Jay Taylor (ASU) APM 541 Fall / 77

69 Limit Theorems for Sums of Random Variables With these results in mind, let us return to sums of independent random variables. We first consider the asymptotic behavior of the sample mean of a large number of such variables. Recall that we use the abbreviation i.i.d. to stand for independent and identically-distributed. Theorem (The Weak Law of Large Numbers, WLLN) Suppose that X 1, X 2, are i.i.d. real-valued random variables and that E X 1 <. If µ = EX 1 and S n = X X n is the sum of the first n variables, then the sequence of sample means S n/n converges in probability to µ, i.e., for every ɛ > 0 ff lim P 1 n j n Sn µ > ɛ = 0. In other words, if we conduct a large number of independent, identical trials in which we measure some random quantity, then with high probability the sample mean will be close to the actual mean. Jay Taylor (ASU) APM 541 Fall / 77

70 Limit Theorems for Sums of Random Variables In fact, a stronger result is available. Theorem (The Strong Law of Large Numbers) Suppose that X 1, X 2, are i.i.d. real-valued random variables and that E X 1 <. If µ = EX 1 and S n = X X n, then the sequence of sample means S n/n converges almost surely to µ, i.e., j P lim n 1 n Sn = µ ff = 1. The SLLN tells us that as we conduct more and more trials, the sample mean is certain to approach the true mean. The SLLN implies the WLLN, but is a much stronger result. Jay Taylor (ASU) APM 541 Fall / 77

71 Limit Theorems for Sums of Random Variables The WLLN and SLLN are special cases of a more general heuristic, which states that deterministic behavior can emerge in certain kinds of stochastic models that consist of a large number of weakly interacting (or indeed independent) components. For example, it is this heuristic that allows us to use deterministic ODE s or PDE s to model the behavior of chemical reactions when the numbers of molecules is large. Nonetheless, in many instances, we are specifically interested in the noise or fluctuations of the system about this deterministic limit. In the context of sums of i.i.d. variables, this is addressed by the Central Limit Theorem. Theorem (Central Limit Theorem, CLT) Suppose that X 1, X 2, are i.i.d. real-valued random variables and that both the mean µ = EX 1 and the variance σ 2 = Var(X 1) are finite. If S n = X X n and Z n = 1 σ (Sn nµ) n then the sequence Z 1, Z 2, converges in distribution to a standard normal random variable. Jay Taylor (ASU) APM 541 Fall / 77

72 Limit Theorems for Sums of Random Variables The variables Z 1, Z 2, introduced in the CLT are said to be standardized in the sense that for every n 1, EZ n = 0 and Var(Z n) = 1. In other words, since we already know that the sample means S n/n are converging almost surely to the mean µ, to study the fluctuations of S n/n around this limit we need to subtract the limit and then amplify the differences S n/n µ by a factor that grows rapidly enough to compensate for the fact that this difference is tending to 0: Z n = n σ «1 n Sn µ What the CLT tells us is that no matter what the distribution of the X i s is (as long as it has finite mean and finite variance), these amplified differences will be approximately normally distributed when n is large. This observation presumably explains why so many quantities are approximately normally distributed - in effect, the microscopic details are lost to the Gaussian limit whenever we consider a macroscopic system in which the components act additively.. Jay Taylor (ASU) APM 541 Fall / 77

APM 504: Probability Notes. Jay Taylor Spring Jay Taylor (ASU) APM 504 Fall / 65

APM 504: Probability Notes. Jay Taylor Spring Jay Taylor (ASU) APM 504 Fall / 65 APM 504: Probability Notes Jay Taylor Spring 2015 Jay Taylor (ASU) APM 504 Fall 2013 1 / 65 Outline Outline 1 Probability and Uncertainty 2 Random Variables Discrete Distributions Continuous Distributions