APM 421 Probability Theory Discrete Random Variables. Jay Taylor Fall Jay Taylor (ASU) Fall / 86

APM 421 Probability Theory Discrete Random Variables Jay Taylor Fall 2013 Jay Taylor (ASU) Fall 2013 1 / 86

Outline 1 Motivation 2 Infinite Sets and Cardinality 3 Countable Additivity 4 Discrete Random Variables 5 Probability Generating Functions 6 Geometric and related distributions 7 Poisson distribution 8 Fluctuation Tests 9 Poisson Processes Jay Taylor (ASU) Fall 2013 2 / 86

Motivation Distributions on Infinite Spaces Example: Suppose that a coin with probability p = 1/2 of landing on heads is repeatedly flipped and let N be the number of flips that land on tails before we get the first heads. Assuming that the flips are independent of one another, we expect that P(N = k) = «k+1 1 2 for any integer k 0, since the event {N = k} occurs if and only if the first k flips all land on tails and the k + 1 st toss lands on heads. This suggests that we can define N to be a random variable with values in the set of natural numbers N = {0, 1, } and distribution given by the above formula. In particular, notice that if p > 0, then «k+1 1 P(N = k) = = 1, 2 k=0 k=0 which suggests that P(N < ) = 1. Jay Taylor (ASU) Fall 2013 3 / 86

Motivation Unfortunately, our current formulation of probability does not allow us to define such a random variable N. The problem is that at the moment we only require probability distributions to be finitely additive, which means that while we can calculate the probabilities of finite sets such as P(N 4) = 4X P(N = k) = k=0 k=0 4X k=0 «k+1 1 = 31 2 32, we cannot similarly conclude that «2k+1 1 P(N is even) = P(N = 2k) = 2 k=0 since [ {N is even} = {N = 2k} k=0 expresses the event that N is even as a disjoint union of infinitely many sets and finite additivity tells us nothing in this case. Jay Taylor (ASU) Fall 2013 4 / 86

Motivation One way to address this problem is to define a distribution ν on N by requiring ν(a) = X k A «k+1 1 2 for every subset A N. Clearly, ν is coherent: ν(a) 0 for every A N; ν(n) = P k 0 ν({k}) = P ` 1 k+1 k=0 = 1 2 ν is finitely additive, since given any pair of disjoint subsets A, B N, ν(a B) = X k A B 1 2 «k+1 = X k A 1 2 «k+1 + X k B «k+1 1 = ν(a) + ν(b). 2 However, ν is not the only coherent distribution on N which assigns probability (1/2) k+1 to each singleton set {k}. In fact, there are infinitely many such distributions and we don t yet know which one (if any) of these should be chosen as the distribution of N. Jay Taylor (ASU) Fall 2013 5 / 86

Motivation The previous example demonstrated that finite additivity on its own may not be strong enough to uniquely define the probabilities of infinite sets. Our next example will reveal an even more serious defect: although sure loss cannot occur if we use a coherent probability distribution to place bets on a finite sample space, this is not true for infinite sample spaces. Example: Let N = {0, 1, 2, } denote the set of natural numbers and for each m 1 and k = 0,, m 1, let R m,k be the set R m,k = {k + nm : n 0} = {k, k + m, k + 2m, k + 3m, } The sets R m,k are called residue classes mod m. For example, R 2,0 is the set of non-negative even integers, while R 2,1 is the set of non-negative odd integers. It can be shown that there exists a coherent distribution µ on N with the following properties: µ(r m,k ) = 1/m for all m 1 and k = 0,, m 1. µ(b) = 0 for any finite subset B N. In particular, µ({n}) = 0 for every n 0. Jay Taylor (ASU) Fall 2013 6 / 86

Motivation Since R 1,0 = N, it follows that 0 1 1 = µ(n) = µ @ n 0{n} [ A X µ({n}) = 0, n 0 but this is no contradiction since we only require µ to be finitely additive. We will say that µ is a uniform distribution on the natural numbers. Although µ is coherent, it has some unsavory properties. Let A be an event to which you assign probability P(A) = 1/2 and let X be the indicator variable for A, i.e., X = 1 if A occurs and X = 0 if A occurs. We will define a second random variable Y with values in N as follows. Given any subset B N define 8 < µ(b) if X = 0 P(Y B X ) = : ν(b) if X = 1, where ν is the distribution defined in the first example and µ is the uniform distribution defined in this example. Jay Taylor (ASU) Fall 2013 7 / 86

Motivation Since Y has been defined in terms of X, we can use the law of total probability to calculate the probabilities of events of the form Y B. For example, observe that for every n 0, P(Y = n) = P(Y = n X = 0)P(X = 0) + P(Y = n X = 1)P(X = 1) = µ({n}) 1 2 + ν({n}) 1 2 = 0 1 «n+1 1 2 + 1 2 2 «n+2 1 =, 2 since µ assigns probability 0 to every finite subset of N. Jay Taylor (ASU) Fall 2013 8 / 86

Motivation Likewise, we can use Bayes formula to calculate the conditional distribution of X given Y, e.g., P(X = 1 Y = n) = = P(X = 1, Y = n) P(Y = n) P(Y = n X = 1)P(X = 1) P(Y = n) = (1/2)n+1 (1/2) (1/2) n+2 = 1, which also holds for every n 0. In other words, although X is equally likely to be equal to 1 or 0, as soon as we learn the value of Y we can immediately deduce that X = 1, no matter what value Y assumes. Jay Taylor (ASU) Fall 2013 9 / 86

Motivation These observations also lead to following consequences for wagers on the event A. In the absence of any information about Y, we are willing to pay $0.50 for a $1 bet that A will occur. However, if we subsequently learn the value of Y, then the value of our $1 wager on A immediately becomes $0 since we are then certain that A will occur. This example illustrates a phenomenon known as dynamic sure loss: by gaining information we guarantee that we will lose money. Notice that we cannot escape this quandry by re-assigning the unconditional probability of A to be 1, since in the absence of information about Y, A is as likely to occur as it is not to occur. Rather the problem is that coherence is not strong enough a condition on distributions on infinite spaces to avoid certain forms of sure loss. Fortunately, we can avoid these dilemmas by requiring that probability distributions on infinite spaces satisfy a stronger set of conditions. Jay Taylor (ASU) Fall 2013 10 / 86

Infinite Sets and Cardinality Interlude: Infinite Sets and Cardinality Before we can begin to extend our theory to sets with infinitely many elements, we need to take a closer look at some of the properties of infinite sets. We begin by addressing the following question: what do we mean when we say that two sets, A and B, have the same number of elements? This is easy when A and B are finite. For example, if A = {apples, oranges, pears} B = {87, J, c} then since A and B each contain three elements, it is clear that they both have the same number of elements. In other words, we count the number of elements in each set and check whether these numbers are equal. Jay Taylor (ASU) Fall 2013 11 / 86

Infinite Sets and Cardinality To extend this concept further, we need to take a closer look at counting. When we count the number of elements in a set X and decide that this number if n, what we are doing is creating a function Φ from the set {1, 2,, n} into the set X that is both one-to-one and onto: Φ is one-to-one if no two distinct elements are assigned the same value by Φ, i.e., if i j, then Φ(i) Φ(j); Φ is onto if every element in the range is the image of an element in the domain, i.e., for every x X, there is an element i such that Φ(i) = x. A function Φ that is both one-to-one and onto is said to be bijective. In general, there may be many bijections Φ between {1, 2,, n} and X, but one way to select such a function is to label the elements of X = {x 1,, x n} and then define for i = 1,, n. Φ(i) = x i Jay Taylor (ASU) Fall 2013 12 / 86

Infinite Sets and Cardinality This way of thinking about counting can also be applied to pairs of sets whose sizes are being compared. Specifically, if X and Y both have n elements, then there are bijective functions Φ (X ) and Φ (Y ) from {1,, n} onto X and Y, respectively. However, this 1 means that there is a bijective function Ψ = Φ (Y ) Φ (X ) from X onto Y. In fact, the converse is also true: if X and Y are finite and there is a bijection between X and Y, then they have the same numbers of elements. For example, a bijection can be constructed between the sets A and B as follows apples 1 87 oranges 2 J pears 3 c which gives us the mapping Ψ : A B with Ψ(apples) = 87, Ψ(oranges) = J and Ψ(pears) = c. Jay Taylor (ASU) Fall 2013 13 / 86

Infinite Sets and Cardinality These observations lead us to the following definition. Definition We say that two sets, X and Y, have the same cardinality, written X = Y, if there exists a bijective function Φ between X and Y. In contrast, we say that the cardinality of X is less than the cardinality of Y, written X < Y, if X and Y do not have the same cardinality and there is a subset D Y such that X and D have the same cardinality. Remarks: Interpretation: Cardinality provides us with a way to compare the sizes of different sets. Sets that have the same cardinality have, in some sense, the same number of elements, even if that number is infinite. Cardinality is not the only way to measure the size of an infinite set, but it is one of the most basic notions insofar as it does not require additional structure on the set. Other more specialized notions of size include Lebesgue measure, Hausdorff dimension, and capacity. Jay Taylor (ASU) Fall 2013 14 / 86

Infinite Sets and Cardinality Given any two finite sets A and B, we can show that either both sets have the same number of elements or one of the two sets has fewer elements than the other. This is a consequence of the fact that the counting numbers are well-ordered: given positive integers n and m, either n = m or n < m or n > m. However, things are not so straightforward when we turn to infinite sets. In fact, the following two statements are equivalent in the sense that each one implies the other: Law of Trichotomy: Given any two sets X and Y, either X = Y or X < Y or X > Y. Axiom of Choice: Given any collection of distinct, non-empty sets S α, α A, there exists a set C which contains exactly one element from each of the sets S α. Although the axiom of choice is accepted by many mathematicians as one of the fundamental axioms of set theory, it leads to a number of odd results such as the Banach-Tarski paradox, which asserts that it is possible to dissect a three-dimensional ball into a finite number of pieces which can then be reassembled into two disjoint unit balls, each having the same volume as the original. Jay Taylor (ASU) Fall 2013 15 / 86

Infinite Sets and Cardinality One of the stranger properties of cardinality is that two sets X and Y can have the same cardinality even when X is a proper subset of Y. Example: The positive integers Z + are a proper subset of the natural numbers N, but the mapping Φ(n) = n + 1 is a bijection from N onto Z + and so both sets have the same cardinality. Example: The even natural numbers 2N are a proper subset of the natural numbers N, but the mapping Φ(n) = 2n is a bijection from N onto 2N and so these sets also have the same cardinality. Hilbert s Paradox of the Grand Hotel: Suppose that a hotel contains an infinite number of rooms, numbered 1, 2, 3,, and that all of the rooms are occupied. A new guest arrives and seeks accommodation. To make a place for them, the hotel moves each guest from their current room to the room with the next higher number, e.g., the person in room 1 moves to room 2, the person in room 2 moves to room 3, and so forth. The new guest is then moved into room 1. In this way, the hotel is able to accommodate new arrivals even when there are no vacancies. Jay Taylor (ASU) Fall 2013 16 / 86

Infinite Sets and Cardinality Sets that have the same cardinality as the natural numbers or one of its subsets play an especially important role in probability theory. Definition A set X is said to be countable if X is either finite or X has the same cardinality as the natural numbers N. In the latter case, we say that X is countably infinite. If X is neither finite nor countably infinite, then X is said to be uncountable. The following are examples of countable sets: The natural numbers N = {0, 1, 2, }; The positive integers Z + = {1, 2, }; The integers Z = {0, ±1, ±2, }; The rational numbers Q = {p/q : q 0, p, q Z}; Any countable union of countable sets, i.e., if A i is countable for every i 1, then the union A = i A i is also countable. Jay Taylor (ASU) Fall 2013 17 / 86

Infinite Sets and Cardinality As the following theorem shows, not all infinite sets are countably infinite. Theorem The set [0, 1] is uncountable. Proof: We prove this by contradiction, using a clever method invented by Georg Cantor that has come to be known as the Cantor diagonalization argument. If [0, 1] is countable, then there is a bijection Φ between Z + and [0, 1]. Each of the numbers Φ(n) [0, 1] has a decimal expansion which can be written as Φ(n) = 0.c n1c n2c n3. Let x [0, 1] be the number with decimal expansion x = 0.x 1x 2x 3, where x n = 2 whenever c nn 2 and x n = 1 whenever c nn = 2. I claim that there is no integer n > 0 such that Φ(n) = x. Jay Taylor (ASU) Fall 2013 18 / 86

Infinite Sets and Cardinality Indeed, if there is an n > 0 such that Φ(n) = x, then we can write the decimal expansion of x in two ways: x = 0.x 1x 2x 3 = 0.c n1c n2c n3. However, since decimal expansions that do not end in a repeating series of all 0 s or all 9 s are unique, it must be the case that x i = c ni for all i 1. In particular, x n = c nn, which is a contradiction since we chose each x i so that x i c ii. This shows that no bijection can exist between Z + and [0, 1], which in turn implies that [0, 1] is uncountably infinite. Remarks: Since Z + has the same cardinality as the set D = {n 1 : n 1} [0, 1], it follows that the cardinality of [0, 1] is strictly larger than that of Z +. In other words, some infinite sets are bigger than others. It can be shown that any interval [a, b] or (a, b) with a < b is uncountably infinite. In particular, the real numbers R are uncountable, as are all of the Euclidean spaces R n. Jay Taylor (ASU) Fall 2013 19 / 86

Countable Additivity Countable Additivity To avoid the kinds of difficulties exposed by the examples given at the beginning of these slides, we will require that probability distributions on infinite sets satisfy the following additional condition. Definition Let S be a set and let P(S) be the collection of all subsets of S, i.e., the power set of S. A function µ : P(S) R is said to be countably additive if for any countably infinite collection of disjoint sets A 1, A 2, in F we have µ! [ A i = i=1 µ(a i ). i=1 Notice that if µ is countably additive and µ( ) is finite, then µ( ) = 0. Indeed, if we take A i = for all i 1, then µ( ) = µ( ). i=1 Jay Taylor (ASU) Fall 2013 20 / 86

Countable Additivity Theorem Let S be a countably infinite set and suppose that P : P(S) [0, 1] is a countably additive function with P(S) = 1. Then P is coherent. Proof: To show that P is coherent, we need only show that it is finitely additive. Suppose that A 1,, A n is a finite collection of disjoint subsets of S and define A n+k = for every k 1. Then A 1, A 2, extended in this fashion is a countably infinite sequence of disjoint subsets of S and by countable additivity we know that P! n[ A i i=1! [ = P A i = = i=1 P(A i ) i=1 nx P(A i ), i=1 which shows that P is finitely additive. Jay Taylor (ASU) Fall 2013 21 / 86

Countable Additivity In view of the previous theorem, we will adopt the following definition for probability distributions on countably infinite sets. Definition Let S = {s 1, s 2, } be a countably infinite set. A probability distribution on S is a function P : P(S) [0, 1] which satisfies the following conditions: 1 P(A) 0 for every subset A S; 2 P(S) = 1; 3 P is countably additive, i.e., if A 1, A 2, is a countable sequence of disjoint subsets of S, then! [ P A i = P(A i ). i=1 i=1 Jay Taylor (ASU) Fall 2013 22 / 86

Countable Additivity Because every subset of a countably infinite set is either finite or countably infinite, every probability distribution on a countably infinite set is uniquely determined by the probabilities that it assigns to the individual members of the set. Theorem Suppose that P is a probability distribution on a countably infinite set S = {s 1, s 2, }. Then, for any subset A S, we have P(A) = X s A P({s}). Proof: The result follows from the countable additivity of P and the fact that A can be expressed as a countable disjoint union of singleton sets containing the elements contained in A: A = [ s A{s}. Jay Taylor (ASU) Fall 2013 23 / 86

Countable Additivity In particular, this identity leads to an easy method for constructing probability distributions on countably infinite sets. Theorem Let S = {s 1, s 2, } be a countably infinite set and suppose that p 1, p 2, is a sequence of non-negative numbers that sums to 1 If P : P(S) [0, 1] is defined by p i = 1. i=1 then P is a probability distribution on S. P(A) = X s i A p i, Jay Taylor (ASU) Fall 2013 24 / 86

Countable Additivity Proof: It is clear from the definition that P(A) 0 for every subset A S and also that P(S) = X s i S p i = p i = 1. i=1 Furthermore, if A 1, A 2, is a countably infinite sequence of disjoint subsets of S, then! [ P A k k=1 which shows that P is countably additive. = = = X s i S k A k X p i p i k=1 s i A k P(A k ), k=1 Jay Taylor (ASU) Fall 2013 25 / 86

Discrete Random Variables Having defined probability distributions on countably infinite spaces, we can also extend our definition of a random variable to include variables which can take on countably infinitely many possible values. The following definition is key. Definition A discrete random variable is a random quantity X which takes values in a countable set S = {x 1, x 2, }. (Here S can be finite or countably infinite.) In this case, the distribution of X is the probability distribution on S defined by P(A) = P(X A) for any subset A S. Furthermore, the probability mass function of X is the function p : S [0, 1] defined by for any x S. p(x) = P(X = x) Jay Taylor (ASU) Fall 2013 26 / 86

Discrete Random Variables Example: For each n 0, let p n = 2 (n+1). Since p n = 2 (n+1) = 1, n=0 n=0 we can define a probability distribution on the natural numbers N = {0, 1, 2, } by setting P(A) = X n A p n. With this machinery in place, we can also formally define a random variable N which is equal to the number of tails obtained before the first heads when a fair coin is tossed repeatedly. In this case, the probability mass function of N is the function p : N [0, 1] defined by p(n) = 2 (n+1). Jay Taylor (ASU) Fall 2013 27 / 86

Discrete Random Variables Example: There is no uniform distribution on the natural numbers. Indeed, a distribution on a set S is said to be uniform if every element in S has the same probability. Thus, if P was uniform on N, then there would exists a non-negative number c 0 such that for every n 0, c = p n = P({n}). However, since N is equal to the countably infinite disjoint union of the singleton sets {n} and we know that P(N) = 1 for any probability distribution on N, the countable additivity of P implies that 1 = P(N) = p n = c and the right-hand side is either 0 if c = 0 or if c > 0. In either case, we have a contradiction and so P cannot be uniform on N. n=0 n=0 Jay Taylor (ASU) Fall 2013 28 / 86

Discrete Random Variables Theorem Suppose that X is a discrete random variable with values in the countable set S and let p : S [0, 1] be the probability mass function of S. Then and X p(x) = 1 x S P(X A) = X x A p(x) for any subset A S. Exercise: Prove this theorem. Jay Taylor (ASU) Fall 2013 29 / 86

Discrete Random Variables Previously we defined the expected value of a random variable X with finitely many possible values S = {x 1,, x n} to be the weighted sum of these values: E[X ] = nx P(X = x i ) x i. i=1 Although we would like to be able to extend this definition to random variables with countably infinitely many possible values, the following example shows that this is not entirely straightforward. Example: Let X be a random variable with values in the integers Z = {0, ±1, ±2, } and the following probability mass function p : Z [0, 1]: 8 < p(n) = P(X = n) = : 0 if n = 0 1 Cn 2 if n 0. Jay Taylor (ASU) Fall 2013 30 / 86

Discrete Random Variables The constant C included in the definition of the probability mass function of X is said to be a normalizing constant and must be chosen so that the probabilities sum to 1: 1 = X n Z p(n) = 2 C n=1 1 n = 2 π 2 2 C 6 = π2 3C. Thus C = 3/π 2 and so X is a properly defined discrete random variable. Now suppose that we define the expectation of X to be E[X ] X n Z P(X = n) n = 1 C = 1 C X n 0 X n 0 n n 2 1 n. Jay Taylor (ASU) Fall 2013 31 / 86

Discrete Random Variables Unfortunately, the last expression is ambiguous since its value depends on the order in which the terms are included in the sum. For example, if we first add the positive terms and then the negative terms, then we obtain the difference: X n 0 1 n? = X 1 n + X 1 n n 1 n 1 =, which is undefined. Alternatively, if we group the terms by absolute value and sum in order of increasing magnitude, then we obtain X 1 n + 1 «= X 0 = 0. n n 1 n 1 In fact, given any real number x R, it is possible to order the terms in this series such that the sum is equal to x. This shows that our previous definition of the expectation cannot be automatically extended to variables that take on infinitely many values since the infinite series might not even exist or, if it does, might depend on the order in which we list the possible values. Jay Taylor (ASU) Fall 2013 32 / 86

Discrete Random Variables Interlude: Infinite Series We begin by recalling what it means for a sequence of real numbers to converge to a limit. Definition A sequence of real numbers (x n : n 1) is said to converge to the limit x R, written x = lim n xn, if for every ɛ > 0 there exists a positive integer N ɛ such that for every n N ɛ we have x x n < ɛ. Example: The sequence (1/n; n 1) converges to the limit x = 0 since for any n N ɛ = 1 + ɛ 1 we have 0 x n = 0 1/n = 1 n 1 1 + ɛ 1 < 1 ɛ < ɛ. Jay Taylor (ASU) Fall 2013 33 / 86

Discrete Random Variables Although any sequence of real numbers (x n; n 1) can be assembled into a formal series, x n, the previous example shows that this sum is not always uniquely defined. For this reason, we need to pick out a special class of series that can be summed. n=1 Definition An infinite series consisting of the terms (x n; n 1) is said to be convergent if the sequence of partial sums s n = x 1 + + x n is convergent, i.e., if the limit exists. i=1 x i lim n nx i=1 x i Jay Taylor (ASU) Fall 2013 34 / 86

Discrete Random Variables Our example also revealed that the value or even the existence of the limit of an infinite series may depend on the order of appearance of the terms in that sequence. This is unacceptable if we wish to use infinite series to define expectations of random variables with countably infinitely many values since the order in which we list these values is completely arbitrary. Fortunately, there is a large class of infinite series that do not suffer from this ambiguity. Definition An infinite series consisting of the terms (x n; n 1) is said to be absolutely convergent if the series ( x n : n ) is convergent, i.e., if the limit n=1 x n = lim n nx x i exists. If the series formed from (x n : n 1) is convergent, but not absolutely convergent, then we say that it is conditionally convergent. i=1 Jay Taylor (ASU) Fall 2013 35 / 86

Discrete Random Variables Example: Consider the alternating sequence with terms x n = ( 1) n+1 /n. This sequence is convergent with limit ln(2) = lim n k=1 nx x k, but it is only conditionally convergent since the limit is infinite. lim n k=1 nx x k = k=1 1 k = Jay Taylor (ASU) Fall 2013 36 / 86

Discrete Random Variables There is a profound difference between absolutely and conditionally convergent sequences that has a direct impact on our ability to define expectations. This is highlighted by the next two theorems. Theorem Suppose that (x n : n 1) are the terms in an absolute convergent series and let (y n : n 1) be a rearrangement of these terms. Then (y n : n 1) is also absolutely convergent and the limit of the series does not depend on the order in which we sum the terms: x n = y n. n=1 In particular, if (a n : n 1) is the sequence of non-negative values in (x n; n 1) and ( b n : n 1) is the sequence of negative values, both listed in order of appearance, then the series P a n and P b n are both absolutely convergent and n=1 x n = X a n X b n. n 1 n 1 n=1 Jay Taylor (ASU) Fall 2013 37 / 86

Discrete Random Variables The previous theorem states that absolutely convergent series are well-behaved in the sense that their limits do not depend on the order in which we sum their terms. The next theorem shows that absolute convergence is also a necessary condition for this to be true. Theorem Suppose that the series P x n is conditionally convergent and let y R be a real number. Then there is a rearrangement of the sequence (x n : n 1), say (y n : n 1), such that the series P y n converges to y, i.e., y = lim n k=1 nx y k. In other words, a conditionally convergent series can be rearranged so that it converges to any limit that we like. Jay Taylor (ASU) Fall 2013 38 / 86

Discrete Random Variables These last two theorems lead us to the following definition of the expectation of a random variable that takes on countably infinitely many values. Definition Suppose that X is a random variable with values in a countably infinite set S = {x 1, x 2, } R and let p(x i ) = p i be the probability mass function of X. Then the expectation of X will be defined to be equal to the quantity E[X ] = p k x k = lim k=1 n k=1 nx p k x k provided that the series P p k x k is absolutely convergent, i.e., provided that E[ X ] = p k x k = lim k=1 n k=1 nx p k x k < If this condition is not satisfied, then we say that the expectation of X does not exist. Jay Taylor (ASU) Fall 2013 39 / 86

Discrete Random Variables Example: Let N be the random variable with values in the natural numbers N = {0, 1, 2, } and probability mass function p(n) = 2 (n+1). Since N only takes on non-negative values, we need only check that the series P n 0 pnn is convergent. However, this is a consequence of the following calculation: nx 2 (n+1) n = 2 (n+1) 1 n=0 = = = 1 n=0 k=1 n=k k=1 2 n k=1 2 (n+1) Thus the expectation of N exists and is equal to E[N] = 1. Jay Taylor (ASU) Fall 2013 40 / 86

Discrete Random Variables Most of the results that we proved about expectations of random variables taking at most finitely many values extend to expectations of random variables taking countably infinitely many values. Here I will prove one such result and state several others (see the text for proofs). Theorem Suppose that X is a random variable having an expectation. Let k and b be constants and define Y = kx + b. Then Y has an expectation and E[Y ] = ke[x ] + b. Proof: Suppose that X takes values in the set S = {x 1, x 2, } and let p i = P(X = x i ). Then E Y = p i kx i + b p i` kxi + b i=1 = k p i x i + p i b i=1 i=1 = k E X + b <. i=1 Jay Taylor (ASU) Fall 2013 41 / 86

Discrete Random Variables This calculation shows that E[Y ] exists. Furthermore, its value is E[Y ] = p i`kxi + b i=1 = lim n nx p i`kxi + b i=1 nx = lim k p i x i + b n = k lim n i=1 nx i=1! nx p i i=1 p i x i + b lim n = k p i x i + b i=1 = ke[x ] + b. i=1 p i nx i=1 p i Jay Taylor (ASU) Fall 2013 42 / 86

Discrete Random Variables The remaining properties are stated as theorems. Theorem 1 Suppose that X and Y are random variables and that the expectations E[X ] and E[Y ] both exist. Then the expectation of X + Y exists and is equal to E[X + Y ] = E[X ] + E[Y ]. 2 In general, suppose that the expectations of the random variables X 1,, X n exists and let c 1,, c n be constants. Then the expectation of the variable c 1X 1 + + c nx n exists and is equal to " nx # nx E c i X i = c i E[X i ]. i=1 i=1 In other words, expectations remain linear even when extended to random variables taking countably infinitely many values. Jay Taylor (ASU) Fall 2013 43 / 86

Discrete Random Variables Theorem Suppose that X is a random variable with countably infinitely many variables in the set S = {x 1, x 2, } and let g : S R. Then Y = g(x ) has expectation provided that E Y <. E[Y ] = P(X = x i ) g(x i ) i=1 Remark: The existence of E[X ] is not enough to guarantee the existence of E[g(X )]. For example, if P(N = n) = 2 (n+1), then we know that E[N] = 1 exists. However, if g(n) = ( 2) n, then and so E[g(N)] does not exist. E[ g(n) ] = 2 (n+1) ( 2) n 1 = 2 =, n=0 n=0 Jay Taylor (ASU) Fall 2013 44 / 86

Discrete Random Variables Theorem Let X and Y be random variables taking at most countably many values and suppose that E[X ] and E[X Y = y i ] exist for all possible values y i of Y. Then the random variable E[X Y ] has an expectation and E[X ] = EˆE[X Y ]. This result is sometimes known as the law of iterated expectations. Theorem Let X and Y be independent random variables and suppose that the expectations of g(x ) and h(y ) exist, where g and h are functions. Then g(x ) and h(y ) are independent random variables and the expectation of g(x )h(y ) exists and is equal to E[g(X )h(y )] = E[g(X )]E[h(Y )]. Jay Taylor (ASU) Fall 2013 45 / 86

Probability Generating Functions Probability Generating Functions Definition Let X be a random variable with values in the natural numbers N = {0, 1, }. The probability generating function of X is the function ψ() defined by h i ψ X (t) = E t X = P(X = n) t n for those values of t R such that the series on the right-hand side converges. The set of all such t is called the domain of ψ X. n=0 Remarks: The probability generating function is an alternative way of encoding information about the distribution of a random variable. Our aim is to learn about the distribution by studying the properties of the probability generating function. Jay Taylor (ASU) Fall 2013 46 / 86

Probability Generating Functions Example: If X is a binomial random variable with parameters n and p, then the probability generating function of X is ψ X (t) = E ht i X = = = nx P(X = k) t k k=0 nx k=0 nx k=0! n p k (1 p) n k t k k! n (pt) k (1 p) n k k = (pt + 1 p) n, and the domain of ψ X is the entire real line. Jay Taylor (ASU) Fall 2013 47 / 86

Probability Generating Functions Because the probability generating function is defined by a power series expansion, its domain is at least as large as the radius of convergence of that series. Recall that the radius of convergence of a power series φ(x) = P c nx n is the largest number of ρ such that the series converges for all x with x < ρ: ( ) ρ = sup r > 0 : c nx n < if x r n=0 There are many methods that can be used to determine the radius of convergence of a power, but one of these is the so-called ratio test. Theorem Let φ(x) = P n cnx n be a power series and suppose that the limit c n ρ lim n c n+1 exists. Then ρ is the radius of convergence of this power series. Jay Taylor (ASU) Fall 2013 48 / 86

Probability Generating Functions Example: Let X be a natural number-valued random variable with distribution 8 < 0 if n = 0 P(X = n) = : 6 1 if n 1. π 2 n 2 and let ψ X be the probability generating function of X : ψ X (t) = 6 π 2 X n=0 t n n 2. To apply the ratio test, we calculate the limit c n ρ = lim n c = lim (n + 1) 2 = 1, n+1 n n 2 which shows that the radius of convergence is 1. Since ψ X (1) < and ψ X ( 1) <, the domain of ψ X is [ 1, 1]. Jay Taylor (ASU) Fall 2013 49 / 86

Probability Generating Functions An important property of probability generating functions is that they uniquely determine the distribution of a random variable. Theorem Suppose that X and Y are natural number-valued random variables with identical probability generating functions, i.e., h i h i ψ X (t) = E t X = E t Y = ψ Y (t) and both functions have the same domain. Then X and Y have the same distribution, i.e., for every n 0. P(X = n) = P(Y = n), Remark: Notice that the theorem only asserts that X and Y have the same distribution, not that X = Y. In such cases we say that X and Y are identical in distribution and we write X d = Y. Jay Taylor (ASU) Fall 2013 50 / 86

Probability Generating Functions Proof: Because the radius of convergence of ψ X and ψ Y is greater than or equal to 1, we can perform the following differentiations: d n dt ψ n X (t) = d n t=0 dt n = =! P(X = k) t t=0 k k=0 P(X = k) d n k=0 k=n dt n tk t=0 k! P(X = k) (k n)! tk n t=0 = n! P(X = n). Similarly, d n dt ψ n Y (t) = n! P(Y = n), t=0 but since ψ X = ψ Y, the two functions have identical derivatives of all orders and so P(X = n) = P(Y = n). Jay Taylor (ASU) Fall 2013 51 / 86

Probability Generating Functions Probability generating functions of sums of independent random variables are particularly well behaved. Theorem Suppose that X 1,, X n are independent natural number-valued random variables with probability generating functions ψ X1,, ψ Xn. Then the probability generating function of X = X 1 + + X n is ny ψ X (t) = ψ Xi (t) and the domain of ψ X is the intersection of the domains of the functions ψ X1, ψ Xn. i=1 Proof: Since X 1,, X n are independent, so are the random variables t X 1,, t Xn every value of t. Consequently, " i ny # ny h i ny ψ X (t) = E ht X 1+ +X n = E t X i = E t X i = ψ Xi (t) provided that t is contained in the domain of each of the functions ψ Xi. i=1 i=1 i=1 for Jay Taylor (ASU) Fall 2013 52 / 86

Probability Generating Functions Example: Let X 1,, X n be independent Bernoulli random variables, each with parameter p, and let X = X 1 + + X n. Since X 1,, X n all have the same probability generating function ψ Xi (t) = 1 p + pt, it follows that the probability generating function of X is ψ X (t) = ny ψ Xi (t) = (1 p + pt) n. i=1 Since this is the same probability generating function that we found for a binomial random variable with parameters n and p, it follows that X is itself a binomial random variable with these parameters. Jay Taylor (ASU) Fall 2013 53 / 86

Probability Generating Functions Probability generating functions can also be used to calculate the mean and the variance of a random variable. Theorem Suppose that X is a random variable with probability generating function ψ X and assume that the radius of convergence of ψ X is greater than 1. Then the mean and the variance of X are equal to E[X ] = ψ X (1) Var(X ) = ψ X (1) + ψ X (1) `ψ X (1) 2. Jay Taylor (ASU) Fall 2013 54 / 86

Probability Generating Functions Proof: Provided that the radius of convergence is greater than 1, we can differentiate inside the series: ψ X (1) = = = k=0 P(X = k) d dt tk t=1 P(X = k) kt k 1 t=1 k=1 P(X = k) k k=0 = E[X ]. Jay Taylor (ASU) Fall 2013 55 / 86

Probability Generating Functions Similarly, ψ X (1) = = P(X = k) d 2 k=0 dt 2 tk t=1 P(X = k) k(k 1) k=2 = E[X 2 ] E[X ], which shows that Var(X ) = E[X 2 ] E[X ] 2 = ψ X (1) + ψ X (1) `ψ X (1) 2. Jay Taylor (ASU) Fall 2013 56 / 86

Probability Generating Functions Example: If X is a binomial random variable with parameters n and p, then the probability generating function of X is ψ X (t) = (1 p + pt) n which has derivatives ψ (1) = np and ψ (1) = n(n 1)p 2. Consequently, E[X ] = np and Var[X ] = ψ (1) + ψ (1) (ψ (1)) 2 = n(n 1)p 2 + np n 2 p 2 = np(1 p). These results agree with those that we previously calculated through more direct means. Jay Taylor (ASU) Fall 2013 57 / 86

Geometric and related distributions The Geometric Distribution Suppose that a series of independent trials is performed and that each trial has probability p of failing. If X is the number of successes that occur before the first failure, then P(X = n) = (1 p) n p for n 0. This distribution is important enough to have its own name. Definition A random variable X with values in the natural numbers is said to have the geometric distribution with parameter p, written X Geometric(p), if the probability mass function of X is P(X = n) = (1 p) n p. Jay Taylor (ASU) Fall 2013 58 / 86

Geometric and related distributions Exercise: Suppose that X Geometric(p). Find the probability generating function ψ X (t) = E[t X ] and use this to calculate the mean and the variance of X. Solution: The probability generating function of X is Since ψ (t) = ψ X (t) = ψ (1) = 2(1 p)2. Therefore p 2 p(1 p) n t n p = 1 t(1 p). n=0 p(1 p) and ψ (t) = 2p(1 p)2, it follows that ψ (1) = 1 p (1 t(1 p)) 2 (1 t(1 p)) 3 p E[X ] = 1 p p Var(X ) = = 2(1 p)2 p 2 (1 p) p 2. + 1 p p 1 p p «2 and Jay Taylor (ASU) Fall 2013 59 / 86

Geometric and related distributions The geometric distribution can be used to model random lifespans when the probability of death or failure per unit time is constant. In fact, the geometric distribution is said to be memoryless because the probability of survival over a period [t, t + s] depends only on the duration of the period and not on its starting time: P(X > t + s X > t) = P(X > t + s) P(X > t) (1 p)t+s = (1 p) t = (1 p) s = P(X > s). In other words, knowing that an individual has survived until time t makes it no more nor no less probable that they will survive for an additional s units of time. Jay Taylor (ASU) Fall 2013 60 / 86

Geometric and related distributions Example: Suppose that we believe that the lifespan of an electronic component can be modeled by a geometric distribution with an unknown parameter p and that we measure the lifespans of m copies of the component in order to estimate p. Let X 1 = x 1,, X m = x m be the observed lifespans and assume that these are independent. The likelihood function for p given the data D = (x 1,, x m) is L(p D) P p(x 1 = x 1,, X m = x m) my = P p(x i = x i ) = i=1 my p(1 p) x i i=1 = p m (1 p) x, where x = x 1 + x m. The notation P p was used above to indicate that we are calculating the probability of the data under the assumption that the parameter of the geometric distribution is p. Jay Taylor (ASU) Fall 2013 61 / 86

Geometric and related distributions The likelihood function tells us how the probability of the data varies with our choice of the parameter p. One way to select a point estimate of p is to choose the value of p that maximizes the probability of the data. This estimate is called the maximum likelihood estimate of p and can be found by maximizing the function L(p D). To this end, we differentiate L(p D) with respect to p and set the result equal to 0: 0 = d dp L(p D) = d dp (pm (1 p) x ) = mp m 1 (1 p) x xp m (1 p) x 1 = p m (1 p) x m p x «. 1 p Solving for p shows that the maximum likelihood estimate of p is ˆp ML = m m + x. Jay Taylor (ASU) Fall 2013 62 / 86

Geometric and related distributions If we use the geometric distribution to model random lifespans, then we are implicitly assuming that a single failure is sufficient to cause death or system collapse. However, many systems are robust in the sense that multiple independent components must fail for death to result. To model the lifespan of such a system we will introduce a more general class of distributions. Definition A random variable X with values in the natural numbers is said to have the negative binomial distribution with parameters r 1 and p [0, 1], written X NB(r, p), if the probability mass function of X is! P(X = n) = n + r 1 p r (1 p) n. n Remark: The negative binomial distribution with parameters r = 1 and p [0, 1] is just the geometric distribution with parameter p. Jay Taylor (ASU) Fall 2013 63 / 86

Geometric and related distributions The negative binomial distribution arises in the following way. Suppose that a sequence of independent trials is performed and that each trial has probability p of failure. If X is the number of successes that occur before the r th failure, then X NB(r, p). To verify this claim, observe that the event X = n occurs if the first n + r 1 trials result in r 1 failures and n successes, and the n + r th trial results in a failure. However, the probability of n successes in the first n r + 1 trials is given by the binomial probability! P(n successes in the first n + r 1 trials) = n + r 1 p r 1 (1 p) n. n Furthermore, since the outcome of the n + r th trial is independent of the first n + r 1 trials, it follows that!! P(X = n) = n + r 1 p r 1 (1 p) n p = n + r 1 p r (1 p) n, n n which is the negative binomial distribution. Jay Taylor (ASU) Fall 2013 64 / 86

Geometric and related distributions Suppose that X 1,, X r are independent geometric random variables with parameter p and let X = X 1 + + X r. Then X NB(r, p). Indeed, if we interpret X 1 as the number of success that occur until the first failure, X 2 as the number of successes that occur until the second failure, etc., then it is clear that X is the number of successes that occur until the cumulative number of failures is equal to r. This observation makes it easy to calculate the probability generating function of X : h i ψ X (t) = E t X = ry «r p ψ Xi (t) =. 1 t(1 p) i=1 Furthermore, by differentiating ψ X (t), we can calculate the mean and the variance of X, which are E[X ] = Var(X ) = r(1 p) p r(1 p). p 2 Jay Taylor (ASU) Fall 2013 65 / 86

Poisson distribution The Poisson Distribution Suppose that we perform a large number of independent trials, say n 100, and that the probability of a success on any one trial is small, say p n = λ/n 1. If we let X (n) denote the total number of successes that occur in all n trials, then X (n) Binomial(n, p) and therefore P(X (n) = k) = = = =! n pn k (1 p n) n k k λ «k n! 1 λ «n k (n k)!k! n n ««n! λ k 1 λ «n 1 λ «k n k (n k)! k! n n «n(n 1)(n 2) (n k + 1) λ k n k k! «1 λ «n 1 λ «k. n n Jay Taylor (ASU) Fall 2013 66 / 86

Poisson distribution Notice that three of the terms in the last line depend on n and that these converge to finite limits as n : lim n «n(n 1)(n 2) (n k + 1) n k lim n lim n = 1 1 λ n «k = 1 1 λ n «n = e λ. Consequently, for every integer k 0, the probabilities P(X (n) = k) converge to a limit as n, which is lim P(X (n) = k) = e λ n λ k k! «. Jay Taylor (ASU) Fall 2013 67 / 86

Poisson distribution To verify the third limit on the preceding page, recall that the Taylor series for log(1 + x) is log(1 + x) = ( 1) n+1 x n n, n=1 which converges as long as x ( 1, 1]. In particular, when x 1, we can write log(1 + x) = x + O(x 2 ), where O(x 2 ) stands for a remainder term that is bounded by a constant times x 2. Therefore, lim log 1 λ «n = lim n n n log n = lim n n = λ. «1 λ n λn + O`n 2 «However, since e x is continuous on (, ), we can exponentiate both sides of this identity to obtain lim 1 λ «n = e λ. n n Jay Taylor (ASU) Fall 2013 68 / 86

Poisson distribution Furthermore, the limiting values of the probabilities sum to 1 when k is allowed to range over all of the natural numbers: «e λ λ k X «= e λ λ k = e λ e λ = 1. k! k! k=0 k=0 These observations motivate the following definition: Definition A random variable X is said to have the Poisson distribution with parameter λ 0 if X takes values in the non-negative integers with probability mass function In this case we write X Poisson(λ). p X (k) = P(X = k) = e λ λ k k!. Remark: The Poisson distribution takes its name from that of the 19 th century French mathematician, Siméon Dennis Poisson (1781-1840). Jay Taylor (ASU) Fall 2013 69 / 86

Poisson distribution The probability generating function of the Poisson distribution is: h i ψ X (t) = E t X = X p X (k)t k = e λ k=0 Differentiating twice with respect to t gives = e λ e λt = e λ(t 1). k=0 (λt) k k! and we then find ψ X (t) = λe λ(t 1) ψ X (t) = λ 2 e λ(t 1) E[X ] = ψ X (1) = λ Var(X ) = ψ X (1) + ψ X (1) (ψ X (1)) 2 = λ. Thus, λ is equal to both the mean and the variance of the Poisson distribution. Jay Taylor (ASU) Fall 2013 70 / 86

Poisson distribution It has long been recognized that Poisson distributions provide a surprisingly accurate model for the statistics of a large number of seemingly unrelated phenomena. Some examples include: the number of misprints per page of a book; the number of wrong telephone numbers dialed in a day; the number of customers entering a post office per day; the number of mutations that occur when a genome is replicated; the number of α-particles discharged per day from a 14 C source; the number of major earthquakes per year; the number of Prussian soldiers killed per year by being kicked by a horse. Jay Taylor (ASU) Fall 2013 71 / 86

Poisson distribution For example, even the number of vacancies per year on the US Supreme Court is reasonably well modeled by a Poisson distribution: Number of vacancies (x) Probability Years with x vacancies 1837-1932 1933-2007 Observed Expected Observed Expected 0 0.6065 59 58.2 47 45.5 1 0.3033 27 29.1 21 22.7 2 0.0758 9 7.3 7 5.7 3 0.0126 1 1.2 0 1.0 3 0.0018 0 0.2 0 0.1 Data from Cole (2010) compared with a Poisson distribution with λ = 0.5. Jay Taylor (ASU) Fall 2013 72 / 86

Poisson distribution Since these phenomena are generated by very different physical and biological processes, the fact that they share similar statistical properties cannot be explained by the specific mechanisms that operate in each instance. Instead, the widespread emergence of the Poisson distribution appears to be a consequence of the following more general mathematical result, which is commonly known as the Law of Rare Events. Theorem For each n 1, let X (n) 1,, X n (n) be a collection of independent Bernoulli random variables, each with success probability p n = λ/n, and let X (n) = X (n) 1 + + X n (n) be the total number of successes in these n trials. Then «lim P(X (n) = k) = e λ λ k. n k! Interpretation: When n is large, the probability of success p n = λ/n is small and so each success is a rare event. However, since there are many trials, there is a non-negligible probability of having at least one success and the distribution of the total number of successes is approximately Poisson with parameter λ. Jay Taylor (ASU) Fall 2013 73 / 86

Poisson distribution We previously proved the law of rare events by directly calculating the limits of the probabilities P(X (n) = k) and showing that these coincide with the probabilities given by a Poisson distribution. However, we can also prove this result with the help of probability generating functions. The following theorem provides the essential tool. Theorem For each n 1, let X (n) be a non-negative integer-valued random variable with probability generating function ψ n(t) and suppose that these functions converge pointwise on the interval ( 1, 1), i.e., the limit ψ(t) lim n ψn(t) exists for all t ( 1, 1). Then ψ(t) is the probability generating function of a random variable X with values in the natural numbers and lim P X (n) = k = P(X = k) n for every integer k 0. Jay Taylor (ASU) Fall 2013 74 / 86

Poisson distribution Proof of the Law of Rare Events: Since X (n) Binomial(n, p n), we know that the probability generating function of X n is h ψ n(t) = E t X (n)i = 1 λ n + λt «n. n However, the pointwise limit of these functions as n tends to infinity is the function ψ(t) = lim 1 λ n n + λt «n = e λ(t 1), n and convergence occurs over the entire real line, i.e., for all values of t. Since ψ(t) is the probability generating function of the Poisson distribution with parameter λ, it follows that the probabilities P(X (n) = k) converge to those of this Poisson distribution. Jay Taylor (ASU) Fall 2013 75 / 86

Poisson distribution Probability generating functions can also be used to prove the following theorem. Theorem Suppose that X 1,, X n is a collection of independent Poisson distributed random variables with parameters λ 1,, λ n, respectively, and let X = X 1 + + X n. Then X is Poisson-distributed with parameter λ 1 + + λ n. Proof: If ψ i (t) = e λ i (t 1) is the probability generating function of X i, then because the X i are independent, we know that the probability generating function of X is ψ X (t) = ny ψ i (t) = i=1 ny e λ i (t 1) = e (t 1) P n i=1 λ i. i=1 Since this is also the probability generating function of a Poisson-distributed random variable with parameter λ 1 + + λ n, it follows that this is the distribution of X. Jay Taylor (ASU) Fall 2013 76 / 86

Fluctuation Tests Fluctuation tests and the origin of adaptive mutations One of the classic experiments of molecular genetics is the fluctuation test, which was developed by Salvador Luria and Max Delbrück in 1943 to investigate the origins of adaptive mutations. An adaptive mutation is one that increases the fitness (e.g., survival, fecundity) of an individual that carries that mutation. At the time, the molecular processes underpinning heredity and mutation were unknown (e.g., the structure of DNA was only described in 1951). There were two prevailing hypotheses explaining the origins of adaptive mutations: the spontaneous mutation hypothesis and the induced mutation hypothesis. According to the spontaneous mutation hypothesis, adaptive mutations occur by chance irrespective of the environmental conditions. According to the induced mutation hypothesis, adaptive mutations are directly induced by the environmental conditions in which they will be favored. Jay Taylor (ASU) Fall 2013 77 / 86

Fluctuation Tests Luria and Delbrück developed an experimental system based on a bacterium Escherischia coli along with a virus (T1 bacteriophage) that infects it. When T1 phage is added to a culture of E. coli, most of the bacteria are killed, but a few resistant cells may survive and give rise to resistant colonies that can be seen on the surface of a petri dish. This shows that resistance to T1 phage is a trait that varies across E. coli bacteria and which is heritable, i.e., the descendants of resistant bacteria are usually themselves resistant. Source: Wikipedia Source: Madeleine Price Ball Jay Taylor (ASU) Fall 2013 78 / 86

Fluctuation Tests The experiment carried out by Luria and Delbrück consisted of the following steps: 1 An E. coli culture was initiated from a single T1-susceptible cell and allowed to grow to a population containing millions of bacteria. 2 Several small samples were taken from this colony and spread on agar plates that had also been inoculated with the T1 phage. These plates were left for a period after which the number of resistant colonies on each plate was counted. 3 The procedures described in steps 1-2 were repeated several times, using independently established E. coli cultures and the resulting data were used to estimate both the mean and the variance of the number of resistant colonies arising in each culture. 4 If R ij is the number of resistant colonies observed on the j th plate inoculated with bacteria from the i th colony, then the mean and the variance of the number of resistant colonies can be estimated by R i = 1 5 5X R ij and V i = 1 4 j=1 5X `Rij R i 2. j=1 Jay Taylor (ASU) Fall 2013 79 / 86

Fluctuation Tests Luria and Delbrück argued that the spontaneous and induced mutation hypotheses could be distinguished in the following manner. If mutations are induced, then these will only appear after the bacteria are exposed to the phage. In this case, the number of resistant colonies will be approximately Poisson distributed and the variance will be approximately equal to the mean. If mutations are spontaneous, then the number of resistant colonies depends on timing of the mutation relative to the expansion of the culture. In this case, the variance will be much greater than the mean. Source: Wikipedia The law of rare events explains why the number of resistant colonies is expected to be Poisson distributed under the induced mutation hypothesis: although there are a large number of cells that can independently mutate to the resistance phenotype, the probability of mutation was known to be low. Jay Taylor (ASU) Fall 2013 80 / 86