Preliminaries. Probability space

Size: px

Start display at page:

Download "Preliminaries. Probability space"

Allen Haynes
5 years ago
Views:

1 Preliminaries This section revises some parts of Core A Probability, which are essential for this course, and lists some other mathematical facts to be used (without proof) in the following. Probability space We recall that a sample space Ω is a collection of all possible outcomes of a probabilistic experiment; an event is a collection of possible outcomes, ie., a subset of the sample space. We introduce the impossible event and the certain event Ω; also, if A Ω and B Ω are events, it is natural to consider other events such that A B (A or B), A B (A and B), A c Ω \ A (not A), and A \ B (A but not B). Definition 0.1. Let A be a collection of subsets of Ω. We shall call A a field if it has the following properties: 1. A; 2. if A 1, A 2 A, then A 1 A 2 A; 3. if A A, then A c A. Remark Obviously, every field is closed w.r.t. taking finite unions or intersections. Definition 0.2. Let F be a collection of subsets of Ω. We shall call F a σ-field if it has the following properties: 1. F; 2. if A 1, A 2, F, then k=1 A k F; 3. if A F, then A c F. Remark Obviously, property 2 above can be replaced by the equivalent condition k=1 A k F. Clearly, if Ω is fixed, the smallest σ-field in Ω is just {, Ω } and the biggest σ-field consists of all subsets of Ω. We observe the following simple fact: Exercise 0.3. Show that if F 1 and F 2 are σ-fields, then 1 F 1 F 2 is a σ-field, but, in general, F 1 F 2 is not a σ-field. If A and B are events, we say that A and B are incompatible (or disjoint), if A B =. Definition 0.4. Let Ω be a sample space, and F be a σ-field of events in Ω. A probability distribution P on (Ω, F) is a collection of numbers P(A), A F, possessing the following properties: 1 and, in fact, an intersection of arbitrary (even uncountable!) collection of σ-fields; i

2 A1 for every event A F, P(A) 0; A2 P(Ω) = 1; A3 for any pair of incompatible events A and B, P(A B) = P(A) + P(B); A4 for any countable collection A 1, A 2,... of mutually incompatible 2 events, ( P k=1 A k ) = P ( ) A k. Remark Notice that the additivity axiom A4 above does not extend to uncountable collections of incompatible events. Remark Obviously, property A4 above and Definition 0.2 are nontrivial only in examples with infinitely many different events, ie., when the collection F of all events (and, therefore, the sample space Ω) is infinite. k=1 The following properties are immediate from the above axioms: P1 for any pair of events A, B in Ω we have P(B \ A) = P(B) P(A B), P(A B) = P(A) + P(B \ A); in particular, P(A c ) = 1 P(A); P2 if events A, B in Ω are such that A B Ω, then 0 = P( ) P(A) P(B) P(Ω) = 1. P3 if A 1, A 2,..., A n are events in Ω, then P ( n k=1 A k) n k=1 P(A k) with the inequality becoming an equality if these events are mutually incompatible; Definition 0.5. A probability space is a triple (Ω, F, P), where Ω is a sample space, F is a σ-field of events in Ω, and P( ) is a probability measure on (Ω, F). In what follows we shall always assume that some probability space (Ω, F, P) is fixed. Conditional probability, independence Definition 0.6. The conditional probability of event A given event B such that P(B) > 0, is P(A B) def P(A B) =. P(B) It is easy to see that if E F is any event with P(E) > 0, then P( E) is a probability measure on (Ω, F), ie., axioms A1 A4 and properties P1 P3 hold (just P( ) replace with P( E)). We list some additional useful properties of conditional probabilities: 2 ie., A k A j = for all k j; ii

3 P4 multiplication rule for probabilities: if A and B are events, then P(A B) = P(A) P(B A) = P(B) P(A B) ; more generally, if A 1,..., A n are arbitrary events in F, then ( n P k=1 A k ) = P(A 1 ) n k=2 ( k 1 ) P A k A j ; (0.1) j=1 for example, P(A B C) = P(A) P(B A) P(C A B). P5 partition theorem or formula of total probability: we say that events B 1,..., B n form a partition of Ω if they are mutually incompatible (disjoint) and their union n k=1 B k is the entire space Ω. The partition theorem says that if B 1,..., B n form a partition of Ω, then for any event A we have P(A) = n P(B k ) P(A B k ). (0.2) k=1 P6 Bayes theorem: for any events A, B, we have P(A B) = P(A) P(B A) P(B) ; in particular, if D is an event and C 1,..., C n form a partition of Ω, then P(C k D) = P(C k ) P(D C k ) n k=1 P(C k) P(D C k ) ; (0.3) Exercise 0.7. Check carefully (ie., by induction) property P4 above. Then next definition is one of the most important in probability theory. Definition 0.8. We say that events A and B are independent if P ( A B ) = P(A) P(B) ; (0.4) under (0.4), we have P(A B) = P(A), ie., event A is independent of B; similarly, P(B A) = P(B), ie., event B is independent of A. More generally, Definition 0.9. A collection of events A 1,..., A n is called (mutually) independent, if ( n ) n P A k = P ( ) A k. (0.5) k=1 It is immediate from (0.5) that every sub-collection of { A 1,..., A n } is also mutually independent. k=1 iii

4 Random variables It is very common for the sample space Ω of possible outcomes to be a set of real numbers. Then the outcome to the probabilistic experiment is often called a random variable and denoted by a capital letter such as X. In this case the events are subsets A R and it is usual to write P(X A) instead of P(A) and similarly P(X = 1) for P({1}), P(1 < X < 5) for P(A) where A = (1, 5) and so on. The probability distribution of a r.v. X is the collection of probabilities P(X A) for all intervals A R (and other events that can be obtained from intervals via axioms A1 A4). Let X be a random variable (so the sample space Ω is a subset of R). We say that X is a discrete r.v. if in addition Ω is countable, i.e., if the possible values for X can be enumerated in a (possibly infinite) list. In this case the function p(x) def = P(X = x) (defined for all real x) is called the probability mass function of X and the corresponding probability distribution of X is defined via P(X A) = x A P(X = x) = x A p(x). If X takes possible values x 1, x 2,..., then, by axiom A3, k 1 p(x k) = 1 and if x is NOT one of the possible values of X then p(x) = 0. Similarly, a random variable X has a continuous probability distribution if there exists a non-negative function f(x) on R such that for any interval (a, b) R P ( a < X < b ) = b a f(x) dx ; in particular, by axiom A3, we must have f(x) dx = 1. The function f( ) is then called the probability density function (or pdf) of X. In Core A Probability you saw a number of random variables with discrete (Bernoulli, binomial, geometric, Poisson) or continuous (uniform, exponential, normal) distribution. Definition For any random variable X, the cumulative distribution function (or cdf) of X is the function F : R [0, 1] that is given at all x R by F (x) def = P(X x) = { x f(y) dy, X a continuous r.v.; x k :x k x p(x k), X a discrete r.v.; (0.6) If, in addition, f(x) is continuous function on some interval (a, b) then by the fundamental theorem of calculus, for all x (a, b), F (x) = f(x); ie., the cdf determines the pdf and vice versa. In fact, the cdf of a r.v. X always determines its probability distribution. Remark Suppose X is a random variable and h is some real-valued function defined for all real numbers. Then h(x) is also a random variable, namely, the outcome to a new experiment obtained by running the old experiment to produce the r.v. X and then evaluating h(x). iv

5 Joint distributions It is essential for most useful applications of probability to have a theory which can handle many random variables simultaneously. Definition Let (X 1,..., X n ) be a multivariate random variable (or random vector). Its cumulative distribution function is F X1,...,X n (x 1,..., x n ) def = P ( X 1 x 1,..., X n x n ), (0.7) here and below we write { X 1 x 1,..., X n x n } = {X1 x 1 } {X n x n }. Bivariate variables: discrete case Suppose (X, Y ) is a bivariate r.v. and that X and Y are discrete r.v. taking possible values x 1, x 2,... and y 1, y 2,... respectively. Then the collection of probabilities p(x j, y k ) P(X = x j, Y = y k ), k 1, j 1, determines the joint probability distribution of (X, Y ). It is important to remember that given the joint distribution of X, Y we can recover the probability density function p X (in this case it is called the marginal probability distribution) of X via p X (x j ) P(X = x j ) = k P(X = x j, Y = y k ) = k p(x j, y k ) (0.8) for any possible value x j of X. Similarly, the marginal probability distribution of Y is given by p Y (y k ) = j P(X = x j, Y = y k ) = j p(x j, y k ). Conditional distribution and independence For any discrete bivariate rv (X, Y ) the conditional distribution of X given Y has probability mass function p(x y) P(X = x Y = y) = p(x, y) p Y (y) for all y with p Y (y) > 0. There is also a r.v. version of the partition theorem (0.2); it is often called the law of total probability: for any X-event A, P(X A) = y P ( X A Y = y ) p Y (y). (0.9) We say that X and Y are independent if for all x, y Alternatively, we have p(x, y) = p X (x) p Y (y). (0.10) v

6 Definition Random variables X, Y are independent if for every X-event A and every Y -event B we have P ( X A, Y B ) P ( (X, Y ) A B ) = P(X A) P(Y B). (0.11) The definitions (0.10), (0.11) can be easily extended to the case of any general multivariate distribution. Let (X 1,..., X n ) be a random vector and g : R n R be a function. Then g(x 1,..., X n ) is a random variable (obtained by the new experiment consisting of first carrying out the original experiment to determine the value of (X 1,..., X n ) and then applying the function g to this ordered n-tuple to obtain a real number g(x 1,..., X n )). Exercise ). Let (X, Y, Z) be a random vector with independent components; show that for any function h : R 2 R the variables h(x, Y ) and Z are independent. 2). Let X 1,..., X k and Y 1,..., Y m be a collection of independent random variables. If the functions f and g are such that f : R k R and g : R m R, show that the random variables f(x 1,..., X m ) and g(y 1,..., Y m ) are independent. Bivariate variables: continuous case We will only consider the case where (X, Y ) has a continuous joint pdf f(x, y) defined for (x, y) R 2. By analogy with the definition for discrete random variables, P ( (X, Y ) A ) = f(x, y) dx dy for any integrable set A. In this case X and Y have the marginal pdfs f X (x) = and for any interval (a, b) we have P(a < X < b) A f(x, y) dy, f Y (y) = b a f(x, y) dx dy = b a f(x, y) dx f X (x) dx. We define the continuous conditional density of X given Y by f(x y) = { f(x, y)/fy (y), if f Y (y) > 0 0, if f Y (y) = 0. Also, X and Y are independent if and only if f(x, y) = f X (x) f Y (y) for every pair (x, y) R 2. Transformations g(x, Y ) in the continuous case are treated similarly to the discrete case. vi

7 Expectation Definition For any random variable X the expected value (or mean) of X is the number x k Ω E(X) = x k p(x k ), X discrete with pmf p ; xf(x) dx, (0.12) X continuous with pdf f. The following generalisation of this definition is of great importance to the whole theory. If X is a discrete rv and takes values in Ω = {x 1, x 2,... } with probabilities p(x k ) and the transformed rv g(x) takes values y 1, y 2,... with probabilities q(y m ) def = P(X G m ) = x G m p(x), where G m def = { x Ω : g(x) = y m }, then the sets G m form a partition of Ω and it follows that E ( g(x) ) = y m q(y m ) = g(x) p(x) = g(x k ) p(x k ). m m x G m Similarly, if X is continuous rv with pdf f, then E ( g(x) ) = g(x) f(x) dx. The most important properties of the expectation are: E1 linearity: let f, g be real functions and let a, b be real numbers; then k=1 E ( af(x) + bg(x) ) = a E ( f(x) ) + b E ( g(x) ), (0.13) provided the corresponding expectations exist. E2 monotonicity: if h(x) 0 for all real x, then E ( h(x) ) 0; in other words, if the real functions f, g are such that f(x) g(x) for all real x, then provided the corresponding expectations exist. E ( f(x) ) E ( g(x) ), (0.14) Recall three important special cases: the variance Var(X) of a rv X, its r-th moment E(X r ), and its moment generating function, M X (t), Var(X) def = E ( X E(X) ) 2, MX (t) def = E ( e tx). Exercise Let X be a rv, and let g : R [0, ] be an increasing function such that E ( g(x) ) <. Show that for any real a, one has P ( X > a ) E( g(x) ). (0.15) g(a) In particular, P ( X > a ) E ( exp { λ(x a) }) for any real a and any λ > 0. Notice that the Markov inequality and the Chebyshev inequality are special cases of (0.15). vii

8 Multivariate case In the multivariate case, the expectation is defined similarly and has properties analogous to the considered above. Additionally, we mention two other properties: E3 multivariate linearity: let (X 1,..., X n ) be a random vector, g 1,..., g n be real functions, and a 1,..., a n be real numbers. Then ( n E k=1 ) a k g k (X k ) = n a k E ( g k (X k ) ). (0.16) E4 independence: if X 1,..., X n are independent rv s, so that their joint pmf/pdf factorises, k=1 p X1,...,X n (x 1,..., x n ) = n p Xk (x k ), k=1 then for all real functions g 1,..., g n one has ( n E k=1 g k ( Xk ) ) = n E ( g k (X k ) ). (0.17) We say that the variables X and Y are uncorrelated if their covariance, Cov(X, Y ) def = E ( (X E(X))(Y E(Y )) ) E(XY ) E(X) E(Y ), (0.18) vanishes, Cov(X, Y ) = 0. In particular, any pair of independent variables is uncorrelated. By linearity property E3, the variance Var ( n k=1 X k) of the sum of rv s X1,..., X n equals ( n ) Var X k = k=1 k=1 n Var ( ) X k + 2 Cov(X k, X l ). k=1 Thus, if the variables X 1,..., X n are pairwise uncorrelated (in particular, independent), then ( n ) n Var X k = Var ( ) X k. (0.19) Conditional expectation k=1 Let X be a discrete rv on a sample space Ω, and let A Ω be an event. The conditional expectation of X given A is a number E(X A) defined by k=1 k<l E(X A) = x x P(X = x A), (0.20) viii

9 where the sum runs through all possible values of X. In particular, we have the partition theorem for expectation: if events B 1,..., B n form a partition of the sample space Ω, then E(X) = n E(X B k ) P(B k ). k=1 Using the definition (0.20), it is immediate to compute E(X Y = y); we recall that then E(X Y ) is a random variable such that E ( E(X Y ) ) = E(X). Limiting results Theorem 0.16 (Law of Large Numbers). Let X 1,..., X n be iid (independent, identically distributed) rv s such that E(X k ) µ, Var(X k ) = σ 2. Denote S n def = n k=1 X k. Then for any fixed a > 0 as n. P ( n 1 S n µ > a ) 0 (0.21) Theorem 0.17 (Central Limit Theorem). Under the conditions of the previous theorem, denote Sn def = S n nµ Var(Sn ) S n nµ σ n. Then, as n, the distribution of Sn converges to that of the standard Gaussian random variable (ie., N (0, 1)): for every fixed a R, P ( S n a ) a Moment generating functions 1 2π e y2 /2 dy. (0.22) As mentioned before, the moment generating function (or mgf) of a rv X is defined via M X (t) def = E ( e tx). (0.23) We finish by listing several useful properties of mgf s. M1 For each positive integer r E(X r ) = dr M X dt r (0). M2 [uniqueness] The mgf M X (t) of X uniquely determines the probability distribution of X, provided that M X (t) is finite in some neighbourhood of the origin. ix

10 M3 [linear transformation] If X has mgf M X (t), and Y = ax + b, then M Y (t) = e bt M X (at). M4 [independence] Suppose that X 1,..., X n are independent rv s and let Y = n k=1 X k. Then n M Y (t) = M Xk (t). M5 [convergence] Suppose that Y 1, Y 2,... is an infinite sequence of rv s, and that Y is a further random variable. Suppose that M Y (t) is finite for t < a for some positive a and that for all t ( a, a) Then, as n, k=1 M Yn (t) M Y (t) as n. P(Y n c) P(Y c). for all real c such that P(Y = c) = 0. x

11 1 Sequences of events and their limits 1.1 Monotone sequences of events Sequences of events arise naturally when a probabilistic experiment is repeated many times. For example, if a coin is flipped consecutively, the event 3 A = { heads never seen } is just the intersection, A = n 1 A n, of the events A n = { heads not seen in the first n tosses }. This simple remark leads to the following important observations: 1) taking countable operations is not that exotic in probabilistic models, and thus any reasonable theory should deal with σ-fields; b) the event A is in some sense the limit of the sequence (A n ) n 1, so understanding limits of sequences of sets (events) might be useful. In general, finding a limit of a sequence of sets is not easy and we will not do this here. 4 Instead, we will mostly consider monotone sequences of events. Definition 1.1. A sequence (A n ) n 1 of events is increasing if A n A n+1 for all n 1. It is decreasing if A n A n+1 for all n 1. Example 1.2. If (A n ) n 1 is a sequence of arbitrary events, then the sequence (B n ) n 1 with B n = n k=1 A k is increasing, whereas the sequence (C n ) n 1 with C n = n k=1 A k is decreasing. The following result shows that the probability measure is continuous along monotone sequences of events. Lemma 1.3. If (A n ) n 1 is increasing with A = lim n A n = n 1 A n, then P(A) = P ( lim A ) n = lim P(A n). n n If (A n ) n 1 is a decreasing sequence with A = lim n A n = n 1 A n, then P(A) = P ( lim A ) n = lim P(A n). n n Remark If (A n ) n 1 is not a monotone sequence of events, the claim of the lemma is not necessarily true (find a counterexample!). Proof. Let (A n ) n 1 be increasing with A = n 1 A n. Denote C 1 = A 1 and, for n 2, put C n = A n \ A n 1 = A n A c n 1. We then have (why?) 5 A n = n A k = k=1 n C k, k=1 A k = k=1 C k. 3 Apriori we do not know that A is an event, ie, can be assigned probability to! 4 The corresponding theory is the subject of pure courses such as set theory or (real) analysis/measure theory; if interested, have a look at problems E26 E28 and/or get in touch! 5 Decompositions in the form A n = n ( k=1 Ak \( k 1 m=1 A m) ) are often called telescopic; they are analogous to those in sequential Bayes formulae. k=1 1

12 Since the events in (C k ) k 1 are mutually incompatible, the σ-additivity property P3 of the probability measure gives ( ) ( ) P(A) = P A k = P C k = P ( ) C k 1 k 1 and therefore k 1 k 1 0 P(A) P(A n ) = P ( A \ A n ) = P ( k>n ) C k = P(C k ) 0 k>n as n, as a tail sum of a convergent series k 1 P( C k ). A similar argument holds for decreasing sequences (do this!). Example 1.4. A standard six-sided die is tossed repeatedly. Let N 1 denote the total number of ones observed. Assuming that the individual outcomes are independent, show that P(N 1 = ) = 1. Solution. We show that P(N 1 < ) = 0. First, notice that {N 1 < } = n 1 B n with B n = { no ones after nth toss }, so it is enough to show that P(B n ) = 0 for all n. However, B n = m>0 C n,m with C n,m = { no one on tosses n + 1,... n + m } being a decreasing sequence, C n,m C n,m+1 for all m 1. Since P(C n,m ) = (5/6) m 0 as m, Lemma implies P(B n ) = lim m P(C n,m ) = 0, as requested. Example 1.5. Let X be a positive random variable with P(X < ) = 1. For k 1, denote X k = 1 k X. Show that the event A(ε) { X k > ε finitely often } satisfies P ( A(ε) ) = 1 for every ε > 0. Solution. Let Ω 0 = {ω Ω : X(ω) < } be the event X is finite ; by assumption, P(Ω 0 ) = 1. Consider the events B k = { X k > ε} = {ω : X(ω) > kε}. Since the random variables X k form a pointwise decreasing sequence, namely ω Ω, X k (ω) X k+1 (ω) for all k 1, the events B k are decreasing (ie., B k B k+1 for all k 1) towards {X = }, we deduce that A(ε) = { B k finitely often } Ω 0. Remark The previous argument shows that the event {ω : X k (ω) 0} coincides with ε>0 A(ε) Ω 0 ; in other words, the sequence of random variables X k converges (to zero) with probability one (or almost surely), P(X k 0) = Borel-Cantelli lemma ( Let (A k ) k 1 be an infinite sequence of events from some probability space Ω, F, P. One is often interested in finding out how many of the events An occur. 6 The event that infinitely many of the events A n occur, written { A } n i.o. or { A } n infinitely often, is { An i.o. } = n 1 k=n A k. (1.1) 6 Eg., some results in Number Theory about rational approximations of irrational numbers are formulated in a form similar to Lemma 1.6! 2

13 The next result is very important for applications. Its proof uses the intrinsic monotonicity structure of the definition (1.1). Lemma 1.6 (Borel-Cantelli lemma). Let A = n 1 k=n A k be the event that infinitely many of the A n occur. Then: a) If k P(A k) <, then P(A) = 0, ie., with probability one only finitely many of the A k occur. b) If k P(A k) = and A 1, A 2,... are independent events, then P(A) = 1. Remark The independence condition in part b) above cannot be relaxed. Otherwise, let A n E for all n 1, where E F satisfies 0 < P(E) < 1 (and thus the events A k are not independent). Then A = E and P(A) = P(E) 1. Remark An even more interesting counterexample to part b) without the independence property can be constructed as follows (do this!): Let X be a uniform random variable on (0, 1), write X U(0, 1). For n 1, consider the event A n = { X < 1/n }. It is easy to see that A = { A n i.o. } =, so that one can have n P(A n) = together with P(A) = P(A n i.o.) = 0. Example 1.7 (Infinite monkey theorem). By the second Borel-Cantelli lemma, Lemma 1.6b), a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely (ie., with probability one) type any particular chosen text, such as the complete works of William Shakespeare (and, in fact, infinitely many copies of the chosen text). Idea of the argument. Suppose that the typewriter has 50 keys, and the word to be typed is banana. The chance that the first letter typed is b is 1/50, as is the chance that the second letter is a, and so on. These events are independent, so the chance of the first six letters matching banana is 1/50 6. For the same reason the chance that the next six letters match banana is also 1/50 6, and so on. Now, the chance of not typing banana in each block of six letters is 1 1/50 6. Because each block is typed independently, the chance of not typing banana in any of the first n blocks of six letters 7 is p = (1 1/50 6 ) n. If we were to count occurences of banana that crossed blocks, p would approach zero even more quickly. 8 Finally, once the first copy of the word banana appears, the process starts afresh independently of the past, so that the probability of obtaining the second copy of the word banana within the same number of blocks is still p etc.; the result now follows from Lemma 1.6. Of course, the same argument applies if the monkey were typing any other string of characters of finite length, eg., your favourite novel. 9 7 As n grows, p gets smaller. For n = 10 6, p is more than 99.99%, but for n = the probability p is about 52.73% and for an n = it is about 0.17%. As n goes to infinity, the probability p can be made as small as one likes. 8 Using the theory of Markov chains, discussed later in the course, you should be able to show that the expected hitting time of the word banana is exactly You can use the R script available from the course webpage to explore sequences of different length and/or different typewriters. 3

14 Remark By using an appropriate monotone approximation, one can deduce the result as in Example 1.4, without explicitly using the Borel-Cantelli lemma. Moreover, the same argument can be extended to the situations, when the probability p n of typing banana in the nth block of six letters varies with n, but remains uniformly positive, ie., p n δ > 0 for all n 1. The true power of the lemma is seen in the situations when p n 0 slowly enough to have n p n = (provided the events in different blocks are independent). Proof of Lemma 1.6. a) For every n 1, let B n def = k=na k be the event that at least one of A k with k n occurs. Since A B n for all n 1, we have P(A) P(B n ) P(A k ) 0 k=n as n, whenever k P(A k) <. b) The event A c = { A n occur finitely often } is related to the sequence B c n = k=na c k { none of A k, k n, occurs } via A c = A c k = Bn c, n k=n n so it is sufficient to show that P(Bn) c = 0 for all n 1. By independence and the elementary inequality 1 x e x with x 0, we get so that ( m P k=n A c k as the sum diverges. ) = m P ( A c ) m ( k = 1 P ( ) ) { A k exp k=n P ( ( Bn) c = lim P m m k=n k=n A c k ) { exp k=n m k=n } P(A k ) = 0, } P(A k ) Example 1.8. A standard six-sided die is tossed repeatedly. Let N k denote the total number of tosses when face k was observed. Assuming that the individual outcomes are independent, show that P(N 1 = ) = P(N 2 = ) = P(N 1 =, N 2 = ) = 1. Solution. Equalities P(N 1 = ) = P(N 2 = ) = 1 can be derived as in Example 1.4, so that the intersection event {N 1 =, N 2 = } has probability one. Alternatively, we derive the first equality from the Borel-Cantelli lemma. To this end, fix k {1, 2,..., 6} and denote A k n = { nth toss shows k }. For different n, the events A k n are independent and have the same probability 1/6. Since n P(Ak n) =, the Borel-Cantelli lemma implies that the event { N k = } { A k n infinitely often } has probability one. The remaining claims now follow as indicated above. Example 1.9. A coin showing heads with probability p is tossed repeatedly. With X n denoting the result of the nth toss, let C n = {X n = T, X n 1 = H}. Show that P(C n i.o. ) = 1. 4

15 Solution. We have { C 2n i.o. } { C n i.o. }, where P(C 2n ) pq and C 2n are independent. The result follows from Lemma 1.6b) (or via monotone approximation). The Borel-Cantelli lemma is often used, when one needs to describe longterm behaviour of sequences of random variables. Example Let (X k ) k 1 be i.i.d. random variables with common exponential distribution of mean 1/λ, i.e., P ( X 1 > x ) = e λx for all x 0. One can show that X n grows like 1 10 λ log n, more precisely, that Solution. For ε > 0, denote A ε n def = P ( lim sup n X n log n = 1 λ) = 1. { } ω : X n (ω) > 1+ε log n, B ε def λ n = { ω : X n (ω) > 1 ε λ log n }. We clearly have P ( An) ε = n (1+ε) and P ( Bn) ε = n (1 ε). Since n P(Aε n) <, by Lemma 1.6a) the event { A ε n infinitely often } has probability zero. Similarly, the events Bn ε are independent and { n P(Bε n) =, thus, by Lemma 1.6b), the event B ε n infinitely often } has probability one. Remark (Records) A slightly more general version of the argument from Example 1.10 helps to control the limiting behaviour of records: 11 Let (X k ) k 1 be i.i.d. exponential r.v. with distribution P(X k > x) = e x, and def let M n = max 1 k n X k. Then P(M n /(log n) 1) = 1, ie., the normalized maximum M n /(log n) converges to one almost surely (as n ). Example Let random variables (X n ) n 1 be i.i.d. with X 1 U[0, 1]. For α > 0, we have P ( X n > 1 n α) = n α, so that P ( X n > 1 n α i.o. ) = 1 iff α 1. A similar analysis shows that ( P X n > 1 1 ) n(log n) β i.o. = { 1, β 1, 0, β > 1. Lemma 1.6 is one of the main methods of proving almost sure convergence: Example If (X k ) k 1 is a sequence of random variables such that for every ε > 0 the event A(ε) { X k > ε finitely often } has probability one, then X k is said to converge to zero with probability one, recall Remark A simple example is with X k = 1 k X for a variable X 0 of finite mean, EX <. One then can show that k 1 P( X k > ε) = k 1 P(X > kε) <, and thus the result follows from Lemma 1.6, see also Lemma 2.15 below. 10 Recall that for a real sequence (a n ) n 1 one defines lim sup a n as the largest limiting n point of the sequence (a n ) n 1, equivalently, lim sup a n lim sup n n k n a k, see App. A below. 11 Similar results hold for other distributions, see page E6 in the Problems Sheets. 5

16 2 Convergence of random variables In probability theory one uses various modes of convergence of random variables, many of which are crucial for applications. In this section we shall consider some of the most important of them: convergence in L r, convergence in probability and convergence with probability one (a.k.a. almost sure convergence). 2.1 Weak laws of large numbers Definition 2.1. Let r > 0 be fixed. We say that a sequence X j, j 1, of random variables converges to a random variable X in L r L (write X r n X) as n, if E Xn X r 0 as n. Example 2.2. Let ( X n be a sequence of random variables such that for )n 1 some real numbers (a n ) n 1, we have P ( X n = a n ) = pn, P ( X n = 0 ) = 1 p n. (2.1) Then X n L r 0 iff E Xn r an r p n 0 as n. The following result is the L 2 weak law of large numbers (L 2 -WLLN) Theorem 2.3. Let X j, j 1, be a sequence of uncorrelated random variables with EX j = µ and Var(X j ) C <. Denote S n = X X n. Then 1 n S n L 2 µ as n. Proof. Immediate from 1 ) 2 E( n S (S n nµ) 2 n µ = E = Var(S n) n 2 n 2 Cn n 2 0 as n. Definition 2.4. We say that a sequence X j, j 1, of random variables converges P to a random variable X in probability (write X n X) as n, if for every fixed ε > 0 P ( X n X ε ) 0 as n. Example 2.5. Let the sequence ( X n be as in (2.1). Then for every ε > 0 )n 1 we have P ( X n ε ) P ( P X n 0) = p n, so that X n 0 if pn 0 as n. The usual (WLLN) is just a convergence in probability result: Theorem 2.6. Under the conditions of Theorem 2.3, 1 n S n P µ as n. Exercise 2.7. Derive Theorem 2.6 from the Chebyshev inequality. We prove Theorem 2.6 using the following simple fact: L Lemma 2.8. Let X j, j 1, be a sequence of random variables. If X r n X for some fixed r > 0, then X n P X as n. 6

17 Proof. By the generalized Markov inequality with g(x) = x r and Z n = X n X 0, we get: for every fixed ε > 0 as n. P ( Z n ε ) P ( X n X r ε r) E X n X r ε r 0 Proof of Theorem 2.6. Follows immediately from Theorem 2.3 and Lemma 2.8. As the following example shows, a high dimensional cube is almost a sphere. Example 2.9. Let X j, j 1 be iid with X j U( 1, 1). Then the variables Y j = (X j ) 2 satisfy EY j = 1 3, Var(Y j) E[(Yj 2)] = E[(X j) 4 ] 1. Fix ε > 0 and consider the set { def A n,ε = z R n : (1 ε) n/3 < z < (1 + ε) } n/3, where z is the usual Euclidean length in R n, z 2 = n j=1 (z j) 2. By the WLLN, 1 n n Y j 1 n j=1 n (X j ) 2 P 1 3 ; j=1 in other words, for every fixed ε > 0, a point X = (X 1,..., X n ) chosen uniformly at random in ( 1, 1) n satisfies ( 1 P n n (X j ) 2 1 ) ε P ( ) X A n,ε 0 as n, 3 j=1 ie., for large n, with probability approaching one, a random point X ( 1, 1) n is near the n-dimensional sphere of radius n/3 centred at the origin. Theorem Let random variables S n, n 1, have two finite moments, µ n ES n, σn 2 Var(S n ) <. If, for some sequence b n, we have σ n /b n 0 as n, then (S n µ n )/b n 0 as n, both in L 2 and in probability. Proof. The result follows immediately from the observation ( (Sn µ n ) 2 ) E b 2 n = Var(S n) b 2 n 0 as n. Example In the coupon collector s problem 12 let T n be the time to collect all n coupons. It is easy to show that ET n = n n m=1 1 m n log n and Var(T n ) n 2 n m=1 1 m π2 n 2 2 6, so that T n ET n n log n 0 ie., T n n log n 1 as n both in L 2 and in probability. 12 Problem R4 7

18 2.2 Almost sure convergence Let (X k ) k 1 be a sequence of i.i.d. random variables having mean EX 1 = µ and def finite second moment. Denote S n = n k=1 X k. Then the usual (weak) law of large numbers (WLLN) tells us that for every δ > 0 P( n 1 S n µ ) > δ 0 as n. (2.2) In other words, according to WLLN, n 1 S n converges in probability to a constant random variable X µ = E(X 1 ), as n (recall Definition 2.4, Theorem 2.6). It is important to remember that convergence in probability is not related to the pointwise convergence, ie., convergence X n (ω) X(ω) for a fixed ω Ω. The following useful definition can be realised in terms of a U[0, 1] random variable, recall Remark Definition The canonical probability space is ( Ω, F, P ), where Ω = [0, 1], F is the smallest σ-field containing all intervals in [0, 1], and P is the length measure on Ω (ie., for A = [a, b] [0, 1], P(A) = b a). Example Let ( Ω, F, P ) be the canonical probability space. For every event A F consider the indicator random variable { 1, if ω A, 1 A (ω) = (2.3) 0, if ω / A. For n 1 put m = [log 2 n], ie., m 0 is such that 2 m n < 2 m+1, define [ n 2 m A n = 2 m, n + 1 ] 2m 2 m [ 0, 1 ] and let X n def = 1 An. Since P ( 1 An > 0 ) = P(An ) = 2 [log 2 n] < 2 n 0 as n, the sequence X n converges in probability to X 0. However, { ω Ω : Xn (ω) X(ω) 0 as n } =, ie., there is no point ω Ω for which the sequence X n (ω) {0, 1} converges to X(ω) = 0. [Try the R script simulating this sequence from the course webpage!] The following is the key definition of this section. Definition A sequence (X k ) k 1 of random variables in (Ω, F, P) converges, as n, to a random variable X with probability one (or almost surely) if P( {ω Ω : Xn (ω) X(ω) as n }) = 1. (2.4) Remark For ε > 0, let A n (ε) = { ω : X n (ω) X(ω) > ε }. Then the property (2.4) is equivalent to saying that for every ε > 0 P ({ A n (ε) finitely often }) = 1. (2.5) This is why the Borel-Cantelli lemma is so useful in studying almost sure limits. 8

19 Example 1.5 (continued) Consider a finite random variable X, ie., satisfying def P( X < ) = 1. Then the sequence (X k ) k 1 defined via X k = 1 k X converges to zero with probability one. Solution. The previous discussion established exactly (2.5). In general, to verify convergence with probability one is not immediate. The following lemma gives a sufficient condition of almost sure convergence. Lemma Let X 1, X 2,... and X be random variables. If, for every ε > 0, P ( X n X > ε ) <, (2.6) n=1 then X n converges to X almost surely. Proof. Fix ε > 0 and let A n (ε) = { ω Ω : X n (ω) X(ω) > ε }. By (2.6), n P( A n (ε) ) <, and, by Lemma 1.6a), only a finite number of A n (ε) occur with probability one. This means that for every fixed ε > 0 the event A(ε) def = { ω Ω : X n (ω) X(ω) ε for all n large enough } has probability one. By monotonicity (A(ε 1 ) A(ε 2 ) if ε 1 < ε 2 ), the event { ω Ω : Xn (ω) X(ω) as n } = ε>0 A(ε) = m 1 A(1/m) has probability one. The claim follows. A straightforward application of Lemma 2.15 improves the WLLN (2.2) and gives the following famous (Borel) Strong Law of Large Numbers (SLLN): Theorem 2.16 (L 4 -SLLN). Let the variables X 1, X 2,... be i.i.d. with E(X k ) = µ and E ( (X k ) 4) def <. If S n = X 1 + X X n, then S n /n µ almost surely, as n. Proof. We may and shall suppose 13 that µ = E(X k ) = 0. Now, E ( (S n ) 4) (( n ) 4 ) = E X k = k=1 k E ( (X k ) 4) + 6 E ( (X k ) 2 (X m ) 2) 1 k<m n so that E ( (S n ) 4) Cn 2 for some C (0, ). By Chebyshev s inequality, and the result follows from (2.6). P ( S n > nε ) E( (S n ) 4) (nε) 4 C n 2 ε 4 13 otherwise, consider the centred variables X k = X k µ and deduce the result from the relation 1 n S n = 1 n S n µ and linearity of almost sure convergence. 9

20 With some additional work, 14 one can obtain the following SLLN (which is due to Kolmogorov): Theorem 2.17 (L 1 -SLLN). Let X 1, X 2,... be i.i.d. r.v. with E X k <. If def E(X k ) = µ and S n = X X n, then 1 n S n µ almost surely, as n. Notice that verifying almost sure convergence through the Borel-Cantelli lemma (or the sufficient condition (2.6)) is easier than using an explicit construction in the spirit of Example 1.5. We shall see more examples below. 2.3 Relations between different types of convergence It is important to remember the relations between different types of convergence. We know that (Lemma 2.8) X n L r X = X n P X ; one can also show 15 In addition, according to Example 2.13, X n a.s. X = X n P X. X n P X X n a.s. X, and the same construction shows that X n L r X X n a.s. X. The following examples fill in the remaining gaps: L r a.s. Example 2.18 (X n X X n X). Let X n be a sequence of independent random variables such that P(X n = 1) = p n, P(X n = 0) = 1 p n. Then X n P X p n 0 X n L r X as n, whereas X n a.s. X n p n <. In particular, taking p n = 1/n we deduce the claim. Notice that this example also shows that X P a.s. n X X n X. Example 2.19 (X P L n X X r n X). Let (Ω, F, P) be the canonical probability space, recall Definition For every n 1, define { X n (ω) def = e n e n, 0 ω 1/n 1 [0,1/n] (ω) 0, ω > 1/n. We obviously have X n a.s. 0 and X n P 0 as n ; however, for every r > 0 L r E X n r = enr as n, ie., X n n 0. Notice that this example also shows that a.s. L X n X X r n X. 14 we will not do this here! 15 although we shall not do it here! 10

21 3 Lebesgue integral In the simplest case, the (Riemann) integral of a non-negative function can be regarded as the area between the graph of that function and the x-axis. Lebesgue integration is a mathematical construction that extends the notion of the integral to a larger class of functions; it also extends the domains on which these functions can be defined. As such, the Lebesgue integral plays an important role in real analysis, probability, and many other areas of mathematics. 3.1 Integration: Riemann vs. Lebesgue As part of the general movement towards rigour in mathematics in the nineteenth century, attempts were made to put the integral calculus on a firm foundation. The Riemann integral 16 is one of the most widely known examples; its definition starts with the construction of a sequence of easily-calculated integrals which converge to the integral of a given function. This definition is successful in the sense that it gives the expected answer for many already-solved problems, and gives useful results for many other problems. However, despite the Riemann integral is naturally linear and monotone, 17 it does not interact well with taking limits of sequences of functions, making such limiting functions difficult to analyse (and integrate). 18 The Lebesgue integral is easier to deal with when taking limits under the integral sign; it also allows to calculate integrals for a broader class of functions. For example, the Dirichlet function, which is 0 where its argument is irrational and 1 otherwise, is Lebesgue-integrable, but not Riemann-integrable Riemann integral Recall that a partition of an interval [a, b] is a finite sequence a = x 0 < x 1 < x 2 <... < x n = b. Each [x i, x i+1 ] is called a sub-interval of the partition. The mesh of a partition is defined to be the length of the longest sub-interval [x i, x i+1 ], that is, it is max(x i+1 x i ) where 0 i n 1. Let f be a real-valued function defined on the interval [a, b]. The Riemann sum of f with respect to the partition x 0,..., x n is n 1 i=0 f(t i )(x i+1 x i ), where each t i is a fixed point in the sub-interval [x i, x i+1 ]. Notice that the last expression is the sum of areas of rectangles with heights f(t i ) and lengths x i+1 x i. 16 proposed by Bernhard Riemann ( ); 17 see the slides! 18 This is of prime importance, for instance, in the study of Fourier series, Fourier transforms and other topics. 11

22 Loosely speaking, the Riemann integral of f is the limit of the Riemann sums of f as the partitions get finer and finer (ie. the mesh goes to zero), and every function f for which this limit does not depend on the approximating sequence is called integrable Lebesgue integral: sketch of the construction The modern approach to the theory of Lebesgue integration has two distinct parts: a) a theory of measurable sets and measures on these sets; b) a theory of measurable functions and integrals on these functions. Measure theory initially was created to provide a detailed analysis of the notion of length of subsets of the real line and more generally area and volume of subsets of Euclidean spaces. In particular, it provided a systematic answer to the question of which subsets of R have a length. As was shown by later developments in set theory, it is actually impossible to assign a length to all subsets of R in a way which preserves some natural additivity and translation invariance properties. This suggests that picking out a suitable class of measurable subsets is an essential prerequisite. The modern approach to measure and integration is axiomatic. One defines a measure as a mapping µ from a σ-field A of subsets of a set E, which satisfies a certain list of properties. 19 These properties can be shown to hold in many different cases. Integration. In the Lebesgue theory, integrals are limited to a class of functions called measurable functions. Let E be a set and let A be a σ-field of subsets 20 of E. A function f : E R is measurable if the pre-image of any closed interval [a, b] R is in A, f 1 ([a, b]) A. The set of measurable functions is naturally closed under algebraic operations; in addition (and more importantly) this class is closed under various kinds of point-wise sequential limits, eg., if the sequence {f k } k N consists of measurable functions, then both are measurable functions. lim inf f k and lim sup k N Let a measure space (E, A, µ) be fixed. The Lebesgue integral f dµ for measurable functions f : E R is constructed in stages: E Indicator functions: If S A, ie., the set S is measurable, we define the integral of its indicator function 21 1 S via 1 S dµ = µ(s). k N f k 19 see the slides! 20 one often calls (E, A) a measurable space, and (E, A, µ) a measure space; 21 recall that 1 S (x) = 1 if x S and 1 S (x) = 0 otherwise 12

23 Simple functions: for non-negative simple functions, ie., linear combinations of indicator functions f = k a k1 Sk (where the sum is finite and all a k 0), we use linearity to define 22 ( ) µ(f) a k 1 Sk dµ = a k 1 Sk dµ = a k µ(s k ), k k k This construction is obviously linear and monotone. 23 Moreover, even if a simple function can be written as k a k1 Sk in many ways, the integral will always be the same. 24 Non-negative functions: Let f : E [0, + ] be measurable. We put { } f dµ := sup h dµ : h f, 0 h simple E E We need to check whether this construction is consistent, ie., if 0 f is simple we need to verify whether this definition coincides with the preceding one. Another question is: if f as above is Riemann-integrable, does this definition give the same value of the integral? It is not hard to prove that the answer to both questions is yes. Clearly, if f : E [0, + ] is any measurable function, its integral f dµ may be infinite. Signed functions: If f : E [, + ] is measurable, 25 we decompose it into the positive and negative parts, f = f + f, where f + (x) = { f(x) if f(x) > 0, 0 otherwise, f (x) = { f(x) if f(x) < 0, 0 otherwise, Note that the functions f + 0 and f 0 satisfy f = f + + f. If f dµ is finite, then f is called Lebesgue integrable. In this case, both integrals f + dµ and f dµ converge, and it makes sense to define f dµ = f + dµ f dµ. It turns out that this definition gives the desirable properties of the integral, namely, linearity, monotonicity and regularity when taking limits. The functions, which can be obtained from the above construction, are called Borel functions. 26 The class of Borel functions is very big and sufficient for most practical considerations here we always assume that 0 = 0 = 0; 23 see the slides! 24 Also, if any two functions f 1 and f 2 coincide almost everywhere, ie., they differ on a set of measure zero, µ(x : f 1 (x) f 2 (x)) = 0, their integrals are equal, µ(f 1 ) = µ(f 2 ). 25 Complex valued functions can be similarly integrated, by considering the real part and the imaginary part separately. 26 by definition f : E [, + ] is Borel, if for every a R, {x E : f(x) a} A, ie., is measurable. 27 it is not easy to construct a non-borel real-valued function; get in touch, if interested! 13

24 3.2 Lebesgue integral: limiting results The construction described above implies the following limiting property, which is one of the most central in the area: Theorem 3.1 (Monotone Convergence Theorem; (MON)). Let f and (f n ) n 1 be Borel functions on (E, A, µ) such that 0 f n f. Then, as n, µ ( f n ) µ ( f ). The random variables version of the result is: Theorem 3.2 (Monotone Convergence Theorem; (MON)). If random variables X n 0 are such that X n X as n, then E(X n ) E(X) as n. In view of the footnote 24 above, the following result is rather natural: Corollary 3.3. Let f and (f n ) n 1 be non-negative Borel functions on (E, A, µ) such that, except on a µ-null set N, 0 f n f, i.e., x E \ N, f n (x) f(x) and µ(n) = 0. Then µ ( f n ) µ ( f ) as n. Exercise 3.4. State an analogue of the previous corollary for random variables (using almost sure convergence). Another important result is Theorem 3.5 (Dominated-Convergence Theorem; (DOM)). Let (f n ) n 1 and f be Borel functions on (E, A, µ) such that f n (x) converges to f(x) for all x E as n and such that the sequence f n (x) is dominated by a non-negative integrable function g, i.e., for all x E and n N, f n (x) f(x) and fn (x) g(x) with µ(g) <. (3.1) Then µ(f n ) µ(f) as n. Theorem 3.6 (Dominated-Convergence Theorem; (DOM)). Let (X n ) n 1 and X be random variables such that for all ω Ω, we have X n (ω) X(ω) as n. If there is a random variable Y 0 such that E(Y ) <, and for all ω Ω, X n (ω) Y (ω), then E(X n ) E(X) as n. Of course, similarly to the corollary above, one can allow the conditions (3.1) to be violated on a set N of measure zero. Exercise 3.7. State the versions of the last two theorems in the case convergence is violated on a set of measure zero (ie., convergence takes place almost surely). Various examples of application of these results were discussed in the lectures and tutorials. 14

25 4 Generating functions Even quite straightforward counting problems can lead to laborious and lengthy calculations. These are often greatly simplified by using generating functions. 28 Definition 4.1. Given a collection of real numbers (a k ) k 0, the function G(s) = G a (s) def = is called the generating function of (a k ) k 0. a k s k (4.1) Why do we care? If the generating function G a (s) of (a n ) n 0 is analytic near the origin, then there is a one-to-one correspondence between G a (s) and (a n ) n 0 ; namely, a k can be recovered via 29 k=0 a k = 1 k! d k ds k G a(s) s=0. (4.2) This result is often referred to as the uniqueness property of generating functions. Definition 4.2. If X is a discrete random variable with values in Z + def = {0, 1,... }, its (probability) generating function, G(s) G X (s) def = E ( s X) = s k P(X = k), (4.3) is just the generating function of the pmf { p k } { P(X = k) } of X. k=0 Recall that the moment generating function 30 M X (t) def = E(e tx ) of a random variable X is just 31 E(X k ) k! t k. Why do we introduce both G X (s) and M X (t)? k 0 The following result illustrates one of the most useful applications of generating functions in probability theory: Theorem 4.3. If X and Y are independent random variables with values in {0, 1, 2,... } and Z def = X + Y, then their generating functions satisfy 32 G Z (s) = G X+Y (s) = G X (s) G Y (s). Example 4.4. If X 1, X 2,..., X n are independent identically distributed random variables 33 with values in {0, 1, 2,... } and if S n = X X n, then G Sn (s) = G X1 (s)... G Xn (s) [ G X (s) ] n. 28 introduced by de Moivre and Euler in the early eighteenth century. 29 this and a several other useful properties of power series can be found in Sect. A.4 below. 30 we might have M X (t) = for t 0! 31 ie., it is the generating function of the sequence E(X k )/k!. 32recall: if X and Y are discrete random variables, and f, g : Z + R are arbitrary functions, then f(x) and g(y ) are independent random variables and E [ f(x)g(y ) ] = Ef(X) Eg(Y ); 33 from now on we shall often abbreviate this to just i.i.d.r.v. 15

26 Example 4.5. Let X 1, X 2,..., X n be i.i.d.r.v. with values in {0, 1, 2,... } and let N 0 be an integer-valued random variable independent of {X k } k 1. Then 34 def S N = X X N has generating function G SN (s) = G N ( GX (s) ). (4.4) Solution. This is a straightforward application of the partition theorem for expectations. Alternatively, the result follows from the standard properties of conditional expectations: E ( z S N ) = E [ E ( z S N N )] = E ([ G X (z) ] N ) = GN ( GX (z) ). Example 4.6. [Renewals] Imagine a diligent janitor who replaces a light bulb the same day as it burns out. Suppose the first bulb is put in on day 0 and let X i be the lifetime of the ith light bulb. Let the individual lifetimes X i be i.i.d.r.v. s with values in {1, 2,... } and have a common distribution with def generating function G f (s). Define r n = P ( a light bulb was replaced on day n ) and f k def = P ( the first light bulb was replaced on day k ). Then r 0 = 1, f 0 = 0, and r n = n k=1 f k r n k, n 1. A standard computation implies that G r (s) = 1 + G f (s) G r (s) for all s < 1, so that G r (s) = 1/(1 G f (s)). In general, we say a sequence (c n ) n 0 is the convolution of (a k ) k 0 and (b m ) m 0 (write c = a b), if c n = n a k b n k, n 0, (4.5) k=0 The key property of convolutions is given by the following result: Theorem 4.7. [Convolution thm] If c = a b, then the generating functions G c (s), G a (s), and G b (s) satisfy G c (s) = G a (s) G b (s). Example 4.8. Let X Poi(λ) and Y Poi(µ) be independent. Then Z = X + Y is Poi(λ + µ). Solution. A straightforward computation gives G X (s) = e λ(s 1) ; Theorem 4.3 then implies G Z (s) = G X (s) G Y (s) = e λ(s 1) e µ(s 1) e (λ+µ)(s 1), so that the result follows by uniqueness. A similar argument implies the following result. Example 4.9. If X Bin(n, p) and Y Bin(m, p) are independent, then X + Y Bin(n + m, p). Another useful property of probability generating function G X (s) is that it can be used to compute moments of X: 34 This is a two-stage probabilistic experiment! 16

27 Theorem If X has generating function G(s), then 35 E [ X(X 1)... (X k + 1) ] = G (k) (1). Remark The quantity E [ X(X 1)... (X k + 1) ] is called the kth factorial moment of X. Notice also that Var(X) = G X(1) + G X(1) ( G X(1) ) 2. (4.6) Proof. Fix s (0, 1) and differentiate G(s) k times 36 to get G (k) (s) = E [ s X k X(X 1)... (X k + 1) ]. Taking the limit s 1 and using the Abel theorem, 37 we obtain the result. Remark Notice also that lim G X(s) lim E[s X ] = P(X < ). s 1 s 1 This allows us to check whether a variable is finite, if we do not know this apriori. Exercise Let S N be defined as in Example 4.5. Use (4.4) to compute E [ S N ] and Var [ S N ] in terms of E[N], E[N], Var[X] and Var[N]. Now check your result for E [ S N ] and Var [ SN ] by directly applying the partition theorem for expectations. Generating functions are also very useful in solving recurrences, especially when combined with the following algebraic fact. 38 Lemma 4.12 (Partial fraction expansion). Let f(x) = g(x)/h(x) be a ratio of two polynomials without common roots. Let deg(g) < deg(h) = m and suppose that the roots a 1,..., a m of h(x) are all distinct. Then f(x) can be decomposed into a sum of partial fractions, ie., for some constants b 1, b 2,..., b m, Remark Since f(x) = b 1 a 1 x + b 2 a 2 x + + b a x = b a ( x ) k = a k 0 k 0 b m a m x. (4.7) b a k+1 xk, a generating function of the form (4.7) can be easily written as a power series. 35 here, if G (k) (1) does not exists we understand the RHS of the equation as G (k) (1 ) lim s 1 G (k) (s), the limiting value of the kth left derivative of G(s) at s = 1; 36 As G X (s) E s X 1 for all s 1, the generating function G X (s) can be differentiated many times for all s inside the disk { s : s < 1 }. 37 Theorem A.12 below; by footnote 36, it applies to all probability generating functions. 38 An alternative way would be to use products of matrices; get in touch, if interested! 17

1 Sequences of events and their limits

O.H. Probability II (MATH 2647 M15 1 Sequences of events and their limits 1.1 Monotone sequences of events Sequences of events arise naturally when a probabilistic experiment is repeated many times. For