Probability and Statistics. Vittoria Silvestri

Size: px

Start display at page:

Download "Probability and Statistics. Vittoria Silvestri"

Valentine Hoover
5 years ago
Views:

1 Probability and Statistics Vittoria Silvestri Statslab, Centre for Mathematical Sciences, University of Cambridge, Wilberforce Road, Cambridge CB3 0WB, UK

3 Contents Preface 5 Chapter 1. Discrete probability 7 1. Introduction 7 2. Combinatorial analysis 9 3. Exercises Stirling s formula Properties of Probability measures Exercises Independence Conditional probability Exercises Some natural probability distributions Exercises Additional exercises Random variables Expectation Exercises Variance and covariance Exercises Inequalities Exercises Probability generating functions Conditional distributions Exercises 49 Chapter 2. Continuous probability Some natural continuous probability distributions Continuous random variables Exercises Transformations of one-dimensional random variables Exercises 57 3

4 4 CONTENTS 6. Moment generating functions Exercises Multivariate distributions Exercises Transformations of two-dimensional random variables Limit theorems Exercises 68 Chapter 3. Statistics Parameter estimation Exercises Mean square error Confidence intervals Exercises Bayesian statistics Exercises 86 Appendix A. Background on set theory Table of Φ(z) 91

5 PREFACE 5 Preface These lecture notes are for the NYU Shanghai course Probability and Statistics, given in Fall The contents are closely based on the following lecture notes, all available online: James Norris: james/lectures/p.pdf Douglas Kennedy: Richard Weber: rrw1/stats/sa4.pdf Please notify vs358@cam.ac.uk for comments and corrections.

7 CHAPTER 1 Discrete probability 1. Introduction This course concerns the study of experiments with random outcomes, such as throwing a die, tossing a coin or blindly drawing a card from a deck. Say that the set of possible outcomes is Ω = {ω 1, ω 2, ω 3,...}. We call Ω sample space, while its elements are called outcomes. A subset A Ω is called an event. Example 1.1 (Throwing a die). Toss a normal six-faced die: the sample space is Ω = {1, 2, 3, 4, 5, 6}. Examples of events are: {5} (the outcome is 5) {2, 4, 6} (the outcome is even) {3, 6} (the outcome is divisible by 3) Example 1.2 (Drawing a card). Draw a card from a standard deck: Ω is the set of all possible cards, so that Ω = 52. Examples of events are: A 1 = {the card is a Jack}, A 1 = 4 A 2 = {the card is Diamonds}, A 2 = 13 A 3 = {the card is not the Queen of Spades}, A 3 = 51. Example 1.3 (Picking a natural number). Pick any natural number: the sample space is Ω = N. Examples of events are: {the number is at most 5} = {0, 1, 2, 3, 4, 5} {the number is even} = {2, 4, 6, 8...} {the number is not 7} = N \ {7}. 7

8 8 1. DISCRETE PROBABILITY Example 1.4 (Picking a real number). Pick any number in the closed interval [0, 1]: the sample space is Ω = [0, 1]. Examples of events are: {x : x < 1/3} = [0, 1/3) {x : x 0.7} = [0, 1] \ {0.7} = [0, 0.7) (0.7, 1] {x : x = 2 n for some n N} = {1, 1/2, 1/4, 1/8...}. Recall that a set Ω is said to be countable if there exists a bijection between Ω and a subset of N. A set is said to be uncountable if it is not countable. Thus, for example, any finite set is countable, the set of even (or odd) integers is countable, but R is uncountable. Note that the sample space Ω is finite in the first two examples, infinite but countable in the third example, and uncountable in the last example. Remark 1.5. For the first part of the course we will restrict to countable sample spaces, thus excluding Example 1.4 above. We can now give a general definition. Definition 1.6. Let Ω be any set, and F be a set of subsets of Ω. We say that F is a σ-algebra if - Ω F, - if A F, then A c F, - for every sequence (A n ) n 1 in F, it holds n=1 A n F. Assume that F is a σ-algebra. A function P : F [0, 1] is called a probability measure if - P(Ω) = 1, - for any sequence of disjoint events (A n ) n 1 it holds ( ) P A n = P(A n ). n=1 The triple (Ω, F, P) is called a probability space. Remark 1.7. In the case of countable state space we take F to be the set of all subsets of Ω, unless otherwise stated. n=1 We think of F as the collection of observable events. If A F, then P(A) is the probability of the event A. In some probability models, such as the one in Example 1.4, the probability of each individual outcome is 0. This is one reason why we need to specify probabilities of events rather than outcomes.

9 2. COMBINATORIAL ANALYSIS Equally likely outcomes. The simplest case is that of a finite sample space Ω and equally likely outcomes: P({ω}) = 1 ω Ω. Ω Note that by definition of probability measure this implies that Moreover P(A) = A Ω P( ) = 0, if A B then P(A) P(B), if A B = then P(A B) = P(A) + P(B), P(A B) = P(A) + P(B) P(A B), P(A B) P(A) + P(B). A F. To check that P is a probability measure, note that P(Ω) = Ω / Ω = 1, and for disjoint events (A k ) n it holds ( n ) P A k = A 1 A 2... A n n A k n = = P(A k ), Ω Ω as wanted. Example 1.8. When throwing a fair die there are 6 possible outcomes, all equally likely. Then Ω = {1, 2, 3, 4, 5, 6}, P({i}) = 1/6 for i = So P(even outcome) = P({2, 4, 6}) = 1/2, while P(outcome 5) = P({1, 2, 3, 4, 5}) = 5/6. 2. Combinatorial analysis We have seen that it is often necessary to be able to count the number of subsets of Ω with a given property. We now take a systematic look at some counting methods Multiplication rule. Take N finite sets Ω 1, Ω 2... Ω N (some of which might coincide), with cardinalities Ω k = n k. We imagine to pick one element from each set: how many possible ways do we have to do so? Clearly, we have n 1 choices for the first element. Now, for each choice of the first element, we have n 2 choices for the second, so that Ω 1 Ω 2 = n 1 n 2. Once the first two element, we have n 3 choices for the third, and so on, giving We refer to this as the multiplication rule. Ω 1 Ω 2... Ω N = n 1 n 2 n N.

10 10 1. DISCRETE PROBABILITY Example 2.1. A restaurant offers 6 starters, 7 main courses and 5 desserts. The number of possible three course meals is then = 210. Example 2.2. Throw 3 dice. Then the number of outcomes given by an even number, a 6 and then a number smaller than 4 is = 12. Example 2.3 (The number of subsets). Suppose a set Ω = {ω 1, ω 2... ω n } has n elements. How many subsets does Ω have? We proceed as follows. To each subset A of Ω we can associate a sequence of 0 s and 1 s of length n so that the i th number is 1 if ω i is in A, and 0 otherwise. Thus if, say, Ω = {ω 1, ω 2, ω 3, ω 4 } then A 1 = {ω 1 } 1, 0, 0, 0 A 2 = {ω 1, ω 3, ω 4 } 1, 0, 1, 1 A 3 = 0, 0, 0, 0. This defines a bijection between the subsets of Ω and the strings of 0 s and 1 s of length n. Thus we have to count the number of such strings. Since for each element we have 2 choices (either 0 or 1), there are 2 n strings. This shows that a set of n elements has 2 n subsets. Note that this also counts the number of functions from a set of n elements to {0, 1} Permutations. How many possible orderings of n elements are there? Label the elements {1, 2... n}. A permutation is a bijection from {1, 2... n} to itself, i.e. an ordering of the elements. We may obtain all permutations by subsequently choosing the image of element 1, then the image of element 2 and so on. We have n choices for the image of 1, then n 1 choices for the image of 2, n 2 choices for the image of 3 until we have only one choice for the image of n. Thus the total number of choices is, by the multiplication rule, n! = n(n 1)(n 2) 1. Thus there are n! different orderings, or permutations, of n elements. Equivalently, there are n! different bijections from any two sets of n elements. Example 2.4. There are 52! possible orderings of a standard deck of cards Subsets. How many ways are there to choose k elements from a set of n elements? With ordering. We have n choices for the first element, n 1 choices for the second element and so on, ending with n k + 1 choices for the k th element. Thus there are (2.1) n(n 1) (n k + 1) = n! (n k)! ways to choose k ordered elements from n. An alternative way to obtain the above formula is the following: to pick k ordered elements from n, first pick a permutation of the n elements (n! choices), then forget all elements but the first k. Since for each choice of the first k elements there are (n k)! permutations starting with those k elements, we again obtain (2.1).

11 2. COMBINATORIAL ANALYSIS Without ordering. To choose k unordered elements from n, we could first choose k ordered elements, and then forget about the order. Recall that there are n!/(n k)! possible ways to choose k ordered elements from n. Moreover, any given k elements can be ordered in k! possible ways. Thus there are ( ) n = k n! k!(n k)! possible ways to choose k unordered elements from n. More generally, suppose we have integers n 1, n 2... n k with n 1 + n n k = n. Then we have ( ) n n! = n 1... n k n 1!... n k! possible ways to partition n elements in k subsets of cardinalities n 1,... n k. Example 2.5. Imagine to have a box containing n balls. Then there are: n! (n k)! ordered ways to pick k balls, ( n k) unordered ways to pick k balls, ( ) n n 1 n 2 n 3 to subdivide the balls into 3 unordered groups of n1, n 2 and n 3 balls each Subsets with repetitions. How many ways are there to choose k elements from a set of n elements, allowing repetitions? With ordering. We have n choices for the first element, n choices for the second element and so on. Thus there are n k = n n n possible ways to choose k ordered elements from n, allowing repetitions Without ordering. Suppose we want to choose k elements from n, allowing repetitions but discarding the order. How many ways do we have to do so? Note that naïvely dividing n k by k! doesn t give the right answer, since there may be repetitions. Instead, we count as follows. Label the n elements {1, 2... n}, and for each element draw a each time it is picked n ** *... *** Note that there are k s and n 1 vertical lines. Now delete the numbers: (2.2)... The above diagram uniquely identifies an unordered set of (possibly repeated) k elements. Thus we simply have to count how many such diagrams there are. The only restriction is that there

12 12 1. DISCRETE PROBABILITY must be n 1 vertical lines and k s. Since there are n + k 1 locations, we can fix such a diagram by assigning the positions of the s, which can be done in ( ) n + k 1 k ways. This therefore counts the number of unordered subsets of k elements from n, without ordering. Example 2.6. Imagine to again have a box with n balls, but now each time a ball is picked, it is put back in the box (so that it can be picked again). There are: n k ordered ways to pick k balls, and ( ) n+k 1 k unordered ways to pick k balls. Example 2.7 (Increasing and non decreasing functions). An increasing function from {1, 2... k} to {1, 2... n} is uniquely determined by its range, which is a subset of {1, 2... n} of size k. Vice versa, each such subset determines a unique increasing function. This bijection tells us that there are ( n k) increasing functions from {1, 2... k} to {1, 2... n}, since this is the number of subsets of size k of {1, 2,... n}. How about non decreasing functions? There is a bijection from the set of non decreasing functions f : {1, 2... k} {1, 2... n} to the set of increasing functions g : {1, 2... k} {1, 2... n + k 1}, given by g(i) = i + f(i) 1. Hence the number of such decreasing functions is ( ) n+k 1 k. 3. Exercises Exercise 1. Define a probability space. Take Ω = {a, b, c}. Which ones of the following are σ-algebras? Explain. F = {, {a}}, F = {, {a}, {b, c}, {a, b, c}}, F = {, {a}, {b}, {c}, {b, c}, {a, b, c}}, F = {, {a}, {b}, {c}, {a, b}, {b, c}, {a, c}, {a, b, c}}. Exercise 2. Let Ω be a sample space, and A be an event. Show that P(A c ) = 1 P(A). Exercise 3. How many ways are there to divide a deck of 52 cards into two decks of 26 each? How many ways are there to do so, ensuring that each of the two decks contains exactly 13 black and 13 red cards?

13 4. STIRLING S FORMULA 13 Exercise 4. Four mice are chosen (without replacement) from a litter, two of which are white. The probability that both white mice are chosen is twice the probability that neither is chosen. How many mice are there in the litter? Exercise 5. Suppose that n (non-identical) balls are tossed at random into n boxes. What is the probability that no box is empty? What is the probability that all balls end up in the same box? Exercise 6. [Ordered partitions] An ordered partition of k of size n is a sequence (k 1, k 2... k n ) of non-negative integers such that k k n = k. How many ordered partitions of k of size n are there? Exercise 7. [The birthday problem] If there are n people in the room, what is the probability that at least two people have the same birthday? 3.1. Recap of formulas. Permutations of n elements: n!. Choose k elements from n, no repetitions: n! with ordering: (n k)!, ( ) n without ordering:. k Choose k elements from n, with repetitions: with ordering: n k,( ) n 1 + k without ordering:. k 4. Stirling s formula We have seen how factorials are ubiquitous in combinatorial analysis. It is therefore important to be able to provide asymptotics for n! as n becomes large, which is often the case of interest. Recall that we write a n b n to mean that a n /b n 1 as n. Theorem 4.1 (Stirling s formula). As n we have n! 2πn n+1/2 e n. Note that this implies that log(n!) log( 2πn n+1/2 e n ) as n. We now prove this weaker statement.

14 14 1. DISCRETE PROBABILITY Set l n = log(n!) = log 1 + log log n. Write x for the integer part of x. Then log x log x log x + 1. Integrate over the interval [1, n] to get from which Integrating by parts we find from which we deduce that n 1 n 1 l n 1 n 1 log xdx l n log xdx l n n+1 1 log xdx. log xdx = n log n n + 1, n log n n + 1 l n (n + 1) log(n + 1) n. Since n log n n + 1 n log n, (n + 1) log(n + 1) n n log n, dividing through by n log n and taking the limit as n we find that l n n log n, as wanted. Example 4.2. Suppose we have 4n balls, of which 2n are red and 2n are black. We put them randomly into two urns, so that each urn contains 2n balls. What is the probability that each urn contains exactly n red balls and n black balls? Call this probability p n. Note that there are ( ( 4n 2n) ways to distribute the balls into the two urns, 2n )( 2n ) n n of which will result in each urn containing exactly n balls of each color. Thus ( )( ) 2n 2n p n = ( n n (2n)! ( ) = 4n n! 2n ) 4 1 (4n)! ( 2πe 2n (2n) 2n+1/2 2πe n n n+1/2 ) 4 1 2πe 4n (4n) 4n+1/2 = 2 πn.

15 5. PROPERTIES OF PROBABILITY MEASURES Properties of Probability measures Let (Ω, F, P) be a probability space, and recall that P : F [0, 1] has the property that P(Ω) = 1 and for any sequence (A n ) n 1 of disjoint events in F it holds P A n = P(A n ). n 1 n 1 Then we have the following: 0 P(A) 1 for all A F, if A B = then P(A B) = P(A) + P(B), P(A c ) = 1 P(A), since P(A) + P(A c ) = P(Ω) = 1, P( ) = 0, if A B then P(A) P(B), since P(B) = P(A (B \ A)) = P(A) + P(B \ A) P(A). P(A B) = P(A) + P(B) P(A B), P(A B) P(A) + P(B) Continuity of probability measures. For a non-decreasing sequence of events A 1 A 2 A 3 we have P A n = lim P(A n). n n 1 Indeed, we can define a new sequence (B n ) n 1 as B 1 = A 1, B 2 = A 2 \ A 1, B 3 = A 3 \ (A 1 A 2 )... B n = A n \ (A 1 A n 1 )... Then the B n s are disjoint, and so P(A n ) = P(B 1 B n ) = Taking the limit as n both sides, we get ( lim P(A n) = P(B n ) = P n n=1 n 1 n P(B k ). ) ( B n = P n 1 A n ), as wanted. Similarly, for a non-increasing sequence of events B 1 B 2 B 3 we have P B n = lim P(B n). n n 1

16 16 1. DISCRETE PROBABILITY Example 5.1. Take Ω = N and let A n = {1, 2... n} for n 1. Then we have A 1 A 2 A 3 and lim P(A n) = P A n = P(Ω) = 1. n n 1 If, on the other hand, we set A n = A 5 for all n 5, then we find lim P(A n) = P A n = P(A 5 ). n n 1 Example 5.2. Take Ω = N and let B n = {n, n + 1, n } for n 1. Then we have B 1 B 2 B 3 and lim P(B n) = P B n = P( ) = 0. n n 1 If, on the other hand, we set B n = B 3 for all n 3, then we find lim P(B n) = P B n = P(B 10 ). n n Subadditivity of probability measures. For any events A 1, A 2... A n we have ( n ) n (5.1) P A k P(A k ). To see this, define the events B 1 = A 1, B 2 = A 2 \ A 1, B 3 = A 3 \ (A 1 A 2 )... B n = A n \ (A 1 A n 1 ). Then the B k s are disjoint, and P(B k ) P(A k ) for all k = 1... n, from which ( n ) ( n ) n n P A k = P B k = P(B k ) P(A k ). i=1 The same proof shows that this also holds for infinite sequences of events (A n ) n 1 P A n P(A n ). n 1 n 1 The above inequalities show that probability measures are subadditive, and are often referred to as Boole s inequality. Example 5.3. We have already seen that P(A B) P(A) + P(B), which is an example of (5.1) with n = 2.

17 5. PROPERTIES OF PROBABILITY MEASURES 17 Example 5.4. Throw a fair die, so that Ω = {1, 2, 3, 4, 5, 6}. Then P({1, 2, 3} {2, 4}) = P({1, 2, 3, 4}) = 4 6, while P({1, 2, 3}) + P({2, 4}) = = 5 6, so indeed P({1, 2, 3} {2, 4}) P({1, 2, 3}) + P({2, 4}). Example 5.5. Toss a fair coin 3 times. Write H for head and T for tail. Then P ( exactly one head {(T, H, H)} {(H, T, T )} ) = = P({(H, T, T ), (T, H, T ), (T, T, H), (T, H, H)}) = 4 8 = 1 2, while P ( exactly one head) + P(T, H, H) + P(H, T, T ) = = Inclusion-exclusion formula. We have already seen that for any two events A, B P(A B) = P(A) + P(B) P(A B). How does this generalise to an arbitrary number of events? For 3 events we have P(A B C) = P(A) + P(B) + P(C) P(A B) P(A C) P(B C) + P(A B C). A B C In general, for n events A 1, A 2... A n we have ( n ) n P A i = ( 1) k+1 i=1 1 i 1 < <i k n P(A i1 A i2 A ik ). You are not required to remember the general formula for this course, but you should learn it for n = 2, 3.

18 18 1. DISCRETE PROBABILITY Example 5.6. Pick a card from a standard deck of 52. Define the events Then P(A) = A = {the card is diamonds}, B = {the card is even}, C = {the cards is at least 10.} 24 16, P(B) =, P(C) = 52 52, P(A B) = 6 52, P(A C) = 4 52, P(B C) = 8 52, P(A B C) = 2 52, which gives P(A B C) = = Exercises Exercise 1. Let A, B, C be three events. Express in symbols the following events: (i) only A occurs, (ii) at least one event occurs, (iii) exactly one event occurs, (iv) no event occurs. Exercise 2. Show that for any two sets A, B it holds (A B) c = A c B c. Exercise 3. Show that, for any three events A, B, C P(A c (B C)) = P(B) + P(C) P(B C) P(C A) P(A B) + P(A B C). How many numbers in {1, } are not divisible by 7 but are divisible by 3 or 5? Exercise 4. Let (A n ) n 1 be a sequence of events in some probability space (Ω, F, P). Define A = {ω Ω : ω A n infinitely often}, B = {ω Ω : ω A n for all sufficiently large n}. Write A and B in terms of unions/intersections of the A k s. Exercise 5. A committee of size r is chosen at random from a set of n people. Calculate the probability that m given people will be in the committee.

19 7. INDEPENDENCE 19 Exercise 6. What is the probability that a non-decreasing function f : {1... k} {1... n} is increasing? Exercise 7. (i) How many injective functions f : {1, 2... k} {1, 2... n} are there? (ii) How many injective, non-decreasing functions f : {1, 2... k} {1, 2... n} are there? (iii) How many non-constant functions f : {1, 2... k} {0, 1} are there? How about non-constant f : {1, 2... k} {1, 2... n}? 7. Independence We will now discuss what is arguably the most important concept in probability theory, namely independence. Definition 7.1. Two events A, B are said to be independent if Example 7.2. Throw a fair die, and let P(A B) = P(A)P(B). A = {the first number is even}, B = {the second number is 2}. Then P(A) = 3 6, P(B) = 2 6 B are independent. 6 and P(A B) = 36, so P(A B) = P(A)P(B) and the events A and Example 7.3. Pick a card from a standard deck. Let Then P(A) = , P(B) = 52 B are independent. A = {the card is heart}, B = {the card is at most 5}. Example 7.4. Toss two fair dice. Let 5 and P(A B) = 52, so P(A B) = P(A)P(B) and therefore A and A = {the sum of the two numbers is 6}, B = {the firs number is 4}. Then P(A) = 5 36, P(B) = and P(A B) = 36. Thus P(A B) P(A)P(B) and the events are not independent. Intuitively, the probability of getting 6 for the sum depends on the first outcome, since if we were to get 6 at the first toss then it would be impossible to obtain 6 for the sum, while if the first toss gives a number 5 then we have a positive probability of getting 6 for the sum. If, on the other hand, we replace A with A = {the sum of the two numbers is 7},

20 20 1. DISCRETE PROBABILITY then P(A ) = 6 36, P(B) = 1 6 and P(A B) = Thus P(A B) = P(A )P(B) and the events are independent. That is, knowing that the first toss gave 4 doesn t give us any information on the probability that the sum is equal to 7. An important property of independence is the following: if A is independent of B, then A is independent of B c. Indeed, we have P(A B c ) = P(A) P(A B) = P(A) P(A)P(B) (by independence of A, B) = P(A)(1 P(B)) which shows that A and B c are independent. = P(A)P(B c ) (since P(B c ) = 1 P(B)) Example 7.5. Let A, B be defined as in Example 7.2, so that B c = {the second number is 3}. Then P(B c ) = 1 P(B) = 4 6, so P(A)P(Bc ) = = P(A Bc ), so A and B c are independent. Example 7.6. Let A, B be defined as in Example 7.3, so that B c = {the card is at least 6}. Then P(B c ) = 1 P(B) = 32 52, so P(A)P(Bc ) = 8 52 = P(A Bc ), so A and B c are independent. We can also define independence for more than two events. Definition 7.7. We say that the events A 1, A 2... A n are independent if for any k 2 and any collection of distinct indices 1 i 1 < < i k n we have P(A i1 A i2... A ik ) = P(A i1 )P(A i2 ) P(A ik ). Note that we require all possible subsets of the n events to be independent. Example 7.8. Toss 3 fair coins. Let A = {first coin H}, B = {second coin H}, C = {third coin T}. Then P(A) = P(B) = P(C) = 1 2 and P(A B) = 1 4 = P(A)P(B) P(B C) = 1 4 = P(B)P(C) P(A C) = 1 4 = P(A)P(C) P(A B C) = 1 8 = P(A)P(B)P(C),

21 8. CONDITIONAL PROBABILITY 21 so A, B, C are independent. At this point the reader may wonder whether independence of n events is equivalent to pairwise independence, that is the requirement that any two events are independent. The answer is no: independence is stronger than pairwise independence. Here is an example. Example 7.9. Toss 2 fair coins, so that Let Ω = {(H, H), (H, T ), (T, H), (T, T )}. A = {the first coin is T} = {(T, H), (T, T )}, B = {the second coin is T} = {(H, T ), (T, T )}, C = {get exactly one T} = {(T, H), (H, T )}. Then P(A) = P(B) = P(C) = 1/2, P(A B) = P(A C) = P(B C) = 1/4 so the events are pairwise independent. On the other hand, so the events are not independent. P(A B C) = 0, 8. Conditional probability Closely related to the notion of independence is that of conditional probability. Definition 8.1. Let A, B be two events with P(B) > 0. Then the conditional probability of A given B is P(A B) P(A B) =. P(B) We interpret P(A B) as the probability that the event A occurs, when it is known that the event B occurs. Example 8.2. Toss a fair die. Let A 1 = {3} and B = {even} = {2, 4, 6}. Then P(A 1 B) = P({3 and even}) P(even) = 0. This is intuitive: if we know that the outcome is even, the probability that the outcome is 3 is zero. If instead we take A 2 = {2, 3, 4, 5, 6} then P(A 2 B) = P({2, 4, 6}) P(even) = 1.

22 22 1. DISCRETE PROBABILITY This is also intuitive: if we know that the outcome is even, then we are certain that it is at least 2. Finally, take A 3 = {4, 5, 6}, for which P(A 3 B) = P({4, 6}) P(even) = 2 3. An important property to note is that if A and B are independent, then P(A B) = P(A). That is, when two events are independent, knowing that one occurs does not affect the probability of the other. This follows from P(A B) = P(A B) P(B) = P(A)P(B) P(B) = P(A), where we have used that A and B are independent in the second equality The law of total probability. From the definition of conditional probability we see that for any two events A, B with P(B) > 0 we have P(A B) = P(A B)P(B). This shows that the probability of two events occurring simultaneously can be broken up into calculating successive probabilities: first the probability that B occurs, and then the probability that A occurs given that B has occurred. Clearly by replacing B with B c in the above formula we also have But then, since B and B c are disjoint, P(A B c ) = P(A B c )P(B c ). P(A) = P(A B) + P(A B c ) = P(A B)P(B) + P(A B c )P(B c ). This has an important generalisation, called law of total probability. Theorem 8.3 (Law of total probability). Let (B n ) n 1 be a sequence of disjoint events of positive probability, whose union is the sample space Ω. Then for all events A P(A) = P(A B n )P(B n ). n=1 Proof. We know that, by assumption, B n = Ω. n 1

23 8. CONDITIONAL PROBABILITY 23 This gives P(A) = P(A Ω) = P A B n = P B n ) n 1 n 1(A = P(A B n ) = P(A B n )P(B n ). n 1 n 1 Remark 8.4. In the statement of the law of total probabilities we can also drop the assumption P(B n ) > 0, provided we interpret P(A B n )P(B n ) = 0 if P(B n ) = 0. It follows that we can also take (B n ) n to be a finite collection of events (simply set B n = from some finite index onwards). Example 8.5. An urn contains 10 black balls and 5 red balls. We draw two balls from the urn without replacement. What is the probability that the second ball drawn is black? Let A = {the second ball is black}, B = {the first ball is black}. Then P(A) = P(A B)P(B) + P(A B c )P(B c ) = = 2 3. If, in general, the urn contains b black balls and r red balls we have P(A) = b b + r b 1 b + r 1 + r b + r b b + r 1 = b b + r. Example 8.6. Toss a fair coin. If it comes up head, then toss one fair die, otherwise toss 2 fair dice. Call M the maximum number obtained from the dice, and let A = {M 4}. What is the probability of A? We look at the two cases: either the coin came head and we tossed one die, or the coin came tail and we tossed two dice. Thus let By the law of total probabilities we have B = {the coin gives H}. P(A) = P(A B)P(B) + P(A B c )P(B c ) = P(M 4 H)P(H) + P(M 4 T )P(T ) = = 5 9. Example 8.7. Toss a fair coin twice. If get (H, H) then toss 1 die, if (T, T ) toss 2 dice and otherwise toss 3 dice. With M defined as before, what is the probability of the event A? Again, we split the probability space into disjoint events according to the number of dice we toss. Let B 1 = {(H, H)}, B 2 = {(T, T )}, B 3 = {(H, T ), (T, H)}.

24 24 1. DISCRETE PROBABILITY Then by the law of total probabilities P(A) = P(A B 1 )P(B 1 ) + P(A B 2 )P(B 2 ) + P(A B 3 )P(B 3 ) = = Bayes theorem. Note that if A and B are two events of positive probability we have, by definition of conditional probability, P(A B) = P(A B)P(B) = P(B A)P(A). Often, one of the conditional probabilities is easier to compute than the other. Bayes theorem tells us how to switch between the two. Theorem 8.8 (Bayes theorem). For any two events A, B both of positive probability we have P(A B) = P(B A)P(A). P(B) Proof. We have P(A B) = P(A B) P(B) = P(B A)P(A). P(B) Example 8.9. Going back to Example 8.5, suppose we are told that the second ball is black. What is the probability that the first ball was black? Applying Bayes theorem we find P(B A) = P(A B)P(B) P(A) = = Example Going back to Example 8.6, suppose we are told that M 2. What is the probability that the coin came up head? Recall that B = {the coin gives H}, and let C = {M 2}. By Bayes theorem, we have P(B C) = P(C B)P(B) P(C) 2 P(C B)P(B) = P(C B)P(B) + P(C B c )P(B c ) = 9. Exercises = 3 4. Exercise 1. Toss two fair dice. Determine whether the following events are independent: (i) A = {the first die gives 4}, B = {the maximum outcome is 3}, (ii) A = {the first die gives 4}, B = {the maximum outcome is 5}, (iii) A = {the first die gives 5}, B = {the sum of the outcomes is 6}, (iv) A = {the first die gives 6}, B = {the sum of the outcomes is 7}. Exercise 2. Let A and B be independent. Show that A c and B c are independent.

25 10. SOME NATURAL PROBABILITY DISTRIBUTIONS 25 Exercise 3. Let A, B be two events with P(B) > 0. Define P(A B). Show that (a) P(B B) = 1, (b) if A B then P(A B) = P(A) P(B), (c) if A B then P(A B) < 1. Exercise 4. Give an example of two events for which P(A B) P(B A). Can you find two events A, B of positive probability such that P(A B)P(B) P(B A)P(A)? Explain. Exercise 5. Draw a card from a standard deck. What is the probability that the card is heart? What is the probability that the card is even, given that it is heart? What is the probability that the card is even, given that it is 5? What is the probability that the card is 5, given that it is even? Exercise 6. Students of a given class get grades A, B, C, D with probability 1 8, 2 8, 3 8, 2 8 respectively. If they misread the question, they usually do worse, getting grades A, B, C, D with probability 1 respectively. Each student misreads the question with probability , 2 10, 3 10, What is the probability that: (a) a student who read the question correctly gets A? (b) a student who got B has read the question correctly? (c) a student who got C has misread the question? (d) a student who misread the question got C? (e) a student gets D? Exercise 7. What is the probability that a function f : {1, 2... k} {1, 2... n} is constant, given that it is non-decreasing? What is the probability that f : {1, 2... k} {1, 2... n} is non-decreasing, given that it is constant? 10. Some natural probability distributions Up to this point in the course we have only worked with equally likely outcomes. On the other hand, many other choices of probability measures are possible. We will next see some some of them, which arise naturally. Definition A probability distribution, or simply distribution, is a probability measure on some (Ω, F). We say that a distribution is discrete if Ω is a discrete set.

26 26 1. DISCRETE PROBABILITY For discrete Ω we always take the σ-algebra F to be the set of all subsets of Ω. In particular, it contains all sets {ω}, ω Ω, made of a single outcome. Thus, if µ is a probability distribution, we can write Note that p ω [0, 1] for all ω Ω, and ω Ω p ω = µ({ω}). p ω = µ(ω) = 1. Moreover, µ is uniquely determined by the collection (p ω ) ω Ω, that we call weights. We think of the weight p ω as the probability of seeing the outcome {ω}, when sampling according to µ. We now list several natural probability distributions Bernoulli distribution. Take Ω = {0, 1} and define µ to be the probability distribution given by the weights p 1 = p, p 0 = 1 p. Then µ models the number of heads obtained in one biased coin toss (the coin gives head with probability p and tail with probability 1 p). Such µ is called Bernoulli distribution of parameter p, and it is denoted by Bernoulli(p) Binomial distribution. Fix an integer N 1 and let Ω = {0, 1, 2... N}. Define µ to be the probability distribution given by the weights ( ) N p k = p k (1 p) N k, 0 k N. k Then µ models the number of heads in N biased coin tosses (again, the coin gives head with probability p, and tail with probability 1 p). Such µ is called Binomial distribution of parameters N, p, and it is denoted by Binomial(N, p). Note that µ(ω) = N p k = k=0 N k=0 ( N k ) p k (1 p) N k = (p + (1 p)) N = Geometric distribution. Let Ω = {1, 2, 3...}, and define µ as the distribution given by the weights p k = (1 p) k 1 p, k 1. for all k 1. Then µ models the number of biased coin tosses up to (and including) the first head. Such µ is called Geometric distribution of parameter p, and it is denoted by Geometric(p). Note that µ(ω) = p k = p (1 p) k 1 = p 1 (1 p) = 1.

27 11. EXERCISES 27 Note: A variant of the geometric distribution counts the number of tails until the first head. In this case Ω = {0, 1, 2, 3...} and p k = (1 p) k p, k 0. for all k 0. This is also referred to as Geometric(p), and you should check that µ(ω) = 1. It should always be made clear which version of the geometric distribution is intended Poisson distribution. Let Ω = N, and for a fixed parameter λ (0, + ) let µ be the probability distribution given by the weights p k = e λ λ k, k 0. k! Such µ is called Poisson distribution of parameter λ, and it is denoted by P oisson(λ). Note that e λ λ k µ(ω) = = e λ e λ = 1. k! k=0 This distribution arises as the limit of a Binomial distribution with parameters N, λ N as N. Indeed, if p k (N, λ/n) denote the weights of a Binomial(N, λ/n) distribution, we have ( ) ( ) N λ k ( p k (N, λ/n) = 1 λ ) N k k N N as N, since = N(N 1) (N k + 1) N k 1, N(N 1) (N k + 1) N k ( 1 λ ) N k λ k N k! e λ λ k k! ( 1 λ N ) N k e λ, ( 1 λ N ) k Exercises Exercise 1. Let A, B be two events such that 0 < P(A) < 1 and 0 < P(B) < 1. We say that B attracts A if P(A B) > P(A). Show that if B attracts A then A attracts B. Exercise 2. Tom is undecided as to whether to take a Mathematics course or a Literature course. He estimates that the probability of getting an A would be 1/3 in a Maths course, and 1/2 in a Literature course. Suppose he chooses which course to take by tossing a fair coin: what is the probability that he gets an A? Suppose that instead he tosses a biased coin: if it comes up head, which happens with probability p (0, 1), he takes the Maths course, and otherwise he takes the Literature course. What is the probability that Tom gets an A?

28 28 1. DISCRETE PROBABILITY Exercise 3. Toss two dice. Let A i = {the first die gives i}, 1 i 6 B j = {the sum of the two dice is j}, 2 j 12. Determine for which (i, j) the events A i, B j are independent. Exercise 4. At a certain stage of a criminal investigation the inspector is 60 per cent convinced of the guilt of a certain suspect. Suppose, however, that a new piece of evidence is uncovered: it is found that the criminal is left-handed. If the probability of being both left-handed and non-guilty is 2/25, how certain of the guilt of the suspect should the inspector be if it turns out that the suspect is left-handed? Exercise 5. Show that for any A, B with P(B) > 0 it holds 0 P(A B) 1. Exercise 6. Toss a biased coin, which gives head with probability p (0, 1) and tail with probability 1 p. If it comes up head, roll two dice and record the sum. If tail, pick a card from a standard deck and record its number (between 1 and 13). What is the probability that the number you have recorded is 4? What is the probability that it is 2? 12. Additional exercises Exercise 1. A committee of 3 people is to be formed from a group of n people. How many different committees are possible? Let c(n) denote the number of possible committees. Find α such that c(n) lim n n α (0, + ). Exercise 2. How many different words can be formed from (exactly) the letters QUIZZES? Exercise 3. How many distinct non-negative integer-valued solutions (x 1, x 2 ) of the equation x 1 + x 2 = 3 are possible? Exercise 4. Show that P(A B) P(A). When does equality hold? Exercise 5. A total of 36 members of a club play tennis, 28 play squash and 18 play badminton. Furthermore, 22 of the members play both tennis and squash, 12 play both tennis and badminton, 9 play both squash and badminton, and 4 play all three sports. How many members of the club play at least one of the three sports?

29 13. RANDOM VARIABLES 29 Exercise 6. How many people have to be in a room in order that the probability that at least two of them celebrate their birthday in the same month is at least 1/2? Assume that all months are equally likely. Exercise 7. An urn contains two coins of type A and one coin of type B. When a type A coin is flipped, it comes up head with probability 1/4, while when a type B coin is flipped it comes up head with probability 3/4. A coin is randomly chosen from the urn and flipped. What is the probability to get head? Given that the coin gave head, what is the probability that it was a coin of type A? Exercise 8. An infinite sequence of independent trials is to be performed. Each trial results in a success with probability p (0, 1), and a failure with probability 1 p. Determine the probability that: (a) at least one success occurs in the first n trials, (b) exactly k successes occur in the first n trials, (c) all trials result in successes, (d) all trials result in failures. Exercise 9. An infinite sequence of independent trials is to be performed. Each trial results in a success with probability p (0, 1), and a failure with probability 1 p. What is the probability distribution of the number of trials until (and including) the first success? Write down the corresponding weights (p k ) k 1. Exercise 10. Each day of the week James flips a biased coin: if it comes up head (which happens with probability p), then he goes out for a walk. What is the distribution of the number of times James goes for a walk in one week? Write down the corresponding weights (p k ) 0 k Random variables It is often the case that when a random experiment is conducted we don t just want to record the outcome, but perhaps we are interested in some functions of the outcome. Definition A random variable on (Ω, F) taking values in a discrete set S is a function X : Ω S. Typically S R or S R k, in which case we say that X is a real-valued random variable.

30 30 1. DISCRETE PROBABILITY Example Toss a biased coin. Then Ω = {H, T }, where H, T stands for head and tail. Define X(H) = 1, X(T ) = 0. Then X : Ω {0, 1} counts the number of heads in the outcome. Example Toss two biased coins, so that Ω = {H, T } 2. Define X(HH) = 2, X(T H) = X(HT ) = 1, X(T T ) = 0. Then X : Ω {0, 1, 2} counts the number of heads in 2 coin tosses. Example Roll two dice, so that Ω = {1, 2, 3, 4, 5, 6} 2. For (i, j) Ω, set X(i, j) = i + j. Then X : Ω {2, } records the sum of the two dice. We could also define the random variables Y (i, j) = max{i, j}, Z(i, j) = i. Then Y : Ω {1, 2, 3, 4, 5, 6} records the maximum of the two dice, while Z : Ω {1, 2, 3, 4, 5, 6} records the outcome of the first die. Example For a given probability space (Ω, F, P) and A F, define 1, for ω A, 1 A (ω) = 0, for ω / A. Then 1 A : Ω {0, 1} tells us whether the outcome was in A or not. Note that (i) 1 A c = 1 1 A, (ii) 1 A B = 1 A 1 B, (iii) 1 A B = 1 (1 1 A )(1 1 B ). You should prove (iii) as an exercise. For a subset T S we denote the set {ω Ω : X(ω) T } simply by {X T }. Since X takes values in a discrete S, we let p x = P(X = x) for all x S. The collection (p x ) x S is referred to as the probability distribution of X. If the probability distribution of X is, say, Geometric(p), then we say that X is a Geometric random variable of parameter p, and write X Geometric(p). Similarly, for the other distributions we have encountered write X Bernoulli(p), X Binomial(N, p), X P oisson(λ). Given a random variable X taking values in S R, the function F X : R [0, 1] given by F X (x) = P(X x)

31 13. RANDOM VARIABLES 31 is called distribution function of X. Note that F X is piecewise constant, non-decreasing, right-continuous, and lim F X(x) = 0, x lim F X(x) = 1. x + Example Toss a biased coin, which gives head with probability p, and define the random variable X(H) = 1, X(T ) = 0. Then 0, if x < 0 F X (x) = 1 p, if x [0, 1) 1, if x 1. Note that F X is piecewise constant, non-decreasing, right-continuous, and its jumps are given by the weights of a Bernoulli distribution of parameter p. Knowing the probability distribution function is equivalent to knowing the collection of weights (p x ) x S such that p x = P(X = x), and hence it is equivalent to knowing the probability distribution of X. Definition 13.7 (Independent random variables). Two random variables X, Y, taking values in S X, S Y respectively, are said to be independent if P(X = x, Y = y) = P(X = x)p(y = y) for all (x, y) S X S Y. In general, the random variables X 1, X 2... X n taking values in S 1, S 2... S n are said to be independent if P(X 1 = x 1, X 2 = x 2... X n = x n ) = P(X 1 = x 1 )P(X 2 = x 2 ) P(X n = x n ) for all (x 1, x 2... x n ) S 1 S 2 S n. Note that if X 1, X 2... X N are independent, then for any k N and any distinct indices 1 i 1 < i 2 < < i k N, the random variables X i1, X i2... X ik are independent. We show this with N = 3 for simplicity. Let X : Ω S X, Y : Ω S Y and Z : Ω S Z be independent random variables, that is P(X = x, Y = y, Z = z) = P(X = x)p(y = y)p(z = z) for all (x, y, z) S X S Y S Z. Then X, Y are independent. Indeed, P(X = x, Y = y) = P(X = x, Y = y, Z = z) = P(X = x)p(y = y)p(z = z) z S Z z S Z = P(X = x)p(y = y) P(Z = z) = P(X = x)p(y = y). z S Z }{{} 1

32 32 1. DISCRETE PROBABILITY Similarly, one shows that Y, Z and X, Z are independent. Example Consider the probability distribution on Ω = {0, 1} N given by the weights p ω = N p ω k (1 p) 1 ω k for any ω = (ω 1, ω 2... ω N ) Ω. This models a sequence of N tosses of a biased coin. On Ω we can define the random variables X 1, X 2... X N by X k (ω) = ω k for all ω = ω 1 ω 2... ω N Ω. Then each X k is a Bernoulli random variable of parameter p, since P(X i = 1) = P(ω i = 1) = Moreover, for ω = ω 1 ω 2... ω N {0, 1} N, w:w i =1 P(X 1 = ω 1, X 2 = ω 2... X n = ω n ) = p ω = N p ω k (1 p) 1 ω k = p = 1 P(X i = 0). N p ω k (1 p) 1 ω k = N P(X k = ω k ), so X 1, X 2... X N are independent. Define a further random variable S N on Ω by setting S N (ω) = N X k (ω) = N ω k. Then S N (ω) counts the number of ones in ω. Moreover, for k = 0, 1... N ( ) N {S N = k} =, k and p ω = p k (1 p) N k for all ω {S N = k}, so ( ) N P(S N = k) = p k (1 p) N k. k Thus S N is a Binomial random variable of parameters N, p. We have seen in the above example that the sum of random variables is again a random variable. In general, one could look at functions of random variables. Definition Let X : Ω S be a random variable taking values in S, and let g : S S be a function. Then g(x) : Ω S defined by is a random variable taking values in S. g(x)(ω) = g(x(ω))

33 14. EXPECTATION Expectation From now on all the random variables we consider are assumed to take real values, unless otherwise specified. We say that a random variable X is non-negative if X takes values in S [0, ). Definition 14.1 (Expectation of a non-negative random variable). For a non-negative random variable X : Ω S we define the expectation (or expected value, or mean value) of X to be E(X) = x S xp(x = x). Thus the expectation of X is the average of the values taken by X, averaged with weights corresponding to the probabilities of the values. Example If X Bernoulli(p) then Example If X Geometric(p) then E(X) = kp(x = k) = E(X) = 1 P(X = 1) + 0 P(X = 0) = p. k(1 p) k 1 p = p [ ( )] d = p (1 x) k dx k=0 For a random variable X write x=p so that X = X + X and X = X + + X. [ d = p dx ( [ d ] (1 x)k dx x=p )] 1 1 (1 x) X + = max{x, 0}, X = max{ X, 0}, x=p = p 1 p 2 = 1 p. Definition For X : Ω S, define X + and X as above. Then, provided E(X + ) < or E(X ) <, we define the expectation of X to be E(X) = E(X + ) E(X ) = x S xp(x = x). Note that we are allowing the expectation to be infinite: if E(X + ) = + and E(X ) < then we set E(X) = +. Similarly, if E(X + ) < and E(X ) = + then we set E(X) =. However, if both E(X + ) = E(X ) = + then the expectation of X is not defined. Definition A random variable X : Ω S is said to be integrable if E( X ) <.

34 34 1. DISCRETE PROBABILITY Example Let X : Ω Z have probability distribution (p k ) k Z where p 0 = P(X = 0) = 0, p k = P(X = k) = P(X = k) = 2 k 1, k 1. Then E(X + ) = E(X ) = 1 2, so E(X) = E(X +) E(X ) = 0. Example Let X : Ω Z have probability distribution (p k ) k Z where p 0 = P(X = 0) = 0, p k = P(X = k) = P(X = k) = 3 (πk) 2, k 1. Then E(X + ) = E(X ) = +, so E(X) is not defined. Properties of the expectation. The expectation of a random variable X : Ω S satisfies the following properties, some of which we state without proof: (1) If X 0 then E(X) 0, and E(X) = 0 if and only if P(X = 0) = 1. (2) If c R is a constant, then E(c) = c and E(cX) = c E(X). (3) For random variables X, Y, it holds E(X + Y ) = E(X) + E(Y ). Proof. If X, Y take values in S X, S Y respectively, we have E(X + Y ) = zp(x + Y = z) = (x + y)p(x = x, Y = y) z S X +S Y x S X y S Y = ( ) x P(X = x, Y = y) + ( ) y P(X = x, Y = y) x S X y S Y y S Y x S X }{{}}{{} P(X=x) P(Y =y) = xp(x = x) + yp(y = y) = E(X) + E(Y ). x S X y S Y The above properties generalise by induction, to give that for any constants c 1, c 2... c n and random variables X 1, X 2... X n it holds ( n ) n E c k X k = c k E(X k ). Thus the expectation is linear. (4) For any function g : S S, g(x) : Ω S is a random variable taking values in S, and E(g(X)) = x S g(x)p(x = x).

35 14. EXPECTATION 35 An important example is given by g(x) = x k, and the corresponding expectation E(X k ) is called the k th moment of the random variable X. (5) If X 0 takes integer values, then E(X) = Proof. We have kp(x = k) = k=0 E(X) = j=1 P(X k). k P(X = k) = j=1 k=j P(X = k) = P(X j). (6) If X, Y are independent random variables, then E(XY ) = E(X)E(Y ). In general, if X 1, X 2... X N are independent random variables, E(X 1 X 2 X N ) = E(X 1 ) E(X 2 ) E(X N ). Example Let X Geometric(p). Then P(X k) = (1 p) k 1, and E(X) = P(X k) = (1 p) k 1 = j=1 (1 p) k = 1 p. Example Roll two dice, and let X, Y denote the outcomes of the first and second die respectively. Then X and Y are independent, and E(X) = E(Y ) = so if S = X + Y and P = XY we have 6 kp(x = k) = 1 6 k=0 6 k = 21 6, E(S) = E(X + Y ) = E(X) + E(Y ) = 42, E(P ) = E(XY ) = E(X)E(Y ) = 6 ( ) In general, if X 1, X 2... X N denote the outcome of N consecutive rolls of a die, and we define S = N X k, P = N X k, then E(S) = NE(X 1 ) = 21 6 N, E(P ) = [E(X 1)] N = ( ) 21 N. 6

36 36 1. DISCRETE PROBABILITY Functions of random variables. Recall that if X : Ω S is a random variable, and g : S S is a (non-random) function, then g(x) : Ω S is itself a random variable, and by Property (14) E(g(X)) = x S g(x)p(x = x). Now, if X : Ω S X and Y : Ω S Y are two random variables, and f : S X S X and g : S Y S Y are two functions, then if X, Y are independent then also f(x), g(y ) are independent. This generalises to arbitrarily many random variables, to give that if X 1, X 2... X N are independent, then f 1 (X 1 ), f 2 (X 2 )... f N (X N ) are independent. Example Toss two dice. Let A = {the first die gives 4}, B = {the sum of the two dice is 7}. Then we have seen that A, B are independent events. Define the indicator random variables X = 1 A, Y = 1 B. Since indicators of independent events are independent, X, Y are independent. Moreover, E(X) = P(A) = 1 6, E(Y ) = P(B) = 1 6, E(X + Y ) = P(A) + P(B) = 1 3, E(XY ) = P(A)P(B) = If, on the other hand, we introduce C = {the sum of the two dice is 3}, Z = 1 C then A, C are not independent, so X, Z are not independent, and E(Z) = P(C) = 2 8, E(X + Z) = P(A) + P(C) = 36 36, E(XZ) = E(1 A C ) = P(A C) = 0 E(X)E(Z). 15. Exercises Exercise 1. Let X Bernoulli(p). Determine the distribution of Y = 1 X. Exercise 2. Show that for any two sets A, B 1 A B = 1 (1 1 A )(1 1 B ). Exercise 3. Show that if A, B are independent events then 1 A and 1 B are independent random variables. Compute the expectation of 1 A B and 1 A B.

37 16. VARIANCE AND COVARIANCE 37 Exercise 4. Roll two dice, and let Ω = {1, } 2 be the associated sample space. Define the random variable X : Ω {1, } by X(i, j) = max{i, j} for all (i, j) Ω. Write down the weights of the random variable X. Draw the graph of the associated distribution function F X, and check that lim F X(x) = 0, x lim F X(x) = 1. x + Exercise 5. Roll two dice, and let Ω = {1, } 2 be the associated sample space. Consider the event A = {i = 4}, and define the random variable 1 if i = 4, X(i, j) = 1 A (i, j) = 0 if i 4. Define further the random variable Determine whether X and Y are independent. Y (i, j) = i + j. Exercise 6. Redo all the computations of Example 13.8 for N = 2 and N = 3. Exercise 7. Show that the X k s defined in Example 13.8 are Bernoulli(p). Exercise 8. For an event A Ω, let 1 A denote the indicator function of A. Compute E(1 A ). When is E(1 A ) = 0? When is E(1 A ) = 1? Exercise 9. Let X Binomial(N, p). Show that E(X) = Np. Exercise 10. Let X P oisson(λ). Show that E(X) = λ. 16. Variance and covariance Variance. Once we know the mean of a random variable X, we may ask how much typically X deviates from it. This is measured by the variance. Definition For any random variable X with finite mean E(X), the variance of X is defined to be Var(X) = E[(X E(X)) 2 ]. Thus Var(X) measures how much the distribution of X is concentrated around its mean: the smaller the variance, the more the distribution is concentrated around E(X).

38 38 1. DISCRETE PROBABILITY Properties of the variance. The variance of a random variable X : Ω S satisfies the following properties: (1) Var(X) = E(X 2 ) [E(X)] 2. Proof. We have Var(X) = E[(X E(X)) 2 ] = E[X 2 + (E(X)) 2 2XE(X)] = E(X 2 ) [E(X)] 2. (2) If E(X) = 0 then Var(X) = E(X 2 ). (3) Var(X) 0, and Var(X) = 0 if and only if P(X = c) = 1 for some constant c R. (4) If c R then Var(cX) = c 2 Var(X). (5) If c R then Var(X + c) = Var(X). (6) If X, Y are independent then Var(X + Y ) = Var(X) + Var(Y ). In general, if X 1, X 2... X N are independent then Var(X 1 + X X N ) = Var(X 1 ) + Var(X 2 ) + + Var(X N ) Covariance. Closely related to the concept of variance is that of covariance. Definition For any two random variables X, Y with finite mean, we define their covariance to be Cov(X, Y ) = E[(X E(X))(Y E(Y ))] Properties of the covariance. The covariance of two random variables X, Y satisfies the following properties: (1) Cov(X, Y ) = E(XY ) E(X)E(Y ). Proof. We have Cov(X, Y ) = E[(X E(X))(Y E(Y ))] = E[XY XE(Y ) Y E(X) + E(X)E(Y )] = E(XY ) E(X)E(Y ). (2) Cov(X, X) = Var(X). (3) Cov(X, c) = 0 for all c R. (4) For c R, Cov(cX, Y ) = c Cov(X, Y ) and Cov(X + c, Y ) = Cov(X, Y ). (5) For X, Y, Z random variables, Cov(X + Z, Y ) = Cov(X, Y ) + Cov(Z, Y ).

39 16. VARIANCE AND COVARIANCE 39 The above properties generalise, to give that the covariance is bilinear. That is, for any collections of random variables X 1, X 2... X n and Y 1, Y 2... Y n, and constants a 1, a 2... a n and b 1, b 2... b n, it holds n n n n Cov a i X i, b j Y j = Cov(X i, Y j ). i=1 j=1 i=1 j=1 (6) For any two random variables with finite variance, it holds Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ). Proof. We have Var(X + Y ) = E[(X E(X) + Y E(Y )) 2 ] = E[(X E(X)) 2 ] + E[(Y E(Y )) 2 ] + 2E[(X E(X))(Y E(Y ))] = Var(X) + Var(Y ) + 2Cov(X, Y ). (7) If X, Y are independent then Cov(X, Y ) = 0. Proof. We have Cov(X, Y ) = E(XY ) E(X)E(Y ) = E(X)E(Y ) E(X)E(Y ) = 0, where the second equality follows from independence. We remark that the converse of property (7) is false, as shown by the following example. Example 16.3 (Zero covariance does not imply independence). Let 2, with probability 1 4 1, with probability 1 4 X =, Y = X 2. 1, with probability 1 4 2, with probability 1 4 Then E(X) = 0 and E(XY ) = E(X 3 ) = 0 by symmetry, so Cov(X, Y ) = E(XY ) E(X)E(Y ) = 0. However, X, Y are not independent, since P(X = 2, Y = 4) = P(X = 2) = 1 4, while P(X = 2)P(X = 4) = P(X = 2)P({X = 2} {X = 2}) = = 1 8.

40 40 1. DISCRETE PROBABILITY 17. Exercises Exercise 1. Toss two biased coin (giving head with probability p). Let A = {the first coin gives head}, B = {the second coin gives head}, C = {at least one head}. Let X = 1 A, Y = 1 B, Z = 1 C. Compute expectation and variance of X, Y, Z. Compute Cov(X, Y ), Cov(X, Z). Are X, Z independent? Compute E(X +Y ), E[(X +Y ) 2 ], E[log(X +2)], E(4Y Z X 2 ). Exercise 2. Let X be a random variable taking values {0, 1, 2, 3} with probability p 0, p 1, p 2, p 3 respectively, with probability p 0 = p 1 = p 2 = 1 6, p 3 = 1 2. Compute E(X), E(X 2 ), Var(X). Describe the random variable Y = X 2 (that is, list the possible values of Y and the associated weights). Compute E(Y ), E(Y 2 ), Var(Y ) and Cov(X, Y ). Are X, Y independent? Compute E(X + Y + 3), Var(X + Y 2), E[(X Y )X], E[(X + Y )(X Y )]. Exercise 3. Draw two cards from a standard deck. Let A = {the first card is heart}, B = {the second card is heart}. Define X = 1 A, Y = 1 B. Are A, B independent? Are X, Y independent? Write down the weights of X and Y, and compute E(X), E(Y ), Cov(X, Y ), Var(X + Y ). Exercise 4. Let X Geometric(p). Compute E(X), E(X(X 1)), and deduce E(X 2 ). Exercise 5. Let X P oisson(λ). Compute E(X), E(X 2 ) and Var(X). Exercise 6. Toss a biased coin N times. For all 1 k N let A k = {the k th coin gives head}, and define X k = 1 Ak. Compute E(X k ) and Var(X k ) for all k. Are the X k s independent? Let S = N X k. Then S counts the total number of heads in N coin tosses. Compute E(S) and Var(S). Assume N is even. What is the expected total number of heads obtained in even coin tosses? And odd coin tosses?

41 18. INEQUALITIES 41 Exercise 7. Let X be a random variable with finite mean µ. For x R define the function f(x) = E[(X x) 2 ]. Computing f (x), show that f(x) is minimised at x = µ, and compute its minimum value. Exercise 8. Let X be a random variable with mean µ and variance σ 2. Compute E(X 2 ) and E[(X 2) 2 ] in terms of µ, σ 2. Exercise 9. Give an example of two random variables with equal mean but different variance. Give an example of two random variables with the same mean and variance, but different distribution. 18. Inequalities It is often the case that, rather than computing the exact value of E(X), we are only interested in having a bound for it. Theorem 18.1 (Markov s inequality). Let X be a non-negative random variable, and λ (0, ). Then P(X λ) E(X) λ. Proof. Let Y = λ1 {X λ}. Then X Y, so E(X) E(Y ) = E(λ1 {X λ} ) = λe(1 {X λ} ) = λp(x λ). Example Roll a die N times, and let S N denote the total number of 3 s obtained. Then S N Binomial(N, 1/6), and so E(S N ) = N/6. Thus P(X αn) E(X) αn = 1 6α. Note that the bound is useless if α 1/6, while it contains interesting information if α > 1/6. Theorem 18.3 (Chebyshev s inequality). Let X be a random variable with finite mean E(X). Then, for λ (0, ), P( X E(X) λ) Var(X) λ 2. Proof. We have P( X E(X) λ) = P((X E(X)) 2 λ 2 ) E[(X E(X))2 ] where the inequality follows from Markov s inequality. λ 2 = Var(X) λ 2,

42 42 1. DISCRETE PROBABILITY Theorem 18.4 (Cauchy-Schwarz inequality). For all random variables X, Y, it holds E( XY ) E(X 2 )E(Y 2 ). We do not prove this inequality. Note that if Y is constant, say P(Y = 1) = 1, then the inequality gives E( X ) E(X 2 ). Example Toss a fair coin repeatedly. Let X denote the number of tossed until (and including) the first head, while Y denotes the number of tossed until (and including) the first tail. Then X Geometric(p), and X, Y are not independent since Y Geometric(1 p), P(X = 1, Y = 1) = = P(X = 1)P(Y = 1). We want a bound for E(XY ). To this end, recall that if Z Geometric(p) then E(Z 2 ) = (2 p)/p 2. So E(X 2 ) = 2 1/2 (1/2) 2 = 6 = E(Y 2 ), from which, using Cauchy-Schwarz inequality, we find E(XY ) 6 6 = 6. This, together with E(X) = E(Y ) = 2, allows us to bound the covariance: Cov(X, Y ) = E(XY ) E(X)E(Y ) 6 4 = 2. Theorem 18.6 (Jensen s inequality). Let X be an integrable random variable and let f : R R be a convex function. Then f(e(x)) E(f(X)). We do not prove this. To remember the direction of the inequality, take f(x) = x 2 and use that E(X 2 ) [E(X)] 2 = Var(X) 0. Example Let X P oisson(λ), so that E(X) = λ. Define Y = e X. Since f(x) = e x is a convex function, we have E(Y ) = E(e X ) e E(X) = e λ. 19. Exercises Exercise 1. In a sequence of n independent trials the probability of a success at the i th trial is p i. Let N denote the total number of successes. Find E(N) and Var(N).

43 19. EXERCISES 43 Exercise 2. Let X be an integrable random variable. Show that for every p > 0 and all x > 0 Show further that for all β 0 P( X x) E( X p )x p. P(X x) E(e βx )e βx. Exercise 3. Let X 1, X 2... X n be independent and identically distributed random variables with finite mean µ and variance σ 2. Define the sample mean X = 1 n X k. n Compute E( X) and Var( X). Use Chebyshev s inequality to bound Use this bound to find n such that P( X µ > 2σ). P( X µ > 2σ) Exercise 4. Let X P oisson(λ). Compute E(e X ). Compute E(e αx ) for all α R. Let X, Y be independent P oisson(λ) random variables. For any α, β R compute E(e αx+βy ). Exercise 5. Let X 1, X 2... X n be independent and identically distributed random variables with finite mean µ and variance σ 2. Define the sample mean X = 1 n X k. n Use Chebyshev s inequality to show that for any ε > 0 as n. P( X µ > ε) 0 Exercise 6. Explain why E(X) log E(e X ), E(X) E( X ). for any random variable X with finite mean.

44 44 1. DISCRETE PROBABILITY Exercise 7. Roll a die N times. Denote by X the total number of 3 s and by Y the total number of 6 s. Are X and Y independent? Denote by Z the total number of even outcomes. Are X and Z independent? Compute expectation and variance of X, Y, Z. [Hint: write X, Y, Z as sums of independent indicator random variables.] 20. Probability generating functions Let X be a random variable taking values in N. Definition We define the generating function of X to be G X (t) = E(t X ) = t k P(X = k) for t [0, 1]. Note that the assumption t [0, 1] guarantees that the infinite sum defining G X (t) is convergent. Moreover, G X (1) = P(X = k) = 1, and G X (t) 1 for all t [0, 1]. Example If X Bernoulli(p) then k=0 G X (t) = E(t X ) = tp + 1 p. Example If X Geometric(p), then G X (t) = E(t X ) = t k (1 p) k 1 p = tp [t(1 p)] k = k=0 k=0 k=0 pt 1 t(1 p). Example If X P oisson(λ), then G X (t) = E(t X ) = t k e λ λ k = e λ (tλ) k = e λ+λt. k! k! The importance of probability generating functions lies in the following result. Theorem The probability generating function of a random variable X determines the distribution of X uniquely. We do not prove the above theorem. Note that it tells us that if X, Y have the same generating function, that is if E(t X ) = E(t Y ) for all t [0, 1], then X, Y have the same distribution.

45 20. PROBABILITY GENERATING FUNCTIONS 45 Example We show that if X P oisson(λ) and Y P oisson(µ) are independent, then X + Y P oisson(λ + µ). Indeed, we find G X+Y (t) = E(t X+Y ) = E(t X )E(t Y ) = e λ+λt e µ+µt = e (λ+µ)+(λ+µ)t. We recognise in this expression the probability generating function of a P oisson(λ + µ) random variable. Since probability generating functions uniquely determine distributions, we conclude that X + Y P oisson(λ + µ). Example If X Binomial(N, p), then G X (t) = E(t X ) = N ( N t k k k=0 ) p k (1 p) N k = (tp + 1 p) N. We use this to show that the sum of two independent Bernoulli(p) random variables is Binomial(2, p). Indeed, if X, Y are independent Bernoulli(p), we find G X+Y (t) = E(t X+Y ) = E(t X )E(t Y ) = (tp + 1 p) 2, which is the probability generating function of a Binomial(2, p). This generalises to show that the sum of N independent Bernoulli(p) is Binomial(N, p). As shown in the above examples, if X, Y are independent then G X+Y (t) = E(t X+Y ) = E(t X )E(t Y ) = G X (t)g Y (t). If, moreover, X and Y also have the same distribution, then G X+Y (t) = [G X (t)] 2. This generalises to arbitrarily many random variables, to give that if X 1, X 2... X n are independent, and n S = X k, then G S (t) = G X1 (t)g X2 (t) G Xn (t). If, moreover, the X k s all have the same distribution, then G S (t) = [G X1 (t)] n. Example If X 1, X 2... X n are independent random variables, with X k P oisson(λ k ), then X 1 + X X n P oisson(λ 1 + λ λ n ).

46 46 1. DISCRETE PROBABILITY The expectation E(X n ) is called n th moment of X. Provided G X (t) is finite for some t > 1, we can differentiate inside the series, to obtain G(1) = P(X = k) = 1, k=0 G (1) = kp(x = k) = E(X), k=0 G (1) = k(k 1)P(X = k) = E(X(X 1)). k=0 In general, G (n) (1) = E(X(X 1) (X n + 1)). Example Recall that if X P oisson(λ) then G X (t) = e λ+λt. So G (t) = λe λ+λt, G (1) = λ = E(X) G (t) = λ 2 e λ+λt, G (1) = λ 2 = E(X(X 1)) from which we find E(X) = G (1) = λ, E(X 2 ) = G (1) + G (1) = λ 2 + λ, and so Var(X) = E(X 2 ) [E(X)] 2 = λ. 21. Conditional distributions Definition 21.1 (Joint distribution). Given two random variables X : Ω S X and Y : Ω S Y, their joint distribution is defined as the collection of probabilities P(X = x, Y = y), x S X, y S Y. Note that the joint distribution of X and Y is a probability distribution on S X S Y. Example Let X, Y be independent Bernoulli random variables of parameter p, and define Z = X + Y. Then, if we interpret X, Y as recording the outcome of two independent coin tosses, Z denotes the total number of head. Here S X = S Y = {0, 1} while S Z = {0, 1, 2}. Moreover, by independence P(X = x, Y = y) = P(X = x)p(y = y) for all (x, y) S X S Y. On the other hand, X and Z are not independent, and they have joint distribution P(X = 0, Z = 0) = (1 p) 2, P(X = 0, Z = 1) = (1 p)p, P(X = 1, Z = 1) = p(1 p), P(X = 1, Z = 2) = p 2. Note that the weights of the joint distribution sum up to 1.

47 21. CONDITIONAL DISTRIBUTIONS 47 Remark The joint distribution of two independent random variables is the product of the marginal distributions. From the joint distribution, one can recover the marginal distribution of X and Y by mean of the total probability law: P(X = x) = y S Y P(X = x, Y = y) for all x S X, P(Y = y) = x S X P(X = x, Y = y) for all y S Y. Example Continuing Example 21.2, from the joint distribution of X, Z we can recover the marginal distribution of Z via P(Z = 0) = P(Z = 0, X = 0) + P(Z = 0, X = 1) = (1 p) 2, P(Z = 1) = P(Z = 1, X = 0) + P(Z = 1, X = 1) = 2p(1 p), P(Z = 2) = P(Z = 2, X = 0) + P(Z = 2, X = 1) = p 2. This generalises to arbitrarily many random variables: for X 1... X n random variables taking values in S 1... S n, their joint distribution is given by P(X 1 = x 1, X 2 = x 2,... X n = x n ), x 1 S 1,..., x n S n, and we can recover the marginal distributions via P(X i = x) = P(X 1 = x 1,..., X i = x,..., X n = x n ). x 1...x i 1,x i+1...x n Definition 21.5 (Conditional distribution). For two random variables X, Y, the conditional distribution of X given {Y = y} (which we assume to have positive probability) is the collection of probabilities P(X = x Y = y), x S X, where, clearly, P(X = x Y = y) = P(X = x, Y = y). P(Y = y) Then we can recover the marginal distribution of X via the total probability law: P(X = x) = y S Y P(X = x Y = y)p(y = y). Example Continuing Example 21.2, the conditional distribution of Z given {X = 0} is P(Z = 0 X = 0) = 1 p, P(Z = 1 X = 0) = p, P(Z = 2 X = 0) = 0. On the other hand, the conditional distribution of Z given {Y = 1} is P(Z = 0 X = 1) = 0, P(Z = 1 X = 1) = 1 p, P(Z = 2 X = 1) = p.

48 48 1. DISCRETE PROBABILITY We can recover the marginal distribution of Z by P(Z = 0) = P(Z = 0 X = 0)P(X = 0) + P(Z = 0 X = 1)P(X = 1) = (1 p) 2, P(Z = 1) = P(Z = 1 X = 0)P(X = 0) + P(Z = 1 X = 1)P(X = 1) = 2p(1 p), P(Z = 2) = P(Z = 2 X = 0)P(X = 0) + P(Z = 2, X = 1)P(X = 1) = p 2. Remark If X, Y are independent, then the conditional distribution of X given {Y = y} coincides with the distribution of X, since P(X = x Y = y) = P(X = x) for all (x, y) S X S Y. Definition 21.8 (Conditional expectation). Finally, we define the conditional expectation of X given {Y = y} by E(X Y = y) = x S X xp(x = x Y = y). Note that if X, Y are independent then E(X Y = y) = E(X) for all y S Y. Example Continuing Example 21.2, we have while E(Z) = E(X) + E(Y ) = 2p, E(Z X = 0) = 0 P(Z = 0 X = 0) + 1 P(Z = 1 X = 0) + 2 P(Z = 2 X = 0) = p. We could have arrived to the same conclusion by observing that, since Z = X + Y, on the event {X = 0} we have Z = Y, and so E(Z X = 0) = E(Y X = 0) = E(Y ) = p Properties of the conditional expectation. (1) For a constant c R, we have E(cX Y = y) = ce(x Y = y) and E(c Y = y) = c. (2) For random variables X, Z, it holds E(X + Z Y = y) = E(X Y = y) + E(Z Y = y). The above two properties generalise by induction, to give that for any constants c 1, c 2... c n and random variables X 1, X 2... X n it holds ( n ) Y n E c k X k = y = c k E(X k Y = y). Thus the conditional expectation is linear. (3) If X, Y are independent then E(X Y = y) = E(X) for all y S Y. (4) For a function g : S X S, E(g(X) Y = y) = x S X g(x)p(x = x Y = y).

49 22. EXERCISES Exercises Exercise 1. Let K be a random variable taking values {0, 1, } with probability 1/8 each. Define ( ) ( ) Kπ Kπ X = cos, Y = sin. 4 4 Show that Cov(X, Y ) = 0 but X, Y are not independent. Exercise 2. Let X be a Poisson random variable of parameter λ. We have seen in a previous exercise that for all β 0 P(X x) E(e βx )e βx. By finding the value of β that minimises the right hand side, show that for all x λ P(X x) exp{ x log(x/λ) λ + x}. Show that, on the other hand, for integer x P(X = x) 1 2πx exp{ x log(x/λ) λ + x} as x. Exercise 3. Suppose that a random variable X has probability generating function G X (t) = t(1 tr ) r(1 t) for t [0, 1), where r 1 is a fixed integer. Determine the mean of X. Exercise 4. Toss a biased coin repeatedly, and let X denote the number of tosses until (and including) the second head. Show that P(X = k) = (k 1)(1 p) k 2 p 2 for k 2. Show that the probability generating function of X is ( ) pt 2 G X (t) =. 1 (1 p)t Deduce that E(X) = 2/p and Var(X) = 2(1 p)/p 2. Explain how X can be represented as the sum of two independent and identically distributed random variables. Use this representation to derive again the mean and variance of X. Exercise 5. Solve the above exercise with X denoting the number of tosses until (and including) the a th head, for a 2. Show that in this case ( ) pt a G X (t) =, E(X) = a a(1 p), Var(X) = 1 (1 p)t p p 2.

50 50 1. DISCRETE PROBABILITY Exercise 6. Let X, Y P oisson(λ) be independent random variables. Find the distribution of X + Y. Show that the conditional distribution of X given {X + Y = n} is Binomial (n, 1/2). Exercise 7. Let X P oisson(λ) and Y P oisson(µ) be independent random variables. Find the distribution ( ) of X + Y. Show that the conditional distribution of X given {X + Y = n} λ is Binomial n, λ+µ. Exercise 8. The number of misprints on a given page of a book has Poisson distribution of parameter λ, and the number of misprints on different pages are independent. What is the probability that the second misprint will occur on page k?

51 CHAPTER 2 Continuous probability 1. Some natural continuous probability distributions Up to this point we have only considered probability distributions on discrete sets, such as finite sets, N or Z. We now turn our attention to continuous distributions. Definition 1.1 (Probability density function). A function f : R R is a probability density function if f(x) 0 for all x R, + f(x)dx = 1. We will often write p.d.f. in place of probability density function for brevity. Given a p.d.f., we can define a probability measure µ on R by setting for all a, b R. Note that µ((, a]) = a µ([a, b]) = b a f(x)dx f(x)dx, µ(r) = + f(x)dx = 1, and µ([a, b)) = µ((a, b]) = µ((a, b)) = µ([a, b]). We interpret f(x) as the density of probability at x (the continuous analogue of the weight of x), and µ([a, b]) as the probability of the interval [a, b]. For this course we restrict to piecewise continuous probability density functions. Example 1.2. Take f(x) = e x 2. Then f(x) 0 for all x R, and + f(x)dx = 2 e x 0 2 dx = 1. Thus f is a valid probability density function. If µ denotes the associated probability measure, we have µ([0, )) = e x dx = 1 2, µ([ 1, 1]) = e x dx = 1 0 e x dx = 1 e 1. This tells us that if we choose a real number at random according to µ, the probability that this number is non-negative is 1/2 (which was also clear as f is symmetric), while the probability that it has modulus at most 1 is 1 e 1. 51

52 52 2. CONTINUOUS PROBABILITY Remark 1.3. To be precise, we would have had to introduce a σ-algebra on R, called Borel σ-algebra, and require that f is Borel-measurable. This is a very weak assumption, and it holds for all f piecewise continuous, to which we restrict our attention. This subtleties go beyond the scope of this course, and we will ignore them. We now list some natural continuous probability distributions. Throughout, we use the following notation: for A R 1, if x A 1 A (x) = 0, if x / A, in analogy with the notation introduced for indicator random variables Uniform distribution. For a, b R with a < b, the density function f(x) = 1 b a 1 [a,b](x) defines the uniform distribution on [a, b]. Note that + f(x)dx = b a f(x)dx = 1. The uniform distribution gives to each interval contained in [a, b] a probability proportional to its length Exponential distribution. For λ (0, ), the density function f(x) = λe λx 1 [0,+ ) (x) defines the exponential distribution of parameter λ. Note that + f(x)dx = 0 λe λx dx = Gamma distribution. The Gamma distribution generalises the exponential distribution. For α, λ (0, + ), the density function f(x) = λα x α 1 e λx 1 Γ(α) [0,+ ) (x) defines the gamma distribution of parameters α, λ. Here Γ(α) = 0 x α 1 e x dx is a finite constant, which ensures that f integrates to 1, since with y = λx 0 λ α x α 1 e λx dx = 0 y α 1 e y dy = Γ(α). Note that by setting α = 1 we recover the exponential distribution, of which the Gamma distribution is therefore a generalisation.

53 2. CONTINUOUS RANDOM VARIABLES Cauchy distribution. The density function f(x) = defines the Cauchy distribution. Note that + f(x)dx = 2 π π(1 + x 2 ) 1.5. Gaussian distribution. The density function x 2 dx = 2 π arctan(x) + = 1. 0 f(x) = 1 2π e x2 /2 defines the standard Gaussian distribution, or standard normal distribution, written N(0, 1). More generally, for µ R and σ 2 (0, ), the density function f(x) = 1 (x µ)2 e 2σ 2 2πσ 2 defines the Gaussian distribution of mean µ and variance σ 2, written N(µ, σ 2 ). To check that f integrates to 1 we can use Fubini s theorem and change variables into polar coordinates to obtain ( + 2 e dx) x2 /2 = + + e (x2 +y 2 )/2 dxdy = 2π 2. Continuous random variables A random variable X : Ω R has probability density function f if P(X [a, b]) = b a f(x)dx for all a < b R. We again define the distribution function of X by F X (x) = P(X x) = x 0 f(y)dy. 0 e r2 /2 rdrdθ = 2π. Definition 2.1. A random variable X is said to be continuous if the associated distribution function F X is continuous. Note that by the fundamental theorem of calculus F X(x) = f X (x), thus the probability density function is the derivative of the distribution function. Moreover, lim F X(x) = 0, x and F X is non-decreasing, since F X (x) = f X(x) 0. lim F X(x) = 1 x +

54 54 2. CONTINUOUS PROBABILITY Definition 2.2. Let X be a continuous random variable with probability density function f and distribution function F. Then, provided we define E(X + ) = + 0 xf(x)dx <, or E(X ) = E(X) = + xf(x)dx. 0 ( x)f(x)dx <, Note that if E(X + ) = + and E(X ) < then E(X) = +, while if E(X ) = + and E(X + ) < then E(X) =, in analogy with the discrete case. Example 2.3. If X Uniform[a, b] then E(X) = b a x b a dx = 1 b a ( b 2 a 2 ) 2 = a + b 2. Example 2.4. If X Exponential(λ) then we integrate by parts to get E(X) = Example 2.5. If X Cauchy then 0 E(X + ) = so the expectation is not defined. Example 2.6. If X N(0, 1) then by symmetry. xλe λx dx = xe λx 0 E(X) = e λx dx = 1 λ. x π(1 + x 2 ) dx = + = E(X ), + x 2π e x2 /2 dx = 0 The expectation satisfies the same properties as in the case of discrete random variables. In particular it is linear, and for a function g : R R we have E(g(X)) = + g(x)f(x)dx. Moreover, if X is non-negative, we have the alternative formula E(X) = + 0 P(X x)dx = + 0 (1 F (x))dx. Example 2.7. If X Exponential(λ) then P(X x) = e λx, so we can use the above formula to compute E(X) = P(X x)dx = e λx λ = 1 0 λ. 0

55 Definition 2.8. A random variable X is said to be integrable if E( X ) = 3. EXERCISES 55 + x f(x)dx <. For X integrable random variable, we define the variance of X by Var(X) = E[(X E(X)) 2 ] = (x E(X)) 2 f(x)dx. In analogy with the discrete case, Var(X) = E(X 2 ) [E(X)] 2 = ( x 2 f(x)dx xf(x)dx) 2. Example 2.9. If X Uniform[a, b] then from which E(X 2 ) = b a x 2 b a dx = 1 b a ( b 3 a 3 ) 3 = a2 + b 2 + ab, 3 Var(X) = E(X 2 ) [E(X)] 2 = a2 + b 2 + ab 3 (a + b)2 4 = (a b)2. 12 Example If X Exponential(λ) then we integrate by parts to get from which E(X 2 ) = 0 x 2 λe λx dx = x 2 e λx Var(X) = E(X 2 ) [E(X)] 2 = 2 λ 2 1 λ 2 = 1 λ 2. Example If X N(0, 1) then we integrate by parts to get Var(X) = E(X 2 ) = + x 2 e x2 /2 dx = x e x2 /2 2π 2π 0 xe λx dx = 2 λ 2, e x2 /2 2π dx = Exercises Exercise 1. Let X be a Uniform random variable in [ 1, 1]. Compute E(X 3 ) and E(X 4 ). Can you compute E(X 2n+1 ) for n 0?

56 56 2. CONTINUOUS PROBABILITY Exercise 2. Recall that the Gamma function is defined as Γ(α) = 0 x α 1 e x dx. Using integration by parts, show that for all k positive integers Γ(k + 1) = kγ(k). By comparison with the exponential distribution, explain why Γ(1) = 1. Use this to conclude that for all k positive integers. Γ(k) = (k 1)! 4. Transformations of one-dimensional random variables Let X be a random variable taking values in S R (that is, with probability density function f X supported on S). We consider a function g : S S and define the random variable Y = g(x). What is the probability density function f Y of Y? To answer this question, let us assume that g is strictly increasing, so that g (x) > 0 for all x. Then we can look at the distribution function F Y (y) = P(Y y) = P (g(x) y) = P ( X g 1 (y) ) = F X (g 1 (y)). Differentiating both sides, we find f Y (y) = d dy F ( X g 1 (y) ) = f X (g 1 (y)) d dy g 1 (y). A similar argument applies for the case of g decreasing, to give the following. Theorem 4.1. Let X be a random variable taking values in S R with p.d.f. f X, and let g : S S be a strictly increasing or strictly decreasing function. Then the probability density function of Y = g(x) is given by f Y (y) = f X (g 1 (y)) d dy g 1 (y) for y S. Example 4.2. Let X Uniform[0, 1]. We may assume that X only takes values in (0, 1] since P(X = 0) = 0. Define Y = log X, so that Y takes values in [0, ). Then we have S = (0, 1], S = [0, ) and g(x) = log x strictly decreasing. Using that g 1 (y) = e y, we find f Y (y) = f X (g 1 (y)) d }{{} dy g 1 (y) = e y, 1 } {{ } e y

57 5. EXERCISES 57 which shows that Y Exponential(1). Note that we could have arrived to the same conclusion by directly looking at the distribution functions. Indeed, we have F Y (y) = P(Y y) = P( log X y) = P(X e y ) = 1 e y. By differentiating both sides with respect to y, we find so Y Exponential(1). f Y (y) = e y, Example 4.3. Let X N(µ, σ 2 ), and define Y = (X µ)/σ. Then S = S = R and g(x) = (x µ)/σ. Using that g 1 (y) = σy + µ, we find f Y (y) = f X (g 1 (y)) d 1 dy g 1 (y) = σ e (x µ)2 x=σy+µ 2σ 2 = 1 e y2 /2, }{{} 2πσ 2π σ which shows that Y N(0, 1). Note that we could have arrived to the same conclusion using that µ+σy 1 F Y (y) = P(Y y) = P(X µ + σy) = e (x µ)2 2σ 2 dx 2πσ for all y R, from which, differentiating both sides, we get 1 f Y (y) = σ e (x µ)2 x=σy+µ 2σ 2 = 1 e y2 /2. 2πσ 2π The same reasoning shows that if X N(0, 1) and Y = σx + µ then Y N(µ, σ 2 ). Note that the above example gives us a way to deduce the mean and variance of a N(µ, σ 2 ) from the ones of N(0, 1). Indeed, if X N(0, 1) then Y = σx + µ N(µ, σ 2 ), from which, using the linearity of expectation, E(Y ) = σe(x) + µ = µ, and Var(Y ) = σ 2 Var(X) = σ Exercises Exercise 1. Let X be a random variable with p.d.f. f(x) = ce 3x 1 [0, ) (x). Find c. Find P(X [2, 3]), P( X 1), P(X > 4). Exercise 2. Let X be a random variable with p.d.f. f(x) = cx1 [0,2] (x). Find c. Find the p.d.f. of Y = 2x. Exercise 3. Let X Exponential(1). Find the p.d.f. of Y = X 2.

58 58 2. CONTINUOUS PROBABILITY Exercise 4. Let X Cauchy. Find the p.d.f. of Y = arctan X. Exercise 5. Conversely, show that if X U[ π/2, π/2] and Y = tan X, then Y Cauchy. Exercise 6. Let X U[ 1, 1]. Find the p.d.f. of Y = 2X + 5 and Z = X Exercise 7. Let X be an exponential random variable of parameter λ. Show that for any 0 < s < t it holds P(X > t X > s) = P(X > t s). This is the memoryless property of the exponential distribution. Indeed, thinking of X as a random time (say the time of arrival of a bus), it tells us that given that X > s, the random time X s still has exponential distribution of parameter λ. In fact, this property characterises the exponential distribution, but we won t prove this. Exercise 8. Let X U[0, 1], and let F be a strictly increasing distribution function. Show that the random variable Y = F 1 (X) has distribution function F. This is important for simulating random variables on a computer. 6. Moment generating functions Definition 6.1. The moment generating function of a random variable X with p.d.f. f X is defined as M X (θ) = E(e θx ) = where the above integral is finite. + e θx f X (x)dx, Note that M X (0) = 1 and if θ < 0 and X 0 then M X (θ) 1, so the moment generating function of a non-negative random variable is always finite on (, 0]. The range of values of θ > 0 for which the moment generating function is defined will depend on the tails of the probability density function f X. Example 6.2. Let X be an exponential random variable of parameter λ. Then M X (θ) = E(e θx ) = 0 λe (λ θ)x dx = provided θ < λ, which is needed for the integral to be finite. λ λ θ,

59 If X, Y are independent random variables, then 6. MOMENT GENERATING FUNCTIONS 59 M X+Y (θ) = E(e θ(x+y ) ) = E(e θx )E(e θy ) = M X (θ)m Y (θ). This generalises to arbitrarily many random variables, to give that if X 1, X 2... X n are independent random variables, then M X1 + +X n (θ) = M X1 (θ)m X2 (θ) M Xn (θ). Moreover, if the X k s also have the same distribution, then M X1 + +X n (θ) = [M X1 (θ)] n. The moment generating function (that we abbreviate m.g.f.) plays the same role for continuous random variables as the probability generating function for discrete random variables. In particular, the following theorem holds. Theorem 6.3. The moment generating function of a random variable X determine the distribution of X uniquely, provided M X (θ) < for some θ > 0. Example 6.4. In this example we compute the moment generating function of a Gaussian random variable, and show that the sum of independent Gaussian random variable is still Gaussian. Let X N(µ, σ 2 ). Then M X (θ) = 1 2πσ 2 + e (x µ)2 θx 2σ 2 The above integral is finite for all θ R. We complete the square in the exponent as follows: θx (x µ)2 2σ 2 dx. = (x µ)2 2θσ 2 x 2σ 2 = x2 2(µ + θσ 2 )x + µ 2 2σ 2 = x2 2(µ + θσ 2 )x + (µ + θσ 2 ) 2 + µ 2 (µ + θσ 2 ) 2 2σ 2 = (x (µ + θσ2 )) 2 2σ 2 + θµ + θ2 σ 2 2. Plugging this into the integral, we find M X (θ) = e θµ+ θ2 σ πσ 2 + e (x (µ+θσ2 )) 2 2σ 2 dx } {{ } 1 = e θµ+ θ2 σ 2 2, where in the last equality we have used that the p.d.f. of a N(µ + θσ 2, σ 2 ) random variable integrates to 1. Thus we showed that X N(µ, σ 2 ) M X (θ) = e θµ+ θ2 σ 2 2. In particular, X N(0, 1) M X (θ) = e θ2 /2.

60 60 2. CONTINUOUS PROBABILITY We now use this explicit expression for the m.g.f. to compute the distribution of the sum of two independent Gaussian random variables. Let X N(µ X, σ 2 X ) and Y N(µ Y, σ 2 Y ) be independent. Then M X+Y = E(e θ(x+y ) ) = E(e θx )E(e θy ) = e θ(µ X+µ Y )+ θ2 2 (σ2 X +σ2 Y ), which shows that X + Y N(µ X + µ Y, σ 2 X + σ 2 Y ). Finally, exactly as in the discrete case, we can use the m.g.f. M X of a random variable X to recover the moments of X. Indeed, from e θx = 1 + θx + θ2 X 2 taking the expectation both sides we find 2 + θ3 X 3 3! + M X (θ) = 1 + θe(x) + θ2 2 E(X2 ) + θ3 3! E(X3 ) + Provided M X is finite for some θ 0, we can differentiate to get M X (0) = 1, M X(0) = E(X), M X(0) = E(X 2 ) and in general M (k) X (0) = E(Xk ), whenever the expectation is finite. Example 6.5. We have seen that if X is exponential of parameter λ then M X (θ) = λ/(λ θ) for θ < λ. We differentiate to find and M X(θ) = M X(θ) = λ (λ θ) 2, M X(0) = 1 λ = E(X), 2λ (λ θ) 3, M X(0) = 2 λ 2 = E(X2 ), from which we recover Var(X) = 1/λ 2. Note that all the above computations are for θ < λ. 7. Exercises Exercise 1. Show that the m.g.f. of a Gamma random variable of parameters n, λ, where n is a positive integer, is given by ( ) λ n M X (θ) =. λ θ Deduce that the m.g.f. of an exponential random variable of parameter λ is given by λ/(λ θ). Differentiate the m.g.f. to compute mean and variance of the Gamma(n, λ) distribution. Show that if X, Y are independent exponential random variables of parameter λ, then X + Y

61 8. MULTIVARIATE DISTRIBUTIONS 61 Gamma(2, λ). In general, show that of X 1, X 2... X n are i.i.d. exponential random variables of parameter λ, then X 1 + X X n Gamma(n, λ). Exercise 2. Compute the m.g.f. of a uniform random variable on [a, b]. Differentiate to deduce mean and variance. Exercise 3. Compute the m.g.f. of a standard Gaussian random variable by completing the square. Exercise 4. Show that the m.g.f. of a Cauchy random variable is infinite for all values of θ 0, while it is equal to 1 at θ = Multivariate distributions 8.1. Joint distribution. We start by defining the joint probability density function of two real-valued random variables. Definition 8.1. Two random variables X, Y are said to have joint probability density function f X,Y if P((X, Y ) A) = f X,Y (x, y)dxdy for all subsets A R 2. Their joint distribution function is given by F X,Y (x, y) = P(X x, Y y) = A x Properties of the joint probability density function f X,Y (i) f X,Y (x, y) 0, (ii) + + f X,Y (x, y)dxdy = 1. Properties of the joint distribution function F X,Y y f X,Y (u, v)dudv. (i) For each fixed x, F X,Y (x, y) is non-decreasing in y, and for each fixed y F X,Y non-decreasing in x. (ii) We have is lim x + and lim F X,Y (x, y) = 1, y + lim F X,Y (x, y) = lim F X,Y (x, y) = 0, x y F X,Y (x, + ) = P(X x) = F X (x), F X,Y (+, y) = P(Y y) = F Y (y).

62 62 2. CONTINUOUS PROBABILITY Note that f X,Y : R 2 R and F X,Y : R 2 [0, 1], and P((X, Y ) (a, b) (c, d)) = for all (a, b) (c, d) R 2. Moreover, f X,Y (x, y) = b d a 2 x y F X,Y (x, y), c f X,Y (x, y)dxdy so we can obtain the joint p.d.f. by differentiating the joint distribution function. Example 8.2. Let (X, Y ) be uniformly distributed on [0, 1] 2, which is equivalent to say that the joint p.d.f. of X, Y is given by f X,Y (x, y) = 1 [0,1] 2(x, y). To compute, say, P(X < Y ), we integrate the joint p.d.f. over the set {x < y}, to get 1 y 1 P(X < Y ) = f X,Y (x, y)dxdy = dxdy = ydy = 1 2. {x<y} If, on the other hand, we want to compute P(X 1/2, X + Y 1) we integrate the joint p.d.f. over this set to get 1/2 1 x P(X 1/2, X + Y 1) = f X,Y (x, y)dxdy = dydx = 3 8. {x 1/2,x+y 1} Example 8.3. Let (X, Y ) have joint p.d.f. f X,Y (x, y) = 6xy 2 1 [0,1] 2(x, y). We compute P((X, Y ) A) where A = {(x, y) : x + y 1}. We have P((X, Y ) A) = f X,Y (x, y)dxdy = 6xy 2 dxdy = (6y 3 3y 4 )dy = A 0 Note that from the joint p.d.f. we can recover the marginals by integrating over one of the variables: f X (x) = + 1 y f X,Y (x, y)dy, f Y (y) = Example 8.4. Continuing with the above example, we find and f X (x) = f Y (y) = xy 2 dy = 2x for x [0, 1], 6xy 2 dx = 3y 2 for y [0, 1]. 0 0 x=0 y=0 f X,Y (x, y)dx. Note that f X,Y (x, y) = f X (x)f Y (y). In this case we say that X and Y are independent.

63 8. MULTIVARIATE DISTRIBUTIONS 63 Definition 8.5. Two random variables X, Y are said to be independent if for all (x, y) R 2. P(X x, Y y) = P(X x)p(y y) The following theorem provides a very useful characterization of independence. Theorem 8.6. Two random variables X, Y are independent if and only if their joint p.d.f. is given by the product of the marginals, that is We won t prove this theorem. Example 8.7. Let X, Y have joint p.d.f. for λ > 0. Then f X,Y (x, y) = f X (x)f Y (y). f X,Y (x, y) = λ 2 e λ(x+y) 1 [0, ) 2(x, y) f X (x) = λe λx 1 [0, ) (x), f Y (y) = λe λy 1 [0, ) (y), so f X,Y = f X f Y and X, Y are independent exponential random variables of parameter λ. Example 8.8. Let X, Y have joint p.d.f. for λ > 0. Then f X,Y (x, y) = λe λx 1 [0, ) [0,1] (x, y) f X (x) = λe λx 1 [0, ) (x), f Y (y) = 1 [0,1] (y), so f X,Y = f X f Y and X, Y are independent exponential random variables, with X exp(λ) and Y U[0, 1]. All the above definitions generalise in the obvious way. Definition 8.9. The random variables X 1, X 2... X n are said to have joint probability density function f X1...X n if P((X 1, X 2... X n ) A) = f X1...X n dx 1 dx 2 dx n A for all subsets A R n. Their joint distribution function is given by F X1...X n (x 1... x n ) = P(X 1 x 1, X 2 x 2... x n x n ) = x1 x2 xn The random variables X 1, X 2... X n are said to be independent if f X1...X n (u 1... u n )du 1 du n. P(X 1 x 1, X 2 x 2... x n x n ) = P(X 1 x 1 )P(X 2 x 2 ) P(x n x n )

64 64 2. CONTINUOUS PROBABILITY for all (x 1... x n ) R n. Theorem The random variables X 1, X 2... X n are independent if and only if f X1,X 2...X n (x 1, x 2,..., x n ) = f X1 (x 1 )f X2 (x 2 ) f Xn (x n ). Exactly as in the discrete case, if X, Y are independent random variables, and g 1, g 2 are real functions, then g 1 (X), g 2 (Y ) are again independent random variables. In general, if X 1, X 2... X n are independent random variables, and g 1, g 2... g n are real functions, then g 1 (X 1 ), g 2 (X 2 )... g n (X n ) are independent random variables, and E(g 1 (X 1 )g 2 (X 2 ) g n (X n )) = E(g 1 (X 1 ))E(g 2 (X 2 )) E(g n (X n )) Covariance. Given two random variables X, Y and a function g : R 2 R, we have the formula E(g(X, Y )) = + + In particular, taking g(x, y) = xy we find E(XY ) = + + g(x, y)f X,Y (x, y)dxdy. xyf X,Y (x, y)dxdy. Definition 8.11 (Covariance). Given two random variables X, Y with joint p.d.f. f X,Y define their covariance as Cov(X, Y ) = E(XY ) E(X)E(Y ). Note that if X, Y are independent then f X,Y (x, y) = f X (x)f Y (y) and so E(XY ) = = + + ( + xyf X,Y (x, y)dxdy = + + ) ( + ) xf X (x)dx yf Y (y)dy = E(X)E(Y ) xyf X (x)f Y (y)dxdy which shows that Cov(X, Y ) = 0. The covariance satisfies all the properties that we have listed for the discrete case. Example Let X, Y be two random variables with joint p.d.f. f X,Y (x, y) = cxy for 0 x y 2 1 and y [0, 1]. To ensure that f X,Y integrates to 1 we take c = 12. We find and marginals f X (x) = E(XY ) = 1 1 y x 2 y 2 dxdy = 4 1 x 12xydy = 6x(1 x), f Y (y) = 0 y 8 dy = 4 9, y xydx = 6y 5, we

65 9. EXERCISES 65 from which E(X) = 1/2 and E(Y ) = 6/7. It follows that Cov(X, Y ) = E(XY ) E(X)E(Y ) = 4 6 0, and therefore X, Y are not independent Exercises Exercise 1. Let X, Y be independent exponential random variables of parameter λ > 0. Find the distribution of Z = min{x, Y }. Deduce E(Z) and Var(Z). Exercise 2. Let (X, Y ) be uniformly distributed on the unit disc, that is f X,Y (x, y) = c1 {x 2 +y 2 1}(x, y). Determine c. Compute the marginal probability density functions f X and f Y. Are X and Y independent? Exercise 3. Let X, Y have joint p.d.f. f X,Y (x, y) = c(x + 2y)1 [0,1] 2(x, y). Find c. Compute P(X < Y ) and P(X + Y < 1/2). Find the marginals f X and f Y. Compute E(X), E(Y ) and Cov(X, Y ). Compute E(XY 2 ). Exercise 4. Let X, Y have joint distribution function F X,Y (x, y) = 1 e x e y + e (x+y) for (x, y) [0, ) 2. Find P(X < 4, Y < 2). Find P(X < 3). Differentiate to compute the joint p.d.f. f X,Y of X, Y. Check that f X,Y integrates to 1. Exercise 5. Let X, Y be independent uniform random variables in [0, 1]. Write the joint p.d.f. f X,Y. Compute E(XY ), E(sin(X + Y )). Write down the joint distribution function F X,Y. Exercise 6. Let X, Y have joint p.d.f. f X,Y (x, y) = cx 2 e y 1 {0<x<y<1} (x, y). Find c. Find the marginals f X and f Y. Are X, Y independent? Compute E(XY ), E(X 2 e Y ), E(X(1 + 4Y )), E(e X+Y ), E(e X Y ). Exercise 7. The radius of a circle is exponentially distributed with parameter λ. Determine the probability density function of the perimeter of the circle and of the area of the circle. Exercise 8. Let Y N(µ, σ 2 ). Determine the p.d.f. of X = e Y. Exercise 9. Let X U[0, π]. Determine the p.d.f. of Y = cos X. Exercise 10. Let X, Y have joint p.d.f. f X,Y (x, y) = 1 y 1 {0<x<y<1}(x, y). Check that f X,Y integrates to 1. Find the marginal p.d.f. s f X and f Y. Are X, Y independent? Compute E(XY ), E(XY k ) for k positive integer.

66 66 2. CONTINUOUS PROBABILITY Exercise 11. Find the joint p.d.f. of X, Y having joint distribution function F X,Y (x, y) = (1 e x2 )(1 e y2 )1 [0, ) 2(x, y). 10. Transformations of two-dimensional random variables Let X, Y be two random variables with joint probability density function f X,Y. We introduce the new random variables U = u(x, Y ), V = v(x, Y ) for some functions u, v : R 2 R such that the map (X, Y ) (U, V ) is invertible. Then we can write U = u(x, Y ), X = x(u, V ), V = v(x, Y ) Y = y(u, V ). We would like to obtain a formula for the joint p.d.f. of U, V. To this end, introduce the Jacobian J of this transformation, defined as x x J = det u v = x y y y u v x y v u. u v Then the joint p.d.f. of (U, V ) is given by f U,V (u, v) = J f X,Y (x(u, v), y(u, v)). Example Let X, Y be independent exponential random variables of parameter λ > 0, so that their joint p.d.f. is f X,Y (x, y) = λ 2 e λ(x+y) for (x, y) [0, + ). We want to compute the joint p.d.f. of X + Y, Y. Let U = X + Y, V = Y. Then u(x, y) = x + y x(u, v) = u v v(x, y) = y y(u, v) = v, and so ( ) 1 1 J = det = It follows that the joint p.d.f. of U, V is given by f U,V (u, v) = J f X,Y (u v, v) = f X (u v)f Y (v) = λ 2 e λu 1 [0, ) (u)1 [0,u] (v). We can use this to compute the marginal p.d.f. of X + Y, to get f X+Y (u) = + for u 0. Thus X + Y Gamma(2, λ). f U,V (u, v)dv = u 0 λ 2 e λu dv = λ 2 ue λu

67 11. LIMIT THEOREMS 67 Example Let again X, Y be independent exponentials of parameter λ > 0. Define Then from which u(x, y) = x + y v(x, y) = U = X + Y, V = X X + Y. x x+y x(u, v) = uv y(u, v) = u(1 v), ( ) v u J = det = u. 1 v u It follows that the joint p.d.f. of U, V is given by f U,V (u, v) = J f X (uv)f Y (u(1 v)) = λ 2 ue λu for 0 < u <, 0 < v < 1. Note that we can decompose f U,V as a product f U f V for f U (u) = λ 2 ue λu 1 [0, ) (u) and f V (v) = 1 [0,1] (v). It follows that U, V are independent random variables, and U Gamma(2, λ), V U[0, 1]. 11. Limit theorems In this section we look at the limiting behaviour of sums of i.i.d. (independent and identically distributed) random variables. Theorem 11.1 (Weak law of large numbers). Let (X k ) k 1 be a sequence of i.i.d. random variables with finite mean µ, and define S N = X 1 + X X N. Then for all ε > 0 as N. ( ) S N P N µ > ε 0 Proof. We prove the theorem assuming that the random variables have also finite variance σ 2. Note that E(S N /N) = µ and Var(S N /N) = σ 2 /N. It then follows from Chebyshev s inequality that ( ) ( ( ) S N P N µ > ε S N = P N E SN > ε) Var(S N)/N N ε 2 = σ2 Nε 2 0 as N. The following theorem tells us that the convergence holds in a stronger sense.

68 68 2. CONTINUOUS PROBABILITY Theorem 11.2 (Strong law of large numbers). Let (X k ) k 1 be a sequence of i.i.d. random variables with finite mean µ. Then ( ) S N P N = µ = 1. lim N We won t prove this result in this course. Finally, once we know that S N /N µ as N, we could ask what is the error we make when approximating S N by Nµ for large N. This is the object of the next result. Theorem 11.3 (Central Limit Theorem). Let (X k ) k 1 be a sequence of i.i.d. random variables with finite mean µ and variance σ 2. Then for all x R ( ) SN Nµ P σ N x Φ(x) as N, where x 1 Φ(x) = e y2 /2 dy 2π is the distribution function of a standard Gaussian random variable. Thus the central limit theorem tells us that, for large N, we can approximate the sum S N by a Gaussian random variable with mean Nµ and variance Nσ Exercises f Y Exercise 1. Let X, Y be independent non-negative random variables, with p.d.f. f X and respectively. By performing the change of variables U = X + Y, V = Y, show that f X+Y (u) = u 0 f X (u v)f Y (v)dv = (f X f Y )(u). Exercise 2. Use the previous exercise to compute the p.d.f. of the sum of two independent U[0, 1] random variables, and sketch its graph. Exercise 3. Let X, Y have joint p.d.f. f X,Y (x, y) = 4xy1 [0,1] 2(x, y). Are X, Y independent? Find the joint p.d.f. of U = X/Y and V = XY. Compute the marginals f U and f V. Are U, V independent? Exercise 4. (Polar coordinates) Let X, Y be independent N(0, 1) random variables. Write their joint p.d.f. f X,Y. Define the new random variables R, Θ by X = R cos Θ, Y = R sin Θ. Compute the joint p.d.f. f R,Θ of R, Θ, and show that it factorizes into f R (r)f Θ (θ). Deduce that R, Θ are independent, and determine their distributions.

69 12. EXERCISES 69 Exercise 5. Let X, Y be independent N(0, 1) random variables. For a, b, c R +, define X = au + bv. Y = cv Assuming that ac 0, find the joint p.d.f. of U, V and the marginals f U and f V. Where was the assumption ac 0 used? Exercise 6. Let X, Y be independent N(0, 1) random variables. Show that, for any fixed θ, the random variables U = X cos θ + Y sin θ, V = X sin θ + Y cos θ are independent and find their distributions.

71 CHAPTER 3 Statistics 1. Parameter estimation Statistics is concerned with data analysis. Let us start with a simple example. Imagine to toss a biased coin n times. The coin gives head with probability p, but p is unknown: we would like to estimate p based on the observed outcomes. Denote by X 1, X 2... X n the random variables recording the n coin tosses. Then they are independent Bernoulli(p) random variables with mean p. Let x 1, x 2... x n denote the sequence of observed outcomes in {0, 1}. A reasonable guess for p is given by the proportion of heads in the trials, that is we estimate p via ˆp(x 1,... x n ) = 1 n n x k. How good is this guess? How do we build a reasonable guess for the parameter in a more general setting? Let X 1, X 2... X n be independent and identically distributed (i.i.d.) random variables, representing n identical and independent experiments (or measurement). We assume that the X k s have common probability distribution P( ; θ) depending on a parameter θ R. Our aim is to estimate θ, that is to come up with a guess for the value of θ, based on having observed that X 1 = x 1, X 2 = x 2... X n = x n. Definition 1.1. An estimator T for the parameter θ is a function T = T (X 1, X 2... X n ). We say that the estimator T is unbiased if E(T (X 1... X n )) = θ, and T is said to be biased if it is not unbiased. Thus an unbiased estimator gives, on average, the right guess for the value of θ. We will often use the vector notation X = (X 1, X 2... X n ) for brevity, using which we say that T is unbiased if E(T (X)) = θ. 71

72 72 3. STATISTICS Example 1.2. Going back to biased coin tosses, we have proposed to estimate p via T (X) = 1 n X k. n To check whether this is an unbiased estimator for the parameter p, we compute the expectation: E(T (X)) = 1 n E(X k ) = p, n so T is unbiased for p. Note that we could have instead considered the estimator T 2 (X) = X 1, which is also unbiased since E(T 2 (X)) = E(X 1 ) = p. This corresponds to guessing that the probability p of head equals 1 if the first coin comes up head, and p = 0 otherwise, ignoring all the following coin tosses. This doesn t seem to be a good estimator, and we will see that indeed it isn t. How do we build estimators in a general setting? 1.1. Maximum Likelihood Estimation. Let X 1, X 2... X n be i.i.d. random variables with common distribution P( ; θ) depending on a real parameter θ. Suppose we have observed that X 1 = x 1, X 2 = x 2... X n = x n, and we are asked to come up with a guess for the value of θ based on these data. What would be our best guess? Intuitively, it is reasonable to pick the value of θ which maximises the probability of seeing the observed data. This is the idea behind maximum likelihood estimation. Definition 1.3 (Likelihood). The likelihood of θ is the function The log-likelihood of θ is defined as l(θ) = P(X 1 = x 1, X 2 = x 2... X n = x n ; θ). L(θ) = log l(θ). Note that by independence and n l(θ) = P(X k = x k ; θ), n L(θ) = log P(X k = x k ; θ).

73 1. PARAMETER ESTIMATION 73 Definition 1.4. The Maximum Likelihood Estimator (MLE) for θ, given data x = (x 1, x 2... x n ), is the value of θ that maximises the likelihood l(θ), that is The associated MLE is given by T (X) = ˆθ(X). ˆθ(x) = arg max l(θ). This gives us an algorithm to construct estimators: simply write down the likelihood, and maximise it with respect to the parameter θ. Remark 1.5. Since L(θ) = log l(θ), the likelihood and log-likelihood are maximised at the same point. Thus the MLE is also the value of θ that maximises the log-likelihood ˆθ(x) = arg max L(θ). This is often convenient for computations. Indeed, the likelihood comes as a product (due to independence) of functions of θ, so it s lengthy to differentiate it. On the other hand, the log-likelihood is a sum of functions of θ, which is much easier to differentiate. Example 1.6. Going back to coin tosses, we have X 1, X 2... X n i.i.d. Bernoulli(p), so P(X k = x k ; p) = p x k (1 p) 1 x k for all k n. The likelihood function is given by Thus We differentiate to find l(p) = P(X 1 = x 1, X 2 = x 2... X n = x n ; p) n = p x k (1 p) 1 x k = p k x k (1 p) n k x k. ( n ) ( n ) L(p) = log l(p) = log p x k + log(1 p) (1 x k ). d dp L(p) = 1 p n x k 1 1 p which shows that L(p) is maximised at Thus the MLE is given by n (1 x k ) = ˆp(x) = 1 n T (X) = 1 n which, as we have already seen, is unbiased. n x k. n X k, ( n ) 1 x k np, p(1 p)

74 74 3. STATISTICS Example 1.7. Let X 1, X 2... X n be i.i.d. P oisson(λ). To build the MLE for λ, we write down the likelihood n e λ λ x k l(λ) = P(X 1 = x 1, X 2 = x 2... X n = x n ) = = e λn λ n k x 1 k x k! x k!. Thus We differentiate to get L(λ) = log l(λ) = λn + log λ which shows that L(λ) is maximised at Thus the MLE for λ is This is unbiased, since d dλ L(λ) = n + 1 λ ˆλ(x) = 1 n T (X) = 1 n E(T (X)) = 1 n n x k n x k, n x k. n X k. n E(X k ) = λ. n log(x k!). For continuous random variables we can use exactly the same approach. If X 1, X 2... X n are i.i.d. continuous random variable with common p.d.f. f( ; θ) depending on a parameter θ, define the likelihood as n l(θ) = f X1,X 2...X n (x 1, x 2... x n ; θ) = f(x k ; θ), and the log-likelihood by L(θ) = log l(θ) whenever l(θ) 0, and L(θ) = 0 otherwise. Again, the MLE is the value of θ which maximises the likelihood, or equivalently the log-likelihood. Example 1.8. Let X 1, X 2... X n be i.i.d. exponentials of parameter λ. We aim to build the MLE for λ. To this end, write the likelihood function n l(λ) = λe λx k = λ n e λ k x k, from which L(λ) = n log λ λ n x k.

75 We differentiate, to get 1. PARAMETER ESTIMATION 75 d dλ L(λ) = n n λ x k, which shows that the likelihood is maximised at 1 ˆλ(x) = 1 n n x. k Thus our MLE is given by T (X) = ( 1 n This is biased, since for n = 1 we find ( 1 ) E(T (X)) = E = X ) 1 n X k. 0 1 x λe λx dx = +. Example 1.9. Let X 1, X 2... X n be i.i.d. Uniform random variables in [0, θ]. We want to estimate θ. Using that the X k s have joint p.d.f. we have likelihood function f(x) = 1 θ 1 [0,θ](x), l(θ) = 1 θ n 1 [0,θ](x 1 )1 [0,θ] (x 2 ) 1 [0,θ] (x n ). Using that the condition x k [0, θ] for all k is equivalent to we rewrite l(θ) as max x k θ, 1 k n min x k 0, 1 k n 1 l(θ) = θ n for θ max x k and min x k 0, 1 k n 1 k n 0, otherwise. Moreover, we can assume that min k x k 0, since negative data would contradict our assumption that the X k s are uniform in [0, θ]. Thus we differentiate the likelihood to find d l(θ) = nθ n 1 for dθ which shows that the likelihood is maximised at θ max 1 k n x k, Our MLE is therefore given by ˆθ(x) = max 1 k n x k. T (X) = max 1 k n X k.

76 76 3. STATISTICS Is this unbiased? To answer this question, we have to compute its expectation and check whether it equals θ. To this end, we first compute the distribution function F T of T (X) ( ) ( x ) n F T (x) = P(T x) = P max X k x = P(X 1 x, X 2 x... X n x) = 1 k n θ for x [0, θ], while F T (x) = 0 if x 0 and F T (t) = 1 if x θ. It follows that the p.d.f. of T is given by f T (x) = n θ n xn 1 1 [0,θ] (x). We use this to compute E(T ) = θ 0 xf T (x)dx = n θ n xn dx = ( ) n θ. n + 1 Thus the MLE T (X) is biased. On the other hand, note that as n ( ) n E(T (X)) = θ θ, n + 1 so T (X) is asymptotically unbiased. 2. Exercises Exercise 1. Let X 1, X 2... X n be i.i.d. geometric random variables of parameter p. Write down the likelihood function l(p). By maximising l(p), compute the MLE for p. Is it unbiased for n = 1? [Hint: recall the Taylor series for log(1 x).] Exercise 2. Let X Binomial(N, p) where N is known and p is to be estimated. Write down the likelihood function l(p). By maximising l(p), compute the MLE for p. Is it unbiased? Exercise 3. Let X Binomial(N, p) where p is known and N is to be estimated. Write down the likelihood function l(n). By maximising l(n), compute the MLE for N. Is it always unique? Exercise 4. Let X 1, X 2... X n be i.i.d. N(µ, 1) random variables. Write down the likelihood function l(µ). By maximising l(µ), compute the MLE for µ. Is it unbiased? Exercise 5. Let X 1, X 2... X n be i.i.d. N(µ, σ 2 ) random variables, with µ known and σ 2 to be estimated. Write down the likelihood function l(σ 2 ). By maximising l(σ 2 ), compute the MLE for σ 2. Is it unbiased? Exercise 6. Let X 1, X 2... X n be i.i.d. U[0, θ] random variables. We have seen that the MLE for θ is given by T (X) = max 1 k n X k, and that this is a biased estimator. Can you use this to build an unbiased estimator for θ?

77 3. MEAN SQUARE ERROR Mean square error Let X 1, X 2... X n be i.i.d. following a given probability distribution depending on a parameter θ, that we aim to estimate. We have seen how to build the MLE for θ, but of course we could consider different estimators. Which one is the best? We need a measure for how good a given estimator is. Example 3.1. If the X k s are i.i.d. Bernoulli of parameter p, representing independent coin tosses, we have seen that the MLE is given by T 1 (X) = 1 n n X k, and it is unbiased. On the other hand, we could introduce the alternative estimators T 2 (X) = X 1, T 3 (X) = 1 3 X X 2. It is easy to see that they are all unbiased: which one should we use? We want our estimator to be as close as possible to the true value p, so it is reasonable to choose the estimator that minimises the square distance from p. We have E[(T 1 (X) p) 2 ] = p(1 p), E[(T 2 (X) p) 2 ] = p(1 p), E[(T 3 (X) p) 2 ] = 5 p(1 p), n 9 This shows that the MLE T 1 (X) is the one that, on average, gives the least square error, so it is the one that performs best. Note that, in fact, E[(T 1 (X) p) 2 ] = p(1 p) n 0 as n, so if we could observe infinitely many coin tosses we would be able to determine the value of p with certainty. This is false for the estimators T 2 and T 3. Definition 3.2 (Mean Square Error). Given an estimator T (X) for a parameter θ, the Mean Square Error (MSE) of T is defined as MSE(T ) = E[(T (X) θ) 2 ]. Note that if T is unbiased, that is E(T (X)) = θ, then MSE(T ) = Var(T (X)). If T is biased, let bias(t ) = E(T (X)) θ denote the bias of T. Then the MSE can be expressed as MSE(T ) = Var(T (X)) + (bias(t )) 2.

78 78 3. STATISTICS Indeed, MSE(T ) = E[(T (X) θ) 2 ] = E[(T (X) E(T (X)) + E(T (X)) θ) 2 ] = E[(T (X) E(T (X))) 2 ] +(E(T (X)) θ) E[T (X) E(T (X))] (E(T (X)) θ) }{{}}{{}}{{} Var(T (X)) bias(t ) 0 = Var(T (X)) + (bias(t )) 2. This identity is often useful for computations involving biased estimators. Example 3.3. Let us go back to Example 1.9, in which we showed that the MLE for i.i.d. U[0, θ] random variables is given by T (X) = max 1 k n X k, which has p.d.f. f T (x) = n θ x n 1 1 n [0,θ] (x) and mean E(T ) = may wish to consider instead the unbiased estimator T 1 (X) = 2X 1. n n+1θ. Note that T is biased: we Does T 1 perform better than T? To answer this question, we compute the Mean Square Errors. Since bias(t ) = E(T (X)) θ = θ n + 1, Var(T ) = E(T 2 (X)) E(T ) 2 = we find θ 0 ( ) n n 2 ( ) ( ) n n 2 θ n xn+1 dx θ 2 = θ 2 θ 2, n + 1 n + 2 n + 1 MSE(T ) = Var(T (X)) + (bias(t )) 2 = On the other hand, since T 1 is unbiased we have Since 2θ 2 (n + 1)(n + 2). MSE(T 1 ) = Var(T 1 (X)) = 4Var(X 1 ) = θ2 3. 2θ 2 (n + 1)(n + 2) θ2 3 for all θ > 0 and n 1, this shows that T, although biased, on average performs better than the unbiased estimator T 1. Note that again MSE(T ) = 2θ 2 0 as n, (n + 1)(n + 2) which means that if we had access to infinitely many observation we would be able to determine the exact value of θ using the estimator T. This is false for the unbiased estimator T 1, whose MSE does not depend on n.

79 4. CONFIDENCE INTERVALS Confidence intervals Given i.i.d. experiments X 1, X 2... X n following some distribution f( ; θ) depending on a parameter θ R, we have seen how to construct the MLE for θ. Moreover, given several estimators for θ we have proposed to use the MSE to determine which one performs better. Once an estimator T for θ is chosen, and data x = (x 1, x 2... x n ) have been observed, we estimate θ by ˆθ(x) = T (x), which is a real number, a precise guess for θ. It is often the case that instead of a single guess, we are willing to accept a guess of the form a(x) θ b(x) for some a(x), b(x) depending on the data, provided this has high probability. In other words, we are willing to give up precision in our guess, to gain confidence that our guess is correct. Example 4.1. Let X 1, X 2... X n be i.i.d. N(µ, 1), where µ is to be estimated. We have seen that the MLE for µ is T (X) = X = 1 n X k, n and it is unbiased. Moreover, since linear combination of Gaussian random variables is still Gaussian, we have X N(µ, 1/n), and thus X µ N(0, 1/n). Note that, crucially, the distribution of X µ does not depend on µ, and Thus we can choose α, β such that ( ) P α Z β = Z = n( X µ) N(0, 1). β α 1 2π e x2 /2 dx = The choice of α, β is not unique: it is natural to choose α = β, which makes the interval [α, β] as small as possible. With this choice, we get P( α Z α) = P( α n( X µ) α) = Rearranging, we find ( P X α µ X + α ) = n n In all, we have found that if a(x) = X α/ n and b(x) = X + α/ n, the interval [ a(x), b(x) ]

80 80 3. STATISTICS is a 95% confidence interval for µ, that is P ( a(x) µ b(x) ) = Definition 4.2 (Confidence interval). Given two statistics a(x) and b(x) with a(x) b(x), and a number γ (0, 1), the interval [a(x), b(x)] is said to be a 100γ% confidence interval for the unknown parameter θ if P ( a(x) θ b(x) ) = γ. It is important to note that here the parameter θ is not random, but rather the extrema of the confidence interval a(x) and b(x) are. Thus, if we were to collect data multiple times, and compute the confidence interval for each data collection, we expect θ to be in the confidence interval 100γ% of the time. Example 4.3. Let X 1, X 2... X n be i.i.d. exponential random variables of parameter λ. Consider the statistics n T (X) = (which is not the MLE for λ). We have seen that sum of n independent exponentials of parameter λ has Gamma(n, λ) distribution, so T (X) Gamma(n, λ), X k f T (t) = λn t n 1 e λt 1 (n 1)! [0, ) (t). We cannot use T (X) to construct a confidence interval for λ, as its p.d.f. depends on the unknown parameter λ. We therefore look for some other function of the X k s whose distribution does not depend on λ. Take S(X) = 2λT (X). We compute the distribution function of S, to find ( F S (s) = P(S s) = P(2λT s) = P T s ) ( s ) = F T, 2λ 2λ where F T denotes the distribution function of T (X). Differentiating, we find f S (s) = d ( s ) ( s ) 1 ( 1 ) n s n 1 ds F T = f T 2λ 2λ 2λ = 2 (n 1)! e s/2 1 [0, ) (s). Note that S Gamma(n, 1/2). In particular, we see that the p.d.f. of S(X) does not depend on λ. Given γ (0, 1), then, we can build a 100γ% confidence interval for λ by choosing α, β R such that P (α S(X) β) = γ. Using that S(X) = 2λ n X k, this gives ( P α 2λ ) n X k β = γ,

81 which, solving the inequalities with respect to λ, gives ( ) α P 2 β n X λ k 2 n X k Note that here α, β, γ are known numbers. Thus [ is a 100γ% confidence interval for λ. 4. CONFIDENCE INTERVALS 81 α 2 n X ; k β 2 n X k 4.1. Approximate confidence intervals. We have seen that in order to build a confidence interval for a parameter θ we need to find some function of the data X and the parameter θ whose distribution does not depend on θ. We have seen in Example 4.1 that this is possible when X 1, X 2... X n are i.i.d. N(µ, 1) random variables and T (X) = X. But how to proceed in general? The idea is to use the Central Limit Theorem (CLT) to reduce to the Gaussian case. Example 4.4. Let X 1, X 2... X n be i.i.d. Bernoulli random variables of parameter p. Let n S n = X k, ] = γ. so that the MLE for p is given by T (X) = S n /n, and it is unbiased. Write E(X 1 ) = p, Then by the CLT the random variable Var(X 1 ) = p(1 p). S n np np(1 p) is approximately N(0, 1) distributed as n. Thus, if Z N(0, 1), for very large n we can approximate ( ) P α S n np α 1 β P ( α Z α) = e x2 /2 dx. np(1 p) α 2π We choose α > 0 so that the above probability equals γ (0, 1) given, from which we get ( ) P α S n np β γ, np(1 p) which rearranging gives ( S n P n α Sn (n S n ) n n p S n n + α n ) Sn (n S n ) γ n for n large enough. Thus [ S n n α Sn (n S n ), S n n n n + α ] Sn (n S n ) n n

82 82 3. STATISTICS is an approximate 100% confidence interval for p. This is useful for opinion polls, where we assume that each person votes, say, democratic with probability p, and given that we observed n votes we want to estimate p. Definition 4.5 (Approximate confidence interval). Let X R n. Given two statistics a(x) and b(x) with a(x) b(x), and a number γ (0, 1), the interval [a(x), b(x)] is said to be an approximate 100γ% confidence interval for the unknown parameter θ if as n. P ( a(x) θ b(x) ) γ 5. Exercises Exercise 1. Suppose we observe the outcomes of 10 tosses of the same biased coin, giving head with probability p (unknown). These are given by , where 1 stands for head and 0 stands for tail. By using Maximum Likelihood Estimation, make a guess for p. Exercise 2. Let X 1, X 2... X n be i.i.d. exp(λ), with λ to be estimated. Suppose that n = 5 and we observe outcomes x = (0.2, 3, 1, 0.3, 4.1). By using Maximum Likelihood Estimation, make a guess for λ. Exercise 3. Let X Binomial(N, p), N known and p to be estimated. Compute the MSE of the MLE T (X) = X/N for p. Consider the alternative estimator T 1 (X) = X + 1 N + 1. Is it unbiased? Compute the MSE of T 1. Which one would you use between T and T 1? Why? Exercise 4. In the setting of the previous exercise, define the alternative estimator T α (X) = α X N + (1 α)1 2, for α [0, 1] given. Is T α unbiased? Show that MSE(T α ) = α2 N p(1 p) + (1 α)2( 1 2 p ) 2. Suppose α = 1/2. Which estimator would you choose between T (X) = X/N and T α (X)? Why?

83 6. BAYESIAN STATISTICS 83 Exercise 5. Let X 1, X 2... X n be i.i.d. N(µ, 1) random variables. Compute the MSE of the MLE T (X) for µ. Show that MSE(T ) 0 as n. Let T 1 = X 1 be an alternative estimator. Is it unbiased? Show that MSE(T 1 ) 0 as n. Which estimator would you use and why? Exercise 6. Let X 1, X 2... X n i.i.d. U[0, θ]. We have seen that the MLE is T (X) = max 1 k n X k and it is biased. Introduce the unbiased estimator ( ) n + 1 T 1 (X) = T (X). n Compute the MSE of T 1. Which estimator would you choose between T and T 1 and why? Exercise 7. Suppose that X 1, X 2... X n are i.i.d. N(µ, 4) random variables, where µ is to be estimated. Show that n S(X) = 2 ( X µ) N(0, 1), where X = 1 n X k. n Using S(X), construct a 95% confidence interval for µ. Construct a 90% confidence interval for µ. Which one is wider? Construct a 98% confidence interval for µ. Exercise 8. Let X 1, X 2... X n be i.i.d. Poisson random variables of parameter λ to be estimated. Assuming that n is large, construct a 94% approximate confidence interval for λ. Exercise 9. Let X 1, X 2... X n be i.i.d. Geometric random variables of parameter p to be estimated. Assuming that n is large, construct a 90% approximate confidence interval for p. 6. Bayesian statistics Up to this point we have discussed point estimation. That is, we have assumed that X 1, X 2... X n are i.i.d. random variables following a common probability distribution f( ; θ), depending on a specific parameter θ, which is unknown. We could, on the other hand, think of the parameter θ as being itself random. We make an initial guess for the distribution of θ, and then update the guess based on the observed data. This is the point of view of Bayesian statistics. Notation. When we want to treat the unknown parameter as a random variable we denote it by Θ, while we use θ for its values. We say that Θ π if P(Θ = θ) = π(θ) P(Θ [θ 1, θ 2 ]) = θ2 (discrete setting) θ 1 π(θ)dθ (continuous setting).

84 84 3. STATISTICS Definition 6.1 (Discrete setting). The initial guess for the probability distribution for the parameter θ is called prior distribution, and it is denoted by π( ). Given that the data x = (x 1, x 2... x n ) are observed, the updated distribution for θ is called posterior distribution, and it is given by P(X = x; θ)π(θ) π(θ x) = P(Θ = θ X = x) =. P(X = x) Example 6.2. Imagine to toss a biased coin giving head with probability Θ. We represent the outcome by a Bernoulli(Θ) random variable. Suppose we are told that Θ {θ 1, θ 2, θ 3 } for some given numbers 0 θ 1 < θ 2 < θ 3 1. Prior to observing the outcome of the coin toss we have no reason to believe that one of the values is more likely than the others, so our prior distribution is uniform over the set {θ 1, θ 2, θ 3 }, that is π(θ 1 ) = π(θ 2 ) = π(θ 3 ) = 1 3. We now observe the outcome x = 1 (head) of the coin toss. How does our belief on θ change given the observed datum? We compute the posterior distribution π(θ 1) π(θ k 1) = P(X = 1; θ k)π(θ k ) P(X = 1) for k = 1, 2, 3. We use the law of total probabilities to compute P(X = 1) = P(X = 1; θ 1 )π(θ 1 ) + P(X = 1; θ 2 )π(θ 2 ) + P(X = 1; θ 3 )π(θ 3 ) = 1 3 (θ 1 + θ 2 + θ 3 ). Thus our posterior distribution is given by π(θ 1 1) = P(X = 1; θ 1)π(θ 1 ) P(X = 1) π(θ 2 1) = P(X = 1; θ 2)π(θ 2 ) P(X = 1) π(θ 3 1) = P(X = 1; θ 3)π(θ 3 ) P(X = 1) = θ 1 θ 1 + θ 2 + θ 3, θ 2 = θ 1 + θ 2 + θ 3 θ 3 =. θ 1 + θ 2 + θ 3 Note that the posterior distribution is a probability distribution, as the weights sum up to 1. In fact, we could have used this to compute P(X = 1). Definition 6.3 (Continuous setting). The initial guess for the probability distribution for the parameter θ is called prior distribution, and it is denoted by π( ). Given that the data x = (x 1, x 2... x n ) are observed, the updated distribution for θ is called posterior distribution, and it is given by π(θ x) = f X 1...X n (x; θ)π(θ). f X1...X n (x)

85 6. BAYESIAN STATISTICS 85 Example 6.4. Let X 1, X 2... X n be i.i.d. exponential random variables of parameter Θ. We assume that Θ Exponential(µ) for some given µ. Thus our prior distribution is π(θ) = µe µθ 1 [0, ) (θ). After data x = (x 1, x 2... x n ) are observed, our posterior distribution for the parameter is given by ( π(θ x) = f n ) X 1...X n (x; θ)π(θ) 1 = θe θx k µe µθ f X1...X n (x) f X1...X n (x) µ = f X1...X n (x) θn e θ( n x k+µ) for θ 0. Note that the factor µ/f X1...X n (x) does not depend on θ, so it must equal the unique constant such that the posterior distribution integrates to 1. Since the remaining part of the p.d.f. is proportional to the p.d.f. of a Gamma distribution with parameters n + 1 and n x k + µ, we conclude that Θ x Gamma ( n + 1, ) n x k + µ. That is, if prior to the data collection we believed that the parameter was Exponential(µ), after observing the data our new belief is that the parameter is Gamma(n + 1, n x k + µ) Bayesian estimation. Once we have computed the posterior distribution π( x) for Θ, we can use it to make a guess for the value of θ. Definition 6.5 (Bayes estimator). The Bayes estimator ˆθ(x) is defined to be the posterior mean, that is ˆθ(x) = E(Θ x) = θ π(θ x)dθ. Example 6.6. We have seen that in Example 6.4 the posterior distribution is a Gamma(n + 1, n x k + µ). It follows that the Bayes estimator for the unknown parameter θ is given by ˆθ(x) = E(Θ x) = n + 1 n x k + µ. Theorem 6.7. The Bayes estimator is the value of θ which minimises the expected posterior quadratic loss E[(Θ θ) 2 x ]. Remark 6.8. It is possible to introduce different loss functions. For example, we might want to estimate θ by the minimiser of the expected posterior absolute loss E[ Θ θ x ]. In this case the Bayes estimator would be the median of the posterior distribution.

86 86 3. STATISTICS 7. Exercises Exercise 1. Let X 1, X 2... X n be i.i.d. N(Θ, 1), and consider the prior distribution Θ N(0, 1). Show that the posterior distribution is given by ( ) 1 n 1 Θ x N x k,. n + 1 n + 1 Deduce that the Bayes estimator (for quadratic loss) is given by ˆθ(x) = 1 n+1 n x k. Exercise 2. Let X 1, X 2... X n be i.i.d. P oisson(θ) random variables, with prior distribution Θ Exponential(1). Show that ( n ) Θ x Gamma x k + 1, n + 1. Deduce that the Bayes estimator (for quadratic loss) for θ is given by n ˆθ(x) = x k + 1. n + 1 Exercise 3. Solve the above exercise with prior Θ Exponential(µ). Exercise 4. Let X 1, X 2... X n be i.i.d. Bernoulli(Θ) random variables, with prior distribution Θ U[0, 1] (no prior information). Show that the posterior distribution is given by ( n ) n Θ x Beta x k + 1, n + 1 x k, where we say that X Beta(α, β) if f X (x) = xα 1 (1 x) β 1 1 B(α, β) [0,1] (x), with B(α, β) = Γ(α)Γ(β)/Γ(α + β). Using that a B(α, β) has expectation α/(α + β), deduce that the Bayes estimator (for quadratic loss) is given by n ˆθ(x) = x k + 1. n + 2 Exercise 5. We have an urn containing N balls, N known, some of which are blue and some of which are red. Let Θ be the proportion of red balls. We take the uninformative prior Θ U[0, 1]. Suppose that we pick one ball, and it turns out to be red. Compute the posterior distribution Θ Red. Compute the Bayes estimator (for quadratic loss) for the proportion of red balls.

87 7. EXERCISES 87 Exercise 6. Suppose we roll a biased die, where the probability of getting k is proportional to Θ k, for k = 1, 2, 3, 4, 5, 6. Let X 1, X 2... X n denote n independent rolls of the die. Starting from the uninformative prior Θ U[0, 1], compute the posterior distribution π(θ x) of Θ x, and the Bayes estimator (for quadratic loss). Exercise 7. Let X 1, X 2... X n be i.i.d. Exponential(Θ) random variables, with prior distribution Θ Gamma(α, β) for α, β known. Show that the posterior distribution is given by ( ) n Θ x Gamma α + n, β + x k. Deduce that the Bayes estimator (for quadratic loss) is given by ˆθ(x) = (α + n)/(β + n x k).

89 APPENDIX A Background on set theory A set is a collection of elements, which can be finite or infinite. If A is a finite set, A denotes the number of elements of A. The empty set is the set consisting of no element, and it is denoted by. If A and B are two sets, then A B = {x : x A or x B} (union). A B = {x : x A and B} (intersection). A \ B = {x A but x / B} (difference). A B if all elements of A are also elements of B. If A B = then we say that A and B are disjoint. Note that the empty set is disjoint from every set, including itself. In fact, the empty set is the only set disjoint from itself. If Ω is a set, and A Ω, we define the complement of A in Ω as A c = Ω \ A = {x Ω but x / A}. Note that A A c = so A and A c are disjoint. Example 0.1. Take A = {1, 2, 3}, B = {2, 4, 6, 8}. Then A B = {1, 2, 3, 4, 6, 8}, A B = {2}, A \ B = {1, 3}, B \ A = {4, 6, 8}. If C = {6, 7, 8} then A C = {1, 2, 3, 6, 7, 8}, A C =, A\C = A, A B C = {1, 2, 3, 4, 6, 7, 8}, A B C =. Take Ω = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. Then A, B, C Ω. Their complements in Ω are A c = {4, 5, 6, 7, 8, 9, 10}, B c = {1, 3, 5, 7, 9, 10}, C c = {1, 2, 3, 4, 5, 9, 10}, We can also define infinite unions and intersections. For a sequence of sets (A n ) n N set A n = A 1 A 2 A 3 = {x : x A n for some n N}, n N A n = A 1 A 2 A 3 = {x : x A n for all n N}. n N Example 0.2. Take A n = {n, n + 1, n } = {k N : k n}. Then A n = N, A n =. n N n N 89

90 90 A. BACKGROUND ON SET THEORY If on the other hand we take A n as before but set A n = A 10 for all n 10 then A n = N, A n = A 10 = {10, 11, 12...}. n N n N Given two sets A, B, we define their product as A B = {(x, y) : x A, y B}. Example 0.3. If A = {1, 2, 3} and B = {α, β} then A B = {(1, α), (1, β), (2, α), (2, β), (3, α), (3, β)}, A A = {(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3)}, B B = {(α, α), (α, β), (β, α), (β, β)}. Note that the pairs are ordered, and that A B = A B. We can define the product of an arbitrary number of sets by setting A 1 A 2 A 3 A n = {(x 1, x 2, x 3... x n ) : x k A k }. Example 0.4. If A = {0, 1}, B = {a} and C = {x, y} then Note that A B C = = 4. A B C = {(0, a, x), (0, a, y), (1, a, x), (1, a, y)}.

91 1. TABLE OF Φ(z) Table of Φ(z) Source:

PROBABILITY VITTORIA SILVESTRI

PROBABILITY VITTORIA SILVESTRI Contents Preface. Introduction 2 2. Combinatorial analysis 5 3. Stirling s formula 8 4. Properties of Probability measures Preface These lecture notes are for the course