Probability Theory and Random Processes Lecture Notes

Size: px

Start display at page:

Download "Probability Theory and Random Processes Lecture Notes"

Ada Henry
5 years ago
Views:

1 Probability Theory and Random Processes Lecture Notes Sangwon (Justin) Hyun June 18, 2016 Lecture 1 Recommended Readings: WMS Populations, Statistics and Random Processes Broadly speaking, in Statistics we try to infer, learn, or estimate some features or parameters of a population using a observed data sampled from that population. What do we mean by population? We can think of a population as an entity which encompasses all possible data that can be observed in an experiment. This can be a tangible collection of objects or a more abstract entity. Examples of tangible populations are: the collection of all pennies in the USA all human beings born in the year Most often, we perform an experiment because we are interested in learning about a particular feature or parameter of a population. Examples of parameters for the above populations are respectively: the mean weight of all the pennies in the USA the proportion of people born in the year 2000 that are taller than 5.25 feet today the variability in the weight of all the pennies in the USA. In order to learn such features we usually proceed as follows: 1

2 1. specify a statistical model for the unknown true population (e.g. specify a class of functions that approximate well the unknown true distribution of the weight of all the pennies in the USA) 2. collect data X 1,..., X n, possibly by performing a particular experiment (e.g. sample n pennies and measure their weight) 3. compute a statistic, i.e. a function of the data (and of the data exclusively!), to estimate the parameter of interest (e.g. compute the average weight of the n sampled pennies) 4. further statistical analyses (see, for instance, !). Probability Theory provides the rigorous mathematical framework needed in the above process (and much more...). Probability Theory can be studied at very different levels, depending on one s purpose and needs. At a purely mathematical level, Probability Theory is an application of Measure Theory to real-world problems. At a more introductory and conceptual level, Probability Theory is a set of mathematical tools that allows us to carry out and analyze the process described above in a scientific way (especially for statisticians, computer scientists, scientists and friends). As an example, some frequently used statistics or estimators are the sample mean X = 1 n n i=1 X i the sample variance S 2 = 1 n 1 n i=1 (X i X) 2. In particular, they estimate the mean µ of the distribution of interest (which is a measure of central tendency) and its variance σ 2 (which is a measure of variability/dispersion). With respect to the examples above, we can use the sample mean to estimate the mean weight µ of all the pennies in the USA by collecting n pennies, measuring their weights X 1,..., X n, and then computing X. We expect that, as n gets larger, X µ. If we are also interested in estimating the variance of the weight of the pennies in the USA σ 2 or its standard deviation σ = σ 2, we might compute the sample variance S 2 or the sample standard deviation S = S 2 (note that, as opposed to the sample variance, the sample standard deviation has the same unit of measure of the data X 1,..., X n ). Again, as n gets larger, we expect that S 2 σ 2. We can also use the sample mean to estimate the proportion of people born in the year 2000 who are taller than 5.25 feet today: first we measure the height of n persons born in the year 2000, we introduce an indicator 2

3 variable Z i = 1 (5.25, ) (X i ) = { 1 if X i > if X i 5.25 (1) and we finally compute Z = 1 n n i=1 Z i. Again, as an example, we might imagine that a given probability distribution, for instance the Normal distribution, might be good model to describe the weight of the pennies in the USA. This is an modeling assumption you make ideally, in practice, the appropriateness of these assumptions should be checked. The Normal distribution is completely characterized by two parameters: its mean, µ, and its variance σ 2. How would you plan an experiment to estimate µ and σ 2 if you believe that the Normal distribution N (µ, σ 2 ) is a good model for the weight of all the pennies in the USA? We should always keep in mind that there is a difference between models and the truth. While the Normal distribution might provide a good model for the distribution of the pennies in the USA, the true distribution of the weight of the pennies in the USA might not be Normal (and most likely it is not Normal, in fact). TRUTH MODELS STATISTICS Why is the Normal distribution, for any µ and σ 2, is most likely not the true distribution of the weight of the pennies in the USA? There are situations in which the quantity that we want to study evolves with time (or with space, or with respect to some other dimension). For example, suppose that you are interested in the number of people walking into a store on a given day. We can model that quantity as a random quantity X t varying with respect to time t. Most often there exists some kind of dependence between the number of people in the store at time t and the number of people in the store at time t (especially if t t is small). There is a branch of Probability Theory that is devoted to study and model this type of problems. We usually refer to the collection {X t : t T } as a random process or as a stochastic process. The set T can represent a time interval, a spatial region, or some more elaborate domain. Another example of a random process is observing the location of robberies occurring in a given city. (Question: what is T )? 3

4 Another typical example of a random process is the evolution of a stock price in Wall Street as a (random) function of time. (Question: what is T )? We will devote part of this course to study (at an introductory level) some of the theory related to random processes. Sample Spaces and Set Theory What do we mean by probability? There are two major interpretations: objective probability: the long-run frequency of the occurrence of a given event (e.g. seeing heads in a coin flip) subjective probability: a subjective belief in the rate of occurrence of a given event. Methods based on objective probability are usually called frequentist or classical whereas those based on subjective probability are usually called Bayesian as they are associated to the Bayesian paradigm. In this class, we focus on frequentist methods (but we will discuss later in the course the basic idea at the basis of Bayesian Statistics). In Probability Theory, the notion of event is usually described in terms of a set (of outcomes). Therefore, before beginning our discussion of Probability Theory, it is worthwhile to review some basic facts about set theory. Throughout the course we will adopt the convention that Ω denotes the universal set (the superset of all sets) and denotes the empty set (i.e. a set that does not contain any element). Recall that a set A is a subset of another set B (or A is contained in B) if for any element x of A we have x A = x B. We denote the relation A is a subset of B by A B (similarly, A B means A is a superset of B or A contains B). Clearly, if A B and B A, then A = B. Let and stand for and and or respectively. Given two subsets A and B of Ω, recall the following basic set operations: union of sets: A B = {x Ω : x A x B} intersection of sets: A B = {x Ω : x A x B} complement of a set: A c = {x Ω : x / A} set difference A \ B = {x Ω : x A x / B} symmetric set difference A B = (A\B) (B \A) = (A B)\(A B). 4

5 VENN S DIAGRAMS Notice that, for any set A Ω, A A c = Ω and A A c =. Two sets A, B Ω for which A B = are said to be disjoint or mutually exclusive. Also, recall the distributive property for unions and intersections. For A, B, C Ω, we have A (B C) = (A B) (A B) A (B C) = (A B) (A B). A partition (disjoint union) of Ω is a collection of subsets A 1,, A n that satisfy 1. A i A j =, i j 2. n i=1 A i = A 1 A 2 A n = Ω Finally, we have the De Morgan s Laws: (A B) c = A c B c (A B) c = A c B c. These can be extended to the union and the intersection of any n > 0 sets: ( n i=1 A i) c = n i=1 Ac i ( n i=1 A i) c = n i=1 Ac i. Make sure you draw some Venn diagrams in the white space! We used before the word experiment to describe the process of collecting data or observations. An experiment is, broadly speaking, any process by which an observation is made. This can be done actively, if you have control on the apparatus that collects the data, or passively, if you only get to see the data, but you have no control on the apparatus that collects them. An experiment generates observations or outcomes. Consider the following example: you toss a fair coin twice (your experiment). The possible outcomes (simple events) of the experiment are HH, HT, TH, TT (H: heads, T: tails). The collection of all possible outcomes of an experiment forms the sample space of that experiment. In this case, the sample space is the set Ω = {HH, HT, T H, T T }. Any subset of the sample space of an experiment is called an event. For instance, the event you observe at least one tails in your experiment is the set A = {HT, T H, T T } Ω. 5

6 Note: These are discrete events, i.e. they are a collection of sample points from a discrete sample space. A discrete sample space Ω is one that contains either a finite or countable number of distinct sample points. An event in a discrete sample space is simply a collection of sample points any subset of Ω. A picture about even and odd dice rolls (using notation E i, i = 1,, 6 in sample space Ω, as in WMS 2.4): In this part of the course, just assume we deal with discrete events from a discrete sample spaces. Later, we will deal with continuous sample spaces. Exercises in class 1. A fair coin is tossed twice. You have decided to play a game whose outcome is If the first flip is H, you win $1,000, If the first flip is T, you lose $1,000. (a) Writing a particular outcome as the concatenation of single outcomes such as H or T. What is the sample space Ω? (b) Write as A, a subset of Ω, the set corresponding to the event that the first flip is H. What are elements in A? What is the probability of A? (c) Denote as B the same as above, of the event that the second toss is T. What are the elements in B? What is the probability of B? (d) What is the probability of the first is head and second being tail? (e) Draw A, B, and Ω in a Venn diagram. (f) What is your expected return? 6

7 Homework 1 (due Thursday May 19th) 1. (25 pts) Consider the Normal distribution N (µ, σ 2 ) that we mentioned in class and a sample X 1,..., X n of size n obtained from this distribution. Which of the following are statistics and why? (a) (b) X n( X µ) S (c) n i=1 X i (d) max{x 1,..., X n, σ 2 } (e) 1 {X3 >0}(X 3 ) Extra credit (5pt) What parameter do you think estimate? Justify your answer. X is trying to 2. (25 pts) Show that, for any A, B 1,..., B n Ω with n i=1 B i = Ω, we have A = (A B 1 ) (A B n ). 3. (25 pts) Consider the following experiments: you have three keys and exactly one of them opens the door that you want to open; at each attempt to open the door, you pick one of the keys completely at random and try to open the door (at each attempt you pick one of the three keys at random, regardless of the fact that you may already have tried some in previous attempts). You count the number of attempts needed to open the door. What is the sample space for this experiment? you have an urn with 1 red ball, 1 blue ball and 1 white ball. For two times, you draw a ball from the urn, look at its color, take a note of the color on a piece of paper, and put the ball back in the urn. What is the sample space for this experiment? Write out the set corresponding to the event you observe at least 1 white ball. 4. (25 pts) Think of an example of a random process and describe what the quantity of interest X t is and what the variable t represents in your example (time, space,...?) 5. Extra credit (10 pts) Show that the expression of the sample variance S 2 = 1 n 1 7 n (X i X) 2 (2) i=1

8 is equivalent to ( S 2 = 1 n n ) 2 Xi 2 1 X i. (3) n 1 n i=1 i=1 8

9 Lecture 2 Recommended readings: WMS Unconditional Probability Consider again the example from the previous lecture. We toss a fair coin twice. The sample space for the experiment is Ω = {HH, HT, T H, T T }. If we were to repeat the experiment over and over again, what is the frequency of the occurrence of the sequence HT? What about the other sequences? What about the frequency of {HT } {T T }? To make the intuition a little more formal, we can say that a probability measure on Ω is a set function P satisfying the following axioms: for any A Ω, P (A) [0, 1] P (Ω) = 1 for any countable collection of disjoint events {A i } i=1 Ω, P ( i=1 A i) = i=1 P (A i). From the second and third axiom, it follows that P ( ) = 0. (Spend 30 seconds thinking about this.) An interesting question to ask is what the probability of the union of two (possibly non-disjoint) events is. That is, given A, B Ω, what is P (A B)? Here is the answer. Notice first that A = (A B) (A B c ) and the two sets are disjoint (why?). Using the third axiom of probability (the countable additivity axiom), we have P (A) = P (A B) + P (A B c ) P (B) = P (A B) + P (A c B) (4) Now, A B = (A B c ) (A c B) (A B) and again these sets are all disjoint. Putting all together we get P (A B) = P (A) P (A B) + P (B) P (A B) + P (A B) = = P (A) + P (B) P (A B) (5) How does this formula read when A and B are disjoint? Why is P (A B) = 0 when A and B are disjoint? 9

10 This can be extended to more than 2 events. For instance, given A, B, C Ω, we have P (A B C) = P (A) + P (B) + P (C) P (A B) P (A C) P (B C) + P (A B C) (6) and, in general, for A 1,..., A n Ω we have the so called inclusion-exclusion formula (note the alternating signs for the summands): n P ( n i=1a i ) = P (A i ) P (A i A j ) + P (A i A j A k )... i=1 1 i<j n 1 i<j<k n + ( 1) n 1 P (A 1 A n ). (7) Notice that in general we have the following union bound for any A 1,..., A n Ω: n P ( n i=1a i ) P (A i ). (8) For a given experiment with m possible outcomes, how can we compute the probability of an event of interest? We can always do the following: 1. define the experiment and describe its simple events E i, i {1,..., m} i=1 2. define reasonable probabilities on the simple events, P (E i ), i {1,..., m} 3. define the event of interest A in terms of the simple events E i 4. compute P (A) = i:e i A P (E i). Here is an example (and we will describe it in terms of the above scheme). There are 5 candidates for two identical job positions: 3 females and 2 males. What is the probability that a completely random selection process will appear discriminatory? (i.e. exactly 2 males or exactly 2 females are chosen for these job positions) We can approach this problem in the following way. 1. we introduce 5 labels for each of the 5 candidates: M 1, M 2, F 1, F 2, F 3. The sample space is then Ω ={M 1 M 2, M 1 F 1, M 1 F 2, M 1 F 3, M 2 M 1, M 2 F 1, M 2 F 2, M 2 F 3, F 1 M 1, F 1 M 2, F 1 F 2, F 1 F 3, F 2 M 1, F 2 M 2, F 2 F 1, F 2 F 3, F 3 M 1, F 3 M 2, F 3 F 1, F 3 F 2 } 2. because the selection process is completely random, each of the simple events of the sample space is equally likely. Therefore the probability of any simple event is just 1/ Ω 10

11 3. the event of interest is A = {M 1 M 2, M 2 M 1, F 1 F 2, F 1 F 3, F 2 F 1, F 2 F 3, F 3 F 1, F 3 F 2 } 4. P (A) = P (M 1 M 2 ) + P (M 2 M 1 ) + + P (F 3 F 2 ) = A / Ω = 8/20 = 2/5 = 0.4. Idea: if the simple events are equally likely to occur, then the probability of a composite event A is just P (A) = A / Ω. Questions to ask when you define the sample space and the probabilities on the simple events: is the sampling done with or without replacement? does the order of the labels matter? can we efficiently compute the size of the sample space, Ω? This leads to our next topic, which will equip us with tools to conveniently calculate probability. Tools for counting sample points Basic techniques from combinatorial analysis come handy for this type of questions when the simple events forming the sample space are equally likely to occur. Let s take a closer look to some relevant cases. Suppose you have a group of m elements a 1,..., a m and another group of n elements b 1,... b n. You can form mn pairs containing one element from each group. Of course you can easily extend this reasoning to more than just two groups. This is a useful fact when we are sampling with replacement and the order matters. Consider the following example. You toss a six-sided die twice. You have m = 6 simple outcomes on the first roll and n = 6 possible outcomes in the second roll. The size of the sample space consisting of all possible pairs of the type (i, j) for i, j {1,..., 6} is Ω = mn = 6 2 = 36. Is the experiment performed with replacement? Yes, if the first roll is a 2, nothing precludes the fact that the second roll is a 2 again. Does the order matter? Yes, the pair (2,5) corresponding to a 2 on the first roll and a 5 on the second roll is not equal to the pair (5,2) corresponding to a 5 on the first roll and a 2 on the second roll. 11

12 Sampling without replacement when order matters An ordered arrangement of r distinct elements is called a permutation. The number of ways of ordering n distinct elements taken r at a time is denoted Pr n where P n r = n(n 1)(n 2)... (n r + 1) = n! (n r)!. (9) Consider the following example. There are 30 members in a student organization and they need to choose a president, a vice-president, and a treasurer. In how many ways can this be done? The size of the n distinct elements (persons) is n = 30, the number of persons to be appointed is r = 3. The sampling is done without replacement (a president is chosen out of 30 people, then among the 29 people left a vice-president is chosen, and finally among the 28 people left a treasurer is chosen). Does the order matter? Yes, the president is elected first, then the vice-president, and finally the treasurer (think about ranking the 30 candidates). The number of ways in which the three positions can be assigned is therefore P3 30 = 30!/(30 3)! = = Sampling without replacement when order does not matter The number of combinations of n elements taken r at a time corresponds to the number of subsets of size r that can be formed from the n objects. This is denoted C n r where C n r = ( ) n = P r n r r! = n! (n r)!r!. (10) Here is an example. How many different subsets of three people can become officers of the organization, if chosen randomly? Order doesn t matter here (we are not interested in the exact appointments for a given candidate). The answer is therefore C 30 3 = 30!/(27! 3!) = /6 = The binomial coefficient ( n r) can be extended to the multinomial case in a straightforward way. Suppose that we want to partition n distinct elements in k distinct groups of size n 1,..., n k in such a way that each of the n elements appears in exactly one of the groups. This can be done in ( ) n n! = (11) n 1... n k n 1!n 2!... n k! possible ways. 12

13 Exercises in class: 1. The final exam of has 5 multiple choice questions. Questions 1 to 3 and question 5 each have 3 possible choices. Questions 4 has 3 possible choices. Suppose that you haven t study at all for the exam and your answers to the final exam are given completely at random. What is the probability that you will get a full score at the exam? 2. You have a string of 10 letters. How many substrings of length 7 can you form based on the original string? 3. Google is looking to hire 3 software engineers and asked CMU for potential candidates. CMU advised that Google hires 3 students from the summer class. Because the positions must be filled as quickly as possible, Google decided to skip the interview process and hire 3 of you completely at random. What is the probability that *you* will become a software engineer at Google? 13

14 Lecture 3 Recommended readings: WMS, sections Conditional Probability The probability of an event A may vary as a function of the occurrence of other events. Then, it becomes interesting to compute conditional probabilities, e.g. the probability that the event A occurs given the knowledge that another event B occurred. As an example of unconditional probability, think about the event A = my final grade for the summer class final exam is higher than 90/100 (this event is not conditional on the occurrence of another event) and its probability P (A). As an example of conditional probability, consider now the event B = my midterm grade for the class was higher than 80/100 : how are P (A) and P (A given that B occurred) related? Intuitively, we might expect that P (A) < P (A given that B occurred)! (beware, however, that as we will see the definition of conditional probability is not intrinsically related to causality) The conditional probability of an event A given another event B is usually denoted P (A B). Here is an intuitive description of conditional probability via Venn s diagrams. By definition, the conditional probability of the event A given the event B is P (A B) P (A B) =. (12) P (B) Observe that the quantity P (A B) is well-defined only if P (B) > 0. Consider the following table: 14

15 B A A c The unconditional probabilities of the events A and B are respectively P (A) = = 0.4 and P (B) = = 0.7. The conditional probability of the event A given the event B is P (A B) = P (A B)/P (B) = 0.3/0.7 = 3/7 while the probability of the event B given the event A is P (B A) = P (B A)/P (A) = 0.3/0.4 = 3/4. The conditional probability of the event A given that B c occurs is P (A B c ) = P (A B c )/P (B c ) = 0.1/0.3 = 1/3. Does the probability of A vary as a function of the occurrence of the event B? The conditional probability P ( B) with respect to an event B with P (B) > 0 is a proper probability measure, therefore the three axioms of probability that we discussed in Lecture hold for P ( B) as well. When it comes to standard computations, P ( B) has the same properties as P ( ): for instance, for disjoint events A 1, A 2 Ω we have P (A 1 A 2 B) = P (A 1 B) + P (A 2 B). Exercise in class By using a certain prescription drug, there is a 10% chance of experiencing headaches, a 15% chance of experiencing nausea, and a 5% chance of experiencing both headaches and nausea. What is the probability that you will experience an headache given that you feel nauseous after taking the prescription drug? Independence The event A Ω is said to be independent of the event B Ω if P (A B) = P (A)P (B). This means that the occurrence of the event B does not alter the chance that the event A happens. In fact, assuming that P (B) > 0, we easily see that this is equivalent to P (A B) = P (A B)/P (B) = P (A)P (B)/P (B) = P (A). Furthermore, assuming that also P (A) > 0, we have that P (A B) = P (A) is equivalent to P (B A) = P (B) (independence is a symmetric relation!). Exercise in class Consider the following events and the corresponding table of probabilities : B c 15

16 B A A c Are the events A and B independent? Exercise in class Consider the following events and the corresponding table of probabilities : B c B A 1/4 1/12 A c 1/2 1/6 Are the events A and B independent? Exercise in class Let A, B Ω and P (B) > 0. What is P (A Ω)? What is P (A A)? Let B A. What is P (A B)? Let A B =. What is P (A B)? Notice that from the definition of conditional probability, we have the following: P (A B) = P (B)P (A B) = P (A)P (B A). (13) This can be generalized to more than two events. For instance, for three events A, B, C Ω, we have and for n events A 1,..., A n B c P (A B C) = P (A)P (B A)P (C A B) (14) P (A 1 A n ) = P (A 1 )P (A 2 A 1 )P (A 3 A 1 A 2 )... P (A n A 1 A n 1 ). (15) Warning: Independence and Disjoint are not the same. Two events being disjoint simply means that they do not share any jointly occurring elements. For instance, in a single coin flip example, A = {H} and B = {T} are disjoint events, but A gives perfect knowledge of B, as we know that P (A B) = 0 and P (A B C ) = 1, so that P (A B) P (A) = 1/2. 16

17 Law of Total Probability and Bayes Rule Assume that {B i } i=1 is a partition of Ω, i.e. for any i j we have B i B j = and i=1 B i = Ω. Assume also that P (B i ) > 0 i. Then, for any A Ω, we have the so-called law of total probability P (A) = P (A B i )P (B i ). (16) i=1 As you showed in Homework 1, A = i=1 (A B i) when n i=1 B i = Ω. But this time the sets (A B i ) are also disjoint since {B i } i=1 is a partition of Ω. Furthermore, by equation (13) we have P (A B i ) = P (A B i )P (B i ). Thus, P (A) = i=1 P (A B i) = i=1 P (A B i)p (B i ). We can now use the law of total probability to derive the so-called Bayes rule. We have P (B i A) = P (B i A) P (A B i )P (B i ) = P (A) i=1 P (A B i)p (B i ). (17) For two events A, B this reduces to P (B A) = P (A B)P (B) P (A) = P (A B)P (B) P (A B)P (B) + P (A B c )P (B c ) (18) Exercise in class You are diagnosed with a disease that has two types, A and B. In the population, the probability of having type A is 10% and the probability of having type B is 90%. You undergo a test that is 80% accurate, i.e. if you have type A disease, the test will diagnose type A with probability 80% and type B with probability 20% (and vice versa). The test indicates that you have type A. What is the probability that you really have the type A disease? Let A = you have type A, B= you have type B, T A = the test diagnoses type A, and T B = the test diagnoses type B. We know that P (A) = 0.1 and P (B) = 0.9. The test is 80% accurate, meaning that P (A T A ) = P (B T B ) = 0.8 P (B T A ) = P (A T B ) = 0.2 We want to compute P (A T A ). We have P (A T A ) = P (A T A) = P (T A ) = P (T A A)P (A) P (T A A)P (A) + P (T A B)P (B) = 8 26 =

18 Conditional Independence We are now equipped to talk about another parallel concept to independence. Let C be an event with P (C) > 0. We say that the events A and B are conditional independent of C if This implies that P (A B C) = P (A C)P (B C). P (A B C) = P (A C) To see this notice that conditional independence of A,B given C implies that P (A C)P (B C) = P (A B C) P (A B C) = P (C) P (C)P (B C)P (A B C) = P (C)) = P (B C)P (A B C). Conditional independence does not imply nor is it implied by independence. Independence does not imply conditional independence Toss two coins. Let H 1 the even that the first toss is H and H 2 the event that the second toss is H. Let D the even that the two tosses are different. H 1 and H 2 are clearly independent but P (H 1 D) = P (H 2 D) = 1/2 and P (H 1 H 2 D) = 0. Conditional independence does not imply independence We have two coins, a blue one and a red one. For the blue coin the probability of H is 0.99 and for the red coin it is A coin is chosen at random and then tossed twice. Let B the event that the blue coin is selected, R the event that the red coin is selected. Let H 1 and H 2 the events that the first and second toss is H. By the law of total probability P (H 1 ) = P (H 1 B)P (B) + P (H 1 R)P (R) = 1/2. Similarly P (H 2 ) = 1/2. But, using again the law of total probability, P (H 1 H 2 ) = P (H 1 H 2 B)P (B)+P (H 1 H 2 R)P (R) = 1/ / /2 which is different than P (H 1 )P (H 2 ) = 1/4. 18

19 Homework 2 (due Thursday, May 19) 1. (12 pts) A secretary types 4 letters to 4 people and addresses the 4 envelopes. If he/she inserts the letters at random, what is the probability that exactly 2 letters will go in the correct envelopes? 2. (12 pts, 6 pts each) A labor dispute has arisen concerning the distribution of 20 laboreres to four different construction jobs. The first job is considered very undesirable, and requires 6 laborers. The second, third and fourth utilized 4,5,5 laborers respectively. The accusation was made because four people in the same race out of the twenty were all placed in job 1. In considering whether the assignment represented injustice, an investigation is launched to see the probability of the observed event. (6 points each) (a) What is the number of points in the sample space? (b) Find the probability of the observed event, if it is assumed that the laborers are randomly assigned to all jobs 3. (12 pts) A fair die is tossed 6 times, and the number on the uppermost face is recorded each time. What is the probability that the numbers recorded are in the first four rolls are 1, 2, 3, 4 in any order? It is useful to invoke the counting rules. 4. (12 pts, 6 pts each) (a) What is the number of ways of selecting two random applicants out of five? (b) What is the probability that exactly one of the two best applicants appear in a random selection of two out of five? 5. (12 pts) A population of voters contains 40% Republicans and 60% Democrats. It is reported that 30% of the Republicans and 70% of the Democrats favor an election issue. A person chosen at random from this population is found to favor the issue in question. What is the conditional probability that this person is a Democrat? 6. (4 pts) Use the axioms of probability to show that for any A S we have P (A c ) = 1 P (A). 7. (8 pts) Let B A. Show that P (A) = P (B) + P (A B c ). 8. (10 pts) Show that for A, B Ω, P (A B) max{0, P (A)+P (B) 1}. 19

20 9. (10 pts) Let A, B Ω be such that P (A) > 0 and P (B) > 0. Show that if A and B are disjoint, then they are not independent if A and B are independent, then they are not disjoint. 10. (10 pts) Let A, B Ω be such that P (A) > 0 and P (B) > 0. Prove or disprove the following claim: in general, it is true that P (A B) = P (B A). If you believe that the claim is true, clearly explain your reasoning. If you believe that the claim is false in general, can you find an additional condition under which the claim is true? 11. (10 pts) Let A, B Ω be independent events. Show that the following pairs of events are also independent: A, B c A c, B A c, B c. 20

21 Lecture 4 Recommended readings: WMS, sections 3.1, 3.2, Random Variables and Probability Distributions Let s start with an example. Suppose that you flip a fair coin three times. The sample space for this experiment is Ω = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }. Since the coin is fair, we have that P ({HHH}) = P ({HHT }) = = P ({T T T }) = 1 Ω = 1 8. Suppose that we are interested in a particular quantity associated to the experiment, say the number of tails that we observe. This is of course a random quantity. In particular, we can conveniently define a function X : Ω R that counts the number of tails. Precisely, X(HHH) = 0 X(HHT ) = 1 X(HT H) = 1 X(HT T ) = 2. X(T T T ) = 3. Since the outcome of the experiment is random, the value that the function X takes is also random. That s why we call a function like X a random variable 1. P induces through X a probability distribution on R. In particular, we can easily see that the probability distribution of the random variable X is P (X = 0) = P ({HHH}) = 1/8 P (X = 1) = P ({HHT, HT H, T HH}) = 3/8 P (X = 2) = P ({HT T, T HT, T T H}) = 3/8 P (X = 3) = P ({T T T }) = 1/8 P (X = any other number) = 0. 1 For historical reasons X is called a random variable, even though X is clearly a function mapping Ω into R. 21

22 Let X (!) =# H s in!. X (!) = ( 0 if! 2 {TTT} 1 if! 2 {HTT,THT,TTH} 2 if! 2 {HHT,HTH,THH} 3 if! 2 {HHH} 2. Flip a coin until we see H. Let X (!) =# T s before first H. = {H, TH, TTH, TTTH,...} Formal Definition of a random variable ( 0 Suppose we have a sample space Ω. A random 1 X variable X is a function (!) = 2 from Ω into the real line. In other words, X : Ω R or. ω Ω X(ω) R 3. We transmit a signal and record the time it takes to complete transmission Random variables will be denoted = {xwith 2 R,x>0} capital letters, such as X. This is a consistent, standard notation for a random variable. Let X (!) =!. X R Why we care about random variables? 1 After all, based on the example, it seems that we still have to clearly define and specify the sample space in order to determine P (X = x) for x R. Actually, it turns out when we model a random quantity of interest (such as the number of tails in the coin tossing example) most of the times one assumes (or knows) a distribution to use. (Of course, this assumption must to be reasonable and needs to be rigorously checked by the modeller.) By modelling the randomness of a phenomenon as a random variable whose distribution is known, we can bypass the trouble of defining/specifying a sample space. In the example above, if we simply assume that X has a Binomial distribution, then the sample space is automatically the discrete space of 3-length binary outcomes (HHH, HHT, etc.), and the probability of events of interest (e.g. X=1, which corresponds to one head out of three throws, which is the subset of the sample space {HT T, T HT, T T H}) is precisely calculable. Gist: Assuming that X has a probability distribution (e.g. Binomial distribution in the above example) is equivalent to making an assumption for the probabilities that we are interested in, i.e. P (X = x) for x R. Exercise in class Let X and Y be random variables with the following distributions: 0.4 if x = 0 P (X = x) = 0.6 if x = 1 0 if x / {0, 1} 22

23 and 0.7 if y = 1 P (Y = y) = 0.3 if y = 1 0 if y / {0, 1} Suppose that for any x, y R the events {X = x} and {Y = y} are independent. What is the probability distribution of the random variable Z = X + Y? We have 1 if {X = 0} {Y = 1} occurs 0 if {X = 1} {Y = 1} occurs Z = 1 if {X = 0} {Y = 1} occurs 2 if {X = 1} {Y = 1} occurs. Thus, if z = if z = 0 P (Z = z) = if z = if z = 2 0 if z / { 1, 0, 1, 2} 0.28 if z = if z = 0 = 0.12 if z = if z = 2 0 if z / { 1, 0, 1, 2}. At this point it is worthwhile making an important distinction between two types of random variables, namely discrete random variables and continuous random variables. We say that a random variable is discrete if the set of values that it can take is at most countable. On the other hand, a random variable taking values in an uncountably infinite set is called continuous. Question Consider the following: you draw a circle on a piece of paper an one diameter of the circle; at the center of the circle, you keep your pencil standing orthogonal to the plane where the circle lies. At some point you let go the pencil. X is the random variable corresponding to the angle that the pencil forms with the diameter of the circle that you drew after the pencil fell on the piece of paper. you roll a die. Y is the random variable corresponding to the number of rolls needed until you observe tail for the first time. 23

24 What are the possible values that X and Y can take? Are X and Y discrete or continuous random variables? Depending on whether a given random variable X is discrete or continuous, we use two special types of functions in order to describe its distribution. if X is discrete, let s define its support as the set supp(x) = {x R : P (X = x) > 0} (if X is discrete, supp(x) is either a finite or countable set). We can describe the probability distribution of X in terms of its probability mass function (p.m.f.), i.e. the function p(x) = P (X = x) (19) mapping R into [0, 1]. The function p satisfies the following properties: 1. p(x) [0, 1] x R 2. x supp(x) p(x) = 1. if X is continuous, we can describe the probability distribution of X by means of the probability density function (p.d.f.) f : R R +. Define in this case supp(x) = {x R : f(x) > 0}. The function f satisfies the following properties: 1. f(x) 0 x R 2. R f(x) dx = supp(x) f(x) dx = 1. We use f to compute the probability of events of the type {X (a, b]} for a < b. In particular, for a < b, we have P (X (a, b]) = b a f(x) dx (20) Notice that this implies that, for any x R, P (X = x) = 0 if X is a continuous random variable! Also, if X is a continous random variable, it is clear from above that P (X (a, b]) = P (X [a, b]) = P (X [a, b)) = P (X (a, b)). (21) In general, for any set A R, we have P (X A) = 24 A f(x) dx.

25 Exercise in class A group of 4 components is known to contain 2 defectives. An inspector randomly tests the components one at a time until the two defectives are found. Let X denote the number of tests on which the second defective is found. What is the p.m.f. of X? What is the support of X? Graph the p.m.f. of X. Let D stand for defective and N stand for non-defective. The sample space for this experiment is Ω = {DD, NDD, DND, DNND, NDND, NNDD}. The simple events in Ω are equally likely because the inspector samples the components completely at random. Thus, the probability of each simple event in Ω is just 1/ Ω = 1/6. We have P (X = 2) = P ({DD}) = 1/6 P (X = 3) = P ({NDD, DND}) = 2/6 = 1/3 P (X = 4) = P ({DNND, NDND, NNDD}) = 3/6 = 1/2 P (X = any other number) = 0. Thus, the p.m.f. of X is 1/6 if x = 2 1/3 if x = 3 p(x) = 1/2 if x = 4 0 if x / {2, 3, 4} and supp(x) = {2, 3, 4}. Graph goes here: Exercise in class 25

26 Consider the following p.d.f. for the random variable X: { f(x) = e x e x if x 0 1 [0, ) (x) = 0 if x < 0. What is the support of X? Compute P (2 < X < 3). Graph the p.d.f. of X. The support of X is clearly the set [0, ). We have P (2 < X < 3) = P (X (2, 3)) = Graph goes here: 3 2 f(x) dx = e x 3 2 = e 2 e 3. Another way to describe the distribution of a random variable is by means of its cumulative distribution function (c.d.f). Again we will separate the discrete case and the continuous case. If X is a discrete random variable, its c.d.f is defined as the function F (x) = y x y supp(x) p(y). Notice that for a discrete random variable, F is not a continuous function. If X is a continuous random variable, its c.d.f is defined as the function F (x) = x f(y) dy. Notice that for a continuous random variable, F is a continuous function. 26

27 In both cases, the c.d.f. of X satisfies the following properties: 1. lim x F (x) = 0 2. lim x + F (x) = 1 3. x y = F (x) F (y) 4. F is a right-continuous function, i.e. for any x R we have lim y x + F (y) = F (x). Exercise in class Compute the c.d.f. of the random variable X in the two examples above and draw its graph. For the example in the discrete case, we have 0 if x < 2 1/6 if 2 x < 3 F (x) = 1/6 + 1/3 = 1/2 if 3 x < 4 1/2 + 1/2 = 1 if x 4. For the example in the continuous case, we have { 0 if x < 0 F (x) = 0 + x = 0 e y dy if x 0 Graph goes here: { 0 if x < 0 e y x 0 if x 0 = { 0 if x < 0 1 e x if x 0. What is the relationship between the c.d.f. of a random variable and its p.m.f./p.d.f.? 27

28 For a discrete random variable X, let x (i) denote the i-th largest element in supp(x). Then, { F (x p(x (i) ) = (i) ) F (x (i 1) ) if i 2 F (x (i) ) if i = 1 For a continuous random variable X, f(x) = d dy F (y) for any x at which F is differentiable. y=x Furthermore, notice that for a continuous random variable X with c.d.f. F and for a < b one has P (a < X b) = P (a X b) = P (a X < b) = P (a < X < b) = b a f(x) dx = b f(x) dx a f(x) dx = F (b) F (a). This is easy. However, for a discrete random variable one has to be careful with the bounds. Using the same notation as before, for i < j one has P (x (i) < X x (j) ) = F (x (j) ) F (x (i) ) P (x (i) X x (j) ) = F (x (j) ) F (x (i) ) + p(x (i) ) P (x (i) X < x (j) ) = F (x (j 1) ) F (x (i) ) + p(x (i) ) P (x (i) < X < x (j) ) = F (x (j 1) ) F (x (i) ). Suppose that the c.d.f. F of a continuous random variable X is strictly increasing. Then F is invertible, meaning that there exists a function F 1 : (0, 1) R {, + } such that for any x R we have F 1 (F (x)) = x. Then, for any α (0, 1) we can define the α-quantile of X as the number x α = F 1 (α) (22) with the property that P (X x α ) = α. This can be extended to c.d.f. that are not strictly increasing and to p.m.f. s, but for this class we will not use that extension. Exercise in class Consider again a random variable X with p.d.f. { f(x) = e x e x if x 0 1 [0, ) (x) = 0 if x < 0. 28

29 Compute the α-quantile of X. We know from above that F (x) = { 0 if x < 0 1 e x if x 0. For α (0, 1), set α = F (x α ) = 1 e xα. We then have that the α-quantile is x α = log(1 α). 29

30 Lecture 5 Recommended readings: WMS, sections 3.3, 4.3 Expectation and Variance The expectation (or expected value or mean) is an important operator associated to a probability distribution. Given a random variable X with p.m.f. p (if X is discrete) or p.d.f. f (if X is continuous), its expectation E(X) is defined as E(X) = x supp(x) xp(x), if X is discrete E(X) = R xf(x) dx = x supp(x) xf(x) dx, if X is continuous. Roughly speaking, E(X) is the center of the distribution of X. (supp(x) just means the set of values where X is nonzero. For a discrete random variable X above, it just means {0,1,2,3,}. As a preview, for a normal random variable, it means (, ).) Exercise in class Consider the random variable X and its p.m.f. 0.2 if x = 0 What is E(X)? 0.3 if x = 1 p(x) = 0.1 if x = if x = 3 0 if x / {0, 1, 2, 3}. By definition we have E(X) = xp(x) = = = 1.7. x supp(x) Exercise in class Consider the random variable X and its p.d.f What is E(X)? f(x) = 3x 2 1 [0,1] (x). 30

31 Again, by definition E(X) = xf(x) dx = = R 1 0 x 3x 2 dx = 3 supp(x) 1 0 xf(x) dx = x 3 dx = 3 4 x4 1 0 = 3 4. Consider a function g : R R of the random variable X and the new random variable g(x). The expectation of g(x) is simply E(g(X)) = x supp(x) g(x)p(x), if X is discrete E(g(X)) = R g(x)f(x) dx = x supp(x) g(x)f(x) dx, if X is continuous. Exercise in class Consider once again the two random variables above and the function g(x) = x 2. What is E(X 2 )? In the discrete example, we have E(X 2 ) = x 2 p(x) = = = 4.3. x supp(x) In the continous example, we have E(X 2 ) = x 2 f(x) dx = = R 1 0 x 2 3x 2 dx = 3 supp(x) 1 0 x 2 f(x) dx = x 4 dx = 3 5 x5 1 0 = 3 5. A technical note: for a random variable X, the expected value E(X) is a well-defined quantity whenever E( X ) <. The expected value operator E is a linear operator: given two random variables X, Y and scalars a, b R we have E(aX) = ae(x) E(X + Y ) = E(X) + E(Y ). (23) For any scalar a R, we have E(a) = a. To see this, consider the random variable Y with p.m.f. { 1 if x = a p(y) = 1 {a} (x) = 0 if x a. 31

32 From the definition of E(Y ) it is clear that E(Y ) = a. Exercise in class Consider again the continuous random variable X of the previous example. What is E(X + X 2 )? We have E(X + X 2 ) = E(X) + E(X 2 ) = = Beware that in general, for two random variables X, Y, it is not true that E(XY ) = E(X)E(Y ). However, we shall see later in the class that this is true when X and Y are independent. The letter µ is often used to denote the expected value E(X) of a random variable X, i.e. µ = E(X). Another very important operator associated to a probability distribution is the variance. The variance measures how spread the distribution of a random variable is. The variance of a random variable X is defined as Equivalently, one can write if X is discrete: V (X) = V (X) = E[(X µ) 2 ] = E(X 2 ) µ 2. (24) x supp(x) if X is continuous: V (X) = (x µ) 2 p(x) = x supp(x) x supp(x) x 2 f(x) dx µ 2. x 2 p(x) µ 2 Exercise in class Consider the random variables of the previous examples. What is V (X)? In the discrete example, V (X) = E(X 2 ) [E(X)] 2 = 4.3 (1.7) 2 = = In the continuous example V (X) = E(X 2 ) [E(X)] 2 = 3/5 (3/4) 2 = 3/5 9/16 = 94/80 = 47/40. σ 2 is often used to denote the variance V (X) of a random variable X, i.e. σ 2 = V (X). The variance of a random variable X is finite as soon as E(X 2 ) <. Here are two important properties of the variance operator. For any a R we have 32

33 V (ax) = a 2 V (X) V (a + X) = V (X). Exercise in class Consider ( the continuous random variable of the previous examples. What 40/47X ) is V? We have ( ) V 40/47X = V (X) = = 1. Beware that in general, for two random variables X, Y, it is not true that V (X + Y ) = V (X) + V (Y ). However, we shall see later in the course that this is true when X and Y are independent. Markov s Inequality and Tchebysheff s Inequality Sometimes we may want to compute or approximate the probability of certain events involving a random variable X even if we don t know its distribution (but given that we know its expectation or its variance). Let a > 0. The following inequalities are useful in these cases: P ( X a) E( X ) (Markov s inequality) a P ( X µ a) V (X) a 2 (Tchebysheff s inequality) (25) There are some conditions! The Markov s inequality can be used if X is a positive random variable. The Tchebysheff s Inequality can be used for any random variable with finite (literally, not infinite) expected value. Let σ 2 = V (X) as usual so that σ is the standard deviation of X. Note that we can conveniently take a = kσ and the second inequality then reads P ( X µ kσ) σ2 kσ 2 = 1 k 2 (26) where µ = E(X). Thus, we can easily bound the probability that X deviates from its expectation by more than k times its standard deviation. (Added after lecture.) Exercise in class A call center receives an average of 10,000 phone calls a day, with a standard deviation of (2000). What is the probability that that there will be more than 15,000 calls? Call X the number of phone calls (a random variable) with 33

34 1. Using Markov s inequality, we knot that P (X 15, 000) E(X) 15, 000 = 2/3 This is quick and easy, but we can do better 2. Using Tchebysheff s, we get P (X 15, 000) = P (X 10, 000 5, 000) P ( X 10, 000 5, 000) This is much better than the previous result. 2, 000 5, =

35 Lecture 6 Recommended readings: WMS, sections Independent Random Variables Consider a collection of n random variables X 1,..., X n and their joint probability distribution. Their joint probability distribution can be described in terms of the joint c.d.f. F X1,...,X n (x 1, x 2,..., x n ) = P (X 1 x 1 X 2 x 2 X n x n ), (27) in terms of the joint p.m.f. (if the random variables are all discrete) p X1,...,X n (x 1, x 2,..., x n ) = P (X 1 = x 1 X 2 = x 2 X n = x n ), (28) or in terms of the joint p.d.f. f X1,...,X n (if the random variables are all continuous). The random variables X 1,..., X n are said to be independent if either of the following holds: F X1,...,X n (x 1, x 2,..., x n ) = n i=1 F X i (x i ) p X1,...,X n (x 1, x 2,..., x n ) = n i=1 p X i (x i ) f X1,...,X n (x 1, x 2,..., x n ) = n i=1 f X i (x i ). Then, if we consider an arbitrary collection of events {X 1 A 1 }, {X 2 A 2 },..., {X n A n }, we have that P (X 1 A 1 X 2 A 2 X n A n ) = n P (X i A i ). If the random variables also share the same marginal distribution, i.e. we have F Xi = F i {1,..., n} i=1 p Xi = p f Xi = f i {1,..., n} (if the random variables are all discrete) i {1,..., n} (if the random variables are all continuous) then the random variables X 1,..., X n are said to be independent and identically distributed, usually shortened in i.i.d.. 35

36 Probability calculations using sums and products (Added after first release!) We can make good use of some basic probability concepts to invoke some tools for probability calculations! We know that if random events A and B are independent, then P (A B) = P (A)P (B). Also, if A and B are mutually exclusive, then P (A B) = P (A) + P (B). Take an example of 5 coin flips of an uneven coin that lands on heads with probability p. Then, the probability of 3 heads happening, P (X = 3) for the random variable X which counts the number of heads in this scenario, can be calculated using these two tricks (instead of counting, which we have been doing so far!). Let s do this in two steps: 1. What is the probability of a particular outcome HHHT T? This is simply obtained by multiplying p three times and (1 p) two times. Why is this? because within a particular draw, those five coin flips are independent, so each single outcome s probability should be multiplied. (They are not disjoint, as one event does not preclude the possibility of another event!) 2. What about HHT HT? This outcome is disjoint from the first outcome HHHT T, or for that matter, any other outcome who is a combination of 3 H s and 2 T s. They can never happen together, so they are disjoint events! (They are not independent, because they will never happen together in fact, if you know that HHT HT happened, then you definitely know that HHHT T didn t happen!) Also, we know how to count how many such events can happen it is ( 5 3)! So, we can calculate the aggregate probability of 3 heads and 2 tails happening by adding the single event s probability p 3 (1 p) 2 up exactly ( 5 3) times! i.e. ( ) 5 P (3successesin5trials) = p 3 (1 p) 2 (29) 3 In general, we are interested in the probability distribution of the r number of heads out of n trials. We now proceed to learn some common useful discrete probability distributions. Frequently Used Discrete Distributions Notation The symbol is an identifier for a random variable, and specifies its pmf. In statistical jargon, we will say that a random variable has a certain distri- 36

37 bution to signify that it has a a certain pmf/pdf. Preview Some named discrete distributions that are frequently used are the following: 1. Binomial distribution (outcome of n coin flips) 2. Bernoulli distribution (special case of Binomial, n = 1) 3. Multinomial distribution (outcome of n (unfair) die rolls) 4. Geometric distribution (# times to 1st success) 5. Negative binomial distribution (# times to r th success) 6. Hypergeometric distribution (# times x successes of one class will happen in n samples taken from population N that has two classes of size r and N r) 7. Poisson distribution (number of successes expected in i.i.d., memoryless process; linked to exponential distribution) The Binomial Distribution Consider again tossing a coin (not necessarily fair) n times in such a way that each coin flip is independent of the other coin flips. By this, we mean that if H i denotes the event observing heads on the i-th toss, then P (H i H j ) = P (H i )P (H j ) for all i j. Suppose that the probability of seeing heads on each flip is p [0, 1] (and let s call the event seeing heads a success). Introduce the random variables X i = { 1 if the i th flip is a success 0 otherwise fo i {0, 1, 2,..., n}. The number of heads that we observe (or the number of successes in the experiment) is X = n X i. i=1 Under the conditions described above, the random variable X is distributed according to the Binomial distribution with parameters n Z + (number of 37

38 trials) and p [0, 1] (probability of success in each trial). We denote this by X Binomial(n, p). The p.m.f. of X is {( n ) p(x) = x p x (1 p) n x if x {0, 1, 2,..., n} 0 if x / {0, 1, 2,..., n} (30) Its expectation is E(X) = np and its variance is V (X) = np(1 p). In particular, when n = 1, we usually say that X is distributed according to the Bernoulli distribution of parameter p, denoted X Bernoulli(p). In this case E(X) = p and V (X) = p(1 p). Every X i above is a Bernoulli distributed random variable. (Added after lecture) The sum of binomial random variables is also binomial; i.e. it is closed to summation. If X 1, X 2 are independent binomial random variables each with distribution X 1, X 2 Binom(n i, p), Then, the sum Y = X 1 + X 2 also follows a binomial distribution, with parameters (n 1 + n 2, p). Exercise in class: Prove this, using pmfs. Define Z, then calculate P (Z = z) = P (X 1 +X 2 = z) by: z P (X 1 + X 2 = z) = f X2 (x)f X2 (z x) x=0 Exercise in class: Let X denote the number of 6 observed after rolling 4 times a fair die. Then X Binomial(n, p) with n = 4 and p = 1/6. What is the probability that we observe at least 3 times the number 6 in the 4 rolls? What is the expected number of times that the number 6 is observed in 4 rolls? We have P (X 3) = P (X = 3) + P (X = 4) = p(3) + p(4) ( ) ( ) ( = ( ) ( ) ( ) = = and E(X) = np = 4/6 = 2/3. The Geometric Distribution ) 4 4 Consider now counting the number of coin flips needed before the first success is observed in the setting described above and let X denote the corresponding random variable. Then X has the Geometric distribution with 38

39 parameter p [0, 1], denoted X Geometric(p). Its p.m.f. is given by { (1 p) x 1 p if x {1, 2, 3,... } p(x) = (31) 0 if x / {1, 2, 3,... } The expected value of X is E(X) = 1/p, while its variance is V (X) = (1 p)/p 2. The Geometric distribution is one of the distribution that have the socalled memoryless property. This means that if X Geometric(p), then P (X > x + y X > x) = P (X > y) for any 0 < x y. To see this, let s first compute the c.d.f. of X. We have { { 0 if x < 1 0 if x < 1 F (x) = P (X x) = p x y=1 (1 = p)y 1 if x 1 p x 1 y=0 (1 p)y if x 1 { { 0 if x < 1 0 if x < 1 = = p 1 (1 p) x 1 (1 p) if x 1 1 (1 p) x if x 1. Let s look for example to the case 1 < x < y. We have P (X > x + y X > x) = P (X > x + y) P (X > x) = = (1 p) y = P (X > y). 1 F (x + y) 1 F (x) = (1 p) x+y (1 p) x Exercise in class Consider again rolling a fair die. What is the probability that the first 6 is rolled on the 4-th roll? Let X denote the random variable counting the number of rolls needed to observe the first 6. We have X Geometric(p) with p = 1/6. Then, P (X = 4) = p(4) = p(1 p) 4 1 = = The Negative Binomial Distribution Suppose that we are interested in counting the number of coin flips needed in order to observe the r-th success. If X denotes the corresponding random variable, then X has the Negative Binomial distribution with parameters 39

40 r Z + (number of successes) and p [0, 1] (probability of success on each trial). We use the notation X NBinomial(r, p). Its p.m.f. is then p(x) = {( x 1 r 1) p r (1 p) x r if x {r, r + 1,... } 0 if x / {r, r + 1,... }. (32) The expected value of X is E(X) = r/p and its variance is V (X) = r(1 p)/p 2. Exercise in class: Consider again rolling a fair die. What is the probability that the 3rd 6 is observed on the 5-th roll? We have X NBinomial(r, p) with r = 3 and p = 1/6 and P (X = 5) = ( ) ( 1 6 ) 3 ( ) = Question: How do the Geometric distribution and the Negative Binomial ditribution relate? The Hypergeometric Distribution Suppose that you have a population of N elements that can be divided into 2 subpopulations of size r and N r (say the population of successes and failures, respectively). Imagine that you sample n N elements from this population without replacement. What is the probability that your sample contains x successes? To answer this question we can introduce the random variable X with the Hypergeometric distribution with parameters r Z + (number of successes in the population), N Z + population size, and n Z + (number of trials). We write X HGeometric(r, N, n). Its p.m.f. is ( x)( r N r n x) if x {0, 1,..., r} p(x) = ( N n) (33) 0 if x / {0, 1,..., r}. The expectation and the variance of X are E(X) = nr/n and V (X) = nr N r N n N N N 1. Exercise in class: There are 10 candidates for 4 software engineering positions at Google. 7 of 40

41 them are female candidates and the remaining 3 are male candidates. If the selection process at Google is completely random, what is the probability that only 1 male candidate will be eventually hired? The number of male candidates hired by Google can be described by means of the random variable X Hypergeometric(r, N, n) with r = 3, N = 10, and n = 4. Then, P (X = 1) = ( 3 1 ( 10 4 )( 7 3) The Poisson Distribution ) = 3 7! 4!3! 10! 6!4! = 7!6! 2 10! = = 1 2. The Poisson distribution can be thought of as a limiting case of the Binomial distribution. Consider the following example: you own a store and you want to model the number of people who enter in your store on a given day. In any time interval of that day, the number of people walking in the store is certainly discrete, but in principle that number can be any non-negative integer. We could try and divide the day into n smaller subperiods in such a way that, as n, only one person can walk into the store in any given subperiod. If we let n, it is clear however that the probability p that a person will walk in the store in an infinitesimally small subperiod of time is such that p 0. The Poisson distribution arises as a limiting case of the Binomial distribution when n, p 0, and np λ (0, ). The p.m.f. of a random variable X that has the Poisson distribution with parameter λ > 0, denoted X Poisson(λ) is p(x) = {e λ λx x! if x {0, 1, 2, 3,... } 0 if x / {0, 1, 2, 3,... } (34) Both the expected value and the variance of X are equal to λ, E(X) = V (X) = λ. (Added after lecture, to help with homework 4.) Poisson random variables have the convenient property of being closed to summation. if X Poisson(λ 1 ) and Y Poisson(λ 2 ) and they are independent, then X + Y Poisson(λ 1 + λ 2 ). Exercise in class: Prove this, using pmfs. Define Z, then calculate P (Z = z) = P (X + Y = z) by: z P (X + Y = z) = f X (x)f Y (z x) x=0 41

42 Exercise in class: At the CMU USPS office, the expected number of students waiting in line between 1PM and 2PM is 3. What is the probability that you will see more than 2 students already in line in front of you, if you go to the USPS office in that period of time? Let X Poisson(λ) with λ = 3 be the number of students in line when you enter into the store. We want to compute P (X > 2). We have P (X > 2) = 1 P (X 2) = 1 Multinomial Distribution ( = 1 e ) 2 2 p(x) = 1 e 3 x=0 2 x= = 1 e x x! = (Added after lecture.) The multinomial distribution is a generalization of the binomial distribution. Suppose n i.i.d. experiments are performed. Each experiment can lead to r possible outcomes with probability p 1, p 2, cdots, p r such that p i > 0, r p i = 1 i=0 Now, let X i be the number of experiments resulting in outcome i [1, r]. Then, the multinomial distribution characterizes the joint probability of (X 1, X 2,, X n ); in other words, it fully describes The probability mass function is P (X 1 = x 1, X 2 = x 2,, X r = x r ) (35) p X1,,X r (x 1,, x r ) = {( ) n x 1 x 2 x r p x 1 1 pxr r if r i=1 x i = n 0 otherwise (36) What is the mean? The mean should be defined on each of the X i, and is E(X i ) = np i. The variance is V (X i ) = np i (1 p i ). We will later prove (after we have learned about the concept of covariance) that the variance of each X i is V (X i ) = Cov(X i, X i ) = np i (1 p i ). The covariance of X i and X j is Cov(X i, X j ) = np i p j. How should we understand the expectation and variance? We can see that the marginal probability distribution of X i as simply Binom(n, p i ) 42

43 Age Proportion if we only focus on one variable at a time, each is simply the number of successes (each success having probability p i ) out of n trials! One might argue that it somehow doesn t seem like the other outcomes are irrelevant, so that each X i can be treated as a separate binomial random variable. He would be partially correct! (about his former point) The outcomes of X i and X j are pitted against each other; if X i is high, then X j (j i) should be low, since there are only n draws in total! Indeed, we will find that these are negatively correlated (and have negative covariance). Exercise in class A fair die is rolled 9 times. What is the probability of 1 appearing 3 times, 2 appearing 2 times, 3 appearing 2 times, 4 appearing 1 times, 5 appearing 1 time, and 6 appearing 0 times? Exercise in class According to recent census figures, the proportion of adults (18 years or older of age) in the U.S. associated with 5 age categories are as given in the following table If the figures are accurate and five adults are randomly sampled, find the probability that the sample contains one person between the ages of 18 and 24, two between the ages of 25 and 34, and two between 45 and 64. (Hint: see WMS) 43

44 Homework 3 (due Tuesday, May 24th) 1. (4 pts) Consider three urns. Urn A contains 2 white and 3 red balls, urn B contains 8 white and 4 red balls and urn C contains 1 white and and 3 red balls. If one ball is selected from each urn, what is the probability that the ball chosen from urn A was white, given that exactly 2 white balls were selected? 2. (4 pts) Consider the random variable X with p.m.f. 0.1 if x = 0 p(x) = 0.9 if x = 1 0 if x / {0, 1}. Draw the p.m.f. of X. Compute the c.d.f. of X and draw it. 3. (10 pts) Let X be a continuous random variable with p.d.f. f(x) = 3x 2 1 [0,1] (x). Compute the c.d.f. of X and draw it. What is P (X (0.2, 0.4))? What is P (X [ 3, 0.2))? 4. (10 pts) Consider again the exercise of page 25 on the defective components. Compute the c.d.f. of X and use it (possibly together with the p.m.f. of X) to compute the following probabilities: P (2 X 5) P (X > 3) P (0 < X < 4). 5. (10 pts) Consider the following c.d.f. for the random variable X: 0 if x < 0 8 x if 0 x < 1 F (x) = P (X x) = x if 1 x < x if 3 2 x < 2 1 if x 2. What is the p.d.f. of X? 1 44

45 6. (10 pts) Let α (0, 1). Derive the α-quantile associated to the following c.d.f. 1 F (x) = 1 + e x. 7. (10 pts) Consider the continuous random variable X with p.d.f. f and its tranformation 1 A (X) where A R. Show that E(1 A (X)) = P (X A). 8. (6 pts) Consider a random variable X and a scalar a R. Using the definition of variance, show that V (X + a) = V (X). 9. ( 6 pts ) A gambling book recommends the following winning strategy for the game of roulette. It recommends that a gambler bets $1 on red. If red appears (which has probability 18/38 ), then the gambler should take her $1 profit and quit. If the gambler looses this bet (which has probability 20/38 of occurring), she should make additional $1 bets on red on each of the next two spins of the roulette wheel and then quit. Let X denote the gamblers winning when she quits. (a) Find P (X 0) (b) Are you convinced that this is a winning strategy? Explain your answer. (c) Find E[X]. 10. (10 pts) Let X be a random variable with p.d.f. f(x) = 1 [0,1] (x). Draw the p.d.f. of X. Compute the c.d.f. of X and draw it. Compute E(3X). Compute V (2X + 100). 11. (10 pts) A secretary is given n > 0 passwords to access a computer file. In order to open the file, she starts entering the password in a random order and after each attempt she discards the password that she used if that password was not the correct one. Let X be the random variable corresponding to the number of password that the secretary needs to enter before she can open the file. Argue that P (X = 1) = P (X = 2) = = P (X = n) = 1/n. Find the expectation and the variance of X. 45

46 Potentially useful facts: n x=1 x = n(n + 1)/2 n x=1 x2 = n(n + 1)(2n + 1)/ (10 pts) Consider the positive random variable X (i.e. P (X > 0) = 1) with expectation E(X) = 1 and variance V (X) = 0.002, and is symmetric. Use Markov s inequality for a lower bound, and Tchebysheff s inequality to provide an upper bounds for the probability of the event {X 0.1}. (Hint: think about what allows you to quantify the probability that X 0.1 or X 1.9.) Are these bounds useful in this case? Comment on the bounds. 46

47 Lecture 7 Recommended readings: WMS, sections Frequently Used Continuous Distributions In this section, we will describe some of the most commonly used continuous distribution. In particular, we will focus on three important families which are often used in the statistical modeling of physical phenomena: 1. the Normal family: this is a class of distributions that is commonly used to describe physical phenomena where the quantity of interest takes values (at least in principle) in the range (, ) 2. the Gamma family: this is a class of distributions which are frequently used to describe physical phenomena where the quantity of interest takes non-negative values, i.e. it takes values in the interval [0, ) 3. the Beta family: this is a class of distributions that is commonly used to describe physical phenomena where the quantity of interest takes values in some interval [a, b] of the real line. We will also focus on some relevant subfamilies of the families mentioned above. The Uniform Distribution The Uniform distribution is the simplest among the continuous distributions. We say that a random variable X has the Uniform distribution with parameters a, b R, a < b, if its p.d.f. is f(x) = 1 b a 1 [a,b](x) = { 1 b a if x [a, b] 0 if x / [a, b]. (37) In this case, we use the notation X Uniform(a, b). We have E(X) = a+b 2 and V (X) = (b a)2 12. The Normal Distribution The Normal distribution is of the most relevant distributions for applications in Statistics. We say that a random variable X has a Normal distribution 47

48 with parameters µ R and σ 2 R + if its p.d.f. is f(x) = 1 (x µ)2 e 2σ 2. 2πσ 2 In this case, we write X N (µ, σ). The parameters of X correspond to its expectation µ = E(X) and its variance σ 2 = V (X). The c.d.f. of X N (µ, σ 2 ) can be expressed in terms of the error function erf( ). However, for our purposes, we will not need to investigate this further. The Standardized Normal Distribution Given X N (µ, σ 2 ), we can always obtain a standardized version Z of X such that Z still has a Normal distribution, E(Z) = 0, and V (Z) = 1 (i.e. Z N (0, 1)). This can be done by means of the transformation Z = X µ. σ The random variable Z is said to be standardized. Of course, one can also standardize random variables that have other distributions (as long as they have finite variance), but unlike the Normal case the resulting standardized variable may not belong anymore to the same family to which the original random variable X belonged to. The Normal family is closed with respect to translation and scaling: if X N (µ, σ 2 ), then ax + b N (aµ + b, a 2 σ 2 ) for a 0 and b R. Furthermore, if X N (µ, σ 2 ) and Y N (ν, τ 2 ) and X and Y are independent, then X + Y N (µ + ν, σ 2 + τ 2 ). We finally mention that it is common notation to indicate the c.d.f. of Z N (0, 1) by Φ( ). Notice that Φ( x) = 1 Φ(x) for any x R. The Gamma Distribution We say that the random variable X has a Gamma distribution with parameters α, β > 0 if its p.d.f. is 1 f(x) = β α Γ(α) xα 1 e x β 1 [0, ) (x). (38) Notice that the p.d.f. of the Gamma distribution includes the Gamma function Γ(α) = for α > 0. Useful properties of the Gamma function: 0 48 x α 1 e x dx (39)

49 For α Z +, we have that Γ(α) = (α 1)!. For any α > 0, we have Γ(α + 1) = αγ(α). 2 Make sure not to confuse the Gamma distribution, described by the p.d.f. of equation (38), and the Gamma function of equation (39)! The expectation and the variance of X are respectively E(X) = αβ and V (X) = αβ 2. The c.d.f. of a Gamma-distributed random variable can be expressed explicitly in terms of the incomplete Gamma function. Once again, for our purposes, we don t need to investigate this further. We saw that the Normal family is closed with respect to translation and scaling. The Gamma family is closed with respect to positive scaling only. If X Gamma(α, β), then cx Gamma(α, cβ), provided that c > 0. Furthermore, if X 1 Gamma(α 1, β), X 2 Gamma(α 2, β),..., X n Gamma(α n, β) are independent random variables, then n i=1 X i Gamma ( n i=1 α i, β) The Exponential Distribution The Exponential distribution constitutes a subfamily of the Gamma distribution. In particular, X is said to have an Exponential distribution with parameter β > 0 if X Gamma(1, β). In that case, we write X Exponential(β). The p.d.f. of X is therefore f(x) = 1 β e x β 1 [0, ) (x). Because the Exponenetial distribution is a subfamily of the Gamma distribution, we have E(X) = αβ = β and V (X) = αβ 2 = β 2. Exercise in class Compute the c.d.f. of X Exponential(β). We have, F (x) = P (X x) = = { 0 if x < 0 { 0 if x < 0 x 0 1 e x β if x 0. 1 β e y β dy if x 0 0 if x < = 0 e y β if x 0. x 0 2 (This recursive property often comes in handy for computations that involve Gammadistributed random variables.) 49

50 We will learn a bit more about this when we revisit Poisson processes and exponential wait times of events, but we note here that the exponential distribution also has the property of being memoryless. For an exponential random variable X with parameter λ, the following hold or, equivalently P (X > t + s X > t) = P (X > s) (40) P (X > t + s) = P (X > s)p (X > t) (41) Exercise in class Prove this. Hint: consider using 1-CDF of exponential distributions. (Also, this is a homework problem.) The Chi-Square Distribution The Chi-Square distribution is another relevent subfamily of the Gamma family of distributions. It frequently arises in statistical model-fitting procedures. We say that X has a Chi-Square distribution with ν > 0, ν N + degrees of freedom, if X Gamma(ν/2, 2). In this case, we write X χ 2 (ν). We have E(X) = ν and V (X) = 2ν. The Chi-square distribution can be formed by the sum of ν independent standard normal distributions: If X = ν Xi 2, then X χ 2 (ν) i=1 This distribution is useful when we want to verify (i.e. conduct hypothesis tests) if two categories are independent (see Pearson s Chi-square test of independence) or if a prior belief about proportions of the population in some category (say, male/female, or income group) is plausible (see Goodness of fit test for further study). The Beta distribution Suppose that you are interested in studying a quantity Y that can only take values in the interval [a, b] R. We can easily transform Y in such a way that it can only take values in the standard unit interval [0, 1]: it is enough to consider the normalized version of Y X = Y a b a. 50

51 Then, a flexible family of distributions that can be used to model Y is the Beta family. We say that a random variable X has a Beta distribution with parameters α, β > 0, denoted X Beta(α, β), if its p.d.f. is of the form where B(α, β) = f(x) = xα 1 (1 x) β 1 1 B(α, β) [0,1] (x). (42) 1 0 x α 1 (1 x) β 1 dx = Γ(α)Γ(β) Γ(α + β). The c.d.f. of X can be expressed explicitly in terms of the incomplete Beta function B(x; α, β) = x 0 y α 1 (1 y) β 1 dx, but we don t need to investigate this further for our purposes. The expected value of X is E(X) = and its variance is V (X) = αβ (α+β) 2 (α+β+1). α α+β A Note on the Normalizing Constant of a Probability Density Function Most frequently, a given p.d.f. f takes the form f(x) = cg(x) where c is a positive constant and g is a function. The part of f depending on x, i.e. the function g, is usually called the kernel of the p.d.f. f. Very often one can guess whether f belongs to a certain family by simply inspecting g. Then, if f is indeed a density, c > 0 is exactly the right constant which makes f integrate to 1. Therefore, if for any reason c is unknown, but you can guess that X f belongs to a certain family of distributions for a particular value of its parameters, in order to figure out c one does not necessarily have to compute ( c = 1 g(x) dx). supp(x) Let us illustrate this by means of two examples: Let f(x) = ce x 2 1 [0, ) (x) and we want to figure out what c is. Here g(x) = e x 2 1 [0, ) (x) is the kernel of an Exponential p.d.f. with parameter β = 2. We know therefore that c must be equal to c = 1/β. 51

52 Let f(x) = cx 4 (1 x) 5 1 [0,1] (x) and again suppose that we want to figure out the value of c. Here g(x) = x 4 (1 x) 5 1 [0,1] (x) which is the kernel of a Beta p.d.f. with parameters α = 5 and β = 6. We know therefore that c must be the number c = Γ(α + β) Γ(5 + 6) = Γ(α)Γ(β) Γ(5)Γ(6) = 10! 4!5! = Extra: Simulations of random variables Fun stuff: If we have time, we will try some simulations of some random variables (Normal, Poisson, Exponential, Gamma with different parameters, etc.), overlay pmf/pdfs on them to show dependence on n. Also, as a preview, we will see simulated examples of the central limit theorem phenomenon. Also, we will see 2d scatter plots and 3d density plots of a independent/positively correlated normal distributions. 52

53 Homework 4 (due Thursday, May 28th) 1. (6 pts) An oil exploration firm is formed with enough capital to finance ten explorations. The probability of a particular exploration being successful is 0.1. Assume the explorations are independent. Find the mean and variance of the number of successful explorations. 2. (8 pts) Ten motors are packaged for sale in a certain warehouse. The motors sell for USD 100 each, but a double-your-money-back guarantee is in effect for any defectives the purchaser may receive (i.e. if the purchaser receives one defective motor, the seller has to pay USD 200 back to the purchaser). Find the expected net gain for the seller if the probability of any one motor being defective is 0.08 (assuming that the quality of any one motor is independent of that of the others). 3. (6 pts) Consider again Exercise 11 of Homework 3. Suppose this time that the secretary does not discard a password once she tried it and discovered that the password is wrong. What is the distribution of the random variable X corresponding to the number of attempts needed until she enters the correct password? What are the mean and the variance of X? 4. (8 pts) A geological study indicates that an exploratory oil well should strike oil with probability 0.2. What is the probability that the first strike comes on the third well drilled? What is the probability that the third strike comes on the seventh well drilled? What assumptions did you make to obtain the answers to parts (a) and (b)? Find the mean and variance of the number of wells that must be drilled if the company wants to set up three producing wells. 5. (6 pts) Cards are dealt at random and without replacement from a standard 52 card deck. What is the probability that the second king is dealt on the fifth card? 6. (8 pts) A parking lot has two entrances. Cars arrive at entrance I according to a Poisson distribution at an average of three per hour and at entrance II according to a Poisson distribution at an average of 53

54 four per hour. What is the probability that a total of three cars will arrive at the parking lot in a given hour? Assume that the numbers of cars arriving at the two entrances are independent. 7. (6 pts) Derive the c.d.f. of X Uniform(a, b). 8. (4 pts) Let X Uniform(0, 1). Is X Beta(α, β) for some α, β > 0? 9. (8 pts) The number of defective circuit boards coming off a soldering machine follows a Poisson distribution. During a specific eight-hour day, one defective circuit board was found. Conditional on the event that a single defective circuit was produced, the time at which it was produced has a Uniform distribution. Find the probability that the defective circuit was produced during the first hour of operation during that day. Find the probability that it was produced during the last hour of operation during that day. Given that no defective circuit boards were produced during the first four hours of operation, find the probability that the defective board was manufactured during the fifth hour. 10. (6 pts) Let Z N (0, 1). Show that, for any z > 0, P ( Z > z) = 2(1 Φ(z)). 11. (6 pts) Let Z N (0, 1). Express in terms of Φ, the c.d.f. of Z, the following probabilities P (Z 2 < 1) P (3Z/5 + 3 > 2). Compute the exact probability of the event Φ 1 (α/3) Z Φ 1 (4α/3), where Φ 1 (α) is the inverse c.d.f. computed at α (i.e. the α-quantile of Z). 12. (8 pts) The SAT and ACT college entrance exams are taken by thousands of students each year. The mathematics portions of each of these exams produce scores that are approximately normally distributed. In recent years, SAT mathematics exam scores have averaged 480 with standard deviation 100. The average and standard deviation for ACT mathematics scores are 18 and 6, respectively. 54

55 An engineering school sets 550 as the minimum SAT math score for new students. What percentage of students will score below 550 in a typical year? Express your answer in terms of Φ. What score should the engineering school set as a comparable standard on the ACT math test? You can leave your answer in terms of Φ (6 pts) Suppose that a random variable X has a probability density function given by f(x) = cx 3 e x 2 1 [0, ) (x). Find the value of c that makes f a valid p.d.f. (hint: if you think about this carefully, no calculations are needed!) Does X have a Chi-Square distribution? If so, with how many degrees of freedom? What are E(X) and V (X)? 14. (8 pts) Consider the random variable X Exponential(β). Show that the distribution of X has the memoryless property. 15. (6 pts) An interesting property of the Exponential distribution is that its associated hazard function is constant on [0, ). The hazard function associated to a density f is defined as the function h(x) = f(x) 1 F (x). (43) One way to interpret the hazard function is in terms of the failure rate of a certain component. In this sense, h can be thought of as the ratio between the probability density of failure at time x, f(x), and the prorbability that the component is still working at the same time, which is P (X > x) = 1 F (x). Show that if X Exponential(β) is a random variable describing the time at which a particular component fails to work, then the failure rate of X is constant (i.e. h does not depend on x). 16. Extra credit (10 pts) Consider the Binomial p.m.f. ( ) n p(x) = p x (1 p) n x 1 x {0,1,...,n} (x) 55

56 with parameters n and p. Show that λx lim p(x) = e λ n x! 1 {0,1,... }(x) where the limit is taken as n, p 0, and np λ > 0. (Hint: For three converging sequences a n, b n, c n whose limits are a, b, c respectively, lim n a n b n c n = abc.) 17. Extra credit ( 10 pts) Consider the Normal p.d.f. f(x) = 1 (x µ)2 e 2σ 2 2πσ 2 with parameters µ R and σ 2 > 0. Show that µ σ and µ + σ correspond to the inflection points of f. 18. Extra credit ( 10 pts) Consider the random variable X Beta(α, β) with α, β Z +. Show that, for n Z +, E(X n+1 ) E(X n ) Use this result to show that E(X n ) = = α + n α + β + n. n 1 i=0 α + i α + β + i. 19. Extra credit (10 pt) Download the free version of R and RStudio (Desktop edition) on your personal computer. You can use the instructions here: (a) When you re done, type version at the command line and hit enter. Write down the resulting version.string. Next, type print( Hello, world! ). (b) Next, type mysample = rexp(n=1000,rate=2) to generate 1, 000 i.i.d. samples from an exponential distribution. These are 200 possible wait time (in hours) for a bus that is expected to arrive once every half hour. (c) Next, type hist(mysample) to produce a histogram (an estimate of the probability density function (pdf); the outline of the bars will roughly take the shape of the pdf) of the an exponential distribution 56

57 (d) Next, use abline(v=c(0.25,0.75),col= red ) to plot two vertical lines. The area between these lines is a good estimate of P (0.25 X 0.75) for an exponential random variable X with parameter λ = 2; or, in other words, the probability of the wait time being between 15 and 45 minutes. (e) Lastly, use dev.copy(png, mydirectory/histogram.png ); graphics.off() to save the image as a png file, in your home directory. Include this picture to your homework. Figure 1: Your figure should look something like this! 57

58 Lecture 8 Recommended readings: WMS, sections Multivariate Probability Distributions So far we have focused on univariate probability distributions, i.e. probability distributions for a single random variable. However, when we discussed independence of random variables in Lecture 6, we introduced the notion of joint c.d.f., joint p.m.f., and joint p.d.f. for a collection of n random variables X 1,..., X n. In this lecture we will elaborate more on these objects. Let X 1,..., X n be a collection of n random variables. Regardless of whether they are discrete or continuous, we denote by F X1,...,X n their joint c.d.f., i.e. the function F X1,...,X n (x 1,..., x n ) = P (X 1 x 1 X n x n ). If they are all discrete, we denote by p X1,...,X n the function their joint p.m.f., i.e. p X1,...,X n (x 1,..., x n ) = P (X 1 = x 1 X n = x n ). If they are all continuous, we denote by f X1,...,X n their joint p.d.f.. The above functions satisfy properties that are similar to those satisfied by their univariate counterparts. The joint c.d.f. F X1,...,X n satisfies: F X1,...,X n (x 1,..., x n ) [0, 1] for any x 1,..., x n R. F X1,...,X n is monotonically non-decreasing in each of its variables F X1,...,X n is right-continuous with respect to each of its variables lim xi F X1,...,X n (x 1,..., x i,..., x n ) = 0 for any i {1,..., n} lim x1 +,...,x n + F X1,...,X n (x 1,..., x n ) = 1 The joint p.m.f. satisfies: p X1,...,X n (x 1,..., x n ) [0, 1] for any x 1,..., x n R. x 1 supp(x 1 ) x p n supp(x n) X 1,...,X n (x 1,..., x n ) = 1. The joint p.d.f. satisfies: 58

59 f X1,...,X n (x 1,..., x n ) 0 for any x 1,..., x n R. R... R f X 1,...,X n (x 1,..., x n ) dx 1... dx n = supp(x 1 )... supp(x n) f X 1,...,X n (x 1,..., x n ) dx n... dx 1 = 1. Furthermore we have and F X1,...,X n (x 1,..., x n ) = F X1,...,X n (x 1,..., x n ) = x1 y 1 x 1 y 1 supp(x 1 ) y n x n y n supp(x n) p(y 1,..., y n ) xn... f X1,...,X n (y 1,..., y n ) dy n... dy 1 Exercise in class: You are given the following bivariate p.m.f.: 1 8 if (x 1, x 2 ) = (0, 1) 1 4 if (x 1, x 2 ) = (0, 0) 1 p X1,X 2 (x 1, x 2 ) = 8 if (x 1, x 2 ) = (0, 1) 1 4 if (x 1, x 2 ) = (2, 1) 1 4 if (x 1, x 2 ) = (2, 0) 0 otherwise. What is P (X 1 1 X 2 0) = F X1,X 2 (1, 0)? We have F X1,X 2 (1, 0) = p X1,X 2 (0, 1) + p X1,X 2 (0, 0) = = 3 8. Draw figure of bivariate pmf here. Exercise in class: You are given the following bivariate p.d.f.: f X1,X 2 (x 1, x 2 ) = e (x 1+x 2 ) 1[0, ) [0, ) (x 1, x 2 ) What is P (X 1 1 X 2 > 5)? What is P (X 1 + X 2 3)? 59

60 We have P (X 1 1 X 2 > 5) = and = f X1,X 2 (x 1, x 2 ) dx 1 dx 2 = e x 1 dx 1 e x 2 dx 2 = = ( 1 e 1) e 5. Draw figure of bivariate pdf here. P (X 1 + X 2 3) = = = 3 3 x e x Marginal Distributions 0 ( 5 e (x 1+x 2 ) dx 2 dx 1 = e x ) 2 3 x 1 dx 0 1 = ( e x 1 e 3) dx 1 = ( e x ( e x x1 e x e (x 1+x 2 ) dx 1 dx 2 ) ( e x 2 ) 5 e x 2 dx 2 dx 1 e x 1 ( 1 e x 1 3 ) dx 1 ) 3e 3 = 1 4e 3. Given a collection of random variables X 1,..., X n and their joint distribution, how can we derive the marginal distribution of only one of them, say X i? The idea is summing or integrating the joint distribution over all the variables except for X i. Draw figure of marginalizing a bivariate pmf here. Thus, given p X1,...,X n p Xi (x i ) = y 1 supp(x 1 ) we have that y i 1 supp(x i 1 ) y i+1 supp(x i+1 ) y n supp(x n) p X1,...,X n (x 1,..., x i 1, x i, x i+1,..., x n ). and given f X1,...,X n we have f Xi (x i ) = f X1,...,X n (x 1,..., x i 1, x i, x i+1,..., x n ) dx n... dx i+1 dx i 1... dx 1 R R R R = f X1,...,X n (x 1,..., x i 1, x i, x i+1,..., x n ) supp(x 1 ) supp(x i 1 ) dx n... dx i+1 dx i 1... dx 1. supp(x i+1 ) 60 supp(x n)

61 Exercise in class: Consider again the bivariate p.m.f. 1 8 if (x 1, x 2 ) = (0, 1) 1 4 if (x 1, x 2 ) = (0, 0) 1 p X1,X 2 (x 1, x 2 ) = 8 if (x 1, x 2 ) = (0, 1) 1 4 if (x 1, x 2 ) = (2, 1) 1 4 if (x 1, x 2 ) = (2, 0) 0 otherwise. Derive the marginal p.m.f. of X 2. Notice first that supp(x 2 ) = { 1, 0, 1}. We have p X2 ( 1) = p X1,X 2 (0, 1) + p X1,X 2 (2, 1) = = 3 8 Thus, p X2 (0) = p X1,X 2 (0, 0) + p X1,X 2 (2, 0) = = 1 2 p X2 (1) = p X1,X 2 (0, 1) = 1 8. Exercise in class: Consider again the bivariate p.d.f. 3 8 if x 2 = 1 1 p X2 (x 2 ) = 2 if x 2 = if x 2 = 1 0 if x 2 / { 1, 0, 1} f X1,X 2 (x 1, x 2 ) = e (x 1+x 2 ) 1[0, ) [0, ) (x 1, x 2 ). Derive the marginal p.d.f. of X 1. Notice first that supp(x 1 ) = [0, ). For x 1 [0, ) we have f X1 (x 1 ) = f X1,X 2 (x 1, x 2 ) dx 2 = e x 1 e x 2 dx 2 = e x 1. Thus, R f X1 (x 1 ) = e x1 1 [0, ) (x 1 ). 0 61

62 Conditional Distributions We will limit our discussion here at the bivariate case, but all that follows is easily extended to more than two random variables. Suppose that we are given a pair of random variables X 1, X 2 and we want to compute probabilities of the type P (X 1 A 1 X 2 = x 2 ) for a particular fixed value x 2 of X 2. To do this, we need either the conditional p.m.f. (if X 1 is discrete) or the conditional p.d.f. (if X 1 is continuous) of X 1 given X 2 = x 2. By definition, we have p X1 X 2 =x 2 (x 1 ) = p X 1,X 2 (x 1, x 2 ) p X2 (x 2 ) f X1 X 2 =x 2 (x 1 ) = f X 1,X 2 (x 1, x 2 ). f X2 (x 2 ) Notice the two following impotant facts: 1. p X1 X 2 =x 2 and f X1 X 2 =x 2 are not well-defined if x 2 is such that p X2 (x 2 ) = 0 and f X2 (x 2 ) = 0 respectively (i.e. x 2 / supp(x 2 )) 2. given x 2 supp(x 2 ), p X1 X 2 =x 2 and f X1 X 2 =x 2 are null whenever p X1,X 2 (x 1, x 2 ) = 0 and f X1,X 2 (x 1, x 2 ) = 0 respectively. So whenever you are computing a conditional distribution, it is good practice to 1) determine the support of the conditioning variable X 2 and clearly state that the conditional distribution that you are about to compute is only well-defined for x 2 supp(x 2 ) and 2) given x 2 supp(x 2 ), clarify for which values of x 1 R the conditional distribution p X1 X 2 =x 2 and f X1 X 2 =x 2 is null. Exercise in class: Consider again the bivariate p.m.f. 1 8 if (x 1, x 2 ) = (0, 1) 1 4 if (x 1, x 2 ) = (0, 0) 1 p X1,X 2 (x 1, x 2 ) = 8 if (x 1, x 2 ) = (0, 1) 1 4 if (x 1, x 2 ) = (2, 1) 1 4 if (x 1, x 2 ) = (2, 0) 0 otherwise. 62

63 Derive the conditional distribution of X 1 given X 2. First of all notice that the conditional p.m.f. of X 1 given X 2 = x 2 is only well-defined for x 2 supp(x 2 ) = { 1, 0, 1}. Then we have p X1 X 2 = 1(x 1 ) = 0 if x 1 / {0, 2} 1 3 if x 1 = 0 = 2 3 if x 1 = 2 0 if x 1 / {0, 2} p X1,X 2 (0, 1) p X2 ( 1) if x 1 = 0 p X1,X 2 (2, 1) p X2 ( 1) if x 1 = 2 1/8 3/8 if x 1 = 0 1/4 = 3/8 if x 1 = 2 0 if x 1 / {0, 2} p X1 X 2 =0(x 1 ) = 0 if x 1 / {0, 2} 1 2 if x 1 = 0 = 1 2 if x 1 = 2 0 if x 1 / {0, 2} p X1 X 2 =1(x 1 ) = = p X1,X 2 (0,0) p X2 (0) if x 1 = 0 p X1,X 2 (2,0) p X2 (0) if x 1 = 2 { px1,x 2 (0,1) p X2 (1) if x 1 = 0 0 if x 1 0 { 1 if x 1 = 0 0 if x /4 1/2 if x 1 = 0 1/4 = 1/2 if x 1 = 2 0 if x 1 / {0, 2} = { 1/8 1/8 if x 1 = 0 0 if x 1 0 Exercise in class: Consider again the bivariate p.d.f. f X1,X 2 (x 1, x 2 ) = e (x 1+x 2 ) 1[0, ) [0, ) (x 1, x 2 ). Derive the conditional p.d.f. of X 2 given X 1. First of all notice that the conditional p.m.f. of X 2 given X 1 = x 1 is only well-defined for x 1 supp(x 1 ) = [0, ). For x 1 [0, ), we have f X2 X 1 =x 1 (x 2 ) = f X 1,X 2 (x 1, x 2 ) f X1 (x 1 ) = e (x 1+x 2 ) 1[0, ) (x 2 ) e x 1 63 = e (x 1+x 2 ) 1[0, ) [0, ) (x 1, x 2 ) e x 11 [0, ) (x 1 ) = e x2 1 [0, ) (x 2 ).

64 Question: can we say something about whether or not X 1 and X 2 are independent in this case? Exercise: (not in class)(added after lecture) Take the distribution of heights in adults in the United States in age range to be normal with mean Y and variance 10, in inches. Y is a random variable (say, a normal random variable); depending on the value of Y, the center (mean) of the distribution X changes! The conditional distribution of X A (which can be written as X Y = 60, whose probability density function can be written as f X Y =y (x)) where A = {Y = 60} is simply N (60, 10 2 )! What about when Y is generally a random variable, Y N (55, 5 2 )? Then, we will learn in a few lectures that the expectation of X given Y, and E(X Y ) is also a random variable, and can be treated as a function of random variable Y! A Test for the Independence of Two Random Variables Suppose that you are given (X 1, X 2 ) f X1,X 2. If 1. the support of f X1,X 2 is a rectangular region and 2. f X1,X 2 (x 1, x 2 ) = f X1 (x 1 )f X2 (x 2 ) for all x 1, x 2 R then X 1 and X 2 are independent. The same is true for the case where (X 1, X 2 ) is discrete, after obvious changes to 1). Notice that 1) can be conveniently captured by appropriately using indicator functions. Exercise in class: Consider the following bivariate p.d.f.: f X1,X 2 (x 1, x 2 ) = 6(1 x 2 )1 {0 x1 x 2 1}(x 1, x 2 ). Are X 1 and X 2 independent? (hint: by math (partitioning the pdf), and also intuition from drawing the support.) A Note on Independence and Normalizing Constants Frequently, the p.d.f. (and the same holds true for a p.m.f.) of a pair of random variables takes the form f X1,X 2 (x 1, x 2 ) = cg(x 1 )h(x 2 ) (44) where c > 0 is a constant, g is a function depending only on x 1 and h is a function depending only on x 2. If this is the case, the two random variables 64

65 X 1 and X 2 are independent and g and h are the kernels of the p.d.f. of X 1 and X 2 respectively. Thus, by simply inspecting g and h we can guess to which family of distributions X 1 and X 2 belong to. We do not need to worry about the constant c, as we know that if f X1,X 2 is a bona fide p.d.f., then c = c 1 c 2 is exactly equal to the product of the normalizing constants c 1 and c 2 that satisfy c 1 g(x) dx = c 2 h(x) dx = 1. (45) R This easily generalizes to a collection of n > 2 random variables. R 65

66 Lecture 9 Recommended readings: WMS, sections Expected Value in the Context of Bivariate Distributions (part 1) In Lecture 5, we introduced the concept of expectation or expected value of a random variable. We will now introduce other operators that are based on the expected value operator and we will investigate how they act on bivariate distributions. While we focus on bivariate distributions, it is worthwile to keep in mind that all that we discuss in this Lecture can be extended to collections of random variables X 1,..., X n with n > 2. For a function g :R 2 R (x 1, x 2 ) g(x 1, x 2 ) and a pair of random variables (X 1, X 2 ) with joint p.m.f. p X1,X 2 (if discrete) or joint p.d.f. f X1,X 2 (if continuous), the expected value of g(x 1, X 2 ) is defined as E(g(X 1, X 2 )) = g(x 1, x 2 )p X1,X 2 (x 1, x 2 ) (discrete case) E(g(X 1, X 2 )) = x 1 supp(x 1 ) x 2 supp(x 2 ) R R g(x 1, x 2 )f X1,X 2 (x 1, x 2 ) dx 1 dx 2 if X 1 and X 2 (continuous case). Expectation, Variance and Independence In Lecture 5, we pointed out that in general, given two random variables X 1 and X 2, it is not true that E(X 1 X 2 ) = E(X 1 )E(X 2 ). We said, however, that it is true as soon as X 1 and X 2 are independent. Let s see why. Without loss of generality, assume that X 1 and X 2 are continous (for the discrete case, just change integration into summation as usual). If they are independent, f X1,X 2 = f X1 f X2. Take g(x 1, x 2 ) = x 1 x 2. Then, we have E(X 1 X 2 ) = x 1 x 2 f X1,X 2 (x 1, x 2 ) dx 1 dx 2 R R = x 1 x 2 f X1 (x 1 )f X2 (x 2 ) dx 1 dx 2 = x 1 f X1 (x 1 ) dx 1 x 2 f X2 (x 2 ) dx 2 R R = E(X 1 )E(X 2 ). R R 66

67 The above argument is easily extended to show that for any two functions g 1, g 2 : R R, if X 1 and X 2 are independent, then E(g 1 (X 1 )g 2 (X 2 )) = E(g 1 (X 1 )g 2 (X 2 )). Exercise in class: Consider the following bivariate p.d.f.: f X1,X 2 (x 1, x 2 ) = 3 2 e 3x1 1 [0, ) [0,2] (x 1, x 2 ). Are X 1 and X 2 independent? Why? What are their marginal distributions? Compute E(3XY + 3). Exercise in class: Consider the pair of random variables X 1 and X 2 with joint p.d.f. f X1,X 2 (x 1, x 2 ) = 1 8 x 1e (x 1+x 2 )/2 1[0, ) [0, ) (x 1, x 2 ) and consider the function g(x, y) = y/x. What is E(g(X 1, X 2 ))? First of all, notice that X 1 and X 2 are independent. Therefore, we know already that E(g(X 1, X 2 )) = E(X 2 /X 1 ) = E(X 2 )E(1/X 1 ). Moreover, the kernel of the p.d.f. of X 1 is while that of the p.d.f. of X 2 is x 1 e x 1/2 1[0, ) (x 1 ) e x 2/2 1[0, ) (x 2 ). Thus, X 1 and X 2 are distributed according to Gamma(2, 2) and Exponential(2) respectively. It follows that E(X 2 ) = β = 2. We only need to compute E(1/X 1 ). This is equal to ( ) 1 E = X 1 = f X1 (x 1 ) dx = x 1 e x 1 2 dx = = x 1 4 x 1e x 1 2 dx It follows that E(X 2 /X 1 ) = E(X 2 )E(1/X 1 ) = 2(1/2) = 1. Back to Lecture 5 again. There we mentioned that for two random variables X and Y, V (X +Y ) V (X)+V (Y ) usually, but that the result is 67

68 true if X and Y are independent. Now, we can easily see that if X and Y are independent and we take the functions g 1 = g 2 = g with g(x) = X E(X), by our previous results we have V (X + Y ) = E[(X + Y (E(X) + E(Y ))) 2 ] = E[((X E(X)) + (Y E(Y ))) 2 ] = E[(X E(X)) 2 + (Y E(Y )) 2 + 2(X E(X))(Y E(Y ))] = E[(X E(X)) 2 ] + E[(Y E(Y )) 2 ] + 2E(X E(X))E(Y E(Y )) = V (X) + V (Y ) = V (X) + V (Y ). (46) Notice that we exploited the independence of X and Y together with the results above to deal with the expectation of the cross-product 2(X E(X))(Y E(Y )) with g 1 = g 2 = g. Question: what about V (X Y )? Covariance The covariance of a pair of random variables X and Y is a measure of their linear dependence. By definition, the covariance between X and Y is Cov(X, Y ) = E[(X E(X))(Y E(Y ))] = E(XY ) E(X)E(Y ). (47) It is easy to check that the covariance operator satisfies, for a, b, c, d R Cov(a + bx, c + dy ) = bdcov(x, Y ). Also, notice that Cov(X, Y ) = Cov(Y, X) and Cov(X, X) = V (X). Question: suppose that X and Y are independent. What is Cov(X, Y )? There exists a scaled version of the covariance which takes values in [ 1, 1]. This is the correlation between X and Y. The correlation between X and Y is defined as Cor(X, Y ) = Cov(X, Y ) V (X)V (Y ). (48) It is easy to check that for a, b, c, d R with b, d 0, we have Cor(a + bx, c+dy ) = Cor(X, Y ). So, unlike the covariance operator, the correlation operator is not affected by affine transformations of X and Y. 68

69 Beware that while we saw that X and Y are independent = Cov(X, Y ) = 0, the converse is not true (i.e. independence is stronger than uncorrelation). Consider the following example. Let X be a random variable with E(X) = E(X 3 ) = 0. Consider Y = X 2. Clearly, X and Y are not independent (in fact, Y is a deterministic function of X!). However, Cov(X, Y ) = E(XY ) E(X)E(Y ) = E(X X 2 ) 0 = E(X 3 ) = 0. So X and Y are uncorrelated, because Cor(X, Y ) = 0, but they are not independent! Exercise in class: Look back at equation (46). Without assuming that X and Y are independent, but rather only making the weaker assumption that X and Y are uncorrelated (i.e. Cov(X, Y ) = 0), show that V (X + Y ) = V (X Y ) = V (X) + V (Y ). From the discussion above it is clear that in general V (X + Y ) = V (X) + V (Y ) + 2Cov(X, Y ) V (X Y ) = V (X) + V (Y ) 2Cov(X, Y ). This generalizes as follows. Given a 1,..., a n R and X 1,..., X n, ( n ) n n n V a i X i = a 2 i V (X i ) + a i a j Cov(X i, X j ) i=1 = i=1 n a 2 i V (X i ) + i=1 i=1 j=1 j i 1 i<j n 2a i a j Cov(X i, X j ). If the random variables X 1,..., X n are all pairwise uncorrelated, then obviously the second summand above is null. Exercise in class: You are given three random variables X, Y, Z with V (X) = 1, V (Y ) = 4, V (Z) = 3, Cov(X, Y ) = 0, Cov(X, Z) = 1, and Cov(Y, Z) = 1. Compute V (3X + Y 2Z). We have V (3X + Y 2Z) = V (3X) + V (Y ) + V ( 2Z) + 2Cov(3X, Y ) + 2Cov(3X, 2Z) + 2Cov(Y, 2Z) = 9V (X) + V (Y ) + 4V (Z) + 6Cov(X, Y ) 12Cov(X, Z) 4Cov(Y, Z) = = 9. 69

70 Exercise in class: Recall the multinomial distribution from the lecture about discrete distributions. Suppose that (X 1,..., X k ) has a Multinomial(p 1...., p k ) distribution. We want to prove: Cov[X i, X j ] = np i p j. The trick is to treat a multinomial experiment as a sequence of n independent trials, Y t. 1 Yt=i is the random variable which takes the value of 1 if the t th outcome Y t took value i, and notice that X i = n t=1 1 Y t=i and X j = n t=1 1 Y t=j. Then, proceed as follows: Now, if s = t, Cov[1 Yt=i, 1 Ys=j] = E[1 Yt=i]E[1 Yt=j] = p i p j, while if s t by independence. Cov[1 Yt=i, 1 Ys=j] = 0 Hence the result. n n Cov[X i, X j ] = np i p j = Cov[1 Yt=i, 1 Ys=j]. t=1 s=1 70

71 Lecture 10 Recommended readings: WMS, section 5.11 Expected Value in the Context of Bivariate Distributions (part 2) Using conditional distributions, we can define the notion of conditional expectation, conditional variance, and conditional covariance. Introduction (READ! but not covered in lecture) Conditional expectation can first be thought of as simply the expectation of a (discrete for now) random variable X given an event A: E(X A) = xp X A (x A) x supp(x) Conditional expectation E(X A) has all the properties of regular expectation. In particular: 1. E(f(X) A] = x supp(x) f(x)p X A(x A) 2. E[aX + b A] = ae[x A] + b a, b Set A = {Y = y}: E[X Y = y] = x supp(x) xp X Y (x y) Where E[X Y = y] is the conditional expectation of X given Y = y. Now, Write E[X Y ] where X and Y are r.v. s as the random variable, which is a function of Y, whose value when Y = y is E[X Y = y]. So think of E[X Y ] = g(y ) for some function g. From here on, the distinction between how we refer to the two may be blurred i.e. we will call both E(X A) and E(X Y ) conditional expectations, but it should be clear which it is referring to, from context. Conditional Expectation Given a conditional p.m.f. p X1 X 2 =x 2 or a conditional p.d.f. p X1 X 2 =x 2 the conditional expectation of X 1 given X 2 = x 2 is defined as E(X 1 X 2 = x 2 ) = x 1 p X1 X 2 =x 2 (x 1 ) (49) x 1 supp(x 1 ) 71

72 in the discrete case and as E(X 1 X 2 = x 2 ) = x 1 supp(x 1 ) x 1 f X1 X 2 =x 2 (x 1 ) dx 1 (50) in the continuous case. It is clear from the definition that the conditional expectation is a function of the value of X 2. Since X 2 is a random variable, this means that the conditional expectation (despite its name) is itself a random variable! This is a fundamental difference with respect to the standard concept of expectation for a random variable (for instance E(X 1 )) which is just a constant depending on the distribution of X 1. In terms of notation, we often denote the random variable corresponding to the conditional expectation of X 1 given X 2 simply as E(X 1 X 2 ). Note that if X 1 and X 2 are independent, then E(X 1 X 2 ) = E(X 1 ). Conditional expectation has the so-called tower property, or law of total expectation, which says that E(E(X 1 X 2 )) = E(X 1 ) where the outer expectation is taken with respect to the probability distribution of X 2. This is easier to understand in the context of discrete random variables. Exercise (not in class): Prove this, for discrete and continuous case. Hint: carefully write down the expression for E(E(X 1 X 2 )), where E(X 1 X 2 ) can be treated as a random variable and a function of X 2. You may even write it as g(x 2 ) if it makes it clear. Then, work on the double sum (discrete) or double integral (continuous). Conditional Variance We can also define the conditional variance of X 1 given X 2. We have V (X 1 X 2 ) = E[(X 1 E(X 1 X 2 )) 2 X 2 ] = E(X 2 1 X 2 ) [E(X 1 X 2 )] 2 (51) Obtaining the unconditional variance from the conditional variance is a little harder than obtaining the unconditional expectation from the conditional expectation: V (X 1 ) = E[V (X 1 X 2 )] + V [E(X 1 X 2 )]. Note that if X 1 and X 2 are independent, then V (X 1 X 2 ) = V (X 1 ). 72

73 Conditional Covariance It is worthwhile mentioning that we can also define the conditional covariance between X 1 and X 2 given a third random variable X 3. We hae Cov(X 1, X 2 X 3 ) = E[(X 1 E(X 1 X 3 ))(X 2 E(X 2 X 3 )) X 3 ] = E(X 1 X 2 X 3 ) E(X 1 X 3 )E(X 2 X 3 ). (52) To get the unconditional covariance we have the following formula: Cov(X 1, X 2 ) = E[Cov(X 1, X 2 X 3 )] + Cov[E(X 1 X 3 ), E(X 2 X 3 )]. (53) Note that if X 1 and X 3 are independent, and X 2 and X 3 are independent, then Cov(X 1, X 2 X 3 ) = Cov(X 1, X 2 ). In terms of computation, everything is essentially unchanged. The only difference is that we sum or integrate against the conditional p.m.f./conditional p.d.f. rather than the marginal p.m.f./marginal p.d.f.. Exercise in class: Consider again the p.d.f. of exercise 7 in Homework 5: f X1,X 2 (x 1, x 2 ) = 6(1 x 2 )1 {0 x1 x 2 1}(x 1, x 2 ). What is E(X 1 X 2 )? Compute E(X 1 ). We first need to compute f X1 X 2 =x 2 for x 2 supp(x 2 ). For x 2 [0, 1] we have f X2 (x 2 ) = x2 0 6(1 x 2 )1 [0,1] (x 2 ) dx 2 = 6x 2 (1 x 2 )1 [0,1] (x 2 ) Notice that from the marginal p.d.f. of X 2 we can see that X 2 Beta(2, 2). For x 2 (0, 1) we have f X1 X 2 =x 2 (x 1 ) = f X 1,X 2 (x 1, x 2 ) f X2 (x 2 ) = 6(1 x 2)1 [0,x2 ](x 1 ) 6x 2 (1 x 2 ) = 1 x 2 1 [0,x2 ](x 1 ) Notice that f X1 X 2 =x 2 is the p.d.f. of a Uniform(0, x 2 ) distribution. It follows that E(X 1 X 2 = x 2 ) = x 2 /2. We can write this more concisely (and in a way that stresses more the fact that the conditional expectation is a random variable!) as E(X 1 X 2 ) = X 2 /2. 73

74 We have E(X 1 ) = E[E(X 1 X 2 )] = E(X 2 /2) = E(X 2 )/2 = [(2/(2 + 2)]/2 = 1/4. Exercise in class: i.i.d. Let N Poisson(λ) and let Y 1,..., Y N Gamma(α, β) with N independent of each of the Y i s. Let T = N i=1 Y i. Compute E(T N), E(T ), V (T N), and V (T ). We have that ( N ) n E(T N = n) = E Y i N = n = E(Y i N = n) = i=1 n E(Y i ) = nαβ. i=1 Thus, E(T N) = Nαβ. Now, i=1 E(T ) = E[E(T N)] = E(Nαβ) = αβe(n) = αβλ. The conditional variance is ( N ) ( n ) V (T N = n) = V Y i N = n = V Y i N = n = i=1 n V (Y i N = n) = i=1 i=1 n V (Y i ) = nαβ 2. Thus, V (T N) = Nαβ 2. The unconditional variance of T is then i=1 V (T ) = V [E(T N)] + E[V (T N)] = V (Nαβ) + E(Nαβ 2 ) = α 2 β 2 V (N) + αβ 2 E(N) = α 2 β 2 λ + αβ 2 λ = αβ 2 λ(1 + α). Exercise in class: Let Q Uniform(0, 1) and Y Q Binomial(n, Q). V (Y ). We have Compute E(Y ) and E(Y ) = E[E(Y Q)] = E(nQ) = ne(q) = n/2 74

75 and V (Y ) = V [E(Y Q)] + E[V (Y Q)] = V (nq) + E(nQ(1 Q)) = n 2 V (Q) + ne(q Q 2 ) = n ne(q) ne(q2 ) = n n 2 n(v (Q) + [E(Q)]2 ) = n n 2 n 12 n 4 = n n 6 = n (n/2 + 1). 6 75

76 Homework 5 (due June 2nd) 1. (10 pts) Simple example of multivariate joint distributions, and conditional distributions. All car accidents that occurred in the state of Florida in 1998 and that resulted in an injury were recorded, along with the seat belt use of the injured person and whether the injury was fatal or not. Total number of recorded accidents is No belt Belt Fatal 1, ,527 Non-fatal , ,006. This table can be used to describe the joint distribution of two binary random variables X and Y, where X is 1 if the passenger does not wear the seat belt and 0 otherwise and Y is 1 is the accident is fatal or not. (a) What is the probability that an accident is fatal and the passenger did not wear the seat belt? (b) What is the probability that an accident is fatal and the passenger wore the seat belt? (c) What is the marginal distribution of X? (Hint: just in case you re wondering what you need to produce, notice X is a discrete random variable. You should completely produce probability mass function (or CDF) of X!) (d) What is the conditional distribution of Y X = No Belt and Y X = Belt? (the same hint applies here) 2. (10 pts) Consider the following function: { 6x 2 1 f(x 1, x 2 ) = x 2 if 0 x 1 x 2 x 1 + x otherwise. Verify that f is a valid bivariate p.d.f. Suppose (X 1, X 2 ) f. variables? Compute P (Y 1 + Y 2 < 1) Are X 1 and X 2 independent random Compute the marginal p.d.f. of X 1. What are the values of its parameters? 76

77 Compute the marginal p.d.f. of X 2 Compute the conditional p.d.f. of X 2 given X 1 Compute P (X 2 < 1.1 X 1 = 0.6). 3. (10 pts) Consider the following bivariate p.m.f.: 0.38 if (x 1, x 2 ) = (0, 0) 0.17 if (x 1, x 2 ) = (1, 0) 0.14 if (x 1, x 2 ) = (0, 1) p X1,X 2 (x 1, x 2 ) = 0.02 if (x 1, x 2 ) = (1, 1) 0.24 if (x 1, x 2 ) = (0, 2) 0.05 if (x 1, x 2 ) = (1, 2). Compute the marginal p.m.f. of X 1 and the marginal p.m.f. of X 2 Compute p X2 X 1 =0, the conditional p.m.f. of X 2 given X 1 = 0 Compute P (X 1 = 0 X 2 = 2)? 4. (10 pts) Let X 1 Uniform(0, 1) and, for 0 < x 1 1, f X2 X 1 =x 1 (x 2 ) = 1 x 1 1 (0,x1 ](x 2 ). seem to re- What named continuous distribution does f X2 X 1 =x 1 semble? What are the values of its parameters? Compute the joint p.d.f. of (X 1, X 2 ) Compute the marginal p.d.f. of X (5 pts) Let X 1 Binomial(n 1, p) and X 2 Binomial(n 2, p). Assume that X 1 and X 2 are independent. Consider W = X 1 + X 2. One can show that W Binomial(n 1 + n 2, p). Use this result to show that p X1 W =w is a Hypergeometric p.m.f. with parameters r = n 1, N = n 1 + n 2, and n = w. 6. (10 pts) Suppose that you have a coin such that, if tossed, head appears with probability p [0, 1] and tail occurs with probability 1 p. Person A tosses the coin. The number of tosses until the first head appears for person A is X 1. Then person B tosses the coin. The number of tosses until the first head appears for person B is X 2. Assume that X 1 and X 2 are independent. What is the probability of the event X 1 = X 2? 77

78 7. (10 pts) Let (X 1, X 2 ) f X1,X 2 with Compute E(X 1 X 2 ). 8. (10 pts) Consider f X1,X 2 (x 1, x 2 ) = 1 x 1 1 {0<x2 x 1 1}(x 1, x 2 ). f X1,X 2 (x 1, x 2 ) = 6(1 x 2 )1 {0 x1 x 2 1}(x 1, x 2 ). Are X 1 and X 2 independent? Are X 1 and X 2 correlated (positively or negatively) or uncorrelated? 9. (10 pts) Let X 1 and X 2 be uncorrelated random variables. Consider Y 1 = X 1 + X 2 and Y 2 = X 1 X 2. Express Cov(Y 1, Y 2 ) in terms of V (X 1 ) and V (X 2 ) Give an expression for Cor(Y 1, Y 2 ) Is it possible that Cor(Y 1, Y 2 ) = 0? If so, when does this happen? 10. (10 pts) Let and f X1 (x 1 ) = 1 6 x3 1e x1 1 [0, ) (x 1 ) f X2 (x 2 ) = 1 2 e x [0, ) (x 2 ). Let Y = X 1 X 2. Compute E(Y ) Let Cov(X 1, X 2 ) = γ. Compute V (Y ) Suppose that X 1 and X 2 are uncorrelated. What is V (Y ) in this case? 11. (5 pts) Use the definition of variance to show that V (X ±Y ) = V (X)+ V (Y ) ± 2Cov(X, Y ). 12. Extra credit (10 pts) Let Z N (0, 1). Show that for any odd n 1, E(Z n ) = 0. 78

79 13. Extra Credit: (10 pts) Show that for two random variables X and Y, Cor(X, Y ) [ 1, 1]. Hint: consider X = X E(X) and Y = Y E(Y ) and study the quadratic function h(t, X, Y ) = (tx Y ) 2 as a function of t. Notice that h(t, X, Y ) 0, therefore E(h(t, X, Y )) 0 which implies that the discriminant of h(t, X, Y ) as a quadratic function of t must be non-positive! Try to use this fact to prove the claim that Cor(X, Y ) [ 1, 1]. 14. Extra Credit (10 pts): In an earlier homework, you downloaded R and RStudio and printed Hello, world!, and generated some samples from a given distribution. We will use R again for a fun new experiment. The Hypergeometric distribution can be approximated by a Binomial distribution. Specifically, HG(n, N, r) Bin(n = n, p = r/n) when N 20n, as a rule of thumb! (Fun exercise: first think about intuitively why this is true, then prove that the former distribution goes to the latter as N and p = r/n!) Conduct a simulation in R to demonstrate that this is true. Start with the code below and make modifications as you see fit. Among other things, you may need to increase the number of simulations (n.sample), adjust the parameters (n, N, r) so that we have a valid scenario where the Binomial distribution can approximate the Hypergeometric distribution, and adjust the plotting parameters so that you get a meaningful, easy to read graphic that shows that the two simulated distributions are approximately the same. # Initialize some useful information n.sample <- 100 n <- 500 N < r <- 750 p <- r/n # Generate random sample from binomial distribution # Dont confuse n and size here. The folks who wrote this function # used some unfortunately confusing parameter names bin.sample <- rbinom(n = n.sample, size = n, prob = r/n) # Generate random sample from hypergeometric distribution # Dont get confused here -- the folks who write this function # used a different parameterization of the hypergeometric distribution hg.sample <- rhyper(nn = n.sample, m = r, n = N-r, k = n) 79

80 # Plot the results par(mfrow = c(2,1)) hist(bin.sample, breaks = 50, main = "Simulated Binomial Distribution", col = 2, xlab = "Hey there, fix my axis label!!") hist(hg.sample, breaks = 50, main = "Simulated Hypergeometric Distribution", col = 3, xlab = "Master Chief") 80

81 Lecture 11 Recommended readings: WMS, sections Functions of Random Variables and Their Distributions Suppose that you are given a random variable X F X and a function g : R R. Consider the new random variable Y = g(x). How can we find the distribution of Y? This is the focus of this lecture. There are two approaches that one can typically use to find the distribution of a function of a random variable Y = g(x): the first approach is based on the c.d.f., the other approach is based on the change of variable technique. The former is general and works for any function g, the latter requires some assumptions on g, but it is generally faster than the general approach based on the c.d.f.. The method of the cumulative distribution function The idea is pretty simple. You know that X F X and you want to find F Y. The process is as follows: 1. F Y (y) = P (Y y) by definition 2. P (Y y) = P (g(x) y), since Y = g(x) 3. now you want to express the event {g(x) y} in terms of X, since you know the distribution of X only 4. let A = {x R : g(x) y}; then {g(x) y} = {X A} 5. it follows that P (g(x) y) = P (X A) which can typically be expressed in terms of F X 6. (EXTRA) once you have F X, the p.m.f. or the p.d.f. of X can be easily derived from it. Exercise in class: Let Z N (0, 1) and g(x) = x 2. Consider Y = g(z) = Z 2. What is the probability distribution of Y? 81

82 Following the steps above, we have F Y (y) = P (Y y) = P (Z 2 y) = = { 0 if y < 0 P (Z [ y, y]) if y 0. { 0 if y < 0 P ( Z y) if y 0. Notice that in this case A = [ y, y]. It follows that { { 0 if y < 0 F Y (y) = P (Z [ y, 0 if y < 0 = y]) if y 0. Φ( y) Φ( y) if y 0. { 0 if y < 0 = 2Φ( y) 1 if y 0. Let φ denote the standard normal p.d.f. The p.d.f. of Y = Z 2 is therefore φ(x) = 1 2π e x2 2. { 0 if y 0 f Y (y) = 1 y φ( y) if y > 0 { 0 if y 0 = 1 2π y 1 2 e y 2 if y > 0 Notice that this is the p.d.f. of a Gamma( 1 2, 2) χ2 (1) distribution. Exercise in class: the probability integral transform Consider a continuous random variable X F X where F X, the distribution of X is strictly increasing. Consider the random variable Y = F X (X). What is the probablity distribution of Y? The transformation Y = F X (X) is usually called the probability integral transform. To see why, notice that Y = F X (X) = X f X (y) dy. We have 0 if y < 0 F Y (y) = P (Y y) = P (F X (X) y) = P (X F 1 X (y)) if y [0, 1) 1 if y 1. 82

83 In this case, therefore, A = (, F 1 X (y)]. Then, 0 if y < 0 0 if y < 0 F Y (y) = P (X F 1 X (y)) if y [0, 1) = F X (F 1 X (y)) if y [0, 1) 1 if y 1 1 if y 1 0 if y < 0 = y if y [0, 1) 1 if y 1. The p.d.f. of Y is therefore i.e. Y Uniform(0, 1). f Y (y) = 1 [0,1] (y), The method of the change of variable We will not focus on the mathematical motivation behind this technique. If you are curious and want to learn more about it, I strongly recommend taking a look at here. For our purposes, we will take this procedure as a black-box. In order to apply the method of the change of variable, the function g must admit an inverse g 1 (this is not required by the method based on the c.d.f.). A sufficient condition is that g is strictly monotone (increasing or decreasing). Suppose that you are given X F X and you want to compute the probability distribution of Y = g(x). First of all, determine the support of Y. Then, use this formula to obtain f Y on the support of Y from f X and g: f Y (y) = f X (g 1 d (y)) dz g 1 (z). z=y Exercise in class: Consider X Beta(1, 2) and g(x) = 2x 1. What is the p.d.f. of Y = g(x)? First of all, notice that (with a little abuse of notation) supp(y ) = g(supp(x)). Since supp(x) = [0, 1], it follows that supp(y ) = [ 1, 1]. We have f X (x) = 2(1 x)1 [0,1] (x), g 1 (x) = x+1 2, 83

84 and d dx g 1 (x) = 1 2. Thus, for y [ 1, 1], f Y (y) = f X (g 1 d (y)) dx g 1 (x) x=y ( = 2 1 y+1 ) = 1 y+1 2 = 1 y 2. The complete description of the p.d.f. of Y is therefore f Y (y) = 1 y 2 1 [ 1,1](y). 84

85 Lecture 12 Recommended readings: WMS, sections , 9.7, Cosma Shalizi s lecture notes Estimation Read more of section 8 and 9 for a more comprehensive coverage! What to do when you don t know λ = 5? The concepts and framework of probability theory covered so far has proven useful in calculating theoretical ( ideal ) long run frequency properties of random events. The problem settings so far have always included a known probability distribution of the random variable of interest for instance, the number of people entering a coffeeshop was simply given as 5 people an hour, on average. This led us to use a Poisson random variable X Poisson(λ = 5) to calculate an event like P (5 X 20), by summing the probability mass function in the appropriate region. Of course, we don t have perfect information that says the truth of this process is λ = 5, in practice. Data as realization of random variables. Data Instead of having the perfect information that λ = 5, as a practitioner and probabilistic modeler, you will have data about the arrivals in a coffee shop, over several days. If you are modeling gambling probabilities involving unfair coin flips (Binomial(n, p)) or unevenly shaped dice rolls (X Multinomial(p 1,, p 6, n), then you will have data about the outcomes of coin flips or dice rolls. The goal of estimation is to recover the statistical parameters from observed data, by calculating statistics from the data. Indeed, statistics are just functions of the data. Here is an example of the typical setting in which you will be asked to estimate a parameter. You have observed n dice rolls The data will look like this: Draw n Outcome We will call the outcome of each draw random variables (in upper case): X 1,, X n Each data in the table above is a realization of random variables X i ; we can write this as X 1 = x 1, X 2 = x 2,, X n = x n for the particular outcome (x 1,, x n ). 85

86 Statistical Model The statistical model (interchangeably used are: probabilistic model, or data model) is what we assume (impose) about the nature of randomness of the data. In other words, it is what we impose as the type of randomness that X 1,, X n follow. A sensible model in this example is X 1, X 2,, X n Bernoulli(p) Under this assumption, the number of heads X = n i=1 X i is distributed as X Binomial(n, p). We will write X = (X 1,, X n ) and x = (x 1,, x n ) sometimes. Estimation Lastly, the estimation of the parameter p is done by: ˆp = g(x 1,, X n ) = 1 n n ( X i ). The ˆ notation is used to emphasize the ˆp is aimed at learning about the value of p. The estimate ˆp is sometimes written ˆp(X) to emphasize that it is a function of the data! Also, ˆp is a good estimate of the parameter p if it is close to it. How do you find a good function g of the data whose function value provides a good estimate of p? This is a core matter of interest in statistical modeling, and further studies in statistics and machine learning attempt to answer this question using cool mathematical and algorithmic insights. We will get a flavor to one such technique in this class. Maximum likelihood estimation One way of finding a good g is to think about the likelihood of the data, given my statistical model. If your model parameter is p = 0.8, and your data is mostly zero let us say that 95 percent of them were zero, then it is quite unlikely that they are realizations from the statistical model of X 1,, X n Bernoulli(0.8) this model (which is completely determined by p) is quite implausible. What about p = 0.5? If the data is mostly zero, then, Bernoulli(p = 0.5) is still not quite plausible, but it is more likely than before that mostly zero draws came from it. We can continue this guessing game of finding the most plausible estimate of p, or we can take a more principled route. Write the likelihood i=1 86

87 function of the data, given the model L(p x) = f X (x p) (54) = P (X = x p) (55) = P (X 1 = x 1,, X n = x n p) (56) n = p Xi (x i ) (57) i=1 = p n i=1 X i (1 p) n n i=1 X i (58) where p Xi ( ) are the probability mass function of X i Bernoulli(p). For our purposes, the likelihood function is the joint probability of x. We would like to find the value p that maximizes the joint probability (likelihood) for the data on hand, x 1,, x n. The value of p is what is called the maximum likelihood estimator (MLE) of the parameter of interest (p): ˆp MLE = argmax p L(p x) How would you maximize (54)? Since this is a polynomial function in p, we can use differentiation to find the zeros of this function this is because we recall from calculus that in order to maximize a polynomial, a good strategy is to take the derivative and set it equal to zero (and some other steps such as checking concavity/convexity..). Exercise in class Obtain the maximum likelihood estimator for the parameter p of a Bernoulli random variable (the above setting). (Hint: maximizing a function g(x) is the same as maximizing log (g(x)).) Bias The bias of an estimator is the difference between the expectation of ˆp and the actual parameter: bias = E(ˆp(X)) p. One salient quality of a statistical estimator ˆp is for the expectation (mean) of ˆp to be equal to p! So that on average, ˆp is not consistently misled in its mean that the bias is zero. This quality of an estimator is called unbiasedness, and such an estimator is called an unbiased estimator. Exercise in class 87

88 In the above Bernoulli data example, what is the bias of the estimator ˆp = X? Is ˆp = X unbiased? What is an example of a biased estimate? Take a Normal data model, where X i N (0, σ 2 ). We would like to estimate the variance (noise level) σ 2. We will show that the MLE of the ˆσ 2. Derivation of ˆσ MLE 2 goes here: Exercise in class Obtain the maximum likelihood estimator for the parameter λ of a Poisson random variable. What is E(ˆλ MLE λ)? (note, the subscript MLE is sometimes written to emphasize that it is a maximum likelihood estimate) Here is a summarizing table distinguishing parameters and statistics: µ ˆµ Where does it come from? Underlying truth Data What is it? The goal of estimation Function of data What is it a function of? Population Sample What is it called? Parameter Statistic For example, it is called True mean Sample mean Is it random? Constant value Random variable Example p = 0.6 Example µ = 3.54 ˆp = g(y 1,, Y n ) = Ȳ ˆµ = g(x 1,, X n ) = X = 1 n n i=1 X i Invariance Property of MLEs Maximum likelihood estimators have a convenient property of being invariant to transformation by a function g. 88

89 Formally, if ˆθ MLE is an MLE of θ, and g is some real valued function, then g(ˆθ MLE ) is a maximum likelihood estimator for τ = g(θ). The proof is simple enough in concept. First, notice that g may not be a one-to-one function i.e. it may have many values θ mapping to the same image τ = g(θ). So, a maximizer of L with respect to τ = g(θ) may not be unique with respect to θ (i.e. there may be many θ g 1 (ˆτ) that maximize L. Thus, we define L (τ x) = sup L(θ x). θ:g(θ)=τ Note, if L is maximized by ˆτ, then L is maximized at all θ g 1 (ˆτ). Denoting ˆtau as the maximizer of L, see that: L (ˆτ x) = max τ Proof is complete. L (τ x) = max τ sup θ:g(θ)=τ L(θ x) = sup L(θ x) = L(ˆθ x). θ Case study: linear regression We are equipped with all the tools and probability theory to study a technique (algorithm) called simple linear regression. We will stretch our knowledge a little bit to do maximum likelihood estimation on a couple of parameters. Consider the situation in which you have data pairs (Y 1, X 1 ),, (Y n, X n ) For example, this could be pairs of observations of blood pressure and cholesteral intake. Let us say that there is no noise in cholesterol intake measurements, and only in the blood pressure level. What we will treat as random is the generating mechanism of how blood pressure is affected by cholesterol intake. Y X which practically means that we will assume X is known, and instead study how Y is distributed given X. Our statistical model will be a slight extension from the models we have dealt with: Y i X i = f(x i ) + ɛ i (59) = β 0 + β 1 X i + ɛ i, ɛ i N (0, σ 2 ) (60) where for simplicity, we assume σ 2 is a known constant. f is called a regression function, and in general can be thought of as a function that maps the predictors X i to the response Y i ; in our case, we take it to be a linear 89

90 function f(x) = a + bx. We are interested in estimating the parameters β 0 and β 1. Picture goes here We can re-write the above model as Y i X i N (β 0 + β 1 X i, σ 2 ) What is the joint likelihood of (Y ) (X)? L(β 1, β 0 Y ) = P (Y = y β 1, β 0 ) = We would like to see three things n f Yi (y i p) 1. The maximum likelihood estimate is equal to a least squares estimate. 2. The MLE estimates ˆβ 0, ˆβ 1 are unbiased. i=1 90

91 What are the practical implications? The first point says that the least squares estimate a sensible first way to estimate a regression function f, by minimizing the squared error argmin θ (y f(y θ)) 2 given some functional form f, is equivalent to a sound statistical strategy of obtaining f. The second point says that the maximum likelihood estimates Note: 1. For the purpose of our class, it is more important to recognize the techniques (usages of various tools, such as maximum likelihood estimation, biasedness) than to memorize the exact properties of linear regression. The focus of this lecture is to introduce some applications in which probability theory is useful. 2. There are many different ways to do statistical estimation, and extensions of the ideas from this lecture. If you are further interested, take ! 3. After that, to learn more about linear regression (and much more about data analysis), take ! 91

92 Homework 6 (due June 9th) 1. (5 pts) Let X Q = q Binomial(n, q) and Q Beta(α, β). Compute E(X). 2. (5 pts) Let X Q = q Binomial(n, q) and Q Beta(3, 2). Find the marginal p.d.f. of X. (Notice that here we are mixing discrete and continuous random variables, but the usual definitions still apply!) 3. (5 pts) Let X Exponential(4). Consider Y = 3X + 1. Find the p.d.f. of Y and E(Y ). 4. (10 pts) We mentioned in class that if X Gamma(α, β) then, for c > 0, Y = cx Γ(α, cβ). Prove it. 5. (10 pts) Let F be a given continuous and strictly increasing c.d.f.. Let U Uniform(0, 1). Find a function g such that g(u) F. You can assume that g is strictly increasing. 6. (10 pts) Let X f X with f X (x) = b x 2 1 [b, )(x) and let U Uniform(0, 1). Find a function g such that g(u) has the same distribution of X. 7. (10 pts) Let θ > 0 and X f X with f X (x) = 2x θ e x2 θ 1 [0, ) (x). Consider Y = X 2. What is f Y, the p.d.f. of Y? What are E(Y ) and V (Y ). 8. Extra credit (10pts) Let X Beta(α, β). Consider Y = 1 X. What is the probability distribution of Y? Now, let α = β = 1. Claim that X and Y have the same distribution. Are X and Y also independent? 9. (5 pts) Let X F X. What is E(X X)? 10. (10 pts) Let X, Y, Z be three random variables such that E(X Z) = 3Z, E(Y Z) = 1, and E(XY Z) = 4Z. Assume furthermore that E(Z) = 0. Are X and Y uncorrelated? 92

93 11. Extra credit (5 pts) Let Y be a continuous random variable with distribution Uniform(a, b), for some a < b. Find the cdf of Y. 12. Extra credit (10 pts) Suppose we want to simulate from a continuous random variable X having a strictly increasing c.d.f. F X. However, the computer at our disposal can only simulate from a Uniform(0, 1) distribution. (a) Show that if U Uniform(0, 1), its c.d.f. is 0 if u < 0 F U (u) = u if u [0, 1] 1 if u > 1 (b) Consider the random variable Y = F 1 X (U), where U Uniform(0, 1). Show that Y has the same distribution as X by proving that P (Y x) = F X (x), for all x R. Hint: use part (a) and the monotonicity of F X. (c) Using the same computer, how would you go about generating an Exp(λ) random variable (you can use the following fact, without proving it: if X Exp(λ), then F X (x) = 1 e x/λ )? 13. (10 points, 2 pts each) We will study estimation for a normally distributed sample. (a) Obtain the maximum likelihood estimator for the mean in a Normal model (assume σ 2 > 0 is known): X 1,, X n N (µ, σ 2 ) (Hint: write down the likelihood function L(µ, σ 2 y) = i f(y i µ, σ 2 ).) (b) Obtain the maximum likelihood estimator for the noise parameter in the Normal model. (Hint: same hint) (c) Is the maximum likelihood estimate of µ unbiased? (d) Is the maximum likelihood estimate of σ 2 unbiased? (e) What is an unbiased estimate of σ 2? (Two hints: (1) consider multiplying a constant, (2) n is a constant i.e. there is nothing random about the sample size.) 93

94 14. (10 points, 2.5 pts each) We will now study a Chi-square random variable and the sample standard deviation S 2. (a) Prove that the random variable for S 2 = n i=1 (X i X) 2 (n 1) variable (i.e. Z χ 2 (n 1)). (b) What is expectation E(U)? U = (n 1)S 2 /σ 2 (c) What is the expectation of E(S 2 )? (d) What is the bias of E(S 2 )? Is it unbiased?, is distributed as a Chi square random Note: There is no extra credit for the following problem, because it is a duplicate with 13. The problem is reproduced here for reference: 15. Extra Credit (10 pts) Let s say it is reasonable that an n-sized random sample of body weights in the age range of 17-25: X 1,, X n is i.i.d. distributed as a Normal distribution with mean µ and variance σ 2. A public health official would like to know what the spread (variance) of the body weights of this population we already learned from class that a maximum likelihood estimator of the two parameters, (ˆµ MLE, ˆσ MLE 2 ) is ( n X i n n, (X i X) ) 2 n i=1 i=1 (a) Is the above estimator for σ 2 biased? (7pts) If so, what is the bias? (b) Also, can you guess what an unbiased estimator is? (3pts) 94

95 Lecture 13 Recommended readings: WMS, sections The Empirical Rule, Sampling Distributions, the Central Limit Theorem, and the Delta Method In this lecture we will focus on some basic statistical concepts at the basis of statistical inference. Before starting, let s quickly discuss a useful rule of thumb associated to the Normal distribution which is often mentioned and used in practice. The Empirical Rule Based on the Normal Distribution Suppose that you collect data X 1,..., X n f where f is an unknown probability density function which, however, you know is bell-shaped (or you expect to be bell-shaped given the information you have on the particular phenomenon you are observing, or you can show to bell-shaped using, for example, an histogram). Example of histograms and density estimation goes here: We know that we can estimate the mean µ f and the standard deviation σ f of f by means of the sample mean and the sample standard deviation X and S, respectively. On the basis of these two statistics, you may be interested in approximately quantifying the probability content of intervals of the form [µ f kσ f, µ f + kσ f ] where k is a positive integer. Because for 95

96 X N (µ, σ 2 ) one has P (µ σ X µ + σ) 68% P (µ 2σ X µ + 2σ) 95% P (µ 3σ X µ + 3σ) 99%, one would expect that, based on X and S, P ([ X S, X + S]) P ([µ f σ f, µ f + σ f ]) 68% P ([ X 2S, X + 2S]) P ([µ f 2σ f, µ f + 2σ f ]) 95% P ([ X 3S, X + 3S]) P ([µ f 3σ f, µ f + 3σ f ]) 99% (61) if the probability density f is bell-shaped. The approximations of equation (61) are frequently referred to as the empirical rules based on the Normal distribution. Sampling Distributions To illustrate the concept of sampling distribution, we will consider the Normal model, which assumes that the data X 1,..., X n N (µ, σ 2 ) are i.i.d. i.i.d. with a Normal distribution with parameters µ and σ 2 that are usually assumed to be unknown. In order to estimate µ and σ 2, we saw that we can use the two statistics X and S 2 (the sample mean and the sample variance). These statistics, or estimators, are random variables too, since they depend on the data X 1,..., X n. If they are random variables, then they must have their own probability distributions! The probability distribution of a statistic (or an appropriate stabilizing transformation of it) is called the sampling distribution of that statistic. We have the following results: 1. n( X µ) σ N (0, 1) (n 1)S 2 χ 2 (n 1) σ 2 n( X µ) S t(n 1). Some comments: 1. is easy to understand and prove (it s just standardization of a Normal random variable!). 1. is useful when we are interested in making inferences about µ and we know σ is a little harder to prove. We use this result when we are interested in making inferences about σ 2 and µ is unknown. 3. is a classical result. It is useful when we want to make inferences on µ and σ 2 is unknown. We won t study in detail the t distribution. However, this distribution arises when we take the ratio of a standard 96

97 Normal distribution and the square root of a χ 2 distribution divided by its degrees of freedom (under the assumption that the Normal and the χ 2 distributed random variables are also independent). The t distribution looks like a Normal distribution with fatter tails. As the number of degrees of freedom of the t distribution goes to, the t distribution converges in distribution to a standard Normal distribution. For the purposes of this course, we say that a sequence of random variables X 1, X 2,... X n,... converges in distribution to a certain distribution F if their c.d.f. s F 1, F 2,..., F n,... are such that lim F n(x) = F (x) n for each x R at which F is continuous. We will use the notation d to denote convergence in distribution. The idea here is that the limiting c.d.f. F (often called the limiting distribution of the X s) can be used to approximate probability statements about the random variables in the sequence. This idea will be key to the next result, the Central Limit Theorem. Basic inference example, and p-value example goes here. The Central Limit Theorem The Central Limit Theorem is a remarkable result. Simply put, suppose that you have a sequence of random variables X 1, X 2,..., X n,... iid F distributed according to some c.d.f. F, such that their expectation µ and their variance σ 2 exist and are finite. Then, the standardized version of their average converges in distribution to a standard Normal distribution. In other words, n( Xn µ) d Z (62) σ as n, where Z N (0, 1). Another way to express the Central Limit 97

98 Theorem is to say that, for any x R, ( n( lim P Xn µ) n σ ) x Φ(x). (63) This result is extremely frequently used and invoked to approximate probability statements regarding the average of a collection of i.i.d. random variables when n is large. However, rarely the population variance σ 2 is known. The Central Limit Theorem can be extended to accomodate this case. We have n( Xn µ) d Z (64) S where Z N (0, 1). Exercise in class: The number of errors per computer program has a Poisson distribution with mean λ = 5. We receive n = 125 programs written by n = 125 different programmers. Let X 1,..., X n denote the number of errors in each of the n programs. Then X 1,..., X n iid Poisson(λ). We want to approximate the probability that the average number of errors per program is not larger than 5.5. We have µ = E(X 1 ) = λ and σ = V (X 1 ) = λ. Furthermore, ( ) P ( X n( Xn µ) n(5.5 µ) n 5.5) = P σ σ ( ) 125(5.5 5) P Z = P (Z 2.5) 5 where Z N (0, 1). The Delta Method = Φ(2.5) If we have a sequence of random variables X 1, X 2,..., X n,... which converges in distribution to a standard Normal and a differentiable function g : R R, then the Delta Method allows us to find the limiting distribution for the sequence of random variables g(x 1 ), g(x 2 ),..., g(x n ),.... Assume that n( Xn µ) d N (0, 1) σ 98

99 and that g : R R is differentiable with g (µ) 0. Then, n(g(xn ) g(µ)) d g N (0, 1). (µ) σ Exercise in class: iid Let X 1,..., X n F where F is some c.d.f. such that both the expectation µ and the variance σ 2 of the X s exist and are finite. By the Central Limit Theorem, n( Xn µ) d N (0, 1). σ Consider the function g(x) = e x. Find the limiting distribution of g( X n ). We have g (x) = g(x) = e x > 0 for any x R; thus, g(µ) 0 necessarily. By applying the Delta Method, we have that ) n (e X n e µ n[g( Xn ) g(µ)] g = (µ) σ e µ σ Thus, we can use the distribution N (e µ, e2µ σ 2 ). n d N (0, 1). to approximate probability statements about g( X n ) = e X n when n is large. The Law of Large Numbers We have so far studied the Central Limit Theorem and a particular mode of convergence, the convergence in distribution. Another important result on the convergence of the average of a sequence of random variables is the law of large numbers 3. Simply put, the law of large numbers says that the sample mean of a sequence of i.i.d. random variables converges to the common expectation of the random variables in the sequence. More precisely, let X 1,..., X n,... be a sequence of iid random variables with common mean µ. Then, for any ɛ > 0, In this case, we use the notation X n p µ. lim P ( X n µ > ɛ) = 0. (65) n 3 This is also called the (weak) law of large numbers; the stronger version exists, but we do not discuss this in this course. 99

100 Example in class: Suppose that you repeatedly flip a coin which has probability p (0, 1) of showing head. Introduce the random variables { 1, if the i-th flip is head X i = 0, if the i-th flip is tails. Consider the random variable X n describing the observed proportion of p heads. Then, as n, Xn µ. In English, this means that for any arbitrarily small ɛ > 0, the probability of the event the proportion of heads differs from p by more than ɛ converges to 0 as n. Summary: two ways in which Random Variables converge. We introduced two notions of convergence for random variables: convergence in distribution (which we associated to the Central Limit Theorem) and convergence in probability (which we associated to the weak law of large numbers). There are other ways of convergence of random variables, all of which have the common goal: What does the distribution of Y n = f(x 1,, X n ) look like as we collect more and more samples? (n )? These results allow us to understand the precision or accuracy in which our statistics (estimators, necessarily functions of the data) estimate some unknown quantity (usually a parameter). For further study, read a good probability textbook or consider taking a graduate class! 100

101 Lecture 14 Recommended readings: WMS, sections 3.9, 4.9, 6.5 Moments Remember how we learned that a CDF or a PDF/PMF completely characterizes the distribution of a random variable in other words, it contains all you need to know about the law of randomness of that random variable. If two random variables (from different origins, or sampling procedure) give the same CDF, then they have the same distribution. What is a moment of F? In mathematics/statistics/mechanics, a moment is a specific quantitative measure of the shape of a set of points. (For simplicity, think of such points as potential realizations random variables.) Would you agree that, if you have a set of points, and you knew (1) the center of balance (2) how spread out they are, then you already have a pretty good rough idea of the distribution of variables? The first the center of balance is the first moment (and the mean). The spread (variance) is the second moment minus the square of the first moment. The story goes on what if you knew which side of the mean the points are more populated/concentrated in? This is given by the third moment, plus some multiples of the first and second moment. Would you agree that, as you go further and learn more such information about the shape of the points, that you learn more about the exact distribution of the random variables? Indeed, knowing all the moments is equivalent to knowing the CDF (from which we know all we need to know about the distribution)! In this lecture, we introduce a particular function associated to a probability distribution F which uniquely characterizes F. This function is called the moment-generating function (henceforth shortened to m.g.f.) of F because, if it exists, it allows to easily compute any moment of F, i.e. any expectation of the type E(X k ) with k {0, 1,... } for X F. Moment generating functions There are other functions that are similar to the m.g.f.s which we will not discuss in this course: these include the probability-generating function for discrete probability distributions (which is a compact power series representation of the p.m.f.), the characteristic function (which is the inverse Fourier transform of a p.m.f./p.d.f. and always exists) and the cumulant-generating 101

102 function (which is the logarithm of the m.g.f.). The moments {E(X k )} k=1 associated to a distribution X F completely characterize F (if they exist). They can all be encapsulated in the m.g.f. { m X (t) = E(e tx supp(x) ) = etx f X (x) dx if X is continuous supp(x) etx p X (x) if X is discrete. If two random variables X and Y are such that m X = m Y then their c.d.f. s F X and F Y are equal at almost all points (i.e. they can differ at at most countably many points). We say that the moment generating function m X of X F exists if there exists an open neighborhood around t = 0 in which m X (t) is finite. Note that it is always true that, for X F with an arbitrary distribution F, m X (0) = 1. The name of this function comes from the following feature: suppose that m X exists, then for any k {0, 1,... } d k dt k m X(t) = E(X k ). (66) t=0 This means that we can generate the moments of X F from m X by differentiating m X and evaluating its derivatives at t = 0. Exercise in class: Show that the m.g.f. of X Binomial(n, p) is and use it to compute V (X). m X (t) = [pe t + 1 p] n 102

103 Exercise in class: Show that the m.g.f. of X Uniform(0, 1) is m X (t) = 1 t (et 1). M.g.f s also provide a tool which is often useful to identify the distribution of a linear combination of random variables. In the m.g.f. world sums becomes products! In particular, consider a collection of n independent random variables X 1,..., X n with m.g.f s m X1,..., m Xn. It is easy to check from the definition of m.g.f. that, if we consider Y = n i=1 (a ix i + b i ), then Exercise in class: m Y (t) = e n i=1 b it n m Xi (a i t). (67) i=1 1. What is the mgf of X Poisson(λ)? 2. Consider Y Poisson(µ) with X and Y independent. What is the distribution of X + Y? 103

104 Why are moment generating functions useful? First, for X and Y whose moment generating functions are finite, m X (t) = m Y (t) for all t if and only if P (X x) = P (Y x) for all x. We will briefly prove this, for discrete distributions X and Y. One direction is trivial; if the two have the same distribution, then of course m X (t) = E(e tx ) = E(e ty ) = m Y (t) is true. The other direction is harder. Call A = supp(x) supp(y ), and a 1,, a n the elements of A. Then, the mgf of X is m X (t) = E(e T X) = x supp(x) = i=1,n e tx p X (x) e tai p X (a i ) and likewise, m Y (t) = i=1,,n etai p X (a i ). Subtract the two, to get m X (t) m Y (t) = i=,n e tai [p X (a i ) p Y (a i )] = 0 must be true when t is close to zero (because it is true for all t). If t is close to zero, then no matter what a i is, e ta i must be close to 1. So, it must be the case that p X (a i ) = p Y (a i ) for all i; hence, the pmfs are the same, and the distributions are the same. Second, if Φ Xn (t) φ X (t) for all t, and P (X x) is continuous in x, then P (X n x) P (X x). You can prove the central limit theorem with moment generating functions, assuming the moments are finite. (This proof is in the homework.) As a bonus, we can derive a useful inequality called a large deviation bound (or inequality). Assume φ(t) is finite for some t > 0. Then, for any a > µ, P (S n an) e ni(a) where I(a) is defined as I(a) = max{at log φ(t) : t > 0}. (Remember, you have in your toolbox the Markov and Tchebysheff s inequality already.) We will see an application of the large deviation bound in the homework. 104

105 Homework 7 (due June 14th) (For this section of the class, I highly recommend trying some problems in the WMS book, for the relevant sections as written in the lecture notes.) 1. (20 pts, 5 points each) Maximum likelihood estimation (a) What is the MLE of the variance of X i in a setting where we have n i.i.d. observations X 1,, X n Bernoulli(p)? (b) What is the MLE of the variance of X in a setting where we have one binomial observation? (c) What is the MLE of λ in a setting where we have samples X 1,, X n exp(θ), where the exponential distribution is defined differently as f Xi (x) = θe θx. Is it easy to see that this estimator is unbiased? Explain. (d) The MLE has the remarkable property that it converges in distribution to a normal random variable (under some mild conditions that we will not worry about) in the following way: n(ˆθmle θ) d N(0, 1 I(θ) ) for the Fisher s information number: [ ] [ ] d d 2 I(θ) = V logf(x; θ) = E logf(x; θ) dθ dθ2 Continuing with the setting of (c) and using this result about the asymptotic normality of the maximum likelihood estimator, what is a good approximate distribution of ˆθ MLE, when n is large? 2. (30 pts) Let X 1,..., X 10 iid N (µ, σ 2 ). We don t know the value of σ 2, but we want to decide whether, based on the data X 1,..., X 10, σ 2 σ 2 0 = 3 or not. We observe from the data that S2 = s 2 = 10. Use the sampling distribution of S 2 (read lecture 13) to compute the probability of the event {S 2 s 2 } under the assumption that the true unknown variance is σ 2 = σ0 2 (you will need to use 105

106 a statistical software and use the c.d.f. of a χ 2 with a certain number of degrees of freedom to compute this probability) 4. Based on the probability that you computed, explain why the data X 1,..., X 10 suggest that it is likely not true that σ 2 σ (20 pts) The times a cashier spends processing individual customer s orders are independent random variables with mean 2.5 minutes and standard deviation 2 minutes. What is the approximate probability that it will take more than 4 hours to process the orders of 100 people? You can leave your answer in terms of Φ. 4. (15 pts) Let X 1,..., X n be iid random variables with c.d.f. F, finite expectation µ and finite variance σ 2. Consider the random variable Y = a X n + b for a, b R and a > 0. Assume that n is large. Give an approximate expression for the α-quantile of Y. Your answer will be in terms of Φ 1. iid 5. (15 pts) Let X 1,..., X n Bernoulli(p) be a sequence of binary random variables where { 1, if the Pittsburgh Pirates won the i-th game against the Cincinnati Reds X i = 0, if the Pittsburgh Pirates lost the i-th game against the Cincinnati Reds and p (0, 1). Suppose that we are interested in the odds that the Pittsburgh Pirates win a game agains the Cincinnati Reds. To do that, we need to consider the function g(p) = p 1 p representing the odds of the Pirates winning agains the Reds. It is natural to estimate p by means of X n. Provide a distribution that can be used to approximate probability statements about g( X n ). 6. Extra credit (10 pts) Let U 1, U 2,..., U n,... be a sequence of random variables such that U n d Uniform(0, 1). Consider the function g(x) = 4 We can safely plug-in σ 2 0 because sup 0<σ 2 σ 2 0 P σ 2(S 2 > s 2 ) = P σ 2 0 (S 2 > s 2 ). Here the notation P σ 2 0 (A) should be interpreted as the probability of the event A when we assume that the true unknown variance is σ

107 log x and the new sequence of random variables X 1, X 2,..., X n,... d with X i = log U i. Prove that X n Exponential(1). Hint: use the method of the c.d.f. from lecture. 107

108 Lecture 15 Recommended readings: N/A Introduction to Random Processes A random process is a mathematical model of a probabilistic experiment that evolves in time and generates a sequence of (random) values. More formally, consider a collection of random variables X = {X t } t T defined on a sample space Ω endowed with a probability measure P, where the collection is indexed by a parameter t taking values in the set T. Such a collection is commonly referred to as a random process or as a stochastic process. It suffices to think about t as time 5, and T to be the domain of time (for example, [0, )). Recall that a random variable is a function mapping the sample space Ω into a subset S of the real numbers. The set S is usually called the state space of the random process X. In fact, if at the time t T the random variable X t takes the value X t = x t S, we say that the state of the random process X at time t is x t. Based on the features of the sets T and the sets S we say that X is a discrete-time random process, if T is a discrete set 6 continuous-time random process, if T is not a discrete set discrete-state random process, if S is a discrete set continuous-state random process, if S is a not a discrete set. It is clear from above that X = {X t } t T is a function or mapping of both the elements of ω Ω and of the time parameter t. X : Ω T S (ω, t) X t (ω) We can therefore think of X in at least two ways: as a collection of random variables X t taking values in S; i.e. for any fixed time t, the random process X corresponds to a random variable X t valued in S 5 t can be more generally thought of as time or space (we call this a spatial process, or just any indexing that keeps track of the evolution of the random variable. 6 i.e. T is finite or countably infinite 108

109 as the collection of random functions of time ( trajectories ); i.e. for any fixed ω Ω, we can view X as the sample path or trajectory t X t (ω) which is a (random) function of time. Figure 2: Illustration of sample paths from a stochastic process. In contrast to the study of a single random variable or to the study of a collection of independent random variables, the analysis of a random process puts much more focus on the dependence structure between the random variables in the process at different times and on the implications of that dependence structure. Because there are many real-world phenomena that can be interpreted and modeled as random processes, there is a wide variety of random processes. In this class, however, we will focus only on the following processes: 1. Bernoulli process: discrete-time and discrete-state, 2. Poisson process: continuous-time and discrete-state, 3. Brownian motion (the Wiener process): continuous-time and continuousstate, 4. Discrete-time Markov chains: discrete-time and any state. We can further characterize the Bernoulli process and the Poisson process as arrival-type processes: in this processes, we are interested in occurrences 109

110 that have the character of an arrival, such as customers walking into a store, messages receptions at a receiver, incoming telephone calls, etc. In the Bernoulli process arrivals occur in discrete-time while in the Poisson process arrivals occur in continuous-time. The Bernoulli Process The Bernoulli process is one of the most elementary random processes. It can be visualized as an infinite sequence of independent coin tosses where the probability of heads in each toss is a fixed number p [0, 1]. Thus, a Bernoulli process is essentially a sequence of iid Bernoulli random variables. Question: what is the state space of a Bernoulli process? The Poisson Process The Poisson process is a counting random process in continuous time. The basic idea is that we are interested in counting the number of events (occurring at a certain rate λ > 0) that occurred up to a certain time t. For example, imagine that you own a store and you are counting the number of customers that entered into your store up to a certain time t. A Poisson process is a probabilistic model that can be used to model this phenomenon. In a Poisson process, the number of events occurred up to a certain time t is a random variable that has a Poisson distribution with parameter λt. We will give a more precise definition of the Poisson process in the next lecture, and we will study some interesting properties of this process. Question: what is the state space of a Poisson process? Brownian Motion Brownian motion is one of the most important and well-studied continuous time stochastic processes. It has almost surely continuous sample paths which are almost everywhere not differentiable. Brownian motion is part of a class of random processes called diffusion processes; these processes usually arise as the solution of a stochastic differential equation. Brownian motion is the key process for the definition of stochastic integrals, i.e. integrals of the type T 0 X t (ω) db t (ω) (68) where {B t } T t=0 is a Brownian motion and {X t} T t=0 is another random process. The branch of mathematics devoted to studying the properties of these 110

111 integrals is called stochastic calculus. From a more practical point of view, the study of Brownian motion is historically associated to modeling the motion of a particle suspended in a fluid, where the motion results from the collision of the particle with the atoms and the molecules forming gas or the liquid in which the particle is immersed. Question: what is the state space of a Brownian motion? Discrete-Time Markov Chains 111

112 Informally, a Markov chain is a random process with the property that, conditional on the present state of the process, the future states of the process are independent of the past states. Mathematically, this condition is described by the Markov property P (X n = x n X n 1 = x n 1, X n 2 = x n 2,..., X 0 = x 0 ) = P (X n = x n X n 1 = x n 1 ) (69) How would you compactly summarize the Markov chains in the pictures below? First, notice that the state space C is discrete (let s say K possible discrete states exist). We can use a transition probability matrices P R K K, 112

113 which characterizes the probability of a transition between one state to another, across time. 113

114 Lecture 16 Recommended readings: Ross, sections The Bernoulli Process and the Poisson Process In lecture (and in-class examples), we will define Bernoulli and Poisson processes and their useful properties. Then, in the homework problems, you will learn further aspects and how to apply them to several settings, and calculate/identify useful quantities. The Bernoulli Process Let X = {X i } i=1 be a sequence of iid Bernoulli(p) random variables. The random process X is called a Bernoulli process. Question: describe the sample paths of a Bernoulli process. Answer: The Bernoulli process can be thought of as an arrival process: at each time i either there is an arrival (i.e. X i = 1, or equivalently stated, we observe a success) or there is not an arrival (i.e. X i = 0, or equivalently stated, we observe a failure). Many of the properties of a Bernoulli process are already known to us from previous discussions. Question: Consider n distinct times; what is the probability distribution of the number of arrivals in those n times? Answer: Question: What is the probability distiribution of the first arrival? Answer: Question: Consider the time needed until the r-th arrival (r {1, 2,... }); what is the probability distribution associated to the time of the r-th arrival? Answer: 114

115 The independence of the random variables in the Bernoulli process has an important implication about the process. Consider a Bernoulli process X = {X i } i=1 and the process X n = {X i } i=n+1. Because the random variables in X n are independent of the random variables in X \ X n, it follows that X n is itself a Bernoulli process (starting at time n) which does not depend on the initial n random variables of X. We say that the Bernoulli process thus satisfies the fresh-start property. Question: Recall the memoryless property of the Geometric distribution. How do you interpret this property in light of the fresh-start property of the Bernoulli process? The Bernoulli process itself may not be the most interesting Markov process, since it trivially satisfies the Markov property (all draws are independent). What if we thought about Y n = X n mod10? Now things get more interesting. (You ll see this in the next homework.) Question: Can you prove that Y n is still a Markov process? The Poisson Process (Credit where credit is due: material was adapted from Dr. Virtamo s lecture notes.) Motivation: Poisson process is one of the most important models used in queueing theory. The arrival process of customers is well modeled by a Poisson process. In teletraffic theory the customers may be calls or packets. Poisson process is a viable model when the calls or packets originate from a large population of independent users. In the following it is instructive to think that the Poisson process we consider represents discrete arrivals (of e.g. calls or packets). Picture of arrival times goes here: A random process X = {X t } t T is called a counting process if X t represents the number of events that occur by time t. More properly, if X is a counting process, it must also satisfy the following intuitive properties: for any t T, X t 0 115

116 for any t T, X t Z if s t, then X s X t for s t, X t X s is the number of events occurring in the time interval (s, t]. A counting process X is said to have independent increments if the number of events occurring in disjoint time intervals are independent. This means that the random variables X t X s and X v X u are independent whenever (s, t] (u, v] =. Furthermore, a counting process X is said to have stationary increments if the distribution of the number of events that occur in a time interval only depends on the length of the time interval. This means that the random variables X t+s X s have the same distribution for all s T. Example: Consider a Bernoulli process X = {X i } i=1 and the process Y = {Y i} i=1 with Y i = j i X j. Then Y is a discrete-time counting process with independent and stationary increments. Definition 1: We are now ready to give rigorously define what a Poisson process is. A counting process X = {X t } t 0 is said to be a Poisson process with rate λ > 0 if X 0 = 0 X has independent increments the number of events in any time interval of length t > 0 is Poisson distributed with expectation λt, i.e. for any s, t 0 we have X t+s X s Poisson(λt). Question: Does a Poisson process have stationary increments? Answer: Definition 2: A Pure birth process; In an infinitesimal dt time interval, there is only one arrival, and this happens with λdt, independent of arrivals outside of the interval. Definition 3: A process whose inter-arrival times follow an exponential(λ) distribution. Notice, in all of these definitions, there is no restriction that t must take values in some discrete set so this is a continuous-time random process. 116

117 It is interesting to prove the equivalence of the three definitions. We will prove this equivalency. 1 2: First, definition 1 says: P (X t = x) = (λt)λ e λt x! Then, consider these three outcomes of X dt (denoting dt as h sometimes) 1. P (X dt = 0) = d λ dt = e λ dt = 1 λ dt + o(dt) = 1 λh + o(h) 2. P (X dt = 1) = λdt 1! e λ dt = λ dt λ 2 (dt) 2 + = λh + o(h) 3. P (X dt 2) P (X dt = 2) = (λdt)2 2! e λ dt = o(h) The second case proves our point. The others are interesting keep reading after these proofs. 1 3: Denote Y = time between two arrivals. Then, notice the clever fact that {Y > t} {X t = 0} (Why? Because if X t 1, then an arrival has happened in time t, so the inter-arrival time is less than t.) Now, take the probability of these two events and equate them: P ({Y > t}) = P ({X t = 0}) = e λt and further manipulate this to get the CDF of Y : P ({Y t}) = 1 P (Y > t) = 1 e λt which we recognize as an exponential(λ) CDF. 3 2: Exercise in class: If Y exp(λ), then, proceed by: P (Y dt) = (hint: use a Taylor expansion e a = ax x=0 ( 1)x x! ) 117

118 2 1: First, we introduce the probability generating function (PGF for short). A PGF is a convenient representation for discrete random variables that take values in {0, 1,, s}: G(z) = E[z X ] = p(x)z x This is useful because it allows a succinct description of the probability distribution of P (X = i) via x=0 P (X = i) = G(k) (0) k! where G (k) means the k th derivative of G with respect to z. What is the PGF of a Poisson random variable X Pois(λ)? G X (z) = E(z X ) = z x 1 x! e λ λ x = e λx e λ 1 x! e zλ (zλ) x = e λ(z 1)T x=0 Now, let s consider the PGF of the counter N(0, t) which counts how many occurrences happens between time 0 and t. G t (z) = E(z X 0,t ) x=0 G t+dt (z) = E(z X 0,t+dt ) = E(z X 0,t,X t,t+dt ) = E(z X 0,t )E(z X t,t+dt ) The last expression is the same as Now, G t (z) (1 λdt)z 0 + λdtz 1 = G t (z) = λdt(1 z)g t (z) Food for thought: What is definition 2 saying? Now that we ve examined it a bit more carefully, we can say that a counting process X = {X t } t 0 is a Poisson process with rate λ > 0 if 1. X 0 = 0 2. X has stationary and independent increments 3. P (X h = 1) = λh + o(h) 4. P (X h 2) = o(h) 118

119 Condition 3 in definition 2 says that the probability of observing an arrival in a small amount of time is roughly proportional to h. Condition 4 says that it is very unlikely to observe more than an arrival in a short amount of time. Question: So, we know about the probability distribution of the count of arrivals up to any time in a Poisson process. What about the probability distribution of the inter-arrival times {T i } i=1. First, define the n th arrival time (n = 0, 1, ) as: Y 0 = 0, Y n = min{t : Y t = n} Then, we define the inter-arrival times as the random variables T n = Y n Y n 1, n = 1, 2, As we saw in the proof of definition 3, notice that the event {T 1 > t} is equivalent to the event there are no arrivals in the time interval [0, t], i.e. {X t = 0}. Thus, P (T 1 > t) = P (X t = 0) = e λt from which we see that T 1 Exponential(1/λ). Now, what is the distribution of T 2? We have P (T 2 > t T 1 = s) = P (X t+s X s = 0 T 1 = s) = P (X t+s X s = 0 X s X 0 = 1) = P (X t+s X s = 0) = e λt which implies that T 2 Exponential(λ) as well. By using the same argument, it is easy to see that all the inter-arrival times of a Poisson process are iid Exponentially distributed random variables with parameter λ. Question: what is the distribution of the waiting time until the n-th arrival? k A Poisson process is a Markov process (i.e. it is a random process that satisfies the (weak) Markov property), as an immediate consequence of the first definition. Namely, conditional on X t1 = x, the distribution of a subsequent X t2 for t 2 > t 1 is independent of what happened prior to t 1 (because (X t2 X t1 ) {X t1 = x} = X t2 x Pois(λ(t 2 t 1 )) where, x and t 1 can 119

120 now be treated as a constant! X t2 x has a completely known (Poisson) distribution that uses no information about states before t 1. The Poisson Process as a Continuous-time Version of the Bernoulli Process Fix an arrival rate λ > 0, some time t > 0, and divide the time interval (0, t] into n subintervals of length h = t/n. Consider now a Bernoulli process X = {X i } n i=1 defined over these time subintervals, where each X i is a Bernoulli random variable recording whether there was an arrival in the time interval ((i 1)h, ih]. Impose now the conditions 3. and 4. of the definition of Poisson process that we described above. We have that the probability p of observing at least one arrival in any of these subintervals is p = λh + o(h) = λ t ( ) 1 n + o. n Thus, the number of subintervals in which we record an arrival has a Binomial(n, p) distribution with p.m.f. p(x) = ( ) ( n λ t ( )) 1 x ( x n + o 1 λ t ( 1 n x n n n)) + o 1 {0,1,...,n} (x) Now, let n or equivalently h 0 so that the partition of (0, t] becomes finer and finer, and we approach the continuous-time regime. Following the same limit calculations that we did in Homework 4, as n we have λt (λt)x p(x) e 1 x! {0,1,... } (x), which is the p.m.f. of a Poisson(λt) distribution. Thus, the Poisson process can be thought of as a continuous-time version of the Bernoulli process. 120

121 Homework 8 (due June 16th) 1. (10 pts) Prove the weak law of large numbers under the additional assumption that the variance σ 2 of the iid sequence X 1, X 2,..., X n,... exists and is finite. Hint: use Tchebysheff s inequality. (Go back to earlier lecture notes and homework for how tchebysheff was used) 2. (10 pts) consider the sequence of random variables x 1, x 2,..., x n,... p with x n \(µ, 3/(1 + log n)). show that x n µ. (hint: use tchebysheff again) 3. (15 pts, 5 pts each) (a) What is the mgf of X i Bernoulli(p)? (b) What is the mgf of X = X i? (c) What is the distribution of X? (We knew this, but now we know a convenient way of proving it!) 4. (10 pts) Sum of Poissons: (a) Prove the sum of Poisson distributed X, Y each with parameters λ 1, λ 2 is a Poisson random variable. (What is the parameter?) (Hint: the following formula sometimes called the discrete convolution formula for Z = X 1 + X 2 P (X 1 + X 2 = z) = f Z (z) = may come in handy) z f X1 (x)f X2 (z x) x=0 (b) Prove the above with moment generating functions. 5. (10 pts) Assume that X 1,, X n have finite moment generating functions; i.e. m Xi (t) < for all t, for all i = 1,, n. Prove the central limit theorem again, using moment generating functions. (Hint: come to office hours for a hint.) 6. (10 pts) Using the central limit theorem for Poisson random variables, compute the value of n n i lim n e n i! 121 i=0

122 7. (15 pts) Roll a fair die n times, call the outcomes X i. Then, estimate the probability that S n = i X i exceeds its expectation by at least n, for n = 100 and n = (5 pts) Use Tchebysheff s inequality. (5 pts) Obtain the mgf of S n. (5 pts) Use large deviation bounds. (Hint: you will need to use a computer program of your choice, to get a compenent in the large deviation bound.) 8. (10 pts) In each game a gambler wins the dollars he bets with probability p and looses with probability 1 p. If he has less than $3, he will bet all that he has. Otherwise, since his goal is to have $5, he will only bet the difference between $5 and what he has. He continues to bet until wither he has either $0 or $5. Let X n denote the amount he has right after the n-th bet. (a) Draw the transition graph. (b) Classify the states. (We haven t covered this material yet! Full points given) 9. (10 pts) The number of failures X t, which occur in a computer network over the time interval [0, t), can be described by a homogeneous Poisson process {X t, t 0}. On average, there is a failure after every 4 hours, i.e. the intensity of the process is equal to λ = (a) What is the probability of at most 1 failure in [0, 8), at least 2 failures in [8, 16), and at most 1 failure in [16, 24) (time unit: hour)? (b) What is the probability that the third failure occurs after 8 hours? 10. Extra credit (10 pts) Prove the large deviation bound. (Hint: start with the Markov inequality) 122

123 Lecture 17 Recommended readings: Ross, sections 5.3.4, 5.3.5, Splitting, Merging, Further Properties of the Poisson Process, the Nonhomogeneous Poisson Process, and the Spatial Poisson Process Splitting a Bernoulli Process *႘ Consider a Bernoulli process of parameter p [0, 1]. Suppose that we keep any arrival with probability q [0, 1] or otherwise discard it with probability 1 q. The# new process I thusǐ obtained is èč č still a Bernoulli ºč č O = Ǐ Ǐ M! ŗ!ǐ! Ǐ Ú + + Ċč`č Pč Ǐ process with Čč -. áč example ùč Ҍߥof Ԧ MƯ & Ǐ h&ū Æ Ǐ M Ǐ parameter pq.ø 2This is an thinned Bernoulli č process. Similarly, č!ǐ I č Ǐ Ǐ Ǐ? Ǐ H z / Ŭ Ŗ KǌǏ -@ mǐ č č the process discarded is âč aď thinned Bernoulli process -@obtained ÔǏ Ǐ ÝŒİ ÄǏ by 7 the =^ Ǐ! KǏ Ƹ ĺ Ǐ đ ć arrivals ƀ Ǐ č ś Ǐ Ǐ Ǐ ƹǐ /' q)). Ø\č Oč ӫ ' /' ''ؿ as well (with parameter p(1 Ǐ + Ǐ ř ߥۯ č Ù ŭ Ǐ ÕǏ Ǐ Ǐ œű Ǐ ơ g nǐ Ļ Ǐ 9č č : + G + Ɛ႘ ෩႘ Qč Ǐ This can be easily generalized to the case where e! weǐ Ƣò split the original Ǐ ƍ ø AǏ process in more than two subprocesses. Ʒw AǏ #U č č a Uþ č ȥ ط g č `i %č ͳ - ÅǏ ǰ Չ ܨ ک ႘ Ͱ ஆθ ༪ அ ધ႘ŋ੦႘ Է ނ ʋปഅŋ ౪இ႘ Ͱผŋफ़ʋΓΓ႘Ԋ႘ /ժ١չ Tű Ǐ ႘ Merging a Bernoulli aէ ĭmňňm كڇ YXč ŊԷ ӑβէ ҕ ε ԯ TǏ Ǆ Ǝƣ Ǐ ëč \č ࢱ ႘ ņߥ g ǍƁǏ $ gǐ - m Ǐ Ƥƅě ƺ ǅĜ ǋ@Ǐ ѧ Ӑߥ Process Ç ႘ Ǐ K Ǐ ú= č YÎX ã č řԧߥ ܥ Ӱ ڹ ҵߥ ; ල႘ B႘ ෨႘ ව෪႘ Ċ ǏBernoulli ŲŔ Ǐ : Ҕ ႘ Ǔ à one ȹ Ǉ Ҁ with - Ǐ Qč 6 č T Dh Ǐ ]č Consider two independent processes, parameter p and äč Û ĕ Ä^ č. ² ^ č 2 č í îč ႘ the otherã with parameterƥ bǎǐ q. Consider the new process obtained by óə Ǐ ( Ɔ: ƦƧǏ éč O nǐ c Ǐ ïą--mðñčübù G Ï č ႘ ô -Ǐ ࢲε႘ Ğ Ǐ ƻ ŕ Ū Ǐ K Ǐ žũĝ ı ďǐ 1 ifμthere was an arrival at time i in either process Γ ߥ Xi = č 0 otherwise. / ॲ ਓമςਔเ႘ Ș Ҟϻ႘ òÿb Íč Pč Ņ ¾ ط Ȧ ط Ǐ z U ҖnnԷ Է Ƽ ǏVԷ The process thus obtained is a Bernoulli process with parameter p + q pq. ҕӯߥ ħôĉ ĐǏ / ; ļǐ ::+hr Ǐ ğ ųǐ This is easily generalized to the case Öwhere we merge more than two independent Bernoulli processes. ȧ ط ÌÍÎÇoÈoËǏ«ªÊǏ ĨǏ ĉ ƨ ƩǏ +ƽ!ǐ õɛö^ c ƑƪǏ РŋԷ Է Ѧ ӏߥ ඹ ӹ ႘ 123 ÇÑ႘ ࢮ႘ ط ϋ ό Ϝ Ө Ռ h Ս Վ פף ± ط ÖËM 4č êč Ԅ႘ 4č ņ ҔԷ ó" ô ; Â 4 zč æǆ Ǐ

124 ǭ႘ $ $ $ "$ # $!!!$$ Splitting a Poisson Process Consider a Poisson process X with rate λ > 0. Suppose that there can be two different types of arrivals in the process, say type A and type B, and that each arrival is classified as an arrival of type A with probability p [0, 1] and as an arrival of type B with probability 1 p and that the classification is fone independently from the rest of the process. Let XA and XB denote the counting processes associated to arrivals of type A and type B respectively. Then XA and XB are Poisson processes with parameters λp and λ(1 p) respectively. Furthermore, XA and XB are independent processes. The two processes XA and XB are examples of thinned Poisson processes. This can be easily generalized to the case where we consider n 3 different types of arrivals. ԪԷ Merging two Poisson Processes Consider two Poisson processes X1 and X2 with rates λ > 0 and µ > 0 ƽ ႘ ߥמ respectively. Consider the new Poisson process X = X1 + X2. The merged Poisson process ا Ĭm ߥ X is a PoissonѴ ۇ ӽߥ process with rate λ + µ. Furthermore, any arrival in the process X has probability λ/(λ + µ) of Ċ ӁӔ ω éʔߥ being originated from X1 and probability µ/(λ + µ) of being originated from X2. This is easily generalized to the case where more than two Poisson processes are merged. Conditional Distribution of the Arrival Times Suppose that we know that the number of arrivals up to time t > 0 in a Poisson process with rate λ is n > 0. What is the conditional distribution of the n arrival times S1,..., Sn given that there were exactly n arrivals? 124

Probability. Lecture Notes. Adolfo J. Rumbos

Probability. Lecture Notes. Adolfo J. Rumbos Probability Lecture Notes Adolfo J. Rumbos October 20, 204 2 Contents Introduction 5. An example from statistical inference................ 5 2 Probability Spaces 9 2. Sample Spaces and σ fields.....................