Chapter 2 Introduction to Probability 2.1 Probability Model Probability concerns about the chance of observing certain outcome resulting from an experiment. However, since chance is an abstraction of something not physically measurable, we need a precise mathematical definition. In this chapter, we will use the follow definition of the probability model, which consists of three components: Sample Space: Specify all possible outcomes from an experiment. Event: Specify a particular outcome or combination of outcomes. Probability Law: Specify how likely should an event occur. A pictorial summary of these concepts are illustrated in Figure 2.1. Figure 2.1: Illustration of sample space, event and probability law. 1
Sample Space Definition 1. A sample space Ω is the set of all possible outcomes from an experiment. We denote ω as an element in Ω. Examples: Coin flip: Ω = {H, T }. Throw a dice: Ω = {1, 2, 3, 4, 5, 6}. Waiting time for a bus in West Lafayette: Ω = {t 0 t 30minutes} In the last example, we see that a sample space can be continuous. Counter Examples: Throw a dice: Ω = {1, 2, 3} is not a sample space because it is not exhaustive. Throw a dice: Ω = {1, 1, 2, 3, 4, 5, 6} is not a sample space because it is not exclusive. Therefore, in order to make a valid sample space, we have to make sure Ω contains all possible outcomes and there is no repetition in the outcomes. Event Definition 2. An event F is a subset in the sample space Ω. Note that an outcome ω is an element in Ω but an event F is a subset contained in Ω, i.e., F Ω. Thus, an event can contain one single outcome but it can also contain many elements. Example: Throw a dice. Let Ω = {1, 2, 3, 4, 5, 6}. F 1 = {even numbers} = {2, 4, 6}. F 2 = {less than 3} = {1, 2}. Example: Wait a bus. Let Ω = {0 t 30}. F 1 = {wait less than 10 minutes} = {0 t < 10} F 2 = {wait less than 5 or more than 20 minutes} = {0 t < 5} {20 < t 30}. c 2017 Stanley Chan. All Rights Reserved. 2
In this example, we see that we can create a new event by operating on existing events through set operations, e.g., union, intersection, etc. Formally, we define the collection of all possible events as the event space. Definition 3. The collection of all possible events is called the event space or the σ-field, denoted as F. An event space satisfies the following two properties: If F F, then also F c F. If F 1, F 2,... F, then the union F i F. The two properties of the event space are essential to ensure that every possible subsets in Ω are included, because any other set operations can be derived from complement and union. Example. In a coin flip experiment where Ω = {H, T }, the event space is F = {, H, T, Ω}. Probability Law Definition 4. A probability law is a function P : F [0, 1] that maps an event A to a real number in [0, 1]. The function must satisfy three axioms known as the axioms of probability: I. Non-negativity: P[A] 0, for any A Ω. II. Normalization: P[Ω] = 1. III. Additivity: For any disjoint sets {A 1, A 2,...}, it holds that [ ] P A i = P[A i ]. The non-negativity axiom ensures that a probability value cannot be negative. The normalization axioms ensures that the probability of observing all possible outcomes is 1. The additivity axiom defines how set operations can be translated into probability operations. The infinite number of sets in the axiom makes sure that it is applicable to both discrete and continuous sample spaces. Finite Additivity. The countable additivity stated in the axiom III involves an infinite number of sets. As a special case we can simplify the infinite number into finite number, which states that for any two disjoint sets A and B, we have P[A B] = P[A] +. c 2017 Stanley Chan. All Rights Reserved. 3
In words, if A and B are disjoint, then the probability of observing either A or B is the sum of the two individual probabilities. The union of A and B is equivalent to the logical OR. Once this OR operation is defined, all other logical operations can be subsequently defined. The following corollaries are some examples. Corollary 1. P[A c ] = 1 P[A]. Proof. Since Ω = A A c, by finite additivity we have P[Ω] = P[A A c ] = P[A] + P[A c ]. By normalization axiom, we have P[Ω] = 1. Therefore, P[A c ] = 1 P[A]. Corollary 2. For any A Ω, P[A] 1. Proof. We prove by contradiction. Assume P[A] > 1. Consider the complement A c where A A c = Ω. Since P[A c ] = 1 P[A], we must have P[A c ] < 0 because by hypothesis P[A] > 1. But P[A c ] < 0 violates the non-negativity axiom. So we must have P[A] 1. Corollary 3. P[ ] = 0. Proof. Since Ω = Ω, by the first corollary we have P[ ] = 1 P[Ω] = 0. Corollary 4. For any A and B, P[A B] = P[A] + P[A B]. Note that this statement is different from axiom III because A and B are not necessarily disjoint. Proof. First, observe that A B can be partitioned into three disjoint subsets as A B = (A\B) (A B) (B\A). Since A\B = A B c and B\A = B A c, by finite additivity we have that P[A B] = P[A\B] + P[A B] + P[B\A] = P[A B c ] + P[A B] + P[B A c ] (a) = P[A B c ] + P[A B] + P[B A c ] + P[A B] P[A B] (b) = P[A (B c B)] + P[(A c A) B] P[A B] = P[A Ω] + P[Ω B] P[A B] = P[A] + P[A B], where in (a) we added and subtracted a term P[A B], and in (b) we used finite additivity so that P[A B c ] + P[A B] = P[(A B c ) (A B)] = P[A (B c B)]. The above proof is a rigorous way of deriving the result. A simpler way to visualize the result is to draw a Venn diagram and show that A B contains an overlapping part A B which needs to be subtracted. c 2017 Stanley Chan. All Rights Reserved. 4
Corollary 5 (Union Bound). For any A and B, P[A B] P[A] +. Proof. Since P[A B] = P[A] + P[A B] and by non-negativity axiom P[A B] 0, we must have P[A B] P[A] +. Union bound is a common tool we use to analyze probabilities when the intersection A B is difficult to evaluate. Corollary 6. If A B, then P[A]. Proof. If A B, then there exists a set B\A such that B = A (B\A). Therefore, by finite additivity we have = P[A] + P[B\A] P[A]. This corollary is useful when considering two events of different sizes. For example, in the bus waiting example, if we let A = {t 5}, and B = {t 10}, then P[A] because we have to wait for the first 5 minutes in order to go into the remaining 5 minutes. 2.2 Conditional Probability Definition 5. Assume 0. The conditional probability of A given B is P[A B] def = P[A B]. (2.1) Pictorially, a conditional probability is the proportion of P[A B] over. It is the probability that A happens when we know that B has already happened. The difference between P[A B] and P[A B] is the denominator they carry: P[A B] = P[A B] and P[A B] = P[A B]. P[Ω] Since P[Ω], P[A B] is always larger than or equal to P[A B]. Conditional probabilities are ubiquitous in this course and beyond. They concern about the likelihood that one event happens subject to another event has happened. This notion of causality is common in our daily lives. The followings are some examples. Example. Throw a dice. Let A = {Get 3} and B = {odd numbers}. c 2017 Stanley Chan. All Rights Reserved. 5
Figure 2.2: Illustration of conditional probability and its comparison with P[A B]. Clearly, P[A] = 1/6 and = 1/2. It is also not difficult to see that P[A B] = P[A] = 1/6 because A B and so A B = A. The conditional probability of A given B is P[A B] = P[A B] = 1 3. In words, if we know that we have an odd number, then the probability of obtaining a 3 has to be computed over {1, 3, 5}, which give us a probability 1. If we do not know that we have 3 an odd number, then the probability of obtaining a 3 has to be computed from the sample space {1, 2, 3, 4, 5, 6} which will give us 1. 6 Example. Let In this example, A = {Eat 2 burgers} and B = {Finish a football game}. P[A] = Probability that you eat 2 burgers = Probability that you just finish a football game P[A B] = Probability that you just finish a football game and you eat 2 burgers P[A B] = Probability that you eat 2 burgers given that you just finish a football game. Without knowing that you just finish a football game, you may not be hungry and so the probability of eating 2 burgers (i.e. P[A]) could be low. However, if we know that you finish a football game, then it is quite likely you are hungry and want to eat 2 burgers. This is the conditional probability P[A B]. Example. Let A = {Purdue gets Big Ten champion} and B = {Purdue wins 15 games consecutively}. c 2017 Stanley Chan. All Rights Reserved. 6
In this example, P[A] = Probability that Purdue gets chamion = Probability that Purdue wins 15 games consecutively P[A B] = Probability that Purdue gets champion and wins 15 games consecutively P[A B] = Probability that Purdue gets champion given that we win 15 games consecutively. If we do not know if Purdue has won 15 games consecutively, then it is unlikely that we will get the champion because the sample space of all possible competition reults is large. However, if we have already won 15 games consecutively, then the denominator of the probability becomes much smaller. In this case, the conditional probability is high. Proposition 1. Let > 0. The conditional probability P[A B] satisfies Axiom I to Axiom III. Proof. Let s check the axioms: Axiom I: P[A B] = P[A B]. Since > 0 and Axiom II requires P[A B] 0, we therefore have P[A B] 0. Axiom II: P[Ω B] = P[Ω B] = = 1. Axiom III: Consider two disjoint sets A and C. Then, P[(A C) B] P[(A B) (C B)] P[A C B] = = (a) P[A B] P[C B] = + = P[A B] + P[C B], where (a) holds because if A and C are disjoint then A B and C B are also disjoint. The implication of Proposition 1 is that conditional probabilities are legitimate probabilities. In proving the proposition, we note that the set B is present and is fixed for all three axioms. 2.3 Independence Definition 6. Two events A and B are statistically independent if P[A B] = P[A]. Disjoint VS Independent. It should be cautioned that disjoint and independent are two different concepts, i.e., Disjoint Independence. c 2017 Stanley Chan. All Rights Reserved. 7
If A and B are disjoint, then A B =. This only implies that P[A B] = 0. However, it says nothing about if P[A B] can be factorized into P[A]. If A and B are independent, then we have P[A B] = P[A]. But this does not imply that P[A B] = 0. The only possibility that Disjoint Independence is when P[A] = 0 or = 0. Example. Throw a dice twice. Let A = {1st dice is 3} and B = {2nd dice is 4}. Are A and B independent? We can show that P[A B] = P[(3, 4)] = 1 36 P[A] = 1 6, and = 1 6. So P[A B] = P[A]. Thus, A and B are independent. Example. Throw a dice twice. Let A = {1st dice is 1} and B = {sum is 7}. Are A and B independent? Note that P[A B] = P[(1, 6)] = 1 36 P[A] = 1 6 = P[(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)] = 1 6. So P[A B] = P[A]. Thus, A and B are independent. Example. Throw a dice twice. Let A = {max is 2} and B = {min is 2}. Are A and B independent? Let us first list out A and B: A = {(1, 2), (2, 1), (2, 2)} B = {(2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (3, 2), (4, 2), (5, 2), (6, 2)}. c 2017 Stanley Chan. All Rights Reserved. 8
Therefore, the probabilities are P[A] = 3 36 and = 9 36 P[A B] = P[(2, 2)] = 1 36. Clearly, P[A B] P[A] and so A and B are dependent. Independence Via Conditional Probability. Recall that P[A B] = P[A B]. If A and B are independent, then P[A B] = P[A] and so P[A B] = P[A B] = P[A] = P[A]. This suggests an interpretation of independence: If the occurrence of B provides no additional information about the occurrence of A, then A and B are independent. However, we do not define independence via conditional probability because P[A B] = P[A] holds for 0 but conditional probability P[A B] requires > 0. 2.4 Bayes Theorem and Law of Total Probability Theorem 1. (Bayes Theorem) For any two events A and B such that P[A] > 0 and > 0, it holds that P[B A] P[A] P[A B] =. Proof. By definition of conditional probabilities, we have P[A B] = P[A B] and P[B A] = P[B A]. P[A] Rearranging the terms yields P[A B] = P[B A]P[A], which gives the desired result by dividing both sides by. Bayes Theorem provides two views of the intersection P[A B] using two different conditional probabilities. See Figure 2.3 for a pictorial illustration. We call P[B A] the conditional probability of B given A, and P[A B] the posterior probability of B given A. c 2017 Stanley Chan. All Rights Reserved. 9
Figure 2.3: Bayes theorem provides two views of P[A B] using P[A B] and P[B A]. Theorem 2. (Law of Total Probability) Let {A 1, A 2,..., A n } be a partition of Ω, i.e., A 1,..., A n are disjoint and Ω = A 1 A 2... A n. Then, for any B Ω, = n P[B A i ] P[A i ]. Proof. We start from the right hand side. [ n n n ] P[B A i ] P[A i ] (a) = P[B A i ] (b) = P (B A i ) (c) = P [ ( n )] B A i (d) = P[B Ω] =, where (a) follows from the definition of conditional probability, (b) is due to Axiom III, (c) holds because of the distributive property of sets, and (d) is resulted from the partition property of {A 1, A 2,..., A n }. Interpretation. Law of total probability can be understood as follows. If the sample space Ω consists of disjoint subset A 1,..., A n, then we can compute the probability by summing over its portion P[B A 1 ],..., P[B A n ]. However, the probability of having A 1,..., A n is determined by P[A 1 ],..., P[A n ]. Therefore, when performing the sum we need to weight each P[B A i ] by P[A i ]. See Figure 2.4 for illustration. Corollary 7. Let {A 1, A 2,..., A n } be a partition of Ω, i.e., A 1,..., A n Ω = A 1 A 2... A n. Then, for any B Ω, are disjoint and P[A j B] = P[B A j] P[A j ] n P[B A i] P[A i ]. c 2017 Stanley Chan. All Rights Reserved. 10
Figure 2.4: Law of total probability decomposes the probability into multiple conditional probabilities P[B A i ]. The probability of obtaining each P[B A i ] is P[A i ]. Proof. We just need to apply Bayes Theorem and Law of Total Probability: P[A j B] = P[B A j] P[A j ] = P[B A j ] P[A j ] n P[B A i] P[A i ]. Example. Consider a communication channel shown in Figure 2.5. The probability of sending a 1 is p and the probability of sending a 0 is 1 p. Given that 1 is sent, the probability of receiving 1 is 1 η. Given that 0 is sent, the probability of receiving 0 is 1 ε. We want to find the probability that a 1 has been correctly received. Define the events S 0 = 0 is sent, and R 0 = 0 is received. S 1 = 1 is sent, and R 1 = 1 is received. Then, the probability that 1 is received is P[R 1 ]. However, P[R 1 ] 1 η because 1 η is the conditional probability that 1 is received given 1 is sent. It is possible that we receive 1 as a result of an error when 0 is sent. Therefore, we need to consider the probabilities of having S 0 and S 1. Using Law of total probability we have P[R 1 ] = P[R 1 S 1 ] P[S 1 ] + P[R 1 S 0 ] P[S 0 ] = (1 η)p + ε(1 p). Now, suppose that we have received 1. What is the probability that 1 was originally sent? This is asking the posterior probability P[S 1 R 1 ], which can be found using Bayes Theorem P[S 1 R 1 ] = P[R 1 S 1 ] P[S 1 ] P[R 1 ] = (1 η)p (1 η)p + ε(1 p). c 2017 Stanley Chan. All Rights Reserved. 11
Figure 2.5: A two-channel communication system. Example. Consider a tennis tournament. Your probability of winning the game is 0.3 against 1 2 0.4 against 1 4 0.5 against 1 4 of the players (Event A). of the players (Event B). of the players (Event C). What is the probability of winning the game? Let W be the event that you win the game. Then, by Law of Total Probability, we have P[W ] = P[W A] P[A] + P[W B] + P[W C] P[C] = (0.3)(0.5) + (0.4)(0.25) + (0.5)(0.25) = 0.375. c 2017 Stanley Chan. All Rights Reserved. 12