Probability and Distribution Theory

Size: px

Start display at page:

Download "Probability and Distribution Theory"

Ella Jacobs
5 years ago
Views:

1 P A R T T H R E E Probability and Distribution Theory

2 CHAPTER 6 The Theory of Statistics: An Introduction 6.1 What You Will Learn in This Chapter So far all our analysis has been descriptive; we have provided parsimonious ways to describe and summarize data and thereby acquire useful information. In our examination of data and experiments, we discovered that there were many regularities that held for large collections of data. However, this led to new questions: If random variables are unpredictable, why is it that the same experiment produces the same shape of histogram? What explains the different shapes of histograms? Why do the shapes of histograms become smooth as the number of observations increases? How can we explain the idea of a structural relationship? We now introduce the theory of statistics called probability theory. This theory will provide the answers to these and many other questions as yet unposed. In addition, at last we will formally define the idea of a random variable. The basic theory that we delineate in this chapter underlies all statistical reasoning. Probability theory and the theory of distributions that is to follow in the succeeding chapters are the foundation for making general statements about as yet unobserved events. Instead of having to restrict our statements to a description of a particular set of historically observed data points, probability theory provides the explanations for what it is that we observe. It enables us to recognize our actual data as merely a finite-sized sample of data drawn from a theoretical population of infinite extent. Our explanations move from statements of certainty to explanations expressed in terms of relative frequencies, or in terms of the odds in favor or against an event occurring. You will discover in this chapter that besides simple probability, there are concepts of joint and conditional probability as well as the notion of independence between random variables. The independence that we introduce in this chapter is statistical independence. Statistical independence between two variables implies 170

3 INTRODUCTION 171 that we gain no information about the distribution of events of one of them from information on the values taken by the other. This is a critical chapter because the theory that we will develop lies at the heart of all our work hereafter. All decision making depends one way or another on the concept of conditional probability, and the idea of independence is invoked at every turn. 6.2 Introduction We began our study of statistics by giving an intuitive definition of a random variable as a variable that cannot be predicted by any other variable or by its own past. Given this definition we were in a quandary to start, because we had no way to describe, or summarize, large, or even small, amounts of data. We began our search for ways to describe random data by counting ; that is, we counted the number of occurrences of each value of the variable and called it a frequency. We then saw the benefit of changing to relative frequencies and extended this idea to continuous random variables. With both frequency charts and histograms we saw that frequencies added to one by their definition and that the area under a histogram is also one. More important, we saw that different types of survey or experimental data took different shapes for their frequency charts or their histograms. Also, we saw that if we had enough data there seemed to be a consistency in the shapes of charts and histograms over different trials of the same type of experiment or survey. We began with nothing to explain. We now have a lot to explain. What explains the various shapes of frequency charts and histograms? Why does the same shape occur if random variables are truly unpredictable? The stability of the shape seems to depend on the number of observations, but how? For some pairs of random variables, the distribution of one variable seems to depend on the value taken by the other variable. Why is this, and how does the relationship between the variables change? Why are some variables related in this statistical sense and others are not? Most important, can we predict the shape of distributions and the statistical relationships that we have discovered so far? Might we be able to discover even more interesting relationships? The task of this chapter is to begin the process of providing the answers to these questions. What we require is a theory of relative frequencies. This self-imposed task is common to all sciences. Each science discovers some regularities in its data and then tries to explain those regularities. The same is true in statistics; we need to develop a theory to explain the regularities that we have observed. But if we spend the time to develop a theory of relative frequency, or a theory of statistics, as we shall now call it, what will we gain from our efforts? First, we would expect to be able to answer the questions that we posed to ourselves in the previous paragraphs. But in addition, we would like to be able to generalize from specific events and specific observations to make statements that will hold in similar, but different, situations; this process is called inference and will eventually occupy a lot of our efforts. Another very important objective is to be

4 172 CHAPTER 6 THE THEORY OF STATISTICS: AN INTRODUCTION able to deduce from our theory new types of concepts and new types of relationships between variables that can be used to further our understanding of random variables. Finally, the development of a theory of statistics will enable us to improve our ability to make decisions involving random variables or to deal with situations in which we have to make decisions without full information. The most practical aspect of the application of statistics and of statistical theory is its use in decision making under uncertainty and in determining how to take risks. We will discuss both of these issues in later chapters. If we are to do a creditable job of developing a theory of statistics, we should keep in mind a few important guidelines. First, as the theory is to explain relative frequency, it would be useful if we developed the theory from an abstraction and generalization of relative frequencies. We want to make the theory as broadly applicable, or as general as is feasible; that is, we want to be able to encompass as many different types of random variables and near random variables as is possible. Although we want the theory to generate as many new concepts and relationships as possible, it is advisable that we be as sparing as we can with our basic assumptions. The less we assume to begin, the less chance there is of our theory proving to be suitable only for special cases. Finally, we would like our theory s assumptions to be as simple and as noncontroversial as possible. If everyone agrees that our assumptions are plausible and useful in almost any potential application, then we will achieve substantial and broad agreement with our theoretical conclusions. A side benefit to this approach is that we will be able to make our language more precise and that we will be able to build up our ideas step by step. This will facilitate our own understanding of the theory to be created. But, as with all theories, we will have to idealize our hypothesized experiments to concentrate on the most important aspects of each situation. By abstracting from practical details, we will gain insight. Let us begin. This is a chapter on theory. Consequently, we are now dealing in abstractions and with theoretical concepts, not with actually observed data. However, we will illustrate the abstract ideas with many simple examples of experiments. Try not to confuse the discussion of the illustrative experiment with the abstract theory that is being developed. The notation will change to reflect the change in viewpoint; we will no longer use the lowercase Roman alphabet to represent variables or things like moments. Now we will use uppercase letters to represent random variables and Greek letters to represent theoretical values and objects like moments for theoretical distributions. This convention of lowercase Roman to represent observed variables and uppercase to represent random variables is restricted to variables; that is, we will need both upper and lowercase Roman letters to represent certain functions and distributions that will be defined in later chapters. The context should make it abundantly clear whether we are talking about variables or functions or distributions. However, much of the new notation will not come into play until the next chapter. This chapter relies heavily on elementary set theory, which is reviewed in Appendix A, Mathematical Appendix. The reader is advised at least to scan the material in the appendix to be sure that all the relevant concepts are familiar before proceeding with this chapter.

5 THE THEORY: FIRST STEPS The Theory: First Steps The Sample Space Let us start with a very simple illustrative experiment that involves a 12-sided die. Each side is labeled by a number: 1,2,3,...,10,11,12. Imagine the experiment of tossing this die; each toss is called a trial of the experiment. If you throw this die, it will land on one side and on only one side. This situation is very common and is characterized by this statement: Conjecture There is a set of events that contains a finite number of discrete alternatives that are mutually exclusive; the set of events is exhaustive for the outcomes of the experiment. An event is an outcome or occurrence of a trial in an experiment. The set of outcomes (or events) of the experiment of tossing the 12-sided die is finite; that is, there are only 12 of them. The events are mutually exclusive because one and only one of the outcomes can occur at a time. The set of events is exhaustive because nothing else but 1 of the 12 listed events can occur. Events that are mutually exclusive and exhaustive are known as elementary events; they cannot be broken down into simpler mutually exclusive events. The exhaustive set of mutually exclusive events is called the sample space. This is a somewhat misleading term because it (as well as all the other terms defined in this section) is an abstraction; it does not refer to actual observations. In our current experiment, a trial is the tossing of a 12-sided die; an event is what number shows up on the toss. The sample space is the set of 12 numbers: 1,2,3,4,...,11,12; the events are mutually exclusive and exhaustive, because on each trial, or toss, only one of the numbers will be on top. Listing the 12 numbers as the outcomes of trials logically exhausts the possible outcomes from the experiment. Look at Table 6.1, which shows the relative frequencies of 1000 tosses of a 12-sided die. The first few observed outcomes were 9,4,11,2,11,6,3,12,6,11,... In the first ten trials of this experiment, there were three 11s and two 6s. What would you have expected? Is this result strange, or is it unremarkable? One of the objectives of our theoretical analysis is to provide answers to this sort of question. These numbers are actual observed outcomes, but what we want to do is to abstract from this situation. We want to be able to speak about all possible tosses of 12-sided die, not just about this particular set of actual outcomes as we have done exclusively until now. This change in viewpoint is difficult for some at first, but if you keep thinking about the idea it will soon become a natural and easy one for you. Let us study another example. Suppose that we have a coin with sides, heads and tails, and the edge is so thin that we do not have to worry about the coin landing on its edge. What we are really saying is that we want to study a situation in which there only two possibilities, heads and tails, which are exhaustive and mutually exclusive. The sample space is S ={e 1, e 2 }; in any trial, one and only one of e 1 or e 2 can occur, where e 1 represents heads and e 2 represents tails.

6 174 CHAPTER 6 THE THEORY OF STATISTICS: AN INTRODUCTION Table 6.1 Frequency Tabulation of a 12-Sided Die Die Absolute Relative Cumulative Cumulative Value Frequency Frequency Frequency Relative Frequency Our language about heads and tails is colorful and helpful in trying to visualize the process, but the abstract language of sample spaces and events is more instructive and enables us to generalize our ideas immediately. Remember that throughout this and the next two chapters, the experiments that are described and the observations that are generated by them are merely illustrative examples. These examples are meant to give you insight into the development of the abstract theory that you are trying to learn. Let us study a more challenging example. Suppose that we have two coins that we toss at the same time. At each toss we can get any combination of heads and tails from the two coins. To work out the appropriate sample space is a little more tricky in this example. Remember that we are looking for a set of mutually exclusive and exhaustive events. Here is a list of the potential outcomes for this experiment: H, H H, T T, H T, T where the first letter refers to the first coin and the second letter to the second coin. As listed these four outcomes are mutually exclusive and exhaustive, because one and only one of these events can and will occur. But what if we had listed the outcomes as {H, H}; {H, T or T, H}; {T, T} You might be tempted to list only three mutually exclusive and exhaustive events. This is not correct because H and T can occur together in two ways: H, T or T, H. The event {H, T} is not an elementary event, because it is composed of two subevents: H, T and T, H that is, heads first, then tails, or the reverse order. But the notation {H, T} means that we are examining the pair H, T without worrying about order; {H, T} is the set of outcomes H or T. This example shows that our definition of a sample space must be refined: A sample space is a set of elementary events that are mutually exclusive and exhaustive. (An elementary event is an event that cannot be broken down into a subset of events.)

7 THE THEORY: FIRST STEPS 175 Table 6.2 Frequency Tabulation for a Single-Coin Toss Elementary Absolute Relative Cumulative Cumulative Event Frequency Frequency Frequency Relative Frequency Table 6.3 Frequency Tabulation for a Double-Coin Toss Elementary Absolute Relative Cumulative Cumulative Event Frequency Frequency Frequency Relative Frequency In a very real sense our elementary events are the atoms of the theory of statistics, or of probability theory as it is also called. So a sample space is a collection of elementary events, or a collection of atoms. These are our basic building blocks. In the die example the sample space had 12 elementary events, the numbers from 1 to 12, or more abstractly, {e 1, e 2,..., e 12 }. In the single-coin toss experiment, the sample space had two elementary events, heads and tails, or more generally, e 1 and e 2 ; and in the last experiment involving the tossing of two coins, we had a sample space with four elementary events, e 1 to e 4. Introducing Probabilities Table 6.1 shows the relative frequency for an experiment with a 12-sided die, and Tables 6.2 and 6.3 show observed relative frequencies for 100 trials on each of two experiments one with a single coin, one with two coins. We want to be able to explain these relative frequencies, so we need an abstract analog to relative frequency. We define the probability of an elementary event as a number between zero and one such that the sum of the probabilities over the sample space is one; this last requirement reflects the fact that relative frequencies sum to one. To each elementary event, e i, we assign a number between zero and one, call it p i. We can write this as S ={e 1, e 2, p 1, p 2, , e k }..., p k } for a sample space having k elementary events; or, we can write this as e 1 pr(e 1 ) = p 1 e 2 pr(e 2 ) = p 2 e 3 pr(e 3 ) = p 3... e k pr(e k ) = p k

8 176 CHAPTER 6 THE THEORY OF STATISTICS: AN INTRODUCTION The expression pr(e 2 ) means assign a specific number between zero and one to the elementary event that is designated in the argument, e 2 in this case; and the value given by that assignment is p 2. Our notation reinforces the idea that we are assigning a number to each elementary event. But we are not at liberty to assign just any number between zero and one; there is one more constraint, namely p i = 1. If you look at the simplest example, where the sample space is {e 1, e 2 }, the tossing of one coin experiment, you will see that this last constraint still leaves a lot of choice. If a probability of p 1 is assigned to e 1, 0 p 1 1, then our constraint merely says that pr(e 2 ) = 1 p 1 = p 2 for any valid value of p 1. If we are to proceed with our examples we will have to resolve this issue. There is one easy way out of our difficulty given our current ignorance about what values of probabilities we should assign to our elementary events; assume that they are all the same. Consequently, if there are k elementary events, the assumed probability is 1/k for each elementary event. This convenient assumption is called the equally likely principle, or following Laplace, the principle of insufficient reason ; the former phrase is easier to comprehend. This principle is really an expression of our ignorance of the actual probabilities that would apply to this particular type of experiment. Until we begin to derive probability distributions, we will have to invoke this principle quite often; in any case, it does seem to be reasonable under the circumstances. Following this principle, we can assign a probability distribution to each of our three experiments: 12-sided die experiment: S ={1, 2,...,11,12}; p i = 1 =.0833, i = 1, 12 2,...,12 Single-coin toss experiment: S ={e 1, e 2 }; p i = 1 =.50; i = 1, 2 2 Double-coin toss experiment: S ={e 1, e 2, e 3, e 4 }; p i = 1 =.25, i = 1, 4 2,3,4 In each case, we have defined probability such that 0 p i 1, i = 1,..., k, and p i = 1. This is called a probability distribution. A probability distribution, as its name implies, is a statement of how probabilities are distributed over the elementary events in the sample space. An immediate question may occur to you. If probabilities are the theoretical analogues of relative frequency, then what can we say about the relationship, if any, between probability and relative frequency? Tables 6.1, 6.2, and 6.3 show the results of three actual experiments designed to illustrate the three sample spaces that we have been discussing. We get excellent agreement for experiment 2; the assigned probability is.50 and the relative frequency is.50. The results for experiment 1 are not so good; no relative frequency equals the assigned probability of.0833, but they do seem to be scattered about that value. With the last experiment, no relative frequency equals the assigned probability, but the relative frequencies do seem to be scattered about the assigned probability of.25. At first sight it would appear that we have not made much progress, but on reflection we recall that to get stable shapes we had to have a lot of data; so maybe that is our problem. However, the question does raise an issue that we will have to face eventually; namely, when can we say that we have agreement between theory and what we observe? What are the criteria that we should use? We will meet this issue directly soon enough, but for now you should keep the problem in mind. We can conclude at this time only that the observed relative frequencies seem to be scattered about our assumed probabilities.

9 THE THEORY: FIRST STEPS 177 Probabilities of Unions and Joint Events If this were all that we could do with probability, it would be a pretty poor theory. So far all that we have done is to assign probability to elementary events using the equally likely principle. But what is the probability of getting on a single roll of the 12-sided die a 2, 4, or a 10? Another way of saying this is that we will declare the roll a success if on a single roll, we get one of the numbers 2, or 4, or 10. Given that each outcome from a single roll of a die is mutually exclusive, we will get only one of these three alternatives. If we obtain any other number, we will declare a failure. The question is, What is the probability of success in any trial of this new experiment? Will our new theory help in this more complex, but more interesting, case? Maybe we could just add up the individual probabilities to get the probability of the new event, {e 2, or e 4, or e 10 }: pr(e 2, or e 4, or e 10 ) = p 2 + p 4 + p 10 = =.25 Because our development of the theory is meant to explain relative frequency, we might see how reasonable this guess is by looking at an experiment. A generous soul volunteered to toss a 12-sided die to get 2000 trials on the event {e 2, or e 4, or e 10 }. The result was a relative frequency of.238, which at least seems to be close. The lesson so far seems to be that to obtain the probability of two or more elementary events all we need do is to add up the individual probabilities, but we should be careful and make sure that the current success is not a lucky break. Let us try another experiment, and ask another question. What is the probability of the event of at least one head in the two-coin experiment? Our present theoretical answer is pr(e 1, or e 2, or e 3 ) = p 1 + p 2 + p 3 = =.75 Event e 1 is {H, H}, e 2 is {H, T}, and e 3 is {T, H}. But what is the value obtained by the experiment? Our generous soul reluctantly came to our rescue again to produce the result of.78. Maybe we are onto something. So far for this first set of problems involving only elementary events, the probability of an event composed of two or more elementary events is just the sum of the probabilities of the individual elementary events; or, in our more abstract notation: pr(e i, or e j, or e k,..., or e m) = p i + p j + p k + + p m These new events are called compound events, because they are compounds, or unions, of the elementary events. In any trial, you declare that a compound event has occurred if any one of its members occurs. In our previous examples, the

10 178 CHAPTER 6 THE THEORY OF STATISTICS: AN INTRODUCTION e 1 e 2 e 3 e 4 Figure 6.1 Table of elementary events for the two-coin toss experiment. Note: The lined region is the event at least one head. compound event was to get a 2, 4, or 10. So, if in any single trial in the 12-sided die example, you throw either a 2, 4, or 10, you have an occurrence of that compound event. In the two-coin toss experiment, the compound event was said to occur if you throw at least one head in a trial using two coins. Figure 6.1 illustrates this last case. Success is represented by the lined region, which is the union of the events {e 1, e 2, e 3 }. If we can discuss the probability of a compound event by relating it to the probabilities of the member elementary events, can we discuss the probability of two or more compound events? Think about this case; what is the probability of at least one head or at least one tail in the two-coin experiment? To answer this question we need to know the relevant compound events. The compound event for at least one head is (e 1, or e 2, or e 3 ), call it a, and the compound event for at least one tail is (e 2, or e 3, or e 4 ), call it b. Let us call our event at least one head or at least one tail, c. From our previous efforts, we have pr(e 1, or e 2, or e 3 ) = p 1 + p 2 + p 3 = pr(a) =.75 pr(e 2, or e 3, or e 4 ) = p 2 + p 3 + p 4 = pr(b) =.75 So, is the probability of at least one head or at least one tail the sum of pr(a) and pr(b)? Try it: pr(c) = pr(a) + pr(b) = = 1.5!

11 THE THEORY: FIRST STEPS 179 e 1 e 2 e 3 e 4 Figure 6.2 Table of elementary events for two-coin toss experiment. The lined region is the event at least one head. The shaded region is the event at least one tail. Something is decidedly wrong! Probability cannot be greater than one. What worked so well for unions of elementary events does not seem to work for unions of compound events. Let us look more closely at this problem. Study Figure 6.2, which reproduces Figure 6.1. Here we have put lines to represent the event a, which is at least one head, and shaded the region corresponding to the event b, which is at least one tail. Figure 6.2 gives us a clue to the solution to our problem, for we see that the elementary events e 2 and e 3 are represented twice, once in the event a and once in the event b. The event that is represented by the overlap between events a and b defines a new idea of a compound event. In this example, the overlap, or intersection, is the event defined by the occurrence of e 2 and e 3 that is, the occurrence of both a head and a tail on a single trial. In our first attempt at adding probabilities, the elementary events to be added were mutually exclusive, but the events a and b are not mutually exclusive. If {H, T} occurs, this is consistent with declaring that event a has occurred, and it is also consistent with declaring that event b has occurred. Remember that a compound event occurs whenever any one member of its defining set of elementary events occurs on any trial. Events a and b are not mutually exclusive. What is the way out of our difficulty? Well, we know how to add probability when the events are mutually exclusive, but what do we do when they are not? One solution is for us to convert our problem into one that only involves mutually exclusive events. To do this, we will have to develop some useful notation to ease our efforts.

12 180 CHAPTER 6 THE THEORY OF STATISTICS: AN INTRODUCTION A Mathematical Digression Recall that we used the symbol S to represent the sample space. Compound events are collections of the elements, or members, of the set S. Suppose that A and B represent any two such compound events. From the events A and B, we create new events C and D: Event C occurs if any elementary event in AorBoccurs. This is written as C = A B. Event D occurs if any elementary event in A and B occurs. This is written as D = A B. The symbol is called union and indicates that the event C is composed of the union of all the events in A or B (a composite) but without duplication. The symbol is called intersection and indicates that the event D is composed of all elementary events that are in both compound events A and B. In our previous example with the two-coin toss, the event c was the union of the events a and b; that is, c = a b. The elementary events that overlap between a and b form the intersection between the compound events a and b; that is, we define the event d by d = a b, where d represents the compound event created by the intersection between a and b. The event d is composed of the elementary events {e 2, e 3 }. To make sure that we have a good understanding of our new tools, we should try another example. Consider the sample space for the 12-sided die experiment. To have something to work with, let us define the following compound events: E 1 ={1, 3, 5, 7, 9, 11} E 2 ={2, 4, 6, 8, 10, 12} E 3 ={5, 7, 9} E 4 ={9, 10, 11, 12} Now we can practice our new definitions: A = E 1 E 2 ={1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} =S B = E 1 E 3 ={5, 7, 9} =E 3 C = E 3 E 4 ={5, 7, 9, 10, 11, 12} D = E 3 E 4 ={9} F = E 1 E 2 = This last symbol means that F is an empty set; that is, F has no elements. Be careful to distinguish this from {0}, which is the set whose single element is zero. Events that have a null intersection are mutually exclusive, and vice versa.

13 THE THEORY: FIRST STEPS 181 One other useful notation is A c, which means the complement of A. It contains all members of S that are not in A. For example, using our four events E 1, E 2, E 3, and E 4, we have E1 c = E 2 E4 c ={1, 2, 3, 4, 5, 6, 7, 8} c = S S c = Let us experiment some more with these relationships. By matching the lists of component elementary events, confirm the following relationships: (E 1 E 3 ) (E1 c E 3) = E 3 (E 1 E 2 ) (E1 c E 2) = E 2 E 1 E1 c = S E 1 E1 c = From these relationships we can build more. For example, what might we mean by the set operation: A B We might guess that this expression means the set of all elements that are in A but are not in B. Let us formalize that notion by defining for any two sets A and B: A B = A B c The right-hand side of the equation represents all those elements of the universal set S that are in A and in the complement of B that is, in A but not in B. We have defined numerous alternative compound events; one question that we have not yet asked is how many are there? This question can only be answered easily for the simple case in which we have a finite number of elementary events. Recall the single-coin toss example with only two elementary events, {e 1, e 2 }. The total possible collection of events is {e 1 }, {e 2 }, [{S} or {e 1, e 2 }], that is, four altogether. Now consider the total number of events for the two-coin toss experiment. We have {e 1 }, {e 2 }, {e 3 }, {e 4 }, {e 1, e 2 }, {e 1, e 3 }, {e 1, e 4 } {e 2, e 3 }, {e 2, e 4 }, {e 3, e 4 }, {e 1, e 2, e 3 }, {e 1, e 2, e 4 } {e 1, e 3, e 4 },{e 2, e 3, e 4 }, [S or {e 1, e 2, e 3, e 4 }], with a total of 16 events. The rule for determining the number of alternative events when there are K elementary events is given by 2 K. Each elementary event is included or not, there are just two choices. For each choice on each elementary event, we can choose all the others, so that our total number of choices is (k times) = 2 k. In the first example there were only two elementary events, so the total number of events is 2 2, or 4. In the second example there were four elementary events, so there are 2 4, or 16, different events.

14 182 CHAPTER 6 THE THEORY OF STATISTICS: AN INTRODUCTION A B D Figure 6.3 A Venn diagram illustrating compound events. The lined region \\\ is A c B. The lined region is A B c. The region is A B. The shaded region is A B. A c D = B c D = D. A D = B D = Ø. Calculating the Probabilities of the Union of Events With our new tools, we can easily resolve our problem of how to handle calculating the probabilities for compound events that are not mutually exclusive. Recollect that we know how to add probabilities for events that are mutually exclusive, but we do not yet know how to add probabilities for events that are not mutually exclusive. Consequently, our first attempt to solve our problem is to try to convert our sum of compound events into an equivalent sum of mutually exclusive compound events. Look at Figure 6.3, which illustrates a sample space S and three arbitrary compound events A, B, and D. Notice that A and B overlap, so the intersection between A and B is not empty. But A and D do not overlap, so the intersection between A and D is empty, or. The union of A and B is represented by the shading over the areas labeled A and B. This figure is known as a Venn diagram; it is very useful in helping to visualize problems involving the formation of compound events from other compound events. While we work through the demonstration of how to add probabilities for compound events that are not mutually exclusive, focus on Figure 6.3. Suppose that we want to calculate the probability of the event E, given by the union of A and B; that is, E = A B, which is composed of all the elementary events that are in either A or B. The idea is to reexpress the compound events A and B so that the new compound events are mutually exclusive. We can then express the probability of E as a simple sum of the probabilities of the component compound events that are mutually exclusive. From Figure 6.3, we see that there are three mutually exclusive compound events in the union of A and B: {A B}, {A B c }, {A c B}. Because both A c A and B c B are null (that is, A c A= and B c B= ), our three listed events are mutually exclusive. For example, none of the elementary events that are in {A B} can be in {A B c } as well because {B B c } is null; an elementary event cannot be in both B and B c at the same time. Let us now reexpress the event {A B} in terms of its component events: A B = (A B) (A B c ) (A c B)

15 THE THEORY: FIRST STEPS 183 First, we should check that the equation is correct, which is illustrated in Figure 6.3, by making sure logically that no elementary event can be in more than one of the component events for {A B} and that any elementary event that is in {A B} is also in one of (A B), (A B c ), or (A c B). All the elementary events that are in {A B} are in one, and only one, of (A B), (A B c ), or (A c B). Now that we have mutually exclusive compound events, we can use our old expression to get the probability of the compound event {A B}: pr(a B) = pr(a B) + pr(a B c ) + pr(a c B) Let us see if we can rearrange this expression to relate the left side to the probabilities for A and B. To do this, we use the following identities: (A B) (A B c ) = A pr(a B) (A B c ) = pr(a) (B A) (B A c ) = B pr [ (B A) (B A c ) ] = pr(b) If we now add and subtract pr(a B) = pr(b A) (why is this always true?) to the expression pr(a B), we will be able to rewrite the expression pr(a B) in terms of pr(a) and pr(b) to get pr(a B) = [ pr(a B) + pr(a B c ) ] + [ pr(a c B) + pr(a B) ] pr(a B) (6.1) = pr(a) + pr(b) pr(a B) This is our new general statement for evaluating the probability of compound events formed from the union of any two arbitrary compound events. If the compound events are mutually exclusive, as they are with elementary events, then the new statement reduces to the old because pr(a B) for A, B mutually exclusive, is 0. The name given to the probability of the intersection of A and B is the joint probability of A and B; it is the probability that in any trial an elementary event will occur that is in both the compound events A and B. Refer to our compound events E 1 to E 4 from the 12-sided die experiment: E 1 ={1, 3, 5, 7, 9, 11} E 2 ={2, 4, 6, 8, 10, 12} E 3 ={5, 7, 9} E 4 ={9, 10, 11, 12} Let us calculate the probabilities of A to F, where A = E 1 E 2 ={1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} =S B = E 1 E 3 ={5, 7, 9} =E 3

16 184 CHAPTER 6 THE THEORY OF STATISTICS: AN INTRODUCTION C = E 3 E 4 ={5, 7, 9, 10, 11, 12} D = E 3 E 4 ={9} F = E 1 E 2 = In each instance, we have two ways to calculate the required probability. We can reduce each compound event to a collection of elementary events and then merely add up the probabilities of the component elementary events; we can always do this. However, trying to calculate the probabilities by this procedure can easily get to be a tremendous chore. The alternative is to use our theory to find easier and simpler ways to perform the calculations. Some results should be immediately obvious. For example, what are the probabilities for events A and F? The immediate answers are 1 and 0; do you see why? The probability of S, pr(s), is the probability that at least one of the logical outcomes will occur on any trial; because we have defined the set of elementary events to be exhaustive, we know that one of the outcomes must happen so the probability is 1. Correspondingly, the probability that none of the elementary events will occur is 0 by the same reasoning. Now consider the probability of event C = E 3 E 4. We have discovered that this probability is given by the sum of the probabilities of the individual component events less an allowance for the double counting that is caused by the intersection of the component events. The probability of C in this case is pr(c) = pr(e 3 ) + pr(e 4 ) pr(e 3 E 4 ) = = 1 2 Recall that E 3 E 4 ={5, 7, 9} {9, 10, 11, 12} ={9}. E 3 E 4 ={5, 7, 9, 10, 11, 12}. So, in both cases the probabilities are easily confirmed; pr({9}) = 1 12 and pr({5, 7, 9, 10, 11, 12}) = 6 12 = 1 2. The Definition of Probability for Sample Spaces of Discrete Events In the section Introducing Probabilities, we defined the probability of an elementary event. That was fine as a beginning, but now we have broadened the notion of probability quite a bit. We now see that probability really is defined on subsets of the sample space rather than on the sample space itself. Indeed, we would like our definition of probability to be as general as we can make it. In this connection, for any sample space S we want to be able to define probabilities for any subset of S that is constructed by any combination of unions or intersections or complementarities. Logically, this means that we should be able to determine the probability of any set of events that are combined by the logical statements and, or, and not. In short,

17 CONDITIONAL PROBABILITY 185 given any subset of S formed in any way whatsoever using these procedures, we will be able to assess its probability. Probability for a sample space of discrete events is a function defined on the class of all subsets of the sample space that satisfies the following conditions: 1. For any set A contained in S: 0 pr({a}) 1 2. For any disjoint sets A, B that is {A B} = pr({a B}) = pr({a}) + pr({b}) 3. For any sets {A i } such that i {A i }=S, that are mutually disjoint that is, {A i A j }= for all i j pr( i {A i }) = 1 We are merely saying that for any set of events drawn from S, the probability is between 0 and 1, the definition of probability for the union of two events that are mutually exclusive is the sum of the constituent probabilities, and the sum of mutually exclusive and exhaustive events has a probability of 1. We are now ready to extend our notion of probability once again. 6.4 Conditional Probability Often in life we face circumstances in which the outcome of some occurrence depends on the outcome of a prior occurrence. Even more important, the very alternatives that we face often depend on the outcome of previous choices and the chance outcomes from those choices. The probability distribution of your future income depends on your choice of profession and on many other events over which you may have no control. We can simulate such compound choices by contemplating tossing one die to determine the die to be tossed subsequently. How do we use our probability theory in this situation? What is different from the previous case is that we now have to consider calculating probability conditional on some prior choice, and that choice itself may depend on a probability. Suppose that you are contemplating a choice of university to attend, and then given the university that you attend you face having to get a job on graduating. If you apply to ten universities, you can consider the chance that you will enter each, and given your attendance at each university, you can consider the chance that you will be employed within six months. The chance that you are employed within six months depends on the university from which you graduate. This situation can be more abstractly modeled by contemplating tossing a die to represent your first set of alternatives, which university you will attend. Given each university that you might attend, there is an associated chance that you will soon be employed; and this is represented by the toss of a second die, but which die that you get to toss depends on which university you actually attend. The various income outcomes from each university can be represented by tossing a die, and across universities the dice to be tossed will be different in that they will represent different probabilities. The questions that we might want to ask include, What is the chance of getting a job whatever university you attend? or, For each university that you might attend,

18 186 CHAPTER 6 THE THEORY OF STATISTICS: AN INTRODUCTION S A 4 A 3 B A 1 A 2 Figure 6.4 Illustration of the concept of conditional probability. what are the chances that you will get a job? and How can you determine what those chances might be? What we are really trying to do is to define a new set of probabilities from other probabilities that relate to a specific subset of choices or outcomes. A conditional probability is the probability an event will occur given that another has occurred. Look at the Venn diagram in Figure 6.4. What we are trying to do is illustrated in Figure 6.4 by the problem of defining a new set of probabilities relative to the event B. If we know that the event B has occurred, or we merely want to restrict our attention to the event B, then relative to the event B what are the probabilities for the events {A i }, i = 1,..., 4? You may think of the event B as attending university B instead of some other university. You can regard the events {A i } as the probabilities of getting different types of jobs. If we are looking at the probability of the event A i, i = 1, 2,..., given the event B, then we are in part concerned with the joint probability of each of the events A i, i = 1, 2,... and B that is, with the probability of the events (A i B), i = 1,..., 4. The event A i given B is the set of elementary events such that we would declare that the event A i has occurred and the event B has occurred; so far this is just the joint event of A i and B. The change in focus from joint probability stems from the idea that now we would like to talk about the probability of A i relative to the probability of B occurring. We are in effect changing our frame of reference from the whole sample space S that contains A i and B to just that part of the sample space that is represented by the compound event B. Figure 6.4 shows a sample space S containing four compound events, A 1, A 2, A 3, and A 4, together with an event, B, with respect to which we want to calculate the conditional probability of the A i given B. As drawn, the A i do not intersect, they are mutually exclusive; that is, the joint probability of {A i A j }, for any i and j, is 0. This assumption is not necessary but is a great convenience while explaining the theory of conditional probabilities. Let the joint probability of each A i with B be denoted pi b, that is, pr(a i B) = pi b, i = 1, 2, 3, 4. Since the pr(s) = 1 and the union of the

19 CONDITIONAL PROBABILITY 187 compound events A i B, i = 1, 2, 3, 4 is certainly less than S (because B is not the whole of S), we know that pr [ i (A i B)], where pr [ i (A i B)] = pr [(A 1 B) (A 2 B) (A 3 B) (A 4 B)] = p1 b + pb 2 + pb 3 + pb 4 is not greater than 1; indeed, it is less than 1. But our intent was to try to concentrate on probability restricted to the event B. This suggests that we divide pi b by pr(b) to obtain a set of probabilities that, relative to the event B, sum to 1. We define the conditional probability of the event A given the event B by pr(a B) pr(a B) = (6.2) pr(b) The probability of an event A restricted to event B is the joint probability of A and B divided by the probability of the event B. It is clear that this procedure yields a new set of probabilities that also sum to one, but only over the compound event B. A simple example is given by considering the two mutually exclusive events, A and A c. The distribution of probability over A and A c, where the event B intersects both is pr(a B) + pr(a c B) = pr(a B) pr(b) + pr(ac B) pr(b) = pr(b) pr(b) = 1 (6.3) We can add the probabilities in this expression, because A and A c are mutually exclusive. Further, it is always true that (A B) (A c B) = B for any events A and B. If you do not see this right away, draw a Venn diagram and work out the proof for yourself. Many statisticians claim that conditional probabilities are the most important probabilities, because almost all events met in practice are conditional on something. Without going that far you will soon discover that conditional probabilities are very, very useful. For now, let us try another simple example from the sample space where S = {1, 2, 3,..., 11, 12}. Define the compound events a i and b as follows: a 1 ={1, 2, 3, 4, 5, 6} a 2 ={7, 8} a 3 ={9, 10, 11} b ={6, 7, 8, 9} So, the compound events formed by the intersection of a i and b are a 1 b ={6} a 2 b ={7, 8} a 3 b ={9}

20 188 CHAPTER 6 THE THEORY OF STATISTICS: AN INTRODUCTION The corresponding joint probabilities are now easily calculated by adding the probabilities of the mutually exclusive (elementary) events in each set: p b 1 = pr(a 1 b) = 1 12 p b 2 = pr(a 2 b) = 2 12 p b 3 = pr(a 3 b) = 1 12 pr(b) = 4 12 The corresponding conditional probabilities are given by: i pr(a 1 b) = pb 1 1 pr(b) = pr(a 2 b) = pb 2 2 pr(b) = pr(a 3 b) = pb 1 3 pr(b) = 12 p b i 1 3 i pr(a i b) = pr(b) = 1 = 1 4 = 1 2 = 1 4 Now that we understand the idea of conditional probability, it is not a great step to recognize that we can always and trivially reexpress ordinary, or marginal, probabilities as conditional probabilities relative to the whole sample space. (Marginal probabilities are the probabilities associated with unconditional events.) Recognize for a sample space S and any set A that is a subset of S, that A S = A and that pr(s) = 1. Therefore, we can formally state that the conditional probability of A given the sample space S is pr(a). More formally, we have pr(a S) pr(a S) = pr(s) = pr(a) Let us experiment with the concept of conditional probability in solving problems. We will use the popular game Dungeons & Dragons to illustrate the idea of conditional probability. Imagine that you are facing four doors: behind one is a treasure, behind another is an amulet to gain protection from scrofula, and behind the other two are the dreaded Hydra and the fire dragon, respectively. With an octagonal (an eight-sided die), suppose that rolling a 1 or a 2 gives you entrance to

21 CONDITIONAL PROBABILITY 189 You Are Here Marginal Probabilities 1/4 1/4 1/4 1/4 Treasure Amulet Hydra Dragon Conditional Probabilities 1/4 3/8 3/8 Lose an Arm Grows a Head Killed Figure 6.5 A probability tree for the Dungeons & Dragons example. the treasure, rolling a 4 or a 5 provides the amulet, and rolling a 6 or a 7 brings on the fire-breathing dragon. However, if you roll a 3 or an 8, you get the Hydra. The probability of this event using the equally likely principle is 2/8, or 1/4. These probabilities, labeled the marginal probabilities, are illustrated in Figure 6.5. If you rolled a 3 or an 8, you must now roll another octagonal die to discover the outcomes that await you through the Hydra door. If you then roll a 1 or a 2, you will lose an arm with probability 1/4; if 3, 4, or 5, the Hydra grows another head with probability of 3/8; and if 6 or more, the Hydra dies and you escape for another adventure with probability of 3/8. These probabilities are the conditional probabilities. What is the probability of your losing an arm given that you have already chosen the Hydra door? What is the probability of your losing an arm before you know which door you have to go through? The former probability is the conditional probability; it is the probability of losing an arm given that you have chosen the Hydra door, 1/4. The second probability is one we need to calculate. The conditional probabilities of the various alternatives before choosing the Hydra door are lose your arm; conditional probability = 2 8 = 1 4 Hydra grows a head; conditional probability = 3 8 Hydra killed; conditional probability = 3 8

22 190 CHAPTER 6 THE THEORY OF STATISTICS: AN INTRODUCTION These three probabilities are of the form pr(a i b), where a i is one of the three alternatives facing you and b is the event chose the Hydra door. In this example, you have been given the conditional probability, but to work out the probability of losing an arm before you chose your door, you will have to know the probability of choosing the Hydra door. That probability, the marginal probability, as we calculated is 1 4. Now how do we get the probability of losing an arm? Look back at the definition of conditional probability: pr(a B) pr(a B) = (6.4) pr(b) In our case pr(a B) = 1 4 that is, the probability of losing an arm given you drew the Hydra door and pr(b) = 1 4 that is, the probability of drawing the Hydra door so, the probability of both drawing the Hydra door and losing your arm is pr(a B) = pr(a B) pr(b) (6.5) We got this from Equation 6.4 by multiplying both sides of that equation by pr(b). We conclude from Equation 6.5 that the joint probability of choosing the Hydra door and losing an arm has a probability of ( 1 4 ) ( 1 4 ) = 1 16 ; because pr(a B) = 1 4 and pr(b) = 1 4. This last expression that we have used to obtain the joint probability of 1 16 provides us with an interesting insight: the joint probability of the event A and B is the probability of A given B multiplied by the probability of the event B. This is a general result, not restricted to our current example. Further, we immediately recognize that we could as easily have written pr(a B) = pr(b A) pr(a) (6.6) We could have defined joint probability in terms of this relationship; that is, the joint probability of two events A and B is given by the product of the probability of A given B times the probability of B. This is a type of statistical chain rule. The probability of getting both A and B in a trial is equivalent to the probability of getting B first, followed by the probability of getting A, given B has already been drawn. We evaluate the events together, but for the purposes of calculation, we can think of them as events that occur consecutively. Summing Up the Many Definitions of Probability We started with one simple idea of probability, a theoretical analogue of relative frequency. We now have four concepts of probability: (Marginal) probability: pr(a), pr(c) Joint probability: pr(a B), pr(k L) Composite (union) probability: pr(a B), pr(k L) Conditional probability: pr(a B), pr(c K) where A, B, C, K, and L each represent any event. We describe probabilities as marginal to stress that we are not talking about joint, composite, or conditional probability. So marginal probabilities are just our old probabilities dressed up with an added

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

CS 70 Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 Introduction One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten