Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Lecture 5 (MWF) Probabilities and the rules Suhasini Subba Rao
Review of previous lecture We looked at the interquartile range. The concept of a variance and standard deviation was introduced. The standard deviation is the square root of the variance. Like the mean, the standard deviation (which is the square root of the variance) is a measure of spread of the population, and is not random. On the other hand, the sample variance and standard deviation are random. In statistics we have: Populations, Random variables (the measurements we are interested in the population) and Probabilities which are allocated 1
to the random variables. In this lecture we will introduce the notion of a probability and the rules that go with it. 2
Probability Suppose there is a population of people and I select one person at random. What is the chance (probability) that the selected person s height lies in the interval [5.5, 5.75]feet? How do we calculate this chance? The probability that the randomly selected person s height lies in the interval [5.5, 5.75] is: Total number of people in population with height in the interval [5.5, 5.75]. Total number of people in population 3
Probability and random variables A probability is always positive and is greater than or equal to zero and less than or equal to one (lies in the interval [0, 1]). Random variables and probabilities Recall a random variable X denotes a measurement on a randomly selected individual (it can be the height of a randomly selected student or the gender of a random selected person (for the sake of simplicity we will assume either male or female)). A random variable can take any one of several different outcomes, to each outcome we allocate a probability. We now consider some examples to familiarize ourselves with this notation. 4
Examples of random variables and their probabilities For example if X is the gender of a randomly chosen in the world population. Then clearly X = {Male or Female}. We write P (X =Female) to denote the probability a random chosen person is female. Similar X can denote the height of a randomly chosen person. Then P (5 < X < 5.25), denotes the probability of that a randomly selected students height is in the interval [5, 5.25]. 5
Illustration:Data from a 651 class Summary of information from a 651 class. 6
Probabilities and random variables: Illustration 1 (i) The measurement of interest may be the height of the randomly chosen person. Therefore X = height of randomly selected person. The set of all possible outcomes is any number in the interval [5, 7]. Therefore, P (5 X 7) = 1. P (5.75 X < 6.25) = P (5.75 X < 6) + P (6 X < 6.25) = 0.22 + 0.14. It is impossible from this plot to evaluate P (5.9 X < 6.15). (ii) The measurement of interest may be the gender of the randomly chosen person. Therefore Y = gender of randomly chosen person. The set of all possible outcomes is {male, female}. To each outcome we assign a probability. Since {male, female} is the set of all outcomes (there are no other), we have that P (Y = male or female) =1 and P (Y = Male) = 0.4308. 7
Mutually exclusive events Two events (outcomes) are mutually exclusive, if the occurrence of one event excludes the occurrence of the other event. In the height example, the interval A = [5, 5.5) is an (this is the event that a randomly chosen person s height lies in the interval A = [5, 5.5)). Similarly B = [5.75, 6) is an event. A = [5, 5.5) and B = [5.75, 6) are mutually exclusive events. This means if the randomly chosen person s height lies in the interval A = [5, 5.5), then their height cannot lie in the interval B = [5.75, 6]. Similarly if we know that X (the height of a randomly chosen person) lies in B = [5.75, 6) then it cannot lie in A = [5, 5.5). On the other hand, the events A = [5, 5.5) and C = [5.25, 5.75) are not mutually exclusive. If X = 5.3, then it lies in both events A and C. 8
Suppose X is the time of the day I wake up. Let A=night time and B=day time, then A and B are mutually exclusive events. In anyone day, I cannot get up at both night and day time. 9
Mutually exclusive events and probabilities Benefits of mutually exclusive events If two events A and B are mutually exclusive, then P (A or B) = P (A) + P (B) (often P (A or B) = P (A B)). Height Example Suppose X is the height of a randomly chosen person. If A = [5, 5.5) and B = [5.75, 6), then P (X in A or X in B) = P (5 X < 5.5 or 5.75 X < 6). Since A and B are mutually exclusive then P (X in A or X in B) = P (5 X < 5.5 or 5.75 X < 6) = P (5 X < 5.5) + P (5.75 X < 6). Plot these two events on a density plot. 10
Example 2 Suppose that X is the height of a randomly chosen person. Suppose that the probability P (X t) is known for every single number t. That is I know P (X 4), P (X 4.5) P (X 6) etc. Using this information, how can I evaluate: (i) P (4 < X 7). (ii) P (X 8) (iii) P (4 < X 7 or 8 < X 9) (iv) P (4 < X 7 or 6 < X 8) Solutions in http://www.stat.tamu.edu/~suhasini/teaching651/ probabilities_lecture5.pdf 11
Reasons for the above exercise Soon you will need to understand how to calculate the chance of a certain event happening (such as the sample mean being less than a certain value). This will require you to use the normal tables and look up probabilites and the above will be useful. It will give you some idea how percentiles are calculated. Let us return to the doctor office example. Suppose your blood level lies between 13-15, then given the chances the distribution of blood levels (based on) P (X t), you can calculate the chance P (13 < X 15). 12
Motivation: Conditional probabilities Is there a relationship between binge drinking (excessive consumption of alcohol) and gender. 17096 students were surveyed (7180 males and 9916 females). The data is given below. Male Female Binge 1630 1684 Not Binge 5550 8232 In order to analyze this type of data we need to introduce the notion of a conditional probability. This is a probability calculated using not the entire population but the conditioned subpopulation. 13
Motivation: cont A conditional probability is the probability the probability of an event given that we already have some (possibly partial) information about. Male Female Binge 1630 1684 Not Binge 5550 8232 Subtotals 7180 9916 The probability of binge drinking given that you know person is female is 1684/9916 17%, whereas the probability of binge drinking given that you know person is male is 1630/7180 22.7%. We can write these conditional probabilities as P(binge drink Male) = 0.227 and P(binge drink Female) = 0.17. Based on the above information, does gender change the chance of 14
binge drinking? In other words does information about the gender of the person change the chance of bring drinking. This example alludes to the notion of independence, which we discuss at the end of this lecture. 15
Conditional probability Example Suppose X is the height of a randomly selected person and we know that X lies in the interval [5, 6], then what is the probability X lies in [5, 5.5]? This probability is Number of people whose height is in [5, 5.5] Number of people whose height is in [5, 6]. We write this probability as: the probability that X lies in [5, 5.5] given that X lies in [5, 6]. Using probability notation this is written as: P (5 X 5.5 5 X 6) = P (5 X 5.5 }{{} 5 X 6) denotes given 16
Compare this with the the probability a random person lies in the interval [5, 5.5], written as P (5 X 5.5). In this case P (5 X 5.5) will be smaller than P (5 X 5.5 5 X 6). What is P (5 X 5.5 X 6)? 17
Conditional probability: Heights and gender of 18 people Height 5.5 5.9 4.9 6.2 6 5.9 5.2 5.7 5.3 Gender F M F M M F F M F Height 5 6.3 5.6 5.9 5.8 5.9 6 5.6 5.5 Gender M M F M F M M F F Suppose we randomly select a person and let X denote their height and Y their gender. Calculate (i) P (X 5.5). (ii) P (X 5.5 Y = M). (iii) P (X 5.5 Y = F ). What do we observe? Is there a difference in the distribution of male and female heights. 18
Example - heights and gender (general) Suppose we randomly choose a person, and let X denote their height and Y their gender. Then, P (5 X 5.5 Y = female), is the probability that a randomly person s height lies in interval [5, 5.5] given that we know the person is female. Clearly P (5 X 5.5 Y = female) P (5 X 5.5) and P (5 X 5.5 Y = female) P (5 X 5.5 Y = male). Question In general, which do you think is the larger probability; P (5 X 5.5 Y = female) or P (5 X 5.5 Y = male). Answer Since females, in general, tends to be smaller than males, and [5, 5.5] feet is quite small. It is more likely that a female should have a height in the interval in [5, 5.5] than a male to have a height in [5, 5.5]. 19
This is saying that the probability a randomly chosen person s height lies in [5, 5.5] given that the person is female is greater than the probability a randomly chosen persons height lies in [5, 5.5] given that the person is male. In more succinct notation, this is saying that P (5 X 5.5 Y = female) > P (5 X 5.5 Y = male). Suppose that X denotes the height of a randomly chosen person in the 651 data and Y denotes their gender. How to evaluate: (i) P (5 X 5.5) (ii) P (5 X 5.5 Y = female) (iii) P (5 X 5.5 Y = male)? We will not be using conditional probabilities much in this course. 20
However, the idea is important to understand. This is because when we do statistical testing we will use the idea of probability of observing some event given that A is true. 21
Independence and conditional probabilities Definition Suppose that we have two events A and B. The events A and B are independent of each other if P (A B) = P (A). This means the event B has no influence what so ever on the chance of event A occurring. Let us return to the example of heights and gender. Do you think height and gender are independent? In other words, does information about the gender of a person play no role in the chance of their being a certain height. Intuitively this seems not be the case. We showed this in the above example. Where we showed that P (5 X 5.5 Y = female) > P (5 X 5.5 Y = male). Which means that P (5 X 5.5 Y = female) P (5 X 5.5) Hence gender and height are not independent. 22
Random variables X and Y are independent if for any outcome of X and and outcome of Y, P (X = event A Y = event B) = P (X = event A). Independent and mutually exclusive (they not the same) Example If A is the event a height is in the interval [5, 5.5] and B is the event a height is in the interval [6, 6.5], then P (X in [5, 5.5] X in [6, 6.5]) = 0 P (X in [5, 5.5]). A and B are not independent events. 23
Example: Binge drinking We recall that data set: Male Female Binge 1630 1684 Not Binge 5550 8232 Subtotals 7180 9916 We recall the probability of binge drinking given that you know the person is female = P(binge drink Female) = 0.17, whereas the probability of binge drinking given that you know the person is male = P(binge drink Male) = 0.227. As these conditional probabilities are different, there is a dependence/association between gender and binge drinking. 24
Examples: Independent events Define the random variables. X is the season in college station. Y are the number of hair dressers in college station. Z is the temperature in college station. It is unlikely that the number of hair dressers in college station have an influence on the temperature. Therefore P (Z [25, 28] Y ) = P (X [25, 28]). 25
On the other hand season does have an unfluence on temperature, therefore P (X [25, 28] X = summer ) P (X [25, 28]) or equivalently, P (X [25, 28] and Z = summer ) P (X [25, 28])P (Z = summer). We will understand the above statement in lecture 6. 26
Dependence does not imply causality A study was done in the early 90s to see if the mortality rates between left and right handed people were the same. To do this the psychologists collected the death records of 2000 people who died in May, 1990, in Southern California. They rang the families of all these people and asked whether the were left or right handed. They also categorized the people who died into those above 60 and below 60 years old. This is the data they collected. Out of the 2000, 400 were left handed. Out of the 2000, 150 were left handed and died below 60. Out of the 2000, 300 were right handed and died below 60. 27
(a) A person dies below 60? Lecture 5 (MWF) Random variables, Probability and conditional probabilities (b) A person dies below 60 given that they are left handed? (c) Does the data suggest there is an association/dependence between left handedess and early mortality? 28
Solution It instructive to summarize the data as a contingency table: Died before 60 Died after 60 totals Left 150 250 400 Right 300 1300 1600 Totals 450 1550 2000 (a) The chance a person dies below 60. Calculate the total proportion who died before 60 P (before 60) = 450 2000 (b) A person dies below 60 given that they are left handed. 29
Focus on people who are only left handed. There are 400 of these. Of these, 150 died before 60. Therefore the proportion of left handed people in the sample who died before 60 is P (before 60 left handed) = 150 400 = 3 8 = 37.5% Using the same argument the proportion of right handed people in the sample who died before 60 is P (before 60 right handed) = 300 1600 = 3 16 = 18.75%. (c) As these proportions are different, it suggests there is some sort of dependence/association between lefthandedness and early mortality. 30
Why the dependence? Can we conclude from this data that being left handed increases the chance of early mortality? This statement is a causal statement, it asks whether being left handed causes early mortality. This data set cannot answer this question, does not take into account other variables associated with being left handed which are causing the effect you are seeing. The most plausible explanation for the difference in proportions is that the incidence of left-handedness has increased over the past 50 years (because people are no longer forced to be right handed). The unobserved variable is 31
Example: Before 1930, no one was left handed, what would the numbers in the table and proportions be? 32