Chapter 5: in Our Daily Lives These notes reflect material from our text, Statistics: The Art and Science of Learning from Data, Third Edition, by Alan Agresti and Catherine Franklin, published by Pearson, 2013. quantifies randomness. It is a formal framework with a very specific vocabulary and notation. Imagine an experiment with a specific set of outcomes (say, flipping a fair coin twice). S is the sample space of all possible outcomes. Subsets of S are called events and are denoted with letters like A and B. The empty set, φ, is the event that contains no outcomes. Two events are disjoint if their intersection is empty. The Russian mathematician Kolmogorov helped to clarify the essential properties of a probability function, P P (S) = 1 for the entire sample space S 0 P (A) 1 for any event A S P ( n i=1 A i) = n i=1 P (A i) for disjoint events A i First examples : flip a coin, flip three coins, roll a die, roll two dice If you roll a die once the result is completely uncertain, because the individual outcomes are equally likely. But now begin to methodically roll the die and after each toss calculate the total number of 6 s observed so far divided by the total number of rolls at this point. Call this a cumulative proportion and graph these cumulative proportions for a large number of rolls of the die, say 100,000 rolls. A computer did this and displayed the following graph. In this particular simulation, the first ten rolls of the die produced the sequence 0001010010, where 1 means a 6 was rolled and 0 means something else appeared. Calculate the first ten cumulative sums for this short sequence and compare your results to the following chart. What is the height of the dotted red line? 0.30 0.25 0.20 p^n 0.15 0.10 0.05 0.00 1 10 100 1,000 10,000 100,000 n (number of rolls) Fig. Cumulative proportions of a 6 in 100,000 rolls of a fair die, from OpenIntro Statistics, chapter 2 Display discrete probabilities in a table Flip a fair coin outcome h t probability 0.5 0.5 Spring 2015 Page 1 of 7
Venn diagram A B Rules of Mutually exclusive events. A B = φ Unions. P (A B) = P (A) + P (B) P (A B). Complements. P (A c ) = 1 P (A). Independent events. P (A B) = P (A)P (B) when A and B are independent. Conditional probability. P (A B) = P (A B)/P (B) when P (B) 0 Intersections. P (A B) = P (A B)P (B) Spring 2015 Page 2 of 7
Contingency tables and conditional probabilities Vocabulary for diagnostic testing, S medical state present, P OS test positive : sensitivity P (P OS S), specificity P (NEG S c ), incidence P (S) Consider the Triple Blood Test for Down Syndrome (Agresti and Franklin, chapter 5, pp.232-233) Blood Test Status P OS NEG T otal D (Down) 48 6 54 D c (unaffected) 1307 3921 5228 T otal 1355 3927 5282 Calculate the following probabilities based on the figures in this study: sensitivity P (P OS D), specificity P (NEG D c ), incidence P (D) false positives P (P OS D c ), false negatives P (NEG D) An individual being tested would be most concerned about P (D P OS). What is this probability? Why is it so small? Hint: Calculate P (D c P OS). Again, an individual being tested would want to know P (D NEG). How would that probability compare to the a priori P (D)? Triple Blood Test POS NEG status unaffected Down blood test Spring 2015 Page 3 of 7
Using R to Compute Conditional Probabilities Construct a data frame named down to represent the Down Syndrome contingency table, and then use addmargins(down) to compute its row and column totals. down <- c(48, 1307, 6, 3921) dim(down) <- c(2, 2) dimnames(down) <- list(status=c("down", "unaffected"), "blood test"=c("pos", "neg")) down # status pos neg # down 48 6 # unaffected 1307 3921 addmargins(down) # status pos neg Sum # down 48 6 54 # unaffected 1307 3921 5228 # Sum 1355 3927 5282 Then prop.table(down, 1) will divide each row by its row sum. The numbers in each row are conditional probabilities. And prop.table(down, 2) will divide each column by its column sum. The numbers in each column are conditional probabilities. Therefore, each of the eight numbers shown below is a conditional probability of the form P (A B) for some A and B. Identify the correct A and B for each number. prop.table(down, 1) # status pos neg # down 0.8888889 0.1111111 # unaffected 0.2500000 0.7500000 prop.table(down, 2) # status pos neg # down 0.03542435 0.001527884 # unaffected 0.96457565 0.998472116 What values do these tables indicate for P (pos down) and P (down pos)? Spring 2015 Page 4 of 7
Boston Smallpox Epidemic of 1721 The following contingency table (OpenIntro Statistics, pp.83 87) refers to the Boston smallpox epidemic of 1721. A total of 6224 residents of Boston contracted smallpox in this epidemic and 850 of them died. The epidemic was marked by vigorous public debate of the value (or lack thereof) of a type of inoculation known as variolation (which was dangerous). The Reverend Cotton Mather advocated inoculation but the physician William Douglass was firmly against it. See the article in Harvard s Contagion for more details. An effective smallpox vaccination procedure was eventually demonstrated by Edward Jenner in England in 1796, and succeeding efforts to eradicate smallpox from the world were finally declared to be successful in 1980 by the World Health Organization. Cotton Mather, on the other hand, lives on in infamy for his role in the Salem witch trials. Inoculated Result yes no T otal lived 238 5136 5374 died 6 844 850 T otal 244 5980 6224 Smallpox Epidemic, Boston, 1721 yes no died result lived innoculated Spring 2015 Page 5 of 7
Tree Diagrams The following tree diagram, generated by OpenIntro software, summarizes the relevant statistics for the Boston smallpox epidemic of 1721. Here Inoculated is a categorical explanatory variable with levels yes and no. In the Inoculated column of the tree diagram are the probabilities P (yes) and P (no). The categorical response variable Result has levels lived and died. The conditional probabilities in the Result column are P (lived yes), P (died yes), P (lived no), P (died no). The probabilities calculated by the software in the third column are P (lived and yes), P (died and yes), P (lived and no), P (died and no), because P (A) P (B A) = P (A B). Innoculated yes, 0.0392 Result lived, 0.9754 died, 0.0246 0.0392*0.9754 = 0.03824 0.0392*0.0246 = 0.00096 no, 0.9608 lived, 0.8589 died, 0.1411 0.9608*0.8589 = 0.82523 0.9608*0.1411 = 0.13557 Fig. Smallpox in Boston, 1721, from OpenIntro Statistics, chapter 2, pp.83-87 Spring 2015 Page 6 of 7
Exercises We will attempt to solve some of the following exercises as a community project in class today. Finish these solutions as homework exercises, write them up carefully and clearly, and hand them in at the beginning of class next Friday. Exercises for Chapter 5: 5.10 (coin), 5.12 (stock market), 5.21 (risky behavior), 5.23 (seat belts), 5.34 (free throws), 5.37 (Down), 5.38 (job), 5.40 (serves), 5.50 (Masters), 5.57 (mammogram) Class work 5a probability Exercises from Chapter 5: 5.10 (coin), 5.12 (stock market), 5.21 (risky behavior), 5.23 (seat belts), 5.34 (free throws) Class work 5b probability Exercises from Chapter 5: 5.37 (Down), 5.38 (job), 5.40 (serves), 5.50 (Masters), 5.57 (mammogram) Spring 2015 Page 7 of 7