Chapter 5. The Goodness of Fit Test. 5.1 Dice, Computers and Genetics

Size: px

Start display at page:

Download "Chapter 5. The Goodness of Fit Test. 5.1 Dice, Computers and Genetics"

Thomas James
5 years ago
Views:

1 Chapter 5 The Goodness of Fit Test 5.1 Dice, Computers and Genetics The CM of casting a die was introduced in Chapter 1. We assumed that the six possible outcomes of this CM are equally likely; i.e. we assumed the ELC. Later I mentioned that I own two roundcornered dice and I suspect that the ELC is not reasonable for either of them. How can we decide whether to believe in the ELC for a die? In this chapter we will learn about the (Chi-squared) Goodness of Fit Test. This test was developed circa 1900 by Karl Pearson ( ), in part to investigate theories of genetic inheritance. While I cannot give you an exact reference, sometime in the 1990 s Scientific American (or a similarly themed journal sorry) published an issue devoted to The 20 greatest scientific discoveries of the 20th Century. Next to such obvious entries as the jet engine, the structure of DNA and the splitting of the atom was... the test of this chapter! This was a curious inclusion for at least two reasons. 1. Whereas it is true that modern statisticians do not condemn this test, they don t use it very often. 2. With all the wondrous discoveries of those one hundred years, I can t imagine putting any statistical method on the list! Many of my more zealous colleagues might disagree with my last statement, but I would be truly amazed if any of them selected the Goodness of Fit Test as our main contribution. When I read this issue of the journal, my sense was that there were two reasons for including our test. First, the test is important historically because it provided a confirmation that Mendel s genes made sense. This was important because genes provided a mechanism for Darwin s work. (I am not a biologist and indeed understand the subject poorly, but as I understand things, Darwin provided no mechanism for natural selection.) Second, I suspect that the editors wanted to take the most inclusive view of science for the issue. Hence, even Statistics received attention. In any event, unless you work for a casino or are interested in gambling, you might think that the study of dice is bit frivolous. Well, as mentioned above, there are applications to genetics. But why do I mention computers in this section title? 51

2 Well, we hear all the time about computer models that help us learn about the world. There are computer models for the climate, the mutation of species or viruses, and so on. These computer models typically include CMs and at some point in the analysis the computer programmer will simulate the operations of these various CMs by using a program called a random number generator. For example, a random number generator might promise to select a digit at random (this implies ELC in this setting) from 0, 1, 2,..., 9. But how does the programmer know that the program works as advertised? The test of this chapter can be used to investigate this issue. 5.2 The Chi-Squared Curves In Chapter 2 we learned about the family of normal curves. Also in Chapter 2 we learned about the family of binomial distributions. A binomial is characterized by the values of two parameters: the number of trials n and the probability of success on any trial p. In Chapter 4 we learned about the family of Poisson distributions. A Poisson is characterized by the value of a single parameter θ. In this section we will learn about a family of curves called the Chi-Squared curves. (Note: Many people call these the Chi-Square curves that is, no d at the end of Square but this has always annoyed me. For example, when I read the equation 3 2 = 9, I say, Three squared equals nine. I would never say, Three square equals nine. Three square sounds like I am talking about meals!) This might be a good time to tell/remind you that χ is the lower case Greek letter chi, where ch is pronounced as a k and the i sound is long i. My word processor does not include an upper case chi because it looks just like an upper case ex; i.e. X can be either of two letters, hopefully the context will make it clear which you should use. A helpful reminder is that X is always ex whereas X 2 is usually chi-squared. It is only rarely that statisticians square an ex; for example, why would anyone want to know the square of the number of heads I get when tossing a coin? A Chi-Squared curve is characterized by the value of one parameter, called its degrees of freedom (df). The degrees of freedom can be any positive integer, 1, 2, 3,.... Our symbol for this curve will be χ 2 (df). For example, χ 2 (5) is the Chi-Squared curve with df = 5. I will talk about a Chi-Squared random variable. Such a random variable will be denoted by X 2 and take on values χ 2. (Similar to how a binomial random variable X takes on values x.) Further, this terminology implies that we use a Chi-Squared curve to calculate probabilities for X 2. Following our now standard notation, we write this as: X 2 χ 2 (df). By the way, as the symbol X 2 suggests, a Chi-Squared random variable can never take on a negative number for its value; indeed, this is why we have the squared in the notation, as a reminder that negatives are impossible. On our course webpage there are links to both a table and a calculator for Chi-Squared curves. We will use only the calculator in this course; I provide the table in case you are interested in it. At this time, I want you to go to the calculator. When you call up the calculator you will find the default screen. You will see a curve with the area to the right of 10 shaded blue. Below the curve are three boxes. Reading from these boxes we learn that the area under a χ 2 (10) curve to 52

3 the right of 10 is I want you to take a few minutes and experiment with this site to learn a bit about Chi-Squared curves. In particular, temporarily ignore the bottom two boxes, type 1 in the degrees of freedom box and click on the Compute box. The site then displays a picture of the χ 2 (1) curve. Repeat this exercise for df = 2, 3,..., 10, 30, 50, 100. What have you seen? Well, I note that the χ 2 curves are skewed to the right (the curve to the right of its peak is longer than the curve, if any, to the left of its peak). As the degrees of freedom increases, the skewness becomes less noticeable. For df = 100 the χ 2 curve looks symmetric and bell-shaped. In fact, for a large number of degrees of freedom, a standardized χ 2 curve can be well approximated by a snc, but we won t need this fact. Thus, we won t learn how to do it. There are two ways to use the calculator and both will be valuable to us. You always begin by typing the df in the top box. You may then use the calculator in what I call the Direct or Indirect way. Direct: Type a positive number in the left box and then hit compute. The site will give you the area under the χ 2 curve to the right of your number. Example: I type in 5 for degrees of freedom and then 8.72 in the left box. I hit compute and the site displays in the right box. This means, as the picture reminds us, that the area under the χ 2 (5) curve to the right of 8.72 equals Indirect: Type a number strictly between 0 and 1 in the right box and then hit compute. The site will display a number in the left box. I need an example to explain this. Example: I type in 7 for degrees of freedom and then 0.05 in the right box. I hit compute and the site displays in the left box. This means, as the picture reminds us, that the area under the χ 2 (7) curve to the right of equals Notice that I was not able to explain the Indirect use without an example. This is a general problem for teachers of Statistics, so we invent some new notation to save us from this difficulty. In my above example, the number is denoted by χ (7); i.e. χ (7) = This equation, χ (7) = tells us that the area under the χ 2 (7) curve to the right of is equal to Here are some more examples of this notation. Practice these with the calculator to make sure you can verify my claims. 1. χ (7) = (Hint: Type 7 for df; 0.01 in the right box; and hit compute. You will get in the left box.) 2. χ (3) = χ (9) =

4 5.3 The Hypotheses We assume that we have a CM that can be operated repeatedly and, when so operated, yields i.i.d. trials. Whether the outcomes are categories or numbers, we assign numbers to each outcome: 1, 2,...k or 0, 1, 2,...(k 1) if there are a finite number of possible outcomes or 0, 1, 2,... if there is a sequence of possible outcomes. Note that for the finite case there are k possible outcomes. The probability of outcome i is denoted by p i. So far, this is all quite routine. The Goodness of Fit Test is used when we have a theory about the values of the p i s and we want to evaluate whether or not the theory is reasonable. Here are some examples. 1. Dice. The CM is the casting of a die. The possible outcomes are 1, 2,..., 6. I might entertain the theory that the die is balanced; i.e. I might assume the ELC. 2. Genetics. This is just one of many related problems that arise in genetics. Individual snapdragon (Antirrhinum majus) plants can be red-, pink- or white-flowered. Self-pollination of pink-flowered plants can yield any of these three colors. The CM is one self-pollination with outcome 1 (red), 2 (pink) or 3 (white). Repeated operations are obtained by repeated self-pollinations. These repeated operations are assumed to yield i.i.d. trials. A Mendelian genetic model states that p 1 = 0.25, p 2 = 0.50 and p 3 = Bernoulli Trials. Carol likes to shoot free throws. Every day she attempts 10 free throws. As we saw in Chapter 2, if we assume that her individual shots are BT, then her total number of successes on a day has a Bin(10,p) distribution. We will learn how to use her data from many days to test the binomial model. We will learn how to do this for both of the cases: p is known and p is unknown. (Note: The case in which p is unknown will not be on the exam.) 4. Poisson Process. Every day David counts the number of successes in a fixed location over the same one-hour period of time. As discussed in Chapter 4, David might assume that the number of successes each day has a Poisson(θ) distribution. Given data from many days, we will learn how to test this assumption for both of the cases: θ is known and θ is unknown. (Note: This entire topic of the Goodness of Fit test for the Poisson is in an optional section and will not be on the exam.) We will restrict attention to the situation in which the CM has a finite number of possible outcomes until the last section of this chapter; i.e. we will return to the Poisson Process in the last section. As mentioned above, the test of this chapter is relevant whenever we have a theory that specifies the values of the p i s. The procedure we will learn is an example of a test of (statistical) hypotheses. Below I will introduce you to the features of a test of hypotheses with special attention paid to the Goodness of Fit Test of this chapter. The first feature is that every test has two hypotheses, denoted by H 0 and H 1. The first of these is called the null hypothesis and the second is called the alternative hypothesis. Because of its name, many texts denote the alternative hypothesis by H a, but we will stick with H 1. 54

5 Each hypothesis is a conjecture about reality. These conjectures do not overlap; i.e. they cannot both be true. Curiously, it is possible that neither is true, although standard analyses tend to ignore this possibility. (Well, perhaps ignore is too strong a word, but in my experience analysts do not like to dwell on this possibility.) For the Goodness of Fit test, the null hypothesis states that our theory about the probabilities is correct. The alternative hypothesis states that our theory is incorrect. This might sound confusing, but in any particular situation it is quite simple. For example, for our die study, H 0 : p 1 = p 2 = p 3 = p 4 = p 5 = p 6 = 1/6 H 1 : Not H 0 ; i.e. at least one of the p i s does not equal 1/6. For our snapdragon example, H 0 : p 1(red) = 0.25, p 2(pink) = 0.50, p 3(white) = H 1 : Not H 0 ; In general, let p i0 denote the theory s value of p i. This makes the hypotheses: H 0 : p i = p i0 for all i H 1 : Not H 0 ; i.e. p i p i0 for at least one i. The hypotheses must be selected before data are collected. This should never be a problem because the hypotheses are derived from questions of scientific interest, which exist before we collect data. Every test of hypotheses begins with the assumption that the null hypothesis is correct. There are two reasons for this, one philosophical, one practical. The philosophical reason is often described as Occam s razor which states, roughly, that we prefer a simpler model for the world unless the simple model proves to be seriously inadequate. (See Wikipedia for more details.) In the current example, it is simpler to assume the die is balanced than to assume it is not. (If it is not balanced, we need to learn about its six probabilities and need a reason why they are not all the same.) Similarly, a Mendelian genetic model provides a simple way to explain inheritance of traits. If it is incorrect another (more complicated) model needs to be found. The practical reason we begin with the assumption that the null hypothesis is correct is that we need it in order to obtain useful math results. Thus, a test of hypotheses can be described, briefly, as follows. We specify our hypotheses. We assume the null hypothesis is true. We collect and analyze data. Based on our analysis we select one of two options: Stop assuming the null hypothesis is correct; this is referred to as rejecting the null hypothesis. Continue to assume the null hypothesis is correct; this is referred to as failing to reject the null hypothesis. 55

6 Statisticians (among others) find it insightful to list all the possible consequences of selecting an option. In particular, for a test of hypotheses, we find the following 2 2 (read 2 by 2) table to be very helpful. Truth (Only Nature knows) Action H 0 is correct H 1 is correct Fail to reject H 0 Correct decision Type 2 Error Reject H 0 Type 1 Error Correct decision In words, a Type 1 Error occurs when a correct null hypothesis is rejected and a Type 2 Error occurs when a false null hypothesis is not rejected. The researcher prefers to make a correct decision, of course, but should remember that an error is possible. Before collecting data, we don t know what our decision will be and we don t know the truth; thus, we are uncertain about whether we will make an error. Because of this uncertainty, we can consider the activity of calculating the probability of an error. Statisticians and scientists focus attention primarily on the probability of a Type 1 error and pay much less attention to the probability of a Type 2 error. This is partly philosophical following Occam s Razor we want to avoid discarding a simple and true theory. But it is also pragmatic. The mathematics of computing the probability of a Type 1 error can be daunting, but the probability of a Type 2 error is always much more complicated. The researcher must specify the significance level of the test. It is denoted by α (read alpha) and is usually taken to equal Sometimes α is 0.01 or 0.10, but basically any small positive number is ok. So, what is α? Well, it is the probability that the test makes a Type 1 error. Let s be clear on this: The researcher will use a rule or criterion to decide which option to select: reject or fail to reject. The rule must satisfy the condition that the probability of a Type 1 error that is, the probability of rejecting the null hypothesis given that it is true must be equal to α. Actually, in practice, usually the best we can hope for is approximately α. 5.4 The Test Statistic and Its Sampling Distribution After specifying the hypotheses and the significance level of the test, the researcher collects the data. The idea of the test statistic is to summarize the data with a single number, which is called the observed value of the test statistic. The observed value of the test statistic guides the researcher to the option reject or fail to reject to choose. Because we need to be able to calculate the probability of a Type 1 error, we need to know how to calculate probabilities for the test statistic on the assumption the null hypothesis is true; that is, we need to know the sampling distribution of the test statistic. A professional statistician determines a test statistic for a particular situation by applying certain principles of what makes a test good, and by using some math techniques that can be quite sophisticated and complicated. For this course, I will motivate my choices for test statistics, but not attempt to derive or prove why they are preferred. 56

7 After observing the n i.i.d. trials, we count the observed frequency of each category. We denote the observed frequency for category i by O i, for all possible values of i. (The letter O is for observed.) Consider O 1, the frequency of occurrence of outcome 1 in the n trials. We can view 1 as a success and any other outcome as a failure. Thus, O 1 Bin(n, p 1 ). Thus, the mean of the probability distribution of O 1 is np 1. Now p 1 is unknown, which would be a huge problem except remember that we are assuming the null hypothesis is correct. With this assumption, p 1 = p 10 which is a known number. The mean of O 1 becomes np 10, which we can easily compute. This argument for outcome 1 can be extended to the other outcomes; the result is that the mean of the probability distribution for each O i is np i0 ; again these are all easily computable numbers. Dating back to the gambling origins of probability theory, the mean was called the expected value. Because this is an old test (over 100 years old, as mentioned earlier) this older terminology is reflected in our notation and we denote the mean of O i by E i. Thus, E i = np i0. At this point, it might help to introduce two specific examples. Example 1: An Electronic Die. My statistical software package, Minitab, claims to have a random number generator that can simulate a balanced die. I decided to investigate this claim. I had my computer generate 600 trials from its so-called balanced die. My observed and expected frequencies are in the table below. Outcome O i E i Example 2: Hypothetical Snapdragon Flowers. (Aside: I was disappointed when I searched the web for genetic data. I found sites that talked at length about how the Goodness of Fit Test is so important in genetics and then the example was... tossing a coin!) I will modify some data from published sources and hope that my modification avoids any lawsuits for infringement of copyrights! Suppose that George grows n = 240 snapdragons and obtains the data summarized in the following display. Outcome 1 = Red 2 = Pink 3 = White O i E i I am now reminded of the words of Yogi Berra: You can observe a lot by just watching. Please look at the O s and their corresponding E s. They do not all agree. (In fact, none of them agree.) This is not surprising; I simulated 10,000 data sets as I did above 600 casts of an i.i.d. die with the ELC true and never obtained data for which all the O s equaled 100. In other words, the data almost always contain some evidence in support of the alternative hypothesis, even when the null hypothesis is true. Read this last sentence again. Notice that I talk of evidence in support of the alternative. This is how statisticians talk. We never say evidence in support of or against the null. We never say 57

8 evidence against the alternative. We say evidence in support of the alternative because we are already assuming the null is correct and are looking to see whether there is evidence in support of the alternative. Well, as I said, there is almost always some evidence in support of the alternative; what we are really looking to do is to determine whether the evidence in support of the alternative is sufficiently strong to convince us to reject the null hypothesis. Let s look at our data again. We have six O s and six E s for the die example and three of each for the flowers. Any discrepancy between an O and its E provides evidence in support of the alternative. In other words, I compare the O s and the E s to see whether they agree, almost agree, disagree somewhat, and so on. In mathematics a common way to compare two numbers is to subtract one from the other, and we do that here. In particular, for each possible outcome we compare the O and the E by calculating (O E) and placing these values in our table for the die: Below is this table for the flowers: Outcome O i E i O i E i Outcome 1 = Red 2 = Pink 3 = White O i E i O i E i If an O E is 0, then there is no evidence in support of the alternative. We want to treat an O E of, say, 10 as the same evidence as an O E of +10. We can do this by taking the absolute value of O E, but it turns out to be better to square the value. Thus, the values of (O i E i ) 2 are added to our table, first for the die and then for the flowers. Outcome O i E i O i E i (O i E i ) Outcome 1 = Red 2 = Pink 3 = White O i E i O i E i (O i E i ) Finally, we need to adjust for sample size because the values of (O E) 2, even for a balanced die, will tend to be larger the more often we cast the die. We adjust for sample size by dividing each (O E) 2 by E, which we add to our table, first for the die and then for the flowers. Also, we sum the values of (O E) 2 /E and call the total χ 2. 58

9 Outcome Total O i E i O i E i (O i E i ) (O i E i ) 2 /E i χ 2 = 3.56 Outcome 1 = Red 2 = Pink 3 = White Total O i E i O i E i (O i E i ) (O i E i ) 2 /E i χ 2 = The number χ 2 is called the observed value of the test statistic for the Goodness of Fit Test. The test statistic X 2 is the procedure or rule we follow to obtain the number χ 2. The test statistic is a random variable. This is similar to our discussion of the binomial earlier; X is the rule: calculate the total number of successes, while x is an actual number of successes that we obtain. I need to note some features of χ 2. First, χ 2 cannot be a negative number. If χ 2 = 0 then there is absolutely no evidence in the data in support of the alternative. (Why?) It can be shown mathematically that the larger the value of χ 2 the stronger the evidence in support of the alternative. Thus, logic tells us that if we choose to reject the null for a particular value of χ 2 then we should also reject for any larger value because any larger value would be even stronger evidence in support of the alternative. Thus, it is clear that our rule should be: Reject the null hypothesis if, and only if, χ 2 c, for some number c. But what should we choose for c? We can answer this question because of the following important major result, discovered by Pearson. Given that the null hypothesis is true, the sampling distribution of X 2 is approximated by the Chi-Squared curve with df = k 1. (Remember: k is the number of categories for the response.) Following the notation I introduced earlier, our rule becomes: Reject the null hypothesis if, and only if, χ 2 χ 2 α (k 1). Why? Well, we want the probability of a Type 1 error to equal α. Because of our important major result, on the assumption that the null hypothesis is true, the probability that we will get a value of X 2 that is equal to or larger than χ 2 α (k 1) is equal (approximately) to α. By the way, statisticians get tired of writing and saying, Reject the null hypothesis if, and only if, and then giving a rule. Instead we define the critical region of the test to be those values of the 59

10 test statistic that result in the rejection of the null hypothesis. Thus, for our Goodness of Fit Test, the critical region is χ 2 χ 2 α (k 1). For our die example, k = 6. For α = 0.05, the critical region is Recall that we obtain by going to χ 2 χ (5) = west/applets/chisqdemo.html Once at the site, we type 5 in the degrees of freedom box, 0.05 in the lower right box and click on compute. For our snapdragon data, k = 3. For α = 0.05, the critical region is χ 2 χ (2) = We now evaluate our data. For the die data, χ 2 = 3.56 is not in the critical region, so we do not reject the null hypothesis. For the snapdragons, χ 2 = is not in the critical region, so we do not reject the null hypothesis. Let me do a few more examples. Example 3: Casting my round-cornered blue die. I actually cast my blue round-cornered die 1,000 times and recorded the results! (Goodbye Friday evening!) My data are below. Outcome Total O i E i O i E i (O i E i ) (O i E i ) 2 /E i χ 2 = If I again use α = 0.05 my critical region is χ For these data χ 2 = is in the critical region, so my decision is to reject the null hypothesis and conclude that my die is not balanced. Example 4: Hypothetical fair-coin tosser. Bert plans to toss his favorite coin four times every day for n = 160 days. He is convinced that the coin is fair (i.e. the probability of a head is 0.5), but he wonders about memory/independence. If there is independence then he has BT and the total number of heads on any given day will follow the Bin(4,0.50) distribution. It is the appropriateness of this binomial that Bert wants to investigate. The possible values of the number of heads on a day are, of course, 0, 1, 2, 3 and 4. You can check that if the binomial model is correct, then these outcomes have probabilities 1/16, 4/16, 6/16, 4/16 and 1/16, respectively. Bert s null is that these binomial probabilities are correct and his alternative is that they are not. Bert collects his data and obtains the numbers shown in the table below, which also presents all of his computations. 60

11 Outcome Total O i E i O i E i (O i E i ) (O i E i ) 2 /E i χ 2 = Because k = 5, df = 4; for α = 0.10 the critical region is χ 2 χ (4) = Our χ 2 = is not in the critical region, so the null hypothesis is not rejected. Example 5: Hypothetical free throw shooter. Imagine a basketball player named Shack. He is getting old, doesn t run too well and has always been a poor free throw shooter. He decides to work on the last of these as follows: Five times per day for the next 80 days he will shoot four free throws and count the number of successes that he achieves. Thus, Shack will collect n = 400 numerical values, with each value being one of: 0, 1, 2, 3 or 4. Shack wants to use the Goodness of Fit Test to test whether these 400 values behave as if they come from a binomial distribution. Note the difference between Examples 4 and 5. In Example 4, the analysis assumed that p = For this analysis we assume that p is unknown. Our null hypothesis is that the probabilities follow the binomial distribution for m = 4 trials for some p. The alternative is that the binomial is not correct, for any value of p. First, we need to look at the data Shack collects. His O s are below. Outcome O i In order to proceed we need to use our data to estimate p. Shack shoots a total of 400(4) = 1600 free throws. From the table above, he obtains: 25(0) + 118(1) + 139(2) + 93(3) + 25(4) = 775 successes. Thus, we estimate p by ˆp = 775/1600 = We calculate our E s using the Bin(4,0.4844) distribution. The E s (probabilities times 400) for 0, 1, 2, 3 and 4 are: 28.3, 106.2, 149.7, 93.8 and 22.0, respectively. I will add these to our data and complete the computations of χ 2 : Outcome Total O i E i O i E i (O i E i ) (O i E i ) 2 /E i χ 2 =

12 Next, we need the critical region. Well, we need the following fact. Let j denote the number of parameters that we must estimate in order to obtain the E s. For the current example, j = 1. Given that the null hypothesis is true, the sampling distribution of X 2 is approximated by the Chi-Squared curve with df = k j 1. For our hypothetical Shack, df = = 3. For α = 0.05, the critical region is χ Because our χ 2 = < 7.815, we do not reject the null hypothesis. We have now performed five Goodness of Fit Tests, Examples 1 5 above. In Example 3, we rejected the null, but in the other four examples, we failed to reject the null. Did we make any Type 1 or Type 2 errors? Well, only Nature knows, but because most of these were hypothetical examples, I was Nature so I know! Indeed, only Example 3 had real data. In Examples 1 (electronic die), 2 (snapdragons) and 4 (coin tosser), as Nature I decided to make the null hypothesis true and all three of our tests made the correct decision do not reject. In Example 5 (Shack) as Nature I decided that the alternative was true and our test made a Type 2 error, it failed to reject a false null. (FYI: I generated Shack s data as follows: on 200 occasions it was Bin(4,0.4), a bad outing for Shack, and on 200 occasions it was Bin(4,0.6), a good outing for Shack. Our test did not detect this.) Finally, in Example 3, my round-cornered blue die, I rejected the null, so it is possible that I made a Type 1 error; only Nature knows. 5.5 The Attained Significance Level In this section we learn about a very important idea in modern tests of hypotheses, the Attained Significance Level of a test. Even the biggest fan of our test of hypotheses must admit that the choice of α is arbitrary. The Attained Significance Level, also called the P-value, helps. There is an annoying feature in the above presentation. Consider testing a die for the ELC with α = The critical region is χ Consider four hypothetical results: Researcher A: Obtains χ 2 = 2.12 and does not reject the null. Researcher B: Obtains χ 2 = and does not reject the null. Researcher C: Obtains χ 2 = and rejects the null. Researcher D: Obtains χ 2 = and rejects the null. I have three complaints with the above four examples: Researchers A and B have substantially different strengths of evidence (2.12 is quite different from 11.05), but this is ignored by saying neither rejects. 62

13 Researchers C and D have substantially different strengths of evidence (11.08 is quite different from 51.00), but this is ignored by saying both reject. And perhaps most seriously, Researchers B and C have almost identical strengths of evidence (11.05 and are very similar), but this is worse than ignored in concluding that one should reject and the other should not! The Attained Significance Level helps to reduce substantially the seriousness of these complaints. Recall, Example 1, my electronic die. The observed value of the test statistic was χ 2 = Recall, also that I used α = 0.05 to obtain the critical region χ But suppose I had chosen for my critical region: χ What would we conclude? Well, first it looks awfully suspicious to have just happened to choose a c for my critical region that exactly matches the observed value of χ 2. Let s ignore that for the moment. If I go to my Chi-Squared calculator, I find that the area under the χ 2 (5) to the right of 3.56 is Thus, if I had selected α = then my critical region would have been χ and I would have just barely rejected the null. If I pick any α larger than then I would have a c smaller than 3.56 and I would reject the null; if I pick any α smaller than then I would have a c larger than 3.56 and I would fail to reject the null. Thus, I would reject the null if, and only if, my α In words, is the smallest α for which the null hypothesis would be rejected. This number, , is the P-value for these data. Below are the P-values for the other four Examples above. Example 2: Snapdragons. The observed χ 2 = with df = 2. From the calculator, the area under χ 2 (2) to the right of is Thus, the P-value is We would reject the null if, and only if, our α Example 3: Round-corned blue die. The observed χ 2 = with df = 5. From the calculator, the area under χ 2 (5) to the right of is Thus, the P-value is We would reject the null if, and only if, our α FYI, according to my computer software package the P-value is one in ten trillion. Example 4: Fair-coin tosser. The observed χ 2 = with df = 4. From the calculator, the area under χ 2 (4) to the right of is Thus, the P-value is Example 5: Shack s free throws. The observed χ 2 = with df = 3. From the calculator, the area under χ 2 (3) to the right of is Thus, the P-value is The approach I have described earlier for a test of hypotheses is sometimes called the classical approach. It can be viewed as quite rigid: every analysis must end with a decision to reject or not. This reflects mathematics in two ways. First, every math problem ends in a solution and then we go on to the next math problem. The solution here is to reject or not. Second, for academic researchers who want to publish research papers, the rigidity of the classical approach is helpful for proving theorems and obtaining other mathematical results. But science is much more dynamic than math. One hundred years ago many (most?) scientists believed the space between planets in 63

14 our solar system was filled with ether. If space = ether was a math result, well, then there would be ether. But space = ether was, presumably, a useful scientific theory until it was replaced by a better (more correct) one. Before I offer an alternative to the classical approach, let me remind you of what the P-value does for us. If we decide to use the rigid reject or fail to reject approach to tests of hypotheses, the P-value has the virtue of removing, to some extent, the arbitrariness of the choice of α in the following way. By reporting the P-value the researcher allows the consumer to apply his/her own choice of α to the decision making process. As I said above, science is more dynamic than math. Scientists may be less interested in a carved in stone decision and more interested in evaluating the strength of the evidence in the data. The second interpretation of the P-value, given below, helps with this. As mentioned on page 59, the larger the value of χ 2 the stronger the evidence in support of the alternative. Thus, the P-value is the probability of the researcher obtaining the actual evidence or even stronger evidence. Remember the probability is computed under the assumption that the null is correct. This is a little tricky; the smaller the P-value the stronger the evidence in support of the alternative. For example, if one gets a P-value of this means that the probability of getting such strong (or stronger) evidence is one in ten-thousand. In other words, it is unlikely; thus, the evidence one has is very strong. This second interpretation of the P-value helps sort out the problems with Researchers A, B, C and D introduced on page 62. With the help of the Chi-Squared calculator, we can obtain the following P-values for these researchers; recall that df = 5 for all of them. Researcher χ 2 P-value A B C D If you go and reread my complaints about the reject/fail to reject approach to analysis given earlier, you will see that the P-value does a good job of answering them. 5.6 *Some Loose Ends (Optional) We have been using the Chi-Squared curve to compute probabilities because in the limit it works. That is, using the Chi-Squared curve is an approximation. Is the approximation any good? To answer this, I first note that, as a practical matter, it is impossible to calculate exact probabilities. (I know, some people say Impossible is nothing; but not for this!) We can, however, simulate the distribution of the test statistic X 2. In particular, I simulated 10,000 runs in which each run consisted of casting my balanced electronic die n = 600 times. Remember, for α = 0.05 the critical region is χ , where the number is obtained by using the Chi-Squared curve with df = 5 as an approximation to the sampling distribution of X 2. Each run consisted of: 1. I had the computer generate (simulate) 600 casts of a balanced die. 64

15 2. For the data just obtained, I calculated the value of χ 2 exactly as illustrated above. Thus, each run resulted in a value of χ 2. I then sorted the 10,000 values of χ 2 and determined, by counting, that 489 of my simulated values were Thus, the relative frequency of occurrence of χ was By the LLN, the P(X ) is close to Thus, the exact significance level of my test is close to In conclusion, it seems that, at least for this one example, the Chi-Squared curve provides an adequate approximation. Statisticians have worked on this problem a great deal and have reached similar conclusions. Basically, the most cautious of the conclusions is that it is ok to use the Chi-Squared curve as an approximation provided all of the E s are 5 or larger, which has been the case in all of our examples. Our next loose end is discussed in our next example. Example 6: Prussian Calvary Corps. I want to thank my friend Bret Larget for providing the following reference for the first real data set in my career as an undergraduate math major: Ladislaus Bortkiewicz (1898). Das Gesetz der kleinen Zahlen in the journal Monatshefte fr Mathematik vol. 9 p. A DOI: /BF Somebody (Bortkiewicz?) collected n = 200 observations. Each observation was a count: the number of soldiers kicked to death by a horse/mule during a given calendar year in a given Prussian Calvary Corps unit. I can t recall whether it was data on 20 units for 10 years or 10 units for 20 years, nor when this occurred although, based on the date of publication, the data were collected before the 20th century. If one thinks of a fatality as a success, then one might wonder whether the Poisson distribution would be a good model for these data. (Why?) This example is similar to the Shack example because there is no reason to believe we know the value of θ. Thus, our first task is to estimate θ from the data. I will show you the data soon, but for now let me remark that a total of 122 men were kicked to death, giving a mean of 122/200 = 0.61 deaths per unit per year. Thus, our estimate of θ is My next step is to calculate probabilities for Poisson(0.61). With the help of our calculator I get the following results. x : P(X = x) : np(x = x) : I will now show you the data and the necessary computations. Outcome Total O i E i O i E i (O i E i ) (O i E i ) 2 /E i χ 2 =

16 I will use the Chi-Squared approximation even though one of the E s fall slightly below the recommended minimum of 5. The df = = 2, subtracting twice as in the Shack example. The P-value is the area under the χ 2 (2) to the right of 0.198; this area is Thus, there is only very weak evidence in support of the alternative. Remember that a test of hypotheses only tests some of what we assume. It turns out that we cannot test everything. For example, consider the Goodness of Fit Test for the ELC for an electronic die, like the data I provided with my Example 1. If we generate n = 600 casts and each side lands up exactly 100 times, it is correct to say that the Goodness of Fit Test finds no evidence in support of the alternative. But the Goodness of Fit Test does not examine the assumption of i.i.d. trials. Here are two extreme possibilities that the Goodness of Fit Test would not see. Lack of independence. Suppose that the electronic die yields the sequence 1, 2, 3, 4, 5, 6 repeatedly. The trials are not independent, but our test of this chapter won t notice it. Lack of i.d. Suppose that the first 100 casts yield all 1 s; the next 100 casts yield all 2 s; and so on. This would occur if the probabilities are changing, but, again, the test of our section would not spot it. 66

Chapter 3. Estimation of p. 3.1 Point and Interval Estimates of p

Chapter 3. Estimation of p. 3.1 Point and Interval Estimates of p Chapter 3 Estimation of p 3.1 Point and Interval Estimates of p Suppose that we have Bernoulli Trials (BT). So far, in every example I have told you the (numerical) value of p. In science, usually the