Categorical Data Analysis. The data are often just counts of how many things each category has.

Size: px

Start display at page:

Download "Categorical Data Analysis. The data are often just counts of how many things each category has."

Daisy Mitchell
5 years ago
Views:

1 Categorical Data Analysis So far we ve been looking at continuous data arranged into one or two groups, where each group has more than one observation. E.g., a series of measurements on one or two things. Now we're changing things: we re interested in data that is categorical. The data are often just counts of how many things each category has. For example, we're interested in blood types. We go out and collect the following data from 263 people: Number of people with blood type A: y 1 = 115 Number of people with blood type B: y 2 = 20 Number of people with blood type AB: y 3 = 17 Number of people with blood type O: y 4 = 111 Note that the y's are NOT measurements, they're counts! Now suppose we had some theory about the distribution of blood types; could we use these data to test our theory? Obviously the answer is yes. In the U.S., people have the following proportions for blood type: Blood type A: 42% Blood type B: 10% Blood type AB: 4% Blood type O: 44% Does our theory (the percentages above) match our data? The Chi-square ( χ 2 ) goodness of fit test. Or better, are our data compatible with our hypothesis? Essentially, we have a number of categories for our data, and we have some idea as to the proportions we should expect for our each of our categories. We use the chi-square test to compare our idea (= hypothesis) with the actual data. Here s a simple (made up) example from genetics: We have a simple dominant-recessive relationship. Say, yellow and purple corn kernel color. Purple is dominant, and yellow is recessive. Without going into the details, if we have two heterozygous parents, we would expect 3/4 of our offspring (= kernels) to be purple, and 1/4 yellow. IMPORTANT - you do NOT need to understand the genetics here; all you need to know is that some theory tells us we expect 3/4 purple kernels and 1/4 yellow kernels

2 You want to test this theory, and you go out and collect a sample of 267 corn kernels. How many do you expect to be purple? 3/4 out of 267: 3/4 x 267 = How many yellow? 1/4 out of 267: 1/4 x 267 = Notice - take the proportion you expect for each category and multiply this by the total number in your sample; this gives you what you expect for each category in your sample. Now you can compare this to what you actually got. Suppose you took your sample and got: 157 purple 110 yellow Looking at this, you would think you should have gotten more yellow kernels and less purple kernels. But maybe your differences are due to random chance? Set up your hypotheses: Decide on α H 0 : Pr{purple} =.75, Pr{yellow} =.25 Your H 0 is different from what you re used to; you need to specify each expected proportion. H 1 : At least one probability listed in H 0 is incorrect Let s pick 0.05 Calculate your test statistic. This is now χ 2* (or χ 2 s, as your text calls it): χ 2 * = c (O i E i ) 2 i=1 E i i goes from 1 to c, where c is the number of categories (2 in our example). for our example we have: X 2* = = Compare this to the tabulated χ 2 with c-1 degrees of freedom and the appropriate level of α. Introducing the χ 2 distribution.

3 The value you calculate, χ 2*, will follow a χ 2 distribution if n is moderately large. Notice that the χ 2 test is based on an approximation. You can get exact values, but they re a bit of a pain, and it really isn't necessary - the approximation is very good. Just like many other distributions, the value of the χ 2 distribution depends on the degrees of freedom, d.f. or ν. Here is what it looks like for a couple of different values of ν: Just like before, we reject for values that wind up in the tails (usually only in the upper tail). Finally, you make your comparison: Notice how different the χ 2 looks based on the value of ν. If χ 2* χ 2 table, we reject our H 0. Otherwise fail to reject H 0. It's important to use the correct value for the d.f. = ν or you can get very misleading results. Here s our comparison:

4 χ 2 * 2 2 = χ c 1,α = χ 1,0.05 = 3.84 So we reject our H 0 and conclude that at least one of our proportions is not as specified in H 0. An important point - notice that with just two categories, if one of our proportions is wrong, that immediately implies that the other one is wrong as well (why?). Some comments: Except in the case of two categories, the alternative hypothesis is non-directional. If we have a reason to suspect a directional alternative (and have two categories), we can proceed as follows: H 0 : Pr{Male} =.6 (and therefore Pr{Female} =.4) H 1 : Pr{Male} >.6 (so what are females?) Two examples of the χ 2 test: You should be fairly comfortable with one sided tests by now. Let's do our blood group example: H 0 : Pr{blood type A} = 0.42 Pr{blood type B} = 0.10 Pr{blood type AB} = 0.04 Pr{blood type O} = 0.44 H 1 : at least one of these proportions is not correct. Let's use α =.01 So here's our data (= observed): And our expected values are: Blood type A: x.42 = Blood type B: x.10 = Blood type AB: x.04 = Blood type O: x.44 = Total: 263 So now we calculate χ 2* : χ 2* = ( ) ( ) ( ) ( ) = 5.88 Get our value for χ 2 table: 2 χ 3,.01 = 11.34

5 Finally, because χ 2* is less than our χ 2 table, we fail to reject, and conclude that our null hypothesis is consistent with the data: We have no evidence to show that our blood type sample does not follow the proportions in the U.S. Color vision in squirrels (this example is based on the old textbook for this class: Statistics for the Life Sciences, by Samuels, Witmer, and Schaffner (which is in turn based on a study by G.H. Jacobs from 1978)): A squirrel was exposed to a red panel and two white panels. By pressing the red panel, the squirrel was rewarded; no reward was given for pressing the white panel. In 75 trials, the squirrel correctly pressed the red panel 45 times. Can the squirrel see color? Comment: any one see a problem with the experimental setup? In other words, does this experiment really test color vision in squirrels? H 0 : Pr{red} = 1/3 (so Pr{white} = 2/3) H 1 : Pr{red} 1/3 Incidentally, a squirrel is going to do the best it can to get food, so the alternative probably should be one sided here (H 1 : Pr{red} > 1/3). More on that soon. Let's use α =.01 Calculate our expected: For red, 75 x 1/3 = 25 For white, 75 x 2/3 = 50 Note that the actual (observed) proportion for red is 45/75 = 0.6 Calculate our χ 2* : This would have agreed with our directional alternative if we had used it And our χ 2 table is: X 2* = χ 2 1,0.02 = = 24 And since χ 2* >> χ 2 table, we reject H 0 and conclude that squirrels can see the color red. Incidentally, if you wanted to do a one sided test, you'd do almost everything the same, with the following modifications:

6 Use the correct H 1 Verify that the data agree with H 1 Use the appropriate column in your χ 2 tables, but divided by 2 (i.e., use α/2). For our squirrel example, you'd get: 2 χ 1,0.025 = 5.41 Notice that the critical value got smaller - as usual, one sided tests make it easier to reject the H 0 ; we get more power! Bottom line: we are very confident that squirrels can see the color red. The assumptions of the χ 2 test: The data are collected randomly (you just can t get away from this one!) The smallest expected value is at least 5 (the observed value is irrelevant!). The χ 2 test is an approximation, and approximations get better the bigger the sample size. You can often determine ahead of time what your expected values will be. If they're too small you might consider increasing your sample size. There are other techniques for dealing with smaller sample sizes, but we won t learn them here.

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels. Contingency Tables Definition & Examples. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels. (Using more than two factors gets complicated,