Ch. 11 Inference for Distributions of Categorical Data

Ch. 11 Inference for Distributions of Categorical Data CH. 11 2 INFERENCES FOR RELATIONSHIPS

The two sample z procedures from Ch. 10 allowed us to compare proportions of successes in two populations or for two treatments. What if we want to compare the distributions of a single categorical variable across several populations or treatments? For this new test, we use two-way tables to present the data.

2 populations 1 categorical variable 3 categories 20 30 50 50 50 100 Are we looking at row totals or column totals? row totals 20 Not much 100 = 20% 50 1+ per day 100 = 50% 30 1+ per week 100 = 30% Grand Total

20 30 50 50 50 100 Are we looking at 1 row or 1 column? 1 column (Granada column) 4 Not much 50 = 8% 1+ per week 1+ per day 16 50 = 32% 30 50 = 60%

20 30 50 50 50 100 30% of ECRCHS is expected to use Facebook 1+ per week 30 100 50 row total column total grand total (# of rows 1)(# of columns 1) = r 1 c 1 = 15

In Ch. 11-1, we used a χ 2 GOF test the claimed distribution of a categorical variable. No. We are not comparing a sample distribution to a claimed distribution. We are comparing a sample distribution to another sample distribution.

row 1, column 1: 20 50 100 = 10 (10) (15) (10) (15) 20 30 (25) (25) 50 50 50 100 State: H 0 : H a : There is no difference in the distribution of Facebook habits between ECRCHS and Granada. There is some difference in the distribution of Facebook habits between ECRCHS and Granada. α = 0.05

When comparing a sample distribution to another sample distribution, we use the Plan: χ 2 test of homogeneity Random: Large Sample Size: random sample from each high school All expected counts are at least 5. (10, 10, 15, 15, 25, 25) Independent: Two things to check: 1) Both samples or groups need to be independent of each other. 2) Individual observations in each sample or group have to be independent. When sampling without replacement for both samples, must check 10% condition for both. We clearly have two independent samples one from each school. There must be at least 10 50 and Granada. = 500 students at both ECRCHS

Do: χ 2 distribution, df = 2 df = r 1 c 1 = (3 1)(2 1) = 2 χ 2 = O E 2 E 9.34 16 10 2 = + 4 10 2 14 15 2 + + 10 10 15 = 3.6 + 3.6 + 0.07 + 0.07 + 1 + 1 χ 2 = 9.34 χ 2 cdf lower bound, upper bound, df = χ 2 cdf 9.34, 99999, 2 =.0094 p-value

Conclude: Assuming H 0 is true (there is no difference in the distribution of Facebook habits between ECRCHS and Granada), there is a 0.0094 probability of getting a χ 2 value of 9.34 or more purely by chance. This provides strong evidence against H 0 and is statistically significant at α = 0.05 level (.0094 <.05). Therefore, we reject H 0 and can conclude that there is some difference in Facebook habits between ECRCHS and Granada. The largest component of χ 2 is 3.6 because the number of ECRCHS and Granada students who don t go on Facebook much was higher than expected and lower than expected, respectively. one one one 2+

Just by looking at the data, what do you think the p-value will be? Totals 29.8 20.2 59.6 40.4 59.6 40.4 149 101 Totals 50 100 100 250 Not appropriate to round expected counts to whole numbers State: H 0 : H a : There is no difference in the success rates for the three test preparation strategies. There is a difference in the success rates for the three test preparation strategies. α = 0.05

Plan: χ 2 test of homogeneity Random: random sample of 149 students who had passed the exam and separate sample of 101 students who did not pass the exam Large Sample Size: All expected counts are at least 5. (29.8, 59.6, 59.6, 20.2, 40.4, 40.4) Independent: Independent samples were taken. There must be at least 10 149 = 1490 students who have passed the AP Stats exam and at least 10 101 = 1010 students who did not.

Do: χ 2 distribution, df = 2 df = r 1 c 1 = (3 1)(2 1) = 2 χ 2 = O E 2 E 175.286 40 29.8 2 99 59.6 2 10 59.6 2 = + + + 29.8 59.6 59.6 = 3.49 + 5.15 + 26.05 + 38.43 + 41.28 + 60.9 χ 2 = 175.286 You can use χ 2 GOF-Test to get the contribution values quickly, but don t say you used χ 2 GOF-Test for a test of homogeneity. χ 2 cdf lower bound, upper bound, df = χ 2 cdf 175.286, 9999, 2 = 0 p-value

Conclude: Assuming H 0 is true (there is no difference in the success rates for the three test preparation strategies), there is a 0 probability of getting a χ 2 value of 175.286 or more purely by chance. This provides very strong evidence against H 0 and is statistically significant at α = 0.05 level (0 <.05). Therefore, we reject H 0 and can conclude that there is a difference in success rates for the three types of test preparations. The largest component of χ 2 is 60.9 because the number of students who didn t pass the exam with no review was much higher than expected.

What if we have a single random sample from a single population that s classified according to two categorical variables, and our goal is to see if the two categorical variables have a relationship/association? New Test! Why can t we use χ 2 GOF? There s more than one categorical variable. Why can t we use χ 2 Homogeneity? There s one population and more than one categorical variable.

Totals two categorical variables 27.8 25.3 29.1 23.7 38.4 34.9 40.1 32.7 21.8 19.8 22.8 18.6 88 80 92 75 Totals 106 146 83 335 State: H 0 : H a : H 0 : H a : There is no association between the math class and sport played for high school students. There is some association between the math class and sport played for high school students. OR Math class and sport played are independent in the population of high school students. Math class and sport played are not independent in the population of high school students. α = 0.05

Plan: χ 2 test of association/independence Random: Large Sample Size: random sample of 335 high school students All expected counts are at least 5. The lowest expected count is 18.6. (see table) Independent: One thing to check: Individual observations in the sample or group have to be independent. When sampling without replacement, must check 10% condition for both. There must be at least 10 335 = 3350 high school students in the USA that play a sport and take a math class.

Do: χ 2 distribution, df = 6 df = r 1 c 1 = (3 1)(4 1) = 6 χ 2 = 28.96 You can use χ 2 GOF-Test to get the contribution values quickly, but don t say you used χ 2 GOF-Test for a test of homogeneity. O E 2 35 27.8 2 42 38.4 2 11 21.8 2 = + + + E 27.8 38.4 21.8 = 1.86 +.34 + 5.35 + 2.99 + 2.81 +.07 + 10.05 + 2.44 + 2.27 +.07 +.42 +.31 χ 2 = 28.96 χ 2 cdf lower bound, upper bound, df = χ 2 cdf 28.96, 9999, 6 = 0 p-value

Conclude: Assuming H 0 is true (there is no association between math class and sport played for HS students), there is about a 0 probability of getting a χ 2 value of 28.96 or more purely by chance. This provides strong evidence against H 0 and is statistically significant at α = 0.05 level 0 <.05. Therefore, we reject H 0 and can conclude that there is some association between math class and sport played. The largest component of χ 2 is 10.05 because the number of Geometry students who play football is much less than expected.

1 1 Skittles problem. Tests the null hypothesis that a categorical variable has a claimed distribution. 1 2 or more Facebook habits at ECRCHS vs Granada Comparing the distribution of one categorical variable in two or more populations. 2 1 Math class vs sport played Investigating the relationship between two categorical variables in one population.

Is there an association between resemblance and dog breed? Totals 12.78 12.22 25 Totals 10.22 9.78 20 23 22 45.053 χ 2 test of association/independence χ 2 = 3.73

Does the data give convincing evidence of a difference in resemblance and an owner s choice in dog breed? p 1 = 16 Two-proportion z test (two-sided) 25 =.64.053 z = 1.934 the same.053 p 2 = 7 20 =.35 This only works for two-sided two proportion z tests. z 2 = χ 2 1.934 2 = 3.73