Lecture 21 Comparing Counts - Chi-square test

Lecture 21 Comparing Counts - Chi-square test Thais Paiva STA 111 - Summer 2013 Term II August 5, 2013 1 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

Lecture Plan 1 Goodness of fit 2 Independence 3 Homogeneity 2 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

CEOs and the Zodiac Fortune magazine collected the zodiac signs of 256 heads of the 400 largest companies: Sign Count Aries 23 Taurus 20 Gemini 18 Cancer 23 Leo 20 Virgo 19 Libra 18 Scorpio 21 Sagittarius 19 Capricorn 22 Aquarius 24 Pisces 29 3 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

CEOs and the Zodiac There are more Pisces than anything else Are there enough of them to constitute evidence that CEOs are more likely to be some signs rather than others? If the distribution were uniform, we would expect 256 12 = 21.33 births in each sign How closely to the observed births fit this null model? 4 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

Goodness of fit We will be testing whether the null model fits the observed data. Since there is no parameter to estimate, CIs do not make sense If we were only interested in Pisces, we could use a one-proportion z-test with H 0 : p = 1 12 H A : p > 1 12 But we are interested in the distribution of births. We need a test that takes all signs into account 5 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

Goodness of fit Assumptions and conditions: The data are counts for two or more categories Each unit is independent of the other units Sample size: We should expect to see at least 5 in each cell (some authors say at least 1) In the zodiac example, we would need n such that n 12 > 5 6 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

Goodness of fit Calculations: In each cell, we expect a certain count under the null ( n 12 ) If the null model fits the data, the expected and observed counts should be similar Just like with OLS regression, we will square the differences: p χ 2 (Obs i Exp i ) 2 = Exp i i=1 χ 2 p 1, where p is the number of cells 7 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

The χ 2 distribution Like the t, the χ 2 is indexed by degrees of freedom Unlike the t, the χ 2 is skewed and always positive density 0.00 0.05 0.10 0.15 density 0.00 0.02 0.04 0.06 0.08 0 5 10 15 20 25 30 x 0 5 10 15 20 25 30 x χ 2 4 χ 2 15 8 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

The χ 2 test for goodness of fit p χ 2 (Obs i Exp i ) 2 = Exp i i=1 χ 2 p 1 density α 1 α χ α 2 x 9 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

One-sided or two-sided? If the data does not match the null model, then χ 2 will always be big It works like a one-sided test 10 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

Back to the zodiac H 0 : p Aries = p Taurus =... = p Pisces H A : not all p i equal Exp i = 256 12 = 21.333 for each i = 1,..., 12 χ 2 = 12 i=1 = 5.094 (Obs i 21.333) 2 21.333 χ 2 11 which gives a p-value > 0.25 From the back of the book (p. 676): P[χ 2 11 > 13.7] = 0.25, so > 5.094] > 0.25 P[χ 2 11 There is virtually no evidence of a non-uniform distribution of zodiac signs among executives 11 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

Independence A study at the University of Texas Southwestern examined whether the risk of hepatitis C was related to whether people had tattoos and to where they got their tattoos. Hep. C No hep C Total Tattoo, parlor 17 35 52 Tattoo, elsewhere 8 53 61 No tattoo 22 491 513 Total 47 579 626 This type of data is different we have a single group that is categorized according to two variables This is called a contingency table 12 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

Independence We could convert the counts to proportions, and see whether P(Parlor and hep C) = P(Parlor) P(hep C), for example But even if the variables were independent, the equality would never turn out to hold exactly How close does it have to be? We can use a χ 2 test for independence 13 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

Independence χ 2 = R,C i=1,j=1 (Obs ij Exp ij ) 2 Exp ij χ 2 (R 1)(C 1) Hep. C No hep C Total Tattoo, parlor 17 3.9 35 48.1 52 Tattoo, elsewhere 8 4.6 53 56.4 61 No tattoo 22 38.5 491 474.5 513 Total 47 579 626 Assuming independence, Exp 11 = 626 52 626 47 626 = 3.9 Number of degrees of freedom is now (R 1)(C 1) R: # of rows and C: # of columns 14 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

Independence: tattoos and hepatitis H 0 :Tattoo status and hepatitis status are independent H A :Tattoo status and hepatitis status are not independent χ 2 = 3,2 i=1,j=1 = 57.91 (Obs ij Exp ij ) 2 Exp ij χ 2 2 which gives a p-value < 0.001 (from the back of the book) There is strong evidence towards tattoo status and hepatitis C status not being independent 15 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

Homogeneity Many high schools survey graduating classes to determine their plans. We might wonder whether plans have stayed roughly the same through the decades. 1980 1990 2000 Total College 320 245 288 853 Employment 98 24 17 139 Military 18 19 5 42 Travel 17 2 5 24 Total 453 290 315 1058 16 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

Homogeneity Assumptions and conditions: The data are counts for two or more categories, with two or more groups to compare Each unit that is counted is independent of the other units Sample size: We should expect to see at least a 5 in each cell In the class plans example, we would need n such that n 4 > 5 for each year 17 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

Homogeneity χ 2 = R,C i=1,j=1 (Obs ij Exp ij ) 2 Exp ij χ 2 (R 1)(C 1) 1980 1990 2000 Total College 320 365.23 245 233.81 288 253.96 853 Employment 98 59.52 24 38.10 17 41.39 139 Military 18 17.98 19 11.51 5 12.50 42 Travel 17 10.27 2 6.59 5 7.15 24 Total 453 290 315 1058 Assuming independence, Exp 21 = 1058 139 1058 453 1058 = 59.52 Calculation is exactly the same as in the independence test 18 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

Back to high school H 0 : The post-high school choices have the same distribution in 1980, 1990, and 2000 H A : The post-high school choices in 1980, 1990, and 2000 do not have the same distribution χ 2 = 4,3 i=1,j=1 = 72.77 (Obs ij Exp ij ) 2 Exp ij χ 2 6 which gives a p-value < 0.001 (from the back of the book) There is strong evidence of inhomogeneity 19 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013

Summary 1 Goodness of fit 2 Independence 3 Homogeneity 20 / 20 Thais Paiva STA 111 - Summer 2013 Term II Lecture 21, 08/05/2013