Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Sectio 12 Tests of idepedece ad homogeeity I this lecture we will cosider a situatio whe our observatios are classified by two differet features ad we would like to test if these features are idepedet For example, we ca ask if the umber of childre i a family ad family icome are idepedet Our sample space X will cosist of a b pairs X = {(i, j) : i = 1,, a, j = 1,, b} where the first coordiate represets the first feature that belogs to oe of a categories ad the secod coordiate represets the secod feature that belogs to oe of b categories A iid sample X 1,, X ca be represeted by a cotigecy table below where N ij is the umber all observatios i a cell (i, j) Table 121: Cotigecy table Feature 2 Feature 1 1 2 b 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b a N a1 N a2 N ab We would like to test the idepedece of two features which meas that P(X = (i, j)) = P(X 1 = i)p(x 2 = j) If we itroduce the otatios P(X = (i, j)) = α ij, P(X 1 = i) = p i ad P(X 2 = j) = q j, 77

the we wat to test that for all i ad j we have α ij = p i q j Therefore, our hypotheses ca be formulated as follows: H 0 : α ij = p i q j for all (i, j) for some (p 1,, p a ) ad (q 1,, q b ) H 1 : otherwise We ca see that this ull hypothesis H 0 is a special case of the composite hypotheses from previous lecture ad it ca be tested usig the chi-squared goodess-of-fit test The total umber of groups is r = a b Sice p i s ad q j s should add up to oe p 1 + + p a = 1 ad q 1 + + q b = 1 oe parameter i each sequece, for example p a ad q b, ca be computed i terms of other probabilities ad we ca take (p 1,, p a 1 ) ad (q 1,, q b 1 ) as free parameters of the model This meas that the dimesio of the parameter set is s = (a 1) + (b 1) Therefore, if we fid the maximum likelihood estimates for the parameters of this model the the chi-squared statistic: T = (N ij p i q j ) 2 = χ 2 = χ 2 p i q j χ r 2 s 1 ab (a 1) (b 1) 1 (a 1)(b 1) i,j coverges i distributio to χ 2 (a 1)(b 1) distributio with (a 1)(b 1) degrees of freedom To formulate the test it remais to fid the maximum likelihood estimates of the parameters We eed to maximize the likelihood fuctio (pi q j ) N ij = P j p N ij Pi q N ij j = N p i+ N +j i qj i i,j i j i j where we itroduced the otatios N i+ = N ij ad N +j = N ij j for the total umber of observatios i the ith row ad jth colum Sice p i s ad q j s are ot related to each other, maximizig the likelihood fuctio above is equivalet to maxi- N i+ N +j N mizig i+ i p i ad j q j separately Let us maximize a i=1 p i or, takig the logarithm, maximize a a 1 N i+ log p i = N i+ log p i + N a+ log(1 p 1 p a ), i=1 i=1 sice the probabilities add up to oe Settig derivative i p i equal to zero, we get N i+ N a+ N i+ N a+ = = 0 p i 1 p1 p a 1 p a p i i 78

or N i+ p a = N a+ p i Addig up these equatios for all i a gives Therefore, we get that the MLE for p i : Similarly, the MLE for q j is: N a+ N i+ p a = N a+ = p a = = p i = N i+ p i = q j = N +j Therefore, chi-square statistic T i this case ca be writte as (Nij N i+ N +j /) 2 T = N i+ N +j / i,j ad the decisio rule is give by { δ = H 1 : T c H 2 : T > c where the threshold is determied from the coditio χ 2 (a 1)(b 1)(c, + ) = α Example I 1992 poll 189 Motaa residets were asked whether their persoal fiacial status was worse, the same or better tha oe year ago The opiios were divided ito three groups by icome rage: uder 20K, betwee 20K ad 35K, ad over 35K We would like to test if opiios were idepedet of icome Table 122: Motaa outlook poll b = 3 Worse Same Better 20 15 12 24 27 32 14 22 23 58 64 67 a = 3 20K (20K, 35K) 35K 47 83 59 189 The chi-squared statistic is (20 47 58/189) 2 (23 67 59/189) 2 T = + + = 521 47 58/189 67 59/189 79

If we take level of sigificace α = 005 the the threshold c is: χ 2 (a 1)(b 1) (c, + ) = χ 4 2 (c, ) = α = 005 c = 9488 Sice T = 521 < c = 9488 we accept the ull hypothesis that opiios are idepedet of icome Test of homogeeity Suppose that the populatio is divided ito R groups ad each group (or the etire populatio) is divided ito C categories We would like to test whether the distributio of categories i each group is the same Table 123: Test of homogeeity Category 1 Category C Group 1 N 11 N 1C Group R N R1 N RC N +1 N +C N 1+ N R+ If we deote so that for each group i R we have P(Category j Group i ) = p ij C p ij = 1 j=1 the we wat to test the followig hypotheses: H 0 : p ij = p j for all groups i R H 1 : otherwise If observatios X 1,, X are sampled idepedetly from the etire populatio the homogeeity over groups is the same as idepedece of groups ad categories Ideed, if have homogeeity P(Category j Group i ) = P(Category j ) the we have P(Group i, Category j ) = P(Category j Group i )P(Group i ) = P(Category j )P(Group i ) which meas the groups ad categories are idepedet Aother way aroud, if we have idepedece the P(Category j Group i ) = = P(Group i, Category j ) P(Group i ) P(Category j )P(Group i ) = P(Category P(Group i ) j ) 80

which is homogeeity This meas that to test homogeeity we ca use the test of idepedece above Iterestigly, the same test ca be used i the case whe the samplig is doe ot from the etire populatio but from each group separately which meas that we decide a priori about the sample size i each group - N 1+,, N R+ Whe we sample from the etire populatio these umbers are radom ad by the LLN N i+ / will approximate the probability P(Group i ), ie N i+ reflects the proportio of group i i the populatio Whe we pick these umbers a priori oe ca simply thik that we artificially reormalize the proportio of each group i the populatio ad test for homogeeity amog groups as idepedece i this ew artificial populatio Aother way to argue that the test will be the same is as follows Assume that P(Category j Group i ) = p j where the probabilities p j are all give The by Pearso s theorem we have the covergece i distributio C (N ij N i+ p j ) 2 2 χ N i+ p C 1 j j=1 for each group i R which implies that R C (N ij N i+ p j ) 2 χ 2 R(C 1) N i+ p j i=1 j=1 sice the samples i differet groups are idepedet If ow we assume that probabilities p 1,, p C are ukow ad plug i the maximum likelihood estimates p j = N +j / the R C (N ij N i+ N +j /) 2 χ 2 R(C 1) (C 1) = χ 2 N i+ N +j / i=1 j=1 (R 1)(C 1) because we have C 1 free parameters p 1,, p C 1 ad estimatig each ukow parameter results i losig oe degree of freedom Example (Textbook, page 560) I this example, 100 people were asked whether the service provided by the fire departmet i the city was satisfactory Shortly after the survey, a large fire occured i the city Suppose that the same 100 people were asked whether they thought that the service provided by the fire departmet was satisfactory The result are i the followig table: Satisfactory Usatisfactory Before fire 80 20 After fire 72 28 Suppose that we would like to test whether the opiios chaged after the fire by usig a chi-squared test However, the iid sample cosisted of pairs of opiios of 100 people (X 1, X 2 ),, (X 1, X 2 ) 1 1 100 100 81

where the first coordiate/feature is a perso s opiio before the fire ad it belogs to oe of two categories { Satisfactory, Usatisfactory }, ad the secod coordiate/feature is a perso s opiio after the fire ad it also belogs to oe of two categories { Satisfactory, Usatisfactory } So the correct cotigecy table correspodig to the above data ad satisfyig the assumptio of the chi-squared test would be the followig: Sat before Us before Sat after 70 10 Us after 2 18 I order to use the first cotigecy table, we would have to poll 100 people after the fire idepedetly of the 100 people polled before the fire 82