Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Goodess-of-Fit Tests ad Categorical Data Aalysis (Devore Chapter Fourtee) MATH-252-01: Probability ad Statistics II Sprig 2019 Cotets 1 Chi-Squared Tests with Kow Probabilities 1 1.1 Chi-Squared Testig................ 1 1.1.1 Example: Twis.............. 2 1.2 Chi-Squared Test for a Specified Distributio... 3 2 Chi-Squared Tests with Estimated Parameters 3 2.1 Chi-Squared for a Parametrized Distributio... 4 3 Two-Way Cotigecy Tables 5 3.1 Perspective 1: Test for homogeeity........ 6 3.2 Perspective 2: Test for idepedece....... 6 3.2.1 Example................... 7 Copyright 2019, Joh T. Whela, ad all that Tuesday 16 April 2019 1 Chi-Squared Tests with Kow Probabilities 1.1 Chi-Squared Testig The ext sort of iferece we cosider is kow as categorical data aalysis. Cosider a experimet with a set of k possible discrete outcomes. If we do idepedet repetitios of that experimet, we fid 1 that ed i outcome #1, 2 that ed i outcome #2, up to k edig i outcome #k, where each i is a o-egative iteger, ad 1 2 k. There are k differet categories ito which each observatio ca fall, ad we cout up the umbers of each. We have a model which tells us the expected probabilities p 1, p 2,..., p k of the differet outcomes, where 0 p i 1 ad p 1 p 2 p k 1. The expected umbers of observatios i each category accordig to the model are p 1, p 2,..., p k. (Note that these will i geeral ot be itegers.) If we cosider this model to be the ull hypothesis, we re iterested i a test which rejects the model if 1

the observed umber couts i the categories are very differet from their expected values. To see the test statistic, we take a slight diversio ad recall that if X 1,..., X k are idepedet radom ormal variables such that X i N(µ i, σ 2 i ), the U k (X i µ i ) 2 i1 σ 2 i k i1 Z i 2 (1.1) is a chi-square radom variable with k degrees of freedom, also kow as χ 2 (k). Similarly, oe of the cosequeces of Studet s theorem is that if {X i } is a sample from N(µ, σ 2 ) V k (X i X) 2 i1 σ 2 χ 2 (k 1) (1.2) as we saw whe we cosidered cofidece itervals for the populatio variace, if we have a statistic W χ 2 (ν), P (W χ 2 α,ν) α (1.3) so comparig the statistic value to χ 2 α,ν gives a test at sigificace level α of the origial model. To coect this to the categorical data aalysis problem, cosider the k 2 case, where there are two possible outcomes. The N 1 Bi(, p 1 ) ad N 2 N 1. If p 1 ad p 2 (1 p 1 ) are both more tha about 5, we ca treat the biomial radom variable N 1 as approximately N(p 1, p 1 p 2 ) so W (N 1 p 1 ) 2 p 1 p 2 (1.4) Is approximately χ 2 (1)-distributed. If we use a little algebra, specifically 1 p 1 p 2 1 p 1 1 p 2 ad N 1 p 1 (N 2 p 2 ), we ca show W (N 1 p 1 ) 2 (N 2 p 2 ) 2 (1.5) p 1 p 2 which is a form of the chi-squared statistic (with k 1 1 degree of freedom) which treats outcomes 1 ad 2 o equal footig. Now cosider the case where k > 2. The radom variables N 1, N 2,..., N k are ot idepedet, but obey what s called a multiomial distributio, whose pmf (for every possible combiatio of o-egative itegers with 1 2 k ) is p( 1,..., k )! 1! 2! k! p 1 1 p 2 2 p k k (1.6) I fact, the last radom variable N k N 1 N 2 N k 1 is redudat; all the iformatio is carried i the first k 1 variables. It turs out that, as log as all the expected umber couts {p i } (icludig p k ) are at least 5, the statistic W k (N i p i ) 2 i1 p i (1.7) writte i aalogy with (1.5) is a chi-squared radom variable with k 1 degrees of freedom. So the statistic value (to be compared to χ 2 α,k 1 ) for a specific set of umber couts is k ( i p i ) 2 (observed expected) 2 p i expected i1 1.1.1 Example: Twis (1.8) Cosider 50 sets of twis. If they are all frateral, ad boys ad girls are equally likely, we d expect o average 12.5 boy-boy pairs, 25 mixed pairs, ad 12.5 girl-girl pairs. obs 8 23 19 exp 12.5 25 12.5 2

We ca calculate χ 2 (8 12.5)2 12.5 (23 25)2 25 (19 12.5)2 12.5 5.16 (1.9) ad ote that χ 2.10,2 4.605 while χ 2.05,2 5.992 so this hypothesis would be rejected at the 10% level but ot the 5% level. 1.2 Chi-Squared Test for a Specified Distributio We ca also apply the chi-squared test to a histogram of data hypothesized to be a sample draw from a specified distributio. The the umber couts are the umber of observatios which fall ito each bi. For example, suppose we wat to test the hypothesis that a set of 80 observatios is a sample from a expoetial distributio with rate parameter λ l 2. The cdf is F (x) 1 e x l 2 1 1 2 x (1.10) so the expected umber of observatios lyig betwee x 1 ad x 2 is. Suppose we divide ito four bis, ad obtai the 2 x 1 2 x 2 followig observed ad expected histogram. x < 1 1 x < 2 2 x < 3 3 x obs 40 21 14 5 exp 40 20 10 10 The chi-squared statistic will be χ 2 (40 40)2 40 (21 20)2 20 0 40 1 20 16 10 25 10 (14 10)2 10 1 32 50 20 (5 10)2 10 83 20 4.15 (1.11) We ca compare this to the percetiles of χ 2 (3) distributio. Sice χ 2.10,3 6.251, we see that the P -value is greater tha 10%, ad we would ot reject this model at ay reasoable cofidece level. Note that we could choose differet bis tha the oes we did; i practice it s good to choose the bis to have similar umbers of expected observatios. Practice Problems 14.1, 14.3 Thursday 18 April 2019 2 Chi-Squared Tests with Estimated Parameters I both of the examples from last time, we assumed that the expected umbers for each category were kow, but that required a assumptio i each case. Cosiderig the problem of twi distributio, we assumed that half of all childre were male, leadig to expected umbers of /4 /2 /4 If the fractio of male childre is θ, this becomes θ 2 2θ(1 θ) (1 θ) 2 we ve replaced the fractio expected i category k, which was previously a fixed umber p i, with a fuctio π i (θ). Ad i priciple, we ca imagie that there would be ot just oe parameter θ, but a set of parameters θ {θ 1, θ 2,..., θ m }, where 3

m < k 1. The chi-square statistic would the be χ 2 (θ) k [ i π i (θ)] 2 i1 π i (θ) (2.1) To actually evaluate this for a test, we eed to put i a estimate ˆθ for the parameters. Oe choice would be to fid the oe which miimizes χ 2 (θ) for the observed { i }, but istead, we will use the maximum likelihood estimate, i.e., the oe which maximizes f( 1,..., k ; θ) [π 1 (θ)] 1 [π 2 (θ)] 2 [π k (θ)] k (2.2) or equivaletly l f( 1,..., k ; θ) k! i l π i (θ) l 1! k! i1 (2.3) where the 1,..., k we use are the actual observed values. For example, i the twi example, the log-likelihood is l f( 1,..., k ; θ) 2 1 l θ 2 [l θ l(1 θ))] 2 3 l(1 θ) cost (2 1 2 ) l θ ( 2 2 3 ) l(1 θ) cost (2.4) where the costat depeds o 1, 2 ad 3, but ot θ. Differetiatig with respect to θ gives the maximum likelihood equatio which has the solutio 2 1 2 ˆθ ˆθ 2 1 2 2 2 2 3 1 ˆθ 16 23 100 0 (2.5).39 (2.6) We ca plug that ito the formulas for the expected umbers of twis i each category (for 50) ad get obs 8 23 19 exp θ 2 2θ(1 θ) (1 θ) 2 exp for θ.39 7.605 23.790 18.605 The chi-squared is the χ 2 (8 7.605)2 7.605 (23 23.790)2 23.790 (19 18.605)2 18.605.0551 (2.7) However, sice we ve used the data to fit oe parameter, the percetiles we should compare it to are ot χ 2 (2) but χ 2 (2 1) χ 2 (1). I geeral, if we have k categories ad m fuctioally idepedet parameters (meaig o oe parameter ca be completely determied from the other m 1), the resultig statistic will be approximately χ 2 (k 1 m). 2.1 Chi-Squared for a Parametrized Distributio We ca also retur to the applicatio where we apply the chisquared test to a bied histogram to check a hypothesized distributio. Last time we gave a example where we observed the followig umber couts for a data sample of size 80: x < 1 1 x < 2 2 x < 3 3 x 40 21 14 5 Last time we hypothesized a expoetial distributio with λ l 2, but suppose we allowed λ to be a ukow parameter. The the cdf of the distributio would be F (x) 1 e λx (2.8) ad the expected umber couts i the bis would be 4

x < 1 1 x < 2 2 x < 3 3 x (1 e λ ) (e λ e 2λ ) (e 2λ e 3λ ) e 3λ If we wated to follow the maximum likelihood prescriptio from the previous sectio, we would eed to fid the λ value which maximizes (1 e λ ) 1 (e λ e 2λ ) 2 (e 2λ e 3λ ) 3 (e 3λ ) 4 (2.9) This seems like a pretty complicated fuctio, although it turs out that it is maximized by settig λ to ˆλ l 1 2 2 3 3 3 4 2 2 3 3 4 (2.10) ad the resultig statistic is approximately χ 2 (3)-distributed. But i geeral it may be difficult ad/or require a umerical method to fid the parameters ˆθ which maximize k i l[f (b i ; θ) F (a i ; θ)] (2.11) i1 where the ith histogram bi covers a i < x b i, which is what we d eed to get a statistic which was χ 2 distributed with k 1 m degrees of freedom. Istead, it is ofte coveiet to use some other poit estimate costructed ot just from the bied histogram, but from the full sample of data values. For istace, for the expoetial model we could use λ 1/x. If we do this, though the χ 2 value will geerally ot be as low as if we had really miimized over θ. This meas that the threshold c α for a test with false alarm probability α will be betwee the value χ 2 α,k 1 we d use if we had fixed the parameters arbitrarily ad the (lower) value χ 2 α,k 1 m we d use if we had miimized the χ 2 over the m parameters, i.e., χ 2 α,k 1 m c α χ 2 α,k 1 (2.12) Practice Problems 14.15, 14.17 Tuesday 23 April 2019 3 Two-Way Cotigecy Tables We ow cosider aother categorical data aalysis problem, i which we have two sets of categories, ad each observatio falls ito oe category from the first set ad oe from the secod set. We d like to kow if the data idicate ay statistical relatioship betwee which of the first set of categories a observatio falls ito ad which of the secod. For istace, suppose we survey studets majorig i four disciplies about their food choices: Vega Vegetaria No-Veg Total Math & Stat 9 23 49 81 Physics 6 15 30 51 Chemistry 12 28 62 102 Biology 25 50 98 173 Total 52 116 239 407 We d like to kow if the data idicate ay sigificat tedecies for studets i oe major to have oe diet or aother. There are two ways to pose the questio: 1. Is there ay differece i the tedecies of studets i oe major or aother to have a vega, vegetaria, or ovegetaria diet? (homogeeity) 2. Is there ay correlatio betwee the major chose by a studet ad their dietary choices? (idepedece) The two questios have differet iterpretatios i the framework of classical statistics, but they tur out to lead to idetical data aalysis procedures. 5

As a matter of otatio, we ll cosider the table to have I rows ad J colums, labelled by i 1, 2,..., I ad j 1, 2,..., J, respectively. (I the table above, I is 4 ad J is 3.) We ll call the observed umber i each cell (i, j). (Devore calls this N ij.) We ll write the total of the umbers i row i as (i, ) J j1 (i, j) (Devore: i ) ad i colum j as (, j) I i1 (i, j) (Devore: j). The total umber of observatios is I (i, ) i1 I i1 J (i, j) j1 J (, j) (3.1) j1 3.1 Perspective 1: Test for homogeeity I the first way of lookig at thigs, the umber (i, ) i each row is a give, ad for each i we cosider {N(i, j) j 1,... J} to be a multiomial radom vector with probabilities p(j i). (Devore calls this p ij.) Note that this is ot a coditioal probability accordig to the set theory defiitio of Devore Chapter Two, but it s the probability for a observatio which is i row i to fall ito colum j. These probabilities obey J j1 p(j i) 1 for each i. The most importat property of the multiomial for us is that E (N(i, j)) (i, )p(j i) (3.2) The ull hypothesis of homogeeity is that each of these I sets of J probabilities is the same: H 0 : p(j i) p(, j) for all I (3.3) We ca estimate this commo probability as ˆp(, j) (,j) so that the estimated expected umber of items i each cell is ê(i, j) (i, )ˆp(, j) (i, ) (, j) (3.4) We the use these estimated expected umbers to make a chisquared statistic for the whole table s divergece from homogeeity: I J χ 2 [(i, j) ê(i, j)] 2 (3.5) ê(i, j) i1 j1 We have I multiomial radom variables with J categories each, which meas we ve observed I(J 1) idepedet umbers. We ve estimated J probabilities, but oly J 1 of them were idepedet because they had to add to 1. Thus the umber of degrees of freedom for the chi-squared should be I(J 1) (J 1) (I 1)(J 1) (3.6) Note that although the formalism treats the rows ad colums rather differetly, the fial data aalysis prescriptio treats them symmetrically. 3.2 Perspective 2: Test for idepedece I the secod way of lookig at thigs, the oly thig we treat as give is the total umber of observatios, ad the observatio {N(i, j) i 1,... I; j 1,... J} is a (IJ-dimesioal) multiomial radom vector with probabilities p(i, j). (Devore also calls this p ij, which is oe reaso I wated to make the otatio more descriptive.) These probabilities obey I J i1 j1 p(i, j) 1. Now the multiomial says E (N(i, j)) p(i, j) (3.7) The ull hypothesis of idepedece is that the probabilities ca be decomposed assumig idepedece of the two sets of categories: H 0 : p(i, j) p(i, ) p(, j) (3.8) 6

We ca estimate the probabilties as ˆp(i, ) (i, ) ad ˆp(, j) (,j) so that the estimated expected umber of items i each cell is ê(i, j) ˆp(i, )ˆp(, j) (i, ) (, j) 2 (i, ) (, j) (3.9) which is exactly what we saw above 1 We the make the chisquared as before: χ 2 I i1 J [(i, j) ê(i, j)] 2 j1 ê(i, j) (3.10) We have a sigle multiomial with IJ categories ow, so there are IJ 1 idepedet observatios. We ve estimated I probabilities {ˆp(i, ), I 1 of which are idepedet, ad J probabilities {ˆp(, j), J 1 of which are idepedet, so we have Vega Vegetaria No-Veg Total Math & Stat 10.35 23.09 47.57 81 Physics 6.52 14.54 29.95 51 Chemistry 13.03 29.07 59.90 102 Biology 22.10 49.31 101.59 173 Total 52 116 239 407 Note that it s really easy to do this i a simple spreadsheet program. If I costruct the chi-squared statistic I get 0.99. I eed to compare this to a chi-squared distributio with (4 1)(3 1) (3)(2) 6 degrees of freedom, so 0.99 is rather low ideed ad the data show o sigificat correlatio. Practice Problems 14.27, 14.34, 14.35, 14.41 IJ 1 (I 1) (J 1) IJ I J 1 (I 1)(J 1) (3.11) which, agai, is the same umber of degrees of freedom as the homogeeity test told us. 3.2.1 Example We retur to the example above. The estimates accordig to the model are (81)(52) 10.35, (81)(116) 23.09, etc.: 407 407 1 Note that viewed as a statistic, the estimate from idepedece is N(i, ) N(,j) ê(i, j) while that from homogeeity is ê(i, j) but this distictio has o effect o the data aalysis prescriptio. (i, ) N(,j), 7