Goodness-of-fit for composite hypotheses.

Section 11 Goodness-of-fit fo composite hypotheses. Example. Let us conside a Matlab example. Let us geneate 50 obsevations fom N(1, 2): X=nomnd(1,2,50,1); Then, unning a chi-squaed goodness-of-fit test chi2gof [H,P,STATS]= chi2gof(x) outputs H = 0, P = 0.8793, STATS = chi2stat: 0.6742 df: 3 edges: [-3.7292-0.9249 0.0099 0.9447 1.8795 2.8142 5.6186] O: [8 7 8 8 9 10] E: [8.7743 7.0639 8.7464 8.8284 7.2645 9.3226] The test accepts the hypothesis that the data is nomal. Notice, howeve, that something is diffeent. Matlab gouped the data into 6 intevals, so chi-squaed test fom pevious lectue should have 1 = 6 1 = 5 degees of feedom, but we have df: 3! The diffeence is that now ou hypothesis is not that the data comes fom a paticula given distibution but that the data comes fom a family of distibutions which is called a composite hypothesis. Running [H,P,STATS]= chi2gof(x, cdf,@(z)nomcdf(z,mean(x),std(x,1))) would test a simple hypothesis that the data comes fom a paticula nomal distibution N(ˆµ, χˆ2) and the output H = 0, P = 0.9838 STATS = chi2stat: 0.6842 71

df: 5 edges: [-3.7292-0.9249 0.0099 0.9447 1.8795 2.8142 5.6186] O: [8 7 8 8 9 10] E: [8.6525 7.0995 8.8282 8.9127 7.3053 9.2017] has df: 5. Howeve, we can not use this test because we estimate the paametes ˆµ and ˆχ 2 of this distibution using the data so this is not a paticula given distibution; in fact, this is the distibution that fits the data the best, so the T statistic in Peason s theoem will behave diffeently. Let us stat with a discete case when a andom vaiable takes a finite numbe of values B 1,..., B with pobabilities p 1 = P(X = B 1 ),..., p = P(X = B ). We would like to test a hypothesis that this distibution comes fom a family of distibutions {P θ : ν Θ}. In othe wods, if we denote we want to test p j (ν) = P θ (X = B j ), H 0 : p j = p j (ν) fo all j fo some ν Θ H 1 : othewise. If we wanted to test H 0 fo one paticula fixed ν we could use the statistic (νj np j (ν)) 2 T =, np j (ν) and use a simple chi-squaed goodness-of-fit test. The situation now is moe complicated because we want to test if p j = p j (ν), j at least fo some ν Θ which means that we have many candidates fo ν. One way to appoach this poblem is as follows. (Step 1) Assuming that hypothesis H 0 holds, i.e. P = P θ fo some ν Θ, we can find an estimate ν of this unknown ν and then (Step 2) ty to test if, indeed, the distibution P is equal to P θ by using the statistics (νj np j (ν )) 2 T = np j (ν ) in chi-squaed goodness-of-fit test. This appoach looks natual, the only question is what estimate ν to use and how the fact that ν also depends on the data will affect the convegence of T. It tuns out that if we let ν be the maximum likelihood estimate, i.e. ν that maximizes the likelihood function ϕ(ν) = p 1 (ν) ν 1... p (ν) ν 72

then the statistic (ν j np j (ν )) 2 d T = ϕ 2 (11.0.1) np j (ν ) s 1 conveges to ϕ 2 s 1 distibution with s 1 degees of feedom, whee s is the dimension of the paamete set Θ. Of couse, hee we assume that s 2 so that we have at least one degee of feedom. Vey infomally, by dimension we undestand the numbe of fee paametes that descibe the set { } (p 1 (ν),..., p (ν)) : ν Θ. Then the decision ule will be { α = H 1 : T c H 2 : T > c whee the theshold c is detemined fom the condition P(α = H 0 H 0 ) = P(T > c H 0 ) ϕ 2 s 1(c, + ) = α whee α [0, 1] is the level of sidnificance. Example 1. Suppose that a gene has two possible alleles A 1 and A 2 and the combinations of these alleles define thee genotypes A 1 A 1, A 1 A 2 and A 2 A 2. We want to test a theoy that Pobability to pass A 1 to a child = ν Pobability to pass A 2 to a child = 1 ν and that the pobabilities of genotypes ae given by p 1 (ν) = P(A 1 A 1 ) = ν 2 p 2 (ν) = P(A 1 A 2 ) = 2ν(1 ν) (11.0.2) p 3 (ν) = P(A 2 A 2 ) = (1 ν) 2. Suppose that given a andom sample X 1,..., X n fom the population the counts of each genotype ae ν 1, ν 2 and ν 3. To test the theoy we want to test the hypothesis H 0 : p 1 = p 1 (ν), p 2 = p 2 (ν), p 3 = p 3 (ν) fo some ν [0, 1] H 1 : othewise. Fist of all, the dimension of the paamete set is s = 1 since the distibutions ae detemined by one paamete ν. To find the MLE ν we have to maximize the likelihood function o, equivalently, maximize the log-likelihood p 1 (ν) ν 1 p 2 (ν) ν 2 p 3 (ν) ν 3 log p 1 (ν) ν 1 p 2 (ν) ν 2 p 3 (ν) ν 3 = ν 1 log p 1 (ν) + ν 2 log p 2 (ν) + ν 3 log p 3 (ν) = ν 1 log ν 2 + ν 2 log 2ν(1 ν) + ν 3 log(1 ν) 2. 73

If we compute the citical point by setting the deivative equal to 0, we get 2ν 1 + ν 2 ν =. 2n Theefoe, unde the null hypothesis H 0 the statistic T = (ν 1 np 1 (ν )) 2 (ν 2 np 2 (ν )) 2 (ν 3 np 3 (ν )) 2 + + np 1 (ν ) np 2 (ν ) np 3 (ν ) d ϕ 2 s 1 = ϕ 2 3 1 1 = ϕ1 2 conveges to ϕ 2 1-distibution with one degee of feedom. Theefoe, in the decision ule { H α = 1 : T c H 2 : T > c theshold c is detemined by the condition Fo example, if α = 0.05 then c = 3.841. P(α = H 0 H 0 ) ϕ 1 2 (T > c) = α. Example 2. A blood type O, A, B, AB is detemined by a combination of two alleles out of A, B, O and allele O is dominated by A and B. Suppose that p, q and = 1 p q ae the population fequencies of alleles A, B and O coespondingly. If alleles ae passed andomly fom the paents then the pobabilities of blood types will be Blood type Allele combinations Pobabilities Counts O OO 2 ν 1 = 121 A AA, AO p 2 + 2p ν 2 = 120 B BB, BO q 2 + 2p ν 3 = 79 AB AB 2pq ν 4 = 33 We would like to test this theoy based on the counts of each blood type in a andom sample of 353 people. We have fou goups and two fee paametes p and q, so the chi-squaed statistics T unde the null hypotheses will have ϕ 2 4 2 1 = ϕ2 1 distibution with one degee of feedom. Fist, we have to find the MLE of paametes p and q. The log likelihood is ν 1 log 2 + ν 2 log(p 2 + 2p) + ν 3 log(q 2 + 2q) + ν 4 log(2pq) = 2ν 1 log(1 p q) + ν 2 log(2p p 2 2pq) + ν 3 log(2q q 2 2pq) + ν 4 log(2pq). Unfotunately, if we set the deivatives with espect to p and q equal to zeo, we get a system of two equations that is had to solve explicitly. So instead we can minimize log likelihood numeically to get the MLE ˆp = 0.247 and ˆq = 0.173. Plugging these into fomulas of blood type pobabilities we get the estimated pobabilities and estimated counts in each goup O A B AB ˆp i 0.3364 0.3475 0.2306 0.0855 nˆp i 118.7492 122.6777 81.4050 30.1681 74

We can now compute chi-squaed statistic T 0.44 and the p-value ϕ 2 (T, ) = 0.5071. The 1 data agees vey well with the above theoy. We could also use a simila test when the distibutions P θ, ν Θ ae not necessaily suppoted by a finite numbe of points B 1,..., B, fo example, continuous distibutions. In this case if we want to test the hypothesis H 0 : P = P θ fo some ν Θ we can goup the data into intevals I 1,..., I and test the hypothesis H 0 : p j = p j (ν) = P θ (X I j ) fo all j fo some ν. Fo example, if we discetize nomal distibution by gouping the data into intevals I 1,..., I then the hypothesis will be H 0 : p j = N(µ, χ 2 )(I j ) fo all j fo some (α, χ 2 ). Thee ae two fee paametes µ and χ 2 that descibe all these pobabilities so in this case s = 2. Matlab function chi2gof tests fo nomality by gouping the data and computing statistic T in (11.0.1) - that is why it uses ϕ 2 s 1 distibution with s 1 = 2 1 = 3 degees of feedom and, thus, df: 3 in the example above. Example. Let us test if the data nomtemp fom nomal body tempeatue dataset fits nomal distibution. [H,P,STATS]= chi2gof(nomtemp) gives H = 0, P = 0.0504 STATS = chi2stat: 9.4682 df: 4 edges: [1x8 double] O: [13 12 29 27 35 10 4] E: [9.9068 16.9874 27.6222 31.1769 24.4270 13.2839 6.5958] and we accept null hypothesis at the default level of significance α = 0.05 since p-value 0.0504 > α = 0.05. We have = 7 goups and, theefoe, s 1 = 7 2 1 = 4 degees of feedom. In the case when the distibutions P θ ae continuous o, moe geneally, have infinite numbe of values that must be gouped in ode to use chi-squaed test (fo example, nomal o Poisson distibution), it can be a difficult numeical poblem to maximize the gouped likelihood function P θ (I 1 ) ν 1... P θ (I ) ν max ν. 75 θ

It is tempting to use a usual non-gouped MLE νˆ of ν instead of the above ν because it is often easie to compute, in fact, fo many distibutions we know explicit fomulas fo these MLEs. Howeve, if we use νˆ in the statistic (νj np j (νˆ)) 2 T = (11.0.3) np j (νˆ) then it will no longe convege to ϕ 2 s 1 distibution. A famous esult in [1] poves that typically this T will convege to a distibution in between ϕ 2 s 1 and ϕ 2 1. Intuitively this is easy to undestand because ν specifically fits the gouped data ν 1,..., ν so the expected counts np 1 (ν ),..., np (ν ) should be a bette fit compaed to the expected counts np 1 (νˆ),..., np (νˆ). On the othe hand, these last expected counts should be a bette fit than simply using the tue expected counts np 1 (ν 0 ),..., np (ν 0 ) since the MLE νˆ fits the data bette than the tue distibution. So typically we would expect (νj np j (ν )) 2 (νj np j (νˆ)) 2 (νj np j (ν 0 )) 2. np j (ν ) np j (νˆ) But the left hand side conveges to ϕ 2 s 1 if the decision ule is based on the statistic (11.0.3): { α = np j (ν 0 ) and the ight hand side conveges to ϕ2 1. Thus, H 1 : T c H 2 : T > c then the theshold c can be detemined consevatively fom the tail of ϕ 2 1 distibution since P(α = H 0 H 0 ) = P(T > c) ϕ 2 1 (T > c) = α. Refeences: [1] Chenoff, Heman; Lehmann, E. L. (1954) The use of maximum likelihood estimates in ϕ 2 tests fo goodness of fit. Ann. Math. Statistics 25, pp. 579-586. 76