Goodness of Fit Tests: Homogeneity

Size: px

Start display at page:

Download "Goodness of Fit Tests: Homogeneity"

Allison Reed
6 years ago
Views:

1 Goodness of Fit Tests: Homogeneity Mathematics 47: Lecture 35 Dan Sloughter Furman University May 11, 2006 Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

2 Testing for homogeneity Suppose we have c random samples from discrete distributions each having the same r possible outcomes. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

3 Testing for homogeneity Suppose we have c random samples from discrete distributions each having the same r possible outcomes. Let p ij = probability of outcome i for the jth distribution, where i = 1, 2,..., r and j = 1, 2,..., c. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

4 Testing for homogeneity Suppose we have c random samples from discrete distributions each having the same r possible outcomes. Let p ij = probability of outcome i for the jth distribution, where i = 1, 2,..., r and j = 1, 2,..., c. Let p j = (p 1j, p 2j,..., p rj ) for j = 1, 2,..., c. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

5 Testing for homogeneity Suppose we have c random samples from discrete distributions each having the same r possible outcomes. Let p ij = probability of outcome i for the jth distribution, where i = 1, 2,..., r and j = 1, 2,..., c. Let p j = (p 1j, p 2j,..., p rj ) for j = 1, 2,..., c. We want to test H 0 : p 1 = p 2 = = p c H A : p j p k for some j k. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

6 Testing for homogeneity Suppose we have c random samples from discrete distributions each having the same r possible outcomes. Let p ij = probability of outcome i for the jth distribution, where i = 1, 2,..., r and j = 1, 2,..., c. Let p j = (p 1j, p 2j,..., p rj ) for j = 1, 2,..., c. We want to test Let H 0 : p 1 = p 2 = = p c H A : p j p k for some j k. n ij = number of observations of outcome i in sample j n i+ = n i1 + n i2 + + n ic = number of observations of outcome i n +j = n 1j + n 2j + + n rj = size of sample j n = n 1+ + n n r+ = n +1 + n n +c = total number of observations. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

7 Testing for homogeneity (cont d) We may summarize this information in a contingency table as follows. 1 2 c Total 1 n 11 n 12 n 1c n 1+ 2 n 21 n 22 n 2c n r n r1 n r2 n rc n r+ Total n +1 n +2 n +c n Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

8 Testing for homogeneity (cont d) Under H 0, the maximum likelihood estimator of the probability of outcome i is n i+ n. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

9 Testing for homogeneity (cont d) Under H 0, the maximum likelihood estimator of the probability of outcome i is n i+ n. And so the expected number of observations of outcome i in sample j is e ij = n +j ni+ n = n i+n +j. n Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

10 Testing for homogeneity (cont d) Under H 0, the maximum likelihood estimator of the probability of outcome i is n i+ n. And so the expected number of observations of outcome i in sample j is e ij = n +j ni+ n = n i+n +j. n We may now evaluate either or 2 log(λ) = 2 Q = r r i=1 j=1 c i=1 j=1 c n ij log (n ij e ij ) 2 e ij. ( nij e ij ) Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

11 Testing for homogeneity (cont d) Note: We initially have c(r 1) degrees of freedom (adding together r 1 degrees of freedom for each of the c samples) and have estimated r 1 parameters. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

12 Testing for homogeneity (cont d) Note: We initially have c(r 1) degrees of freedom (adding together r 1 degrees of freedom for each of the c samples) and have estimated r 1 parameters. Hence we have degrees of freedom. c(r 1) (r 1) = (r 1)(c 1) Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

13 Testing for homogeneity (cont d) Note: We initially have c(r 1) degrees of freedom (adding together r 1 degrees of freedom for each of the c samples) and have estimated r 1 parameters. Hence we have degrees of freedom. c(r 1) (r 1) = (r 1)(c 1) That is, under H 0, both 2 log(λ) and Q are approximately χ 2 ((r 1)(c 1)). Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

14 Example Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

15 Example When Jane Austen died in 1817, she left the novel Sanditon unfinished, but with a summary of the rest. This was completed by an admirer, and then published. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

16 Example When Jane Austen died in 1817, she left the novel Sanditon unfinished, but with a summary of the rest. This was completed by an admirer, and then published. In 1978, A. Q. Morton published some statistical studies comparing the writings of Austen and the person who completed Sanditon. Morton counted the occurrences of a, an, this, that, with, and without in chapters 1 and 3 of Sense and Sensibility; chapters 1, 2, and 3 of Emma; and chapters 1 and 6 of Sanditon (written by Austen), and also the occurrences of these words in chapters 12 and 24 of Sanditon (not written by Austen). Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

17 Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

18 The results: Word Austen Imitator Total a an this that with without Total Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

19 Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

20 The expected frequencies are e 11 = (1017)(517) 1213 = , e 12 = (196)(517) 1213 = 83.54, e 21 = (1017)(91) = 76.30, e 22 = (196)(91) = 14.70, and so on, giving us the following table of expected frequencies: Word Austen Imitator Total a an this that with without Total Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

21 Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

22 Evaluating our test statistics, we find either 2 log(λ) = or q = Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

23 Evaluating our test statistics, we find either 2 log(λ) = or q = If U is χ 2 (5), we have either p-value = P(U 31.75) = or p-value = P(U 32.83) = Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

24 Evaluating our test statistics, we find either 2 log(λ) = or q = If U is χ 2 (5), we have either p-value = P(U 31.75) = or p-value = P(U 32.83) = Hence we may conclude that the imitator has not been successful in imitating this aspect of Austen s style. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

25 Example (Doll and Hill Cancer Study) Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

26 Example (Doll and Hill Cancer Study) In a study of patients in London hospitals in 1948 and 1949, Doll and Hill categorized each of 709 lung cancer patients and 709 control patients (that is, patients who did not have lung cancer) as either a smoker or a non-smoker. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

27 Example (Doll and Hill Cancer Study) In a study of patients in London hospitals in 1948 and 1949, Doll and Hill categorized each of 709 lung cancer patients and 709 control patients (that is, patients who did not have lung cancer) as either a smoker or a non-smoker. Results of the study: Cancer Control Total Non-smoker Smoker Total Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

28 Example (Doll and Hill Cancer Study) In a study of patients in London hospitals in 1948 and 1949, Doll and Hill categorized each of 709 lung cancer patients and 709 control patients (that is, patients who did not have lung cancer) as either a smoker or a non-smoker. Results of the study: Cancer Control Total Non-smoker Smoker Total The data raises the following question: Are the 38 additional non-smokers in the control group due to randomness, or to a higher rate of smoking among people with lung cancer than among those without lung cancer? Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

29 Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

30 The expected frequencies are: Cancer Control Total Non-smoker Smoker Total Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

31 The expected frequencies are: Cancer Control Total Non-smoker Smoker Total And so 2 log(λ) = and q = , giving p-values of and , respectively. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

32 The expected frequencies are: Cancer Control Total Non-smoker Smoker Total And so 2 log(λ) = and q = , giving p-values of and , respectively. Hence we have very strong evidence for rejecting the hypothesis that the rate of smoking among the two groups is the same. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

33 Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

34 Note: We could also perform this test as a two-sample test for the equality of the probability of success in two independent Bernoulli populations. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

35 Note: We could also perform this test as a two-sample test for the equality of the probability of success in two independent Bernoulli populations. That is, let p X be the proportion of non-smokers in the cancer population and let p Y be the proportion of non-smokers in the control population. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

36 Note: We could also perform this test as a two-sample test for the equality of the probability of success in two independent Bernoulli populations. That is, let p X be the proportion of non-smokers in the cancer population and let p Y be the proportion of non-smokers in the control population. We want to test H 0 : p X = p Y H A : p X p Y. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

37 Note: We could also perform this test as a two-sample test for the equality of the probability of success in two independent Bernoulli populations. That is, let p X be the proportion of non-smokers in the cancer population and let p Y be the proportion of non-smokers in the control population. We want to test Now H 0 : p X = p Y H A : p X p Y. ˆp X = = , ˆp Y = = , ˆp = = Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

38 Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

39 Hence z = ˆp X ˆp y ˆp(1 ˆp) ( ) = Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

40 Hence z = ˆp X ˆp y ˆp(1 ˆp) ( ) = This yields a p-value of , the same as for q above. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

41 Hence z = ˆp X ˆp y ˆp(1 ˆp) ( ) = This yields a p-value of , the same as for q above. Indeed: z 2 = = q. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, / 13

Mathematics 13: Lecture 4

Mathematics 13: Lecture 4 Mathematics 13: Lecture Planes Dan Sloughter Furman University January 10, 2008 Dan Sloughter (Furman University) Mathematics 13: Lecture January 10, 2008 1 / 10 Planes in R n Suppose v and w are nonzero