Goodness of Fit Tests: Homogeneity

Similar documents
Mathematics 13: Lecture 4

Sampling Distributions

Pivotal Quantities. Mathematics 47: Lecture 16. Dan Sloughter. Furman University. March 30, 2006

Mathematics 22: Lecture 7

Statistics 3858 : Contingency Tables

Nonparametric Tests. Mathematics 47: Lecture 25. Dan Sloughter. Furman University. April 20, 2006

Topic 21 Goodness of Fit

Lecture 01: Introduction

Mathematics 13: Lecture 10

Chi-square (χ 2 ) Tests

Mathematics 22: Lecture 12

10: Crosstabs & Independent Proportions

Some Trigonometric Limits

Chi-square (χ 2 ) Tests

Research Methodology: Tools

Calculus: Area. Mathematics 15: Lecture 22. Dan Sloughter. Furman University. November 12, 2006

Conditional Probability (cont...) 10/06/2005

Mathematics 22: Lecture 19

Binomial and Poisson Probability Distributions

Antiderivatives. Mathematics 11: Lecture 30. Dan Sloughter. Furman University. November 7, 2007

Change of Variables: Indefinite Integrals

10.4 Hypothesis Testing: Two Independent Samples Proportion

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Mathematics 22: Lecture 5

Tests for Population Proportion(s)

Contingency Tables Part One 1

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

Module 10: Analysis of Categorical Data Statistics (OA3102)

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Ling 289 Contingency Table Statistics

Introduction to Analysis of Genomic Data Using R Lecture 6: Review Statistics (Part II)

Section 9.5. Testing the Difference Between Two Variances. Bluman, Chapter 9 1

Mathematics 22: Lecture 4

Example. Mathematics 255: Lecture 17. Example. Example (cont d) Consider the equation. d 2 y dt 2 + dy

Computational Systems Biology: Biology X

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Summary of Chapters 7-9

Lecture 21 Comparing Counts - Chi-square test

Lecture 28 Chi-Square Analysis

Mathematics 22: Lecture 10

Generalized logit models for nominal multinomial responses. Local odds ratios

11-2 Multinomial Experiment

Goodness of Fit Tests

Discrete Multivariate Statistics

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

Categorical Data Analysis Chapter 3

Mathematics 22: Lecture 11

Data Analysis and Statistical Methods Statistics 651

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

Tests for Two Correlated Proportions in a Matched Case- Control Design

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

MAT 2379, Introduction to Biostatistics, Sample Calculator Questions 1. MAT 2379, Introduction to Biostatistics

The Chain Rule. Mathematics 11: Lecture 18. Dan Sloughter. Furman University. October 10, 2007

We know from STAT.1030 that the relevant test statistic for equality of proportions is:

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Asymptotic equivalence of paired Hotelling test and conditional logistic regression

12 Chi-squared (χ 2 ) Tests for Goodness-of-fit and Independence

Sociology 362 Data Exercise 6 Logistic Regression 2

For more information about how to cite these materials visit

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Analysis of Categorical Data Three-Way Contingency Table

Lecture 20. Poisson Processes. Text: A Course in Probability by Weiss STAT 225 Introduction to Probability Models March 26, 2014

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Unit 9: Inferences for Proportions and Count Data

Testing a Claim about the Difference in 2 Population Means Independent Samples. (there is no difference in Population Means µ 1 µ 2 = 0) against

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)

2 and F Distributions. Barrow, Statistics for Economics, Accounting and Business Studies, 4 th edition Pearson Education Limited 2006

Analysis of data in square contingency tables

Introduction to logistic regression

Review. More Review. Things to know about Probability: Let Ω be the sample space for a probability measure P.

2.6.3 Generalized likelihood ratio tests

POLI 443 Applied Political Research

Hypothesis Testing: Chi-Square Test 1

This does not cover everything on the final. Look at the posted practice problems for other topics.

Probability. Chapter 1 Probability. A Simple Example. Sample Space and Probability. Sample Space and Event. Sample Space (Two Dice) Probability

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Lecture 5: ANOVA and Correlation

Conditional Probability (cont'd)

Probability and Probability Distributions. Dr. Mohammed Alahmed

Lecture 2: Categorical Variable. A nice book about categorical variable is An Introduction to Categorical Data Analysis authored by Alan Agresti

Two-sample Categorical data: Testing

3.1 Events, Sample Spaces, and Probability

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

Central Limit Theorem ( 5.3)

Question. Hypothesis testing. Example. Answer: hypothesis. Test: true or not? Question. Average is not the mean! μ average. Random deviation or not?

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Hypothesis testing for µ:

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

Introduction. Dipankar Bandyopadhyay, Ph.D. Department of Biostatistics, Virginia Commonwealth University

Statistics in medicine

Unit 9: Inferences for Proportions and Count Data

Confidence Intervals, Testing and ANOVA Summary

hypothesis testing 1

Categorical Variables and Contingency Tables: Description and Inference

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Transcription:

Goodness of Fit Tests: Homogeneity Mathematics 47: Lecture 35 Dan Sloughter Furman University May 11, 2006 Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 1 / 13

Testing for homogeneity Suppose we have c random samples from discrete distributions each having the same r possible outcomes. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 2 / 13

Testing for homogeneity Suppose we have c random samples from discrete distributions each having the same r possible outcomes. Let p ij = probability of outcome i for the jth distribution, where i = 1, 2,..., r and j = 1, 2,..., c. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 2 / 13

Testing for homogeneity Suppose we have c random samples from discrete distributions each having the same r possible outcomes. Let p ij = probability of outcome i for the jth distribution, where i = 1, 2,..., r and j = 1, 2,..., c. Let p j = (p 1j, p 2j,..., p rj ) for j = 1, 2,..., c. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 2 / 13

Testing for homogeneity Suppose we have c random samples from discrete distributions each having the same r possible outcomes. Let p ij = probability of outcome i for the jth distribution, where i = 1, 2,..., r and j = 1, 2,..., c. Let p j = (p 1j, p 2j,..., p rj ) for j = 1, 2,..., c. We want to test H 0 : p 1 = p 2 = = p c H A : p j p k for some j k. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 2 / 13

Testing for homogeneity Suppose we have c random samples from discrete distributions each having the same r possible outcomes. Let p ij = probability of outcome i for the jth distribution, where i = 1, 2,..., r and j = 1, 2,..., c. Let p j = (p 1j, p 2j,..., p rj ) for j = 1, 2,..., c. We want to test Let H 0 : p 1 = p 2 = = p c H A : p j p k for some j k. n ij = number of observations of outcome i in sample j n i+ = n i1 + n i2 + + n ic = number of observations of outcome i n +j = n 1j + n 2j + + n rj = size of sample j n = n 1+ + n 2+ + + n r+ = n +1 + n +2 + + n +c = total number of observations. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 2 / 13

Testing for homogeneity (cont d) We may summarize this information in a contingency table as follows. 1 2 c Total 1 n 11 n 12 n 1c n 1+ 2 n 21 n 22 n 2c n 2+........ r n r1 n r2 n rc n r+ Total n +1 n +2 n +c n Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 3 / 13

Testing for homogeneity (cont d) Under H 0, the maximum likelihood estimator of the probability of outcome i is n i+ n. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 4 / 13

Testing for homogeneity (cont d) Under H 0, the maximum likelihood estimator of the probability of outcome i is n i+ n. And so the expected number of observations of outcome i in sample j is e ij = n +j ni+ n = n i+n +j. n Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 4 / 13

Testing for homogeneity (cont d) Under H 0, the maximum likelihood estimator of the probability of outcome i is n i+ n. And so the expected number of observations of outcome i in sample j is e ij = n +j ni+ n = n i+n +j. n We may now evaluate either or 2 log(λ) = 2 Q = r r i=1 j=1 c i=1 j=1 c n ij log (n ij e ij ) 2 e ij. ( nij e ij ) Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 4 / 13

Testing for homogeneity (cont d) Note: We initially have c(r 1) degrees of freedom (adding together r 1 degrees of freedom for each of the c samples) and have estimated r 1 parameters. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 5 / 13

Testing for homogeneity (cont d) Note: We initially have c(r 1) degrees of freedom (adding together r 1 degrees of freedom for each of the c samples) and have estimated r 1 parameters. Hence we have degrees of freedom. c(r 1) (r 1) = (r 1)(c 1) Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 5 / 13

Testing for homogeneity (cont d) Note: We initially have c(r 1) degrees of freedom (adding together r 1 degrees of freedom for each of the c samples) and have estimated r 1 parameters. Hence we have degrees of freedom. c(r 1) (r 1) = (r 1)(c 1) That is, under H 0, both 2 log(λ) and Q are approximately χ 2 ((r 1)(c 1)). Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 5 / 13

Example Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 6 / 13

Example When Jane Austen died in 1817, she left the novel Sanditon unfinished, but with a summary of the rest. This was completed by an admirer, and then published. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 6 / 13

Example When Jane Austen died in 1817, she left the novel Sanditon unfinished, but with a summary of the rest. This was completed by an admirer, and then published. In 1978, A. Q. Morton published some statistical studies comparing the writings of Austen and the person who completed Sanditon. Morton counted the occurrences of a, an, this, that, with, and without in chapters 1 and 3 of Sense and Sensibility; chapters 1, 2, and 3 of Emma; and chapters 1 and 6 of Sanditon (written by Austen), and also the occurrences of these words in chapters 12 and 24 of Sanditon (not written by Austen). Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 6 / 13

Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 7 / 13

The results: Word Austen Imitator Total a 434 83 517 an 62 29 91 this 86 15 101 that 236 22 258 with 161 43 204 without 38 4 42 Total 1017 196 1213 Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 7 / 13

Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 8 / 13

The expected frequencies are e 11 = (1017)(517) 1213 = 433.46, e 12 = (196)(517) 1213 = 83.54, e 21 = (1017)(91) = 76.30, e 22 = (196)(91) = 14.70, 1213 1213 and so on, giving us the following table of expected frequencies: Word Austen Imitator Total a 433.46 83.54 517.00 an 76.30 14.70 91.00 this 84.68 16.32 101.00 that 216.31 41.69 258.00 with 171.04 32.96 204.00 without 35.21 6.79 42.00 Total 1017.00 196.00 1213.00 Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 8 / 13

Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 9 / 13

Evaluating our test statistics, we find either 2 log(λ) = 31.75 or q = 32.83. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 9 / 13

Evaluating our test statistics, we find either 2 log(λ) = 31.75 or q = 32.83. If U is χ 2 (5), we have either p-value = P(U 31.75) = 0.000006583 or p-value = P(U 32.83) = 0.000004068. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 9 / 13

Evaluating our test statistics, we find either 2 log(λ) = 31.75 or q = 32.83. If U is χ 2 (5), we have either p-value = P(U 31.75) = 0.000006583 or p-value = P(U 32.83) = 0.000004068. Hence we may conclude that the imitator has not been successful in imitating this aspect of Austen s style. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 9 / 13

Example (Doll and Hill Cancer Study) Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 10 / 13

Example (Doll and Hill Cancer Study) In a study of patients in London hospitals in 1948 and 1949, Doll and Hill categorized each of 709 lung cancer patients and 709 control patients (that is, patients who did not have lung cancer) as either a smoker or a non-smoker. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 10 / 13

Example (Doll and Hill Cancer Study) In a study of patients in London hospitals in 1948 and 1949, Doll and Hill categorized each of 709 lung cancer patients and 709 control patients (that is, patients who did not have lung cancer) as either a smoker or a non-smoker. Results of the study: Cancer Control Total Non-smoker 21 59 80 Smoker 688 650 1338 Total 709 709 1418 Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 10 / 13

Example (Doll and Hill Cancer Study) In a study of patients in London hospitals in 1948 and 1949, Doll and Hill categorized each of 709 lung cancer patients and 709 control patients (that is, patients who did not have lung cancer) as either a smoker or a non-smoker. Results of the study: Cancer Control Total Non-smoker 21 59 80 Smoker 688 650 1338 Total 709 709 1418 The data raises the following question: Are the 38 additional non-smokers in the control group due to randomness, or to a higher rate of smoking among people with lung cancer than among those without lung cancer? Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 10 / 13

Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 11 / 13

The expected frequencies are: Cancer Control Total Non-smoker 40 40 80 Smoker 669 669 1338 Total 709 709 1418 Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 11 / 13

The expected frequencies are: Cancer Control Total Non-smoker 40 40 80 Smoker 669 669 1338 Total 709 709 1418 And so 2 log(λ) = 19.87802 and q = 19.12922, giving p-values of 0.00000825 and 0.00001222, respectively. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 11 / 13

The expected frequencies are: Cancer Control Total Non-smoker 40 40 80 Smoker 669 669 1338 Total 709 709 1418 And so 2 log(λ) = 19.87802 and q = 19.12922, giving p-values of 0.00000825 and 0.00001222, respectively. Hence we have very strong evidence for rejecting the hypothesis that the rate of smoking among the two groups is the same. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 11 / 13

Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 12 / 13

Note: We could also perform this test as a two-sample test for the equality of the probability of success in two independent Bernoulli populations. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 12 / 13

Note: We could also perform this test as a two-sample test for the equality of the probability of success in two independent Bernoulli populations. That is, let p X be the proportion of non-smokers in the cancer population and let p Y be the proportion of non-smokers in the control population. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 12 / 13

Note: We could also perform this test as a two-sample test for the equality of the probability of success in two independent Bernoulli populations. That is, let p X be the proportion of non-smokers in the cancer population and let p Y be the proportion of non-smokers in the control population. We want to test H 0 : p X = p Y H A : p X p Y. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 12 / 13

Note: We could also perform this test as a two-sample test for the equality of the probability of success in two independent Bernoulli populations. That is, let p X be the proportion of non-smokers in the cancer population and let p Y be the proportion of non-smokers in the control population. We want to test Now H 0 : p X = p Y H A : p X p Y. ˆp X = 21 709 = 0.02962, ˆp Y = 59 80 = 0.08322, ˆp = 709 1418 = 0.05642. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 12 / 13

Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 13 / 13

Hence z = ˆp X ˆp y ˆp(1 ˆp) ( 1 709 + 1 ) = 4.373697. 709 Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 13 / 13

Hence z = ˆp X ˆp y ˆp(1 ˆp) ( 1 709 + 1 ) = 4.373697. 709 This yields a p-value of 0.00001222, the same as for q above. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 13 / 13

Hence z = ˆp X ˆp y ˆp(1 ˆp) ( 1 709 + 1 ) = 4.373697. 709 This yields a p-value of 0.00001222, the same as for q above. Indeed: z 2 = 19.12922 = q. Dan Sloughter (Furman University) Goodness of Fit Tests: Homogeneity May 11, 2006 13 / 13