Inference for Categorical Data. Chi-Square Tests for Goodness of Fit and Independence

Similar documents
Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

11-2 Multinomial Experiment

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

Basic Concepts of Probability

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Basic Concepts of Probability

Discrete Multivariate Statistics

HYPOTHESIS TESTING: THE CHI-SQUARE STATISTIC

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Question. Hypothesis testing. Example. Answer: hypothesis. Test: true or not? Question. Average is not the mean! μ average. Random deviation or not?

Ling 289 Contingency Table Statistics

13.1 Categorical Data and the Multinomial Experiment

Statistics 3858 : Contingency Tables

Sociology 6Z03 Review II

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

Psych 230. Psychological Measurement and Statistics

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Chi Square Analysis M&M Statistics. Name Period Date

Exam details. Final Review Session. Things to Review

2.3 Analysis of Categorical Data

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

10: Crosstabs & Independent Proportions

Module 10: Analysis of Categorical Data Statistics (OA3102)

Example. χ 2 = Continued on the next page. All cells

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Hypothesis Testing: Chi-Square Test 1

Testing Independence

Review of One-way Tables and SAS

Statistics in medicine

Chapter 26: Comparing Counts (Chi Square)

Chi-Squared Tests. Semester 1. Chi-Squared Tests

Topic 21 Goodness of Fit

TUTORIAL 8 SOLUTIONS #

Solution to Tutorial 7

Glossary for the Triola Statistics Series

Nominal Data. Parametric Statistics. Nonparametric Statistics. Parametric vs Nonparametric Tests. Greg C Elvers

Lecture 8: Summary Measures

Chapter 10: Chi-Square and F Distributions

Unit 9: Inferences for Proportions and Count Data

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Inferences for Proportions and Count Data

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Review for Final. Chapter 1 Type of studies: anecdotal, observational, experimental Random sampling

Frequency Distribution Cross-Tabulation

Psych Jan. 5, 2005

Unit 9: Inferences for Proportions and Count Data

The material for categorical data follows Agresti closely.

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

10.2: The Chi Square Test for Goodness of Fit

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Sleep data, two drugs Ch13.xls

Summary of Chapters 7-9

Chi-Square. Heibatollah Baghi, and Mastee Badii

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

Section VII. Chi-square test for comparing proportions and frequencies. F test for means

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

15: CHI SQUARED TESTS

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Chapters 9 and 10. Review for Exam. Chapter 9. Correlation and Regression. Overview. Paired Data

Probability Distributions

Table Probabilities and Independence

STAC51: Categorical data Analysis

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

ML Testing (Likelihood Ratio Testing) for non-gaussian models

This gives us an upper and lower bound that capture our population mean.

Confidence Intervals, Testing and ANOVA Summary

Categorical Data Analysis Chapter 3

Lecture 41 Sections Wed, Nov 12, 2008

Finding Relationships Among Variables

2 Describing Contingency Tables

M & M Project. Think! Crunch those numbers! Answer!

Describing Contingency tables

BIOS 625 Fall 2015 Homework Set 3 Solutions

Relate Attributes and Counts

SBAOD Statistical Methods & their Applications - II. Unit : I - V

STP 226 ELEMENTARY STATISTICS NOTES

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

:the actual population proportion are equal to the hypothesized sample proportions 2. H a

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Contingency Tables Part One 1

Mathematical statistics

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Lecture 22. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017

Chi-square (χ 2 ) Tests

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Mathematical Notation Math Introduction to Applied Statistics

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic

Optimal exact tests for complex alternative hypotheses on cross tabulated data

Inferential statistics

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /13/2016 1/33

Chi-square (χ 2 ) Tests

Discrete distribution. Fitting probability models to frequency data. Hypotheses for! 2 test. ! 2 Goodness-of-fit test

HYPOTHESIS TESTING. Hypothesis Testing

Statistics Handbook. All statistical tables were computed by the author.

Chapter 10. Prof. Tesler. Math 186 Winter χ 2 tests for goodness of fit and independence

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Lecture 28 Chi-Square Analysis

Institute of Actuaries of India

Transcription:

Chi-Square Tests for Goodness of Fit and Independence

Chi-Square Tests In this course, we use chi-square tests in two different ways The chi-square test for goodness-of-fit is used to determine whether an observed frequency distribution differs significantly an expected frequency distribution The chi-square test for independence is used to determine whether two categorical (nominal or ordinal) variables exhibit a significant relationship The null hypothesis for both versions of the chi-square test is expressed in terms of expected frequencies The first step in computing the chi-square statistic for either version of the chi-square test is to determine the expected frequencies The second step, computing the chi square statistic, is the same for both versions of the test.

The Chi-Square Test for Goodness-of-Fit The chi-square test for goodness-of-fit uses frequency data from a sample to test hypotheses about the shape or proportions of a population. ach individual in the sample is classified into one category on the scale of measurement. The data, called observed frequencies and denoted as k or f o, simply count how many individuals from the sample are in each category. 3

The Chi-Square Test for Goodness-of-Fit The null hypothesis specifies the proportion of the population that should be in each category. The proportions from the null hypothesis are used to compute expected frequencies that describe how the sample would appear if it were in perfect agreement with the null hypothesis. 4

The Chi-Square Test for Goodness-of-Fit For the goodness of fit test, the expected frequency for each category is obtained from the binomial probability f = [ k] = np (where k = f o is the count of observed items in a category, p is the proportion from the null hypothesis and n is the size of the sample) 5

xample 1: Uniform xpected Frequencies Number of symbols thrown in a game of rock/paper/scissors Symbol Rock Paper Scissors Total Observed 30 1 4 75 xpected 5 5 5 For 3 possible categories, if the player is throwing symbols randomly, then we should expect n/3 occurrences per category. For n = 75, we would expect 5 occurrences per category. 6

xample : Non-uniform xpected Frequencies Official M&M's Color Distribution Color p brown 0.3 red 0. blue 0.1 orange 0.1 green 0.1 yellow 0. Number of M&Ms of each color in a sample bag Color Brown Red Blue Orange Green Yellow Total Observed 14 6 7 4 9 13 53 xpected 15.9 10.6 5.3 5.3 5.3 10.6 7

Binomial Vs. Multinomial Distribution Inference for Categorical Data n! k ( ) = p ( 1 p) P k Binomial PMF ( n k)! k! n k Multinomial PMF n P k k p! kc k1 (,, ) = p 1 c 1 k1!,, kc! c 8

Computing the Chi-Square Statistic Inference for Categorical Data χ ( df ) = ( f f ) O f For the goodness-of-fit test: df = # cells 1 9

Chi-Square Distribution χ 1 χ dchisq(x,k) χ 4 χ 8 10

xample Official M&M's Color Distribution Color p brown 0.3 red 0. blue 0.1 orange 0.1 green 0.1 yellow 0. Number of M&Ms of each color in a sample bag Color Brown Red Blue Orange Green Yellow Total Observed 14 6 7 4 9 13 53 xpected 15.9 10.6 5.3 5.3 5.3 10.6 χ ( 5) = ( f f ) O f ( 14 15.9) ( 6 10.6) ( 7 5.3) ( 4 5.3) ( 9 5.3) ( 13 10.6) = + + + + + 15.9 10.6 5.3 5.3 5. 3 10. 6 = 6.1 11

α = 0.05 χ 0.05 = 11.07 1

( ) p = P χ > χ observed = 0.863 α = 0.05 χ = 6.1 observed χ 0.05 = 11.07 13

( ) p = P χ > χ observed = 0.959 α = 0.05 χ = 6.1 observed χ 0.05 = 11.03 14

( ) p = P χ > χ observed = 0.959 α = 0.05 χ = 6.1 observed χ 0.05 = 11.03 15

xample Official M&M's Color Distribution Color p brown 0.3 red 0. blue 0.1 orange 0.1 green 0.1 yellow 0. Number of M&Ms of each color in a sample bag Color Brown Red Blue Orange Green Yellow Total Observed 14 6 7 4 9 13 53 xpected 15.9 10.6 5.3 5.3 5.3 10.6 χ ( 5) = ( f f ) O f ( 14 15.9) ( 6 10.6) ( 7 5.3) ( 4 5.3) ( 9 5.3) ( 13 10.6) = + + + + + 15.9 10.6 5.3 5.3 5. 3 10. 6 = 6.1 6.1 < 11.07; retain H 0 The frequency distribution of colors in our bag is not significantly different from the distribution expected based on M&M s published proportions 16

The Chi-Square Test for Independence The second chi-square test, the chi-square test for independence, can be used and interpreted in two different ways: 1. Testing hypotheses about the relationship between two variables in a population H 0 : There is no relationship between factor A and factor B (This interpretation is analogous to correlation). Testing hypotheses about differences between proportions for two or more populations. H 0 : There is no difference between the distribution of factor A under different levels of factor B (This interpretation is analogous to interaction in factorial ANOVAs) 17

The Chi-Square Test for Independence The data for a chi-square test for independence are usually organized in a matrix with the categories for one variable defining the rows and the categories of the second variable defining the columns. These matrices are usually called contingency tables 18

The Chi-Square Test for Independence Frequency of successes and relapses for anorexic patients treated with Prozac or a placebo Outcome Treatment Success Relapse Total Drug Placebo Total 13 36 49 14 30 44 7 66 93 19

The Chi-Square Test for Independence The data, called observed frequencies, simply show how many individuals from the sample fall into each cell (i.e., combination of factor levels) of the matrix. The null hypothesis for this test states that there is no relationship between the two variables In other words, the two variables are independent. 0

Computing xpected Frequencies Inference for Categorical Data For the goodness of fit test, the expected frequency is computed as f = np For the test for independence, the expected frequency for each cell in the matrix is computed as f ( ) = n p p = C R ( f f ) C n R 1

To understand the intuition behind this, consider the joint and marginal probabilities underlying the frequency distribution: Outcome Treatment Success Relapse Total Drug Placebo Total P(success,drug) P(relapse,drug) P(drug) P(success,placebo) P(relapse,placebo) P(placebo) P(success) P(relapse) 1 Remember from earlier in the semester that if two factors (treatment & outcome) are independent, then: ( outcome, treatment ) = ( outcome) ( treatment ) P P P

Computing the Chi-Square Statistic Inference for Categorical Data The calculation of chi-square is the same for all chi-square tests: χ = ( f f ) O However, computation of the degrees of freedom differs f For the goodness-of-fit test: df = # cells 1 For the test of independence: (# 1 )(# 1) df = rows cols 3

Full xample vent Frequencies Death Sentence Race Yes No Total Black 95 45 50 Nonblack 19 18 147 Total 114 553 667 vent Probabilities (f/n) Death Sentence Race Yes No Total Black 0.14 0.637 0.780 Nonblack 0.08 0.19 0.0 Total 0.171 0.89 1.000 Are race of the defendant and application of the death sentence independent? P P ( black, death) = 0.14 ( ) P( ) = 0.780 = P( black, death) P( black ) P( death) black death 0.171 0.133 4

Full xample: Observed vent Frequencies f ( black,yes ) Death Sentence Race Yes No Total Black 95 45 50 Nonblack 19 18 147 Total 114 553 667 f f f ( black,no) ( nonblack,yes) ( nonblack,no) 50 114 = = 88.88 667 50 553 = = 431.1 667 147 114 = = 5.1 667 147 553 = = 11.88 667 5

Observed vent Frequencies Death Sentence Race Yes No Total Black 95 45 50 Nonblack 19 18 147 Total 114 553 667 xpected vent Frequencies Death Sentence Race Yes No Total Black 88.88 431.1 50 Nonblack 5.1 11.88 147 Total 114 553 667 χ ( 1) = ( f f ) O f ( 95 88.88) ( 45 431.1) ( 19 5.1) ( 18 11.88) = + + + 88.88 431.1 5.1 11.88 = 0.4 + 0.09 + 1.49 + 0. 31 =.31 6

Observed vent Frequencies Death Sentence Race Yes No Total Black 95 45 50 Nonblack 19 18 147 Total 114 553 667 xpected vent Frequencies Death Sentence Race Yes No Total Black 88.88 431.1 50 Nonblack 5.1 11.88 147 Total 114 553 667 χ P ( 1) = ( f f ) O f ( 95 88.88) ( 45 431.1) ( 19 5.1) ( 18 11.88) = + + + 88.88 431.1 5.1 11.88 = 0.4 + 0.09 + 1.49 + 0. 31 =.31 ( ) χ (1) >.31 0.13; retain H 0 These data do not indicate a significant relationship between race and sentencing in death penalty trials. Or, These data are not sufficient to conclude that black defendants are sentenced at a different rate than nonblack defendants in death penalty trials. 7