Multiple Sample Categorical Data

Similar documents
One-Sample Numerical Data

Bivariate Paired Numerical Data

Statistics - Lecture 04

Multiple Sample Numerical Data

Ling 289 Contingency Table Statistics

Testing Independence

Fundamental Probability and Statistics

The Multinomial Model

Lecture 10: Generalized likelihood ratio test

Lecture 2: Categorical Variable. A nice book about categorical variable is An Introduction to Categorical Data Analysis authored by Alan Agresti

Probability Basics. Part 3: Types of Probability. INFO-1301, Quantitative Reasoning 1 University of Colorado Boulder

TUTORIAL 8 SOLUTIONS #

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

11-2 Multinomial Experiment

ML Testing (Likelihood Ratio Testing) for non-gaussian models

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

Statistics 3858 : Contingency Tables

Elements of probability theory

j=1 π j = 1. Let X j be the number

There are statistical tests that compare prediction of a model with reality and measures how significant the difference.

Summary of Chapters 7-9

2.3 Analysis of Categorical Data

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

STA 247 Solutions to Assignment #1

Binary response data

Goodness of Fit Goodness of fit - 2 classes

Review of One-way Tables and SAS

E509A: Principle of Biostatistics. GY Zou

Probability & Statistics - FALL 2008 FINAL EXAM

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Marginal Screening and Post-Selection Inference

Are Declustered Earthquake Catalogs Poisson?

Goodness of Fit Test and Test of Independence by Entropy

Unit 9: Inferences for Proportions and Count Data

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

The University of Hong Kong Department of Statistics and Actuarial Science STAT2802 Statistical Models Tutorial Solutions Solutions to Problems 71-80

Sequential Implementation of Monte Carlo Tests with Uniformly Bounded Resampling Risk

Chapter 26: Comparing Counts (Chi Square)

Sociology 6Z03 Review II

Optimal rejection regions for testing multiple binary endpoints in small samples

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Glossary for the Triola Statistics Series

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Contingency Tables Part One 1

Definition of Statistics Statistics Branches of Statistics Descriptive statistics Inferential statistics

Discrete Multivariate Statistics

Unit 9: Inferences for Proportions and Count Data

2. Variance and Covariance: We will now derive some classic properties of variance and covariance. Assume real-valued random variables X and Y.

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

Formulas and Tables. for Essentials of Statistics, by Mario F. Triola 2002 by Addison-Wesley. ˆp E p ˆp E Proportion.

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Econ 325: Introduction to Empirical Economics

Introduction 1. STA442/2101 Fall See last slide for copyright information. 1 / 33

Mathematical Notation Math Introduction to Applied Statistics

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Formulas and Tables. for Elementary Statistics, Tenth Edition, by Mario F. Triola Copyright 2006 Pearson Education, Inc. ˆp E p ˆp E Proportion

List of Symbols, Notations and Data

Probability and Statistics

Applied Statistics Lecture Notes

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Test of Association between Two Ordinal Variables while Adjusting for Covariates

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Simple Linear Regression

STAT 536: Genetic Statistics

MTH135/STA104: Probability

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Module 10: Analysis of Categorical Data Statistics (OA3102)

Topic 21 Goodness of Fit

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

STAT 4385 Topic 01: Introduction & Review

Two-sample Categorical data: Testing

Probability Rules. MATH 130, Elements of Statistics I. J. Robert Buchanan. Fall Department of Mathematics

1 Comparing two binomials

13.1 Categorical Data and the Multinomial Experiment

Lecture 1. Chapter 1. (Part I) Material Covered in This Lecture: Chapter 1, Chapter 2 ( ). 1. What is Statistics?

CONTINUOUS RANDOM VARIABLES

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Weldon s dice. Lecture 15 - χ 2 Tests. Labby s dice. Labby s dice (cont.)

Categorical Data Analysis Chapter 3

Two-sample Categorical data: Testing

Introduction to Statistical Analysis. Cancer Research UK 12 th of February 2018 D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]

PHYS 275 Experiment 2 Of Dice and Distributions

STATISTICS SYLLABUS UNIT I

Confidence Intervals, Testing and ANOVA Summary

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

Section 3 : Permutation Inference

Computational Systems Biology: Biology X

Statistics for Managers Using Microsoft Excel/SPSS Chapter 4 Basic Probability And Discrete Probability Distributions

Probability and Statistics Notes

(c) P(BC c ) = One point was lost for multiplying. If, however, you made this same mistake in (b) and (c) you lost the point only once.

Spring 2012 Math 541B Exam 1

BTRY 4090: Spring 2009 Theory of Statistics

Chapter 11. Hypothesis Testing (II)

Testing for Poisson Behavior

Exam 2 Practice Questions, 18.05, Spring 2014

18.05 Practice Final Exam

Discrete Random Variables

Transcription:

Multiple Sample Categorical Data paired and unpaired data, goodness-of-fit testing, testing for independence University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 25 Testing whether two dice have the same distribution Suppose we want to know whether two irregular 6-faced dice, with faces numbered 1 through 6 as usual, have the same chances of landing on any digit. NOTE: The question is not whether they are fair or not. To determine whether this is so, we throw the first die m = 500 times, obtaining X 1,...,X m {1,...,6}, and then throw the second die n = 500 times also, obtaining Y 1,...,Y n {1,...,6}. (We assume all the throws are independent of each other.) NOTE: In principle, m and n can be different, although for the same total sample size m+n, it is best to choose m = n if possible. We then test versus H 0 : the dice X and Y have the same distribution H 1 : the dice X and Y have different distributions 2 / 25 Summary statistics. The counts M s = #{i : X i = s}, s = 1,...,6 N s = #{i : Y i = s}, s = 1,...,6 are (jointly) sufficient, and can be displayed in a table as follows: Digit 1 2 3 4 5 6 Total X 105 70 80 89 75 81 500 Y 77 71 83 77 72 120 500 Total 182 141 163 166 147 201 1000 Graphics. The plots of choice are the following. They offer different advantages. Segmented barplots side-by-side barplots

3 / 25 Chi-squared goodness-of-fit test The observed counts are M s = #{i : X i = s}, s = 1,...,6 N s = #{i : Y i = s}, s = 1,...,6 Under the null, X and Y have the same distribution, say p = (p 1,...,p 6 ), and the expected counts are E(M s ) = mp s E(N s ) = np s The issue is that we do not know p! (Compare with the one-sample setting.) The idea is to estimate p based on the combined sample: ˆp s = M s +N s m+n 4 / 25 With ˆp defined, we can then obtain estimated expected counts Ê(M s ) = mˆp s Ê(N s ) = nˆp s The final step is to compare the observed and estimated expected counts with the usual chi-squared test statistic: [ ] 6 (M s mˆp s ) 2 D = + (N s nˆp s ) 2 mˆp s nˆp s s=1 Theory. Under the null, D has asymptotically (m, n ) the chi-squared distribution with 6 1 = 5 degrees of freedom. Two or more dice The same methodology extends to compare the distributions any number k 2 dice with the same number of faces S 2. The sample sizes may be different. The estimated expected counts (under the null) are estimated based on all the samples combined. Theory. The resulting test statistic has asymptotically (as all the sample sizes diverge) the chi-squared distribution with (k 1)(S 1) degrees of freedom. 5 / 25 6 / 25

Testing whether two dice are independent of each other Suppose we now want to know whether, when rolling these dice together, the digits they show are independent. We throw the pair of dice together n = 500 times and record the outcomes, denoted (X 1,Y 1 ),...,(X n,y n ), with (X i,y i ) {1,...,6} {1,...,6}. (We assume the throws are independent.) In this setting, the variables X and Y (results from the two dice) are paired. We test versus H 0 : the dice X and Y are independent H 1 : the dice X and Y are not independent Known marginal distributions First, assume that we know that both dice are fair. (Each die might have been rigorously tested before based on many trials.) Under the null hypothesis, the dice are independent, we have: P((X,Y) = (a,b)) = P(X = a)p(y = b) = 1 6 1 6 = 1, a,b {1,...,6} 36 7 / 25 We can simply apply the chi-squared GOF test to decide whether Z 1,...,Z n, where Z i = (X i,y i ), are uniformly distributed over {1,...,6} {1,...,6}. After all, the variable Z is just a factor, here with 36 levels, so we are in the one-sample categorical data situation! 8 / 25 Unknown marginal distributions Now assume that we do not know the distributions of the dice. (This situation is much more common.) Under the null hypothesis, the dice are independent, so that P((X,Y) = (a,b)) = P(X = a)p(y = b), a,b {1,...,6} But now we do not know the marginals P(X = a) or P(Y = b). 9 / 25

Contingency table Summary statistics. The joint counts are sufficient and used as summary statistics: N s,t = #{i : (X i,y i ) = (s,t)} They are organized in a matrix, called contingency table (here with totals): Graphics: The main plots are the segmented barplot the side-by-side barplot the mosaic plot Y X 1 2 3 4 5 6 Sum 1 13 18 17 11 9 19 87 2 16 10 16 18 15 13 88 3 10 21 12 16 11 20 90 4 14 14 15 10 14 14 81 5 14 13 15 15 21 8 86 6 16 13 10 6 13 10 68 Sum 83 89 85 76 83 84 500 10 / 25 Chi-squared goodness-of-fit test The observed counts are N s,t = #{i : (X i,y i ) = (s,t)} Under the null, X and Y are independent, say with marginals p and q, and the expected counts are E(N s,t ) = np(x = s,y = t) = np(x = s)p(y = t) = np s q t The issue is that we do not know the marginals, neither p nor q. 11 / 25

The idea is to estimate p and q from the margins. Define the marginal counts as before and then the estimates N s, = #{i : X i = s} N,t = #{i : Y i = t} ˆp s = N s, n ˆq t = N,t n With ˆp and ˆq defined, we can then obtain estimated expected counts Ê(N s,t ) = nˆp s ˆq t = N s, N,t n 12 / 25 The final step is to compare the observed and estimated expected counts with the usual chi-squared test statistic: 6 6 (N s,t nˆp sˆq t ) 2 D = nˆp sˆq t s=1 t=1 Theory. Under the null, D has asymptotically (n ) the chi-squared distribution with (6 1)(6 1) = 25 degrees of freedom. 13 / 25 The same methodology extends to testing for independence between two factors with S and T levels, respectively. The margins are used in the same way to estimate the expected counts under the null. Theory. The resulting test statistic has asymptotically (n ) the chi-squared distribution with (S 1)(T 1) degrees of freedom. 14 / 25

Fisher s exact test R.A. Fisher (a great figure in statistics) developed an exact test for 2 x 2 contingency tables (meaning the two categorical variables are binary). He tells the following story ( lady tasting tea") to motivate his test. Here is the story (paraphrased): A British woman claimed to be able to distinguish whether milk or tea was added to the cup first. To test, she was given 8 cups of tea, in four of which milk was added first. The null hypothesis is that there is no association between the true order of pouring and the woman s guess, the alternative that there is a positive association (that the odds ratio is greater than 1). The resulting counts are as follows: Guess Milk Truth Tea Sum Milk 3 1 4 Tea 1 3 4 Sum 4 4 8 15 / 25 The expected counts are too small to use the chi-squared approximation. What can we do? How can we quantify how accurate the lady s guess is? Fisher s idea is to fix the margins (meaning the row sums and the column sums), enumerate all the contingency tables with the same margins, and sum the probabilities of all the tables that are at least as extreme as the table that is observed. Enumerating all the tables with the observed margins is easy, since there is only one degree of freedom left. For example, we can focus on the top left cell, which determines all the other ones. A table here is at least as extreme as the observed table if the top left cell has a higher count (implying a stronger positive association). 16 / 25 Suppose we have a general 2 2 contingency table Y = 1 Y = 0 Sum X = 1 N 11 N 10 N 1 X = 0 N 01 N 00 N 0 Sum N 1 N 0 n When X and Y are independent, the probability of obtaining such a table, conditioned on having these margins, is ( )( ) ( ) N1 N0 n / N 1 N 11 Indeed, the top left cell count is hypergeometric. N 01

17 / 25 In our example, the probability of the observed table is ( )( ) ( ) 4 4 8 / 3 1 4 There is only one more extreme table Milk Tea Sum Milk 4 0 4 Tea 0 4 4 Sum 4 4 8 and it has probability ( )( ) ( ) 4 4 8 / 4 0 4 The p-value is the sum of these: ( )( ) ( ) 4 4 8 / + 3 1 4 ( 4 4 )( ) ( ) 4 8 / 0.2429 0 4 18 / 25 Exact testing for general S T tables The procedure extends to contingency tables of any dimensions. Assume the following are given row sums: (m s : s = 1,...,S) (1) column sums: (m t : t = 1,...,T) (2) The likelihood of drawing uniformly at random a table with these marginal sums M = (m st : s = 1,...,S;t = 1,...,T) is equal to S m s! s=1 where n is the sample size, meaning n = s ( T m t! / n! t=1 t m st. ) S T m st! s=1 t=1 19 / 25

In analogy with Fisher s exact test, we may define a table as being at least as extreme as the one we observe if its probability is at least as small as the probability of the one we observe. Alternatively, it may be defined as having a test statistic (e.g., Pearson s) at least as extreme as the statistic for the table we observe. The main issue is computational, as enumerating all tables with given margins may be prohibitive as their number increases very fast with the number of cells and the magnitude of the counts. 20 / 25 Calibration by permutation Fisher s method is based on the permutation distribution with the margins being fixed. Under the null hypothesis, X i and Y i are independent. In particular, for any permutation π of {1,...,n}, the permuted data (X 1,Y π1 ),...,(X n,y πn ) has the same distribution as the original data (X 1,Y 1 ),...,(X n,y n ) Therefore, under the null, any test statistic D = Λ [ (X 1,Y 1 ),...,(X n,y n ) ] has the same distribution after permutation, meaning for any permutation π, D π = Λ [ (X 1,Y π1 ),...,(X n,y πn ) ] has the same distribution as D under the null. 21 / 25 Suppose that we reject for large values of Λ, and define P = #{π : D π D obs } n! P is the fraction of permuted statistics that are as extreme as the one we have. P is a valid p-value, in the sense that P 0 (P p) p, p (0,1) In fact, if all the D π s are distinct, then under the null P is uniformly distributed over {k/n! : k = 1,...,n!}. 22 / 25

In practice, the number (n!) of permutations is too large to compute P exactly. In that case, we estimate P by Monte Carlo sampling. For B a large integer, sample π 1,...,π B iid uniform from the permutations of {1,...,n} and estimate P by It happens that ˆP is also a valid p-value. The parametric bootstrap ˆP = #{b : D π b D obs}+1 B +1 The bootstrap offers an alternative method for obtaining a p-value by simulation. It mimics Monte Carlo simulations, replacing the (unknown) marginals with the estimated marginals. Assume without loss of generality that X takes values in {1,..., S} and Y takes values in {1,..., T}. Let (p 1,...,p S ) denote the marginal distribution of X and (q 1,...,q T ) the marginal distribution of Y. 23 / 25 Let ˆp s denote the MLE for p s, meaning, Let ˆq t denote the MLE for q t, meaning, ˆp s = 1 n #{i : X i = s} ˆq t = 1 n #{i : Y i = t} 24 / 25 Suppose we are rejecting for large values of a test statistic D = Λ [ (X 1,Y 1 ),...,(X n,y n ) ] Let B be a large integer. 1. For b = 1,...,B, do the following: (a) Generate a sample of size n, X (b) 1,...,X(b) n, from (ˆp 1,..., ˆp S ). Generate a sample of size n, Y (b) 1,...,Y n (b), from (ˆq 1,...,ˆq T ). (b) Compute 2. The estimated p-value is D b = Λ [ (X (b) (b) 1,Y 1 ),...,(X n (b) #{b : D b D obs }+1 B +1,Y (b) n )] 25 / 25