Multiple Sample Categorical Data

Size: px

Start display at page:

Download "Multiple Sample Categorical Data"

Laura Bradford
5 years ago
Views:

1 Multiple Sample Categorical Data paired and unpaired data, goodness-of-fit testing, testing for independence University of California, San Diego Instructor: Ery Arias-Castro 1 / 25 Testing whether two dice have the same distribution Suppose we want to know whether two irregular 6-faced dice, with faces numbered 1 through 6 as usual, have the same chances of landing on any digit. NOTE: The question is not whether they are fair or not. To determine whether this is so, we throw the first die m = 500 times, obtaining X 1,...,X m {1,...,6}, and then throw the second die n = 500 times also, obtaining Y 1,...,Y n {1,...,6}. (We assume all the throws are independent of each other.) NOTE: In principle, m and n can be different, although for the same total sample size m+n, it is best to choose m = n if possible. We then test versus H 0 : the dice X and Y have the same distribution H 1 : the dice X and Y have different distributions 2 / 25 Summary statistics. The counts M s = #{i : X i = s}, s = 1,...,6 N s = #{i : Y i = s}, s = 1,...,6 are (jointly) sufficient, and can be displayed in a table as follows: Digit Total X Y Total Graphics. The plots of choice are the following. They offer different advantages. Segmented barplots side-by-side barplots

2 3 / 25 Chi-squared goodness-of-fit test The observed counts are M s = #{i : X i = s}, s = 1,...,6 N s = #{i : Y i = s}, s = 1,...,6 Under the null, X and Y have the same distribution, say p = (p 1,...,p 6 ), and the expected counts are E(M s ) = mp s E(N s ) = np s The issue is that we do not know p! (Compare with the one-sample setting.) The idea is to estimate p based on the combined sample: ˆp s = M s +N s m+n 4 / 25 With ˆp defined, we can then obtain estimated expected counts Ê(M s ) = mˆp s Ê(N s ) = nˆp s The final step is to compare the observed and estimated expected counts with the usual chi-squared test statistic: [ ] 6 (M s mˆp s ) 2 D = + (N s nˆp s ) 2 mˆp s nˆp s s=1 Theory. Under the null, D has asymptotically (m, n ) the chi-squared distribution with 6 1 = 5 degrees of freedom. Two or more dice The same methodology extends to compare the distributions any number k 2 dice with the same number of faces S 2. The sample sizes may be different. The estimated expected counts (under the null) are estimated based on all the samples combined. Theory. The resulting test statistic has asymptotically (as all the sample sizes diverge) the chi-squared distribution with (k 1)(S 1) degrees of freedom. 5 / 25 6 / 25

3 Testing whether two dice are independent of each other Suppose we now want to know whether, when rolling these dice together, the digits they show are independent. We throw the pair of dice together n = 500 times and record the outcomes, denoted (X 1,Y 1 ),...,(X n,y n ), with (X i,y i ) {1,...,6} {1,...,6}. (We assume the throws are independent.) In this setting, the variables X and Y (results from the two dice) are paired. We test versus H 0 : the dice X and Y are independent H 1 : the dice X and Y are not independent Known marginal distributions First, assume that we know that both dice are fair. (Each die might have been rigorously tested before based on many trials.) Under the null hypothesis, the dice are independent, we have: P((X,Y) = (a,b)) = P(X = a)p(y = b) = = 1, a,b {1,...,6} 36 7 / 25 We can simply apply the chi-squared GOF test to decide whether Z 1,...,Z n, where Z i = (X i,y i ), are uniformly distributed over {1,...,6} {1,...,6}. After all, the variable Z is just a factor, here with 36 levels, so we are in the one-sample categorical data situation! 8 / 25 Unknown marginal distributions Now assume that we do not know the distributions of the dice. (This situation is much more common.) Under the null hypothesis, the dice are independent, so that P((X,Y) = (a,b)) = P(X = a)p(y = b), a,b {1,...,6} But now we do not know the marginals P(X = a) or P(Y = b). 9 / 25

4 Contingency table Summary statistics. The joint counts are sufficient and used as summary statistics: N s,t = #{i : (X i,y i ) = (s,t)} They are organized in a matrix, called contingency table (here with totals): Graphics: The main plots are the segmented barplot the side-by-side barplot the mosaic plot Y X Sum Sum / 25 Chi-squared goodness-of-fit test The observed counts are N s,t = #{i : (X i,y i ) = (s,t)} Under the null, X and Y are independent, say with marginals p and q, and the expected counts are E(N s,t ) = np(x = s,y = t) = np(x = s)p(y = t) = np s q t The issue is that we do not know the marginals, neither p nor q. 11 / 25

5 The idea is to estimate p and q from the margins. Define the marginal counts as before and then the estimates N s, = #{i : X i = s} N,t = #{i : Y i = t} ˆp s = N s, n ˆq t = N,t n With ˆp and ˆq defined, we can then obtain estimated expected counts Ê(N s,t ) = nˆp s ˆq t = N s, N,t n 12 / 25 The final step is to compare the observed and estimated expected counts with the usual chi-squared test statistic: 6 6 (N s,t nˆp sˆq t ) 2 D = nˆp sˆq t s=1 t=1 Theory. Under the null, D has asymptotically (n ) the chi-squared distribution with (6 1)(6 1) = 25 degrees of freedom. 13 / 25 The same methodology extends to testing for independence between two factors with S and T levels, respectively. The margins are used in the same way to estimate the expected counts under the null. Theory. The resulting test statistic has asymptotically (n ) the chi-squared distribution with (S 1)(T 1) degrees of freedom. 14 / 25

6 Fisher s exact test R.A. Fisher (a great figure in statistics) developed an exact test for 2 x 2 contingency tables (meaning the two categorical variables are binary). He tells the following story ( lady tasting tea") to motivate his test. Here is the story (paraphrased): A British woman claimed to be able to distinguish whether milk or tea was added to the cup first. To test, she was given 8 cups of tea, in four of which milk was added first. The null hypothesis is that there is no association between the true order of pouring and the woman s guess, the alternative that there is a positive association (that the odds ratio is greater than 1). The resulting counts are as follows: Guess Milk Truth Tea Sum Milk Tea Sum / 25 The expected counts are too small to use the chi-squared approximation. What can we do? How can we quantify how accurate the lady s guess is? Fisher s idea is to fix the margins (meaning the row sums and the column sums), enumerate all the contingency tables with the same margins, and sum the probabilities of all the tables that are at least as extreme as the table that is observed. Enumerating all the tables with the observed margins is easy, since there is only one degree of freedom left. For example, we can focus on the top left cell, which determines all the other ones. A table here is at least as extreme as the observed table if the top left cell has a higher count (implying a stronger positive association). 16 / 25 Suppose we have a general 2 2 contingency table Y = 1 Y = 0 Sum X = 1 N 11 N 10 N 1 X = 0 N 01 N 00 N 0 Sum N 1 N 0 n When X and Y are independent, the probability of obtaining such a table, conditioned on having these margins, is ( )( ) ( ) N1 N0 n / N 1 N 11 Indeed, the top left cell count is hypergeometric. N 01

7 17 / 25 In our example, the probability of the observed table is ( )( ) ( ) / There is only one more extreme table Milk Tea Sum Milk Tea Sum and it has probability ( )( ) ( ) / The p-value is the sum of these: ( )( ) ( ) / ( 4 4 )( ) ( ) 4 8 / / 25 Exact testing for general S T tables The procedure extends to contingency tables of any dimensions. Assume the following are given row sums: (m s : s = 1,...,S) (1) column sums: (m t : t = 1,...,T) (2) The likelihood of drawing uniformly at random a table with these marginal sums M = (m st : s = 1,...,S;t = 1,...,T) is equal to S m s! s=1 where n is the sample size, meaning n = s ( T m t! / n! t=1 t m st. ) S T m st! s=1 t=1 19 / 25

8 In analogy with Fisher s exact test, we may define a table as being at least as extreme as the one we observe if its probability is at least as small as the probability of the one we observe. Alternatively, it may be defined as having a test statistic (e.g., Pearson s) at least as extreme as the statistic for the table we observe. The main issue is computational, as enumerating all tables with given margins may be prohibitive as their number increases very fast with the number of cells and the magnitude of the counts. 20 / 25 Calibration by permutation Fisher s method is based on the permutation distribution with the margins being fixed. Under the null hypothesis, X i and Y i are independent. In particular, for any permutation π of {1,...,n}, the permuted data (X 1,Y π1 ),...,(X n,y πn ) has the same distribution as the original data (X 1,Y 1 ),...,(X n,y n ) Therefore, under the null, any test statistic D = Λ [ (X 1,Y 1 ),...,(X n,y n ) ] has the same distribution after permutation, meaning for any permutation π, D π = Λ [ (X 1,Y π1 ),...,(X n,y πn ) ] has the same distribution as D under the null. 21 / 25 Suppose that we reject for large values of Λ, and define P = #{π : D π D obs } n! P is the fraction of permuted statistics that are as extreme as the one we have. P is a valid p-value, in the sense that P 0 (P p) p, p (0,1) In fact, if all the D π s are distinct, then under the null P is uniformly distributed over {k/n! : k = 1,...,n!}. 22 / 25

9 In practice, the number (n!) of permutations is too large to compute P exactly. In that case, we estimate P by Monte Carlo sampling. For B a large integer, sample π 1,...,π B iid uniform from the permutations of {1,...,n} and estimate P by It happens that ˆP is also a valid p-value. The parametric bootstrap ˆP = #{b : D π b D obs}+1 B +1 The bootstrap offers an alternative method for obtaining a p-value by simulation. It mimics Monte Carlo simulations, replacing the (unknown) marginals with the estimated marginals. Assume without loss of generality that X takes values in {1,..., S} and Y takes values in {1,..., T}. Let (p 1,...,p S ) denote the marginal distribution of X and (q 1,...,q T ) the marginal distribution of Y. 23 / 25 Let ˆp s denote the MLE for p s, meaning, Let ˆq t denote the MLE for q t, meaning, ˆp s = 1 n #{i : X i = s} ˆq t = 1 n #{i : Y i = t} 24 / 25 Suppose we are rejecting for large values of a test statistic D = Λ [ (X 1,Y 1 ),...,(X n,y n ) ] Let B be a large integer. 1. For b = 1,...,B, do the following: (a) Generate a sample of size n, X (b) 1,...,X(b) n, from (ˆp 1,..., ˆp S ). Generate a sample of size n, Y (b) 1,...,Y n (b), from (ˆq 1,...,ˆq T ). (b) Compute 2. The estimated p-value is D b = Λ [ (X (b) (b) 1,Y 1 ),...,(X n (b) #{b : D b D obs }+1 B +1,Y (b) n )] 25 / 25

One-Sample Numerical Data

One-Sample Numerical Data quantiles, boxplot, histogram, bootstrap confidence intervals, goodness-of-fit tests University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html