Categorical Variables and Contingency Tables: Description and Inference

Size: px

Start display at page:

Download "Categorical Variables and Contingency Tables: Description and Inference"

Barbara O’Neal’
5 years ago
Views:

1 Categorical Variables and Contingency Tables: Description and Inference STAT 526 Professor Olga Vitek March 3, 2011 Reading: Agresti Ch. 1, 2 and 3 Faraway Ch. 4 3

2 Univariate Binomial and Multinomial Measurements 3-1

3 Binomial Distribution Probability distribution: Y 1, Y 2,..., Y n iid Bernouilli(π) n i=1 p(y) = Y i Binomial(n, π) ( n y ) π y (1 π) n y µ = E(Y ) = nπ, σ 2 = var(y ) = nπ(1 π) Log-ikelihood: L(π) = ylog(π) + (n y)log(1 π) Maximum Likelihood Estimator: ˆπ = y/n E(ˆπ) = π, SE(ˆπ) = π(1 π) n 3-2

4 Large-sample tests for π For a known π 0, test H 0 : π = π 0 vs H 0 : π π 0 Wald test: z W = ˆπ π 0 SE = ˆπ π 0 ˆπ(1 ˆπ)/n H 0,approx N (0, 1) Likelihood ratio Test: z L = 2(L 1 L 0 ) = 2 H 0,approx χ 2 1 ( ylog ˆπ + (n y)log 1 ˆπ ) π 0 1 π 0 Score Test: z S = ˆπ π 0 SE 0 = ˆπ π 0 π0 (1 π 0 )/n Closer to N (0, 1) than Wald H 0,approx N (0, 1) 3-3

5 Large-sample CI for π Based on the Wald test statistic: ˆπ ± z α/2 ˆπ(1 ˆπ) n Performs poorly unless large n Based on the Score Test statistic: ( ) n ˆπ n + zα/ ( z 2 α/2 n + z 2 α/2 ) ± z α/2 1 n + zα/2 2 [ ˆπ(1 ˆπ) ( n n + z 2 α/2 ) ( )] z 2 α/2 2 n + zα/2 2 Performs better than Wald 3-4

6 Multinomial Distribution Probability distribution: (Y i1,..., Y ic ) {Y ij = 1 if in category j, and 0 otherwise } n i=1 Y ij Multinomial(π 1,..., π c ), n = p(n 1, n 2,..., n c 1 ) = ( n! n 1!n 2!...n c! ) c n j j=1 π n 1 1 πn πn c c E(n j ) = nπ j var(n j ) = nπ j (1 π j ), cov(n j, n k ) = nπ j π k Log-likelihood: L(π) = c j=1 n j logπ j Maximum Likelihood Estimator: ˆπ j = n j /n 3-5

7 Large-Sample Test for (π 1,..., π c ) For known (π 10, π 20,... π c0 ), test H 0 : π j = π j0 vs H 0 : π j π j0 Pearson test: X 2 = c j=1 (O j E j0 ) 2 E j0 = c (n j nπ j0 ) 2 j=1 nπ j0 H 0, approx E.g. in genetics: test theories of trait inheritance χ 2 c 1 Likelihood Ratio test: G 2 = 2(L 1 L 0 ) = 2 n j=1 log( n j nπ j 0 ) H 0, approx χ 2 c 1 Asymptotically equivalent when H 0 is true. For n/c < 5, X 2 converges faster 3-6

8 Poisson Distribution Probability distribution: Y - number of events in a fixed interval of space/time Y P oisson(µ) p(y) = e µ µ y y!, y = 0, 1,...; E(Y ) = var(y ) = µ Y 1, Y 2,..., Y c ind P oisson(µ i ), c i=1 Y i P oisson( c i=1 µ i ) c indep. Poisson r.v. total Multinomial P (Y 1 = n 1,..., Y c = n c = P (Y 1 = n 1,..., Y c = n c ) P ( i Y i = n) Y i = n) i = [ exp( µ i )µ n i i /n i! ] i exp( µ i ) ( µ i ) n /n! = n! n i! i i i i π n i i, π i = µ i µ i i 3-7

9 2-Way Contingency Tables 3-8

10 Contingency Tables Contingency Table = Classification Table: frequency of outcomes Two-Way Table: frequency outcomes of two categorical variables I J table: columns. a table with I rows and J Contingency tables can arise from several sampling schemes Inference depends on the sampling scheme Example: Lung Cancer Smoking Cases Controls Total Yes No Total

11 Joint Distribution and Independence Underlying probability distribution of X (smoking) and Y (cancer) Joint distribution: π ij, probability of cell (i, j) Marginal distribution: π i+ = J π +j = j=1 I i=1 π ij, probability of row i π ij, probability of column j Conditional distribution: π j i = π ij /π i+, distribution of j given i Independence: π ij = π i+ π +j for all i and j 3-10

12 Multinomial Sampling The total sample size n is fixed, but the row and column totals are not X and Y are treated equally P (X = i, Y = j) = π ij, i = 1,..., I; j = 1,..., J describe associations with joint distributions. back to the case of the Multinomial distribution Likelihood and log-likelihood: Likelihood = L = I J i=1 j=1 n! n 11! n IJ! I i J j=1 π n ij ij n ij log(π ij ) + constant 3-11

13 Multinomial Sampling: Testing for Independence Hypotheses: H 0 : reduced model π ij = π i+ π +j, for all i and j H a : full model π ij π i+ π +j, for some i and j Pearson χ2 test: X 2 = I J i=1 j=1 (O ij E ij ) 2 E ij H 0, approx. χ 2 (I 1)(J 1) O ij = n ij, E ij = nˆπ i+ˆπ +j = n i+ n +j /n Df = (I 1)(J 1) = (IJ 1) (I 1) (J 1) Likelihood Ratio test: Full model: ˆπ ij = n ij /n ++ Reduced model: ˆπ i+ = n i+ /n ++ ; ˆπ +j = n +j /n ++. G 2 = 2(L 1 L 0 ) = I J 2 n ij log n ijn ++ H 0, approx. n i+ n +j i=1 j=1 χ 2 (I 1)(J 1) 3-12

14 Independent (or Product) Multinomial Sampling The row totals n i+, i = 1,..., I, are fixed E.g., X is an explanatory variable, and response Y occurs separately at each setting of X. View categorical response as function of categorical predictor Describe associations in terms of conditional distributions P (Y = j X = i) = π j i, i = 1,, I; j = 1,, J For a fixed i, {n ij, j = 1,, J} follow a multinomial distribution f(n i1..., n ij } = n i+! n i1! n ij! J j=1 π n ij j i 3-13

15 Compare Proportions Independent Multinomial Sampling H 0 : π 1 = π 2 vs H a : π 1 π 2 ML estimate of the difference: ˆπ 1 ˆπ 2 = y 1 n 1 y 2 n 2 SE(ˆπ 1 ˆπ 2 ) = [ π1 (1 π 1 ) n 1 + π 2(1 π 2 ) n 2 ] 1/2 Wald Confidence Interval: ˆπ 1 ˆπ 2 ± z α/2 ŜE(ˆπ 1 ˆπ 2 ) Replace π with ˆπ to estimate SE Usually too narrow Better methods (e.g. delta method) exist 3-14

16 Testing for Independence of Rows and Columns Independent Multinomial Sampling Independence in this context is often called homogeneity of the conditional distributions X and Y are independent π j 1 = = π j I, for all j Can interpret the independence in terms of product of marginal probabilities π ij = π i+ π +j for all i and j π j 1 = = π j I for all j π j i = π ij /π i+ = (π i+ π +j )/π i+ = π +j I I Let π j i = a j, then π +j = π ij = π i+ a j = a j = π ij = π i+ π +j i=1 i=1 3-15

17 Testing for Independence of Rows and Columns Test the homogeneity of conditional distributions Column Row 1 J Total π 11 π 1J 1 π (π 1 1 ) (π J 1 ) I π I1 (π 1 I ) π IJ (π J I ) π I+ Total π +1 π +J π ++ Consider the new notation: π j (x) = P (Y = j X = x) Although the interpretation is different, use the same Pearson X 2 test and the LR test 3-16

18 Test for Independence: Odds Ratio Odds Ratio: θ = π 11/π 12 π 21 /π 22 = π 11π 22 π 12 π 21 = = P (Y = 1 X = 1)/P (Y = 2 X = 1) P (Y = 1 X = 2)/P (Y = 2 X = 2) P (X = 1 Y = 1)/P (X = 2 Y = 1) P (X = 1 Y = 2)/P (X = 2 Y = 2) Equally valid for prospective (conditional on X), retrospective (conditional on Y ) and cross-sectional (multinomial) sampling designs MLE: ˆθ = n 11/n 12 n 12 /n 22 = n 11n 22 n 12 n 21 When some n ij = 0, ˆθ is not a good estimator. Is improved by adding 0.5 to each cell count: θ = (n )(n ) (n )(n ) 3-17

19 Test for Independence: Odds Ratio X and Y are independent θ = π 11/π 12 π 21 /π 22 = π 11π 22 π 12 π 21 = 1 to check, substitute π ij = π i+ π +j in the formula above Asymptotically, log ˆθ N(log(θ), ˆσ 2 ), where ˆσ 2 = 1 n n n n 22 Large-sample CI for logθ : logˆθ ± z α/2 ŜE(logˆθ) = [L, U] Large-sample CI for θ : [e L, e U ]. Usually too wide 3-18

20 Poisson Sampling Observe a process over a period of time, and observe the number of occurrencies No fixed quantities Poisson sampling assumes each Y ij ind P oisson(π ij ) Denote Y ij the count of cell (i, j) I J Y ij P oission I J i=1 j=1 i=1 j=1 π ij Hypothesis of independence of X and Y has the form log(π ij ) = λ + α i + β j This is the log-linear model of independence for two-way contingency tables Under independence, log(µ ij ) is an additive function of a row effect α i and a column effect β j. Since we don t have a replicate table, the model with the interaction is saturated 3-19

21 Poisson Sampling An additive model log π ij = µ + α i + β j implies the independence of the margins π ij = = E(count) sum of all E(count) e µ+α i+β j e µ ( i e α i)( j e β j) = π i+π +j, where π i+ = e α i/ i e α i = j π ij, π +j = e β j/ j e β j = i π ij. Test for independence: Pearson X 2 or LR test as before (more on this later) 3-20

22 Hypergeometric Sampling Both row and column margins are fixed. When X and Y are independent, given the row and column margins, follows hypergeometric distribution ( I ) ( i=1 n J ) i+! j=1 n +j! n ++! I i=1 J j=1 n ij! the distribution is parameter free For a 2 2 table ( n1+ ) ( n2+ ) P (n 11 = k) = k n +1 k ( ), n++ n +1 max(0, n 1+ + n +1 n) k min(n 1+, n +1 ) Fisher s exact test: p-value = total probability of all outcomes more extreme than the one observed. Takes discrete values for small samples 3-21

23 Case study: Agresti p.80 # read the data X <- data.frame(y=c(178, 138, 108, 570, 648, 442, 138, 252, 252), belief=rep(c("1-fundam", "2-Moder", "3-Liber"), 3), degree=rep(c("1-<hs", "2-HS", "3-BS/grad"), 1, each=3) ) # a table of observed values (ov) ov <- xtabs(y ~ degree+belief, data=x) > ov belief degree 1-Fundam 2-Moder 3-Liber 1-<HS HS BS/grad # export the table into latex # export the table into latex library(xtable) xtable(ov) \begin{table}[ht] \begin{center} \begin{tabular}{rrrr} \hline & 1-Fundam & 2-Moder & 3-Liber \\ \hline 1-$<$HS & & & \\ 2-HS & & & \\

24 Data visualization # dotchart dotchart(t(ov), xlab="observed counts") 1 <HS 3 Liber 2 Moder 1 Fundam 2 HS 3 Liber 2 Moder 1 Fundam 3 BS/grad 3 Liber 2 Moder 1 Fundam Observed counts 3-23

25 Data visualization # mosaic plot mosaicplot(ov, color=true) ov 1 <HS 2 HS 3 BS/grad 3 Liber belief 2 Moder 1 Fundam degree 3-24

26 2 x 2 table: Compare Proportions Independent multinomial sampling: restrictions on the rows compare proportions of columns, given rows also implements the Pearson X 2 test with Yates correction for small samples (from each O-E, subtract 0.5 if positive, and add 0.5 if negative) > prop.test(ov[1:2,1:2]) 2-sample test for equality of proportions with continuity correction data: ov[1:2, 1:2] X-squared = , df = 1, p-value = alternative hypothesis: two.sided 95 percent confidence interval: sample estimates: prop 1 prop #-----double-check the proportions > 178/( ) [1] > 570/( ) [1]

27 2 x 2 table: Hypergeometric Sampling conditional on both margins Hypergeometric test compare distributions of counts within the 4 cells H 0 is specified in terms of OR=1 produces CI for the OR > fisher.test(ov[1:2,1:2]) Fisher s Exact Test for Count Data data: ov[1:2, 1:2] p-value = alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: sample estimates: odds ratio

28 I x J table: Pearson X 2 (Independent) multinomial sampling restrictions on a margin, or on the total H 0 in terms of independence of rows and columns > summary(ov) Call: xtabs(formula = y ~ degree + belief, data = X) Number of cases in table: 2726 Number of factors: 2 Test for independence of all factors: Chisq = 69.16, df = 4, p-value = 3.42e-14 Pearson residuals e ij = n ij ˆµ ij ˆµ 1/2 ij divide residual by ŜE(n ij ) in Poisson sampling Standardized Pearson residuals e ij = n ij ˆµ ij ˆµij (1 p i+ )(1 p +j ) divide residual by ŜE(residual) in Poisson sampling 3-27

29 Visualizing the association # --Compute Pearson and standardized Pearson residuals --- e <- apply(ov, 1, sum) %*% t(apply(ov, 2, sum)) / sum(ov) pearsonresid <- (ov - e)/sqrt(e) prow <- 1-apply(ov, 1, sum) / sum(ov) pcol <- 1-apply(ov, 2, sum) / sum(ov) standpearsonresid <- pearsonresid/ sqrt(prow %*% t(pcol) ) dotchart( t(standpearsonresid) ) abline(v=c(-2,2)) 1 <HS 3 Liber 2 Moder 1 Fundam 2 HS 3 Liber 2 Moder 1 Fundam 3 BS/grad 3 Liber 2 Moder 1 Fundam Standardized Pearson Residuals 3-28

30 Ordered Categories Ordered categories have more info Assign scores to categories Rows: (u 1,... u I ), e.g. (1,..., I) Cols: (v 1,... v j ), e.g. (1,..., J) H 0 : cor(u, v) = 0 vs H a : cor(u, v) 0 produces CI for the OR Study the linear trend r = [ I i=1 j=1 I J (u i ū)(v j v)n ij i=1 j=1 ] [ J (u i ū) 2 n ij ] I J (v i v) 2 n ij i=1 j=1 ū = I J u i n ij /n; v = i=1 j=1 J I i=1 j=1 v i n ij /n; M 2 = (n 1)r 2 H 0 χ

31 Case Study: Ordered Categories # existing implementation > library(coin) > lbl_test(as.table(ov)) Asymptotic Linear-by-Linear Association Test data: belief (ordered) by degree (1-<HS < 2-HS < 3-BS/grad) chi-squared = , df = 1, p-value = 6.939e-14 # manually u <- as.vector(scale(1:3, center=sum(c(1:3)*ov)/sum(ov), scale=false)) v <- as.vector(scale(1:3, center=sum(t(ov)*c(1:3))/sum(ov), scale=false)) r <- sum(u%*%t(v)*ov) / sqrt(sum(u^2*ov) * sum(t(ov) * v^2)) M2 <- (sum(ov) - 1) * r^2 > 1-pchisq(M2, 1, lower=true) [1] e

32 2x2 pairs: Matched Pairs Repeated measurements on same subjects ask the same people the same question twice goal: compare proportions absence of association cannot be interpreted as independence Example (Agresti Ch. 10.1) Approval of the President s performance, one month apart, for a same sample of Americans. Approve Disapprove Approve Disapprove H 0 : Marginal homogeneity. π 1+ = π +1 δ = π 1+ π +1 = (π 11 +π 12 (π 11 +π 21 ) = π 12 π 21 Equivalent to testing table symmetry 3-31

33 Large-sample test and CI CI ˆδ = p +1 p 1+ = p 2+ p +2 var(ˆδ) = [π 1+ (1 π +1 ) + π +1 (1 π +1 ) 2(π 11 π 22 π 12 π 21 )] /n smaller variance than in independent samples, therefore a more efficient design var(ˆδ) = [ (p 12 + p 21 ) (p 12 p 21 ) 2] /n CI: ˆδ ± z α/2 ŜE(ˆδ) Wald Test z = ˆδ ŜE(δ) = n 21 n 12 (n 21 +n 12 ) 1/2 z 2 H 0 χ 2 1 (called McNemar test) Only depends on counts outside of the diagonal 3-32

34 President Approval Example # Read the data Performance <- matrix(c(794, 86, 150, 570), nrow = 2, dimnames = list("1st Survey" = c("approve", "Disapprove"), "2nd Survey" = c("approve", "Disapprove")) ) > Performance 2nd Survey 1st Survey Approve Disapprove Approve Disapprove # Test > mcnemar.test(performance) McNemar s Chi-squared test with continuity correction data: Performance McNemar s chi-squared = , df = 1, p-value = 4.115e-05 significant change (in fact, drop) in approval ratings 3-33

Categorical Data Analysis Chapter 3

Categorical Data Analysis Chapter 3 The actual coverage probability is usually a bit higher than the nominal level. Confidence intervals for association parameteres Consider the odds ratio in the 2x2 table,