n y π y (1 π) n y +ylogπ +(n y)log(1 π).

Size: px

Start display at page:

Download "n y π y (1 π) n y +ylogπ +(n y)log(1 π)."

Mariah Thornton
5 years ago
Views:

1 Tests for a binomial probability π Let Y bin(n,π). The likelihood is L(π) = n y π y (1 π) n y and the log-likelihood is L(π) = log n y +ylogπ +(n y)log(1 π). So L (π) = y π n y 1 π. 1

2 Solving for π gives the MLE ˆπ = y/n, the number of successes out of the total number of trials. Taking the 2 nd derivative of L(π) gives L (π) = y π 2 n y (1 π) 2, and so E(L (π)) = E ( Y π 2 + n Y ) (1 π) 2 = nπ π 2 + n nπ (1 π) 2 = n π(1 π). The large sample result is then ˆπ = Y n N ( π, π(1 π) ). n 2

3 Let s consider H 0 : π = π 0 where π 0 is fixed and known (e.g. H 0 : π = 0.5.) The Wald test plugs in the MLE ˆπ = y/n for the unknown π in the large sample variance: ˆπ = Y ( ) ˆπ(1 ˆπ) N π,. n n Recall that se(ˆπ) = So then ˆπ(1 ˆπ) n. Z W = ˆπ π 0 se(ˆπ) = ˆπ π 0 when H 0 is true. Squaring, W = Z 2 W ˆπ(1 ˆπ) n χ 2 1. N(0,1) 3

4 The Score test plugs the null value into the large sample variance ˆπ = Y ( N π, π ) 0(1 π 0 ). n n So then Z S = ˆπ π 0 π 0 (1 π 0 ) n when H 0 is true. Squaring, S = Z 2 S N(0,1) χ

5 Evaluating the log-likelihood at the unconstrained MLE gives L 1 = L(ˆπ) = log n y +ylogˆπ +(n y)log(1 ˆπ). Under the constraint H 0 : π = π 0, the log-likelihood is simply L 0 = L(π 0 ) = log n y +ylogπ 0 +(n y)log(1 π 0 ), (there are no parameters left to maximize the constrained likelihood under!) and so the LRT, plugging Y in for y, ( L = 2(L 0 L 1 ) = 2 Y log ˆπ ) 1 ˆπ +(n Y)log χ 2 1 π 0 1 π 0 when H 0 is true. 5

6 In all three cases, an approximate α = 0.05 significance test of H 0 : π = π 0 is carried out by computing W, S, or L and rejecting if the test statistic is larger than the quantile corresponding to 0.05 right tail probability from a χ 2 1 distribution, i.e. larger than χ 2 1(0.05) = Confidence intervals are obtained by inverting the test statistics. 6

7 Where for art thou, vegetarians? Out of n = 25 students, y = 0 were vegetarians (Agresti, 2002). Assuming binomial data, the 95% CIs found by inverting the Wald, score, and LRT tests are Wald (0, 0) score (0, 0.133) LRT (0, 0.074) The Wald interval is particularly troublesome. Why the difference? for small or large (true, unknown) π the normal approximation for the distribution of ˆπ is pretty bad in small samples. A solution is to consider the exact sampling distribution of ˆπ rather than a normal approximation. 7

8 Exact inference An exact test proceeds as follows. Under H 0 : π = π 0 we know Y bin(n,π 0 ). Values of ˆπ far away from π 0, or equivalently, values of Y far away from nπ 0, indicate that H 0 : π = π 0 is unlikely. Say we reject H 0 if Y < a or Y > b where 0 a < b n. Then we set the type I error at α by requiring P(reject H 0 H 0 is true) = α. That is, P(Y < a π = π 0 ) = α 2 and P(Y > b π = π 0) = α 2. 8

9 However, since Y is discrete, the best we can do is bounding the type I error by choosing a as large as possible such that P(Y < a π = π 0 ) = a 1 i=0 n i π i 0(1 π 0 ) n i < α 2, and b as small as possible such that n P(Y > b π = π 0 ) = n i i=b+1 π i 0(1 π 0 ) n i < α 2. 9

10 For example, when n = 20, H 0 : π = 0.25, and α = 0.05 we have P(Y < 2 π = 0.25) = and P(Y < 3 π = 0.25) = 0.091, so a = 2. Also, P(Y > 9 π = 0.25) = and P(Y > 8 π = 0.25) = 0.041, so b = 9. We reject H 0 : π = 0.25 when Y < 2 or Y > 9. The type I error is bounded: α = P(reject H 0 H 0 is true) 0.05, but in fact this is conservative, P(reject H 0 H 0 is true) = = Nonetheless, this type of exact test can be inverted to obtain exact confidence intervals for π. However, the actual coverage probability is at least as large as 1 α, but typically more. So the procedure errs on the side of being conservative (CI s are bigger than they need to be). 10

11 To obtain the 95% CI from inverting the score test, and from inverting the exact (Clopper-Pearson) test, in R type: > out1 <- prop.test(x=0,n=25,conf.level=0.95,correct=f) > out1$conf.int [1] attr(,"conf.level") [1] 0.95 > out2 <- binom.test(x=0,n=25,conf.level=0.95) > out2$conf.int [1] attr(,"conf.level") [1] 0.95 Comment: I would recommend, wherever possible, the use of exact tests instead of approximate tests. 11

12 Describing Contingency Tables Contingency tables & their distributions Let X and Y be categorical variables measured on an a subject with I and J levels respectively. Each subject sampled will have an associated (X,Y); e.g. (X,Y) = (female, Republican). For the gender variable X, I = 2, and for the political affiliation Y, we might have J = 3. Say n individuals are sampled and cross-classified according to their outcome (X,Y). A contingency table places the raw number of subjects falling into each cross-classification category into the table cells. We call such a table an I J table. 12

13 If we relabel the category outcomes to be integers 1 X I and 1 Y J (i.e. turn our experimental outcomes into random variables), we can simplify notation. In the abstract, a contingency table looks like: n ij Y = 1 Y = 2 Y = J Totals X = 1 n 11 n 12 n 1J n 1+ X = 2 n 21 n 22 n 2J n X = I n I1 n I2 n IJ n I+ Totals n +1 n +2 n +J n = n ++ If subjects are randomly sampled from the population and cross-classified, both X and Y are random and (X,Y) has a bivariate discrete joint distribution. Let π ij = P(X = i,y = j), the probability of falling into the (i,j) th (row,column) in the table. 13

14 From Chapter 2 in Christensen (1997) we have a sample of n = 52 males aged 11 to 30 years with knee operations via arthroscopic surgery. They are cross-classified according to X = 1,2,3 for injury type (twisted knee, direct blow, or both) and Y = 1,2,3 for surgical result (excellent, good, or fair-to-poor). n ij Excellent Good Fair to poor Totals Twisted knee Direct blow Both types Totals n = 52 with theoretical probabilities: π ij Excellent Good Fair to poor Totals Twisted knee π 11 π 12 π 13 π 1+ Direct blow π 21 π 22 π 23 π 2+ Both types π 31 π 32 π 33 π 3+ Totals π +1 π +2 π +3 π ++ = 1 14

15 The marginal probabilities that X = i or Y = j are P(X = i) = P(Y = j) = J P(X = i,y = j) = j=1 I P(X = i,y = j) = i=1 J π ij = π i+. j=1 I π ij = π +j. i=1 A + in place of a subscript denotes a sum of all elements over that subscript. We must have π ++ = I i=1 J π ij = 1. j=1 The counts have a multinomial distribution n mult(n, π) where n = [n ij ] I J and π = [π ij ] I J. 15

16 Often X is fixed and only Y is random. For example in a case-control study, a fixed number of cases are sampled and a fixed number of controls. Then a risk factor or exposure Y is compared among cases and controls within the table. This results in a separate multinomial distribution for each level of X; however we can pretend that the joint distribution is one large multinomial and nothing changes. Example: Aspirin and heart attacks n = 1360 stroke patients randomly assigned to aspirin or placebo & followed about 3 years. Heart attack No heart attack Total Placebo (fixed) Aspirin (fixed) 16

17 Independence When (X,Y) are jointly distributed, X and Y are independent if Let and P(X = i,y = j) = P(X = i)p(y = j) or π ij = π i+ π +j. π X Y i j = P(X = i Y = j) = π ij /π +j π Y X j i = P(Y = j X = i) = π ij /π i+. Then independence of X and Y implies P(X = i Y = j) = P(X = i) and P(Y = j X = i) = P(Y = j). The probability of any given column response is the same for each row. The probability for any given row response is the same for each column. 17

18 Comparing two proportions Let X and Y be dichotomous. Let π 1 = P(Y = 1 X = 1) and let π 2 = P(Y = 1 X = 2). The difference in probability of Y = 1 when X = 1 versus X = 2 is π 1 π 2. The relative risk π 1 /π 2 may be more informative for rare outcomes. However it may also exaggerate the effect of X = 1 versus X = 2 as well and cloud issues. 18

19 Example: Let Y = 1 indicate presence of a disease and X = 1 indicate an exposure. When π 2 = and π 1 = 0.01, π 1 π 2 = However, π 1 /π 2 = 10. You are 10 times more likely to get the disease when X = 1 than X = 2. However, in either case the probability of getting the disease When π 2 = and π 1 = 0.41, π 1 π 2 = However, π 1 /π 2 = You are 2% more likely to get the disease when X = 1 than X = 2. This doesn t seem as drastic as 1000%. These sorts of comparisons figure into reporting results concerning public health and safety information. e.g. HRT for post-menopausal women, relative safety of SUVs versus sedans, etc. 19

20 Odds ratios The odds of success (say Y = 1) versus failure (Y = 2) are Ω = π/(1 π) where π = P(Y = 1). When someone says 3 to 1 odds the Vikings will win, they mean Ω = 3 which implies the probability the Vikings will win is 0.75, from π = Ω/(Ω+1). Odds measure the relative rates of success and failure. An odds ratio compares relatives rates of success (or disease or whatever) across two exposures X = 1 and X = 2: θ = Ω 1 Ω 2 = π 1/(1 π 1 ) π 2 /(1 π 2 ). Odds ratios are always positive and a ratio > 1 indicates the relative rate of success for X = 1 is greater than for X = 2. However, the odds ratio gives no information on the probabilities π 1 = P(Y = 1 X = 1) and π 2 = P(Y = 1 X = 2). 20

21 Different values for these parameters can lead to the same odds ratio. Example: π 1 = & π 2 = 0.5 yield θ = 5.0. So does π 1 = & π 2 = One set of values might imply a different decision than the other, but θ = 5.0 in both cases. Here, the relative risk is about 1.7 and 5 respectively. Note that when dealing with a rare outcome, where π i 0, the relative risk is approximately equal to the odds ratio. 21

22 When θ = 1 we must have Ω 1 = Ω 2 which further implies that π 1 = π 2 and hence Y does not depend on the value of X. If (X,Y) are both random then X and Y are stochastically independent. An important property of odds ratio is the following: θ = = P(Y = 1 X = 1)/P(Y = 2 X = 1) P(Y = 1 X = 2)/P(Y = 2 X = 2) P(X = 1 Y = 1)/P(X = 2 Y = 1) P(X = 1 Y = 2)/P(X = 2 Y = 2) You should verify this formally. This implies that for the purposes of estimating an odds ratio, it does not matter if data are sampled prospectively, retrospectively, or cross-sectionally. The common odds ratio is estimated ˆθ = n 11 n 22 /[n 12 n 21 ]. 22

23 Inference for Contingency Tables Inference for association parameters: odds ratios The sample odds ratio ˆθ = n 11 n 22 /n 12 n 21 can be zero, undefined, or if one or more of {n 11,n 22,n 12,n 21 } are zero. An alternative is to add 1/2 observation to each cell θ = (n )(n )/(n )(n ). This also corresponds to a particular Bayesian estimate. Both ˆθ and θ have skewed sampling distributions with small n = n ++. The sampling distribution of log ˆθ is relatively symmetric and therefore more amenable to a Gaussian approximation. An approximate (1 α) 100% CI for logθ is given by log ˆθ ±zα 2 1 n n n n 22. A CI for θ is obtained by exponentiating the interval endpoints. 23

24 When ˆθ = 0 this doesn t work (log0 = ). Can use n ij +0.5 in place of n ij in MLE estimate and standard error yielding log θ 1 ±zα 2 n n n n Perhaps better approach would involve inverting score or LRT tests for H 0 : θ = t. 24

25 Example: Aspirin and heart attacks n = 1360 stroke patients randomly assigned to aspirin or placebo (product multinomial sampling) & followed about 3 years. Heart attack No heart attack Total Placebo (fixed) Aspirin (fixed) 95% CI for logθ using ˆθ is ( 0.157,1.047) and so the CI for θ is (e 0.157,e ) = (0.85,2.85). We cannot reject that H 0 : θ = 1 (at level α = 0.05). We accept that heart attacks are unrelated to aspirin intake. 25

26 Difference in proportions Assume (1) multinomial sampling or (2) product binomial sampling where n i+ are fixed (fixed row totals as in heart attack data). Let π 1 = P(Y = 1 X = 1) and π 2 = P(Y = 1 X = 2). The sample proportion for each level of X is the MLE ˆπ 1 = n 11 /n 1+, ˆπ 2 = n 21 /n 2+. Using either large sample results or the CLT we have ( ˆπ 1 N π 1, π ) ( 1(1 π 1 ) ˆπ 2 N π 2, π ) 2(1 π 2 ). n 1+ n 2+ Since the difference of two independent normals is also normal, we have ( ˆπ 1 ˆπ 2 N π 1 π 2, π 1(1 π 1 ) + π ) 2(1 π 2 ). n 1+ n 2+ 26

27 Plugging in MLEs for unknowns, we estimate the standard deviation of the difference in sample proportions by the standard error ˆπ 1 (1 ˆπ 1 ) ˆσ(ˆπ 1 ˆπ 2 ) = + ˆπ 2(1 ˆπ 2 ). n 1+ n 2+ A Wald CI for the unknown difference has endpoints ˆπ 1 ˆπ 2 ±zα 2 ˆσ(ˆπ 1 ˆπ 2 ). For the aspirin data, this yields ± 1.96( ) for the 95% CI ( 0.005, 0.033). 27

28 Estimating relative risk Like the odds ratio, the relative risk π 1 /π 2 (0, ) and tends to have a skewed sampling distribution in small samples. Let r = ˆπ 1 /ˆπ 2 be the sample relative risk. Large sample normality implies where logr = logˆπ 1 /ˆπ 2 N(logπ1 /π 2,σ(logr)). σ(logr) = 1 π 1 π 1 n π 2 π 2 n 2+. Plugging in ˆπ i for π i gives the standard error and CIs are obtained as usual for logπ 1 /π 2, then exponentiated to get the CI for π 1 /π 2. Applying this to the heart attack data we obtain a 95% CI for π 1 /π 2 as (0.86,2.75). The probability of a heart attack on placebo is between 0.86 and 2.75 times greater than on aspirin. 28

29 Testing independence in I J tables Assume one mult(n, π) distribution for the whole table. Let π ij = P(X = i,y = j); we must have π ++ = 1. If the table is 2 2, we can just look at H 0 : θ = 1. In general, independence holds if H 0 : π ij = π i+ π +j, or equivalently, µ ij = nπ i+ π +j. That is, independence implies a constraint; the parameters π 1+,...,π I+ and π +1,...,π +J define all probabilities in the I J table under the constraint. 29

30 Pearson s statistic is X 2 = I i=1 J j=1 (n ij ˆµ ij ) 2 ˆµ ij, where ˆµ ij = n(n i+ /n)(n +j /n), the MLE under H 0. There are I 1 free {π i+ } and J 1 free {π +j }. Then IJ 1 [(I 1)+(J 1)] = (I 1)(J 1). When H 0 is true, X 2 χ 2 (I 1)(J 1). 30

31 The LRT statistic boils down to G 2 = 2 I i=1 J n ij log(n ij /ˆµ ij ), j=1 and is also G 2 χ 2 (I 1)(J 1) when H 0 is true. X 2 G 2 p 0. The approximation is better for X 2 than G 2 in smaller samples. The approximation can be okay when some ˆµ ij = n i+ n +j /n are as small as 1, but most are at least 5. When in doubt, use small sample methods. Everything holds for product multinomial sampling too (fixed marginals for one variable)! 31

32 Following up chi-squared tests for independence Rejecting H 0 : π ij = π i+ π +j does not tell us about the nature of the association. Pearson and standardized residuals The Pearson residual is e ij = n ij ˆµ ij ˆµij, where, as before, ˆµ ij = n i+ n +j /n is the estimate under H 0 : X Y. When H 0 : X Y is true, under multinomial sampling e ij N(0,v), where v < 1, in large samples. Note that I i=1 J j=1 e2 ij = X2. 32

33 Standardized Pearson residuals are Pearson residuals divided by their standard error under multinomial sampling. r ij = n ij ˆµ ij ˆµij (1 p i+ )(1 p +j ), where p ij = n ij /n are MLEs under the full (non-independence) model. Values of r ij > 3 happen very rarely when H 0 : X Y is true and r ij > 2 happen only roughly 5% of the time. Pearson residuals and their standardized version tell us which cell counts are much larger or smaller than what we would expect under H 0 : X Y. Example: we analyze data on religious beliefs and highest degree achieved (Agresti, 2002) using the following SAS code: 33

34 Religious beliefs Highest degree Fundamentalist Moderate Liberal Less than high school High school or junior college Bachelor or graduate data table; input Degree$ Religion$ count datalines; 1 fund mod lib fund mod lib fund mod lib 252 ; proc format; value $dc 1 = < HS 2 = HS or JC 3 = >= BA/BS ; value $rc fund = Fundamentalist mod = Moderate lib = Liberal ; proc freq order=data; weight count; format Religion $rc. Degree $dc.; tables Degree*Religion / chisq expected measures cmh1; proc genmod order=data; class Degree Religion; format Religion $rc. Degree $dc.; model count = Degree Religion / dist=poi link=log residuals; run; 34

35 Annotated output from proc freq: Degree Religion Frequency Expected Percent Row Pct Col Pct Fundamen Moderate Liberal Total talist < HS HS or JC >= BA/BS Total

36 More... Statistics for Table of Degree by Religion Statistic DF Value Prob Chi-Square <.0001 Likelihood Ratio Chi-Square <.0001 Annotated output from proc genmod: The GENMOD Procedure Observation Statistics Observation Resraw Reschi Resdev StResdev StReschi Reslik

37 The StReschi column has the r ij. Values larger than 3 in magnitude indicate severe departures from independence. Observations 1 and 9, corresponding to less than high school, fundamentalist and at least BS/BA, liberal are over-represented relative to independence. Observations 6 and 7, corresponding to HS or JC, liberal and at least BS/BA, fundamentalist are under-represented. That is, we tend to see concentrations along the diagonal, so increased education is associated with increasingly liberal religious views. 37

STAT 705: Analysis of Contingency Tables

STAT 705: Analysis of Contingency Tables Timothy Hanson Department of Statistics, University of South Carolina Stat 705: Analysis of Contingency Tables 1 / 45 Outline of Part I: models and parameters Basic