PB HLTH 240A: Advanced Categorical Data Analysis Fall 2007

Size: px

Start display at page:

Download "PB HLTH 240A: Advanced Categorical Data Analysis Fall 2007"

Kerrie Small
6 years ago
Views:

1 Cohort study s formulations PB HLTH 240A: Advanced Categorical Data Analysis Fall 2007 Srine Dudoit Division of Biostatistics Department of Statistics University of California, Berkeley September 17, 2007

2 Acknowledgements Cohort study s formulations These lecture notes are based on Professor Jewell s notes Chapter 6 of his book Statistics for Epidemiology.

3 Outline Cohort study s formulations Population-based Cohort study s formulations 4 Multiple hypothesis

4 Cohort study s formulations A 2 2 contingency table is a convenient representation of the (data generating or empirical) joint distribution between two binary variables. E.g. Relationship between disease (D, D = not D) exposure (E, Ē = not E). Table 1: 2 2 contingency table. Generic 2 2 contingency table for disease-exposure joint distribution. Disease status D D Exposure E a b n E a + b status Ē c d nē c + d n D a + c n D b + d n a + b + c + d

5 Cohort study s formulations Various measures of association between disease exposure have been proposed in the literature. Relative risk RR Pr(D E) Pr(D Ē) ; (1) Odds ratio OR Pr(D E)/ Pr( D E) Pr(D Ē)/ Pr( D Ē) ; Excess risk ER Pr(D E) Pr(D Ē); Attributable risk AR Pr(D) Pr(D Ē). Pr(D) When disease exposure status are independent, i.e., Pr(D&E) = Pr(D) Pr(E), then RR = 1, OR = 1, ER = 0, AR = 0. (2)

6 Cohort study s formulations A common question in statistical practice is to assess whether a disease-exposure association observed in a sample from a population of interest reflects a genuine association in this population or is due to sampling rom variation. The statistical approach to this problem involves the null hypothesis H 0 of disease-exposure independence, i.e., H 0 : Pr(D&E) = Pr(D) Pr(E) (3) RR = 1 OR = 1 ER = 0 AR = 0.

7 Cohort study s formulations As detailed below for various study s, one can use empirical frequencies from a such as Table 1 to estimate parameters of the disease-exposure joint distribution (e.g., Pr(D&E), Pr(D E), OR) thus measures of deviation from independence. For instance, for a population-based, the larger the value of the sample-derived test statistic (ad bc) 2, the greater the evidence against disease-exposure independence in the population. But... What is large? procedures provide benchmarks for assessing largeness, by accounting for the sampling distribution of test statistics.

8 Cohort study s formulations D Birthweight Low, D Normal, D Marital Unmarried, E 7 52 n E = 59 status Married, Ē n Ē = 141 n D = 14 n D = 186 n = 200 (a) Population-based, OR 2.58 E n E = 100 Ē 5 95 nē = 100 D n D = 17 n D = 183 n = 200 D E n E = 78 Ē nē = 122 n D = 100 n D = 100 n = 200 (b) Cohort, OR 2.59 (c), OR 2.57 D

9 Cohort study s formulations Statistical inference involves using observed data with known empirical distribution P n to learn properties of, i.e., estimate parameters of or test hypotheses for, the unknown data generating distribution P. Estimator ψ n = ˆΨ(P n ) Known Statistical inference = Estimation Parameter ψ = Ψ(P) Unknown

10 Cohort study s formulations is concerned with using observed data to make decisions regarding properties of, i.e., hypotheses for, the unknown data generating distribution. Specifically, consider a data generating distribution P M, belonging to a model M, i.e., a set of possibly non-parametric distributions. Suppose one has a learning set X n {X i : i = 1,..., n} IID P, of n independent identically distributed (IID) rom variables (RV) from P. Let P n denote the empirical distribution of the learning set X n, which places probability 1/n on each X i, i = 1,..., n.

11 Cohort study s formulations One can define a null hypothesis as H 0 I (P M 0 ) the corresponding alternative hypothesis as H 1 I (P / M 0 ), in terms of a submodel M 0 M for the data generating distribution P. Null hypotheses are often expressed in terms of parameters ψ = Ψ(P), defined as functions of the data generating distribution P. E.g. Mean, regression coefficient, odds ratio. A hypothesis procedure is a data-driven rule for deciding whether to reject a null hypothesis, i.e., declare it false.

12 Cohort study s formulations Decisions to reject or not a null hypothesis are based on test statistics T n = T (P n ), defined as functions of the empirical distribution P n, i.e., of the data X n. E.g. t-statistics, χ 2 -statistics, F -statistics, likelihood ratio statistics. Testing procedures provide rejection C for test statistics T n can be expressed as Reject H 0 iff T n C. (4) In any problem, two types of errors can be committed. A Type I error, or false positive, is committed by rejecting a true null hypothesis. A Type II error, or false negative, is committed by failing to reject a false null hypothesis.

13 Cohort study s formulations The actual Type I error rate of a procedure is the chance of incorrectly rejecting the null hypothesis H 0, Pr(Reject H 0 ), P M 0. The actual power of a procedure is the chance of correctly rejecting the null hypothesis H 0, Pr(Reject H 0 ), P / M 0. Ideally, one would like to simultaneously minimize both Type I errors Type II errors. Unfortunately, this is not feasible one seeks a trade-off between the two types of errors. This trade-off typically involves the minimization of Type II errors, i.e., the maximization of power, subject to a Type I error constraint.

14 Cohort study s formulations The actual Type I error rate of a procedure is often different from its nominal Type I error rate, i.e., the level α at which the procedure seeks to control false positives. Conservative Pr(Reject H 0 ) < α, P M 0. (5) Anti-conservative Pr(Reject H 0 ) > α, P M 0.

15 Cohort study s formulations A key component of any procedure is the test statistics (rather than data generating null distribution) used to obtain test statistic rejection, parameter confidence,. Indeed, whether single or multiple hypotheses, one needs the (joint) distribution of the test statistics in order to derive a procedure that probabilistically controls Type I errors. In practice, however, the true distribution of the test statistics is unknown replaced by a. The choice of a proper is crucial in order to ensure that control of the Type I error rate under the assumed does indeed provide the desired control under the true distribution.

16 Cohort study s formulations This issue is particularly relevant for large-scale problems, such as those encountered in biomedical genomic research, which concern high-dimensional multivariate distributions, with complex unknown dependence structures among variables. Resampling procedures (e.g., bootstrap, permutation) are useful for consistent estimation of the of the corresponding test statistic rejection, parameter confidence,.

17 Cohort study s formulations A procedure s nominal Type I error rate Pr 0 (Reject H 0 ), under a, can differ from its actual Type I error rate Pr(Reject H 0 ), under the true data generating distribution P M 0. Conservative Pr(Reject H 0 ) < Pr 0 (Reject H 0 ). (6) Anti-conservative Pr(Reject H 0 ) > Pr 0 (Reject H 0 ).

18 Cohort study s formulations The p-value for the test of a null hypothesis H 0 is the smallest nominal Type I error level α at which one would reject H 0, given the data. That is, P 0n inf {α [0, 1] : Reject H 0 at nominal level α} = inf {α [0, 1] : T n C(α)}, (7) where C(α) is a rejection region chosen with the aim that Pr(T n C(α)) α, P M 0. The smaller the p-value P 0n, the stronger the evidence against the null hypothesis H 0. Thus, for a test with nominal Type I error level α, one rejects H 0 for small p-value P 0n, Reject H 0 iff P 0n α. (8)

19 Cohort study s formulations are flexible summaries of a procedure, in the sense that results are supplied for all Type I error levels α, i.e., the level α need not be chosen ahead of time. provide convenient benchmarks to compare different procedures, whereby smaller indicate a less conservative procedure.

20 Cohort study s formulations There is a duality between hypothesis confidence. Consider a data generating distribution P, a parameter ψ = Ψ(P) IR, a null hypothesis of the form H 0 = I(ψ = ψ 0 ) or H 0 = I(ψ ψ 0 ). A (1 α)100% confidence region (CR) for the parameter ψ is a function A n (α) = A(P n )(α) IR of the empirical distribution P n such that Pr(ψ A n (α)) (1 α). (9) That is, the chance that the data-driven subset (i.e., rom variable) A n (α) contains the true unknown parameter ψ is at least (1 α).

21 Cohort study s formulations This suggests rejecting the null hypothesis H 0 if the confidence region A n (α) does not include the null value ψ 0. Indeed, if the confidence region is derived such that Pr 0 (ψ 0 A n (α)) (1 α) under a that satisfies the null hypothesis H 0, then the procedure Reject H 0 iff ψ 0 / A n (α) (10) has nominal Type I error rate at most α under the null distribution.

22 Three representations of a hypothesis procedure Cohort study s formulations One can report the results of a hypothesis procedure with nominal Type I error level α in terms of the following three entities. Rejection region C(α) for the test statistic T n, p-value P 0n, Reject H 0 iff T n C(α). Reject H 0 iff P 0n α. region A n (α) for the parameter of interest ψ, Reject H 0 iff ψ 0 / A n (α). The first two are general equivalent representations, while the third applies only to parametric hypotheses such as H 0 = I(ψ = ψ 0 ) or H 0 = I(ψ ψ 0 ).

23 Cohort study s formulations. A χ 2 -statistic does not necessarily follow a χ 2 -distribution. One can mechanically plug in any set of numbers into the formula for a test statistic, such as the χ 2 -statistic of Equation (12), below. However, the sampling distribution of the test statistic depends on the data generating distribution, i.e., the population of interest study. For a given procedure, the nominal Type I error rate, under a, usually differs from the actual Type I error rate, under the true data generating distribution. Beware of anti-conservative/conservative behavior.

24 Cohort study s formulations A p-value is NOT the chance that the null hypothesis is true. Be precise with conditional probabilities. A p-value is a rom variable. Indeed, a p-value is a function of the data can therefore be viewed as a test statistic. Ask yourself: What is fixed/rom? What is known/unknown? What is a function of the unknown P/known P n? [0, 1] {0.05}.

25 Cohort study s formulations Statistical significance Subject-matter significance. Statistically significant tiny p-value for a subject-matter insignificant tiny association in a study with large sample size. vs. Statistically significant tiny p-value for a subject-matter significant large association in a study with small sample size. Cynical view: are proxies for sample size. Probability statements are only meaningful with respect to distributions many distributions can satisfy a null hypothesis. It is more rigorous to say that a p-value is the chance of obtaining a value as extreme or more extreme than the observed value of the test statistic under a distribution

26 Cohort study s formulations that satisfies the null hypothesis vs. under the null hypothesis. inference frameworks: Bayesian vs. frequentist. Multiple hypothesis. When multiple hypotheses are tested simultaneously, there is an increased chance of committing at least one Type I error. Small (unadjusted), that would lead to the rejection of a single hypothesis (e.g., 0.001), no longer correspond to statistically significant findings. E.g. The chance that at least one p-value is less than α for M independent test statistics is 1 (1 α) M converges to one as M increases. For M = 1, 000 α = 0.01, this chance is ! One needs to adjust for multiple hypothesis.

27 Cohort study s formulations E.g. Combining data from M s (Jewell, 2003, Section 9.1). Sum of χ 2 -statistics low power due to inflated degrees of freedom, M. Bonferroni procedure very conservative, α/m p-value cut-off.

28 Population-based Cohort study s formulations In a population-based, one needs a frame for the study population. A simple population-based involves taking a rom sample from the population of interest assessing disease exposure status for the sampled individuals. With such a study, one can estimate disease exposure marginal joint probabilities directly from empirical frequencies in Table 1. Pr(D&E) can be estimated by a/n; Pr(D) can be estimated by ˆp (a + c)/n; Pr(E) can be estimated by ˆq (a + b)/n. Constants: n; Rom variables: n D, n D, n E, nē, a, b, c, d.

29 Population-based Cohort study s formulations The null hypothesis of disease-exposure independence is formulated as H 0 : Pr(D&E) = Pr(D) Pr(E). The measure of disease-exposure dependence Pr(D&E) Pr(D) Pr(E) can be estimated by the corresponding difference in empirical frequencies a n ˆpˆq = a n (a + c) (a + b) ad bc = n n n 2.

30 Population-based Cohort study s formulations Note that one can equivalently consider any of the three other cells in Table 1, i.e., focus on any of the following three measures of disease-exposure dependence. Pr( D&E) Pr( D) Pr(E), Pr(D&Ē) Pr(D) Pr(Ē), Pr( D&Ē) Pr( D) Pr(Ē). Since the sample size n is fixed by, it is natural to use ˆδ ad bc as an empirical measure of disease-exposure dependence. Furthermore, ignoring the sign of the difference yields the measure ˆδ 2 = (ad bc) 2. Large values of the test statistics ˆδ ˆδ 2 are suggestive of population disease-exposure dependence.

31 Population-based Cohort study s formulations To assess largeness, one needs to account for the sampling variation in ad bc. In a population-based study with large sample size n, the sampling distribution of the rom variable corresponding to ad bc can be approximated by a Gaussian distribution. Under the null hypothesis of disease-exposure independence, the RV for ad bc has expected value zero variance that can be estimated by ˆv 1 (a + b)(a + c)(b + d)(c + d). (11) n

32 Population-based Cohort study s formulations This suggests the following test statistics. t-statistic t 1 ˆδ 0 ˆv = χ 2 -statistic t 2 t 2 1 = Under H 0 for large n, n(ad bc) (a + b)(a + c)(b + d)(c + d) n(ad bc) 2 (a + b)(a + c)(b + d)(c + d). (12) T 1 N(0, 1) T 2 χ 2 (1). (13)

33 Population-based Cohort study s formulations One can use the χ 2 (1) distribution as a for the test statistic T 2 to derive rejection. For a nominal Type I error level α, a rejection region for T 2 can be defined as C(α) = [c(α), ), where c(α) is the (1 α)-quantile of the χ 2 (1) distribution. That is, one rejects the null hypothesis of disease-exposure independence when T 2 exceeds the critical value or cut-off c(α). Under the χ 2 (1), the chance of committing a Type I error, i.e., incorrectly rejecting the null hypothesis of disease-exposure independence, is α. That is, Pr χ 2 (1)(Reject H 0 ) = Pr χ 2 (1)(T 2 c(α)) = α.

34 Population-based Cohort study s formulations Note that the procedure s nominal Type I error rate Pr χ 2 (1)(Reject H 0 ), under the χ 2 (1) (or other) null distribution, can differ from its actual Type I error rate Pr(Reject H 0 ), under the true data generating distribution P M 0. For a test based on the test statistic T 2 the χ 2 (1) null distribution, the p-value can be expressed as the chance that a χ 2 (1) rom variable is at least as large as the observed value t 2, i.e., p = Pr χ 2 (1)(T 2 t 2 ). (14)

35 Population-based Cohort study s formulations A parameter of interest for measuring disease-exposure association is the odds ratio, OR = Pr(D E)/ Pr( D E) Pr(D Ē)/ Pr( D Ē). For population-based s, one can estimate the odds ratio by replacing the unknown conditional probabilities by the corresponding empirical frequencies from Table 1. That is, ôr ˆp E /(1 ˆp E ) (a/(a + b))/(b/(a + b)) = ˆpĒ /(1 ˆpĒ ) (c/(c + d))/(d/(c + d)) = ad bc, (15) where p E Pr(D E) pē Pr(D Ē) are estimated, respectively, by ˆp E a/(a + b) ˆpĒ c/(c + d).

36 Population-based Cohort study s formulations The sampling distribution of the odds ratio estimator ÔR tends to be skewed to the right, especially for small sample sizes n (Jewell, 2003, Figure 7.1). It is therefore not appropriate to rely on a Gaussian approximation to derive rejection confidence for the odds ratio. However, the sampling distribution of the log odds ratio estimator log(ôr) tends to be symmetric more amenable to a Gaussian approximation (Jewell, 2003, Figure 7.2). As detailed in Jewell (2003, p ), an estimate of the variance of log(ôr) is given by ˆv 1 a + 1 b + 1 c + 1 d. (16)

37 Population-based Cohort study s formulations Thus, one has the following Gaussian approximation to the sampling distribution of the empirical log odds ratio, log(ôr) log(or) N(0, 1). (17) ˆV A two-sided (1 α)100% confidence interval (CI) for the log odds ratio parameter log(or) is given by 1 log(ad) log(bc)±φ 1 (1 α/2) a + 1 b + 1 c + 1 d, (18) where Φ denotes a stard Gaussian CDF Φ 1 (1 α/2) the corresponding (1 α/2)-quantile. One can derive the same estimators confidence intervals for the cohort case-control s considered below.

38 Population-based Cohort study s formulations Table 2: Population-based. for birthweight mother s marital status, n = 200. Birthweight Low, D Normal, D Marital Unmarried, E 7 52 n E = 59 status Married, Ē n Ē = 141 n D = 14 n D = 186 n = 200 ôr = ad bc = , t 2 = 200 ( )2 3.04, p , 95% confidence interval for log(or): [ 0.15, 2.04] 0.

39 Cohort Cohort study s formulations In a cohort, one needs a frame for the exposed unexposed populations. A simple cohort involves taking independent rom samples from the exposed unexposed populations assessing disease status for the sampled individuals. With such a study, one can only estimate the conditional disease probabilities given exposure from the empirical frequencies in Table 1. p E Pr(D E) can be estimated by ˆp E a/(a + b); Pr(D Ē) can be estimated by ˆpĒ c/(c + d). p Ē Constants: n, n E, nē ; Rom variables: n D, n D, a, b, c, d.

40 Cohort Cohort study s formulations The null hypothesis of disease-exposure independence can be restated as H 0 : Pr(D E) = Pr(D Ē). The measure of disease-exposure dependence δ Pr(D E) Pr(D Ē) can be estimated by the corresponding difference in empirical frequencies ˆδ ˆp E ˆpĒ = a a + b c c + d, with Gaussian asymptotic distribution estimated null variance ( 1 ˆv ˆp(1 ˆp) + 1 ) (19) n E nē ( ) ( ) ( a + c b + d 1 = n n a + b + 1 ). c + d

41 Cohort Cohort study s formulations This suggests the following test statistics. t-statistic χ 2 -statistic Under H 0 for large n, t 1 ˆδ 0 ˆv (20) t 2 t 2 1 = ˆδ 2 ˆv. T 1 N(0, 1) T 2 χ 2 (1). (21)

42 Cohort Cohort study s formulations Furthermore, it turns out that the T 1 T 2 test statistics for the cohort are identical to the corresponding test statistics in Equation (12) for the population-based. Indeed, t cohort 1 = = = ( a+c n a a+b ) ( b+d n c c+d ) ( 1 a+b + 1 c+d 1 (a+b)(c+d) (a+c)(b+d) n n 2 (a+b)(c+d) ) (ac + ad ac bc) n(ad bc) (a + b)(a + c)(b + d)(c + d) = t popn 1.

43 Cohort Cohort study s formulations Table 3: Cohort. for birthweight mother s marital status, n E = n Ē = 100. Birthweight Low, D Normal, D Marital Unmarried, E n E = 100 status Married, Ē 5 95 n Ē = 100 n D = 17 n D = 183 n = 200 ôr = ad = bc , t 2 = 200 ( )2 3.15, p , 95% confidence interval for log(or): [ 0.13, 2.03] 0.

44 Cohort study s formulations In a case-control, one needs a frame for the disease/case no disease/control populations. A simple case-control involves taking independent rom samples from the disease no disease populations assessing exposure status for the sampled individuals. With such a study, one can only estimate the conditional exposure probabilities given disease from the empirical frequencies in Table 1. q D Pr(E D) can be estimated by ˆq D a/(a + c); q D Pr(E D) can be estimated by ˆq D b/(b + d). Constants: n, n D, n D ; Rom variables: n E, nē, a, b, c, d.

45 Cohort study s formulations The null hypothesis of disease-exposure independence can be restated as H 0 : Pr(E D) = Pr(E D). The measure of disease-exposure dependence δ Pr(E D) Pr(E D) can be estimated by the corresponding difference in empirical frequencies ˆδ ˆq D ˆq D = a a + c b b + d, with Gaussian asymptotic distribution estimated null variance ( 1 ˆv ˆq(1 ˆq) + 1 ) (22) n D n ( ) ( ) D ( a + b c + d 1 = n n a + c + 1 ). b + d

46 Cohort study s formulations This suggests the following test statistics. t-statistic χ 2 -statistic Under H 0 for large n, t 1 ˆδ 0 ˆv (23) t 2 t 2 1 = ˆδ 2 ˆv. T 1 N(0, 1) T 2 χ 2 (1). (24) Furthermore, it turns out that the T 1 T 2 test statistics for the case-control are identical to the corresponding test statistics for the population-based cohort s.

47 Cohort study s formulations Table 4:. for birthweight mother s marital status, n D = n D = 100. Birthweight Low, D Normal, D Marital Unmarried, E n E = 78 status Married, Ē n Ē = 122 n D = 100 n D = 100 n = 200 ôr = ad = bc , t 2 = 200 ( ) , p , 95% confidence interval for log(or): [0.36, 1.53] 0.

48 study s Cohort study s formulations Table 5: study s. Results of χ 2 -test of independence log odds ratio confidence intervals for population-based, cohort, case-control study s. Study Population-based Cohort Odds ratio, ôr χ 2 -statistic, t p-value, p log(or) 95% CI [ 0.15, 2.04] [ 0.13, 2.03] [0.36, 1.53]

49 study s Cohort study s formulations The differences in results for the χ 2 -test of independence reflect differences in power among the three study s, i.e., in the chance of rejecting the null hypothesis of independence when there truly is a disease-exposure association in the population. In general, the power of a procedure depends on a variety of factors, including the extent of the deviation from the null hypothesis (i.e., the true unknown population association, as measured, for example, by RR, OR, ER, AR) the sample size n. However, these two factors are the same in each of our three examples.

50 study s Cohort study s formulations As in our setting, power is further influenced by the study via the balance of the empirical marginal frequencies for disease (n D, n D ) exposure (n E, nē ).

51 study s Population-based vs. cohort. The χ 2 -statistic used for both s may be written as Cohort study s formulations where ˆp = a + c n ˆp E = a a + b ˆpĒ = c c + d t 2 = (ˆp E ˆpĒ ) 2 ˆp(1 ˆp) ( 1 n E + 1 nē ), (25) estimate of Pr(D) (population-based ), estimate of Pr(D E), estimate of Pr(D Ē).

52 study s Cohort study s formulations The numerator term ˆp E ˆpĒ behaves in a similar manner for both s, in the sense that it converges to the corresponding parameter Pr(D E) Pr(D Ē) which is fixed for a given population. The influences the test statistic through the variance term 1/n E + 1/nĒ. In particular, this term is minimized the test statistic t 2 maximized when n E = nē = n/2. For a cohort with fixed total sample size n, the optimal exposure sample size allocation in terms of power is n E = nē = n/2. For a population-based, the exposure sample sizes n E nē are unlikely to be equal, as they are rom variables with distribution depending on the unknown population frequency Pr(E).

53 study s Cohort study s formulations For large sample size n, a population-based yields a less powerful χ 2 -test of independence than a cohort with equal exposure sample sizes n E = nē = n/2.

54 study s Cohort study s formulations Population-based vs. case-control. A similar argument as above, with the roles of disease exposure interchanged, can be applied to compare population-based case-control s. For a case-control with fixed total sample size n, the optimal disease sample size allocation in terms of power is n D = n D = n/2. For large sample size n, a population-based yields a less powerful χ 2 -test of independence than a case-control with equal disease sample sizes n D = n D = n/2.

55 study s Cohort study s formulations Cohort vs. case-control. Consider for simplicity cohort case-control s with balanced sample sizes n E = nē = n/2 n D = n D = n/2, respectively. For the cohort, ˆp = (ˆp E + ˆpĒ )/2 1/n E + 1/nĒ = 4/n. The power of the χ 2 -test is then driven by the expected value of the rom variable for (ˆp E ˆpĒ )/ ˆp(1 ˆp), which, for large sample size n, can be approximated by the scaled frequency difference parameter δ = p E pē ((pe + pē )/2)(1 (p E + pē )/2), (26) where p E = Pr(D E) pē = Pr(D Ē).

56 study s Cohort study s formulations The closer to 1/2 the average (Pr(D E) + Pr(D Ē))/2, the greater δ, hence the greater the power of the χ 2 -test for a cohort. A similar argument as above, with the roles of disease exposure interchanged, shows that the closer to 1/2 the average (Pr(E D) + Pr(E D))/2, the greater the power of the χ 2 -test for a case-control,. If (Pr(D E) + Pr(D Ē))/2 is closer to 1/2 than (Pr(E D) + Pr(E D))/2, then a cohort leads to a more powerful χ 2 -test than a case-control. It can further be shown that if Pr(D) is closer to 1/2 than Pr(E), then a cohort leads to a more powerful χ 2 -test than a case-control.

57 study s Cohort study s formulations Figure 1: Cohort vs. case-control. Scaled frequency difference δ vs. average frequency for various fixed odds ratios (solid: OR = 4, dotted: OR = 3, dash-dotted: OR = 2).

58 study s Cohort study s formulations Table 6: study s. Factors affecting the power of the χ 2 -test of independence for population-based, cohort, case-control study s. Disease, Exposure, n n D n D n n E nē Study Population-based Cohort = = (ˆq D, ˆq D ) ` 7 14, 52 NA 186 ` (ˆp E, ˆpĒ ) 7 59, ` , ` , Scaled difference, ˆδ NA NA

59 study s Cohort study s formulations Summary of large sample power comparison. For a cohort with fixed total sample size n, the χ 2 -test of independence is most powerful with equal exposure sample sizes n E = nē = n/2. A cohort with equal exposure sample sizes n E = nē = n/2 yields a more powerful χ 2 -test of independence than a population-based with the same overall sample size. For a case-control with fixed total sample size n, the χ 2 -test of independence is most powerful with equal disease sample sizes n D = n D = n/2.

60 study s Cohort study s formulations A case-control with equal disease sample sizes n D = n D = n/2 yields a more powerful χ2 -test of independence than a population-based with the same overall sample size. When Pr(D) is closer to 1/2 than Pr(E), a cohort with equal exposure sample sizes yields a more powerful χ 2 -test of independence than a case-control with equal disease sample sizes. When Pr(E) is closer to 1/2 than Pr(D), a case-control with equal disease sample sizes yields a more powerful χ 2 -test of independence than a cohort with equal exposure sample sizes.

61 Pearson s χ 2 -statistic Cohort study s formulations Let us denote the disease-exposure empirical frequencies of Table 1 by (O DE, O DE, O DĒ, O DĒ ). For a population-based with sample size n, the disease-exposure frequencies follow a multinomial distribution, (O DE, O DE, O D Ē, O DĒ ) Multinomial ( n, ( Pr(D&E), Pr( D&E), Pr(D&Ē), Pr( D&Ē))). (27)

62 Pearson s χ 2 -statistic Cohort study s formulations One can show that the χ 2 -statistic of Equation (12) can be written as a Pearson χ 2 -statistic, T 2 = i {D, D} j {E,Ē} (O ij E ij ) 2 E ij, (28) where the expected values of the disease-exposure frequencies under the null hypothesis of independence estimators thereof are as follows. Expected value under H 0 Parameter Estimate O DE n Pr(D) Pr(E) e DE n (a+c) n O DE n Pr( D) Pr(E) e DE n (b+d) n O DĒ n Pr(D) Pr(Ē) e DĒ n (a+c) n O DĒ n Pr( D) Pr(Ē) e DĒ n (b+d) n (a+b) n (a+b) n (c+d) n (c+d) n

63 Multiway tables Cohort study s formulations Pearson s χ 2 -statistic extends to categorical variables taking on more than two values to higher-order multiway tables for more than two variables. Specifically, consider J categorical rom variables X = (X (j) : j = 1,..., J), where X (j) {1,..., K j }. The (data generating or empirical) joint distribution of X may be represented using a K 1 K J contingency table.

64 Multiway tables Cohort study s formulations The null hypothesis H 0 of mutual independence among the J rom variables may be tested using the Pearson χ 2 -statistic K 1 T 2 k 1 =1 K J k J =1 (O k1...k J E k1...k J ) 2 E k1...k J, (29) where O k1...k J E k1...k J are, respectively, the observed frequencies estimated null expected frequencies for cell (k 1,..., k J ). Under H 0 for large n, T 2 has a χ 2 (ν) distribution with J ν = J 1 (K j 1) (30) degrees of freedom. j=1 K j j=1

65 Fisher s exact test Cohort study s formulations Since most of the information on disease-exposure association is contained in the interior entries of Table 1, we may as well assume that the row column marginal totals are fixed. The analysis is then identical for all three study s, although each generates different marginal totals thus has different power properties. With fixed margins, there is only one rom variable, i.e., one degree of freedom, in the table. Indeed, given a, the other three interior entries, b, c, d, are determined. One can therefore focus on a single cell, without loss of generality a, to study the relationship between disease exposure.

66 Fisher s exact test Cohort study s formulations The rom variable for a, A, has a non-central hypergeometric distribution, with parameters determined by the known marginal totals the unknown population odds ratio OR. Under the null hypothesis of disease-exposure independence, A has a central hypergeometric null

67 Fisher s exact test Cohort study s formulations distribution, with parameters fully-determined by the marginal totals. Pr 0 (A = a) = ( )( ) /( ) a + b c + d n a c a + c (31) = (a + b)!(c + d)!(a + c)!(b + d)!, n!a!b!c!d! E 0 [A] = (a + b)(a + c), n Var 0 [A] = (a + b)(c + d)(a + c)(b + d) n 2. (n 1) Fisher s exact test derives rejection based on the exact finite sample hypergeometric null distribution of the test statistic A.

68 Fisher s exact test Cohort study s formulations Under H 0 for large n, T 1 A E 0[A] Var0 [A] N(0, 1) T 2 (A E 0[A]) 2 Var 0 [A] χ 2 (1). The above χ 2 -statistic is identical to the χ 2 -statistic of Equation (12), except for the (n 1) numerator term replacing the n term. This difference is irrelevant asymptotically, but can have important implications for small sample size n. (32)

69 Asymptotic vs. finite sample s Cohort study s formulations Our discussion of the χ 2 -test of independence has focussed on large sample sizes n, that is, on rejection based on the asymptotic χ 2 null distribution approximation to the sampling distribution of the test statistic T 2. The appropriateness of this approximation depends not only on the sample size n but also on the null joint frequencies, i.e., on the unknown marginal disease exposure frequencies Pr(D) Pr(E). The χ 2 has been shown to be valid so long as each of the null expected frequencies for Table 1 is greater than one.

70 Asymptotic vs. finite sample s Cohort study s formulations A rule of thumb for justifying the χ 2 is that each of the estimated null expected values (E DE, E DE, E DĒ, E DĒ ) be greater than one. That is, e DE = (a + b)(a + c)/n > 1, etc. For a with fixed marginal totals, one can use Fisher s exact test based on a central hypergeometric null distribution.

71 Multiple hypothesis Cohort study s formulations Please refer to the lecture notes Multiple Testing Procedures with Applications to Genomics posted at

72 Multiple hypothesis Cohort study s formulations Data generating distribution: P M. Parameters: ψ = (ψ(j) : j = 1,..., J), where ψ(j) = Ψ(P)(j). Null alternative hypotheses: H 0(m) = I (P M(m)) H 1(m) = I (P / M(m)), where M(m) M, m = 1,..., M. Data empirical distribution: X n = {X i : i = 1,..., n} IID P, P n. : T n = (T n(m) : m = 1,..., M), where T n(m) = T (m; X n) = T (m; P n). : Q 0 (or estimator thereof, Q 0n). Multiple procedure rejection : R n = R(T n, Q 0n, α) = {m : T n(m) C n(m)} = {m : H 0(m) is rejected}. Type I error rate: θ n = Θ(F Vn,R n ), where V n = R n H 0 = # Type I errors R n = R n = # rejected hypotheses.

73 Multiple hypothesis Cohort study s formulations Type II error rate/power: ϑ n = Θ(F Un,R n ), where U n = R c n H 1 = # Type II errors. Summaries of results: Adjusted, test statistic rejection, parameter confidence.

74 Multiple hypothesis Cohort study s formulations Table 7: Type I Type II errors in multiple hypothesis. Null hypotheses Non-rejected, R c n Rejected, R n True, H 0 W n = R c n H 0 V n = R n H 0 h 0 False, H 1 U n = R c n H 1 S n = R n H 1 h 1 M R n R n M Type I errors: R n H 0 Type II errors: R c n H 1

75 Multiple hypothesis Cohort study s formulations Table 8: Multiple hypothesis flowchart. Specify data generating distribution parameters of interest P, ψ = (ψ(j) : j = 1,..., J) Define null alternative hypotheses H 0 (m) = I (P M(m)) H 1 (m) = I (P / M(m)) Specify test statistics T n = (T n(m) : m = 1,..., M) Estimate test statistics Q 0n Select Type I error rate Θ(F Vn,Rn ) Apply MTP Summarize results Adjusted, rejection, confidence

76 Multiple hypothesis Apply MTP Cohort study s formulations FWER Pr(V n > 0) Single-step common-cut-off maxt Single-step common-quantile minp Step-down common-cut-off maxt Step-down common-quantile minp Resampling-based empirical Bayes gfwer Pr(V n > k) Single-step common-cut-off T (k + 1) Single-step common-quantile P(k + 1) Augmentation Resampling-based empirical Bayes General Θ(F Vn ) Single-step common-cut-off Single-step common-quantile Resampling-based empirical Bayes TPPFP Pr(V n/r n > q) Augmentation Resampling-based empirical Bayes gtp Pr(g(V n, R n) > q) Augmentation Resampling-based empirical Bayes FDR E[V n/r n] TPPFP-based Resampling-based empirical Bayes gev E[g(V n, R n)] gtp-based Resampling-based empirical Bayes General Θ(F g(vn,rn) ) Resampling-based empirical Bayes

77 References Cohort study s formulations S. Dudoit M. J. van der Laan. Multiple Testing Procedures with Applications to Genomics. Springer Series in Statistics. Springer, New York, N. P. Jewell. Statistics for Epidemiology. Chapman & Hall/CRC, Articles lecture notes on multiple hypothesis :

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Volume 3, Issue 1 2004 Article 14 Multiple Testing. Part II. Step-Down Procedures for Control of the Family-Wise Error Rate Mark J. van der Laan