PB HLTH 240A: Advanced Categorical Data Analysis Fall 2007
|
|
- Kerrie Small
- 6 years ago
- Views:
Transcription
1 Cohort study s formulations PB HLTH 240A: Advanced Categorical Data Analysis Fall 2007 Srine Dudoit Division of Biostatistics Department of Statistics University of California, Berkeley September 17, 2007
2 Acknowledgements Cohort study s formulations These lecture notes are based on Professor Jewell s notes Chapter 6 of his book Statistics for Epidemiology.
3 Outline Cohort study s formulations Population-based Cohort study s formulations 4 Multiple hypothesis
4 Cohort study s formulations A 2 2 contingency table is a convenient representation of the (data generating or empirical) joint distribution between two binary variables. E.g. Relationship between disease (D, D = not D) exposure (E, Ē = not E). Table 1: 2 2 contingency table. Generic 2 2 contingency table for disease-exposure joint distribution. Disease status D D Exposure E a b n E a + b status Ē c d nē c + d n D a + c n D b + d n a + b + c + d
5 Cohort study s formulations Various measures of association between disease exposure have been proposed in the literature. Relative risk RR Pr(D E) Pr(D Ē) ; (1) Odds ratio OR Pr(D E)/ Pr( D E) Pr(D Ē)/ Pr( D Ē) ; Excess risk ER Pr(D E) Pr(D Ē); Attributable risk AR Pr(D) Pr(D Ē). Pr(D) When disease exposure status are independent, i.e., Pr(D&E) = Pr(D) Pr(E), then RR = 1, OR = 1, ER = 0, AR = 0. (2)
6 Cohort study s formulations A common question in statistical practice is to assess whether a disease-exposure association observed in a sample from a population of interest reflects a genuine association in this population or is due to sampling rom variation. The statistical approach to this problem involves the null hypothesis H 0 of disease-exposure independence, i.e., H 0 : Pr(D&E) = Pr(D) Pr(E) (3) RR = 1 OR = 1 ER = 0 AR = 0.
7 Cohort study s formulations As detailed below for various study s, one can use empirical frequencies from a such as Table 1 to estimate parameters of the disease-exposure joint distribution (e.g., Pr(D&E), Pr(D E), OR) thus measures of deviation from independence. For instance, for a population-based, the larger the value of the sample-derived test statistic (ad bc) 2, the greater the evidence against disease-exposure independence in the population. But... What is large? procedures provide benchmarks for assessing largeness, by accounting for the sampling distribution of test statistics.
8 Cohort study s formulations D Birthweight Low, D Normal, D Marital Unmarried, E 7 52 n E = 59 status Married, Ē n Ē = 141 n D = 14 n D = 186 n = 200 (a) Population-based, OR 2.58 E n E = 100 Ē 5 95 nē = 100 D n D = 17 n D = 183 n = 200 D E n E = 78 Ē nē = 122 n D = 100 n D = 100 n = 200 (b) Cohort, OR 2.59 (c), OR 2.57 D
9 Cohort study s formulations Statistical inference involves using observed data with known empirical distribution P n to learn properties of, i.e., estimate parameters of or test hypotheses for, the unknown data generating distribution P. Estimator ψ n = ˆΨ(P n ) Known Statistical inference = Estimation Parameter ψ = Ψ(P) Unknown
10 Cohort study s formulations is concerned with using observed data to make decisions regarding properties of, i.e., hypotheses for, the unknown data generating distribution. Specifically, consider a data generating distribution P M, belonging to a model M, i.e., a set of possibly non-parametric distributions. Suppose one has a learning set X n {X i : i = 1,..., n} IID P, of n independent identically distributed (IID) rom variables (RV) from P. Let P n denote the empirical distribution of the learning set X n, which places probability 1/n on each X i, i = 1,..., n.
11 Cohort study s formulations One can define a null hypothesis as H 0 I (P M 0 ) the corresponding alternative hypothesis as H 1 I (P / M 0 ), in terms of a submodel M 0 M for the data generating distribution P. Null hypotheses are often expressed in terms of parameters ψ = Ψ(P), defined as functions of the data generating distribution P. E.g. Mean, regression coefficient, odds ratio. A hypothesis procedure is a data-driven rule for deciding whether to reject a null hypothesis, i.e., declare it false.
12 Cohort study s formulations Decisions to reject or not a null hypothesis are based on test statistics T n = T (P n ), defined as functions of the empirical distribution P n, i.e., of the data X n. E.g. t-statistics, χ 2 -statistics, F -statistics, likelihood ratio statistics. Testing procedures provide rejection C for test statistics T n can be expressed as Reject H 0 iff T n C. (4) In any problem, two types of errors can be committed. A Type I error, or false positive, is committed by rejecting a true null hypothesis. A Type II error, or false negative, is committed by failing to reject a false null hypothesis.
13 Cohort study s formulations The actual Type I error rate of a procedure is the chance of incorrectly rejecting the null hypothesis H 0, Pr(Reject H 0 ), P M 0. The actual power of a procedure is the chance of correctly rejecting the null hypothesis H 0, Pr(Reject H 0 ), P / M 0. Ideally, one would like to simultaneously minimize both Type I errors Type II errors. Unfortunately, this is not feasible one seeks a trade-off between the two types of errors. This trade-off typically involves the minimization of Type II errors, i.e., the maximization of power, subject to a Type I error constraint.
14 Cohort study s formulations The actual Type I error rate of a procedure is often different from its nominal Type I error rate, i.e., the level α at which the procedure seeks to control false positives. Conservative Pr(Reject H 0 ) < α, P M 0. (5) Anti-conservative Pr(Reject H 0 ) > α, P M 0.
15 Cohort study s formulations A key component of any procedure is the test statistics (rather than data generating null distribution) used to obtain test statistic rejection, parameter confidence,. Indeed, whether single or multiple hypotheses, one needs the (joint) distribution of the test statistics in order to derive a procedure that probabilistically controls Type I errors. In practice, however, the true distribution of the test statistics is unknown replaced by a. The choice of a proper is crucial in order to ensure that control of the Type I error rate under the assumed does indeed provide the desired control under the true distribution.
16 Cohort study s formulations This issue is particularly relevant for large-scale problems, such as those encountered in biomedical genomic research, which concern high-dimensional multivariate distributions, with complex unknown dependence structures among variables. Resampling procedures (e.g., bootstrap, permutation) are useful for consistent estimation of the of the corresponding test statistic rejection, parameter confidence,.
17 Cohort study s formulations A procedure s nominal Type I error rate Pr 0 (Reject H 0 ), under a, can differ from its actual Type I error rate Pr(Reject H 0 ), under the true data generating distribution P M 0. Conservative Pr(Reject H 0 ) < Pr 0 (Reject H 0 ). (6) Anti-conservative Pr(Reject H 0 ) > Pr 0 (Reject H 0 ).
18 Cohort study s formulations The p-value for the test of a null hypothesis H 0 is the smallest nominal Type I error level α at which one would reject H 0, given the data. That is, P 0n inf {α [0, 1] : Reject H 0 at nominal level α} = inf {α [0, 1] : T n C(α)}, (7) where C(α) is a rejection region chosen with the aim that Pr(T n C(α)) α, P M 0. The smaller the p-value P 0n, the stronger the evidence against the null hypothesis H 0. Thus, for a test with nominal Type I error level α, one rejects H 0 for small p-value P 0n, Reject H 0 iff P 0n α. (8)
19 Cohort study s formulations are flexible summaries of a procedure, in the sense that results are supplied for all Type I error levels α, i.e., the level α need not be chosen ahead of time. provide convenient benchmarks to compare different procedures, whereby smaller indicate a less conservative procedure.
20 Cohort study s formulations There is a duality between hypothesis confidence. Consider a data generating distribution P, a parameter ψ = Ψ(P) IR, a null hypothesis of the form H 0 = I(ψ = ψ 0 ) or H 0 = I(ψ ψ 0 ). A (1 α)100% confidence region (CR) for the parameter ψ is a function A n (α) = A(P n )(α) IR of the empirical distribution P n such that Pr(ψ A n (α)) (1 α). (9) That is, the chance that the data-driven subset (i.e., rom variable) A n (α) contains the true unknown parameter ψ is at least (1 α).
21 Cohort study s formulations This suggests rejecting the null hypothesis H 0 if the confidence region A n (α) does not include the null value ψ 0. Indeed, if the confidence region is derived such that Pr 0 (ψ 0 A n (α)) (1 α) under a that satisfies the null hypothesis H 0, then the procedure Reject H 0 iff ψ 0 / A n (α) (10) has nominal Type I error rate at most α under the null distribution.
22 Three representations of a hypothesis procedure Cohort study s formulations One can report the results of a hypothesis procedure with nominal Type I error level α in terms of the following three entities. Rejection region C(α) for the test statistic T n, p-value P 0n, Reject H 0 iff T n C(α). Reject H 0 iff P 0n α. region A n (α) for the parameter of interest ψ, Reject H 0 iff ψ 0 / A n (α). The first two are general equivalent representations, while the third applies only to parametric hypotheses such as H 0 = I(ψ = ψ 0 ) or H 0 = I(ψ ψ 0 ).
23 Cohort study s formulations. A χ 2 -statistic does not necessarily follow a χ 2 -distribution. One can mechanically plug in any set of numbers into the formula for a test statistic, such as the χ 2 -statistic of Equation (12), below. However, the sampling distribution of the test statistic depends on the data generating distribution, i.e., the population of interest study. For a given procedure, the nominal Type I error rate, under a, usually differs from the actual Type I error rate, under the true data generating distribution. Beware of anti-conservative/conservative behavior.
24 Cohort study s formulations A p-value is NOT the chance that the null hypothesis is true. Be precise with conditional probabilities. A p-value is a rom variable. Indeed, a p-value is a function of the data can therefore be viewed as a test statistic. Ask yourself: What is fixed/rom? What is known/unknown? What is a function of the unknown P/known P n? [0, 1] {0.05}.
25 Cohort study s formulations Statistical significance Subject-matter significance. Statistically significant tiny p-value for a subject-matter insignificant tiny association in a study with large sample size. vs. Statistically significant tiny p-value for a subject-matter significant large association in a study with small sample size. Cynical view: are proxies for sample size. Probability statements are only meaningful with respect to distributions many distributions can satisfy a null hypothesis. It is more rigorous to say that a p-value is the chance of obtaining a value as extreme or more extreme than the observed value of the test statistic under a distribution
26 Cohort study s formulations that satisfies the null hypothesis vs. under the null hypothesis. inference frameworks: Bayesian vs. frequentist. Multiple hypothesis. When multiple hypotheses are tested simultaneously, there is an increased chance of committing at least one Type I error. Small (unadjusted), that would lead to the rejection of a single hypothesis (e.g., 0.001), no longer correspond to statistically significant findings. E.g. The chance that at least one p-value is less than α for M independent test statistics is 1 (1 α) M converges to one as M increases. For M = 1, 000 α = 0.01, this chance is ! One needs to adjust for multiple hypothesis.
27 Cohort study s formulations E.g. Combining data from M s (Jewell, 2003, Section 9.1). Sum of χ 2 -statistics low power due to inflated degrees of freedom, M. Bonferroni procedure very conservative, α/m p-value cut-off.
28 Population-based Cohort study s formulations In a population-based, one needs a frame for the study population. A simple population-based involves taking a rom sample from the population of interest assessing disease exposure status for the sampled individuals. With such a study, one can estimate disease exposure marginal joint probabilities directly from empirical frequencies in Table 1. Pr(D&E) can be estimated by a/n; Pr(D) can be estimated by ˆp (a + c)/n; Pr(E) can be estimated by ˆq (a + b)/n. Constants: n; Rom variables: n D, n D, n E, nē, a, b, c, d.
29 Population-based Cohort study s formulations The null hypothesis of disease-exposure independence is formulated as H 0 : Pr(D&E) = Pr(D) Pr(E). The measure of disease-exposure dependence Pr(D&E) Pr(D) Pr(E) can be estimated by the corresponding difference in empirical frequencies a n ˆpˆq = a n (a + c) (a + b) ad bc = n n n 2.
30 Population-based Cohort study s formulations Note that one can equivalently consider any of the three other cells in Table 1, i.e., focus on any of the following three measures of disease-exposure dependence. Pr( D&E) Pr( D) Pr(E), Pr(D&Ē) Pr(D) Pr(Ē), Pr( D&Ē) Pr( D) Pr(Ē). Since the sample size n is fixed by, it is natural to use ˆδ ad bc as an empirical measure of disease-exposure dependence. Furthermore, ignoring the sign of the difference yields the measure ˆδ 2 = (ad bc) 2. Large values of the test statistics ˆδ ˆδ 2 are suggestive of population disease-exposure dependence.
31 Population-based Cohort study s formulations To assess largeness, one needs to account for the sampling variation in ad bc. In a population-based study with large sample size n, the sampling distribution of the rom variable corresponding to ad bc can be approximated by a Gaussian distribution. Under the null hypothesis of disease-exposure independence, the RV for ad bc has expected value zero variance that can be estimated by ˆv 1 (a + b)(a + c)(b + d)(c + d). (11) n
32 Population-based Cohort study s formulations This suggests the following test statistics. t-statistic t 1 ˆδ 0 ˆv = χ 2 -statistic t 2 t 2 1 = Under H 0 for large n, n(ad bc) (a + b)(a + c)(b + d)(c + d) n(ad bc) 2 (a + b)(a + c)(b + d)(c + d). (12) T 1 N(0, 1) T 2 χ 2 (1). (13)
33 Population-based Cohort study s formulations One can use the χ 2 (1) distribution as a for the test statistic T 2 to derive rejection. For a nominal Type I error level α, a rejection region for T 2 can be defined as C(α) = [c(α), ), where c(α) is the (1 α)-quantile of the χ 2 (1) distribution. That is, one rejects the null hypothesis of disease-exposure independence when T 2 exceeds the critical value or cut-off c(α). Under the χ 2 (1), the chance of committing a Type I error, i.e., incorrectly rejecting the null hypothesis of disease-exposure independence, is α. That is, Pr χ 2 (1)(Reject H 0 ) = Pr χ 2 (1)(T 2 c(α)) = α.
34 Population-based Cohort study s formulations Note that the procedure s nominal Type I error rate Pr χ 2 (1)(Reject H 0 ), under the χ 2 (1) (or other) null distribution, can differ from its actual Type I error rate Pr(Reject H 0 ), under the true data generating distribution P M 0. For a test based on the test statistic T 2 the χ 2 (1) null distribution, the p-value can be expressed as the chance that a χ 2 (1) rom variable is at least as large as the observed value t 2, i.e., p = Pr χ 2 (1)(T 2 t 2 ). (14)
35 Population-based Cohort study s formulations A parameter of interest for measuring disease-exposure association is the odds ratio, OR = Pr(D E)/ Pr( D E) Pr(D Ē)/ Pr( D Ē). For population-based s, one can estimate the odds ratio by replacing the unknown conditional probabilities by the corresponding empirical frequencies from Table 1. That is, ôr ˆp E /(1 ˆp E ) (a/(a + b))/(b/(a + b)) = ˆpĒ /(1 ˆpĒ ) (c/(c + d))/(d/(c + d)) = ad bc, (15) where p E Pr(D E) pē Pr(D Ē) are estimated, respectively, by ˆp E a/(a + b) ˆpĒ c/(c + d).
36 Population-based Cohort study s formulations The sampling distribution of the odds ratio estimator ÔR tends to be skewed to the right, especially for small sample sizes n (Jewell, 2003, Figure 7.1). It is therefore not appropriate to rely on a Gaussian approximation to derive rejection confidence for the odds ratio. However, the sampling distribution of the log odds ratio estimator log(ôr) tends to be symmetric more amenable to a Gaussian approximation (Jewell, 2003, Figure 7.2). As detailed in Jewell (2003, p ), an estimate of the variance of log(ôr) is given by ˆv 1 a + 1 b + 1 c + 1 d. (16)
37 Population-based Cohort study s formulations Thus, one has the following Gaussian approximation to the sampling distribution of the empirical log odds ratio, log(ôr) log(or) N(0, 1). (17) ˆV A two-sided (1 α)100% confidence interval (CI) for the log odds ratio parameter log(or) is given by 1 log(ad) log(bc)±φ 1 (1 α/2) a + 1 b + 1 c + 1 d, (18) where Φ denotes a stard Gaussian CDF Φ 1 (1 α/2) the corresponding (1 α/2)-quantile. One can derive the same estimators confidence intervals for the cohort case-control s considered below.
38 Population-based Cohort study s formulations Table 2: Population-based. for birthweight mother s marital status, n = 200. Birthweight Low, D Normal, D Marital Unmarried, E 7 52 n E = 59 status Married, Ē n Ē = 141 n D = 14 n D = 186 n = 200 ôr = ad bc = , t 2 = 200 ( )2 3.04, p , 95% confidence interval for log(or): [ 0.15, 2.04] 0.
39 Cohort Cohort study s formulations In a cohort, one needs a frame for the exposed unexposed populations. A simple cohort involves taking independent rom samples from the exposed unexposed populations assessing disease status for the sampled individuals. With such a study, one can only estimate the conditional disease probabilities given exposure from the empirical frequencies in Table 1. p E Pr(D E) can be estimated by ˆp E a/(a + b); Pr(D Ē) can be estimated by ˆpĒ c/(c + d). p Ē Constants: n, n E, nē ; Rom variables: n D, n D, a, b, c, d.
40 Cohort Cohort study s formulations The null hypothesis of disease-exposure independence can be restated as H 0 : Pr(D E) = Pr(D Ē). The measure of disease-exposure dependence δ Pr(D E) Pr(D Ē) can be estimated by the corresponding difference in empirical frequencies ˆδ ˆp E ˆpĒ = a a + b c c + d, with Gaussian asymptotic distribution estimated null variance ( 1 ˆv ˆp(1 ˆp) + 1 ) (19) n E nē ( ) ( ) ( a + c b + d 1 = n n a + b + 1 ). c + d
41 Cohort Cohort study s formulations This suggests the following test statistics. t-statistic χ 2 -statistic Under H 0 for large n, t 1 ˆδ 0 ˆv (20) t 2 t 2 1 = ˆδ 2 ˆv. T 1 N(0, 1) T 2 χ 2 (1). (21)
42 Cohort Cohort study s formulations Furthermore, it turns out that the T 1 T 2 test statistics for the cohort are identical to the corresponding test statistics in Equation (12) for the population-based. Indeed, t cohort 1 = = = ( a+c n a a+b ) ( b+d n c c+d ) ( 1 a+b + 1 c+d 1 (a+b)(c+d) (a+c)(b+d) n n 2 (a+b)(c+d) ) (ac + ad ac bc) n(ad bc) (a + b)(a + c)(b + d)(c + d) = t popn 1.
43 Cohort Cohort study s formulations Table 3: Cohort. for birthweight mother s marital status, n E = n Ē = 100. Birthweight Low, D Normal, D Marital Unmarried, E n E = 100 status Married, Ē 5 95 n Ē = 100 n D = 17 n D = 183 n = 200 ôr = ad = bc , t 2 = 200 ( )2 3.15, p , 95% confidence interval for log(or): [ 0.13, 2.03] 0.
44 Cohort study s formulations In a case-control, one needs a frame for the disease/case no disease/control populations. A simple case-control involves taking independent rom samples from the disease no disease populations assessing exposure status for the sampled individuals. With such a study, one can only estimate the conditional exposure probabilities given disease from the empirical frequencies in Table 1. q D Pr(E D) can be estimated by ˆq D a/(a + c); q D Pr(E D) can be estimated by ˆq D b/(b + d). Constants: n, n D, n D ; Rom variables: n E, nē, a, b, c, d.
45 Cohort study s formulations The null hypothesis of disease-exposure independence can be restated as H 0 : Pr(E D) = Pr(E D). The measure of disease-exposure dependence δ Pr(E D) Pr(E D) can be estimated by the corresponding difference in empirical frequencies ˆδ ˆq D ˆq D = a a + c b b + d, with Gaussian asymptotic distribution estimated null variance ( 1 ˆv ˆq(1 ˆq) + 1 ) (22) n D n ( ) ( ) D ( a + b c + d 1 = n n a + c + 1 ). b + d
46 Cohort study s formulations This suggests the following test statistics. t-statistic χ 2 -statistic Under H 0 for large n, t 1 ˆδ 0 ˆv (23) t 2 t 2 1 = ˆδ 2 ˆv. T 1 N(0, 1) T 2 χ 2 (1). (24) Furthermore, it turns out that the T 1 T 2 test statistics for the case-control are identical to the corresponding test statistics for the population-based cohort s.
47 Cohort study s formulations Table 4:. for birthweight mother s marital status, n D = n D = 100. Birthweight Low, D Normal, D Marital Unmarried, E n E = 78 status Married, Ē n Ē = 122 n D = 100 n D = 100 n = 200 ôr = ad = bc , t 2 = 200 ( ) , p , 95% confidence interval for log(or): [0.36, 1.53] 0.
48 study s Cohort study s formulations Table 5: study s. Results of χ 2 -test of independence log odds ratio confidence intervals for population-based, cohort, case-control study s. Study Population-based Cohort Odds ratio, ôr χ 2 -statistic, t p-value, p log(or) 95% CI [ 0.15, 2.04] [ 0.13, 2.03] [0.36, 1.53]
49 study s Cohort study s formulations The differences in results for the χ 2 -test of independence reflect differences in power among the three study s, i.e., in the chance of rejecting the null hypothesis of independence when there truly is a disease-exposure association in the population. In general, the power of a procedure depends on a variety of factors, including the extent of the deviation from the null hypothesis (i.e., the true unknown population association, as measured, for example, by RR, OR, ER, AR) the sample size n. However, these two factors are the same in each of our three examples.
50 study s Cohort study s formulations As in our setting, power is further influenced by the study via the balance of the empirical marginal frequencies for disease (n D, n D ) exposure (n E, nē ).
51 study s Population-based vs. cohort. The χ 2 -statistic used for both s may be written as Cohort study s formulations where ˆp = a + c n ˆp E = a a + b ˆpĒ = c c + d t 2 = (ˆp E ˆpĒ ) 2 ˆp(1 ˆp) ( 1 n E + 1 nē ), (25) estimate of Pr(D) (population-based ), estimate of Pr(D E), estimate of Pr(D Ē).
52 study s Cohort study s formulations The numerator term ˆp E ˆpĒ behaves in a similar manner for both s, in the sense that it converges to the corresponding parameter Pr(D E) Pr(D Ē) which is fixed for a given population. The influences the test statistic through the variance term 1/n E + 1/nĒ. In particular, this term is minimized the test statistic t 2 maximized when n E = nē = n/2. For a cohort with fixed total sample size n, the optimal exposure sample size allocation in terms of power is n E = nē = n/2. For a population-based, the exposure sample sizes n E nē are unlikely to be equal, as they are rom variables with distribution depending on the unknown population frequency Pr(E).
53 study s Cohort study s formulations For large sample size n, a population-based yields a less powerful χ 2 -test of independence than a cohort with equal exposure sample sizes n E = nē = n/2.
54 study s Cohort study s formulations Population-based vs. case-control. A similar argument as above, with the roles of disease exposure interchanged, can be applied to compare population-based case-control s. For a case-control with fixed total sample size n, the optimal disease sample size allocation in terms of power is n D = n D = n/2. For large sample size n, a population-based yields a less powerful χ 2 -test of independence than a case-control with equal disease sample sizes n D = n D = n/2.
55 study s Cohort study s formulations Cohort vs. case-control. Consider for simplicity cohort case-control s with balanced sample sizes n E = nē = n/2 n D = n D = n/2, respectively. For the cohort, ˆp = (ˆp E + ˆpĒ )/2 1/n E + 1/nĒ = 4/n. The power of the χ 2 -test is then driven by the expected value of the rom variable for (ˆp E ˆpĒ )/ ˆp(1 ˆp), which, for large sample size n, can be approximated by the scaled frequency difference parameter δ = p E pē ((pe + pē )/2)(1 (p E + pē )/2), (26) where p E = Pr(D E) pē = Pr(D Ē).
56 study s Cohort study s formulations The closer to 1/2 the average (Pr(D E) + Pr(D Ē))/2, the greater δ, hence the greater the power of the χ 2 -test for a cohort. A similar argument as above, with the roles of disease exposure interchanged, shows that the closer to 1/2 the average (Pr(E D) + Pr(E D))/2, the greater the power of the χ 2 -test for a case-control,. If (Pr(D E) + Pr(D Ē))/2 is closer to 1/2 than (Pr(E D) + Pr(E D))/2, then a cohort leads to a more powerful χ 2 -test than a case-control. It can further be shown that if Pr(D) is closer to 1/2 than Pr(E), then a cohort leads to a more powerful χ 2 -test than a case-control.
57 study s Cohort study s formulations Figure 1: Cohort vs. case-control. Scaled frequency difference δ vs. average frequency for various fixed odds ratios (solid: OR = 4, dotted: OR = 3, dash-dotted: OR = 2).
58 study s Cohort study s formulations Table 6: study s. Factors affecting the power of the χ 2 -test of independence for population-based, cohort, case-control study s. Disease, Exposure, n n D n D n n E nē Study Population-based Cohort = = (ˆq D, ˆq D ) ` 7 14, 52 NA 186 ` (ˆp E, ˆpĒ ) 7 59, ` , ` , Scaled difference, ˆδ NA NA
59 study s Cohort study s formulations Summary of large sample power comparison. For a cohort with fixed total sample size n, the χ 2 -test of independence is most powerful with equal exposure sample sizes n E = nē = n/2. A cohort with equal exposure sample sizes n E = nē = n/2 yields a more powerful χ 2 -test of independence than a population-based with the same overall sample size. For a case-control with fixed total sample size n, the χ 2 -test of independence is most powerful with equal disease sample sizes n D = n D = n/2.
60 study s Cohort study s formulations A case-control with equal disease sample sizes n D = n D = n/2 yields a more powerful χ2 -test of independence than a population-based with the same overall sample size. When Pr(D) is closer to 1/2 than Pr(E), a cohort with equal exposure sample sizes yields a more powerful χ 2 -test of independence than a case-control with equal disease sample sizes. When Pr(E) is closer to 1/2 than Pr(D), a case-control with equal disease sample sizes yields a more powerful χ 2 -test of independence than a cohort with equal exposure sample sizes.
61 Pearson s χ 2 -statistic Cohort study s formulations Let us denote the disease-exposure empirical frequencies of Table 1 by (O DE, O DE, O DĒ, O DĒ ). For a population-based with sample size n, the disease-exposure frequencies follow a multinomial distribution, (O DE, O DE, O D Ē, O DĒ ) Multinomial ( n, ( Pr(D&E), Pr( D&E), Pr(D&Ē), Pr( D&Ē))). (27)
62 Pearson s χ 2 -statistic Cohort study s formulations One can show that the χ 2 -statistic of Equation (12) can be written as a Pearson χ 2 -statistic, T 2 = i {D, D} j {E,Ē} (O ij E ij ) 2 E ij, (28) where the expected values of the disease-exposure frequencies under the null hypothesis of independence estimators thereof are as follows. Expected value under H 0 Parameter Estimate O DE n Pr(D) Pr(E) e DE n (a+c) n O DE n Pr( D) Pr(E) e DE n (b+d) n O DĒ n Pr(D) Pr(Ē) e DĒ n (a+c) n O DĒ n Pr( D) Pr(Ē) e DĒ n (b+d) n (a+b) n (a+b) n (c+d) n (c+d) n
63 Multiway tables Cohort study s formulations Pearson s χ 2 -statistic extends to categorical variables taking on more than two values to higher-order multiway tables for more than two variables. Specifically, consider J categorical rom variables X = (X (j) : j = 1,..., J), where X (j) {1,..., K j }. The (data generating or empirical) joint distribution of X may be represented using a K 1 K J contingency table.
64 Multiway tables Cohort study s formulations The null hypothesis H 0 of mutual independence among the J rom variables may be tested using the Pearson χ 2 -statistic K 1 T 2 k 1 =1 K J k J =1 (O k1...k J E k1...k J ) 2 E k1...k J, (29) where O k1...k J E k1...k J are, respectively, the observed frequencies estimated null expected frequencies for cell (k 1,..., k J ). Under H 0 for large n, T 2 has a χ 2 (ν) distribution with J ν = J 1 (K j 1) (30) degrees of freedom. j=1 K j j=1
65 Fisher s exact test Cohort study s formulations Since most of the information on disease-exposure association is contained in the interior entries of Table 1, we may as well assume that the row column marginal totals are fixed. The analysis is then identical for all three study s, although each generates different marginal totals thus has different power properties. With fixed margins, there is only one rom variable, i.e., one degree of freedom, in the table. Indeed, given a, the other three interior entries, b, c, d, are determined. One can therefore focus on a single cell, without loss of generality a, to study the relationship between disease exposure.
66 Fisher s exact test Cohort study s formulations The rom variable for a, A, has a non-central hypergeometric distribution, with parameters determined by the known marginal totals the unknown population odds ratio OR. Under the null hypothesis of disease-exposure independence, A has a central hypergeometric null
67 Fisher s exact test Cohort study s formulations distribution, with parameters fully-determined by the marginal totals. Pr 0 (A = a) = ( )( ) /( ) a + b c + d n a c a + c (31) = (a + b)!(c + d)!(a + c)!(b + d)!, n!a!b!c!d! E 0 [A] = (a + b)(a + c), n Var 0 [A] = (a + b)(c + d)(a + c)(b + d) n 2. (n 1) Fisher s exact test derives rejection based on the exact finite sample hypergeometric null distribution of the test statistic A.
68 Fisher s exact test Cohort study s formulations Under H 0 for large n, T 1 A E 0[A] Var0 [A] N(0, 1) T 2 (A E 0[A]) 2 Var 0 [A] χ 2 (1). The above χ 2 -statistic is identical to the χ 2 -statistic of Equation (12), except for the (n 1) numerator term replacing the n term. This difference is irrelevant asymptotically, but can have important implications for small sample size n. (32)
69 Asymptotic vs. finite sample s Cohort study s formulations Our discussion of the χ 2 -test of independence has focussed on large sample sizes n, that is, on rejection based on the asymptotic χ 2 null distribution approximation to the sampling distribution of the test statistic T 2. The appropriateness of this approximation depends not only on the sample size n but also on the null joint frequencies, i.e., on the unknown marginal disease exposure frequencies Pr(D) Pr(E). The χ 2 has been shown to be valid so long as each of the null expected frequencies for Table 1 is greater than one.
70 Asymptotic vs. finite sample s Cohort study s formulations A rule of thumb for justifying the χ 2 is that each of the estimated null expected values (E DE, E DE, E DĒ, E DĒ ) be greater than one. That is, e DE = (a + b)(a + c)/n > 1, etc. For a with fixed marginal totals, one can use Fisher s exact test based on a central hypergeometric null distribution.
71 Multiple hypothesis Cohort study s formulations Please refer to the lecture notes Multiple Testing Procedures with Applications to Genomics posted at
72 Multiple hypothesis Cohort study s formulations Data generating distribution: P M. Parameters: ψ = (ψ(j) : j = 1,..., J), where ψ(j) = Ψ(P)(j). Null alternative hypotheses: H 0(m) = I (P M(m)) H 1(m) = I (P / M(m)), where M(m) M, m = 1,..., M. Data empirical distribution: X n = {X i : i = 1,..., n} IID P, P n. : T n = (T n(m) : m = 1,..., M), where T n(m) = T (m; X n) = T (m; P n). : Q 0 (or estimator thereof, Q 0n). Multiple procedure rejection : R n = R(T n, Q 0n, α) = {m : T n(m) C n(m)} = {m : H 0(m) is rejected}. Type I error rate: θ n = Θ(F Vn,R n ), where V n = R n H 0 = # Type I errors R n = R n = # rejected hypotheses.
73 Multiple hypothesis Cohort study s formulations Type II error rate/power: ϑ n = Θ(F Un,R n ), where U n = R c n H 1 = # Type II errors. Summaries of results: Adjusted, test statistic rejection, parameter confidence.
74 Multiple hypothesis Cohort study s formulations Table 7: Type I Type II errors in multiple hypothesis. Null hypotheses Non-rejected, R c n Rejected, R n True, H 0 W n = R c n H 0 V n = R n H 0 h 0 False, H 1 U n = R c n H 1 S n = R n H 1 h 1 M R n R n M Type I errors: R n H 0 Type II errors: R c n H 1
75 Multiple hypothesis Cohort study s formulations Table 8: Multiple hypothesis flowchart. Specify data generating distribution parameters of interest P, ψ = (ψ(j) : j = 1,..., J) Define null alternative hypotheses H 0 (m) = I (P M(m)) H 1 (m) = I (P / M(m)) Specify test statistics T n = (T n(m) : m = 1,..., M) Estimate test statistics Q 0n Select Type I error rate Θ(F Vn,Rn ) Apply MTP Summarize results Adjusted, rejection, confidence
76 Multiple hypothesis Apply MTP Cohort study s formulations FWER Pr(V n > 0) Single-step common-cut-off maxt Single-step common-quantile minp Step-down common-cut-off maxt Step-down common-quantile minp Resampling-based empirical Bayes gfwer Pr(V n > k) Single-step common-cut-off T (k + 1) Single-step common-quantile P(k + 1) Augmentation Resampling-based empirical Bayes General Θ(F Vn ) Single-step common-cut-off Single-step common-quantile Resampling-based empirical Bayes TPPFP Pr(V n/r n > q) Augmentation Resampling-based empirical Bayes gtp Pr(g(V n, R n) > q) Augmentation Resampling-based empirical Bayes FDR E[V n/r n] TPPFP-based Resampling-based empirical Bayes gev E[g(V n, R n)] gtp-based Resampling-based empirical Bayes General Θ(F g(vn,rn) ) Resampling-based empirical Bayes
77 References Cohort study s formulations S. Dudoit M. J. van der Laan. Multiple Testing Procedures with Applications to Genomics. Springer Series in Statistics. Springer, New York, N. P. Jewell. Statistics for Epidemiology. Chapman & Hall/CRC, Articles lecture notes on multiple hypothesis :
Statistical Applications in Genetics and Molecular Biology
Statistical Applications in Genetics and Molecular Biology Volume 3, Issue 1 2004 Article 14 Multiple Testing. Part II. Step-Down Procedures for Control of the Family-Wise Error Rate Mark J. van der Laan
More informationUniversity of California, Berkeley
University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2004 Paper 164 Multiple Testing Procedures: R multtest Package and Applications to Genomics Katherine
More informationStatistical Applications in Genetics and Molecular Biology
Statistical Applications in Genetics and Molecular Biology Volume 3, Issue 1 2004 Article 13 Multiple Testing. Part I. Single-Step Procedures for Control of General Type I Error Rates Sandrine Dudoit Mark
More informationUniversity of California, Berkeley
University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2005 Paper 168 Multiple Testing Procedures and Applications to Genomics Merrill D. Birkner Katherine
More informationMarginal Screening and Post-Selection Inference
Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2
More informationStatistical Applications in Genetics and Molecular Biology
Statistical Applications in Genetics and Molecular Biology Volume 6, Issue 1 2007 Article 28 A Comparison of Methods to Control Type I Errors in Microarray Studies Jinsong Chen Mark J. van der Laan Martyn
More informationSTATS 200: Introduction to Statistical Inference. Lecture 29: Course review
STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout
More informationUniversity of California, Berkeley
University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2005 Paper 198 Quantile-Function Based Null Distribution in Resampling Based Multiple Testing Mark J.
More informationStatistical Inference: Estimation and Confidence Intervals Hypothesis Testing
Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing 1 In most statistics problems, we assume that the data have been generated from some unknown probability distribution. We desire
More informationPolitical Science 236 Hypothesis Testing: Review and Bootstrapping
Political Science 236 Hypothesis Testing: Review and Bootstrapping Rocío Titiunik Fall 2007 1 Hypothesis Testing Definition 1.1 Hypothesis. A hypothesis is a statement about a population parameter The
More informationMultiple Hypothesis Testing in Microarray Data Analysis
Multiple Hypothesis Testing in Microarray Data Analysis Sandrine Dudoit jointly with Mark van der Laan and Katie Pollard Division of Biostatistics, UC Berkeley www.stat.berkeley.edu/~sandrine Short Course:
More informationHypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)
Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2) B.H. Robbins Scholars Series June 23, 2010 1 / 29 Outline Z-test χ 2 -test Confidence Interval Sample size and power Relative effect
More informationEpidemiology Wonders of Biostatistics Chapter 13 - Effect Measures. John Koval
Epidemiology 9509 Wonders of Biostatistics Chapter 13 - Effect Measures John Koval Department of Epidemiology and Biostatistics University of Western Ontario What is being covered 1. risk factors 2. risk
More informationComputational Systems Biology: Biology X
Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,
More informationFundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur
Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new
More informationA multiple testing procedure for input variable selection in neural networks
A multiple testing procedure for input variable selection in neural networks MicheleLaRoccaandCiraPerna Department of Economics and Statistics - University of Salerno Via Ponte Don Melillo, 84084, Fisciano
More informationPart 1.) We know that the probability of any specific x only given p ij = p i p j is just multinomial(n, p) where p k1 k 2
Problem.) I will break this into two parts: () Proving w (m) = p( x (m) X i = x i, X j = x j, p ij = p i p j ). In other words, the probability of a specific table in T x given the row and column counts
More informationLab #11. Variable B. Variable A Y a b a+b N c d c+d a+c b+d N = a+b+c+d
BIOS 4120: Introduction to Biostatistics Breheny Lab #11 We will explore observational studies in today s lab and review how to make inferences on contingency tables. We will only use 2x2 tables for today
More informationLecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University
Lecture 25 Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University November 24, 2015 1 2 3 4 5 6 7 8 9 10 11 1 Hypothesis s of homgeneity 2 Estimating risk
More informationReview of Statistics 101
Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods
More informationMultiple testing: Intro & FWER 1
Multiple testing: Intro & FWER 1 Mark van de Wiel mark.vdwiel@vumc.nl Dep of Epidemiology & Biostatistics,VUmc, Amsterdam Dep of Mathematics, VU 1 Some slides courtesy of Jelle Goeman 1 Practical notes
More informationStatistics in medicine
Statistics in medicine Lecture 3: Bivariate association : Categorical variables Proportion in one group One group is measured one time: z test Use the z distribution as an approximation to the binomial
More informationReview. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda
Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with
More informationUniversity of California, Berkeley
University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2009 Paper 250 A Machine-Learning Algorithm for Estimating and Ranking the Impact of Environmental Risk
More informationPerformance Evaluation and Comparison
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation
More informationMultiple Sample Categorical Data
Multiple Sample Categorical Data paired and unpaired data, goodness-of-fit testing, testing for independence University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html
More informationHigh-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018
High-Throughput Sequencing Course Multiple Testing Biostatistics and Bioinformatics Summer 2018 Introduction You have previously considered the significance of a single gene Introduction You have previously
More informationSummary of Chapters 7-9
Summary of Chapters 7-9 Chapter 7. Interval Estimation 7.2. Confidence Intervals for Difference of Two Means Let X 1,, X n and Y 1, Y 2,, Y m be two independent random samples of sizes n and m from two
More informationStatistical Applications in Genetics and Molecular Biology
Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca
More informationIntroduction to the Analysis of Tabular Data
Introduction to the Analysis of Tabular Data Anthropological Sciences 192/292 Data Analysis in the Anthropological Sciences James Holland Jones & Ian G. Robertson March 15, 2006 1 Tabular Data Is there
More informationNormal (Gaussian) distribution The normal distribution is often relevant because of the Central Limit Theorem (CLT):
Lecture Three Normal theory null distributions Normal (Gaussian) distribution The normal distribution is often relevant because of the Central Limit Theorem (CLT): A random variable which is a sum of many
More informationStatistical Inference
Statistical Inference Classical and Bayesian Methods Revision Class for Midterm Exam AMS-UCSC Th Feb 9, 2012 Winter 2012. Session 1 (Revision Class) AMS-132/206 Th Feb 9, 2012 1 / 23 Topics Topics We will
More informationBIOS 312: Precision of Statistical Inference
and Power/Sample Size and Standard Errors BIOS 312: of Statistical Inference Chris Slaughter Department of Biostatistics, Vanderbilt University School of Medicine January 3, 2013 Outline Overview and Power/Sample
More informationCase-Control Association Testing. Case-Control Association Testing
Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits. Technological advances have made it feasible to perform case-control association studies
More informationDiscrete Multivariate Statistics
Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are
More informationLecture 8 Inequality Testing and Moment Inequality Models
Lecture 8 Inequality Testing and Moment Inequality Models Inequality Testing In the previous lecture, we discussed how to test the nonlinear hypothesis H 0 : h(θ 0 ) 0 when the sample information comes
More informationConfounding and effect modification: Mantel-Haenszel estimation, testing effect homogeneity. Dankmar Böhning
Confounding and effect modification: Mantel-Haenszel estimation, testing effect homogeneity Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK Advanced Statistical
More informationOptimal rejection regions for testing multiple binary endpoints in small samples
Optimal rejection regions for testing multiple binary endpoints in small samples Robin Ristl and Martin Posch Section for Medical Statistics, Center of Medical Statistics, Informatics and Intelligent Systems,
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationEvaluation. Andrea Passerini Machine Learning. Evaluation
Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain
More informationLECTURE 5. Introduction to Econometrics. Hypothesis testing
LECTURE 5 Introduction to Econometrics Hypothesis testing October 18, 2016 1 / 26 ON TODAY S LECTURE We are going to discuss how hypotheses about coefficients can be tested in regression models We will
More informationTECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study
TECHNICAL REPORT # 59 MAY 2013 Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study Sergey Tarima, Peng He, Tao Wang, Aniko Szabo Division of Biostatistics,
More informationStatistical Data Analysis Stat 3: p-values, parameter estimation
Statistical Data Analysis Stat 3: p-values, parameter estimation London Postgraduate Lectures on Particle Physics; University of London MSci course PH4515 Glen Cowan Physics Department Royal Holloway,
More informationTesting Independence
Testing Independence Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM 1/50 Testing Independence Previously, we looked at RR = OR = 1
More informationControlling Bayes Directional False Discovery Rate in Random Effects Model 1
Controlling Bayes Directional False Discovery Rate in Random Effects Model 1 Sanat K. Sarkar a, Tianhui Zhou b a Temple University, Philadelphia, PA 19122, USA b Wyeth Pharmaceuticals, Collegeville, PA
More informationModified Simes Critical Values Under Positive Dependence
Modified Simes Critical Values Under Positive Dependence Gengqian Cai, Sanat K. Sarkar Clinical Pharmacology Statistics & Programming, BDS, GlaxoSmithKline Statistics Department, Temple University, Philadelphia
More informationChapter Six: Two Independent Samples Methods 1/51
Chapter Six: Two Independent Samples Methods 1/51 6.3 Methods Related To Differences Between Proportions 2/51 Test For A Difference Between Proportions:Introduction Suppose a sampling distribution were
More informationMathematical Statistics
Mathematical Statistics MAS 713 Chapter 8 Previous lecture: 1 Bayesian Inference 2 Decision theory 3 Bayesian Vs. Frequentist 4 Loss functions 5 Conjugate priors Any questions? Mathematical Statistics
More informationFDR and ROC: Similarities, Assumptions, and Decisions
EDITORIALS 8 FDR and ROC: Similarities, Assumptions, and Decisions. Why FDR and ROC? It is a privilege to have been asked to introduce this collection of papers appearing in Statistica Sinica. The papers
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationReview of One-way Tables and SAS
Stat 504, Lecture 7 1 Review of One-way Tables and SAS In-class exercises: Ex1, Ex2, and Ex3 from http://v8doc.sas.com/sashtml/proc/z0146708.htm To calculate p-value for a X 2 or G 2 in SAS: http://v8doc.sas.com/sashtml/lgref/z0245929.htmz0845409
More informationControl of Generalized Error Rates in Multiple Testing
Institute for Empirical Research in Economics University of Zurich Working Paper Series ISSN 1424-0459 Working Paper No. 245 Control of Generalized Error Rates in Multiple Testing Joseph P. Romano and
More informationSTAT 461/561- Assignments, Year 2015
STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and
More informationControl of Directional Errors in Fixed Sequence Multiple Testing
Control of Directional Errors in Fixed Sequence Multiple Testing Anjana Grandhi Department of Mathematical Sciences New Jersey Institute of Technology Newark, NJ 07102-1982 Wenge Guo Department of Mathematical
More informationCentral Limit Theorem ( 5.3)
Central Limit Theorem ( 5.3) Let X 1, X 2,... be a sequence of independent random variables, each having n mean µ and variance σ 2. Then the distribution of the partial sum S n = X i i=1 becomes approximately
More informationLecture 21: October 19
36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 21: October 19 21.1 Likelihood Ratio Test (LRT) To test composite versus composite hypotheses the general method is to use
More informationStatistical Inference
Statistical Inference Classical and Bayesian Methods Class 6 AMS-UCSC Thu 26, 2012 Winter 2012. Session 1 (Class 6) AMS-132/206 Thu 26, 2012 1 / 15 Topics Topics We will talk about... 1 Hypothesis testing
More informationHypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006
Hypothesis Testing Part I James J. Heckman University of Chicago Econ 312 This draft, April 20, 2006 1 1 A Brief Review of Hypothesis Testing and Its Uses values and pure significance tests (R.A. Fisher)
More informationReports of the Institute of Biostatistics
Reports of the Institute of Biostatistics No 02 / 2008 Leibniz University of Hannover Natural Sciences Faculty Title: Properties of confidence intervals for the comparison of small binomial proportions
More informationLectures 5 & 6: Hypothesis Testing
Lectures 5 & 6: Hypothesis Testing in which you learn to apply the concept of statistical significance to OLS estimates, learn the concept of t values, how to use them in regression work and come across
More informationLecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015
Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits.
More informationSTATISTICS SYLLABUS UNIT I
STATISTICS SYLLABUS UNIT I (Probability Theory) Definition Classical and axiomatic approaches.laws of total and compound probability, conditional probability, Bayes Theorem. Random variable and its distribution
More informationLecture 1: Bayesian Framework Basics
Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of
More informationTwo-stage Adaptive Randomization for Delayed Response in Clinical Trials
Two-stage Adaptive Randomization for Delayed Response in Clinical Trials Guosheng Yin Department of Statistics and Actuarial Science The University of Hong Kong Joint work with J. Xu PSI and RSS Journal
More informationMath 494: Mathematical Statistics
Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/
More informationSTATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic
STATISTICS ANCILLARY SYLLABUS (W.E.F. the session 2014-15) Semester Paper Code Marks Credits Topic 1 ST21012T 70 4 Descriptive Statistics 1 & Probability Theory 1 ST21012P 30 1 Practical- Using Minitab
More informationDefinition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.
Hypothesis Testing Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution. Suppose the family of population distributions is indexed
More informationNon-specific filtering and control of false positives
Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview
More informationStat 5101 Lecture Notes
Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random
More information3 Comparison with Other Dummy Variable Methods
Stats 300C: Theory of Statistics Spring 2018 Lecture 11 April 25, 2018 Prof. Emmanuel Candès Scribe: Emmanuel Candès, Michael Celentano, Zijun Gao, Shuangning Li 1 Outline Agenda: Knockoffs 1. Introduction
More informationSubject CS1 Actuarial Statistics 1 Core Principles
Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and
More informationPsychology 282 Lecture #4 Outline Inferences in SLR
Psychology 282 Lecture #4 Outline Inferences in SLR Assumptions To this point we have not had to make any distributional assumptions. Principle of least squares requires no assumptions. Can use correlations
More informationTwo Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests
Chapter 59 Two Correlated Proportions on- Inferiority, Superiority, and Equivalence Tests Introduction This chapter documents three closely related procedures: non-inferiority tests, superiority (by a
More informationSTAT 135 Lab 5 Bootstrapping and Hypothesis Testing
STAT 135 Lab 5 Bootstrapping and Hypothesis Testing Rebecca Barter March 2, 2015 The Bootstrap Bootstrap Suppose that we are interested in estimating a parameter θ from some population with members x 1,...,
More informationCHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)
FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter
More informationFamilywise Error Rate Controlling Procedures for Discrete Data
Familywise Error Rate Controlling Procedures for Discrete Data arxiv:1711.08147v1 [stat.me] 22 Nov 2017 Yalin Zhu Center for Mathematical Sciences, Merck & Co., Inc., West Point, PA, U.S.A. Wenge Guo Department
More informationSTAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015
STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots March 8, 2015 The duality between CI and hypothesis testing The duality between CI and hypothesis
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2
MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 1 Bootstrapped Bias and CIs Given a multiple regression model with mean and
More informationLecture 21. Hypothesis Testing II
Lecture 21. Hypothesis Testing II December 7, 2011 In the previous lecture, we dened a few key concepts of hypothesis testing and introduced the framework for parametric hypothesis testing. In the parametric
More informationLecture 6: Discrete Choice: Qualitative Response
Lecture 6: Instructor: Department of Economics Stanford University 2011 Types of Discrete Choice Models Univariate Models Binary: Linear; Probit; Logit; Arctan, etc. Multinomial: Logit; Nested Logit; GEV;
More informationRobust Backtesting Tests for Value-at-Risk Models
Robust Backtesting Tests for Value-at-Risk Models Jose Olmo City University London (joint work with Juan Carlos Escanciano, Indiana University) Far East and South Asia Meeting of the Econometric Society
More informationPerson-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data
Person-Time Data CF Jeff Lin, MD., PhD. Incidence 1. Cumulative incidence (incidence proportion) 2. Incidence density (incidence rate) December 14, 2005 c Jeff Lin, MD., PhD. c Jeff Lin, MD., PhD. Person-Time
More informationLecture 10: Generalized likelihood ratio test
Stat 200: Introduction to Statistical Inference Autumn 2018/19 Lecture 10: Generalized likelihood ratio test Lecturer: Art B. Owen October 25 Disclaimer: These notes have not been subjected to the usual
More information1 Statistical inference for a population mean
1 Statistical inference for a population mean 1. Inference for a large sample, known variance Suppose X 1,..., X n represents a large random sample of data from a population with unknown mean µ and known
More informationStatistics 3858 : Contingency Tables
Statistics 3858 : Contingency Tables 1 Introduction Before proceeding with this topic the student should review generalized likelihood ratios ΛX) for multinomial distributions, its relation to Pearson
More informationUnit 14: Nonparametric Statistical Methods
Unit 14: Nonparametric Statistical Methods Statistics 571: Statistical Methods Ramón V. León 8/8/2003 Unit 14 - Stat 571 - Ramón V. León 1 Introductory Remarks Most methods studied so far have been based
More informationStatistical Applications in Genetics and Molecular Biology
Statistical Applications in Genetics and Molecular Biology Volume, Issue 006 Article 9 A Method to Increase the Power of Multiple Testing Procedures Through Sample Splitting Daniel Rubin Sandrine Dudoit
More informationIntroduction to Statistical Inference
Introduction to Statistical Inference Ping Yu Department of Economics University of Hong Kong Ping Yu (HKU) Statistics 1 / 30 1 Point Estimation 2 Hypothesis Testing Ping Yu (HKU) Statistics 2 / 30 The
More informationINTERVAL ESTIMATION AND HYPOTHESES TESTING
INTERVAL ESTIMATION AND HYPOTHESES TESTING 1. IDEA An interval rather than a point estimate is often of interest. Confidence intervals are thus important in empirical work. To construct interval estimates,
More informationHypothesis Testing One Sample Tests
STATISTICS Lecture no. 13 Department of Econometrics FEM UO Brno office 69a, tel. 973 442029 email:jiri.neubauer@unob.cz 12. 1. 2010 Tests on Mean of a Normal distribution Tests on Variance of a Normal
More informationTESTS FOR EQUIVALENCE BASED ON ODDS RATIO FOR MATCHED-PAIR DESIGN
Journal of Biopharmaceutical Statistics, 15: 889 901, 2005 Copyright Taylor & Francis, Inc. ISSN: 1054-3406 print/1520-5711 online DOI: 10.1080/10543400500265561 TESTS FOR EQUIVALENCE BASED ON ODDS RATIO
More informationChapter Seven: Multi-Sample Methods 1/52
Chapter Seven: Multi-Sample Methods 1/52 7.1 Introduction 2/52 Introduction The independent samples t test and the independent samples Z test for a difference between proportions are designed to analyze
More informationComparison of Two Samples
2 Comparison of Two Samples 2.1 Introduction Problems of comparing two samples arise frequently in medicine, sociology, agriculture, engineering, and marketing. The data may have been generated by observation
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More information401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.
401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis
More informationModel Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model
Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population
More informationA Simple, Graphical Procedure for Comparing Multiple Treatment Effects
A Simple, Graphical Procedure for Comparing Multiple Treatment Effects Brennan S. Thompson and Matthew D. Webb May 15, 2015 > Abstract In this paper, we utilize a new graphical
More informationNon-parametric Inference and Resampling
Non-parametric Inference and Resampling Exercises by David Wozabal (Last update. Juni 010) 1 Basic Facts about Rank and Order Statistics 1.1 10 students were asked about the amount of time they spend surfing
More informationPhysics 403. Segev BenZvi. Classical Hypothesis Testing: The Likelihood Ratio Test. Department of Physics and Astronomy University of Rochester
Physics 403 Classical Hypothesis Testing: The Likelihood Ratio Test Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Bayesian Hypothesis Testing Posterior Odds
More informationClinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.
Introduction to Dalla Lana School of Public Health University of Toronto olli.saarela@utoronto.ca September 18, 2014 38-1 : a review 38-2 Evidence Ideal: to advance the knowledge-base of clinical medicine,
More information