Sample Size Estimation for Studies of High-Dimensional Data

Size: px

Start display at page:

Download "Sample Size Estimation for Studies of High-Dimensional Data"

Eugene Horton
5 years ago
Views:

1 Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung, Taiwan 1

2 Outline Background (Single endpoint) Power Analysis and Sample Size Sample Size Problem in Multiple Testing Multiple Testing Framework Type I error and Power Sample Size Estimation Independent Model Correlated Model

3 Statistical Hypothesis Testing Population - Control (m c, s); Treatment (m t, s) Difference d = m c m t Experiment - Control (x c, s c ); Treatment (x t, s t ) with n samples in each group Is the mean difference between Control and Treatment significant? Two-sample t-test: t s 2 c x c x t n s 2 t Statistical S1- Determine the proper test statistic Test S2- Perform statistical test from the observed data S3- Compute p-value (under the null hypothesis) S4- Compare the p-value with the pre-specified n s d 2 n

4 Power of a Test Two-sample t-test: t s 2 c x c x n s The significance (power) of a test depends on the sample size n, and the mean difference d (d), and standard deviation s (s). Power: Probability of declaring a significance (p-value ) Given d m c m t, s, n, and Power (1-b) of a test (t-test) can be computed. t 2 t Given d m c m t, s, (1-b), and Sample size of a test (t-test) can be computed to achieve a significance. n s d 2 n

5 Sample Size Estimation Sample size estimation for a given study is conducted during the design stage before data are collected. The three factors d, s, and together with the power (1-b) deter mine the needed sample size n. Effect size d: the targeted statistical distance between the param eters of two populations designed to detect (the difference in pop ulation means. It represents the smallest effect that is considered to be clinical or biological relevant (significance). Sample size: the number of experimental units (e.g., biological sa mples.

6 Sample Size Estimation One Endpoint Sample size problem: The number of samples needed to ensure the power (1-b) to detect the difference d with a test at the significance level. Sample size calculation requires specifying 1. Effect size d or standardized effect size d/s, 2. The desired power (1-b), 3. Type I error. Sample size for the two-sample t-test: n = 2 (t /2 + t b ) 2 /(d/s) 2.

7 High-Dimensional Data High dimensional data (e.g., gene expression data): Each sample is characterized by hundreds or thousands of correlated measurements. Most measurements are unrelated to the phenotypes. The number of phenotype samples is small. The curse of dimensionality: multiplicity testing; feature selection; over-fitting; poor scalability; etc

8 High-Dimensional Data for Cancer Biology Goal: To identify markers (genes or proteins) that may have a functional role in specific phenotypes Diagnosis: define patterns that can identify specific phenotypes. Prognosis: establish a patient s clinical outcome independent of treatment. Prediction: predict outcome of a specific treatment.

9 Data Matrix Potential Markers x 1 x 2 x 3. x m S 11 y 111 y 211 y 311. y m11 Control (-) Treatment (+) S 21 y 112 y 212 y 312. y m12... S n1 y 11k y 21k y 31k. y m1k S 12 y 121 y 221 y 321. y m21 S 22 y 122 y 222 y 322. y m22... S n2 y 12k y 22k y 32k. y m2k Using an appropriate test statistics to compare two groups for each variable (univariate analysis) at the significance level.

10 Testing Multiple Endpoints Decision True State Significance Non-significance Total Null V S 1- m 0 Alternative U 1-b T b m 1 Total R A m V is the number of the false positives; U is the number of the true positives; R is the total number of the significances; E(V)/m 0 = is the per comparison-wise error (CWE) rate; E(U)/m 1 = (1b) is the per comparison-wise power (sensitivity).

11 Type I Errors in Multiple Testing Three type I errors in multiple testing: CWE = E(V)/m: Expected proportion of positives. FWE = Pr(V > 0): Probability of at least one false positive. FDR = E(V/R): Expected proportion of false positive findings. Type I error based on the CWE at : = v*/m, the expected number of false positives is v*. = f*/m, the FWE will be at f* (Bonferroni approach). When m is large, becomes small. Thus, the Bonferroni approach is not paratical. Most favor the FDR approach in high-dimensional data.

12 FDR - Comments If m 0 = m (all null hypotheses are true), both FDR and FWE are 1 when there is any rejection. In this case, FWE and FDR are equivalent. If m 0 < m, it can be shown that Pr(V > 0) > E(V/R). A MCP procedure which controls FWE will controls FDR. The FDR approach allows the findings to be made, provided that the investigator is willing to accept a small fraction of false positive findings. Most favor the FDR approach in high-dimensional data.

13 Sample Size Estimation HD Since a large number of tests is made and the structu re of data is complex, determination of the needed sa mple size is difficult. Most current methods proposed for the sample size a nalysis do not consider the dependency of expressio n levels and/or assume equal variance among genes.

14 Power and Sample Size Power is defined by (1-b), the proportion that the true alternatives are significant (sensitivity l). Sample size problem: The number of samples needed to ensure detecting at least l fraction out of the m 1 true alternatives for the difference d with a test at the significance level. (Both m and m 1 are pre-specified by the investigator.)

15 A Simple Method For specified l, and (1-b), the needed sample size based on the univariate calculation is n = n* = 2 (t /2 + t b ) 2 /(d/s) 2, where n* is the smallest integer >= n*. Conversely, given d,, and n the outcome of a test is a Bernoulli with p = (1b). The expected proportion of detection (sensitivity) is l. Sample size as calculated will have the sensitivity l, on the average. But, the probability can be low.

16 Confidence Probability Given m, m 1, d, and, the relationship between the sample size n and the power (1-b) is n = n* = 2 (t /2 + t b ) 2 /(d/s) 2. Under the independent model, the probability f l to detect at least l fraction of the m 1 alternatives is the sum of the binomial probabilities:

17 Estimated sample size n, sensitivity l, and power f under the independent model based on m = 2,000, d = 2, and = Univariate Method p 1 (%) l f n* n * l f

18 An Alternative Formulation The number of samples needed to ensure the specified sensitivity l with the confidence probability at least f. The sample size n is calculated using the two equations: n* = 2 (t /2 + t b ) 2 /(d/s) 2. 18

19 Estimated sample size n, sensitivity l, and power f under the independent model based on m = 2,000, d = 2, and = Mean Method Confidence probability p 1 (%) l n * l f n l f

20 Type I Errors: CWE, FWE, FDR Type I errors: FWE, FDR, or CWE (v: number of false positives). Since m 1 and (1 β) are pre-specified, a Type I error can be expressed in terms of CWE. Setting α = v/m, the FWE will be controlled at v (e.g., v = 0.05). Setting α = [m 1 (1 β)q*]/[m 0 (1 q*)], the FDR will be controlled at q*. 20

21 Simulation Experiment: Independent Model Fixed m = 2,000, α = 0.001, d = 2; specify m 1 and n Null model, m 0 = m(1 π 1 ) from N(0,1); Alternative model, m 1 = mπ 1 from N(δ,1). For each sample set, the t-statistics were computed, and v and u are counted at α = The estimates of α, FDR, λ and φ λ were then calculated. The estimate of φ λ was the proportion of times out of the 1,000 simulations 21

22 Estimated and FDR q from the mean and confidence probability methods under the independent model, m = 2,000, d = 2, and = Mean Method Confidence Probability p 1 l q * n a q n b q

23 Estimated sensitivity l and confidence probability f from the mean and confidence probability methods under the independent model, m = 2,000, d = 2, and = Mean Method Confidence Probability p 1 (%) l n a l f n b l f

24 A Re-sampling Method - Tibshirani X i t 01 t 02 t 0N Null genes v 1 v 2 v N Type I error or FDR Pilot data X Null distribution t i s i t 11 t 12 t 1N σ d 1 1 n0 1 n Non-null genes 1 u 1 u 2 u N the (1f) th percentile u*

25 Lin-Chen Re-sampling 1. Start with the sample Method size n from independent model. 2. Compute the adjustment factors f = f 1 x f Generate permutation samples for (b = 1,2,..,B). Compute the t-statistics from the permutation samples and multiple each t-statistic by the factor f. t b = {ft 0b, ft 1b }. Add d to a set of randomly selected m 1 genes in Group Construct a null distribution by pooling all t 0b s. 5. Calculate v b (Type I error) and u b (power) for each t b. 6. Order u 1, u 2,, u B, and find the (1-f) th percentile, u*. 7. If u* > m 1 l stop and report n as the sample size estimate; otherwise, increase n by 1 and go to 1. 25

26 26 Comments 1. This method is modified from Tibshirani (2007). 2. It accounts for the correlations and variances of the variables. 3. The sample size in the pilot dataset is small, in practice. 4. Adjustment factors f = f 1 x f 2 2 / 2, 2 / 2, p p n n n n p p p p t t n n n n f f f f 1 uses the maximum likelihood estimate of the t-statistic f 2 is to account for differential sample sizes for the df of the t- distribution, i.e., n 0p + n 1p 2 and n 0 + n 1 2.

27 A Re-sampling Method Lin & Chen X i ft 01 ft 02 ft 0N Null genes v 1 v 2 v N Type I error or FDR Pilot data X Null distribution f n n n t 0p 1p n0 n12, / 2 0p n1 p 2 tn n 2, / 2 0 p 1p t i s i ft 11 ft 12 ft 1N σ d 1 1 n0 1 n Non-null genes 1 u 1 u 2 u N the (1f) th percentile u*

28 Simulation Experiment: Correlated Model Colon cancer data set (Alon et al., 1999, PNAS) The colon dataset consists of expression patterns of 2000 human genes with 22 normal and 40 colon tumor tissues. Fixed m = 2,000, α = 0.001, d = 2; specify m1 and n. 1. Random select 4 samples for each group. 2. Compute the sample sizes using the Tibshirani and Lin & Chen methods. 3. Repeat 1000 times by selecting different sets of four samples. 4. Estimate the mean and standard deviation. Repeat the procedure using 6 samples for each group. 28

29 Sample size estimates (sd), 4 or 6 samples samples per group with 1,000 repetitions, m = 2,000, d = 2 and =10-3 p 1 (%) l n a n b (4) n c (4) n b (6) n c (6) (3.16) 26.6(7.98) 14.3(2.35) 18.0(3.88) (3.23) 30.0(8.09) 15.3(2.48) 19.7(4.05) (3.68) 35.5(9.15) 17.3(2.45) 22.8(4.12) (2.76) 25.5(7.04) 14.2(2.33) 17.8(3.88) (3.33) 29.6(8.17) 15.4(2.38) 19.8(3.90) (3.48) 35.4(8.74) 17.3(2.53) 22.8(4.26) a. mean method; b. Lin & Chen; c. Tibshirani

30 Estimated sensitivity l and confidence probability f from the univariate mean and Lin & Chen methods using the colon tumor data, m = 2,000, d = 2, and = Univariate mean method Lin & Chen Method p 1 (%) l n l f n l f

31 Summary - 1 Common Approaches: Sample size estimation methods were derived from the independent or equi-correlated models because of complexity of the correlation among the variables. The sample size is formulated as the number of arrays needed to achieve the specified sensitivity l on the average. This formulation is inadequate due to the presence of the variance in estimating l.

32 Summary - II Alternative: The number of arrays needed to ensure detecting at least the specified sensitivity with a confidence probability at least 95%. A permutation method using a small pilot dataset to estimate sample size is proposed. use a small pilot dataset, 4-6 samples per groups. provide efficient estimates of sample size. perform well for an illustrative dataset.

33 References Tsai, C-A, Wang, S-J, Chen, D-T, and Chen, J.J. Sample size for gene expression microarray experiments. Bioinformatics 21, , Tibshirani R. A simple method for assessing sample sizes in microarray experiments. BMC Bioinformatics 2006; 7:106. Lin, W-J and Chen, J.J. Power and sample size estimation in microarray studies (manuscript).

34 Sample Size Estimation for Studies of High-Dimensional Data Before conducting an experiment, one important issue that needs to be decided is the number of samples required in order to have adequate power to detect treatment effects. Sample size estimation for a single endpoint is formulated as the number of samples needed to ensure the specified power of detecting the specified mean difference at a pre-specified significance level a. For high dimensional data, the common sample size estimation is calculated to ensure to detect a specified sensitivity on the average. The needed sample size can be calculated using a univariate method. This formulation is inadequate due to the presence of the variance in estimating the sensitivity. The univariate method does not take the dependence among the variables into consideration. This talk formulates the sample size problem as the number of samples needed to ensure detecting at least the specified sensitivity with the desired confidence probability. A permutation method using a small pilot dataset to estimate sample size will be presented. This method accounts for correlation and variance heterogeneity among variables and is shown to perform well for an example dataset.

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca