An Omnibus Consistent Adaptive Percentile Modified. Wilcoxon Rank Sum Test with Applicaitions in Gene. Expression Studies

Size: px

Start display at page:

Download "An Omnibus Consistent Adaptive Percentile Modified. Wilcoxon Rank Sum Test with Applicaitions in Gene. Expression Studies"

Imogene Sanders
5 years ago
Views:

1 An Omnibus Consistent Adaptive Percentile Modified Wilcoxon Rank Sum Test with Applicaitions in Gene Expression Studies O. Thas, L. Clement, J.C.W. Rayner, B. Carvalho, and W. Van Criekinge Supplementary Material Corresponding author: Olivier Thas, Department of Mathematical Modeling, Statistics and Bioinformatics, Ghent University, Belgium ( 1

2 Web Appendix A Tables 1, 2, 3 and 4 give the empirical powers. See section 4 of the paper for the details. 2

3 Table 1: Powers of exact permutation tests for location-shift alternatives with different parent distributions (n = m = 10, α = 0.05, powers approximated based on 10, 000 simulations) θ Test Uniform Normal Cauchy Chi Expon. Logistic 0.0 WMW Gastwirth HFR LT MAX SUM APMW Lepage BWS t test WMW Gastwirth HFR LT MAX SUM APMW Lepage BWS t test WMW Gastwirth HFR LT MAX SUM APMW Lepage BWS t test WMW Gastwirth HFR LT MAX SUM APMW Lepage BWS t test

4 Table 2: Powers of permutation tests for location-shift alternatives with different parent distributions (n = m = 25, α = 0.05, powers approximated based on 10, 000 simulations, p-values based on 200, 000 Monte Carlo runs) θ Test Uniform Normal Cauchy Chi Expon. Logistic 0.0 WMW Gastwirth HFR LT MAX SUM APMW Lepage BWS t test WMW Gastwirth HFR LT MAX SUM APMW Lepage BWS t test WMW Gastwirth HFR LT MAX SUM APMW Lepage BWS t test WMW Gastwirth HFR LT MAX SUM APMW Lepage BWS t test

5 Web Appendix B The application of our test to the complete set of genes resulted in a list of significantly differentially expressed genes. The list contains some genes for which differential expression was not detected by the t-test or the WMW test and some of these genes have been biologically validated by confirming that the gene s promotor is methylated in most patients of the adenoma group. Methylation is an epigenetic process that results in gene silencing, i.e. the gene cannot be expressed when its promotor is methylated. The methylation status has been determined in an MSP (Methylation Specific PCR) experiment. We present the results of the LDOC1L gene with accession number NM , which is known as a gene for a Leucine zipper. The data are presented in Table 1. With the two-sided Welch t-test and the WMW tests we find p-values and , both nonsignificant at the α = 0.05 level. The adaptive PMW test gives R n = 2.86 (p, r [0.01, 0.5]) with a p-value of 0.02, resulting in the rejection of the null hypothesis. The value of R n corresponds to S n (p, r) using only a fraction of p = 15.71% observations in the right tail, and a fraction of r = 4.41% observations in the left tail of the pooled sample. This suggests that the difference between the two distributions is more pronounced for the larger expression values. This is also illustrated in Figure 1 of Web Appendix B, which shows kernel density estimates of the two distributions. The biological validation implies that the rejection of the null hypothesis is not a false positive result. This case study demonstrates that the APMW test is successful in detecting biologically relevant effects where some other tests fail. 5

6 Density Expresssion Value Figure 1: Density estimates of the expression values of the LDOC1L gene in adenoma (full line) and carcinoma patients (dashed line) In the original study (Carvalho et al., 2008, 2009) no corrections for multiple testing were performed. The biologists selected potential differentially expressed genes based on the ranking of the p-values of genes that had p-values smaller than 0.05 for both the Welch t-test and the WMW test, but they also selected a few genes that resulted in small p-values for the APMW test but that were not detected by the other two tests. This resulted in the detection of the leucine zipper gene. We do not advocate this procedure, but it demonstrates that the APMW test is successful in detecting biologically relevant effects where some other tests fail. 6

7 Web Appendix C In the absence of ties the exact means and variances of T n (p, r) for odd n under H 0 are E {T n (p, r)} = n 1 2n {n p(n p + 1) n r (n r + 1)} and Var {T n (p, r)} = n 1n 2 n p (n p + 1) 12n 2 (n 1) { 4nnp + 2n 3n 2 p 3n p } +2 n 1n 2 n p n r 4n 2 (n 1) (n p + 1)(n r + 1) + n 1 n 2 n r (n r + 1) { 4nnr 12n 2 + 2n 3n 2 } r 3n r, (n 1) and the exact means and variances of T n (p, r) for even n under H 0 are E {T n (p, r)} = n 1 2n {n2 p n 2 r} and Var {T n (p, r)} = n 1 n 2 n p { 4nn 2 12n 2 (n 1) p 3n 3 p n } + 2 n 1n 2 n 2 pn 2 r 4n 2 (n 1) + n 1 n 2 n r { 4nn 2 12n 2 (n 1) r 3n 3 r n }. 7

8 Web Appendix D Here we present the proofs of the exact mean and variance in the presence of ties (see Appendix A). We only give the proofs for the upper fraction statistic T np. The proof for the lower fraction statistic B nr is completely analogous. Preliminaries For a given fraction p there are n p observations of the pooled sample in the upper fraction. Let N p1 denote the number of observations of the first sample that appear in the upper fraction, i.e. n N p1 = c i, i=n n p+1 where c i is as defined in section 3 (i.e. c i = 1 if the ith smallest observation in the pooled sample comes from sample 1). Note that N p1 has a hypergeometric distribution with parameters n p, n and n 1, i.e. P (N p = k) = ( np k )( n np n 1 =k ) ( n n 1 ). Thus E {N p } = n 1n p n, Var {N p } = n 1n p n n p n 1 n n n 1, n 8

9 and E { Np 2 } = Var {Np } + [E {N p }] 2 = n 1n p n(n 1) (n n p n 1 + n 1 n p ). Let C t p = (c n np+1,..., c n ) denote the vector with the sample indicators of the observations in the upper fraction. Then P (C p = c N p = k) = 1 ). ( np k Hence, for any i {n n p + 1,..., n}, P (c pi = 1 N p = k) = = c:c pi =1 ( np 1 k 1 ( np ) k = k n p. ) P (C p = c N p = k) Thus also E {c pi N p = k} = k n p. Finally, we will also need P (c i = 1 and c j = 1 N p = k). We make a distinction between i = j and i j. When i = j, P (c i = 1 and c j = 1 N p = k) = P (c i = 1 N p = k) = k n p. 9

10 When i j, P (c i = 1 and c j = 1 N p = k) = P (c i = 1 c j = 1 and N p = k) P (c j = 1 N p = k) = k 1 k. n p 1 n p Exact Mean { E {T np } = E Np ECp N p {E {T np C p, N p }} } n = E Np E C p N p c i a(i, τ) = n i=n n p+1 = 1 n p E Np {N p } = n 1 n A p(τ). i=n n p+1 { } Np E Np a(i, τ) n p n i=n n p+1 a(i, τ) Exact Variance First we write Var {T pn } = E Cp,N p {Var {T np C p, N p }} + Var Cp,N p {E {T np C p, N p }} = E Np { ECp N p {Var {T C p, N p } N p } } (1) +E Np { VarCp N p {E {T np C p, N p } N p } } (2) +Var Np { ECp N p {E {T np C p.n p } N p } }. (3) 10

11 Term (1) contains Var {T C p, N p } = 0 and it is thus exactly zero. For Term 2 we first calculate Var Cp N p {E {T np C p, N p } N p }, Var Cp Np {E {T np C p, N p } N p } n = Var Cp Np c i a(i, τ) N p = = n i=n n p+1 n i=n n p+1 j=n n p+1 n n i=n n p+1 j=n n p+1 = N p(n p N p ) n 2 p [ A (2) p a(i, τ)a(j, τ)cov {c i, c j N p } a(i, τ)a(j, τ) (τ) A(1,1) p [ ] (τ). n p 1 P (c i = 1 and c j = 1 N p ) ( Np n p ) 2 ] Term 2 then becomes E Np { VarCp N p {E {T np C p, N p } N p } } = n 1(n p 1)(n n 1 ) n(n 1)n p ( ) A (2) p (τ) A(1,1) n p 1. For Term 3 we first calculate E Cp N p {E {T np C p.n p } N p }, E Cp Np {E {T np C p.n p } N p } n = E Cp Np c i a(i, τ) N p = n i=n n p+1 = N p n p A p (τ). i=n n p+1 N p n p a(i, τ) 11

12 Finally, { } Np Var Np A p (τ) = A(2) p n p n 2 p (τ) Var {N p } = A (2) p (τ) n 1(n n p )(n n 1 ) n 2 (n 1)n p. Now that we have the three terms, Var {T pn } = n 1(n n 1 ) n 2 (n 1) ( (n 1)A (2) p (τ) A (1,1) p ) (τ). 12

13 Web Appendix E The R package APMW contains the R function apmw.test that performs the APMW test. The result of the function is an R object of the htest class. Type help(apmw.test) on the R command line for more help. The file name of the source code of the R package is apmw.zip. The R file apmw example.r shows the R code to analyse the Leucine zipper gene data. See Also Section 5 of the paper and Web Appendix B for the data. Make sure that the compiled c-code file, the R source file and the data file are in your R working directory. 13

14 References Carvalho, B., Postma, C., Mongera, S., Hopmans, E., Diskin, S., van de Wiel, M., Van Criekinge, W., Thas, O., Matth Ai, A., Cuesta, M., Terhaar, J., Craanen, M., Schr Ock, E., Ylstra, B., & Meijer, G. (2008). Integration of DNA and expression microarray data unravels seven putative oncogenes on 20q amplicon involved in colorectal adenoma to carcinoma progression. Cellular Oncology, 30, Carvalho, B., Postma, C., Mongera, S., Hopmans, E., Diskin, S., van de Wiel, M., Van Criekinge, W., Thas, O., Matth Ai, A., Cuesta, M., Terhaar, J., Craanen, M., Schr Ock, E., Ylstra, B., & Meijer, G. (2009). Multiple putative oncogenes at the chromosome 20q amplicon contribute to colorectal adenoma to carcinoma progression. GUT, 58,

15 Table 3: Powers of permutation tests for location-shift alternatives with different parent distributions (n = m = 50, α = 0.05, powers approximated based on 10, 000 simulations, p-values based on 200, 000 Monte Carlo runs) θ Test Uniform Normal Cauchy Chi Expon. Logistic 0.0 WMW Gastwirth HFR LT MAX SUM APMW Lepage BWS t test WMW Gastwirth HFR LT MAX SUM APMW Lepage BWS t test WMW Gastwirth HFR LT MAX SUM APMW Lepage BWS t test WMW Gastwirth HFR LT MAX SUM APMW Lepage BWS t test

16 Table 4: Gene expression levels of the LDOC1L gene (expression values after RMA preprocessing) adenoma carcinoma

Sample Size and Power Calculation in Microarray Studies Using the sizepower package.

Sample Size and Power Calculation in Microarray Studies Using the sizepower package. Weiliang Qiu email: weiliang.qiu@gmail.com Mei-Ling Ting Lee email: meilinglee@sph.osu.edu George Alex Whitmore email: