The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies Ian Barnett, Rajarshi Mukherjee & Xihong Lin Harvard University ibarnett@hsph.harvard.edu August 5, 2014 Ian Barnett JSM 2014 August 5, 2014 1 / 20
Background Genome-wide association studies (GWAS): millions of common (minor allele frequency > 0.05) SNPs genotyped. Gene-level/pathway-level analysis can provide power to detect these types of effects by combining information over the SNPs. Goal: Develop powerful, computationally efficient, statistical methodology for SNP-sets that have the power to detect joint SNP effects. Ian Barnett JSM 2014 August 5, 2014 2 / 20
Model Model n subjects, q covariates, p genetic variants. Y i is phenotype for ith individual X i contains q covariates for ith individual G i contains SNP information (minor allele counts) in a gene/pathway/snp-set for ith individual α and β contain regression coefficients. µ i = E(Y i G i, X i ) h(µ i ) = X i α + G i β h( ) is the link function. Ian Barnett JSM 2014 August 5, 2014 3 / 20
Marginal SNP test statistics The marginal score test statistic for the jth variant is: Z j = G T j (Y ˆµ 0) where ˆµ 0 is the MLE of E(Y H 0 ). Assume Z j is normalized. Letting UU T = Ĉov(Z) = ˆΣ, define the transformed (decorrelated) test statistics: Z = U 1 L Z MVN(0, n I p) Ian Barnett JSM 2014 August 5, 2014 4 / 20
Current popular methods Method SKAT MinP p Test statistic j=1 Z j 2 max j { Z j } High power when signal sparsity is low. Accurate Pros p-values can be obtained quickly. Cons Can have very low power when sparsity is high. High power when signal sparsity is high. Slightly lower power when sparsity is low. Difficult to obtain accurate analytic p- values. Ian Barnett JSM 2014 August 5, 2014 5 / 20
The higher criticism Let S(t) = p j=1 1 { Zj t} Assumes Σ = I p Under H 0, S(t) Binomial(p, 2 Φ(t)) where Φ(t) = 1 Φ(t) is the survival function of the normal distribution. The Higher Criticism test statistic is: HC = sup t>0 { } S(t) 2p Φ(t) 2p Φ(t)(1 2 Φ(t)) Ian Barnett JSM 2014 August 5, 2014 6 / 20
The higher criticism Histogram of the Z i Density 0.00 0.10 0.20 0.30 4 2 0 2 4 t argmax{hc(t)} Ian Barnett JSM 2014 August 5, 2014 7 / 20
Adjusting for correlation Recalling that Z = U 1 Z, let S (t) = p j=1 1 { Z j t} Note that under H 0, S (t) Binomial(p, 2 Φ(t)) regardless for general correlated Σ. The innovated Higher Criticism test statistic is: { } S (t) 2p Φ(t) ihc = sup t>0 2p Φ(t)(1 2 Φ(t)) Ian Barnett JSM 2014 August 5, 2014 8 / 20
Asymptotic p-values for ihc The supremum of this standardized empirical process follows a Gumbel distribution asymptotically. For 0 < ɛ < δ < 1, if the supremum is taken over the interval Φ 1 (1 δ/2) < t < Φ 1 (1 ɛ/2), then, as shown in Jaeschke (1979), we can write: pr(ihc < x) exp[ exp{ x(2 log ρ) 1/2 2 1 log π + 2 1 log 2 ρ + 2 log ρ}], where ρ = 2 1 log[δ(1 ɛ)/{ɛ(1 δ)}]. Jaeschke (1979) shows that this converges in distribution at an abysmal rate of O{(log p) 1/2 } Ian Barnett JSM 2014 August 5, 2014 9 / 20
Slow convergence to asymptotic distribution The supremum of the higher criticism test statistic is taken over the range Φ 1 (1 δ/2) < t < Φ 1 (1 ɛ/2) where ɛ = 0 01 and δ = 0 40. Each empirical distribution is constructed using 500 samples of p independent standard normal random variables. 1.0 0.8 CDF(x) 0.6 0.4 0.2 0.0 Theoretical; p= Empirical; p=10 2 Empirical; p=10 6 0 1 2 3 4 x Ian Barnett JSM 2014 August 5, 2014 10 / 20
Analytic p-values for ihc Letting h be the observed GHC statistic: ( { p-value = pr sup t>0 S (t) 2p Φ(t) 2p Φ(t)(1 2 Φ(t)) There exists 0 < t 1 < < t p, such that ( p ) p-value = 1 pr {S (t k ) p k} k=1 } h Then apply the chain rule of conditioning to get a product of binomial probabilities. ) Ian Barnett JSM 2014 August 5, 2014 11 / 20
Type I error rates Table: Empirical type I error rate percentages from 10 6 simulations for the higher criticism using the proposed analytic method over a range of significance levels, α. Asymptotic type I error rate percentages are provided for comparison: exact(asymptotic). p α 2 10 50 5 0 4 98(5 95) 4 98(3 24) 5 01(1 84) 1 0 1 00(2 24) 9 92 10 1 (7 31 10 1 ) 1 01(1 59 10 1 ) 1 0 10 1 9 97 10 2 (2 05 10 1 ) 1 01 10 1 (6 03 10 2 ) 9 75 10 2 (4 90 10 3 ) 1 0 10 2 1 01 10 2 (5 32 10 2 ) 1 12 10 2 (7 30 10 3 ) 9 80 10 3 (4 00 10 4 ) Ian Barnett JSM 2014 August 5, 2014 12 / 20
Decorrelating dampens true signals Cancer Genetic Markers of Susceptibility (CGEM) Breast Cancer GWAS: FGFR2 gene Frequency 0 2 4 6 8 Z Frequency 0 2 4 6 8 Z * 4 2 0 2 4 Decorrelating causes ihc to lose power. Marginal test statistics Ian Barnett JSM 2014 August 5, 2014 13 / 20
Comparison Method MinP SKAT ihc Robust to signal sparsity Robust to correlation/ld structure Computationally efficient Does not require decorrelating test statistics Ian Barnett JSM 2014 August 5, 2014 14 / 20
Comparison Method MinP SKAT ihc GHC Robust to signal sparsity Robust to correlation/ld structure Computationally efficient Doesn t require decorrelating test statistics *We will also consider the omnibus test, OMNI, in our power simulations. It is based on the minimum p-value of the SKAT, MinP, and GHC. Ian Barnett JSM 2014 August 5, 2014 15 / 20
Our contribution: the generalized higher critcism (GHC) Recall S(t) = p j=1 1 { Zj t} Now we allow Σ to have arbitrary correlation structure. S(t) is no longer binomial. Instead we approximate with Beta-binomial, matching on first two moments. The Generalized Higher Criticism test statistic is: S(t) 2p Φ(t) GHC = sup t>0 Var(S(t)) GHC achieves the same as detection boundary as HC. Ian Barnett JSM 2014 August 5, 2014 16 / 20
The variance estimator Var(S(t)) Theorem 1 Let r n = 2 p(1 p) 1 k<l p (Σ kl) n and let H i (t) be the Hermite polynomials: H 0 (t) = 1, H 1 (t) = t, H 2 (t) = t 2 1 and so on. Then ( ) Cov S(t k ), S(t j ) = p[2 Φ(max{t j, t k }) 4 Φ(t j ) Φ(t k )] +4p(p 1)φ(t j )φ(t k ) H 2i 1 (t j )H 2i 1 (t k )r 2i i=1 (2i)! Proof follows from Schwartzman and Lin (2009) where they showed: P(Z k > t i, Z l > t j ) = Φ(t i ) Φ(t j ) + φ(t i )φ(t j ) n=1 Σ n kl n! H n 1(t i )H n 1 (t j ) Ian Barnett JSM 2014 August 5, 2014 17 / 20
Analytic p-values for the GHC Letting h be the observed GHC statistic: S(t) 2p p-value = pr sup Φ(t) t>0 Var(S(t)) h There exists 0 < t 1 < < t p, such that ( p ) p-value = 1 pr {S(t k ) p k} k=1 Ian Barnett JSM 2014 August 5, 2014 18 / 20
Power simulations ρ 1=0.4; ρ 3=0; p=40; 2 causal ρ 1=0.4; ρ 3=0; p=40; 4 causal power 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 ρ 2 GHC MinP SKAT ihc OMNI power 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 ρ 2 GHC MinP SKAT ihc OMNI ρ 1=0.4; ρ 3=0.4; p=40; 2 causal ρ 1=0.4; ρ 3=0.4; p=40; 4 causal power 0.0 0.2 0.4 0.6 0.8 1.0 GHC MinP SKAT ihc OMNI 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ρ 2 power 0.0 0.2 0.4 0.6 0.8 1.0 ρ 1 : correlation within causal variants. ρ 2 : correlation between causal and noncausal variants. ρ 3 : correlation within non-causal variants. 0.0 0.1 0.2 0.3 0.4 0.5 ρ 2 Ian Barnett JSM 2014 August 5, 2014 19 / 20 GHC MinP SKAT ihc OMNI
Data analysis The National Cancer Institute s Cancer Genetic Markers of Susceptibility (CGEM) breast cancer GWAS. Sample has 1145 cases, 1142 controls with european ancestry. 5 4 FGFR2 PTCD3 POLR1A Observed log 10(pvalue) 3 2 1 0 0 1 2 3 4 5 Expected log 10(pvalue) Ian Barnett JSM 2014 August 5, 2014 20 / 20