Bayesian latent hierarchical model for transcriptomic meta-analysis to detect biomarkers with clustered meta-patterns of differential expression signals BayesMP Zhiguang Huo 1, Chi Song 2, George Tseng 3 1 Department of Biostatistics, University of Florida 2 Department of Biostatistics, the Ohio State University 3 Department of Biostatistics, University of Pittsburgh July 30, 2018 1 / 19
Background for data integration (Tseng et al. 2012) Horizontal meta-analysis: Same type of genomic data from multiple patient cohorts. Vertical integrative analysis: Multiple types of genomic data from the same patient cohort. BayesMP: Differential expression (DE) analysis. 2 / 19
Backgroud for meta-analysis of DE analysis According to Tseng et al. (2012), there are four major categories of transcriptomic meta-analysis: Combine effect sizes Fixed effects models, random effects model Combine p-values p-value aggregation methods: Fisher (Fisher, 1925), Stouffer (Stouffer, 1949) order statistics: minp (Tippett, 1931), maxp (Wilkinson, 1951), rop (Song, 2014) Combine ranks ranksum, rankprod (Hong et al, 2006) Direct merge 3 / 19
Combine p-values Combining p-values is simple, powerful and independent of batch effect. Table: p-value combining method. E.g. combine p 11, p 12,..., p 1S Genes Study 1 Study 2... Study S 1 p 11 p 12... p 1S 2 p 21 p 22... p 2S 3 p 31 p 32... p 3S............... G p G1 p G2... p GS Genomic meta-analysis Perform combining p-value methods gene-wisely Adjust for multiple comparisons 4 / 19
Motivation 1: Hypothesis testing setting θ s is the effect size of study s, 1 s S. HS B targets biomarkers that are DE in one or more studies: H 0 : θ {θ s = 0} vs H A : θ {θ s 0}. Fisher minp HS A targets biomarkers that are DE in all studies: H 0 : θ {θ s = 0} vs H A : θ {θ s 0}, maxp HS r targets biomarkers that r or more studies are DE: H 0 : θ {θ s = 0} vs H A : θ I{θ s 0} r, rop Problem: HS A and HS r are not complementary hypothesis testing setting. 5 / 19
Motivation 2: differential expression from multiple tissues I. II. III. IV. V. VI. Brown fat Heart Liver Figure: heatmap Phenotypes: Black: Wild type. Red: VLCAD-deficient. Differential expression pattern: Homogeneous differential expression pattern. (Moduel I, II). Study specific differential expression pattern. (Moduel III, IV, V, VI) How to categorize meta-analysis differential expression pattern (metapattern)? 6 / 19
Z statistics and its distribution Figure: Z statistics distribution in one study. Black line: null component. Red line: positive DE component. Blue line: negative DE component. p gs is one sided p-value for gene g and study s. Z gs = Φ 1 (p gs ), where Φ 1 ( ) is the inverse cumulative density function (CDF) of standard Gaussian distribution. Null component: assume standard Gaussian distribution or empirical null (Efron, 2004). Alternative component: Dirichlet process. 7 / 19
Multiple studies (a) Study 1 (b) Study 2 (c) Study 3 Figure: Z statistics distribution in three studies. Y gs { 1, 0, 1} is DE indicator: f (s) (Z gs Y gs ) = f (s) 0 (Z gs ) I(Y gs = 0) + f (s) +1 (Z gs) I(Y gs = 1) + f (s) 1 (Z gs) I(Y gs = 1), Prior Y gs Mult ( 1, (1 π g, π + g, π g ) ) (0, 1, 1), where π + g = π g δ g, π g = π g (1 δ g ). 8 / 19
Graphical Model G 0+ G 0- γ β α π g δ g G s+ G s- Y gs f (s) f (s) k+ k- f 0 f (s) Z gs Figure: Graphical representation of Bayesian latent hierarchical model. Shaded nodes are observed variables. Dashed nodes are pre-estimated/fixed parameters. Arrows represent generative process. Dashed lines represent equivalent variables. s is the study index and g is the gene index. 9 / 19
Bayesian computing 1. Update π g s: π g Y gs Beta(γ/(G γ) + Y + g + Y g, S Y + g Y g + 1), where Y + g = s I(Y gs = 1) and Y g = s I(Y gs = 1). 2. Update δ g s: 3. Update Y gs s: First update C gs s s.t. δ g Y gs Beta(β + Y + g, β + Y g ). Pr(C gs = k C g,s, Z gs, π ± g ) h (s) k (Z gs C g,s )(π g + ) I(k>0) (πg ) I(k<0) (1 π g ) I(k=0) Set Y gs = sgn(c gs ), Conjugacy will make the Bayesian computing very fast. 10 / 19
Decision making framework (Problem 1) For meta-analysis purpose, we will declare differentially expressed genes which are in: ΩĀ : Ω 1 = { θ Ā g : S s=1 I(θ gs 0) = S}. Ω B : Ω 1 B = { θ g : S s=1 I(θ gs 0) = 1}. Ω r : Ω 1 r = { θ g : S s=1 I(θ gs 0) r}. Efron (2001) proposed local FDR ξ g = Pr( θ g Ω 0 Ā Z) = 1 Pr( θ g Ω 1 Ā Z). Given a threshold κ, we declare gene g as a DE gene if ξ g κ and the expected number of false discoveries is g ξ g I(ξ g κ). The Bayesian false discovery rate (FDR) (Newton 2004) is defined as g ξg I(ξg κ) g I(ξg κ). We will compare the performance of our Bayesian approach in terms of FDR with FDR (Benjamini-Hochberg) from frequentists perspective. 11 / 19
Biomarker clustering for meta-patterns of homogenous and heterogenous differential signals (Problem 2) Denote by U gs the posterior probability vector for Y gs : U gs = (Pr(Y gs = 1 Z), Pr(Y gs = 1 Z), Pr(Y gs = 0 Z)). We will calculate dissimilarity of U is and U js in study s and then average over study index s. Apply tight clustering (Tseng and Wong) on gene-gene dissimilarity matrix, obtain stable modules. 12 / 19
Simulation (FDR) Table: Comparison of different methods by FDR for decision spaces D Ā, D B, and D r. The nominal FDR is 5% for all compared methods. The mean results and SD (in parentheses) were calculated based on 100 simulations. FDR DĀ D B D r (r = S/2 + 1) S σ BayesMP maxp BayesMP Fisher AW BayesMP rop 1 0.054 0.207 0.050 0.035 0.035 0.035 0.087 (0.008) (0.013) (0.006) (0.005) (0.004) (0.005) (0.008) 3 2 0.052 0.199 0.054 0.035 0.035 0.036 0.080 (0.012) (0.016) (0.008) (0.006) (0.006) (0.006) (0.010) 3 0.036 0.183 0.050 0.034 0.035 0.031 0.071 (0.018) (0.021) (0.010) (0.008) (0.009) (0.008) (0.015) 1 0.069 0.358 0.053 0.035 0.034 0.038 0.129 (0.009) (0.017) (0.005) (0.004) (0.004) (0.005) (0.008) 5 2 0.073 0.348 0.055 0.035 0.034 0.041 0.113 (0.016) (0.023) (0.006) (0.005) (0.005) (0.007) (0.008) 3 0.054 0.332 0.049 0.035 0.034 0.036 0.098 (0.032) (0.035) (0.008) (0.008) (0.008) (0.008) (0.013) 1 0.096 0.583 0.061 0.035 0.036 0.049 0.228 (0.019) (0.023) (0.004) (0.004) (0.004) (0.005) (0.010) 10 2 0.108 0.569 0.061 0.035 0.035 0.058 0.197 (0.029) (0.027) (0.006) (0.005) (0.005) (0.009) (0.012) 3 0.083 0.553 0.053 0.036 0.036 0.056 0.163 (0.063) (0.038) (0.007) (0.006) (0.006) (0.009) (0.014) 13 / 19
Simulation (AUC) Table: Comparison of different methods by AUC of ROC curve for decision spaces D Ā, D B, and D r. The nominal FDR is 5% for all compared methods. The mean results and SD (in parentheses) were calculated based on 100 simulations. AUC DĀ D B D r (r = S/2 + 1) S σ BayesMP maxp BayesMP Fisher AW BayesMP rop 1 0.977 0.926 0.973 0.973 0.973 0.980 0.972 (0.003) (0.003) (0.002) (0.002) (0.002) (0.002) (0.003) 3 2 0.907 0.875 0.879 0.877 0.875 0.902 0.873 (0.006) (0.007) (0.004) (0.004) (0.004) (0.004) (0.005) 3 0.831 0.805 0.787 0.783 0.779 0.819 0.776 (0.008) (0.008) (0.005) (0.005) (0.005) (0.006) (0.006) 1 0.974 0.920 0.979 0.978 0.979 0.986 0.979 (0.004) (0.003) (0.002) (0.002) (0.002) (0.002) (0.002) 5 2 0.920 0.890 0.897 0.894 0.892 0.929 0.893 (0.007) (0.006) (0.004) (0.004) (0.004) (0.004) (0.005) 3 0.864 0.833 0.812 0.806 0.801 0.859 0.802 (0.009) (0.009) (0.005) (0.005) (0.005) (0.005) (0.006) 1 0.964 0.910 0.985 0.983 0.985 0.986 0.986 (0.007) (0.003) (0.001) (0.002) (0.001) (0.002) (0.002) 10 2 0.910 0.905 0.920 0.917 0.917 0.950 0.919 (0.011) (0.006) (0.003) (0.003) (0.003) (0.004) (0.004) 3 0.875 0.863 0.848 0.839 0.834 0.904 0.838 (0.013) (0.010) (0.004) (0.004) (0.005) (0.005) (0.006) 14 / 19
Mouse Metabolism data Table: Sample size description Study wild type VLCAD Brown fat 4 4 Heart 3 4 Liver 4 4 Metabolism disorder in children. Two genotypes of the mouse model - wild type (VLCAD +/+) and VLCAD-deficient (VLCAD -/-)-were studied. Total number of genes from these three transcriptomic studies is 14,495. For D B FDR 5%, we declared 1,701 genes. For D A FDR 5%, we declared 133 genes. 15 / 19
Mouse Metabolism data metapattern Brown fat Heart Liver Brown Heart Liver n = 277 I. II. III. IV. V. 0.0 0.8 0.0 0.8 0.0 0.8 0.0 0.8 0.0 0.8 Brown+ Heart+ Liver+ Brown Heart Liver n = 195 Brown+ Heart+ Liver+ Brown Heart Liver n = 194 Brown+ Heart+ Liver+ Brown Heart Liver n = 140 Brown+ Heart+ Liver+ Brown Heart Liver n = 276 Brown+ Heart+ Liver+ Brown Heart Liver n = 110 VI. (a) Heatmap (b) CS 0.0 0.8 (c) Brown+ Heart+ Liver+ Brown Heart Liver bar plot 16 / 19
Mouse Metabolism data pathway enrichment analysis Table: module information Target pathway type q value module 1 KEGG LYSOSOME q = 2.8 10 4 module 2 BIOCARTA AHSP PATHWAY q = 0.017 module 3 DEFENSE RESPONSE q = 4.2 10 8 module 4 BIOCARTA MCM PATHWAY q = 3.9 10 3 module 5 none module 6 FC GAMMA R MEDIATED PHAGOCYTOSIS q = 0.067 17 / 19
Mouse Metabolism data D A FDR 5%, 133 genes (a) Brown (b) Heart (c) Liver Figure: Heatmaps of 133 DE genes detected under D Ā (at FDR level of 5%) in the mouse metabolism dataset. 18 / 19
Summary Novelty: 1. The p-value based method is capable of combining data from different microarray and RNA-seq platforms, 2. Bayesian framework provides complementary decision making space. 3. Non-parametric Bayesian framework makes it robust against distribution assumptions. 4. Meta-pattern help characterize heterogeneities of studies with same disease but different pheonotypes. Performance: 1. Better performance than current meta-analysis hypothesis testing methods (AUC, FDR, etc). 2. Computing is fast because of conjugacy. 3. Implemented is C++ and publicly available in Github. 19 / 19