Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018

Size: px

Start display at page:

Download "Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018"

Janice Lee
5 years ago
Views:

Bayesian latent hierarchical model for transcriptomic meta-analysis to detect biomarkers with clustered meta-patterns of differential expression signals BayesMP Zhiguang Huo 1, Chi Song 2,

1 Bayesian latent hierarchical model for transcriptomic meta-analysis to detect biomarkers with clustered meta-patterns of differential expression signals BayesMP Zhiguang Huo 1, Chi Song 2, George Tseng 3 1 Department of Biostatistics, University of Florida 2 Department of Biostatistics, the Ohio State University 3 Department of Biostatistics, University of Pittsburgh July 30, / 19

2 Background for data integration (Tseng et al. 2012) Horizontal meta-analysis: Same type of genomic data from multiple patient cohorts. Vertical integrative analysis: Multiple types of genomic data from the same patient cohort. BayesMP: Differential expression (DE) analysis. 2 / 19

3 Backgroud for meta-analysis of DE analysis According to Tseng et al. (2012), there are four major categories of transcriptomic meta-analysis: Combine effect sizes Fixed effects models, random effects model Combine p-values p-value aggregation methods: Fisher (Fisher, 1925), Stouffer (Stouffer, 1949) order statistics: minp (Tippett, 1931), maxp (Wilkinson, 1951), rop (Song, 2014) Combine ranks ranksum, rankprod (Hong et al, 2006) Direct merge 3 / 19

4 Combine p-values Combining p-values is simple, powerful and independent of batch effect. Table: p-value combining method. E.g. combine p 11, p 12,..., p 1S Genes Study 1 Study 2... Study S 1 p 11 p p 1S 2 p 21 p p 2S 3 p 31 p p 3S G p G1 p G2... p GS Genomic meta-analysis Perform combining p-value methods gene-wisely Adjust for multiple comparisons 4 / 19

5 Motivation 1: Hypothesis testing setting θ s is the effect size of study s, 1 s S. HS B targets biomarkers that are DE in one or more studies: H 0 : θ {θ s = 0} vs H A : θ {θ s 0}. Fisher minp HS A targets biomarkers that are DE in all studies: H 0 : θ {θ s = 0} vs H A : θ {θ s 0}, maxp HS r targets biomarkers that r or more studies are DE: H 0 : θ {θ s = 0} vs H A : θ I{θ s 0} r, rop Problem: HS A and HS r are not complementary hypothesis testing setting. 5 / 19

Motivation 2: differential expression from multiple tissues I. II. III. IV. V. VI. Brown fat Heart Liver Figure: heatmap Phenotypes: Black: Wild type. Red: VLCAD-deficient.

6 Motivation 2: differential expression from multiple tissues I. II. III. IV. V. VI. Brown fat Heart Liver Figure: heatmap Phenotypes: Black: Wild type. Red: VLCAD-deficient. Differential expression pattern: Homogeneous differential expression pattern. (Moduel I, II). Study specific differential expression pattern. (Moduel III, IV, V, VI) How to categorize meta-analysis differential expression pattern (metapattern)? 6 / 19

7 Z statistics and its distribution Figure: Z statistics distribution in one study. Black line: null component. Red line: positive DE component. Blue line: negative DE component. p gs is one sided p-value for gene g and study s. Z gs = Φ 1 (p gs ), where Φ 1 ( ) is the inverse cumulative density function (CDF) of standard Gaussian distribution. Null component: assume standard Gaussian distribution or empirical null (Efron, 2004). Alternative component: Dirichlet process. 7 / 19

8 Multiple studies (a) Study 1 (b) Study 2 (c) Study 3 Figure: Z statistics distribution in three studies. Y gs { 1, 0, 1} is DE indicator: f (s) (Z gs Y gs ) = f (s) 0 (Z gs ) I(Y gs = 0) + f (s) +1 (Z gs) I(Y gs = 1) + f (s) 1 (Z gs) I(Y gs = 1), Prior Y gs Mult ( 1, (1 π g, π + g, π g ) ) (0, 1, 1), where π + g = π g δ g, π g = π g (1 δ g ). 8 / 19

9 Graphical Model G 0+ G 0- γ β α π g δ g G s+ G s- Y gs f (s) f (s) k+ k- f 0 f (s) Z gs Figure: Graphical representation of Bayesian latent hierarchical model. Shaded nodes are observed variables. Dashed nodes are pre-estimated/fixed parameters. Arrows represent generative process. Dashed lines represent equivalent variables. s is the study index and g is the gene index. 9 / 19

10 Bayesian computing 1. Update π g s: π g Y gs Beta(γ/(G γ) + Y + g + Y g, S Y + g Y g + 1), where Y + g = s I(Y gs = 1) and Y g = s I(Y gs = 1). 2. Update δ g s: 3. Update Y gs s: First update C gs s s.t. δ g Y gs Beta(β + Y + g, β + Y g ). Pr(C gs = k C g,s, Z gs, π ± g ) h (s) k (Z gs C g,s )(π g + ) I(k>0) (πg ) I(k<0) (1 π g ) I(k=0) Set Y gs = sgn(c gs ), Conjugacy will make the Bayesian computing very fast. 10 / 19

11 Decision making framework (Problem 1) For meta-analysis purpose, we will declare differentially expressed genes which are in: ΩĀ : Ω 1 = { θ Ā g : S s=1 I(θ gs 0) = S}. Ω B : Ω 1 B = { θ g : S s=1 I(θ gs 0) = 1}. Ω r : Ω 1 r = { θ g : S s=1 I(θ gs 0) r}. Efron (2001) proposed local FDR ξ g = Pr( θ g Ω 0 Ā Z) = 1 Pr( θ g Ω 1 Ā Z). Given a threshold κ, we declare gene g as a DE gene if ξ g κ and the expected number of false discoveries is g ξ g I(ξ g κ). The Bayesian false discovery rate (FDR) (Newton 2004) is defined as g ξg I(ξg κ) g I(ξg κ). We will compare the performance of our Bayesian approach in terms of FDR with FDR (Benjamini-Hochberg) from frequentists perspective. 11 / 19

12 Biomarker clustering for meta-patterns of homogenous and heterogenous differential signals (Problem 2) Denote by U gs the posterior probability vector for Y gs : U gs = (Pr(Y gs = 1 Z), Pr(Y gs = 1 Z), Pr(Y gs = 0 Z)). We will calculate dissimilarity of U is and U js in study s and then average over study index s. Apply tight clustering (Tseng and Wong) on gene-gene dissimilarity matrix, obtain stable modules. 12 / 19

13 Simulation (FDR) Table: Comparison of different methods by FDR for decision spaces D Ā, D B, and D r. The nominal FDR is 5% for all compared methods. The mean results and SD (in parentheses) were calculated based on 100 simulations. FDR DĀ D B D r (r = S/2 + 1) S σ BayesMP maxp BayesMP Fisher AW BayesMP rop (0.008) (0.013) (0.006) (0.005) (0.004) (0.005) (0.008) (0.012) (0.016) (0.008) (0.006) (0.006) (0.006) (0.010) (0.018) (0.021) (0.010) (0.008) (0.009) (0.008) (0.015) (0.009) (0.017) (0.005) (0.004) (0.004) (0.005) (0.008) (0.016) (0.023) (0.006) (0.005) (0.005) (0.007) (0.008) (0.032) (0.035) (0.008) (0.008) (0.008) (0.008) (0.013) (0.019) (0.023) (0.004) (0.004) (0.004) (0.005) (0.010) (0.029) (0.027) (0.006) (0.005) (0.005) (0.009) (0.012) (0.063) (0.038) (0.007) (0.006) (0.006) (0.009) (0.014) 13 / 19

14 Simulation (AUC) Table: Comparison of different methods by AUC of ROC curve for decision spaces D Ā, D B, and D r. The nominal FDR is 5% for all compared methods. The mean results and SD (in parentheses) were calculated based on 100 simulations. AUC DĀ D B D r (r = S/2 + 1) S σ BayesMP maxp BayesMP Fisher AW BayesMP rop (0.003) (0.003) (0.002) (0.002) (0.002) (0.002) (0.003) (0.006) (0.007) (0.004) (0.004) (0.004) (0.004) (0.005) (0.008) (0.008) (0.005) (0.005) (0.005) (0.006) (0.006) (0.004) (0.003) (0.002) (0.002) (0.002) (0.002) (0.002) (0.007) (0.006) (0.004) (0.004) (0.004) (0.004) (0.005) (0.009) (0.009) (0.005) (0.005) (0.005) (0.005) (0.006) (0.007) (0.003) (0.001) (0.002) (0.001) (0.002) (0.002) (0.011) (0.006) (0.003) (0.003) (0.003) (0.004) (0.004) (0.013) (0.010) (0.004) (0.004) (0.005) (0.005) (0.006) 14 / 19

15 Mouse Metabolism data Table: Sample size description Study wild type VLCAD Brown fat 4 4 Heart 3 4 Liver 4 4 Metabolism disorder in children. Two genotypes of the mouse model - wild type (VLCAD +/+) and VLCAD-deficient (VLCAD -/-)-were studied. Total number of genes from these three transcriptomic studies is 14,495. For D B FDR 5%, we declared 1,701 genes. For D A FDR 5%, we declared 133 genes. 15 / 19

Mouse Metabolism data metapattern Brown fat Heart Liver Brown Heart Liver n = 277 I. II. III. IV. V. 0.0 0.

16 Mouse Metabolism data metapattern Brown fat Heart Liver Brown Heart Liver n = 277 I. II. III. IV. V Brown+ Heart+ Liver+ Brown Heart Liver n = 195 Brown+ Heart+ Liver+ Brown Heart Liver n = 194 Brown+ Heart+ Liver+ Brown Heart Liver n = 140 Brown+ Heart+ Liver+ Brown Heart Liver n = 276 Brown+ Heart+ Liver+ Brown Heart Liver n = 110 VI. (a) Heatmap (b) CS (c) Brown+ Heart+ Liver+ Brown Heart Liver bar plot 16 / 19

17 Mouse Metabolism data pathway enrichment analysis Table: module information Target pathway type q value module 1 KEGG LYSOSOME q = module 2 BIOCARTA AHSP PATHWAY q = module 3 DEFENSE RESPONSE q = module 4 BIOCARTA MCM PATHWAY q = module 5 none module 6 FC GAMMA R MEDIATED PHAGOCYTOSIS q = / 19

18 Mouse Metabolism data D A FDR 5%, 133 genes (a) Brown (b) Heart (c) Liver Figure: Heatmaps of 133 DE genes detected under D Ā (at FDR level of 5%) in the mouse metabolism dataset. 18 / 19

19 Summary Novelty: 1. The p-value based method is capable of combining data from different microarray and RNA-seq platforms, 2. Bayesian framework provides complementary decision making space. 3. Non-parametric Bayesian framework makes it robust against distribution assumptions. 4. Meta-pattern help characterize heterogeneities of studies with same disease but different pheonotypes. Performance: 1. Better performance than current meta-analysis hypothesis testing methods (AUC, FDR, etc). 2. Computing is fast because of conjugacy. 3. Implemented is C++ and publicly available in Github. 19 / 19

using Bayesian hierarchical model

Biomarker detection and categorization in RNA-seq meta-analysis using Bayesian hierarchical model Tianzhou Ma Department of Biostatistics University of Pittsburgh, Pittsburgh, PA 15261 email: tim28@pitt.edu