Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018

Similar documents
using Bayesian hierarchical model

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS

Multiple testing: Intro & FWER 1

FDR and ROC: Similarities, Assumptions, and Decisions

Advanced Statistical Methods: Beyond Linear Regression

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

High-throughput Testing

A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE

Looking at the Other Side of Bonferroni

Unsupervised machine learning

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

Empirical Bayes Moderation of Asymptotically Linear Parameters

Family-wise Error Rate Control in QTL Mapping and Gene Ontology Graphs

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

Empirical Bayes Moderation of Asymptotically Linear Parameters

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Androgen-independent prostate cancer

Adaptive Filtering Multiple Testing Procedures for Partial Conjunction Hypotheses

Bayesian Partition Models for Identifying Expression Quantitative Trait Loci

Department of Statistics, The Wharton School, University of Pennsylvania

Announcements. Proposals graded

Estimating empirical null distributions for Chi-squared and Gamma statistics with application to multiple testing in RNA-seq

Empirical Bayesian Inference & Non-Null Bootstrapping for Threshold Selection, Nasseroleslami Page 1 of 10

Pearson s meta-analysis revisited

Lecture 28. Ingo Ruczinski. December 3, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Model Accuracy Measures

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

SIGNAL RANKING-BASED COMPARISON OF AUTOMATIC DETECTION METHODS IN PHARMACOVIGILANCE

The miss rate for the analysis of gene expression data

A Large-Sample Approach to Controlling the False Discovery Rate

Lesson 11. Functional Genomics I: Microarray Analysis

Statistical Applications in Genetics and Molecular Biology

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Biochip informatics-(i)

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

Lecture 7: Hypothesis Testing and ANOVA

Machine Learning Linear Classification. Prof. Matteo Matteucci

Two-stage Adaptive Randomization for Delayed Response in Clinical Trials

Estimation of Optimally-Combined-Biomarker Accuracy in the Absence of a Gold-Standard Reference Test

A Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

High-dimensional data: Exploratory data analysis

Bayesian Aspects of Classification Procedures

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Step-down FDR Procedures for Large Numbers of Hypotheses

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Stat 206: Estimation and testing for a mean vector,

Frequentist Accuracy of Bayesian Estimates

Predicting Protein Functions and Domain Interactions from Protein Interactions

Bayesian Inference of Interactions and Associations

Network Biology-part II

Experimental Design and Data Analysis for Biologists

Large-Scale Hypothesis Testing

Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments

Performance Evaluation

Lecture: Mixture Models for Microbiome data

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

Non-Parametric Combination (NPC) & classical multivariate tests

False Discovery Control in Spatial Multiple Testing

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

Non-specific filtering and control of false positives

A NEW APPROACH FOR LARGE SCALE MULTIPLE TESTING WITH APPLICATION TO FDR CONTROL FOR GRAPHICALLY STRUCTURED HYPOTHESES

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison

Journal Club: Higher Criticism

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /8/2016 1/38

Rank conditional coverage and confidence intervals in high dimensional problems

Dispersion modeling for RNAseq differential analysis

Peak Detection for Images

Spiked Dirichlet Process Prior for Bayesian Multiple Hypothesis Testing in Random Effects Models

Association studies and regression

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

The locfdr Package. August 19, hivdata... 1 lfdrsim... 2 locfdr Index 5

Statistical analysis of microarray data: a Bayesian approach

Chapter 10. Semi-Supervised Learning

Differential Modeling for Cancer Microarray Data

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

PB HLTH 240A: Advanced Categorical Data Analysis Fall 2007

Identifying Bio-markers for EcoArray

Lecture 27. December 13, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Statistical testing. Samantha Kleinberg. October 20, 2009

A Unified Approach for Simultaneous Gene Clustering and Differential Expression Identification

CHOOSING THE LESSER EVIL: TRADE-OFF BETWEEN FALSE DISCOVERY RATE AND NON-DISCOVERY RATE

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013

Adaptive Filtering Procedures for Replicability Analysis of High-throughput Experiments

Alpha-Investing. Sequential Control of Expected False Discoveries

Bayesian Regression (1/31/13)

A BAYESIAN STEPWISE MULTIPLE TESTING PROCEDURE. By Sanat K. Sarkar 1 and Jie Chen. Temple University and Merck Research Laboratories

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

A Practical Approach to Inferring Large Graphical Models from Sparse Microarray Data

Semiparametric Varying Coefficient Models for Matched Case-Crossover Studies

Sample Size Estimation for Studies of High-Dimensional Data

29 Sample Size Choice for Microarray Experiments

Microarray Data Analysis: Discovery

Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks

Large-Scale Multiple Testing of Correlations

Bayesian Methods for Highly Correlated Data. Exposures: An Application to Disinfection By-products and Spontaneous Abortion

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Bayesian Inference and the Parametric Bootstrap. Bradley Efron Stanford University

Generalized Linear Models (1/29/13)

Transcription:

Bayesian latent hierarchical model for transcriptomic meta-analysis to detect biomarkers with clustered meta-patterns of differential expression signals BayesMP Zhiguang Huo 1, Chi Song 2, George Tseng 3 1 Department of Biostatistics, University of Florida 2 Department of Biostatistics, the Ohio State University 3 Department of Biostatistics, University of Pittsburgh July 30, 2018 1 / 19

Background for data integration (Tseng et al. 2012) Horizontal meta-analysis: Same type of genomic data from multiple patient cohorts. Vertical integrative analysis: Multiple types of genomic data from the same patient cohort. BayesMP: Differential expression (DE) analysis. 2 / 19

Backgroud for meta-analysis of DE analysis According to Tseng et al. (2012), there are four major categories of transcriptomic meta-analysis: Combine effect sizes Fixed effects models, random effects model Combine p-values p-value aggregation methods: Fisher (Fisher, 1925), Stouffer (Stouffer, 1949) order statistics: minp (Tippett, 1931), maxp (Wilkinson, 1951), rop (Song, 2014) Combine ranks ranksum, rankprod (Hong et al, 2006) Direct merge 3 / 19

Combine p-values Combining p-values is simple, powerful and independent of batch effect. Table: p-value combining method. E.g. combine p 11, p 12,..., p 1S Genes Study 1 Study 2... Study S 1 p 11 p 12... p 1S 2 p 21 p 22... p 2S 3 p 31 p 32... p 3S............... G p G1 p G2... p GS Genomic meta-analysis Perform combining p-value methods gene-wisely Adjust for multiple comparisons 4 / 19

Motivation 1: Hypothesis testing setting θ s is the effect size of study s, 1 s S. HS B targets biomarkers that are DE in one or more studies: H 0 : θ {θ s = 0} vs H A : θ {θ s 0}. Fisher minp HS A targets biomarkers that are DE in all studies: H 0 : θ {θ s = 0} vs H A : θ {θ s 0}, maxp HS r targets biomarkers that r or more studies are DE: H 0 : θ {θ s = 0} vs H A : θ I{θ s 0} r, rop Problem: HS A and HS r are not complementary hypothesis testing setting. 5 / 19

Motivation 2: differential expression from multiple tissues I. II. III. IV. V. VI. Brown fat Heart Liver Figure: heatmap Phenotypes: Black: Wild type. Red: VLCAD-deficient. Differential expression pattern: Homogeneous differential expression pattern. (Moduel I, II). Study specific differential expression pattern. (Moduel III, IV, V, VI) How to categorize meta-analysis differential expression pattern (metapattern)? 6 / 19

Z statistics and its distribution Figure: Z statistics distribution in one study. Black line: null component. Red line: positive DE component. Blue line: negative DE component. p gs is one sided p-value for gene g and study s. Z gs = Φ 1 (p gs ), where Φ 1 ( ) is the inverse cumulative density function (CDF) of standard Gaussian distribution. Null component: assume standard Gaussian distribution or empirical null (Efron, 2004). Alternative component: Dirichlet process. 7 / 19

Multiple studies (a) Study 1 (b) Study 2 (c) Study 3 Figure: Z statistics distribution in three studies. Y gs { 1, 0, 1} is DE indicator: f (s) (Z gs Y gs ) = f (s) 0 (Z gs ) I(Y gs = 0) + f (s) +1 (Z gs) I(Y gs = 1) + f (s) 1 (Z gs) I(Y gs = 1), Prior Y gs Mult ( 1, (1 π g, π + g, π g ) ) (0, 1, 1), where π + g = π g δ g, π g = π g (1 δ g ). 8 / 19

Graphical Model G 0+ G 0- γ β α π g δ g G s+ G s- Y gs f (s) f (s) k+ k- f 0 f (s) Z gs Figure: Graphical representation of Bayesian latent hierarchical model. Shaded nodes are observed variables. Dashed nodes are pre-estimated/fixed parameters. Arrows represent generative process. Dashed lines represent equivalent variables. s is the study index and g is the gene index. 9 / 19

Bayesian computing 1. Update π g s: π g Y gs Beta(γ/(G γ) + Y + g + Y g, S Y + g Y g + 1), where Y + g = s I(Y gs = 1) and Y g = s I(Y gs = 1). 2. Update δ g s: 3. Update Y gs s: First update C gs s s.t. δ g Y gs Beta(β + Y + g, β + Y g ). Pr(C gs = k C g,s, Z gs, π ± g ) h (s) k (Z gs C g,s )(π g + ) I(k>0) (πg ) I(k<0) (1 π g ) I(k=0) Set Y gs = sgn(c gs ), Conjugacy will make the Bayesian computing very fast. 10 / 19

Decision making framework (Problem 1) For meta-analysis purpose, we will declare differentially expressed genes which are in: ΩĀ : Ω 1 = { θ Ā g : S s=1 I(θ gs 0) = S}. Ω B : Ω 1 B = { θ g : S s=1 I(θ gs 0) = 1}. Ω r : Ω 1 r = { θ g : S s=1 I(θ gs 0) r}. Efron (2001) proposed local FDR ξ g = Pr( θ g Ω 0 Ā Z) = 1 Pr( θ g Ω 1 Ā Z). Given a threshold κ, we declare gene g as a DE gene if ξ g κ and the expected number of false discoveries is g ξ g I(ξ g κ). The Bayesian false discovery rate (FDR) (Newton 2004) is defined as g ξg I(ξg κ) g I(ξg κ). We will compare the performance of our Bayesian approach in terms of FDR with FDR (Benjamini-Hochberg) from frequentists perspective. 11 / 19

Biomarker clustering for meta-patterns of homogenous and heterogenous differential signals (Problem 2) Denote by U gs the posterior probability vector for Y gs : U gs = (Pr(Y gs = 1 Z), Pr(Y gs = 1 Z), Pr(Y gs = 0 Z)). We will calculate dissimilarity of U is and U js in study s and then average over study index s. Apply tight clustering (Tseng and Wong) on gene-gene dissimilarity matrix, obtain stable modules. 12 / 19

Simulation (FDR) Table: Comparison of different methods by FDR for decision spaces D Ā, D B, and D r. The nominal FDR is 5% for all compared methods. The mean results and SD (in parentheses) were calculated based on 100 simulations. FDR DĀ D B D r (r = S/2 + 1) S σ BayesMP maxp BayesMP Fisher AW BayesMP rop 1 0.054 0.207 0.050 0.035 0.035 0.035 0.087 (0.008) (0.013) (0.006) (0.005) (0.004) (0.005) (0.008) 3 2 0.052 0.199 0.054 0.035 0.035 0.036 0.080 (0.012) (0.016) (0.008) (0.006) (0.006) (0.006) (0.010) 3 0.036 0.183 0.050 0.034 0.035 0.031 0.071 (0.018) (0.021) (0.010) (0.008) (0.009) (0.008) (0.015) 1 0.069 0.358 0.053 0.035 0.034 0.038 0.129 (0.009) (0.017) (0.005) (0.004) (0.004) (0.005) (0.008) 5 2 0.073 0.348 0.055 0.035 0.034 0.041 0.113 (0.016) (0.023) (0.006) (0.005) (0.005) (0.007) (0.008) 3 0.054 0.332 0.049 0.035 0.034 0.036 0.098 (0.032) (0.035) (0.008) (0.008) (0.008) (0.008) (0.013) 1 0.096 0.583 0.061 0.035 0.036 0.049 0.228 (0.019) (0.023) (0.004) (0.004) (0.004) (0.005) (0.010) 10 2 0.108 0.569 0.061 0.035 0.035 0.058 0.197 (0.029) (0.027) (0.006) (0.005) (0.005) (0.009) (0.012) 3 0.083 0.553 0.053 0.036 0.036 0.056 0.163 (0.063) (0.038) (0.007) (0.006) (0.006) (0.009) (0.014) 13 / 19

Simulation (AUC) Table: Comparison of different methods by AUC of ROC curve for decision spaces D Ā, D B, and D r. The nominal FDR is 5% for all compared methods. The mean results and SD (in parentheses) were calculated based on 100 simulations. AUC DĀ D B D r (r = S/2 + 1) S σ BayesMP maxp BayesMP Fisher AW BayesMP rop 1 0.977 0.926 0.973 0.973 0.973 0.980 0.972 (0.003) (0.003) (0.002) (0.002) (0.002) (0.002) (0.003) 3 2 0.907 0.875 0.879 0.877 0.875 0.902 0.873 (0.006) (0.007) (0.004) (0.004) (0.004) (0.004) (0.005) 3 0.831 0.805 0.787 0.783 0.779 0.819 0.776 (0.008) (0.008) (0.005) (0.005) (0.005) (0.006) (0.006) 1 0.974 0.920 0.979 0.978 0.979 0.986 0.979 (0.004) (0.003) (0.002) (0.002) (0.002) (0.002) (0.002) 5 2 0.920 0.890 0.897 0.894 0.892 0.929 0.893 (0.007) (0.006) (0.004) (0.004) (0.004) (0.004) (0.005) 3 0.864 0.833 0.812 0.806 0.801 0.859 0.802 (0.009) (0.009) (0.005) (0.005) (0.005) (0.005) (0.006) 1 0.964 0.910 0.985 0.983 0.985 0.986 0.986 (0.007) (0.003) (0.001) (0.002) (0.001) (0.002) (0.002) 10 2 0.910 0.905 0.920 0.917 0.917 0.950 0.919 (0.011) (0.006) (0.003) (0.003) (0.003) (0.004) (0.004) 3 0.875 0.863 0.848 0.839 0.834 0.904 0.838 (0.013) (0.010) (0.004) (0.004) (0.005) (0.005) (0.006) 14 / 19

Mouse Metabolism data Table: Sample size description Study wild type VLCAD Brown fat 4 4 Heart 3 4 Liver 4 4 Metabolism disorder in children. Two genotypes of the mouse model - wild type (VLCAD +/+) and VLCAD-deficient (VLCAD -/-)-were studied. Total number of genes from these three transcriptomic studies is 14,495. For D B FDR 5%, we declared 1,701 genes. For D A FDR 5%, we declared 133 genes. 15 / 19

Mouse Metabolism data metapattern Brown fat Heart Liver Brown Heart Liver n = 277 I. II. III. IV. V. 0.0 0.8 0.0 0.8 0.0 0.8 0.0 0.8 0.0 0.8 Brown+ Heart+ Liver+ Brown Heart Liver n = 195 Brown+ Heart+ Liver+ Brown Heart Liver n = 194 Brown+ Heart+ Liver+ Brown Heart Liver n = 140 Brown+ Heart+ Liver+ Brown Heart Liver n = 276 Brown+ Heart+ Liver+ Brown Heart Liver n = 110 VI. (a) Heatmap (b) CS 0.0 0.8 (c) Brown+ Heart+ Liver+ Brown Heart Liver bar plot 16 / 19

Mouse Metabolism data pathway enrichment analysis Table: module information Target pathway type q value module 1 KEGG LYSOSOME q = 2.8 10 4 module 2 BIOCARTA AHSP PATHWAY q = 0.017 module 3 DEFENSE RESPONSE q = 4.2 10 8 module 4 BIOCARTA MCM PATHWAY q = 3.9 10 3 module 5 none module 6 FC GAMMA R MEDIATED PHAGOCYTOSIS q = 0.067 17 / 19

Mouse Metabolism data D A FDR 5%, 133 genes (a) Brown (b) Heart (c) Liver Figure: Heatmaps of 133 DE genes detected under D Ā (at FDR level of 5%) in the mouse metabolism dataset. 18 / 19

Summary Novelty: 1. The p-value based method is capable of combining data from different microarray and RNA-seq platforms, 2. Bayesian framework provides complementary decision making space. 3. Non-parametric Bayesian framework makes it robust against distribution assumptions. 4. Meta-pattern help characterize heterogeneities of studies with same disease but different pheonotypes. Performance: 1. Better performance than current meta-analysis hypothesis testing methods (AUC, FDR, etc). 2. Computing is fast because of conjugacy. 3. Implemented is C++ and publicly available in Github. 19 / 19