Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models

Similar documents
Simultaneous inference for multiple testing and clustering via a Dirichlet process mixture model

Spiked Dirichlet Process Prior for Bayesian Multiple Hypothesis Testing in Random Effects Models

Bayesian Nonparametric Regression for Diabetes Deaths

Department of Statistics, The Wharton School, University of Pennsylvania

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Hierarchical Mixture Models for Expression Profiles

Distance-Based Probability Distribution for Set Partitions with Applications to Bayesian Nonparametrics

Sequentially-Allocated Merge-Split Sampler for Conjugate and Nonconjugate Dirichlet Process Mixture Models

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Spatial Normalized Gamma Process

Non-specific filtering and control of false positives

Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles

Parametric Empirical Bayes Methods for Microarrays

28 Bayesian Mixture Models for Gene Expression and Protein Profiles

CPSC 540: Machine Learning

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE

Problem Selected Scores

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

29 Sample Size Choice for Microarray Experiments

Time-Sensitive Dirichlet Process Mixture Models

A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles

STAT Advanced Bayesian Inference

Learning in Bayesian Networks

Research Article Spiked Dirichlet Process Priors for Gaussian Process Models

Predicting Protein Functions and Domain Interactions from Protein Interactions

A Bayesian Mixture Model for Differential Gene Expression

Non-Parametric Bayes

Bayesian Inference and MCMC

Optimal Sample Size for Multiple Testing: the Case of Gene Expression Microarrays

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

Statistical Applications in Genetics and Molecular Biology

Empirical Bayes Moderation of Asymptotically Linear Parameters

Graphical Models for Query-driven Analysis of Multimodal Data

Lecture 7 and 8: Markov Chain Monte Carlo

Nonparametric Bayesian modeling for dynamic ordinal regression relationships

Bayesian Wavelet-Based Functional Mixed Models

Default Priors and Effcient Posterior Computation in Bayesian

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

Lesson 11. Functional Genomics I: Microarray Analysis

Estimation of a Two-component Mixture Model

The optimal discovery procedure: a new approach to simultaneous significance testing

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

Hybrid Dirichlet processes for functional data

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian Methods for Machine Learning

ABTEKNILLINEN KORKEAKOULU

Genetic Association Studies in the Presence of Population Structure and Admixture

Part 8: GLMs and Hierarchical LMs and GLMs

STAT 425: Introduction to Bayesian Analysis

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Advanced Statistical Methods: Beyond Linear Regression

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

Introduction to Machine Learning. Lecture 2

Introduction to Probabilistic Machine Learning

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

Similarity Measures and Clustering In Genetics

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

DETECTING DIFFERENTIALLY EXPRESSED GENES WHILE CONTROLLING THE FALSE DISCOVERY RATE FOR MICROARRAY DATA

Clustering using Mixture Models

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Bayes methods for categorical data. April 25, 2017

Marginal Specifications and a Gaussian Copula Estimation

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Exam: high-dimensional data analysis January 20, 2014

Bayesian Regression Linear and Logistic Regression

VCMC: Variational Consensus Monte Carlo

Construction of Dependent Dirichlet Processes based on Poisson Processes

Part 7: Hierarchical Modeling

Bayesian Inference of Multiple Gaussian Graphical Models

MCMC algorithms for fitting Bayesian models

Bayesian non-parametric model to longitudinally predict churn

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

The Calibrated Bayes Factor for Model Comparison

Experimental Design and Data Analysis for Biologists

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

A Unified Approach for Simultaneous Gene Clustering and Differential Expression Identification

How To Use CORREP to Estimate Multivariate Correlation and Statistical Inference Procedures

Hybrid Copula Bayesian Networks

Extending the Robust Means Modeling Framework. Alyssa Counsell, Phil Chalmers, Matt Sigal, Rob Cribbie

Summary of Extending the Rank Likelihood for Semiparametric Copula Estimation, by Peter Hoff

Expression arrays, normalization, and error models

Data Preprocessing. Data Preprocessing

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Nuisance parameters and their treatment

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

MCMC: Markov Chain Monte Carlo

Nonparametric Bayes tensor factorizations for big data

Nearest Neighbor Gaussian Processes for Large Spatial Data

Unsupervised machine learning

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

BIOLOGY 111. CHAPTER 1: An Introduction to the Science of Life

Identifying Mixtures of Mixtures Using Bayesian Estimation

Transcription:

Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models David B. Dahl Department of Statistics Texas A&M University Marina Vannucci, Michael Newton, & Qianxing Mo 3 rd Lehmann Symposium Rice University May 16, 2007 David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 1 / 41

Michael Newton Quincy Mo Marina Vannucci D. B. Dahl, M. A. Newton (200?), Multiple Hypothesis Testing by Clustering Treatment Effects of Correlated Objects, Journal of the American Statistical Association, accepted. D. B. Dahl (2006), Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model, in Bayesian Inference for Gene Expression and Proteomics, Kim-Anh Do, Peter Müller, Marina Vannucci (Eds.), Cambridge University Press. D. B. Dahl, Q. Mo, M. Vannucci (200?), Simultaneous Inference for Multiple Testing and Clustering via a Dirichlet Process Mixture Model, Statistical Modelling: An International Journal, accepted. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 2 / 41

Outline 1 Motivation 2 Z-scores Demonstration 3 BEMMA for Differential Expression Dahl, Newton (200?) 4 BEMMA for Clustering Dahl (200?) 5 SIMTAC for DE and Clustering Dahl, Mo, Vannucci (200?) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 3 / 41

Outline 1 Motivation 2 Z-scores Demonstration 3 BEMMA for Differential Expression Dahl, Newton (200?) 4 BEMMA for Clustering Dahl (200?) 5 SIMTAC for DE and Clustering Dahl, Mo, Vannucci (200?) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 4 / 41

Two Statistical Tasks Multiple hypothesis testing: Goal: Detect shift in marginal distribution of gene expression. Statistical dependence among genes is a nuisance parameter. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 5 / 41

Two Statistical Tasks Multiple hypothesis testing: Goal: Detect shift in marginal distribution of gene expression. Statistical dependence among genes is a nuisance parameter. Clustering: Goal: Group genes that are highly correlated. Correlation may reflect underlying biological factors of interest. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 5 / 41

Two Statistical Tasks Multiple hypothesis testing: Goal: Detect shift in marginal distribution of gene expression. Statistical dependence among genes is a nuisance parameter. Clustering: Goal: Group genes that are highly correlated. Correlation may reflect underlying biological factors of interest. We propose a hybrid methodology... Main Idea Simultaneously infer clustering & test for differential expression David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 5 / 41

Two Statistical Tasks Multiple hypothesis testing: Goal: Detect shift in marginal distribution of gene expression. Statistical dependence among genes is a nuisance parameter. Clustering: Goal: Group genes that are highly correlated. Correlation may reflect underlying biological factors of interest. We propose a hybrid methodology... Main Idea Simultaneously infer clustering & test for differential expression Other work: Storey (2007), Yuan & Kendziorski (2006), Tibshirani & Wasserman (2006) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 5 / 41

Gene 1 14 Gene Expression 12 10 8 1 2 3 4 5 6 7 8 9 10 Treatment David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 6 / 41

Gene 2 14 Gene Expression 12 10 8 1 2 3 4 5 6 7 8 9 10 Treatment David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 6 / 41

14 Gene Expression 12 10 8 1 2 3 4 5 6 7 8 9 10 Treatment David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 6 / 41

Outline 1 Motivation 2 Z-scores Demonstration 3 BEMMA for Differential Expression Dahl, Newton (200?) 4 BEMMA for Clustering Dahl (200?) 5 SIMTAC for DE and Clustering Dahl, Mo, Vannucci (200?) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 7 / 41

Elementary Setup Parameters θ 1,..., θ n for n observations. Hypotheses: H 0i : θ i = 0, vs. H ai : θ i > 0 Test statistics Z 1,..., Z n are independent and Z i N(θ i, 1) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 8 / 41

Method I Standard Z -test in which H 0i is rejected if Z i > z, where z is chosen to achieve the desired size. The test has power: 1 Φ(z θ i ) where Φ(x) is the standard normal distribution function evaluated at x. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 9 / 41

Method II Assumes a known clustering: c ij = I{θ i = θ j }. Test statistic: S i = Z i + c ij Z j. i j The test has power: 1 Φ(z n (i) θ i ) where n (i) = n j=1 c ij Method 2 is never less powerful than Method 1. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 10 / 41

Method III Clustering indicators c ij s are estimated: ĉ ij = { c ij γ is the error rate of clustering. with probability 1 γ 1 c ij with probability γ, Take Method II, but replace c ij with ĉ ij to form Ŝi. Under an assumption about the distribution of θ 1,..., θ n, the test has power: 1 Φ(z kθ i ) where k is a constant involving γ, n (i), etc. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 11 / 41

Power("Method 3") / Power("Method 1") 0.0 0.5 1.0 1.5 2.0 2.5 3.0 n (i) = 30 n (i) = 10 0.0 0.1 0.2 0.3 0.4 γ = Error Probability David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 12 / 41

Outline 1 Motivation 2 Z-scores Demonstration 3 BEMMA for Differential Expression Dahl, Newton (200?) 4 BEMMA for Clustering Dahl (200?) 5 SIMTAC for DE and Clustering Dahl, Mo, Vannucci (200?) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 13 / 41

BEMMA for Diff. Expression Dahl, Newton (200?) Bayesian Effects Model for Microarrays (BEMMA): Conjugate Dirichlet process mixture (DPM) model. Identifies differentially expressed genes by borrowing strength from genes likely to have the same parameters. Averages over clustering uncertainty. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 14 / 41

BEMMA for Diff. Expression Dahl, Newton (200?) Bayesian Effects Model for Microarrays (BEMMA): Conjugate Dirichlet process mixture (DPM) model. Identifies differentially expressed genes by borrowing strength from genes likely to have the same parameters. Averages over clustering uncertainty. Sampling model: y gtr µ g, τ gt, λ g N(y gtr µ g + τ gt, λ g ), where r is replicate, t is treatment, and g is gene. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 14 / 41

BEMMA for Diff. Expression Dahl, Newton (200?) Bayesian Effects Model for Microarrays (BEMMA): Conjugate Dirichlet process mixture (DPM) model. Identifies differentially expressed genes by borrowing strength from genes likely to have the same parameters. Averages over clustering uncertainty. Sampling model: y gtr µ g, τ gt, λ g N(y gtr µ g + τ gt, λ g ), where r is replicate, t is treatment, and g is gene. Genes g and g come from the same cluster iff: (τ g1,..., τ gt, λ g ) = (τ g 1,..., τ g T, λ g ) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 14 / 41

Nuisance Parameters Gene-specific means µ 1,..., µ G are not related to differential expression or clustering. 14 Gene Expression 12 10 8 1 2 3 4 5 6 7 8 9 10 Treatment David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 15 / 41

Nuisance Parameters Gene-specific means µ 1,..., µ G are not related to differential expression or clustering. 14 Gene Expression 12 10 8 1 2 3 4 5 6 7 8 9 10 Treatment Let d g be a vector whose elements are y gtr y g1 for t 2. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 15 / 41

Model Sampling distribution: d g τ g, λ g N N (d g Xτ g, λ g M), where τ g = (τ g2,..., τ gt ), M = ( I + 1 R 1 J ) 1, and X is a design matrix picking off the appropriate element of τ g. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 16 / 41

Model Sampling distribution: d g τ g, λ g N N (d g Xτ g, λ g M), where τ g = (τ g2,..., τ gt ), M = ( I + 1 R 1 J ) 1, and X is a design matrix picking off the appropriate element of τ g. Clustering based on (τ, λ) via a Dirichlet process prior: (τ g, λ g ) F F F DP(αF 0 ), where F 0 is conjugate to the likelihood. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 16 / 41

David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 17 / 41

David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 17 / 41

David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 17 / 41

David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 17 / 41

David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 17 / 41

David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 17 / 41

David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 17 / 41

Model Fitting The τ s and λ s may be integrated away, leaving only the clustering of the G genes. Sample from posterior clustering distribution using MCMC. Gibbs of MacEachern (1994) and Neal (1992) Merge-Split of Jain & Neal (2004) Merge-Split of Dahl (2003) After MCMC, it s easy to sample τ s and λ s given clustering. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 18 / 41

Inference on Differential Expression Define a univariate parameter q g that encodes the hypothesis of interest. For example, the global F-test in one-way ANOVA setting is analogous to: T q g = Estimate q g under squared-error loss by computing its expection with respect to p(q g d 1,..., d G ). Rank genes for evidence for differential expression using the estimates ˆq 1,..., ˆq G. t=2 τ 2 gt David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 19 / 41

Simulation Study Some other methods for differential expression: EBarrays (Kendziorski, Newton, et al., 2003) LIMMA (Smyth 2004) Comparison based on proportion of false discoveries. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 20 / 41

Simulation Study Some other methods for differential expression: EBarrays (Kendziorski, Newton, et al., 2003) LIMMA (Smyth 2004) Comparison based on proportion of false discoveries. Simulate datasets: Time-course experiment: Three time points Two treatment conditions 300 of 1,200 genes are differentially expressed. Interest lies in genes that are differentially expressed at one or more time points. Four levels of clustering: Heavy Clustering: 12 clusters of 100 genes per cluster. Moderate Clustering: 60 clusters of 20 genes per cluster. Weak Clustering: 240 clusters of 5 genes per cluster. No Clustering. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 20 / 41

Heavy Clustering False Discovery Rate 0.00 0.10 0.20 0.30 0 50 100 150 200 Number of Discoveries David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 21 / 41

Moderate Clustering False Discovery Rate 0.00 0.10 0.20 0.30 0 50 100 150 200 Number of Discoveries David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 22 / 41

Weak Clustering False Discovery Rate 0.00 0.10 0.20 0.30 0 50 100 150 200 Number of Discoveries David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 23 / 41

No Clustering False Discovery Rate 0.00 0.10 0.20 0.30 0 50 100 150 200 Number of Discoveries David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 24 / 41

Paraquat Experiment Old and young mice treated with paraquat injection. Sacrifice as baseline or 1, 3, 5, or 7 hours after injection. Three replicates per treatment. 10,043 probe sets on Affymetrix MG-U74A arrays. Background correction and normalization using RMA (Irizarry et al., 2003). Biologists are interested in genes whose expression between old and young is similar at baseline and very different at one hour. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 25 / 41

Estimated Treatment Effects for Probe Set 92885_at Deviation from Reference 0.10 0.05 0.00 0.05 Yng Old Yng Old Yng Old Yng Old Yng Old Baseline 1 hour 3 hours 5 hours 7 hours David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 26 / 41

Outline 1 Motivation 2 Z-scores Demonstration 3 BEMMA for Differential Expression Dahl, Newton (200?) 4 BEMMA for Clustering Dahl (200?) 5 SIMTAC for DE and Clustering Dahl, Mo, Vannucci (200?) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 27 / 41

Inference on Clustering Dahl (200?) MCMC sampling algorithm produces B clusterings π (1),..., π (B) from the posterior clustering distribution. Point estimation methods: Maximum a posteriori (MAP) clustering Medvedovic & Sivaganesan (2002): hierarchical clustering using pairwise probabilities Dahl (2006): stochastic search to minimize posterior expected loss from Binder (1978) Lau & Green (200?): heuristic to minimize posterior expected loss from Binder (1978) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 28 / 41

Inference on Clustering Dahl (200?) MCMC sampling algorithm produces B clusterings π (1),..., π (B) from the posterior clustering distribution. Point estimation methods: Maximum a posteriori (MAP) clustering Medvedovic & Sivaganesan (2002): hierarchical clustering using pairwise probabilities Dahl (2006): stochastic search to minimize posterior expected loss from Binder (1978) Lau & Green (200?): heuristic to minimize posterior expected loss from Binder (1978) Selects the observed clustering closest to the matrix of pairwise probabilities in terms of squared distances: π LS = arg min π {π (1),...,π (B) } G i=1 j=1 G (δ i,j (π) ˆp i,j ) 2 (1) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 28 / 41

Heavy Clustering Degree of Clustering Adjusted Rand Index Clustering Method w/ 95% C.I. MCLUST 0.413 (0.380, 0.447) BEMMA(least-squares) 0.402 (0.373, 0.431) BEMMA(map) 0.390 (0.362, 0.419) HCLUST(effects,average) 0.277 (0.247, 0.308) HCLUST(effects,complete) 0.260 (0.242, 0.279) HCLUST(correlation,complete) 0.162 (0.144, 0.180) HCLUST(correlation,average) 0.156 (0.141, 0.172) Heavy Table: Adjusted Rand Index for BEMMA and Other Methods. Large values of the adjusted Rand index indicate better agreement between the estimated and true clustering. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 29 / 41

Moderate Clustering Degree of Clustering Adjusted Rand Index Clustering Method w/ 95% C.I. BEMMA(least-squares) 0.154 (0.146, 0.163) MCLUST 0.144 (0.136, 0.152) BEMMA(map) 0.127 (0.119, 0.135) HCLUST(effects,complete) 0.117 (0.111, 0.123) HCLUST(effects,average) 0.101 (0.095, 0.107) HCLUST(correlation,average) 0.079 (0.075, 0.083) HCLUST(correlation,complete) 0.073 (0.068, 0.078) Moderate Table: Adjusted Rand Index for BEMMA and Other Methods. Large values of the adjusted Rand index indicate better agreement between the estimated and true clustering. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 30 / 41

Weak Clustering Degree of Clustering Adjusted Rand Index Clustering Method w/ 95% C.I. MCLUST 0.050 (0.048, 0.052) HCLUST(effects,complete) 0.045 (0.043, 0.048) BEMMA(least-squares) 0.042 (0.040, 0.043) HCLUST(effects,average) 0.037 (0.035, 0.038) BEMMA(map) 0.031 (0.030, 0.033) HCLUST(correlation,average) 0.029 (0.027, 0.030) HCLUST(correlation,complete) 0.027 (0.025, 0.029) Weak Table: Adjusted Rand Index for BEMMA and Other Methods. Large values of the adjusted Rand index indicate better agreement between the estimated and true clustering. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 31 / 41

Posterior Distribution of the Number of Clusters Density 0.00 0.02 0.04 0.06 0.08 80 90 100 110 120 Number of Clusters David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 32 / 41

Effects Intensity Plot of Cluster of 92885_at 250 Genes Grouped by Inferred Clusters 200 150 100 50 Young Old Young Old Young Old Young Old Young Old Baseline 1 hour 3 hours 5 hours 7 hours David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 33 / 41

Effects Intensity Plot for All Clusters 10000 8000 Genes Grouped by Inferred Clusters 6000 4000 2000 Young Old Young Old Young Old Young Old Young Old Baseline 1 hour 3 hours 5 hours 7 hours David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 34 / 41

Effects Intensity Plot for Smallest Clusters 350 300 Genes Grouped by Inferred Clusters 250 200 150 100 50 Young Old Young Old Young Old Young Old Young Old Baseline 1 hour 3 hours 5 hours 7 hours David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 35 / 41

Outline 1 Motivation 2 Z-scores Demonstration 3 BEMMA for Differential Expression Dahl, Newton (200?) 4 BEMMA for Clustering Dahl (200?) 5 SIMTAC for DE and Clustering Dahl, Mo, Vannucci (200?) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 36 / 41

SIMTAC Dahl, Mo, Vannucci (200?) Simultaneous Inference for Multiple Testing and Clustering (SIMTAC): Extension of BEMMA Separates clustering of regression coefficients from accommodation of heteroscedasticity Wider class of experimental designs No need to specify an arbitrary reference treatment Nonconjugate Dirichlet process mixture (DPM) model Sampling distribution: d g µ g, β g, λ g N K ( d g µ g j + Xβ g, λ g M ), Prior distribution: µ g N ( µ g m µ, p µ ) β g G β G β G β DP ( α β Gβ ) λ g G λ G λ G λ DP (α λ Gλ ) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 37 / 41

Simulation Study Relationship of Regression Coefficients Number of Clusters Size of Encoding Equivalent and Differential Expression with this Each Cluster Time Point 1 Time Point 2 Time Point 3 Configuration 120 β g,3 = 0 β g,4 = β g,1 β g,5 = β g,2 1 40 β g,3 = 0 β g,4 = β g,1 β g,5 = β g,2 2 40 β g,3 = 0 β g,4 β g,1 β g,5 = β g,2 1 15 β g,3 = 0 β g,4 = β g,1 β g,5 = β g,2 6 15 β g,3 = 0 β g,4 β g,1 β g,5 = β g,2 1 15 β g,3 = 0 β g,4 β g,1 β g,5 β g,2 1 5 β g,3 = 0 β g,4 = β g,1 β g,5 = β g,2 19 5 β g,3 = 0 β g,4 β g,1 β g,5 = β g,2 2 5 β g,3 = 0 β g,4 β g,1 β g,5 β g,2 2 5 β g,3 0 β g,4 β g,1 β g,5 β g,2 1 2 β g,3 = 0 β g,4 = β g,1 β g,5 = β g,2 48 2 β g,3 = 0 β g,4 β g,1 β g,5 = β g,2 4 2 β g,3 = 0 β g,4 β g,1 β g,5 β g,2 4 2 β g,3 0 β g,4 β g,1 β g,5 β g,2 4 1 β g,3 = 0 β g,4 = β g,1 β g,5 = β g,2 95 1 β g,3 = 0 β g,4 β g,1 β g,5 = β g,2 5 1 β g,3 = 0 β g,4 β g,1 β g,5 β g,2 5 1 β g,3 0 β g,4 β g,1 β g,5 β g,2 5 1 β g,3 = 0 β g,4 = β g,1 β g,5 β g,2 5 1 β g,3 0 β g,4 β g,1 β g,5 = β g,2 5 Table: Clusters in a Synthetic Dataset. For the 216 clusters in each synthetic dataset, this table shows the relationship among and the cluster sizes for the regression coefficients. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 38 / 41

Proportion of False Discoveries Proportion of False Discoveries 0.0 0.1 0.2 0.3 0.4 0.5 ANOVA BEMMA SIMTAC 0 20 40 60 80 100 Number of Discoveries David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 39 / 41

Difference in Proportion of False Discoveries Difference in Proportion of False Discoveries 0.05 0.00 0.05 0.10 0.15 0.20 ANOVA SIMTAC BEMMA SIMTAC 0 20 40 60 80 100 Number of Discoveries David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 40 / 41

Summary Dependence can be exploited to improve power in multiple testing. Dirichlet process mixture (DPM) models provide a powerful machinery to accomplish simultaneous inference on clustering and multiple hypothesis testing. BEMMA: Under weak clustering, BEMMA performs as well as its peers. Under heavier clustering, BEMMA performs substantially better. BEMMA has been successfully applied to a replicated microarray study with 10,000+ probesets and 10 treatment conditions. SIMTAC: Improvemed implementation of the the idea. Simulation results are encouraging... now applying to local data. Least-squares clustering: Convenient and conceptually appealing procedure for point estimation of clustering. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 41 / 41