A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

Similar documents
The miss rate for the analysis of gene expression data

Estimation of a Two-component Mixture Model

Journal of Statistical Software

Statistical testing. Samantha Kleinberg. October 20, 2009

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

INTERVAL ESTIMATION AND HYPOTHESES TESTING

A Large-Sample Approach to Controlling the False Discovery Rate

Large-Scale Hypothesis Testing

The optimal discovery procedure: a new approach to simultaneous significance testing

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Research Article Sample Size Calculation for Controlling False Discovery Proportion

False Discovery Control in Spatial Multiple Testing

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

The locfdr Package. August 19, hivdata... 1 lfdrsim... 2 locfdr Index 5

Statistical Applications in Genetics and Molecular Biology

Step-down FDR Procedures for Large Numbers of Hypotheses

Some General Types of Tests

Bayesian linear regression

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Estimation of the False Discovery Rate

High-throughput Testing

Non-specific filtering and control of false positives

Bayesian Inference and the Parametric Bootstrap. Bradley Efron Stanford University

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Two-stage Adaptive Randomization for Delayed Response in Clinical Trials

Bayesian Aspects of Classification Procedures

Bayesian model selection for computer model validation via mixture model estimation

A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE

Aliaksandr Hubin University of Oslo Aliaksandr Hubin (UIO) Bayesian FDR / 25

Probabilistic Inference for Multiple Testing

Looking at the Other Side of Bonferroni

Modified Simes Critical Values Under Positive Dependence

Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Analysis

Package locfdr. July 15, Index 5

Generalized Linear Models (1/29/13)

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

Frequentist Accuracy of Bayesian Estimates

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Doing Cosmology with Balls and Envelopes

FALSE DISCOVERY AND FALSE NONDISCOVERY RATES IN SINGLE-STEP MULTIPLE TESTING PROCEDURES 1. BY SANAT K. SARKAR Temple University

Alpha-Investing. Sequential Control of Expected False Discoveries

18.05 Practice Final Exam

Tools and topics for microarray analysis

Comparison of the Empirical Bayes and the Significance Analysis of Microarrays

Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

A Practical Approach to Inferring Large Graphical Models from Sparse Microarray Data

Lecture 21: October 19

SIGNAL RANKING-BASED COMPARISON OF AUTOMATIC DETECTION METHODS IN PHARMACOVIGILANCE

The Pennsylvania State University The Graduate School Eberly College of Science GENERALIZED STEPWISE PROCEDURES FOR

Contents 1. Contents

CHOOSING THE LESSER EVIL: TRADE-OFF BETWEEN FALSE DISCOVERY RATE AND NON-DISCOVERY RATE

Improving the Performance of the FDR Procedure Using an Estimator for the Number of True Null Hypotheses

BTRY 4830/6830: Quantitative Genomics and Genetics

STAT 4385 Topic 01: Introduction & Review

Package FDRreg. August 29, 2016

The Monte Carlo Method: Bayesian Networks

(1) Introduction to Bayesian statistics

STA 4273H: Statistical Machine Learning

EMPIRICAL BAYES METHODS FOR ESTIMATION AND CONFIDENCE INTERVALS IN HIGH-DIMENSIONAL PROBLEMS

arxiv: v1 [stat.me] 25 Aug 2016

STAT 461/561- Assignments, Year 2015

Lecture 28. Ingo Ruczinski. December 3, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018

Bayesian estimation of the discrepancy with misspecified parametric models

The Pennsylvania State University The Graduate School A BAYESIAN APPROACH TO FALSE DISCOVERY RATE FOR LARGE SCALE SIMULTANEOUS INFERENCE

Advanced Statistical Methods: Beyond Linear Regression

ESTIMATING THE PROPORTION OF TRUE NULL HYPOTHESES UNDER DEPENDENCE

29 Sample Size Choice for Microarray Experiments

Statistical Inference

Estimating empirical null distributions for Chi-squared and Gamma statistics with application to multiple testing in RNA-seq

A Bayesian Discovery Procedure

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS

Large-Scale Multiple Testing of Correlations

Curve Fitting Re-visited, Bishop1.2.5

Correlation, z-values, and the Accuracy of Large-Scale Estimators. Bradley Efron Stanford University

Data Mining. CS57300 Purdue University. March 22, 2018

Uniformly Most Powerful Bayesian Tests and Standards for Statistical Evidence

Bayesian non-parametric model to longitudinally predict churn

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Bayes methods for categorical data. April 25, 2017

Biostatistics Advanced Methods in Biostatistics IV

18.05 Final Exam. Good luck! Name. No calculators. Number of problems 16 concept questions, 16 problems, 21 pages

Statistics for Managers Using Microsoft Excel/SPSS Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests

Statistical analysis of microarray data: a Bayesian approach

Nonparametric Estimation of Distributions in a Large-p, Small-n Setting

Non-parametric Inference and Resampling

LECTURE NOTE #3 PROF. ALAN YUILLE

MIXTURE MODELS FOR DETECTING DIFFERENTIALLY EXPRESSED GENES IN MICROARRAYS

HYPOTHESIS TESTING. Hypothesis Testing

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

7. Estimation and hypothesis testing. Objective. Recommended reading

Exam: high-dimensional data analysis January 20, 2014

Master s Written Examination - Solution

arxiv: v1 [math.st] 31 Mar 2009

Transcription:

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data Faming Liang, Chuanhai Liu, and Naisyin Wang Texas A&M University

Multiple Hypothesis Testing Introduction In microarray data analysis, an important question is to identify genes that express differentially between two types of tissues or at different experimental conditions. Since a large number of genes is compared simultaneously, the use of significance testing methods, such as a Student s t-test or Wilcoxon test, could lead to a large chance of false positive findings if the extremes of multiple samples are not properly accounted for. Existing methods in the literature (1) False discovery rate (FDR) method (Benjamini and Hochberg, 1995) (2) Empirical Bayes method (Efron et al, 2001; Efron, 2004)

Multiple Hypothesis Testing Introduction FDR methods Notations: Let H 1,..., H N denote the collection of N null hypotheses, Y 1,..., Y N denote the corresponding N test statistics, and P 1,..., P N denote the corresponding N p-values of the tests. Accept H i Reject H i Total Genes for which H i is true: U V n Genes for which H i is false: T S n Total W R N

Multiple Hypothesis Testing Introduction Literature review for FDR methods Assumption: The N p-values are mutually independent and follow the mixture distribution f(p) = π 0 f 0 (p) + (1 π 0 )f 1 (p), where π 0 denotes a priori probability that a null hypothesis is true and it is typically near 1, say, π 0 0.9; f 0 and f 1 denote the distributions of the p-values corresponding to the null and alternative hypotheses, respectively. In the FDR method, it is generally assumed that f 0 is uniform [0,1]. Benjamini and Hochberg (1995) and Benjamini and Liu (1999) defines FDR as the expected proportion of false positive findings among all the rejected hypotheses, i.e., FDR = E ( V R R > 0) P r(r > 0). Sequential p-value based testing procedures are proposed to control FDR to a desired level.

Multiple Hypothesis Testing Introduction Storey (2002,2003) and Storey, Taylor and Siegmund (2004) proposed a new class of testing procedures by incorporating the information of π 0. The tests have usually a higher power. Related quantities: (i) Positive FDR: pfdr = E ( V R R > 0), which is simpler than FDR conceptually.

Multiple Hypothesis Testing Introduction (ii) If the rejection region is of the form Λ = [0, λ], then pfdr is pfdr(λ) = π 0P f0 (Λ) P f (Λ), where P f0 and P f are the probabilities of Λ w.r.t. f 0 and f, respectively. (iii) q-value: q(p) = inf {Λ:p Λ} {pfdr(λ)}. Storey (2002) argued that the q-value is a natural pfdr analogue of the p-value used in the conventional single hypothesis test and suggested that the q-value could be used as a reference quantity for decision of multiple tests.

Multiple Hypothesis Testing Introduction Literature review for empirical Bayes methods Unlike FDR methods, empirical Bayes methods work on the test statistics Y i s or the test scores Z i = Φ 1 (P i ) or Z i = Φ 1 (1 P i ), and assume the score follow a mixture distribution, f(z) = π 0 f 0 (z) + (1 π 0 )f 1 (z), where Φ is the CDF of the standard Gaussian, f 0 is a non-standard Gaussian and can be estimated from the data. Efron (2004) estimated f 0 and f using a method of spline. Do, Muller and Tang (2004) estimated f 0 and f 1 using a nonparametric Bayesian approach.

Multiple Hypothesis Testing Introduction Literature review for empirical Bayes methods In empirical Bayes methods, the differentially expressed genes are usually identified using the local FDR fdr(z i ) = π 0f 0 (z i ) f(z i ) or the posterior expected FDR (Genovese and Wasserman, 2002, 2003).

Multiple Hypothesis Testing Introduction Difficulties of the two existing methods The distribution of the p-values of the non-differentially expressed genes may have a fair deviation from the uniform[0,1]. This renders the invalidity of the FDR methods. When the number of differentially expressed genes are small, the estimate of f 1 or f may not be reliable. Our approach estimates f 0 (and thus F 0 ) parametrically with a sequential procedure by taking advantage of missing data techniques, and estimate F nonparametrically. Note that estimating F is much easier than estimating f or f 1, as evidenced by the faster optimal convergence rate of the former.

Sequential Bayesian Procedure Methodology Assumptions: It works on the test scores Z i = Φ 1 (1 P i ). It assumes that the scores are mutually independent, the majority of the Z i s are from the null distribution F 0, and the others are from F 1. Here F 1 may be an arbitrarily complex distribution. There exists a number C > mode(f 0 ) and any Z < C follows the null distribution F 0.

Sequential Bayesian Procedure Methodology Let {z 1,..., z N } be a sample from the mixture distribution F (z) = π 0 F (z θ) + (1 π 0 )F 1 (z), and suppose that {z1,..., zn} are from F 0. Our goal is to determine the sample size n (n N ) and the parameters of F 0. We assume that {z1, z2,..., zm} is an identical copy of the set of the m smallest z i s, and there exist a cut-off value c such that max 1 i m zi c and min m+1 i n zi > c. Treating {zm+1,..., zn} as missing, we have the posterior of n and θ as follows, ( ) n m P (θ, n z1,... zm) f 0 (zi θ)[1 F 0 (c θ)] n m P (n)p (θ). m i=1

Sequential Bayesian Procedure Methodology Bayes Specification Specify f 0 as a generalized Gaussian distribution (Box and Tiao, 1973), f 0 (z θ) = β 2αΓ(1/β) exp{ ( z µ /α)β }, where θ = (µ, α, β) with α > 0 and β > 0. The location parameter µ represents the center of the distribution, the scale parameter α represents the standard deviation, and the shape parameter β represents the rate of exponential decay. The priors for θ = (µ, α, β): f(µ) 1, f(α) 1 α (µ < c), and β follows a χ 2 (µ) distribution with degree of freedom ν; that is, f(β) β ν/2 1 e β/2.

Sequential Bayesian Procedure Methodology Based on the symmetry of f 0, P (n) exp{ λ n n 0 }, n = m, m + 1,..., N, where λ is a hyperparameter, and n 0 = n 0 if n 0 < 0.95N, and 0.95N otherwise; here, n 0 is defined to be twice of #{z i : mode(f 0 ), i = 1,..., N}. Setting λ = 0.001, which z i corresponds to a vague prior on n.

Sequential Bayesian Procedure Methodology The posterior distribution of n and θ is ( ) n ( P (θ, n z1,..., zm) β ) m m exp{ ( zi µ /α) β } m 2αΓ(1/β) [1 2 1 2Γ(1/β) ( c µ α )β 0 i=1 t 1/β 1 e t dt ] n m e λ n n 0 βν/2 1 α e β/2, where µ < c, α > 0, β > 0, and n {m, m + 1,..., N}. The scheme for sampling from P (θ, n z 1,..., z m). (a) Simulate n from the conditional posterior f(n θ, z 1,..., z m) using the Metropolis-Hastings algorithm. (b) Simulate θ from the conditional posterior f(θ n, z 1,..., z m) using the Metropolis-Hastings algorithm.

Sequential Bayesian Procedure Methodology With the estimate of θ obtained in the above simulation, we can test if z (m+1),..., z (m+s) f 0 (z ˆθ), where z (m+1),..., z (m+s) (c, c+ δ ˆα]. Here δ is usually set to a value between 0.01 and 0.1. If we decide to accept the hypothesis that z (m+1)... z (m+s) f 0 (z ˆθ), then we let the new m m + s, repeat the simulation, and re-estimate θ with the data {z (1),..., z (m+s) }. The above procedure will be iterated until no more samples can be added to f 0.

Sequential Bayesian Procedure Methodology The Sequential Bayesian Estimation Algorithm (0) Set c to the Q th percentile of z, and determine the value of m such that z (m) < c and z (m+1) > c. Let I denote the number of consecutive stages used for estimation of the parameters n and θ. Let I = 1, ˆn 0 = m, ˆβ 0 = 2, and ˆµ 0 and ˆα 0 be the mean and standard deviation of z (1),..., z (m), respectively. (1) Simulate samples (n 1, θ 1 ),..., (n M, θ M ) from the joint posterior L(n, θ z 1,..., z m) for M steps starting with the current estimate (ˆn I 1, ˆθ I 1 ). Estimate n by ˆn I = (1 1 I )ˆn I 1 + 1 (M M 0 )I M i=m 0 +1 n i, where ˆn 0 = 0, and M 0 is the number of burn-in steps in each stage. Estimate µ, α and β similarly.

Sequential Bayesian Procedure Methodology (2) Test the hypothesis H 0 : z (m+1)... z (m+s) f 0 (z ˆθ I ) versus H 1 : z (m+1)... z (m+s) f 0 (z ˆθ I ) at a pre-specified level γ. If H 0 is accepted, set m m + s, c c + δ ˆα, ˆn 0 = ˆn I, ˆθ 0 = ˆθ I, and I = 1, go to step (1); otherwise, keep m and c unchanged and set I I + 1, go to step(3). (3) If I > S, go to step (4); otherwise, go to step (1). (4) Fix m to the current value, re-simulate samples (ˆn 1, ˆθ 1 ),..., (ˆn M, ˆθ M ) from f(n, θ z 1,..., z m) by the Metropolis-within-Gibbs sampler, calculate ˆq(z i ) for i = 1,..., N, and identify the differentially expressed genes according to the ˆq(z i ) s. Default setting: Q = 80, M 0 = 200, M = 1000, δ = 0.025, γ = 0.05, S = 3, and M = 1000.

Sequential Bayesian Procedure Methodology Estimator of FDR in the Sequential Bayesian Procedure If the rejection region takes the form Λ = [ω, ), we have pfdr(λ) = 1 M M i=1 ˆn i N 1 F 0 (ω ˆθ i ) 1 ˆF (ω), where ˆF (ω) = #{z i : z i ω}/n. Furthermore, for the nested rejection regions of the form [ω, ), we can estimate the q-value of the observed score z by ˆq(z) = inf ω z { pfdr(ω)}.

Simulated Examples Numerical Results Example 1 This example comprises 20 datasets. Each dataset consists of 2100 test scores, of which the first 2000 scores are generated from the standard normal distribution, and the remaining 100 scores are generated from a left-truncated student-t distribution with the degree of freedom df = 5 and the truncation threshold T = 3; that is, f(z) = π 0 φ(z) + (1 π 0 ) t(z df = 5, T = 3). The histogram of one of the 20 datasets is shown in Figure 1(a). Five runs of the sequential procedure for this dataset produce the following estimate: ˆπ 0 = 0.9526 with standard deviation 0.0009, which is almost identical to the true value 0.9524; and ˆθ = (ˆµ, ˆα, ˆβ) = (0.0158, 1.4179, 1.9914) with standard deviation (.0026,.0039,.0070), which is also very close to the true value (0.0, 1.414, 2.0).

Simulated Examples Numerical Results Example 2 It comprises 20 datasets. Each of the dataset consists of 2100 test score, of which the first 2000 scores are generated from N(0, 1.5 2 ), and the remaining 100 scores are generated from N(4, 1); that is, f(z) = π 0 φ(z/1.5) + (1 π 0 )φ(z 4). Figure 3 shows the histogram of one dataset. For this dataset, five runs of the sequential procedure produce the following estimates, (ˆπ 0, ˆµ, ˆα, ˆβ) = (0.949, 0.05, 2.148, 1.98) with the standard deviations (0.001, 0.002, 0.006, 0.007). The true value is (π 0, µ, α, β) = (0.952, 0.0, 2.121, 2.0).

Simulated Examples Numerical Results (a) (b) 0 50 100 150 q value 0.0 0.2 0.4 0.6 0.8 4 2 0 2 4 6 8 4 2 0 2 4 6 8 Scores Scores (c) (d) significant tests 0 20 40 60 80 120 true false discovery rate 0.0 0.05 0.10 0.15 0.20 0.0 0.05 0.10 0.15 0.0 0.05 0.10 0.15 0.20 q value cut off q value cut off Figure 1: (a) Histogram of the scores. The vertical line shows the cut-off point, the value of c, obtained in a run of the sequential procedure at the final stage. (b) The q-values versus the test scores. (c) The numbers of significant scores versus the q-value cut-off values. (d) The true false discovery rates (tfdr) versus the q-values.

Simulated Examples Numerical Results m 1700 1750 1800 1850 1900 1950 2000 (a) 0 10 20 30 40 50 stage p value 0.0 0.2 0.4 0.6 0.8 1.0 (b) 0 10 20 30 40 50 stage 0.0 0.5 1.0 1.5 2.0 2.5 pi0 mu alpha beta (c) 0 10 20 30 40 50 stage Figure 2: Intermediate results produced in a run of the sequential procedure. (a) The increasing process of m. (b) The p-values of the null-score-addition tests. (c) The estimates of π 0 and θ in different stages.

Simulated Examples Numerical Results True false discovery rate Methods ˆπ 0 ˆq(z) 0.2 ˆq(z) 0.15 ˆq(z) 0.1 ˆq(z) 0.05 Sequential Storey 0.948 0.208 0.161 0.107 0.051 (0.002) (0.015) (0.012) (0.008) (0.006) 0.934 0.206 0.156 0.109 0.053 (0.012) (0.007) (0.008) (0.005) (0.005) Table 1: True false discovery rates for different rejection regions. The true false discovery rates and their standard deviations, the numbers in the below parentheses, are calculated by averaging over 20 runs. Storey s procedure: http://faculty.washington.edu/ jstorey/.

Simulated Examples Numerical Results 0 50 100 150 200 250 4 2 0 2 4 6 Scores Figure 3: Histogram of one dataset generated by the mixture model of Example 2.

Simulated Examples Numerical Results True false discovery rate Methods ˆπ 0 ˆq(z) 0.2 ˆq(z) 0.15 ˆq(z) 0.1 ˆq(z) 0.05 Sequential Storey 0.952 0.199 0.146 0.084 0.053 (0.003) (0.020) (0.017) (0.017) (0.015) 1.0 0.687 0.638 0.577 0.479 (0.000) (0.006) (0.006) (0.007) (0.009) Table 2: True false discovery rates for different rejection regions. The true false discovery rates and their standard deviations, the numbers in the below parentheses, are calculated by averaging over 20 runs.

Avian Pineal Gland Gene Expression Data Numerical Results Data Description: The avian pineal gland contains both circadian oscillators and photoreceptors to produce rhythms in biosynthesis of the hormone melatonin in vivo and in vitro. It is of great interest to understand the genetic mechanisms driving the rhythms. For this purpose, a sequence of cdna microarrays of birds pineal gland transcripts under light-dark (LD) and constant darkness (DD) conditions were generated. Under LD, birds were euthanized at 2, 6, 10, 14, 18, 22 hour Zeitgeber time (ZT) to obtain mrna to produce adequate cdna libraries. Under both LD and DD conditions, p-values, P i s, for testing the existence of different time effects were produced and transformed to test scores using Φ(1 P i ).

Avian Pineal Gland Gene Expression Data Numerical Results (a) (b) (c) 0 100 200 300 2 0 2 4 scores q values 0.2 0.3 0.4 0.5 0.6 0.7 0.8 2 0 2 4 scores significant tests 0 200 400 600 800 1000 1200 1400 0.23 0.24 0.25 0.26 0.27 0.28 q value cut off Figure 4: Computational results of the sequential procedure for the LD data. (a) The histogram of the scores and the fitted f 0 in one run of the sequential procedure. (b) The q-values versus the test scores. (c) The numbers of Significant genes versus the q-value cut-off values. Estimate: (ˆπ 0, ˆµ, ˆα, ˆβ) = (0.863, 1.339, 2.203, 2.540) with the standard deviation (0.0002, 0.0019, 0.0021, 0.0147). Among the identified 1400 differentially expressed genes, there are about 400 genes ( 1400 28%) which are false positive.

Avian Pineal Gland Gene Expression Data Numerical Results (a) (b) q values 0.00 0.05 0.10 0.15 0.20 significant tests 0 2000 4000 6000 2 0 2 4 scores 0.00 0.05 0.10 0.15 0.20 q value cut off Figure 5: Computational results of Storey s procedure for the LD data. (a) The q-values versus the test scores. (b) The numbers of Significant genes versus the q-value cut-off values.

Avian Pineal Gland Gene Expression Data Numerical Results 0 100 200 300 400 (a) q values 0.2 0.4 0.6 0.8 1.0 (b) significant tests 0 20 40 60 80 100 120 140 (c) 4 2 0 2 4 scores 4 2 0 2 4 scores 0.15 0.20 0.25 0.30 q value cut off Figure 6: Computational results of the sequential procedure for the DD data. (a) The histogram of the scores and the fitted f 0 in one run of the sequential procedure. (b) The q-values versus the test scores. (c) The numbers of Significant genes versus the q-value cut-off values. Estimate: (ˆπ 0, ˆµ, ˆα, ˆβ) = (0.972, 0.361, 0.942, 1.284) with the standard deviation (0.0004, 0.0020, 0.0021, 0.0146).

Avian Pineal Gland Gene Expression Data Numerical Results (a) (b) (c) 0 100 200 300 400 q values 0.2 0.4 0.6 0.8 1.0 significant tests 0 20 40 60 80 100 120 140 4 2 0 2 4 scores 4 2 0 2 4 scores 0.15 0.20 0.25 0.30 q value cut off (d) (e) (f) 0 100 200 300 400 q values 0.2 0.4 0.6 0.8 1.0 significant tests 0 20 40 60 80 100 4 2 0 2 4 scores 4 2 0 2 4 scores 0.18 0.20 0.22 0.24 0.26 q value cut off Figure 7: Computational results of the sequential procedure for the DD data with the significant level γ = 0.001 of the data addition tests. (a)&(d): the histogram of the scores and the fitted f 0. (b)&(e): the q-values versus the test scores. (c)&(f): the numbers of Significant genes versus the q-value cut-off values.

Avian Pineal Gland Gene Expression Data Numerical Results (a) (b) q values 0.0 0.1 0.2 0.3 0.4 significant tests 0 1000 2000 3000 4000 5000 3 2 1 0 1 2 3 scores 0.0 0.1 0.2 0.3 0.4 q value cut off Figure 8: Computational results of Storey s procedure for the DD data. (a) The q-values versus the test scores. (c) The numbers of Significant genes versus the q-value cut-off values.

Sequential Bayesian Procedure Discussion Suspiciously differentially expressed genes can be identified by comparing the density curve of the null distribution and the histogram of the test scores of genes. The null distribution can be modeled by different parametric distributions, say, a mixture of the generalized normal distribution, to reflect the biological background of the null condition for the dataset under study. The null score addition process can be slightly improved by adding a backward deletion step even though the final outcomes seem to be fairly indifferent toward the choice of m.