SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical mdel fr micrarray data analysis David Rssell Department f Bistatistics M.D. Andersn Cancer Center, Hustn, TX 77030, USA rsselldavid@gmail.cm Cntents 1 Bayesian prcedure fr gene differential expressin analysis 2 2 Frequentist perating characteristics f the Bayesian prcedure 3 3 Frequentist FDR fr the Armstrng dataset 5 4 Gdness f fit in the Armstrng dataset 6 1

1 Bayesian prcedure fr gene differential expressin analysis Here we detail the Bayesian prcedure that minimizes the Bayesian FNR subject t the Bayesian FDR being belw a threshld α. Let δ 1,..., δ n dente the expressin pattern that each gene fllws, e.g. if there are nly tw grups t be cmpared δ i = 0 dentes the null hypthesis that bth grups are equal and δ i = 1 dentes that they are different. Dente as d i = d i (x) the pattern that gene i is assigned t, i.e. d i = 0 means that we declare the gene as equally expressed (EE) and d i 0 that we declare it differentially expressed (DE). The false negative (FNP) and false discvery prprtins (FDP) can be written as: FNP = FDP = i=1 I(d i = 0)I(δ i 0) i=1 I(d i = 0) i=1 I(d i = 1)I(δ i = 0) i=1 I(d, (1) i = 1) where I( ) is the indicatr functin. That is, the FNP is the prprtin f genes declared EE that are actually DE, and the FDP is the prprtin f genes declared DE that are actually EE. The Bayesian FNR and Bayesian FDR are defined as the expected FNP and FDP, respectively, where the expectatin is taken with respect t the psterir distributin f δ 1,..., δ n (1). Fr any fixed decisins d 1,..., d n, ne can evaluate the Bayesian FNR and FDR simply as BFNR = BFDR = i=1 I(d i = 0)(1 v i0 ) i=1 I(d i = 0) i=1 I(d i = 1)v i0 n i=1 I(d i = 1), (2) where v i0 = P (δ i = 0 x) is the psterir prbability that gene i is EE. (2) shwed that, in a setup with nly tw hyptheses, the ptimal rule t minimize BFNR subject t BFDR< α is t declare as DE all genes with v i0 belw a certain threshld t, i.e. d i = I(v i0 < t), where t is the minimum value such that BFDR α. Nte that BFNR and BFDR nly take int accunt whether a gene was classified int pattern 0 r nt. Therefre, when minimizing BFNR subject t BFDR α it makes n difference whether a particular gene is assigned t pattern 1 r pattern 2, say. We prpse the

bvius: given that a gene is declared DE, we assign it t the pattern with highest psterir prbability, i.e. δ i = I(v i0 < t) argmax k {1,...,H 1} (v ik ). It is straightfrward t see that, fr any fixed BFNR and BFDR, this rule maximizes the expected number f genes crrectly classified int their expressin pattern. 2 Frequentist perating characteristics f the Bayesian prcedure In this sectin we review the algrithm that (3) prpsed t estimate the frequentist FDR fr any given prcedure, i.e. the expected FDP in (1) when the prcedure is applied under repeated sampling. As a first step, ne btains btstrap samples frm the riginal data, in such a way that the sample mean and variance f each gene are rughly preserved and that it represents a sample under the cmplete null hypthesis that n genes are DE. Then ne repeatedly applies the prcedure t each btstrap dataset, btaining an estimate f the number f false psitives, and cmpares that t the number f genes fund in the riginal dataset. Mre specifically, the algrithm is as fllws: Algrithm 1 1. Apply the prcedure t the riginal dataset, and dente the number f genes declared t be DE as P. Dente as X i and S i the sample mean and standard deviatin f the gene expressin measurements fr gene i = 1... n. 2. Fr b = 1... B, d: Cmpute z ij = (x ij X i )/S i, i = 1... n, j = 1... J. Fr each gene, btain a sample f size J with replacement frm the cllectin f all z ij. Dente this sampled values as z (b) ij. Then cmpute x (b) ij = S i z (b) ij X i. Apply the prcedure t find differentially expressed genes t the btstrap dataset. Since all discveries are false psitives, dente the number f genes declared DE as FP b. P B b=1 FP 3. Estimate the frequentist FDR as FDR = ˆπ b 0, where ˆπ B P 0 is an estimate f the prprtin f EE genes.

Grup 1 Grup 2 2.2 2.8 2.8 3.2 3.1 2.0 2.5 2.1 2.7 2.1 2.0 2.5 2.1 2.9 2.3 10.5 2.9 2.2 2.4 2.4 Table 1: Hypthetical expressin values fr a gene, 10 arrays and 2 grups There is an imprtant remark t make here: there are ther ways t simulate data under the null hypthesis. Fr instance, ne culd simply permute the grup labels r btstrap data within each gene, but this culd be trublesme when the sample size is t small t prvide an accurate representatin f the null. Determining what sample size is large enugh is nt a simple matter. Fr example, suppse that we have a gene fr which mst expressin values are small, but fr ne f the grups it ccasinally presents large expressin values. Table 1 10 presents hypthetical values fr a single gene and 2 grups. Nte that there are ( ) 10 5 = 184, 756 ways t permute the grup labels, which at first sight may seem a large enugh number. Hwever, in grup 2 there is an utlying value. Fr any permutatin, whatever grup the utlier is assigned t will tend t be declared t have higher expressin levels than the ther grup, and it will be cunted as a false psitive. If ne btains a btstrap sample, there is sme prbability that the effect f the utlier will be mitigated (the utlier may nt be sampled at all, r be sampled the same number f times in bth grups), but it will still tend t increase t false psitive cunt. If ne btstraps residuals frm ther genes, such as Algrithm 1 des, the value f 10.5 may nt be utlying at all anymre, since ther genes may present values that are als far away frm the mean. Hence, we see that three reasnable strategies t sample under the null may result in three quite different sampling distributins and estimated FDR. The main issue is whether the utlying value shuld be cnsidered t be an errr r nt. In ur experience, in micrarrays it is nt unfrequent t encunter data such as that in Table 1. A pssible bilgical explanatin is that a small prprtin f individuals frm grup 2 experience sme kind f

GaGa mdel MiGaGa mdel (M=2) Estimated frequentist FDR 5 arrays 10 arrays 15 arrays All data Estimated frequentist FDR 5 arrays 10 arrays 15 arrays All data Bayesian FDR Bayesian FDR Figure 1: Armstrng dataset. Frequentist FDR fr the GaGa (a) and Mi- GaGa (b) mdels. mutatin, which causes the expressin f a particular gene t raise cnsiderably. If such a discvery is bilgically meaningful ne shuld nt cunt it as a false psitive, and hence methds based n permuting r btstrapping each gene separately wuld nt be apprpriate. We agree with (3) that mre research is needed regarding this tpic, but we feel that unless the number f arrays is quite large it may be beneficial t use a resampling scheme that uses data frm several genes at nce. 3 Frequentist FDR fr the Armstrng dataset We nw estimate the frequentist FDR f the Bayesian prcedure utlined in Sectin 1 by applying Algrithm 1 t the Armstrng dataset. Figure 1(a) displays the estimated frequentist FDR fr target Bayesian FDRs ranging frm 0 t 0.1, bth fr the GaGa and MiGaGa mdels and increasing amunts f data. Fr Bayesian FDR at the 0.05 level, the estimated frequentist FDR is always belw 0.05. The nly exceptin is the MiGaGa mdel applied t the full dataset, fr which the frequentist FDR is estimated t be 6.4%. Figure 2 prvides the analgus estimates fr the mntnically transfrmed data, which was analyzed with a GaGa mdel. Fr a Bayesian FDR f 0.05, the estimated frequentist FDR is belw 0.05, as desired.

(a) (b) GaGa mdel, transfrmed data Estimated frequentist FDR 5 arrays 10 arrays 15 arrays All data Density 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Observed data GaGa 0 5 10 15 Bayesian FDR Expressin levels (lg scale) Figure 2: Armstrng dataset after mntnic transfrmatin. (a): frequentist FDR ft the GaGa mdel; (b): marginal distributin f data vs. prir predictive f GaGa mdel with ω = ˆω. We cnclude that the Bayesian prcedure frm Sectin 1 has reasnably gd frequentist perating characteristics when applied t the Armstrng dataset. 4 Gdness f fit in the Armstrng dataset In the riginal paper, Sectin 6.2.1, we evaluated sme aspects f the verall gdness-f-fit. Figure 2(b) cmpares the marginal distributin f the mntnically transfrmed data with draws frm the prir-predictive GaGa mdel, setting the hyper-parameters t their psterir mean. The mntnic transfrmatin imprves the fit f the GaGa mdel substantially (cmpare with Figure 4(a) in the riginal paper). We nw assess the fit fr sme genes individually. First, we select the tw genes with the highest prbability f being DE accrding t the Ga mdel. Figure 3(a) cmpares their bserved expressin values with draws frm their psterir predictive distributin based n the Ga mdel. We see that, even thugh the mdel underestimates the variability fr the MLL grup, the tw genes d actually seem t be differentially expressed. Figure 3(b) shws hw draws frm the GaGa mdel psterir predictive mre apprpriately capture

(a) (b) Ga mdel GaGa mdel Prbe 37809_at ALL MLL Prbe 37809_at CI CII Prbe 1914_at (c) Ga mdel Prbe 1914_at (d) GaGa mdel Prbe 2087_s_at ALL MLL Prbe 2087_s_at CI CII Prbe 1369_s_at Prbe 1369_s_at Figure 3: Observed expressin values vs. predictive distributin. Large black symbls are actual bservatins, small gray symbls are draws frm the psterir predictive. (a),(b): the tw genes with highest prbability f being DE accrding t the Ga mdel; (c),(d): tw genes declared DE by the Ga mdel and declared EE by the GaGa mdel

the variability f the data. Fr these tw genes in particular, hwever, the result f the inference is the same: bth mdels declare prbes 1914 at and 37809 at t be differentially expressed. We nw select tw genes that are declared DE by the Ga mdel and EE by the GaGa mdel, and again cmpare the bserved values t their psterir predictive distributin. Figure 3(c) reveals that the Ga mdel underestimates the variability in the data, while in panel (d) we see that GaGa represents it mre satisfactrily. In this case the inference abut the prbes 1369 s at and 2087 s at frm bth mdels is radically different, fr the GaGa mdel assigns a psterir prbability < 0.01 that each gene is differentially expressed. The pr Ga fit t these tw genes and the fact that n strng differences between grups are bserved in Figure 3(c)-(d) suggest that the inference prvided by the GaGa mdel is mre realiable. Althugh nt presented here, we bserve a similar favrable behavir f the MiGaGa mdel with M = 2 cmpnents. We cnclude that the psterir distributin f the GaGa and MiGaGa mdels present an adequate fit t the data. This is in cntrast with the prir-predictive plt in Figure 3(b) in the riginal paper, which suggested the GaGa fit t be f limited quality. This shuld nt be t surprising, it merely reflects that the psterir distributin incrprates the infrmatin abut bimdality present in the data. References [1] P. Müller, G. Parmigiani, and K. Rice. FDR and Bayesian Multiple Cmparisns Rules. Oxfrd University Press, 2007. [2] P. Müller, G. Parmigiani, C. Rbert, and J. Russeau. Optimal sample size fr multiple testing: the case f gene expressin micrarrays. Jurnal f the American Statistical Assciatin, 99:990 1001, 2004. [3] J.D. Strey. The ptimal discvery prcedure: A new apprach t simultaneus significance testing. Jurnal f the Ryal Statistical Sciety B, 69:347 368, 2007.