GaGa: a Parsimonious and Flexible Hierarchical Model for Microarray Data Analysis

Size: px

Start display at page:

Download "GaGa: a Parsimonious and Flexible Hierarchical Model for Microarray Data Analysis"

Hubert Walsh
6 years ago
Views:

1 GaGa: a Parsimnius and Fleible Hierarchical Mdel fr Micrarray Data Analysis David Rssell Department f Bistatistics M.D. Andersn Cancer Center, Hustn, TX 77030, USA rsselldavid@gmail.cm Abstract Bayesian hierarchical mdels are attractive fr micrarray data since they allw sharing infrmatin acrss genes and acrss different analyses in a cherent manner. Kendzirski et al. (2003) and Newtn et al. (2004) intrduced the gamma-gamma hierarchical mdel. The mdel parsimniusly describes the epressin f thusands f genes with a small number f hyper-parameters. This makes the mdel easy t interpret and analytically tractable. Hwever, we find imprtant limitatins f the mdel when fitting real datasets. The limitatins are due t sme f the assumptins being t restrictive. We prpse a simple etensin f the mdel that imprves the fit substantially with almst n increase in cmpleity. The mdel allws cmparing nt nly mean epressin between grups but als the distributinal shape, which we argue t be f bilgical relevance. We prpse a secnd etensin that uses a miture f gamma distributins t further imprve the fit, at the epense f increased cmputatinal burden. We prpse several apprimatins that significantly reduce the cmputatinal cst. We use ur apprach fr inference abut differential gene epressin and fr class predictin, bth in simulated and real datasets. We find that bth etensins are preferable t the riginal frmulatin f the mdel, and that they prvide advantages ver several ther ppular methds, especially fr small samples. 1 Intrductin Tw cmmn inference prblems with micrarray data are differential epressin analysis, i.e. the cmparisn f sme measure f gene epressin between grups, and class predictin, i.e. the predictin f an unknwn grup label fr a new sample. Fr bth prcedures, ne f the challenges is that the number f genes greatly eceeds the number f replicated measurements that are btained fr each gene. That is, data is abundant at an verall level but it is scarce at the gene level, and therefre there is much ptential fr methds that allw fr the sharing f infrmatin acrss genes. T whm crrespndence shuld be addressed 1

2 Hierarchical mdels naturally allw fr the sharing f infrmatin between genes. Typical eamples are Lönnstedt and Speed (2002) and Smyth (2004), wh mdeled genespecific parameter estimates via hierarchical empirical Bayes methds t btain imprved testing prcedures. Kendzirski et al. (2003), Newtn et al. (2001) and Newtn and Kendzirski (2003) prpsed hierarchical mdels that depend n few parameters i.e. they greatly reduce the dimensinality f the prblem. This feature is particularly imprtant fr small sample sizes. We build n the gamma-gamma hierarchical mdel f Kendzirski et al. (2003). The mdel is used etensively in recent literature (Parmigiani et al., 2003; Müller et al., 2004; Zha et al., 2005; Chigna et al., 2007). The gamma-gamma mdel assumes that the bservatins fr each gene arise frm a gamma distributin with cmmn shape parameter acrss all genes and a scale parameter that arises frm a hierarchical gamma prir. Since the mdel uses a single gamma prir, we refer t it as the Ga mdel. We find the gamma chice appealing, fr it is a fleible family that can capture a variety f distributinal shapes. In this paper we shw that, althugh this mdel is elegant and parsimnius, it fails t prvide an adequate fit in real datasets, and that this is partly due t the assumptin that the shape parameter is cmmn acrss genes. We prpse a simple etensin f the mdel that specifies a gamma prir n bth the shape and the inverse mean parameters (GaGa mdel). The etensin is still parsimnius, requiring nly ne additinal hyper-parameter, and it can be fit bth via empirical Bayes and a fully Bayesian apprach. We develp an algrithm that implements psterir inference with a cmputatinal effrt that is cmparable t the Ga mdel. We then develp a secnd etensin that specifies a gamma prir n the shape parameter and a miture f gamma prirs n the inverse mean (MiGaGa mdel). This prvides additinal fleibility, albeit at the epense f reduced mdel parsimny and increased cmputatinal cst. An advantage f the GaGa and MiGaGa hierarchical mdels is that they allw t cmpare bth the shape and mean parameters between grups. That is, ne nt nly cmpares mean epressin levels but als the distributinal shape. This is imprtant because, as argued by Lapinte et al. (2004) and Cmbes et al. (2007), the latter cmparisn may be bilgically mre meaningful than cmparing mean epressins. As an eample, cnsider the case f cancer bimarkers. Many mutatins, deletins and translcatins affect nly a prprtin f the diseased individuals. When cmparing nrmal and cancer cells, nly that prprtin ehibit a mdified epressin pattern, i.e. the tail behavir r the variability will be different acrss grups even thugh the mean epressin may be almst unchanged. The paper is structured as fllws. In Sectin 2 we review the Ga mdel and we etend it t the GaGa mdel. We derive epressins fr psterir prbabilities f interest and an MCMC sampling scheme t fit the mdel. In Sectin 3 we prpse as a further generalizatin the MiGaGa. Fr bth etensins, the psterir distributins f the gamma shape parameters are knwn nly up t a cnstant. We refer t this distributin, which t ur knwledge has nt been described befre, as the gamma shape distributin. In Sectin 4 we derive useful apprimatins fr this distributin. In Sectin 5 we eplain hw t find differentially epressed (DE) genes and perfrm class predictin. In Sectin 6 we apply ur apprach t simulated data and real datasets. Finally, in Sectin 7 we present sme cncluding remarks. The methds described in this paper are implemented in the R library gaga. 2

3 2 The GaGa mdel We assume that the data has been backgrund crrected, nrmalized and quantified in a sensible manner (Dudit et al., 2002). Let ij be the measure f epressin fr gene i, i = 1... n, in micrarray j, j = 1... J. Let z j {1... K} indicate a grup membership, e.g. z j = 1 fr nrmal cells and z j = 2 fr cancer cells. We dente the vectr f bservatins fr gene i as i and the whle dataset as. We use Ga( ) t dente a gamma distributin, IGa( ) fr the inverse gamma, Mult( ) fr the multinmial, Dirichlet( ) fr the Dirichlet and GaS( ) fr the gamma shape distributin. The GaS distributin is defined in Sectin 4. In the differential epressin prblem the investigatr is interested in determining the epressin pattern that each gene fllws. This inference prblem can be viewed as a hypthesis testing prblem. Thrughut we use the terms hypthesis and epressin pattern interchangeably. A simplest setup is having K = 2 grups and 2 hyptheses: pattern 0 under which bth grups are equally epressed (null hypthesis) and pattern 1 under which they are differentially epressed (alternative hypthesis). Fr K > 2 we may want t cnsider mre than 2 patterns. Fr eample, if grup 1 crrespnds t nrmal cells, grup 2 t cells with type A cancer and grup 3 t type B cancer, ne may be interested in assigning each gene t ne f the fllwing patterns: Pattern 0: Nrmal = Cancer A = Cancer B Pattern 1: Nrmal Cancer A = Cancer B Pattern 2: Nrmal Cancer A Cancer B. (1) Dente by H the number f hyptheses, and let the latent variable δ i {0, 1... H 1} indicate the true epressin pattern fr gene i. We refer t genes with δ i = 0 as equally epressed (EE) and genes with δ i 0 as differentially epressed (DE), and we dente δ = (δ 1... δ n ). 2.1 The mdel The Ga mdel (Kendzirski et al., 2003; Newtn et al., 2001; Newtn and Kendzirski, 2003), assumes that ij are independent realizatins frm Ga(α i,zj, λ i,zj ) (i.e. the mean is α i,zj /λ i,zj ). The mdel assumes δ i Mult(1, π), fies α i,zj = α fr all i, j and specifies the hierarchical prir λ i,zj Ga(α 0, ν) fr all distinct scale parameters under pattern δ i. Here (α 0, ν, α, π) are hyper-parameters cmmn t all genes. Fr δ i = 0 (EE genes) we have λ i,1 =... = λ i,k, and fr δ i 0 sme f the λ i,zj are different frm each ther, accrding t the specificatin f the hyptheses. The Ga mdel impses the restrictin that 1/ α i,zj, the within-grups cefficients f variatin (CV), must be cnstant acrss all genes and grups. The assumptin is analytically cnvenient, but we have fund it nt t be reasnable in typical datasets. Figure 1 shws empirical CVs fr tw datasets described in Sectin 6. The CVs ehibit a skewed distributin with a substantial amunt f variability that cntradicts that CVs are cnstant. Figure 1 highlights the genes declared DE by the Ga mdel in bth datasets. These are ften genes with abve average CV. This is due, we believe, t the cnstant CV assumptin, which makes the Ga mdel view atypical CVs as evidence fr differential epressin. 3

4 Cefficient f variatin Cefficient f variatin Mean epressin (lg scale) Mean epressin (lg scale) Figure 1: Sample mean and CV fr each gene ( dentes genes declared DE by the Ga mdel). (a): Guld dataset; (b): Armstrng dataset When impsing cnstant α i,zj ne can nly cmpare the scale parameters λ i,zj between grups, while in practice it can be mre relevant bilgically t cmpare the full distributin (Lapinte et al., 2004; Cmbes et al., 2007). We prpse a generalizatin that addresses this limitatin. We intrduce gene and pattern-specific shape parameters α i,zj and assume ij Ga(α i,zj, α i,zj /λ i,zj ) (i.e. λ i,zj is the mean), with the fllwing hierarchical prir λ i,k δ i, α 0, ν IGa(α 0, α 0 /ν), indep. fr i = 1... n α i,k δ i, β, µ Ga(β, β/µ), indep. fr i = 1... n, (2) and a prir fr δ i as befre. We refer t (2) as the GaGa mdel. As in the Ga mdel, the values f (α i1, λ i1 )... (α ik, λ ik ) are tied when δ i = 0, whereas under δ i 0 sme f them are different frm each ther (althugh they still arise frm the same marginal distributin). The GaGa mdel replaces the hyper-parameter α f the Ga mdel by the pair (β, µ). That is, the additinal fleibility is achieved with nly ne mre hyperparameter. We cmplete the Bayesian mdel with hyper-prirs: α 0 Ga(a α0, b α0 ); ν IGa(a ν, b ν ); β Ga(a β, b β ); µ IGa(a µ, b µ ); π Dirichlet(p). (3) We believe that eliciting prir distributins can be advantageus in micrarray studies, since there usually is sme degree f prir knwledge. Fr eample, the investigatr may have an idea abut what prprtin f DE genes t epect. As anther eample, many nrmalizatin prcedures result in values that are belw 15 n the lg-scale. The use f hyper-prirs is nt critical. Alternatively, (α 0, ν, β, µ, π) can be fied by an empirical Bayes argument, using an epectatin-maimizatin algrithm cmpletely 4

5 analgus t that fr the Ga mdel (Kendzirski et al. (2003), Appendi). We implement bth methds in ur gaga library. Micrarray datasets are strngly infrmative abut the parameters in (3), since they are cmmn t all genes. In the datasets in Sectin 6 we fund fairly similar results fr tw reasnable prir specificatins and the empirical Bayes methd. 2.2 Psterir distributins We derive the psterir distributin f the first-stage parameters, assuming fied hyperparameters ω = (α 0, ν, β, µ, π). Frm (2) we see that, cnditinal n ω, the gene-specific parameters (δ i, α i1... α ik, λ i1... λ ik ) are independent a psteriri acrss genes i = 1... n. Therefre, it suffices t derive the psterir fr each gene separately. We dente the vectr f parameters fr a single gene as λ i = (λ i,1... λ i,k ), α i = (α i,1... α i,k ) and we let λ = (λ 1... λ n ), α = (α 1... α n ) be the cllectin f these parameters. Let N δi be the number f grups that are distinct under pattern δ i. In ur eample in (1) we have H = 3 patterns: under pattern 0 we have N 0 = 1 distinct grups, and similarly N 1 = 2, N 2 = 3. Let S i,δi,k fr i = 1... n, δ i = 0... H 1 and k = 1... N δi be the sum f bservatins frm gene i that under pattern δ i crrespnd t the k th distinct grup, P i,δi,k be the prduct f the same bservatins and J i,δi,k be the number f terms in the sum. In ur eample S 10,0,1 dentes the sum f all bservatins frm gene 10 (since under pattern 0 there is nly ne distinct grup), S 10,1,1 dentes the sum f bservatins frm nrmal samples (since it is the first distinct grup under pattern 1) and S 10,1,2 the sum frm cancers f type A and B. The psterir prbability that gene i fllws epressin pattern l, which we dente as v il, is given by v il = P (δ i = l, ω) f( i δ i = l, ω)π l fr l = 0... H 1, where [ (α0 /ν) α 0 (β/µ) β f( i δ i, ω) = Γ(α 0 )Γ(β) ] Nδi N δi k=1 1 C(J i,δi,k, β, β/µ lg(p i,δi,k), α 0, α 0 /ν, S i,δi,k), (4) and C( ) is the gamma shape nrmalizatin cnstant, defined in Sectin 4. The psterir distributin f (α ik, λ ik ) cnditinal n δ i is α i,k δ i, ω, GaS(J i,δi,k, β, β/µ lg(p i,δi,k), α 0, α 0 /ν, S i,δi,k) λ i,k α i,k, δ i, ω, IGa(α i,k J i,δi,k α 0, α 0 /ν α i,k S i,δi,k). (5) Fr any given ω, (5) can be used t btain psterir credibility intervals in the usual fashin. Nte that α i,k and λ i,k are nt cnditinally independent a psteriri given (δ i, ω) as they are a priri. 5

6 2.3 Mdel fitting The psterir distributin f ω = (α 0, ν, β, µ, π) given (δ, α, λ) and is characterized as fllws. We find ( n ) α 0 δ, α, λ, GaS N δi, a α0, b α0 S λ, a ν, b ν, S λ ( ν α 0, δ, α, λ, IGa a ν α 0 ) n N δi, b ν α 0 S λ. (6) The hyper-parameters (β, µ, π) are cnditinally independent f (α 0, ν) given (δ, α, λ, ), with ( n β δ, α, λ, GaS ( µ β, δ, α, λ, IGa a µ β N δi, a β, b β S α, a µ, b µ, S α ) ) n N δi, b µ βs α (7) and π δ, α, λ, Dirichlet ( p 1 n I(δ i = 0),..., p H 1 ) n I(δ i = H 1) (8) cnditinally independent f (β, µ). Here S λ = n Nδi k=1 α i,k and S α = n S α = n α i,k. Nδi k=1 λ i,k, S λ = n Nδi k=1 1/λ i,k, Nδi k=1 lg(α i,k) are sums ver all distinct λ i,k and Tgether with the psterirs given in Sectin 2.2, this allws us t implement a Gibbs sampling scheme t fit the mdel (Gelfand and Smith, 1990). The Gibbs sampler is defined by iterative sampling f (δ, α, λ) given ω, and sampling ω given (δ, α, λ). 3 The MiGaGa mdel The GaGa mdel addresses the prblem illustrated in Figure 1 by allwing varying CVs acrss genes. Hwever anther limitatin remains. A well-knwn feature f the GCRMA nrmalizatin prcedure (Wu et al., 2004) is that it creates a distinctly bimdal distributin fr the gene epressins. Figure 2(a) shws the empirical distributin f ij fr the Armstrng dataset (see Sectin 6.2), and cmpares it with the prir-predictive under the GaGa mdel. The mdel des nt capture the bimdality. T address this limitatin we intrduce a further generalizatin, by letting λ i,k arise frm a miture λ i,k δ i, ρ, α 0, ν M ρ m IGa(α 0m, α 0m /ν m ) m=0 ρ Dirichlet(r) (9) 6

7 (a) (b) Prbe 1110_at Density Observed data GaGa MiGaGa Density ALL MLL Epressin levels (lg scale) Epressin levels (lg scale) Figure 2: Armstrng dataset. (a): marginal distributin f data vs. prir predictive f GaGa and MiGaGa with M = 2 and ω = ˆω; (b): gene with evidence fr change in shift and n change in lcatin. Psterir predictive under GaGa mdel with 10 arrays per grup. ( indicates the 24 ALL bservatins; the 18 AML) and specifying the fllwing prirs α 0m Ga(a α0, b α0 ), fr m = 1... M indep. ν m IGa(a ν, b ν ), fr m = 1... M indep. (10) The rest f the mdel is as in (2) and (3). The statement f psterir distributins and the Gibbs sampler are largely analgus t that fr the GaGa prir. The main difference is that fr the MiGaGa ne intrduces latent variables indicating the cluster t which each gene belngs. Cmpared t the GaGa prir, the additinal fleibility in MiGaGa ptentially allws us t btain a better fit t the data, albeit this cmes at the cst f increased mdel cmpleity and cmputatinal burden. Figure 2(a) shws hw the MiGaGa prir predictive imprves the GaGa fit substantially. 4 The Gamma shape distributin We define the distributin that arises as the psterir f the shape parameter f the gamma distributin under independent sampling and a gamma prir, as in (2). We assume that the gamma distributin is indeed by the shape and mean parameters. This distributin, which we refer t as gamma shape distributin, has nt been described befre. It is similar t the distributin that arises when the parameterizatin is in terms f the shape and scale parameters (Damsleth, 1975; Miller, 1980). T simplify ntatin 7

8 we dente by y a psitive cntinuus randm variable that fllws this distributin. Its prbability density functin, indeed by the parameters a 0, b, d, r, s > 0, c > alg(s/a), can be written as: ( ) ayd Γ(ay d) y f(y a, b, c, d, r, s) = C(a, b, c, d, r, s) y b d 1 e yc I(y > 0), (11) Γ(y) a r sy where C(a, b, c, d, r, s) is the nrmalizatin cnstant and Γ( ) is the gamma functin. Fr a = d = 0, (11) simplifies t a gamma distributin. In general, t btain randm draws frm (11) r t evaluate C(a, b, c, d, r, s) ne has t resrt t numerical methds. This is impractical in ur setup, since the psterir simulatin requires repeated and cmputatinally efficient simulatin frm a GaS distributin. Apprimatins are required t decreases the cmputatinal burden. We start by deriving an apprimatin t (11) that is apprpriate fr large values f y. By apprimating Γ( ) with Stirling s frmula and evaluating the limit f the resulting epressin as y we find that (11) is apprimately prprtinal t y a/2b 1/2 1 ep{ y(c alg(s/a))}. (12) One can btain apprimate randm draws frm (11) by drawing frm a Ga(a/2 b 1/2, calg(s/a)). T apprimate C(a, b, c, d, r, s), dente as g(y) the prbability density functin f the gamma apprimatin, and let m be its mde. Evaluating g and (11) at m gives Γ(m) a C(a, b, c, d, r, s) g(m) Γ(am d) ( ) (amd) m m bd1 e mc. (13) r sm Figure 3 shws the gamma shape distributin and its gamma apprimatin fr tw randmly selected cases that were encuntered in psterir simulatins fr the MiGaGa mdel. The density in panel (a) arises as the psterir f the shape parameter fr a single gene with a sample size f 5 bservatins per grup, whereas panel (b) results as the psterir f a parameter shared by a large number f genes i.e. it represents a situatin with a large sample size. In bth cases the apprimatin is very clse. In the micrarray datasets that we have analyzed s far the apprimatin wrked well. In sme rare cases we detected that the mde f the apprimatin did nt match that f (11) (indicated by the first derivative f lg [f(y a, b, c, d, r, s)] nt being clse t zer). In these cases we used a few Newtn-Raphsn steps t lcate the mde and used the gamma apprimatin that matches the lcatin f the mde as well as the value f the secnd derivative f lg [f(y a, b, c, d, r, s)] evaluated at the mde. 5 Inference 5.1 Differentially epressed genes We frmalize inference fr differential epressin by minimizing the Bayesian false negative rate (BFNR) subject t an upper bund α n the Bayesian false discvery rate (BFDR) (Müller et al., 2007). Briefly, BFNR is the psterir epected prprtin f 8

9 (a) (b) Density Gamma shape Gamma appr Density Gamma shape Gamma appr y y Figure 3: Gamma apprimatin t the gamma shape distributin. Parameter values are (a): a=10; b=0.90; c= ; d= ; r= ; s=65.02 (b): a=1532,b=.16,c=3469,d=.16,r=.016,s=159.5 genes declared EE (i.e. assigned t pattern 0) that are actually DE (i.e. nt fllw pattern 0), and BFDR is the epected prprtin f genes declared DE that are actually EE. This definitin remains valid fr mre than tw hyptheses. The Bayes rule is t declare a gene as DE whenever its psterir prbability f DE is abve a certain threshld. The result etends trivially t ur multiple hyptheses setup with a slight adjustment: given that a gene is nt classified int pattern 0, we prpse assigning it t the pattern with the highest psterir prbability. That is, fr given BFDR and BFNR we maimize the number f genes crrectly classified int their epressin pattern. Since the psterir prbabilities in Sectin 2.2 are derived under an assumed prbability mdel, deviatins frm the assumptins may result in pr perfrmance f the prcedure. We prpse t assess its frequentist perating characteristics fllwing the btstrap scheme intrduced by Strey (2007), which allws t estimate the frequentist FDR fr any given α. In Sectin 6.3 we apply this prcedure t a real dataset. Fr mre details, see Strey (2007) and the supplementary material at T ease the cmputatinal burden f the btstrap-based prcedure, we use an apprimatin. Instead f using P (δ i = l ) we use v il = P (δ i = l, ˆω) as given in (4), where ˆω is the psterir mean f ω. We have fund this strategy t deliver very similar results t thse frm the eact versin f the algrithm, but at a much lwer cmputatinal cst. 9

10 5.2 Class predictin We set the gal f maimizing the number f future samples = ( 1... n) that are crrectly classified as type z. The Bayes rule is t assign the new sample t the type k that has the highest psterir prbability P (z = k, ). As in Sectin 5.1, we use the apprimatin u k = P (z = k,, ˆω) f( z = k, ˆω, )P (z = k), where ˆω is the psterir mean f ω, f( z = k, ˆω, ) is the predictive distributin fr the measurements f a sample f type k and P (z = k) is the prir prbability. The prir prbabilities can be based, fr eample, n the prevalence f the disease in the ppulatin under study, the presence f risk factrs r the utcme f previus tests. Fr the GaGa mdel we find where f( z = k, ˆω, ) = f( i z = k,δ i = l, ω, i ) = 1 i n H 1 l=0 f( i z = k, δ i = l, ˆω, i )v il, (14) C(J i,δi,k, β, β/µ lg(p i,δi,k), α 0, α 0 /ν, S i,δi,k) C(J i,δi,k 1, β, β/µ lg(p i,δi,k) lg( i ), α 0, α 0 /ν, i S i,δ i,k). (15) An interesting feature f (14) is that the classifier weights the cntributin f each gene accrding t the psterir prbabilities v il, and that in particular genes with zer psterir prbability f being DE d nt cntribute t the classifier. This is imprtant frm a practical standpint, since frequently ne wants t use nly a subset f the genes. The classifier is rbust with respect t hw many genes are chsen. Similar epressins can be btained fr the MiGaGa mdel by averaging accrding t the cluster weights ρ. 6 Results We assess the perfrmance f the GaGa and MiGaGa mdels in simulated and real data. In Sectin 6.1 we revisit the Guld dataset that (Kendzirski et al., 2003) riginally used t illustrate the Ga mdel. In Sectin 6.2 we analyze the leukemia dataset f Armstrng et al. (2002), and in Sectin 6.3 we cnduct tw simulatin studies based n this dataset. In all analyses we tried tw different prir specificatins. Under the first prir we use a α0 = b α0 = a ν = b ν = a β = b β = a µ = b µ = and all the elements f p and r equal t 0.1. Secnd, we defined a slightly mre infrmative prir taking int accunt that lgepressin levels are rarely abve 15 and that cefficients f variatin (CV) are usually centered arund with large variance. We then fund prir parameter values that were cnsistent with this infrmatin and that at the same time allwed fr substantial prir uncertainty: a α0 = , b α0 = 10 4, a ν = 0.016, b ν = , a β = 0.004, b β = 10 3, a µ = and b µ = 20. The GaGa mdel yielded very similar parameter estimates and lists f differentially epressed genes under bth prirs. The MiGaGa mdel was slightly mre sensitive t the prir specificatin, with the nn-infrmative prir resulting in a higher psterir epectatin fr π 0 and a shrter list f DE genes. We present nly the results arising frm the mre infrmative prir. T fit the mdel we btain 5,000 psterir samples, assess the cnvergence with trajectry plts and save 10

11 the Mnte Carl when cnvergence is judged t have been reached (typically well befre 1,000 samples fr the GaGa mdel and 2,500 samples fr the MiGaGa). We run the Markv chain a secnd time with different starting values and verified that it cnverged t the same target distributin. Fr differential epressin analysis, we cmpare ur methdlgy with the empirical Bayes prcedure f Smyth (2005), adjusting the p-values via the beta-unifrm miture apprach Punds and Mrris (2003) (EBayes-BUM), and with tw-sample Wilcn tests adjusting p-values bth via BUM (Wilcn-BUM) and the Benjamini and Hchberg (1995) methd (Wilcn-BH). We als perfrm class predictin, cmparing ur methdlgy with Fisher s linear discriminant analysis (LDA). We use EBayes as implementated in the R/Bicnductr functins lmfit and ebayes, BUM as implemented in Bum and BH as in p.adjust (libraries limma (Smyth, 2005), ClassCmparisn (Cmbes, 2005) and stats, respectively). All methds were set up t cntrl the FDR belw Guld data We used the already pre-prcessed versin f the Guld dataset prvided with the R library EBarrays. We fit the mdel t lg-transfrmed data, since this reduced the effect f utliers and resulted in a better perfrmance f the mdel. Data is available fr 5,000 genes and 4 inbred lines: 2 parental and 2 ffspring. The parental lines are a Cpenhagen (COP) rat strain resistant t mammary carcingenesis and a Wistar-Furth (WF) rat strain that is highly susceptible. The tw ffspring lines are btained by crssing the parental lines, in such a way that they are hmzygus COP/COP thrughut the genme ecept fr a small regin in which they are hmzygus WF/WF. In ne f the ffspring lines the COP/COP regin is apprimately 30cM (line CI) and in the ther it is 1.5cM (line CII). Therefre, it seems reasnable t epect gene epressin fr the CI and CII lines t be smewhere between COP and WF. Further, CI shuld be clser t the COP line than CII while CII shuld be clser t WF. The dataset cntains 1 micrarray fr the COP grup, 2 fr CI, 5 fr CII and 2 fr WF. T illustrate ur apprach we analyze a randmly chsen subset cntaining 2 micrarrays frm CI, CII and WF (we ignre the COP grup fr lack f replicates). This will allw us t assess hw well the mdel fits the 3 CII micrarrays that are nt used t fit the mdel. Fr each gene, the eperimenters cnsidered fur epressin patterns: Mdel fit Pattern 0: CI = CII = WF Pattern 1: CI CII = WF Pattern 2: CI = CII WF Pattern 3: CI CII WF. (16) The Ga mdel estimates ˆα 0 = , ˆν = 0.463, ˆπ = (0.980, 0.001, 0.001, 0.018) and fies ˆα i,j = ˆα = The GaGa etensin yields psterir means ˆα 0 = , ˆν = 0.152, ˆβ = 1.089, ˆµ = and ˆπ = (0.999, 0, 0.001, 0). That is, the Ga mdel views 98.0% f the genes as being EE, while fr the GaGa mdel it is a 99.9%. Als, Ga assumes a 11

12 (a) (b) Ga mdel GaGa mdel Prbe rc.ai at CI CII WF Prbe rc.ai at CI CII WF Prbe rc.ai at Prbe rc.ai at Figure 4: Guld data. Observed epressin values vs. predictive distributin fr the tw genes with highest psterir prbability f DE accrding t the Ga mdel. Large black symbls are actual bservatins, small gray symbls are draws frm the psterir predictive. cnstant within-grups CV f 1/ = 0.047, while GaGa estimates it t vary acrss genes as a 1/ Ga(1.089, ), indicating that the CVs are nt cnstant. In Figure 1(a) we detected lack f fit f the Ga mdel at an verall level which the GaGa mdel vercmes. Net we cnsider the fit fr individual genes. In particular, ne shuld make sure that the fit is reasnable fr the genes that are declared t be DE. Otherwise the inference wuld be suspect. We select the tw genes are deemed the mst interesting by the Ga mdel, i.e. thse with lwest prbability f being EE. We cmpare their bserved epressin levels with draws frm the psterir predictive distributin fr these tw genes. Figure 4(a) reveals that the Ga mdel seriusly underestimates the variability, inaccurately predicting that the bservatins fall int 3 clearly separated grups. Figure 4(b) presents the same plt fr the GaGa mdel. Here the mdel-based predictins apprpriately reflect the variability. We can n lnger see any separatin between the 3 grups. We cnclude that these genes are fund by Ga due t mdel lack-f-fit Differential epressin analysis The Ga mdel allcates 2 genes t pattern 1, 1 t pattern 2 and 78 t pattern 3 while GaGa allcates all genes t pattern 0. Under the GaGa mdel the largest prbability f DE fr any gene is As we saw in Figure 1, the genes fund by the Ga mdel tend t be thse with CV abve average. Fr cmparisn, perfrming F-tests via EBayes-BUM did nt identify any DE genes either. 12

13 6.1.3 Class predictin Since there is little evidence that any gene is differentially epressed, we d nt epect this dataset t allw us t build a gd classifier. We cnducted a small simulatin study that cnfirmed the lack f predictive pwer, regardless f hw many genes were used t build the classifier. 6.2 Armstrng data The data, btained frm cnsisted f 24 Affymetri U95A arrays frm acute lymphblastic leukemia (ALL) samples, 18 U95A arrays frm lymphblastic leukemia with MLL translcatins (MLL), and 2 U95Av2 arrays als frm the MLL grup. The U95Av2 arrays were btained at a later date than the rest, pssibly under different eperimental cnditins, s we ecluded them frm the analysis. The dataset als cntained samples with acute myelgenus leukemia, but fr illustratin we restrict attentin t the ALL and MLL grups. The data was backgrund crrected, nrmalized and summarized using the functin just.gcrma frm the R library gcrma (Wu and Irizarry, 2007). Nte that different pre-prcessing algrithms result in different distributinal shapes fr the bserved gene epressin quantificatins, and hence the quality f the mdel fit can be affected by the chice f pre-prcessing methd. T eplre the effect f the distributinal shape, in additin t analyzing the GCRMA nrmalized data we apply a mntnic transfrmatin t enfrce unimdality and we analyze it with a GaGa mdel. Within each micrarray, the transfrmatin maps sample quantiles t the crrespnding quantiles f a gamma distributin with matching mean and variance Mdel fit Figure 1(b) reveals a vilatin f the cnstant CV assumptin f the Ga mdel, and that the mdel tends t flag genes with large CVs as DE. Figure 2(a) shws that a MiGaGa fit with M = 2 cmpnents describes the data better than a single-cmpnent GaGa. An analgus plt shws hw the mntne transfrmatin imprves the fit f the GaGa mdel substantially. This plt and further assessment f gdness-f-fit can be fund in supplementary material at T study the behavir f the methds under small sample sizes and evaluate the reliability f the results, we start by fitting the mdel t 5 randmly chsen arrays frm each grup. We then add 5 mre arrays per grup, then 10 and finally we analyze the full dataset. Fr the GCRMA data with 5 arrays per grup the GaGa psterir means are ˆα 0 = 4.520, ˆν = 0.314, ˆβ = 0.826, ˆµ = and ˆπ = (0.954, 0.046). That is, the genespecific shape parameters are estimated t arise frm a gamma with mean and standard deviatin With mre than 5 arrays similar (ˆα 0, ˆν, ˆβ, ˆµ) were btained. Hwever, ˆπ 1 increased t 0.121, and when analyzing 10 arrays, 15 arrays and the full dataset, respectively. The MiGaGa estimates behaved in a similar manner, and s did the GaGa estimates fr the mntnically transfrmed data, btaining similar ˆπ 1 in all mdels. Since we did nt bserve this phenmenn n simulated data, we believe that it is due t sme cmpnent f the mdel being miss-specified, e.g. assuming cnditinal 13

14 5 arrays 10 arrays 15 arrays All data # DE % rep. # DE % rep. # DE % rep. # DE GaGa MiGaGa (M =2) GaGa (transf.) EBayes-BUM Wilcn-BUM Wilcn-BH Table 1: Gene discveries in the Armstrng dataset. # DE: number f genes declared DE; % rep.: percentage f # DE als fund when analyzing the full dataset. ODP reprts the mean f tw analyses, each using B=100 permutatins. independence f δ i given ω. In ur eperience, in real datasets many methds prvide estimated prprtins f DE genes that change widely with sample size. Fr instance, the EBayes-BUM and Wilcn-BUM estimates increase frm t and frm t 0.333, respectively. Cmpared t this ther tw prcedures, under the GaGa and MiGaGa mdels ˆπ 1 is relatively stable acrss sample size Differential epressin analysis Table 1 shws the number f genes declared DE when analyzing a subset f 5, 10 and 15 arrays per grup, as well as the full dataset. The table als prvides the percentage f reprducibility, i.e. hw many amngst thse genes were fund again when analyzing the full dataset. Fr instance, with 5 arrays per grup MiGaGa fund 339 genes, 79.6% f which were cnfirmed in the full data. We see that the three variants f ur mdel find mre genes than EBayes-BUM, Wilcn-BUM and Wilcn-BH, with a reprducibility arund 80%. The reprducibility f the cmpeting methds is very high fr small sample sizes but it diminishes as mre data becmes available. In fact, they find fewer DE genes in the full dataset than in a subset with 15 arrays per grup. One shuld be careful in cmparing the number f genes detected by each methd, fr ur mdels cmpare nt nly mean epressin but als the distributinal shape between ALL and MLL samples. T highlight this feature inspect the 1096 genes that were reprted as DE under the GaGa mdel with 10 arrays per grup. 2(b) shws the predictive distributin fr a gene that has 95% credibility interval fr λ i1 λ i2 cntaining 0 and the interval fr α i1 /α i2 nt cntaining 1. This is a gene with strnger evidence f a difference in shape between grups than in lcatin. The bserved data, which includes the 22 samples nt used t fit the mdel, indeed suggests a larger tail fr ALL than fr AML samples Class predictin We select the 100 genes with the largest psterir prbability f being differentially epressed accrding t each f the three mdels (GaGa, MiGaGa and GaGa applied t the transfrmed data) when fit with 5 bservatins per grup, and predict the class fr the rest f the dataset. The three mdels crrectly classify all samples. The same result is 14

15 fund when nly using the tp 10 genes t build the classifier. Fr cmparisn, using Fisher s Linear Discriminant Analysis with the 5 genes having smallest p-values accrding t Wilcn (BH) crrectly classifies 16 ALL and 13 MLL samples. 6.3 Simulatin study We cnduct tw simulatin studies. First, we assess the frequentist perating characteristics f the Bayesian prcedure. Fr this purpse, we apply the btstrap prcedure described in Strey (2007) t the Armstrng dataset, s that in the generated data all genes are equally epressed. We then cmpute psterir prbabilities f differential epressin fr each gene (setting ω t its psterir mean) and apply the Bayes rule Müller et al. (2004), as described in Sectin 5.1, cntrlling the Bayesian FDR at 5%. We repeat this prcess 500 times and we btain an estimate fr the frequentist FDR, as described in Strey (2007). We find that the GaGa mdel apprpriately cntrls the frequentist FDR belw the desired 5%, bth when applied t the riginal and the mntnically transfrmed data. MiGaGa cntrlled the FDR belw 5% when analyzing 5, 10 and 15 arrays per grup, but when analyzing the full dataset the estimated frequentist FDR was 6.4%. Fr mre details, see the supplementary material at Fr the secnd simulatin study, we generate 12,626 gene epressin values fr 5 MLL and 5 ALL samples frm a GaGa mdel. We set ω t its estimated value fr the Armstrng dataset: α 0 = 4.520, ν = 0.314, β = 0.826, µ = and π = (0.954, 0.046). Fitting the GaGa mdel t the simulated data prvides the psterir means ˆα 0 = 4.630, ˆν = 0.312, ˆβ = 0.906, ˆµ = and ˆπ = (0.955, 0.045). All the 95% credibility intervals cntained the true value, ecept the ne fr β which was (0.882, 0.930). MiGaGa with M = 2 estimates ˆρ = (0.004, ), i.e. it crrectly assigns a negligible psterir prbability t ne f the clusters. The psterir means are ˆα = 4.618, ˆν = (fr the secnd cluster), ˆβ = 0.870, ˆµ = 514.6, ˆπ = (0.956, 0.044). GaGa and MiGaGa tagged 469 and 468 genes as differentially epressed, respectively, 4.5% f which were false psitives. EBayes-BUM fund 449 genes (7.1% false psitives), whereas Wilcn-BUM and Wilcn-BH did nt find any significant genes. 7 Discussin We have intrduced tw simple etensins f the Ga mdel. The GaGa mdel relaes the cnstant cefficient f variatin assumptin. This results in a parsimnius mdel, which substantially imprves the quality f the mdel fit and reliability f the resulting inference. The increased generality cmes at a negligible cmputatinal cst. We derived an apprimatin fr the psterir distributin f the gamma shape parameter that further simplifies cmputatin. The secnd etensin, the MiGaGa, increases the mdel fleibility by incrprating a miture prir, at the epense f mdel parsimny. In practice, a miture with as few as tw cmpnents may suffice t prvide a satisfactry fit, as we have illustrated with the Armstrng dataset. Further, we have shwn hw t imprve the quality f the GaGa fit by using a simple mntnic transfrmatin that guarantees unimdality f the marginal distributin f the data. 15

16 The hierarchical nature f the mdels allws fr the sharing f infrmatin acrss genes. In simulatins and in real data we have shwn hw bth GaGa and MiGaGa find mre genes than three cmpeting methds. Fr instance, when analyzing a subset with 5 arrays per grup frm the Armstrng dataset we detect arund 200 differentially epressed genes, while the mst that any cmpeting methd finds is 47. The fact that arund 80% f these genes were fund again when analyzing the full data gives us cnfidence that these are nt spurius findings. Als, we cnducted simulatins under the null hyptheses which estimated the FDR t be arund the desired 5%. The differences in the number f genes fund by each methd are partly due t the fact that they are testing different hyptheses. Our mdels nt nly seek differences in mean epressins as the cmpeting methds d, but als test fr differences in the distributinal shape. We believe that this may frequently be f bilgical interest, since many mutatins, deletins and translcatins affect nly a prprtin f the diseased individuals, and hence ne epects t see differences in the tail behavir between grups. Since ur mdels are built t be sensitive t tail behavir, the presence f utlying values can have an effect n the inference. If the eperimenter believes that differences in shape lack bilgical relevance and hence that this sensitivity is undesirable, ne can easily use the mdel utput t fcus n genes with lcatin shift. Fr instance, ne can cmpute psterir credibility intervals fr the difference between grup means and disregard thse genes fr which it cntains zer. In additin, we have shwn a fully Bayesian apprach fr class predictin. The apprach allws t specify prir prbabilities that take int accunt the prevalence f the disease under study and the utcme f previus tests, fr instance. In the Guld dataset the mdel revealed that the data lacked predictive pwer, while in the Armstrng dataset the prpsed apprach crrectly classified all the 32 samples that had nt been used t fit the mdel. In bth cases, we shwed hw the classifier is rbust with respect t the number f genes used t build it. As a limitatin, we have nt eplicitely mdeled the dependence between genes. In datasets with strng crrelatins, we epect that this may have a strnger effect n class predictin than n inference abut gene epressin. Nt learning abut the dependence structure als limits the use f the mdel in finding gene netwrks r gene interactins. Interesting future wrk will be t include dependence. Other pssibilities are etending the mdel t include cvariate infrmatin and study-specific randm effects, which wuld make it appealing fr meta-analysis purpses, r using the mdel fr sample size calculatin as in Müller et al. (2004) r sequential sample size calculatin. In the latter applicatin, the cmputatinal efficiency f the GaGa mdel shuld prve a majr asset. Acknwledgments We thank Peter Müller fr his very useful cmments. References S.A. Armstrng, J.E. Stauntn, L.B. Silverman, R. Pieters, M.L. Ber, M.D. Minden, E.S. Sallan, E.S. Lander, T.R. Glub, and S.J. Krsmeyer. Mll translcatins specify a 16

17 distinct gene epressin prfile that distinguishes a unique leukemia. Nature Genetics, 30:41 47, Y. Benjamini and Y. Hchberg. Cntrlling the false discvery rate: A practical and pwerful apprach t multiple testing. Jurnal f the Ryal Statistical Sciety B, 57: , M. Chigna, M.S. Massa, and C. Rmualdi. Effect f nrmalizatins n detecting differentially epressed genes with cdna micrarray eperiments. Technical reprt, Universita degli studi di Padva, Dipartiment di scienze statistiche, K. R. Cmbes, J. Wang, and K.A. Baggerly. A statistical methd fr finding bimarkers frm micrarray data, with applicatin t prstate cancer. Technical reprt, M.D. Andersn Cancer Center, URL utmdabtr00704.pdf. Kevin R. Cmbes. ClassCmparisn: Classes and methds fr class cmparisn prblems n micrarrays, R package versin 1.1. E. Damsleth. Cnjugate classes fr gamma distributins. Scandinavian Jurnal f Statistics, 2:80 84, S. Dudit, H.Y. Yang, M.J. Callw, and T.P. Speed. Statistical methds fr identifying differentially epressed genes in replicated cdna micrarray eperiments. Statistica Sinica, 12: , A.E. Gelfand and A.F.M. Smith. Sampling based appraches t calculating marginal densities. Jurnal f the American Statistical Assciatin, 85: , C.M. Kendzirski, M.A. Newtn, H. Lan, and M.N. Guld. On parametric empirical bayes methds fr cmparing multiple grups using replicated gene epressin prfiles. Statistics in Medicine, 22: , J. Lapinte, C. Li, J.P Higgins, M. Rijn, E. Bair, K. Mntgmery, M. Ferrari, L. Egevad, W. Rayfrd, U. Bergerheim, P. Ekman, A.M. DeMarz, R. Tibshirani, D. Btstein, P.O. Brwn, J.D. Brks, and J.R. Pllack. Gene epressin prfiling identifies clinically relevant subtypes f prstate cancer. Prceedings f the Natinal Academy f Science, 101: , I. Lönnstedt and T. Speed. Replicated micrarray data. Statistica Sinica, 12(1), R.B. Miller. Bayesian analysis f the tw-parameter gamma distributin. Technmetrics, 22:65 69, P. Müller, G. Parmigiani, C. Rbert, and J. Russeau. Optimal sample size fr multiple testing: the case f gene epressin micrarrays. Jurnal f the American Statistical Assciatin, 99: , P. Müller, G. Parmigiani, and K. Rice. FDR and Bayesian Multiple Cmparisns Rules. Ofrd University Press,

18 M.A. Newtn and C.M. Kendzirski. Parametric Empirical Bayes Methds fr Micrarrays. Springer Verlag, New Yrk, M.A. Newtn, C.M. Kendzirski, C.S Richmnd, F.R. Blattner, and K.W. Tsui. On differential variability f epressin ratis: Imprving statistical inference abut gene epressin changes frm micrarray data. Jurnal f Cmputatinal Bilgy, 8:37 52, M.A. Newtn, A. Nueriry, D. Sarkar, and P. Ahlquist. Detecting differential gene epressin with a semiparametric hierarchical miture mdel. Bistatistics, 5: , G. Parmigiani, E.S. Garett, R.A. Irizarry, and S.L. Zeger, editrs. The Analysis f Gene Epressin Data. Springer, S. Punds and S.W. Mrris. Estimating the ccurrence f false psitives and false negatives in micrarray studies by apprimating and partitining the empirical distributin f p-values. Biinfrmatics, 10: , G.K. Smyth. Linear mdels and empirical Bayes methds fr assessing differential epressin in micrarray eperiments. Statistical Applicatins in Genetics and Mlecular Bilgy, 3, G.K. Smyth. Limma: linear mdels fr micrarray data. In R. Gentleman, V. Carey, S. Dudit, R. Irizarry, and W. Huber, editrs, Biinfrmatics and Cmputatinal Bilgy Slutins using R and Bicnductr, pages Springer, New Yrk, J.D. Strey. The ptimal discvery prcedure: A new apprach t simultaneus significance testing. Jurnal f the Ryal Statistical Sciety B, 69: , J. Wu and J.M.J. Irizarry, R. with cntributins frm Gentry. gcrma: Backgrund Adjustment Using Sequence Infrmatin, R package versin Z. Wu, R.A. Irizarry, R. Gentleman, F.M. Murill, and F. Spencer. A mdel based backgrund adjustment fr lignucletide epressin arrays. Technical reprt, Jhns Hpkins University, Dept. f Bistatistics, Y. Zha, M.C. Li, and R. Simn. An adaptive methd fr cdna micrarray nrmalizatin. Biinfrmatics, 6:28,

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical mdel fr micrarray data analysis David Rssell Department f Bistatistics M.D. Andersn Cancer Center, Hustn, TX 77030, USA rsselldavid@gmail.cm