Context-specific infinite mixtures for clustering gene expression profiles across diverse microarray dataset

Gene Expression Context-specific infinite mixtures for clustering gene expression profiles cross diverse microrry dtset Liu, X. 1,, Sivgnesn, S. 3, Yeung, K.Y. 4, Guo, J. 1, Bumgrner, R.E. 4, MedvedovicMrio 1, 1 Deprtment of Environmentl Helth, University of Cincinnti, 33 Eden Av. ML 56, Cincinnti OH 4567, Division of Biomedicl Informtics, Cincinnti Children s Hospitl Reserch Foundtion, Cincinnti, OH 459, 3 Mthemticl Sciences Deprtment, University of Cincinnti, Cincinnti, OH 451, 4 Deprtment of Microbiology, University of Wshington, Settle, WA 98195. ABSTRACT Motivtion: Identifying groups of co-regulted genes by monitoring their expression over vrious experimentl conditions is complicted by the fct tht such co-regultion is condition-specific. Ignoring the context-specific nture of co-regultion significntly reduces the bility of clustering procedures to detect co-expressed genes due to dditionl noise introduced by non-informtive mesurements. Results: We hve developed novel Byesin hierrchicl model nd corresponding computtionl lgorithms for clustering gene expression profiles cross diverse experimentl conditions nd studies tht ccounts for context-specificity of gene expression ptterns. The model is bsed on the Byesin infinite mixtures frmework nd does not require priori specifiction of the number of clusters. We demonstrte tht explicit modeling of context-specificity results in incresed ccurcy of the cluster nlysis by exmining the specificity nd sensitivity of clusters in microrry dt. We lso demonstrte tht probbilities of co-expression derived from the posterior distribution of clusterings re vlid estimtes of sttisticl significnce of creted clusters. Avilbility: The open-source pckge gimm is vilble t http://eh3.uc.edu/gimm. Contct: Mrio.Medvedovic@uc.edu Supplementry informtion: http://eh3.uc.edu/gimm/csimm 1 INTRODUCTION Identifying nd interpreting gene expression ptterns nd chrcterizing groups of co-expressed genes defining these ptterns through cluster nlysis hs been productive pproch to lerning from DNA microrry dt. The results of such nlyses hve served to dissect regultory mechnisms underlying co-expression, identify pthwys involved in biologicl processes nd nnotte gene function. The qulity of these results nd conclusions is directly dependent on the qulity of the clustering procedure used in the nlysis. Since the dvent of the microrry technology, virtully ll trditionl clustering pproches hve been pplied in this context nd numerous new pproches hve been developed (Yeung nd Bumgrner 004). To identify subsets of co-expressed genes, most clustering procedures depend on either visul identifiction clusters from ptterns in color-coded disply (such s hierrchicl clustering) or on the correct specifiction of the number of ptterns present in dt prior to the nlysis (k-mens nd To whom correspondence should be ddressed. Self Orgnizing Mps). The most commonly used clustering procedures re d-hoc by nture nd incpble of seprting sttisticlly significnt clusters from rtifcts of rndom fluctutions in the dt. On the other hnd, clustering pproches bsed on the sttisticl modeling of the dt often require the number of clusters to be specified in dvnce (Brsh nd Friedmn 00; McLchln et l. 00; Segl et l. 003). When the correct number of clusters is estimted from the dt, trditionl methods fil to ccount for this significnt source of vribility in ssessing the sttisticl significnce of detected ptterns (Medvedovic nd Guo 004). Assessing the function of gene product is multidimensionl endevor whereby one my scertin number of properties including structure, the low-level function of protein (i.e. kinse, protese, etc), nd higher level function describing the biologicl processes in which the protein prticiptes. Identifiction of groups of co-expressed genes cross diverse microrry dtsets is very promising strtegy for ssessing higher-level function of gene products. Such nlysis is complicted by the fct tht coregultion is often condition-specific nd my not extend cross ll conditions. The problem of context-specificity cn be prticulrly pronounced when combining gene expression profiles cross different experiments, tissue types or even different orgnisms to perform met-cluster nlysis (Segl et l. 004; Segl et l. 003; Sturt et l. 003). In these situtions, mesurements of genes expression under ll conditions re not necessrily informtive with regrds to their co-regultion. Ignoring the locl nture of co-regultion significntly reduces one s bility to detect coregulted genes due to the noise introduced by non-informtive mesurements. Previously proposed solutions to this problem in terms of the context-specific Byesin networks (Brsh nd Friedmn 00) nd more generl module networks (Segl et l. 003) rely on the specifiction or estimtion of the correct number of ptterns. In these respects, they suffer from the sme problems relted to the estimtion of the correct number of ptterns in the finite mixture bsed clustering. We developed context-specific infinite mixture model (CSIMM) to llow clusters of co-expressed genes to be further grouped loclly on subsets of experimentl conditions tht do not contribute ny informtion bout their differences. This pproch mkes use of the Byesin infinite mixture frmework (Medvedovic et l. 004; Medvedovic nd Sivgnesn 00) to circumvent the issue of identifying the correct number of globl nd locl ptterns in the dt. Infinite mixtures re one possible Oxford University Press 005 1

X. Liu et l. prmetriztion of semi-prmetric Byesin models with Dirichlet process priors (Nel, 000) nd the CSIMM described here cn be thought of s hierrchicl Dirichlet process. Infinite mixtures frmework fcilittes verging over models with different numbers of ptterns, nd the posterior distribution of clusterings incorportes uncertinties relted to not knowing the correct number of clusters, either globlly or loclly. Consequently, the resulting posterior probbilities of co-expression offer relible ssessment of the sttisticl significnce of the groupings. We demonstrte the bility of the procedure to integrte informtion from diverse microrry experiments through simultion study nd by ssessing the performnce of lgorithms in the context of functionl nnottion of genes bsed on their co-expression. METHODS.1 Motivtion Our gol is to identify clusters of genes exhibiting similr expression ptterns cross multiple microrry dtsets. Ech dt set or context consists of number of closely relted microrry experiments tht shre biologicl reltionship nd smple limited rnge of perturbtions to the system under study. For exmple, one dt set my consist of mesurements of gene expression t different time points fter het shock, while nother my consists of mesurements t different stges of the cell cycle. For the ske of discussion, we will refer to ech dt set s context the entire collection of dtsets s the globl dtset. Context1 Context Context3 E1 E E3 E4 E5 E6 E7 E8 E9 E10 E11 E1 E13 E14 E15 Fig. 1. Simple context-specificity of expression ptterns It is resonble to ssume tht different regultory progrms re employed by different biologicl processes nd tht specific subsets of regultory progrms re needed to respond to given type of perturbtion. Some regultory progrms will respond to ll the perturbtions vilble within the globl dtset while others will respond to none, one or limited number of perturbtions. Tht is, some genes will be co-regulted on globl scle, while others (perhps most) will be co-regulted on locl scle. We define globl clustering structure on set of gene expression profiles by sying tht two genes belong to the sme globl clusters shre common pttern of expression in ll of the exmined dtsets. On the other hnd, we define locl clustering structure s groups of genes tht shre common expression pttern within subset ( context) of the dt but which do not group together when exmined globlly. B Cluster 1 Cluster Cluster 3 Cluster 4 Figure 1 shows n exmple of the type of structure we might resonbly expect within gene expression dt nd the informtion (groupings) we would like to recover. For clrity, the exmple provided is overly simplified. It shows only three dt sets (contexts), 0 genes, 4 globl clusters nd two locl clusters within ech context. Expression level is either high (coded light gry) or low (blck). Locl clusters within ech context consist of genes tht re either high or low within this context. For exmple, globl Cluster1 is formed of genes tht re high within Context 1 nd low within Contexts nd 3. On the other hnd, Context 1 does not contribute ny informtion bout differences between globl Clusters 1 nd 4. We wish to be ble to seprte ptterns in such globl clusters even in the presence of dt from mny other noisy or non-informtive contexts. We construct Byesin hierrchicl model tht describes the probbility distribution of the dt so tht locl nd globl clusters cn be identified.. Context Specific Infinite Mixture Model (CSIMM) Suppose tht expression ws mesured for T genes cross M experimentl conditions. If x ij is the expression level of the i th gene for the j th experimentl condition, then x i =(x i1, x i,, x im ) denotes the complete expression profile for the i th gene. Ech gene expression profile cn be viewed s being generted by one out of Q different underlying expression ptterns. Expression profiles generted by the sme pttern form cluster of similr expression profiles. If c i is the clssifiction vrible indicting the pttern tht genertes the i th expression profile (c i =q mens tht the i th expression profile ws generted by the q th pttern), then clustering is defined by set of clssifiction vribles for ll gene expression profiles C=(c 1, c,, c T ). In our model, the q th pttern is represented by the men vector of the M-dimensionl Gussin distribution µ q =(µ q1,, µ qm ). Profiles clustering together (i.e. belong to the sme pttern) re ssumed to be rndom smple from the sme multivrite Gussin distribution. Tht is, c i =q implies tht x i ~ N M (µ q,σ q ), where Σ q is the vrince-covrince mtrix of the M-dimensionl multivrite Gussin distribution. Suppose further tht ech gene profile is prtitioned into R subprofiles. Without loss of generlity we cn ssume tht the first r 1 experimentl conditions form the first sub-profile, experimentl conditions r 1 +1 to r 1 +r form the second sub-profile, etc. Tht is x i =(x 1 i,,x R i ) where x j i = (xi r',...,x ) j+ 1 i r' j+ r nd r j =r 1 + +r j-1. j The two most extreme cses re when R=M nd R=1. The cse of R=1 is equivlent to the simple clustering in which the context structure is not defined. The cse when R=M represents the sitution when ech microrry hybridiztion represents seprte context. The locl structure of the co-expression ptterns is specified by the Q by R mtrix L=(L qf ), where L qf =t if globl cluster q is plced in locl cluster t within context f. Thus, within ech context, we crete group of loclly indistinguishble globl clusters. All gene expression profiles contined in globl clusters tht re indistinguishble within context form locl cluster of genes which re co-expressed within this context. The joint posterior distribution of ll prmeters in the model, including the globl nd locl clustering vribles C nd L, given dt is estimted using the Gibbs smpler (Gelfnd nd Smith 1990). The clusters of globlly nd loclly co-expressed genes re formed bsed on the mrginl posterior distributions of the clssifiction vribles C nd L. Summrizing the smple of cluster-

Context-specific infinite mixtures for clustering gene expression profiles cross diverse microrry dtset ings generted by the Gibbs smpler in mixture models is generlly non-trivil problem. We circumvent this problem by clculting posterior pirwise probbilities (PPPs) of co-expression for genes i nd j s the proportion of the smples in which these two genes re clustered together (Medvedovic nd Sivgnesn 00). We then use these probbilities s the similrity mesure to hierrchiclly cluster gene expression profiles by pplying the verge linkge principle. The mthemticl specifiction of the model describing the distribution of the dt nd the specifics of the Gibbs smpler re given in the Appendix. All conditionl probbility distributions needed to run the Gibbs smpler re given in the Supplementl Mterils..3 Implementtion Computtionl procedures for performing CSIMM-bsed clustering re implemented in stndlone pckge gimm. The pckge consists of C++ code, simple Jv-bsed gui nd instlltion scripts, nd it is vilble for both Linux/Unix nd Windows pltforms. The Windows version is vilble s self-instlling pckge. The softwre genertes.cdt nd.gtr files defining the hierrchicl clustering tht cn be viewed nd nlyzed using the treeview progrm (Eisen et l. 1998). The Linux C++ code is designed to exploit the OpenMP prlleliztion when pproprite compiler is instlled. For Linux, we lso developed the R pckge wrpper tht fcilittes using gimm within R. All pckges nd the source code cn be freely downloded from http://eh3.uc.edu/gimm. We discuss the computtionl complexity of the lgorithm in the Web Supplement. 3 RESULTS 3.1 Simultion Study The study ws designed to compre different clustering procedures bsed on their bility to correctly seprte simulted expression profiles into different clusters in repeted experiments. The problem is treted in the trditionl sttisticl hypothesis-testing frmework of ssessing the probbility tht procedure will correctly conclude tht two expression profiles re generted by distinct ptterns of expression (i.e. belong to two different clusters) while controlling the probbility of flsely concluding tht two profiles belong to different clusters when they re ctully generted by the sme pttern. Unlike trditionl sttisticl hypothesis testing procedure, we do not supply the lbels for profiles tht re being compred. We simulted dt representing the structure depicted in Figure 1 where the het mp ws tken to represent the vlues of the men expression profiles in the corresponding cluster. Low expression levels (blck) were set to 0 nd the high expression levels (gry) were set to 1. For exmple, in ech dtset, profile g1 ws rndomly drwn from the 15-dimensionl Gussin rndom distribution whose men vector is equl to 1 in first 5 dimensions (e1,,e5) nd 0 in other 10 dimensions (e6,,e15). Dt ws simulted for different level of noise (σ). The selected rnge of rndom noise llowed us to ssess the performnce of different pproches in esy nd progressively more difficult (i.e. noisier) situtions. 100 dtsets were generted for ech noise level. We re focusing on the bility of method to seprte profiles in Cluster1 from profiles in Cluster. This is the most difficult spect of the nlysis since Cluster1 hs only two profiles nd differs from Cluster only on 5 out of 15 experimentl conditions. Methods re tested bsed on their bility to correctly conclude tht profiles in Cluster1 re different from profiles with Cluster. In sense we re ssessing the power of clustering procedures to conclude tht profiles in Cluster1 re different from profiles in Cluster. If we knew which profiles cme from which cluster, we could perform simple test of hypothesis to decide one wy or nother. In the unsupervised sitution, we do not supply the membership but the gol is still the sme. The performnce of different clustering procedures ws ssessed by constructing Receiver Operting Chrcteristic (ROC) curves tht relte the probbility of the clustering method to correctly seprte profiles from different clusters nd the probbility of incorrectly seprting profiles from the sme clusters. True Positive Rte σ = 0.3 Simple Infinite Mixtures Euclidin Distnce Context-Specific IMM σ = 0.5 Flse Positive Rte Fig.. ROC curves for different clustering pproches Let X be the posterior probbility cut-off for seprting profiles in Cluster1 from Cluster. For fixed cut-off point X, we consider tht the clustering procedure is correctly concluding tht profile from Cluster1 does not belong to Cluster if its posterior pirwise probbility of co-clustering with ny single profile from Cluster is less thn X. Tht is, mx{p(c i =c j for ll profiles j from Cluster}<X, where p(c i =c j ) denotes the posterior pirwise probbility of co-expression for profiles i nd j. We consider tht the clustering procedure is incorrectly concluding tht profile 1 nd profile from Cluster1 do not belong in the sme cluster if p(c 1 =c )<X. The true positive rte (TPR) is the proportion of times tht correct decision is mde nd the flse positive rte (FPR) is the proportion of times tht n incorrect decision is mde. As the cut-off X is incresed from 0 to 1, both TPR nd the FPR will increse. The re under the curve relting the TPR nd FPR s X is incresed from 0 to 1 describes the efficiency of sttisticl procedure with the rndom decision-mking hving n re of 0.5 while the idel sttisticl procedure would hve n re equl to 1. ROC curves in Figure indicte tht context-specific infinite mixtures model σ = σ = 3

X. Liu et l. significntly outperforms simple mixture model in its bility to seprte different ptterns of expression while controlling the flse positive rtes. The difference between the simple infinite mixture nd context-specific infinite mixtures is clerly due to the better representtion of the underlying ptterns offered by the contextspecific model. Furthermore, over- nd under-fitting the dt by specifying too mny or too few context hs the expected consequences on the clustering results (Figure S1 in the Supplement). When plcing ech experiment in seprte context (over-fitting), the performnce is ctully worse thn for the simple model. Filing to specify ll contexts (two out of three) cuses reduction in the performnce of the context-specific model, but it still outperforms the simple model. Posterior pirwise probbilities re vlid mesures of sttisticl significnce: In Figure 3 we plotted observed flse positive rtes ginst corresponding sttisticl significnce levels from the CSIMM nlysis. Given significnce level α, ll gene-pirs whose PPP ws lower thn α were ssumed to belong to different clusters. As the empiricl flse positive rtes re lwys less thn α, we conclude tht PPPs bsed on CSIMM re vlid mesures of sttisticl significnce t ll noise level. This is lso true for the simple IMM, but not for the finite mixtures model (Figure S in the Supplement). Additionlly we performed similr nlysis on 100 dt sets in which ll dt points were generted from the single probbility distribution representing the situtions when there is no clustering structure in the dt (Rndom). As it cn be seen from the virtully perfect 45 degrees line, PPPs correctly protect ginst Type I errors when there re no ptterns in the dt. procedure ws interpreted s the mesure of the precision for clustering procedure. We constructed the test dtset by combining two microrry experiments ssessing two distinct yet relted biologicl processes. The first dtset is the yest sporultion dtset (Primig et l. 000) consisting of gene expression mesurements t 8 nd 7 time points throughout the sporultion process for two sporultion-competent yest strins SK1 nd W303 respectively. The second dtset is the cell-cycle (Cho et l. 1998) dtset consisting of gene expression mesurements t 17 time-points spnning two complete yest cell cycles. The two dt sets were mtched by identifying 6044 ORFs represented on both of the two versions of the Affymetrix microrrys used in these experiments. Dt ws mildly processed by setting ny mesurement below 1 to 1, log-trnsforming it nd centering ech gene s expression profile round zero for the two experiments seprtely. Genes which never reched the signl of 100 were excluded from the nlysis resulting in the totl of 5685 genes remining. 1044 ORFs represented genes ssocited with t lest one KEGG pthwy. A σ = 0.3 σ = σ = 0.5 σ = Rndom B 1. Significnce Level α Fig. 3. Posterior probbilities s mesures of sttisticl significnce. The Rndom scenrio correspond to the sitution in which ll profiles were generted by the sme multivrite norml probbility distribution. 3. Yest cell-cycle nd sporultion dt Compring the performnce of different clustering procedures on the rel-world dt is complicted by the lck of gold-stndrd (i.e. the correct clustering ). We ssessed our clustering results by forming functionl groupings of genes bsed on the informtion vilble in the KEGG dtbse of biologicl pthwys (Knehis et l. 004). The strength of ssocition between such functionl clusters nd clusters of co-expressed genes formed by clustering Fig. 4. ROC curves compring the performnce of different clustering pproches on the joint sporultion nd cell cycle dtset. A) The curve relting true positive nd flse positive rtes. B) The curve relting ctul numbers of true positive nd flse positive pirs of co-clustered genes. 4

Context-specific infinite mixtures for clustering gene expression profiles cross diverse microrry dtset Dt from the two experiments ws clustered seprtely nd jointly using the simple IMM pproch, CSIMM nd Eucliden distnce-bsed hierrchicl clustering (EDHC). For ech hierrchicl clustering, the tree ws cut to crete 1 to 5685 clusters. For fixed number of clusters pir of genes (from the 1044 genes ssigned to t lest one pthwy) belonging to the sme cluster ws ssumed to be true positive if the two genes both belonged to t lest one specific KEGG pthwy, nd it ws considered to be flse positive if they did not shre single KEGG pthwy. True nd flse positive rtes were then obtined by dividing the number of true/flse positives with the totl number of gene pirs shring common KEGG pthwy nd totl number of gene pirs not shring KEGG pthwy respectively. When ll genes re plced in their own individul clusters (5685 clusters), both true nd flse positive rtes re equl to zero. As we reduce the number of clusters, both true nd flse positive rtes increse defining ROC curve. At the extreme when ll genes re plced in the sme cluster, both true nd flse positive rtes re equl to one. ROC curves SK1 0h 1h W303 0h 1h Cell cycle 0 160min Ribosome Cell cycle Purine metbolism Pyrmidine metoblism DNA Polymerse Fig. 5. Gene expression levels (green-red het mp) nd KEGG pthwys memberships (blue het-mp) for 54 genes which were co-clustered with t lest one other gene fter cutting CSIMM-bsed tree t the verge linkge distnce of 5. derived in such wy for ech dtset/method combintion for the sttisticlly relevnt flse-positive rtes (less thn 5) re shown in Figure 4A. The globl clustering methods, IMM nd EDHC, both performed worse for the joint dt nlysis thn using the cell-cycle dt lone. The ROC curve for the CSIMM indicted tht this method ws ble to integrte informtion from both dtsets into single more precise nlysis. The behvior of the overfitted model in which ech microrry is treted s seprte context is inline with the behvior observed in the nlysis of simulted dt. Due to the strong imblnce between the totl number of positive pirs 30,336 nd negtive pirs 513,067, reltively low FPR still results in lrge number of flse positive pirs in comprison to the number of true positive pirs. Therefore, we exmined more closely the behvior of different clustering procedures by relting the bsolute number of flse nd true positive pirs of genes (Figure 4B). The improvements in precision of the CSIMM over competing pproches when looking t this outcome for less then 10 flse positive pirs is drmtic. Clusters of co-expressed genes t the highest rtio of true to flse positives (191.6 with 5 flse nd 958 true positive pirs) long with corresponding KEGG pthwys re displyed in Figure 5. The highest rtio of true to flse positive ws chieved when cutting the tree t the verge linkge distnce of 5. Genes were included in the het mp if they were ssigned to t lest one KEGG pthwy nd were co-clustered with t lest one other gene from KEGG. The KEGG pthwys implicted by these ptterns re clerly relted to the biologic processes under investigtion (sporultion nd cell cycle). Although our gold stndrd bsed on KEGG implicted 5 flse positive pirs, closer exmintion of the genes in two clusters on the top of the het mp (RNR1, MCD1, POL1, CDC45 nd POL1) revels tht the ctivity of ll these genes is tightly regulted during the DNA repliction process. This indictes tht t this level of resolution, CSIMM cretes perfect groupings of functionlly relted genes from KEGG. Interestingly, context-specific model for the cell-cycle dt, in which gene expression profiles re split in two distinct cell-cycles performed better thn the simple model when nlyzing the cellcycle dt lone (Figure S3 in the supplement). This could be consequence of issues previously rised bout the synchroniztion of cells in different microrry experiments chrcterizing gene expression signtures of cell cycle (Cooper nd Shedden 003). We exmined broder ptterns of expression implicted t this level of significnce by clustering ll 135 genes tht were coclustered with t lest one other gene fter cutting the tree t the verge linkge distnce of 5 regrdless of their KEGG membership (Figure S4 in the supplement). In ddition to KEGG pthwy memberships, we exmined correltions of the clusters generted by CSIMM with trnscription fctors shown to bind their promoters in Chip-on-Chip experiments (Lee et l. 00). The hierrchicl tree in Figure S4 ws cut in 8 clusters. Six of these clusters hd more thn two genes nd they were tested for overrepresen ttion of genes whose promoters re substrtes of ny one single trnscription fctor using the Fisher s exct test. Eight trnscription fctors were significntly ssocited with t lest one of the cluster nd their functionl roles re closely relted to the biologicl processes exmined, s well s the KEGG pthwy ssocitions, lnding dditionl credibility to the clusters identified by CSIMM. In comprison, cutting the tree formed by the Euclidindistnce bsed hierrchicl clustering to obtin 135 genes tht were 5

X. Liu et l. co-clustered with t lest one other gene generted diffused ptterns without ny obvious clustering structure (Figure S5 in the supplement). Seprte nlyses of cell-cycle nd sporultion dt offered similr picture (Figures S6, S7, S8, S9 in the supplement). Fig. 6. Empiricl flse discovery rte s function of the verge linkge distnce used to cut the CSIMM-bsed hierrchicl clustering tree. Finlly, we investigted the vlidity of PPP-derived significnce levels in deciding which clusters of genes re sttisticlly significnt. This ws ssessed by exmining the proportion of flse positive co-clusterings in clusters obtined by cutting the hierrchicl tree t different levels of verge-linkge distnces derived from posterior pirwise probbilities of co-expression. If the tree ws cut t the similrity level d, the verge PPP between ech gene in cluster nd ll other genes in the sme cluster is greter thn (1-d). In the sme time the flse discovery rtes (FDR) re clculted s the proportion of implicted pirs of co-expressed genes when the tree is cut t the verge-linkge distnce d which lso shred t lest one KEGG pthwy out of ll pirs implicted. Plotting the flse discovery rtes ginst different d s (Figure 6) indictes tht d s very well pproximte the empiricl flse discovery rtes. 4 DISCUSSION The most importnt distinguishing feture of the model described here lies in its bility to circumvent the difficult problem of identifying the correct number of locl nd/or globl ptterns in the dt. Previously described context-specific models relied on different versions of penlized likelihood scores to estimte the correct number of ptterns in the dt. There re some obvious dvntges of being ble to identify the single most likely number of clusters. However we previously demonstrted tht our modelverging results in more ccurte nlysis thn the clustering procedure in which the correct number of clusters is estimted from dt. Here we further demonstrte tht posterior distribution of clusterings offers credible ssessment of sttisticl significnce of identified clusters nd devise prcticl pproch for identifying sttisticlly significnt ptterns in the dt. This lso simplifies the use of the model-bsed clustering since the whole procedure resembles simple hierrchicl pproches. The notion of context specificity introduced in our model is different from the two previously proposed context-specificity definitions. In the context-specific finite mixture model introduced by (Brsh nd Friedmn 00), ll uninformtive mesurements within context re plced into single defult cluster. CSIMM insted forms distinct groups of globl clusters within ech context. The module-network described by (Segl et l. 003) introduces notion of context-specificity in which contexts re defined differently for different clusters nd the distribution of ll mesurements within the sme cluster nd context re represented by the univrite Gussin distribution. These two methods lso fcilitte estimtion nd modeling of the most likely context structure while CSIMM t this point requires context structure to be specified in dvnce. On the other hnd, CSIMM uses globlly defined contexts which re identicl for ll clusters, nd the ptterns within different contexts re described by multivrite Gussin distributions. The distinction between univrite vs multivrite definition of locl ptterns seems to be prticulrly importnt in the situtions when distinct locl clusters describe complex ptterns such s the time series or dose-response dt. ACKNOWLEDGEMENTS The development of sttisticl models presented here hs been supported by the grnt 1R1HG00849 from NHGRI. Yeung is supported by NIH-NCI 1K5CA106988. REFERENCES Yeung,K.Y. et l. (004) Pttern Recognition in Gene Expression Dt. Recent Devel.Nucleic Acids Res. 1, 333-354. Brsh,Y. et l. (00) Context-specific byesin clustering for gene expression dt. J Comput Biol. 9[], 169-191. McLchln,G.J. et l. (00) A mixture model-bsed pproch to the clustering of microrry expression dt. Bioinformtics 18[3], 413-4. Segl,E. et l. (003) Module networks: identifying regultory modules nd their condition-specific regultors from gene expression dt. Nt.Genet. 34, 166-176. Medvedovic,M. et l. (004) Byesin Model-Averging in Unsupervised Lering From Microrry Dt. BIOKDD 004. Segl,E. et l. (004) A module mp showing conditionl ctivity of expression modules in cncer. Nt Genet. 36[10], 1090-1098. Sturt,J.M. et l. (003) A gene-coexpression network for globl discovery of conserved genetic modules. Science 30[5643], 49-55. Medvedovic,M. et l. (00) Byesin infinite mixture model bsed clustering of gene expression profiles. Bioinformtics 18[9], 1194-106. Medvedovic,M. et l. (004) Byesin mixture model bsed clustering of replicted microrry dt. Bioinformtics. 0[8], 1-13. Nel,R.M. (000) Mrkov Chin Smpling Methods for Dirichlet Process Mixture Models. Journl of Computtionl nd Grphicl Sttistics 9, 49-65. Gelfnd,E.A. et l. (1990) Smpling-bsed pproches to clculting mrginl densities. Journl of The Americn Sttisticl Assocition 85, 398-409. Eisen,M.B. et l. (1998) Cluster nlysis nd disply of genome-wide expression ptterns. Proc.Ntl.Acd.Sci.U.S.A 95[5], 14863-14868. Knehis,M. et l. (004) The KEGG resource for deciphering the genome. Nucleic Acids Res. 3 Dtbse issue, D77-D80. Primig,M. et l. (000) The core meiotic trnscriptome in budding yests. Nt.Genet. 6[4], 415-43. Cho,R.J. et l. (1998) A genome-wide trnscriptionl nlysis of the mitotic cell cycle. Mol.Cell [1], 65-73. Cooper,S. et l. (003) Microrry nlysis of gene expression during the cell cycle. Cell Chromosome. [1], 1. Lee,T.I. et l. (00) Trnscriptionl regultory networks in Scchromyces cerevisie. Science 98[5594], 799-804. Gelmn,A. et l. (003) Byesin Dt Anlysis. Second. Cowell,R.G. et l. (1999) Probbilistic Networks nd Expert Systems. APPENDIX: CSIMM MODEL 6

Context-specific infinite mixtures for clustering gene expression profiles cross diverse microrry dtset The sttisticl model describing the distribution of the dt is given in the form of Byesin hierrchicl model (Gelmn et l. 003). Dependencies between vrious model prmeters nd the dt re defined by the Directed Acyclic Network (Cowell et l. 1999) in Figure 7. Nodes in the network represent rndom vribles nd rcs define the independence structure of the joint probbility distribution function. Assuming tht the probbility distribution of ny node is independent of its non-descendnts if vlues of the prent nodes re given (Directed Mrkov Assumption), the joint probbility distribution of ll prmeters nd dt is given by the product of the locl probbility distributions of individul rndom vribles given their prents. p(x, C, L, M, M, S, α,, λ, τ, β, φ) = p(x C, M, S)p(C α)p(m L,M )p(s β, φ) p(l C,)p(M λ, τ)p(α)p()p(λ)p(τ)p(β)p(φ) M={µ 1,,µ Q } is the set of ll men vectors ssocited with Q globl ptterns, S={Σ 1,,Σ Q } is the set of corresponding vrincecovrince mtrices, M = {( µ,..., µ ),..., ( µ,..., µ )} is 11 K 1 1 1R KRR { 1,..., Σ R the set of ll locl men vectors, S = Σ } is the set of corresponding vrince-covrince mtrices nd K f is the number of locl groupings of globl clusters within context f. α,, λ, τ, β nd φ re hyperprmeters for C, L, M, nd S respectively. The probbility distribution of the expression dt vector for gene i, given its clssifiction vrible c i, globl mens M nd the vrincecovrince mtrices S is p ( xi ci = q, M, S) = f N ( xi µ q,σq ), where f N (. µ,σ) is the multivrite Gussin probbility distribution function with men µ nd vrince-covrince mtrix Σ. All vrince-covrince mtrices in the model re context-specific nd digonl. Tht is Σ q is the block digonl mtrix with contextspecific digonl mtrices σ I on the digonl. α C Fig. 7. Context-specific infinite mixtures tf The probbility distribution of the globl men vector µ q, given the locl structure L nd the locl prmeters M nd S is p( ( µ q1, µ q L,..., µ qr ) L, M,σ ) = x i1 (λ, τ) (M,S ) M x ig fn ( µ q1 µ L 1, Σ1 ) f ( µ q µ L, )... ( µ qr µ L R, Σ ) q1 N Σ f q N qr where µ qf is the subvector of the globl men µ q on the f th context. (β, φ) S R Prior distributions for the locl groupings L re defined following the infinite mixtures pproch tht voids the specifiction of the correct number of groups of locl clusters for ech context (Medvedovic et l. 004; Medvedovic nd Sivgnesn 00). The probbility of ssigning the globl cluster q to n lredy existing group of clusters t within the context f, given C nd, is given n-qft by p (Lqf = t C,), t=1,..,q, where n -qft is the number of globl clusters currently plced in locl cluster t within context f without counting globl cluster q. The probbility of ssigning globl cluster to new locl group is given by p ( Lqf Lq' f, q' q C,). The rest of the locl conditionl probbility distributions, the structure of vrince- covrince mtrices nd hyperprmeters re stted in the supplementl mteril. The joint posterior distribution of ll prmeters in the model given dt is estimted using Gibbs smpler. Gibbs smpler (Gelfnd nd Smith 1990) is generl procedure for smpling observtions from multivrite distribution. It proceeds by itertively drwing observtions from complete conditionl distributions of ll components given the current vlues of ll other components. Under mild condition, the distribution of generted multivrite observtions converges to the trget multivrite distribution. The Gibbs smpler employed here is derived from previously described lgorithms for fitting infinite mixture models. Conditionl posterior distributions for M, M, nd L re derived ssuming tht Σ f = ( σ q ) I nd by letting σ f 0 within ll contexts. This effectively forces ll globl cluster mens grouped together within context to be identicl within this context. Consequently, insted of estimting mens nd vrinces for ech of the Q globl clusters within ech context f, we estimte only K f <Q locl prmeters. The posterior distributions for the locl clssifiction vribles, conditionl on ll other prmeters in the model re: n -qft q σ p(l t,,) ( tf, tf qf = C X fn xf µ I) n q σ σ p(l L, q' q C,) ( tf, tf ft, tf qf q' f xf µ I) (µ )d( µ tf,σ tf ) fn p, nq nq where f xi q ci = q x f =. n q All other conditionl posterior distributions re similr to the simple infinite mixture models (Medvedovic et l. 004; Medvedovic nd Sivgnesn 00) nd re given in the web supplement. The Gibbs smpler is initilized smpling ll model prmeters from their respective prior distributions, nd plcing ll globl gene expression profiles into single cluster. The Gibbs smpler proceeds to smple first globl clusters, then locl groupings of globl clusters within ech context nd then the rest of the prmeters in the model. To llevite the problem of slow mixing, we pply heuristic nneling djustment described in the Web Supplement. Previously, we demonstrted tht such modifictions preserve the topology of the posterior distribution of clusterings (Medvedovic et l. 004). q 7