Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks

Size: px

Start display at page:

Download "Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks"

Cori Higgins
5 years ago
Views:

Combiig Microarrays ad Biological Kowledge for Estimatig Gee Networks via Bayesia Networks Seiya Imoto 1, Tomoyuki Higuchi 2, Takao Goto 1, Kousuke Tashiro 3, Satoru Kuhara 3 ad Satoru Miyao 1 1 Huma

1 Combiig Microarrays ad Biological Kowledge for Estimatig Gee Networks via Bayesia Networks Seiya Imoto 1, Tomoyuki Higuchi 2, Takao Goto 1, Kousuke Tashiro 3, Satoru Kuhara 3 ad Satoru Miyao 1 1 Huma Geome Ceter, Istitute of Medical Sciece, Uiversity of Tokyo Shirokaedai, Miato-ku, Tokyo, , Japa fimoto, takao, miyaog@ims.u-tokyo.ac.jp 2 The Istitute of Statistical Mathematics, 4-6-7, Miami-Azabu, Miato-ku, Tokyo, , Japa higuchi@ism.ac.jp 3 Graduate School of Geetic Resources Techology, Kyushu Uiversity Hakozaki, Higashi-ku, Fukuoka, , Japa fktashiro, kuharag@grt.kyushu-u.ac.jp Abstract We propose a statistical method for estimatig a gee etwork based o Bayesia etworks from microarray gee expressio data together with biological kowledge icludig protei-protei iteractios, protei-dna iteractios, bidig site iformatio, existig literature ad so o. Ufortuately, microarray data do ot cotai eough iformatio for costructig gee etworks accurately i may cases. Our method adds biological kowledge to the estimatio method of gee etworks uder a Bayesia statistical framework, ad also cotrols the trade-off betwee microarray iformatio ad biological kowledge automatically. We coduct Mote Carlo simulatios to show the effectiveess of the proposed method. We aalyze Saccharomyces cerevisiae gee expressio data as a applicatio. 1. Itroductio I recet years, a large amout of gee expressio data has bee collected ad estimatig a gee etwork has become oe of the cetral topics i the field of bioiformatics. Several methodologies have bee proposed for costructig a gee etwork based o gee expressio data, such as Boolea etworks [1, 2, 32, 42], differetial equatio models [7, 10, 11, 32] ad Bayesia etworks [13, 14, 17, 18, 20, 22, 23, 37]. Mai drawback for the gee etwork costructio from microarray data is that while the gee etwork cotais a large umber of gees, the iformatio cotaied i gee expressio data is limited by the umber of microarrays, their quality, the experimetal desig, oise, ad measuremet errors. Therefore, estimated gee etworks cotai some icorrect gee regulatios, which caot be evaluated from a biology viewpoit. I particular, the directio of gee regulatio is difficult to decide usig gee expressio data oly. Hece, the use of biological kowledge, icludig protei-protei ad protei-dna iteractios [3, 5, 16, 21, 25], sequeces of the bidig site of the gees cotrolled by trascriptio regulators [31, 40, 47], literature ad so o, are cosidered to be a key for microarray data aalysis. The use of biological kowledge has previously received cosiderable attetio for extractig more iformatio from microarray data [4, 6, 18, 33, 36, 38, 41]. I this paper, we provide a geeral framework for combiig microarray data ad biological kowledge aimed at estimatig a gee etwork by usig a Bayesia etwork model. If the gee regulatio mechaisms are completely kow, we ca model the gee etwork easily. However, may parts of the true gee etwork are still ukow ad eed to be estimated from data. Hece, it is ecessary to costruct a suitable criterio for evaluatig estimated gee

2 etworks i order to obtai a optimal etwork. While criteria proposed previously for evaluatig a Bayesia etwork model oly measure the closeess betwee a model ad microarray data, we derive a criterio for selectig etworks based o microarray data ad biological kowledge. The proposed criterio is coducted by two compoets: Oe shows the fitess of the model to the microarray data ad the other reflects biological kowledge, which is modeled uder a probabilistic framework. Our proposed method automatically tues the balace betwee the biological kowledge ad microarray data based o our criterio ad estimates a gee etwork from the combied data. I Sectio 2.1, we describe our statistical model for costructig gee etworks ad itroduce a criterio for evaluatig etworks i Sectio 2.2. A statistical framework for represetig biological kowledge is described i Sectio 2.3. I Sectio 2.4, we illustrate how to model various types of biological kowledge i practice. Mote Carlo simulatios, i Sectio 3.1, are coducted to show the effectiveess of the proposed method. We apply our method to Saccharomyces cerevisiae gee expressio data i Sectio Method for Estimatig Gee Networks 2.1. Bayesia etwork ad oparametric heteroscedastic regressio model Bayesia etworks [26] are a type of graphical models for capturig complex relatioships amog a large amout of radom variables by the directed acyclic graph ecodig the Markov assumptio. I the cotext of Bayesia etworks, a gee correspods to a radom variable show as a ode, while gee regulatios are show by directed edges. Thus gee iteractios are modeled by the coditioal distributio of each gee. We use Bayesia etwork ad oparametric heteroscedastic regressio models [23] for costructig gee etworks from microarray data. Suppose that we have sets of microarrays fx 1 ;:::;x g of p gees, where x i = (x i1 ; :::; x ip ) T is a p dimesioal gee expressio vector obtaied by ith microarray. Here, x ij is a expressio value of jth gee, deoted by gee j, measured by ith microarray after required ormalizatios ad trasformatio [39]. Ordiary, x ij is give by log 2 (R ij =G ij ),wherer ij ad G ij are ormalized itesities of Cy5 ad Cy3 for gee j measured by ith microarray. The iteractio betwee gee j ad its parets is modeled by the oparametric additive regressio model [19] with heterogeeous error variaces x ij = m j1 (p (j) i1 )+ + m jqj (p(j) iqj )+" ij; where p (j) ik is the expressio value of kth paret of gee j measured by ith microarray ad " ij depeds idepedetly ad ormally o mea 0 ad variace ff 2 ij. Here, m jk ( ) is a smooth fuctio costructed by B-splies [9, 12, 24] of the form Mjk X m jk (p (j) ik )= m=1 fl (j) mk b(j) mk (p(j) ik ); where fb (j) 1k ( );:::;b(j) Mjk;k ( )g is a prescribed set of B- splies ad fl (j) mk are parameters. Hece, a Bayesia etwork ad oparametric heteroscedastic regressio model ca be represeted as f(x i ; G )= py j=1 f j (x ij jp ij ; j ) for i = 1;:::;,where G is a parameter vector ad f j (x ij jp ij ; j ) is a desity of Gaussia distributio with mea m j1 (p (j) i1 )+ + m jqj (p(j) iqj ) ad variace ff2 ij. If gee j has o paret gees, we use μ j ad ffj 2 istead of m j1 (p (j) i1 )+ + m jqj (p(j) iqj ) ad ff2 ij, respectively. This model has several advatages. Ulike Boolea etworks ad discrete Bayesia etworks [13, 14, 17, 18, 20, 37], o discretizatio of gee expressio data, which leads to iformatio loss, is required. Secod, eve oliear relatioships betwee gees are automatically extracted based o gee expressio data Criterio for evaluatig etworks Some gee etworks are partially kow, but may mechaisms of gee regulatios are still ukow. Therefore we eed to estimate ukow structures of the gee etwork from the data. Hece, the costructio of a suitable criterio for measurig the closeess betwee a estimated gee etwork ad the true oe is a essetial problem for statistical gee etwork modelig. Followig the result of Imoto et al. [23], a criterio for evaluatig a estimated gee etwork ca be derived from Bayes approach. At first, we briefly itroduce the derivatio of their criterio. We the explai how exted their criterio for combiig microarray data ad biological kowledge. Whe we costruct a gee etwork G by usig a Bayesia etwork model, the posterior probability of the etwork is obtaied as the product of prior probability of the etwork ß(G) ad the margial likelihood divided by the ormalizig costat. After droppig the ormalizig costat, the posterior probability of the etwork is proportioal to ß(G) Z Y i=1 f (x i ; G )ß( G j )d G ;

3 where ß( G j ) is a prior distributio o the parameter vector G with hyperparameter vector satisfyig log ß( G j ) =O(). The essetial problem for costructig a criterio based o the posterior probability of the etwork is how to compute the margial likelihood give by a high dimesioal itegral. Imoto et al. [23] used the Laplace approximatio for itegrals [8, 30, 45] ad derived a criterio, amed BNRC hetero (Bayesia etwork ad Noparametric heteroscedastic Regressio Criterio), of the form where BNRChetero(G) = 2logß(G) +log l ( G jx) = 1 fi fi fi fi fi 2ß J (^ G ) fi 2l (^ G jx); X i=1 J ( G fl ( G T G log f(x i ; G )+ 1 log ß( Gj ); ad ^ G is the mode of l ( G jx). Suppose that the prior distributio ß( G j ) is factorized as ß( G j ) = Y j;k ß jk (fl jk j jk ); where fl jk =(fl (j) 1k ;:::;fl(j) Mjk;k )T is a parameter vector ad jk is a hyperparameter. We use a sigular M jk variate ormal distributio as the prior distributio o fl jk, 2ß ß jk (fl jk j jk )= jk jk j jk (Mjk 2)=2 1=2 + exp jk 2 flt jkk jk fl jk ; where K jk is a M jk M jk symmetric positive P semidefiite matrix satisfyig fl T jk K Mjk jkfl jk = ff=3 (fl(j) ffk 2fl (j) ff 1;k + fl(j) ff 2;k )2. The we P have the decompositio p BNRChetero = 2logß(G) + j=1 BNRC(j) hetero.here BNRC (j) hetero is a score for gee j ad give by BNRC (j) hetero + X qjx i=1 k=1 = ( qjx k=1 M jk + 1) log( 2ß ) log w ij + log(2ß ^ff 2 j )+ flog jλjkj M jk log(^ff 2 j )g L 3 ={1} gee gee 1 2 L 4 ={2} U 13 gee 3 gee 4 L 5 ={3,4} U 35 gee 5 U 24 U 45 Figure 1. A gee etwork ad its eergy. The idex sets L 3, L 4 ad L 5 are illustrated ad L 1 ad L 2 are defied by empty sets. The local eergies are E 3 = U 13, E 4 = U 24 ad E 5 = U 35 + U 45. The total eergy of this etwork is E= E 3 + E 4 + E 5 = U 13 + U 24 + U 35 + U 45. log(2^ff 2 j ) log jk jkj + Xqj + k=1 f(m jk 2) log ψ! 2ß ^ff j 2 fi jk fi jk + ^fl T ^ff jkk j 2 jk ^fl jk g; where w ij ;i = 1; :::; are weights of the heterogeeous error variace ffij 2 = w ij 1 ff2 j ad Λj = Bjk T W jb jk +fi jk K jk with B jk = (b jk (p (j) 1k ); :::; b jk(p (j) k ))T, b jk (p (j) ik ) = (b(j) 1k (p(j) ik ); :::; b(j) Mjk;k (p(j) ik ))T, W j = diag(w 1j ; :::; w j ) ad fi jk = ffj 2 jk. The details of the parameter estimatio are described i Imoto et al. [23] Addig biological kowledge The criterio BNRC hetero (G), itroduced i the previous sectio, cotais two quatities: the prior probability ß(G) of the etwork, ad the margial likelihood of the data. The margial likelihoodshows the fitess of the model to the microarray data. The biological kowledge ca the be added ito the prior probability of the etwork ß(G). Let U ij be the iteractio eergy of the edge from gee i to gee j ad let U ij be categorized ito I values, H 1 ; :::; H I, based o biological kowledge. For example, if we kow a priori gee i regulates gee j,wesetu ij = H 1. However, if we do ot kow whether gee k regulates gee l or ot, we set U kl = H 2. Note that 0 <H 1 <H 2. The

4 total eergy of the etwork G ca the be defied as E(G) = X fi;jg2g U ij ; where the sum is take over the existig edges i the etwork G. Uder the Bayesia etwork framework, the total eergy ca be decomposed ito the sum of the local eergies px X px E(G) = U ij = j=1 i2lj j=1 E j ; (1) where L j is a idex set of parets of gee j ad E j = P i2lj U ij is a local eergy defied by gee j ad its parets. Figure 1 shows a example of a gee etwork ad its eergy. The probability of a etwork G, ß(G), is aturally modeled by the Gibbs distributio [15] ß(G) =Z 1 expf E(G)g; (2) where (> 0) is a hyperparameter ad Z is a ormalizig costat called the partitio fuctio X Z = expf E(G)g: G2G Here G is the set of possible etworks. By replacig H 1 ; :::; H I with 1 ;:::; I, respectively, the ormalizig costat Z is a fuctio of 1 ; :::; I. We call j a iverse ormalized temperature. By substitutig (1) ito (2), we have ß(G) = Z 1 py j=1 = Z 1 py expf E j g Y exp( ff(i;j) ); j=1 i2lj with ff(i; j) = k for U ij = H k. Hece, by addig biological kowledge ito the prior probability of the etwork, BNRC hetero ca be rewritte as BNRChetero(G; 1 ; :::; I )=2logZ px X + f2 j=1 i2lj ff(i;j) +BNRC (j) heterog: (3) We ca choose a optimal etwork uder the give 1 ; :::; I. Also the optimal values of 1 ; :::; I are obtaied as the miimizer of (3). Therefore, we ca represet a algorithm for estimatig a gee etwork from microarray data ad biological kowledge as follows: Step1: Set the values 1 ; :::; I. Step2: Estimate a gee etwork by miimizig BNRChetero(G) uder the give 1 ;:::; I. Step3: Repeat Step1 ad Step2 agaist the cadidate values of 1 ; :::; I. Step4: A optimal gee etwork is obtaied from the cadidate etworks obtaied i Step3. I Step2, we use the greedy hill-climbig algorithm for learig etworks. The details are show i Imoto et al. [23]. Note that the proposed prior probabilityof the etwork ca be used for other types of Bayesia etwork models, such as discrete Bayesia etworks ad dyamic Bayesia etworks [29, 34, 36, 43]. The computatio of partitio fuctio, Z, is itractable eve for moderately sized gee etworks. To avoid this problem, we compute upper ad lower bouds of the partial fuctio ad use them for choosig the optimal values of 1 ;:::; I. A upper boud is obtaied by directed graphs, which are allowed to cotai cyclic graphs. Thus the true value of the partitio fuctio is ot greater tha the upper boud. A lower boud is computed by multi-level directed graphs with followig assumptios: (A1) There is oe top gee ad (A2) Gees at the same level have a commo paret gee that is located o oe upper level of them. We also cosider joied graphs of some multi-level directed graphs satisfyig (A1) ad (A2). Sice the umber of possible graphs is much larger tha those icluded i the computatio, the true value of the partitio fuctio should be greater tha the lower boud. Sice the optimizatio of the etwork structure for fixed 1 ; :::; I does ot deped o the value of the partitio fuctio, our method works well i practice. Of course, whe the umber of gees is small, we ca perform a exhaustive search ad compute the partitio fuctio completely. However, we thik that the developmet of a effective algorithm to eumerate all possible etworks or approximate the partitio fuctio is a importat problem Prior desig for various biological kowledge I this subsectio, we show some examples of biological kowledge ad how to iclude them ito the prior probability i practice. We cosider usig two values 1 ad 2 satisfyig 0 < 1 < 2 for represetig biological kowledge. Basically, we allocate 1 to a kow relatioship ad 2 otherwise. The prior iformatio ca be summarized as a p p matrix U whose (i; j) elemet, u ij, correspods to 1 or 2. Protei-protei iteractios The umber of kow protei-protei iteractios is rapidly icreasig ad kept i some public databases such

5 g 1 = " 1 ; g 2 = :7g 1 + " 2 g 5 = :7g 1 + " 5 ; g 10 =1=f1 + exp( 4g 3 )g + " 10 g 3 = ( 1 +" 3 (g 1» :5) g 1 + " 3 (jg 1 j <:5) 1+" 3 (g 1 :5) g 6 = ( :8g 3 + " 6 (g 3» 1) (g 3 +1) 1:5 + " 6 ( 1 < jg 3 j < 0) 1+" 6 (g 3 1) g 4 = :4g1 +1+" 4 (jg 1 j»:3) (g 1 +1) 2 + " 4 (jg 1 j <:3) g 8 = :2g3 1+" 8 (g 3» :2) 1:4g 3 + " 8 (g 3 >:2) g 11 = :7g 6 + " 11 ; g 14 = :7g 6 + " 14 ; g 15 =1=f1 +exp( 4g 8 )g + " 15 :4g3 +1+" 9 (jg 3 j»:3) g 9 = (g 3 +1) 1:2 + " 9 (jg 3 j <:3) g 12 = ( 1 +" 12 (g 6 < :5) g 6 + " 12 (jg 6 j»:5) 1+" 12 (jg 6 j >:5) g 13 = :4g6 +1+" 13 (jg 6 j»:3) (g 6 +1) 2 + " 13 (jg 6 j <:3) g 16 = :8g 8 + " 16 g 19 =1=f1 + exp( 4g 10 )g + " 19 g 20 =1:1g 10 + " 20 (a) g 17 = :2g8 1+" 17 (g 8» :2) 1:4g 8 + " 17 (g 8 >:2) g 18 = :4g8 +1(jg 8 j >:3 (g 8 +1) 1:2 (g 8» :3) (b) Figure 2. Artificial gee etwork ad fuctioal structures betwee odes. as GRID [16] ad BIND [3, 5]. Protei-protei iteractios show at least two proteis form a complex. Therefore, represetig protei-protei iteractios by a directed graph is ot suitable. However, they ca be icluded i our method. If we kow gee i ad gee j create a protei-protei iteractio, we set u ij = u ji = 1.Isuch a case, we will decide whether we make a virtual ode correspodig to a protei complex theoretically [35]. Protei-DNA iteractios Protei-DNA iteractios show gee regulatios by trascriptio factors ad ca be modeled more easily tha protei-protei iteractios. Whe gee i is a trascriptio regulator ad cotrols gee j,wesetu ij = 1 ad u ji = 2. Sequeces Gees that are cotrolled by a trascriptio regulator might have a cosesus motif i their promoter DNA sequeces. If gee j1,...,gee j have a cosesus motif ad are cotrolled by gee i,wesetu ij1 = = u ij = 1 ad u j1i = = u j i = 2. Previously, cosesus motifs were ofte used for the evaluatio of estimated gee etworks from a biological viewpoit. This iformatio, however, ca be itroduced directly ito our method. Oe straightforward way is the use of kow regulatory motifs kept i public databases such as SCPD [40] ad YTF [47]. As for a advaced method, Tamada et al. [44] proposed a method for simultaeously estimatig a gee etwork ad detectig regulatory motifs based o our method, ad succeeded i estimatig a accurate gee etwork ad detectig a true regulatory motif. Gee etworks ad pathways The iformatio of gee etworks ca be itroduced directly ito our method by trasformig the prescribed etworkstructuresitothe matrix U. We ca the estimate ageeetworkbasedou ad microarray data. Our method also ca use gee etworks estimated by other techiques such as boolea etworks, differetial equatio models, ad so o. Also, some databases, such as KEGG [28], cotai several kow gee etworks ad pathways. This iformatio ca be used similarly. Literature Some research has bee performed to extract iformatio from a huge amout of literature [27]. Literature cotai various kids of iformatio icludig biological kowledge described above. So we ca model literature iformatio i the same way. 3. Computatioal Experimets 3.1. Mote Carlo simulatios Before aalyzig real gee expressio data, we perform Mote Carlo simulatios to examie the properties of the proposed method. We assume a artificial etwork with 20 odes show i Figure 2 (a). The fuctioal relatioships betwee odes are listed i Figure 2 (b). A etwork will be rebuilt from simulated data cosistig of 50 or 100 observatios, which correspods to 50 or 100 microarrays. As for the biological kowledge, we tried the followig situatios: (Case 1) we kow some gee regulatios (100%, 75%, 50% or 25% out of 19 edges show i Figure 2 (a)) ad (Case 2) we kow some gee regulatios, but some (1, 2, or 3) icorrect edges are kept i the database. The cadidate values of 1 ad 2 are f0:5; 1:0g ad f 1 ; 2:5; 5:0; 7:5; 10:0g, respectively.

6 With kowledge Without kowledge Appear i both methods True edge 1 BNRC hetero Figure 3. The behavior of BNRC hetero whe 1 = 0.5. We ca fid out the optimal iverse ormalized temperature 2 is 5.0. Figure 4. A example of resultig etworks based o 100 samples. We used 1 =0.5ad 2 = 5.0 that are selected by our criterio (see Figure 3). Figure 4 shows two estimated etworks: Oe is estimated by 100 observatios (microarrays) aloe. We use 1 = 2 = 0:5, i.e. we did ot use ay kowledge (we deote this etwork by N 0 for coveiece). The other is estimated by 100 observatios ad prior iformatio of 75% gee regulatios, i.e. we kow 14 correct relatios out of the all 19 correct edges (we deote this etwork by N 1 ). Edges appearig i both etworks are colored gree, while edges appearig i N 0 or N 1 oly are colored blue ad red, respectively. By addig prior kowledge, it is clear that we succeeded i reducig the umber of false positives. We also fid additioal four correct relatioships. Figure 3 shows the behavior of BNRC hetero whe 1 =0:5. Wefid that the optimal value of 2 is 5.0. From the Mote Carlo simulatios, we observed that 2 ca be selected by usig middle values (depicted by a blue lie) of upper ad lower bouds or upper bouds i practice. For the selectio of 1, we use the middle value of the upper ad lower bouds of the score of our criterio. The results of the Mote Carlo simulatios are summarized as follows: I (Case 1), we obtaied etworks more accurately as log as we add correct kowledge. We observed that the umber of false positives decreased drastically. We presume the reaso is the ature of directed acyclic graphs. Sice a Bayesia etwork model is a directed acyclic graph, oe icorrect estimate may affect the relatios i its eighborhood. However, by addig some correct kowledge, we ca restrict the search space of the Bayesia etwork model learig effectively. I (Case 2), the results deped o the type of icorrect kowledge. (i) If we use misdirected relatios, e.g. gee 8! gee 3,as prior kowledge, serious problems occur. Sice microarray data to some degree support the misdirected relatios, they ted to receive a better criterio score. (ii) If we add idirect relatios such as gee 1! gee 8,we observed that our method cotrolled the balace betwee this prior iformatio ad microarray data ad could decide whether the prior relatio is true. (iii) If irrelevat relatios such as gee 20! gee 5 are added as prior iformatio, we observed that our method could reject these prior iformatio, because, the microarray data do ot support these relatios Example usig experimetal data I this subsectio, we demostrate our method by aalyzig Saccharomyces cerevisiae gee expressio data obtaied by disruptig 100 gees, which are almost all trascriptio factors. We focus o five gees, MCM1, SWI5, ACE2, SNF2 ad STE12 (see Table 1) ad extract gees that are regulated by these 5 gees from the Yeast Proteome Database [46]. Thus, we costruct a prior etwork show i Figure 5, based o the database iformatio. We iclude the prior etwork i our Bayesia etwork estimatio method.

7 MCM1 : trascriptio factor of the MADS box family MET14, CDC6, MET2, CDC5, MET6, SIC1, STE6, CLN2, PCL2, STE2, ACE2, MET16, MET3, MET4, CAR1, SWI5, PCL9, CLB1, MET17, EGT2, ARG5,6, PMA1, RME1, CLB2 SWI5 : trascriptio factor CDC6, SIC1, CLN2, PCL2, PCL9, EGT2, RME1, CTS1, HO ACE2 : metallothioei expressio activator SNF2 : CLN2, EGT2, HO, CTS1, RME1 compoet of SWI/SNF global trascriptio activator complex CTS1, HO STE12 : trascriptioal activator STE6, FAR1, KAR3, SST2, FUS1, STE2, BAR1, AGA1, AFR1, CIK1 Table 1. Five trascriptio factors ad their regulatig gees. SWI5 YDR146C PCL9 YDL179W PCL2 YDL127W SIC1 YLR079W MCM1 YMR043W SWI5 YDR146C PCL9 YDL179W PCL2 YDL127W SIC1 YLR079W MCM1 YMR043W ACE2 YLR131C RME1 YGR044C CLN2 YPL256C EGT2 YNL327W CDC6 YJL194W CTS1 YLR286C HO YDL227C CAR1 YPL111W MET14 YKL001C MET2 YNL277W ARG5,6 YER069W CDC5 YMR001C PMA1 YGL008C CLB1 YGR108W SNF2 YOR290C CLB2 YPR119W BAR1 YIL015W FUS1 YCL027W MET3 YJR010W MET16 YPR167C AGA1 YNR044W MET6 YER091C MET4 YNL103W MET17 YLR303W FAR1 YJL157C KAR3 YPR141C AFR1 YDR085C CIK1 YMR198W STE2 YFL026W STE6 YKL209C STE12 YHR084W CLN2 YPL256C EGT2 RME1 YNL327W YGR044C CDC6 YJL194W ACE2 YLR131C CTS1 YLR286C HO YDL227C MET2 MET17 YNL277W YLR303W MET3 YJR010W CDC5 YMR001C CAR1 MET16 YPL111W YPR167C CLB1 MET4 YGR108W YNL103W PMA1 YGL008C MET14 YKL001C ARG5,6 YER069W CLB2 MET6 YPR119W YER091C AGA1 YNR044W FAR1 BAR1 YJL157C YIL015W KAR3 SNF2 YPR141C YOR290C FUS1 CIK1 YCL027W YMR198W AFR1 YDR085C STE2 YFL026W STE6 YKL209C STE12 YHR084W Figure 5. Prior kowledge etwork. The gees that are i each shadowed circle are regulated by the paret gees. Figure 6. Resultig etwork based o microarray oly. That is, the purpose of this aalysis is to estimate the gee etwork cotaiig above 36 gees from microarray data together with the prior etwork. Figure 6 shows the estimated gee etwork usig microarray data oly. There are may o-prior edges ad may of them are probably false positives. I additio, we fid three misdirected relatios: SWI5! MCM1, HO! ACE2 ad STE6! STE12. By addig the prior etwork, we obtai the gee etwork show i Figure 8. As for the iverse ormalized temperatures 1 ad 2,weset 1 = 0:5 ad choose the optimal value of 2. We also estimated a gee etwork based o 1 =1ad foud the results described below to be essetially uchaged. Figure 7 shows the behavior of BNRC hetero with respect to 2. We fid that the optimal value of 2 is 2.5. Figure 8 shows the resultig etwork based o microarray data ad the biological kowledge represeted by the prior etwork i Figure 5. We show the edges that correspod to the prior kowledge i black. The edges betwee gees that are regulated by the same trascriptio factor i the prior etwork are show i blue. The red edges do ot correspod to the prior kowledge. I particular, we fid that the relatioships aroud MCM1 improve drastically. The etwork based o microarray oly (Figure 6) idicates that oly SIC1 ad ACE2 are regulated by MCM1. Note that the uderlied gees correspod to the prior etwork iformatio. After addig the prior kowledge ad optimizig the iverse ormalized temperatures, we fid that 10 gees out of 24 gees that are listed as co-regulated gees of MCM1 i Table 1 are extracted. Also, the relatioships aroud STE12

8 BNRC hetero SWI5 YDR146C ACE2 YLR131C RME1 YGR044C CLN2 YPL256C CDC6 YJL194W PCL9 YDL179W PCL2 YDL127W SIC1 YLR079W EGT2 YNL327W CTS1 YLR286C HO YDL227C CAR1 YPL111W MET14 YKL001C PMA1 YGL008C SNF2 YOR290C CLB2 YPR119W MET2 YNL277W ARG5,6 YER069W CDC5 YMR001C CLB1 YGR108W BAR1 YIL015W MET3 YJR010W MET16 YPR167C AGA1 YNR044W FUS1 YCL027W MET17 YLR303W MET6 YER091C MET4 YNL103W FAR1 YJL157C KAR3 YPR141C AFR1 YDR085C CIK1 YMR198W MCM1 YMR043W STE2 YFL026W STE6 YKL209C STE12 YHR084W Figure 7. Optimizatio of 2. We ca fid out that the optimal value of 2 is 2.5. Figure 8. Resultig etwork based o microarray data ad biological kowledge. The iverse ormalized temperatures are selected by our criterio ( 1 =0.5, 2 =2.5). become clearer. Before addig prior kowledge, the estimated etwork i Figure 6 suggests FUS1, AFR1, KAR3, BAR1, MET4, MET16 ad MCM1 are regulated by STE12, while STE12 is cotrolled by HO, STE6 ad MET3. O the other had, the etwork i Figure 8 shows that STE12 regulates FUS1, AFR1, KAR3, CIK1, STE2, STE6, HO ad MCM1. Note that the three misdirected relatios described above are corrected i Figure 8. The differece betwee the iverse ormalized temperatures 1 = 0:5 ad 2 = 2:5 is small, because the score of the criterio is added as 2 1 or 2 2, whe we add a edge that is listed or ot listed i the prior etwork, respectively. Therefore, microarray data cotai this iformatio ad we succeeded i extractig this iformatio with the slight help of the prior etwork. We optimized the iverse ormalized temperature 2 based o the proposed criterio. From the etwork based o the optimal iverse ormalized temperatures, we ca fid the gap betwee microarray data ad biological kowledge. By comparig Figure 6 with Figure 8, we fid that the microarray data reflect the relatioship betwee seve gees (CLN2, RME1, CDC6, EGT2, PCL2, PCL9 ad SIC1) ad two trascriptio factors (MCM1 ad SWI5). O the other had, we fid that there are somewhat large differeces betwee microarray data ad the prior etwork for the relatioship betwee MCM1 ad the thirtee gees that are i the biggest circle. 4. Discussio I this paper we proposed a geeral framework for combiig microarray data ad biological kowledge aimed at estimatig a gee etwork. A advatage of our method is the balace betwee microarray iformatio ad biological kowledge is optimized by the proposed criterio. By addig biological kowledge ito our Bayesia etwork estimatio method, we succeeded i extractig more iformatio from microarray data ad estimatig the gee etwork more accurately. We believe that the combiatio of microarray data ad biological kowledge gives a ew perspective for uderstadig the systems of livig creatures. We cosider the followig problems as our future works: (1) I the real applicatio, we demostrated how to use the gee etwork that is obtaied biologically as a prior kowledge. There are various types of biological kowledge we listed i Sectio 2.4. It is a very importat problem how to use such kowledge together with microarray data i practice. (2) From biological kowledge, we determiistically decided the category to which edges belog, e.g. u 11 = 1, u 12 = 2, ad so o. However, biological kowledge cotais some errors. I fact, u ij ca be viewed as a radom variable, ad a statistical model ca be costructed for u ij. I that sese, our method ca be exteded as a Bayesia etwork estimatio method with a self-repairig database mechaism. We would like to ivestigate these problems i a future paper.

9 Refereces [1] T. Akutsu, S. Miyao ad S. Kuhara. Idetificatio of geetic etworks from a small umber of gee expressio patters uder the Boolea etwork model. Pacific Symposium o Biocomputig, 4,17-28,1999. [2] T. Akutsu, S. Miyao ad S. Kuhara. Iferrig qualitative relatios i geetic etworks ad metabolic pathways. Bioiformatics, 16, ,2000 [3] G.D. Bader, I. Doaldso, C. Woltig, B.F.F. Ouellette, T. Pawso ad C.W.V. Hogue. BIND-The biomolecular iteractio etwork database. Nucleic Acids Research, 29, , [4] H. Baai, S. Ieaga, A. Shiohara, M. Takeda ad S. Miyao. A strig patter regressio algorithm ad its applicatio to patter discovery i log itros. Geome Iformatics, 13,3-11,2002. [5] BIND [6] H.J. Bussemaker,H. Li ad E.D. Siggia. Regulatory elemet detectio usig correlatio with expressio. Nature Geetics, 27, ,2001. [7] T. Che, H. He ad G. Church. Modelig gee expressio with differetial equatios. Pacific Symposium o Biocomputig, 4,29-40,1999. [8] A.C. Daviso. Approximate predictive likelihood. Biometrika, 73, , [9] C. de Boor. A Practical Guide to Splies. Spriger, Berli [10] M.J.L. de Hoo, S. Imoto ad S. Miyao. Iferrig gee regulatory etworks from time-ordered gee expressio data usig differetial equatios. Proc. 5th Iteratioal Coferece o Discovery Sciece, Lecture Note i Artificial Itelligece, 2534, Spriger-Verlag, , [11] M.J.L. de Hoo, S. Imoto, K. Kobayashi, N. Ogasawara ad S. Miyao. Iferrig gee regulatory etworks from timeordered gee expressio data of Bacillus subtilis usig differetial equatios. Pacific Symposium o Biocomputig, 8, 17-28, [12] P.H.C. Eilers ad B. Marx. Flexible smoothig with B- splies ad pealties (with discussio). Statistical Sciece, 11, , [13] N. Friedma ad M. Goldszmidt. Learig Bayesia etworks with local structure. i M.I. Jorda ed., Kluwer Academic Publisher, , [14] N. Friedma, M. Liial, I. Nachma ad D. Pe er. Usig Bayesia etwork to aalyze expressio data. J. Comp. Biol., 7, , [15] S. Gema ad D. Gema. Stochastic relaxatio, Gibbs distributio ad the Bayesia restoratios. IEEE Trasactios o Patter Aalysis ad Machie Itelligece, 6, , [16] GRID [17] A.J. Hartemik, D.K. Gifford, T.S. Jaakkola ad R.A. Youg. Usig graphical models ad geomic expressio data to statistically validate models of geetic regulatory etworks. Pacific Symposium o Biocomputig, 6, , [18] A.J. Hartemik, D.K. Gifford, T.S. Jaakkola ad R.A. Youg. Combiig locatio ad expressio data for pricipled discovery of geetic regulatory etwork models. Pacific Symposium o Biocomputig, 7, ,2002. [19] T. Hastie ad R. Tibshirai. Geeralized Additive Models. Chapma & Hall, [20] D. Heckerma. A tutorial o learig with Bayesia etworks. i M.I. Jorda ed., Kluwer Academic Publisher, , [21] T.Ideker,O.Ozier,B.SchwikowskiadA.F.Siegel.Discoverig regulatory ad sigallig circuits i molecular iteractio etworks. Bioiformatics, 18 (ISMB 2002), S233- S240, [22] S. Imoto, T. Goto ad S. Miyao. Estimatio of geetic etworks ad fuctioal structures betwee gees by usig Bayesia etworks ad oparametric regressio. Pacific Symposium o Biocomputig, 7, ,2002. [23] S. Imoto, S. Kim, T. Goto, S. Aburatai, K. Tashiro, S. Kuhara ad S. Miyao. Bayesia etwork ad oparametric heteroscedastic regressio for oliear modelig of geetic etwork. Joural of Bioiformatics ad Computatioal Biology, i press. (Prelimiary versio has appeared i Proc. 1st IEEE Computer Society Bioiformatics Coferece, , 2002). [24] S. Imoto ad S. Koishi. Selectio of smoothig parameters i B-splie oparametric regressio models usig iformatio criteria. A. Ist. Statist. Math., i press. [25] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori ad Y. Sakaki. A comprehesive two-hybrid aalysis to explore the yeast protei iteractome. Proc. Natl. Acad. Sci. USA, 97, , [26] F.V. Jese. A itroductio to Bayesia Networks.Uiversity College Lodo Press, [27] T.-K. Jesse, A. L greid. J. Komorowski ad E. Hovig. A literature etwork of huma gees for high-throughput aalysis of gee expressio.nature Geetics, 28, [28] KEGG

10 [29] S. Kim, S. Imoto ad S. Miyao. Dyamic Bayesia etwork ad oparametric regressio for oliear modelig of gee etworks from time series gee expressio data. Proc. 1st Iteratioal Workshop o Computatioal Methods i Systems Biology, Lecture Note i Computer Sciece, 2602, Spriger-Verlag, , [30] S. Koishi, T. Ado ad S. Imoto. Bayesia iformatio criteria ad smoothig parameter selectio i radial basis fuctio etworks. submitted for publicatio. [31] T.I. Lee, N.J. Rialdi, F. Robert, D.T. Odom, Z. Bar-Joseph, G.K. Gerber, N.M. Haett, C.T. Harbiso, C.M. Thompso, I. Simo, J. Zeitliger, E.G. Jeigs, H.L. Murray, D.B. Gordo, B.Re, J.J. Wyrick, J-B. Tage, T.L. Volkert, E. Fraekel, D.K. Gifford ad R.A. Youg. Trascriptioal regulatory etworks i Saccharomyces cerevisiae. Sciece, 298, , 2002 [32] Y. Maki, D. Tomiaga, M. Okamoto, S. Wataabe ad Y. Eguchi. Developmet of a system for the iferece of large scale geetic etworks. Pacific Symposium o Biocomputig, 6, , [42] I. Shmulevich, E.R. Dougherty, S. Kim ad W. Zhag. Probabilistic Boolea etworks: a rule-based ucertaity model for gee regulatory etworks. Bioiformatics, 18, , [43] V.A. Smith, E.D. Jarvis ad A.J. Hartemik. Evaluatig fuctioal etwork iferece usig simulatios of complex biological systems. Bioiformatics, 18 (ISMB 2002), S216- S224, [44] Y. Tamada, S. Kim, H. Baai, S. Imoto K. Tashiro, S. Kuhara ad S. Miyao. Estimatig gee etworks from gee expressio data by combiig Bayesia etwork model with promoter elemet detectio. Bioiformatics, (ECCB 2003), i press. [45] L. Tierey ad J.B. Kadae. Accurate approximatios for posterior momets ad margial desities. J. Amer. Statist. Assoc., 81, 82-86, [46] YPD databases/ypd.shtml [47] YTF [33] D.R. Masys. Likig microarray data to the literature. Nature Geetics, 28, 9-10, [34] K. Murphy ad S. Mia. Modellig gee expressio data usig dyamic Bayesia etworks. Techical report, Computer Sciece Divisio, Uiversity of Califoria, Berkeley, CA [35] N. Nariai, S. Kim, S. Imoto ad S. Miyao. Usig proteiprotei iteractios for refiig gee etworks estimated from microarray data by Bayesia etworks. uder preparatio. [36] I.M. Og, J.D. Glaser ad D. Page. Modellig regulatory pathways i E. coli from time series expressio profiles. Bioiformatics, 18 (ISMB2002), S241-S248, [37] D. Pe er, A. Regev, G. Elida ad N. Friedma. Iferrig subetworks from perturbed expressio profiles. Bioiformatics, 17 (ISMB 2001), S215-S224, [38] Y. Pilpel, P. Sudarsaam ad G.M. Church. Idetifyig regulatory etworks by combiatorial aalysis of promoter elemets. Nature Geetics, 29,153-9,2001. [39] J. Quackebush. Microarray data ormalizatio ad trasformatio. Nature Geetics, 32, ,2002. [40] SCPD [41] E. Segal, Y. Barash, I. Simo, N. Friedma ad D. Koller. From promoter sequece to expressio: a probabilistic framework. Proc. 6th Aual Iteratioal Coferece o Research i Computatioal Molecular Biology (RECOMB 2002), , 2002.

Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks

Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks Seiya Imoto 1, Tomoyuki Higuchi 2, Takao Goto 1, Kousuke Tashiro 3, Satoru Kuhara 3 and Satoru Miyano 1