Additional File 1 - Detailed explanation of the expression level CPD

Addtonal Fle - Detaled explanaton of the expreon level CPD A mentoned n the man text, the man CPD for the uterng model cont of two ndvdual factor: P( level gen P( level gen P ( level gen 2 (.).. CPD factor : P (level gen Th factor decrbe the man condtonal probablty that an expreon level belong to a dtrbuton, determned by the gene to uter agnment gen the array to uter agnment B and a pecfc array ID ID. Below we frt defne the CPD P ( ) for three pecfc tuaton: one n whch the expreon value wa agned to the background, one n whch the expreon value wa agned to a ngle uter and one n whch the expreon value agned to dfferent overlappng uter. Baed on thee pecfc defnton (ndcated by a ) we ntroduce the generalzed defnton of P ( ) that cover all three tuaton. Stuaton : background dtrbuton If the expreon level not part of any uter (genb B = ø), t agned to a vrtual uter wth ndex - that decrbe the background. Th uter decrbed wth eparate Normal dtrbuton (µ bgr a,σ bgr a ), one for each array a. The parameter of thee dtrbuton are fxed and derved a pror from the dataet ung a robut etmaton.

Stuaton 2: uter wthout overlap If no overlap occur between dfferent uter, each expreon level can only be agned to exactly one uter, each of whch modeled wth Normal dtrbuton wth parameter (µ,σ). The value of thee parameter depend on the gene to uter and array to uter agnment (g. a.b) and on the unque array dentfer a.id. The probablty P ( ) to oberve an expreon level that belong to a ngle uter only, defned a: P ( level gen P P ( level array uter ( level 2 { b}) b, b) (.2) ( level b) exp 2, 2 2 a b b We ntroduced the probablty P (level array = uter = {b}) a the probablty that an expreon level belong to a ngle uter. The attrbute uter doe not formally ext n the model, but t mplctly defned a the et of uter ndce to whch the expreon level belong, namely the nterecton B genb. Stuaton 3: overlappng uter When dfferent uter overlap, an expreon level can belong to multple uter. To avod overfttng t eem approprate to model the overlap regon ung the parameter et that were already defned for the ndvdual uter (tuaton 2,., one parameter et per arrayuter combnaton). For example, by relyng on a defnton of the overlap, P ( ) would be agned a hgh probablty f the expreon level ether approxmate the um, average,

weghted um, mnmum, or the maxmum, etc. of the probablty dtrbuton n the contrbutng uter. In our model we chooe for an overlap model where the probablty of an expreon level n the overlap regon defned a the geometrc mean of the probablte agned to the expreon level baed on the dtrbuton of the ndvdual uter. For computatonal reaon, we aumed that the tandard devaton of the dtrbuton of the overlappng uter are almot dentcal and that an expreon level can maxmally belong to two uter and. Formally, P ( ) can then be defned a: P ( level gen bet ( B ) b{ et( gen B) et( B)} P ( level, e P ( level b) b b ) /#{ et( gen B) et( B)} e /# et( B ) (.3) where the followng notaton ued: et(x), denotng the et of ndce for whch the vector element X of bnary vector X are. B e defned a the dot product of genb and B. Therefore, et(b e) the et of uter-ndce n the nterecton of genb and or formally: et(b e) = et(genb) et(b). Fnally, #et(b e) the number of element n th et. Generalzed formula The followng notaton cover all tuaton mentoned above: P ( level gen bet( B ) P ( level e /# et( Be ) b, b) (.4)

Stuaton 2 mplctly covered n the notaton of tuaton 3 a t can be formulated a a pecal cae of overlap wth only one uter. Stuaton covered by the ue of the vrtual uter wth ndex -. Th background uter can by defnton not overlap wth any other uter. The defnton of the et et(b e) alo lghtly dfferent from how t wa defned n tuaton 2 a B e b et( ) now cover: B e empty: background dtrbuton, the product over the et b [-] o et(b e) = [- ]. B e not empty: uter dtrbuton, the product over the et of uter n the nterecton and never nclude b = - by defnton..2. CPD factor 2: P 2 (level gen Wthout penalzng for model complexty, the MAP oluton would nclude a very large number of uter nce each addtonal uter ntroduce addtonal degree of freedom to model the expreon value. Model wth many uter can better explan the data and thu reult n hgher MAP oluton. Reducng model complexty n a tradtonal way by ncludng addtonal term n the log-lkelhood or log-poteror dtrbuton (uch a the Bayean nformaton crteron (BIC) [] or the Akake nformaton crteron (AIC) [2]) would lead to computatonal ntractablty f an Expectaton-Maxmzaton algorthm ued to fnd the MAP oluton. The optmzaton algorthm aume ndependent optmzaton per gene or per array n the ubtep of the EM procedur Th ndependency doe no longer ext f one of the crtera mentoned above ncluded n the model. Therefore, an alternatve trategy ued to reduce model complexty by ntroducng a penalty factor P 2 ( ). The addtonal penalty factor P 2 ( ) defned uch that t only allow a et of

expreon level to be ncluded n a uter f they are on average N tme more lkely to be n ther repectve uter dtrbuton than n ther background dtrbuton. The factor P 2 ( ) decompoe mlarly to P ( ), leadng to the followng expreon: # etb ( e) 2 ( level gen P2 ( level b) bet ( e) P (.5) where P 2 (level b) = π bgr decrbe that probablty that the expreon level belong to the background uter (b = -) and P 2 (level b) = π decrbe the probablty that the expreon level belong to a uter other than the background (b -). Th mple that a ubet of expreon level E for a partcular gene or array wll be agned to a uter f Equaton (.6) hold: B ee S P ( e, ) ee ee bgr ee ee P ( e, bgr) P ( e, bgr) P ( e, ) ee bgr (.6) The uer-defned rato bgr ndcate how many tme more lkely t mut be on average that an expreon value part of the uter dtrbuton compared to beng part of the background dtrbuton before uch a et of expreon value E actually added to the uter. To gude the uer n determnng th rato, we aume there ext one or more et of gene n the dataet that are known to be coexpreed. In mot practcal bologcal tuaton, uch known et of gene ext (g., a et of operon gene). If uch a et would not be avalable, tandard cluterng technque can alo be ued to dentfy one or more thee cluter. Fgure. llutrate how to chooe the rato bgr gven that a et of gene known to be coexpreed. We calculate for

every array the probablty that generated by a uter dtrbuton to whch t agned veru t core of beng generated by the background dtrbuton. The dfference between thee two probablty core, defned a δ. If the condton under whch thee gene are coexpreed alo known n advance (ee Fgure. (top panel)), we ue the known label that ndcate whether or not the array belong to the uter to tran a clafer. Th mple determnng the optmal threhold of δ o that the global error rate of mclafyng an array wth known label mnmzed (= the product of the fale potve rate and the fale negatve rate). If the condton are unknown n advance, a plot of orted δ mad The uggeted δ the one that make the bet dtncton between array wth a low δ and array wth a hgh δ value (cut-off pont) a hown n Fgure. (bottom panel).

Fgure.. Determnng the rato log bgr. (top) Reult of a mulated 500x200 dataet wth three 50x50 uter (noe level 0.2). The plot how the δ over all array (multpled wth the number of uter) for a et of gene that are known to be coexpreed n a number of array. Large δ ndcate that the expreon level of the gene are more lkely to be part of the uter dtrbuton for that array than to be part of the background dtrbuton. The δ threhold that bet clafe thee two et of array accordng to the rato 0.5, leadng to an optmal

rato of log 0. 5. (bottom) In the E. col compendum for the et of gene that are known to be bgr regulated by FNR, the plot how orted δ over all array. Baed on th plot, well choen value for bgr log range between -0.5 and -.0. Reference. Schwarz G: Etmatng the Dmenon of a Model. Annal of Stattc 978, 6:46-464. 2. Akake H: A new look at the tattcal model ndentfcaton. IEEE Tranacton on Automatc Control 974, 9:76-722.