arxiv: v1 [q-bio.qm] 6 Jun PDF Free Download

On the Approxmablty of Comparng Genomes wth Duplcates Sébasten Angbaud 1, Gullaume Fertn 1, Irena Rusu 1, Annelyse Thévenn 2, and Stéphane Valette 3 arxv:0806.1103v1 [q-bo.qm] 6 Jun 2008 1 Laboratore d Informatque de Nantes-Atlantque (LINA), UMR CNRS 6241, Unversté de Nantes, 2 rue de la Houssnère, 44322 Nantes Cedex 3, France {sebasten.angbaud,gullaume.fertn,rena.rusu}@unv-nantes.fr 2 Laboratore de Recherche en Informatque (LRI), UMR CNRS 8623, Unversté Pars-Sud, 91405 Orsay, France thevenn@lr.fr 3 IGM-LabInfo, UMR CNRS 8049, Unversté Pars-Est, 5 Bd Descartes 77454 Marne-la-Vallée, France valette@unv-mlv.fr Abstract. A central problem n comparatve genomcs conssts n computng a(ds-)smlarty measure between two genomes, e.g. n order to construct a phylogenetc tree. A large number of such measures has been proposed n the recent past: number of reversals, number of breakponts, number of common or conserved ntervals, SAD etc. In ther ntal defntons, all these measures suppose that genomes contan no duplcates. However, we now know that genes can be duplcated wthn the same genome. One possble approach to overcome ths dffculty s to establsh a one-to-one correspondence(.e. a matchng) between genes of both genomes, where the correspondence s chosen n order to optmze the studed measure. Then, after a gene relabelng accordng to ths matchng and a deleton of the unmatched sgned genes, two genomes wthout duplcates are obtaned and the measure can be computed. In ths paper, we are nterested n three measures (number of breakponts, number of common ntervals and number of conserved ntervals) and three models of matchng(exemplar, ntermedate and maxmum matchng models). We prove that, for each model and each measure M, computng a matchng between two genomes that optmzes M s APX hard. We show that ths result remans true even for two genomes G 1 and G 2 such that G 1 contans no duplcates and no gene of G 2 appears more than twce. Therefore, our results extend those of [7, 10, 13]. Besdes, n order to evaluate the possble exstence of approxmaton algorthms concernng the number of breakponts, we also study the complexty of the followng decson problem: s there an exemplarzaton (resp. an ntermedate matchng, a maxmum matchng) that nduces no breakpont? In partcular, we extend a result of [13] by provng the problem to be NP complete n the exemplar model for a new class of nstances, we note that the problems are equvalent n the ntermedate and the exemplar models and we show that the problem s n P n the maxmum matchng model. Fnally, we focus on a fourth measure, closely related to the number of breakponts: the number of adjacences, for whch we gve several constant rato approxmaton algorthms n the maxmum matchng model, n the case where genomes contan the same number of duplcatons of each gene. Keywords: genome rearrangements, APX hardness, duplcate genes, breakponts, adjacences, common ntervals, conserved ntervals, approxmaton algorthms. 1 Introducton and Prelmnares In comparatve genomcs, computng a measure of (ds-)smlarty between two genomes s a central problem: such a measure can be used, for nstance, to construct phylogenetc trees. The measures defned so far essentally fall nto two categores: the frst one conssts n countng the mnmum number of operatons needed to transform a genome nto another (e.g. the edt dstance [21] or the number of reversals [4]). The second one contans (ds-)smlarty measures based on the genome structure, such as the number of breakponts [7], the conserved ntervals dstance [6], the number of common ntervals [10], SAD and MAD [24] etc.

When genomes contan no duplcates, most measures can be computed n polynomal tme. However, assumng that genomes contan no duplcates s too lmted. Indeed, t has been recently shown that a great number of duplcates exsts n some genomes. For example, n [20], authors estmate that 15% of genes are duplcated n the human genome. A possble approach to overcome ths dffculty s to specfy a one-to-one correspondence (.e. a matchng) between genes of both genomes and to remove the unmatched genes, thus obtanng two genomes wth dentcal gene content and no duplcates. Usually, the above mentoned matchng s chosen n order to optmze the studed measure, followng the parsmony prncple. Three models achevng ths correspondence have been proposed : the exemplar model [23], the ntermedate model [3] and the maxmum matchng model [25]. Before defnng precsely the measures and models studed n ths paper, we need to ntroduce some notatons. Notatons used n the paper. A genome G s represented by a sequence of sgned ntegers (called sgned genes). For any genome G, we denote by F G the set of unsgned ntegers (called genes) that are present n G. For any sgned gene g, let g be the sgned gene havng the opposte sgn and let g F G be the correspondng (unsgned) gene. Gven a genome G wthout duplcates and two sgned genes a, b such that a s located before b, let G[a,b] be the set S F G of genes located between genes a and b n G, a and b ncluded. We also note [a,b] G the substrng (.e. the sequence of consecutve elements) of G startng at a and fnshng at b n G. Let occ(g,g) be the number of occurrences of a gven gene g n a genome G and let occ(g) = max{occ(g,g) g F G }. A par of genomes (G 1,G 2 ) s sad to be of type (x,y) f occ(g 1 ) = x and occ(g 2 ) = y. A par of genomes (G 1,G 2 ) s sad to be balanced f, for each gene g F G1 F G2, we have occ(g,g 1 ) = occ(g,g 2 ) (otherwse, (G 1,G 2 ) wll be sad to be unbalanced). Note that a par (G 1,G 2 ) of type (x,x) s not necessary balanced. Denote by n G the sze of genome G, that s the number of sgned genes t contans. Let G[p], 1 p n G, be the sgned gene that occurs at poston p on genome G, and let G[p] F G be the correspondng (unsgned) gene. Let N G [p], 1 p n G, be the number of occurrences of G[p] n the frst (p 1) postons of G. We defne a duo n a genome G as a par of successve sgned genes.gven a duo d = (G[],G[+ 1]) n a genome G, we note d the duo equal to ( G[+1], G[]). Let (d 1,d 2 ) be a par of duos ; (d 1,d 2 ) s called a duo match f d 1 s a duo of G 1, d 2 s a duo of G 2, and f ether d 1 = d 2 or d 1 = d 2. For example, consder the genome G 1 = +1 + 2 + 3 + 4 + 5 1 2 + 6 2. Then, F G = {1,2,3,4,5,6}, n G1 = 9, occ(1,g 1 ) = 2, occ(g 1 ) = 3, G 1 [7] = 2, G 1 [7] = +2, G 1 [7] = 2 and N G1 [7] = 1. Let G 2 be the genome G 2 = +2 1 +6 +3 5 4 +2 1 2. Then the par (G 1, G 2 ) s balanced and s of type (3,3). Let d 1 = (G 1 [4],G 1 [5]) be the duo (+4,+5) and d 2 be the duo (G 2 [5],G 2 [6]). Thepar (d 1,d 2 ) s a duomatch. Now, consder the genome G 3 = +3 2+6+4 1+5 wthout duplcates. We have G 3 [+6, 1] = {1,4,6} and [+6, 1] G3 = (+6,+4, 1). Breakponts, adjacences, common and conserved ntervals. Let us now defne the four measures we wll study n ths paper. Let G 1 and G 2 be two genomes wthout duplcates and wth the same gene content, that s F G1 = F G2. Breakpont and Adjacency. Let (a,b) be a duo n G 1. We say that the duo (a,b) nduces a breakpont of (G 1,G 2 ) f nether (a,b) nor ( b, a) s a duo n G 2. Otherwse, we say that (a,b) nduces an adjacency of (G 1,G 2 ). For example, when G 1 = +1 + 2 + 3 + 4 + 5 and G 2 = +5 2

4 3+2+1, the duo (2,3) n G 1 nduces a breakpont of (G 1,G 2 ) whle (3,4) n G 1 nduces an adjacency of (G 1,G 2 ). We note B(G 1,G 2 ) (resp. A(G 1,G 2 )) the number of breakponts (resp. the number of adjacences) that exst between G 1 and G 2. Common nterval. A common nterval of (G 1,G 2 ) s a substrng of G 1 such that G 2 contans a permutaton of ths substrng (not takng sgns nto account). For example, consder G 1 = +1+2+ 3+4+5 and G 2 = +2 4+3+5+1. The substrng [+3,+5] G1 s a common nterval of (G 1,G 2 ). Conserved nterval. Consder two sgned genes a and b of G 1 such that a precedes b, where the precedence relaton s large n the sense that, possbly, a = b. The substrng [a,b] G1 s a conserved nterval of (G 1,G 2 ) f ether () a precedes b and G 2 [a,b] = G 1 [a,b], or () b precedes a and G 2 [ b, a] = G 1 [a,b]. For example, f G 1 = +1 + 2 + 3 + 4 + 5 and G 2 = 5 4 + 3 2 + 1, the substrng [+2,+5] G1 s a conserved nterval of (G 1,G 2 ). We note that the noton of conserved nterval doesnot consder thesgn of genes. Note also that aconserved nterval s actually acommon nterval, but wth addtonal restrctons on ts extremtes. Dealng wth duplcates n genomes. When genomes contan duplcates, we cannot drectly compute the measures defned n the prevous paragraph. A soluton conssts n fndng a one-to-one correspondence (.e. a matchng) between duplcated genes of G 1 and G 2 ; we then use ths correspondence to rename genes of G 1 and G 2, and we delete the unmatched sgned genes n order to obtan two genomes G 1 and G 2 such that G 2 s a permutaton of G 1 ; thus, the measure computaton becomes possble. In ths paper, we wll focus on three models of matchng : the exemplar, ntermedate and maxmum matchng models. The exemplar model [23]: for each gene g, we keep n the matchng M only one occurrence of g n G 1 and n G 2, and we remove all the other occurrences. Hence, we obtan two genomes G E 1 and G E 2 wthout duplcates. The trplet (GE 1,GE 2,M) s called an exemplarzaton of (G 1,G 2 ). Note that n ths model, M can be nferred from the exemplarzed genomes G E 1 and GE 2. Thus, n the rest of the paper, any exemplarzaton (G E 1,GE 2,M) of (G 1,G 2 ) wll be only descrbed by the par (G E 1,GE 2 ). The ntermedate model [3]: n ths model, for each gene g, we keep n the matchng M an arbtrary number k g, 1 k g mn(occ(g,g 1 ),occ(g,g 2 )), n order to obtan genomes G I 1 and G I 2. We call the trplet (GI 1,GI 2,M) an ntermedate matchng of (G 1,G 2 ). The maxmum matchng model [25]: n ths case, we keep n the matchng M the maxmum number of sgned genes n both genomes. More precsely, we look for a one-to-one correspondence between sgned genes of G 1 and G 2 that matches, for each gene g, exactly mn(occ(g,g 1 ), occ(g,g 2 )) occurrences. After ths operaton, we delete each unmatched sgned gene. Thetrplet (G M 1,GM 2,M) obtaned by ths operaton s called a maxmum matchng of (G 1,G 2 ). Problems studed n ths paper. Consder two genomes G 1 and G 2 wth duplcates. Let EComI (resp. IComI, MComI) be the problem whch conssts n fndng an exemplarzaton (resp. ntermedate matchng, maxmum matchng) (G 1,G 2,M) of (G 1,G 2 ) such that the number of common ntervals of (G 1,G 2 ) s maxmzed. Moreover, let EConsI (resp. IConsI, MConsI) be the problem whch conssts n fndng an exemplarzaton (resp. ntermedate matchng, maxmum matchng) (G 1,G 2,M) of (G 1,G 2 ) such that the number of conserved ntervals of (G 1,G 2,M) s maxmzed. In Secton 2, we prove the APX hardness of EComI and EConsI, even for genomes G 1 and G 2 such that occ(g 1 ) = 1 and occ(g 2 ) = 2. These results nduce the APX hardness under the other models (.e., IComI, MComI, IConsI and MConsI are APX hard). These results extend n partcular those of [7, 10]. 3

Let EBD (resp. IBD, MBD) be the problem whch conssts n fndng an exemplarzaton (resp. ntermedate matchng, maxmum matchng) (G 1,G 2,M) of (G 1,G 2 ) that mnmzes the number of breakponts between G 1 and G 2. In Secton 3, we prove the APX hardness of EBD, even for genomes G 1 and G 2 such that occ(g 1 ) = 1 and occ(g 2 ) = 2. Ths result mples that IBD and MBD are also APX hard, and extends those of [13]. Let ZEBD(resp. ZIBD, ZMBD) be the problem whch conssts n determnng, for two genomes G 1 andg 2,whetherthereexstsanexemplarzaton (resp.ntermedatematchng, maxmummatchng) whch nduces zero breakpont. In secton 4, we study the complexty of ZEBD, ZMBD and ZIBD: n partcular, we extend a result of [13] by provng ZEBD to be NP complete for a new class of nstances. We also note that the problems ZEBD and ZIBD are equvalent, and we show that ZMBD s n P. Fnally, n Secton 5, we focus on a fourth measure, closely related to the number of breakponts: the number of adjacences, for whch we gve several constant rato approxmaton algorthms n the maxmum matchng model, n the case where genomes are balanced. 2 EComI and EConsI are APX hard Consder two genomes G 1 and G 2 wth duplcates, and let EComI (resp. IComI, MComI) be the problem whch conssts n fndng an exemplarzaton (resp. ntermedate matchng, maxmum matchng) (G 1,G 2,M) of (G 1,G 2 ) such that the number of common ntervals of (G 1,G 2 ) s maxmzed. Moreover, let EConsI (resp. IConsI, MConsI) be the problem whch conssts n fndng an exemplarzaton (resp. ntermedate matchng, maxmum matchng) (G 1,G 2,M) of (G 1,G 2 ) such that the number of conserved ntervals of (G 1,G 2,M) s maxmzed. EComI andmcomi have beenproved to benp complete even f occ(g 1 ) = 1and occ(g 2 ) = 2 n [10]. Besdes, n [6], Bln and Rzz have studed the problem of computng a dstance bult on the number of conserved ntervals. Ths dstance dffers from the number of conserved ntervals we study n ths paper, manly n the sense that () t can be appled to two sets of genomes (as opposed to two genomes n our case), and () the dstance between two dentcal genomes of length n s equal to 0 (as opposed to n(n+1) 2 n our case). Bln and Rzz [6] proved that fndng the mnmum dstance s NP complete, under both the exemplar and maxmum matchng models. A closer analyss of ther proof shows that t can be easly adapted to prove that EConsI and MConsI are NP complete, even n the case occ(g 1 ) = 1. We can conclude from the above results that IComI and IConsI are also NP complete, snce when one genome contans no duplcates, exemplar, ntermedate and maxmum matchng models are equvalent. In ths secton, we mprove the above results by showng that the sx problems EComI, IComI, MComI, EConsI, IConsI and MConsI are APX hard, even when genomes G 1 and G 2 are such that occ(g 1 ) = 1 and occ(g 2 ) = 2. The man result s Theorem 1, whch wll be completed by Corollary 1 at the end of the secton. Theorem 1. EComI and EConsI are APX hard even when genomes G 1 and G 2 are such that occ(g 1 ) = 1 and occ(g 2 ) = 2. We prove Theorem 1 by usng an L-reducton [22] from the Mn-Vertex-Cover problem on cubc graphs, denoted here Mn-Vertex-Cover-3. Let G = (V,E) be a cubc graph,.e. for all v V,degree(v) = 3. A set of vertces V V s called a vertex cover of G f for each edge e E, 4

there exsts a vertex v V such that e s ncdent to v. The problem Mn-Vertex-Cover-3 s defned as follows: Problem: Mn-Vertex-Cover-3 Input: A cubc graph G = (V,E). Soluton: A vertex cover V of G. Measure: The cardnalty of V. Mn-Vertex-Cover-3 was proved to be APX complete n [1]. 2.1 Reducton Let G = (V,E) be an nstance of Mn-Vertex-Cover-3, where G s a cubc graph wth V = {v 1...v n } and E = {e 1...e m }. Consder the transformaton R whch assocates to the graph G two genomes G 1 and G 2 n the followng way, where each gene has a postve sgn. wth : G 1 = b 1 b 2...b m x a 1 C 1 f 1 a 2 C 2 f 2...a n C n f n y b m+n,b m+n 1...b m+1 (1) G 2 = y a 1 D 1 f 1 b m+1 a 2 D 2 f 2 b m+2...b m+n 1 a n D n f n b m+n x (2) for each, 1 n,a = 6 5, f = 6 for each, 1 n,c = (a +1),(a +2),(a +3),(a +4) for each, 1 n+m,b = 6n+ x = 7n+m+1 and y = 7n+m+2 for each, 1 n,d = (a +3),(b j ),(a +1),(b k ),(a +4),(b l ),(a +2) where e j, e k and e l are the edges whch are ncdent to v n G, wth j < k < l. In the followng, genes b, 1 m, are called markers. There s no duplcated gene n G 1 and the markers are the only duplcated genes n G 2 ; these genes occur twce n G 2. Hence, we have occ(g 1 ) = 1 and occ(g 2 ) = 2. e 3 V 1 e 1 V 2 e 2 e 4 e 5 V 3 e V 4 6 Fg.1. The cubc graph G. To llustrate the reducton, consder the cubc graph G of Fgure 1. From G, we construct the followng genomes G 1 and G 2 : b 1 z} { 25 b 2 z} { 26 b 3 z} { 27 b 4 z} { 28 b 5 z} { 29 b 6 z} { 30 x C 1 C 2 z} { z } { z } { 35 12 3 4 56 78 9 10 11 12 13 C 3 z } { 14 15 16 17 18 19 C 4 z } { 20 21 22 23 24 {z} 36 14 25 2 26 {z 5 27 3 } 6 {z} 31 710 25 8 28 {z 11 29 9 } 12 {z} 32 1316 26 14 28 {z 17 30 15 } 18 {z} 33 1922 27 20 29 {z 23 30 21 } 24 {z} 34 y D 1 b 7 D 2 b 8 D 3 b 9 D 4 y z} { 36 b 10 z} { 34 b 9 z} { 33 b 8 z} { 32 b 7 z} { 31 {z} 35 b 10 x 5

2.2 Prelmnary results In order to prove Theorem 1, we frst gve four ntermedate lemmas. In the followng, a common nterval for the EComI problem or a conserved nterval for EConsI s called a robust nterval. Besdes, a trval nterval wll denote ether an nterval of length one (.e. a sngleton), or the whole genome. Lemma 1. For any exemplarzaton (G 1,G E 2 ) of (G 1,G 2 ), the non trval robust ntervals of (G 1,G E 2 ) are necessarly contaned n some sequence a C f of G 1 (1 n). Proof. We start by provng the lemma for common ntervals, and we wll then extend t to conserved ntervals. Frst, we prove that, for any exemplarzaton (G 1,G E 2 ) of (G 1,G 2 ), each common nterval I such that I 2 contans ether both of x, y or none of them. Ths further mples that I covers the whole genome. Suppose there exsts a common nterval I x (recall that by defnton I x s on G 1 ) such that I x 2 and I x contans x. Let PI x be the permutaton of I x n G E 2. The nterval I x must contan ether b m or a 1. Let us detal each of the two cases: (a) If I x contans b m, then PI x contans b m too. Notce that there s some, 1 n, such that b m belongs to D n G E 2. Then PI x contans all genes between D and x n G E 2. Thus PI x contans b m+n. Consequently, I x contans b m+n and t also contans y. (b) If I x contans a 1, then PI x contans a 1 too. Then PI x contans all genes between a 1 and x. Thus PI x contans b m+n. Hence, I x contans b m+n and then t also contans y. Now, suppose that I y s a common nterval such that I y 2 and I y contans y. Let PI y be the permutaton of I y on G E 2. The nterval I y must contan ether b m+n or f n. Let us detal each of the two cases: (a) If I y contans b m+n, then PI y contans b m+n too. Thus PI y contans all genes between b m+n and y. Hence PI y contans all the sequences D, 1 n. In partcular, PI y contans all the markers and consequently I y must contan x. (b) If I y contans f n, then PI y contans f n too. Then PI y contans all genes between f n and y. In partcular, PI y contans b m+n 1 and then I y contans b m+n 1 too. Hence, I y also contans b m+n, smlarly to the prevous case. Thus I y contans x. We conclude that each non sngleton common nterval contanng ether x or y necessarly contans both x and y. Therefore, and by constructon of G 2, there s only one such nterval, that s G 1 tself. Hence, any non trval common nterval s necessarly, n G 1, ether strctly on the left of x, or between x and y, or strctly on the rght of y. Let us analyze these dfferent cases: LetI beanontrvalcommonntervalstuatedstrctlyontheleftofxng 1.ThusI sasequence of at least two consecutve markers. Snce n any exemplarzaton (G 1,G E 2 ) of (G 1,G 2 ), every marker n G E 2 has neghborng genes whch are not markers, ths contradcts the fact that I s a common nterval. Let I be a non trval common nterval stuated strctly on the rght of y n G 1. Then I s a substrng of b m+n,...,b m+1 contanng at least two genes. In any exemplarzaton (G 1,G E 2 ) of (G 1,G 2 ), for each par (b m+,b m++1 ) of G E 2, wth 1 < n, we have a +1 G E 2 [b m+,b m++1 ]. Ths contradcts the fact that I s strctly on the rght of y n G 1. 6

Let I be a non trval common nterval lyng between x and y n G 1. For any exemplarzaton (G 1,G E 2 ) of (G 1,G 2 ), a common nterval cannot contan, n G 1, both f and a +1 for some, 1 n 1 (snce b m+ s stuated between f and a +1 n G E 2 and on the rght of x n G 1). Hence, a non trval common nterval of (G 1,G E 2 ) s ncluded n some sequence a C f n G 1, 1 n. Ths proves the lemma for common ntervals. By defnton, any conserved nterval s necessarly a common nterval. So, a non trval conserved nterval of (G 1,G E 2 ) s ncluded n some sequence a C f n G 1, 1 n. The lemma s proved. Lemma 2. Let (G 1,G E 2 ) be an exemplarzaton of (G 1,G 2 ) and [1...n]. Let be a substrng of [a + 3,a + 2] G E 2 that does not contan any marker. If {2,3}, then there s no robust nterval I of (G 1,G E 2 ) such that s a permutaton of I. Proof. Frst, we prove that there s no permutaton I of such that I s a common nterval of (G 1,G E 2 ). Next, we show that there s no permutaton I of such that I s a conserved nterval. By Lemma 1, we know that a non trval common nterval of (G 1,G E 2 ) s a substrng of some sequence a C f, 1 n. Ths substrng contans only consecutve ntegers. Therefore, f there exsts a permutaton I of such that I s a common nterval of (G 1,G E 2 ), then must be a permutaton of consecutve ntegers. If = 2, we have = (p,q) where p and q are not consecutve ntegers and f = 3, then we have = (a +3,a +1,a +4) or = (a +1,a +4,a +2). In these three cases, s not a permutaton of consecutve ntegers. Hence, there s no permutaton I of such that I s a common nterval of (G 1,G E 2 ). Moreover, any conserved nterval s also a common nterval. Thus, there s no permutaton I of such that I s a conserved nterval of (G 1,G E 2 ). For more clarty, let us now ntroduce some notatons. Gven a graph G = (V,E), let VC = {v 1,v 2...v k } be a vertex cover of G. Let R(G) = (G 1,G 2 ) be the par of genomes defned by the constructon descrbed n (1) and (2). Now, let F be the functon whch assocates to VC, G 1 and G 2 an exemplarzaton F(VC) of (G 1,G 2 ) as follows. In G 2, all the markers are removed from the sequences D for all 1, 2... k. Next, for each marker whch s stll present twce, one of ts occurrences s arbtrarly removed. Snce n G 2 only markers are duplcated, we conclude that F(VC) s an exemplarzaton of (G 1,G 2 ). Gven a cubc graph G and genomes G 1 and G 2 obtaned by the transformaton R(G), let us defne the functon S whch assocates to an exemplarzaton (G 1,G E 2 ) of (G 1,G 2 ) the vertex cover VC of G defned as follows: VC = {v 1 n j {1...m},b j G E 2 [a,f ]}. In other words, we keep n VC the vertces v of G for whch there exsts some gene b j such that b j s n G E 2 [a,f ]. We now prove that VC s a vertex cover. Consder an edge e p of G. By constructon of G 1 and G 2, there exsts some, 1 n, such that gene b p s located between a and f n G E 2. The presence of gene b p between a and f mples that vertex v belongs to VC. We conclude that each edge s ncdent to at least one vertex of VC. Let W be the functon defned on {EConsI,EComI} by W(pb) = 1 f pb = EConsI and W(pb) = 4 fpb = EComI. Let opt P (A) betheoptmum resultof annstance A foran optmzaton problem pb, pb {EcomI, EConsI, Mn-Vertex-Cover-3}. We now defne the functon T whose arguments are a problem pb {EConsI,EComI} and a cubc graph G. Let R(G) = (G 1,G E 2 ) as usual. Then T(pb,G) s defned as the number of robust trval ntervals of (G 1,G E 2 ) wth respect to pb. Let n and m be respectvely the number of vertces 7

andthenumberof edges of G. We have T(EConsI,G) = 7n+m+2andT(EComI,G) = 7n+m+3. Indeed, for EComI, there are 7n+m+2 sngletons and we also need to consder the whole genome. Lemma 3. Let pb {EcomI,EConsI}. Let G be a cubc graph and R(G) = (G 1,G 2 ). Let (G 1,G E 2 ) be an exemplarzaton of (G 1,G 2 ) and let, 1 n. Then only two cases can occur wth respect to D. 1. Ether n G E 2, all the markers from D were removed, and n ths case, there are exactly W(pb) non trval robust ntervals nvolvng D. 2. Or n G E 2, at least one marker was kept n D, and n ths case, there s no non trval robust nterval nvolvng D. Proof. We frst prove the lemma for the EComI problem and then we extend t to EConsI. Lemma1mples that each non trval common nterval I of (G 1,G E 2 ) s contaned n somesubstrng of a C f, 1 n. So, the permutaton of I on G E 2 s contaned n a substrngof a D f, 1 n. Consder, 1 n, and suppose that all the markers from D are removed on G E 2. Thus, a C f, C, a C and C f are common ntervals of (G 1,G E 2 ). Let us now show that there s no other non trval common nterval nvolvng D. Let be a substrng of [a + 3,a + 2] G E such that 2 {2,3}. By Lemma 2, we know that s not a common nterval. The remanng ntervals are (a,a +3), (a,a +3,a +1), (a,a +3,a +1,a +4), (a +1,a +4,a +2,f ), (a +4,a +2,f ) and (a + 2,f ). By constructon, none of them can be a common nterval, because none of them s a permutaton of consecutve ntegers. Hence, there are only four non trval common ntervals nvolvng D n G E 2. Among these four common ntervals, only a C f s a conserved nterval too. In the end, f all the markers are removed from D, there are exactly four non trval common ntervals and one non trval conserved nterval nvolvng D. So, gven a problem pb {EcomI,EconsI}, there are exactly W(pb) non trval robust ntervals nvolvng D. Now, suppose that at least one marker of D s kept n G E 2. Lemma 1 shows that each non trval common nterval I of (G 1,G E 2 ) s contaned n some substrng of a C f, 1 n. Snce no marker s present n a sequence a C f, we deduce that there does not exst any trval common nterval contanng a marker. So, a non trval common nterval nvolvng D only must contan a substrng of [a +3,a +2] G E such that 2 contans no marker. Snce no marker s an extremty of [a +3,a +2] G E, we have 2 3. By Lemma 2, we know that s not a common nterval. The remanng ntervals to be consdered are the ntervals a and f. By constructon of a C f, these ntervals are not common ntervals (the absence of gene a + 2 for a and of gene a + 3 for f mples that these ntervals are not a permutaton of consecutve ntegers). Hence, these ntervals cannot be conserved ntervals ether. Lemma 4. Let pb {EcomI,EConsI}. Let G = (V,E) be a cubc graph wth V = {v 1...v n } and E = {e 1...e m } and let G 1, G 2 be the two genomes obtaned by R(G). 1. Let VC be a vertex cover of G and denote k = VC. Then the exemplarzaton F(VC) of (G 1,G 2 ) has at least N = nw(pb)+t(pb,g) W(pb) k robust ntervals. 2. Let (G 1,G E 2 ) be an exemplarzaton of (G 1,G 2 ) and let VC be the vertex cover of G obtaned by S(G 1,G E 2 ). Then VC = W(pb) n+t(pb,g) N W(pb), where N s the number of robust ntervals of (G 1,G E 2 ). Proof. 1. Let pb {EcomI,EConsI}. Let G be a cubc graph and let G 1 and G 2 be the two genomesobtanedbyr(g).letvc beavertexcoverofganddenotek = VC.Let(G 1,G E 2 )bethe 8

exemplarzaton of(g 1,G 2 )obtanedbyf(vc).byconstructon, wehaveat least (n k)substrngs D n G E 2 for whchall themarkers areremoved. By Lemma3, weknow that each of thesesubstrngs mples the exstence of W(pb) non trval robust ntervals. So, we have at least W(pb)(n k) non trval robust ntervals. Moreover, by defnton of T(pb, G), the number of trval robust ntervals of (G 1,G E 2 ) s exactly T(pb,G). Thus, we have at least N = W(pb) n + T(pb,G) W(pb) k robust ntervals of (G 1,G E 2 ). 2. Let (G 1,G E 2 ) be an exemplarzaton of (G 1,G 2 ) and let n j be the number of sequences D, 1 n, for whch all markers have been deleted n G E 2. Then, by Lemmas 1 and 3, the number of robust ntervals of (G 1,G E 2 ) s equal to N = W(pb) n+t(pb,g) W(pb) j. Let VC be the vertex cover obtaned by S(G 1,G E 2 ). Each marker has one occurrence n GE 2 and these occurrences le n j sequences D. So, by defnton of S, we conclude that VC = j = W(pb) n+t(pb,g) N W(pb). 2.3 Man result Let us frst defne the noton of L-reducton [22]: let A and B be two optmzaton problems and c A, c B be respectvely ther cost functons. An L-reducton from problem A to problem B s a par of polynomal-tme computable functons R and S wth the followng propertes: (a) If x s an nstance of A, then R(x) s an nstance of B ; (b) If x s an nstance of A and y s a soluton of R(x), then S(y) s a soluton of A ; (c) If x s an nstance of A and R(x) s ts correspondng nstance of B, then there s some postve constant α such that opt B (R(x)) α.opt A (x) ; (d) If s s a soluton of R(x), then there s some postve constant β such that opt A (x) c A (S(s)) β opt B (R(x)) c B (s). We prove Theorem 1 by showng that the par (R,S) defned prevously s an L-reducton from Mn-Vertex-Cover-3 to EConsI and from Mn-Vertex-Cover-3 to EComI. Frst note that propertes (a) and (b) are obvously satsfed by R and S. Consder pb {EcomI,EConsI}. Let G = (V,E) be a cubc graph wth n vertces and m edges. We now prove propertes (c) and (d). Consder the genomes G 1 and G 2 obtaned by R(G). For sake of clarty, we abbrevate here and n the followng opt Mn-Vertex-Cover-3 to opt Mn-VC. Frst, we need to prove that there exsts α 0 such that opt pb (G 1,G 2 ) α.opt Mn-Vertex-Cover-3 (G). Snce G s cubc, we have the followng propertes: n 4 (3) m = 1 n degree(v ) = 3n 2 2 (4) =1 opt Mn-VC (G) m 3 = n 2 To explan property (5), remark that, n a cubc graph G wth n vertces and m edges, each vertex covers three edges. Thus, a set of k vertces covers at most 3k edges. Hence, any vertex cover of G must contan at least m 3 vertces. By Lemma 3, we know that sequences of the form a C f, 1 n, contan ether zero or W(pb) non trval robust ntervals. By Lemma 1, there are no other non trval robust ntervals. So, we have the followng nequalty: (5) 9

If pb = EComI, we have: And f pb = EConsI, we have : opt pb (G 1,G 2 ) T(pb,G) }{{} +W(pb) n trval robust ntervals opt EComI (G 1,G 2 ) 7n+m+3+4n opt EComI (G 1,G 2 ) 27n by (3) and (4) (6) 2 opt EConsI (G 1,G 2 ) 7n+m+2+n opt EConsI (G 1,G 2 ) 21n by (3) and (4) (7) 2 Altogether, by (5), (6) and (7), we prove property (c) wth α = 27. Now, let us prove property (d). Let VC = {v 1,v 2...v P } be a mnmum vertex cover of G. Then P = opt Mn-VC (G). Let G 1 and G 2 be the genomes obtaned by R(G). Let (G 1,G E 2 ) be an exemplarzaton of (G 1,G 2 ) and let k be the number of robust ntervals of (G 1,G E 2 ). Fnally, let VC be the vertex cover of G such that VC = S(G 1,G E 2 ). We need to fnd a postve constant β such that P VC β opt pb (G 1,G 2 ) k. For pb {EcomI,EConsI}, let N pb bethenumberofrobustntervals betweenthetwogenomes obtaned by F(VC). By the frst property of Lemma 4, we have opt pb (G 1,G 2 ) N pb W(pb) n+t(pb,g) W(pb) P So, t s suffcent to prove that there exsts some β 0 such that P VC β W(pb) n + T(pb,G) W(pb) P k. By the second property of Lemma4, we have VC = W(pb) n+t(pb,g) k W(pb). Snce P VC, we have P VC = VC P = W(pb) n+t(pb,g) k W(pb) P = 1 W(pb) (W(pb) n+ T(pb,G) W(pb) P k ). So β = 1 s suffcent n both cases, snce W(EComI) = 4 and W(EConsI) = 1, whch mples 1 W(pb) 1. Altogether, we then have opt Mn-VC (G) VC 1 opt pb (G 1,G 2 ) k. We proved that the reducton (R,S) s an L-reducton. Ths mples that for two genomes G 1 and G 2, both problems EConsI and EComI are APX hard even f occ(g 1 ) = 1 and occ(g 2 ) = 2. Theorem 1 s proved. We extend n Corollary 1 our results for the ntermedate and maxmum matchng models. Corollary 1. IComI, MComI, IConsI and MConsI are APX hard even when genomes G 1 and G 2 are such that occ(g 1 ) = 1 and occ(g 2 ) = 2. Proof. The ntermedate and maxmum matchng models are dentcal to the exemplar model when one of the two genomes contans no duplcates. Hence, the APX hardness result for EComI (resp. EConsI) also holds for IComI and MComI (resp. IConsI and MConsI). 10

3 EBD s APX hard Consder two genomes G 1 and G 2 wth duplcates, and let EBD (resp. IBD, MBD) be the problem whch conssts n fndng an exemplarzaton (resp. ntermedate matchng, maxmum matchng) (G 1,G 2,M) of (G 1,G 2 )that mnmzes the number of breakponts between G 1 and G 2. EBD has been proved to be NP complete even f occ(g 1 ) = 1 and occ(g 2 ) = 2 [7]. Some napproxmablty results also exst: n partcular, t has been proved n [13] that, n the general case, EBD cannot be approxmated wthn a factor c log n, where c > 0 s a constant, and cannot be approxmated wthn a factor 1.36 when occ(g 1 ) = occ(g 2 ) = 2. Moreover, for two balanced genomes G 1 and G 2 such that k = occ(g 1 ) = occ(g 2 ), several approxmaton algorthms for MBD are gven. These approxmaton algorthms admt respectvely a rato of 1.1037 when k = 2 [17], 4 when k = 3 [17] and 4k n the general case [19]. We can conclude from the above results that IBD and MBD problems are also NP complete, snce when one genome contans no duplcates, exemplar, ntermedate and maxmum matchng models are equvalent. Inths secton, we mprovetheabove results by showngthatthethreeproblems EBD, IBD and MBD are APX hard, even when genomes G 1 and G 2 are such that occ(g 1 ) = 1 and occ(g 2 ) = 2. The man result s Theorem 2 below, whch wll be completed by Corollary 2 at the end of the secton. Theorem 2. EBD s APX hard even when genomes G 1 and G 2 are such that occ(g 1 ) = 1 and occ(g 2 ) = 2. To prove Theorem 2, we use an L-Reducton from Mn-Vertex-Cover-3 to EBD. Let G = (V,E) be a cubc graph wth V = {v 1...v n } and E = {e 1...e m }. For each, 1 n, let e f, e g and e h be the three edges whch are ncdent to v n G wth f < g < h. Let R be the polynomal transformaton whch assocates to G the followng genomes G 1 and G 2, where each gene has a postve sgn: G 1 = a 0 a 1 b 1 a 2 b 2...a n b n c 1 d 1 c 2 d 2...c m d m c m+1 G 2 = a 0 a n d fn d gn d hn b n...a 2 d f2 d g2 d h2 b 2 a 1 d f1 d g1 d h1 b 1 c 1 c 2...c m c m+1 wth : a 0 = 0, and for each, 1 n, a = and b = n+ c m+1 = 2n+m+1, and for each, 1 m, c = 2n+ and d = 2n+m+1+ We remark that there s no duplcaton n G 1, so occ(g 1 ) = 1. In G 2, only the genes d, 1 m, are duplcated and occur twce. Thus occ(g 2 ) = 2. Let G be a cubc graph and VC be a vertex cover of G. Let G 1 and G 2 be the genomes obtaned by R (G). We defne F to be the polynomal transformaton whch assocates to VC, G 1 and G 2 the exemplarzaton F (VC) = (G 1,G E 2 ) of (G 1,G 2 ) as follows. For each such that v / VC, we remove from G 2 the genes d f,d g and d h. Then, for each j, 1 j m such that d j stll has two occurrences n G 2, we arbtrarly remove one of these occurrences n order to obtan the genome G E 2. Hence, (G 1,G E 2 ) s an exemplarzaton of (G 1,G 2 ). Gven a cubc graph G, we construct G 1 and G 2 by the transformaton R (G). Gven an exemplarzaton (G 1,G E 2 ) of (G 1,G 2 ), let S be the polynomal transformaton whch assocates to (G 1,G E 2 ) the set VC = {v 1 n,a and b are not consecutve n G E 2 }. We clam that VC s a vertex cover of G. Indeed, let e p, 1 p m, be an edge of G. Genome G E 2 contans one occurrence of gene d p snce G E 2 s an exemplarzaton of G 2. By constructon, there exsts, 1 n, such 11

that d p s n G E 2 [a,b ] and such that e p s ncdent to v. The presence of d p n G E 2 [a,b ] mples that vertex v belongs to VC. We can conclude that each edge of G s ncdent to at least one vertex of VC. Lemmas 5 and 6 below are used to prove that (R,S ) s an L-Reducton from the Mn-Vertex- Cover-3 problem to the EBD problem. Let G = (V,E) be a cubc graph wth V = {v 1,v 2...v n } and E = {e 1,e 2...e m } and let us construct (G 1,G 2 ) by the transformaton R (G). Lemma 5. Let VC be a vertex cover of G and (G 1,G E 2 ) the exemplarzaton gven by F (VC). Then VC = k B(G 1,G E 2 ) n + 2m + k + 1, where B(G 1,G E 2 ) s the number of breakponts between G 1 and G E 2. Proof. Suppose VC = k. Let us lst the breakponts between genomes G 1 and G E 2 obtaned by F (VC). The pars (b,a +1 ), 1 n 1, and (b n,c 1 ) nduce one breakpont each. For all, 1 m, each par of the form (c,d ) (resp. (d,c +1 )) nduces one breakpont. For all, 1 n, such that v VC, (a,b ) nduces at most one breakpont. Fnally, the par (a 0,a 1 ) nduces one breakpont. Thus there are at most n+2m+k+1 breakponts of (G 1,G E 2 ). Lemma 6. Let (G 1,G E 2 ) be an exemplarzaton of (G 1,G 2 ) and VC be the vertex cover of G obtaned by S (G 1,G E 2 ). We have B(G 1,G E 2 ) = k VC = k n 2m 1. Proof. Let (G 1,G E 2 ) be an exemplarzaton of (G 1,G 2 ) and VC be the vertex cover obtaned by S (G 1,G E 2 ). Suppose B(G 1,G E 2 ) = k. For any exemplarzaton (G 1,G E 2 ) of (G 1,G 2 ), the followng breakponts always occur: the par (a 0,a 1 ) ; for each, 1 m, each par (c,d ) and (d,c +1 ) ; for each, 1 n 1, the par (b,a +1 ) ; the par (b n,c 1 ). Thus, we have at least n+2m+1 breakponts. The other possble breakponts are nduced by pars of the form of (a,b ). Snce we have B(G 1,G E 2 ) = k, there are exactly k n 2m 1 such breakponts. By constructon of VC, the cardnalty of VC s equal to the number of breakponts nduced by pars of the form (a,b ). So, we have: VC = k n 2m 1. To prove that (R,S ) s an L-reducton, we frst notce that propertes (a) and (b) of an L- reducton are trvally verfed. The next lemma proves property (c). Lemma 7. The nequalty opt EBD (G 1,G 2 ) 12 opt Mn-VC (G) holds. Proof. For a cubc graph G wth n vertces and m edges, we have 2m = 3n (see (4)) and opt Mn-VC (G) n 2 (see (5)). By constructon of the genomes G 1 and G 2, any exemplarzaton of (G 1,G 2 ) contans 2n+2m+2genes n each genome. Thus, we have opt EBD (G 1,G 2 ) 2n+2m+2 6n (n 4 n a cubc graph). Hence, we conclude that opt EBD (G 1,G 2 ) 12 opt Mn-VC (G). Now, we prove property (d) of our L-reducton. Lemma 8. Let (G 1,G E 2 ) be an exemplarzaton of (G 1,G 2 ) and let VC be the vertex cover of G obtaned by S (G 1,G E 2 ). Then, we have opt Mn-VC(G) VC opt EBD (G 1,G 2 ) B(G 1,G E 2 ) Proof. Let (G 1,G E 2 ) be an exemplarzaton of (G 1,G 2 ) and VC be the vertex cover of G obtaned by S (G 1,G E 2 ). Let VC be a vertex cover of G such that VC = opt Mn-VC(G). We know that opt Mn-VC (G) VC and opt EBD (G 1,G 2 ) B(G 1,G E 2 ). So, t s suffcent to prove VC opt Mn-VC (G) B(G 1,G E 2 ) opt EBD(G 1,G 2 ). 12

By Lemma 5, we have B(F (VC)) n+2m+1+opt Mn-VC, whch mples opt EBD (G 1,G 2 ) B(F (VC)) n+2m+1+opt Mn-VC. Then B(G 1,G E 2 ) opt EBD(G 1,G 2 ) B(G 1,G E 2 ) n 2m 1 opt Mn-VC(G) (8) By Lemma 6, we have: VC = B(G 1,G E 2 ) n 2m 1 whch mples VC opt Mn-VC (G) = B(G 1,G E 2 ) n 2m 1 opt Mn-VC (G) (9) Fnally, by (8) and (9), we get VC opt Mn-VC B(G 1,G E 2 ) opt EBD(G 1,G 2 ). Lemmas 7 and 8 prove that the par (R,S ) s an L-reducton from Mn-Vertex-Cover-3 to EBD. Hence, EBD s APX hard even f occ(g 1 ) = 1 and occ(g 2 ) = 2, and Theorem 2 s proved. We extend n Corollary 2 our results for the ntermedate and maxmum matchng models. Corollary 2. The IBD and MBD problems are APX hard even when genomes G 1 and G 2 are such that occ(g 1 ) = 1 and occ(g 2 ) = 2. Proof. The ntermedate and maxmum matchng models are dentcal to the exemplar model when one of the two genomes contans no duplcates. Hence, the APX hardness result for EBD also holds for IBD and MBD. 4 Zero breakpont dstance Ths secton s devoted to zero breakpont dstance recognton ssues. Indeed, n [13], the authors showed that decdng whether the exemplar breakpont dstance between any two genomes s zero or not s NP complete even when no gene occurs more than three tmes n both genomes,.e., nstances of type (3, 3). Ths mportant result mples that the exemplar breakpont dstance problem does not admt any approxmaton n polynomal-tme, unless P = NP. Followng ths lne of research, we frst complement the result of [13] by provng that decdng whether the exemplar breakpont dstance between any two genomes s zero or not s NP complete, even when no gene s duplcated more than twce n one of the genomes (the maxmum number of duplcatons s however unbounded n the other genome). Ths result s next extended to the ntermedate matchng model and we gve a practcal - but exponental - algorthm for decdng whether the exemplar breakpont dstance between any two genomes s zero or not n case no gene occurs more than twce n both genomes (a problem whose complexty, P versus NP complete, remans open). Fnally, we show that decdng whether the maxmum matchng breakpont dstance between any two genomes s zero or not s polynomal-tme solvable and hence that such negatve approxmaton results (the ones we obtaned for the exemplar and ntermedate models) do no propagate to the maxmum matchng model. The followng easy observaton wll prove extremely useful n the sequel of the present secton. Observaton 3 Let G 1 and G 2 be two genomes. If the exemplar breakpont dstance between G 1 and G 2 s zero, then there exsts an exemplarzaton (G E 1,GE 2 ) of (G 1,G 2 ) such that (1) G E 1 = GE 2, or (2) (G E 1 )r = G E 2, where (GE 1 )r s the sgned reversal of genome G 1. The same observaton can be made for the ntermedate and maxmum matchng models. 13

4.1 Zero exemplar breakpont dstance The zero exemplar breakpont dstance (ZEBD) problem s formally defned as follows. Problem: ZEBD Input: Two genomes G 1 and G 2. Queston: Is the exemplar breakpont dstance between G 1 and G 2 equal to zero? Amng at precsely defnng the napproxmablty landscape of computng the exemplar breakpont dstance between two genomes, we complement the result of [13], who showed ZEBD to be NP complete even for nstances of type (3, 3), by the followng theorem. Theorem 4. ZEBD s NP complete even f no gene occurs more than twce n G 1. Proof. Membershp of ZEBD to NP s mmedate. The reducton we use to prove hardness s from Mn-Vertex-Cover [16]. Let an arbtrary nstance of Mn-Vertex-Cover be gven by a graph G = (V,E) and a postve nteger k. Wrte V = {v 1,v 2...v n } and E = {e 1,e 2...e m }. In the rest of the proof, elements of V (resp. E) wll be seen ether as vertces (resp. edges) or genes, dependng on the context. The correspondng nstance (G 1,G 2 ) of ZEBD s defned as follows: G 1 = v 1 X 1 v 2 X 2...v n X n G 2 = Y[1] Y[2]... Y[k] Y V. For each = 1,2,...,n, X s defned to be X = e 1 e 2... e j, where e 1,e 2,...,e j, 1 < 2 <... < j, are the edges ncdent to vertex v. The strngs Y[], 1 k, are all equal and are defned by Y[] = Y V Y E where Y V = v 1 v 2... v n and Y E = e 1 e 2... e m. Notce that no gene occurs more than twce n G 1 (actually genes v occur once and genes e occur twce). However, the number of occurrences of each gene n G 2 s upper bounded by k +1. Furthermore, all genes have postve sgn, and hence accordng to Observaton 3 we only need to consder exemplarzatons (G E 1,GE 2 ) of (G 1,G 2 ) such that G E 1 = GE 2. It s mmedate to check that our constructon can be carred out n polynomal-tme. We now clam that there exsts a vertex cover of sze k n G ff the exemplar breakpont dstance between G 1 and G 2 s zero. Supposefrstthat thereexsts avertex cover V V of szek ng. Wrte V = {v 1,v 2,...,v k }, 1 < 2 <... < k. For convenence, we also defne 0 to be 0. From V we construct an exemplarzaton (G E 1,GE 2 ) as follows. We obtan GE 1 from G 1 by a two step procedure. Frst we delete n G 1 all strngs X such that v / V. Second, for each 1 j m, f gene e j stll occurs twce, we delete ts second occurrence (ths second step s concerned wth edges connectng two vertces n V ). We now turn to G E 2. For 1 j k, we consder the strng Y[j] = Y V Y E that we process as follows: (1) we delete n Y V all genes but v j and those genes v l / V such that j 1 < l < j, and (2) we delete n Y E all genes but those e l that are not ncdent to v j or ncdent to v j and some smaller vertex n V (.e., e l = {v j,v j } for some j < j). Fnally, we delete n the tralng strng Y V = v 1 v 2... v n all genes but those v l (/ V ) such that k < l. Snce V s a vertex cover n G, then t follows that each gene occurs once n the obtaned genomes,.e., (G E 1,GE 2 ) s ndeed an exemplarzaton of (G 1,G 2 ). It s now easly seen that G E 1 = GE 2, and hence that the exemplar breakpont dstance between G 1 and G 2 s zero. 14

Conversely, suppose that the exemplar breakpont dstance between G 1 and G 2 s zero. Snce all genes have a postve sgn, then t follows that there exsts an exemplarzaton (G E 1,GE 2 ) of (G 1,G 2 ) such that G E 1 = GE 2. Exemplarzaton GE 2 can be wrtten as G E 2 = Y V [1] Y E [1] Y V [2] Y E [2]...Y V [k] Y E [k] Y V [k +1] where, Y V [], 1 k + 1, s a strng on V and Y E [], 1 k, s a strng on E, V and E beng vewed as alphabets. Now, defne V V as follows: v V ff gene v occurs n some Y V [j], 1 j k, as the last gene. By constructon, V k (we may ndeed have V < k f some Y V [j], 1 j k, denotes the empty strng). We now observe that, snce no gene v s duplcated n G 1, all genes e l that occur between some gene v V and some gene v j V n G E 2 should match genes n strng X n G 1. Then t follows that V s a vertex cover of sze at most k n G. The complexty of ZEBD remans open n case no gene occurs more than twce n G 1 and more than a constant tmes n G 2,.e., nstances of type (2,c) for some c = O(1) ; recall here that ZEBD s NP complete f no gene occurs more than three tmes n G 1 or n G 2 (nstances of type (3,3), [13]). In partcular, the complexty of ZEBD for nstances of type (2,2) s open. However, we propose here a practcal - but exponental - algorthm for ZEBD for nstances of type (2,2), whch s well-suted n case the number of genes that occur twce both n G 1 and n G 2 s relatvely small. Proposton 1. ZEBD for nstances of type (2,2) (no gene occurs more than twce n G 1 and n G 2 ) s solvable n O (1.6182 2k ) tme, where k s upper-bounded by the number of genes that occur exactly twce n G 1 and n G 2. Proof. Accordng to Observaton 3, for any nstance (G 1,G 2 ), we only need to focus on exemplarzatons (G E 1,GE 2 ) such that GE 1 = GE 2 or (GE 1 )r = G E 2, where (GE 1 )r s the sgned reversal of G E 1. Let us frst consder the case GE 1 = GE 2 (the case (GE 1 )r = G E 2 s dentcal up to a sgned reversal and wll thereby be brefly dscussed at the end of the proof). Let (G 1,G 2 ) be an nstance of type (2,2) of ZEBD. Our algorthm s by transformng nstance (G 1,G 2 ) nto a CNF boolean formula φ wth only few large clauses such that φ s satsfable ff the exemplar breakpont dstance between G 1 and G 2 s zero. By hypothess, each sgned gene occurs at most twce n G 1 and n G 2. Therefore, for any sgned gene g, we have one out of four possble dstnct confguratons depcted n Fgure 2, where p 1, p 2, and are postons of occurrence of g n G 1 and G 2. Furthermore, snce we are lookng for an exemplarzaton (G E 2,GE 2 ) of (G 1,G 2 ) such that G E 1 = GE 2, we may assume, n case g occurs only once n G 1 or n G 2, that all occurrences of G have the same sgn (otherwse a trval self-reducton would ndeed apply). In other words, referrng at Fgure 2, we assume G 1 [p 1 ] = G 2 [ ] = G 2 [ ] n case (2), G 1 [p 1 ] = G 1 [p 2 ] = G 2 [ ] n case (3), and G 1 [p 1 ] = G 2 [ ] n case (4). Fnally, as for case (1), we may assume that ether all occurrences have the same sgn, or G 1 [p 1 ] = G 1 [p 2 ] and G 2 [ ] = G 2 [ ] (otherwse a trval self-reducton would agan apply). We now descrbe the constructon of the CNF boolean formula φ. Frst, the set of boolean varables X s defned as follows: for each gene g occurrng at poston p n G 1 and at poston q n G 2 (.e., G 1 [p] = G 2 [q]) ) we add to X the boolean varable x p q. We now turn to defnng the clauses of φ. Let g be any gene, and let the occurrence postons of g n G 1 and n G 2 be noted as n Fgure 2. f occ(g,g 1 ) = occ(g,g 2 ) = 2 (case(1)), 15

p 1 p 2 p 1 p 1 p 2 p 1 G 1 G 2 (1) (2) (3) (4) Fg.2. The 4 gene-confguratons for nstances of type (2,2): p 1 and p 2 are the occurrence postons of gene g n G 1, and and are the occurrence postons of gene g n G 2. f G 1 [p 1 ] = G 1 [p 2 ] = G 2 [ ] = G 2 [ ], we add to φ the clauses (x p 1 x p 1 x p 2 x p 2 ), (x p 1 x p 1 ), (x p 1 x p 2 ), (x p 1 x p 2 ), (x p 1 x p 2 ), (x p 1 x p 2 ) and (x p 2 x p 2 ), otherwse, we have G 1 [p 1 ] = G 1 [p 2 ] and G 2 [ ] = G 2 [ ] (see above dscusson), f G 1 [p 1 ] = G 2 [ ] and G 1 [p 2 ] = G 2 [ ])), we add to φ the clauses (x p 1 x p 2 ) and (x p 1 x p 2 ), f G 1 [p 1 ] = G 2 [ ] and G 1 [p 2 ] = G 2 [ ])), we add to φ the clauses (x p 1 x p 2 ) and (x p 1 x p 2 ), f occ(g,g 1 ) = 1andocc(g,G 2 ) = 2(case (2)), weaddtoφtheclauses (x p 1 x p 1 ) and(x p 1 x p 1 ), f occ(g,g 1 ) = 2andocc(g,G 2 ) = 1(case (3)), weaddtoφtheclauses (x p 1 x p 2 ) and(x p 1 x p 2 ), and f occ(g,g 1 ) = occ(g,g 2 ) = 1 (case (4)), we add to φ the clause (x p 1 ). The ratonale of ths constructon s that f formula φ evaluates to true for some assgnment f and f(x p q) s true for some gene g occurrng at poston p n G 1 and q n G 2, then all occurrences of g but the one at poston p should be deleted n G 1 and all occurrences of g but the one at poston q should be deleted n G 2, n order to obtan the exemplar soluton. What s left s to enforce that φ evaluates to true ff the exemplar breakpont dstance between G 1 and G 2 s zero. To ths am, we add to φ the followng clauses. For each par of varables (x 1 j1,x 2 j2 ) such that G 1 [ 1 ] G 1 [ 2 ], 1 < 2 and j 1 > j 2, we add to φ the clause (x 1 j1 x 2 j2 ). The constructon of φ s now complete. Clearly, φ evaluates to true ff the exemplar breakpont dstance between G 1 and G 2 s zero. Let k be the number of genes g that occur twce n G 1 and n G 2 wth the same sgn,.e., G 1 [p 1 ] = G 1 [p 2 ] = G 2 [ ] = G 2 [ ]. We now make the mportant observaton that all clauses n φ have sze less than or equal to 2 except those k clauses of sze 4 ntroduced n case gene g occurs twce n G 1 and n G 2 wth the same sgn. By ntroducng a new boolean varable, we can easly replace n φ each clause of sze 4 by two clauses of sze 3, and hence we may now assume that φ s a 3-CNF formula (.e., each clause has sze at most 3) wth exactly 2k clauses of sze 3. As for the case (G E 1 )r = G E 2, we replace G 1 by (G 1 ) r and construct another 3-CNF formula φ as descrbed above. The two 3-CNF formulas need, however, to be examned separately. Fernauproposedn[15]analgorthmforsolvng3-CNFbooleanformulasthatrunsnO (1.6182 l ) tme, where l s the number of clauses of sze 3. Therefore, ZEBD for nstances of type (2,2) s solvable n O (1.6182 2k ) tme, where k s the number of genes g that occur twce n G 1 and n G 2. 16

4.2 Zero ntermedate matchng breakpont dstance We now turn to the zero ntermedate breakpont dstance (ZIBD) problem. It s defned as follows. Problem: ZIBD Input: Two genomes G 1 and G 2. Queston: Is the ntermedate breakpont dstance between G 1 and G 2 equal to zero? We show here that ZEBD and ZIBD are equvalent problems. We need the followng lemma. Lemma 9 ([2]). Let G 1 and G 2 be two genomes wthout duplcates and wth the same gene content, and G 1 and G 2 be the two genomes obtaned from G 1 and G 2 by deletng any gene g. Then B(G 1,G 2 ) B(G 1,G 2 ). Theorem 5. ZEBD and ZIBD are equvalent problems. Proof. One drecton s trval (any exemplarzaton s ndeed an ntermedate matchng). The other drecton follows from Lemma 9. It follows from Theorem 5 that the problem IBD s not approxmable even for nstances of type (3,3) (see [13]) and f no gene occurs more than twce n G 1 (see Theorem 4). 4.3 Zero maxmum matchng breakpont dstance We show here that, oppostely to the exemplar and the ntermedate matchng models, decdng whether the maxmum matchng breakpont dstance between two genomes s equal to zero s polynomal-tme solvable, and hence we cannot rule out the exstence of accurate approxmaton algorthms for the maxmum matchng model. We refer to ths problem as ZMBD. Problem: ZMBD Input: Two genomes G 1 and G 2. Queston: Is the maxmum matchng breakpont dstance between G 1 and G 2 equal to zero? The man dea of our approach s to transform any nstance of ZMBD nto a matchng dagram and next use an effcent algorthm for fndng a large set of non-ntersectng lne segments. Note that ths latter problem s equvalent to fndng a large ncreasng subsequence n permutatons. A matchng dagram [18] conssts of, say n, ponts on each of two parallel lnes, and n straght lne segments matchng dstnct pars of ponts. The ntersecton graph of the lne segments s called a permutaton graph (the reason for the name s that f the ponts on the top lne are numbered 1,2,...,n, then the ponts on the other lne are numbered by a permutaton on 1,2,...,n). We descrbe how to turn the par of genomes (G 1,G 2 ) nto a matchng dagram D(G 1,G 2 ). For sake of presentaton we ntroduce the followng notatons. For each gene famly g, we wrte occ pos (G,g) (resp. occ neg (G,g)) for the number of postve (resp. negatve) occurrences of gene g n genome G. Accordng to Observaton 3, t s enough to consder two cases: G M 1 = G M 2 or (G M 1 )r = G M 2, where (GM 1,GM 2,M) s a maxmum matchng of (G 1,G 2 ). Let us frst focus on testng G M 1 = G M 2 (the case (G M 1 )r = G M 2 s dentcal up to a sgned reversal). We descrbe the constructon of the top labeled ponts. Readng genome G 1 from left to rght, we replace gene g by the sequence of labeled ponts +g 1 (,occ pos (G 2,g)) +g 1 (,occ pos (G 2,g) 1)... +g 1 (,1) 17

arxiv: v1 [q-bio.qm] 6 Jun 2008